Tài liệu A study of generalization in genetic programming

.PDF

135

289

nhattuvisu Báo vi phạm

Tải xuống 93

Mô tả:

MINISTRY OF EDUCATION AND TRAINING MINISTRY OF NATIONAL DEFENSE MILITARY TECHNICAL ACADEMY NGUYEN THI HIEN A STUDY OF GENERALIZATION IN GENETIC PROGRAMMING THE THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN MATHEMATICS Hanoi – 2014 MINISTRY OF EDUCATION AND TRAINING MINISTRY OF NATIONAL DEFENSE MILITARY TECHNICAL ACADEMY A STUDY OF GENERALIZATION IN GENETIC PROGRAMMING Specialized in: Applied Mathematics and Computer Science Code: 60 46 01 10 THE THESIS IS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN MATHEMATICS ACADEMIC SUPERVISORS: 1. Assoc/Prof. Dr. Nguyen Xuan Hoai 2. Prof. Dr. R.I. (Bob) McKay Hanoi - 2014 i Abstract Genetic Programming (GP) is a meta-heuristic technique that simulates biological evolution. GP diﬀers to other Evolutionary Algorithms (EAs) in that it generates solutions to problems in the form of computer programs. The quality criterion for each individual is often referred to as ﬁtness in EAs and it is with which we determine which individuals shall be selected. GP induces a population of computer programs that improve their ﬁtness automatically as they experience the data on which they are trained. According this, GP can be seen as one of the machine learning methods. Machine learning is a process that trains a system over some training data for capturing the succinct causal relationships among data. The key parts of this process are the ”learning domain”, the ”training set”, the ”learning system” and the ”testing” (validating) of the learnt results. The quality of a learnt solution is its ability to predict the outputs from future unseen data (often simulated on a test set), which is often referred to as ”generalization”. The goal of this thesis is to investigate the aspects that aﬀect the generalization ability of GP as well as techniques from more traditional machine learning literature that help to improve the generalization and learning eﬃciency of GP. In particular, a number of early stopping criteria were proposed and tested. The early stopping techniques (with the proposed stopping criteria) helped to avoid over-ﬁtting in GP producing much simpler solutions in much shorter training time. Over-ﬁtting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been over-ﬁt will ii generally have poor predictive performance, as it can exaggerate minor ﬂuctuations in the data. The thesis also proposes a framework to combine progressive sampling, a traditional machine learning technique, with early stopping for GP. The experimental results showed that progressive sampling in combination with early stopping help to further improve the learning eﬃciency of GP (i.e. to avoid over-ﬁtting and produce simpler solutions in much shorter training time). The early stopping techniques (with the proposed stopping criteria) helped to avoid over-ﬁtting in GP producing much simpler solutions in much shorter training time. iii Acknowledgements The ﬁrst person I would like to thank is my supervisor for directly guiding me through the PhD progress. He’s enthusiasm is the power source to motivate me in this work. His guide has inspired much of the research in this thesis. I also wish to thank my co-supervisor. Although staying far away, we usually had online research chats. Working with him, I have learnt how to do research systematically. In particular, the leaders of the Department of Software Engineering and Faculty of Information Technology, Military Technical Academy have frequently supported my research with regards to facilities and materials for running the experiments. Dr. Nguyen Quang Uy has discussed a number of issues related to this research with me. I would like to thank him for his thorough comments and suggestions on my research. I also would like to acknowledge the supports from my family, especially my parents, Nguyen Van Hoan and Nguyen Thi Quy, who have worked hard and always believed strongly in their children and to my husband, Nguyen Hai Trieu for sharing happiness and diﬃculty in the life with me. To my beloved daughter Nguyen Thi Hai Phuong, who was born before this thesis was completed, I would like to express my thanks for being such a good girl always cheering me up. i Contents Abstract ii List of Figures v List of Tables vii Abbreviations ix 0.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 0.2 Research Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 0.3 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3 0.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1 Backgrounds 1.1 6 Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.1 Representation, Initialization and Operators in GP . . . . . . . . . 7 1.1.2 Some Variants of GP . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2 An Example of Problem Domain . . . . . . . . . . . . . . . . . . . . . . . 19 1.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4 GP and Machine Learning Issues . . . . . . . . . . . . . . . . . . . . . . . 21 1.4.1 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.4.2 Bloat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 ii 1.4.3 1.5 1.6 Generalization and Complexity . . . . . . . . . . . . . . . . . . . . 22 Generalization in GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.5.1 Overview of Studies on GP Generalization Ability . . . . . . . . . . 24 1.5.2 Methods for Improving GP Generalization Ability . . . . . . . . . . 25 1.5.3 Problem of Improving Training Eﬃciency . . . . . . . . . . . . . . . 33 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2 Early Stopping, Progressive Sampling, and Layered Learning 2.1 36 Over-training, Early Stopping and Regularization . . . . . . . . . . . . . . 36 2.1.1 Over-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.2 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.1.3 Regularization Learning . . . . . . . . . . . . . . . . . . . . . . . . 39 Progressive Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2.1 Types of Progressive Sampling Schedule . . . . . . . . . . . . . . . 44 2.2.2 Detection of Convergence . . . . . . . . . . . . . . . . . . . . . . . 47 2.3 Layered Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.2 3 An Empirical Study of Early Stopping for GP 3.1 51 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1.1 Proposed Classes of Stopping Criteria . . . . . . . . . . . . . . . . . 53 3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.1 Comparison with GP . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.2 Comparison with Tarpeian . . . . . . . . . . . . . . . . . . . . . . . 76 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4 iii 4 A Progressive Sampling Framework for GP - An Empirical Approach 4.1 4.2 4.3 4.4 79 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1.1 The Sampling Schedule . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.2 The Initial Sample Set Size . . . . . . . . . . . . . . . . . . . . . . 84 4.1.3 The Number of Learning Layers . . . . . . . . . . . . . . . . . . . . 86 4.1.4 The Stopping Criterion for Each Layer . . . . . . . . . . . . . . . . 86 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.2 GP Systems Conﬁgurations . . . . . . . . . . . . . . . . . . . . . . 87 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.1 Learning Eﬀectiveness Comparison . . . . . . . . . . . . . . . . . . 89 4.3.2 Learning Eﬃciency . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.3 Solution Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.4 Distribution of Best Solutions by Layers . . . . . . . . . . . . . . . 93 4.3.5 Synergies between System Components . . . . . . . . . . . . . . . . 94 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5 Conclusions and Future Work 99 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Bibliography 104 iv List of Figures 1.1 GP’s main loop [57] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 GP expression tree representing max(x*x,x+3y). [57] . . . . . . . . . . . . . 8 1.3 Creation of a full tree having maximum depth 2 (and therefore a total of seven nodes) using the Full initialisation method (t=time). [57] . . . . . . . 1.4 10 Creation of a ﬁve node tree using the Grow initialisation method with a maximum depth of 2 (t=time). A terminal is chosen at t = 2, causing the left branch of the root to be terminated at that point even though the maximum depth has not been reached. [57] . . . . . . . . . . . . . . . . . . 12 1.5 Example of subtree crossover. . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6 Example of subtree mutation. [57] . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Idealized training and validation error curves. Vertical axis: errors; horizontal axis:time [88] 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 A real validation error curve. Vertical: validation set error; horizontal: time (in training epochs) [88]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3 Learning curves and progressive sample. 44 2.4 Regions of the Eﬃciency of Arithmetic and Geometric Schedule Sampling [89]. 46 3.1 Distribution of stop time (last generation of the run) of the OF criterion . . . . . . . . . . . . . . . . . . . and GPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 74 3.2 Distribution of stop time (last generation of the run) of the VF criterion and GPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Distribution of stop time (last generation of the run) of the GL criterion and GPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 75 Distribution of stop time (last generation of the run) of the PQ criterion and GPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 74 75 Distribution of stop time (last generation of the run) of the UP criterion and GPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1 PS framework for GP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 GP Evolution of each layer 87 . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables 1.1 Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.2 The real-world data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1 Parameter settings for the GP systems . . . . . . . . . . . . . . . . . . . . 58 3.2 Data Sets for Problems. For the synthetic problems, the notation [min, max ] deﬁnes the range from which the data points are sampled. . . . . . . . . . 3.3 Generalization error and run time of VF, OF, True Adap stopping criterion and GP, Tar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 63 p-value of complexity of VF, OF, True Adap stopping criterion and Tar vs GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 62 p-value of run time and GE of VF, OF, True Adap stopping criterion and Tar vs GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 61 Size of best individual of VF, OF and True Adap stopping criterion and GP, Tar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 59 64 Relative Generalization Errors and Run Time of OF, VF and True Adap stopping criterion (vs GP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.8 Generalization error and run time of GL, PQ and UP stopping criterion . . 66 3.9 Size of the best individual of GL, PQ and UP stopping criterion . . . . . . 67 3.10 p-value of GL, PQ and UP stopping criterion vs GP . . . . . . . . . . . . . 68 vii 3.11 p-value of size of best individual of comparison GL, PQ and UP stopping criterion with GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.12 Relative Generalization Errors and Run Time of GL, PQ and UP stopping criterion (vs GP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.13 p-value of GE, run time and size of best solution of True Adap, PQ vs Tarpeian 77 4.1 Data Sets for the Test Functions. . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Evolutionary Parameter Settings for the Genetic Programming Systems . . 89 4.3 Mean and SD of Generalization Error, GPLL and GP (14 Problems) . . . . 90 4.4 Relative Average Generalization Errors and p-values (GPLL vs GP) (14 Problems) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.5 Mean Run Time, GPLL and GP (14 Problems) . . . . . . . . . . . . . . . 91 4.6 Relative Run Times and p-values (GPLLs vs GP) (14 Problems) . . . . . . 92 4.7 Mean Solution Size and p-values, GPLL vs GP (14 Problems) . . . . . . . 93 4.8 Number of End-of-run Best Solutions First Found by GPLL in Each Layer 94 4.9 Average Generalization Error on 14 Problems . . . . . . . . . . . . . . . . 95 4.10 Average Run Time of All Systems . . . . . . . . . . . . . . . . . . . . . . . 96 4.11 Average Size of Solutions Found by all Systems . . . . . . . . . . . . . . . 97 viii Abbreviations Abbreviation ARMA Meaning AutoregressiveMoving-Average EA Evolutionary Algorithm DSS Dynamic Subset Selection GA Genetic Algorithms GGGP GP GPLL ILP KDD Grammar Guided Genetic Programming Genetic Programming Layered Learning Genetic Programming Inductive Logic Programming Knowledge Discovery and Data Mining LL Layered Learning ML Machine Learning MSE Mean Squared Error PDGP PS RST Parallel Distributed GP Progressive Sampling Random Sampling Technique ix Introduction Genetic programming (GP) paradigm was ﬁrst proposed by Koza [53] can be viewed as the discovery of computer programs which produce desired outputs for particular inputs. A computer program could be an expression, formula, plan, control strategy, decision tree or a model depending on the type of problem. GP is a simple and powerful technique which has been applied to a wide range of problems in combinatorial optimization, automatic programming and model induction. The direct encoding of solutions as variable-length computer programs allows GP to provide solutions that can be evaluated and also examined to understand their internal workings. In the ﬁeld of evolutionary and natural computation, GP plays a special role. In the Koza’s seminal book, the ambition of GP was to evolve in a population of programs that can learn automatically from the training data. Since its birth, GP has been developed fast with various fruitful applications. Some of the developments are in terms of the novel mechanisms regarding to GP evolutionary process, others were on ﬁnding potential areas on which GP could be successfully applied. Despite these advances in GP research, when it is applied to learning tasks, the issue of generalization of GP, as a learning machine, has not been taken the attention that it deserves. This thesis focuses on the generalization aspect of GP and proposes mechanisms to improve GP generalization ability. 1 0.1. MOTIVATIONS 0.1 Motivations GP is one of the evolutionary algorithms (EAs). The common underlying idea behind all EA techniques is the same: given a population of individuals the environmental pressure causes natural selection (survival of the ﬁttest) and this causes a rise in the ﬁtness of the population. Given a quality function to be maximised we can randomly create a set of candidate solutions, i.e., elements of the function’s domain, and apply the quality function as an abstract ﬁtness measure - the higher the better. Based on this ﬁtness, some of the better candidates are chosen to seed the next generation by applying recombination and/or mutation to them. Recombination is an operator applied to two or more selected candidates (the so-called parents) and results one or more new candidates (the children). Mutation is applied to one candidate and results in one new candidate. Executing recombination and mutation leads to a set of new candidates (the oﬀspring) that compete - based on their ﬁtness (and possibly age) - with the old ones for a place in the next generation. This process can be iterated until a candidate with suﬃcient quality (a solution) is found or a previously set computational limit is reached. GP has been proposed as a machine learning (ML) method [53], and solutions to various learning problems are sought by means of an evolutionary process of program discovery. Since GP often tackles ML issues, generalization that is synonymous with robustness has been the desirable property at the end of the GP evolutionary process. The main goal when using GP or other ML techniques is not only to create a program that exactly cover the training examples (as in many GP applications so far), but also to have a good generalization ability: for unseen and future samples of data, the program should output values that closely resemble the underlying function. Although, recently, the issue of generalization in GP has received more attention than it deserves, the use of more traditional machine learning techniques has been rather limited. It is strange that generalization has been an old time studied topic in machine learning with 2 0.2. RESEARCH PERSPECTIVES many proposed practical techniques for improving learning machines (such as for decision trees, neural networks, etc). It is hoped that adapting these techniques to GP would help to improve GP generalization ability. 0.2 Research Perspectives The focus of this thesis is on understanding the generalization of GP, the ways it can be improved and the role it plays in learning. Speciﬁcally, the generalization of GP is analysed to uncover key features. Based on that the thesis proposes two new learning mechanisms for GP, early stopping and progressive sampling. In the process of developing the better learning frameworks for GP, a clearer description of the scalability of the framework emerges to facilitate and motivate future enhancements. The current theoretical models of GP are limited in use and applicability due to their complexity. Therefore, the majority of theoretical work has been derived from experimentation. The approach taken in this thesis is also based on the careful designed experiments and the analysis of experimental results. 0.3 Contributions of the Thesis This thesis makes the following contributions: 1. An early stopping approach for GP that is based on the estimate of the generalization error as well as propose some new criteria apply to early stopping for GP. The GP training process will be stopped early at the time which promises to bring the solution with the smallest generalization error. 2. A progressive sampling framework for GP, which divides GP learning process into several layers with the training set size, starting from being small, gradually get 3 0.4. ORGANIZATION OF THE THESIS bigger at each layer. The termination of training on each layer will be based on certain early stopping criteria. 0.4 Organization of the Thesis The organization of the rest of the thesis is as follows: Chapter 1 ﬁrst introduces the basic components of GP - the algorithm, representation, and operators as well as some benchmark problem domains (some are used for the experiments in this thesis). Two important research issues and a metaphor of GP search are then discussed. Next, the major issues of GP are discussed especially when the GP is considered as a machine learning system. It ﬁrst overviews the approaches proposed in the literature that help to improve the generalization ability of GP. Then, the solution complexity (code bloat), in the particular context of GP, and its relations to GP learning performance (generalization, training time, solution comprehensibility) are discussed. Chapter 2 provides a backgrounds on a number of concepts and techniques subsequently used in the thesis for improving GP learning performance. They include progressive sampling, layer learning, and early stopping in machine learning. Chapter 3 presents one of the main contributions of the thesis. It proposes some criteria used to determine when to stop the training of GP with an aim to avoid over-ﬁtting. Experiments are conducted to assess the eﬀectiveness of the proposed early stopping criteria. The results emphasise that GP with early stopping could ﬁnd solutions with at least similar generalization errors with standard GP but in much shorter training time and lower complexity of solutions. Chapter 4 develops a learning framework for GP that is based layer learning and progressive sampling. The results show that the solutions of GP with progressive sampling signiﬁcantly reduce in size and the training time is much faster when compared to standard 4 0.4. ORGANIZATION OF THE THESIS GP while the generalization error of the obtained solutions is often smaller. Chapter 5 concludes the thesis summarizing the main results and proposing suggestions for future works that extend the research in this thesis. 5 Chapter 1 Backgrounds GP [53] is an evolutionary algorithm (EA) that uses either a procedural or a functional representation. This chapter describes the representation and the speciﬁc algorithm components used in the canonical version of GP. The basics of GP are initially presented, followed by a discussion of the algorithm and a description of some variants of GP. Then, a survey of ML related issues of GP is given. The chapter ends with a comprehensive overview of the literature on the approaches used to improve GP generalization ability. 1.1 Genetic Programming GP is an evolutionary paradigm that is inspired by biological evolution to ﬁnd the solutions, in the form of computer programs, for a problem. The form of GP, where a tree representation of programs was used in conjunction with subtree crossover to evolve a multiplication function. The evolution through each generation as described in Figure 1.1. GP adopts a similar search strategy as a genetic algorithm (GA), but uses a program representation and special operators. It is this representation that makes GP unique. Algorithm 1 shows the basic steps to run GP. 6

- Xem thêm -

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất