Đăng ký Đăng nhập

Tài liệu Machine learning in action

.PDF
382
250
118

Mô tả:

IN ACTION Peter Harrington MANNING www.it-ebooks.info Machine Learning in Action www.it-ebooks.info www.it-ebooks.info Machine Learning in Action PETER HARRINGTON MANNING Shelter Island www.it-ebooks.info For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: [email protected] ©2012 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co.Development editor:Jeff Bleiel 20 Baldwin Road Technical proofreaders: Tricia Hoffman, Alex Ott PO Box 261 Copyeditor: Linda Recktenwald Shelter Island, NY 11964 Proofreader: Maureen Spencer Typesetter: Gordan Salinovic Cover designer: Marija Tudor ISBN 9781617290183 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 17 16 15 14 13 12 www.it-ebooks.info To Joseph and Milo www.it-ebooks.info www.it-ebooks.info brief contents PART 1 CLASSIFICATION ...............................................................1 1 Machine learning basics 2 ■ Classifying with k-Nearest Neighbors 3 ■ Splitting datasets one feature at a time: decision trees 4 ■ Classifying with probability theory: naïve Bayes 61 5 ■ Logistic regression 6 ■ Support vector machines 7 PART 2 ■ 3 ■ Improving classification with the AdaBoost meta-algorithm 129 37 83 101 FORECASTING NUMERIC VALUES WITH REGRESSION .............. 151 8 ■ Predicting numeric values: regression 9 PART 3 18 ■ Tree-based regression 153 179 UNSUPERVISED LEARNING ...............................................205 10 ■ Grouping unlabeled items using k-means clustering 207 11 ■ Association analysis with the Apriori algorithm 224 12 ■ Efficiently finding frequent itemsets with FP-growth vii www.it-ebooks.info 248 viii PART 4 BRIEF CONTENTS ADDITIONAL TOOLS .......................................................267 13 ■ Using principal component analysis to simplify data 14 ■ Simplifying data with the singular value decomposition 280 15 ■ Big data and MapReduce www.it-ebooks.info 299 269 contents preface xvii acknowledgments xix about this book xxi about the author xxv about the cover illustration xxvi PART 1 CLASSIFICATION ...................................................1 1 Machine learning basics 3 1.1 What is machine learning? 5 Sensors and the data deluge 6 important in the future 7 1.2 1.3 1.4 1.5 1.6 ■ Machine learning will be more Key terminology 7 Key tasks of machine learning 10 How to choose the right algorithm 11 Steps in developing a machine learning application Why Python? 13 11 Executable pseudo-code 13 Python is popular 13 What Python has that other languages don’t have 14 Drawbacks 14 ■ ■ ■ 1.7 1.8 Getting started with the NumPy library Summary 17 ix www.it-ebooks.info 15 x CONTENTS 2 Classifying with k-Nearest Neighbors 18 2.1 Classifying with distance measurements 19 Prepare: importing data with Python 21 Putting the kNN classification algorithm into action 23 How to test a classifier 24 ■ ■ 2.2 Example: improving matches from a dating site with kNN 24 Prepare: parsing data from a text file 25 Analyze: creating scatter plots with Matplotlib 27 Prepare: normalizing numeric values 29 Test: testing the classifier as a whole program 31 Use: putting together a useful system 32 ■ ■ ■ ■ 2.3 Example: a handwriting recognition system Prepare: converting images into test vectors 33 handwritten digits 35 2.4 3 Summary 33 Test: kNN on 36 Splitting datasets one feature at a time: decision trees 3.1 Tree construction 3.2 ■ Splitting the dataset ■ ■ Recursively 48 Constructing a tree of annotations Testing and storing the classifier Test: using the tree for classification decision tree 57 3.4 3.5 43 Plotting trees in Python with Matplotlib annotations Matplotlib annotations 49 3.3 37 39 Information gain 40 building the tree 46 4 ■ 51 56 56 ■ Use: persisting the Example: using decision trees to predict contact lens type Summary 59 57 Classifying with probability theory: naïve Bayes 61 4.1 4.2 4.3 4.4 4.5 Classifying with Bayesian decision theory 62 Conditional probability 63 Classifying with conditional probabilities 65 Document classification with naïve Bayes 65 Classifying text with Python 67 Prepare: making word vectors from text 67 Train: calculating probabilities from word vectors 69 Test: modifying the classifier for realworld conditions 71 Prepare: the bag-of-words document model 73 ■ ■ ■ 4.6 Example: classifying spam email with naïve Bayes 74 Prepare: tokenizing text 74 ■ Test: cross validation with naïve Bayes www.it-ebooks.info 75 xi CONTENTS 4.7 Example: using naïve Bayes to reveal local attitudes from personal ads 77 Collect: importing RSS feeds words 80 4.8 5 Summary 78 ■ Analyze: displaying locally used 82 Logistic regression 83 5.1 5.2 Classification with logistic regression and the sigmoid function: a tractable step function 84 Using optimization to find the best regression coefficients 86 Gradient ascent 86 Train: using gradient ascent to find the best parameters 88 Analyze: plotting the decision boundary 90 Train: stochastic gradient ascent 91 ■ ■ 5.3 Example: estimating horse fatalities from colic Prepare: dealing with missing values in the data 97 classifying with logistic regression 98 5.4 6 Summary Test: 100 Support vector machines 6.1 6.2 96 ■ 101 Separating data with the maximum margin Finding the maximum margin 104 102 Framing the optimization problem in terms of our classifier Approaching SVMs with our general framework 106 6.3 Efficient optimization with the SMO algorithm Platt’s SMO algorithm simplified SMO 107 6.4 6.5 106 ■ 104 106 Solving small datasets with the Speeding up optimization with the full Platt SMO 112 Using kernels for more complex data 118 Mapping data to higher dimensions with kernels 118 The radial bias function as a kernel 119 Using a kernel for testing 122 ■ ■ 6.6 6.7 7 Example: revisiting handwriting classification Summary 127 125 Improving classification with the AdaBoost meta-algorithm 7.1 Classifiers using multiple samples of the dataset 130 Building classifiers from randomly resampled data: bagging Boosting 131 7.2 Train: improving the classifier by focusing on errors www.it-ebooks.info 130 131 129 xii CONTENTS 7.3 7.4 7.5 7.6 7.7 Creating a weak learner with a decision stump 133 Implementing the full AdaBoost algorithm 136 Test: classifying with AdaBoost 139 Example: AdaBoost on a difficult dataset 140 Classification imbalance 142 Alternative performance metrics: precision, recall, and ROC 143 Manipulating the classifier’s decision with a cost function 147 Data sampling for dealing with classification imbalance 148 7.8 Summary 148 PART 2 FORECASTING NUMERIC VALUES WITH REGRESSION .151 8 Predicting numeric values: regression 153 8.1 8.2 8.3 8.4 Finding best-fit lines with linear regression 154 Locally weighted linear regression 160 Example: predicting the age of an abalone 163 Shrinking coefficients to understand our data 164 Ridge regression regression 167 8.5 8.6 164 ■ The lasso 167 9 Forward stagewise The bias/variance tradeoff 170 Example: forecasting the price of LEGO sets Collect: using the Google shopping API 173 8.7 ■ Summary ■ 172 Train: building a model 174 177 Tree-based regression 179 9.1 9.2 9.3 Locally modeling complex data 180 Building trees with continuous and discrete features Using CART for regression 184 Building the tree 184 9.4 Tree pruning Prepruning 9.5 9.6 9.7 ■ Executing the code 186 188 188 ■ Postpruning 190 Model trees 192 Example: comparing tree methods to standard regression Using Tkinter to create a GUI in Python 198 Building a GUI in Tkinter 199 9.8 181 Summary ■ 203 www.it-ebooks.info 195 Interfacing Matplotlib and Tkinter 201 xiii CONTENTS PART 3 UNSUPERVISED LEARNING ..................................205 10 Grouping unlabeled items using k-means clustering 10.1 10.2 10.3 10.4 The k-means clustering algorithm 208 Improving cluster performance with postprocessing Bisecting k-means 214 Example: clustering points on a map 217 The Yahoo! PlaceFinder API coordinates 220 10.5 11 207 218 213 Clustering geographic ■ Summary 223 Association analysis with the Apriori algorithm 224 11.1 11.2 11.3 Association analysis 225 The Apriori principle 226 Finding frequent itemsets with the Apriori algorithm 228 Generating candidate itemsets 229 Apriori algorithm 231 11.4 11.5 ■ Putting together the full Mining association rules from frequent item sets 233 Example: uncovering patterns in congressional voting 237 Collect: build a transaction data set of congressional voting records 238 Test: association rules from congressional voting records 243 ■ 11.6 11.7 12 Example: finding similar features in poisonous mushrooms 245 Summary 246 Efficiently finding frequent itemsets with FP-growth 12.1 12.2 FP-trees: an efficient way to encode a dataset 249 Build an FP-tree 251 Creating the FP-tree data structure 12.3 251 ■ Constructing the FP-tree 252 Mining frequent items from an FP-tree 256 Extracting conditional pattern bases FP-trees 258 12.4 12.5 12.6 248 257 ■ Creating conditional Example: finding co-occurring words in a Twitter feed 260 Example: mining a clickstream from a news site 264 Summary 265 www.it-ebooks.info xiv CONTENTS PART 4 ADDITIONAL TOOLS ..........................................267 13 Using principal component analysis to simplify data 269 13.1 13.2 Dimensionality reduction techniques Principal component analysis 271 Moving the coordinate axes 13.3 13.4 14 271 ■ 270 Performing PCA in NumPy Example: using PCA to reduce the dimensionality of semiconductor manufacturing data 275 Summary 278 Simplifying data with the singular value decomposition 14.1 Applications of the SVD Latent semantic indexing 14.2 14.3 14.4 273 280 281 281 Recommendation systems ■ 282 Matrix factorization 283 SVD in Python 284 Collaborative filtering–based recommendation engines 286 Measuring similarity 287 Item-based or user-based similarity? Evaluating recommendation engines 289 ■ 14.5 Example: a restaurant dish recommendation engine 289 290 Recommending untasted dishes 290 Improving recommendations with the SVD 292 Challenges with building recommendation engines 295 ■ ■ 14.6 14.7 15 Example: image compression with the SVD 295 Summary 298 Big data and MapReduce 15.1 15.2 299 MapReduce: a framework for distributed computing Hadoop Streaming 302 Distributed mean and variance mapper and variance reducer 304 15.3 303 ■ 300 Distributed mean Running Hadoop jobs on Amazon Web Services 305 Services available on AWS 305 Getting started with Amazon Web Services 306 Running a Hadoop job on EMR 307 ■ ■ 15.4 15.5 Machine learning in MapReduce 312 Using mrjob to automate MapReduce in Python Using mrjob for seamless integration with EMR 313 MapReduce script in mrjob 314 www.it-ebooks.info ■ 313 The anatomy of a xv CONTENTS 15.6 Example: the Pegasos algorithm for distributed SVMs The Pegasos algorithm 317 Training: MapReduce support vector machines with mrjob 318 ■ 15.7 15.8 appendix A appendix B appendix C appendix D Do you really need MapReduce? Summary 323 Getting started with Python Linear algebra 335 Probability refresher 341 Resources 345 index 347 325 www.it-ebooks.info 322 316 www.it-ebooks.info preface After college I went to work for Intel in California and mainland China. Originally my plan was to go back to grad school after two years, but time flies when you are having fun, and two years turned into six. I realized I had to go back at that point, and I didn’t want to do night school or online learning, I wanted to sit on campus and soak up everything a university has to offer. The best part of college is not the classes you take or research you do, but the peripheral things: meeting people, going to seminars, joining organizations, dropping in on classes, and learning what you don’t know. Sometime in 2008 I was helping set up for a career fair. I began to talk to someone from a large financial institution and they wanted me to interview for a position modeling credit risk (figuring out if someone is going to pay off their loans or not). They asked me how much stochastic calculus I knew. At the time, I wasn’t sure I knew what the word stochastic meant. They were hiring for a geographic location my body couldn’t tolerate, so I decided not to pursue it any further. But this stochastic stuff interested me, so I went to the course catalog and looked for any class being offered with the word “stochastic” in its title. The class I found was “Discrete-time Stochastic Systems.” I started attending the class without registering, doing the homework and taking tests. Eventually I was noticed by the professor and she was kind enough to let me continue, for which I am very grateful. This class was the first time I saw probability applied to an algorithm. I had seen algorithms take an averaged value as input before, but this was different: the variance and mean were internal values in these algorithms. The course was about “time series” data where every piece of data is a regularly spaced sample. I found another course with Machine Learning in the title. In this class the xvii www.it-ebooks.info xviii PREFACE data was not assumed to be uniformly spaced in time, and they covered more algorithms but with less rigor. I later realized that similar methods were also being taught in the economics, electrical engineering, and computer science departments. In early 2009, I graduated and moved to Silicon Valley to start work as a software consultant. Over the next two years, I worked with eight companies on a very wide range of technologies and saw two trends emerge which make up the major thesis for this book: first, in order to develop a compelling application you need to do more than just connect data sources; and second, employers want people who understand theory and can also program. A large portion of a programmer’s job can be compared to the concept of connecting pipes—except that instead of pipes, programmers connect the flow of data—and monstrous fortunes have been made doing exactly that. Let me give you an example. You could make an application that sells things online—the big picture for this would be allowing people a way to post things and to view what others have posted. To do this you could create a web form that allows users to enter data about what they are selling and then this data would be shipped off to a data store. In order for other users to see what a user is selling, you would have to ship the data out of the data store and display it appropriately. I’m sure people will continue to make money this way; however to make the application really good you need to add a level of intelligence. This intelligence could do things like automatically remove inappropriate postings, detect fraudulent transactions, direct users to things they might like, and forecast site traffic. To accomplish these objectives, you would need to apply machine learning. The end user would not know that there is magic going on behind the scenes; to them your application “just works,” which is the hallmark of a well-built product. An organization may choose to hire a group of theoretical people, or “thinkers,” and a set of practical people, “doers.” The thinkers may have spent a lot of time in academia, and their day-to-day job may be pulling ideas from papers and modeling them with very high-level tools or mathematics. The doers interface with the real world by writing the code and dealing with the imperfections of a non-ideal world, such as machines that break down or noisy data. Separating thinkers from doers is a bad idea and successful organizations realize this. (One of the tenets of lean manufacturing is for the thinkers to get their hands dirty with actual doing.) When there is a limited amount of money to be spent on hiring, who will get hired more readily—the thinker or the doer? Probably the doer, but in reality employers want both. Things need to get built, but when applications call for more demanding algorithms it is useful to have someone who can read papers, pull out the idea, implement it in real code, and iterate. I didn’t see a book that addressed the problem of bridging the gap between thinkers and doers in the context of machine learning algorithms. The goal of this book is to fill that void, and, along the way, to introduce uses of machine learning algorithms so that the reader can build better applications. www.it-ebooks.info acknowledgments This is by far the easiest part of the book to write... First, I would like to thank the folks at Manning. Above all, I would like to thank my editor Troy Mott; if not for his support and enthusiasm, this book never would have happened. I would also like to thank Maureen Spencer who helped polish my prose in the final manuscript; she was a pleasure to work with. Next I would like to thank Jennie Si at Arizona State University for letting me sneak into her class on discrete-time stochastic systems without registering. Also Cynthia Rudin at MIT for pointing me to the paper “Top 10 Algorithms in Data Mining,” 1 which inspired the approach I took in this book. For indirect contributions I would like to thank Mark Bauer, Jerry Barkely, Jose Zero, Doug Chang, Wayne Carter, and Tyler Neylon. Special thanks to the following peer reviewers who read the manuscript at different stages during its development and provided invaluable feedback: Keith Kim, Franco Lombardo, Patrick Toohey, Josef Lauri, Ryan Riley, Peter Venable, Patrick Goetz, Jeroen Benckhuijsen, Ian McAllister, Orhan Alkan, Joseph Ottinger, Fred Law, Karsten Strøbæk, Brian Lau, Stephen McKamey, Michael Brennan, Kevin Jackson, John Griffin, Sumit Pal, Alex Alves, Justin Tyler Wiley, and John Stevenson. My technical proofreaders, Tricia Hoffman and Alex Ott, reviewed the technical content shortly before the manuscript went to press and I would like to thank them 1 Xindong Wu, et al., “Top 10 Algorithms in Data Mining,” Journal of Knowledge and Information Systems 14, no. 1 (December 2007). xix www.it-ebooks.info
- Xem thêm -

Tài liệu liên quan