Tài liệu Combutational business analytics

  • Số trang: 505 |
  • Loại file: PDF |
  • Lượt xem: 106 |
  • Lượt tải: 0

Đã đăng 28948 tài liệu

Mô tả:

COMPUTATIONAL BUSINESS ANALYTICS K14110_FM.indd 1 11/19/13 6:40 PM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A. AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis. This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and handbooks. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues. PUBLISHED TITLES ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. Srivastava BIOLOGICAL DATA MINING Jake Y. Chen and Stefano Lonardi COMPUTATIONAL BUSINESS ANALYTICS Subrata Das COMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE DEVELOPMENT Ting Yu, Nitesh V. Chawla, and Simeon Simoff COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY, AND APPLICATIONS Sugato Basu, Ian Davidson, and Kiri L. Wagstaff CONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS Guozhu Dong and James Bailey DATA CLUSTERING: ALGORITHMS AND APPLICATIONS Charu C. Aggarawal and Chandan K. Reddy DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH Guojun Gan K14110_FM.indd 2 11/19/13 6:40 PM DATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada DATA MINING WITH R: LEARNING WITH CASE STUDIES Luís Torgo FOUNDATIONS OF PREDICTIVE ANALYTICS James Wu and Stephen Coggeshall GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, SECOND EDITION Harvey J. Miller and Jiawei Han HANDBOOK OF EDUCATIONAL DATA MINING Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS Vagelis Hristidis INTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS Priti Srinivas Sajja and Rajendra Akerkar INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS AND TECHNIQUES Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn KNOWLEDGE DISCOVERY FROM DATA STREAMS João Gama MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR ENGINEERING SYSTEMS HEALTH MANAGEMENT Ashok N. Srivastava and Jiawei Han MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS David Lo, Siau-Cheng Khoo, Jiawei Han, and Chao Liu MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO CONCEPTS AND THEORY Zhongfei Zhang and Ruofei Zhang MUSIC DATA MINING Tao Li, Mitsunori Ogihara, and George Tzanetakis NEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar RAPIDMINER: DATA MINING USE CASES AND BUSINESS ANALYTICS APPLICATIONS Markus Hofmann and Ralf Klinkenberg K14110_FM.indd 3 11/19/13 6:40 PM RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS, AND APPLICATIONS Bo Long, Zhongfei Zhang, and Philip S. Yu SERVICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY Domenico Talia and Paolo Trunfio SPECTRAL FEATURE SELECTION FOR DATA MINING Zheng Alan Zhao and Huan Liu STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION George Fernandez SUPPORT VECTOR MACHINES: OPTIMIZATION BASED THEORY, ALGORITHMS, AND EXTENSIONS Naiyang Deng, Yingjie Tian, and Chunhua Zhang TEMPORAL DATA MINING Theophano Mitsa TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N. Srivastava and Mehran Sahami THE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX DECOMPOSITIONS David Skillicorn K14110_FM.indd 4 11/19/13 6:40 PM COMPUTATIONAL BUSINESS ANALYTICS SUBRATA DAS Machine Analytics, Inc. Belmont, Massachusetts, USA K14110_FM.indd 5 11/19/13 6:40 PM The author can be requested at sdas@machineanalytics.com for a demonstration version of any of the three Machine Analytics tools used to perform case studies in the two penultimate chapters of the book. It will be the sole discretion of the author to provide tools upon a satisfactory analysis of the requestor’s usage intention. Use of the tools is entirely at their own risk. Machine Analytics is not responsible for the consequences of reliance on any analyses provided by the tools. Licensing details for commercial versions of these tools can be obtained by sending an email to admin@machineanalytics.com. CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2014 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20131206 International Standard Book Number-13: 978-1-4398-9073-8 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com vi  3.3 3.4 3.5 Contents CONTINUOUS PROBABILITY DISTRIBUTIONS 3.3.1 Gaussian or Normal Distribution 3.3.2 Lognormal 3.3.3 Exponential Distribution 3.3.4 Weibull Distribution 3.3.5 Beta and Dirichlet Distributions 3.3.6 Gamma Distribution GOODNESS-OF-FIT TEST 3.4.1 Probability Plot 3.4.2 One-Way Chi-Square Goodness-of-Fit Test 3.4.3 Kolmogorov-Smirnov Test FURTHER READING Chapter 4 Bayesian Probability and Inference  4.1 4.2 4.3 BAYESIAN INFERENCE PRIOR PROBABILITIES 4.2.1 Conjugate Priors 4.2.2 The Jereys Prior FURTHER READING Chapter 5 Inferential Statistics and Predictive Analytics  5.1 5.2 5.3 5.4 5.5 5.6 5.7 CHI-SQUARE TEST OF INDEPENDENCE REGRESSION ANALYSES 5.2.1 Simple Linear Regression 5.2.2 Multiple Linear Regression 5.2.3 Logistic Regression 5.2.4 Polynomial Regression BAYESIAN LINEAR REGRESSION 5.3.1 Gaussian Processes PRINCIPAL COMPONENT AND FACTOR ANALYSES SURVIVAL ANALYSIS AUTOREGRESSION MODELS FURTHER READING 49 49 50 51 52 53 56 57 57 59 61 64 65 65 68 69 70 73 75 76 77 77 78 79 81 82 84 87 92 97 98 Contents  Chapter 6 Articial Intelligence for Symbolic Analytics  6.1 6.2 6.3 6.4 6.5 6.6 ANALYTICS AND UNCERTAINTIES 6.1.1 Ignorance to Uncertainties 6.1.2 Approaches to Handling Uncertainties NEO-LOGICIST APPROACH 6.2.1 Evolution of Rules 6.2.2 Inferencing in Rule-based Systems 6.2.3 Advantages and Disadvantages of Rule-Based Systems NEO-PROBABILIST NEO-CALCULIST APPROACH 6.4.1 Certainty Factors 6.4.2 Dempster-Shafer Theory of Belief Function NEO-GRANULARIST 6.5.1 Probabilistic Logic 6.5.2 Fuzzy Logic 6.5.3 Fuzzy Logic for Customer Segmentation FURTHER READING Chapter 7 Probabilistic Graphical Modeling  7.1 7.2 7.3 NAIVE BAYESIAN CLASSIFIER (NBC) K-DEPENDENCE NAIVE BAYESIAN CLASSIFIER (KNBC) BAYESIAN BELIEF NETWORKS 7.3.1 Conditional Independence in Belief Networks 7.3.2 Evidence, Belief, and Likelihood 7.3.3 Prior Probabilities in Networks without Evidence 7.3.4 Belief Revision 7.3.5 Evidence Propagation in Polytrees vii 99 99 99 103 105 106 110 111 112 114 114 117 123 123 126 132 134 135 136 138 140 145 152 154 156 161 Upward Propagation in a Linear Fragment 161 Downward Propagation in a Linear Fragment 164 Upward Propagation in a Tree Fragment 167 viii  Contents 7.3.6 7.3.7 169 Upward Propagation in a Polytree Fragment 169 Downward Propagation in a Polytree Fragment 171 Propagation Algorithm 175 Evidence Propagation in Directed Acyclic Graphs 178 Graphical Transformation 181 Join Tree Initialization 187 7.4 Downward Propagation in a Tree Fragment Propagation in Join Tree and Marginalization Handling Evidence  8.2 8.3 8.4 EXPECTED UTILITY THEORY AND DECISION TREES INFLUENCE DIAGRAMS FOR DECISION SUPPORT 8.2.1 Inferencing in Inuence Diagrams 8.2.2 Compilation of Inuence Diagrams SYMBOLIC ARGUMENTATION FOR DECISION SUPPORT 8.3.1 Measuring Consensus 8.3.2 Combining Sources of Varying Condence FURTHER READING Chapter 9 Time Series Modeling and Forecasting  9.1 9.2 191 7.3.8 Complexity of Inference Algorithms 194 7.3.9 Acquisition of Probabilities 195 7.3.10 Advantages and Disadvantages of Belief Networks 198 7.3.11 Belief Network Tools 199 FURTHER READING 199 Chapter 8 Decision Support and Prescriptive Analytics 8.1 189 PROBLEM MODELING 9.1.1 State Transition and Observation Models 9.1.2 Estimation Problem KALMAN FILTER (KF) 201 202 204 206 211 219 221 226 226 229 229 230 231 233 Contents 9.3 9.4 9.5 9.2.1 Extended Kalman Filter (EKF) MARKOV MODELS 9.3.1 Hidden Markov Models (HMM) 9.3.2 The Forward Algorithm 9.3.3 The Viterbi Algorithm 9.3.4 Baum-Welch Algorithm for Learning HMM DYNAMIC BAYESIAN NETWORKS (DBNS) 9.4.1 Inference Algorithms for DBNs FURTHER READING Chapter 10 Monte Carlo Simulation  10.1 10.2 10.3 10.4 MONTE CARLO APPROXIMATION GIBBS SAMPLING METROPOLIS-HASTINGS ALGORITHM PARTICLE FILTER (PF) 10.4.1 Particle Filter for Dynamical Systems 10.4.2 Particle Filter for DBN 10.4.3 Particle Filter Issues 10.5 FURTHER READING Chapter 11 Cluster Analysis and Segmentation  11.1 11.2 11.3 11.4 HIERARCHICAL CLUSTERING K-MEANS CLUSTERING K-NEAREST NEIGHBORS SUPPORT VECTOR MACHINES 11.4.1 Linearly Separable Data 11.4.2 Preparation of Data and Packages 11.4.3 Non-Separable Data 11.4.4 Non-Linear Classier 11.4.5 VC Dimension and Maximum Margin Classier 11.5 NEURAL NETWORKS 11.5.1 Model Building and Data Preparation 11.5.2 Gradient Descent for Updating Weights 11.6 FURTHER READING  ix 240 242 244 248 252 253 257 260 265 267 267 270 272 273 274 277 279 280 281 282 284 286 288 288 291 291 293 296 298 300 301 302 x  Contents Chapter 12 Machine Learning for Analytics Models  12.1 DECISION TREES 12.1.1 Algorithms for Constructing Decision Trees 12.1.2 Overtting in Decision Trees 12.1.3 Handling Continuous Attributes 12.1.4 Advantages and Disadvantages of Decision Tree Techniques 12.2 LEARNING NAIVE BAYESIAN CLASSIFIERS 12.2.1 Semi-Supervised Learning of NBC via EM 12.3 LEARNING OF KNBC 12.4 LEARNING OF BAYESIAN BELIEF NETWORKS 12.4.1 Cases for Learning Bayesian Networks 12.4.2 Learning Probabilities 315 315 318 322 323 324 325 325 Learning Probabilities from Fully Observable Variables 325 Learning Probabilities from Partially Observable Variables 327 Online Adjustment of Parameters Structure Learning 12.5 12.6 304 305 311 313 12.4.3 Brief Survey 303 Brief Survey 331 332 332 Learning Structure from Fully Observable Variables 333 Learning Structure from Partially Observable Variables 338 12.4.4 Use of Prior Knowledge from Experts INDUCTIVE LOGIC PROGRAMMING FURTHER READING Chapter 13 Unstructured Data and Text Analytics  339 339 343 345 13.1 INFORMATION STRUCTURING AND EXTRACTION 346 13.2 BRIEF INTRODUCTION TO NLP 348 13.2.1 Syntactic Analysis 349 Tokenization 349 Morphological Analysis 349 Contents 13.2.2 Part-of-Speech (POS) Tagging 350 Syntactic Parsing 351 Semantic Analysis 354 Named Entity Recognition 354 Co-reference Resolution 354 Relation Extraction 355 13.3 TEXT CLASSIFICATION AND TOPIC EXTRACTION 13.3.1 Naïve Bayesian Classiers (NBC) 13.3.2 k-Dependence Naïve Bayesian Classier (kNBC) 13.3.3 Latent Semantic Analysis 13.3.4 Probabilistic Latent Semantic Analysis (PLSA) 13.3.5 Latent Dirichlet Allocation (LDA) 13.4 FURTHER READING Chapter 14 Semantic Web  14.1 RESOURCE DESCRIPTION FRAMEWORK (RDF) 14.1.1 RDF Schema (RDFS) 14.1.2 Ontology Web Language (OWL) 14.2 DESCRIPTION LOGICS 14.2.1 Description Logic Syntax 14.2.2 Description Logic Axioms 14.2.3 Description Logic Constructs and Subsystems 14.2.4 Description Logic and OWL Constructs in Relational Database 14.2.5 Description Logic as First-Order Logic 14.3 FURTHER READING Chapter 15 Analytics Tools  15.1 15.2 xi INTELLIGENT DECISION AIDING SYSTEM (IDAS) ENVIRONMENT FOR 5TH GENERATION APPLICATIONS (E5) 15.2.1 Rule-based Expert System Shell 15.2.2 Prolog Interpreter 15.2.3 Lisp Interpreter 355 356 359 361 368 369 372 373 373 377 379 381 382 384 384 386 387 388 389 390 400 401 404 405 xii  Contents 15.3 15.4 15.5 ANALYSIS OF TEXT (ATEXT) R AND MATLAB SAS AND WEKA Chapter 16 Analytics Case Studies  16.1 16.2 16.3 16.4 16.5 16.6 RISK ASSESSMENT MODEL I3 RISK ASSESSMENT IN INDIVIDUAL LENDING USING IDAS RISK ASSESSMENT IN COMMERCIAL LENDING USING E5 AND IDAS FRAUD DETECTION SENTIMENT ANALYSIS USING ATEXT 16.5.1 Text Corpus Classication 16.5.2 Evaluation Results LIFE STATUS ESTIMATION USING DYNAMIC BAYESIAN NETWORKS Appendix A Usage of Symbols  A.1 SYMBOLS USED IN THE BOOK Appendix B Examples and Sample Data  B.1 B.2 PLAY-TENNIS EXAMPLE UNITED STATES ELECTORAL COLLEGE DATA Appendix C MATLAB and R Code Examples  C.1 C.2 Index 406 419 421 425 425 427 430 441 444 444 446 449 453 453 455 455 456 457 MATLAB CODE FOR STOCK PREDICTION USING KALMAN FILTER 457 R CODE FOR STOCK PREDICTION USING KALMAN FILTER 460 479 Preface 1 According to the Merriam-Webster dictionary , analytics is the method of logical analysis. This is a very broad denition of analytics, without an explicitly stated end-goal. A view of analytics within the business community is that analytics describes a process (a method or an analysis) that transforms (hopefully, logically) raw data into actionable knowledge in order to guide strategic decision-making. Along this line, technology research guru Gartner denes analytics as methods that leverage data in a particular functional process (or application) to enable context-specic insight that is actionable (Kirk, 2006). Business analytics naturally concerns the application of analytics in industry, and the title of this book, Computational Business Analytics, refers to the algorithmic process of analytics as implemented via computer. This book provides a computational account of analytics, and leaves such areas as visualization-based analytics to other authors. Each of the denitions provided above is broad enough to cover any application domain. This book is not intended to cover every possible business vertical, but rather to teach the core tools and techniques applicable across multiple domains. In the process of doing so, we present many examples and a selected number of challenging case studies from interesting domains. Our hope is that practitioners of business analytics will be able to easily see the connections to their own problems and to formulate their own strategies for nding the solutions they seek. Traditional business analytics has focused mostly on descriptive analyses of structured historical data using myriad statistical techniques. The current trend has been a turn towards predictive analytics and text analytics of unstructured data. Our approach is to augment and enrich numerical statistical 2 and Machine Learning techniques with symbolic Articial Intelligence (AI) 3 (ML) techniques. Note our usage of the terms augment and enrich as op- posed to replace. Traditional statistical approaches are invaluable in datarich environments, but there are areas where AI and ML approaches provide better analyses, especially where there is an abundance of subjective knowledge. Benets of such augmentation include: 1 http://www.merriam-webster.com/ 2 AI systems are computer systems exhibiting some of form human intelligence. 3 Computer systems incorporating ML technologies have the ability to learn from obser- vations. xiii xiv •  Preface Mixing of numerical (e.g., interest rate, income) and categorical (e.g., day of the week, position in a company) variables in algorithms. • What-if  or explanation-based reasoning (e.g., what if the revenue target is set higher, explain the reason for a customer churn). • Results of inferences (are) easily understood by human analysts. • Eciency enhancement incorporating knowledge from domain experts as heuristics to deal with the curse of dimensionality, for example. Though early AI reasoning was primarily symbolic in nature (i.e., the manipulation of linguistics symbols with well-dened semantics), it has moved towards a hybrid of symbolic and numerical, and therefore one is expected to nd both probabilistic and statistical foundations in many AI approaches. Here are some augmentation/enrichment approaches readers will nd covered by this book (not to worry if you are not familiar with the terms): we enrich principal component and factor analyses with subspace methods (e.g., latent semantic analyses), meld regression analyses with probabilistic graphical modeling, extend autoregression and survival analysis techniques with Kalman lter and dynamic Bayesian networks, embed decision trees within inuence diagrams, and augment nearest-neighbor and k -means clustering techniques with support vector machines and neural networks. On the surface, these extensions may seem to be replacements of traditional analytics, but in most of these cases a generalized technique can be reduced to the underlying traditional base technique under very restrictive conditions. The enriched techniques oer ecient solutions in areas such as customer segmentation, churn prediction, credit risk assessment, fraud detection, and advertising campaigns. Descriptive and Predictive Analytics together establish current and projected situations of an organization, but do not recommend actions. An obvious next step is Prescriptive Analytics, which is a process to determine alternative courses of actions or decision options, given the situation along with a set of objectives, requirements, and constraints. Automation of decision-making of routine tasks is ubiquitous (e.g., preliminary approval of loan eligibility or determining insurance premiums), but subjective processes within organizations are still used for complex decision-making (e.g., credit risk assessment or clinical trial assessment). This current use of subjectivity should not prohibit the analytics community from pursuing a computational approach to the generation of decision options by accounting for various non-quantiable subjective factors together with numerical data. The analytics-generated options can then be presented, along with appropriate explanations and backing, to the decision-makers of the organization. Analytics is ultimately about processing data and knowledge. If available data are structured in relational databases, then data samples and candidate variables for the models to be built are well-identied. However, more than eighty percent of enterprise data today is unstructured (Grime, 2011), and Preface  xv there is an urgent need for automated analyses. Text analytics is a framework to enable an organization to discover and maximize the value of information within large quantities of text (open source or internal). Applications include sentiment analysis, business intelligence analysis, e-service, military intelligence analysis, scientic discovery, and search and information access. This book covers computational technologies to support two fundamental requirements for text analyses, information extraction and text classication. Most analytics systems presented as part of case studies will be hybrid in nature, in combinations of the above three approaches, namely statistics-, AI-, and ML-based. Special emphasis is placed on techniques handling time. Examples in this book are drawn from numerous domains, including life status estimation, loan processing, and credit risk assessment. Since the techniques presented here have roots in the theory of statistics and probability, in AI and ML, and in control theory, there is an abundance of relevant literature for further studies. Readership The book may be used by designers and developers of analytics systems for any vertical (e.g., healthcare, nance and accounting, human resources, customer support, transportation) who work within business organizations around the world. They will nd the book useful as a vehicle for moving towards a new generation of analytics approaches. University students and teachers, especially those in business schools, who are studying and teaching in the eld of analytics will nd the book useful as a textbook for undergraduate and graduate courses, and as a reference book for researchers. Prior understanding of the theories presented in the book will be benecial for those who wish to build analytics systems grounded in well-founded theory, rather than ad hoc ones. Contents The sixteen chapters in this book are divided into six parts, mostly along the line of statistics, AI, and ML paradigms, including the parts for introductory materials, information structuring and dissemination, and tools and case studies. It would have been unnatural to divide along the three categories of analytics processes, namely, descriptive, predictive, and prescriptive. This is mainly due to the fact that some models can be used for the purpose of more than one of these three analytics. For example, if a model helps to discriminate a set of alternative hypotheses based on the available information, these hypotheses could be possible current or future situations, or alternative courses of actions. The coverage of statistics and probability theory in this book is far from comprehensive; we focus only on those descriptive and inferential techniques that are either enhanced via or used within some AI and ML techniques. There is an abundance of books on statistics and probability theory for further investigation, if desired. xvi  Preface PART I  Introduction and background Chapter 1 details the concepts of analytics, with examples drawn from various application domains. It provides a brief account of analytics modeling and some well-known models and architectures of analytics. Chapter 1 is written in an informal manner and uses relatable examples, and is crucial for understanding the basics of analytics in general. Chapter 2 presents background on mathematical and statistical prelimi- naries, including basic probability and statistics, graph theory, mathematical logic, performance measurement, and algorithmic complexity. This chapter will serve as a refresher for those readers who have already been exposed to these concepts. PART II  Statistical Analytics Chapter 3 provides a detailed account of various statistical techniques for descriptive analytics. These include relevant discrete and continuous probability distributions and their applicability, goodness-of-t tests, measures of central tendency, and dispersions. Chapter 4 is dedicated to Bayesian probability and inferencing, given its importance across most of the approaches. We analyze Bayes's rule, and discuss the concept of priors and various techniques for obtaining them. Chapter 5 covers inferential statistics for predictive analytics. Topics in- clude generalization, test hypothesis, estimation, prediction, and decision. We cover various dependence methods in this category, including linear and logistics regressions, polynomial regression, Bayesian regression, auto-regression, factor analysis, and survival analysis. We save the Decision Tree (DT) learning techniques Classication and Regression Tree (CART) for a later chapter, given its close similarity with other DT techniques from the ML community. PART III  Articial Intelligence for Analytics Chapter 6 presents the traditional symbolic AI approach to analytics. This chapter provides a detailed account of uncertainty and describes various well-established formal approaches to handling uncertainty, some of which are to be covered in more detail in subsequent chapters. Chapter 7 presents several probabilistic graphical models for analytics. We start with Naïve Bayesian Classiers (NBCs), move to their generalizations, the k -dependence Naïve Bayesian Classiers (k NBCs), and, nally, ex- plore the most general Bayesian Belief Networks (BNs). The chapter presents various evidence propagation algorithms. There is not always an intuitive explanation of how evidence is propagated up and down the arrows in a BN model via abductive (explanation-based) and deductive (causal) inferencing. This is largely due to the conditional independence assumption and, as a consequence, separation among variables. To understand evidence propagation behavior and also to identify sources of inferencing ineciency, readers are Preface  xvii therefore encouraged to go through in as much detail as they can the theory underlying BN technology and propagation algorithms. Chapter 8 describes the use of the Inuence Diagram (ID) and sym- bolic argumentation technologies to make decisions using prescriptive analytics. The BN and rule-based formalisms for hypothesis evaluation do not explicitly incorporate the concepts of action and utility that are ubiquitous in decision-making contexts. IDs incorporate the concepts of action and utility. Symbolic argumentation allows one to express arguments for and against decision hypotheses with weights from a variety of dictionaries, including the probability dictionary. Arguments are aggregated to rank the considered set of hypotheses to help choose the most plausible one. Readers must go through the BN chapter to understand IDs. Chapter 9 presents our discussion of models in the temporal category. We present several approaches to modeling time-series data generated from a dynamic environment, such as the nancial market, and then make use of such models for forecasting. We present the Kalman Filter (KF) technique for estimating the state of a dynamic environment, then present the Hidden Markov Model (HMM) framework and the more generalized Dynamic Bayesian Network (DBN) technology. DBNs are temporal extensions of BNs. Inference algorithms for these models are also provided. Readers must understand the BN technology to understand its temporal extension. Chapter 10 presents sampling-based approximate algorithms for infer- ences in non-linear models. The algorithms that we cover are Markov Chain Monte Carlo (MCMC), Gibbs sampling, Metropolis-Hastings, and Particle Filter (PF). PF algorithms are especially eective in handling hybrid DBNs containing both categorical and numerical variables. PART IV  Machine Learning for Analytics Chapter 11 covers some of the most popular and powerful clustering techniques for segmenting data sets, namely, hierarchical, k -means, k -Nearest Neighbor (k NN), Support Vector Machines (SVM), and feed-forward Neural Networks (NNs). The rst three have their roots in traditional statistics, whereas the latter two developed within the ML community. Chapter 12 presents supervised and unsupervised techniques for learning trees, rules, and graphical models for analytics, some of which have been presented in the previous chapters. We start with algorithms for learning Decision Trees (DTs), and then investigate learning of various probabilistic graphical models, namely, NBC, k NBC, and BN. Finally, we present a general rule induction technique, called Inductive Logic Programming (ILP). PART V  Information Structuring and Dissemination Chapter 13 deals with the analytics of unstructured textual data. The two fundamental tasks that provide foundations for text analytics are information extraction and text classication. This chapter briey introduces some pop- xviii  Preface ular linguistic techniques for extracting structured information in the form of Resource Description Framework (RDF) triples, then details an array of techniques for learning classiers for text corpus, such as NBC, k NBC, Latent Semantics Analysis (LSA), probabilistic LSA (PLSA), and Latent Dirichlet Allocation (LDA). PLSA and LDA are particularly useful for extracting latent topics in a text corpus in an unsupervised manner. Chapter 14 presents standardized semantics of information content to be exchanged in order to be comprehended as consumers by various entities, whether they are computer-based processes, physical systems, or human operators. We present the Semantic Web technology to serve such a purpose. PART VI  Analytics Tools and Case Studies Chapter 15 presents three analytics tools that are designed and conceived by the author: 1) Intelligent Decision Aiding System (iDAS), which provides th implementations of a set of ML techniques; 2) Environment for 5 Generation Applications (E5), which provides a development environment in declarative languages with an embedded expert system shell; and 3) Analysis of Text (aText) for information extraction and classication of text documents. Demo versions of iDAS, E5, and aText can be obtained by purchasing a copy of the book and then emailing a request to the author. The chapter presents very briey a handful of commercial and publicly available tools for analytics, including R, MATLAB, WEKA, and SAS. The author can be contacted at sdas@machineanalytics.com or subrata@skdas.com to request a demonstration version of any of the above three Machine Analytics tools used to perform case studies in the two penultimate chapters of the book. It will be the sole discretion of the author to provide tools upon a satisfactory analysis of the requestor's usage intention. Use of the tools is entirely at his or her own risk. Machine Analytics is not responsible for the consequences of reliance on any analyses provided by the tools. Licensing details for commercial versions of these tools can be obtained by sending an email to admin@machineanalytics.com. Chapter 16 presents four detailed case studies, namely, risk assessment for both individual and commercial lendings, life status estimation, and sentiment analysis, making use of all three tools, iDAS, E5, and aText. The demo versions of the tools (see above) come with data from these case studies for readers to run on their own. The chapter also describes various types of fraud detection problems that can be solved by using various modeling and clustering technologies introduced in the book. The scope of analytics is broad and interdisciplinary in nature, and is likely to cover a breadth of topic areas. The aim of this book is not to cover each and every aspect of analytics. The book provides a computational account of analytics, and leaves areas such as visual analytics, image analytics, and web analytics for other authors. Moreover, the symbolic thrust of the book naturally puts less emphasis on sub-symbolic areas, such as neural networks.
- Xem thêm -