COMPUTATIONAL
BUSINESS ANALYTICS
K14110_FM.indd 1
11/19/13 6:40 PM
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A.
AIMS AND SCOPE
This series aims to capture new developments and applications in data mining and knowledge
discovery, while summarizing the computational tools and techniques useful in data analysis. This
series encourages the integration of mathematical, statistical, and computational methods and
techniques through the publication of a broad range of textbooks, reference works, and handbooks. The inclusion of concrete examples and applications is highly encouraged. The scope of the
series includes, but is not limited to, titles in the areas of data mining and knowledge discovery
methods and applications, modeling, algorithms, theory and foundations, data and knowledge
visualization, data mining systems and tools, and privacy and security issues.
PUBLISHED TITLES
ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY
Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. Srivastava
BIOLOGICAL DATA MINING
Jake Y. Chen and Stefano Lonardi
COMPUTATIONAL BUSINESS ANALYTICS
Subrata Das
COMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE
DEVELOPMENT
Ting Yu, Nitesh V. Chawla, and Simeon Simoff
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY,
AND APPLICATIONS
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff
CONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS
Guozhu Dong and James Bailey
DATA CLUSTERING: ALGORITHMS AND APPLICATIONS
Charu C. Aggarawal and Chandan K. Reddy
DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH
Guojun Gan
K14110_FM.indd 2
11/19/13 6:40 PM
DATA MINING FOR DESIGN AND MARKETING
Yukio Ohsawa and Katsutoshi Yada
DATA MINING WITH R: LEARNING WITH CASE STUDIES
Luís Torgo
FOUNDATIONS OF PREDICTIVE ANALYTICS
James Wu and Stephen Coggeshall
GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY,
SECOND EDITION
Harvey J. Miller and Jiawei Han
HANDBOOK OF EDUCATIONAL DATA MINING
Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker
INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS
Vagelis Hristidis
INTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS
Priti Srinivas Sajja and Rajendra Akerkar
INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS
AND TECHNIQUES
Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu
KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND
LAW ENFORCEMENT
David Skillicorn
KNOWLEDGE DISCOVERY FROM DATA STREAMS
João Gama
MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR
ENGINEERING SYSTEMS HEALTH MANAGEMENT
Ashok N. Srivastava and Jiawei Han
MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS
David Lo, Siau-Cheng Khoo, Jiawei Han, and Chao Liu
MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO
CONCEPTS AND THEORY
Zhongfei Zhang and Ruofei Zhang
MUSIC DATA MINING
Tao Li, Mitsunori Ogihara, and George Tzanetakis
NEXT GENERATION OF DATA MINING
Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar
RAPIDMINER: DATA MINING USE CASES AND BUSINESS ANALYTICS
APPLICATIONS
Markus Hofmann and Ralf Klinkenberg
K14110_FM.indd 3
11/19/13 6:40 PM
RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS,
AND APPLICATIONS
Bo Long, Zhongfei Zhang, and Philip S. Yu
SERVICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY
Domenico Talia and Paolo Trunfio
SPECTRAL FEATURE SELECTION FOR DATA MINING
Zheng Alan Zhao and Huan Liu
STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION
George Fernandez
SUPPORT VECTOR MACHINES: OPTIMIZATION BASED THEORY,
ALGORITHMS, AND EXTENSIONS
Naiyang Deng, Yingjie Tian, and Chunhua Zhang
TEMPORAL DATA MINING
Theophano Mitsa
TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS
Ashok N. Srivastava and Mehran Sahami
THE TOP TEN ALGORITHMS IN DATA MINING
Xindong Wu and Vipin Kumar
UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX
DECOMPOSITIONS
David Skillicorn
K14110_FM.indd 4
11/19/13 6:40 PM
COMPUTATIONAL
BUSINESS ANALYTICS
SUBRATA DAS
Machine Analytics, Inc.
Belmont, Massachusetts, USA
K14110_FM.indd 5
11/19/13 6:40 PM
The author can be requested at
[email protected] for a demonstration version of any of
the three Machine Analytics tools used to perform case studies in the two penultimate chapters of
the book. It will be the sole discretion of the author to provide tools upon a satisfactory analysis of
the requestor’s usage intention. Use of the tools is entirely at their own risk. Machine Analytics is not
responsible for the consequences of reliance on any analyses provided by the tools. Licensing details
for commercial versions of these tools can be obtained by sending an email to
[email protected].
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2014 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Version Date: 20131206
International Standard Book Number-13: 978-1-4398-9073-8 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
vi
3.3
3.4
3.5
Contents
CONTINUOUS PROBABILITY DISTRIBUTIONS
3.3.1 Gaussian or Normal Distribution
3.3.2 Lognormal
3.3.3 Exponential Distribution
3.3.4 Weibull Distribution
3.3.5 Beta and Dirichlet Distributions
3.3.6 Gamma Distribution
GOODNESS-OF-FIT TEST
3.4.1 Probability Plot
3.4.2 One-Way Chi-Square Goodness-of-Fit Test
3.4.3 Kolmogorov-Smirnov Test
FURTHER READING
Chapter 4 Bayesian Probability and Inference
4.1
4.2
4.3
BAYESIAN INFERENCE
PRIOR PROBABILITIES
4.2.1 Conjugate Priors
4.2.2 The Jereys Prior
FURTHER READING
Chapter 5 Inferential Statistics and Predictive Analytics
5.1
5.2
5.3
5.4
5.5
5.6
5.7
CHI-SQUARE TEST OF INDEPENDENCE
REGRESSION ANALYSES
5.2.1 Simple Linear Regression
5.2.2 Multiple Linear Regression
5.2.3 Logistic Regression
5.2.4 Polynomial Regression
BAYESIAN LINEAR REGRESSION
5.3.1 Gaussian Processes
PRINCIPAL COMPONENT AND FACTOR ANALYSES
SURVIVAL ANALYSIS
AUTOREGRESSION MODELS
FURTHER READING
49
49
50
51
52
53
56
57
57
59
61
64
65
65
68
69
70
73
75
76
77
77
78
79
81
82
84
87
92
97
98
Contents
Chapter 6 Articial Intelligence for Symbolic Analytics
6.1
6.2
6.3
6.4
6.5
6.6
ANALYTICS AND UNCERTAINTIES
6.1.1 Ignorance to Uncertainties
6.1.2 Approaches to Handling Uncertainties
NEO-LOGICIST APPROACH
6.2.1 Evolution of Rules
6.2.2 Inferencing in Rule-based Systems
6.2.3 Advantages and Disadvantages of Rule-Based
Systems
NEO-PROBABILIST
NEO-CALCULIST APPROACH
6.4.1 Certainty Factors
6.4.2 Dempster-Shafer Theory of Belief Function
NEO-GRANULARIST
6.5.1 Probabilistic Logic
6.5.2 Fuzzy Logic
6.5.3 Fuzzy Logic for Customer Segmentation
FURTHER READING
Chapter 7 Probabilistic Graphical Modeling
7.1
7.2
7.3
NAIVE BAYESIAN CLASSIFIER (NBC)
K-DEPENDENCE NAIVE BAYESIAN CLASSIFIER
(KNBC)
BAYESIAN BELIEF NETWORKS
7.3.1 Conditional Independence in Belief Networks
7.3.2 Evidence, Belief, and Likelihood
7.3.3 Prior Probabilities in Networks without Evidence
7.3.4 Belief Revision
7.3.5 Evidence Propagation in Polytrees
7.3.5.1
7.3.5.2
7.3.5.3
vii
99
99
99
103
105
106
110
111
112
114
114
117
123
123
126
132
134
135
136
138
140
145
152
154
156
161
Upward Propagation in a Linear Fragment
161
Downward Propagation in a Linear
Fragment
164
Upward Propagation in a Tree Fragment 167
viii
Contents
7.3.5.4
7.3.5.5
7.3.5.6
7.3.6
7.3.7
169
Upward Propagation in a Polytree Fragment
169
Downward Propagation in a Polytree
Fragment
171
Propagation Algorithm
175
Evidence Propagation in Directed Acyclic Graphs 178
7.3.7.1
Graphical Transformation
181
7.3.7.2
Join Tree Initialization
187
7.3.7.3
7.3.7.4
7.4
Downward Propagation in a Tree Fragment
Propagation in Join Tree and Marginalization
Handling Evidence
8.2
8.3
8.4
EXPECTED UTILITY THEORY AND DECISION
TREES
INFLUENCE DIAGRAMS FOR DECISION SUPPORT
8.2.1 Inferencing in Inuence Diagrams
8.2.2 Compilation of Inuence Diagrams
SYMBOLIC ARGUMENTATION FOR DECISION SUPPORT
8.3.1 Measuring Consensus
8.3.2 Combining Sources of Varying Condence
FURTHER READING
Chapter 9 Time Series Modeling and Forecasting
9.1
9.2
191
7.3.8 Complexity of Inference Algorithms
194
7.3.9 Acquisition of Probabilities
195
7.3.10 Advantages and Disadvantages of Belief Networks 198
7.3.11 Belief Network Tools
199
FURTHER READING
199
Chapter 8 Decision Support and Prescriptive Analytics
8.1
189
PROBLEM MODELING
9.1.1 State Transition and Observation Models
9.1.2 Estimation Problem
KALMAN FILTER (KF)
201
202
204
206
211
219
221
226
226
229
229
230
231
233
Contents
9.3
9.4
9.5
9.2.1 Extended Kalman Filter (EKF)
MARKOV MODELS
9.3.1 Hidden Markov Models (HMM)
9.3.2 The Forward Algorithm
9.3.3 The Viterbi Algorithm
9.3.4 Baum-Welch Algorithm for Learning HMM
DYNAMIC BAYESIAN NETWORKS (DBNS)
9.4.1 Inference Algorithms for DBNs
FURTHER READING
Chapter 10 Monte Carlo Simulation
10.1
10.2
10.3
10.4
MONTE CARLO APPROXIMATION
GIBBS SAMPLING
METROPOLIS-HASTINGS ALGORITHM
PARTICLE FILTER (PF)
10.4.1 Particle Filter for Dynamical Systems
10.4.2 Particle Filter for DBN
10.4.3 Particle Filter Issues
10.5 FURTHER READING
Chapter 11 Cluster Analysis and Segmentation
11.1
11.2
11.3
11.4
HIERARCHICAL CLUSTERING
K-MEANS CLUSTERING
K-NEAREST NEIGHBORS
SUPPORT VECTOR MACHINES
11.4.1 Linearly Separable Data
11.4.2 Preparation of Data and Packages
11.4.3 Non-Separable Data
11.4.4 Non-Linear Classier
11.4.5 VC Dimension and Maximum Margin Classier
11.5 NEURAL NETWORKS
11.5.1 Model Building and Data Preparation
11.5.2 Gradient Descent for Updating Weights
11.6 FURTHER READING
ix
240
242
244
248
252
253
257
260
265
267
267
270
272
273
274
277
279
280
281
282
284
286
288
288
291
291
293
296
298
300
301
302
x
Contents
Chapter 12 Machine Learning for Analytics Models
12.1
DECISION TREES
12.1.1 Algorithms for Constructing Decision Trees
12.1.2 Overtting in Decision Trees
12.1.3 Handling Continuous Attributes
12.1.4 Advantages and Disadvantages of Decision Tree
Techniques
12.2 LEARNING NAIVE BAYESIAN CLASSIFIERS
12.2.1 Semi-Supervised Learning of NBC via EM
12.3 LEARNING OF KNBC
12.4 LEARNING OF BAYESIAN BELIEF NETWORKS
12.4.1 Cases for Learning Bayesian Networks
12.4.2 Learning Probabilities
12.4.2.1
315
315
318
322
323
324
325
325
Learning Probabilities from Fully Observable Variables
325
12.4.2.3
Learning Probabilities from Partially
Observable Variables
327
Online Adjustment of Parameters
Structure Learning
12.4.3.1
12.4.3.2
12.4.3.3
12.5
12.6
304
305
311
313
12.4.2.2
12.4.2.4
12.4.3
Brief Survey
303
Brief Survey
331
332
332
Learning Structure from Fully Observable Variables
333
Learning Structure from Partially Observable Variables
338
12.4.4 Use of Prior Knowledge from Experts
INDUCTIVE LOGIC PROGRAMMING
FURTHER READING
Chapter 13 Unstructured Data and Text Analytics
339
339
343
345
13.1
INFORMATION STRUCTURING AND EXTRACTION
346
13.2 BRIEF INTRODUCTION TO NLP
348
13.2.1 Syntactic Analysis
349
13.2.1.1
Tokenization
349
13.2.1.2
Morphological Analysis
349
Contents
13.2.2
13.2.1.3
Part-of-Speech (POS) Tagging
350
13.2.1.4
Syntactic Parsing
351
Semantic Analysis
354
13.2.2.1
Named Entity Recognition
354
13.2.2.2
Co-reference Resolution
354
13.2.2.3
Relation Extraction
355
13.3
TEXT CLASSIFICATION AND TOPIC EXTRACTION
13.3.1 Naïve Bayesian Classiers (NBC)
13.3.2 k-Dependence Naïve Bayesian Classier (kNBC)
13.3.3 Latent Semantic Analysis
13.3.4 Probabilistic Latent Semantic Analysis (PLSA)
13.3.5 Latent Dirichlet Allocation (LDA)
13.4 FURTHER READING
Chapter 14 Semantic Web
14.1
RESOURCE DESCRIPTION FRAMEWORK (RDF)
14.1.1 RDF Schema (RDFS)
14.1.2 Ontology Web Language (OWL)
14.2 DESCRIPTION LOGICS
14.2.1 Description Logic Syntax
14.2.2 Description Logic Axioms
14.2.3 Description Logic Constructs and Subsystems
14.2.4 Description Logic and OWL Constructs in Relational Database
14.2.5 Description Logic as First-Order Logic
14.3 FURTHER READING
Chapter 15 Analytics Tools
15.1
15.2
xi
INTELLIGENT DECISION AIDING SYSTEM (IDAS)
ENVIRONMENT FOR 5TH GENERATION APPLICATIONS (E5)
15.2.1 Rule-based Expert System Shell
15.2.2 Prolog Interpreter
15.2.3 Lisp Interpreter
355
356
359
361
368
369
372
373
373
377
379
381
382
384
384
386
387
388
389
390
400
401
404
405
xii
Contents
15.3
15.4
15.5
ANALYSIS OF TEXT (ATEXT)
R AND MATLAB
SAS AND WEKA
Chapter 16 Analytics Case Studies
16.1
16.2
16.3
16.4
16.5
16.6
RISK ASSESSMENT MODEL I3
RISK ASSESSMENT IN INDIVIDUAL LENDING USING IDAS
RISK ASSESSMENT IN COMMERCIAL LENDING
USING E5 AND IDAS
FRAUD DETECTION
SENTIMENT ANALYSIS USING ATEXT
16.5.1 Text Corpus Classication
16.5.2 Evaluation Results
LIFE STATUS ESTIMATION USING DYNAMIC
BAYESIAN NETWORKS
Appendix A Usage of Symbols
A.1
SYMBOLS USED IN THE BOOK
Appendix B Examples and Sample Data
B.1
B.2
PLAY-TENNIS EXAMPLE
UNITED STATES ELECTORAL COLLEGE DATA
Appendix C MATLAB and R Code Examples
C.1
C.2
Index
406
419
421
425
425
427
430
441
444
444
446
449
453
453
455
455
456
457
MATLAB CODE FOR STOCK PREDICTION USING
KALMAN FILTER
457
R CODE FOR STOCK PREDICTION USING KALMAN
FILTER
460
479
Preface
1
According to the Merriam-Webster dictionary , analytics is the method of
logical analysis. This is a very broad denition of analytics, without an explicitly stated end-goal. A view of analytics within the business community is
that analytics describes a process (a method or an analysis) that transforms
(hopefully, logically) raw data into actionable knowledge in order to guide
strategic decision-making. Along this line, technology research guru Gartner
denes analytics as methods that leverage data in a particular functional
process (or application) to enable context-specic insight that is actionable
(Kirk, 2006). Business analytics naturally concerns the application of analytics in industry, and the title of this book,
Computational Business Analytics,
refers to the algorithmic process of analytics as implemented via computer.
This book provides a computational account of analytics, and leaves such
areas as visualization-based analytics to other authors.
Each of the denitions provided above is broad enough to cover any application domain. This book is not intended to cover every possible business
vertical, but rather to teach the core tools and techniques applicable across
multiple domains. In the process of doing so, we present many examples and
a selected number of challenging case studies from interesting domains. Our
hope is that practitioners of business analytics will be able to easily see the
connections to their own problems and to formulate their own strategies for
nding the solutions they seek.
Traditional business analytics has focused mostly on descriptive analyses
of structured historical data using myriad statistical techniques. The current
trend has been a turn towards predictive analytics and text analytics of unstructured data. Our approach is to augment and enrich numerical statistical
2 and Machine Learning
techniques with symbolic Articial Intelligence (AI)
3
(ML) techniques. Note our usage of the terms augment and enrich as op-
posed to replace. Traditional statistical approaches are invaluable in datarich environments, but there are areas where AI and ML approaches provide
better analyses, especially where there is an abundance of subjective knowledge. Benets of such augmentation include:
1 http://www.merriam-webster.com/
2 AI systems are computer systems exhibiting some of form human intelligence.
3 Computer systems incorporating ML technologies have the ability to learn from
obser-
vations.
xiii
xiv
•
Preface
Mixing of numerical (e.g., interest rate, income) and categorical (e.g.,
day of the week, position in a company) variables in algorithms.
•
What-if or explanation-based reasoning (e.g., what if the revenue target is set higher, explain the reason for a customer churn).
•
Results of inferences (are) easily understood by human analysts.
•
Eciency enhancement incorporating knowledge from domain experts
as heuristics to deal with the curse of dimensionality, for example.
Though early AI reasoning was primarily symbolic in nature (i.e., the manipulation of linguistics symbols with well-dened semantics), it has moved
towards a hybrid of symbolic and numerical, and therefore one is expected to
nd both probabilistic and statistical foundations in many AI approaches.
Here are some augmentation/enrichment approaches readers will nd covered by this book (not to worry if you are not familiar with the terms): we
enrich principal component and factor analyses with subspace methods (e.g.,
latent semantic analyses), meld regression analyses with probabilistic graphical modeling, extend autoregression and survival analysis techniques with
Kalman lter and dynamic Bayesian networks, embed decision trees within
inuence diagrams, and augment nearest-neighbor and
k -means
clustering
techniques with support vector machines and neural networks. On the surface,
these extensions may seem to be replacements of traditional analytics, but in
most of these cases a generalized technique can be reduced to the underlying
traditional base technique under very restrictive conditions. The enriched techniques oer ecient solutions in areas such as customer segmentation, churn
prediction, credit risk assessment, fraud detection, and advertising campaigns.
Descriptive and Predictive Analytics together establish current and projected situations of an organization, but do not recommend actions. An obvious next step is Prescriptive Analytics, which is a process to determine alternative courses of actions or decision options, given the situation along with a set
of objectives, requirements, and constraints. Automation of decision-making
of routine tasks is ubiquitous (e.g., preliminary approval of loan eligibility or
determining insurance premiums), but subjective processes within organizations are still used for complex decision-making (e.g., credit risk assessment
or clinical trial assessment). This current use of subjectivity should not prohibit the analytics community from pursuing a computational approach to the
generation of decision options by accounting for various non-quantiable subjective factors together with numerical data. The analytics-generated options
can then be presented, along with appropriate explanations and backing, to
the decision-makers of the organization.
Analytics is ultimately about processing data and knowledge. If available
data are structured in relational databases, then data samples and candidate
variables for the models to be built are well-identied. However, more than
eighty percent of enterprise data today is unstructured (Grime, 2011), and
Preface
xv
there is an urgent need for automated analyses. Text analytics is a framework
to enable an organization to discover and maximize the value of information
within large quantities of text (open source or internal). Applications include
sentiment analysis, business intelligence analysis, e-service, military intelligence analysis, scientic discovery, and search and information access. This
book covers computational technologies to support two fundamental requirements for text analyses, information extraction and text classication.
Most analytics systems presented as part of case studies will be hybrid
in nature, in combinations of the above three approaches, namely statistics-,
AI-, and ML-based. Special emphasis is placed on techniques handling time.
Examples in this book are drawn from numerous domains, including life status
estimation, loan processing, and credit risk assessment. Since the techniques
presented here have roots in the theory of statistics and probability, in AI and
ML, and in control theory, there is an abundance of relevant literature for
further studies.
Readership
The book may be used by designers and developers of analytics systems
for any vertical (e.g., healthcare, nance and accounting, human resources,
customer support, transportation) who work within business organizations
around the world. They will nd the book useful as a vehicle for moving
towards a new generation of analytics approaches. University students and
teachers, especially those in business schools, who are studying and teaching
in the eld of analytics will nd the book useful as a textbook for undergraduate and graduate courses, and as a reference book for researchers. Prior
understanding of the theories presented in the book will be benecial for those
who wish to build analytics systems grounded in well-founded theory, rather
than ad hoc ones.
Contents
The sixteen chapters in this book are divided into six parts, mostly along
the line of statistics, AI, and ML paradigms, including the parts for introductory materials, information structuring and dissemination, and tools and
case studies. It would have been unnatural to divide along the three categories
of analytics processes, namely, descriptive, predictive, and prescriptive. This
is mainly due to the fact that some models can be used for the purpose of
more than one of these three analytics. For example, if a model helps to discriminate a set of alternative hypotheses based on the available information,
these hypotheses could be possible current or future situations, or alternative
courses of actions. The coverage of statistics and probability theory in this
book is far from comprehensive; we focus only on those descriptive and inferential techniques that are either enhanced via or used within some AI and
ML techniques. There is an abundance of books on statistics and probability
theory for further investigation, if desired.
xvi
Preface
PART I Introduction and background
Chapter 1
details the concepts of analytics, with examples drawn from
various application domains. It provides a brief account of analytics modeling and some well-known models and architectures of analytics. Chapter 1 is
written in an informal manner and uses relatable examples, and is crucial for
understanding the basics of analytics in general.
Chapter 2 presents background on mathematical and statistical prelimi-
naries, including basic probability and statistics, graph theory, mathematical
logic, performance measurement, and algorithmic complexity. This chapter
will serve as a refresher for those readers who have already been exposed to
these concepts.
PART II Statistical Analytics
Chapter 3
provides a detailed account of various statistical techniques
for descriptive analytics. These include relevant discrete and continuous probability distributions and their applicability, goodness-of-t tests, measures of
central tendency, and dispersions.
Chapter 4
is dedicated to Bayesian probability and inferencing, given
its importance across most of the approaches. We analyze Bayes's rule, and
discuss the concept of priors and various techniques for obtaining them.
Chapter 5 covers inferential statistics for predictive analytics. Topics in-
clude generalization, test hypothesis, estimation, prediction, and decision. We
cover various dependence methods in this category, including linear and logistics regressions, polynomial regression, Bayesian regression, auto-regression,
factor analysis, and survival analysis. We save the Decision Tree (DT) learning techniques Classication and Regression Tree (CART) for a later chapter,
given its close similarity with other DT techniques from the ML community.
PART III Articial Intelligence for Analytics
Chapter 6
presents the traditional symbolic AI approach to analytics.
This chapter provides a detailed account of uncertainty and describes various
well-established formal approaches to handling uncertainty, some of which are
to be covered in more detail in subsequent chapters.
Chapter 7
presents several probabilistic graphical models for analytics.
We start with Naïve Bayesian Classiers (NBCs), move to their generalizations, the
k -dependence
Naïve Bayesian Classiers (k NBCs), and, nally, ex-
plore the most general Bayesian Belief Networks (BNs). The chapter presents
various evidence propagation algorithms. There is not always an intuitive explanation of how evidence is propagated up and down the arrows in a BN
model via abductive (explanation-based) and deductive (causal) inferencing.
This is largely due to the conditional independence assumption and, as a consequence, separation among variables. To understand evidence propagation
behavior and also to identify sources of inferencing ineciency, readers are
Preface
xvii
therefore encouraged to go through in as much detail as they can the theory
underlying BN technology and propagation algorithms.
Chapter 8
describes the use of the Inuence Diagram (ID) and sym-
bolic argumentation technologies to make decisions using prescriptive analytics. The BN and rule-based formalisms for hypothesis evaluation do not
explicitly incorporate the concepts of action and utility that are ubiquitous
in decision-making contexts. IDs incorporate the concepts of action and utility. Symbolic argumentation allows one to express arguments for and against
decision hypotheses with weights from a variety of dictionaries, including the
probability dictionary. Arguments are aggregated to rank the considered set
of hypotheses to help choose the most plausible one. Readers must go through
the BN chapter to understand IDs.
Chapter 9
presents our discussion of models in the temporal category.
We present several approaches to modeling time-series data generated from a
dynamic environment, such as the nancial market, and then make use of such
models for forecasting. We present the Kalman Filter (KF) technique for estimating the state of a dynamic environment, then present the Hidden Markov
Model (HMM) framework and the more generalized Dynamic Bayesian Network (DBN) technology. DBNs are temporal extensions of BNs. Inference
algorithms for these models are also provided. Readers must understand the
BN technology to understand its temporal extension.
Chapter 10
presents sampling-based approximate algorithms for infer-
ences in non-linear models. The algorithms that we cover are Markov Chain
Monte Carlo (MCMC), Gibbs sampling, Metropolis-Hastings, and Particle
Filter (PF). PF algorithms are especially eective in handling hybrid DBNs
containing both categorical and numerical variables.
PART IV Machine Learning for Analytics
Chapter 11
covers some of the most popular and powerful clustering
techniques for segmenting data sets, namely, hierarchical,
k -means, k -Nearest
Neighbor (k NN), Support Vector Machines (SVM), and feed-forward Neural Networks (NNs). The rst three have their roots in traditional statistics,
whereas the latter two developed within the ML community.
Chapter 12 presents supervised and unsupervised techniques for learning
trees, rules, and graphical models for analytics, some of which have been
presented in the previous chapters. We start with algorithms for learning
Decision Trees (DTs), and then investigate learning of various probabilistic
graphical models, namely, NBC,
k NBC, and BN. Finally, we present a general
rule induction technique, called Inductive Logic Programming (ILP).
PART V Information Structuring and Dissemination
Chapter 13 deals with the analytics of unstructured textual data. The two
fundamental tasks that provide foundations for text analytics are information
extraction and text classication. This chapter briey introduces some pop-
xviii
Preface
ular linguistic techniques for extracting structured information in the form
of Resource Description Framework (RDF) triples, then details an array of
techniques for learning classiers for text corpus, such as NBC,
k NBC, Latent
Semantics Analysis (LSA), probabilistic LSA (PLSA), and Latent Dirichlet
Allocation (LDA). PLSA and LDA are particularly useful for extracting latent topics in a text corpus in an unsupervised manner.
Chapter 14
presents standardized semantics of information content to
be exchanged in order to be comprehended as consumers by various entities,
whether they are computer-based processes, physical systems, or human operators. We present the Semantic Web technology to serve such a purpose.
PART VI Analytics Tools and Case Studies
Chapter 15 presents three analytics tools that are designed and conceived
by the author: 1) Intelligent Decision Aiding System (iDAS), which provides
th
implementations of a set of ML techniques; 2) Environment for 5
Generation
Applications (E5), which provides a development environment in declarative
languages with an embedded expert system shell; and 3) Analysis of Text
(aText) for information extraction and classication of text documents. Demo
versions of
iDAS,
E5, and
aText
can be obtained by purchasing a copy of
the book and then emailing a request to the author. The chapter presents
very briey a handful of commercial and publicly available tools for analytics,
including R, MATLAB, WEKA, and SAS.
The author can be contacted at
[email protected] or
[email protected] to request a demonstration version of any of the above three
Machine Analytics tools used to perform case studies in the two penultimate
chapters of the book. It will be the sole discretion of the author to provide
tools upon a satisfactory analysis of the requestor's usage intention. Use of the
tools is entirely at his or her own risk. Machine Analytics is not responsible for
the consequences of reliance on any analyses provided by the tools. Licensing
details for commercial versions of these tools can be obtained by sending an
email to
[email protected].
Chapter 16
presents four detailed case studies, namely, risk assessment
for both individual and commercial lendings, life status estimation, and sentiment analysis, making use of all three tools,
iDAS,
E5, and
aText.
The
demo versions of the tools (see above) come with data from these case studies
for readers to run on their own. The chapter also describes various types of
fraud detection problems that can be solved by using various modeling and
clustering technologies introduced in the book.
The scope of analytics is broad and interdisciplinary in nature, and is likely
to cover a breadth of topic areas. The aim of this book is not to cover each
and every aspect of analytics. The book provides a computational account
of analytics, and leaves areas such as visual analytics, image analytics, and
web analytics for other authors. Moreover, the symbolic thrust of the book
naturally puts less emphasis on sub-symbolic areas, such as neural networks.