Data science as an experimental process: unsupervised and supervised learning

Data Science

As a companion to my recent post “Correlation versus Causation: The Science, Art, and Magic of Experimental Design”, I wanted to offer a more technical exposition concerning data science approaches to focused causal model development.

A fundamental question faced by business analytics professionals and data scientists is whether they have a working correlative and causal explanatory model related to the phenomenon they are observing, be it related to reducing manufacturing error rates, determining the cause of customer abandonment, reducing fraud, targeting marketing, realizing logistics efficiencies, etc.  This is known as an experimental model in science or a conceptual model in broader research venues (i.e. social sciences).

Data Science

Data Science

The increasing interest in analytics has led to a proliferation of powerful tools.  The new tools increase the ease of conducting sophisticated data analysis.  However, the danger is that inexperienced analysts take a shotgun approach, throwing data at a tool and leaping at any hints of statistical causal significance that emerges.

For example, a sophomore data analyst might rush to notify management that s/he detected a strong correlation between the marketing budget and revenues, suggesting the marketing budget should be increased as much as possible.  With a deeper examination of marketing efficacy in relation to mediating factors (i.e. macroeconomic trends, demographic features, competitive forces, trending consumer preferences, seasonality, weather), one will realize that marketing expenditures are rarely a constant direct causal agent in revenue growth (and when a strong factor, only temporary in scope).  Otherwise, marketing would have an infinite budget and run most companies (though this might not stop them from trying to assert this right).

A fundamental questions analytics and data professionals should ask is whether, at any particular point, there are sufficient  grounds, based upon statistical significance, to apply a proposed causal model into operational use (i.e. to recommend a decision path based on descriptive or predictive analysis, or to operationalize a prescriptive algorithm).  In other words, if there is a working causal hypothesis or practical model in place, has it been sufficiently tested to establish statistical significance and validity?

Much of this comes down to whether a structured analytical process was followed to establish experimental validity (statistical significance) for the experimental / conceptual model.  Where did the experimental / conceptual model come from?  Were the proper experts consulted?  Were alternative explanations / hypothesis properly considered?  Was there a deep enough examination of mediating and moderating variables and grounds for the establishment of direct causation as opposed to correlation? Are there hidden, more fundamental factors at play that have been missed?

As an example of the importance of drilling-down to fundamental causes, while we could stop by saying, for instance, “it is good for a surgeon to clean his/her hands after operating because we notice less subsequent infections”, we now know (because of sound scientific inquiry) that the causal factor of infection is bacterial and viral agents.  A deeper understanding of microbiology (in particular viruses and bacteria) allows us to also prescribe the sterilization of operating instruments and the operating theater.  As well, we know the simple efficacy of washing one’s hands in reducing the transmission of the flu.   When we stop at noticing a correlation (i.e. clean hands = less infections), we not only forego broader understanding, but we potentially continue perpetrating serious errors (i.e. not sterilizing surgical instruments).

To turn our attention more directly to data science and analytics, the analysis of data should follow a methodical process of iteratively strengthening a conceptual model through staged statistical and algorithmic analysis.  A major division concerns whether there are suitable grounds for segmenting a dataset prior to applying statistical analysis (i.e. customers, manufacturing errors, credit card transactions, etc.), or whether there is lack of understanding concerning the operative correlative factors in terms of grouping.

The segmentation of the following two fundamental approaches should provide some guidance in the development of an operational explanatory model.  If there is little understanding of the nature of the correlative factors which suggest groupings or clusters of phenomenon (i.e. customer categories), unsupervised learning should be applied to segment or cluster fundamental statistical categories.  If there is an understanding of fundamental categories, supervised learning should be used to profile segmented groups for prediction and prescriptive treatments.

1.  Unsupervised Learning:  segmentation / clustering

Unsupervised learning to cluster or segment a dataset should be the first step if there is no working understanding of the phenomenon at play (the base correlative interaction of variables) and no classification labels.  Such an approach should be used in cases where there is a large enough dataset and there is a core phenomenon of interest (i.e. declining sales, increasing fraud), but no clear primary understanding of how the component variables correlate amongst themselves (i.e.  how meaningful groups of customers are identified, how observed variables on the assembly line contribute to the error rate or not).

Unsupervised techniques are those techniques that aggregate patterns based on statistical similarity.  Such an approach is applicable where there is no labeled training set (i.e. where those groups of customers who are at risk for fraud are not yet segmented into meaningful groups).  Clustering algorithms are specifically used to identify unique segments of a population and to depict the common attributes of members of a cluster in relation to the target phenomenon.  The goal is to generate or extract classification labels automatically (hence the term unsupervised).

This approach is useful when an analyst has no idea how to segment the population (i.e. customers) in relation to the phenomenon (i.e. purchasing or fraud).  Running a clustering technique is a good first step to see how the elements in a dataset related into unique groups.  Such techniques are used regularly in marketing analysis to extrapolate meaningful categories of customers which then can be targeted independently according to unique sales and marketing messages.

The types of statistical / analytical techniques available include:

•  Cluster analysis

o  Hierarchical cluster analysis

o  Two-step cluster analysis

o  O-Cluster (proprietary Oracle algorithm)

•  Factor analysis

•  Naïve Bayes for clustering

•  Kohonen Networks / Self-Organizing Maps

2.  Supervised Learning: profiling segmented groups

Once a training set has been labeled (i.e. customers who have purchased or not purchased in last year have been identified in particular customer groups, distinct groups of customers interested in a product have been segmented), supervised learning techniques can be applied.  At this point, there is an existing notion of how to segment the population (i.e. customers) and the analyst would like to implement some type of automatic procedure or operational approach (i.e. automatic fraud risk assessment, marketing messaging).

Supervised techniques learn a pre-defined answer based on the segmented groups (i.e. fraud / non-fraud customers, buying / non-buying sales prospects) and provide a method for new instances to be assessed based on the ‘trained’ algorithm, structure, or facility.

A.  Classical Supervised Learning

CLASSICAL methods are useful in determining how classification is made, explaining how the model is composed, or determining what influencers can be centrally attributed to a category (i.e. factors which predispose fraudulent behavior).  Such techniques are useful for gaining a better understanding of the specific causal and correlative factors, and via this understanding to guide decision making related to future phenomenon.  Thus, this type of evaluation has both explanatory and predictive power, allowing for prescriptive operationalization as well as progressive targeting and micro-segmentation.

•  Regression analysis 

o  Logistic regression

o  Multiple linear regression

•  Structural equation modeling

•  K-Means

o  K-Nearest Neighbors (K-NN)

•  Recency, Frequency, Monetary (RFM) (customer value)

•  LDA (Linear Discriminant Analysis)

•  Decision Trees (DT)

o  CHi-squared Automatic Interaction Detection (CHAID)

o  Exhaustive CHAID

o  Random forests

o  Boosted trees / gradient boosting

o  C&RT/ CART (Classification & regression trees)

o  J48C4.5

o  QUEST / Supervised learning in quest (SLIQ)

•  Naïve Bayes classifier

B.  Advanced Supervised Learning

ADVANCED methods allow for automation when the analyst is not interested in explanatory logic, just operationalizing prediction: a prescriptive solution.  Such approaches allow for automated procedures such as real-time online customer approval or real-time flow control on an assembly line.

 Neural networks

•  Support Vector Machine (Linear and Kernel) / Support Vector Networks

•  Ensembles / Ensemble Learning

Bringing It All Together:  Continual Refinement

Taken together, this range of unsupervised and supervised techniques ideally iterates in a cyclical fashion to refine progressive understanding and to optimize actions.   A segmentation strategy should evolve over time and incorporate feedback from earlier cycles, continually assessing new instances to modify the model.  The segmentation results can then be used to refine profiling and action.

While an initial cycle may be forced to rely on unsupervised learning, plan (if possible) to track the behavior/response of the outcomes (i.e. the behavior / reaction of customers or the result of an error reduction technique).  You can then use that data to generate a subsequent prediction concerning the likelihood of a positive response (in an iterative fashion).  On a future sample, you can generate a predicted likelihood of success for each segment.

For segments with low likelihood, you can pilot a different approach (changing the message, channel, incentive, etc.) and measure if the observed response exceeds the expected level.  Continuing to follow this process over multiple cycles, an “optimized” strategy (i.e. marketing approach, fraud reduction approach) should emerge where each segment is targeted with the type of treatment (i.e. marketing communication, credit approval) most likely to yield positive results with the ever refined segment.

, , , , , , , , , , , ,

About SARK7

Scott Allen Mongeau (SARK7) is an INFORMS Certified Analytics Professional (CAP) and a Data Scientist in the Cybersecurity business unit at SAS Institute. Scott has over 20 years of experience in project-focused analytics functions in a range of industries, including IT, biotech, pharma, materials, insurance, law enforcement, financial services, and start-ups. Scott is a part-time PhD (ABD) researcher at Nyenrode Business University. He holds a Global Executive MBA (OneMBA) and Masters in Financial Management from Erasmus Rotterdam School of Management (RSM). He has a Certificate in Finance from University of California at Berkeley Extension, a MA in Communication from the University of Texas at Austin, and a Graduate Degree (GD) in Applied Information Systems Management from the Royal Melbourne Institute of Technology (RMIT). He holds a BPhil from Miami University of Ohio. Having lived and worked in a number of countries, Scott is a dual American (native) and Dutch citizen. He may be contacted at: All posts are copyright © 2015 SARK7 All external materials utilized imply no ownership rights and are presented purely for educational purposes.

View all posts by SARK7


Subscribe to our RSS feed and social profiles to receive updates.

2 Comments on “Data science as an experimental process: unsupervised and supervised learning”


  1. Correlation versus Causation: The Science, Art, and Magic of Experimental Design | BAM! Business Analytics Management... - August 18, 2013

    […] [Of note, this post examines the theoretical and historical foundations related to experimental model design.  If you are seeking focused technical information on computational / analytical / data science approaches to experimental model design, please see the companion post:  Data science as an experimental process: unsupervised and supervised learning:… ] […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: