Tell me… how ugly is your bad data?

August 14, 2014

Best practices, Management, Methods

Ugly data

The adage ‘garbage-in-garbage-out’ is an analytics mantra so ingrained it has its own shorthand: GIGO. Yet, in the mad, blind rush toward all things ‘big data’, there is the danger of sidelining the crucial-but-dreary topic of data quality, to which GIGO refers.

While data quality is not as ‘sexy’ as big data, anyone who wants to work with big data or fancies themselves a data scientist will quickly run smack into a ‘big bad data wall’ without explicit forethought. The discipline of Master Data Management can help quell the pain – knowing the basics can avoid a world of ‘big hurt’!

Bad Data

You saw the terrifying movie, now buy the book!

Jurney, R. 2014. Agile Data Science. O’Reilly.Jurney, R. 2014. Agile Data Science. O’Reilly.

Tell me… how ugly is your bad data?

While fast evolving tools and techniques allow us to massage and manage sloppy data, when the rubber meets the road, at best, bad data poses fundamental challenges to an analytics inquiry. At worst, bad data results in misleading insights, which spawns poor, even destructive, decisions. Such perverse results can even remain hidden – decision flaws in-waiting – until disaster strikes.

A key point to assimilate, internalize, and imbibe is that data quality is only partially a technical problem. The scourge of bad data encompasses and often finds its very origins in organizational, as opposed to technical, challenges. At a fundamental level, data quality is thus an organizational challenge: one of governance, aligned incentives, proper processes, and even culture.

Business analytics itself is an organizational process: framing problems which can be addressed with data analysis which leads to insights that drive value-creating decisions.

  • Business analytics (beyond data science) is a process which addresses the coupling of problem framing, data management, and analytics to assure decision quality.
  • Want to know more? Online class on the business analytics process (developed by SARK7 for Accenture Academy)

Bad data thus encompasses situations where poor problem framing (a broken business analytics process) and breakdowns in organizational decision culture perpetuate poor analytics, such as the well-known case of the 1986 Challenger space shuttle disaster.

Excuse me, would you care for a big, steaming heap of bad data?

For those just getting started with analytics, it is often a shock how much time is spent on gathering, cleaning, sorting, and preparing data for analysis. Often there are several rounds of data cleaning as an analytics model evolves, leading to a rinse, wash, spin, repeat cycle.

Many analytics projects follow a classical Pareto principle 80 / 20% distribution between ‘data cleansing’ and actual analytics (indeed I have had 95/5% projects).  Much of this time involves gathering, combining, re-formatting, sorting, compacting, ‘munging’ (or wrangling), and attempting to structure and make sense of data which is often in a messy, low-quality condition.

A recent New York Times article, ‘For Big-Data Scientists, ‘Janitor Work’ is a Key Hurdle to Insights’. attested to this increasing trend.  This was reinforced in a recent article in the INFORMS Analytics Magazine ‘Students, Professionals Need ‘Data Wrangling’ Skills’.  The observation is that increasingly large and diverse multi-format datasets demand more intensive approaches to managing and transforming data (see recent post on data management / engineering as a growing trend requiring specialty skills).

But what happens when the data is fundamentally flawed and analysis is thus compromised?  Sometimes the mission is hopeless! What happens when there are seven product databases and multiple departments disagree on key aspects such as ‘base price’?  What happens when a large circle of security databases update each other in an endless, mechanical chain such that ex-employees keep being returned for systems access (as was one project I had in the past at a company which shall remain nameless)?

The truth is, most all businesses struggle perpetually with fundamental issues of data quality.  Typical businesses often have a hodgepodge of multiple data sources (spreadsheets, databases, unstructured documents, etc.) surrounding such key artifacts as ‘customer’ and ‘product’.  These struggles are organizational problems more-so than technical problems: breakdowns in data ownership and governance.  Tools can help to improve processes, but basic organizational roles, agreements, and incentives need to be put in place to drive true change.

This is where MDM comes in.  MDM is a discipline which focuses on bringing organizational processes, governance, and systems together to improve data quality. A major objective is to establish a ‘single version of the truth’ in terms of data definitions. Where there are disagreements, for instance based on different professional domains, MDM brokers explicit definitions concerning the distinctions. Tools include metadata dictionaries and/or ontologies – formal descriptions of contextual and conceptual meaning within a domain.

But… I’ll just dump it into a big data store and worry about it later!

A suggested advent of big data is that of ‘collect all the bad data and clean it later’. While Hadoop and other mass storage approaches make this increasingly technically feasible, the ‘clean later’ part does not, as a result, go away. ‘Clean later’, as in “I’ll clean my house / do my homework / pay my taxes next week”, runs the danger of never happening, or worse, of dysfunctional data hoarding leaving servers jam packed with a mess of crud!

The emerging big data processing ‘stack’ implies that data will be ‘cleaned’ and presented for analytics as part of as structured process:

Flow of data processing (from Jurney, R. 2014. Agile Data Science. O'Reilly)

Flow of data processing (Jurney, R. 2014. Agile Data Science. O’Reilly)

An example technical ‘stack’ here would be (also from Jurney’s Agile Data Science):

Avro -> IMPA -> Hadoop -> Pig -> Mongo DB -> Lightweight web framework -> D3

This is all great! This is an engineering solution to storing, extracting, transforming, and presenting large sets of data. However, if we wish to perform data analytics, the use of powerful technology does not issue a ‘get out of jail free’ pass.

The assumption is that somewhere in the ‘middle part’, magic happens whereby reasonable sense is made of the massive set of data such that there is integrity in the business analytics process. At a minimum, this encompasses a set of organizational procedures:

  • proper problem framing (strategic and tactical governance; stakeholder alignment),
  • validation of data quality (which assumes a link to MDM),
  • proper data selection and sampling (proper data analytics methods applied),
  • model building (“),
  • model testing (“),
  • proper interpretation (“), and
  • clear communication of results (organizational stakeholder communications).

In the context of the big data ‘stack, such orchestration assumes that technology tools, processes, methods, and organizational stakeholders are aligned. An MDM program and a clear business analytics process assure that quality and risks are formally addressed.

A particularly troublesome challenge concerns properly confronting the methodological issues raised by large sets of data (both large sample sets as well as large ranges of variables). There is a pernicious myth that massive and broad sets of data issue some type of methodological omnipotence. This is not the case: large datasets are particularly subject to issues regarding model overfitting and variance.

The flip-side is that tight / targetted models are suceptable to bias (work in many cases, but potentially overgeneralize).  These are flip-sides of a coin in modeling – ideally a data scientist seeks a sweet-spot between the two, but there is always a compromise one way (high variance) or the other (high bias).

A recent article in Science, ‘The Parable of Google Flu: Traps in Big Data Analysis’, concerning issues with the Google flu trends platform goes into some detail on this topic. As well, issues of mistaking correlation for causation abound. Big data sets produce multiple models, many of which may involve spurious, context specific, or phantom correlations.

The details of such methodological issues are still being debated. Part of the issue is that machine learning involves a paradigm shift from statistical methods. Principles for validating and testing machine learning methods are still being developed and socialized. This means that extra vigilance needs to be applied when attempting to assert causal conclusions from machine learning-derived insights, especially when relying on computer-built or heavily computer-guided models focused on correlation (as opposed to models rigorously tested for causal indications via traditional tests for statistical significance).

In conclusion, big data is not a panacea. Technology is ineffective without proper processes and organizational application. As well, there are methodological issues associated with large sets of data which must be confronted explicitly.

Do you suffer bad data? I recommend pursuing a MDM program and implementing an end-to-end business analytics decision process. A Hadoop implementation alone will not lead to effective big data analytics…

  • Want to learn more? A short presentation on methodological challenges associated with Big Data analytics by Scott Mongeau (presented to Erasmus Rotterdam School of Management): https://www.youtube.com/watch?v=UPsJx427rKE

DO YOU HAVE A BAD DATA STORY?  Leave a comment below…

, , , , , , , , , , , , , , , , , , , , ,

About SARK7

Scott Allen Mongeau (SARK7) is an INFORMS Certified Analytics Professional (CAP) and a Data Scientist in the Cybersecurity business unit at SAS Institute. Scott has over 20 years of experience in project-focused analytics functions in a range of industries, including IT, biotech, pharma, materials, insurance, law enforcement, financial services, and start-ups. Scott is a part-time PhD (ABD) researcher at Nyenrode Business University. He holds a Global Executive MBA (OneMBA) and Masters in Financial Management from Erasmus Rotterdam School of Management (RSM). He has a Certificate in Finance from University of California at Berkeley Extension, a MA in Communication from the University of Texas at Austin, and a Graduate Degree (GD) in Applied Information Systems Management from the Royal Melbourne Institute of Technology (RMIT). He holds a BPhil from Miami University of Ohio. Having lived and worked in a number of countries, Scott is a dual American (native) and Dutch citizen. He may be contacted at: webmaster@sark7.com All posts are copyright © 2015 SARK7 All external materials utilized imply no ownership rights and are presented purely for educational purposes.

View all posts by SARK7

Subscribe

Subscribe to our RSS feed and social profiles to receive updates.

Trackbacks/Pingbacks

  1. Predictive policing: the brave new age of law enforcement analytics | BAM! Business Analytics Management… - September 8, 2014

    […] predicts flu outbreaks based on correlation with a key set of trending search terms.  There is an ongoing debate and good reasons to be quite cautious with correlation-focused machine learning model… (chiefly that they are subject to overfitting).  However, the utility and power of machine […]

  2. Twelve Emerging Trends in Data Analytics (part 1 of 4) | BAM! Business Analytics Management… - September 8, 2014

    […] A recent New York Times article, ‘For Big-Data Scientists, ‘Janitor Work’ is a Key Hurdle to Insights’. attested to this increasing trend.  This was reinforced in a recent article in the INFORMS Analytics Magazine ‘Students, Professionals Need ‘Data Wrangling’ Skills’.  The observation is that increasingly large and diverse multi-format datasets demand more intensive approaches to managing and transforming data (see recent related post the demands data availability and quality are putting on professionals). […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: