In my work and university lecturing, I speak with many senior leaders, experienced professionals, young professionals, and students, many interested in either hiring, transitioning into, or becoming a ‘data scientist’. The title, however, is quite contentious. Why is this? Essentially, there is a great deal of confusion related to this new field, and it will take some time for organizations, as well as the labor market, to sort out the hype from the core value proposition.
A number of established professionals are skeptical concerning the need for ‘data scientists’, feeling it is a hype and that fancy methods do not trump experience-guided intuition. Others, recruiters and forward-thinking managers, are over-excited, desperate to hire for and establish ‘data science’ programs in their organizations. Meanwhile, there has been an understandable flood of opportunists with a relative lack of experience, but who have been quick to brand themselves as expert ‘data scientists’. Those ‘new entrants’ often claim expertise based on relatively weak formal education and experience, which serves to discredit and blur the burgeoning field. Alongside this, many motivated and hard working students and young professionals are interested to move into this profession, but are confused concerning what a ‘data scientist’ is and does, and how to properly train themselves.
The Gartner Technology Hype Curve: it’s not just a good idea, it’s the law! 😉
The field, being quite new, is distorted due to imperfect information in the labor markets. In the rush to hire, there is a lack of consensus amongst companies concerning base prerequisites for being a ‘data scientist’. This allows many to claim the title on weak grounds, i.e. having a basic grasp of R and taking a few Coursera classes on machine learning. People hiring, themselves not clear on the nature of this new field, do not always establish a firm baseline concerning their expectations, nor the nature of the work duties, which creates a wider pool of entrants (in the uncertainty).
There are certifications and focused degree programs now emerging (i.e. INFORMS CAP, business analytics MA programs), but the field is relatively new, meaning there is still a Wild West aspect to hiring, recruiting, and the pool of interested entrants, all of which leads to oversupply and depresses wages. The higher the uncertainty => the less established standards for entrants => the larger the supply of available labor => increased competition => negotiation power for firms => lower wages. This dampens the incentive for trained and qualified professionals, many with multiple advanced degrees and years of experience (e.g. BI + computer science + statistics + programming + data engineering, etc….) to engage and pursue ‘data scientist’ roles.
At the moment a 22 year old can plausibly jump in and claim the title or practicing ‘data scientist’ based on, for example, a BA in Economics, a six month internship as a ‘data analytics lead’ at an investment bank, a few example R or Python scripts evidenced on GitHub, and a few Coursera courses. Similar to San Francisco in 1998, one only had to have created a few HTML pages and be able to sit in front of a computer for extended periods to get hired as a ‘web development lead’.
Beyond this, and a gripe of many who have worked in ‘analytics’ before it became branded and hyped, is that the term ‘data scientist’ is highly contentious: while we can forego the implication that the claimant has a PhD (STEM PhD = scientist = formal academic certification signifying one is proficient in independently framing and leading research initiatives based on accepted research methodologies), it begs the question concerning the nature of the institution hosting the ‘data science’ position.
If the firm or institution is not hosting and funding pure research, that is, they are formally protecting a group of researchers from financial and political pressures in order to spend time conducting often laborious, slow, and expensive staged scientific inquiry (often with no immediate promise of tangible results), then it is creating a false expectation to issue someone the ‘scientist’ title.
The reality is that most data scientists land in a commercial role where they quickly are subsumed by issues of base data quality (80-90% of the time spent on data munging/wrangling), political struggles which influence not only the interpretation of results but which bias problem framing and structuring from the outset, and, most worryingly, extreme time pressures which hobble the base application of recognized scientific approaches (e.g. no time for A-B testing, lack of time to properly contextualize and frame data types, no leverage to invest effort to identify key significant variables, no appetite to bother with framing hypotheses (throw ALL or MANY of the variables in and operate simply on demonstrated correlation = model overfitting (a la initial Google flu trends)), and demand for instant results prohibiting trialing several algorithmic approaches and comparing like to like with proper diagnostics, confusion matrix, etc.).
This situation is not the fault of the ‘data scientist’: even if a PhD statistician with mad programming and machine learning skills, if the host institution is not structured appropriately to support the basic process of scientific inquiry, is it fair to create the false expectation that this person will be able to perform as a ‘scientist’ in any way, shape, or form? If someone is hired as a ‘firefighter’, yet are prohibited from approaching any type of flame or burning structures due to safety and liability concerns, is this not a misappellation?
Lastly, even if the firm does support a pure research effort, there is the challenge of linking resulting insights and discoveries to decisions and commercial initiatives. Many companies have rushed to staff-up on data scientists without considering the ‘organizational decision architecture’. If data science will automate-away the decision rights of several idle managers, it is understandable some of those middle managers will act to sabotage, discredit, or otherwise subvert the data science program.
Likewise, being a new field, those who ostensibly must cooperate with ‘data scientists’ to make data-driven decisions do not yet have a clear context for facilitating data-focused decision making. Is an innovation valuable if management and other key stakeholders do not act to incorporate or exploit the innovation? The classic lesson is Xerox PARC, a storied and talented group of R&D ‘scientists’ who established a platform for modern computing. Problem was that Xerox management itself did not properly understand or commercialize the key innovations. Instead, in a one hour visit Steve Jobs saw the potential of the GUI, the light went on, and the rest is history. One can imagine that ‘data scientists’ may identify an unexpected and valuable insight, but if they fail to convince key stakeholders in the firm, the insight will not result in value creating decisions.
In summary, although a promising new field, it is still the Wild West out there. The distortion of views concerning the nature of the field and its role in organizations creates uncertainty, which leads to an oversupply of titled claimants to the expertise, which leads to confusion in recruiting and hiring, which leads to too many candidates, which leads to depressed wages.
In my view, it will take at least five years for the newly emerging focused academic programs to churn out fully qualified data analytics specialists and for certification programs to gain currency. However, it will take at least a decade for organizations to properly design the ‘data scientist’ (if the dicey title sticks) job design so that it meshes with decision architectures and organizational role architectures. Everything works on incentives, and currently only a bleeding edge minority of companies have properly incentivized their middle and senior management to ‘play nice’ with data scientists because they have demonstrated that aggregate economic profit has been optimized.
To put it in economic terms in terms of labor markets and firms: the Prisoner’s Dilemma (middle manager vs. data scientist in a game of ‘I’m right based on experience and intuition’ vs. ‘No, I am right based on structured diagnostics and evidence-based reasoning’) must achieve demonstrated economic optimality (i.e. dominant strategy equilibrium, Pareto optimality, Nash equilibrium). The bleeding edge firms have demonstrated that everybody wins if employees endeavor to apply scientific principles to data-driven decision making. The rest likely need to evolve slowly in that direction, which is a slow, generational, human timescale problem.