Cybersecurity Data Science (CSDS): How Not to Drown in Your Cyber Data Lake!

Future Shock: Growing Vulnerabilities and Liabilities

Cybersecurity data science (CSDS) brings hope to organizations challenged by evolving cyber threats. A rapidly developing field, CSDS utilizes advanced analytics to address security gaps in an increasingly data-driven, interconnected world.

The consequences of ignoring security challenges are rising. According to the Cisco 2018 Annual Cybersecurity Report, over half of cyber-attacks resulted in damages of greater than US$500k, with nearly a fifth costing more than US$2.5M. Meanwhile regulators, seeking to spur heightened oversight, have become more aggressive in levying fines and holding corporate boards accountable.

Cybersecurity Data Science in a Nutshell

CSDS offers a path forward for organizations besieged by unknown-unknowns. The discipline unites a range of analytical methods to achieve detection and prevention goals. When operationalized, the result is an end-to-end process orchestrating people, methods, and technologies.

Cybersecurity data science (CDSD) drives value through:

  • Aligning data engineering objectives
  • Refining fast and big data into ‘smart data’
  • Orchestrating a cyclical process of discovery and detection
  • Facilitating the development of analytical models for pattern extraction and event detection
  • Leveraging data analytics tools and methods to produce targeted, evidence-based alerts
  • Routing focused incidents to the right resources at the right time for rapid review and remediation

Chasing Phantoms: Lurking Unknown-Unknowns 

Cybersecurity monitoring efforts are increasingly overwhelmed by false alerts. This challenge is exacerbated by monitoring and remediation resource limitations. A recent report by Cybersecurity Ventures predicts a deficit of 3.5 million unfilled cybersecurity job openings by 2021.

Status quo rule-based approaches to surfacing security events are deluged by data overload and data fragmentation from distributed sources. The lack of an integrated view results in confusion concerning unknown-unknows – phantom patterns issuing from increasingly complex environments and disconnected data. The resulting confusion and complexity obfuscates threat indicators lurking in the shadows.The consequences are dire and growing: wasted resources struggle to do more with less while exposure to persistent and sophisticated threats grows. Cybersecurity professionals struggle to find elusive needles in exponentially expanding data haystacks.

A New Hope: Cybersecurity Data Science (CSDS)

Data science drives the application of advanced analytics methods to yield value-creating insights from data. A practitioner-driven discipline, data science combines methods from a variety of fields, including computer science, data engineering, statistics, machine learning, and operations research. Combining domain challenges with data science methods results in hybrid areas, CSDS being a significant and growing example.

Cybersecurity is a broad, established professional domain addressing a range of topics associated with safeguarding network and computer infrastructure and devices. Sub-domains focus on solutions engineering, data protection, safeguarding access, network & device monitoring, incident response & handling, forensics, penetration testing / ethical hacking, and rapidly emerging focus areas such as wireless, mobile, and internet of things (IoT) security.

Data, Data Everywhere and Not a Drop to Drink!  

As false alerts plague monitoring efforts, cybersecurity professionals are challenged to disassociate signals from noise. Similarly, whereas organizations are overwhelmed by growing volumes of cyber data, security monitoring solutions struggle to extract focused alerts. Data science addresses these twin gaps by bringing to bear a range of techniques to refine data into focused and effective alerts.

Lacking properly prepared data and context, cybersecurity analytics efforts are highly constrained from the outset. Before trends can be extrapolated and predictions made, data engineering and selection must be undertaken. This involves exploring data to determine key features, preprocessing to impose structure, integrating sources, and establishing pipes – jobs or routines to streamline the movement of data from raw-and-distributed to structured-and-integrated.

Data sources for cyber analytics are prolific and voluminous, including log files, network traffic & packets, authentication and proxy records, device configuration, SIEM and monitoring data, device telemetry, threat feeds, domain lookup, and user and device metadata.

The variety and diversity of cyber data sources, often unstructured, requires focused data engineering and feature selection to aggregate and transform sources into an integrated picture. CSDS directly supports the refinement, linking, and selection of efficacious master datasets, turning big data into ‘smart data’.

Advance Analytics: Walk Before You Run

Advanced analytics methods applied in CSDS range from exploratory methods – for instance unsupervised machine learning, diagnostic statistics, and time-series analysis – to predictive techniques – such as forecasting and supervised machine learning. Specialized techniques such as text analytics, network graph analysis, and probabilistic process benchmarking are applied for advanced challenges.

A fundamental CSDS best practice is to use basic exploratory techniques to develop a clearer picture of what is ‘normal’ in the environment – to better understand natural groups and dynamics so it is easier to spot anomalies. This implies that straight-forward descriptive and diagnostic techniques are applied before jumping to predictive machine learning and advanced methods.

Utilizing combinations of fundamental analytical methods, CSDS provides focused value by improving organizational understandings of security infrastructure and dynamics. Analytical techniques support the identification of statistical baselines – an understanding of what is expected for a given environment and set of entities, be they users or devices. Pattern detection algorithms and diagnostics bring clarity and definition to complex environments.

CSDS as-a-Process: The Explore and Detect Cycle

A frequent misstep in cybersecurity analytics initiatives is to fail to distinguish processes for exploration versus detection. The former process, exploration and discovery, is used to identify new detection methods, and involves larger and richer datasets to spot emerging, exploits and threats. The latter, detection automation, focuses on operational detection, and by nature utilizes a highly-refined set of data and methods for monitoring.

The processes taken together, exploring for new insights and detection monitoring, operate as a virtuous lifecycle. However, from an operational standpoint, it is a costly misstep to attempt to ‘boil the ocean’ by storing all the data, all the time. Operational effectiveness and cost control depend on knowing which data can be forgotten and which must be operationalized, while allowing for the model to change based upon new insights.

Many companies now struggle with immense and costly cyber data lakes filled with unstructured, untreated data which does not result in analytics insights or value. Explicitly distinguishing data and methods associated with the exploration and discovery processes implies distinct data and models for each challenge.

A unified approach combines iterative analytical methods to refine big data into smart data, and from there delivers targeted, efficacious alerts to the right resources at the right time:

Together, Better: Take Away CSDS Best Practices

Having summarized why CSDS is a new hope for cybersecurity and outlining central approaches, a number of focused best practices have been espoused:

1.      The New Normal: Using analytics methods to build a picture of ‘normal’ for your environment is the first step towards focused detection.

2.      Garbage In, Garbage Out: Data quality is often overlooked, but small investments in refining data using analytics has outsized benefits in operational efficiency and effectiveness.

3.      Walk Before You Run: focus on basic descriptive and statistical diagnostic techniques before jumping to ‘fancier’ predictive machine learning and AI approaches.

4.      From Big Data to Smart Data: Feature engineering supports the refinement and reduction of exhaustive cybersecurity datasets into highly refined measures for monitoring.

5.      Insight as a Process: There are a range of CSDS methods available – focusing on implementing an end-to-end process from raw data to insights will help to structure engineering efforts.

6.      Segment Your Goals: Formally distinguish data and methods used for continuing exploration versus for operational detection.

7. Knowing the Right Questions: Use a continual exploration approach to refine the questions being asked; operationalize the cycle in a refined exploration-detection process.

Do You Know the Way? Next Steps for Your Organization

Organizations should carefully plan-out a set of objectives for a cybersecurity analytics initiative. Planning should focus on defining a process for applying cybersecurity data science (CSDS) to systematically refine big data into smart data. The goal is an operationalized process which reduces data into an efficacious set of features which result in focused alerts. Emerging threats are monitored continuously, but separately, in a parallel exploratory process.

Dumping cybersecurity data in a massive cyber data lake and attaching machine learning algorithms on-top is not only insufficient, it is guaranteed to be a costly misstep. It is important to keep one’s bearings by focusing on the operational goals of each refined step in the CSDS process. At a high level, the operating process should facilitate the role-based interactions of data engineers, data scientists, cyber investigators, and incident response professionals.

While most large organizations have focused experts in the areas of big data, data science, and cybersecurity, it is beneficial to speak with experienced, focused CSDS professionals that have implemented cutting-edge solutions at the junction of these three domains. Going it alone or attempting to reinvent the wheel can be a costly and time-consuming misstep.

As you advance on your cybersecurity analytics initiative, attending to CSDS best practices will ensure you stay on track in terms of goals, costs, and value realized. Considering analytics tools and solutions to address gaps in your processes are a rapid, cost effective way to get started. SAS CSDS experts are available to provide feedback and guidance to ensure your initiative is a success.


Scott has three decades of experience designing and deploying data analytics solutions in a range of industries. Active globally, Scott is based near The Hague, Netherlands. He is currently completing a book on cybersecurity data science organizational best practices.

Scott Allen Mongeau, Cybersecurity Data Scientist, SAS Institute

email: <> mobile: + 31 (0)6 8370 3097

LinkedIn <> Twitter <> Blog


International Institute for Analytics. (2016). “Stronger Cybersecurity Starts with Data Management.” Available at

Cisco Systems Inc. (2018). Cisco 2018 Annual Cybersecurity Report. San Jose, CA, USA: Cisco Systems Inc.

Kirchhoff, C., Upton, D., and Winnefeld, Jr., Admiral J. A. (2015 October 7). “Defending Your Networks.” Harvard Business Review. Available at

Morgan, Steve. (2018). “Cybersecurity Jobs Report 2018-2021.” Available at

Ponemon Institute. (2017). “When Seconds Count: How Security Analytics Improves Cybersecurity Defenses.” Available at

SANS Institute. (2015). “2015 Analytics and Intelligence Survey.” Available at

SANS Institute. (2016). “Using Analytics to Predict Future Attacks and Breaches.” Available at

SAS Institute. (2018). “Managing the Analytical Life Cycle for Decisions at Scale.” Available at

SAS Institute. (2017). “SAS Cybersecurity: Counter cyberattacks with your information advantage.” Available at

Security Brief Magazine. (2016). “Analyze This! Who’s Implementing Security Analytics Now?” Available at

UBM. (2016). “Dark Reading: Close the Detection Deficit with Security Analytics.” Available at

, , , , , , ,

About SARK7

Scott Allen Mongeau (@SARK7), an INFORMS Certified Analytics Professional (CAP), is a researcher, lecturer, and consulting Data Scientist. Scott has over 30 years of project-focused experience in data analytics across a range of industries, including IT, biotech, pharma, materials, insurance, law enforcement, financial services, and start-ups. Scott is a part-time lecturer and PhD (abd) researcher at Nyenrode Business University on the topic of data science. He holds a Global Executive MBA (OneMBA) and Masters in Financial Management from Erasmus Rotterdam School of Management (RSM). He has a Certificate in Finance from University of California at Berkeley Extension, a MA in Communication from the University of Texas at Austin, and a Graduate Degree (GD) in Applied Information Systems Management from the Royal Melbourne Institute of Technology (RMIT). He holds a BPhil from Miami University of Ohio. Having lived and worked in a number of countries, Scott is a dual American and Dutch citizen. He may be contacted at: LinkedIn: Twitter: @sark7 Blog: Web: All posts are copyright © 2020 SARK7 All external materials utilized imply no ownership rights and are presented purely for educational purposes.

View all posts by SARK7


Subscribe to our RSS feed and social profiles to receive updates.

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: