CrowdStrike’s Approach to Better Machine Learning Evaluation Using Strategic Data Splitting

CrowdStrike Blog T2 clear — 2758 words ORIGINAL

Classification

SEV 6/10

Machine Learning Evaluation Using Data Splitting | CrowdStrike BLOG Featured Now Live: The CrowdStrike 2026 Financial Services Threat Landscape Report May 14, 2026 Falcon AIDR Detects Threats at the Prompt Layer in Kubernetes AI Applications May 13, 2026 May 2026 Patch Tuesday: 30 Critical Vulnerabilities Among 130 CVEs May 12, 2026 Inside CrowdStrike Automated Leads: A Transformative Approach to Threat Detections May 11, 2026 Recent Video Video Highlights the 4 Key Steps to Successful Incident Response Dec 02, 2019 Helping Non-Security Stakeholders Understand ATT&CK in 10 Minutes or Less [VIDEO] Feb 21, 2019 Analyzing Targeted Intrusions Through the ATT&CK Framework Lens [VIDEO] Jan 22, 2019 Qatar’s Commercial Bank Chooses CrowdStrike Falcon®: A Partnership Based on Trust [VIDEO] Aug 20, 2018 Category Agentic SOC How Charlotte AI AgentWorks Fuels Security's Agentic Ecosystem 03/25/26 CrowdStrike Services and Agentic MDR Put the Agentic SOC in Reach 03/24/26 4 Ways Businesses Use CrowdStrike Charlotte AI to Transform Security Operations 03/12/26 Inside the Human-AI Feedback Loop Powering CrowdStrike’s Agentic Security 02/10/26 Cloud & Application Security 05/13/26 CrowdStrike Named a Leader in Frost & Sullivan 2026 Radar for Cloud-Native Application Protection Platforms 04/27/26 CrowdStrike Expands Real-Time Cloud Detection and Response to Google Cloud 04/22/26 CrowdStrike Falcon Cloud Security Delivered 264% ROI Through Unified Cloud Protection Threat Hunting & Intel 05/14/26 CrowdStrike Named a Leader in the First-Ever Gartner® Magic Quadrant™ for Cyberthreat Intelligence Technologies 05/06/26 CrowdStrike Launches Falcon OverWatch for Defender 05/05/26 Tune In: The Future of AI-Powered Vulnerability Discovery 05/01/26 Endpoint Security & XDR 05/11/26 CrowdStrike Falcon Platform Achieves 441% ROI in Three Years 04/21/26 Falcon for IT Supports Windows Secure Boot Certificate Lifecycle Management 04/01/26 Enhanced Network Visibility: A Dive into the Falcon macOS Sensor's New Capabilities 03/11/26 Engineering & Tech EMBER2024: Advancing the Training of Cybersecurity ML Models Against Evasive Malware 09/03/25 Falcon Platform Prevents COOKIE SPIDER’s SHAMOS Delivery on macOS 08/20/25 CrowdStrike’s Approach to Better Machine Learning Evaluation Using Strategic Data Splitting 08/11/25 CrowdStrike Researchers Develop Custom XGBoost Objective to Improve ML Model Release Stability 03/20/25 Executive Viewpoint Frontier AI Is Collapsing the Exploit Window. Here’s How Defenders Must Respond. 04/20/26 Frontier AI for Defenders: CrowdStrike and OpenAI TAC 04/16/26 Anthropic Claude Mythos Preview: The More Capable AI Becomes, the More Security It Needs 04/06/26 The Architecture of Agentic Defense: Inside the Falcon Platform 01/16/26 From The Front Lines CrowdStrike Technical Risk Assessments Reveal Common Exposure Patterns 05/04/26 Introducing the CrowdStrike Shadow AI Visibility Service CrowdStrike Flex for Services Expands Access to Elite Security Expertise From Scanner to Stealer: Inside the trivy-action Supply Chain Compromise 03/20/26 Next-Gen Identity Security Detecting CVE-2026-20929: Kerberos Authentication Relay via CNAME Abuse 03/31/26 CrowdStrike FalconID Brings Phishing-Resistant MFA to Falcon Next-Gen Identity Security 02/26/26 CrowdStrike Named a Customers’ Choice in 2026 Gartner® Peer Insights™ Voice of the Customer for User Authentication 02/12/26 CrowdStrike to Acquire Seraphic to Secure Work in Any Browser 01/13/26 Next-Gen SIEM & Log Management Falcon Next-Gen SIEM Supports Third-Party EDR Tools, Starting with Microsoft Defender 03/23/26 Falcon Next-Gen SIEM Simplifies Onboarding with Sensor-Native Log Collection 03/06/26 Exposing Insider Threats through Data Protection, Identity, and HR Context 02/18/26 How to Scale SOC Automation with Falcon Fusion SOAR 02/11/26 Public Sector CrowdStrike Innovates to Modernize National Security and Protect Critical Systems 03/18/26 Falcon Platform for Government Now Offers Falcon for XIoT to Secure Connected Assets CrowdStrike Achieves FedRAMP® High Authorization 03/19/25 NHS Matures Healthcare Cybersecurity with NCSC’s CAF Assurance Model 03/13/25 Exposure Management 05/12/26 April 2026 Patch Tuesday: Two Zero-Days and Eight Critical Vulnerabilities Among 164 CVEs 04/14/26 How CrowdStrike Is Accelerating Exposure Evaluation as Adversaries Gain Speed 04/05/26 March 2026 Patch Tuesday: Eight Critical Vulnerabilities and Two Publicly Disclosed Among 82 CVEs Patched 03/10/26 Securing AI CrowdStrike Expands ChatGPT Enterprise Integration with Enhanced Audit Logging and Activity Monitoring 04/28/26 New CrowdStrike Innovations Secure AI Agents and Govern Shadow AI Across Endpoints, SaaS, and Cloud Secure Homegrown AI Agents with CrowdStrike Falcon AIDR and NVIDIA NeMo Guardrails 03/19/26 Introducing "AI Unlocked: Decoding Prompt Injection," a New Interactive Challenge Data Security Falcon Data Security Secures Data Wherever It Lives and Moves Falcon Data Protection for Cloud Extends DSPM into Runtime 11/20/25 CrowdStrike Stops GenAI Data Leaks with Unified Data Protection 09/18/25 Q&A: How Mastronardi Produce Secures Innovation with CrowdStrike 02/14/25 Start Free Trial August 11, 2025 Josh Sun “Leakage” in machine learning (ML) occurs when data that an ML model should not learn on is included at training time, often in unexpected ways.

CONFIDENCE56%

Categories

cloud_securityvulnerabilityiot_ot_security

Threat Actors

Conti

Target Sectors

financehealthcaregovernment

Here’s How Defenders Must Respond. 04/20/26 Frontier AI for Defenders: CrowdStrike and OpenAI TAC 04/16/26 Anthropic Claude Mythos Preview: The More Capable AI Becomes, the More Security It Needs 04/06/26 The Architecture of Agentic Defense: Inside the Falcon Platform 01/16/26 From The Front Lines CrowdStrike Technical Risk Assessments Reveal Common Exposure Patterns 05/04/26 Introducing the CrowdStrike Shadow AI Visibility Service CrowdStrike Flex for Services Expands Access to Elite Security Expertise From Scanner to Stealer: Inside the trivy-action Supply Chain Compromise 03/20/26 Next-Gen Identity Security Detecting CVE-2026-20929: Kerberos Authentication Relay via CNAME Abuse 03/31/26 CrowdStrike FalconID Brings Phishing-Resistant MFA to Falcon Next-Gen Identity Security 02/26/26 CrowdStrike Named a Customers’ Choice in 2026 Gartner® Peer Insights™ Voice of the Customer for User Authentication 02/12/26 CrowdStrike to Acquire Seraphic to Secure Work in Any Browser 01/13/26 Next-Gen SIEM & Log Management Falcon Next-Gen SIEM Supports Third-Party EDR Tools, Starting with Microsoft Defender 03/23/26 Falcon Next-Gen SIEM Simplifies Onboarding with Sensor-Native Log Collection 03/06/26 Exposing Insider Threats through Data Protection, Identity, and HR Context 02/18/26 How to Scale SOC Automation with Falcon Fusion SOAR 02/11/26 Public Sector CrowdStrike Innovates to Modernize National Security and Protect Critical Systems 03/18/26 Falcon Platform for Government Now Offers Falcon for XIoT to Secure Connected Assets CrowdStrike Achieves FedRAMP® High Authorization 03/19/25 NHS Matures Healthcare Cybersecurity with NCSC’s CAF Assurance Model 03/13/25 Exposure Management 05/12/26 April 2026 Patch Tuesday: Two Zero-Days and Eight Critical Vulnerabilities Among 164 CVEs 04/14/26 How CrowdStrike Is Accelerating Exposure Evaluation as Adversaries Gain Speed 04/05/26 March 2026 Patch Tuesday: Eight Critical Vulnerabilities and Two Publicly Disclosed Among 82 CVEs Patched 03/10/26 Securing AI CrowdStrike Expands ChatGPT Enterprise Integration with Enhanced Audit Logging and Activity Monitoring 04/28/26 New CrowdStrike Innovations Secure AI Agents and Govern Shadow AI Across Endpoints, SaaS, and Cloud Secure Homegrown AI Agents with CrowdStrike Falcon AIDR and NVIDIA NeMo Guardrails 03/19/26 Introducing "AI Unlocked: Decoding Prompt Injection," a New Interactive Challenge Data Security Falcon Data Security Secures Data Wherever It Lives and Moves Falcon Data Protection for Cloud Extends DSPM into Runtime 11/20/25 CrowdStrike Stops GenAI Data Leaks with Unified Data Protection 09/18/25 Q&A: How Mastronardi Produce Secures Innovation with CrowdStrike 02/14/25 Start Free Trial August 11, 2025 Josh Sun “Leakage” in machine learning (ML) occurs when data that an ML model should not learn on is included at training time, often in unexpected ways.

This can cause overconfidence in ML model training results, producing cybersecurity ML models that fail to recognize threats. CrowdStrike data scientists employ strategic data splitting during ML model training to prevent data leakage. Since day one, CrowdStrike's mission has been to stop breaches. Our pioneering AI-native approach quickly set our platform apart from the landscape of legacy cybersecurity vendors that were heavily reliant on reactive, signature-based approaches for threat detection and response.

Our use of patented models across the CrowdStrike Falcon® sensor and in the cloud enables us to quickly and proactively detect threats — even unknown or zero-day threats. This requires accurate threat prediction by the CrowdStrike Falcon® platform. To achieve this critical requirement, CrowdStrike data scientists think carefully about how we train and evaluate our ML models. We train our models on datasets containing millions of cybersecurity events.

These events can be structured in certain ways; they can have dependencies or similarities to one another. For example, we might collect multiple data points for a single malicious process tree and those data points will be closely related to one another, or we might collect malicious scripts that are extremely similar. Because of the kinds of relationships present in cybersecurity data, our domain requires us to carefully consider the ML concepts of train-test leakage data splitting .

When observations are not independent of one another, the data should be split in a way that does not cause overconfidence. Otherwise, we might think our model can handle malicious processes very well, even though when faced with new threats, the model fails to recognize them. In this post, we explain why CrowdStrike data scientists adopt strategic data splitting when training our ML models. Employing the strategic data splitting approach, as discussed below, will help to prevent train-test leakage in datasets with interdependent observations.

This helps ensure more reliable model performance against novel threats in the wild. From Random to Strategic Data Splitting One tenet of ML is to split the data into train, validation, and test sets, or to perform cross-validation, where the data are partitioned into multiple iterations of training and testing. The model learns from the training data and is then evaluated on the validation/testing data.

This allows us to have a reasonable expectation of real-world performance and select a winner from competing models. A common statistical assumption is that observations in the data are independent. However, in real-world scenarios, data points often relate to each other. And if we train using data that are not independent, we get train-test leakage — where the training data has information it should not be expected to have.

When correlated observations are mixed randomly into both the train and test sets, the model’s training data is dependent on the testing data in a way that may not be realistic for the model in production. Therefore, real-world performance for the ML model may not match what was seen in testing. As an analogy for train-test leakage, imagine you’re evaluating a student in a class with a final test. In order to prepare them for the test, you give them a set of practice questions — the training set.

If the practice questions are too closely related to the actual test questions (for example, only changing a few words in the question), the student might ace the test just by memorizing the practice questions. The student may have performed well, but we are overconfident in how much the student has learned because information leaked from the training set to the actual test — giving us an inflated view of their true knowledge.

Looking at this issue in a real-world, physical science scenario, data with these kinds of dependency structures can often be found in ecological data, as noted by Roberts et al. (2016) . In this domain, it is common to observe autocorrelation in space or time, or dependency among observations from the same individuals or groups. For example, if data points are spatially autocorrelated (related by location), traditional random train-test splits can lead to misleading results.

This happens because nearby locations share similar features, like climate, which can leak information between training and test sets. Therefore, a random split may inflate the performance estimate of the model. In fact, we may get data from an entirely new region at prediction time, but the data splitting method has led to overoptimism and overfitting. This problem is not limited to one domain. Kapoor and Narayanan (2023) describe how different types of data leakage have contributed to a reproducibility crisis in ML-based science across over 290 papers in 17 fields, due to the overoptimism that leakage produces.

Different modeling strategies for nonindependent data are possible — such as linear mixed model or time-series approaches — but many performant predictive models, including tree-based ensembles and neural networks, may not be designed to account for these dependency structures. We should then turn toward a more careful approach to data splitting. The Roberts et al. study recommends splitting the data into blocks, where each block groups together dependent data at some level.

Each block is then assigned to a cross-validation fold. In the ecological data example, grouping nearby locations together as one block prevents data leakage and gives more accurate model performance estimates. There are also trade-offs here. It is possible that blocking the data may limit what is seen in predictor space — the possible feature values — which can decrease the model’s predictive capability.

Some experiments below illustrate these concepts. CrowdStrike’s Solution to Data Leakage One approach CrowdStrike takes to stop breaches is applying ML to detect malicious processes by their behaviors. However, observations from a process are correlated with other observations from that process — and with other processes from its process genealogy and machine of origin. We experimented with “blocking” by machine.

Figure 1. A machine has many different processes, and each process is part of its own genealogy. Processes in a machine are not independent, so we consider each machine a “block.” Our experiments consisted of training tree-based ML models for binary classification. We used an experimental dataset containing observations of process behaviors, with each observation labeled as either malicious or non-malicious.

There was fair label balance. Eighty percent of the data was used to run 1) blocked cross-validation (see scikit-learn’s GroupKFold ), and 2) random cross-validation, each with five folds. The remaining 20% of the data was held out in a way we theorized to be realistic for the prediction context — across new blocks and with data later in time. We trained a final model on the original 80% of the data and evaluated it over the remaining 20% to get a realistic performance estimate.

We used AUC (area under the ROC Curve) as a performance metric, where a higher AUC is better. Figure 2. AUC across five cross-validation folds plotted for the two split strategies. A final model was trained over all of the cross-validation data and tested on a more realistic test set and shown as “Realistic AUC.” Points are jittered for clarity. A few conclusions are apparent based on our results: A purely random partition strategy overestimates performance Blocked cross-validation better estimates realistic performance Extrapolation across blocks is difficult Our findings show how a blocked cross-validation approach can illuminate the structure of the data.

If we used a random split, we would be overoptimistic about our ML model. Along with overoptimism, there is also the potential of overfitting to the data. One method to avoid overfitting some iterative ML models is early stopping, which attempts to stop model training before the point of overfitting. A validation-based early stopping rule halts the training process once we see performance is not improving on a validation set.

It is clear that leakage with the validation set can cause problems. With that in mind, we trained two iterative boosting models on our data with early stopping. Once the validation loss failed to improve on the minimum for 20 rounds, training ceased. Eighty percent of the data was used for training and 10% for validation, split either randomly or systematically as in the cross-validation procedure. Otherwise, the models were identical.

Figure 3. Logistic loss for the validation set plotted over boosting iteration for the two split strategies. A lower loss is better. The blocked split early stops at iteration 198. We observed the model trained with a random split did not stop early before 1,000 rounds. This suggests the randomly split validation set is correlated with the training set and therefore, loss continues improving across iterations.

It is possible the model trained this way was overfit to the training data. However, the randomly split model performed better on the final 10% of the data — a realistic test set — with AUC 0.966 vs. 0.948, so we might ultimately use a random split. One possibility for the better performance of the random split model is that blocking can limit what we see in predictor space while training, leading to worse overall performance.

Perhaps the blocks are too dissimilar, and using a blocked split has actually underfit the model. These trade-offs should be taken into account in accordance with our goal: catching threats. Building Better Machine Learning Threat Predictions In typical ML workflows, data scientists gather and clean data, conduct exploratory analyses, consider what approaches might work, then finally train and evaluate models.

Each of these steps requires careful consideration of the underlying data. In particular, practitioners should be careful about their data partitioning and evaluation strategies. At CrowdStrike, we continuously evaluate our models carefully, in order to understand them and pick the best ones. By making our analyses rigorous, we get better threat predictions. Additional Resources Learn how you can stop cloud breaches with CrowdStrike unified cloud security for multi-cloud and hybrid environments — all in one lightweight platform.

Read about adversaries tracked by CrowdStrike in the CrowdStrike 2025 Global Threat Report . Read about the types of insider threats and how to detect them here . Test CrowdStrike next-generation antivirus for yourself. Start your free trial of CrowdStrike® Falcon Prevent™ today. Related Content CrowdStrike’s Journey in Customizing NVIDIA Nemotron Models for Peak Accuracy and Performance How CrowdStrike Trains GenAI Models at Scale Using Distributed Computing Categories CONNECT WITH US FEATURED ARTICLES May 06, 2026 May 05, 2026 SUBSCRIBE Sign up now to receive the latest notifications and updates from CrowdStrike.

Sign Up See CrowdStrike Falcon ® in Action Detect, prevent, and respond to attacks— even malware-free intrusions—at any stage, with next-generation endpoint protection. See Demo Privacy Request Info Contact Us 1.888.512.8906 Accessibility

Extracted Entities (1)

CVEs

CVE-2026-20929

ID: 114Lang: enType: article