EMBER2024: Advancing the Training of Cybersecurity ML Models Against Evasive Malware

CrowdStrike Blog T2 clear — 2636 words ORIGINAL

Classification

SEV 6/10

EMBER2024: Advancing Cybersecurity ML Training on Evasive Malware BLOG Featured Now Live: The CrowdStrike 2026 Financial Services Threat Landscape Report May 14, 2026 Falcon AIDR Detects Threats at the Prompt Layer in Kubernetes AI Applications May 13, 2026 May 2026 Patch Tuesday: 30 Critical Vulnerabilities Among 130 CVEs May 12, 2026 Inside CrowdStrike Automated Leads: A Transformative Approach to Threat Detections May 11, 2026 Recent Video Video Highlights the 4 Key Steps to Successful Incident Response Dec 02, 2019 Helping Non-Security Stakeholders Understand ATT&CK in 10 Minutes or Less [VIDEO] Feb 21, 2019 Analyzing Targeted Intrusions Through the ATT&CK Framework Lens [VIDEO] Jan 22, 2019 Qatar’s Commercial Bank Chooses CrowdStrike Falcon®: A Partnership Based on Trust [VIDEO] Aug 20, 2018 Category Agentic SOC How Charlotte AI AgentWorks Fuels Security's Agentic Ecosystem 03/25/26 CrowdStrike Services and Agentic MDR Put the Agentic SOC in Reach 03/24/26 4 Ways Businesses Use CrowdStrike Charlotte AI to Transform Security Operations 03/12/26 Inside the Human-AI Feedback Loop Powering CrowdStrike’s Agentic Security 02/10/26 Cloud & Application Security 05/13/26 CrowdStrike Named a Leader in Frost & Sullivan 2026 Radar for Cloud-Native Application Protection Platforms 04/27/26 CrowdStrike Expands Real-Time Cloud Detection and Response to Google Cloud 04/22/26 CrowdStrike Falcon Cloud Security Delivered 264% ROI Through Unified Cloud Protection Threat Hunting & Intel 05/14/26 CrowdStrike Named a Leader in the First-Ever Gartner® Magic Quadrant™ for Cyberthreat Intelligence Technologies 05/06/26 CrowdStrike Launches Falcon OverWatch for Defender 05/05/26 Tune In: The Future of AI-Powered Vulnerability Discovery 05/01/26 Endpoint Security & XDR 05/11/26 CrowdStrike Falcon Platform Achieves 441% ROI in Three Years 04/21/26 Falcon for IT Supports Windows Secure Boot Certificate Lifecycle Management 04/01/26 Enhanced Network Visibility: A Dive into the Falcon macOS Sensor's New Capabilities 03/11/26 Engineering & Tech EMBER2024: Advancing the Training of Cybersecurity ML Models Against Evasive Malware 09/03/25 Falcon Platform Prevents COOKIE SPIDER’s SHAMOS Delivery on macOS 08/20/25 CrowdStrike’s Approach to Better Machine Learning Evaluation Using Strategic Data Splitting 08/11/25 CrowdStrike Researchers Develop Custom XGBoost Objective to Improve ML Model Release Stability 03/20/25 Executive Viewpoint Frontier AI Is Collapsing the Exploit Window. Here’s How Defenders Must Respond. 04/20/26 Frontier AI for Defenders: CrowdStrike and OpenAI TAC 04/16/26 Anthropic Claude Mythos Preview: The More Capable AI Becomes, the More Security It Needs 04/06/26 The Architecture of Agentic Defense: Inside the Falcon Platform 01/16/26 From The Front Lines CrowdStrike Technical Risk Assessments Reveal Common Exposure Patterns 05/04/26 Introducing the CrowdStrike Shadow AI Visibility Service CrowdStrike Flex for Services Expands Access to Elite Security Expertise From Scanner to Stealer: Inside the trivy-action Supply Chain Compromise 03/20/26 Next-Gen Identity Security Detecting CVE-2026-20929: Kerberos Authentication Relay via CNAME Abuse 03/31/26 CrowdStrike FalconID Brings Phishing-Resistant MFA to Falcon Next-Gen Identity Security 02/26/26 CrowdStrike Named a Customers’ Choice in 2026 Gartner® Peer Insights™ Voice of the Customer for User Authentication 02/12/26 CrowdStrike to Acquire Seraphic to Secure Work in Any Browser 01/13/26 Next-Gen SIEM & Log Management Falcon Next-Gen SIEM Supports Third-Party EDR Tools, Starting with Microsoft Defender 03/23/26 Falcon Next-Gen SIEM Simplifies Onboarding with Sensor-Native Log Collection 03/06/26 Exposing Insider Threats through Data Protection, Identity, and HR Context 02/18/26 How to Scale SOC Automation with Falcon Fusion SOAR 02/11/26 Public Sector CrowdStrike Innovates to Modernize National Security and Protect Critical Systems 03/18/26 Falcon Platform for Government Now Offers Falcon for XIoT to Secure Connected Assets CrowdStrike Achieves FedRAMP® High Authorization 03/19/25 NHS Matures Healthcare Cybersecurity with NCSC’s CAF Assurance Model 03/13/25 Exposure Management 05/12/26 April 2026 Patch Tuesday: Two Zero-Days and Eight Critical Vulnerabilities Among 164 CVEs 04/14/26 How CrowdStrike Is Accelerating Exposure Evaluation as Adversaries Gain Speed 04/05/26 March 2026 Patch Tuesday: Eight Critical Vulnerabilities and Two Publicly Disclosed Among 82 CVEs Patched 03/10/26 Securing AI CrowdStrike Expands ChatGPT Enterprise Integration with Enhanced Audit Logging and Activity Monitoring 04/28/26 New CrowdStrike Innovations Secure AI Agents and Govern Shadow AI Across Endpoints, SaaS, and Cloud Secure Homegrown AI Agents with CrowdStrike Falcon AIDR and NVIDIA NeMo Guardrails 03/19/26 Introducing "AI Unlocked: Decoding Prompt Injection," a New Interactive Challenge Data Security Falcon Data Security Secures Data Wherever It Lives and Moves Falcon Data Protection for Cloud Extends DSPM into Runtime 11/20/25 CrowdStrike Stops GenAI Data Leaks with Unified Data Protection 09/18/25 Q&A: How Mastronardi Produce Secures Innovation with CrowdStrike 02/14/25 Start Free Trial September 03, 2025 Phil Roth CrowdStrike data scientists are members of a team of cybersecurity researchers that recently released EMBER2024 , an update to EMBER, the popular open source malware benchmark dataset originally released in 2018.

CONFIDENCE56%

Categories

malwarevulnerabilitycloud_security

Threat Actors

ContiPlay

Target Sectors

financeeducationgovernment

Here’s How Defenders Must Respond. 04/20/26 Frontier AI for Defenders: CrowdStrike and OpenAI TAC 04/16/26 Anthropic Claude Mythos Preview: The More Capable AI Becomes, the More Security It Needs 04/06/26 The Architecture of Agentic Defense: Inside the Falcon Platform 01/16/26 From The Front Lines CrowdStrike Technical Risk Assessments Reveal Common Exposure Patterns 05/04/26 Introducing the CrowdStrike Shadow AI Visibility Service CrowdStrike Flex for Services Expands Access to Elite Security Expertise From Scanner to Stealer: Inside the trivy-action Supply Chain Compromise 03/20/26 Next-Gen Identity Security Detecting CVE-2026-20929: Kerberos Authentication Relay via CNAME Abuse 03/31/26 CrowdStrike FalconID Brings Phishing-Resistant MFA to Falcon Next-Gen Identity Security 02/26/26 CrowdStrike Named a Customers’ Choice in 2026 Gartner® Peer Insights™ Voice of the Customer for User Authentication 02/12/26 CrowdStrike to Acquire Seraphic to Secure Work in Any Browser 01/13/26 Next-Gen SIEM & Log Management Falcon Next-Gen SIEM Supports Third-Party EDR Tools, Starting with Microsoft Defender 03/23/26 Falcon Next-Gen SIEM Simplifies Onboarding with Sensor-Native Log Collection 03/06/26 Exposing Insider Threats through Data Protection, Identity, and HR Context 02/18/26 How to Scale SOC Automation with Falcon Fusion SOAR 02/11/26 Public Sector CrowdStrike Innovates to Modernize National Security and Protect Critical Systems 03/18/26 Falcon Platform for Government Now Offers Falcon for XIoT to Secure Connected Assets CrowdStrike Achieves FedRAMP® High Authorization 03/19/25 NHS Matures Healthcare Cybersecurity with NCSC’s CAF Assurance Model 03/13/25 Exposure Management 05/12/26 April 2026 Patch Tuesday: Two Zero-Days and Eight Critical Vulnerabilities Among 164 CVEs 04/14/26 How CrowdStrike Is Accelerating Exposure Evaluation as Adversaries Gain Speed 04/05/26 March 2026 Patch Tuesday: Eight Critical Vulnerabilities and Two Publicly Disclosed Among 82 CVEs Patched 03/10/26 Securing AI CrowdStrike Expands ChatGPT Enterprise Integration with Enhanced Audit Logging and Activity Monitoring 04/28/26 New CrowdStrike Innovations Secure AI Agents and Govern Shadow AI Across Endpoints, SaaS, and Cloud Secure Homegrown AI Agents with CrowdStrike Falcon AIDR and NVIDIA NeMo Guardrails 03/19/26 Introducing "AI Unlocked: Decoding Prompt Injection," a New Interactive Challenge Data Security Falcon Data Security Secures Data Wherever It Lives and Moves Falcon Data Protection for Cloud Extends DSPM into Runtime 11/20/25 CrowdStrike Stops GenAI Data Leaks with Unified Data Protection 09/18/25 Q&A: How Mastronardi Produce Secures Innovation with CrowdStrike 02/14/25 Start Free Trial September 03, 2025 Phil Roth CrowdStrike data scientists are members of a team of cybersecurity researchers that recently released EMBER2024 , an update to EMBER, the popular open source malware benchmark dataset originally released in 2018.

The EMBER2024 dataset includes metadata, labels, and calculated features for over 3.2 million files from six different file formats. It provides data scientists conducting cybersecurity research with an extensive, modern dataset to support the training and evaluation of machine learning models for malware detection, including a collection of advanced malware that has demonstrated its ability to evade antivirus products.

An academic paper, EMBER2024: A Benchmark Dataset for Holistic Evaluation of Malware Classifiers , details this new dataset and was presented at the SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2025) in Toronto in August 2025. The paper also includes 14 benchmark models trained on different subsets of the data and varying classification tasks. There are many barriers to releasing public datasets in the cybersecurity field, including preserving customer privacy and hiding defender capabilities from attackers.

Because of this, CrowdStrike researchers were excited for the opportunity to help update this very popular dataset. In this post, researchers can learn more about what this dataset provides and the new research enabled by it. Original EMBER Dataset (2018): An Influential Resource for Malware Classification The original EMBER dataset was a labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable (PE) files.

Released in 2018, it was accompanied by an academic paper co-authored by a CrowdStrike data scientist who is part of the EMBER2024 team. The paper was subsequently updated the following year. The goal of EMBER was to invigorate research in the field of malware classification, just as other benchmark datasets had done for image classification. It has helped to significantly advance malware detection in cybersecurity products, including the CrowdStrike Falcon® Platform.

As of this writing, the paper has been cited in academic research over 700 times since its original publication in 2018, reflecting just how influential EMBER has been in the field of ML training for cybersecurity. Researchers have used EMBER to measure how quickly malware classifiers degrade over time, explore adversarial machine learning attacks and defenses, and as a basis for educational projects.

Last year, CrowdStrike researchers augmented the data with tags and leaf similarity information to create EMBERSim , an effort to make building Binary Code Similarity techniques using benign data easier. EMBER2024 builds on the innovative and influential original, delivering a leap forward in capability. EMBER2024: Updated to Help Train the Next Generation of Cybersecurity ML Researchers WIth an ongoing industry shift to ML-based malware detection, the importance of innovative tools like EMBER has only increased.

A team of researchers from multiple organizations — including a member of the CrowdStrike Data Science team who co-created the original EMBER dataset — recently undertook the project of updating and improving EMBER. They had ambitious plans to expand and extend the original dataset in many different ways, ending up with in excess of 3.2 million files from six file formats. Figure 1 shows how many of each file type are included in EMBER2024.

The dataset features seven different types of labels and tags that support training classifiers on seven common tasks, including malicious/benign detection, malware family classification, and malware behavior identification. Source code is included that will allow researchers to replicate the feature calculation, model training, and file collection techniques used to construct the dataset. A supplemental release also includes the raw bytes and disassembly for 16.3 million functions from malicious files identified and compiled by the FLARE team’s capa tool .

Figure 1. File type stats for the EMBER 2024 dataset File Type Train Test Challenge Total Win32 1,560,000 360,000 3,225 1,923,225 Win64 520,000 120,000 814 640,814 .NET 260,000 60,000 805 320,805 APK 208,000 48,000 256 256,256 PDF 52,000 12,000 805 64,805 ELF 26,000 6,000 386 32,386 CrowdStrike’s contribution to the project consisted of updating the original feature calculation code to make it easier to use.

EMBER2018 features require version 0.9.0 of the LIEF library. Updating this library results in features that may not be equivalent to what’s calculated with 0.9.0. But LIEF 0.9.0 requires Python 3.6, which is now quite out of date and unsupported. One of EMBER’s main use cases is teaching students how to work with machine learning in cybersecurity, and this very old dependency was just introducing them to the pain of Python packaging and versioning instead.

To solve this problem, the feature calculation code was updated to use the most recent version of the pefile library instead of LIEF. Because pefile is pure Python, it’s more likely that a single locked version of pefile will be able to be installed on future versions of Python as they’re released. Future versions of pefile are also unlikely to introduce breaking changes to the calculated features so that locking the pefile version required can be delayed as long as possible.

While making this change, the repository also switched to using more modern Python tooling (polars, uv, etc.). In addition to the dependency update, EMBER2024 features now include information about a file’s richheader, authenticode, and any warnings that the pefile module outputs while attempting to read the PE file format. Figure 2 shows the categories of features that are calculated along with examples of all of the metadata included.

A full description of all changes to the feature calculation can be found in the paper and the source code. Figure 2. Categories of features calculated and examples of metadata included in EMBER2024 Beyond the updated features, two other aspects of the new dataset warrant mentioning: the inclusion of a challenge set and infrastructure code. Adding a Challenge Set to Improve ML Training for Commercial Cybersecurity Solutions The challenge set includes 6,315 files that were not initially detected as malicious by any of the AV products in VirusTotal.

Files are included in the challenge set if they are later (after 30 days) detected as malicious by enough AV products to qualify as EMBER2024’s definition of malicious. In order to gather as many of these samples as possible, they are saved throughout the train and test time periods of the dataset. They are then set aside from the train and test, and models are evaluated on them separately. Figure 1 shows the relative size of the challenge set among the various file types collected.

One of the drawbacks of the original EMBER dataset is that it’s too “easy” to classify. The benchmark model from the very first release achieved a ROC AUC score of 0.99911 on the test set. This made it very difficult for researchers to publicly demonstrate that their novel techniques for classification would perform better. The dataset wasn’t large enough to reflect the difficulties of training and shipping a real commercial AV solution.

The challenge sets take a step toward solving this problem by highlighting the very hardest files to classify. Checking signatures and using allowlists and blocklists with cloud lookups make it possible to identify “known bad” behavior. The promise of incorporating machine learning into your AV system is that you’ll be better able to identify malicious files that nobody has seen before. Most of the AV products used to generate EMBER2024 labels already use ML to attempt this.

And even then, the community of defenders sometimes fails. Creating a collection of those files that weren’t initially identified as malicious highlights where existing solutions are struggling and creates a metric that has room for improvement. Infrastructure Code to Support Future Research Another innovation in the EMBER2024 public release is that it includes the code used to construct the dataset itself.

This includes retrieving VirusTotal reports, labeling the files contained in those reports, and selecting a pre-set number of files from a certain time period while excluding near duplicates. This will allow researchers with access to VirusTotal to replicate what EMBER2024 would have constructed as a dataset some time in the future. Given enough resources, future projects could use this code to put together a much larger dataset that would enable larger models or studies about the evolution of benign and malicious software over time.

There’s no guarantee about the consistency of the files that get added to VirusTotal in any given time period, but there are still interesting questions about model degradation or other topics that can now be approached with this codebase. EMBER2024 Exemplifies CrowdStrike’s Commitment to Research The original EMBER dataset achieved its objective of boosting research in the field of malware classification, with hundreds of citations in academic features in the years since it was first published.

It has also been used to help teach the latest generations of cybersecurity researchers. Its popularity spawned related projects like EMBERSim and now EMBER2024. This effort, and the involvement of our data scientists, reflects CrowdStrike’s ongoing commitment to research in the cybersecurity industry. We believe when defenders collaborate and share knowledge, we collectively strengthen our position against the threat actors who benefit from operating in the shadows.

Open source initiatives like EMBER2024 represent the kind of industry-wide cooperation that helps to drive innovation and support continuous product improvement. Projects like these tilt the playing field toward defenders and ensure the AI-native CrowdStrike Falcon platform remains a leader in stopping breaches. Additional Resources CrowdStrike Falcon Wins AV-Comparatives Awards for EDR Detection and Mac Security To learn what other industry analysts are saying about CrowdStrike, visit the Industry Recognition webpage .

Test CrowdStrike next-gen AV for yourself with a free trial of CrowdStrike Falcon® Prevent . Related Content CrowdStrike’s Journey in Customizing NVIDIA Nemotron Models for Peak Accuracy and Performance How CrowdStrike Trains GenAI Models at Scale Using Distributed Computing Categories CONNECT WITH US FEATURED ARTICLES May 06, 2026 May 05, 2026 SUBSCRIBE Sign up now to receive the latest notifications and updates from CrowdStrike.

Sign Up See CrowdStrike Falcon ® in Action Detect, prevent, and respond to attacks— even malware-free intrusions—at any stage, with next-generation endpoint protection. See Demo Privacy Request Info Contact Us 1.888.512.8906 Accessibility

Extracted Entities (1)

CVEs

CVE-2026-20929

ID: 112Lang: enType: article