CSCE 689: Machine Learning-Based CyberDefenses

Special Topics, TAMU, 2023

In this course, we will navigate through the applications of ML in the security field: the pros, the cons, and the future yet to come.

What to expect

Lots of malware analysis stuff (50% of the course).
Discussions about ML limits in general.
To be requested to map the introduced concepts to your own work.

Topics

Pitfalls of ML, dataset size, generalization, and more.
Malware detection, streams, concept drift.
Adversarial ML (attacks and defenses)
Biometrics, Authentication, and related applications.
Large Language Models, GPT-3, and other fun stuff.

Evaluation/Format

Seminar presentation + competition

The competition

Let’s make a small version of the MLSEC competition, with students playing together to create attacks and defenses.

Official Course Link

TAMU students can already enroll in CSCE689 via the Howdy! system.

Course Progress

Topic 1.1 Machine Learning for Malware Detection
- Concepts:
  - Static vs. Dynamic analysis.
  - Endpoint vs. Cloud-based strategies.
  - Locality Sensitive Hashing (LSH).
  - The role of AV updates.
  - AV scans in the cloud.
  - Non-negative Neural Networks (NNs) against evasion.
- Outcomes (2023):
  - The student Sidhart Baveja created a blog on ML for security. Check it here
  - The student Sidharth Anil created a mind map for this paper. Check it here
- Outcomes (2024):
  - The student Ali Seyed created a blog on ML for security. Check it here
  - The student Ayushri Jaim created a blog on ML for security. Check it here
  - The student Dylan Nguyen created a blog on ML for security. Check it here
Topic 1.2 Malware Detection on Highly Imbalanced Data through Sequence Modeling
- Concepts:
  - Traditional Machine Learning (ML) vs. Deep Learning (DL) algorithms.
  - Imbalance in endpoint files vs. Imbalance in malware repositories.
  - Malware detection as a Natural Language Processing (NLP) problem.
  - The need for having very low False Positives (FPs).
- Outcomes:
  - The student Soumyajyoti Dutta has been coding some examples on imbalanced data. Check it here
Topic 2.1 Machine Learning (In) Security: A Stream of Problems (1/2)
- Concepts:
  - Problem space vs. Feature space.
  - Data leakage and Temporal inconsistency.
  - Label flipping.
  - Under- and Over- sampling (temporally-accurate).
  - Dataset size vs. Dataset diversity.
  - FPR as a target metric.
- Outcomes:
  - The student Soumyajyoti Dutta created a mind map for this paper. Check it here
Topic 2.2 Machine Learning (In) Security: A Stream of Problems (2/2)
- Concepts:
  - Concept drift vs. Concept evolution.
  - Cost-sensitive learning.
  - Ensembling and Bagging.
  - The RandomForest non-linearity and the AdaptiveRandomForest drift retraining.
  - One-class classifiers.
  - White-box and Black-box attacks.
  - Substitute model and offline attacks.
  - Gradient-descent attacks in images and malware classifiers.
  - Adversarial retraining.
  - What is a 0-day.
  - Penetration testing: Blue vs. Red teams.
- Outcomes:
  - The student Dutta created a discussion outline for this paper. Check it here
Topic 3.1 Dos and Don’ts of Machine Learning in Computer Security (1/2)
- Concepts:
  - Sampling bias.
  - AV labels shifting.
  - Base Rate Fallacy.
  - Regional datasets implications.
  - Threat Models.
  - Website fingerprinting.
  - Infection vectors: USB sticks and Autorun.
  - Stuxnet malware.
  - Hardware Trojans.
Topic 3.2 Dos and Don’ts of Machine Learning in Computer Security (2/2)
- Concepts:
  - Vulnerabilities vs. Malware.
  - Malware campaigns, variants, and attribution.
  - Reputation-based mechanisms: AV whitelisting.
  - Outlier detection: profile-based and abnormal behavior detection.
Topic 4.1 Fast & Furious: On the modelling of malware detection as an evolving data stream
- Concepts:
  - Feature extractor retraining.
  - Encodings: 1-hot encoding.
  - Embeddings: TF-IDF, Word2Vec, Graph embeddings.
  - Graph Representations: Call Graphs (CGs), Control Flow Graphs (CFGs), Data Dependency Graphs (DDGs).
  - Concept drift: vocabulary changes.
  - Software Repositories: Dynamic analysis in application marketplaces.
  - Android malware: SMS fraud.
  - Web malware: Clickjacking.
  - Evasion strategies: Dead code insertion
  - Limits of static analysis: Opaque constants.
Topic 4.2 DroidEvolver: Self-Evolving Android Malware Detection System
- Concepts:
  - Pool of classifiers.
  - Confidence levels.
  - Pseudo-labels.
  - Model poisoning.
  - Backdooring ML models.
  - MLOps and DevSecOps.
  - AV telemetry.
Topic 5.1 Transcend: Detecting Concept Drift in Malware Classification Models
- Concepts:
  - Conformal Evaluation.
  - Statistics vs. Probabilities.
  - P-values.
  - Classifier credibility vs. Prediction confidence.
Topic 5.2 Transcending TRANSCEND: Revisiting Malware Classification in the Presence of Concept Drift
- Concepts:
  - Pipeline of detection strategies.
  - FPs in spam detection.
  - PDF malware, Word malware, and macros.
  - What is a file packer.
  - Evasion techniques: Anti-VM and Sandbox fingerprinting.
  - Attacker strategy: File format migration
  - Server-side Polymorphism.
  - Attack opportunity window and AV response time.
  - Malware types: Botnets and Remote Access Trojans (RATs).
  - Command and Control (C&C or C2).
  - Infection vectors: 1-click vs. 0-click vulnerability exploits.
Topic 6.1 Shallow Security: on the Creation of Adversarial Variants to Evade Machine Learning-Based Malware Detectors
- Concepts:
  - Raw bytes vs. Feature-based models.
  - MalConv vs. Non-negative MalConv.
  - Adversarial Attacks: Data appending and Data hiding.
  - Packers: Compressors, Crypters, Droppers, and Injectors.
  - Malware Downloaders and URL obfuscation.
  - PE Loading: OS internals, PE headers, fake timestamps, and checksum recomputation..
  - ML biases: Detecting UPX as malware regardless of its content.
  - C&C strategies: domain fronting.
Topic 6.2 No Need to Teach New Tricks to Old Malware: Winning an Evasion Challenge with XOR-based Adversarial Samples
- Concepts:
  - Obfuscation strategies: XOR encoding and Base64 encoding.
  - Mimicry attacks: Dead imports and Dynamic library resolving.
Topic 7.1 Functionality-Preserving Black-Box Optimization of Adversarial Windows Malware
- Concepts:
  - Intrusion Detection: The need for minimizing the number of adversarial queries.
  - Adversarial Attacks: Hard-labels vs. Soft-labels.
  - ML Evasion as an optimization problem.
  - Genetic algorithms for minimizing complex formulas.
  - PE files: Entry points vs. Main functions.
  - PE files: Code caves, Slack spaces, Padding, and Patching..
  - Real-world attacks: Malware as a Service.
  - Threat Models: AVs assuming pristine OS installations.
Topic 7.2 Mal-LSGAN: An Effective Adversarial Malware Example Generation Model”
- Concepts:
  - Generative Adversarial Networks (GANs).
  - Attack Transferability.
  - Image generation: “This person does not exist”.
  - Moving Target Defense (MTD).
Topic 8.1 EvadeDroid: A Practical Evasion Attack on Machine Learning for Black-box Android Malware Detection
- Concepts:
  - Android Reverse Engineering: Smali code.
  - Assembly vs. Bytecode: Compiled vs. Interpreted languages.
  - Attacks: Code gadgets.
  - ML Classification: Markov models and text generation.
  - Android models: GANs against permission-based classifiers.
  - Access control: The absence of native manifest files in Windows.
- Outcomes
  - The student Brandon Gathright presented a small implementation of the paper approach.
Topic 8.2 Adversarial Machine Learning in Image Classification: A Survey Toward the Defender’s Perspective
- Concepts:
  - Adversarial attacks in images: Invisible perturbations.
  - Physical adversarial attacks: Autonomous vehicles.
  - Explainable AI: Shapley values.
  - Optical Character Recognition (OCR)-based SQL Injection.
Topics 9.1 Pop Quiz! Can a Large Language Model Help With Reverse Engineering?
- Concepts:
  - Reverse Engineering: Concepts and applications.
  - GPT-3 Playground, ChatGPT, Github Copilot, and other LLM-based applications.
  - Davinci, Codex, and other training sets.
  - One-shot vs. Few-shot learning.
  - Discussion on Large Models: Generalization or Overfitting?
  - Google Captcha and OCR.
  - Oracles in query-response systems.
  - Sentiment analysis in textual documents.
  - Usable Security: Pattern-based authentication.
  - Language-based modeling: Grammar-based fuzzing.
Topic 9.2 TrojanPuzzle: Covertly Poisoning Code-Suggestion Models
- Concepts:
  - Poisoning attacks against LLMs.
  - Discussion: LLMs in the real-world.
  - Sybil attacks.
  - Common Weaknesses Enumeration (CWEs).
  - Popular Vulnerabilities: Path Traversal.
  - Downgrade Attacks: ECB-mode cryptography.
  - Supply Chain attacks: Open-source repository trojanization.
Topic 10.1 Examining Zero-Shot Vulnerability Repair with Large Language Models
- Concepts:
  - LLMs for bug fix: Should they be perfect or better than humans?
  - LLM’s temperature and the amount of randomness.
  - Discussion: Is LLM coding the same as LLM learning the code?
  - Hardware bugs: Glitches.
  - (Secure) Software Engineering: (Security) Regression Testing.
Topic 10.2 Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants
- Concepts:
  - Complexity metrics: Lines of Code (LoCs) vs. Bugs per Line vs. etc.
  - Bugs vs. Vulnerabilities.
  - Trusted Code Base (TCB).
  - Vulnerability Discovery: Fuzzing.
  - Common Weaknesses Enumeration (CWE).
- Outcomes:
  - The student Amith Mattar coded an automatic code generation tool. (Source) (Page)
Topic 11.1 Online Binary Models are Promising for Distinguishing Temporally Consistent Computer Usage Profile
- Concepts:
  - Authentication vs. Authorization.
  - Continuous Authentication.
  - Research Methodology: Online vs. Offline experiments.
  - Malware vs. Goodware: Keyloggers vs. Keystroke authenticators.
Topic 11.2 Passphrase and keystroke dynamics authentication: Usable security
- Concepts:
  - Keystroke Dynamics.
  - Secret Types: Passwords and Passphrashes.
  - Password policies: Best practices and Usable Security.
  - Password Strength: Shannon Entropy.
  - Security Threats: Password leakage.
  - Password Breaking: Brute-force, dictionary, and rainbow tables.
  - Password Storage: Hashing and Salting.
  - Security Threats II: Typosquatting.
Topic 12.1 Automatic Yara Rule Generation Using Biclustering
- Concepts:
  - Pattern Matching: YARA rules.
  - Taxonomy: 0 days, 1-day, N-day.
  - Efficient rule storage: Bloom filters.
  - Efficient rule matching: Aho-Corasick algorithm.
Topic 12.2 DeepSign: Deep learning for automatic malware signature generation and classification
- Concepts:
  - ML construction: Denoising AutoEncoder.
  - Malware Behaviors: Behavioral Signatures.
  - Behavior Masking: Distributed Threats.

Challenge

Results (2023)
- Defense
  - Syed Wall and Yasir Farrukh won the defense challenge!
- Attack
  - Veronika Maragulova, Sidharth Baveja, Sidharth Anil, and Soumyajyoti Dutta won the attack challenge!
  - See the winner report here
Results (2024)
- Defense
  - Sidharth Arivarasan, Sahil Salunkhe, Ali Ayati won the defense challenge!
- Attack
  - Akshat Punjabi, Akshat Pandey, and Ayushri Jain won the attack challenge!

Public Detection Models (2023): a * Multiple Students: DockerHub - yasirali12/malwaredetector - felzek/malware-classifier - amithmkini/cyberai - yasirali12/pipeline - sidbav/689-final-submission - yasirali12/model - felzek/defender
Public Detection Models (2024):
- Bhavan Dondapati, Vishal Vardhan Adepu, and Rohith Yogi Nomula
  - DockerHub: vva2/defender - Version: 1.0.2

ChatGPT Fun

Results:
- Syed Rizvi and Yasir Farrukh were able tp ,ale ChatGPT to create a Python Ransomware.
  - Prompt:
  - Execution:
- Amith KMattar, Chunkai Fu, amd Mason Jerome were able to make ChatGPT create a Dropper.
  - The Tool:
  - Code Generation:

Share on

Twitter Facebook LinkedIn

Marcus Botacin