Software engineers and AI practitioners rely on software packages from vast ecosystems like PyPI (Python), Cargo (Rust), and npm (JavaScript/Typescript), often with deeply cascading dependencies. A key activity in software supply chain security (SSCS) is to understand the full dependency footprint of a product and stay abreast of potential security vulnerabilities, with the goal of preventing such packages from entering the product or build process.
In this blog post, we share our research on AI-assisted dependency vetting; the key idea is to scan software source code with large language models (LLMs) in order to identify malicious behaviour as well as unintentionally dangerous code. We have found that such LLM-based techniques can identify vulnerabilities that more traditional rule-based vetting pipelines are typically not able to discover. Today, we actively use the described vetting technique as one of several building blocks in our SSCS infrastructure.
Metrics-based vetting
Traditional approaches for vetting software packages (ie, for determining whether one can trust an open-source package) often rely on metrics and other meta data about the software, for example:
- Popularity and maintainer trust
- Financial backing
- Active development
- Security practices (SAST, CVEs, Fuzzing)
- Domain criticality (e.g. authentication, cryptography, communication)
Unfortunately, these metrics do not provide sufficient signal in dynamic ecosystems such as AI research, where we find a large number of small (and not widely adopted) code repositories, often written in the context of academic research. These repositories have usually not adopted state-of-the-art security engineering practices, are only sporadically maintained by a small number of contributors, and don’t have strong financial or organisational backing.
How can we draw a line here between a trustworthy or non-trustworthy package? How can we gain trust in our software supply chain in such an environment? Clearly the metrics or metadata-based approach comes to a limit and we instead need to look at the code itself.
Maliciousness
Our policy for allowlisting a package for internal usage is simple: The package must be non-malicious. We have broken this down into a set of technical characteristics which can indicate malicious intent and are thus by default not allowed in our software supply chain:
- Package performssuspicious behavior , like obfuscation or having a typo-squatted name
- Package performsnetwork traffic (eg, telemetry) to external (unverified) systems with the risk of sensitive data leakage
- Package implementsinsecure cryptographic components , like a custom RSA implementation
- Package pulls source code or binaries from unverified servers and executes those
- Package originates from a country which has a high density of hacker groups targeting our industry
Now, how can we detect those properties at scale , ie, for several thousands of packages across multiple programming languages, each with regular updates? Human review by security experts works well for a small number of high-impact packages, but won’t scale for the long tail of thousands of packages.
The canonical approach to automating vetting relies on rule-based analysis. Unfortunately, most readily available tools (eg, Semgrep) focus on quality aspects of software, not malicious behavior. Moreover, rule-based analysis (“if this then that”) already struggles with simple techniques like obfuscation:
目水鸟月木人木鳥马口马刀木鳥水子 + ''.join(map(getattr(__builtins__, oct.__str__()[-3 << 0] + hex.__str__()[-1 << 2] + copyright.__str__()[4 << 0]), [(((3 << 3) - 1) << 2), (((5 << 2) - 1) << 2) + 1, ((((3 << 2) + 1)) << 3) + 1, (((3 << 3) + 1) << 2) - 1, (7 << 4) + (1 << 1), (7 << 4) - 1, (7 << 4) + 3, (7 << 4) - 1, ((((3 << 2) + 1)) << 3) - (1 << 1), (7 << 4) + (1 << 2), (((3 << 3) - 1) << 2), ((
((3 << 2) - 1)) << 3) - 1, ((((3 << 2) + 1)) << 3) + 1, (7 << 4) - (1 << 1), (((3 << 3) + 1) << 2), (7 << 4) - 1, (((1 << 4) - 1) << 3) - 1, (7 << 4) + 3, (((3 << 3) - 1) << 2), (((5 << 2) + 1) << 2) - 1, (7 << 4) + (1 << 2), (3 << 5) + 1, (7 << 4) + (1 << 1), (7 << 4) + (1 << 2), (1 << 5), (((5 << 2) - 1) << 2) + 1, (((3 << 3) + 1) << 2) + 1, (7 << 4) - (1 << 1), (((1 << 4) - 1) << 3) - 3])),
...
The limits of rule-based systems inspired us to investigate AI-based vetting methods.
AI-based vetting
We have designed a multi-layer AI vetting pipeline for automated package vetting. The pipeline has the following components:
- First, a lightweight model (DistilBERT), trained on malware samples, scans for malicious patterns in packages. The design stems from the paper Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application (2025). The idea is to quickly spot malicious needles in a haystack of thousands of packages.
- When there is a finding, package files are sent to a Large Language Model for triaging. Currently only bigger models (Gemini 2.5 Pro, Mistral Large 2) or fine-tuned models are reliable for triaging security attributes of software. In our experience, smaller open-source models like Gemma 3 or Mistral Small 3.2 create too many false-positives.
- If the finding is categorised as malicious by both LLM layers, a security engineer takes over to make a final decision on the trustworthiness of a package. We tune the LLM layers for high recall (ie, better safe than sorry) and deal with false positives (ie, LLM suggests malicious intent even if it is not) with human review.
- The final human decision is supported by a Deep Research agent system which gathers information about the trustworthiness of a package: background check, popularity, maintenance etc.
- Given all the information above, a human security engineer takes the final vetting decision.
Learnings
Based on the AI vetting pipeline we could massively reduce review times for packages and gain novel insights into our supply-chain. We identified packages performing unwanted telemetry, packages side-loading source code, and packages with suspicious naming schemes (probably typo-squatting attempts).
Let’s take a look at a few examples; the examples are notional, but are representative of real findings of our vetting pipeline.
Obfuscation
Highly obfuscated code like this can be used to hide malicious functions, such as a backdoor or spyware, that could exfiltrate sensitive data without being detected by standard security scans.
return
=
return
return
return
Telemetry
A telemetry function disguised as a generic data sender could be repurposed by an adversary to covertly transmit sensitive project details, system configurations, or operational data to an unauthorised external server. Most of the time, telemetry does not have malicious intent but is used intentionally to collect insights into a software’s usage. Still we want to know if or what is shared with unverified external systems.
=
return
=
pass # Fail silently if payload creation fails
pass # Fail silently if the request fails
Insecure authentication
Deploying a cryptographically weak or flawed authentication implementation could create a critical vulnerability in systems protecting sensitive information. Here, the LLM layer identified a suspicious comment in an authentication implementation.
"""
Encrypts a byte stream using a proprietary asymmetric cipher in ECB mode.
This function takes a stream of bytes and encrypts each byte individually
using the provided public key components (e, n).
WARNING: This implementation is for academic purposes only. It is a textbook
example of a cryptographic protocol (ECB mode) and is NOT
suitable for securing real-world data.
"""
, =
# This encrypts each byte independently, which is a major cryptographic flaw.
=
return
Code from untrusted sources
The following code could be exploited in a supply chain attack, where a hostile actor compromises the build process to download and execute a malicious pre-compiled binary from an untrusted server, embedding malware directly into critical software when not run in an air-gapped environment.
"""
Fetches and integrates pre-built components from an existing package.
"""
=
=
=
=
=
=
=
=
# Placeholder for environment compatibility check
return True
Conclusion
Scalable software package vetting has become a critical ingredient to our software supply chain infrastructure, allowing our engineers to experiment and work with a vast number of packages while simultaneously keeping the security team in the loop and in control.
Thus far, we are very happy with the first iteration of our AI-based vetting system. But of course there is always room for improvement! For example, since the LLM layers are costly in time and money, we are actively working on improving the precision of the DistilBERT classifier; one trick to exploit here is to use the results of the LLM and human review layers as training data for the former.
Written by the Helsing Security Engineering team