AI-assisted vetting of software packages

Software engineers and AI practitioners rely on software packages from vast ecosystems like PyPI (Python), Cargo (Rust), and npm (JavaScript/Typescript), often with deeply cascading dependencies. A key activity in software supply chain security (SSCS) is to understand the full dependency footprint of a product and stay abreast of potential security vulnerabilities, with the goal of preventing such packages from entering the product or build process.

In this blog post, we share our research on AI-assisted dependency vetting; the key idea is to scan software source code with large language models (LLMs) in order to identify malicious behaviour as well as unintentionally dangerous code. We have found that such LLM-based techniques can identify vulnerabilities that more traditional rule-based vetting pipelines are typically not able to discover. Today, we actively use the described vetting technique as one of several building blocks in our SSCS infrastructure.

Metrics-based vetting

Traditional approaches for vetting software packages (ie, for determining whether one can trust an open-source package) often rely on metrics and other meta data about the software, for example:

Popularity and maintainer trust
Financial backing
Active development
Security practices (SAST, CVEs, Fuzzing)
Domain criticality (e.g. authentication, cryptography, communication)

Unfortunately, these metrics do not provide sufficient signal in dynamic ecosystems such as AI research, where we find a large number of small (and not widely adopted) code repositories, often written in the context of academic research. These repositories have usually not adopted state-of-the-art security engineering practices, are only sporadically maintained by a small number of contributors, and don’t have strong financial or organisational backing.

How can we draw a line here between a trustworthy or non-trustworthy package? How can we gain trust in our software supply chain in such an environment? Clearly the metrics or metadata-based approach comes to a limit and we instead need to look at the code itself.

Maliciousness

Our policy for allowlisting a package for internal usage is simple: The package must be non-malicious. We have broken this down into a set of technical characteristics which can indicate malicious intent and are thus by default not allowed in our software supply chain:

Package performssuspicious behavior , like obfuscation or having a typo-squatted name
Package performsnetwork traffic (eg, telemetry) to external (unverified) systems with the risk of sensitive data leakage
Package implementsinsecure cryptographic components , like a custom RSA implementation
Package pulls source code or binaries from unverified servers and executes those
Package originates from a country which has a high density of hacker groups targeting our industry

Now, how can we detect those properties at scale , ie, for several thousands of packages across multiple programming languages, each with regular updates? Human review by security experts works well for a small number of high-impact packages, but won’t scale for the long tail of thousands of packages.

The canonical approach to automating vetting relies on rule-based analysis. Unfortunately, most readily available tools (eg, Semgrep) focus on quality aspects of software, not malicious behavior. Moreover, rule-based analysis (“if this then that”) already struggles with simple techniques like obfuscation:

目水鸟月木人木鳥马口马刀木鳥水子 + ''.join(map(getattr(__builtins__, oct.__str__()[-3 << 0] + hex.__str__()[-1 << 2] + copyright.__str__()[4 << 0]), [(((3 << 3) - 1) << 2), (((5 << 2) - 1) << 2) + 1, ((((3 << 2) + 1)) << 3) + 1, (((3 << 3) + 1) << 2) - 1, (7 << 4) + (1 << 1), (7 << 4) - 1, (7 << 4) + 3, (7 << 4) - 1, ((((3 << 2) + 1)) << 3) - (1 << 1), (7 << 4) + (1 << 2), (((3 << 3) - 1) << 2), ((
         ((3 << 2) - 1)) << 3) - 1, ((((3 << 2) + 1)) << 3) + 1, (7 << 4) - (1 << 1), (((3 << 3) + 1) << 2), (7 << 4) - 1, (((1 << 4) - 1) << 3) - 1, (7 << 4) + 3, (((3 << 3) - 1) << 2), (((5 << 2) + 1) << 2) - 1, (7 << 4) + (1 << 2), (3 << 5) + 1, (7 << 4) + (1 << 1), (7 << 4) + (1 << 2), (1 << 5), (((5 << 2) - 1) << 2) + 1, (((3 << 3) + 1) << 2) + 1, (7 << 4) - (1 << 1), (((1 << 4) - 1) << 3) - 3])),

...

The limits of rule-based systems inspired us to investigate AI-based vetting methods.

AI-based vetting

We have designed a multi-layer AI vetting pipeline for automated package vetting. The pipeline has the following components:

First, a lightweight model (DistilBERT), trained on malware samples, scans for malicious patterns in packages. The design stems from the paper Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application (2025). The idea is to quickly spot malicious needles in a haystack of thousands of packages.
When there is a finding, package files are sent to a Large Language Model for triaging. Currently only bigger models (Gemini 2.5 Pro, Mistral Large 2) or fine-tuned models are reliable for triaging security attributes of software. In our experience, smaller open-source models like Gemma 3 or Mistral Small 3.2 create too many false-positives.
If the finding is categorised as malicious by both LLM layers, a security engineer takes over to make a final decision on the trustworthiness of a package. We tune the LLM layers for high recall (ie, better safe than sorry) and deal with false positives (ie, LLM suggests malicious intent even if it is not) with human review.
The final human decision is supported by a Deep Research agent system which gathers information about the trustworthiness of a package: background check, popularity, maintenance etc.
Given all the information above, a human security engineer takes the final vetting decision.

Learnings

Based on the AI vetting pipeline we could massively reduce review times for packages and gain novel insights into our supply-chain. We identified packages performing unwanted telemetry, packages side-loading source code, and packages with suspicious naming schemes (probably typo-squatting attempts).

Let’s take a look at a few examples; the examples are notional, but are representative of real findings of our vetting pipeline.

Obfuscation

Highly obfuscated code like this can be used to hide malicious functions, such as a backdoor or spyware, that could exfiltrate sensitive data without being detected by standard security scans.

def __a(text_data: str, key_data: str, /) -> str:
    return "".join(chr(ord(char1) ^ ord(char2)) for char1, char2 in zip(text_data, key_data * (len(text_data) // len(key_data) + 1)))

_sub_elements = {__a("🍎🍊🍓🍇", "🔑"): __a("📘📙📒📗📕", "🔒")}

def __dir__() -> list[str]:
    return list(_sub_elements.keys())

def __getattr__(attribute_name: str) -> object:
    import importlib
    if module_name := _sub_elements.get(attribute_name):
        return importlib.import_module(f"{__name__}.{module_name}")
    try:
        return globals()[attribute_name]
    except KeyError:
        raise AttributeError(f"module '{__name__}' has no attribute '{attribute_name}'")

Telemetry

A telemetry function disguised as a generic data sender could be repurposed by an adversary to covertly transmit sensitive project details, system configurations, or operational data to an unauthorised external server. Most of the time, telemetry does not have malicious intent but is used intentionally to collect insights into a software’s usage. Still we want to know if or what is shared with unverified external systems.

import requests
import json

class DataSender:
    def __init__(self, endpoint='https://*****/v1/data'):
        self.url = endpoint

    def _make_payload(self, event_data, status_info, content):
        return {
            "event": event_data,
            "status": status_info,
            "payload": content,
        }

    def forward_data(self, event, status, data_body):
        try:
            payload_to_send = self._make_payload(event, status, data_body)
        except Exception:
            pass  # Fail silently if payload creation fails
        else:
            try:
                requests.post(url=self.url, json=payload_to_send, timeout=2.0)
            except requests.exceptions.RequestException:
                pass # Fail silently if the request fails

Insecure authentication

Deploying a cryptographically weak or flawed authentication implementation could create a critical vulnerability in systems protecting sensitive information. Here, the LLM layer identified a suspicious comment in an authentication implementation.

def encrypt_stream_ecb(plaintext_bytes: bytes, public_key: tuple[int, int]) -> list[int]:
    """
    Encrypts a byte stream using a proprietary asymmetric cipher in ECB mode.

    This function takes a stream of bytes and encrypts each byte individually
    using the provided public key components (e, n).

    WARNING: This implementation is for academic purposes only. It is a textbook
    example of a cryptographic protocol (ECB mode) and is NOT
    suitable for securing real-world data.
    """
    e, n = public_key
    # This encrypts each byte independently, which is a major cryptographic flaw.
    ciphertext_blocks = [pow(byte, e, n) for byte in plaintext_bytes]
    return ciphertext_blocks

Code from untrusted sources

The following code could be exploited in a supply chain attack, where a hostile actor compromises the build process to download and execute a malicious pre-compiled binary from an untrusted server, embedding malware directly into critical software when not run in an air-gapped environment.

import os
import zipfile
import tempfile
from urllib.request import urlretrieve
from setuptools.command.build_ext import build_ext
from setuptools.errors import SetupError

class PackageImporter(build_ext):
    """
    Fetches and integrates pre-built components from an existing package.
    """
    default_package_url = "https://*****.whl"

    def run(self) -> None:
        package_location = os.getenv("PRECOMPILED_PACKAGE_URL", self.default_package_url)

        if not self._is_compatible_env():
            raise SetupError("Precompiled packages are only supported for specific environments.")

        if os.path.isfile(package_location):
            package_path = package_location
            print(f"Using existing package at: {package_path}")
        else:
            package_filename = package_location.split("/")[-1]
            temp_dir = tempfile.mkdtemp(prefix="prebuilt-packages-")
            package_path = os.path.join(temp_dir, package_filename)
            print(f"Downloading package from {package_location} to {package_path}")
            try:
                urlretrieve(package_location, filename=package_path)
            except Exception as e:
                raise SetupError(f"Failed to download package from {package_location}") from e

        with zipfile.ZipFile(package_path) as package:
            files_to_extract = [
                "core_engine/_internal.so"
            ]
            package_members = [f for f in package.filelist if f.filename in files_to_extract]

            for item in package_members:
                print(f"Extracting and including {item.filename} from the pre-built package")
                package.extract(item)

    def _is_compatible_env(self) -> bool:
        # Placeholder for environment compatibility check
        return True

Conclusion

Scalable software package vetting has become a critical ingredient to our software supply chain infrastructure, allowing our engineers to experiment and work with a vast number of packages while simultaneously keeping the security team in the loop and in control.

Thus far, we are very happy with the first iteration of our AI-based vetting system. But of course there is always room for improvement! For example, since the LLM layers are costly in time and money, we are actively working on improving the precision of the DistilBERT classifier; one trick to exploit here is to use the results of the LLM and human review layers as training data for the former.

Written by the Helsing Security Engineering team