LLM Code Authorship Detection: Unmasking Malicious Package Contributions

Research, Technical

November 14 2023

7 min read

TL;DR: Apiiro’s security research team has developed a revolutionary approach that combines LLMs and sophisticated algorithms to accurately correlate different code segments—such as open-source packages or commits—by similarity. This methodology supports various use cases, including matching multiple OSS package contributors with different aliases to singular authors, a common practice employed by bad actors.

In the past year, there has been a significant influx of malicious activity within open-source package managers such as PyPI, npm, and RubyGems. To protect your supply chain from such activity, you need to reliably determine whether a package is malicious prior to downloading and installing it. But that’s easier said than done, as existing software composition analysis (SCA) and software supply chain security (SSCS) tools are ill-equipped to do so.

As innovators in the age of AI, we saw LLMs as a great potential solution to this problem and created a new technique using LLMs—LLM Code Patterns (LCPs)—and proprietary code analysis. As outlined in a previous post, this approach accurately detects new malicious packages by establishing similarities with past known malicious packages and ensures accuracy through a self-enhancing loop. This is great when you can analyze the code, but threat actors go to great lengths to disguise their malicious activities or circumvent restrictions by using multiple aliases to publish packages and using encryption to obfuscate malicious code. That’s where our new methodology—LLM Code Authorship Detection (LCAD) comes in.

LLM Code Authorship Detection (LCAD): An innovative approach to code attribution with precision

LCAD goes beyond detecting malicious packages, and correlates packages with their respective author aliases to illuminate malicious activity. When confronted with a high volume of malicious packages published by different authors, LCAD can accurately group them based on their “true” authors. By linking seemingly unrelated code contributions to the same individual, LCAD helps expose deceptive practices and enhances accountability.

How LCAD works

LCAD leverages LLMs, sophisticated algorithms, and graph analysis.

LLM Code Patterns (LCPs). LCAD relies on LLM Code Patterns (LCPs), a representation of code that’s converted into vector format. To create LCPs, LLMs transform textual code into quantifiable numeric vectors, creating a standardized basis for comparison. This process allows us to concisely and precisely express complex code structures for more scalable and accurate comparisons.
LCP indexing. LCAD indexes all LCPs in a vector database which become the foundation for calculating similarities between pairs of code segments. This similarity analysis creates corresponding connectivity graphs, wherein nodes represent distinct code segments—such as commits, files, or packages—and edges signify substantial code-similarities as indicated by the LLM vectors.
Authorship graph analysis. The authorship connectivity graph undergoes thorough analysis to identify meaningful connectivity components, which correspond to the same “real” author.
Anomaly detection. LCAD includes an anomaly detection mechanism to flag irregular patterns within the connectivity graphs to help identify and highlight potential malicious, abnormal behavior exhibited by the code author.

This approach outperforms traditional authorship detection solutions because of the use of LCPs by enabling the attribution of malicious code to a singular actor, even when confronted with intricate scenarios like multiple authors or code repurposing, which is a common way threat actors try to mask malicious behavior. Alongside LCAD’s detection abilities, its anomaly detection mechanism helps teams defend against the introduction of potentially malicious code.

Validating LCAD in the wild: Unmasking malicious activity in PyPI

To put LCAD to the test, we analyzed 100k+ PyPI package contributions to see if it could identify malicious packages that appeared to have different authors but were, in fact, uploaded by the same individual.

As you can see in the connectivity graph below, LCAD was able to identify similarities between packages published under different aliases despite variations in patterns and the use of encryption within the packages.

Apiiro LCAD Graph - correlated packages and authors

Each point on the graph represents a unique package, and the colors represent the corresponding submitting authors. The edges between nodes in this graph denote significant code similarity, as identified by LCAD. Notably, this graph comprises a single, prominent connected component, indicating that all of these packages were indeed contributed by a single “real” author.

These packages were found to be malicious in nature, posing a significant security concern to the PyPI ecosystem. PyPI has been notified, and the packages were removed.

Trust but verify: Was LCAD accurate?

In an effort to verify LCAD’s findings, we manually traced the attack timeline and used our human investigative powers to spot any distinguishing clues to determine whether or not these packages were likely to be contributed by the same “real” author. We looked at the different types of encryption employed and traits such as email addresses to draw similarities.

April 1 – April 26 | A contributor under the alias ‘NHJonas’ using the email NHJonas@gmail.com uploaded ~30 malicious packages. The malicious code within these packages was in plain text and lacked any encryption.
May 1 | A contributor under the alias ‘Christian F’ using the email nick.faltermeier@gmx.de uploaded ~30 packages also lacking any encryption. As you can see on the left side of the above connectivity graph, LCAD determined that these packages are highly similar. With manual analysis, we confirmed that they were nearly identical to the set previously published by NHJonas.
May 3 – May 6 | Christian F published ~20 malicious packages. In a departure from the previous set, they began implementing encryption to hide some of the malicious content. Again, as you can see from the green and red clusters in the graph, LCAD determined that they were highly similar to each other and the non-encrypted packages. We verified that most of the malicious code in plain text bore a resemblance to the previous sets of packages, but one specific function was encrypted.
May 3 | NHJonas uploaded a new malicious package using the email nick.faltermeier@gmx.de (ding ding ding!) The package, syssqlite3V2 shared the same partial encryption technique as employed by Christian F and is highly similar to both with the non-encrypted packages and with the partially-encrypted packages.
May 6 – May 7 | A new contributor under the alias ‘Josef M’ uploaded three malicious packages (syssqlitedbextension, syscolouringsextV1, and syscoloringextensionV2). that bore a striking similarity to the partially encrypted malicious code uploaded by Christian F and NHJonas between May 3, 2023, and May 6, 2023. This marked the last instance where all three authors utilized the same partial encryption approach in their malicious code. As can be seen in the center of the Authorship Connectivity Graph, these packages are connected with the other partially encrypted packages uploaded by Christian F.
May 7 |NHJonas uploaded another malicious package (pipcryptomodsV2), highlighted as an anomaly with respect to the rest of the packages in the graph, as it used 100% encryption with no malicious code present in plain-text, as opposed to the fully encrypted packages that still have a few lines of plain-text code.
May 8, 2023 – July 3, 2023 | Josef M uploaded ~90 similar malicious packages now leveraging full encryption. All previously plain text malicious code was now fully encrypted, except for a few lines related to the installation of third-party packages and the download and execution of a malicious script from a nefarious website.

You can see the timeline of contributions of each alias below by date (x-axis) and time of day (y-axis):

Apiiro LCAD - code author timeline plot

Our real-world example demonstrates the strength of LCAD, accurately linking multiple aliases to a single malicious actor. By analyzing the attack timeline and authorship connectivity graph, we exposed the deceptive practices of the actor. When applied in an organization, code contributions that are attributed to known malicious actors, or code that deviates from the author’s style, would flag the commit as abnormal.

Beyond malicious packages: Using LCAD to detect malicious commits

The above example demonstrated the strength of LCAD for connecting multiple package manager author aliases to a single malicious actor by comparing the nature of their contributions and behaviors. By pinpointing malicious actors and flagging future contributions as malicious, LCAD can enrich package insights for an added dimension of risk reduction.

This same methodology, however, can be leveraged internally to analyze an organization’s own development anomalies. By analyzing developers’ commits and behavior, LCAD can detect instances where commits commits purportedly made by a specific user, do not align with their established coding patterns, signaling the possibility of code injection or unauthorized contributions.

The versatility of LCAD manifests in its ability to detect anomalies within commits, cluster code segments, and in case of malicious code, expose deceptive actor aliases. In an ever-evolving cybersecurity landscape, code authorship detection remains a paramount concern, as demanding robust solutions like LCAD to ensure the integrity and trustworthiness of software repositories.

LLM Code Authorship Detection (LCAD) revolutionizes the landscape of code attribution, providing a powerful and precise solution to trace code origins. By integrating such signals, improved anomaly detection becomes possible.

How Apiiro protects organizations

At Apiiro, it’s our mission to contextualize security alerts from disparate tools (SAST, SCA, etc.) to determine which presents the most risk and needs to be quickly addressed. We’re also constantly implementing new, innovative ways to flag new types of potential risks—with higher precision and across different malicious activities.

Malicious package insights

Apiiro’s risk engine leverages LLMs and code analysis to detect malicious packages, which are reflected alongside other open source security alerts in Apiiro with a “Malicious package” insight.

Apiiro SCA Malicious package insights

Abnormal commit behavior

Apiiro’s risk engine also leverages LLMs and code analysis to natively flag abnormal commits based on known developer behavior and the nature of the code or repository leveraging.

Apiiro Abnormal commit behavior detection

As the landscape of technology and cyber threats continue to advance, Apiiro remains ahead of the curve with its innovative solutions like LCAD that leverage the power of LLMs for accurate code attribution and strengthened security. By integrating LLMs, Apiiro has revolutionized the methodology of identifying and comparing code segments, vastly uplifting the reliability and efficacy of authorship detection systems.

We will continue to push the envelope to effectively use AI and LLMs to improve the effectiveness of application security tooling. To learn more about our work in this arena, read our recent blog on LCPs and subscribe for future posts.

Eli Shalom

VP of Algorithms and Data Science

Gil David

Head of AI