Cookies Notice
This site uses cookies to deliver services and to analyze traffic.
November 14 2023 | 7 min read
Research, Technical | November 14 2023 | 7 min read
TL;DR: Apiiro’s security research team has developed a revolutionary approach that combines LLMs and sophisticated algorithms to accurately correlate different code segments—such as open-source packages or commits—by similarity. This methodology supports various use cases, including matching multiple OSS package contributors with different aliases to singular authors, a common practice employed by bad actors.
In the past year, there has been a significant influx of malicious activity within open-source package managers such as PyPI, npm, and RubyGems. To protect your supply chain from such activity, you need to reliably determine whether a package is malicious prior to downloading and installing it. But that’s easier said than done, as existing software composition analysis (SCA) and software supply chain security (SSCS) tools are ill-equipped to do so.
As innovators in the age of AI, we saw LLMs as a great potential solution to this problem and created a new technique using LLMs—LLM Code Patterns (LCPs)—and proprietary code analysis. As outlined in a previous post, this approach accurately detects new malicious packages by establishing similarities with past known malicious packages and ensures accuracy through a self-enhancing loop. This is great when you can analyze the code, but threat actors go to great lengths to disguise their malicious activities or circumvent restrictions by using multiple aliases to publish packages and using encryption to obfuscate malicious code. That’s where our new methodology—LLM Code Authorship Detection (LCAD) comes in.
LCAD goes beyond detecting malicious packages, and correlates packages with their respective author aliases to illuminate malicious activity. When confronted with a high volume of malicious packages published by different authors, LCAD can accurately group them based on their “true” authors. By linking seemingly unrelated code contributions to the same individual, LCAD helps expose deceptive practices and enhances accountability.
LCAD leverages LLMs, sophisticated algorithms, and graph analysis.
This approach outperforms traditional authorship detection solutions because of the use of LCPs by enabling the attribution of malicious code to a singular actor, even when confronted with intricate scenarios like multiple authors or code repurposing, which is a common way threat actors try to mask malicious behavior. Alongside LCAD’s detection abilities, its anomaly detection mechanism helps teams defend against the introduction of potentially malicious code.
To put LCAD to the test, we analyzed 100k+ PyPI package contributions to see if it could identify malicious packages that appeared to have different authors but were, in fact, uploaded by the same individual.
As you can see in the connectivity graph below, LCAD was able to identify similarities between packages published under different aliases despite variations in patterns and the use of encryption within the packages.
Each point on the graph represents a unique package, and the colors represent the corresponding submitting authors. The edges between nodes in this graph denote significant code similarity, as identified by LCAD. Notably, this graph comprises a single, prominent connected component, indicating that all of these packages were indeed contributed by a single “real” author.
These packages were found to be malicious in nature, posing a significant security concern to the PyPI ecosystem. PyPI has been notified, and the packages were removed.
In an effort to verify LCAD’s findings, we manually traced the attack timeline and used our human investigative powers to spot any distinguishing clues to determine whether or not these packages were likely to be contributed by the same “real” author. We looked at the different types of encryption employed and traits such as email addresses to draw similarities.
You can see the timeline of contributions of each alias below by date (x-axis) and time of day (y-axis):
Our real-world example demonstrates the strength of LCAD, accurately linking multiple aliases to a single malicious actor. By analyzing the attack timeline and authorship connectivity graph, we exposed the deceptive practices of the actor. When applied in an organization, code contributions that are attributed to known malicious actors, or code that deviates from the author’s style, would flag the commit as abnormal.
The above example demonstrated the strength of LCAD for connecting multiple package manager author aliases to a single malicious actor by comparing the nature of their contributions and behaviors. By pinpointing malicious actors and flagging future contributions as malicious, LCAD can enrich package insights for an added dimension of risk reduction.
This same methodology, however, can be leveraged internally to analyze an organization’s own development anomalies. By analyzing developers’ commits and behavior, LCAD can detect instances where commits commits purportedly made by a specific user, do not align with their established coding patterns, signaling the possibility of code injection or unauthorized contributions.
The versatility of LCAD manifests in its ability to detect anomalies within commits, cluster code segments, and in case of malicious code, expose deceptive actor aliases. In an ever-evolving cybersecurity landscape, code authorship detection remains a paramount concern, as demanding robust solutions like LCAD to ensure the integrity and trustworthiness of software repositories.
LLM Code Authorship Detection (LCAD) revolutionizes the landscape of code attribution, providing a powerful and precise solution to trace code origins. By integrating such signals, improved anomaly detection becomes possible.
At Apiiro, it’s our mission to contextualize security alerts from disparate tools (SAST, SCA, etc.) to determine which presents the most risk and needs to be quickly addressed. We’re also constantly implementing new, innovative ways to flag new types of potential risks—with higher precision and across different malicious activities.
Apiiro’s risk engine leverages LLMs and code analysis to detect malicious packages, which are reflected alongside other open source security alerts in Apiiro with a “Malicious package” insight.
Apiiro’s risk engine also leverages LLMs and code analysis to natively flag abnormal commits based on known developer behavior and the nature of the code or repository leveraging.
As the landscape of technology and cyber threats continue to advance, Apiiro remains ahead of the curve with its innovative solutions like LCAD that leverage the power of LLMs for accurate code attribution and strengthened security. By integrating LLMs, Apiiro has revolutionized the methodology of identifying and comparing code segments, vastly uplifting the reliability and efficacy of authorship detection systems.
We will continue to push the envelope to effectively use AI and LLMs to improve the effectiveness of application security tooling. To learn more about our work in this arena, read our recent blog on LCPs and subscribe for future posts.