Over 100,000 Infected Repos Found on GitHub

Research

February 28 2024

5 min read

Our security research and data science teams detected a resurgence of a malicious repo confusion campaign that began mid-last year, this time on a much larger scale. The attack impacts more than 100,000 GitHub repositories (and presumably millions) when unsuspecting developers use repositories that resemble known and trusted ones but are, in fact, infected with malicious code.

How do repo confusion attacks happen?

Similar to dependency confusion attacks, malicious actors get their target to download their malicious version instead of the real one. But dependency confusion attacks take advantage of how package managers work, while repo confusion attacks simply rely on humans to mistakenly pick the malicious version over the real one, sometimes employing social engineering techniques as well.

In this case, in order to maximize the chances of infection, the malicious actor is flooding GitHub with malicious repos, following these steps:

Cloning existing repos (for example: TwitterFollowBot, WhatsappBOT, discord-boost-tool, Twitch-Follow-Bot, and hundreds more).
Infecting them with malware loaders.
Uploading them back to GitHub with identical names.
Automatically forking each thousands of times.
Covertly promoting them across the web via forums, discord, etc.

What happens when the malicious repos are in use?

Once unsuspecting developers use any of the malicious repos, the hidden payload unpacks seven layers of obfuscation, which also involves pulling malicious Python code and later a binary executable. The malicious code (largely a modified version of BlackCap-Grabber) would then collect login credentials from different apps, browser passwords and cookies, and other confidential data. It then sends it back to the malicious actors’ C&C (command-and-control) server and performs a long series of additional malicious activities.

code analysis

The automation effects on GitHub

Most of the forked repos are quickly removed by GitHub, which identifies the automation. However, the automation detection seems to miss many repos, and the ones that were uploaded manually survive. Because the whole attack chain seems to be mostly automated on a large scale, the 1% that survive still amount to thousands of malicious repos. You can check out a small portion of the current wave yourself by simply searching the following in GitHub: 🔥 2024 language:python.

Counting the removed ones, the number of repos reaches millions. Usually the removal happens a few hours after the upload, so it’s challenging to document them. We know the removal is automated because many of the original ones still exist, and it mainly targets the fork bombs. For example, here you can see thousands of forks appear in the summary but none in the details.

Because of the operation’s large scope, this campaign has a sort of 2nd-order social engineering network effect when, every now and then, naive users fork the malicious repos without realizing they are spreading malware. Kind of ironic to see it spreading by humans after such heavy reliance on automation.

When did the campaign start?

Here is a brief history of this malicious campaign:

May 2023: As originally reported by Phylum, several malicious packages were uploaded to PyPI containing early parts of the current payload. These packages were spread by ‘os.system(“pip install package”)’ calls planted in forks of popular GitHub repos, such as ‘chatgpt-api’.

July – August 2023: Several malicious repos were uploaded to GitHub, this time delivering the payload directly instead of through importing PyPI packages. This came after PyPI removed the malicious packages, and the security community increased its focus there. Aliakbar Zahravi and Peter Girnus from Trend Micro published a great technical analysis of it.

November 2023 – Now: We have detected more than 100,000 repos containing similar malicious payloads, and the number keeps growing. This attack approach has several advantages:

GitHub is huge, therefore despite the large number of instances, their relative portion is still insignificant and thus hard to detect.
Package managers are not involved as before, therefore explicit malicious package names are not mentioned, so that’s one less indicator.
The targeted repos are in a small niche and have low popularity, making it easier for unsuspecting developers to make the mistake and clone their malicious impersonators.

The transition of malware from package managers to SCMs

Judging by the many incidents we have observed in package managers and SCM platforms, the transition of this campaign from malicious packages in PyPI to malicious GitHub repos seems to reflect a general trend. It seems that nowadays, the security community puts extra focus on package managers, so that was to be expected.

The ease of automatic generation of accounts and repos on GitHub and alike, using comfortable APIs and soft rate limits that are easy to bypass, combined with the huge number of repos to hide among, make it a perfect target for covertly infecting the software supply chain.

This campaign, along with dependency confusion campaigns plaguing package registries and generally malicious code being spread through source control managers, demonstrates how fragile software supply chain security is, despite the abundance of tools and available security mechanisms.

Indicators of compromise (how to know if you are infected)

Search for the following Python patterns and investigate any matches:
- exec(Fernet
- exec(requests
- exec(__import
- exec(bytes
- exec(“””\nimport
- exec(compile
- __import__(“builtins”).exec(
Check for the local presence of any repositories related to automations of actions on social platforms, bots, and gaming, and remove them. If you must, then reinstall – but this time carefully verify the source, and either avoid it or run it in a sandbox.
If you believe there’s a chance a repository of this type was cloned, respond as if the following cookies, credentials and keys were stolen:
- From browsers: any financial services, any email services, any crypto services, Amazon, eBay, AliExpress, Facebook, Instagram, Twitter, Youtube, Discord, TikTok, Telegram, Twitch, Steam, Yahoo, ExpressVPN, Spotify, and any streaming services.
- From apps: Exodus, Atomic Wallet, Guarda, Coinomi, Ethereum.
If you would like to verify files checksums, the length of the list is impractical but some of the common ones can be found in this VirusTotal graph.

Cloudflare was notified and deactivated the DNS records of the malicious addresses found.

How to protect yourself against repo confusions

GitHub was notified, and most of the malicious repos were deleted, but the campaign continues, and attacks that attempt to inject malicious code into the supply chain are becoming increasingly prevalent. There are countless solutions for catching malware at the system and network levels, but the supply chain remains a massive and lucrative attack surface for malicious actors. If you encounter any malicious repo, part of this campaign or not, we encourage you to report it.

At Apiiro, we’ve built a malicious code detection system that monitors any connected codebases. We then detect attacks by using deep code analysis using multiple advanced techniques: LLM-based code analysis, deconstruction of the code into a complete execution flow graph, an elaborate heuristics engine, dynamic decoding, decryption, and de-obfuscation, and more, so it’s pretty hard to fool it.

Without monitoring your code for injected malicious payloads, the security of your whole organization is determined by things like the ability of your developers to not choose the wrong repo, which is almost identical, not having a single CI/CD misconfiguration, having 100% secure 3rd party code, and other impossible conditions. That’s why we as an industry need to start going beyond typical vulnerability detection and ingestion to surface the next generation of software supply chain and application risks.

Matan Giladi

Security Researcher

Gil David

Head of AI