Toward Secure Code Generation with LLMs: Why Context Is Everything

Idan Plotnik

CEO

Published July 31 2025 · 4 min. read

The Future Is LLM-Driven—and It Can Be Secure

Large language models are now a staple in how developers write code. They’re used to scaffold new components, suggest implementations, and automate repetitive tasks. But as their adoption grows, so do the questions about the security of what they generate.

A recent study by a research group at George Mason University—“Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation”—offers one of the most comprehensive evaluations to date. The authors examined how reliably today’s LLMs produce secure code and how they respond to different types of guidance and feedback. Their findings are clear: while models like GPT-4o and CodeLlama often introduce vulnerabilities by default, their output improves significantly when they’re given explicit, contextual information about potential security risks.

At Apiiro, these insights echo what we see in real-world development environments. The issue isn’t that LLMs are inherently insecure; it’s actually that they don’t have access to the context that developers and AppSec teams rely on to make safe decisions. That’s where Apiiro fits in: by analyzing code structure, data flow, runtime exposure, and architectural intent, we help bridge the gap between raw AI output and production-ready, secure code.

In this post, we’ll unpack the paper’s key insights and show how Apiiro helps organizations ensure their LLM-assisted development processes result in code that not only works, but does not introduce security issues.

Why LLMs Need More Than Skilled Prompts to Write Secure Code

The research paper shows not just the existence of AI-generated code risks, but also the reasons why those risks persist, and what actually improves outcomes. Across models and benchmarks, one consistent pattern emerged: without explicit signals about potential vulnerabilities, LLMs default to insecure implementations. Not due to malicious behavior, but because they lack visibility into the constraints developers take for granted: data sensitivity, runtime exposure, architectural intent.

The team evaluated eight popular LLMs across two security benchmarks: SecCodePLT, which focuses on CWE-specific vulnerability types, and SecurityEval, which includes real-world, expert-authored Python coding scenarios. On both, even high-performing models like GPT-4o and DeepSeek-Coder-V2 frequently generated vulnerable code. In the worst cases, over 40% of completions introduced security flaws.

But the most meaningful finding wasn’t how often models fail—it was how much they improve when you give them the right kind of context. When supplied with:

Self-generated vulnerability hints grounded in the task description, and
Explained feedback contextualizing static analysis results,

models showed significant drops in vulnerability rates—up to 80% reduction in some scenarios. GPT-4o, for example, dropped from 41.3% to 5.2% (AllV-R on SecurityEval) with the right guidance.

“When these hints are further contextualized—explicitly linking the vulnerability to the specific scenario—they outperform basic definitions.”

The paper makes a strong case that if you give LLMs the right inputs, they can reason about security. What’s missing is structured, context-rich information about the application and environment they’re working within.

Secure Code Requires Context. Apiiro Delivers It.

The findings from Yan et al. reinforce something we’ve seen play out repeatedly in practical, real-world scenarios. Language models can write insecure code, not because they’re careless, but because they don’t understand the environment they’re writing for. They have no awareness of whether a function is handling PII, whether an API is exposed externally, or whether a code path violates organizational security policy. Without that context, even a perfectly fluent model will make the wrong call.

At Apiiro, we’ve tracked this trend across real production environments. Since mid-2023, we’ve seen a 3x increase in PII exposure and a 10x spike in APIs missing authentication and validation logic, much of it tied to code authored or accelerated by GenAI tooling.

What the research demonstrates in controlled experiments, we’re seeing at scale: LLMs do better when they have something to anchor their decisions to. When they understand which risks matter, and why. Apiiro provides exactly that. By analyzing code changes in the context of runtime behavior, data sensitivity, architectural dependencies, and ownership, Apiiro brings structure to the otherwise invisible decisions LLMs (and developers) make as they move quickly through implementation.

And just as the paper found that “explained feedback” improved LLM repair performance, Apiiro applies that principle across the SDLC, not only flagging risks, but explaining their impact and relevance in plain, actionable terms. This gives developers the context they need to validate or refine what an LLM suggests, instead of blindly trusting it, or ignoring it altogether.

In short, where the research team modeled how LLMs can be guided into secure behavior, Apiiro turns that guidance into a product reality. Our platform closes the loop between AI-assisted development and risk-aware engineering.

Enabling Secure LLM-Driven Development

If secure LLM-driven development is about context, Apiiro is the system of record for that context.

Our platform analyzes every code change through multiple lenses: data flow, control flow, runtime exposure, architectural dependencies, and ownership. Then, our system maps those insights directly to risk. This gives development teams and LLMs the information they need to make better decisions before vulnerabilities are introduced.

Deep code analysis (DCA) detects risky patterns even when static scanners miss them.
Code-to-runtime mapping shows whether a vulnerable function is exposed externally or gated internally.
Risk graph visualization helps teams understand not just where the risk is, but why it matters.

This level of context doesn’t require developers to become security experts or expert prompt engineers. Instead, helpful information is surfaced directly in the flow of development, at the PR, in the CI pipeline, or through the API that powers your LLM assistant.

Implications for Engineering and Security Teams

For engineering teams, it means you can now move fast with AI without accepting unnecessary risk. Whether your developers are using Copilot or integrating LLMs into custom tooling, Apiiro ensures that security is factored into the development process.

For AppSec teams, it means less time chasing vague alerts and unpacking findings and more time focused on high-impact risks, because the platform prioritizes vulnerabilities based on exposure, reachability, and business context.

And for leadership, it’s a model for enabling innovation responsibly: accelerating productivity without compromising security posture.

Final Thoughts

LLMs can write secure code, but only with the right context. Apiiro is the answer to how they do that. By mapping code to risk, architecture, and runtime behavior, we give both developers and their AI tools the information they need to make safer decisions.
See how it works by requesting a demo.