Cookies Notice
This site uses cookies to deliver services and to analyze traffic.
📣 Guardian Agent: Guard AI-generated code
A code property graph (CPG) is a unified data structure that merges three fundamental code representations into a single, queryable graph: the abstract syntax tree (AST), the control flow graph (CFG), and the program dependence graph (PDG), which captures data flow relationships. By combining these representations, a CPG provides a complete structural and semantic model of source code that supports advanced security analysis, vulnerability discovery, and code understanding.
The concept was introduced in academic research and popularized by tools like Joern and ShiftLeft. A code property graph enables analysts and automated tools to ask complex questions about code that no single representation can answer alone, such as “show me every path where user input reaches a database query without passing through a sanitizer.”
Each component of a code property graph captures a different dimension of program behavior. These include:
A CPG merges these three graphs by connecting their nodes through shared identity. A function node in the AST links to its entry point in the CFG and to the data flow edges in the PDG that trace values through its parameters. This unified code graph lets analysts traverse syntactic structure, execution paths, and data flow in a single query, without switching between separate tool outputs.
Code property graphs are naturally suited to graph databases because they consist of nodes (representing code elements) and edges (representing relationships like “calls,” “flows to,” or “controls”). Graph databases like Neo4j, TinkerGraph, and OverflowDB store CPGs efficiently and support traversal queries that would be expensive or impractical in relational databases.
Query languages like Gremlin, Cypher, or tool-specific DSLs (such as Joern’s query language) allow analysts to express complex code patterns as graph traversals. For example, a query might find all functions that accept HTTP request parameters, trace data flow through intermediate variables, and check whether the data reaches a SQL execution sink without passing through a parameterized query builder.
This query-driven approach is what makes CPG code analysis powerful for security research. Analysts can write custom queries tailored to their application’s architecture, coding patterns, and risk profile, going far beyond the fixed rule sets of traditional scanners.
Code property graphs support several high-value use cases across security and development. Key use cases include:
Traditional static application security testing tools typically operate on individual representations: an AST for pattern matching, a CFG for path analysis, or a data flow graph for taint tracking. Each representation answers a subset of questions about the code, and findings from one representation cannot easily reference another.
A code property graph eliminates this separation. Because all three representations share a unified graph, a single query can combine syntactic patterns, control flow conditions, and data flow paths. This produces more precise results with fewer false positives, since the analysis can apply constraints from all three dimensions simultaneously.
The tradeoff is cost. Building and storing a CPG for a large codebase requires more processing time and memory than generating a single AST or CFG. For smaller codebases or narrow scanning requirements, traditional representations may be sufficient. For deep security research, variant analysis, and complex vulnerability discovery across large codebases, the CPG’s unified model is significantly more powerful.
Security researchers and tools use CPGs to find complex vulnerabilities, perform variant analysis, mine codebases for insecure patterns, and support refactoring decisions through structural queries.
A CPG merges the AST, CFG, and data flow graph into a single unified structure. Individual representations capture only syntax, control flow, or data flow in isolation.
Joern is the most widely known open-source CPG tool. Commercial platforms like ShiftLeft (Qwiet AI) and Semgrep’s deep analysis also use CPG-based representations for vulnerability detection.
CPGs are stored in graph databases like OverflowDB, Neo4j, or TinkerGraph and queried using graph traversal languages such as Gremlin, Cypher, or tool-specific query DSLs.
No. CPGs complement traditional techniques. Lightweight linters and pattern matchers remain valuable for fast, simple checks. CPGs excel at deep, cross-cutting analysis that requires combining syntax, control flow, and data flow.