Two scans, two questions
CodeDelta runs two independent analyses. They answer different questions and must not be read as the same thing — a file can score high on one and nothing on the other, and that is by design.
"Was this code written by AI?"
AI Audit
Examines the style and structure of the code itself — comment patterns, naming, uniformity, structural fingerprints — to estimate how likely it is that a file was produced by an AI model rather than written by hand.
"Does this code use or launch AI?"
Agent Scan
Looks for code that calls AI services or runs AI agents — SDK imports, agent frameworks, automated LLM loops, and the high-risk pattern of executing AI-generated output.
A file can be written by AI but contain no AI calls (an AI-generated sorting function). Another can be hand-written but launch an autonomous agent (a developer wiring up the OpenAI SDK). These are different risks, so CodeDelta measures them separately and never conflates the two.
AI Audit — was this code written by AI?
The AI Audit produces three numbers for every file. Understanding the relationship between them is the key to reading the report correctly.
GSS — a heuristic score from hand-built rules that look for tell-tale stylistic patterns of AI authorship.
MLS — a machine-learning score from a trained model that judges the file's statistical fingerprint.
AIC — the combined score that the risk rating is actually based on. This is the number that decides whether a file is flagged.
The heuristic score (GSS)
GSS is built from explicit, inspectable rules. Each rule ("signal") looks for a specific pattern that tends to appear in AI-generated code and contributes a fixed number of points when it fires. The points are summed and capped at 100. These are the exact signals and weights:
| Signal | Pts | What it detects |
|---|---|---|
| structural_uniformity | 40 | Repeated, near-identical function bodies — a strong sign of generated code. Humans parameterise; generators repeat. |
| docblock_coverage | 40 | A documentation block on every function. AI is exhaustively consistent; humans are not. |
| guard_clause_density | 35 | A null / empty / range check on essentially every parameter. |
| repetitive_block_pattern | 35 | The same multi-line construct repeated across methods (production-AI style). |
| formulaic_exceptions | 30 | Boilerplate error messages — "X cannot be null", "X must be positive". |
| audit_notify_coupling | 25 | Audit-log calls always paired with notification calls, in lockstep. |
| generic_name_saturation | 25 | High proportion of generic identifiers (result, data, temp, item, handler…). |
| annotation_saturation | 20 | @param/@return-style annotations on every method (Java/C#). |
| symmetric_validation | 20 | Paired null+empty+range checks repeated in the same order. |
| pure_code_no_comments | 20 | Zero comments of any kind — the terse end of the AI style spectrum. |
| high_add_velocity | 15 | A large block of code added in a single change (comparison scans only). |
| lloc_warning | 15 | Token-level repetition flagged by the core analysis engine. |
| this_init_pattern | 15 | Every field initialised via this.x = … in the constructor. |
| zero_human_comments | 10 | No inline comments, only machine-style doc blocks. |
GSS is fully transparent: every point in a file's score traces back to a named signal you can see in the report. Its limitation is the flip side of that strength — it can only detect patterns it has rules for, and disciplined human library code (heavy documentation, thorough validation) can resemble those patterns. That is precisely why GSS is never used alone to flag a file.
The machine-learning score (MLS)
MLS comes from a model trained on a large corpus of paired examples — real human-written code and real AI-generated code — for each supported language. Rather than firing on named rules, it weighs dozens of measurable features of a file (comment density, token entropy, identifier statistics, structural regularity, and many more) and returns a probability, expressed 0–100, that the file is AI-written.
What the model learns from
Each language has its own model and its own feature extractor. The model is trained to separate genuine human code from genuine AI output, and is validated against held-out examples of both before release.
Resistance to gaming
A detector is only useful if it can't be trivially fooled. One early version of the C++ model leaned heavily on whether a file had a header comment — but that is exactly the kind of thing a team can mandate or strip at will, which would let AI code masquerade as human (or vice-versa). The current model is deliberately blind to that signal: it is trained without the header-comment feature, so it must judge a file on structural characteristics that can't be toggled by a coding-style rule. On validation this preserved detection accuracy on genuine AI code while removing the loophole.
The model is strongest on code that resembles real human or real AI authorship — what it was trained on. Code that is neither (for example, machine-generated stress-test fixtures, or heavily obfuscated output) sits outside its training distribution, and the model will be appropriately less certain. We report that uncertainty rather than forcing a verdict.
How they combine (AIC) and what the risk levels mean
The risk rating shown for each file is based on the AIC — the combined score — not on GSS or MLS alone. AIC blends the machine-learning judgement with the heuristic judgement, weighted toward the model:
The blend exists so the two detectors check each other: the model carries most of the weight, but a confident heuristic can still raise a score where the model is uncertain. The GSS = 0 exception matters — without it, a file the model is confident about would have its score pulled down purely because the heuristic happened to find none of its specific patterns, which would wrongly bury a real detection.
Risk levels
AIC is compared against a sensitivity threshold (default 50):
| Rating | Condition | Meaning |
|---|---|---|
| HIGH | AIC ≥ 70 | Strong combined evidence of AI authorship. |
| ELEVATED | AIC ≥ 50 | Meaningful evidence; warrants a look. |
| NORMAL | AIC < 50 | No strong combined signal. |
"How can a file show a high MLS but still be rated NORMAL?" Because the rating follows AIC, not MLS. If a file scores MLS 73 but GSS 0, and the file's circumstances mean the blend (not the GSS=0 exception) applies, AIC can land below the threshold and the file reads NORMAL. The model "voted AI", the heuristic "voted nothing", and the combined figure sits under the line. The threshold is adjustable, so you can tune sensitivity to your own tolerance.
Agent Scan — does this code use or launch AI?
The Agent Scan is a separate analysis with its own score, AIS (AI-Initiation Score). It does not care whether code was written by AI — it looks for code that calls AI services or runs AI agents. Each category that fires adds points; the total determines the rating.
| Category | Pts | What it detects |
|---|---|---|
| Rogue-agent pattern | 70 | The critical case. An eval/exec call sitting next to an AI API call — i.e. code that executes AI-generated output. Detected across languages. |
| Dynamic execution | 50 | Direct eval/exec of code (Python). |
| Agent orchestration | 40 | Agent-framework constructs — LangChain, AutoGen, CrewAI, Semantic Kernel, LangGraph, and similar. |
| Hand-rolled agent loop | 40 | An AI API call inside a loop — an automation loop built without a named framework. |
| Subprocess w/ variable | 40 | Shell/process execution built from a variable — a command-injection surface (Python). |
| Prompt injection vector | 35 / 18 | Unsanitised external input flowing into a model prompt. |
| AI SDK import | 25 | An AI provider's SDK is imported — OpenAI, Anthropic, LangChain, Google, Mistral, Cohere, and others. Recognised across import styles and languages. |
| Dynamic import | 20 | Code loaded by name at runtime. |
Risk levels
| Rating | Condition | Meaning |
|---|---|---|
| CRITICAL | AI-written + agent-initiating | A file that both looks AI-written (high GSS) and launches AI agents (high AIS) — the combination warranting the closest review. |
| HIGH | AIS ≥ 70, or exec + SDK present | Strong agent-activity signal, including the rogue-execution pattern. |
| ELEVATED | AIS ≥ 40 | Meaningful agent activity present. |
| NORMAL | AIS < 40 | No significant agent activity. (A lone SDK import scores 25 and stays NORMAL — importing a library is not, by itself, agent behaviour.) |
The two scans answer different questions. AI-written arithmetic, business logic, or data structures contain no AI calls and no agent loops — so the Agent Scan correctly reports nothing, even as the AI Audit flags the same files. Conversely, hand-written code that imports an AI SDK and runs an agent loop will register on the Agent Scan regardless of who wrote it.
What we deliberately don't claim
A detection tool is only trustworthy if it's honest about its edges. CodeDelta is an evidence tool, not an oracle.
- No certainty of authorship. The AI Audit estimates likelihood from observable patterns. A high score is strong evidence, not proof, that a file was AI-written.
- Out-of-distribution code is reported as uncertain. Code that resembles neither typical human nor typical AI output (synthetic fixtures, exotic generators) sits outside what the models were trained on; we surface that uncertainty rather than inventing a verdict.
- Deliberate obfuscation can evade pattern matching. The Agent Scan reads code as text. Code intentionally written to hide an AI call or an
execbehind indirection can reduce what static analysis sees — a limitation shared by all static scanners. - Language-specific scope. Some signals only make sense in certain languages (for example, plain
eval/execflagging applies to Python, where they are dangerous built-ins, rather than to languages where those words are ordinary method names). Scoping these correctly is what keeps the scan free of noise. - The threshold is yours to set. Sensitivity is adjustable. The defaults are a sensible starting point, not a fixed judgement — tune them to your own tolerance for false positives versus missed detections.
CodeDelta tells you, with traceable evidence and stated uncertainty, how likely each file is to have been written by AI (AI Audit) and whether it uses or launches AI (Agent Scan) — two separate questions, two separate scores, no black box.