AI-Powered API Penetration Testing: Autonomous Security at Scale

We ship API endpoints faster than our security team can review them. That’s not a flex — it’s the root cause of every “we found this in production” Slack message I’ve ever received. OWASP Top 10 plus the API Security Top 10 gives you 20 vulnerability classes. Multiply that by 45 endpoints and you’re looking at 900 test vectors, minimum. Nobody’s writing those by hand.

So we built a framework that spawns an LLM as a subprocess, hands it a target configuration, and lets it autonomously pentest our API. It crafts payloads, sends real HTTP requests, analyzes responses, iterates on leads, and writes a structured report. No hardcoded test cases. No Selenium scripts. Just a YAML file describing your endpoints and an AI agent with a terminal.

Why Not Just Use Burp Suite?

Burp Suite, ZAP, Postman security scans — they’re all good tools. We use them. But they share a fundamental limitation: they test what you configure them to test. They’ll fuzz parameters, replay requests, scan for known CVEs. What they won’t do is reason about your business logic.

Consider this: you have a PATCH /v1/pipelines/:id/stage endpoint that advances a candidate through your hiring pipeline. A traditional scanner will test for SQL injection in the id parameter. What it won’t test is whether a candidate’s auth token can call this recruiter-only endpoint. Or whether you can skip stages by sending a stage index that’s three steps ahead. Or whether the notes field in the request body is sanitized before being rendered in the recruiter dashboard.

An LLM that understands your endpoint spec, the auth model, and the business context will try all three. That’s the gap we’re filling.

Architecture: Three Decoupled Layers

The system has three parts that know nothing about each other:

Three-Layer Architecture

Target Config

YAML

Orchestrator

Node.js

LLM Subprocess

CLI Agent

Layer 1 — Target Configuration. A YAML file that declaratively describes every endpoint in your API. Method, path, auth type, example parameters, and vulnerability tags:

endpoints:
  - id: get-user-profile
    method: GET
    path: /v1/users/:userId
    auth: bearerToken
    tags: [idor, data-exposure]
    params:
      path:
        userId: "64a1b2c3d4e5f67890abcdef"
    notes: "Test with other users' IDs for IDOR"

  - id: create-account
    method: POST
    path: /v1/users
    auth: none
    tags: [mass-assignment, injection]
    params:
      body:
        email: "test@example.com"
        password: "TestPass123!"
        displayName: "Test User"
    notes: "Try injecting role and org fields"

No code. No test logic. Just a declaration of what exists and what might be vulnerable. Adding a new endpoint to the test suite is a five-line YAML block.

Layer 2 — Orchestrator. A lightweight Node.js runner that reads the YAML config, resolves environment variables (auth tokens are never hardcoded — they’re injected via ${AUTH_TOKEN} placeholders), optionally filters endpoints by tag or ID, and spawns the LLM CLI as a child process. It also validates that a report file was actually created when the subprocess exits.

Layer 3 — LLM Subprocess. The AI agent receives the assembled context — endpoint specs plus a methodology document — and gets full access to a terminal. It can execute curl, read files, and write the report. This is the critical design choice: it’s not an API call that returns text suggestions. It’s a process with a shell.

The 6-Step Testing Loop

For every endpoint, the agent follows a methodology we defined in a standalone document (not code — a markdown file it reads at startup):

Understand

Parse the endpoint’s purpose, inputs, expected auth, and business context from the YAML notes.
Identify Attack Surface

Determine which OWASP categories apply. A public registration endpoint gets mass-assignment and injection tests. A resource-fetching endpoint gets IDOR and data-exposure tests.
Execute

Craft and fire curl payloads against the live API.
Analyze

Inspect the response. A 200 where you expected 403 is a lead. A stack trace in a 500 is a finding. A response body containing fields the requester shouldn’t see is data exposure.
Iterate

If something looks suspicious, try variations. Change the ID format. Add unexpected fields. Swap auth tokens between roles. This is where the LLM genuinely shines — it doesn’t just log a suspicious 200, it tries five more ID substitutions to confirm the pattern is systematic.
Document

Write the finding to the report immediately, with severity, payload, evidence, and remediation. Don’t wait until all endpoints are tested.

The “iterate” step is where the LLM genuinely shines. A traditional scanner gets a 200 and moves on. The agent gets a 200, realizes the response contains data belonging to a different organization, and immediately tries five more ID substitutions to confirm the pattern is systematic.

OWASP Coverage

The agent tests against both the OWASP Top 10 and the API Security Top 10. Here’s what that looks like in practice:

Category	What Gets Tested	Example
Broken Access Control	Substitute IDs across orgs/users	GET /v1/resource/<other-org-id> with valid token
Injection	NoSQL operators, XSS, path traversal	{"field": {"$gt": ""}} in request body
Security Misconfiguration	CORS, verbose errors, debug routes	OPTIONS + inspect Access-Control-* headers
Mass Assignment	Inject privileged fields on create/update	{"role": "admin", "orgId": "<other>"} in registration
SSRF	Internal URLs in user-supplied fields	Metadata endpoint URLs in redirect parameters
Broken Authentication	Cross-role token usage, expired tokens	Candidate token on recruiter-only endpoint
Rate Limiting	Header spoofing, burst requests	X-Forwarded-For rotation against rate-limited routes
Excessive Data Exposure	Sensitive fields in responses	Check if password hashes, internal IDs leak in responses

The tag system means you don’t have to run the full suite every time. Fixing an IDOR bug? Run --tag idor. Auditing public endpoints before launch? Run --tag public. Testing a single endpoint after a code change? Run --endpoint create-account.

What the Reports Look Like

Every finding follows a structured template. This isn’t a wall of text — it’s a document a security engineer can action immediately:

## [CRITICAL] Mass Assignment — POST /v1/users

| Field     | Value                                    |
|-----------|------------------------------------------|
| ID        | FINDING-002                              |
| Severity  | Critical                                 |
| Endpoint  | POST /v1/users                           |
| Category  | OWASP A08 — Software & Data Integrity    |
| Status    | Confirmed                                |

### Description
The user registration endpoint accepts a `roles` field in the
request body. Providing `["admin"]` as the value results in an
account with elevated privileges.

### Payload
curl -X POST https://api.example.com/v1/users \
  -H "Content-Type: application/json" \
  -d '{"email":"attacker@test.com","password":"Pass123!",
       "displayName":"Test","roles":["admin"]}'

### Evidence
201 Created — response body includes `"roles": ["admin"]`

### Recommendation
Strip or ignore `roles`, `organizationId`, and `_id` fields
from user-supplied input during account creation. Use a
server-side allowlist for writable fields.

Reports are timestamped Markdown files, one per run. They’re gitignored by default because they contain auth tokens and response data you don’t want in version control. A summary table at the bottom aggregates findings by severity.

Key Design Decisions

CLI subprocess, not API SDK. This is the most important decision. Calling an LLM API gives you text. Spawning it as a CLI process with tool access gives you an agent that can actually execute curl, observe the response, and decide what to do next. The difference is between reading about pentesting and doing it.

YAML config over hardcoded tests. Our API has 45 endpoints today. It’ll have 60 by next quarter. Nobody wants to write and maintain test scripts for each one. A YAML block is a five-line declaration. When an endpoint changes, you update the YAML — no code changes needed.

Methodology as a document, not code. The testing strategy lives in a markdown file the agent reads at startup. Want to add GraphQL-specific tests? Update the document. Want to emphasize rate-limit testing for a specific run? Edit the methodology. No redeployment.

Database isolation. Pentest runs create garbage data — accounts with XSS payloads as display names, resources with injected fields. We run the target API against a dedicated pentest database. Cleanup is a single dropDatabase() call. Development and production data are never touched.

Tag-based filtering. The full suite takes 15-20 minutes. Tag filtering lets developers run targeted tests in under a minute. This is the difference between “we run security tests before release” and “we run security tests during development.”

What We Learned

LLMs iterate better than expected

When the agent gets a suspicious response — a 200 where it expected 403 — it doesn’t just log it. It tries variations: different IDs, different formats, different auth tokens. In one run, it found that our pipeline endpoint rejected cross-org access for GET requests but allowed it for PATCH — a bug that would have taken hours to find manually.

Structured output is non-negotiable

Early versions produced free-form findings. Some had payloads, some didn’t. Some had severity ratings, some said “this looks bad.” A strict report template with required fields fixed this completely. Every finding now has an ID, severity, exact payload, evidence, and remediation.

Rate limits create noise — teach the methodology

The agent fires requests faster than any human would, routinely triggering rate limiters. Adding configurable delays between requests and explicit methodology guidance to distinguish “rate limiter working correctly” from “rate limiter bypassable” cut false positives by ~60%.

False positives need human review — by design

The agent flags anything suspicious. That’s the right default. But it means a human needs to review Critical and High findings before they become tickets. We treat the report as a triage input, not a final verdict.

This augments pentesting, it doesn't replace it

The agent is excellent at mechanical OWASP coverage — the same tests a human pentester would run first, just faster and more systematically. What it can’t do is chain findings together into complex attack paths, or understand that a combination of two Low-severity issues creates a Critical one. Professional pentesters still matter.

When This Approach Works Best

Pre-release security sweeps — run the full suite against staging before every deploy
Continuous regression testing — add to CI/CD to catch security regressions as endpoints change
New endpoint onboarding — add a YAML block for the new endpoint, run the agent, review the report before merging
Audit preparation — run the suite and fix low-hanging fruit before external auditors arrive. They’ll find the interesting stuff; you handle the obvious stuff.

What’s Next

The current implementation is HTTP/REST specific. The natural extensions: GraphQL introspection-based testing (the schema tells you everything), WebSocket message fuzzing, and gRPC service definition parsing. The architecture supports all of these — you’d add a new target config format and update the methodology document. The orchestrator and agent layer stay the same.

The bigger question: as LLMs get better at reasoning about code, can we feed them the actual route handler source and have them identify vulnerabilities statically before testing dynamically? That’s the convergence of SAST and DAST in a single agent. We’re not there yet, but the gap is closing fast.