Detect AI scrapers with TLS fingerprints, not user agents
Your robots.txt says User-agent: GPTBot Disallow: /. GPTBot honors it. The dozen RAG pipelines built on top of requests, httpx, and headless Chrome with a Mozilla/5.0 string in the headers do not. Recent traffic samples from large publishers show 30 to 60% of "Chrome" hits originate from non-browser TLS stacks. The User-Agent is theater; the TLS Client Hello is the truth.
JA4, the TLS fingerprint format FoxIO published in 2023, hashes the parts of a Client Hello a client cannot easily change without forking its TLS library: cipher order, extensions, ALPN, signature algorithms. Real Chrome and a Python httpx script have fingerprints that look nothing alike, no matter what either one writes in the User-Agent header. This post shows how to read JA4 at the edge, classify it via /v1/tls/fingerprint, and take action before the request reaches your origin.
What JA4 looks like
A JA4 is three sections joined by underscores. The first section encodes TLS version, ALPN, and cipher count. The second hashes cipher suites. The third hashes extensions and signature algorithms. A typical Chrome 124 fingerprint:
t13d1516h2_8daaf6152771_e5627efa2ab1
A Python httpx 0.27 client:
t13d1517h2_8daaf6152771_b1ff8ab2d16f
Same TLS 1.3, same ALPN h2, similar cipher hash, but the extension hash differs because httpx negotiates a different extension order from BoringSSL. That third section is the discriminator that catches scrapers pretending to be Chrome.
Classify a fingerprint
Send the JA4 to /v1/tls/fingerprint and get back a client identification, a verdict, and a confidence score. The endpoint maintains a fingerprint corpus indexed by client library and version range.
curl -X POST https://api.botoi.com/v1/tls/fingerprint \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BOTOI_API_KEY" \
-d '{"ja4":"t13d1517h2_8daaf6152771_b1ff8ab2d16f"}'
Sample response for the Python httpx JA4 above:
{
"ja4": "t13d1517h2_8daaf6152771_b1ff8ab2d16f",
"client": {
"library": "python-httpx",
"version_range": "0.27.x",
"category": "scraper"
},
"verdict": "challenge",
"confidence": 0.92,
"browser_match": false
}
The verdict field is the only piece you need to act on at the edge. browser_match is the boolean shortcut for "is this any known real browser version" and is a useful denominator for monitoring.
Cloudflare Worker, 30 lines
Cloudflare exposes JA4 on request.cf.ja4 for any zone with TLS fingerprinting enabled (free on Pro plan, available via Workers headers on the free tier with tls_client_hello request). Cache the verdict for an hour because fingerprints are stable until a client library updates.
// cloudflare-worker.ts
// Block scraper fingerprints at the edge before the request reaches origin.
export default {
async fetch(req: Request, env: Env): Promise<Response> {
const ja4 = (req.cf as { ja4?: string } | undefined)?.ja4;
if (!ja4) return fetch(req);
const verdict = await classify(ja4, env.BOTOI_API_KEY);
if (verdict === 'block') {
return new Response('Bot traffic is paywalled. See /api for licensing.', {
status: 402,
headers: { 'content-type': 'text/plain' },
});
}
if (verdict === 'challenge') {
return new Response(challengePage(), {
status: 200,
headers: { 'content-type': 'text/html' },
});
}
return fetch(req);
},
};
const cache = new Map<string, { verdict: string; expires: number }>();
async function classify(ja4: string, key: string): Promise<string> {
const hit = cache.get(ja4);
if (hit && hit.expires > Date.now()) return hit.verdict;
const res = await fetch('https://api.botoi.com/v1/tls/fingerprint', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${key}`,
},
body: JSON.stringify({ ja4 }),
signal: AbortSignal.timeout(150),
});
if (!res.ok) return 'allow';
const data = (await res.json()) as { verdict: string };
cache.set(ja4, { verdict: data.verdict, expires: Date.now() + 3_600_000 });
return data.verdict;
}
The cache keeps the API call rate proportional to unique fingerprints, not request volume. A typical site sees fewer than 5,000 distinct JA4s per day, so the upstream call is rare in steady state. Workers isolates evict the Map on cold start, so the cache is bounded in practice; for high-cardinality origins, swap to caches.default with a 1-hour TTL.
Same thing in Express
If your CDN forwards the JA4 in a header (Fastly's fastly-tls-ja4, Cloudflare's cf-ja4, or your own from a Worker upstream), the origin handler is a thin middleware.
// express-middleware.ts
// Same idea behind a Node origin. Forward x-tls-ja4 from the proxy.
import type { Request, Response, NextFunction } from 'express';
const cache = new Map<string, { verdict: string; expires: number }>();
export async function tlsGate(req: Request, res: Response, next: NextFunction) {
const ja4 = req.header('x-tls-ja4');
if (!ja4) return next();
const cached = cache.get(ja4);
let verdict = cached && cached.expires > Date.now() ? cached.verdict : null;
if (!verdict) {
const r = await fetch('https://api.botoi.com/v1/tls/fingerprint', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${process.env.BOTOI_API_KEY}`,
},
body: JSON.stringify({ ja4 }),
});
const data = (await r.json()) as { verdict: string };
verdict = data.verdict;
cache.set(ja4, { verdict, expires: Date.now() + 3_600_000 });
}
if (verdict === 'block') return res.status(402).send('paywall');
if (verdict === 'challenge') return res.status(429).send('rate-limited');
next();
} Decide actions per verdict
Three response paths cover most cases. Don't block the unknown bucket; you'll burn real users on novel browser releases.
Verdict Confidence Action Why
---------- ------------ ------------------------------- ----------------------------------
allow any pass through known browser fingerprint
challenge 0.7-0.95 JS challenge or 429 with retry library client; could be legit dev
block 0.95+ 402 with licensing pointer known scraper, unwanted at scale
unknown n/a rate-limit only novel fingerprint, do not block
The 402 status code is intentional. It points scraper operators at /api for licensing rather than telling them they're blocked, which is what gets you on a "publisher being hostile to AI" list. The friendly version lands you the occasional API customer instead of a Twitter pile-on.
Measure the impact
Before flipping any block, log JA4 plus User-Agent for a week. The chart that matters is "share of requests where User-Agent claims Chrome but JA4 says scraper". Most teams see 15 to 40%. That number is your blast radius if you ship a User-Agent-only block list. After enabling JA4-based gating, the same chart should show that share dropping to single digits within hours.
Free tier on botoi covers 1,000 fingerprint classifications per day (5 req/min burst). Paired with the cache, that's enough headroom for any small to mid-size site. Grab a key at botoi.com/api/signup.
Endpoint reference: TLS Fingerprint API. Related: API observability when AI agents are your heaviest callers.
Frequently asked questions
- Why do user agents not work anymore?
- AI training crawlers and their many imitators (real bots, gray-market scrapers, customer-built RAG pipelines) increasingly send "Mozilla/5.0 ... Chrome/120" because publishers added robots.txt and User-Agent blocks. The TLS handshake is set by the client library (Go net/http, Python httpx, Node undici, headless Chrome) and is much harder to fake without rewriting the client.
- What is JA4?
- JA4 is a TLS Client Hello fingerprint format published by FoxIO in 2023. It hashes the negotiated TLS version, cipher suites, extensions, ALPN, and signature algorithms into a string like t13d1516h2_8daaf6152771_e5627efa2ab1. Two clients sharing a JA4 are using the same TLS stack, regardless of what their User-Agent header claims.
- Will this block real users?
- No. Real Chrome, Safari, and Firefox have well-known JA4s that change only on browser version bumps. The botoi endpoint flags fingerprints associated with scraping libraries (curl, wget, requests, httpx, Go default, Node undici, headless Chrome with anti-detection patches). You allow the browser fingerprints and challenge or rate-limit the rest.
- Can scrapers spoof JA4?
- Yes, with effort. Tools like curl-impersonate and the Python tls_client library can mimic Chrome JA4. Spoofers are still a small minority of scraper traffic in 2026, and once a spoofer ID becomes public you classify it like any other fingerprint. JA4 raises the cost from one HTTP header to a forked TLS library; that cost is enough to deter the long tail.
- How do I get the fingerprint at the edge?
- Cloudflare exposes ja4 on the request object via cf.botManagement and on workers via request.cf.tlsClientHelloLength plus ja4. Fastly exposes it via fastly.tls. AWS CloudFront does not yet expose it directly; route through a CF Worker or use the botoi /v1/tls/fingerprint endpoint with the raw Client Hello.
Try this API
TLS Fingerprint API — interactive playground and code examples
More tutorial posts
Start building with botoi
150+ API endpoints for lookup, text processing, image generation, and developer utilities. Free tier, no credit card.