By day I work on DevRel at Tigris Data, and by night I’m a virtual anime person that makes software deployed by the United Nations. A side project of mine took off recently, a Web AI Firewall Utility named Anubis. Anubis blocks abusive scrapers from taking out web services by asserting that clients that claim to act like browsers do, in fact, act like browsers. As usage has grown (yes, the UN actually uses Anubis), we’ve gotten creative about detecting bots, including setting honeypots that give us more detailed data on their requests. These honeypot logs are sent to Tigris, and we analyze them using DuckDB to find patterns and improve bot detection.
In this article I'm going to be calling those abusive scrapers "AI scrapers", but to be clear I'm not really sure if they are for training generative AI or not. However, the worst offender has been Amazon's Alexa team so it's pretty easy to take those two points and draw a line between them.
At a high level, Anubis has a big old set of rules in your bot policy file. If clients match a rule, they either get allowed through, blocked, or selected for secondary screening. By default, Anubis is meant to instantly work by stopping all the bleeding and letting administrators sleep without downtime alerts waking them up. This means that it's overly paranoid and aggressively challenges everything, similar to Cloudflare's "I'm under attack" mode.
My intent was that admins would start out with Anubis being quite paranoid and then slowly lessen the paranoia as they find better patterns and match out ways to do things. Users tend to use Anubis in its default configuration, but this default configuration interferes with RSS feed readers and other "good bots". Some users also have opinions about their privacy that lead them to disable things like cookie storage, JavaScript execution, and other fundamental parts of how the internet works. All of these interfere with how Anubis works.
I'd like to have Anubis throw challenges less often than it does right now, but there are a few key problems that interfere with this:
- Anubis is open source software, a critical part of the setup. Anubis is open source so that adoption is easier and so that users can trust that I am not doing malicious things in the name of security. It’s an asset as much as it is a liability because it means that attackers can (and have been) studying the open source code in order to bypass it better.
- In order to show the challenge page less often, you need data. A lot of data. More specifically you need a general "shape" of what "known good" and "known bad" clients look like. For privacy, Anubis does not collect this data. This is intentional. You can look in the code to confirm for yourself should you desire.
- Data that can be used to improve bot detection is sensitive– it can be used to identify individual users. Though tools like Tigris are equipped to handle sensitive data, me personally, I don’t want that responsibility right now.
Finally, the biggest problem is that I don’t really know how to analyze data. I'm getting better at it, but generally if it's a bit too big for me to either fit into a single SQLite database or excel sheet, it becomes a problem. I don’t have an analytics team (yet), so an easy-to-use toolchain for small data seems appropriate. Enter: DuckDB, a lightweight DB inspired by MonetDB that’s meant for exploratory queries with minimal overhead.
From here the rest is just wiring things up, getting access to the data I need, and figuring out how to analyze it.
Everything's a big data problem
As the sacred texts have foretold:
Given enough time, every problem becomes a big data problem.
My sites are popular, sure, but I need more data than I can get myself. After experimentation, I set up some honeypot servers. I had a TLS terminating reverse proxy in my giant bag of experimental code, so I slapped some logic on the side of it to dump information about every request to Tigris:
- Basic request metadata (method, path, query string)
- HTTP header metadata (names and contents of HTTP headers, just in case there's a pattern there)
- Other connection-level metadata (TLS session details, other metadata like IP addresses)
I set up two honeypots: one that gets talked about, and one that does not. Both had TLS certificates via Let's Encrypt. It’s commonly thought that AI scrapers use the TLS certificate transparency log in order to find new targets to scrape. To attract the AIs, I ensured that the second honeypot had a cgit server with the source code to the Linux kernel. For some reason the AI scrapers seem to love requesting the “git blame” route. I won’t reveal all my tricks, but rest assured, the honeypot was loaded with AI Brawndo: it’s got what AI’s crave.
I set the trap and let it sit for a few weeks. I posted publicly about the known server, and I kept the other a secret outside of its listing in the TLS certificate transparency logs. I also asked ChatGPT, Gemini, Qwen, and DeepSeek about the git server, just to see if I managed to get lucky and get them thrown into the training dataset.
Spoiler alert: I got lucky.
Peering at the data
In Tigris, the honeypot log data is a giant pile of .jsonl
files. For lack of
a better idea, I named them after UUIDv7 values so that
alphabetic sorting would result in each file being in temporal order. I set up
tigrisfs on my Fedora workstation and
then started running grep to look for some patterns like what the OpenAI crawler
does. Turns out that crawler is particularly noisy:
{
"request_date": {
"seconds": 1748104928,
"nanos": 106832015
},
"response_time": {
"nanos": 12817879
},
"host": "[redacted]",
"method": "GET",
"path": "/glibc.git/plain/locale/setlocale.c",
"query": {
"id": "b647f210e61ab339fbb75dd9873daf7cb8f12665"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, br, deflate",
"From": "gptbot(at)openai.com",
"User-Agent": "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)",
"X-Openai-Host-Hash": "589806218",
"X-Real-Ip": "[redacted but in an Azure ASN]"
},
"remote_ip": "20.171.207.44",
"ja4": "t13d1011h2_61a7ad8aa9b6_3fcd1a44f3e3",
"request_id": "0197032c-5b6a-7c7f-944d-8b2cfdb87258"
}
Other searches for openai
got me here:
{
"request_date": {
"seconds": 1747130654,
"nanos": 998119913
},
"response_time": {
"nanos": 3273033
},
"host": "[redacted]",
"method": "GET",
"path": "/robots.txt",
"headers": {
"Accept": "text/plain",
"Accept-Encoding": "gzip, deflate, br",
"Cache-Control": "no-cache",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot",
"X-Envoy-Expected-Rq-Timeout-Ms": "13916",
"X-Openai-Internal-Caller": "webcrawler-robots-txt",
"X-Openai-Originator": "webcrawler-robots-txt",
"X-Openai-Originator-Env": "prod",
"X-Openai-Product-Sku": "unknown",
"X-Openai-Traffic-Source": "user",
"X-Real-Ip": "[redacted but in an Azure ASN]",
"X-Request-Id": "577763b8-200d-49c4-985f-bd7cdaf1964b"
},
"request_id": "0196c91a-2116-71ca-8a8a-429b4b14de76",
"response_code": 404
}
This is plenty of metadata to be able to write a good set of rules for OpenAI in particular:
- name: openai-crawler
action: DENY
expression:
all:
- userAgent.contains("https://openai.com/gptbot")
- '"X-Openai-Host-Hash" in headers'
- '"From" in headers'
- headers["From"] == "gptbot(at)openai.com"
- name: openai-robots-txt-fetcher
action: DENY
expression:
all:
- userAgent.contains("OAI-SearchBot")
- '"X-Envoy-Expected-Rq-Timeout-Ms" in headers'
- '"X-Openai-Internal-Caller" in headers'
- '"X-Openai-Originator" in headers'
- '"X-Openai-Traffic-Source" in headers'
Or even to write a ruleset that matches clients that only partially look like OpenAI:
- name: challenge-sus-openai-crawler
action: CHALLENGE
expression:
all:
- userAgent.contains("https://openai.com/gptbot")
- '!("From" in headers)'
- '!("X-Openai-Host-Hash" in headers)'
- name: challenge-sus-openai-robots-txt-fetcher
action: CHALLENGE
expression:
all:
- userAgent.contains("OAI-SearchBot")
- '!("X-Envoy-Expected-Rq-Timeout-Ms" in headers)'
- '!("X-Openai-Internal-Caller" in headers)'
- '!("X-Openai-Originator" in headers)'
- '!("X-Openai-Traffic-Source" in headers)'
This should be all I need to make more educated guesses about how crawlers work so I can make it easier for website operators to interfere with the quality of service that the scraper bots get. In case you're wondering, here's how many requests OpenAI has made to my server:
openai_request_count |
---|
78132 |
All of these requests account for four hours of compute time on my server. During OpenAI's most aggressive scrape, the server crashed and rebooted because it ran out of ram.
But as the scrapers evolve, so must my methods. I needed to dig deeper to make more educated guesses about how crawlers work and improve detection.
Importing the logs to MotherDuck with DuckDB
Now that I have data, how do I find patterns so that we can better detect the AI scrapers as they evolve? I’m more of an SRE than a data analyst, but I have a lot of experience with SQL. Normally I’d use SQLite for this because their documentation has the most lovely diagrams, but SQLite has very poor support for nested data, such as HTTP headers. DuckDB is an embedded database engine like SQLite but with a heavy focus on analytics and dealing with unstructured data. It has extensions that let you make strictly typed structures, access files over HTTP (or even object storage), and more. This lets me take the JSON logs that the honeypots produce and turn them into something I can query.
MotherDuck offers a cloud SaaS built on DuckDB that gives you a notebook, storage, sharing, and hybrid execution across cloud and local data. The notebooks are pretty convenient, especially if I want to have fresh data as it gets written to Tigris. Having fresh data becomes important when I want to analyze traffic from a machine currently experiencing a heavy volume of scraping traffic.
Connecting your Tigris bucket o’ logs to MotherDuck is pretty simple: in your notebook, you use your Tigris URL:
from 's3://your-tigris-bucket/path/to/data'
And set the endpoint to the Tigris high performance endpoint in your secret config:
CREATE OR REPLACE SECRET tigris
( TYPE s3
, PROVIDER config
, REGION 'auto'
, ENDPOINT 't3.storage.dev'
, URL_STYLE 'vhost'
);
It’s also pretty easy to run DuckDB locally if your data is small enough to fit on your laptop. I’m pretty pro having the same toolchain for local development as I do in the cloud, so that’s a little nicety.
Now that we have the data loaded into a notebook, we can take a look at those honeypot logs. The JSON logs contains nested key-value pairs in two basic shapes:
- Known shapes, such as the timestamps and durations
- Unknown shapes, such as request headers and query strings
DuckDB lets me handle both of them. Here's the SQL table I made after reading the DuckDB schema docs and converting relayd’s HTTP request schema by hand:
CREATE TABLE requests
( request_date STRUCT("seconds" BIGINT, "nanos" BIGINT)
, response_time STRUCT(nanos BIGINT)
, host VARCHAR
, "method" VARCHAR
, path VARCHAR
, headers MAP(STRING, STRING)
, remote_ip VARCHAR
, ja4 VARCHAR
, request_id UUID UNIQUE
, request_protocol VARCHAR
, alpn VARCHAR
);
Then I imported all of the data through DuckDB's object storage functions:
INSERT INTO
requests
( request_date , response_time
, host , "method"
, path , headers
, remote_ip , ja4
, request_id , request_protocol
, alpn
)
SELECT
*
, CASE
WHEN ja4 LIKE 't_______h1\_%' ESCAPE '\' THEN 'HTTP/1.1'
WHEN ja4 LIKE 't_______h2\_%' ESCAPE '\' THEN 'HTTP/2.0'
WHEN ja4 LIKE 'q_______h3\_%' ESCAPE '\' THEN 'HTTP/3.0'
ELSE NULL
END AS alpn
, CASE
WHEN alpn IS NOT NULL THEN alpn
ELSE 'HTTP/1.1'
END AS request_protocol
FROM
READ_JSON
( 's3://relayd-logs/halone.within.lgbt/*.jsonl'
, columns =
{ request_date: 'struct("seconds" bigint, "nanos" bigint)'
, response_time: 'struct(nanos bigint)'
, host: 'varchar'
, "method": 'varchar'
, path: 'varchar'
, headers: 'map(string, string)'
, remote_ip: 'varchar'
, ja4: 'varchar'
, request_id: 'uuid'
}
, format = 'nd'
);
Off to the races
Now that we have the data, we can start to look for patterns. Let’s start by looking at patterns involved with my own browser's User-Agent string:
SELECT COUNT(DISTINCT remote_ip)
FROM requests
WHERE headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:138.0) Gecko/20100101 Firefox/138.0';
count(DISTINCT remote_ip) |
---|
185 |
Cool! 185 IP addresses. Let's see if there's any other things that pop up. Here's Google Chrome hits:
SELECT headers['User-Agent'], COUNT(*) AS hits
FROM requests
WHERE headers['User-Agent'] LIKE '%Chrome/%'
GROUP BY headers['User-Agent']
HAVING COUNT(*) >= 76
ORDER BY hits DESC;
headers['User-Agent'] | hits |
---|---|
Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Mobile Safari/537.36 | 325 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36 | 179 |
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36 | 170 |
Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Mobile Safari/537.36 | 154 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 | 140 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36 | 138 |
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36 | 76 |
So, let's look at a random Google Chrome for Android request:
{
"request_date": {
"seconds": 1746331998,
"nanos": 774116562
},
"response_time": {
"nanos": 1002352
},
"host": "halone.within.lgbt",
"method": "GET",
"path": "/",
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-US,en;q=0.9",
"Priority": "u=0, i",
"Sec-Ch-Ua": "\"Chromium\";v=\"136\", \"Google Chrome\";v=\"136\", \"Not.A/Brand\";v=\"99\"",
"Sec-Ch-Ua-Mobile": "?1",
"Sec-Ch-Ua-Platform": "\"Android\"",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Mobile Safari/537.36",
"X-Real-Ip": "[mobile phone IP]"
},
"remote_ip": "[mobile phone IP]",
"ja4": "t13d1516h2_8daaf6152771_d8a2da3f94cd",
"request_id": "0196997f-9a36-71c2-9662-6e7bf807b43d"
}
Something interesting about Chrome is that it sets a bunch of headers other than the User-Agent string and you can use those headers to pretty reliably identify Chrome.
Either way, these let you build a library of “known good” patterns so you can build fingerprinting methods on HTTP requests. There’s prior art here in the form of JA4H, but in testing against this dataset I didn’t have a sufficiently high match rate against malicious clients.
Labeling
One of the great things about this dataset is it’s all HTTP requests, which are a well known structure with many components. Patterns of interest could be based on the IP range the requests came from, HTTP headers, or other bits of metadata.
Every request has a unique ID, so it’s pretty easy to match requests into rules by creating a little table:
CREATE TABLE rule_matches
( request_id TEXT
, rule_name TEXT
);
CREATE UNIQUE INDEX
rule_matches_request_id_rule_name
ON
rule_matches(request_id, rule_name);
And then using INSERT INTO ... SELECT
to fill it with data:
INSERT INTO rule_matches (request_id, rule_name)
SELECT request_id, 'chrome-without-sec-ch-ua'
FROM requests
WHERE headers['User-Agent'] LIKE '%Chrome/%'
AND NOT map_contains(headers, 'Sec-Ch-Ua');
SELECT COUNT(*) FROM rule_matches WHERE rule_name='chrome-without-sec-ch-ua';
count_star() |
---|
267 |
INSERT INTO rule_matches (request_id, rule_name)
SELECT request_id, 'generic-browser'
FROM requests
WHERE headers['User-Agent'] LIKE '%Mozilla/%'
OR headers['User-Agent'] LIKE '%Opera/%';
select COUNT(*) from rule_matches where rule_name='generic-browser';
count_star() |
---|
86143 |
I’ve converted over a lot of the other core/stdlib request matchers and have found that they seem to work well against this dataset.
Conclusion
Though this starter dataset is much smaller than ram, as I collect more data, I’ll be more confident in the patterns I find. I’m working on an Ingress Controller named hythlodaeus that helps my homelab and production clusters collect a filtered version of this request metadata. Then I can import that data into duckdb, find more patterns, and refine my filtering logic. I’d like to try having webhooks
As AI Scrapers evolve, Anubis has to stay a couple steps ahead by spotting new patterns in the data. Lightweight tools like Tigris and DuckDB make it easy to query data and ultimately make the internet a better place by stopping bots.