Verified Crawlers

IP ranges for major search engine and platform crawlers fetched from official provider geofeeds, compiled into goodBots.mmdb.

The Crawlers source fetches IP ranges for major search engine crawlers and legitimate automated agents from their official geofeeds and documentation pages. Data is extracted from JSON, CSV, and HTML sources, then merged and compiled into a single MMDB database.

The fetcher uses a tiered fetch with a fallback mechanism that invokes curl to bypass anti-scraping measures on social media geofeeds when it detects a regular fetch being blocked.

Output file: goodBots.mmdb

Make sure curl is installed on your system when using this data source. The fallback mechanism requires it to fetch certain provider pages.

Built-in Providers


Usage

pnpm dlx @riavzon/shield-base --seo
When custom providers are passed, they are merged with the built-in datasets and compiled into a single goodBots.mmdb database.

Record Structure

interface CrawlersRecord {
  range: string;        // IP prefix, e.g. "66.249.66.0/24"
  provider: string;     // Provider name, e.g. "google", "bing", "apple"
  syncToken: string;    // Provider sync token (when available)
  creationTime: string; // Provider creation timestamp
}

interface ProvidersLists {
  name: string;                    // Stored as the `provider` field in the database
  type: 'HTML' | 'JSON' | 'CSV';  // Format of the source URL
  urls: string[];                  // One or more URLs to fetch
}

Example Lookup

Terminal
mmdbctl read -f json-pretty 66.249.66.1 outputDirectory/goodBots.mmdb
{
  "provider": "google",
  "range": "66.249.66.0/24",
  "syncToken": "1710000000",
  "creationTime": "2024-03-09T22:00:00.000Z"
}
The type field is a special field that the success of the data retrieval depends on. If the links you are providing include a regular html/markdown/other-raw-text-data page, use HTML. If it is a link to a CSV file, use CSV. If it is a JSON (e.g., https://developers.google.com/static/search/apis/ipranges/googlebot.json), use JSON. Providing urls that mixes CSV with JSON data or raw text with CSV and JSON will fail to process this provider. Visit the built-in providers to get an idea of the parsing engine or check the source code.
Logo