Shield Base

Verified Crawlers

IP ranges for major search engine and platform crawlers fetched from official provider geofeeds, compiled into goodBots.mmdb.

The Crawlers source fetches IP ranges for major search engine crawlers and legitimate automated agents from their official geofeeds and documentation pages. Data is extracted from JSON, CSV, and HTML sources, then merged and compiled into a single MMDB database.

The fetcher uses a tiered fetch with a fallback mechanism that invokes curl to bypass anti-scraping measures on social media geofeeds when it detects a regular fetch being blocked.

Output file: goodBots.mmdb

Make sure curl is installed on your system when using this data source. The fallback mechanism requires it to fetch certain provider pages.

Built-in Providers

Provider	Type	Source
Google	JSON	googlebot.json, special-crawlers.json, user-triggered-fetchers.json, user-triggered-fetchers-google.json, goog.json
Bing	JSON	bingbot.json
OpenAI	JSON	gptbot.json, searchbot.json
Apple	JSON	applebot.json
Ahrefs	JSON	crawler-ip-ranges
DuckDuckGo	HTML	duckassistbot, duckduckbot
Common Crawl	HTML	commoncrawl.org/faq
X / Twitter	HTML	troubleshooting-cards
Facebook	CSV	facebook.com/peering/geofeed
Pinterest	HTML	pinterestbot
Telegram	HTML	bots/webhooks
Semrush	HTML	semrush.com/kb/1149

Usage

pnpm dlx @riavzon/shield-base --seo

yarn dlx @riavzon/shield-base --seo

npx @riavzon/shield-base --seo

bunx @riavzon/shield-base --seo

import { getCrawlersIps } from '@riavzon/shield-base';

// Compile with built-in providers only
await getCrawlersIps('./out', 'mmdbctl');

// Or merge custom providers with built-in ones
import type { ProvidersLists } from '@riavzon/shield-base';

const customProviders: ProvidersLists[] = [
  {
    name: 'cloudflare',
    type: 'JSON',
    urls: [
      'https://www.cloudflare.com/ips-v4',
      'https://www.cloudflare.com/ips-v6',
    ],
  },
];

await getCrawlersIps('./out', 'mmdbctl', customProviders);

When custom providers are passed, they are merged with the built-in datasets and compiled into a single goodBots.mmdb database.

Record Structure

interface CrawlersRecord {
  range: string;        // IP prefix, e.g. "66.249.66.0/24"
  provider: string;     // Provider name, e.g. "google", "bing", "apple"
  syncToken: string;    // Provider sync token (when available)
  creationTime: string; // Provider creation timestamp
}

interface ProvidersLists {
  name: string;                    // Stored as the `provider` field in the database
  type: 'HTML' | 'JSON' | 'CSV';  // Format of the source URL
  urls: string[];                  // One or more URLs to fetch
}

Example Lookup

Terminal

mmdbctl read -f json-pretty 66.249.66.1 outputDirectory/goodBots.mmdb

{
  "provider": "google",
  "range": "66.249.66.0/24",
  "syncToken": "1710000000",
  "creationTime": "2024-03-09T22:00:00.000Z"
}

The type field is a special field that the success of the data retrieval depends on. If the links you are providing include a regular html/markdown/other-raw-text-data page, use HTML. If it is a link to a CSV file, use CSV. If it is a JSON (e.g., https://developers.google.com/static/search/apis/ipranges/googlebot.json), use JSON. Providing urls that mixes CSV with JSON data or raw text with CSV and JSON will fail to process this provider. Visit the built-in providers to get an idea of the parsing engine or check the source code.

Country Geolocation

Disposable Emails