URL Archive Mining for Bug Bounty Recon

2026-03-26 · rojo-sombrero

Every web application has a past. Pages get renamed, endpoints get deprecated, API versions get replaced, config files get moved. The code changes but the internet remembers. Archive services like the Wayback Machine, CommonCrawl, and OTX have been indexing URLs for years — and those old URLs are gold for recon.

This post covers how I built a URL archive mining module into my recon pipeline, what patterns to look for in the output, and how to turn thousands of archived URLs into actionable attack surface.

The Data Sources

There are several public archives that index URLs as they crawl the web:

Wayback Machine (CDX API) — archive.org's index. The biggest, going back to the late 90s for some domains. The CDX API lets you query it programmatically.
CommonCrawl — a nonprofit that crawls the open web monthly. Petabytes of data, freely queryable.
AlienVault OTX — threat intelligence platform that indexes URLs seen in the wild.
URLScan.io — records URLs from user-submitted scans.

You can query each of these individually, or use gau (GetAllURLs) which queries all of them in parallel:

# Install gau
go install github.com/lc/gau/v2/cmd/gau@latest

# Pull all known URLs for a domain
echo "target.com" | gau --threads 5 --timeout 60 -o urls.txt

# Or multiple domains from subdomains list
gau --threads 5 --timeout 60 -o urls.txt < subdomains.txt

For supplemental coverage, especially on older domains, hit the Wayback CDX API directly:

# Wayback CDX API — all URLs ever archived under *.target.com
curl -s "http://web.archive.org/cdx/search/cdx?\
url=*.target.com/*&\
output=text&\
fl=original&\
collapse=urlkey"

The collapse=urlkey parameter deduplicates by URL pattern, which keeps the output manageable. Without it, you'll get every single snapshot of every URL — potentially millions of rows for large sites.

What You're Looking For

Raw URL lists are noise. The value is in the patterns. Here's what I extract automatically:

1. Parameterized URLs (Injection Points)

# Extract all URLs with query parameters
grep -E '\?' urls.txt | sort -u > urls-with-params.txt

# Extract just the parameter names, ranked by frequency
grep -oP '[\?&]\K[^=]+' urls.txt | sort | uniq -c | sort -rn > params.txt

The parameter list is immediately useful. High-frequency params like id, page, search, redirect, url, file, path are your first targets for injection testing. But the rare, domain-specific params are often more interesting — they suggest custom functionality that gets less scrutiny.

2. API and Config Endpoints

grep -iE '(api/|/v[0-9]/|\.json|\.xml|\.yaml|\.yml|\
\.env|\.config|\.conf|\.bak|\.old|\.sql|\.zip|\
\.tar|\.gz|\.log|\.txt|/graphql|/swagger|\
/openapi|/admin|/debug|/internal|/backup|\.git)' \
  urls.txt | sort -u

This is where archive mining really shines. These endpoints may have been removed from the live site but the functionality might still be there — just unlinked. I've seen:

/api/v1/ endpoints still live after /api/v2/ replaced them — older version, fewer security patches
/swagger.json files that map the entire API
.env and .config files that were briefly exposed during deployment
/admin and /debug paths behind obscurity rather than auth
.git/config exposures that reveal internal repository structure

3. JavaScript Files

grep -iE '\.js(\?|$)' urls.txt | grep -v '\.json' | sort -u

JS files are recon goldmines. They contain:

API endpoints and keys hardcoded by developers
Internal paths and route definitions
Authentication logic and token handling
Comments with developer notes, TODO items, credentials

Old JS files from the archive are especially valuable because developers are sloppier in earlier versions. That API key they removed in v2.3 was probably in v1.0, and you can still read v1.0 through the Wayback Machine.

4. Sensitive Path Patterns

grep -iE '(password|token|secret|key|auth|login|\
signup|register|upload|download|reset|verify|\
confirm|checkout|payment|admin|dashboard|\
internal|staging|dev\.|test\.)' \
  urls.txt | sort -u

These are paths that suggest sensitive functionality. staging.target.com and dev.target.com are often less hardened than production. /upload endpoints might accept unexpected file types. /reset and /verify flows are common sources of logic bugs.

Putting It Together

In my recon pipeline, the URL archive module runs after subdomain enumeration and HTTP probing. This means it pulls archived URLs for all discovered subdomains, not just the root domain. The output feeds directly into nuclei for automated vulnerability scanning — parameterized URLs get tested for injection, exposed endpoints get checked for known vulnerabilities.

# Full pipeline: recon with URL archive mining + nuclei
recon target.com -w -n

# Output structure:
#   urls-archived.txt      — all unique URLs from archives
#   urls-interesting.txt   — filtered interesting patterns
#   params.txt             — unique parameters ranked by frequency
#   nuclei.txt             — vulnerability scan results

A note on volume: Large domains can return hundreds of thousands of archived URLs. On constrained hardware (I run this on a Chromebook with a Celeron N4120), I limit the Wayback CDX queries to the top 20 subdomains to avoid timeouts. gau handles the parallelism better but still benefits from reasonable scope.

The Mindset

URL archive mining is about exploiting the gap between what an application is now and what it was before. Developers remove links but not functionality. They upgrade APIs but leave old versions running. They clean up exposed files but don't rotate the credentials that were in them.

The internet has a better memory than most security teams. Use it.