Every web application has a past. Pages get renamed, endpoints get deprecated, API versions get replaced, config files get moved. The code changes but the internet remembers. Archive services like the Wayback Machine, CommonCrawl, and OTX have been indexing URLs for years — and those old URLs are gold for recon.
This post covers how I built a URL archive mining module into my recon pipeline, what patterns to look for in the output, and how to turn thousands of archived URLs into actionable attack surface.
There are several public archives that index URLs as they crawl the web:
You can query each of these individually, or use gau (GetAllURLs) which queries all of them in parallel:
# Install gau go install github.com/lc/gau/v2/cmd/gau@latest # Pull all known URLs for a domain echo "target.com" | gau --threads 5 --timeout 60 -o urls.txt # Or multiple domains from subdomains list gau --threads 5 --timeout 60 -o urls.txt < subdomains.txt
For supplemental coverage, especially on older domains, hit the Wayback CDX API directly:
# Wayback CDX API — all URLs ever archived under *.target.com curl -s "http://web.archive.org/cdx/search/cdx?\ url=*.target.com/*&\ output=text&\ fl=original&\ collapse=urlkey"
The collapse=urlkey parameter deduplicates by URL pattern, which keeps
the output manageable. Without it, you'll get every single snapshot of every URL —
potentially millions of rows for large sites.
Raw URL lists are noise. The value is in the patterns. Here's what I extract automatically:
# Extract all URLs with query parameters grep -E '\?' urls.txt | sort -u > urls-with-params.txt # Extract just the parameter names, ranked by frequency grep -oP '[\?&]\K[^=]+' urls.txt | sort | uniq -c | sort -rn > params.txt
The parameter list is immediately useful. High-frequency params like id,
page, search, redirect, url,
file, path are your first targets for injection testing.
But the rare, domain-specific params are often more interesting — they suggest
custom functionality that gets less scrutiny.
grep -iE '(api/|/v[0-9]/|\.json|\.xml|\.yaml|\.yml|\ \.env|\.config|\.conf|\.bak|\.old|\.sql|\.zip|\ \.tar|\.gz|\.log|\.txt|/graphql|/swagger|\ /openapi|/admin|/debug|/internal|/backup|\.git)' \ urls.txt | sort -u
This is where archive mining really shines. These endpoints may have been removed from the live site but the functionality might still be there — just unlinked. I've seen:
/api/v1/ endpoints still live after /api/v2/ replaced them — older version, fewer security patches/swagger.json files that map the entire API.env and .config files that were briefly exposed during deployment/admin and /debug paths behind obscurity rather than auth.git/config exposures that reveal internal repository structuregrep -iE '\.js(\?|$)' urls.txt | grep -v '\.json' | sort -u
JS files are recon goldmines. They contain:
Old JS files from the archive are especially valuable because developers are sloppier in earlier versions. That API key they removed in v2.3 was probably in v1.0, and you can still read v1.0 through the Wayback Machine.
grep -iE '(password|token|secret|key|auth|login|\ signup|register|upload|download|reset|verify|\ confirm|checkout|payment|admin|dashboard|\ internal|staging|dev\.|test\.)' \ urls.txt | sort -u
These are paths that suggest sensitive functionality. staging.target.com
and dev.target.com are often less hardened than production.
/upload endpoints might accept unexpected file types.
/reset and /verify flows are common sources of logic bugs.
In my recon pipeline, the URL archive module runs after subdomain enumeration and HTTP probing. This means it pulls archived URLs for all discovered subdomains, not just the root domain. The output feeds directly into nuclei for automated vulnerability scanning — parameterized URLs get tested for injection, exposed endpoints get checked for known vulnerabilities.
# Full pipeline: recon with URL archive mining + nuclei recon target.com -w -n # Output structure: # urls-archived.txt — all unique URLs from archives # urls-interesting.txt — filtered interesting patterns # params.txt — unique parameters ranked by frequency # nuclei.txt — vulnerability scan results
URL archive mining is about exploiting the gap between what an application is now and what it was before. Developers remove links but not functionality. They upgrade APIs but leave old versions running. They clean up exposed files but don't rotate the credentials that were in them.
The internet has a better memory than most security teams. Use it.