Large Documentation Handling
Strategies for scraping and managing documentation sites with 10,000+ pages.
Overview
Large documentation sites (10K+ pages) present unique challenges:
- Token limits - Single skill exceeds Claude/Gemini/OpenAI context windows
- Scraping time - 10K pages × 1 second = 3+ hours
- Memory usage - Storing 10K pages in memory
- Skill usability - Too much content makes skills slow and unfocused
Solutions:
- Split strategies - Divide into category-based sub-skills
- Router pattern - Intelligently route to right sub-skill
- Parallel scraping - Multi-worker concurrent scraping
- Checkpointing - Resume interrupted scrapes
- Size-based splitting - Auto-split by token budget
Version: v2.0.0+
When to Use Large Doc Strategies
Size Thresholds
| Pages | Recommendation | Strategy |
|---|---|---|
| < 500 | Single skill | Standard scraping |
| 500-2000 | Single skill + optimization | Async scraping, selective content |
| 2000-5000 | Consider splitting | Category-based split |
| 5000-10000 | Split strongly recommended | Router + sub-skills |
| 10000+ | Must split | Router + parallel scraping |
Indicators You Need Splitting
- ✅ Documentation organized into clear categories (API, Guides, Tutorials, etc.)
- ✅ Estimated token count > 100K
- ✅ Scraping estimated to take > 1 hour
- ✅ Different sections serve different use cases
Split Strategies
1. Category-Based Split (Recommended)
Best for: Documentation with clear categorical organization
How it works: Scrape each category into separate sub-skill, create router
Example: Kubernetes Docs
# 1. Split by category
skill-seekers scrape --config configs/k8s-concepts.json --output output/k8s-concepts/
skill-seekers scrape --config configs/k8s-tasks.json --output output/k8s-tasks/
skill-seekers scrape --config configs/k8s-api.json --output output/k8s-api/
# 2. Create router
skill-seekers router \
output/k8s-concepts/ \
output/k8s-tasks/ \
output/k8s-api/ \
--output output/k8s-router/ \
--name kubernetes-complete
# 3. Package
skill-seekers package output/k8s-router/ --include-subskills
Config example (k8s-concepts.json):
{
"name": "kubernetes-concepts",
"base_url": "https://kubernetes.io/docs/concepts/",
"url_patterns": {
"include": ["concepts"],
"exclude": []
},
"max_pages": 500
}
2. Size-Based Split (Automatic)
Best for: Unorganized docs or uniform structure
How it works: Auto-split when token budget exceeded
# Automatic splitting at 50K tokens per skill
skill-seekers scrape --config configs/large-docs.json \
--auto-split \
--max-tokens 50000 \
--output output/large-docs/
# Creates:
# output/large-docs-part1/
# output/large-docs-part2/
# output/large-docs-part3/
# output/large-docs-router/ (automatically generated)
3. Router-First Split (Manual)
Best for: Pre-planned organization
Config structure:
{
"name": "django-complete",
"router_mode": true,
"sub_skills": [
{
"name": "django-tutorial",
"base_url": "https://docs.djangoproject.com/en/stable/intro/",
"max_pages": 200
},
{
"name": "django-api",
"base_url": "https://docs.djangoproject.com/en/stable/ref/",
"max_pages": 1000
},
{
"name": "django-topics",
"base_url": "https://docs.djangoproject.com/en/stable/topics/",
"max_pages": 500
}
]
}
Scrape:
skill-seekers scrape --config configs/django-router.json
# Automatically scrapes all sub-skills and generates router
Parallel Scraping
Multi-Worker Scraping
Speed up scraping with concurrent workers:
# 4 parallel workers
skill-seekers scrape --config configs/large-docs.json \
--workers 4 \
--output output/large-docs/
# Performance:
# - 1 worker: 10,000 pages = 3 hours
# - 4 workers: 10,000 pages = 45 minutes
# - 8 workers: 10,000 pages = 25 minutes (diminishing returns)
Optimal worker count:
- CPU-bound: Number of CPU cores
- Network-bound: 4-8 workers (avoid rate limiting)
- Large docs (10K+ pages): 4-6 workers recommended
Async Scraping
Single-process async for moderate speedup:
# Async mode (faster than sync, no parallelism overhead)
skill-seekers scrape --config configs/large-docs.json \
--async \
--output output/large-docs/
# 2-3x faster than synchronous mode
When to use:
- Async mode: 500-2000 pages, network-bound
- Parallel mode: 2000+ pages, need maximum speed
- Sync mode: < 500 pages, simple/stable
Checkpointing and Resume
Checkpoint Scraping
Resume interrupted scrapes:
# Enable checkpointing (saves progress every 100 pages)
skill-seekers scrape --config configs/large-docs.json \
--checkpoint \
--checkpoint-interval 100 \
--output output/large-docs/
# If interrupted, resume from last checkpoint:
skill-seekers scrape --config configs/large-docs.json \
--resume \
--output output/large-docs/
Checkpoint location:
output/large-docs/.checkpoint/
├── progress.json # Pages scraped, current URL, etc.
├── cache/ # Cached page content
└── metadata.json # Scraping metadata
Smart Resume
Detect and skip already-scraped pages:
# Resume automatically detects existing content
skill-seekers scrape --config configs/large-docs.json \
--smart-resume \
--output output/large-docs/
# Skips pages that:
# - Already exist in references/
# - Have not changed since last scrape (based on Last-Modified header)
# - Match content hash from previous scrape
Router Pattern for Large Docs
Router SKILL.md Example
Kubernetes Router (4 sub-skills):
---
name: kubernetes-complete
description: Complete Kubernetes documentation with intelligent routing
---
# Kubernetes Complete Router
## Sub-Skills
### 1. kubernetes-concepts
**When to use:** Understanding Kubernetes architecture and concepts
**Contains:** Pods, Services, Deployments, ReplicaSets, etc.
### 2. kubernetes-tasks
**When to use:** Step-by-step how-to guides
**Contains:** Creating deployments, exposing services, scaling, etc.
### 3. kubernetes-api
**When to use:** API reference and specifications
**Contains:** API objects, fields, methods
### 4. kubernetes-tutorials
**When to use:** End-to-end learning guides
**Contains:** Hello Minikube, Stateless Applications, etc.
## Routing Strategy
1. **Conceptual questions** → kubernetes-concepts
2. **How-to questions** → kubernetes-tasks
3. **API reference questions** → kubernetes-api
4. **Learning questions** → kubernetes-tutorials
For complex questions requiring multiple perspectives, consult multiple sub-skills and synthesize answer.
Benefits:
- ✅ Focused sub-skills (500-2000 pages each)
- ✅ Fast routing (only load needed sub-skill)
- ✅ Better token efficiency (no 10K-page context)
- ✅ User-friendly (clear organization)
Optimization Techniques
1. Selective Content Extraction
Extract only essential content:
{
"name": "large-docs",
"base_url": "https://docs.example.com/",
"selectors": {
"main_content": "article",
"exclude_selectors": [
".sidebar",
".footer",
".navigation",
".advertisement"
]
},
"extract_api": true,
"extract_examples": true,
"extract_navigation": false,
"extract_metadata": false
}
Reduce token count by 30-50% by excluding non-essential content.
2. Smart Page Filtering
Skip low-value pages:
{
"url_patterns": {
"include": ["docs"],
"exclude": [
"blog",
"news",
"changelog",
"404",
"search"
]
},
"min_content_length": 500,
"skip_duplicate_content": true
}
3. Incremental Updates
Update only changed pages:
# First scrape (full)
skill-seekers scrape --config configs/docs.json --output output/docs/
# Later updates (incremental)
skill-seekers scrape --config configs/docs.json \
--output output/docs/ \
--incremental \
--since "2025-01-01"
# Only re-scrapes pages modified after 2025-01-01
4. Content Deduplication
Remove duplicate content:
# Enable deduplication
skill-seekers scrape --config configs/docs.json \
--deduplicate \
--similarity-threshold 0.9
# Skips pages with > 90% similarity to already-scraped pages
Case Studies
Case Study 1: Kubernetes (10,000+ pages)
Challenge: Comprehensive docs across 5 major sections
Solution: Category-based split
# Split into 5 sub-skills
skill-seekers scrape --config configs/k8s-concepts.json --workers 4
skill-seekers scrape --config configs/k8s-tasks.json --workers 4
skill-seekers scrape --config configs/k8s-api.json --workers 6
skill-seekers scrape --config configs/k8s-tutorials.json --workers 2
skill-seekers scrape --config configs/k8s-reference.json --workers 4
# Generate router
skill-seekers router output/k8s-*/ --output output/k8s-router/
# Package
skill-seekers package output/k8s-router/ --include-subskills
Results:
- 5 focused sub-skills (500-2500 pages each)
- Total scraping time: 1.5 hours (with 4-6 workers)
- Token count per sub-skill: 20K-60K
- Router overhead: ~5K tokens
Case Study 2: Python Docs (4,000+ pages)
Challenge: Large, monolithic documentation
Solution: Automatic size-based split
# Auto-split at 50K tokens
skill-seekers scrape --config configs/python-docs.json \
--auto-split \
--max-tokens 50000 \
--workers 4
# Creates:
# - python-stdlib (part 1)
# - python-tutorial (part 2)
# - python-reference (part 3)
# - python-howto (part 4)
# - python-router (auto-generated)
Results:
- 4 sub-skills automatically created
- Even distribution (~1000 pages each)
- Scraping time: 45 minutes
Case Study 3: Internal Company Docs (20,000+ pages)
Challenge: Massive internal wiki with poor organization
Solution: Hybrid approach
# Phase 1: Category-based split (where possible)
skill-seekers scrape --config configs/company-api.json --workers 4
skill-seekers scrape --config configs/company-guides.json --workers 4
# Phase 2: Auto-split remaining unorganized docs
skill-seekers scrape --config configs/company-misc.json \
--auto-split \
--max-tokens 50000 \
--workers 6
# Phase 3: Generate router
skill-seekers router output/company-*/ --output output/company-router/
Results:
- 2 manual categories + 3 auto-split parts = 5 sub-skills
- Scraping time: 3 hours (20K pages)
- Manageable sub-skills (50K-70K tokens each)
Performance Guidelines
Scraping Speed Estimates
| Pages | Workers | Sync | Async | Parallel (4 workers) |
|---|---|---|---|---|
| 100 | 1 | 2 min | 1 min | 30 sec |
| 500 | 1 | 8 min | 4 min | 2 min |
| 1000 | 1 | 17 min | 8 min | 4 min |
| 5000 | 1 | 1.5 hr | 45 min | 22 min |
| 10000 | 1 | 3 hr | 1.5 hr | 45 min |
| 20000 | 1 | 6 hr | 3 hr | 1.5 hr |
Factors affecting speed:
- Network latency - Higher latency = slower scraping
- Rate limiting - Respecting
robots.txtand rate limits - Page complexity - Heavy JavaScript, dynamic content
- Content extraction - Complex selectors slow down processing
Memory Usage
| Pages | Memory (Sync) | Memory (Async) | Memory (Parallel 4x) |
|---|---|---|---|
| 100 | 50 MB | 80 MB | 200 MB |
| 500 | 200 MB | 300 MB | 800 MB |
| 1000 | 400 MB | 600 MB | 1.5 GB |
| 5000 | 2 GB | 3 GB | 7 GB |
| 10000 | 4 GB | 6 GB | 14 GB |
Recommendations:
- < 1000 pages: Any mode works
- 1000-5000 pages: Use async or 2-4 workers
- 5000+ pages: Use checkpointing + parallel workers
- 10000+ pages: Split into sub-skills
Advanced Configuration
Multi-Stage Scraping
Stage 1: Quick scan (get structure)
skill-seekers scrape --config configs/docs.json \
--scan-only \
--output output/docs-scan/
# Creates URL map without full content extraction
Stage 2: Analyze and plan split
# Analyze structure
skill-seekers analyze output/docs-scan/ --suggest-split
# Outputs suggested categories and sizes
Stage 3: Full scrape with split
# Use suggested split
skill-seekers scrape --config configs/docs.json \
--split-by-categories \
--workers 4 \
--output output/docs/
Custom Router Logic
Define custom routing rules:
{
"router_config": {
"name": "custom-router",
"sub_skills": [
{
"name": "api-reference",
"keywords": ["api", "method", "function", "class"],
"priority": 1
},
{
"name": "user-guide",
"keywords": ["how to", "guide", "tutorial", "example"],
"priority": 2
},
{
"name": "concepts",
"keywords": ["concept", "overview", "architecture"],
"priority": 3
}
],
"default_skill": "user-guide",
"multi_skill_threshold": 0.5
}
}
Troubleshooting
Issue: Out of memory during scraping
Symptoms: Process killed, MemoryError
Solutions:
-
Reduce batch size:
skill-seekers scrape --config X --batch-size 50 -
Enable streaming mode:
skill-seekers scrape --config X --streaming -
Split into smaller sub-skills:
skill-seekers scrape --config X --auto-split --max-pages 1000
Issue: Scraping too slow
Symptoms: Taking 5+ hours for 10K pages
Solutions:
-
Use parallel workers:
skill-seekers scrape --config X --workers 4 -
Enable async mode:
skill-seekers scrape --config X --async -
Skip low-value content:
{ "url_patterns": { "exclude": ["blog", "news", "search", "404"] } }
Issue: Skill too large for Claude
Symptoms: Upload fails, “Token limit exceeded”
Solutions:
-
Check token count:
skill-seekers validate output/skill/ --check-tokens -
Split into router + sub-skills:
skill-seekers router output/skill/ --max-tokens 50000 -
Optimize content extraction:
{ "extract_navigation": false, "extract_metadata": false, "exclude_selectors": [".sidebar", ".footer"] }
Best Practices
1. Plan Your Split Strategy
✅ Before scraping, analyze documentation structure:
skill-seekers analyze https://docs.example.com/ --suggest-split
2. Use Category-Based Split When Possible
✅ Clearer organization, better routing ❌ Avoid arbitrary size-based splits if categories exist
3. Test with Sample First
✅ Scrape small sample (100 pages) to validate config:
skill-seekers scrape --config X --max-pages 100 --output test/
4. Monitor Progress
✅ Enable verbose logging:
skill-seekers scrape --config X --verbose
5. Use Checkpointing for 5K+ Pages
✅ Always use --checkpoint for large scrapes
✅ Enables resume if interrupted
Next Steps
- Three-Stream GitHub Architecture - Router pattern for multi-source skills
- Skill Architecture Guide - Layering and splitting strategies
- Unified Scraping - Multi-source scraping with conflict detection
Status: ✅ Production Ready (v2.0.0+)
Found an issue or have suggestions? Open an issue