unified - Multi-Source Scraping
Combine multiple sources (docs + GitHub + PDF) into one unified skill with conflict detection.
Basic Usage
skill-seekers unified [OPTIONS]
Quick Examples
# Use existing unified configs
skill-seekers unified --config configs/react_unified.json
skill-seekers unified --config configs/django_unified.json
skill-seekers unified --config configs/fastapi_unified.json
# Analyze GitHub repo with three-stream architecture
skill-seekers unified \
--repo-url https://github.com/facebook/react \
--depth c3x \
--fetch-github-metadata
Why Use Unified Scraping?
The Problem: Documentation and code often drift apart. Docs might be outdated, missing features, or documenting removed features.
The Solution: Unified scraping combines multiple sources and automatically detects conflicts.
Three-Stream Architecture
New in v2.6.0 - GitHub repos are split into three streams:
- Stream 1: Code - Deep C3.x analysis (patterns, examples, architecture)
- Stream 2: Docs - Repository documentation (README, docs/*.md)
- Stream 3: Insights - GitHub issues (common problems + solutions)
skill-seekers unified \
--repo-url https://github.com/fastapi/fastapi \
--depth c3x \
--fetch-github-metadata \
--output-dir output/fastapi
Config File Format
{
"name": "myframework",
"description": "Complete framework knowledge from docs + code",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://docs.myframework.com/",
"extract_api": true,
"max_pages": 200
},
{
"type": "github",
"repo": "owner/myframework",
"include_code": true,
"code_analysis_depth": "deep"
},
{
"type": "pdf",
"pdf_path": "docs/manual.pdf",
"extract_tables": true
}
]
}
Options
Required (choose one)
--config CONFIG- Load unified configuration--repo-url URL- GitHub repository URL
For GitHub Repos
--depth DEPTH- Analysis depth:basicorc3x--fetch-github-metadata- Include issues, stars, forks--output-dir DIR- Output directory
Config-Based
--merge-mode MODE- Conflict resolution:rule-basedorai-powered
Analysis Depths
Basic (Fast - 1-2 min)
skill-seekers unified \
--repo-url https://github.com/fastapi/fastapi \
--depth basic
- File structure
- Import relationships
- Entry points
- GitHub metadata (if βfetch-github-metadata)
C3.x (Comprehensive - 20-60 min)
skill-seekers unified \
--repo-url https://github.com/fastapi/fastapi \
--depth c3x \
--fetch-github-metadata
- Everything from basic
- C3.1: Design pattern detection
- C3.2: Test example extraction
- C3.3: How-to guide generation
- C3.4: Configuration analysis
- C3.7: Architectural patterns
- GitHub issues with solutions
Conflict Detection
Unified scraping automatically detects 4 types of conflicts:
1. Missing in Code (π΄ High Priority)
#### `initialize_auth(config: dict)`
π΄ **Missing in code**: Documented but not found in implementation
**Documentation:**
- Purpose: Initialize authentication system
- Parameters: config (dict) - Auth configuration
2. Missing in Docs (π‘ Medium Priority)
#### `initialize_auth(config: dict, timeout: int = 30)`
π‘ **Missing in docs**: Implemented but not documented
**Implementation:**
- File: src/auth.py:45
- Has additional parameter: timeout (int) = 30
3. Signature Mismatch (β οΈ Warning)
#### `move_local_x(delta: float)`
β οΈ **Conflict**: Documentation signature differs from implementation
**Documentation says:**
```python
def move_local_x(delta: float)
Code implementation:
def move_local_x(delta: float, snap: bool = False) -> None
### 4. Description Mismatch (βΉοΈ Info)
```markdown
#### `get_user_data()`
βΉοΈ **Conflict**: Different descriptions
**Documentation:** "Returns all user profile data"
**Code docstring:** "Returns user data excluding sensitive fields"
Example Configs
React (Docs + GitHub)
{
"name": "react",
"description": "React docs + GitHub repo",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://react.dev/",
"max_pages": 300
},
{
"type": "github",
"repo": "facebook/react",
"code_analysis_depth": "deep"
}
]
}
FastAPI (Docs + GitHub + PDF)
{
"name": "fastapi",
"description": "Complete FastAPI knowledge",
"merge_mode": "ai-powered",
"sources": [
{
"type": "documentation",
"base_url": "https://fastapi.tiangolo.com/"
},
{
"type": "github",
"repo": "fastapi/fastapi",
"include_issues": true,
"max_issues": 100
},
{
"type": "pdf",
"pdf_path": "docs/fastapi_guide.pdf"
}
]
}
Output Structure
output/
βββ {name}_unified_data/
βββ SKILL.md # Merged content with conflicts marked
βββ references/
β βββ index.md
β βββ from_docs.md # Documentation content
β βββ from_code.md # Code analysis
β βββ from_pdf.md # PDF content
β βββ conflicts.md # Conflict report
βββ c3_analysis_temp/ # C3.x analysis data
Time Estimates
| Configuration | Time |
|---|---|
| Docs only | 20-40 min |
| Docs + GitHub (basic) | 25-45 min |
| Docs + GitHub (c3x) | 40-80 min |
| Docs + GitHub (c3x) + PDF | 50-90 min |
Next Steps
- Three-Stream Architecture - Learn about the architecture
- C3.x Analysis - Deep code analysis
- Package Command - Package unified skills