PDF Documentation Scraping

Extract content from PDF documentation and convert to AI skills with advanced features including OCR, table extraction, parallel processing, and MCP integration.

Overview

Skill Seekers’ PDF scraper converts PDF documentation into AI skills with:

Text extraction from any PDF document
Code detection with language identification and quality scoring
Image extraction with configurable size filtering
Table extraction from well-formatted tables
Chapter detection for automatic organization
OCR support for scanned PDFs
Password support for encrypted PDFs
Parallel processing for 3x faster extraction

Quick Start

Basic Usage

# Extract from PDF
skill-seekers pdf --input manual.pdf --output output/manual/

# With OCR for scanned PDFs
skill-seekers pdf --input scanned.pdf --output output/scanned/ --ocr

# Password-protected PDF
skill-seekers pdf --input encrypted.pdf --password "your-password"

# Extract tables
skill-seekers pdf --input data.pdf --extract-tables

# Parallel processing (3x faster)
skill-seekers pdf --input large.pdf --parallel --workers 8

Complete Workflow

# 1. Extract from PDF
skill-seekers pdf --input manual.pdf --output output/manual/

# 2. Enhance (optional)
skill-seekers enhance output/manual/

# 3. Package
skill-seekers package output/manual/ --target claude

# 4. Upload
skill-seekers upload manual-claude.zip

Usage Modes

Mode 1: Direct PDF (Quick)

skill-seekers pdf \
  --input manual.pdf \
  --output output/manual/ \
  --extract-images \
  --min-quality 6.0

Uses default settings:

Chunk size: 10 pages
Min quality: 5.0
Extract images: true
Chapter-based categorization

Mode 2: Config File (Recommended)

Create configs/manual_pdf.json:

{
  "name": "mymanual",
  "description": "My Manual documentation",
  "pdf_path": "docs/manual.pdf",
  "extract_options": {
    "chunk_size": 10,
    "min_quality": 6.0,
    "extract_images": true,
    "min_image_size": 150
  },
  "categories": {
    "getting_started": ["introduction", "setup"],
    "api": ["api", "reference", "function"],
    "tutorial": ["tutorial", "example", "guide"]
  }
}

Run scraper:

skill-seekers pdf --config configs/manual_pdf.json

Mode 3: From Extracted JSON (Iteration)

# Step 1: Extract to JSON (one time)
skill-seekers pdf --input manual.pdf --extract-only --output manual.json

# Step 2: Build skill from JSON (fast, can iterate)
skill-seekers pdf --from-json manual.json --output output/manual/

Benefits:

Separate extraction and building
Fast iteration on categorization
No re-extraction needed

Advanced Features

OCR for Scanned PDFs

Extract text from scanned PDFs using Optical Character Recognition:

Installation:

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Python packages
pip install pytesseract Pillow

Usage:

# Basic OCR
skill-seekers pdf --input scanned.pdf --ocr

# OCR with other options
skill-seekers pdf --input scanned.pdf --ocr --parallel --workers 4

How it works:

Checks if page has < 50 characters
Renders page as image if text is sparse
Runs Tesseract OCR on the image
Uses OCR text if longer than extracted text

Performance: ~2-5 seconds per page

Password-Protected PDFs

Handle encrypted PDFs:

# Basic usage
skill-seekers pdf --input encrypted.pdf --password mypassword

# Environment variable (more secure)
export PDF_PASSWORD="mypassword"
skill-seekers pdf --input encrypted.pdf --password "$PDF_PASSWORD"

Security note: Password is passed via command line (visible in process list). For sensitive documents, use environment variables.

Table Extraction

Extract tables from PDFs:

# Extract tables
skill-seekers pdf --input data.pdf --extract-tables

# Tables included in output
# Formatted as markdown tables in reference files

Example output:

## Data Tables

### Table 1 (Page 5)
| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Data 1   | Data 2   | Data 3   |

Best with: Well-formatted tables, not complex merged cells

Parallel Processing

Process pages in parallel for 3x faster extraction:

# Auto-detect CPU count
skill-seekers pdf --input large.pdf --parallel

# Specify worker count
skill-seekers pdf --input large.pdf --parallel --workers 8

Performance:

Pages	Sequential	Parallel (4)	Parallel (8)
50	25s	10s (2.5x)	8s (3.1x)
100	50s	18s (2.8x)	15s (3.3x)
500	4m 10s	1m 30s (2.8x)	1m 15s (3.3x)

Note: Only activates for PDFs with > 5 pages

Page Chunking and Chapters

Automatic Chapter Detection

Detects chapter boundaries automatically:

Recognizes H1/H2 headings
Patterns: “Chapter 1”, “Part 2”, “Section 3”
Numbered sections: “1. Introduction”

Chapter output:

{
  "chapters": [
    {
      "title": "Getting Started",
      "start_page": 1,
      "end_page": 12
    },
    {
      "title": "API Reference",
      "start_page": 13,
      "end_page": 45
    }
  ]
}

Page Chunking

Break large PDFs into manageable chunks:

# Default chunking (10 pages per chunk)
skill-seekers pdf --input manual.pdf

# Custom chunk size
skill-seekers pdf --input manual.pdf --chunk-size 20

# Disable chunking
skill-seekers pdf --input manual.pdf --chunk-size 0

Benefits:

Better memory efficiency for large PDFs
Respects chapter boundaries
Structured output for downstream processing

Code Block Merging

Intelligently merges code blocks split across pages:

Example:

Page 5:  def calculate_total(items):
             total = 0
             for item in items:

Page 6:         total += item.price
             return total

Result: Combined into single code block

Categorization

Chapter-Based (Automatic)

If PDF has detectable chapters:

Extracts chapter titles and page ranges
Creates one category per chapter
Assigns pages by page number

Advantages:

Automatic, no config needed
Respects document structure

Keyword-Based (Configurable)

Provide custom categories in config:

{
  "categories": {
    "getting_started": [
      "introduction",
      "getting started",
      "installation"
    ],
    "scripting": [
      "gdscript",
      "scripting",
      "code"
    ],
    "api": [
      "api",
      "class reference",
      "method"
    ]
  }
}

Scoring:

Keyword in page text: +1 point
Keyword in page heading: +2 points
Assigned to highest-scoring category

Advantages:

Flexible, customizable
Works without clear chapters
Combines related sections

Output Structure

Generated Files

output/
├── manual_extracted.json          # Raw extraction data
└── manual/                        # Skill directory
    ├── SKILL.md                   # Main skill file
    ├── references/                # Reference documentation
    │   ├── index.md               # Category index
    │   ├── getting_started.md     # Category 1
    │   ├── api.md                 # Category 2
    │   └── tutorial.md            # Category 3
    ├── scripts/                   # Empty (for user scripts)
    └── assets/                    # Assets directory
        └── images/                # Extracted images
            ├── manual_page5_img1.png
            └── manual_page12_img2.jpeg

SKILL.md Format

# Mymanual Documentation Skill

My Manual documentation

## When to use this skill

Use this skill when the user asks about mymanual documentation,
including API references, tutorials, examples, and best practices.

## What's included

This skill contains:

- **Getting Started**: 25 pages
- **Api**: 80 pages
- **Tutorial**: 45 pages

## Quick Reference

### Top Code Examples

**Example 1** (Quality: 8.5/10):

\`\`\`python
def initialize_system():
    config = load_config()
    setup_logging(config)
    return System(config)
\`\`\`

## Navigation

See `references/index.md` for complete documentation structure.

## Languages Covered

- python: 45 examples
- javascript: 32 examples
- shell: 8 examples

Config File Reference

Complete Example

{
  "name": "godot_manual",
  "description": "Godot Engine documentation from PDF manual",
  "pdf_path": "docs/godot_manual.pdf",
  "extract_options": {
    "chunk_size": 15,
    "min_quality": 6.0,
    "extract_images": true,
    "min_image_size": 200,
    "ocr": false,
    "extract_tables": false,
    "parallel": false,
    "workers": 4
  },
  "categories": {
    "getting_started": [
      "introduction",
      "getting started",
      "installation",
      "first steps"
    ],
    "scripting": [
      "gdscript",
      "scripting",
      "code",
      "programming"
    ],
    "3d": [
      "3d",
      "spatial",
      "mesh",
      "shader"
    ]
  }
}

Field Reference

Required Fields

name (string): Skill identifier (lowercase, no spaces)
pdf_path (string): Path to PDF file

Optional Fields

description (string): Skill description
extract_options (object):
- chunk_size (number): Pages per chunk (default: 10)
- min_quality (number): Min code quality 0-10 (default: 5.0)
- extract_images (boolean): Extract images (default: true)
- min_image_size (number): Min image pixels (default: 100)
- ocr (boolean): Enable OCR (default: false)
- extract_tables (boolean): Extract tables (default: false)
- parallel (boolean): Parallel processing (default: false)
- workers (number): Worker count (default: 4)
categories (object): Keyword-based categorization

MCP Integration

Using MCP Tool

The scrape_pdf MCP tool provides PDF scraping through Model Context Protocol:

# Mode 1: Config file
result = await mcp.call_tool("scrape_pdf", {
    "config_path": "configs/manual_pdf.json"
})

# Mode 2: Direct PDF
result = await mcp.call_tool("scrape_pdf", {
    "pdf_path": "manual.pdf",
    "name": "mymanual",
    "description": "My Manual Docs"
})

# Mode 3: From JSON
result = await mcp.call_tool("scrape_pdf", {
    "from_json": "output/manual_extracted.json"
})

Complete MCP Workflow

# 1. Scrape PDF
scrape_result = await mcp.call_tool("scrape_pdf", {
    "pdf_path": "docs/api_manual.pdf",
    "name": "api_manual"
})

# 2. Package skill
package_result = await mcp.call_tool("package_skill", {
    "skill_dir": "output/api_manual/",
    "auto_upload": True
})

See: MCP Setup for MCP server configuration

Combined Usage Examples

Maximum Performance

skill-seekers pdf \
  --input docs/manual.pdf \
  --output output/manual/ \
  --extract-images \
  --extract-tables \
  --parallel \
  --workers 8 \
  --min-quality 5.0

Scanned PDF with Tables

skill-seekers pdf \
  --input docs/scanned.pdf \
  --output output/scanned/ \
  --ocr \
  --extract-tables \
  --parallel \
  --workers 4

Encrypted PDF with All Features

skill-seekers pdf \
  --input docs/encrypted.pdf \
  --output output/encrypted/ \
  --password mypassword \
  --extract-images \
  --extract-tables \
  --parallel \
  --workers 8

Performance

Benchmarks

PDF Size	Pages	Extraction	Building	Total
Small	50	30s	5s	35s
Medium	200	2m	15s	2m 15s
Large	500	5m	45s	5m 45s

Feature Overhead

Feature	Time Impact	Memory Impact
OCR	+2-5s per page	+50MB per page
Table extraction	+0.5s per page	+10MB
Image extraction	+0.2s per image	Varies
Parallel (8 workers)	-66% total time	+8x memory
Caching	-50% on re-run	+100MB

Optimization Tips

Use --from-json for iteration
- Extract once, build many times
- Test categorization without re-extraction
Adjust chunk size
- Larger chunks: Faster extraction
- Smaller chunks: Better chapter detection
Filter aggressively
- Higher min_quality: Fewer low-quality code blocks
- Higher min_image_size: Fewer small images
Parallel processing
- Use --workers equal to CPU cores
- Not recommended with very large images (memory intensive)

Troubleshooting

No Categories Created

Problem: Only “content” or “other” category

Solutions:

# Check extracted chapters
cat output/manual_extracted.json | jq '.chapters'

# Add keyword categories to config if chapters empty
# Or accept single category for small PDFs

Low-Quality Code Blocks

Problem: Too many poor code examples

Solution:

{
  "extract_options": {
    "min_quality": 7.0  // Increase threshold
  }
}

Images Not Extracted

Problem: No images in assets/images/

Solution:

{
  "extract_options": {
    "extract_images": true,
    "min_image_size": 50  // Lower threshold
  }
}

OCR Not Working

Problem: OCR fails or gives poor results

Solutions:

# Check Tesseract installed
tesseract --version

# Install if missing
# Ubuntu: sudo apt-get install tesseract-ocr
# macOS: brew install tesseract

# Try with verbose mode
skill-seekers pdf --input scanned.pdf --ocr --verbose

Password Errors

Problem: Password not accepted

Solutions:

# Check password is correct
# Try with quotes
skill-seekers pdf --input file.pdf --password "my password"

# Use environment variable
export PDF_PASSWORD="my password"
skill-seekers pdf --input file.pdf --password "$PDF_PASSWORD"

Best Practices

For Large PDFs (500+ pages)

Use parallel processing with --workers 8
Extract to JSON first, then build skill
Monitor system resources (RAM, CPU)
Use larger chunk sizes (20-50 pages)

For Scanned PDFs

Use OCR with parallel processing
Test on sample pages first
Use --verbose to monitor OCR performance
Expect 2-5x slower processing

For Encrypted PDFs

Use environment variable for password
Clear shell history after use
Don’t commit passwords to version control

For PDFs with Tables

Enable table extraction with --extract-tables
Check table quality in output JSON
Manual review recommended for critical data
Works best with well-formatted tables

Next Steps

Tutorials:

Extracting PDFs - Step-by-step tutorial
Multi-Source Skills - Combine PDF with docs and GitHub

Manual:

Unified Scraping - Combine PDF with other sources
MCP Setup - Configure MCP server

CLI Reference:

pdf command - Complete command reference