pdf - PDF Extraction

Extract content from PDF files and convert to AI skills.

Basic Usage

skill-seekers pdf [OPTIONS]

Quick Examples

# Basic PDF extraction
skill-seekers pdf --pdf docs/manual.pdf --name myskill

# Scanned PDFs with OCR
skill-seekers pdf --pdf docs/scanned.pdf --name myskill --ocr

# Password-protected PDFs
skill-seekers pdf --pdf docs/encrypted.pdf --name myskill --password mypassword

# Advanced features
skill-seekers pdf --pdf docs/manual.pdf --name myskill \
    --extract-tables \
    --parallel \
    --workers 8

Options

Required

--pdf FILE - Path to PDF file
--name NAME - Skill name

Optional

--description DESC - Skill description
--ocr - Enable OCR for scanned PDFs
--password PASS - Password for encrypted PDFs
--extract-tables - Extract complex tables
--parallel - Enable parallel processing (3x faster)
--workers N - Number of CPU cores to use (default: 4)
--output DIR - Output directory

Features

Basic Extraction

✅ Text Extraction - All text content
✅ Code Detection - Recognizes code blocks in 20+ languages
✅ Image Extraction - Embedded images
✅ Metadata - Title, author, creation date

Advanced Features

✅ OCR Support - For scanned documents
✅ Password Protection - Handle encrypted PDFs
✅ Table Extraction - Complex table structures
✅ Parallel Processing - 3x faster for large PDFs
✅ Intelligent Caching - 50% faster on re-runs

OCR Support

Prerequisites

# Install OCR dependencies
pip install pytesseract Pillow

# Install Tesseract OCR engine
# macOS
brew install tesseract

# Linux
sudo apt install tesseract-ocr

# Windows
# Download from: https://github.com/UB-Mannheim/tesseract/wiki

Usage

skill-seekers pdf --pdf scanned_manual.pdf --name myskill --ocr

Parallel Processing

For large PDFs, enable parallel processing:

# Use all CPU cores
skill-seekers pdf --pdf large_manual.pdf --name myskill --parallel

# Specify worker count
skill-seekers pdf --pdf large_manual.pdf --name myskill --parallel --workers 16

Performance:

Without parallel: 15-20 minutes for 500-page PDF
With parallel (8 cores): 5-7 minutes for 500-page PDF
3x faster!

Table Extraction

Extract complex tables from PDFs:

skill-seekers pdf --pdf data_report.pdf --name myskill --extract-tables

Supports:

Multi-column tables
Merged cells
Nested headers
Formatted data (numbers, dates, currency)

Output Structure

output/
└── {name}/
    ├── SKILL.md              # Main content
    ├── references/
    │   ├── index.md
    │   ├── extracted_text.md
    │   ├── tables.md         # If --extract-tables
    │   └── code_blocks.md
    └── assets/
        ├── image_001.png     # Extracted images
        └── ...

Advanced Examples

Complete Extraction

skill-seekers pdf --pdf technical_manual.pdf --name technical \
    --description "Technical manual from PDF" \
    --ocr \
    --extract-tables \
    --parallel \
    --workers 8 \
    --output output/technical

Password-Protected PDF

skill-seekers pdf --pdf encrypted.pdf --name secure \
    --password "my-secure-password" \
    --extract-tables

Batch Processing

# Process multiple PDFs
for pdf in docs/*.pdf; do
    name=$(basename "$pdf" .pdf)
    skill-seekers pdf --pdf "$pdf" --name "$name" --parallel
done

Time Estimates

PDF Size	Pages	Without Parallel	With Parallel (8 cores)
Small	10-50	1-2 min	30 sec - 1 min
Medium	100-200	5-10 min	2-3 min
Large	500+	15-20 min	5-7 min

Troubleshooting

OCR Not Working

# Check Tesseract installation
tesseract --version

# If not found, install:
# macOS: brew install tesseract
# Linux: sudo apt install tesseract-ocr

Tables Not Extracting

Some PDFs use images for tables. Enable OCR:

skill-seekers pdf --pdf doc.pdf --name myskill --ocr --extract-tables

Memory Issues

For very large PDFs, reduce workers:

skill-seekers pdf --pdf huge.pdf --name myskill --parallel --workers 2

Next Steps

Unified Command - Combine PDFs with other sources
Package Command - Package your skills
Features: PDF - Advanced PDF features