LangChain
将任何来源转换为用于 RAG 流水线的 LangChain 文档。
快速开始
# 从文档获取
skill-seekers scrape --format langchain --config configs/react.json
# 从 GitHub 仓库获取
skill-seekers scrape --format langchain --github https://github.com/facebook/react
# 从 PDF 获取
skill-seekers scrape --format langchain --pdf ./manual.pdf
# 从本地代码库获取
skill-seekers analyze --directory ./my-project --format langchain
您会得到什么
- 完整的元数据的 LangChain Document 对象
- 分类内容(API、指南、教程等)
- 来源追踪(URL、文件路径、章节)
- 带有语言检测的 代码示例
- 用于过滤和检索的 丰富元数据
Python 示例
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
import json
# 从 Skill Seekers 输出加载文档
def load_documents(output_dir):
"""从 Skill Seekers LangChain 输出加载文档。"""
documents = []
with open(f"{output_dir}/documents.json", "r") as f:
data = json.load(f)
for doc in data:
documents.append(Document(
page_content=doc["content"],
metadata=doc["metadata"]
))
return documents
# 加载文档
documents = load_documents("output/react-langchain/")
print(f"已加载 {len(documents)} 个文档")
# 创建向量存储
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
documents,
embeddings,
collection_name="react-docs"
)
# 查询
results = vectorstore.similarity_search("How do I use useState?")
print(results[0].page_content)
文档结构
每个文档包含:
{
"page_content": "...",
"metadata": {
"source": "https://react.dev/docs/hooks-intro",
"title": "Introducing Hooks",
"category": "api",
"language": "javascript"
}
}
完整的 RAG 流水线
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
# 使用现代 ChatOpenAI 创建 QA 链
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4o"),
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# 提问
response = qa_chain.invoke({"query": "What are React Hooks?"})
print(response["result"])