GraphRAG 全局檢索強化：Dry-Run 成本估算 + Hybrid FTS 混合檢索

延續全局檢索的 Map-Reduce 架構，這篇補上三項實用改善：dry-run 估算 LLM 成本（使用 tiktoken 精確計算）、固定 temperature 控制非決定性、FTS+Vector 混合檢索增強 evidence 召回。

為什麼需要這些改善

全局檢索（global retrieval）的核心流程是：

從 graph_communities 檢索主題社區
對每個社區做 map：取 evidence + LLM 生成子答案
reduce：聚合成最終答案

這套流程會產生多次 LLM 呼叫；如果社區數量多或 evidence 量大，token 成本可能超出預期。另外，evidence retrieval 只用向量檢索時，對精確關鍵字匹配較弱。

因此我們加上：

Dry-run + tiktoken：精確估算成本再決定是否執行
Temperature 控制：確保可重現性
Hybrid FTS + CLI 索引建立：結合 BM25 全文檢索提升召回

改善一：Dry-Run 成本估算（tiktoken 精確版）

tiktoken 依賴

在 pyproject.toml 加入：

"tiktoken>=0.7",

Token 計算函式

在 [global_retrieval.py](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L17-L47) 新增精確計算：

def _count_tokens(text: str, model: str = "gpt-4o-mini") -> int:
    """Count tokens using tiktoken for accurate estimation."""
    try:
        import tiktoken
        enc = tiktoken.encoding_for_model(model)
        return len(enc.encode(text))
    except Exception:
        return len(text) // 4  # fallback

def _estimate_community_summary_tokens(nodes: list[str]) -> int:
    prompt_template = 200
    nodes_text = ", ".join(nodes[:40])
    output_estimate = 300
    return prompt_template + _count_tokens(nodes_text) + output_estimate

CLI 使用

uv run career-kb graph build-communities --max-communities 10 --dry-run
# 輸出（tiktoken 精確計算）：
# communities: 5
# estimated_tokens: 2567
# estimated_cost_usd: $0.0077

uv run career-kb graph query "backend" --mode global --dry-run
# 輸出：
# communities_matched: 5
# estimated_tokens: 5647
# estimated_cost_usd: $0.0169

改善二：Temperature 控制

確認 get_chat_model() 在 [reflection.py](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/llm/reflection.py#L71-94) 已全域設定 temperature=0.3，所有社區摘要與 map-reduce 呼叫都會使用相同的低 temperature，確保結果穩定可重現。

改善三：Hybrid FTS + 自動索引建立

FTS 索引 CLI

新增 create-fts-index 命令，在 [graph_cli.py](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/cli/graph_cli.py#L317-L336)：

uv run career-kb graph create-fts-index
# 輸出：✓ FTS index ready on resume_chunks.text

uv run career-kb graph create-fts-index --table graph_communities --column summary_short

ensure_fts_index 函式

在 [lancedb.py](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/db/lancedb.py#L69-L93) 新增：

def ensure_fts_index(table_name: str = "resume_chunks", column: str = "text") -> bool:
    """Ensure FTS index exists on the specified table column."""
    try:
        table.search("test", query_type="fts").limit(1).to_list()
        return True  # Index exists
    except Exception:
        pass
    try:
        table.create_fts_index(column, replace=True)
        return True
    except Exception:
        return False

混合檢索

[hybrid_search_materials](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/db/lancedb.py#L132-L203) 結合 FTS + Vector：

uv run career-kb graph query "Kubernetes deployment" --mode global --use-fts --json

FTS 權重預設 0.3，向量權重 0.7
FTS 不可用時自動 fallback 到純向量

改善四：Voyage Reranker 整合

rerank_documents 函式

在 [embedding.py](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/rag/embedding.py#L62-L112) 新增 reranker：

def rerank_documents(
    query: str,
    documents: list[dict],
    text_key: str = "text",
    top_k: int | None = None,
    model: str = "rerank-2.5",
) -> list[dict]:
    """Rerank documents using Voyage AI reranker."""
    client = get_client()
    reranking = client.rerank(
        query=query,
        documents=[doc.get(text_key, "") for doc in documents],
        model=model,
        top_k=top_k or len(documents),
    )
    # Return reranked with added "rerank_score" field

CLI 使用

uv run career-kb graph query "distributed systems" \
  --mode global --use-rerank --json

先用 vector/FTS 取回 3 倍候選數
再用 Voyage rerank-2.5 做精排
回傳 top_k 並附加 rerank_score

改善五：Leiden 社區偵測

igraph + leidenalg

新增依賴：

"igraph>=0.11",
"leidenalg>=0.10",

_leiden_communities 函式

在 [global_retrieval.py](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L112-L130) 新增 Leiden 演算法：

def _leiden_communities(skill_graph: dict[str, Any], seed: int = 42) -> list[set[str]]:
    """Detect communities using Leiden algorithm with fixed seed for reproducibility."""
    import leidenalg
    
    g, skills = _build_igraph(skill_graph)
    partition = leidenalg.find_partition(
        g,
        leidenalg.ModularityVertexPartition,
        seed=seed,  # 固定 seed 確保可重現
    )
    # ...

CLI 使用

# 預設使用 Leiden（更穩定）
uv run career-kb graph build-communities --max-communities 10

# 指定 seed 確保可重現
uv run career-kb graph build-communities --leiden-seed 123

# 退回使用 NetworkX
uv run career-kb graph build-communities --no-use-leiden

--use-leiden / --no-use-leiden：預設啟用 Leiden
--leiden-seed：控制隨機種子（預設 42）

總結

這篇補上了 GraphRAG 全局檢索的完整改善：

功能	CLI 選項	效果
精確成本估算	`--dry-run`	tiktoken 計算 token 與 USD 成本
FTS 索引建立	`create-fts-index`	一鍵建立全文搜尋索引
混合檢索	`--use-fts`	結合 FTS 提升精確匹配召回
Voyage Reranker	`--use-rerank`	cross-encoder 精排提升相關性
Leiden 社區偵測	`--use-leiden`	更穩定的分群 + 固定 seed 可重現
非決定性控制	內建 `temperature=0.3`	結果更穩定可重現

所有原本列為「待改善」的項目均已完成實作！

Career Knowledge Base 是一個本地優先的履歷知識庫系統，使用 Python + LanceDB + Voyage AI + LangChain 建構。

← Previous
GraphRAG 全局檢索：社區分群 + 社區摘要 + Map-Reduce 聚合
Next →
優化 LLM 模型分派策略：Reasoning vs. Fast