GraphRAG 全局檢索：社區分群 + 社區摘要 + Map-Reduce 聚合

當「局部 GraphRAG」只能回答跟少數 seed 強相關的問題時，全局檢索能把你的所有經驗切成主題社區（community），先在主題層檢索，再回到素材層取證，最後用 map-reduce 聚合成一個結構化總結。

為什麼需要 Global Retrieval

我們原本的 GraphRAG 是「局部檢索」路線：先對 resume_chunks 做向量檢索找到 seed materials，再抽 seed skills，然後用知識圖做 N-hop 擴展，最後用擴展技能回去二次檢索。

這套流程在回答「我有沒有做過 X」這類局部問題時很強，但對下面這種「需要跨主題聚合」的問題比較吃力：

我後端的核心強項是什麼？
我在可觀測性做過哪些類型的事？
我跨職務的影響力模式怎麼總結？

原因很直觀：局部 GraphRAG 的起點依賴 seed 命中；廣域問題可能命中零散或偏科，導致後續擴展也會被帶偏。

所以我們加上一條「全局檢索」路徑：用圖分群（community detection）把技能圖切成多個主題社區，為每個社區生成可檢索的摘要（community summary），查詢時先檢索社區摘要，再針對命中的社區去素材庫取證，最後用 map-reduce 聚合答案。

目標架構（兩層檢索 + Map-Reduce）

query
  → retrieve community summaries（LanceDB: graph_communities）
  → for each selected community:
       retrieve evidence（LanceDB: resume_chunks）
       map：生成「本社區答案」+ evidence ids
  → reduce：聚合所有社區答案成最終總結（附 evidence ids）

關鍵差異：第一層檢索從「素材層」提升到「主題層」。你可以把 graph_communities 想成「你的經歷地圖索引」，而 resume_chunks 則是「可引用的原始素材證據」。

離線：社區分群與摘要索引（build-communities）

Step 1：從 skill-graph.json 建無向技能圖

社區偵測先用「無向圖」比較合理（社區更偏主題聚合，不是因果方向）。我們把 implies / impliedBy / relatedTo 都視為邊，組一張無向圖：

程式實作在：

[global_retrieval.py](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L73-L87)

Step 2：用 Leiden 做 community detection（預設）

目前 repo 預設採用 Leiden Algorithm（leidenalg + igraph），並固定 random seed（預設 leiden_seed=42）以確保重現性。

Leiden 分群實作在：

[_leiden_communities](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L112-L132)

如果 Leiden 執行失敗（例如環境缺少依賴），會 fallback 回 NetworkX 的 greedy_modularity_communities（MVP 版本的做法）：

try:
    communities = _leiden_communities(skill_graph, seed=42)
except Exception:
    g = _build_skill_undirected_graph(skill_graph)
    from networkx.algorithms.community import greedy_modularity_communities
    communities = list(greedy_modularity_communities(g))

接著用 min_size / max_size / max_communities 做裁切，避免超小社區與超大社區（超大社區摘要會太噪）。

Step 3：為每個社區產生可檢索摘要

摘要有兩種策略：

有可用 LLM key 時：用 LlamaIndex structured_predict 產生 CommunitySummary
沒有 key 或失敗時：用 heuristic（中心度 top skills）產生摘要，確保流程 end-to-end 可跑

class CommunitySummary(BaseModel):
    summary_short: str
    summary_long: str
    key_skills: list[str] = Field(default_factory=list)

實作在：

[CommunitySummary](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L52-L55)
[_llm_summary](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L181-L206)
[_heuristic_summary](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L173-L178)

Step 4：把社區摘要寫進 LanceDB（graph_communities）

每筆 graph_communities 記錄包含：

graph_version：用 skill-graph.json 內容 hash 當版本（cache invalidation）
nodes_json / key_skills_json：用 JSON string 避免 Arrow nested 結構相容性麻煩
vector：用 summary_short + key_skills 組合做 embedding，讓社區摘要可向量檢索

關於向量維度：本 repo 的 resume_chunks.vector 已是 1024 維（Voyage-4），因此社區摘要也用相同 embedding 設定，避免維度 mismatch。

LanceDB 封裝在：

[upsert_graph_communities](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/db/lancedb.py#L132-L152)

寫入策略上，除了清掉舊版本，也會先刪掉同 graph_version 的既有資料，避免重複 add 導致資料越疊越多：

table.delete(f"graph_version != '{graph_version}'")
table.delete(f"graph_version = '{graph_version}'")
table.add(records)

CLI：建立社區索引

cd py-kb
uv run career-kb graph build-communities --max-communities 200
uv run career-kb graph build-communities --max-communities 30 --no-use-llm
uv run career-kb graph build-communities --no-use-leiden
uv run career-kb graph build-communities --use-leiden --leiden-seed 42

命令入口：

[graph_cli.py build-communities](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/cli/graph_cli.py#L336-L372)

線上：global/hybrid 查詢（query --mode global|hybrid）

Step 1：先在社區摘要上做向量檢索

comm_hits = search_graph_communities(query, limit=top_communities)

封裝在：

[search_graph_communities](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/db/lancedb.py#L154-L160)

Step 2：每個社區再回到 resume_chunks 取證（evidence）

對於每個命中的社區，我們會：

取出 nodes_json（社區技能節點）
hybrid 模式下：對社區技能做 hops 擴展（把主題周邊技能也納入 evidence 檢索）
用 query + extra_terms 合併查詢回 resume_chunks 取回證據

實作在：

[_retrieve_evidence](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L298-L345)
[_expand_skills](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L135-L169)

此外，為了讓 --json 輸出可用，我們把 evidence record 裡的 vector 移除（不然會把 1024 維浮點陣列塞進 JSON，體積巨大）：

[evidence 去除 vector](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L316-L336)

Step 3：Map（逐社區產生子答案，附 evidence ids）

Map 用 LlamaIndex structured_predict，強制模型輸出固定 schema，讓後續 reduce 更穩：

class CommunityMap(BaseModel):
    community_id: int
    map_answer: str
    evidence_ids: list[str] = Field(default_factory=list)

實作在：

[CommunityMap](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L58-L61)

Step 4：Reduce（聚合總結，主題分段 + evidence ids）

Reduce 把各社區 map 的產出整合成「整體答案」。我們在 prompt 中要求結構清晰且每段附 evidence id，確保可追溯。

實作在：

[global_query](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/graph/global_retrieval.py#L348-L462)

CLI：global / hybrid 查詢

cd py-kb

# 全局：先命中社區摘要，再取證 + map-reduce
uv run career-kb graph query "backend core strengths" \
  --mode global --top-communities 3 --evidence-per-community 2 --json

# 混合：global 命中社區後，對社區技能做 hops 擴展，再取證 + map-reduce
uv run career-kb graph query "observability" \
  --mode hybrid --top-communities 2 --evidence-per-community 2 --hops 1 --json

命令入口：

[graph_cli.py query](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/cli/graph_cli.py#L183-L254)

踩坑與修正

1) 向量維度要一致（1024）

resume_chunks.vector 在現有 DB schema 是 1024 維；因此社區摘要也必須用同樣維度的 embedding，否則 table.search(query_vector) 會在 runtime 出錯或結果異常。

embedding 設定在：

[embedding.py](file:///home/iantp/GitHub/2601-19-resume-skills/py-kb/src/career_kb/rag/embedding.py#L1-L60)

2) JSON 輸出不要帶 vector

向量欄位適合用在 DB 檢索，不適合跟著 CLI JSON 回傳。把 evidence 的 vector 去掉後，--json 的輸出體積會小很多，也更適合 downstream（例如再餵給其他流程）。

總結

這條「全局檢索」路徑補上了 GraphRAG 的另一半拼圖：

局部檢索（local）：回答「跟某些技能/素材強相關」的精準問題
全局檢索（global/hybrid）：回答「跨主題聚合」的廣域問題，並保持可追溯（evidence ids）

接下來如果想再提升品質，通常會從三個方向下手：

改善 community detection（更穩定或更貼近語意的分群）
強化 evidence retrieval（結合 BM25/filters/rerank）
對 reduce 做更強的結構化輸出（例如固定章節、固定引用格式）

Career Knowledge Base 是一個本地優先的履歷知識庫系統，使用 Python + LanceDB + Voyage AI + LlamaIndex 建構。