Lv, Qi 03b53aed71 feat: Refactor Analysis Context Mechanism and Generic Worker

- Implemented Unified Context Mechanism (Task 20251127):
  - Decoupled intent (Module) from resolution (Orchestrator).
  - Added ContextResolver for resolving input bindings (Manual Glob/Auto LLM).
  - Added IOBinder for managing physical paths.
  - Updated GenerateReportCommand to support explicit input bindings and output paths.

- Refactored Report Worker to Generic Execution (Task 20251128):
  - Removed hardcoded financial DTOs and specific formatting logic.
  - Implemented Generic YAML-based context assembly for better LLM readability.
  - Added detailed execution tracing (Sidecar logs).
  - Fixed input data collision bug by using full paths as context keys.

- Updated Tushare Provider to support dynamic output paths.
- Updated Common Contracts with new configuration models.

2025-11-28 20:11:17 +08:00

8.5 KiB

Raw Blame History

任务：重构分析模块上下文机制 (两阶段选择与统一 I/O 绑定的融合)

状态: 设计中 (Finalizing) 日期: 2025-11-27 优先级: 高 负责人: @User / @Assistant

1. 核心理念：意图与实现的解耦

我们经历了三个思维阶段，现在需要将其融合成一个完整的体系：

Context Projection: 模块需要从全局上下文中“投影”出自己需要的数据。
Two-Stage Selection: 这种投影过程分为“选择（我需要什么？）”和“分析（怎么处理它？）”两个阶段，且都需要 Prompt/Model 驱动。
Unified I/O Binding: 模块本身不应处理物理路径，应由 Orchestrator 负责 I/O 绑定。

融合方案:

Module 定义意图 (Intent): 模块通过 Configuration (Prompt/Rules) 描述“我需要什么样的输入”（例如：“我需要去年的财务数据” 或 “按此 Glob 规则匹配”）。
Orchestrator 负责解析 (Resolution): Orchestrator（借助 IO Binder）根据模块的意图和当前的全局上下文状态，计算出具体的物理路径绑定。
Module 执行实现 (Execution): 模块接收 Orchestrator 传来的物理路径，执行读取、分析和写入。

2. 架构设计

2.1. 模块配置：描述“我需要什么”

AnalysisModuleConfig 依然保持两阶段结构，但这里的“Input/Context Selector”描述的是逻辑需求。

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AnalysisModuleConfig {
    pub id: String,
    
    // Phase 1: Input Intent (我需要什么数据？)
    pub context_selector: ContextSelectorConfig, 
    // Manual: 明确的规则 (e.g., "financials/*.json")
    // Auto: 模糊的需求，交给 Orchestrator/Agent 自动推断
    // Hybrid: 具体的 Prompt (e.g., "Find all news about 'Environment' from last year")

    // Phase 2: Analysis Intent (怎么处理这些数据？)
    pub analysis_prompt: String,
    pub llm_config: Option<LlmConfig>,

    // Output Intent (结果是什么？)
    // 模块只定义它产生什么类型的结果，物理路径由 Orchestrator 分配
    pub output_type: String, // e.g., "markdown_report", "json_summary"
}

2.2. Orchestrator 运行时：解析“在哪里”

Orchestrator 在调度任务前，会执行一个 Resolution Step。

对于 Manual Selector:
- Orchestrator 根据规则（Glob）在当前 VGCS Head Commit 中查找匹配的文件。
- 生成具体的 InputBindings (Map<FileName, PhysicalPath>)。
对于 Auto/Hybrid Selector:
- 这里是关键融合点：Orchestrator (或专门的 Resolution Agent) 会运行一个轻量级的 LLM 任务。
- Input: 当前 VGCS 目录树 + 模块定义的 Selection Prompt (或 Auto 策略)。
- Output: 具体的 VGCS 文件路径列表。
- Orchestrator 将这些路径打包成 InputBindings。

2.3. 模块执行：执行“转换”

当模块真正启动时（Worker 接收到 Command），它看到的是已经被解析过的确定的世界。

// 最终发给 Worker 的指令
pub struct GenerateReportCommand {
    pub request_id: Uuid,
    pub commit_hash: String, // 锁定的世界状态
    
    // 具体的 I/O 绑定 (由 Orchestrator 解析完毕)
    pub input_bindings: Vec<String>, // e.g., ["raw/tushare/AAPL/financials.json", ...]
    pub output_path: String,         // e.g., "analysis/financial_v1/report.md"
    
    // 分析逻辑 (透传给 Worker)
    pub analysis_prompt: String,
    pub llm_config: Option<LlmConfig>,
}

变化点:

复杂的 Selection 逻辑上移：原本打算放在 Worker 里的 Select_Smart 逻辑，现在看来更适合作为 Orchestrator 的预处理步骤（或者一个独立的微任务）。
Worker 变轻：Worker 变得非常“傻”，只负责 Read(paths) -> Expand -> Prompt -> Write(output_path)。这就实现了真正的“模块只关注核心任务”。
灵活性保留：如果是 Auto/Hybrid 模式，Orchestrator 会动态决定 Input Bindings；如果是 Manual 模式，则是静态规则解析。对 Worker 来说，它收到的永远是确定的文件列表。

3. 实施路线图 (Revised)

Phase 1: 协议与配置 (Contracts)

定义 AnalysisModuleConfig (包含 Selector, Prompt, LlmConfig)。
定义 GenerateReportCommand (包含 input_bindings 物理路径列表, output_path, commit_hash)。

Phase 2: Orchestrator Resolution Logic

实现 ContextResolver 组件：
- 支持 Glob 解析 (Manual)。
- (后续) 支持 LLM 目录树推理 (Auto/Hybrid)。
在调度循环中，在生成 Command 之前调用 ContextResolver。

Phase 3: 模块改造 (Module Refactor)

Provider: 接收 output_path (由 Orchestrator 按约定生成，如 raw/{provider}/{symbol}) 并写入。
Generator:
- 移除所有选择逻辑。
- 直接读取 cmd.input_bindings 中的文件。
- 执行 Expander (JSON->Table 等)。
- 执行 Prompt。
- 写入 cmd.output_path。

4. 总结

这个方案完美融合了我们的讨论：

Input/Output Symmetry: 都在 Command 中明确绑定。
Two-Stage:
- Stage 1 (Selection) 发生在 Orchestration Time (解析 Binding)。
- Stage 2 (Analysis) 发生在 Execution Time (Worker 运行)。
Module Focus: 模块不需要知道“去哪找”，只知道“给我这些文件，我给你那个结果”。

5. 实施步骤清单 (Checklist)

Phase 1: 协议与配置定义 (Contracts & Configs)

Common Contracts: 在 services/common-contracts/src 创建或更新 configs.rs。
- 定义 SelectionMode (Manual, Auto, Hybrid)。
- 定义 LlmConfig (model_id, parameters)。
- 定义 ContextSelectorConfig (mode, rules, prompt, llm_config)。
- 定义 AnalysisModuleConfig (id, selector, analysis_prompt, llm_config, output_type)。
Messages: 更新 services/common-contracts/src/messages.rs。
- GenerateReportCommand: 添加 commit_hash, input_bindings: Vec<String>, output_path: String, llm_config.
- FetchCompanyDataCommand: 添加 output_path: Option<String>.
VGCS Types: 确保 workflow-context crate 中的类型足以支持路径操作。(Confirmed: Vgcs struct has methods)

Phase 2: Orchestrator 改造 (Resolution Logic)

Context Resolver: 在 workflow-orchestrator-service 中创建 context_resolver.rs。
- 实现 resolve_input(selector, vgcs_client, commit_hash) -> Result<Vec<String>>。
- 针对 Manual 模式：实现 Glob 匹配逻辑 (调用 VGCS list_dir 递归查找)。
- 针对 Auto/Hybrid 模式：(暂留接口) 返回 Empty 或 NotImplemented，后续接入 LLM。
IO Binder: 实现 io_binder.rs。
- 实现 allocate_output_path(task_type, task_id) -> String 约定生成逻辑。
Scheduler: 更新 dag_scheduler.rs。
- 在 dispatch 任务前，调用 ContextResolver 和 IOBinder。
- 将解析结果填入 Command。

Phase 3: 写入端改造 (Provider Adaptation)

Tushare Provider: 更新 services/tushare-provider-service/src/generic_worker.rs。
- 读取 Command 中的 output_path (如果存在)。
- 使用 WorkerContext 写入数据到指定路径 (不再硬编码 raw/tushare/...，而是信任 Command)。
- 提交并返回 New Commit Hash。

Phase 4: 读取端改造 (Report Generator Adaptation)

Worker Refactor: 重写 services/report-generator-service/src/worker.rs。
- Remove: 删除 fetch_data_and_configs (旧的 DB 读取逻辑)。
- Checkout: 使用 vgcs.checkout(cmd.commit_hash)。
- Read Input: 遍历 cmd.input_bindings，使用 vgcs.read_file 读取内容。
- Expand: 实现简单 Expander (JSON -> Markdown Table)。
- Prompt: 渲染 cmd.analysis_prompt。
- LLM Call: 使用 cmd.llm_config 初始化 Client 并调用。
- Write Output: 将结果写入 cmd.output_path。
- Commit: 提交更改并广播 Event。

Phase 5: 集成与验证 (Integration)

Config Migration: 更新 config/analysis-config.json (或 DB 中的配置)，适配新的 AnalysisModuleConfig 结构。
End-to-End Test: 运行完整流程，验证：
1. Provider 写文件到 Git。
2. Orchestrator 解析路径。
3. Generator 读文件并生成报告。

8.5 KiB Raw Blame History Unescape Escape