init repo

This commit is contained in:
2026-04-25 19:25:22 +08:00
commit c7533eada2
50 changed files with 3732 additions and 0 deletions

24
.gitignore vendored Normal file
View File

@@ -0,0 +1,24 @@
# OS
.DS_Store
# Env
.env
.env.*
# Python
__pycache__/
.pytest_cache/
.mypy_cache/
.venv/
venv/
# Node
node_modules/
.next/
dist/
build/
.nuxt/
.output/
# Misc
*.log

17
.project.json Normal file
View File

@@ -0,0 +1,17 @@
{
"created" : "2026-03-27",
"description" : "AI 自动生成咨询报告系统(规划中)",
"kind" : "app",
"name" : "咨询报告生成",
"stack" : [
"Python",
"FastAPI",
"LiteLLM",
"Pydantic"
],
"status" : "paused",
"worklog" : {
"auto" : true,
"path" : "\/Users\/kangwan\/Projects\/business\/20260327-咨询报告生成\/Users\/kangwan\/Projects\/business\/20260327-咨询报告生成\/Users\/kangwan\/Projects\/business\/20260327-咨询报告生成\/Users\/kangwan\/Projects\/business\/20260327-咨询报告生成\/Users\/kangwan\/Projects\/business\/20260327-咨询报告生成\/Users\/kangwan\/Projects\/business\/20260327-咨询报告生成\/Users\/kangwan\/Projects\/business\/20260327-咨询报告生成\/memory\/worklog.json"
}
}

21
AGENTS.md Normal file
View File

@@ -0,0 +1,21 @@
# 咨询报告生成 Agent Rules
## Must Read First
- `.project.json` 是机器真源:公网链接、快捷登录、凭证引用都以它为准
- `RULES.md` 是人工规则和部署事实:启动命令、平台、域名、注意事项都写这里
- 不允许编造不存在的域名、账号、密码;未知就保持空白并明确标记待补充
## Deployment Metadata Contract
- 任何任务只要新增、删除或修改公网地址,必须在同一次任务里更新 `.project.json`
- `urls[]` 推荐显式写 `type``app``backend``docs``admin``repo`
- 项目专属的网页登录信息,如果允许放进仓库,就写 `.project.json.quick_login`
- 不能直接入库的敏感登录,不要伪造 `quick_login`,改为写 `.project.json.credentials` 引用
- 数据库密码、API Key、服务器 root 密码,不属于 `quick_login`
## Completion Gate
- 部署完成后,不允许在 `.project.json` 缺少最新公网链接的状态下结束任务
- 部署完成后,必须同步更新 `RULES.md` 的部署事实
- 如果只更新了代码但没回写部署元数据,这个任务不算完成

123
BRIEF.md Normal file
View File

@@ -0,0 +1,123 @@
# 咨询报告 AI 生成系统 — 项目简报
> 新窗口打开时,让 Claude 先读这个文件
## 项目定位
为咨询公司构建 **安全、可控** 的行业报告自动生成系统。
- 输入:客户需求 + 行业数据 + 报告模板
- 输出Word/PPT/Excel/PDF 格式的专业咨询报告
- 安全铁律:**客户数据绝不经过第三方**
## 架构灵感(来自 Open SWE
借鉴 Open SWE 的多 Agent 流水线,但从"写代码"改为"写报告"
```
用户输入(报告需求 + 数据)
├── Researcher Agent → 分析需求、检索资料、梳理框架
├── Writer Agent → 按模板撰写报告正文
├── Data Agent → 处理数据、生成图表、制作附录
├── Reviewer Agent → 检查质量、一致性、合规性
└── Formatter Agent → 排版输出 docx/pptx/xlsx/pdf
```
## 可用 Skills已有
| Skill | 路径 | 能力 |
|-------|------|------|
| docx | `~/Projects/code/20260119-skills合集/anthropics_skills/skills/docx/` | Word 文档生成/编辑/批注 |
| pptx | `~/Projects/code/20260119-skills合集/anthropics_skills/skills/pptx/` | PPT 演示文稿生成 |
| xlsx | `~/Projects/code/20260119-skills合集/anthropics_skills/skills/xlsx/` | Excel 数据分析/图表/财务模型 |
| pdf | `~/Projects/code/20260119-skills合集/anthropics_skills/skills/pdf/` | PDF 生成/合并 |
## Open SWE 安全审查结论(避坑清单)
从 Open SWE 代码审查中得出的教训,本项目必须避免:
| Open SWE 的问题 | 本项目的对策 |
|-----------------|-------------|
| 所有内容发送给第三方 LLM (Poe/OpenRouter) | 必须用有 DPA 的 API 或自建模型 |
| LangSmith 遥测发送完整执行数据 | 不用 LangSmith自建 tracing 或不 trace |
| fetch_url 无 SSRF 防护 | Agent 工具严格限制外网访问 |
| 本地沙箱 inherit_env 暴露密钥 | 隔离执行环境,最小化环境变量 |
| 无多租户隔离 | 每个客户项目独立隔离 |
## 需要用户提供的材料
### 必须提供(不然搭不了)
1. **报告模板** — 你们现在用的 Word/PPT 模板文件(至少 2-3 个不同类型)
2. **样例报告** — 之前交付过的成品报告脱敏后的2-3 份
3. **LLM 选择** — 用哪个模型?选项:
- Poe API现有但数据经过 Poe → 有泄露风险)
- 本地部署 LLM192.168.2.221 Linux 服务器,需要 GPU
- 有 DPA 的 API如 Azure OpenAI / AWS Bedrock
### 最好提供(能做更好)
4. **行业数据源** — 你们通常从哪里获取行业数据?(数据库/网站/内部资料库)
5. **报告类型清单** — 你们做哪几种报告?(市场分析/竞品分析/尽职调查/投资备忘录...
6. **品牌规范** — 公司 logo、配色、字体规范用于排版
7. **审批流程** — 报告生成后谁审核?需要什么审批机制?
### 可选
8. **客户列表结构** — 多租户需要支持多少客户?权限怎么分?
9. **部署偏好** — 部署在 VPS (76.13.31.179) 还是本地服务器 (192.168.2.221)
## 技术方案初步思路
```
┌─────────────────────────────────────────────┐
│ Web UI后续做
│ 用户输入需求 / 上传数据 / 下载报告 │
└──────────────────┬──────────────────────────┘
┌──────────────────▼──────────────────────────┐
│ 调度引擎 (Python/FastAPI) │
│ 接收任务 → 编排 Agent → 输出报告 │
│ - 多租户隔离 │
│ - 任务队列 │
│ - 模板管理 │
└──────────────────┬──────────────────────────┘
┌─────────────┼─────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────┐
│Researcher│ │ Writer │ │ Data │
│ Agent │ │ Agent │ │ Agent │
│ 检索分析 │ │ 撰写正文 │ │ 图表数据 │
└────┬─────┘ └────┬─────┘ └────┬─────┘
└─────────────┼─────────────┘
┌─────────────────┐
│ Reviewer Agent │
│ 质量审查 │
└────────┬────────┘
┌─────────────────┐
│ Formatter Agent │
│ docx/pptx/pdf │
│ (用现有 Skills) │
└─────────────────┘
```
## 相关项目参考
- Open SWE 源码:`~/Projects/research/20260327-open-swe/source/`
- Open SWE 安全审查:在本项目创建对话中完成
- Skills 合集:`~/Projects/code/20260119-skills合集/`
- GPT Researcher MCP已配置在 `~/.claude.json`(可用于资料检索)
## 状态
- [x] 需求明确:咨询公司内部效率工具 + 客户增值服务
- [x] 安全需求明确:客户数据不得外泄
- [x] 可用 Skills 盘点完成
- [ ] **等待用户提供:报告模板 + 样例报告 + LLM 选择**
- [ ] 架构设计
- [ ] 开发
- [ ] 部署

21
CLAUDE.md Normal file
View File

@@ -0,0 +1,21 @@
# 咨询报告生成 Agent Rules
## Must Read First
- `.project.json` 是机器真源:公网链接、快捷登录、凭证引用都以它为准
- `RULES.md` 是人工规则和部署事实:启动命令、平台、域名、注意事项都写这里
- 不允许编造不存在的域名、账号、密码;未知就保持空白并明确标记待补充
## Deployment Metadata Contract
- 任何任务只要新增、删除或修改公网地址,必须在同一次任务里更新 `.project.json`
- `urls[]` 推荐显式写 `type``app``backend``docs``admin``repo`
- 项目专属的网页登录信息,如果允许放进仓库,就写 `.project.json.quick_login`
- 不能直接入库的敏感登录,不要伪造 `quick_login`,改为写 `.project.json.credentials` 引用
- 数据库密码、API Key、服务器 root 密码,不属于 `quick_login`
## Completion Gate
- 部署完成后,不允许在 `.project.json` 缺少最新公网链接的状态下结束任务
- 部署完成后,必须同步更新 `RULES.md` 的部署事实
- 如果只更新了代码但没回写部署元数据,这个任务不算完成

37
RULES.md Normal file
View File

@@ -0,0 +1,37 @@
# 咨询报告生成
## 启动
- `待补充`
## 部署事实
- 平台:待定
- 发布状态:未部署
- 主站 / 前端:待定
- API / 后端:待定
- 文档 / 解析:待定
- 管理后台:待定
- 代码仓:待定
## 快捷登录
- 登录地址:待补充
- 用户名:待补充
- 密码:待补充
- 说明这里只写项目专属网页登录数据库密码、API Key、服务器 root 密码不要写这里
## 元数据回写清单
- 新增或变更公网地址后,必须同步更新 `.project.json.urls`
- 如果有网页后台登录:
- 可直接入库:写 `.project.json.quick_login`
- 不应入库:写 `.project.json.credentials` 引用
- 部署完成后,`RULES.md``.project.json` 必须同一次任务一起更新
## 环境变量
- 待补充
## 规则
- 不允许编造不存在的部署域名、账号、密码
- 没有公网地址时,`.project.json.urls` 保持空数组
- 任何部署或域名变化,都要先改元数据,再视为任务完成
## 注意事项
- 待补充

0
app/__init__.py Normal file
View File

15
app/agents/__init__.py Normal file
View File

@@ -0,0 +1,15 @@
from .base import BaseAgent
from .researcher import ResearcherAgent
from .writer import WriterAgent
from .data_agent import DataAgent
from .reviewer import ReviewerAgent
from .formatter import FormatterAgent
__all__ = [
"BaseAgent",
"ResearcherAgent",
"WriterAgent",
"DataAgent",
"ReviewerAgent",
"FormatterAgent",
]

166
app/agents/base.py Normal file
View File

@@ -0,0 +1,166 @@
"""Base agent with LLM calling via litellm."""
from __future__ import annotations
import json
import logging
from typing import Any
import litellm
from app.config import settings
logger = logging.getLogger(__name__)
# Disable litellm telemetry
litellm.telemetry = False
class BaseAgent:
"""Base class for all pipeline agents."""
name: str = "base"
description: str = ""
system_prompt: str = ""
model: str = "" # empty = use default from config
def __init__(self, model: str | None = None):
if model:
self.model = model
def get_model(self) -> str:
return self.model or settings.llm_model
async def call_llm(
self,
prompt: str,
*,
system: str | None = None,
temperature: float = 0.3,
max_tokens: int = 4096,
response_format: dict | None = None,
) -> str:
"""Call LLM via litellm. Returns the text response."""
messages = []
sys_prompt = system or self.system_prompt
if sys_prompt:
messages.append({"role": "system", "content": sys_prompt})
messages.append({"role": "user", "content": prompt})
kwargs: dict[str, Any] = {
"model": self.get_model(),
"messages": messages,
"temperature": temperature,
"max_tokens": max_tokens,
}
if settings.llm_api_key:
kwargs["api_key"] = settings.llm_api_key
if settings.llm_api_base:
kwargs["api_base"] = settings.llm_api_base
if response_format:
kwargs["response_format"] = response_format
logger.info(f"[{self.name}] calling {self.get_model()}")
response = await litellm.acompletion(**kwargs)
content = response.choices[0].message.content
logger.info(f"[{self.name}] got {len(content)} chars")
return content
async def call_llm_json(self, prompt: str, **kwargs) -> dict:
"""Call LLM and parse response as JSON."""
raw = await self.call_llm(
prompt,
response_format={"type": "json_object"},
**kwargs,
)
# Strip markdown code fences if present
text = raw.strip()
if text.startswith("```"):
first_nl = text.find("\n")
if first_nl != -1:
text = text[first_nl + 1:]
if text.endswith("```"):
text = text[: text.rfind("```")]
text = text.strip()
# Sanitize control characters inside JSON string values
# (models sometimes emit literal newlines/tabs inside strings)
import re
def _clean_json_string(s: str) -> str:
# Replace unescaped control chars within JSON strings
# This is a best-effort fix for common model outputs
result = []
in_string = False
escape = False
for ch in s:
if escape:
result.append(ch)
escape = False
continue
if ch == '\\':
result.append(ch)
escape = True
continue
if ch == '"':
in_string = not in_string
result.append(ch)
continue
if in_string and ord(ch) < 32:
# Replace control chars with escaped versions
if ch == '\n':
result.append('\\n')
elif ch == '\r':
result.append('\\r')
elif ch == '\t':
result.append('\\t')
else:
result.append(f'\\u{ord(ch):04x}')
continue
result.append(ch)
return ''.join(result)
# Try parsing with multiple strategies
for attempt, candidate in enumerate([text, _clean_json_string(text)]):
try:
return json.loads(candidate)
except json.JSONDecodeError:
continue
# Last resort: try to extract the largest valid JSON object
# (model may have appended commentary after the JSON)
brace_depth = 0
start = text.find('{')
if start == -1:
raise json.JSONDecodeError("No JSON object found", text, 0)
cleaned = _clean_json_string(text)
for i, ch in enumerate(cleaned[start:], start):
if ch == '{':
brace_depth += 1
elif ch == '}':
brace_depth -= 1
if brace_depth == 0:
try:
return json.loads(cleaned[start:i + 1])
except json.JSONDecodeError:
continue
# If all else fails, use json_repair library or raise
try:
import json_repair
return json_repair.loads(text)
except (ImportError, Exception):
raise json.JSONDecodeError(
f"Failed to parse JSON after multiple attempts", text, 0
)
async def run(self, context: dict[str, Any]) -> dict[str, Any]:
"""Execute this agent's task. Override in subclasses.
Args:
context: Shared pipeline context (accumulated by previous agents).
Returns:
Dict of new keys to merge into context.
"""
raise NotImplementedError

78
app/agents/data_agent.py Normal file
View File

@@ -0,0 +1,78 @@
"""Data Agent — processes data, generates chart specs and table data."""
from __future__ import annotations
import json
from typing import Any
from .base import BaseAgent
from app.config import settings
class DataAgent(BaseAgent):
name = "data"
description = "处理数据、生成图表规格和表格数据"
system_prompt = """\
你是一位数据分析专家。你的任务是根据报告草稿中标注的图表和表格需求,
生成具体的数据和图表规格。
输出要求JSON 格式):
{
"charts": [
{
"id": "chart_1",
"title": "图表标题",
"type": "bar|line|pie|area|scatter",
"description": "图表说明",
"data": {
"labels": ["标签1", "标签2"],
"datasets": [
{"label": "数据集名", "data": [100, 200]}
]
}
}
],
"tables": [
{
"id": "table_1",
"title": "表格标题",
"headers": ["列1", "列2", "列3"],
"rows": [["数据1", "数据2", "数据3"]]
}
]
}"""
def __init__(self):
super().__init__(model=settings.model_for_domain("fast"))
async def run(self, context: dict[str, Any]) -> dict[str, Any]:
draft = context["draft"]
extra_data = context.get("extra_data", "")
# Collect chart/table needs from draft
chart_needs = []
table_needs = []
for ch in draft.get("chapters", []):
chart_needs.extend(ch.get("charts", []))
table_needs.extend(ch.get("tables", []))
if not chart_needs and not table_needs:
return {"data_assets": {"charts": [], "tables": []}}
prompt = f"""\
## 报告标题
{draft.get("title", "")}
## 需要生成的图表
{json.dumps(chart_needs, ensure_ascii=False)}
## 需要生成的表格
{json.dumps(table_needs, ensure_ascii=False)}
## 补充数据源
{extra_data if extra_data else "(无额外数据,请根据行业常识生成合理的示例数据)"}
请为以上需求生成具体的图表规格和表格数据。输出 JSON。"""
result = await self.call_llm_json(prompt)
return {"data_assets": result}

669
app/agents/formatter.py Normal file
View File

@@ -0,0 +1,669 @@
"""Formatter Agent — renders final report using Skills toolkit.
Skills integration:
- docx: python-docx (baseline) + docx-js via Node.js (rich mode) + OOXML template editing
- pptx: html2pptx.js via Node.js (visual slides) + python-pptx fallback
- xlsx: openpyxl + recalc.py (formula recalculation via LibreOffice)
- pdf: reportlab with CJK support + fpdf2 fallback
"""
from __future__ import annotations
import asyncio
import json
import logging
import shutil
import subprocess
import tempfile
from pathlib import Path
from typing import Any
from .base import BaseAgent
logger = logging.getLogger(__name__)
# Skills root
SKILLS_ROOT = Path.home() / "Projects/code/20260119-skills合集/anthropics_skills/skills"
DOCX_SKILLS = SKILLS_ROOT / "docx"
PPTX_SKILLS = SKILLS_ROOT / "pptx"
XLSX_SKILLS = SKILLS_ROOT / "xlsx"
PDF_SKILLS = SKILLS_ROOT / "pdf"
def _skills_available() -> dict[str, bool]:
"""Check which skill toolkits are available."""
return {
"docx_js": (DOCX_SKILLS / "docx-js.md").exists(),
"html2pptx": (PPTX_SKILLS / "scripts" / "html2pptx.js").exists(),
"recalc": (XLSX_SKILLS / "recalc.py").exists(),
"ooxml_docx": (DOCX_SKILLS / "ooxml" / "scripts" / "unpack.py").exists(),
"ooxml_pptx": (PPTX_SKILLS / "ooxml" / "scripts" / "unpack.py").exists(),
"pdf_scripts": (PDF_SKILLS / "scripts").is_dir(),
}
class FormatterAgent(BaseAgent):
name = "formatter"
description = "将报告渲染为 docx/pptx/xlsx/pdf融合 Skills 能力"
def __init__(self):
super().__init__()
self.skills = _skills_available()
available = [k for k, v in self.skills.items() if v]
logger.info(f"[formatter] available skills: {available}")
async def run(self, context: dict[str, Any]) -> dict[str, Any]:
draft = context["draft"]
data_assets = context.get("data_assets", {})
output_dir = Path(context.get("output_dir", "output"))
formats = context.get("output_formats", ["docx"])
template_path = context.get("template_path") # optional: user-provided template
output_dir.mkdir(parents=True, exist_ok=True)
title = draft.get("title", "报告")
generated_files = []
for fmt in formats:
try:
match fmt:
case "docx":
path = await self._render_docx(draft, data_assets, output_dir, title, template_path)
case "pptx":
path = await self._render_pptx(draft, data_assets, output_dir, title)
case "xlsx":
path = await self._render_xlsx(data_assets, output_dir, title)
case "pdf":
path = await self._render_pdf(draft, data_assets, output_dir, title)
case _:
logger.warning(f"Unsupported format: {fmt}")
continue
generated_files.append(str(path))
logger.info(f"[formatter] generated {path}")
except Exception as e:
logger.exception(f"[formatter] failed to render {fmt}")
return {"generated_files": generated_files}
# -----------------------------------------------------------------------
# DOCX — python-docx baseline + OOXML template editing
# -----------------------------------------------------------------------
async def _render_docx(
self, draft: dict, data_assets: dict, output_dir: Path, title: str,
template_path: str | None = None,
) -> Path:
if template_path and self.skills["ooxml_docx"]:
return await self._render_docx_from_template(
draft, data_assets, output_dir, title, Path(template_path)
)
return await self._render_docx_baseline(draft, data_assets, output_dir, title)
async def _render_docx_baseline(
self, draft: dict, data_assets: dict, output_dir: Path, title: str
) -> Path:
from docx import Document
from docx.shared import Pt, RGBColor
from docx.enum.text import WD_ALIGN_PARAGRAPH
doc = Document()
# -- Styles --
style = doc.styles["Normal"]
style.font.name = "微软雅黑"
style.font.size = Pt(11)
# Title
t = doc.add_heading(title, level=0)
t.alignment = WD_ALIGN_PARAGRAPH.CENTER
# Executive summary
if summary := draft.get("executive_summary"):
doc.add_heading("执行摘要", level=1)
# Add summary with highlight styling
p = doc.add_paragraph()
run = p.add_run(summary)
run.font.size = Pt(11)
run.font.color.rgb = RGBColor(0x33, 0x33, 0x33)
# Chapters
for chapter in draft.get("chapters", []):
doc.add_heading(chapter["title"], level=1)
content = chapter.get("content", "")
self._docx_render_markdown(doc, content)
# Tables from data assets
for table_spec in data_assets.get("tables", []):
doc.add_heading(table_spec.get("title", "数据表"), level=2)
self._docx_add_table(doc, table_spec)
# Page break + chart descriptions as placeholders
for chart_spec in data_assets.get("charts", []):
doc.add_heading(chart_spec.get("title", "图表"), level=2)
desc = chart_spec.get("description", "")
chart_type = chart_spec.get("type", "")
doc.add_paragraph(f"[{chart_type.upper()} 图表] {desc}")
# Render chart data as a table too
chart_data = chart_spec.get("data", {})
if labels := chart_data.get("labels"):
for ds in chart_data.get("datasets", []):
self._docx_add_table(doc, {
"headers": ["项目", ds.get("label", "数据")],
"rows": [[str(l), str(v)] for l, v in zip(labels, ds.get("data", []))],
})
path = output_dir / f"{title}.docx"
doc.save(str(path))
return path
async def _render_docx_from_template(
self, draft: dict, data_assets: dict, output_dir: Path, title: str,
template_path: Path,
) -> Path:
"""Edit an existing DOCX template using OOXML unpack/edit/pack workflow."""
unpack_script = DOCX_SKILLS / "ooxml" / "scripts" / "unpack.py"
pack_script = DOCX_SKILLS / "ooxml" / "scripts" / "pack.py"
with tempfile.TemporaryDirectory() as tmpdir:
work_dir = Path(tmpdir) / "unpacked"
# Unpack template
proc = await asyncio.create_subprocess_exec(
"python3", str(unpack_script), str(template_path), str(work_dir),
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
)
await proc.wait()
if proc.returncode != 0:
logger.warning("[formatter] OOXML unpack failed, falling back to baseline")
return await self._render_docx_baseline(draft, data_assets, output_dir, title)
# TODO: edit XML content in work_dir based on draft
# For now, just pack back as-is (template passthrough)
output_path = output_dir / f"{title}.docx"
proc = await asyncio.create_subprocess_exec(
"python3", str(pack_script), str(work_dir), str(output_path),
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
)
await proc.wait()
return output_path
def _docx_render_markdown(self, doc, content: str):
"""Convert markdown-ish content to docx paragraphs."""
from docx.shared import Pt
for block in content.split("\n\n"):
block = block.strip()
if not block:
continue
if block.startswith("#### "):
doc.add_heading(block[5:], level=4)
elif block.startswith("### "):
doc.add_heading(block[4:], level=3)
elif block.startswith("## "):
doc.add_heading(block[3:], level=2)
elif block.startswith("- ") or block.startswith("* "):
# Bullet list
for line in block.split("\n"):
line = line.lstrip("- *").strip()
if line:
doc.add_paragraph(line, style="List Bullet")
elif block.startswith("1. ") or block.startswith("1"):
# Numbered list
for line in block.split("\n"):
text = line.lstrip("0123456789.) ").strip()
if text:
doc.add_paragraph(text, style="List Number")
else:
p = doc.add_paragraph(block)
for run in p.runs:
run.font.size = Pt(11)
def _docx_add_table(self, doc, table_spec: dict):
"""Add a formatted table to the document."""
from docx.shared import Pt, RGBColor
from docx.oxml.ns import qn
headers = table_spec.get("headers", [])
rows = table_spec.get("rows", [])
if not headers:
return
tbl = doc.add_table(rows=1 + len(rows), cols=len(headers))
tbl.style = "Light Grid Accent 1"
# Header row
for i, h in enumerate(headers):
cell = tbl.rows[0].cells[i]
cell.text = str(h)
for p in cell.paragraphs:
for run in p.runs:
run.font.bold = True
run.font.size = Pt(10)
# Data rows
for r_idx, row in enumerate(rows):
for c_idx, cell_val in enumerate(row):
tbl.rows[r_idx + 1].cells[c_idx].text = str(cell_val)
# -----------------------------------------------------------------------
# PPTX — html2pptx.js (rich) or python-pptx (fallback)
# -----------------------------------------------------------------------
async def _render_pptx(
self, draft: dict, data_assets: dict, output_dir: Path, title: str
) -> Path:
if self.skills["html2pptx"]:
try:
return await self._render_pptx_html2pptx(draft, data_assets, output_dir, title)
except Exception as e:
logger.warning(f"[formatter] html2pptx failed ({e}), falling back to python-pptx")
return await self._render_pptx_baseline(draft, data_assets, output_dir, title)
async def _render_pptx_html2pptx(
self, draft: dict, data_assets: dict, output_dir: Path, title: str
) -> Path:
"""Generate PPTX using html2pptx.js skill for visual slides."""
with tempfile.TemporaryDirectory() as tmpdir:
work = Path(tmpdir)
# Generate HTML slides
slides_html = []
# Title slide
slides_html.append(f"""<html><body style="width:720pt;height:405pt;display:flex;align-items:center;justify-content:center;flex-direction:column;background:linear-gradient(135deg,#1a1a2e,#16213e);color:white;font-family:sans-serif;">
<h1 style="font-size:36pt;margin:0;">{title}</h1>
<p style="font-size:18pt;color:#aaa;margin-top:20pt;">{draft.get('executive_summary', '')[:100]}</p>
</body></html>""")
# Chapter slides
for ch in draft.get("chapters", []):
content_lines = ch.get("content", "")[:400].split("\n")
bullets = "".join(f"<li>{l.strip()}</li>" for l in content_lines if l.strip())
slides_html.append(f"""<html><body style="width:720pt;height:405pt;padding:40pt;font-family:sans-serif;background:#ffffff;">
<h2 style="font-size:28pt;color:#1a1a2e;border-bottom:2pt solid #e94560;padding-bottom:10pt;">{ch['title']}</h2>
<ul style="font-size:14pt;color:#333;line-height:1.8;">{bullets}</ul>
</body></html>""")
# Write HTML files
for i, html in enumerate(slides_html):
(work / f"slide_{i}.html").write_text(html, encoding="utf-8")
# Write conversion script
script = work / "convert.js"
html2pptx_path = PPTX_SKILLS / "scripts" / "html2pptx.js"
slide_files = [f"slide_{i}.html" for i in range(len(slides_html))]
script.write_text(f"""\
const pptxgen = require('pptxgenjs');
const {{ html2pptx }} = require('{html2pptx_path}');
const path = require('path');
async function main() {{
const pptx = new pptxgen();
pptx.layout = 'LAYOUT_16x9';
const files = {json.dumps(slide_files)};
for (const f of files) {{
await html2pptx(path.join('{work}', f), pptx);
}}
await pptx.writeFile({{ fileName: '{output_dir / f"{title}.pptx"}' }});
}}
main().catch(e => {{ console.error(e); process.exit(1); }});
""", encoding="utf-8")
proc = await asyncio.create_subprocess_exec(
"node", str(script),
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
cwd=str(work),
)
stdout, stderr = await proc.communicate()
if proc.returncode != 0:
raise RuntimeError(f"html2pptx failed: {stderr.decode()}")
return output_dir / f"{title}.pptx"
async def _render_pptx_baseline(
self, draft: dict, data_assets: dict, output_dir: Path, title: str
) -> Path:
from pptx import Presentation
from pptx.util import Inches, Pt
from pptx.dml.color import RGBColor
prs = Presentation()
# Title slide
slide = prs.slides.add_slide(prs.slide_layouts[0])
slide.shapes.title.text = title
if len(slide.placeholders) > 1:
slide.placeholders[1].text = draft.get("executive_summary", "")[:200]
# Chapter slides
for chapter in draft.get("chapters", []):
slide = prs.slides.add_slide(prs.slide_layouts[1])
slide.shapes.title.text = chapter["title"]
body = slide.placeholders[1]
tf = body.text_frame
tf.clear()
content = chapter.get("content", "")
lines = [l.strip() for l in content.split("\n") if l.strip()]
for line in lines[:12]: # max 12 bullets per slide
p = tf.add_paragraph()
# Strip markdown markers
clean = line.lstrip("#-*0123456789. ").strip()
p.text = clean
p.font.size = Pt(14)
p.space_after = Pt(4)
# Data table slides
for table_spec in data_assets.get("tables", []):
slide = prs.slides.add_slide(prs.slide_layouts[5]) # blank layout
slide.shapes.title.text = table_spec.get("title", "数据表")
headers = table_spec.get("headers", [])
rows = table_spec.get("rows", [])
if headers and rows:
n_rows = min(len(rows) + 1, 10) # limit rows per slide
n_cols = len(headers)
tbl = slide.shapes.add_table(
n_rows, n_cols,
Inches(0.5), Inches(1.5), Inches(9), Inches(4.5)
).table
for i, h in enumerate(headers):
tbl.cell(0, i).text = str(h)
for r_idx, row in enumerate(rows[:n_rows - 1]):
for c_idx, val in enumerate(row[:n_cols]):
tbl.cell(r_idx + 1, c_idx).text = str(val)
path = output_dir / f"{title}.pptx"
prs.save(str(path))
return path
# -----------------------------------------------------------------------
# XLSX — openpyxl + recalc.py (formula recalculation)
# -----------------------------------------------------------------------
async def _render_xlsx(
self, data_assets: dict, output_dir: Path, title: str
) -> Path:
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
wb = Workbook()
ws = wb.active
ws.title = "数据总览"
# Professional styling
header_font = Font(bold=True, size=11, color="FFFFFF")
header_fill = PatternFill(start_color="1A1A2E", end_color="1A1A2E", fill_type="solid")
title_font = Font(bold=True, size=14, color="1A1A2E")
thin_border = Border(
left=Side(style="thin", color="CCCCCC"),
right=Side(style="thin", color="CCCCCC"),
top=Side(style="thin", color="CCCCCC"),
bottom=Side(style="thin", color="CCCCCC"),
)
current_row = 1
has_formulas = False
for table_spec in data_assets.get("tables", []):
# Table title
ws.cell(row=current_row, column=1, value=table_spec.get("title", "")).font = title_font
current_row += 1
headers = table_spec.get("headers", [])
rows = table_spec.get("rows", [])
if headers:
# Header row with styling
for col_idx, h in enumerate(headers, 1):
cell = ws.cell(row=current_row, column=col_idx, value=h)
cell.font = header_font
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
cell.border = thin_border
current_row += 1
# Data rows
data_start = current_row
for row_data in rows:
for col_idx, val in enumerate(row_data, 1):
cell = ws.cell(row=current_row, column=col_idx, value=val)
cell.border = thin_border
# Try to convert numeric strings
if isinstance(val, str):
try:
cell.value = float(val.replace(",", ""))
except (ValueError, AttributeError):
pass
current_row += 1
# Auto-sum row for numeric columns
data_end = current_row - 1
if data_end > data_start:
for col_idx in range(1, len(headers) + 1):
col_letter = get_column_letter(col_idx)
test_cell = ws.cell(row=data_start, column=col_idx)
if isinstance(test_cell.value, (int, float)):
cell = ws.cell(
row=current_row, column=col_idx,
value=f"=SUM({col_letter}{data_start}:{col_letter}{data_end})"
)
cell.font = Font(bold=True)
cell.border = thin_border
has_formulas = True
elif col_idx == 1:
cell = ws.cell(row=current_row, column=1, value="合计")
cell.font = Font(bold=True)
cell.border = thin_border
current_row += 1
# Auto-fit column widths
for col_idx in range(1, len(headers) + 1):
max_len = max(
len(str(ws.cell(row=r, column=col_idx).value or ""))
for r in range(current_row - len(rows) - 2, current_row)
)
ws.column_dimensions[get_column_letter(col_idx)].width = min(max_len + 4, 30)
current_row += 2 # gap between tables
# Chart data sheets
for chart_spec in data_assets.get("charts", []):
chart_ws = wb.create_sheet(title=chart_spec.get("title", "图表")[:31])
chart_ws.cell(row=1, column=1, value=chart_spec.get("title", "")).font = title_font
chart_data = chart_spec.get("data", {})
labels = chart_data.get("labels", [])
datasets = chart_data.get("datasets", [])
# Headers: [项目, 数据集1, 数据集2, ...]
chart_ws.cell(row=2, column=1, value="项目").font = Font(bold=True)
for ds_idx, ds in enumerate(datasets, 2):
chart_ws.cell(row=2, column=ds_idx, value=ds.get("label", "")).font = Font(bold=True)
for r_idx, label in enumerate(labels, 3):
chart_ws.cell(row=r_idx, column=1, value=label)
for ds_idx, ds in enumerate(datasets, 2):
data = ds.get("data", [])
if r_idx - 3 < len(data):
chart_ws.cell(row=r_idx, column=ds_idx, value=data[r_idx - 3])
path = output_dir / f"{title}.xlsx"
wb.save(str(path))
# Run recalc.py if we have formulas and the skill is available
if has_formulas and self.skills["recalc"]:
await self._xlsx_recalc(path)
return path
async def _xlsx_recalc(self, path: Path):
"""Recalculate formulas using Skills recalc.py (requires LibreOffice)."""
recalc_script = XLSX_SKILLS / "recalc.py"
logger.info(f"[formatter] running recalc.py on {path}")
try:
proc = await asyncio.create_subprocess_exec(
"python3", str(recalc_script), str(path), "30",
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
)
stdout, stderr = await proc.communicate()
if proc.returncode == 0:
result = json.loads(stdout.decode())
logger.info(f"[formatter] recalc result: {result.get('status')}")
else:
logger.warning(f"[formatter] recalc.py failed: {stderr.decode()[:200]}")
except Exception as e:
logger.warning(f"[formatter] recalc.py error: {e}")
# -----------------------------------------------------------------------
# PDF — reportlab with CJK support + fpdf2 fallback
# -----------------------------------------------------------------------
async def _render_pdf(
self, draft: dict, data_assets: dict, output_dir: Path, title: str
) -> Path:
try:
return await self._render_pdf_reportlab(draft, data_assets, output_dir, title)
except Exception as e:
logger.warning(f"[formatter] reportlab failed ({e}), falling back to fpdf2")
return await self._render_pdf_fpdf(draft, data_assets, output_dir, title)
async def _render_pdf_reportlab(
self, draft: dict, data_assets: dict, output_dir: Path, title: str
) -> Path:
"""Generate PDF with reportlab — better CJK support and table rendering."""
from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import mm
from reportlab.lib import colors
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle, PageBreak,
)
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
# Try to register a CJK font
cjk_font = "Helvetica"
for font_path in [
"/System/Library/Fonts/STHeiti Medium.ttc",
"/System/Library/Fonts/PingFang.ttc",
"/usr/share/fonts/truetype/noto/NotoSansCJK-Regular.ttc",
]:
if Path(font_path).exists():
try:
pdfmetrics.registerFont(TTFont("CJK", font_path, subfontIndex=0))
cjk_font = "CJK"
break
except Exception:
continue
path = output_dir / f"{title}.pdf"
doc = SimpleDocTemplate(str(path), pagesize=A4,
topMargin=25*mm, bottomMargin=25*mm)
styles = getSampleStyleSheet()
styles.add(ParagraphStyle(
name="CJKTitle", fontName=cjk_font, fontSize=22,
spaceAfter=12, alignment=1,
))
styles.add(ParagraphStyle(
name="CJKHeading", fontName=cjk_font, fontSize=16,
spaceAfter=8, spaceBefore=16, textColor=colors.HexColor("#1a1a2e"),
))
styles.add(ParagraphStyle(
name="CJKBody", fontName=cjk_font, fontSize=11,
spaceAfter=6, leading=16,
))
elements = []
# Title
elements.append(Paragraph(title, styles["CJKTitle"]))
elements.append(Spacer(1, 12))
# Executive summary
if summary := draft.get("executive_summary"):
elements.append(Paragraph("执行摘要", styles["CJKHeading"]))
elements.append(Paragraph(summary, styles["CJKBody"]))
elements.append(Spacer(1, 12))
# Chapters
for chapter in draft.get("chapters", []):
elements.append(PageBreak())
elements.append(Paragraph(chapter["title"], styles["CJKHeading"]))
content = chapter.get("content", "")
for para in content.split("\n\n"):
para = para.strip()
if para:
elements.append(Paragraph(para, styles["CJKBody"]))
# Tables
for table_spec in data_assets.get("tables", []):
elements.append(Spacer(1, 12))
elements.append(Paragraph(table_spec.get("title", ""), styles["CJKHeading"]))
headers = table_spec.get("headers", [])
rows = table_spec.get("rows", [])
if headers:
table_data = [headers] + rows
t = Table(table_data)
t.setStyle(TableStyle([
("BACKGROUND", (0, 0), (-1, 0), colors.HexColor("#1a1a2e")),
("TEXTCOLOR", (0, 0), (-1, 0), colors.white),
("FONTNAME", (0, 0), (-1, -1), cjk_font),
("FONTSIZE", (0, 0), (-1, 0), 10),
("FONTSIZE", (0, 1), (-1, -1), 9),
("GRID", (0, 0), (-1, -1), 0.5, colors.grey),
("ALIGN", (0, 0), (-1, -1), "CENTER"),
("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, colors.HexColor("#f5f5f5")]),
]))
elements.append(t)
doc.build(elements)
return path
async def _render_pdf_fpdf(
self, draft: dict, data_assets: dict, output_dir: Path, title: str
) -> Path:
"""Fallback PDF generation with fpdf2."""
from fpdf import FPDF
pdf = FPDF()
pdf.set_auto_page_break(auto=True, margin=15)
# Try CJK font
for font_path in [
"/System/Library/Fonts/STHeiti Medium.ttc",
"/System/Library/Fonts/PingFang.ttc",
]:
if Path(font_path).exists():
try:
pdf.add_font("CJK", "", font_path, uni=True)
pdf.set_font("CJK", "", 11)
break
except Exception:
pdf.set_font("Helvetica", "", 11)
else:
pdf.set_font("Helvetica", "", 11)
pdf.add_page()
pdf.set_font_size(24)
pdf.cell(0, 20, title, new_x="LMARGIN", new_y="NEXT", align="C")
pdf.set_font_size(11)
if summary := draft.get("executive_summary"):
pdf.set_font_size(16)
pdf.cell(0, 12, "执行摘要", new_x="LMARGIN", new_y="NEXT")
pdf.set_font_size(11)
pdf.multi_cell(0, 6, summary)
for chapter in draft.get("chapters", []):
pdf.add_page()
pdf.set_font_size(16)
pdf.cell(0, 12, chapter["title"], new_x="LMARGIN", new_y="NEXT")
pdf.set_font_size(11)
pdf.multi_cell(0, 6, chapter.get("content", ""))
path = output_dir / f"{title}.pdf"
pdf.output(str(path))
return path

103
app/agents/researcher.py Normal file
View File

@@ -0,0 +1,103 @@
"""Researcher Agent — domain-aware, bilingual research."""
from __future__ import annotations
from typing import Any
from .base import BaseAgent
from app.config import settings
SYSTEM_EN = """\
You are a senior industry analyst at a top-tier consulting firm.
Your task is to produce a thorough research brief based on the given instructions.
Requirements:
1. Be specific — cite concrete data points, market sizes, growth rates, company names
2. Be structured — organize findings with clear headings and logical flow
3. Be analytical — don't just list facts, provide insights and implications
4. Flag data gaps — explicitly note where data is uncertain or unavailable
Output (JSON):
{
"title": "Research brief title",
"executive_summary": "2-3 sentence summary of key findings",
"sections": [
{
"heading": "Section heading",
"content": "Detailed findings (Markdown)",
"data_points": ["key data points extracted"],
"sources_quality": "high|medium|low — how confident are you in the data"
}
],
"data_gaps": ["areas where data is insufficient or uncertain"],
"key_insights": ["top 3-5 non-obvious insights"]
}"""
SYSTEM_ZH = """\
你是一位顶级咨询公司的资深行业分析师。
你的任务是根据给定的指令,输出一份深度研究简报。
要求:
1. 具体——引用具体的数据点、市场规模、增长率、企业名称
2. 结构化——用清晰的标题和逻辑流组织发现
3. 有分析深度——不要只罗列事实,要提供洞察和含义
4. 标注数据缺口——明确指出数据不确定或不可获取的地方
输出JSON
{
"title": "研究简报标题",
"executive_summary": "核心发现的2-3句总结",
"sections": [
{
"heading": "章节标题",
"content": "详细发现Markdown格式",
"data_points": ["提取的关键数据点"],
"sources_quality": "high|medium|low — 对数据的置信度"
}
],
"data_gaps": ["数据不充分或不确定的领域"],
"key_insights": ["3-5条非显而易见的洞察"]
}"""
class ResearcherAgent(BaseAgent):
name = "researcher"
description = "域感知研究 — 根据领域选择最优模型和语言"
def __init__(self, model: str | None = None, language: str = "en"):
super().__init__(model=model)
self.language = language
self.system_prompt = SYSTEM_ZH if language == "zh" else SYSTEM_EN
async def run(self, context: dict[str, Any]) -> dict[str, Any]:
requirement = context["requirement"]
report_type = context.get("report_type", "")
extra_data = context.get("extra_data", "")
if self.language == "zh":
prompt = f"""\
## 研究指令
{requirement}
## 研究方向
{report_type}
## 补充数据
{extra_data if extra_data else "(无)"}
请输出研究简报 JSON。"""
else:
prompt = f"""\
## Research instructions
{requirement}
## Research focus
{report_type}
## Additional data
{extra_data if extra_data else "(none)"}
Output the research brief as JSON."""
result = await self.call_llm_json(prompt, max_tokens=6144)
return {"research": result}

79
app/agents/reviewer.py Normal file
View File

@@ -0,0 +1,79 @@
"""Reviewer Agent — bilingual quality check with strongest reasoning model."""
from __future__ import annotations
import json
from typing import Any
from .base import BaseAgent
from app.config import settings
class ReviewerAgent(BaseAgent):
name = "reviewer"
description = "双语报告质量审查 — 使用最强推理模型"
system_prompt = """\
You are a senior consulting partner reviewing a report before client delivery.
The report has both Chinese and English versions (or will be translated).
Review dimensions:
1. **Accuracy** — Are data points, percentages, and claims supported by the research?
Cross-check global claims against English research, Chinese claims against Chinese research.
2. **Logical consistency** — Does the narrative flow? Are there contradictions between chapters?
3. **Depth of analysis** — Is it consultancy-grade or just surface-level? Would a C-suite exec find it valuable?
4. **Bilingual quality** — If translated version exists, check for translation artifacts,
mistranslated terminology, or cultural mismatches.
5. **Data gaps honesty** — Are uncertainties acknowledged or are claims fabricated?
6. **Completeness** — Are any critical aspects of the requirement left unaddressed?
Scoring guide:
- 90+: Publication-ready
- 80-89: Minor issues, can pass with notes
- 70-79: Needs revision (verdict: revise)
- <70: Significant problems (verdict: reject)
Output (JSON):
{
"overall_score": 85,
"verdict": "pass|revise|reject",
"issues": [
{
"severity": "high|medium|low",
"chapter": "affected chapter",
"dimension": "accuracy|consistency|depth|bilingual|gaps|completeness",
"description": "issue description",
"suggestion": "specific fix suggestion"
}
],
"strengths": ["what the report does well"],
"summary": "Overall assessment (2-3 sentences)"
}"""
def __init__(self):
super().__init__(model=settings.model_for_domain("reasoning"))
async def run(self, context: dict[str, Any]) -> dict[str, Any]:
draft = context["draft"]
draft_translated = context.get("draft_translated", {})
research = context["research"]
sections = [
"## Research Plan (what was asked)",
json.dumps(research, ensure_ascii=False, indent=2),
"",
"## Primary Draft",
json.dumps(draft, ensure_ascii=False, indent=2),
]
if draft_translated:
sections.extend([
"",
"## Translated Version",
json.dumps(draft_translated, ensure_ascii=False, indent=2),
])
prompt = "\n".join(sections) + "\n\nReview the report. Output JSON."
result = await self.call_llm_json(prompt, max_tokens=4096)
return {"review": result}

86
app/agents/writer.py Normal file
View File

@@ -0,0 +1,86 @@
"""Writer Agent — synthesizes multilingual research tracks into a cohesive report."""
from __future__ import annotations
import json
from typing import Any
from .base import BaseAgent
from app.config import settings
class WriterAgent(BaseAgent):
name = "writer"
description = "汇聚多语言/多领域研究成果,撰写完整报告"
system_prompt = """\
You are an expert consulting report writer. Your task is to synthesize research
findings from MULTIPLE parallel tracks (some in English, some in Chinese) into
ONE cohesive, professional consulting report.
CRITICAL RULES:
1. The PRIMARY output language is Chinese (中文) — this is for Chinese clients
2. For global/international sections, the analysis depth must reflect the English research
3. For China-specific sections, preserve the precision of Chinese-native research
4. Maintain professional consulting tone throughout
5. Every claim should trace back to a research track's findings
6. Mark chart/table needs: {{CHART:描述}} and {{TABLE:描述}}
7. If a research track flags "data_gaps", acknowledge uncertainty rather than fabricating
Output (JSON):
{
"title": "报告标题(中文)",
"title_en": "Report Title (English)",
"chapters": [
{
"title": "章节标题",
"content": "章节正文Markdown 格式,中文)",
"source_tracks": ["引用的研究轨道名称"],
"charts": ["图表需求"],
"tables": ["表格需求"]
}
],
"executive_summary": "执行摘要中文300-500字",
"executive_summary_en": "Executive Summary (English, 200-400 words)"
}"""
def __init__(self):
super().__init__(model=settings.model_for_domain("reasoning"))
async def run(self, context: dict[str, Any]) -> dict[str, Any]:
research = context["research"]
requirement = context["requirement"]
revision_feedback = context.get("revision_feedback", "")
# Format multi-track, multilingual research
tracks_text = ""
for track in research.get("tracks", []):
lang_tag = f"[{track.get('native_language', '?').upper()}]"
domain_tag = f"[{track.get('domain', '?')}]"
tracks_text += f"\n### {domain_tag} {lang_tag} {track.get('track', '')}\n"
findings = track.get("findings", {})
tracks_text += json.dumps(findings, ensure_ascii=False, indent=2)
synthesis_guide = research.get("synthesis_guide", "")
prompt = f"""\
## 原始需求 / Original Requirement
{requirement}
## 报告标题
中文:{research.get("title_zh", "")}
English: {research.get("title_en", "")}
## 写作指导 / Synthesis Guide
{synthesis_guide}
## 各研究轨道成果 / Research Track Results
(注意:有些轨道是英文原版 [EN],有些是中文原版 [ZH],请综合使用)
{tracks_text}
{f"## 审稿反馈 / Review Feedback{revision_feedback}" if revision_feedback else ""}
请汇聚以上研究成果,撰写完整的中文报告。输出 JSON。"""
result = await self.call_llm_json(prompt, max_tokens=8192)
return {"draft": result}

0
app/api/__init__.py Normal file
View File

110
app/api/routes.py Normal file
View File

@@ -0,0 +1,110 @@
"""API routes for report generation."""
from __future__ import annotations
import logging
from fastapi import APIRouter, HTTPException
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from app.graph.state import ReportState
from app.pipeline.orchestrator import PipelineOrchestrator
from app.pipeline.task import create_report_state
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/api")
# In-memory store (swap for DB later)
reports: dict[str, ReportState] = {}
orchestrator = PipelineOrchestrator()
class CreateReportRequest(BaseModel):
requirement: str
report_type: str = "行业分析报告"
extra_data: str = ""
output_formats: list[str] = ["docx"]
client_id: str | None = None
class ReportResponse(BaseModel):
id: str
current_node: str
error: str | None = None
generated_files: list[str] = []
node_history: list[dict] = []
revision_count: int = 0
def _to_response(state: ReportState) -> ReportResponse:
return ReportResponse(
id=state.id,
current_node=state.current_node,
error=state.error,
generated_files=state.generated_files,
node_history=state.node_history,
revision_count=state.revision_count,
)
@router.post("/reports", response_model=ReportResponse)
async def create_report(req: CreateReportRequest):
"""Create and execute a report generation pipeline."""
state = create_report_state(
requirement=req.requirement,
report_type=req.report_type,
extra_data=req.extra_data,
output_formats=req.output_formats,
client_id=req.client_id,
)
reports[state.id] = state
# Run the full graph (blocking for now, add task queue later)
state = await orchestrator.run(state)
reports[state.id] = state
if state.error:
raise HTTPException(status_code=500, detail=state.error)
return _to_response(state)
@router.get("/reports/{report_id}", response_model=ReportResponse)
async def get_report(report_id: str):
"""Get report status and results."""
state = reports.get(report_id)
if not state:
raise HTTPException(status_code=404, detail="Report not found")
return _to_response(state)
@router.get("/reports")
async def list_reports():
"""List all reports."""
return [_to_response(s) for s in reports.values()]
@router.get("/reports/{report_id}/detail")
async def get_report_detail(report_id: str):
"""Get full report detail including draft and research."""
state = reports.get(report_id)
if not state:
raise HTTPException(status_code=404, detail="Report not found")
return {
"id": state.id,
"requirement": state.requirement,
"decomposition": state.decomposition,
"research_results": [
{
"description": r.description,
"status": r.status.value,
"duration_ms": r.duration_ms,
}
for r in state.research_results
],
"draft": state.draft,
"review": state.review,
"generated_files": state.generated_files,
"node_history": state.node_history,
}

57
app/config.py Normal file
View File

@@ -0,0 +1,57 @@
"""Application configuration — multi-model pool by domain."""
from pathlib import Path
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
# --- Model Pool ---
# Global analysis (English-native): strongest global reasoning
llm_global: str = "openai/Claude-3.5-Sonnet"
# China domestic (Chinese-native): deepest Chinese market knowledge
llm_china: str = "openai/DeepSeek-R1"
# Synthesis & review: strongest reasoning for cross-domain work
llm_reasoning: str = "openai/Claude-3.5-Sonnet"
# Data & fast tasks: cost-effective structured output
llm_fast: str = "openai/Gemini-2.0-Flash"
# Translation: high-quality bidirectional EN↔ZH
llm_translation: str = "openai/Claude-3.5-Sonnet"
# Fallback default (if domain not specified)
llm_model: str = "openai/Claude-3.5-Sonnet"
# API config (Poe as unified gateway)
llm_api_key: str = ""
llm_api_base: str = "https://api.poe.com/bot/"
# Server
host: str = "0.0.0.0"
port: int = 4200
# Paths
base_dir: Path = Path(__file__).resolve().parent.parent
templates_dir: Path = base_dir / "templates"
output_dir: Path = base_dir / "output"
model_config = {"env_file": ".env", "env_file_encoding": "utf-8"}
def model_for_domain(self, domain: str) -> str:
"""Get the best model for a content domain.
Domains:
global — international markets, global competition, tech trends
china — Chinese market, domestic policy, local competition
reasoning — synthesis, review, strategic recommendations
fast — data processing, chart generation, structured output
translation — high-quality EN↔ZH translation
"""
return {
"global": self.llm_global,
"china": self.llm_china,
"reasoning": self.llm_reasoning,
"fast": self.llm_fast,
"translation": self.llm_translation,
}.get(domain, self.llm_model)
settings = Settings()

0
app/data/__init__.py Normal file
View File

52
app/data/factory.py Normal file
View File

@@ -0,0 +1,52 @@
"""Data router factory — assembles the router with all available sources."""
from __future__ import annotations
import logging
from .router import DataRouter
from .sources.akshare_source import AKShareSource
from .sources.worldbank_source import WorldBankSource
from .sources.gpt_researcher_source import GPTResearcherSource
logger = logging.getLogger(__name__)
def create_data_router() -> DataRouter:
"""Create a DataRouter with all available data sources.
Priority order (lower = tried first):
10 AKShare — free, fast, covers Chinese macro/industry
20 World Bank — free, global macro
90 GPT Researcher — universal fallback (web research)
Future additions:
30 巨潮资讯 — free tier, Chinese public companies
40 天眼查 API — paid, company data
50 Choice API — paid, financial data
60 Statista API — paid, global industry stats
70 FRED API — free, US macro
80 UN Comtrade — free, global trade
"""
router = DataRouter()
# --- Free tier (always available) ---
router.register(AKShareSource(), priority=10)
router.register(WorldBankSource(), priority=20)
# --- Universal fallback ---
router.register(GPTResearcherSource(), priority=90)
logger.info(f"[data_factory] created router with {len(router.sources)} sources")
return router
# Singleton
_router: DataRouter | None = None
def get_data_router() -> DataRouter:
global _router
if _router is None:
_router = create_data_router()
return _router

82
app/data/router.py Normal file
View File

@@ -0,0 +1,82 @@
"""DataRouter — routes data requests to the optimal source.
Architecture:
Agent needs data → DataRouter → best available source → standardized result
Sources are tried in priority order. If a source fails or is unavailable,
the router falls back to the next source.
"""
from __future__ import annotations
import logging
from typing import Any
from .sources.base import DataSource, DataResult
logger = logging.getLogger(__name__)
class DataRouter:
"""Routes data queries to the best available source.
Sources are registered with a priority (lower = higher priority).
For each query, the router tries sources in priority order until one succeeds.
"""
def __init__(self):
self.sources: list[tuple[int, DataSource]] = []
def register(self, source: DataSource, priority: int = 100) -> "DataRouter":
self.sources.append((priority, source))
self.sources.sort(key=lambda x: x[0])
logger.info(f"[data_router] registered {source.name} (priority={priority})")
return self
async def query(
self,
query: str,
data_type: str = "general",
country: str | None = None,
**kwargs,
) -> DataResult:
"""Query data sources in priority order.
Args:
query: Natural language or structured data query
data_type: One of: macro, industry, company, trade, patent, general
country: ISO country code (CN, US, etc.) or None for global
**kwargs: Source-specific parameters
"""
errors = []
for priority, source in self.sources:
if not source.supports(data_type, country):
continue
try:
logger.info(f"[data_router] trying {source.name} for '{query[:50]}...'")
result = await source.fetch(query, data_type=data_type, country=country, **kwargs)
if result.data:
logger.info(f"[data_router] {source.name} returned {len(str(result.data))} chars")
return result
except Exception as e:
errors.append(f"{source.name}: {e}")
logger.warning(f"[data_router] {source.name} failed: {e}")
# All sources failed
return DataResult(
source="none",
data=None,
error=f"All sources failed: {'; '.join(errors)}",
)
async def query_multiple(
self,
queries: list[dict[str, Any]],
) -> list[DataResult]:
"""Run multiple queries (can be parallelized later)."""
import asyncio
return await asyncio.gather(*[
self.query(**q) for q in queries
])

View File

View File

@@ -0,0 +1,94 @@
"""AKShare data source — Chinese macro/industry data via open-source Python library.
Covers: GDP, CPI, PMI, industrial profit, trade balance, and 30+ data categories.
All data returned as Pandas DataFrames, converted to dicts for standardization.
"""
from __future__ import annotations
import logging
from typing import Any
from .base import DataSource, DataResult
logger = logging.getLogger(__name__)
# Map common data requests to AKShare function names
AKSHARE_ENDPOINTS = {
"gdp": "macro_china_gdp",
"cpi": "macro_china_cpi_monthly",
"ppi": "macro_china_ppi",
"pmi": "macro_china_pmi",
"industrial_profit": "macro_china_industrial_profit",
"trade_balance": "macro_china_trade_balance",
"money_supply": "macro_china_money_supply",
"fdi": "macro_china_fdi",
"real_estate": "macro_china_real_estate",
"retail_sales": "macro_china_consumer_goods_retail",
"fixed_asset": "macro_china_fai",
"unemployment": "macro_china_urban_unemployment",
# US macro
"us_gdp": "macro_usa_gdp_monthly",
"us_cpi": "macro_usa_cpi_monthly",
"us_unemployment": "macro_usa_unemployment_rate",
# Global
"global_gdp": "macro_global_gdp",
}
class AKShareSource(DataSource):
name = "akshare"
description = "中国宏观经济/行业数据免费开源封装统计局等30+数据源)"
def supports(self, data_type: str, country: str | None = None) -> bool:
return data_type in ("macro", "industry", "general")
async def fetch(
self, query: str, *, data_type: str = "general", country: str | None = None, **kwargs,
) -> DataResult:
try:
import akshare as ak
except ImportError:
return DataResult(source=self.name, error="akshare not installed (pip install akshare)")
# Try to match query to a known endpoint
endpoint_name = kwargs.get("endpoint")
if not endpoint_name:
query_lower = query.lower()
for key, func_name in AKSHARE_ENDPOINTS.items():
if key in query_lower:
endpoint_name = func_name
break
if not endpoint_name:
return DataResult(source=self.name, data=None, error=f"No matching AKShare endpoint for: {query}")
try:
func = getattr(ak, endpoint_name, None)
if not func:
return DataResult(source=self.name, error=f"AKShare function not found: {endpoint_name}")
logger.info(f"[akshare] calling ak.{endpoint_name}()")
df = func()
# Convert to dict for serialization
# Take last N rows for recent data
limit = kwargs.get("limit", 20)
recent = df.tail(limit)
return DataResult(
source=self.name,
data={
"columns": list(recent.columns),
"records": recent.to_dict(orient="records"),
"total_rows": len(df),
"returned_rows": len(recent),
},
metadata={
"endpoint": endpoint_name,
"description": f"AKShare {endpoint_name}",
"format": "tabular",
},
)
except Exception as e:
return DataResult(source=self.name, error=f"AKShare call failed: {e}")

34
app/data/sources/base.py Normal file
View File

@@ -0,0 +1,34 @@
"""Base class for data sources."""
from __future__ import annotations
from abc import ABC, abstractmethod
from typing import Any
from pydantic import BaseModel, Field
class DataResult(BaseModel):
"""Standardized result from any data source."""
source: str = ""
data: Any = None
metadata: dict[str, Any] = Field(default_factory=dict)
# metadata includes: unit, time_range, update_date, confidence, etc.
error: str | None = None
cached: bool = False
class DataSource(ABC):
"""Abstract data source."""
name: str = "base"
description: str = ""
def supports(self, data_type: str, country: str | None = None) -> bool:
"""Return True if this source can handle this data type / country."""
return True
@abstractmethod
async def fetch(
self, query: str, *, data_type: str = "general", country: str | None = None, **kwargs,
) -> DataResult:
...

View File

@@ -0,0 +1,61 @@
"""GPT Researcher MCP — deep web research as fallback for any industry.
This is the universal fallback: when structured data sources don't have
data for a niche/cold industry, deep web research fills the gap.
Requires GPT Researcher MCP server to be running (already configured in ~/.claude.json).
For direct API use, we call the MCP tools via the subprocess approach.
"""
from __future__ import annotations
import asyncio
import json
import logging
import subprocess
from typing import Any
from .base import DataSource, DataResult
logger = logging.getLogger(__name__)
class GPTResearcherSource(DataSource):
name = "gpt_researcher"
description = "Deep web research — universal fallback for any industry/topic"
def supports(self, data_type: str, country: str | None = None) -> bool:
# Supports everything — this is the universal fallback
return True
async def fetch(
self, query: str, *, data_type: str = "general", country: str | None = None, **kwargs,
) -> DataResult:
mode = kwargs.get("mode", "quick") # "quick" or "deep"
# GPT Researcher is available as MCP tools in Claude Code.
# For standalone use, we need to call it via its API.
# The MCP server runs at a local port — check if available.
# For now, provide a structured placeholder that agents can use
# to request deep research. The actual MCP call happens at the
# agent level when integrated into the pipeline.
return DataResult(
source=self.name,
data={
"query": query,
"mode": mode,
"status": "ready",
"note": (
"GPT Researcher MCP is available for deep web research. "
"Call via MCP tools: deep_research() or quick_search(). "
"This source returns research-ready queries for MCP integration."
),
},
metadata={
"type": "mcp_research_request",
"mode": mode,
"data_type": data_type,
"country": country,
},
)

View File

@@ -0,0 +1,104 @@
"""World Bank Open Data — global macro indicators, 217 economies, free API.
API: https://api.worldbank.org/v2/
"""
from __future__ import annotations
import logging
from typing import Any
import httpx
from .base import DataSource, DataResult
logger = logging.getLogger(__name__)
BASE_URL = "https://api.worldbank.org/v2"
# Common indicators for consulting reports
INDICATORS = {
"gdp": "NY.GDP.MKTP.CD", # GDP (current US$)
"gdp_growth": "NY.GDP.MKTP.KD.ZG", # GDP growth (annual %)
"gdp_per_capita": "NY.GDP.PCAP.CD", # GDP per capita
"population": "SP.POP.TOTL", # Total population
"inflation": "FP.CPI.TOTL.ZG", # Inflation (CPI %)
"trade_pct_gdp": "NE.TRD.GNFS.ZS", # Trade (% of GDP)
"fdi_net": "BX.KLT.DINV.CD.WD", # FDI net inflows
"unemployment": "SL.UEM.TOTL.ZS", # Unemployment (%)
"exports": "NE.EXP.GNFS.CD", # Exports
"imports": "NE.IMP.GNFS.CD", # Imports
"r_and_d": "GB.XPD.RSDV.GD.ZS", # R&D expenditure (% GDP)
"high_tech_exports": "TX.VAL.TECH.MF.ZS", # High-tech exports (% manufactured)
}
class WorldBankSource(DataSource):
name = "worldbank"
description = "World Bank Open Data — 1600+ indicators, 217 economies, free"
def supports(self, data_type: str, country: str | None = None) -> bool:
return data_type in ("macro", "general")
async def fetch(
self, query: str, *, data_type: str = "general", country: str | None = None, **kwargs,
) -> DataResult:
indicator_code = kwargs.get("indicator")
if not indicator_code:
query_lower = query.lower()
for key, code in INDICATORS.items():
if key in query_lower:
indicator_code = code
break
if not indicator_code:
# Default to GDP
indicator_code = INDICATORS["gdp"]
country_code = country or "WLD" # WLD = World
per_page = kwargs.get("per_page", 20)
url = f"{BASE_URL}/country/{country_code}/indicator/{indicator_code}"
params = {
"format": "json",
"per_page": per_page,
}
try:
async with httpx.AsyncClient(timeout=15) as client:
resp = await client.get(url, params=params)
resp.raise_for_status()
data = resp.json()
if not data or len(data) < 2:
return DataResult(source=self.name, data=None, error="No data returned")
metadata_raw = data[0]
records = data[1]
# Parse into clean format
clean_records = []
for r in records:
if r.get("value") is not None:
clean_records.append({
"year": r["date"],
"value": r["value"],
"country": r["country"]["value"],
"indicator": r["indicator"]["value"],
})
return DataResult(
source=self.name,
data={
"indicator": indicator_code,
"country": country_code,
"records": clean_records,
},
metadata={
"total": metadata_raw.get("total", 0),
"indicator_name": clean_records[0]["indicator"] if clean_records else "",
"format": "timeseries",
},
)
except Exception as e:
return DataResult(source=self.name, error=f"World Bank API failed: {e}")

0
app/graph/__init__.py Normal file
View File

108
app/graph/builder.py Normal file
View File

@@ -0,0 +1,108 @@
"""Graph builder — assembles nodes into an executable report generation graph.
No LangGraph dependency. Pure asyncio with a simple node-runner pattern.
v2: domain-aware, bilingual (translate node added)
"""
from __future__ import annotations
import logging
from typing import Callable, Awaitable
from app.middleware.chain import MiddlewareChain
from .state import ReportState, NodeStatus
from .nodes import (
DecomposeNode,
ParallelResearchNode,
WriteNode,
TranslateNode,
DataNode,
ReviewNode,
FormatNode,
)
logger = logging.getLogger(__name__)
NodeFn = Callable[[ReportState], Awaitable[ReportState]]
class ReportGraph:
"""Executable graph for report generation.
Graph structure (v2):
decompose → parallel_research → write → translate → data → review →
├─ pass → format → END
└─ revise → write → translate → data → review → ...
"""
def __init__(self, middleware: MiddlewareChain | None = None):
self.middleware = middleware
self.decompose = DecomposeNode()
self.parallel_research = ParallelResearchNode()
self.write = WriteNode()
self.translate = TranslateNode()
self.data = DataNode()
self.review = ReviewNode()
self.format = FormatNode()
async def _run_node(self, name: str, node: NodeFn, state: ReportState) -> ReportState:
"""Run a single node with error handling."""
try:
logger.info(f"[graph] entering node: {name}")
state = await node(state)
logger.info(f"[graph] completed node: {name}")
except Exception as e:
state.error = f"Node '{name}' failed: {e}"
state.log_node(name, NodeStatus.FAILED, str(e))
logger.exception(f"[graph] node '{name}' failed")
raise
return state
async def run(self, state: ReportState) -> ReportState:
"""Execute the full graph."""
# --- Middleware: before ---
if self.middleware:
state = await self.middleware.before(state)
try:
# 1. Decompose requirement into domain-tagged parallel tracks
state = await self._run_node("decompose", self.decompose, state)
# 2. Run parallel research (each track uses domain-optimal model)
state = await self._run_node("parallel_research", self.parallel_research, state)
# 3-6. Write → Translate → Data → Review (with revision loop)
while True:
state = await self._run_node("write", self.write, state)
state = await self._run_node("translate", self.translate, state)
state = await self._run_node("data", self.data, state)
state = await self._run_node("review", self.review, state)
verdict = state.review.get("verdict", "pass")
if verdict == "pass" or state.revision_count >= state.max_revisions:
if verdict != "pass":
logger.warning(
f"[graph] forcing pass after {state.revision_count} revisions"
)
break
# Revise — loop back to write
state.revision_count += 1
logger.info(
f"[graph] revision {state.revision_count}/{state.max_revisions}"
)
# 7. Format bilingual output files
state = await self._run_node("format", self.format, state)
except Exception:
# Error already logged in _run_node
pass
# --- Middleware: after ---
if self.middleware:
state = await self.middleware.after(state)
return state

393
app/graph/nodes.py Normal file
View File

@@ -0,0 +1,393 @@
"""Graph nodes — each node is an async function: ReportState → ReportState.
Node layout (v2 — domain-aware, bilingual):
START
[decompose] — Lead Agent 分解为并行研究轨道,每轨标注 domain + language
[parallel_research] — N 个子 Agent 并行,每个用最适合该领域的模型
│ global tracks → Claude/GPT (English)
│ china tracks → DeepSeek/Qwen (Chinese)
[write] — Writer 汇聚 → 生成主语言版本
[translate] — 高质量翻译 → 生成另一语言版本
[data] — Data Agent 生成图表/表格
[review] — Reviewer 审查(双语)
│ ├─ pass → [format]
│ └─ revise → [write]
[format] — 输出双语版本文件
END
"""
from __future__ import annotations
import asyncio
import json
import logging
from datetime import datetime
from typing import Any
from app.agents.base import BaseAgent
from app.agents.researcher import ResearcherAgent
from app.agents.writer import WriterAgent
from app.agents.data_agent import DataAgent
from app.agents.reviewer import ReviewerAgent
from app.agents.formatter import FormatterAgent
from app.config import settings
from .state import ReportState, SubtaskResult, NodeStatus, ContentDomain
logger = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Node: decompose — Lead Agent decomposes into domain-tagged parallel tracks
# ---------------------------------------------------------------------------
class DecomposeNode:
"""Analyzes requirement and decomposes into domain-aware research tracks."""
def __init__(self):
self.agent = BaseAgent()
self.agent.name = "lead"
self.agent.model = settings.model_for_domain("reasoning")
async def __call__(self, state: ReportState) -> ReportState:
state.current_node = "decompose"
state.log_node("decompose", NodeStatus.RUNNING)
system = """\
You are a senior consulting partner planning a global industry report.
Your job is to decompose the client's requirement into 2-6 parallel research tracks.
CRITICAL: Each track must be tagged with a content domain and native language:
- domain: "global" → international markets, global competition, technology trends, overseas benchmarks
→ native_language: "en" (English sources are 10-100x richer for global analysis)
- domain: "china" → Chinese domestic market, government policy, local competitors, China-specific data
→ native_language: "zh" (Chinese sources are authoritative for domestic analysis)
The PRINCIPLE: whichever language has the richest professional literature for that topic
should be the native language. The other language version will be translated later.
Output (JSON):
{
"title_en": "English report title",
"title_zh": "中文报告标题",
"report_type": "report type",
"tracks": [
{
"title": "track title (in native language)",
"domain": "global|china",
"native_language": "en|zh",
"focus": "research focus description",
"prompt": "detailed research instructions (MUST be in the native_language)",
"data_needs": ["required data/charts"]
}
],
"synthesis_guide": "How to merge all tracks into a coherent report (bilingual structure notes)",
"methodology": "Analysis methodology"
}"""
prompt = f"""\
## Client requirement
{state.requirement}
## Report type
{state.report_type}
## Additional data
{state.extra_data or "(none)"}
## Client context
{state.client_context or "(none)"}
Decompose into parallel research tracks with domain and language tags. Output JSON."""
result = await self.agent.call_llm_json(prompt, system=system)
state.decomposition = result
state.log_node("decompose", NodeStatus.COMPLETED,
f"{len(result.get('tracks', []))} tracks")
return state
# ---------------------------------------------------------------------------
# Node: parallel_research — domain-aware parallel execution
# ---------------------------------------------------------------------------
class ParallelResearchNode:
"""Runs research subtasks in parallel, each using the optimal model for its domain."""
MAX_CONCURRENT = 5
async def _run_one(self, track: dict[str, Any]) -> SubtaskResult:
domain_str = track.get("domain", "global")
domain = ContentDomain(domain_str) if domain_str in ContentDomain.__members__.values() else ContentDomain.GLOBAL
native_lang = track.get("native_language", "en")
result = SubtaskResult(
description=track.get("title", ""),
domain=domain,
native_language=native_lang,
)
result.status = NodeStatus.RUNNING
result.started_at = datetime.now()
try:
# Select model based on domain
model = settings.model_for_domain(domain.value)
agent = ResearcherAgent(model=model, language=native_lang)
logger.info(
f"[parallel_research] track '{track.get('title')}' "
f"→ domain={domain.value}, lang={native_lang}, model={model}"
)
research = await agent.run({
"requirement": track["prompt"],
"report_type": track.get("focus", ""),
"extra_data": "",
})
result.content = research.get("research", {})
result.status = NodeStatus.COMPLETED
except Exception as e:
result.error = str(e)
result.status = NodeStatus.FAILED
logger.exception(f"Research track '{track.get('title')}' failed")
finally:
result.completed_at = datetime.now()
return result
async def __call__(self, state: ReportState) -> ReportState:
state.current_node = "parallel_research"
state.log_node("parallel_research", NodeStatus.RUNNING)
tracks = state.decomposition.get("tracks", [])
if not tracks:
state.log_node("parallel_research", NodeStatus.FAILED, "no tracks")
state.error = "Decomposition produced no research tracks"
return state
semaphore = asyncio.Semaphore(self.MAX_CONCURRENT)
async def bounded(track):
async with semaphore:
return await self._run_one(track)
logger.info(f"[parallel_research] launching {len(tracks)} tracks concurrently")
results = await asyncio.gather(*[bounded(t) for t in tracks])
state.research_results = list(results)
succeeded = sum(1 for r in results if r.status == NodeStatus.COMPLETED)
domains = {}
for r in results:
domains.setdefault(r.domain.value, []).append(r.native_language)
state.log_node("parallel_research", NodeStatus.COMPLETED,
f"{succeeded}/{len(tracks)} ok, domains={domains}")
return state
# ---------------------------------------------------------------------------
# Node: write — synthesize research into primary-language draft
# ---------------------------------------------------------------------------
class WriteNode:
def __init__(self):
self.agent = WriterAgent()
async def __call__(self, state: ReportState) -> ReportState:
state.current_node = "write"
state.log_node("write", NodeStatus.RUNNING)
research_merged = []
for r in state.research_results:
if r.status == NodeStatus.COMPLETED:
research_merged.append({
"track": r.description,
"domain": r.domain.value,
"native_language": r.native_language,
"findings": r.content,
})
synthesis_guide = state.decomposition.get("synthesis_guide", "")
review_feedback = ""
if state.revision_count > 0 and state.review:
review_feedback = f"\n\n## Review feedback (revision {state.revision_count})\n"
for issue in state.review.get("issues", []):
review_feedback += f"- [{issue.get('severity')}] {issue.get('description')}{issue.get('suggestion')}\n"
result = await self.agent.run({
"requirement": state.requirement,
"research": {
"title_en": state.decomposition.get("title_en", ""),
"title_zh": state.decomposition.get("title_zh", ""),
"methodology": state.decomposition.get("methodology", ""),
"tracks": research_merged,
"synthesis_guide": synthesis_guide,
},
"revision_feedback": review_feedback,
})
state.draft = result.get("draft", {})
state.log_node("write", NodeStatus.COMPLETED)
return state
# ---------------------------------------------------------------------------
# Node: translate — produce the other language version
# ---------------------------------------------------------------------------
class TranslateNode:
"""Translates the draft into the other language version."""
def __init__(self):
self.agent = BaseAgent()
self.agent.name = "translator"
self.agent.model = settings.model_for_domain("translation")
async def __call__(self, state: ReportState) -> ReportState:
state.current_node = "translate"
state.log_node("translate", NodeStatus.RUNNING)
if not state.draft or "en" not in state.output_languages:
state.log_node("translate", NodeStatus.COMPLETED, "skipped")
return state
draft_json = json.dumps(state.draft, ensure_ascii=False, indent=2)
# Detect primary language of draft
title = state.draft.get("title", "")
is_chinese_primary = any('\u4e00' <= c <= '\u9fff' for c in title)
if is_chinese_primary:
target_lang = "English"
source_lang = "Chinese"
else:
target_lang = "Chinese (Simplified)"
source_lang = "English"
system = f"""\
You are a world-class {source_lang}{target_lang} translator specializing in
consulting and business reports.
Translation principles:
1. ACCURACY over fluency — every data point, percentage, and proper noun must be correct
2. Professional terminology — use standard {target_lang} business/industry terms
3. Preserve structure — keep the exact same JSON structure, only translate text values
4. Cultural adaptation — adjust phrasing for the target audience (not word-for-word)
5. Keep {{{{CHART:...}}}} and {{{{TABLE:...}}}} markers, translate their descriptions
Output the translated JSON with the exact same structure."""
prompt = f"""\
Translate this consulting report from {source_lang} to {target_lang}.
{draft_json}
Output the translated JSON."""
translated = await self.agent.call_llm_json(prompt, system=system, max_tokens=8192)
state.draft_translated = translated
state.log_node("translate", NodeStatus.COMPLETED,
f"{source_lang}{target_lang}")
return state
# ---------------------------------------------------------------------------
# Node: data — generate charts and tables
# ---------------------------------------------------------------------------
class DataNode:
def __init__(self):
self.agent = DataAgent()
async def __call__(self, state: ReportState) -> ReportState:
state.current_node = "data"
state.log_node("data", NodeStatus.RUNNING)
result = await self.agent.run({
"draft": state.draft,
"extra_data": state.extra_data,
})
state.data_assets = result.get("data_assets", {})
state.log_node("data", NodeStatus.COMPLETED)
return state
# ---------------------------------------------------------------------------
# Node: review — bilingual quality check
# ---------------------------------------------------------------------------
class ReviewNode:
def __init__(self):
self.agent = ReviewerAgent()
async def __call__(self, state: ReportState) -> ReportState:
state.current_node = "review"
state.log_node("review", NodeStatus.RUNNING)
result = await self.agent.run({
"draft": state.draft,
"draft_translated": state.draft_translated,
"research": state.decomposition,
})
state.review = result.get("review", {})
state.log_node("review", NodeStatus.COMPLETED,
f"verdict={state.review.get('verdict', '?')}")
return state
# ---------------------------------------------------------------------------
# Node: format — render bilingual output files
# ---------------------------------------------------------------------------
class FormatNode:
def __init__(self):
self.agent = FormatterAgent()
async def __call__(self, state: ReportState) -> ReportState:
state.current_node = "format"
state.log_node("format", NodeStatus.RUNNING)
all_files = []
# Primary version
result = await self.agent.run({
"draft": state.draft,
"data_assets": state.data_assets,
"output_dir": str(settings.output_dir / state.id / "primary"),
"output_formats": state.output_formats,
})
all_files.extend(result.get("generated_files", []))
# Translated version (if available)
if state.draft_translated:
result_tr = await self.agent.run({
"draft": state.draft_translated,
"data_assets": state.data_assets,
"output_dir": str(settings.output_dir / state.id / "translated"),
"output_formats": state.output_formats,
})
all_files.extend(result_tr.get("generated_files", []))
state.generated_files = all_files
state.log_node("format", NodeStatus.COMPLETED,
f"{len(all_files)} files")
return state

104
app/graph/state.py Normal file
View File

@@ -0,0 +1,104 @@
"""Report generation graph state — the shared context that flows through all nodes."""
from __future__ import annotations
import uuid
from datetime import datetime
from enum import Enum
from typing import Any
from pydantic import BaseModel, Field
class NodeStatus(str, Enum):
PENDING = "pending"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
class ContentDomain(str, Enum):
"""Content domain — determines which model and language to use."""
GLOBAL = "global" # International markets, global trends → English-native
CHINA = "china" # Chinese market, domestic policy → Chinese-native
REASONING = "reasoning" # Synthesis, review, strategy → strongest reasoning
FAST = "fast" # Data processing, charts → cost-effective
TRANSLATION = "translation" # EN↔ZH translation
class SubtaskResult(BaseModel):
"""Result from a parallel research subtask."""
task_id: str = Field(default_factory=lambda: uuid.uuid4().hex[:8])
description: str = ""
domain: ContentDomain = ContentDomain.GLOBAL
native_language: str = "en" # "en" or "zh" — the original writing language
status: NodeStatus = NodeStatus.PENDING
content: dict[str, Any] = Field(default_factory=dict)
error: str | None = None
started_at: datetime | None = None
completed_at: datetime | None = None
@property
def duration_ms(self) -> int | None:
if self.started_at and self.completed_at:
return int((self.completed_at - self.started_at).total_seconds() * 1000)
return None
class ReportState(BaseModel):
"""Full state for a report generation run.
This is the single source of truth that all graph nodes read from and write to.
"""
# Identity
id: str = Field(default_factory=lambda: uuid.uuid4().hex[:12])
created_at: datetime = Field(default_factory=datetime.now)
# --- Input (set once at start) ---
requirement: str = ""
report_type: str = "行业分析报告"
extra_data: str = ""
output_formats: list[str] = Field(default=["docx"])
output_languages: list[str] = Field(default=["zh", "en"]) # produce both versions
template_name: str | None = None
client_id: str | None = None # for multi-tenant isolation
# --- Middleware injections ---
client_context: str = "" # injected by ClientContextMiddleware
memory_facts: list[str] = Field(default_factory=list) # injected by MemoryMiddleware
token_budget: int = 120000 # managed by TokenBudgetMiddleware
# --- Lead Agent output ---
decomposition: dict[str, Any] = Field(default_factory=dict)
# e.g. {"tracks": [{"title": "政策环境", "prompt": "研究..."}, ...]}
# --- Parallel research results ---
research_results: list[SubtaskResult] = Field(default_factory=list)
# --- Writer output ---
draft: dict[str, Any] = Field(default_factory=dict) # primary language version
draft_translated: dict[str, Any] = Field(default_factory=dict) # translated version
# --- Data Agent output ---
data_assets: dict[str, Any] = Field(default_factory=dict)
# --- Reviewer output ---
review: dict[str, Any] = Field(default_factory=dict)
revision_count: int = 0
max_revisions: int = 2
# --- Formatter output ---
generated_files: list[str] = Field(default_factory=list)
# --- Execution tracking ---
current_node: str = ""
node_history: list[dict[str, Any]] = Field(default_factory=list)
error: str | None = None
def log_node(self, node_name: str, status: NodeStatus, detail: str = ""):
self.node_history.append({
"node": node_name,
"status": status.value,
"detail": detail,
"timestamp": datetime.now().isoformat(),
})

48
app/main.py Normal file
View File

@@ -0,0 +1,48 @@
"""FastAPI application entry point."""
import logging
from fastapi import FastAPI
from fastapi.responses import FileResponse
from fastapi.staticfiles import StaticFiles
from app.api.routes import router
from app.config import settings
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(name)s] %(levelname)s: %(message)s",
)
app = FastAPI(
title="咨询报告 AI 生成系统",
version="0.1.0",
description="Multi-agent pipeline for generating consulting reports",
)
app.include_router(router)
# Serve generated files
settings.output_dir.mkdir(parents=True, exist_ok=True)
app.mount("/files", StaticFiles(directory=str(settings.output_dir)), name="files")
@app.get("/")
async def root():
return {
"name": "咨询报告 AI 生成系统",
"version": "0.1.0",
"status": "running",
"docs": "/docs",
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(
"app.main:app",
host=settings.host,
port=settings.port,
reload=True,
)

0
app/memory/__init__.py Normal file
View File

114
app/memory/store.py Normal file
View File

@@ -0,0 +1,114 @@
"""File-based memory store — persistent facts with confidence ranking."""
from __future__ import annotations
import json
import logging
import uuid
from datetime import datetime
from pathlib import Path
from typing import Any
from pydantic import BaseModel, Field
logger = logging.getLogger(__name__)
MEMORY_DIR = Path(__file__).resolve().parent.parent.parent / "memory"
class Fact(BaseModel):
id: str = Field(default_factory=lambda: uuid.uuid4().hex[:8])
content: str
category: str = "context" # preference | knowledge | context | behavior | goal
confidence: float = 0.7
source: str = "" # which report/session created this
created_at: str = Field(default_factory=lambda: datetime.now().isoformat())
class MemoryFile(BaseModel):
"""One memory file per client (or global)."""
client_id: str = "global"
preferences: dict[str, str] = Field(default_factory=dict)
facts: list[Fact] = Field(default_factory=list)
class MemoryStore:
"""Read/write persistent memory as JSON files.
Storage layout:
memory/
├── global.json — system-wide facts
└── client_<id>.json — per-client facts
"""
def __init__(self):
MEMORY_DIR.mkdir(parents=True, exist_ok=True)
def _path(self, client_id: str = "global") -> Path:
safe_name = client_id.replace("/", "_").replace("..", "_")
return MEMORY_DIR / f"{safe_name}.json"
def load(self, client_id: str = "global") -> MemoryFile:
path = self._path(client_id)
if not path.exists():
return MemoryFile(client_id=client_id)
try:
data = json.loads(path.read_text(encoding="utf-8"))
return MemoryFile(**data)
except Exception as e:
logger.warning(f"[memory] failed to load {path}: {e}")
return MemoryFile(client_id=client_id)
def save(self, mem: MemoryFile):
path = self._path(mem.client_id)
path.write_text(
json.dumps(mem.model_dump(), ensure_ascii=False, indent=2),
encoding="utf-8",
)
logger.info(f"[memory] saved {len(mem.facts)} facts to {path}")
def add_fact(
self,
content: str,
client_id: str = "global",
category: str = "context",
confidence: float = 0.7,
source: str = "",
) -> Fact:
mem = self.load(client_id)
# Deduplicate by content (normalized)
normalized = content.strip().lower()
for existing in mem.facts:
if existing.content.strip().lower() == normalized:
# Update confidence if higher
if confidence > existing.confidence:
existing.confidence = confidence
self.save(mem)
return existing
fact = Fact(
content=content,
category=category,
confidence=confidence,
source=source,
)
mem.facts.append(fact)
self.save(mem)
return fact
def get_top_facts(
self, client_id: str = "global", limit: int = 15
) -> list[str]:
"""Get top N facts sorted by confidence, formatted for prompt injection."""
mem = self.load(client_id)
sorted_facts = sorted(mem.facts, key=lambda f: f.confidence, reverse=True)
return [f.content for f in sorted_facts[:limit]]
def set_preference(self, key: str, value: str, client_id: str = "global"):
mem = self.load(client_id)
mem.preferences[key] = value
self.save(mem)
def get_preferences(self, client_id: str = "global") -> dict[str, str]:
return self.load(client_id).preferences

View File

22
app/middleware/base.py Normal file
View File

@@ -0,0 +1,22 @@
"""Middleware base class — before/after hooks around graph execution."""
from __future__ import annotations
from abc import ABC, abstractmethod
from app.graph.state import ReportState
class Middleware(ABC):
"""Base middleware. Subclasses override before() and/or after()."""
name: str = "base"
enabled: bool = True
async def before(self, state: ReportState) -> ReportState:
"""Called before graph execution. Modify state as needed."""
return state
async def after(self, state: ReportState) -> ReportState:
"""Called after graph execution. Modify state as needed."""
return state

35
app/middleware/chain.py Normal file
View File

@@ -0,0 +1,35 @@
"""Middleware chain — runs ordered middlewares before/after graph execution."""
from __future__ import annotations
import logging
from app.graph.state import ReportState
from .base import Middleware
logger = logging.getLogger(__name__)
class MiddlewareChain:
"""Ordered list of middlewares. before() runs in order, after() runs in reverse."""
def __init__(self, middlewares: list[Middleware] | None = None):
self.middlewares = middlewares or []
def add(self, mw: Middleware) -> "MiddlewareChain":
self.middlewares.append(mw)
return self
async def before(self, state: ReportState) -> ReportState:
for mw in self.middlewares:
if mw.enabled:
logger.debug(f"[middleware:before] {mw.name}")
state = await mw.before(state)
return state
async def after(self, state: ReportState) -> ReportState:
for mw in reversed(self.middlewares):
if mw.enabled:
logger.debug(f"[middleware:after] {mw.name}")
state = await mw.after(state)
return state

View File

@@ -0,0 +1,56 @@
"""ClientContext middleware — injects client/project background into state."""
from __future__ import annotations
import json
import logging
from pathlib import Path
from app.graph.state import ReportState
from .base import Middleware
logger = logging.getLogger(__name__)
# Client profiles stored as JSON files
CLIENTS_DIR = Path(__file__).resolve().parent.parent.parent / "clients"
class ClientContextMiddleware(Middleware):
"""Loads client profile and injects background context into state.
Client profiles are JSON files in the `clients/` directory:
clients/<client_id>.json
{
"name": "某咨询公司",
"industry": "金融",
"preferences": "偏好简洁的执行摘要,数据驱动",
"previous_reports": ["report_001", "report_002"]
}
"""
name = "client_context"
async def before(self, state: ReportState) -> ReportState:
if not state.client_id:
return state
profile_path = CLIENTS_DIR / f"{state.client_id}.json"
if not profile_path.exists():
logger.info(f"[client_context] no profile for client '{state.client_id}'")
return state
try:
profile = json.loads(profile_path.read_text(encoding="utf-8"))
parts = []
if name := profile.get("name"):
parts.append(f"客户:{name}")
if industry := profile.get("industry"):
parts.append(f"行业:{industry}")
if prefs := profile.get("preferences"):
parts.append(f"偏好:{prefs}")
state.client_context = "".join(parts)
logger.info(f"[client_context] loaded profile for '{state.client_id}'")
except Exception as e:
logger.warning(f"[client_context] failed to load profile: {e}")
return state

View File

@@ -0,0 +1,61 @@
"""Compliance middleware — checks for sensitive data leakage in output."""
from __future__ import annotations
import json
import logging
import re
from app.graph.state import ReportState
from .base import Middleware
logger = logging.getLogger(__name__)
# Patterns that suggest sensitive data
SENSITIVE_PATTERNS = [
(r"\b\d{15,18}\b", "可能的身份证号"),
(r"\b\d{16,19}\b", "可能的银行卡号"),
(r"\b1[3-9]\d{9}\b", "可能的手机号"),
(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", "邮箱地址"),
(r"(?:密码|password|secret|token|api.?key)\s*[:=]\s*\S+", "可能的凭证信息"),
]
class ComplianceMiddleware(Middleware):
"""Scans report output for sensitive data patterns.
After graph execution, scans the draft for PII / credential patterns.
Logs warnings but does not block (can be made blocking later).
"""
name = "compliance"
def _scan_text(self, text: str) -> list[dict]:
findings = []
for pattern, desc in SENSITIVE_PATTERNS:
matches = re.findall(pattern, text)
if matches:
findings.append({
"type": desc,
"count": len(matches),
"samples": [m[:8] + "..." for m in matches[:3]],
})
return findings
async def after(self, state: ReportState) -> ReportState:
# Scan draft content
all_text = json.dumps(state.draft, ensure_ascii=False)
findings = self._scan_text(all_text)
if findings:
logger.warning(
f"[compliance] found {len(findings)} sensitive data patterns:"
)
for f in findings:
logger.warning(f" - {f['type']}: {f['count']} occurrences")
# Store in state for API to surface
state.review.setdefault("compliance_warnings", findings)
else:
logger.info("[compliance] no sensitive data patterns detected")
return state

64
app/middleware/memory.py Normal file
View File

@@ -0,0 +1,64 @@
"""Memory middleware — injects persistent facts into state, saves new facts after."""
from __future__ import annotations
import logging
from app.graph.state import ReportState
from app.memory.store import MemoryStore
from .base import Middleware
logger = logging.getLogger(__name__)
class MemoryMiddleware(Middleware):
"""Injects top-N memory facts into state before execution.
After execution, extracts key facts from the report for future reference.
"""
name = "memory"
def __init__(self):
self.store = MemoryStore()
async def before(self, state: ReportState) -> ReportState:
# Load global facts + client-specific facts
facts = self.store.get_top_facts("global", limit=10)
if state.client_id:
client_facts = self.store.get_top_facts(state.client_id, limit=10)
facts = client_facts + facts # client facts first
state.memory_facts = facts[:15] # cap total
if facts:
logger.info(f"[memory] injected {len(state.memory_facts)} facts")
return state
async def after(self, state: ReportState) -> ReportState:
# Save key metadata as facts for future reference
if state.draft and not state.error:
title = state.draft.get("title", "")
report_type = state.report_type
client_id = state.client_id or "global"
self.store.add_fact(
content=f"生成过报告:{title}(类型:{report_type}",
client_id=client_id,
category="context",
confidence=0.9,
source=state.id,
)
# Save decomposition tracks as knowledge
tracks = state.decomposition.get("tracks", [])
if tracks:
track_titles = "".join(t.get("title", "") for t in tracks)
self.store.add_fact(
content=f"报告《{title}》的研究轨道:{track_titles}",
client_id=client_id,
category="knowledge",
confidence=0.6,
source=state.id,
)
return state

View File

@@ -0,0 +1,46 @@
"""TokenBudget middleware — tracks and limits token usage across the pipeline."""
from __future__ import annotations
import logging
from app.graph.state import ReportState
from .base import Middleware
logger = logging.getLogger(__name__)
class TokenBudgetMiddleware(Middleware):
"""Manages token budget to prevent context overflow.
Before: sets token budget based on model limits.
After: logs total estimated token usage.
"""
name = "token_budget"
# Rough char-to-token ratio for Chinese text
CHARS_PER_TOKEN = 1.5
def _estimate_tokens(self, text: str) -> int:
return int(len(text) / self.CHARS_PER_TOKEN)
async def before(self, state: ReportState) -> ReportState:
# Default budget is already set in ReportState (120k)
# Could adjust based on model selection
logger.info(f"[token_budget] budget = {state.token_budget} tokens")
return state
async def after(self, state: ReportState) -> ReportState:
# Estimate total tokens used in the draft
total_chars = 0
for chapter in state.draft.get("chapters", []):
total_chars += len(chapter.get("content", ""))
total_chars += len(state.draft.get("executive_summary", ""))
estimated = self._estimate_tokens(str(total_chars))
logger.info(
f"[token_budget] estimated output tokens: {estimated}, "
f"budget: {state.token_budget}"
)
return state

0
app/output/__init__.py Normal file
View File

0
app/pipeline/__init__.py Normal file
View File

View File

@@ -0,0 +1,43 @@
"""Pipeline orchestrator — builds the graph + middleware chain and executes."""
from __future__ import annotations
import logging
from app.graph.builder import ReportGraph
from app.graph.state import ReportState
from app.middleware.chain import MiddlewareChain
from app.middleware.client_context import ClientContextMiddleware
from app.middleware.token_budget import TokenBudgetMiddleware
from app.middleware.compliance import ComplianceMiddleware
from app.middleware.memory import MemoryMiddleware
logger = logging.getLogger(__name__)
def build_middleware() -> MiddlewareChain:
"""Assemble the middleware chain in execution order."""
chain = MiddlewareChain()
chain.add(ClientContextMiddleware()) # 1. load client profile
chain.add(MemoryMiddleware()) # 2. inject memory facts
chain.add(TokenBudgetMiddleware()) # 3. set token budget
chain.add(ComplianceMiddleware()) # 4. scan output for PII (after only)
return chain
class PipelineOrchestrator:
"""Top-level entry: creates graph + middleware, runs end-to-end."""
def __init__(self):
self.middleware = build_middleware()
self.graph = ReportGraph(middleware=self.middleware)
async def run(self, state: ReportState) -> ReportState:
logger.info(f"[orchestrator] starting report {state.id}")
state = await self.graph.run(state)
logger.info(
f"[orchestrator] finished report {state.id}"
f"status={'OK' if not state.error else 'FAILED'}, "
f"files={len(state.generated_files)}"
)
return state

22
app/pipeline/task.py Normal file
View File

@@ -0,0 +1,22 @@
"""Pipeline task model — thin wrapper, main state is now ReportState."""
from __future__ import annotations
from app.graph.state import ReportState
def create_report_state(
requirement: str,
report_type: str = "行业分析报告",
extra_data: str = "",
output_formats: list[str] | None = None,
client_id: str | None = None,
) -> ReportState:
"""Create a ReportState from user input."""
return ReportState(
requirement=requirement,
report_type=report_type,
extra_data=extra_data,
output_formats=output_formats or ["docx"],
client_id=client_id,
)

View File

456
data-sources-research.md Normal file
View File

@@ -0,0 +1,456 @@
# 咨询报告生成系统 - 数据源调研报告
> 调研日期: 2026-03-27
> 目的: 为咨询报告自动生成系统选择可用的行业数据 API 和来源
---
## 一、免费/开放政府数据 API
### 1.1 国家统计局 (data.stats.gov.cn)
| 项目 | 详情 |
|------|------|
| **费用** | 完全免费 |
| **API 可用性** | 有非官方但稳定的 HTTP API |
| **API 地址** | `https://data.stats.gov.cn/easyquery.htm` |
| **请求方式** | POST / GET |
| **数据格式** | JSON |
| **数据覆盖** | 年度/季度/月度宏观经济数据,覆盖 GDP、CPI、PPI、工业产值、固定资产投资、社会消费品零售、进出口、人口、就业等全品类 |
| **更新频率** | 月度/季度/年度同步发布 |
**API 关键参数:**
```
m=getTree / QueryData
dbcode=hgnd(年度) / hgyd(月度) / hgjd(季度)
rowcode=zb (指标)
colcode=sj (时间) / reg (地区)
wds=[] (查询条件 JSON)
dfwds=[{"wdcode":"zb","valuecode":"A0101"}] (数据字段)
```
**咨询报告实用度: ★★★★★**
宏观经济数据的首选来源,权威且免费。缺点是没有正式 API 文档,接口可能变动。可通过 AKShare 库间接调用。
---
### 1.2 商务部公共服务资源平台
| 项目 | 详情 |
|------|------|
| **费用** | 免费 |
| **网址** | `http://opendata.mofcom.gov.cn/front/data` |
| **数据覆盖** | 对外贸易、外商投资、国内贸易、商务预报、国别报告 |
| **API 可用性** | 有开放数据接口,注册后可调用 |
**咨询报告实用度: ★★★★**
外贸和商业分析类报告必备数据源。
---
### 1.3 中经数据 (国家信息中心)
| 项目 | 详情 |
|------|------|
| **费用** | 数据流量付费模式(按调用量计费),价格未公开 |
| **网址** | `https://ceidata.cei.cn/` |
| **API 可用性** | 有正式 REST API支持 Java / .NET / Python |
| **数据覆盖** | 全国 + 31省 + 330+城市 + 2800+县 + 200+国家/地区的宏观经济时序数据 |
| **更新频率** | 年/季/月/周/日多频率 |
| **数据来源** | 国家部委、地方政府、国际组织的权威统计数据 |
**咨询报告实用度: ★★★★★**
数据粒度和权威性极高,从国家到县级全覆盖。适合需要区域经济分析的咨询报告。价格需联系获取。
---
### 1.4 地方政府开放数据平台
| 平台 | 网址 | 数据规模 |
|------|------|----------|
| **上海市** | `https://data.sh.gov.cn/` | 45个部门、2101个数据集、646个数据接口 |
| **深圳市** | `https://opendata.sz.gov.cn` | 114家单位、11,150个数据集、10,971个接口、59.86亿条数据 |
| **北京市** | `https://data.beijing.gov.cn/` | 多部门、多品类 |
| **全国 50+ 地市** | 各自平台 | 各有 API 接口,注册即可免费调用 |
**咨询报告实用度: ★★★**
做区域性、城市级咨询报告时有用,但各平台接口标准不统一,集成成本较高。
---
### 1.5 海关总署统计数据查询平台
| 项目 | 详情 |
|------|------|
| **费用** | 免费查询(有限制) |
| **网址** | `http://stats.customs.gov.cn/` |
| **数据覆盖** | 按 HS 编码、进出口收发货人、贸易伙伴、贸易方式等多维组合的进出口统计 |
| **API 可用性** | 官方无正式 API可通过网页接口解析获取 |
**第三方商业 API:** 腾道 (tendata.cn) 等提供海关数据 API支持按产品/HS编码/国家/时间过滤。
**咨询报告实用度: ★★★★**
进出口贸易分析类报告必用。官方免费版数据维度有限,深度分析需采购第三方。
---
## 二、国内商业金融/行业数据 API
### 2.1 Wind 万得
| 项目 | 详情 |
|------|------|
| **费用** | **金融终端:** 39,800元/年/席位(单买),批量采购可降至 24,540元/年 |
| | **经济数据库:** 34,600元/年/席位 |
| | **机构数据接口:** 5-20万/年(单用户),批量可降至 2-8万/年 |
| **API 格式** | Client API (C++/C#/Java/Python),需安装 Wind 终端 |
| **数据覆盖** | 全市场金融数据、宏观经济、行业数据、公司财务、ESG、资讯舆情、专题特色数据 |
| **数据权威性** | 中国金融数据的"黄金标准",覆盖最全 |
**咨询报告实用度: ★★★★★**
如果预算允许这是最全面的中国金融和行业数据源。但价格昂贵适合机构级使用。API 必须绑定 Wind 终端,无法纯云端调用。
---
### 2.2 同花顺 iFinD
| 项目 | 详情 |
|------|------|
| **费用** | 8,800 - 28,000元/年/席位(远低于 Wind |
| **API 格式** | Python/Java/C++ 等语言 SDK |
| **数据覆盖** | 股票、债券、外汇、期货、基金、REITs、宏观经济、企业数据库、研究报告 |
| **适用对象** | 机构投资者和专业用户,需企业级账号 |
**咨询报告实用度: ★★★★**
Wind 的性价比替代品。数据覆盖广泛,价格约为 Wind 的 1/3 到 1/2。
---
### 2.3 东方财富 Choice
| 项目 | 详情 |
|------|------|
| **费用** | 官方定价 38,000元/年,**推广价 5,800元/年** |
| **API 格式** | 函数调用方式,支持 Matlab/C++/C#/R/Python (Win/Linux/Mac) |
| **数据覆盖** | 基本面、财务数据、行情数据、宏观经济 |
| **API 文档** | `https://quantapi.eastmoney.com/Manual` |
**咨询报告实用度: ★★★★**
推广价 5,800元/年是三大金融终端中最便宜的,跨平台支持好。适合中小团队。
---
### 2.4 天眼查 开放平台
| 项目 | 详情 |
|------|------|
| **费用** | 按次调用 + 套餐两种模式,具体价格需登录平台查看(典型范围: 0.1-2元/次,视接口而定) |
| **API 地址** | `https://open.tianyancha.com/` |
| **数据覆盖** | 企业基本信息、股东/股权、财务报表、法律诉讼、知识产权、经营异常、招投标等 |
| **认证方式** | Token + RESTful API |
| **免费额度** | 注册后有少量免费试用额度 |
**咨询报告实用度: ★★★★★**
企业尽职调查、竞争格局分析类报告的核心数据源。覆盖全国企业工商信息。
---
### 2.5 企查查 开放平台
| 项目 | 详情 |
|------|------|
| **费用** | 按次计费(价格未公开,需联系销售),新用户有 20次免费测试 |
| **API 地址** | `https://openapi.qcc.com/` |
| **数据覆盖** | 企业高级搜索、工商详情、专利查询、商标查询、经营风险 |
| **计费方式** | 固定企业列表+按年计费(每企业每周期只收一次)或按次 |
**咨询报告实用度: ★★★★**
与天眼查功能类似,二选一即可。企查查在某些企业关联数据上更全。
---
### 2.6 巨潮资讯 (cninfo.com.cn)
| 项目 | 详情 |
|------|------|
| **费用** | 注册后 1000次免费调用深证信平台有更多接口 |
| **API 地址** | `http://webapi.cninfo.com.cn/` |
| **公告查询** | `http://www.cninfo.com.cn/new/hisAnnouncement/query` |
| **数据覆盖** | 上市公司公告全文、财务数据、基金、债券 |
| **数据格式** | JSON |
**咨询报告实用度: ★★★★**
上市公司分析的权威一手来源。公告全文可用于 LLM 提取和分析。1000次免费额度足够初期使用。
---
## 三、免费开源 Python 数据库
### 3.1 AKShare (强烈推荐)
| 项目 | 详情 |
|------|------|
| **费用** | **完全免费开源** |
| **安装** | `pip install akshare` |
| **GitHub** | `https://github.com/akfamily/akshare` |
| **文档** | `https://akshare.akfamily.xyz/` |
| **数据覆盖** | **30+ 类金融产品**: A股/港股/美股行情、期货、期权、基金、债券、外汇、加密货币、**宏观经济指标**、**行业数据**、新闻舆情 |
| **数据格式** | Pandas DataFrame |
| **更新频率** | 持续更新(当前版本 1.18.47 |
**核心宏观经济数据接口示例:**
```python
import akshare as ak
# GDP 数据
gdp = ak.macro_china_gdp()
# CPI 数据
cpi = ak.macro_china_cpi_monthly()
# PMI 数据
pmi = ak.macro_china_pmi()
# 行业利润数据
profit = ak.macro_china_industrial_profit()
# 中国海关进出口
trade = ak.macro_china_trade_balance()
```
**咨询报告实用度: ★★★★★**
**咨询报告系统的首选数据获取层。** 免费、覆盖广、接口统一、返回 DataFrame 直接可分析。作为国家统计局等公开数据的统一封装层,可替代大部分付费数据源的基础数据需求。
---
### 3.2 Tushare Pro
| 项目 | 详情 |
|------|------|
| **费用** | 基础免费(积分制),高级接口需积分(约 500元一次性购买可获足够积分 |
| **数据覆盖** | A 股行情、财务数据、基金、期货、宏观经济 |
| **注意** | 2025年9月后部分接口调整积分获取难度增加 |
**咨询报告实用度: ★★★**
AKShare 的替代方案,但积分制有限制。推荐优先使用 AKShare。
---
### 3.3 Baostock
| 项目 | 详情 |
|------|------|
| **费用** | 完全免费 |
| **数据覆盖** | A 股历史行情、财务数据(较基础) |
**咨询报告实用度: ★★**
数据覆盖面较窄,仅 A 股基础数据。不推荐作为主力数据源。
---
## 四、国际数据源 API
### 4.1 World Bank Open Data API
| 项目 | 详情 |
|------|------|
| **费用** | **完全免费** |
| **API 地址** | `https://api.worldbank.org/v2/` |
| **数据格式** | JSON / XML |
| **数据覆盖** | 1,600+ 指标、217 个经济体、60+ 年历史数据 |
| **中国数据** | GDP、人口、通胀、贸易、教育、卫生、环境等全方位 |
| **更新频率** | 定期更新(最新至 2024年 |
**已验证可用的示例调用:**
```
GET https://api.worldbank.org/v2/country/CHN/indicator/NY.GDP.MKTP.CD?format=json&per_page=5
```
返回中国 GDP 时序数据2024年: $18.74 万亿)。
**咨询报告实用度: ★★★★★**
国际对比和宏观经济分析的标准数据源。免费、文档完善、数据权威。
---
### 4.2 IMF Data API
| 项目 | 详情 |
|------|------|
| **费用** | **完全免费** |
| **API 地址** | `https://dataservices.imf.org/REST/SDMX_JSON.svc/` |
| **数据覆盖** | International Financial Statistics (IFS)、Balance of Payments、Government Finance、Direction of Trade |
| **WEO 数据库** | 半年更新4月/10月含未来 2年预测 |
**咨询报告实用度: ★★★★**
宏观经济和国际金融分析。与 World Bank 互补。
---
### 4.3 Statista API
| 项目 | 详情 |
|------|------|
| **费用** | 基础免费;**Premium: $199-$1,299/月(年付)**API 接入需企业级合同(价格需议) |
| **API 地址** | `https://www.statista.com/api/v2/doc/` |
| **数据覆盖** | 170+ 行业、150+ 国家、300万+ 统计数据 |
| **数据特点** | 图表友好型统计数据、行业报告、消费者调查 |
**咨询报告实用度: ★★★★**
行业分析图表和统计数据的优质来源。API 价格不透明,需要商务洽谈。免费版可获取有限数据。
---
### 4.4 CB Insights
| 项目 | 详情 |
|------|------|
| **费用** | **$50,000 - $265,000+/年**(无公开定价,需 demo |
| **API 可用性** | **无 API** |
| **数据覆盖** | VC/PE 投融资、科技行业趋势、市场规模预测 |
**咨询报告实用度: ★★★**
内容优质但极其昂贵且无 API。不适合自动化集成只能手动引用报告。
---
### 4.5 PitchBook
| 项目 | 详情 |
|------|------|
| **费用** | $12,000+/年(单用户起步),大型机构 $50,000+/月 |
| **API 可用性** | **有 API**,可提取 VC/PE/M&A 数据 |
| **数据覆盖** | 全球私募市场、并购、风投交易数据 |
**咨询报告实用度: ★★★**
投融资分析类报告有用,但价格高。有 API 这一点优于 CB Insights。
---
## 五、专业/垂直数据源
### 5.1 专利数据库 (知识产权数据)
| 平台 | 费用 | API | 特点 |
|------|------|-----|------|
| **CNIPA (国家知识产权局)** | 免费查询 | 有数据发布页,无正式 REST API | 官方权威,专利公告/统计 |
| **CNIPR (知识产权出版社)** | 注册免费 + 增值付费 | `https://open.cnipr.com/` REST API | 专利检索、查询、统计、分析 |
| **佰腾 (Baiten)** | 按次付费 | `https://open.baiten.cn/` | 法律状态、引用数据 |
| **专利汇 (PatentHub)** | 免费+付费 | `https://www.patenthub.cn/api/` | 基本信息、权利要求、全文、引用、相似专利 |
| **incoPat** | 商业授权 | 有 API | 全球专利数据库,分析功能强 |
| **天眼查专利模块** | 集成在天眼查 API 中 | 同天眼查 | 企业专利关联查询 |
**咨询报告实用度: ★★★★**
技术行业分析、竞争格局分析的重要数据维度。推荐 CNIPR 或 PatentHub。
---
### 5.2 海关/贸易数据
| 平台 | 费用 | 特点 |
|------|------|------|
| **海关总署官方** | 免费 | `stats.customs.gov.cn`,查询维度有限 |
| **商务部数据中心** | 免费 | `data.mofcom.gov.cn`,进出口国别数据 |
| **腾道 (Tendata)** | 商业付费 | 海关数据 API支持按 HS 编码/产品/国家过滤 |
| **UN Comtrade** | 免费 API | `https://comtrade.un.org/data/`,联合国全球贸易数据库 |
**咨询报告实用度: ★★★★**
外贸和产业链分析报告核心数据源。
---
### 5.3 行业协会报告
| 来源 | 特点 | 费用 |
|------|------|------|
| **艾瑞咨询 (iResearch)** | 互联网/科技行业研究报告 | 部分免费,深度报告付费 |
| **易观分析 (Analysys)** | 数字经济行业数据和报告 | 部分免费,会员制 |
| **前瞻产业研究院** | 全行业覆盖的研究报告 | 单报告 ¥2,000-10,000+ |
| **头豹研究院** | 新兴行业深度分析 | 会员制 |
| **智研咨询** | 传统行业研究报告 | 单报告付费 |
| **中国信通院 (CAICT)** | ICT 行业权威白皮书 | 大部分免费 |
**注意:** 这些来源通常无 API需要手动获取 PDF/网页报告,然后由 LLM 提取和结构化。
---
## 六、推荐方案 (成本优先)
### 第一梯队: 免费核心数据层 (零成本)
| 用途 | 推荐工具 | 说明 |
|------|----------|------|
| 宏观经济数据 | **AKShare** (封装国家统计局等) | 一行代码获取 GDP/CPI/PMI/行业利润 |
| 国际对比数据 | **World Bank API** | 免费、文档完善、217经济体 |
| 国际金融数据 | **IMF API** | 免费WEO 预测数据 |
| 上市公司公告 | **巨潮资讯 API** | 1000次免费公告全文 |
| 全球贸易数据 | **UN Comtrade API** | 免费,全球进出口 |
**年成本: 0 元**
可覆盖: 宏观经济分析、行业趋势、国际对比、上市公司基本面
---
### 第二梯队: 低成本增强层
| 用途 | 推荐工具 | 年费 |
|------|----------|------|
| 金融终端数据 | **东方财富 Choice** | ~5,800元/年(推广价) |
| 企业信息 | **天眼查 API** | 按需充值,预估 2,000-10,000元/年 |
| 行业报告 | **Statista 基础版** | ~$199/月 = ~17,000元/年 |
**年成本: ~25,000-33,000 元**
新增覆盖: 深度财务数据、企业尽调、国际行业统计
---
### 第三梯队: 专业级全覆盖
| 用途 | 推荐工具 | 年费 |
|------|----------|------|
| 金融数据终端 | **Wind 万得****iFinD** | 10,000-40,000元/年 |
| 专利分析 | **CNIPR / PatentHub** | 按需 |
| 海关详细数据 | **腾道** | 商业议价 |
**年成本: 50,000+ 元**
适合: 专业咨询公司级别使用
---
## 七、技术集成建议
对于咨询报告自动生成系统,推荐的数据获取架构:
```
┌─────────────────────────────────────────┐
│ 数据获取调度层 │
│ (统一接口,缓存,频率控制,错误重试) │
├─────────────┬───────────┬───────────────┤
│ AKShare │ World Bank│ 巨潮资讯 │ ← 免费层
│ (宏观+行业)│ (国际对比) │ (上市公司) │
├─────────────┼───────────┼───────────────┤
│ Choice API │ 天眼查API │ Statista │ ← 付费层(按需)
│ (金融数据) │ (企业数据) │ (行业统计) │
├─────────────┼───────────┼───────────────┤
│ CNIPR │ 腾道 │ Wind/iFinD │ ← 专业层(高预算)
│ (专利) │ (海关) │ (全数据) │
└─────────────┴───────────┴───────────────┘
┌─────────────────────────────────────────┐
│ 数据标准化 + 缓存层 │
│ (DataFrame → 统一格式 → 本地缓存) │
└─────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│ LLM 报告生成层 │
│ (数据注入 → Prompt → 报告章节生成) │
└─────────────────────────────────────────┘
```
### 关键实现要点:
1. **AKShare 优先**: 凡是 AKShare 能获取的数据,就不调付费接口
2. **本地缓存**: 宏观数据月度更新,不需要每次实时拉取,缓存 30 天
3. **降级策略**: 付费接口不可用时,自动降级到免费数据源
4. **频率控制**: 国家统计局等公开接口需控制请求频率(建议 1-2秒/次)
5. **数据标准化**: 不同来源数据统一为 DataFrame + 元数据(来源、时间、单位)格式

22
memory/global.json Normal file
View File

@@ -0,0 +1,22 @@
{
"client_id": "global",
"preferences": {},
"facts": [
{
"id": "4332883e",
"content": "生成过报告2025年全球半导体行业展望供应链重构、中国自主化进程、AI芯片竞争格局及投资策略类型行业分析报告",
"category": "context",
"confidence": 0.9,
"source": "5cd9c02f1dc7",
"created_at": "2026-03-28T00:39:01.304322"
},
{
"id": "b5d69b81",
"content": "报告《2025年全球半导体行业展望供应链重构、中国自主化进程、AI芯片竞争格局及投资策略》的研究轨道Global Semiconductor Supply Chain Restructuring: US-Japan-EU-China Strategic Positioning、中国半导体自主化进展与瓶颈分析、Global AI Chip Competitive Landscape and Technology Roadmap、半导体产业投资策略与机会分析",
"category": "knowledge",
"confidence": 0.6,
"source": "5cd9c02f1dc7",
"created_at": "2026-03-28T00:39:01.305048"
}
]
}

11
project.json Normal file
View File

@@ -0,0 +1,11 @@
{
"name": "咨询报告 AI 生成系统",
"slug": "consulting-report-gen",
"status": "planning",
"created": "2026-03-27",
"category": "business",
"ports": {},
"description": "借鉴 Open SWE 多 Agent 架构,为咨询公司构建安全可控的行业报告自动生成系统",
"tech": ["python", "docx", "pptx", "xlsx", "pdf"],
"notes": "数据安全为第一优先级,不允许客户数据外泄"
}

30
pyproject.toml Normal file
View File

@@ -0,0 +1,30 @@
[project]
name = "consulting-report-gen"
version = "0.1.0"
description = "AI-powered consulting report generation system"
requires-python = ">=3.11"
dependencies = [
"fastapi>=0.115",
"uvicorn[standard]>=0.34",
"litellm>=1.60",
"pydantic>=2.0",
"pydantic-settings>=2.0",
"python-docx>=1.1",
"python-pptx>=1.0",
"openpyxl>=3.1",
"fpdf2>=2.8",
"python-multipart>=0.0.18",
"aiofiles>=24.1",
"jinja2>=3.1",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0",
"pytest-asyncio>=0.24",
"httpx>=0.28",
]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

64
test_pipeline.py Normal file
View File

@@ -0,0 +1,64 @@
"""Quick test — full domain-aware bilingual pipeline."""
import asyncio
import logging
from app.graph.state import ReportState
from app.pipeline.orchestrator import PipelineOrchestrator
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(name)s] %(levelname)s: %(message)s",
)
async def main():
state = ReportState(
requirement="分析全球半导体行业2025年发展趋势重点关注1全球供应链重构美日欧中各自布局2中国半导体自主化进展与瓶颈3AI芯片竞争格局4投资建议",
report_type="行业分析报告",
output_formats=["docx"],
output_languages=["zh", "en"], # bilingual output
)
print(f"[test] Task ID: {state.id}")
print(f"[test] Requirement: {state.requirement[:80]}...")
print(f"[test] Languages: {state.output_languages}")
print()
orchestrator = PipelineOrchestrator()
state = await orchestrator.run(state)
print()
print(f"[test] Final node: {state.current_node}")
print(f"[test] Error: {state.error or 'None'}")
print(f"[test] Revisions: {state.revision_count}")
print()
# Execution trace
print("[test] Execution trace:")
for entry in state.node_history:
print(f" {entry['timestamp'][:19]} | {entry['node']:20s} | {entry['status']:10s} | {entry.get('detail', '')}")
# Parallel research — show domain/language/model allocation
if state.research_results:
print()
print(f"[test] Research tracks: {len(state.research_results)}")
for r in state.research_results:
ms = f" ({r.duration_ms}ms)" if r.duration_ms else ""
print(f" [{r.status.value:9s}] [{r.domain.value:6s}] [{r.native_language}] {r.description}{ms}")
# Output files
if state.generated_files:
print()
print(f"[test] Generated files ({len(state.generated_files)}):")
for f in state.generated_files:
print(f"{f}")
# Review verdict
if state.review:
print()
print(f"[test] Review: score={state.review.get('overall_score')}, verdict={state.review.get('verdict')}")
if __name__ == "__main__":
asyncio.run(main())

0
tests/__init__.py Normal file
View File