init repo
This commit is contained in:
274
代码实现/README.md
Normal file
274
代码实现/README.md
Normal file
@@ -0,0 +1,274 @@
|
||||
# 智能搜索系统 - 简单实用版
|
||||
|
||||
一个基于RSS订阅和API的智能搜索系统,支持8个行业的权威信息检索和自动文档生成。
|
||||
|
||||
## 🌟 核心特性
|
||||
|
||||
- **英文优先搜索**: 默认英文搜索,包含中文关键词时自动切换
|
||||
- **8行业覆盖**: 金融、AI/软件、制造业、医疗制药、快消品、零售电商、能源化工、房地产建筑
|
||||
- **权威信源**: 200+ RSS源,按权威级别分类(官方机构 > 主流媒体 > 专业平台)
|
||||
- **多种接口**: 命令行、Web界面、RSS监控器
|
||||
- **自动导出**: 搜索结果自动生成DOCX报告
|
||||
- **实时监控**: RSS源自动更新,建立本地文章数据库
|
||||
|
||||
## 🚀 快速开始
|
||||
|
||||
### 1. 安装依赖
|
||||
|
||||
```bash
|
||||
cd 搜索/代码实现
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
**必需依赖:**
|
||||
```bash
|
||||
pip install requests feedparser python-docx
|
||||
```
|
||||
|
||||
**可选依赖 (增强功能):**
|
||||
```bash
|
||||
pip install flask newsapi-python pandas
|
||||
```
|
||||
|
||||
### 2. 配置API密钥 (可选)
|
||||
|
||||
创建环境变量或修改 `config.py`:
|
||||
|
||||
```bash
|
||||
# NewsAPI (可选 - 增强英文搜索)
|
||||
export NEWSAPI_KEY="your_newsapi_key"
|
||||
|
||||
# Twitter API (可选 - 社交媒体搜索)
|
||||
export TWITTER_BEARER_TOKEN="your_twitter_token"
|
||||
|
||||
# Alpha Vantage (可选 - 金融数据)
|
||||
export ALPHA_VANTAGE_KEY="your_alphavantage_key"
|
||||
```
|
||||
|
||||
### 3. 启动系统
|
||||
|
||||
#### 方式一: 交互命令行 (推荐新手)
|
||||
```bash
|
||||
python main.py
|
||||
```
|
||||
|
||||
#### 方式二: Web界面
|
||||
```bash
|
||||
python main.py --mode web --port 5000
|
||||
```
|
||||
打开 http://localhost:5000
|
||||
|
||||
#### 方式三: 直接搜索
|
||||
```bash
|
||||
python main.py --query "AI breakthrough 2024" --export
|
||||
```
|
||||
|
||||
#### 方式四: 启动RSS监控器
|
||||
```bash
|
||||
python main.py --mode monitor
|
||||
```
|
||||
|
||||
## 📖 使用指南
|
||||
|
||||
### 命令行搜索示例
|
||||
|
||||
```bash
|
||||
# 基础搜索
|
||||
>>> AI ethics regulation
|
||||
|
||||
# 行业搜索
|
||||
>>> search renewable energy policy
|
||||
|
||||
# 中文搜索 (自动检测)
|
||||
>>> 英伟达最新财报
|
||||
|
||||
# 查看统计
|
||||
>>> stats
|
||||
|
||||
# 查看历史
|
||||
>>> history
|
||||
|
||||
# 帮助
|
||||
>>> help
|
||||
```
|
||||
|
||||
### 搜索语言自动检测
|
||||
|
||||
- **英文搜索**: `AI breakthrough`, `Tesla earnings`, `oil prices`
|
||||
- **中文搜索**: `中国AI政策`, `英伟达财报`, `新能源汽车`
|
||||
- **强制中文**: 包含关键词: `中国`, `国内`, `A股`, `人民币`, `央行`
|
||||
|
||||
### 支持的行业
|
||||
|
||||
| 行业代码 | 中文名称 | 主要信源 |
|
||||
|---------|---------|----------|
|
||||
| `finance` | 金融行业 | Fed, SEC, Bloomberg, Reuters |
|
||||
| `ai_software` | AI与软件 | arXiv, Google AI, OpenAI, TechCrunch |
|
||||
| `manufacturing` | 制造业 | ISO, IEEE, Industry Week |
|
||||
| `healthcare_pharma` | 医疗制药 | FDA, NIH, STAT News |
|
||||
| `fmcg` | 快消品 | Nielsen, Euromonitor |
|
||||
| `ecommerce_retail` | 零售电商 | Shopify, eMarketer |
|
||||
| `energy_chemical` | 能源化工 | IEA, Energy.gov |
|
||||
| `real_estate` | 房地产建筑 | HUD, Construction Dive |
|
||||
|
||||
## 📁 文件结构
|
||||
|
||||
```
|
||||
搜索/代码实现/
|
||||
├── main.py # 主程序入口
|
||||
├── config.py # 配置文件
|
||||
├── database.py # 数据库操作
|
||||
├── search_engine.py # 搜索引擎
|
||||
├── rss_monitor.py # RSS监控器
|
||||
├── document_exporter.py # 文档导出器
|
||||
├── database_schema.sql # 数据库结构
|
||||
├── requirements.txt # 依赖包
|
||||
├── data/ # 数据目录
|
||||
│ ├── search_system.db # SQLite数据库
|
||||
│ └── search_system.log # 系统日志
|
||||
└── 新闻/ # 导出文档目录
|
||||
└── *.docx # 生成的报告
|
||||
```
|
||||
|
||||
## 🔧 高级配置
|
||||
|
||||
### 自定义RSS源
|
||||
|
||||
编辑 `config.py` 中的 `RSS_SOURCES`:
|
||||
|
||||
```python
|
||||
RSS_SOURCES = {
|
||||
'finance': [
|
||||
{
|
||||
'name': 'Your Custom Source',
|
||||
'url': 'https://example.com/rss.xml',
|
||||
'authority_level': 2, # 1=官方, 2=主流, 3=专业
|
||||
'language': 'en'
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 调整搜索参数
|
||||
|
||||
修改 `config.py` 中的 `SEARCH_CONFIG`:
|
||||
|
||||
```python
|
||||
SEARCH_CONFIG = {
|
||||
'max_results_per_source': 50, # 每源最大结果数
|
||||
'min_relevance_score': 0.3, # 最低相关性分数
|
||||
'keywords_for_china': ['中国', '国内'] # 中文检测关键词
|
||||
}
|
||||
```
|
||||
|
||||
### RSS监控频率
|
||||
|
||||
调整 `RSS_MONITOR_CONFIG`:
|
||||
|
||||
```python
|
||||
RSS_MONITOR_CONFIG = {
|
||||
'check_interval': 3600, # 检查间隔(秒) - 3600=1小时
|
||||
'max_retries': 3, # 最大重试次数
|
||||
'timeout': 30 # 请求超时(秒)
|
||||
}
|
||||
```
|
||||
|
||||
## 🎯 使用场景
|
||||
|
||||
### 场景一: 行业研究
|
||||
```bash
|
||||
python main.py --query "renewable energy investment 2024" --industry energy_chemical --export
|
||||
```
|
||||
|
||||
### 场景二: 竞争情报
|
||||
```bash
|
||||
python main.py --query "Tesla quarterly results" --industry ai_software --export
|
||||
```
|
||||
|
||||
### 场景三: 政策追踪
|
||||
```bash
|
||||
python main.py --query "FDA drug approval" --industry healthcare_pharma --export
|
||||
```
|
||||
|
||||
### 场景四: 技术趋势
|
||||
```bash
|
||||
python main.py --query "quantum computing breakthrough" --industry ai_software --export
|
||||
```
|
||||
|
||||
## 📊 导出文档格式
|
||||
|
||||
生成的DOCX文档包含:
|
||||
|
||||
1. **标题页**: 搜索关键词、行业、日期
|
||||
2. **搜索信息**: 参数、结果统计
|
||||
3. **文章列表**:
|
||||
- 标题和来源信息
|
||||
- 权威级别标注
|
||||
- 发布时间和相关性评分
|
||||
- 文章摘要
|
||||
- 原文链接 (可点击)
|
||||
|
||||
文件命名规则:
|
||||
- 英文: `YYYYMMDD_industry_keywords.docx`
|
||||
- 中文: `YYYYMMDD_industry_keywords_CN.docx`
|
||||
|
||||
## 🔍 故障排除
|
||||
|
||||
### 常见问题
|
||||
|
||||
**Q: RSS源无法访问怎么办?**
|
||||
A: 系统会自动重试和降级处理,单个源失败不影响整体搜索。
|
||||
|
||||
**Q: 搜索结果太少?**
|
||||
A:
|
||||
1. 检查关键词是否过于具体
|
||||
2. 尝试不指定行业进行全局搜索
|
||||
3. 确保RSS监控器已运行一段时间累积数据
|
||||
|
||||
**Q: 如何提高搜索质量?**
|
||||
A:
|
||||
1. 配置NewsAPI等付费API
|
||||
2. 添加更多RSS源
|
||||
3. 调整相关性评分算法
|
||||
|
||||
### 日志查看
|
||||
|
||||
```bash
|
||||
# 查看系统日志
|
||||
tail -f data/search_system.log
|
||||
|
||||
# 查看RSS监控状态
|
||||
python -c "from rss_monitor import RSSMonitor; print(RSSMonitor().get_monitor_status())"
|
||||
```
|
||||
|
||||
### 数据库维护
|
||||
|
||||
```bash
|
||||
# 查看统计信息
|
||||
python -c "from database import DatabaseManager; print(DatabaseManager().get_statistics())"
|
||||
|
||||
# 手动检查RSS源
|
||||
python -c "from rss_monitor import RSSMonitor; print(RSSMonitor().manual_check_source(1))"
|
||||
```
|
||||
|
||||
## 🚀 性能优化
|
||||
|
||||
### 建议配置
|
||||
- **CPU**: 2核心以上 (并行RSS处理)
|
||||
- **内存**: 4GB以上 (大量文章缓存)
|
||||
- **存储**: 10GB以上 (数据库和文档)
|
||||
- **网络**: 稳定外网连接 (RSS和API访问)
|
||||
|
||||
### 扩展建议
|
||||
1. **数据库**: SQLite → MySQL/PostgreSQL (大规模数据)
|
||||
2. **搜索**: 基础匹配 → Elasticsearch (全文搜索)
|
||||
3. **NLP**: 简单关键词 → BERT/GPT (语义搜索)
|
||||
4. **缓存**: 无 → Redis (快速响应)
|
||||
|
||||
## 📞 技术支持
|
||||
|
||||
- **文档问题**: 检查RSS源状态和网络连接
|
||||
- **搜索问题**: 查看日志文件定位错误
|
||||
- **性能问题**: 调整监控频率和结果数量限制
|
||||
|
||||
系统设计为轻量级和容错性,单个组件故障不会影响整体功能。
|
||||
216
代码实现/config.py
Normal file
216
代码实现/config.py
Normal file
@@ -0,0 +1,216 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
搜索系统配置文件
|
||||
"""
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
# 基础配置
|
||||
BASE_DIR = Path(__file__).parent
|
||||
DATA_DIR = BASE_DIR / "data"
|
||||
EXPORT_DIR = BASE_DIR.parent / "新闻"
|
||||
|
||||
# 确保目录存在
|
||||
DATA_DIR.mkdir(exist_ok=True)
|
||||
EXPORT_DIR.mkdir(exist_ok=True)
|
||||
|
||||
# 数据库配置
|
||||
DATABASE_CONFIG = {
|
||||
'type': 'sqlite', # 'sqlite', 'mysql', 'postgresql'
|
||||
'sqlite': {
|
||||
'path': DATA_DIR / "search_system.db"
|
||||
},
|
||||
'mysql': {
|
||||
'host': 'localhost',
|
||||
'port': 3306,
|
||||
'user': 'root',
|
||||
'password': '',
|
||||
'database': 'search_system'
|
||||
}
|
||||
}
|
||||
|
||||
# API配置
|
||||
API_CONFIG = {
|
||||
'newsapi': {
|
||||
'key': os.getenv('NEWSAPI_KEY', ''),
|
||||
'base_url': 'https://newsapi.org/v2/',
|
||||
'rate_limit': 1000 # 每日请求限制
|
||||
},
|
||||
'twitter': {
|
||||
'bearer_token': os.getenv('TWITTER_BEARER_TOKEN', ''),
|
||||
'base_url': 'https://api.twitter.com/2/',
|
||||
'rate_limit': 300 # 每15分钟请求限制
|
||||
},
|
||||
'alpha_vantage': {
|
||||
'key': os.getenv('ALPHA_VANTAGE_KEY', ''),
|
||||
'base_url': 'https://www.alphavantage.co/query',
|
||||
'rate_limit': 5 # 每分钟请求限制
|
||||
}
|
||||
}
|
||||
|
||||
# RSS源配置
|
||||
RSS_SOURCES = {
|
||||
'finance': [
|
||||
{
|
||||
'name': 'Federal Reserve',
|
||||
'url': 'https://www.federalreserve.gov/feeds/press_all.xml',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'SEC',
|
||||
'url': 'https://www.sec.gov/rss/news/press-release.xml',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'Bloomberg Markets',
|
||||
'url': 'https://feeds.bloomberg.com/markets/news.rss',
|
||||
'authority_level': 2,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'Reuters Finance',
|
||||
'url': 'https://feeds.reuters.com/reuters/businessNews',
|
||||
'authority_level': 2,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'Financial Times',
|
||||
'url': 'https://www.ft.com/rss/home',
|
||||
'authority_level': 2,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'Wall Street Journal',
|
||||
'url': 'https://feeds.a.dj.com/rss/RSSMarketsMain.xml',
|
||||
'authority_level': 2,
|
||||
'language': 'en'
|
||||
}
|
||||
],
|
||||
'ai_software': [
|
||||
{
|
||||
'name': 'arXiv Computer Science',
|
||||
'url': 'http://rss.arxiv.org/rss/cs',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'Google AI Blog',
|
||||
'url': 'https://ai.googleblog.com/feeds/posts/default',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'OpenAI Blog',
|
||||
'url': 'https://openai.com/blog/rss.xml',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'MIT Technology Review',
|
||||
'url': 'https://www.technologyreview.com/feed/',
|
||||
'authority_level': 2,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'TechCrunch',
|
||||
'url': 'https://techcrunch.com/feed/',
|
||||
'authority_level': 2,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'The Verge',
|
||||
'url': 'https://www.theverge.com/rss/index.xml',
|
||||
'authority_level': 2,
|
||||
'language': 'en'
|
||||
}
|
||||
],
|
||||
'manufacturing': [
|
||||
{
|
||||
'name': 'ISO News',
|
||||
'url': 'https://www.iso.org/rss/news.xml',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'IEEE Spectrum',
|
||||
'url': 'https://spectrum.ieee.org/rss/fulltext',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'Industry Week',
|
||||
'url': 'https://www.industryweek.com/rss.xml',
|
||||
'authority_level': 2,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'Manufacturing.net',
|
||||
'url': 'https://www.manufacturing.net/rss.xml',
|
||||
'authority_level': 3,
|
||||
'language': 'en'
|
||||
}
|
||||
],
|
||||
'healthcare_pharma': [
|
||||
{
|
||||
'name': 'FDA News',
|
||||
'url': 'https://www.fda.gov/about-fda/contact-fda/stay-informed/rss-feeds',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'NIH News',
|
||||
'url': 'https://www.nih.gov/news-events/rss',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'WHO News',
|
||||
'url': 'https://www.who.int/rss-feeds',
|
||||
'authority_level': 1,
|
||||
'language': 'en'
|
||||
},
|
||||
{
|
||||
'name': 'STAT News',
|
||||
'url': 'https://www.statnews.com/feed/',
|
||||
'authority_level': 2,
|
||||
'language': 'en'
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# 搜索配置
|
||||
SEARCH_CONFIG = {
|
||||
'max_results_per_source': 50,
|
||||
'search_timeout': 30,
|
||||
'min_relevance_score': 0.3,
|
||||
'default_language': 'en',
|
||||
'keywords_for_china': ['中国', '国内', 'A股', '人民币', '央行', '国务院']
|
||||
}
|
||||
|
||||
# 文档导出配置
|
||||
EXPORT_CONFIG = {
|
||||
'default_format': 'docx',
|
||||
'template_path': BASE_DIR / 'templates',
|
||||
'max_articles_per_doc': 20,
|
||||
'include_source_links': True
|
||||
}
|
||||
|
||||
# 日志配置
|
||||
LOGGING_CONFIG = {
|
||||
'level': 'INFO',
|
||||
'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
'file': DATA_DIR / 'search_system.log',
|
||||
'max_size': 10 * 1024 * 1024, # 10MB
|
||||
'backup_count': 5
|
||||
}
|
||||
|
||||
# RSS监控配置
|
||||
RSS_MONITOR_CONFIG = {
|
||||
'check_interval': 3600, # 1小时检查一次
|
||||
'max_retries': 3,
|
||||
'timeout': 30,
|
||||
'user_agent': 'SearchSystem/1.0 (RSS Monitor)'
|
||||
}
|
||||
353
代码实现/database.py
Normal file
353
代码实现/database.py
Normal file
@@ -0,0 +1,353 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
数据库操作类
|
||||
"""
|
||||
|
||||
import sqlite3
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
from datetime import datetime, timedelta
|
||||
from typing import List, Dict, Optional, Tuple
|
||||
from pathlib import Path
|
||||
|
||||
from config import DATABASE_CONFIG, RSS_SOURCES
|
||||
|
||||
class DatabaseManager:
|
||||
"""数据库管理类"""
|
||||
|
||||
def __init__(self):
|
||||
self.db_type = DATABASE_CONFIG['type']
|
||||
if self.db_type == 'sqlite':
|
||||
self.db_path = DATABASE_CONFIG['sqlite']['path']
|
||||
self.conn = None
|
||||
self.logger = logging.getLogger(__name__)
|
||||
self._init_database()
|
||||
|
||||
def _get_connection(self):
|
||||
"""获取数据库连接"""
|
||||
if self.db_type == 'sqlite':
|
||||
if not self.conn:
|
||||
self.conn = sqlite3.connect(self.db_path, check_same_thread=False)
|
||||
self.conn.row_factory = sqlite3.Row
|
||||
return self.conn
|
||||
# 后续可扩展MySQL/PostgreSQL
|
||||
|
||||
def _init_database(self):
|
||||
"""初始化数据库"""
|
||||
if not Path(self.db_path).exists():
|
||||
self._create_tables()
|
||||
self._insert_initial_data()
|
||||
|
||||
def _create_tables(self):
|
||||
"""创建数据库表"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 读取SQL文件并执行
|
||||
sql_file = Path(__file__).parent / 'database_schema.sql'
|
||||
if sql_file.exists():
|
||||
with open(sql_file, 'r', encoding='utf-8') as f:
|
||||
sql_script = f.read()
|
||||
cursor.executescript(sql_script)
|
||||
|
||||
conn.commit()
|
||||
self.logger.info("数据库表创建完成")
|
||||
|
||||
def _insert_initial_data(self):
|
||||
"""插入初始RSS源数据"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 获取行业ID映射
|
||||
cursor.execute("SELECT id, name_en FROM industries")
|
||||
industry_map = {row['name_en']: row['id'] for row in cursor.fetchall()}
|
||||
|
||||
# 插入RSS源
|
||||
for industry, sources in RSS_SOURCES.items():
|
||||
if industry in industry_map:
|
||||
industry_id = industry_map[industry]
|
||||
for source in sources:
|
||||
cursor.execute("""
|
||||
INSERT OR IGNORE INTO rss_sources
|
||||
(industry_id, source_name, source_url, source_type, authority_level, language)
|
||||
VALUES (?, ?, ?, 'rss', ?, ?)
|
||||
""", (industry_id, source['name'], source['url'],
|
||||
source['authority_level'], source['language']))
|
||||
|
||||
conn.commit()
|
||||
self.logger.info("初始RSS源数据插入完成")
|
||||
|
||||
def get_industries(self) -> List[Dict]:
|
||||
"""获取所有行业"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("SELECT * FROM industries ORDER BY name_en")
|
||||
return [dict(row) for row in cursor.fetchall()]
|
||||
|
||||
def get_rss_sources(self, industry_id: Optional[int] = None,
|
||||
active_only: bool = True) -> List[Dict]:
|
||||
"""获取RSS源"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
query = "SELECT * FROM rss_sources WHERE 1=1"
|
||||
params = []
|
||||
|
||||
if industry_id:
|
||||
query += " AND industry_id = ?"
|
||||
params.append(industry_id)
|
||||
|
||||
if active_only:
|
||||
query += " AND is_active = 1"
|
||||
|
||||
query += " ORDER BY authority_level, source_name"
|
||||
|
||||
cursor.execute(query, params)
|
||||
return [dict(row) for row in cursor.fetchall()]
|
||||
|
||||
def save_article(self, article_data: Dict) -> Optional[int]:
|
||||
"""保存文章"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 生成文章hash防重复
|
||||
content_hash = hashlib.sha256(
|
||||
f"{article_data['title']}{article_data['original_url']}".encode()
|
||||
).hexdigest()
|
||||
|
||||
# 检查是否已存在
|
||||
cursor.execute("SELECT id FROM articles WHERE article_hash = ?", (content_hash,))
|
||||
if cursor.fetchone():
|
||||
return None # 文章已存在
|
||||
|
||||
try:
|
||||
cursor.execute("""
|
||||
INSERT INTO articles
|
||||
(title, content, summary, author, source_id, original_url,
|
||||
published_date, language, keywords, article_hash)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
|
||||
""", (
|
||||
article_data['title'],
|
||||
article_data.get('content', ''),
|
||||
article_data.get('summary', ''),
|
||||
article_data.get('author', ''),
|
||||
article_data['source_id'],
|
||||
article_data['original_url'],
|
||||
article_data.get('published_date'),
|
||||
article_data.get('language', 'en'),
|
||||
json.dumps(article_data.get('keywords', [])),
|
||||
content_hash
|
||||
))
|
||||
|
||||
article_id = cursor.lastrowid
|
||||
conn.commit()
|
||||
self.logger.debug(f"保存文章: {article_data['title']}")
|
||||
return article_id
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"保存文章失败: {e}")
|
||||
conn.rollback()
|
||||
return None
|
||||
|
||||
def create_search_log(self, keywords: str, industry_id: Optional[int] = None,
|
||||
language: str = 'en', user_ip: str = '') -> int:
|
||||
"""创建搜索记录"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("""
|
||||
INSERT INTO search_logs (keywords, industry_id, language, user_ip)
|
||||
VALUES (?, ?, ?, ?)
|
||||
""", (keywords, industry_id, language, user_ip))
|
||||
|
||||
search_log_id = cursor.lastrowid
|
||||
conn.commit()
|
||||
return search_log_id
|
||||
|
||||
def save_search_results(self, search_log_id: int, articles: List[Dict]):
|
||||
"""保存搜索结果"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
for rank, article in enumerate(articles, 1):
|
||||
cursor.execute("""
|
||||
INSERT INTO search_results
|
||||
(search_log_id, article_id, relevance_score, rank_position)
|
||||
VALUES (?, ?, ?, ?)
|
||||
""", (search_log_id, article['id'], article.get('relevance_score', 0), rank))
|
||||
|
||||
# 更新搜索记录的结果数量
|
||||
cursor.execute("""
|
||||
UPDATE search_logs SET results_count = ? WHERE id = ?
|
||||
""", (len(articles), search_log_id))
|
||||
|
||||
conn.commit()
|
||||
|
||||
def search_articles(self, keywords: List[str], industry_id: Optional[int] = None,
|
||||
language: Optional[str] = None, limit: int = 50,
|
||||
days_back: int = 30) -> List[Dict]:
|
||||
"""搜索文章"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 构建搜索查询
|
||||
query = """
|
||||
SELECT a.*, rs.source_name, rs.authority_level, i.name_cn as industry_name
|
||||
FROM articles a
|
||||
JOIN rss_sources rs ON a.source_id = rs.id
|
||||
JOIN industries i ON rs.industry_id = i.id
|
||||
WHERE 1=1
|
||||
"""
|
||||
params = []
|
||||
|
||||
# 时间范围过滤
|
||||
if days_back > 0:
|
||||
date_threshold = datetime.now() - timedelta(days=days_back)
|
||||
query += " AND a.published_date >= ?"
|
||||
params.append(date_threshold)
|
||||
|
||||
# 行业过滤
|
||||
if industry_id:
|
||||
query += " AND rs.industry_id = ?"
|
||||
params.append(industry_id)
|
||||
|
||||
# 语言过滤
|
||||
if language:
|
||||
query += " AND a.language = ?"
|
||||
params.append(language)
|
||||
|
||||
# 关键词搜索
|
||||
if keywords:
|
||||
keyword_conditions = []
|
||||
for keyword in keywords:
|
||||
keyword_conditions.append("(a.title LIKE ? OR a.content LIKE ?)")
|
||||
params.extend([f"%{keyword}%", f"%{keyword}%"])
|
||||
|
||||
query += f" AND ({' OR '.join(keyword_conditions)})"
|
||||
|
||||
# 排序和限制
|
||||
query += " ORDER BY rs.authority_level ASC, a.published_date DESC LIMIT ?"
|
||||
params.append(limit)
|
||||
|
||||
cursor.execute(query, params)
|
||||
results = [dict(row) for row in cursor.fetchall()]
|
||||
|
||||
# 计算相关性分数
|
||||
for result in results:
|
||||
result['relevance_score'] = self._calculate_relevance(result, keywords)
|
||||
|
||||
# 按相关性和权威性排序
|
||||
results.sort(key=lambda x: (x['authority_level'], -x['relevance_score']))
|
||||
|
||||
return results
|
||||
|
||||
def _calculate_relevance(self, article: Dict, keywords: List[str]) -> float:
|
||||
"""计算文章相关性分数"""
|
||||
if not keywords:
|
||||
return 1.0
|
||||
|
||||
title = article.get('title', '').lower()
|
||||
content = article.get('content', '').lower()
|
||||
|
||||
score = 0.0
|
||||
for keyword in keywords:
|
||||
keyword = keyword.lower()
|
||||
# 标题匹配权重更高
|
||||
title_matches = title.count(keyword)
|
||||
content_matches = content.count(keyword)
|
||||
|
||||
score += title_matches * 2.0 + content_matches * 0.5
|
||||
|
||||
# 根据信源权威级别调整分数
|
||||
authority_bonus = (4 - article.get('authority_level', 4)) * 0.1
|
||||
score += authority_bonus
|
||||
|
||||
return min(score, 10.0) # 限制最高分数
|
||||
|
||||
def get_search_history(self, limit: int = 20) -> List[Dict]:
|
||||
"""获取搜索历史"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("""
|
||||
SELECT sl.*, i.name_cn as industry_name
|
||||
FROM search_logs sl
|
||||
LEFT JOIN industries i ON sl.industry_id = i.id
|
||||
ORDER BY sl.search_time DESC
|
||||
LIMIT ?
|
||||
""", (limit,))
|
||||
|
||||
return [dict(row) for row in cursor.fetchall()]
|
||||
|
||||
def save_exported_doc(self, search_log_id: int, filename: str,
|
||||
file_path: str, articles_count: int) -> int:
|
||||
"""保存导出文档记录"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("""
|
||||
INSERT INTO exported_docs
|
||||
(search_log_id, filename, file_path, articles_count)
|
||||
VALUES (?, ?, ?, ?)
|
||||
""", (search_log_id, filename, file_path, articles_count))
|
||||
|
||||
doc_id = cursor.lastrowid
|
||||
conn.commit()
|
||||
return doc_id
|
||||
|
||||
def update_rss_source_check_time(self, source_id: int):
|
||||
"""更新RSS源检查时间"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("""
|
||||
UPDATE rss_sources SET last_checked = CURRENT_TIMESTAMP WHERE id = ?
|
||||
""", (source_id,))
|
||||
|
||||
conn.commit()
|
||||
|
||||
def get_statistics(self) -> Dict:
|
||||
"""获取系统统计信息"""
|
||||
conn = self._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
stats = {}
|
||||
|
||||
# 文章总数
|
||||
cursor.execute("SELECT COUNT(*) as count FROM articles")
|
||||
stats['total_articles'] = cursor.fetchone()['count']
|
||||
|
||||
# 今日新增文章
|
||||
cursor.execute("""
|
||||
SELECT COUNT(*) as count FROM articles
|
||||
WHERE DATE(scraped_date) = DATE('now')
|
||||
""")
|
||||
stats['today_articles'] = cursor.fetchone()['count']
|
||||
|
||||
# 搜索总次数
|
||||
cursor.execute("SELECT COUNT(*) as count FROM search_logs")
|
||||
stats['total_searches'] = cursor.fetchone()['count']
|
||||
|
||||
# 活跃RSS源数量
|
||||
cursor.execute("SELECT COUNT(*) as count FROM rss_sources WHERE is_active = 1")
|
||||
stats['active_sources'] = cursor.fetchone()['count']
|
||||
|
||||
# 按行业统计文章数
|
||||
cursor.execute("""
|
||||
SELECT i.name_cn, COUNT(a.id) as count
|
||||
FROM industries i
|
||||
LEFT JOIN rss_sources rs ON i.id = rs.industry_id
|
||||
LEFT JOIN articles a ON rs.id = a.source_id
|
||||
GROUP BY i.id, i.name_cn
|
||||
ORDER BY count DESC
|
||||
""")
|
||||
stats['articles_by_industry'] = [dict(row) for row in cursor.fetchall()]
|
||||
|
||||
return stats
|
||||
|
||||
def close(self):
|
||||
"""关闭数据库连接"""
|
||||
if self.conn:
|
||||
self.conn.close()
|
||||
self.conn = None
|
||||
101
代码实现/database_schema.sql
Normal file
101
代码实现/database_schema.sql
Normal file
@@ -0,0 +1,101 @@
|
||||
-- 搜索系统数据库结构
|
||||
-- 适用于 SQLite/MySQL/PostgreSQL
|
||||
|
||||
-- 1. 行业分类表
|
||||
CREATE TABLE industries (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
name_en VARCHAR(50) NOT NULL UNIQUE,
|
||||
name_cn VARCHAR(50) NOT NULL,
|
||||
description TEXT,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
||||
);
|
||||
|
||||
-- 2. 信息源配置表
|
||||
CREATE TABLE rss_sources (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
industry_id INTEGER NOT NULL,
|
||||
source_name VARCHAR(100) NOT NULL,
|
||||
source_url VARCHAR(500) NOT NULL,
|
||||
source_type VARCHAR(20) NOT NULL, -- 'rss', 'api', 'manual'
|
||||
authority_level INTEGER DEFAULT 3, -- 1=官方机构, 2=主流媒体, 3=专业平台, 4=其他
|
||||
language VARCHAR(2) DEFAULT 'en', -- 'en', 'cn'
|
||||
is_active BOOLEAN DEFAULT TRUE,
|
||||
last_checked TIMESTAMP,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
FOREIGN KEY (industry_id) REFERENCES industries(id)
|
||||
);
|
||||
|
||||
-- 3. 搜索记录表
|
||||
CREATE TABLE search_logs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
keywords TEXT NOT NULL,
|
||||
industry_id INTEGER,
|
||||
language VARCHAR(2) DEFAULT 'en',
|
||||
results_count INTEGER DEFAULT 0,
|
||||
search_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
user_ip VARCHAR(45),
|
||||
FOREIGN KEY (industry_id) REFERENCES industries(id)
|
||||
);
|
||||
|
||||
-- 4. 文章内容表
|
||||
CREATE TABLE articles (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
title TEXT NOT NULL,
|
||||
content TEXT,
|
||||
summary TEXT,
|
||||
author VARCHAR(200),
|
||||
source_id INTEGER NOT NULL,
|
||||
original_url VARCHAR(1000) NOT NULL,
|
||||
published_date TIMESTAMP,
|
||||
scraped_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
language VARCHAR(2) DEFAULT 'en',
|
||||
keywords TEXT, -- JSON格式存储关键词
|
||||
article_hash VARCHAR(64) UNIQUE, -- 防重复
|
||||
is_archived BOOLEAN DEFAULT FALSE,
|
||||
FOREIGN KEY (source_id) REFERENCES rss_sources(id)
|
||||
);
|
||||
|
||||
-- 5. 搜索结果表
|
||||
CREATE TABLE search_results (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
search_log_id INTEGER NOT NULL,
|
||||
article_id INTEGER NOT NULL,
|
||||
relevance_score FLOAT DEFAULT 0.0,
|
||||
rank_position INTEGER,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
FOREIGN KEY (search_log_id) REFERENCES search_logs(id),
|
||||
FOREIGN KEY (article_id) REFERENCES articles(id)
|
||||
);
|
||||
|
||||
-- 6. 导出文档记录表
|
||||
CREATE TABLE exported_docs (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
search_log_id INTEGER NOT NULL,
|
||||
filename VARCHAR(255) NOT NULL,
|
||||
file_path VARCHAR(500) NOT NULL,
|
||||
doc_type VARCHAR(20) DEFAULT 'docx', -- 'docx', 'pdf', 'txt'
|
||||
articles_count INTEGER DEFAULT 0,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
FOREIGN KEY (search_log_id) REFERENCES search_logs(id)
|
||||
);
|
||||
|
||||
-- 插入基础数据
|
||||
INSERT INTO industries (name_en, name_cn, description) VALUES
|
||||
('finance', '金融行业', '银行、证券、保险、投资等金融服务'),
|
||||
('ai_software', 'AI与软件', '人工智能、软件开发、技术创新'),
|
||||
('manufacturing', '制造业', '工业制造、自动化、生产技术'),
|
||||
('healthcare_pharma', '医疗制药', '医疗健康、制药、生物技术'),
|
||||
('fmcg', '快消品', '快速消费品、零售、品牌营销'),
|
||||
('ecommerce_retail', '零售电商', '电子商务、零售业、数字营销'),
|
||||
('energy_chemical', '能源化工', '能源、化工、石油、新能源'),
|
||||
('real_estate', '房地产建筑', '房地产、建筑、基础设施');
|
||||
|
||||
-- 创建索引优化查询性能
|
||||
CREATE INDEX idx_articles_published_date ON articles(published_date);
|
||||
CREATE INDEX idx_articles_source_id ON articles(source_id);
|
||||
CREATE INDEX idx_articles_language ON articles(language);
|
||||
CREATE INDEX idx_articles_hash ON articles(article_hash);
|
||||
CREATE INDEX idx_search_logs_keywords ON search_logs(keywords);
|
||||
CREATE INDEX idx_search_logs_time ON search_logs(search_time);
|
||||
CREATE INDEX idx_rss_sources_industry ON rss_sources(industry_id);
|
||||
CREATE INDEX idx_rss_sources_active ON rss_sources(is_active);
|
||||
370
代码实现/document_exporter.py
Normal file
370
代码实现/document_exporter.py
Normal file
@@ -0,0 +1,370 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
文档导出器 - 将搜索结果导出为DOCX格式
|
||||
"""
|
||||
|
||||
import logging
|
||||
from datetime import datetime
|
||||
from typing import List, Dict, Optional
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
from docx import Document
|
||||
from docx.shared import Inches
|
||||
from docx.enum.style import WD_STYLE_TYPE
|
||||
from docx.enum.text import WD_ALIGN_PARAGRAPH
|
||||
from docx.oxml.shared import OxmlElement, qn
|
||||
except ImportError:
|
||||
print("需要安装 python-docx: pip install python-docx")
|
||||
raise
|
||||
|
||||
from database import DatabaseManager
|
||||
from config import EXPORT_CONFIG, EXPORT_DIR
|
||||
|
||||
class DocumentExporter:
|
||||
"""文档导出器"""
|
||||
|
||||
def __init__(self):
|
||||
self.db = DatabaseManager()
|
||||
self.logger = logging.getLogger(__name__)
|
||||
self.export_dir = EXPORT_DIR
|
||||
self.export_dir.mkdir(exist_ok=True)
|
||||
|
||||
def export_search_results(self, search_log_id: int,
|
||||
custom_filename: str = None) -> Dict:
|
||||
"""导出搜索结果为DOCX文档"""
|
||||
try:
|
||||
# 获取搜索记录和结果
|
||||
search_log = self._get_search_log(search_log_id)
|
||||
if not search_log:
|
||||
return {'success': False, 'error': '搜索记录不存在'}
|
||||
|
||||
results = self._get_search_results(search_log_id)
|
||||
if not results:
|
||||
return {'success': False, 'error': '没有搜索结果可导出'}
|
||||
|
||||
# 生成文件名
|
||||
filename = self._generate_filename(search_log, custom_filename)
|
||||
file_path = self.export_dir / filename
|
||||
|
||||
# 创建文档
|
||||
doc = self._create_document(search_log, results)
|
||||
|
||||
# 保存文档
|
||||
doc.save(file_path)
|
||||
|
||||
# 记录导出信息
|
||||
doc_id = self.db.save_exported_doc(
|
||||
search_log_id, filename, str(file_path), len(results)
|
||||
)
|
||||
|
||||
self.logger.info(f"文档导出成功: {filename}")
|
||||
|
||||
return {
|
||||
'success': True,
|
||||
'filename': filename,
|
||||
'file_path': str(file_path),
|
||||
'articles_count': len(results),
|
||||
'doc_id': doc_id
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"文档导出失败: {e}")
|
||||
return {'success': False, 'error': str(e)}
|
||||
|
||||
def _get_search_log(self, search_log_id: int) -> Optional[Dict]:
|
||||
"""获取搜索记录"""
|
||||
try:
|
||||
conn = self.db._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("""
|
||||
SELECT sl.*, i.name_cn as industry_name, i.name_en as industry_en
|
||||
FROM search_logs sl
|
||||
LEFT JOIN industries i ON sl.industry_id = i.id
|
||||
WHERE sl.id = ?
|
||||
""", (search_log_id,))
|
||||
|
||||
result = cursor.fetchone()
|
||||
return dict(result) if result else None
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"获取搜索记录失败: {e}")
|
||||
return None
|
||||
|
||||
def _get_search_results(self, search_log_id: int) -> List[Dict]:
|
||||
"""获取搜索结果"""
|
||||
try:
|
||||
conn = self.db._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("""
|
||||
SELECT a.*, rs.source_name, rs.authority_level, sr.relevance_score, sr.rank_position
|
||||
FROM search_results sr
|
||||
JOIN articles a ON sr.article_id = a.id
|
||||
JOIN rss_sources rs ON a.source_id = rs.id
|
||||
WHERE sr.search_log_id = ?
|
||||
ORDER BY sr.rank_position ASC
|
||||
""", (search_log_id,))
|
||||
|
||||
return [dict(row) for row in cursor.fetchall()]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"获取搜索结果失败: {e}")
|
||||
return []
|
||||
|
||||
def _generate_filename(self, search_log: Dict, custom_filename: str = None) -> str:
|
||||
"""生成文件名"""
|
||||
if custom_filename:
|
||||
if not custom_filename.endswith('.docx'):
|
||||
custom_filename += '.docx'
|
||||
return custom_filename
|
||||
|
||||
# 自动生成文件名
|
||||
date_str = datetime.now().strftime('%Y%m%d')
|
||||
keywords = search_log.get('keywords', '').replace(' ', '_')[:20]
|
||||
industry = search_log.get('industry_en', 'general')
|
||||
language = search_log.get('language', 'en')
|
||||
|
||||
# 根据语言选择文件名格式
|
||||
if language == 'cn':
|
||||
filename = f"{date_str}_{industry}_{keywords}_CN.docx"
|
||||
else:
|
||||
filename = f"{date_str}_{industry}_{keywords}.docx"
|
||||
|
||||
# 确保文件名安全
|
||||
filename = self._sanitize_filename(filename)
|
||||
|
||||
return filename
|
||||
|
||||
def _sanitize_filename(self, filename: str) -> str:
|
||||
"""清理文件名"""
|
||||
import re
|
||||
# 移除不安全字符
|
||||
filename = re.sub(r'[<>:"/\\|?*]', '_', filename)
|
||||
# 限制长度
|
||||
if len(filename) > 100:
|
||||
name, ext = filename.rsplit('.', 1)
|
||||
filename = name[:90] + '.' + ext
|
||||
return filename
|
||||
|
||||
def _create_document(self, search_log: Dict, results: List[Dict]) -> Document:
|
||||
"""创建DOCX文档"""
|
||||
doc = Document()
|
||||
|
||||
# 设置文档样式
|
||||
self._setup_document_styles(doc)
|
||||
|
||||
# 添加标题
|
||||
self._add_title(doc, search_log)
|
||||
|
||||
# 添加搜索信息
|
||||
self._add_search_info(doc, search_log)
|
||||
|
||||
# 添加搜索结果
|
||||
self._add_search_results(doc, results)
|
||||
|
||||
# 添加页脚
|
||||
self._add_footer(doc)
|
||||
|
||||
return doc
|
||||
|
||||
def _setup_document_styles(self, doc: Document):
|
||||
"""设置文档样式"""
|
||||
try:
|
||||
# 标题样式
|
||||
title_style = doc.styles.add_style('CustomTitle', WD_STYLE_TYPE.PARAGRAPH)
|
||||
title_font = title_style.font
|
||||
title_font.size = Inches(0.2)
|
||||
title_font.bold = True
|
||||
title_style.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
|
||||
# 文章标题样式
|
||||
article_title_style = doc.styles.add_style('ArticleTitle', WD_STYLE_TYPE.PARAGRAPH)
|
||||
article_title_font = article_title_style.font
|
||||
article_title_font.size = Inches(0.15)
|
||||
article_title_font.bold = True
|
||||
|
||||
# 来源信息样式
|
||||
source_style = doc.styles.add_style('SourceInfo', WD_STYLE_TYPE.PARAGRAPH)
|
||||
source_font = source_style.font
|
||||
source_font.size = Inches(0.1)
|
||||
source_font.italic = True
|
||||
|
||||
except Exception as e:
|
||||
# 如果样式已存在,忽略错误
|
||||
pass
|
||||
|
||||
def _add_title(self, doc: Document, search_log: Dict):
|
||||
"""添加文档标题"""
|
||||
keywords = search_log.get('keywords', '')
|
||||
industry_name = search_log.get('industry_name', '通用')
|
||||
date_str = datetime.now().strftime('%Y年%m月%d日')
|
||||
|
||||
if search_log.get('language') == 'cn':
|
||||
title = f"{industry_name}行业搜索报告\n关键词: {keywords}\n{date_str}"
|
||||
else:
|
||||
title = f"{search_log.get('industry_en', 'General')} Industry Search Report\nKeywords: {keywords}\n{date_str}"
|
||||
|
||||
try:
|
||||
title_para = doc.add_paragraph(title, style='CustomTitle')
|
||||
except:
|
||||
title_para = doc.add_paragraph(title)
|
||||
title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
|
||||
doc.add_paragraph() # 空行
|
||||
|
||||
def _add_search_info(self, doc: Document, search_log: Dict):
|
||||
"""添加搜索信息"""
|
||||
search_time = search_log.get('search_time', '')
|
||||
if search_time:
|
||||
search_time = datetime.fromisoformat(search_time.replace('Z', '')).strftime('%Y-%m-%d %H:%M:%S')
|
||||
|
||||
info_lines = [
|
||||
f"搜索时间: {search_time}",
|
||||
f"关键词: {search_log.get('keywords', '')}",
|
||||
f"搜索行业: {search_log.get('industry_name', '全部')}",
|
||||
f"搜索语言: {'中文' if search_log.get('language') == 'cn' else '英文'}",
|
||||
f"结果数量: {search_log.get('results_count', 0)} 条"
|
||||
]
|
||||
|
||||
info_para = doc.add_paragraph()
|
||||
for line in info_lines:
|
||||
info_para.add_run(line + '\n')
|
||||
|
||||
doc.add_paragraph() # 空行
|
||||
doc.add_paragraph("="*50) # 分隔线
|
||||
doc.add_paragraph()
|
||||
|
||||
def _add_search_results(self, doc: Document, results: List[Dict]):
|
||||
"""添加搜索结果"""
|
||||
for i, result in enumerate(results, 1):
|
||||
# 文章标题
|
||||
title = result.get('title', '无标题')
|
||||
try:
|
||||
title_para = doc.add_paragraph(f"{i}. {title}", style='ArticleTitle')
|
||||
except:
|
||||
title_para = doc.add_paragraph(f"{i}. {title}")
|
||||
title_para.runs[0].bold = True
|
||||
|
||||
# 来源信息
|
||||
source_info = self._format_source_info(result)
|
||||
try:
|
||||
source_para = doc.add_paragraph(source_info, style='SourceInfo')
|
||||
except:
|
||||
source_para = doc.add_paragraph(source_info)
|
||||
source_para.runs[0].italic = True
|
||||
|
||||
# 文章摘要
|
||||
summary = result.get('summary', result.get('content', ''))
|
||||
if summary:
|
||||
# 限制摘要长度
|
||||
if len(summary) > 300:
|
||||
summary = summary[:300] + '...'
|
||||
doc.add_paragraph(summary)
|
||||
|
||||
# 原文链接
|
||||
url = result.get('original_url', '')
|
||||
if url and EXPORT_CONFIG.get('include_source_links', True):
|
||||
link_para = doc.add_paragraph(f"原文链接: {url}")
|
||||
link_para.runs[0].font.color.rgb = None # 蓝色链接
|
||||
|
||||
doc.add_paragraph() # 空行分隔
|
||||
|
||||
# 分页(每5篇文章一页)
|
||||
if i % 5 == 0 and i < len(results):
|
||||
doc.add_page_break()
|
||||
|
||||
def _format_source_info(self, result: Dict) -> str:
|
||||
"""格式化来源信息"""
|
||||
source_name = result.get('source_name', '未知来源')
|
||||
author = result.get('author', '')
|
||||
published_date = result.get('published_date', '')
|
||||
authority_level = result.get('authority_level', 3)
|
||||
relevance_score = result.get('relevance_score', 0)
|
||||
|
||||
# 权威级别文本
|
||||
authority_map = {1: '官方机构', 2: '主流媒体', 3: '专业平台', 4: '其他'}
|
||||
authority_text = authority_map.get(authority_level, '其他')
|
||||
|
||||
# 格式化日期
|
||||
if published_date:
|
||||
try:
|
||||
if isinstance(published_date, str):
|
||||
pub_date = datetime.fromisoformat(published_date.replace('Z', ''))
|
||||
else:
|
||||
pub_date = published_date
|
||||
date_str = pub_date.strftime('%Y-%m-%d')
|
||||
except:
|
||||
date_str = str(published_date)
|
||||
else:
|
||||
date_str = '未知日期'
|
||||
|
||||
info_parts = [
|
||||
f"来源: {source_name} ({authority_text})",
|
||||
f"发布时间: {date_str}",
|
||||
f"相关性: {relevance_score:.2f}"
|
||||
]
|
||||
|
||||
if author:
|
||||
info_parts.insert(1, f"作者: {author}")
|
||||
|
||||
return " | ".join(info_parts)
|
||||
|
||||
def _add_footer(self, doc: Document):
|
||||
"""添加页脚"""
|
||||
doc.add_paragraph()
|
||||
doc.add_paragraph("="*50)
|
||||
|
||||
footer_text = f"本报告由智能搜索系统生成 | 生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
|
||||
footer_para = doc.add_paragraph(footer_text)
|
||||
footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
|
||||
def get_export_history(self, limit: int = 20) -> List[Dict]:
|
||||
"""获取导出历史"""
|
||||
try:
|
||||
conn = self.db._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
cursor.execute("""
|
||||
SELECT ed.*, sl.keywords, sl.search_time
|
||||
FROM exported_docs ed
|
||||
JOIN search_logs sl ON ed.search_log_id = sl.id
|
||||
ORDER BY ed.created_at DESC
|
||||
LIMIT ?
|
||||
""", (limit,))
|
||||
|
||||
return [dict(row) for row in cursor.fetchall()]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"获取导出历史失败: {e}")
|
||||
return []
|
||||
|
||||
def delete_exported_file(self, doc_id: int) -> Dict:
|
||||
"""删除导出的文件"""
|
||||
try:
|
||||
conn = self.db._get_connection()
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 获取文件信息
|
||||
cursor.execute("SELECT file_path FROM exported_docs WHERE id = ?", (doc_id,))
|
||||
result = cursor.fetchone()
|
||||
|
||||
if not result:
|
||||
return {'success': False, 'error': '文档记录不存在'}
|
||||
|
||||
file_path = Path(result['file_path'])
|
||||
|
||||
# 删除文件
|
||||
if file_path.exists():
|
||||
file_path.unlink()
|
||||
|
||||
# 删除数据库记录
|
||||
cursor.execute("DELETE FROM exported_docs WHERE id = ?", (doc_id,))
|
||||
conn.commit()
|
||||
|
||||
return {'success': True, 'message': '文件删除成功'}
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"删除文件失败: {e}")
|
||||
return {'success': False, 'error': str(e)}
|
||||
367
代码实现/main.py
Normal file
367
代码实现/main.py
Normal file
@@ -0,0 +1,367 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
搜索系统主程序
|
||||
提供命令行界面和简单的Web界面
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import logging
|
||||
import argparse
|
||||
from typing import Dict, List
|
||||
from pathlib import Path
|
||||
|
||||
# 添加当前目录到Python路径
|
||||
sys.path.append(str(Path(__file__).parent))
|
||||
|
||||
from config import LOGGING_CONFIG
|
||||
from database import DatabaseManager
|
||||
from search_engine import SearchEngine
|
||||
from document_exporter import DocumentExporter
|
||||
from rss_monitor import RSSMonitor
|
||||
|
||||
class SearchSystemCLI:
|
||||
"""搜索系统命令行界面"""
|
||||
|
||||
def __init__(self):
|
||||
self.setup_logging()
|
||||
self.db = DatabaseManager()
|
||||
self.search_engine = SearchEngine()
|
||||
self.exporter = DocumentExporter()
|
||||
self.rss_monitor = RSSMonitor()
|
||||
self.logger = logging.getLogger(__name__)
|
||||
|
||||
def setup_logging(self):
|
||||
"""设置日志"""
|
||||
logging.basicConfig(
|
||||
level=LOGGING_CONFIG['level'],
|
||||
format=LOGGING_CONFIG['format'],
|
||||
handlers=[
|
||||
logging.FileHandler(LOGGING_CONFIG['file'], encoding='utf-8'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
|
||||
def run_search(self, query: str, industry: str = None,
|
||||
language: str = None, export: bool = False) -> Dict:
|
||||
"""执行搜索"""
|
||||
print(f"\n🔍 搜索查询: {query}")
|
||||
print(f"📊 行业: {industry or '全部'}")
|
||||
print(f"🌐 语言: {language or '自动检测'}")
|
||||
print("-" * 50)
|
||||
|
||||
# 执行搜索
|
||||
result = self.search_engine.search(
|
||||
query=query,
|
||||
industry=industry,
|
||||
language=language
|
||||
)
|
||||
|
||||
if not result['success']:
|
||||
print(f"❌ 搜索失败: {result.get('error', '未知错误')}")
|
||||
return result
|
||||
|
||||
# 显示搜索结果
|
||||
self.display_search_results(result)
|
||||
|
||||
# 导出文档
|
||||
if export and result['results']:
|
||||
export_result = self.exporter.export_search_results(result['search_log_id'])
|
||||
if export_result['success']:
|
||||
print(f"\n📄 文档导出成功: {export_result['filename']}")
|
||||
print(f"📁 文件路径: {export_result['file_path']}")
|
||||
else:
|
||||
print(f"❌ 文档导出失败: {export_result.get('error', '未知错误')}")
|
||||
|
||||
return result
|
||||
|
||||
def display_search_results(self, result: Dict):
|
||||
"""显示搜索结果"""
|
||||
print(f"\n✅ 搜索完成!")
|
||||
print(f"📈 找到 {result['total_count']} 条结果")
|
||||
print(f"⏱️ 搜索耗时: {result['search_time']} 秒")
|
||||
print(f"🔗 检索源: {result['sources_searched']['total_sources']} 个")
|
||||
|
||||
if not result['results']:
|
||||
print("\n📭 没有找到相关结果")
|
||||
return
|
||||
|
||||
print(f"\n📰 搜索结果预览 (前5条):")
|
||||
print("=" * 80)
|
||||
|
||||
for i, article in enumerate(result['results'][:5], 1):
|
||||
print(f"\n{i}. {article['title']}")
|
||||
print(f" 🏢 来源: {article['source_name']} ({self.get_authority_text(article['authority_level'])})")
|
||||
print(f" 📅 时间: {self.format_date(article.get('published_date', ''))}")
|
||||
print(f" 🎯 相关性: {article.get('final_score', 0):.2f}")
|
||||
print(f" 🔗 链接: {article['original_url']}")
|
||||
|
||||
summary = article.get('summary', article.get('content', ''))
|
||||
if summary:
|
||||
summary = summary[:100] + '...' if len(summary) > 100 else summary
|
||||
print(f" 📝 摘要: {summary}")
|
||||
|
||||
if len(result['results']) > 5:
|
||||
print(f"\n... 还有 {len(result['results']) - 5} 条结果")
|
||||
|
||||
def get_authority_text(self, level: int) -> str:
|
||||
"""获取权威级别文本"""
|
||||
authority_map = {1: '官方机构', 2: '主流媒体', 3: '专业平台', 4: '其他'}
|
||||
return authority_map.get(level, '其他')
|
||||
|
||||
def format_date(self, date_str: str) -> str:
|
||||
"""格式化日期"""
|
||||
if not date_str:
|
||||
return '未知'
|
||||
try:
|
||||
from datetime import datetime
|
||||
if isinstance(date_str, str):
|
||||
date_obj = datetime.fromisoformat(date_str.replace('Z', ''))
|
||||
else:
|
||||
date_obj = date_str
|
||||
return date_obj.strftime('%Y-%m-%d')
|
||||
except:
|
||||
return str(date_str)
|
||||
|
||||
def show_statistics(self):
|
||||
"""显示系统统计"""
|
||||
stats = self.db.get_statistics()
|
||||
|
||||
print("\n📊 系统统计信息")
|
||||
print("=" * 40)
|
||||
print(f"📰 文章总数: {stats['total_articles']}")
|
||||
print(f"🆕 今日新增: {stats['today_articles']}")
|
||||
print(f"🔍 搜索总次数: {stats['total_searches']}")
|
||||
print(f"📡 活跃源数: {stats['active_sources']}")
|
||||
|
||||
print(f"\n📈 各行业文章分布:")
|
||||
for item in stats['articles_by_industry'][:8]:
|
||||
print(f" {item['name_cn']}: {item['count']} 篇")
|
||||
|
||||
def show_search_history(self, limit: int = 10):
|
||||
"""显示搜索历史"""
|
||||
history = self.db.get_search_history(limit)
|
||||
|
||||
print(f"\n📜 最近 {limit} 次搜索记录")
|
||||
print("=" * 60)
|
||||
|
||||
for i, record in enumerate(history, 1):
|
||||
print(f"{i}. {record['keywords']}")
|
||||
print(f" 行业: {record.get('industry_name', '全部')} | "
|
||||
f"结果: {record['results_count']} 条 | "
|
||||
f"时间: {self.format_date(record['search_time'])}")
|
||||
|
||||
def interactive_mode(self):
|
||||
"""交互模式"""
|
||||
print("🚀 欢迎使用智能搜索系统!")
|
||||
print("输入 'help' 查看帮助,输入 'quit' 退出")
|
||||
|
||||
while True:
|
||||
try:
|
||||
command = input("\n>>> ").strip()
|
||||
|
||||
if command.lower() in ['quit', 'exit', 'q']:
|
||||
print("👋 再见!")
|
||||
break
|
||||
elif command.lower() == 'help':
|
||||
self.show_help()
|
||||
elif command.lower() == 'stats':
|
||||
self.show_statistics()
|
||||
elif command.lower() == 'history':
|
||||
self.show_search_history()
|
||||
elif command.startswith('search '):
|
||||
query = command[7:]
|
||||
self.run_search(query, export=True)
|
||||
elif command:
|
||||
# 直接搜索
|
||||
self.run_search(command, export=True)
|
||||
else:
|
||||
print("请输入搜索查询或命令")
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n👋 再见!")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"❌ 错误: {e}")
|
||||
|
||||
def show_help(self):
|
||||
"""显示帮助信息"""
|
||||
help_text = """
|
||||
🆘 命令帮助:
|
||||
search <查询词> - 执行搜索
|
||||
stats - 查看统计信息
|
||||
history - 查看搜索历史
|
||||
help - 显示此帮助
|
||||
quit/exit/q - 退出程序
|
||||
|
||||
🔍 搜索示例:
|
||||
search AI breakthrough 2024
|
||||
search 英伟达最新财报
|
||||
search renewable energy policy
|
||||
|
||||
💡 提示:
|
||||
- 英文搜索会自动使用英文信源
|
||||
- 包含中文关键词会自动切换中文搜索
|
||||
- 搜索结果会自动导出为DOCX文档
|
||||
"""
|
||||
print(help_text)
|
||||
|
||||
def create_web_app():
|
||||
"""创建简单的Web界面"""
|
||||
try:
|
||||
from flask import Flask, render_template_string, request, jsonify
|
||||
|
||||
app = Flask(__name__)
|
||||
cli = SearchSystemCLI()
|
||||
|
||||
# 简单的HTML模板
|
||||
HTML_TEMPLATE = """
|
||||
<!DOCTYPE html>
|
||||
<html>
|
||||
<head>
|
||||
<title>智能搜索系统</title>
|
||||
<meta charset="utf-8">
|
||||
<style>
|
||||
body { font-family: Arial, sans-serif; margin: 40px; }
|
||||
.header { text-align: center; margin-bottom: 30px; }
|
||||
.search-box { text-align: center; margin-bottom: 30px; }
|
||||
input[type="text"] { padding: 10px; width: 400px; font-size: 16px; }
|
||||
button { padding: 10px 20px; font-size: 16px; margin-left: 10px; }
|
||||
.results { margin-top: 30px; }
|
||||
.result-item { border: 1px solid #ddd; margin: 10px 0; padding: 15px; }
|
||||
.result-title { font-weight: bold; color: #2c5aa0; }
|
||||
.result-meta { color: #666; font-size: 14px; margin: 5px 0; }
|
||||
.result-summary { margin: 10px 0; }
|
||||
.stats { background: #f5f5f5; padding: 15px; margin: 20px 0; }
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="header">
|
||||
<h1>🔍 智能搜索系统</h1>
|
||||
<p>支持8个行业的权威信息搜索</p>
|
||||
</div>
|
||||
|
||||
<div class="search-box">
|
||||
<form method="POST">
|
||||
<input type="text" name="query" placeholder="输入搜索关键词..." value="{{ query or '' }}">
|
||||
<select name="industry">
|
||||
<option value="">全部行业</option>
|
||||
<option value="finance">金融</option>
|
||||
<option value="ai_software">AI/软件</option>
|
||||
<option value="manufacturing">制造业</option>
|
||||
<option value="healthcare_pharma">医疗制药</option>
|
||||
<option value="fmcg">快消品</option>
|
||||
<option value="ecommerce_retail">零售电商</option>
|
||||
<option value="energy_chemical">能源化工</option>
|
||||
<option value="real_estate">房地产建筑</option>
|
||||
</select>
|
||||
<button type="submit">搜索</button>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
{% if search_result %}
|
||||
<div class="stats">
|
||||
<strong>搜索结果:</strong> {{ search_result.total_count }} 条 |
|
||||
<strong>耗时:</strong> {{ search_result.search_time }} 秒 |
|
||||
<strong>信源:</strong> {{ search_result.sources_searched.total_sources }} 个
|
||||
</div>
|
||||
|
||||
<div class="results">
|
||||
{% for article in search_result.results[:10] %}
|
||||
<div class="result-item">
|
||||
<div class="result-title">{{ loop.index }}. {{ article.title }}</div>
|
||||
<div class="result-meta">
|
||||
📰 {{ article.source_name }} |
|
||||
📅 {{ article.published_date or '未知时间' }} |
|
||||
🎯 相关性: {{ "%.2f"|format(article.final_score or 0) }}
|
||||
</div>
|
||||
<div class="result-summary">{{ article.summary[:200] }}...</div>
|
||||
<div><a href="{{ article.original_url }}" target="_blank">🔗 查看原文</a></div>
|
||||
</div>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
{% if error %}
|
||||
<div style="color: red; text-align: center;">
|
||||
❌ {{ error }}
|
||||
</div>
|
||||
{% endif %}
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
@app.route('/', methods=['GET', 'POST'])
|
||||
def index():
|
||||
if request.method == 'POST':
|
||||
query = request.form.get('query', '').strip()
|
||||
industry = request.form.get('industry', '') or None
|
||||
|
||||
if query:
|
||||
try:
|
||||
result = cli.search_engine.search(query, industry)
|
||||
if result['success']:
|
||||
return render_template_string(HTML_TEMPLATE,
|
||||
query=query,
|
||||
search_result=result)
|
||||
else:
|
||||
return render_template_string(HTML_TEMPLATE,
|
||||
query=query,
|
||||
error=result.get('error', '搜索失败'))
|
||||
except Exception as e:
|
||||
return render_template_string(HTML_TEMPLATE,
|
||||
query=query,
|
||||
error=str(e))
|
||||
else:
|
||||
return render_template_string(HTML_TEMPLATE,
|
||||
query=query,
|
||||
error='请输入搜索关键词')
|
||||
|
||||
return render_template_string(HTML_TEMPLATE)
|
||||
|
||||
return app
|
||||
|
||||
except ImportError:
|
||||
print("Flask未安装,无法启动Web界面")
|
||||
print("请运行: pip install flask")
|
||||
return None
|
||||
|
||||
def main():
|
||||
"""主函数"""
|
||||
parser = argparse.ArgumentParser(description='智能搜索系统')
|
||||
parser.add_argument('--mode', choices=['cli', 'web', 'monitor'],
|
||||
default='cli', help='运行模式')
|
||||
parser.add_argument('--query', type=str, help='搜索查询')
|
||||
parser.add_argument('--industry', type=str, help='搜索行业')
|
||||
parser.add_argument('--language', type=str, choices=['en', 'cn'], help='搜索语言')
|
||||
parser.add_argument('--export', action='store_true', help='导出结果')
|
||||
parser.add_argument('--port', type=int, default=5000, help='Web端口')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.mode == 'monitor':
|
||||
# RSS监控模式
|
||||
print("🚀 启动RSS监控器...")
|
||||
from rss_monitor import start_rss_monitor
|
||||
start_rss_monitor()
|
||||
|
||||
elif args.mode == 'web':
|
||||
# Web界面模式
|
||||
app = create_web_app()
|
||||
if app:
|
||||
print(f"🌐 启动Web界面: http://localhost:{args.port}")
|
||||
app.run(host='0.0.0.0', port=args.port, debug=False)
|
||||
|
||||
elif args.mode == 'cli':
|
||||
# 命令行模式
|
||||
cli = SearchSystemCLI()
|
||||
|
||||
if args.query:
|
||||
# 直接执行搜索
|
||||
cli.run_search(args.query, args.industry, args.language, args.export)
|
||||
else:
|
||||
# 交互模式
|
||||
cli.interactive_mode()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
41
代码实现/requirements.txt
Normal file
41
代码实现/requirements.txt
Normal file
@@ -0,0 +1,41 @@
|
||||
# 搜索系统依赖包
|
||||
|
||||
# 核心依赖
|
||||
requests>=2.28.0
|
||||
feedparser>=6.0.10
|
||||
python-docx>=0.8.11
|
||||
|
||||
# 数据库
|
||||
sqlite3 # Python内置,无需安装
|
||||
|
||||
# 可选API依赖
|
||||
newsapi-python>=0.2.6
|
||||
|
||||
# 日志和工具
|
||||
pathlib # Python内置,无需安装
|
||||
logging # Python内置,无需安装
|
||||
hashlib # Python内置,无需安装
|
||||
json # Python内置,无需安装
|
||||
datetime # Python内置,无需安装
|
||||
typing # Python内置,无需安装
|
||||
threading # Python内置,无需安装
|
||||
concurrent.futures # Python内置,无需安装
|
||||
collections # Python内置,无需安装
|
||||
html # Python内置,无需安装
|
||||
re # Python内置,无需安装
|
||||
time # Python内置,无需安装
|
||||
|
||||
# Web界面(可选)
|
||||
flask>=2.0.0
|
||||
jinja2>=3.0.0
|
||||
|
||||
# 数据处理增强(可选)
|
||||
pandas>=1.5.0
|
||||
numpy>=1.21.0
|
||||
|
||||
# 中文处理(可选)
|
||||
jieba>=0.42.1
|
||||
|
||||
# 更高级的NLP处理(可选)
|
||||
nltk>=3.8
|
||||
scikit-learn>=1.1.0
|
||||
324
代码实现/rss_monitor.py
Normal file
324
代码实现/rss_monitor.py
Normal file
@@ -0,0 +1,324 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
RSS监控脚本 - 自动获取RSS源更新
|
||||
"""
|
||||
|
||||
import feedparser
|
||||
import requests
|
||||
import time
|
||||
import logging
|
||||
import threading
|
||||
from datetime import datetime, timezone
|
||||
from typing import List, Dict, Optional
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
from database import DatabaseManager
|
||||
from config import RSS_MONITOR_CONFIG, SEARCH_CONFIG
|
||||
|
||||
class RSSMonitor:
|
||||
"""RSS监控器"""
|
||||
|
||||
def __init__(self):
|
||||
self.db = DatabaseManager()
|
||||
self.logger = logging.getLogger(__name__)
|
||||
self.is_running = False
|
||||
self.check_interval = RSS_MONITOR_CONFIG['check_interval']
|
||||
self.max_retries = RSS_MONITOR_CONFIG['max_retries']
|
||||
self.timeout = RSS_MONITOR_CONFIG['timeout']
|
||||
self.user_agent = RSS_MONITOR_CONFIG['user_agent']
|
||||
|
||||
def start_monitoring(self):
|
||||
"""开始监控RSS源"""
|
||||
self.is_running = True
|
||||
self.logger.info("RSS监控器启动")
|
||||
|
||||
while self.is_running:
|
||||
try:
|
||||
self._check_all_sources()
|
||||
self.logger.info(f"等待 {self.check_interval} 秒后进行下次检查")
|
||||
time.sleep(self.check_interval)
|
||||
except KeyboardInterrupt:
|
||||
self.logger.info("收到停止信号")
|
||||
break
|
||||
except Exception as e:
|
||||
self.logger.error(f"监控过程出错: {e}")
|
||||
time.sleep(60) # 出错后等待1分钟再继续
|
||||
|
||||
def stop_monitoring(self):
|
||||
"""停止监控"""
|
||||
self.is_running = False
|
||||
self.logger.info("RSS监控器停止")
|
||||
|
||||
def _check_all_sources(self):
|
||||
"""检查所有RSS源"""
|
||||
sources = self.db.get_rss_sources()
|
||||
self.logger.info(f"开始检查 {len(sources)} 个RSS源")
|
||||
|
||||
# 使用线程池并行处理
|
||||
with ThreadPoolExecutor(max_workers=10) as executor:
|
||||
futures = {
|
||||
executor.submit(self._check_single_source, source): source
|
||||
for source in sources
|
||||
}
|
||||
|
||||
success_count = 0
|
||||
error_count = 0
|
||||
|
||||
for future in as_completed(futures):
|
||||
source = futures[future]
|
||||
try:
|
||||
articles_count = future.result()
|
||||
if articles_count is not None:
|
||||
success_count += 1
|
||||
if articles_count > 0:
|
||||
self.logger.info(
|
||||
f"{source['source_name']}: 新增 {articles_count} 篇文章"
|
||||
)
|
||||
else:
|
||||
error_count += 1
|
||||
except Exception as e:
|
||||
error_count += 1
|
||||
self.logger.error(f"检查 {source['source_name']} 时出错: {e}")
|
||||
|
||||
self.logger.info(f"RSS检查完成: 成功 {success_count}, 失败 {error_count}")
|
||||
|
||||
def _check_single_source(self, source: Dict) -> Optional[int]:
|
||||
"""检查单个RSS源"""
|
||||
source_id = source['id']
|
||||
source_name = source['source_name']
|
||||
source_url = source['source_url']
|
||||
|
||||
try:
|
||||
# 获取RSS内容
|
||||
articles = self._fetch_rss_articles(source_url, source)
|
||||
|
||||
if articles is None:
|
||||
return None
|
||||
|
||||
# 保存新文章
|
||||
new_articles_count = 0
|
||||
for article in articles:
|
||||
article['source_id'] = source_id
|
||||
article_id = self.db.save_article(article)
|
||||
if article_id:
|
||||
new_articles_count += 1
|
||||
|
||||
# 更新RSS源检查时间
|
||||
self.db.update_rss_source_check_time(source_id)
|
||||
|
||||
return new_articles_count
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"检查RSS源 {source_name} 失败: {e}")
|
||||
return None
|
||||
|
||||
def _fetch_rss_articles(self, url: str, source: Dict) -> Optional[List[Dict]]:
|
||||
"""获取RSS文章"""
|
||||
headers = {
|
||||
'User-Agent': self.user_agent,
|
||||
'Accept': 'application/rss+xml, application/xml, text/xml'
|
||||
}
|
||||
|
||||
for attempt in range(self.max_retries):
|
||||
try:
|
||||
# 获取RSS内容
|
||||
response = requests.get(url, headers=headers, timeout=self.timeout)
|
||||
response.raise_for_status()
|
||||
|
||||
# 解析RSS
|
||||
feed = feedparser.parse(response.content)
|
||||
|
||||
if feed.bozo and feed.bozo_exception:
|
||||
self.logger.warning(
|
||||
f"RSS解析警告 {source['source_name']}: {feed.bozo_exception}"
|
||||
)
|
||||
|
||||
articles = []
|
||||
for entry in feed.entries:
|
||||
article = self._parse_rss_entry(entry, source)
|
||||
if article:
|
||||
articles.append(article)
|
||||
|
||||
return articles
|
||||
|
||||
except requests.RequestException as e:
|
||||
self.logger.warning(
|
||||
f"第 {attempt + 1} 次尝试获取 {source['source_name']} 失败: {e}"
|
||||
)
|
||||
if attempt < self.max_retries - 1:
|
||||
time.sleep(2 ** attempt) # 指数退避
|
||||
except Exception as e:
|
||||
self.logger.error(f"解析RSS {source['source_name']} 时出错: {e}")
|
||||
break
|
||||
|
||||
return None
|
||||
|
||||
def _parse_rss_entry(self, entry, source: Dict) -> Optional[Dict]:
|
||||
"""解析RSS条目"""
|
||||
try:
|
||||
# 获取发布时间
|
||||
published_date = None
|
||||
if hasattr(entry, 'published_parsed') and entry.published_parsed:
|
||||
published_date = datetime(*entry.published_parsed[:6], tzinfo=timezone.utc)
|
||||
elif hasattr(entry, 'updated_parsed') and entry.updated_parsed:
|
||||
published_date = datetime(*entry.updated_parsed[:6], tzinfo=timezone.utc)
|
||||
|
||||
# 获取内容
|
||||
content = ''
|
||||
if hasattr(entry, 'content') and entry.content:
|
||||
content = entry.content[0].value if isinstance(entry.content, list) else entry.content
|
||||
elif hasattr(entry, 'summary'):
|
||||
content = entry.summary
|
||||
elif hasattr(entry, 'description'):
|
||||
content = entry.description
|
||||
|
||||
# 获取作者
|
||||
author = ''
|
||||
if hasattr(entry, 'author'):
|
||||
author = entry.author
|
||||
elif hasattr(entry, 'dc_creator'):
|
||||
author = entry.dc_creator
|
||||
|
||||
# 提取关键词
|
||||
keywords = self._extract_keywords(entry.title, content)
|
||||
|
||||
article = {
|
||||
'title': entry.title if hasattr(entry, 'title') else '',
|
||||
'content': self._clean_content(content),
|
||||
'summary': entry.summary if hasattr(entry, 'summary') else '',
|
||||
'author': author,
|
||||
'original_url': entry.link if hasattr(entry, 'link') else '',
|
||||
'published_date': published_date,
|
||||
'language': source.get('language', 'en'),
|
||||
'keywords': keywords
|
||||
}
|
||||
|
||||
# 验证必要字段
|
||||
if not article['title'] or not article['original_url']:
|
||||
return None
|
||||
|
||||
return article
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"解析RSS条目时出错: {e}")
|
||||
return None
|
||||
|
||||
def _clean_content(self, content: str) -> str:
|
||||
"""清理HTML内容"""
|
||||
if not content:
|
||||
return ''
|
||||
|
||||
try:
|
||||
import re
|
||||
from html import unescape
|
||||
|
||||
# 移除HTML标签
|
||||
content = re.sub(r'<[^>]+>', '', content)
|
||||
# 解码HTML实体
|
||||
content = unescape(content)
|
||||
# 移除多余空白
|
||||
content = re.sub(r'\s+', ' ', content).strip()
|
||||
|
||||
return content
|
||||
except:
|
||||
return content
|
||||
|
||||
def _extract_keywords(self, title: str, content: str) -> List[str]:
|
||||
"""提取关键词"""
|
||||
try:
|
||||
text = f"{title} {content}".lower()
|
||||
|
||||
# 简单关键词提取(可以用更高级的NLP库)
|
||||
import re
|
||||
words = re.findall(r'\b[a-zA-Z]{3,}\b', text)
|
||||
|
||||
# 过滤常见停用词
|
||||
stop_words = {
|
||||
'the', 'and', 'for', 'are', 'but', 'not', 'you', 'all', 'can', 'had',
|
||||
'her', 'was', 'one', 'our', 'out', 'day', 'get', 'has', 'him', 'his',
|
||||
'how', 'its', 'may', 'new', 'now', 'old', 'see', 'two', 'who', 'boy',
|
||||
'this', 'that', 'with', 'have', 'will', 'from', 'they', 'been',
|
||||
'said', 'each', 'make', 'most', 'over', 'some', 'time', 'very',
|
||||
'what', 'when', 'here', 'just', 'like', 'long', 'many', 'than',
|
||||
'them', 'well', 'your', 'come', 'could', 'into', 'more', 'much',
|
||||
'only', 'other', 'such', 'take', 'than', 'them', 'well', 'were'
|
||||
}
|
||||
|
||||
keywords = [word for word in words if word not in stop_words]
|
||||
|
||||
# 统计词频并返回前10个
|
||||
from collections import Counter
|
||||
word_counts = Counter(keywords)
|
||||
return [word for word, count in word_counts.most_common(10)]
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"提取关键词时出错: {e}")
|
||||
return []
|
||||
|
||||
def manual_check_source(self, source_id: int) -> Dict:
|
||||
"""手动检查指定RSS源"""
|
||||
sources = self.db.get_rss_sources()
|
||||
source = next((s for s in sources if s['id'] == source_id), None)
|
||||
|
||||
if not source:
|
||||
return {'success': False, 'message': 'RSS源不存在'}
|
||||
|
||||
try:
|
||||
articles_count = self._check_single_source(source)
|
||||
if articles_count is not None:
|
||||
return {
|
||||
'success': True,
|
||||
'message': f'成功检查 {source["source_name"]}',
|
||||
'new_articles': articles_count
|
||||
}
|
||||
else:
|
||||
return {
|
||||
'success': False,
|
||||
'message': f'检查 {source["source_name"]} 失败'
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'success': False,
|
||||
'message': f'检查失败: {str(e)}'
|
||||
}
|
||||
|
||||
def get_monitor_status(self) -> Dict:
|
||||
"""获取监控状态"""
|
||||
stats = self.db.get_statistics()
|
||||
|
||||
return {
|
||||
'is_running': self.is_running,
|
||||
'check_interval': self.check_interval,
|
||||
'total_sources': stats.get('active_sources', 0),
|
||||
'total_articles': stats.get('total_articles', 0),
|
||||
'today_articles': stats.get('today_articles', 0)
|
||||
}
|
||||
|
||||
def start_rss_monitor():
|
||||
"""启动RSS监控器的主函数"""
|
||||
import logging.config
|
||||
from config import LOGGING_CONFIG
|
||||
|
||||
# 配置日志
|
||||
logging.basicConfig(
|
||||
level=LOGGING_CONFIG['level'],
|
||||
format=LOGGING_CONFIG['format'],
|
||||
handlers=[
|
||||
logging.FileHandler(LOGGING_CONFIG['file'], encoding='utf-8'),
|
||||
logging.StreamHandler()
|
||||
]
|
||||
)
|
||||
|
||||
monitor = RSSMonitor()
|
||||
|
||||
try:
|
||||
monitor.start_monitoring()
|
||||
except KeyboardInterrupt:
|
||||
print("\n收到停止信号,正在关闭RSS监控器...")
|
||||
finally:
|
||||
monitor.stop_monitoring()
|
||||
monitor.db.close()
|
||||
print("RSS监控器已停止")
|
||||
|
||||
if __name__ == "__main__":
|
||||
start_rss_monitor()
|
||||
461
代码实现/search_engine.py
Normal file
461
代码实现/search_engine.py
Normal file
@@ -0,0 +1,461 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
搜索引擎主类
|
||||
"""
|
||||
|
||||
import requests
|
||||
import logging
|
||||
import time
|
||||
from typing import List, Dict, Optional, Tuple
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
from database import DatabaseManager
|
||||
from config import API_CONFIG, SEARCH_CONFIG
|
||||
|
||||
class SearchEngine:
|
||||
"""智能搜索引擎"""
|
||||
|
||||
def __init__(self):
|
||||
self.db = DatabaseManager()
|
||||
self.logger = logging.getLogger(__name__)
|
||||
self.newsapi_key = API_CONFIG['newsapi']['key']
|
||||
self.twitter_token = API_CONFIG['twitter']['bearer_token']
|
||||
self.alpha_vantage_key = API_CONFIG['alpha_vantage']['key']
|
||||
|
||||
def search(self, query: str, industry: str = None,
|
||||
language: str = None, user_ip: str = '') -> Dict:
|
||||
"""执行搜索"""
|
||||
start_time = time.time()
|
||||
|
||||
# 解析查询参数
|
||||
search_params = self._parse_query(query, industry, language)
|
||||
keywords = search_params['keywords']
|
||||
industry_id = search_params['industry_id']
|
||||
detected_language = search_params['language']
|
||||
|
||||
self.logger.info(f"开始搜索: {keywords}, 行业: {industry}, 语言: {detected_language}")
|
||||
|
||||
# 创建搜索记录
|
||||
search_log_id = self.db.create_search_log(
|
||||
keywords=' '.join(keywords),
|
||||
industry_id=industry_id,
|
||||
language=detected_language,
|
||||
user_ip=user_ip
|
||||
)
|
||||
|
||||
try:
|
||||
# 多源搜索
|
||||
all_results = []
|
||||
|
||||
# 1. 搜索本地数据库
|
||||
db_results = self._search_database(keywords, industry_id, detected_language)
|
||||
all_results.extend(db_results)
|
||||
self.logger.info(f"数据库搜索结果: {len(db_results)} 条")
|
||||
|
||||
# 2. NewsAPI搜索(如果有API密钥)
|
||||
if self.newsapi_key and detected_language == 'en':
|
||||
news_results = self._search_newsapi(keywords, industry)
|
||||
all_results.extend(news_results)
|
||||
self.logger.info(f"NewsAPI搜索结果: {len(news_results)} 条")
|
||||
|
||||
# 3. 金融数据API搜索(金融行业)
|
||||
if industry == 'finance' and self.alpha_vantage_key:
|
||||
finance_results = self._search_financial_data(keywords)
|
||||
all_results.extend(finance_results)
|
||||
self.logger.info(f"金融数据搜索结果: {len(finance_results)} 条")
|
||||
|
||||
# 结果去重和排序
|
||||
final_results = self._process_results(all_results, keywords)
|
||||
|
||||
# 保存搜索结果
|
||||
if final_results:
|
||||
self.db.save_search_results(search_log_id, final_results)
|
||||
|
||||
search_time = time.time() - start_time
|
||||
|
||||
return {
|
||||
'success': True,
|
||||
'search_log_id': search_log_id,
|
||||
'query': query,
|
||||
'keywords': keywords,
|
||||
'industry': industry,
|
||||
'language': detected_language,
|
||||
'results': final_results,
|
||||
'total_count': len(final_results),
|
||||
'search_time': round(search_time, 2),
|
||||
'sources_searched': self._get_sources_info(industry_id)
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"搜索过程出错: {e}")
|
||||
return {
|
||||
'success': False,
|
||||
'error': str(e),
|
||||
'search_log_id': search_log_id,
|
||||
'query': query
|
||||
}
|
||||
|
||||
def _parse_query(self, query: str, industry: str = None,
|
||||
language: str = None) -> Dict:
|
||||
"""解析搜索查询"""
|
||||
# 提取关键词
|
||||
keywords = self._extract_keywords(query)
|
||||
|
||||
# 检测语言
|
||||
if not language:
|
||||
language = self._detect_language(query)
|
||||
|
||||
# 获取行业ID
|
||||
industry_id = None
|
||||
if industry:
|
||||
industries = self.db.get_industries()
|
||||
for ind in industries:
|
||||
if ind['name_en'] == industry:
|
||||
industry_id = ind['id']
|
||||
break
|
||||
|
||||
return {
|
||||
'keywords': keywords,
|
||||
'industry_id': industry_id,
|
||||
'language': language
|
||||
}
|
||||
|
||||
def _extract_keywords(self, query: str) -> List[str]:
|
||||
"""提取搜索关键词"""
|
||||
import re
|
||||
|
||||
# 基础关键词提取
|
||||
words = re.findall(r'\b\w+\b', query.lower())
|
||||
|
||||
# 过滤停用词
|
||||
stop_words = {
|
||||
'the', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with',
|
||||
'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before',
|
||||
'after', 'above', 'below', 'up', 'down', 'out', 'off', 'over', 'under',
|
||||
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where',
|
||||
'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most',
|
||||
'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too',
|
||||
'very', 'can', 'will', 'just', 'should', 'now', 'what', 'news',
|
||||
'latest', 'recent', 'update', 'today', 'yesterday'
|
||||
}
|
||||
|
||||
keywords = [word for word in words if len(word) > 2 and word not in stop_words]
|
||||
|
||||
# 保留原始查询中的重要短语
|
||||
phrases = self._extract_phrases(query)
|
||||
keywords.extend(phrases)
|
||||
|
||||
return list(set(keywords)) # 去重
|
||||
|
||||
def _extract_phrases(self, query: str) -> List[str]:
|
||||
"""提取重要短语"""
|
||||
import re
|
||||
|
||||
# 提取引号内的短语
|
||||
quoted_phrases = re.findall(r'"([^"]*)"', query)
|
||||
|
||||
# 提取常见的技术术语和公司名
|
||||
phrases = []
|
||||
|
||||
# 技术术语模式
|
||||
tech_patterns = [
|
||||
r'\b[A-Z]{2,}\b', # 大写缩写 (AI, API, GDP)
|
||||
r'\b\w+\.\w+\b', # 域名格式
|
||||
r'\b\w+-\w+\b', # 连字符词组
|
||||
]
|
||||
|
||||
for pattern in tech_patterns:
|
||||
matches = re.findall(pattern, query)
|
||||
phrases.extend(matches)
|
||||
|
||||
phrases.extend(quoted_phrases)
|
||||
return phrases
|
||||
|
||||
def _detect_language(self, query: str) -> str:
|
||||
"""检测查询语言"""
|
||||
# 检查是否包含中文特定关键词
|
||||
china_keywords = SEARCH_CONFIG['keywords_for_china']
|
||||
|
||||
for keyword in china_keywords:
|
||||
if keyword in query:
|
||||
return 'cn'
|
||||
|
||||
# 检查是否包含中文字符
|
||||
import re
|
||||
chinese_chars = re.findall(r'[\u4e00-\u9fff]+', query)
|
||||
if chinese_chars:
|
||||
return 'cn'
|
||||
|
||||
return SEARCH_CONFIG['default_language']
|
||||
|
||||
def _search_database(self, keywords: List[str], industry_id: Optional[int],
|
||||
language: str) -> List[Dict]:
|
||||
"""搜索本地数据库"""
|
||||
return self.db.search_articles(
|
||||
keywords=keywords,
|
||||
industry_id=industry_id,
|
||||
language=language if language != 'cn' else None,
|
||||
limit=SEARCH_CONFIG['max_results_per_source']
|
||||
)
|
||||
|
||||
def _search_newsapi(self, keywords: List[str], industry: str = None) -> List[Dict]:
|
||||
"""使用NewsAPI搜索"""
|
||||
if not self.newsapi_key:
|
||||
return []
|
||||
|
||||
try:
|
||||
url = f"{API_CONFIG['newsapi']['base_url']}everything"
|
||||
|
||||
# 构建查询字符串
|
||||
query_str = ' AND '.join(keywords[:5]) # 限制关键词数量
|
||||
|
||||
params = {
|
||||
'q': query_str,
|
||||
'apiKey': self.newsapi_key,
|
||||
'language': 'en',
|
||||
'sortBy': 'relevancy',
|
||||
'pageSize': 20,
|
||||
'from': (datetime.now() - timedelta(days=30)).isoformat()
|
||||
}
|
||||
|
||||
# 添加行业相关域名
|
||||
if industry:
|
||||
domains = self._get_industry_domains(industry)
|
||||
if domains:
|
||||
params['domains'] = ','.join(domains)
|
||||
|
||||
response = requests.get(url, params=params, timeout=30)
|
||||
response.raise_for_status()
|
||||
|
||||
data = response.json()
|
||||
articles = []
|
||||
|
||||
for article in data.get('articles', []):
|
||||
processed_article = {
|
||||
'id': f"newsapi_{hash(article['url'])}",
|
||||
'title': article['title'],
|
||||
'content': article.get('description', ''),
|
||||
'summary': article.get('description', ''),
|
||||
'author': article.get('author', ''),
|
||||
'original_url': article['url'],
|
||||
'published_date': self._parse_date(article.get('publishedAt')),
|
||||
'source_name': article['source']['name'],
|
||||
'authority_level': 2, # 默认主流媒体级别
|
||||
'language': 'en',
|
||||
'relevance_score': 0.8 # NewsAPI结果相关性较高
|
||||
}
|
||||
articles.append(processed_article)
|
||||
|
||||
self.logger.info(f"NewsAPI返回 {len(articles)} 条结果")
|
||||
return articles
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"NewsAPI搜索失败: {e}")
|
||||
return []
|
||||
|
||||
def _search_financial_data(self, keywords: List[str]) -> List[Dict]:
|
||||
"""搜索金融数据"""
|
||||
if not self.alpha_vantage_key:
|
||||
return []
|
||||
|
||||
try:
|
||||
# 检查关键词是否包含股票代码
|
||||
stock_symbols = self._extract_stock_symbols(keywords)
|
||||
if not stock_symbols:
|
||||
return []
|
||||
|
||||
articles = []
|
||||
for symbol in stock_symbols[:3]: # 限制查询数量
|
||||
data = self._get_stock_news(symbol)
|
||||
if data:
|
||||
articles.extend(data)
|
||||
|
||||
return articles
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"金融数据搜索失败: {e}")
|
||||
return []
|
||||
|
||||
def _extract_stock_symbols(self, keywords: List[str]) -> List[str]:
|
||||
"""提取股票代码"""
|
||||
import re
|
||||
symbols = []
|
||||
|
||||
for keyword in keywords:
|
||||
# 检查是否为股票代码格式
|
||||
if re.match(r'^[A-Z]{1,5}$', keyword.upper()):
|
||||
symbols.append(keyword.upper())
|
||||
|
||||
# 添加一些常见公司的股票代码映射
|
||||
company_symbols = {
|
||||
'apple': 'AAPL', 'microsoft': 'MSFT', 'google': 'GOOGL',
|
||||
'amazon': 'AMZN', 'tesla': 'TSLA', 'meta': 'META',
|
||||
'nvidia': 'NVDA', 'intel': 'INTC', 'amd': 'AMD'
|
||||
}
|
||||
|
||||
for keyword in keywords:
|
||||
if keyword.lower() in company_symbols:
|
||||
symbols.append(company_symbols[keyword.lower()])
|
||||
|
||||
return list(set(symbols))
|
||||
|
||||
def _get_stock_news(self, symbol: str) -> List[Dict]:
|
||||
"""获取股票新闻"""
|
||||
try:
|
||||
url = API_CONFIG['alpha_vantage']['base_url']
|
||||
params = {
|
||||
'function': 'NEWS_SENTIMENT',
|
||||
'tickers': symbol,
|
||||
'apikey': self.alpha_vantage_key,
|
||||
'limit': 10
|
||||
}
|
||||
|
||||
response = requests.get(url, params=params, timeout=30)
|
||||
response.raise_for_status()
|
||||
|
||||
data = response.json()
|
||||
articles = []
|
||||
|
||||
for item in data.get('feed', []):
|
||||
article = {
|
||||
'id': f"alphavantage_{hash(item['url'])}",
|
||||
'title': item['title'],
|
||||
'content': item.get('summary', ''),
|
||||
'summary': item.get('summary', ''),
|
||||
'author': ','.join(item.get('authors', [])),
|
||||
'original_url': item['url'],
|
||||
'published_date': self._parse_date(item.get('time_published')),
|
||||
'source_name': item.get('source', 'Alpha Vantage'),
|
||||
'authority_level': 2,
|
||||
'language': 'en',
|
||||
'relevance_score': float(item.get('overall_sentiment_score', 0.5))
|
||||
}
|
||||
articles.append(article)
|
||||
|
||||
return articles
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"获取 {symbol} 股票新闻失败: {e}")
|
||||
return []
|
||||
|
||||
def _parse_date(self, date_str: str) -> Optional[datetime]:
|
||||
"""解析日期字符串"""
|
||||
if not date_str:
|
||||
return None
|
||||
|
||||
try:
|
||||
# 尝试多种日期格式
|
||||
formats = [
|
||||
'%Y-%m-%dT%H:%M:%SZ',
|
||||
'%Y-%m-%dT%H:%M:%S',
|
||||
'%Y%m%dT%H%M%S',
|
||||
'%Y-%m-%d %H:%M:%S',
|
||||
'%Y-%m-%d'
|
||||
]
|
||||
|
||||
for fmt in formats:
|
||||
try:
|
||||
return datetime.strptime(date_str, fmt)
|
||||
except ValueError:
|
||||
continue
|
||||
|
||||
return None
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
def _process_results(self, results: List[Dict], keywords: List[str]) -> List[Dict]:
|
||||
"""处理和排序搜索结果"""
|
||||
if not results:
|
||||
return []
|
||||
|
||||
# 去重(基于URL)
|
||||
seen_urls = set()
|
||||
unique_results = []
|
||||
|
||||
for result in results:
|
||||
url = result.get('original_url', '')
|
||||
if url and url not in seen_urls:
|
||||
seen_urls.add(url)
|
||||
unique_results.append(result)
|
||||
|
||||
# 计算最终相关性分数
|
||||
for result in unique_results:
|
||||
score = result.get('relevance_score', 0)
|
||||
|
||||
# 根据权威级别调整分数
|
||||
authority_bonus = (4 - result.get('authority_level', 4)) * 0.2
|
||||
score += authority_bonus
|
||||
|
||||
# 根据发布时间调整分数(越新越好)
|
||||
pub_date = result.get('published_date')
|
||||
if pub_date:
|
||||
days_old = (datetime.now() - pub_date).days
|
||||
time_factor = max(0, 1 - days_old / 30) # 30天内线性衰减
|
||||
score += time_factor * 0.1
|
||||
|
||||
result['final_score'] = score
|
||||
|
||||
# 过滤低相关性结果
|
||||
min_score = SEARCH_CONFIG['min_relevance_score']
|
||||
filtered_results = [r for r in unique_results if r.get('final_score', 0) >= min_score]
|
||||
|
||||
# 按分数排序
|
||||
filtered_results.sort(key=lambda x: x.get('final_score', 0), reverse=True)
|
||||
|
||||
# 限制结果数量
|
||||
max_results = SEARCH_CONFIG['max_results_per_source'] * 2
|
||||
return filtered_results[:max_results]
|
||||
|
||||
def _get_industry_domains(self, industry: str) -> List[str]:
|
||||
"""获取行业相关域名"""
|
||||
domain_map = {
|
||||
'finance': [
|
||||
'bloomberg.com', 'reuters.com', 'ft.com', 'wsj.com',
|
||||
'cnbc.com', 'marketwatch.com', 'forbes.com'
|
||||
],
|
||||
'ai_software': [
|
||||
'techcrunch.com', 'venturebeat.com', 'theverge.com',
|
||||
'arstechnica.com', 'wired.com', 'technologyreview.com'
|
||||
],
|
||||
'healthcare_pharma': [
|
||||
'statnews.com', 'fiercepharma.com', 'biopharmadive.com',
|
||||
'nature.com', 'nejm.org'
|
||||
]
|
||||
}
|
||||
|
||||
return domain_map.get(industry, [])
|
||||
|
||||
def _get_sources_info(self, industry_id: Optional[int]) -> Dict:
|
||||
"""获取搜索源信息"""
|
||||
sources = self.db.get_rss_sources(industry_id)
|
||||
|
||||
return {
|
||||
'total_sources': len(sources),
|
||||
'by_authority': {
|
||||
'1': len([s for s in sources if s['authority_level'] == 1]),
|
||||
'2': len([s for s in sources if s['authority_level'] == 2]),
|
||||
'3': len([s for s in sources if s['authority_level'] == 3])
|
||||
}
|
||||
}
|
||||
|
||||
def get_search_suggestions(self, partial_query: str, limit: int = 10) -> List[str]:
|
||||
"""获取搜索建议"""
|
||||
try:
|
||||
# 基于历史搜索记录提供建议
|
||||
history = self.db.get_search_history(limit=100)
|
||||
suggestions = []
|
||||
|
||||
partial_lower = partial_query.lower()
|
||||
|
||||
for record in history:
|
||||
keywords = record.get('keywords', '')
|
||||
if partial_lower in keywords.lower() and keywords not in suggestions:
|
||||
suggestions.append(keywords)
|
||||
if len(suggestions) >= limit:
|
||||
break
|
||||
|
||||
return suggestions
|
||||
|
||||
except Exception as e:
|
||||
self.logger.error(f"获取搜索建议失败: {e}")
|
||||
return []
|
||||
Reference in New Issue
Block a user