init repo

This commit is contained in:
2026-04-25 19:21:03 +08:00
commit bab2d40577
33 changed files with 5291 additions and 0 deletions

274
代码实现/README.md Normal file
View File

@@ -0,0 +1,274 @@
# 智能搜索系统 - 简单实用版
一个基于RSS订阅和API的智能搜索系统支持8个行业的权威信息检索和自动文档生成。
## 🌟 核心特性
- **英文优先搜索**: 默认英文搜索,包含中文关键词时自动切换
- **8行业覆盖**: 金融、AI/软件、制造业、医疗制药、快消品、零售电商、能源化工、房地产建筑
- **权威信源**: 200+ RSS源按权威级别分类官方机构 > 主流媒体 > 专业平台)
- **多种接口**: 命令行、Web界面、RSS监控器
- **自动导出**: 搜索结果自动生成DOCX报告
- **实时监控**: RSS源自动更新建立本地文章数据库
## 🚀 快速开始
### 1. 安装依赖
```bash
cd 搜索/代码实现
pip install -r requirements.txt
```
**必需依赖:**
```bash
pip install requests feedparser python-docx
```
**可选依赖 (增强功能):**
```bash
pip install flask newsapi-python pandas
```
### 2. 配置API密钥 (可选)
创建环境变量或修改 `config.py`:
```bash
# NewsAPI (可选 - 增强英文搜索)
export NEWSAPI_KEY="your_newsapi_key"
# Twitter API (可选 - 社交媒体搜索)
export TWITTER_BEARER_TOKEN="your_twitter_token"
# Alpha Vantage (可选 - 金融数据)
export ALPHA_VANTAGE_KEY="your_alphavantage_key"
```
### 3. 启动系统
#### 方式一: 交互命令行 (推荐新手)
```bash
python main.py
```
#### 方式二: Web界面
```bash
python main.py --mode web --port 5000
```
打开 http://localhost:5000
#### 方式三: 直接搜索
```bash
python main.py --query "AI breakthrough 2024" --export
```
#### 方式四: 启动RSS监控器
```bash
python main.py --mode monitor
```
## 📖 使用指南
### 命令行搜索示例
```bash
# 基础搜索
>>> AI ethics regulation
# 行业搜索
>>> search renewable energy policy
# 中文搜索 (自动检测)
>>> 英伟达最新财报
# 查看统计
>>> stats
# 查看历史
>>> history
# 帮助
>>> help
```
### 搜索语言自动检测
- **英文搜索**: `AI breakthrough`, `Tesla earnings`, `oil prices`
- **中文搜索**: `中国AI政策`, `英伟达财报`, `新能源汽车`
- **强制中文**: 包含关键词: `中国`, `国内`, `A股`, `人民币`, `央行`
### 支持的行业
| 行业代码 | 中文名称 | 主要信源 |
|---------|---------|----------|
| `finance` | 金融行业 | Fed, SEC, Bloomberg, Reuters |
| `ai_software` | AI与软件 | arXiv, Google AI, OpenAI, TechCrunch |
| `manufacturing` | 制造业 | ISO, IEEE, Industry Week |
| `healthcare_pharma` | 医疗制药 | FDA, NIH, STAT News |
| `fmcg` | 快消品 | Nielsen, Euromonitor |
| `ecommerce_retail` | 零售电商 | Shopify, eMarketer |
| `energy_chemical` | 能源化工 | IEA, Energy.gov |
| `real_estate` | 房地产建筑 | HUD, Construction Dive |
## 📁 文件结构
```
搜索/代码实现/
├── main.py # 主程序入口
├── config.py # 配置文件
├── database.py # 数据库操作
├── search_engine.py # 搜索引擎
├── rss_monitor.py # RSS监控器
├── document_exporter.py # 文档导出器
├── database_schema.sql # 数据库结构
├── requirements.txt # 依赖包
├── data/ # 数据目录
│ ├── search_system.db # SQLite数据库
│ └── search_system.log # 系统日志
└── 新闻/ # 导出文档目录
└── *.docx # 生成的报告
```
## 🔧 高级配置
### 自定义RSS源
编辑 `config.py` 中的 `RSS_SOURCES`:
```python
RSS_SOURCES = {
'finance': [
{
'name': 'Your Custom Source',
'url': 'https://example.com/rss.xml',
'authority_level': 2, # 1=官方, 2=主流, 3=专业
'language': 'en'
}
]
}
```
### 调整搜索参数
修改 `config.py` 中的 `SEARCH_CONFIG`:
```python
SEARCH_CONFIG = {
'max_results_per_source': 50, # 每源最大结果数
'min_relevance_score': 0.3, # 最低相关性分数
'keywords_for_china': ['中国', '国内'] # 中文检测关键词
}
```
### RSS监控频率
调整 `RSS_MONITOR_CONFIG`:
```python
RSS_MONITOR_CONFIG = {
'check_interval': 3600, # 检查间隔(秒) - 3600=1小时
'max_retries': 3, # 最大重试次数
'timeout': 30 # 请求超时(秒)
}
```
## 🎯 使用场景
### 场景一: 行业研究
```bash
python main.py --query "renewable energy investment 2024" --industry energy_chemical --export
```
### 场景二: 竞争情报
```bash
python main.py --query "Tesla quarterly results" --industry ai_software --export
```
### 场景三: 政策追踪
```bash
python main.py --query "FDA drug approval" --industry healthcare_pharma --export
```
### 场景四: 技术趋势
```bash
python main.py --query "quantum computing breakthrough" --industry ai_software --export
```
## 📊 导出文档格式
生成的DOCX文档包含:
1. **标题页**: 搜索关键词、行业、日期
2. **搜索信息**: 参数、结果统计
3. **文章列表**:
- 标题和来源信息
- 权威级别标注
- 发布时间和相关性评分
- 文章摘要
- 原文链接 (可点击)
文件命名规则:
- 英文: `YYYYMMDD_industry_keywords.docx`
- 中文: `YYYYMMDD_industry_keywords_CN.docx`
## 🔍 故障排除
### 常见问题
**Q: RSS源无法访问怎么办**
A: 系统会自动重试和降级处理,单个源失败不影响整体搜索。
**Q: 搜索结果太少?**
A:
1. 检查关键词是否过于具体
2. 尝试不指定行业进行全局搜索
3. 确保RSS监控器已运行一段时间累积数据
**Q: 如何提高搜索质量?**
A:
1. 配置NewsAPI等付费API
2. 添加更多RSS源
3. 调整相关性评分算法
### 日志查看
```bash
# 查看系统日志
tail -f data/search_system.log
# 查看RSS监控状态
python -c "from rss_monitor import RSSMonitor; print(RSSMonitor().get_monitor_status())"
```
### 数据库维护
```bash
# 查看统计信息
python -c "from database import DatabaseManager; print(DatabaseManager().get_statistics())"
# 手动检查RSS源
python -c "from rss_monitor import RSSMonitor; print(RSSMonitor().manual_check_source(1))"
```
## 🚀 性能优化
### 建议配置
- **CPU**: 2核心以上 (并行RSS处理)
- **内存**: 4GB以上 (大量文章缓存)
- **存储**: 10GB以上 (数据库和文档)
- **网络**: 稳定外网连接 (RSS和API访问)
### 扩展建议
1. **数据库**: SQLite → MySQL/PostgreSQL (大规模数据)
2. **搜索**: 基础匹配 → Elasticsearch (全文搜索)
3. **NLP**: 简单关键词 → BERT/GPT (语义搜索)
4. **缓存**: 无 → Redis (快速响应)
## 📞 技术支持
- **文档问题**: 检查RSS源状态和网络连接
- **搜索问题**: 查看日志文件定位错误
- **性能问题**: 调整监控频率和结果数量限制
系统设计为轻量级和容错性,单个组件故障不会影响整体功能。

216
代码实现/config.py Normal file
View File

@@ -0,0 +1,216 @@
# -*- coding: utf-8 -*-
"""
搜索系统配置文件
"""
import os
from pathlib import Path
# 基础配置
BASE_DIR = Path(__file__).parent
DATA_DIR = BASE_DIR / "data"
EXPORT_DIR = BASE_DIR.parent / "新闻"
# 确保目录存在
DATA_DIR.mkdir(exist_ok=True)
EXPORT_DIR.mkdir(exist_ok=True)
# 数据库配置
DATABASE_CONFIG = {
'type': 'sqlite', # 'sqlite', 'mysql', 'postgresql'
'sqlite': {
'path': DATA_DIR / "search_system.db"
},
'mysql': {
'host': 'localhost',
'port': 3306,
'user': 'root',
'password': '',
'database': 'search_system'
}
}
# API配置
API_CONFIG = {
'newsapi': {
'key': os.getenv('NEWSAPI_KEY', ''),
'base_url': 'https://newsapi.org/v2/',
'rate_limit': 1000 # 每日请求限制
},
'twitter': {
'bearer_token': os.getenv('TWITTER_BEARER_TOKEN', ''),
'base_url': 'https://api.twitter.com/2/',
'rate_limit': 300 # 每15分钟请求限制
},
'alpha_vantage': {
'key': os.getenv('ALPHA_VANTAGE_KEY', ''),
'base_url': 'https://www.alphavantage.co/query',
'rate_limit': 5 # 每分钟请求限制
}
}
# RSS源配置
RSS_SOURCES = {
'finance': [
{
'name': 'Federal Reserve',
'url': 'https://www.federalreserve.gov/feeds/press_all.xml',
'authority_level': 1,
'language': 'en'
},
{
'name': 'SEC',
'url': 'https://www.sec.gov/rss/news/press-release.xml',
'authority_level': 1,
'language': 'en'
},
{
'name': 'Bloomberg Markets',
'url': 'https://feeds.bloomberg.com/markets/news.rss',
'authority_level': 2,
'language': 'en'
},
{
'name': 'Reuters Finance',
'url': 'https://feeds.reuters.com/reuters/businessNews',
'authority_level': 2,
'language': 'en'
},
{
'name': 'Financial Times',
'url': 'https://www.ft.com/rss/home',
'authority_level': 2,
'language': 'en'
},
{
'name': 'Wall Street Journal',
'url': 'https://feeds.a.dj.com/rss/RSSMarketsMain.xml',
'authority_level': 2,
'language': 'en'
}
],
'ai_software': [
{
'name': 'arXiv Computer Science',
'url': 'http://rss.arxiv.org/rss/cs',
'authority_level': 1,
'language': 'en'
},
{
'name': 'Google AI Blog',
'url': 'https://ai.googleblog.com/feeds/posts/default',
'authority_level': 1,
'language': 'en'
},
{
'name': 'OpenAI Blog',
'url': 'https://openai.com/blog/rss.xml',
'authority_level': 1,
'language': 'en'
},
{
'name': 'MIT Technology Review',
'url': 'https://www.technologyreview.com/feed/',
'authority_level': 2,
'language': 'en'
},
{
'name': 'TechCrunch',
'url': 'https://techcrunch.com/feed/',
'authority_level': 2,
'language': 'en'
},
{
'name': 'The Verge',
'url': 'https://www.theverge.com/rss/index.xml',
'authority_level': 2,
'language': 'en'
}
],
'manufacturing': [
{
'name': 'ISO News',
'url': 'https://www.iso.org/rss/news.xml',
'authority_level': 1,
'language': 'en'
},
{
'name': 'IEEE Spectrum',
'url': 'https://spectrum.ieee.org/rss/fulltext',
'authority_level': 1,
'language': 'en'
},
{
'name': 'Industry Week',
'url': 'https://www.industryweek.com/rss.xml',
'authority_level': 2,
'language': 'en'
},
{
'name': 'Manufacturing.net',
'url': 'https://www.manufacturing.net/rss.xml',
'authority_level': 3,
'language': 'en'
}
],
'healthcare_pharma': [
{
'name': 'FDA News',
'url': 'https://www.fda.gov/about-fda/contact-fda/stay-informed/rss-feeds',
'authority_level': 1,
'language': 'en'
},
{
'name': 'NIH News',
'url': 'https://www.nih.gov/news-events/rss',
'authority_level': 1,
'language': 'en'
},
{
'name': 'WHO News',
'url': 'https://www.who.int/rss-feeds',
'authority_level': 1,
'language': 'en'
},
{
'name': 'STAT News',
'url': 'https://www.statnews.com/feed/',
'authority_level': 2,
'language': 'en'
}
]
}
# 搜索配置
SEARCH_CONFIG = {
'max_results_per_source': 50,
'search_timeout': 30,
'min_relevance_score': 0.3,
'default_language': 'en',
'keywords_for_china': ['中国', '国内', 'A股', '人民币', '央行', '国务院']
}
# 文档导出配置
EXPORT_CONFIG = {
'default_format': 'docx',
'template_path': BASE_DIR / 'templates',
'max_articles_per_doc': 20,
'include_source_links': True
}
# 日志配置
LOGGING_CONFIG = {
'level': 'INFO',
'format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
'file': DATA_DIR / 'search_system.log',
'max_size': 10 * 1024 * 1024, # 10MB
'backup_count': 5
}
# RSS监控配置
RSS_MONITOR_CONFIG = {
'check_interval': 3600, # 1小时检查一次
'max_retries': 3,
'timeout': 30,
'user_agent': 'SearchSystem/1.0 (RSS Monitor)'
}

353
代码实现/database.py Normal file
View File

@@ -0,0 +1,353 @@
# -*- coding: utf-8 -*-
"""
数据库操作类
"""
import sqlite3
import hashlib
import json
import logging
from datetime import datetime, timedelta
from typing import List, Dict, Optional, Tuple
from pathlib import Path
from config import DATABASE_CONFIG, RSS_SOURCES
class DatabaseManager:
"""数据库管理类"""
def __init__(self):
self.db_type = DATABASE_CONFIG['type']
if self.db_type == 'sqlite':
self.db_path = DATABASE_CONFIG['sqlite']['path']
self.conn = None
self.logger = logging.getLogger(__name__)
self._init_database()
def _get_connection(self):
"""获取数据库连接"""
if self.db_type == 'sqlite':
if not self.conn:
self.conn = sqlite3.connect(self.db_path, check_same_thread=False)
self.conn.row_factory = sqlite3.Row
return self.conn
# 后续可扩展MySQL/PostgreSQL
def _init_database(self):
"""初始化数据库"""
if not Path(self.db_path).exists():
self._create_tables()
self._insert_initial_data()
def _create_tables(self):
"""创建数据库表"""
conn = self._get_connection()
cursor = conn.cursor()
# 读取SQL文件并执行
sql_file = Path(__file__).parent / 'database_schema.sql'
if sql_file.exists():
with open(sql_file, 'r', encoding='utf-8') as f:
sql_script = f.read()
cursor.executescript(sql_script)
conn.commit()
self.logger.info("数据库表创建完成")
def _insert_initial_data(self):
"""插入初始RSS源数据"""
conn = self._get_connection()
cursor = conn.cursor()
# 获取行业ID映射
cursor.execute("SELECT id, name_en FROM industries")
industry_map = {row['name_en']: row['id'] for row in cursor.fetchall()}
# 插入RSS源
for industry, sources in RSS_SOURCES.items():
if industry in industry_map:
industry_id = industry_map[industry]
for source in sources:
cursor.execute("""
INSERT OR IGNORE INTO rss_sources
(industry_id, source_name, source_url, source_type, authority_level, language)
VALUES (?, ?, ?, 'rss', ?, ?)
""", (industry_id, source['name'], source['url'],
source['authority_level'], source['language']))
conn.commit()
self.logger.info("初始RSS源数据插入完成")
def get_industries(self) -> List[Dict]:
"""获取所有行业"""
conn = self._get_connection()
cursor = conn.cursor()
cursor.execute("SELECT * FROM industries ORDER BY name_en")
return [dict(row) for row in cursor.fetchall()]
def get_rss_sources(self, industry_id: Optional[int] = None,
active_only: bool = True) -> List[Dict]:
"""获取RSS源"""
conn = self._get_connection()
cursor = conn.cursor()
query = "SELECT * FROM rss_sources WHERE 1=1"
params = []
if industry_id:
query += " AND industry_id = ?"
params.append(industry_id)
if active_only:
query += " AND is_active = 1"
query += " ORDER BY authority_level, source_name"
cursor.execute(query, params)
return [dict(row) for row in cursor.fetchall()]
def save_article(self, article_data: Dict) -> Optional[int]:
"""保存文章"""
conn = self._get_connection()
cursor = conn.cursor()
# 生成文章hash防重复
content_hash = hashlib.sha256(
f"{article_data['title']}{article_data['original_url']}".encode()
).hexdigest()
# 检查是否已存在
cursor.execute("SELECT id FROM articles WHERE article_hash = ?", (content_hash,))
if cursor.fetchone():
return None # 文章已存在
try:
cursor.execute("""
INSERT INTO articles
(title, content, summary, author, source_id, original_url,
published_date, language, keywords, article_hash)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
article_data['title'],
article_data.get('content', ''),
article_data.get('summary', ''),
article_data.get('author', ''),
article_data['source_id'],
article_data['original_url'],
article_data.get('published_date'),
article_data.get('language', 'en'),
json.dumps(article_data.get('keywords', [])),
content_hash
))
article_id = cursor.lastrowid
conn.commit()
self.logger.debug(f"保存文章: {article_data['title']}")
return article_id
except Exception as e:
self.logger.error(f"保存文章失败: {e}")
conn.rollback()
return None
def create_search_log(self, keywords: str, industry_id: Optional[int] = None,
language: str = 'en', user_ip: str = '') -> int:
"""创建搜索记录"""
conn = self._get_connection()
cursor = conn.cursor()
cursor.execute("""
INSERT INTO search_logs (keywords, industry_id, language, user_ip)
VALUES (?, ?, ?, ?)
""", (keywords, industry_id, language, user_ip))
search_log_id = cursor.lastrowid
conn.commit()
return search_log_id
def save_search_results(self, search_log_id: int, articles: List[Dict]):
"""保存搜索结果"""
conn = self._get_connection()
cursor = conn.cursor()
for rank, article in enumerate(articles, 1):
cursor.execute("""
INSERT INTO search_results
(search_log_id, article_id, relevance_score, rank_position)
VALUES (?, ?, ?, ?)
""", (search_log_id, article['id'], article.get('relevance_score', 0), rank))
# 更新搜索记录的结果数量
cursor.execute("""
UPDATE search_logs SET results_count = ? WHERE id = ?
""", (len(articles), search_log_id))
conn.commit()
def search_articles(self, keywords: List[str], industry_id: Optional[int] = None,
language: Optional[str] = None, limit: int = 50,
days_back: int = 30) -> List[Dict]:
"""搜索文章"""
conn = self._get_connection()
cursor = conn.cursor()
# 构建搜索查询
query = """
SELECT a.*, rs.source_name, rs.authority_level, i.name_cn as industry_name
FROM articles a
JOIN rss_sources rs ON a.source_id = rs.id
JOIN industries i ON rs.industry_id = i.id
WHERE 1=1
"""
params = []
# 时间范围过滤
if days_back > 0:
date_threshold = datetime.now() - timedelta(days=days_back)
query += " AND a.published_date >= ?"
params.append(date_threshold)
# 行业过滤
if industry_id:
query += " AND rs.industry_id = ?"
params.append(industry_id)
# 语言过滤
if language:
query += " AND a.language = ?"
params.append(language)
# 关键词搜索
if keywords:
keyword_conditions = []
for keyword in keywords:
keyword_conditions.append("(a.title LIKE ? OR a.content LIKE ?)")
params.extend([f"%{keyword}%", f"%{keyword}%"])
query += f" AND ({' OR '.join(keyword_conditions)})"
# 排序和限制
query += " ORDER BY rs.authority_level ASC, a.published_date DESC LIMIT ?"
params.append(limit)
cursor.execute(query, params)
results = [dict(row) for row in cursor.fetchall()]
# 计算相关性分数
for result in results:
result['relevance_score'] = self._calculate_relevance(result, keywords)
# 按相关性和权威性排序
results.sort(key=lambda x: (x['authority_level'], -x['relevance_score']))
return results
def _calculate_relevance(self, article: Dict, keywords: List[str]) -> float:
"""计算文章相关性分数"""
if not keywords:
return 1.0
title = article.get('title', '').lower()
content = article.get('content', '').lower()
score = 0.0
for keyword in keywords:
keyword = keyword.lower()
# 标题匹配权重更高
title_matches = title.count(keyword)
content_matches = content.count(keyword)
score += title_matches * 2.0 + content_matches * 0.5
# 根据信源权威级别调整分数
authority_bonus = (4 - article.get('authority_level', 4)) * 0.1
score += authority_bonus
return min(score, 10.0) # 限制最高分数
def get_search_history(self, limit: int = 20) -> List[Dict]:
"""获取搜索历史"""
conn = self._get_connection()
cursor = conn.cursor()
cursor.execute("""
SELECT sl.*, i.name_cn as industry_name
FROM search_logs sl
LEFT JOIN industries i ON sl.industry_id = i.id
ORDER BY sl.search_time DESC
LIMIT ?
""", (limit,))
return [dict(row) for row in cursor.fetchall()]
def save_exported_doc(self, search_log_id: int, filename: str,
file_path: str, articles_count: int) -> int:
"""保存导出文档记录"""
conn = self._get_connection()
cursor = conn.cursor()
cursor.execute("""
INSERT INTO exported_docs
(search_log_id, filename, file_path, articles_count)
VALUES (?, ?, ?, ?)
""", (search_log_id, filename, file_path, articles_count))
doc_id = cursor.lastrowid
conn.commit()
return doc_id
def update_rss_source_check_time(self, source_id: int):
"""更新RSS源检查时间"""
conn = self._get_connection()
cursor = conn.cursor()
cursor.execute("""
UPDATE rss_sources SET last_checked = CURRENT_TIMESTAMP WHERE id = ?
""", (source_id,))
conn.commit()
def get_statistics(self) -> Dict:
"""获取系统统计信息"""
conn = self._get_connection()
cursor = conn.cursor()
stats = {}
# 文章总数
cursor.execute("SELECT COUNT(*) as count FROM articles")
stats['total_articles'] = cursor.fetchone()['count']
# 今日新增文章
cursor.execute("""
SELECT COUNT(*) as count FROM articles
WHERE DATE(scraped_date) = DATE('now')
""")
stats['today_articles'] = cursor.fetchone()['count']
# 搜索总次数
cursor.execute("SELECT COUNT(*) as count FROM search_logs")
stats['total_searches'] = cursor.fetchone()['count']
# 活跃RSS源数量
cursor.execute("SELECT COUNT(*) as count FROM rss_sources WHERE is_active = 1")
stats['active_sources'] = cursor.fetchone()['count']
# 按行业统计文章数
cursor.execute("""
SELECT i.name_cn, COUNT(a.id) as count
FROM industries i
LEFT JOIN rss_sources rs ON i.id = rs.industry_id
LEFT JOIN articles a ON rs.id = a.source_id
GROUP BY i.id, i.name_cn
ORDER BY count DESC
""")
stats['articles_by_industry'] = [dict(row) for row in cursor.fetchall()]
return stats
def close(self):
"""关闭数据库连接"""
if self.conn:
self.conn.close()
self.conn = None

View File

@@ -0,0 +1,101 @@
-- 搜索系统数据库结构
-- 适用于 SQLite/MySQL/PostgreSQL
-- 1. 行业分类表
CREATE TABLE industries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name_en VARCHAR(50) NOT NULL UNIQUE,
name_cn VARCHAR(50) NOT NULL,
description TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- 2. 信息源配置表
CREATE TABLE rss_sources (
id INTEGER PRIMARY KEY AUTOINCREMENT,
industry_id INTEGER NOT NULL,
source_name VARCHAR(100) NOT NULL,
source_url VARCHAR(500) NOT NULL,
source_type VARCHAR(20) NOT NULL, -- 'rss', 'api', 'manual'
authority_level INTEGER DEFAULT 3, -- 1=官方机构, 2=主流媒体, 3=专业平台, 4=其他
language VARCHAR(2) DEFAULT 'en', -- 'en', 'cn'
is_active BOOLEAN DEFAULT TRUE,
last_checked TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (industry_id) REFERENCES industries(id)
);
-- 3. 搜索记录表
CREATE TABLE search_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
keywords TEXT NOT NULL,
industry_id INTEGER,
language VARCHAR(2) DEFAULT 'en',
results_count INTEGER DEFAULT 0,
search_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
user_ip VARCHAR(45),
FOREIGN KEY (industry_id) REFERENCES industries(id)
);
-- 4. 文章内容表
CREATE TABLE articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
content TEXT,
summary TEXT,
author VARCHAR(200),
source_id INTEGER NOT NULL,
original_url VARCHAR(1000) NOT NULL,
published_date TIMESTAMP,
scraped_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
language VARCHAR(2) DEFAULT 'en',
keywords TEXT, -- JSON格式存储关键词
article_hash VARCHAR(64) UNIQUE, -- 防重复
is_archived BOOLEAN DEFAULT FALSE,
FOREIGN KEY (source_id) REFERENCES rss_sources(id)
);
-- 5. 搜索结果表
CREATE TABLE search_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
search_log_id INTEGER NOT NULL,
article_id INTEGER NOT NULL,
relevance_score FLOAT DEFAULT 0.0,
rank_position INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (search_log_id) REFERENCES search_logs(id),
FOREIGN KEY (article_id) REFERENCES articles(id)
);
-- 6. 导出文档记录表
CREATE TABLE exported_docs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
search_log_id INTEGER NOT NULL,
filename VARCHAR(255) NOT NULL,
file_path VARCHAR(500) NOT NULL,
doc_type VARCHAR(20) DEFAULT 'docx', -- 'docx', 'pdf', 'txt'
articles_count INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (search_log_id) REFERENCES search_logs(id)
);
-- 插入基础数据
INSERT INTO industries (name_en, name_cn, description) VALUES
('finance', '金融行业', '银行、证券、保险、投资等金融服务'),
('ai_software', 'AI与软件', '人工智能、软件开发、技术创新'),
('manufacturing', '制造业', '工业制造、自动化、生产技术'),
('healthcare_pharma', '医疗制药', '医疗健康、制药、生物技术'),
('fmcg', '快消品', '快速消费品、零售、品牌营销'),
('ecommerce_retail', '零售电商', '电子商务、零售业、数字营销'),
('energy_chemical', '能源化工', '能源、化工、石油、新能源'),
('real_estate', '房地产建筑', '房地产、建筑、基础设施');
-- 创建索引优化查询性能
CREATE INDEX idx_articles_published_date ON articles(published_date);
CREATE INDEX idx_articles_source_id ON articles(source_id);
CREATE INDEX idx_articles_language ON articles(language);
CREATE INDEX idx_articles_hash ON articles(article_hash);
CREATE INDEX idx_search_logs_keywords ON search_logs(keywords);
CREATE INDEX idx_search_logs_time ON search_logs(search_time);
CREATE INDEX idx_rss_sources_industry ON rss_sources(industry_id);
CREATE INDEX idx_rss_sources_active ON rss_sources(is_active);

View File

@@ -0,0 +1,370 @@
# -*- coding: utf-8 -*-
"""
文档导出器 - 将搜索结果导出为DOCX格式
"""
import logging
from datetime import datetime
from typing import List, Dict, Optional
from pathlib import Path
try:
from docx import Document
from docx.shared import Inches
from docx.enum.style import WD_STYLE_TYPE
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml.shared import OxmlElement, qn
except ImportError:
print("需要安装 python-docx: pip install python-docx")
raise
from database import DatabaseManager
from config import EXPORT_CONFIG, EXPORT_DIR
class DocumentExporter:
"""文档导出器"""
def __init__(self):
self.db = DatabaseManager()
self.logger = logging.getLogger(__name__)
self.export_dir = EXPORT_DIR
self.export_dir.mkdir(exist_ok=True)
def export_search_results(self, search_log_id: int,
custom_filename: str = None) -> Dict:
"""导出搜索结果为DOCX文档"""
try:
# 获取搜索记录和结果
search_log = self._get_search_log(search_log_id)
if not search_log:
return {'success': False, 'error': '搜索记录不存在'}
results = self._get_search_results(search_log_id)
if not results:
return {'success': False, 'error': '没有搜索结果可导出'}
# 生成文件名
filename = self._generate_filename(search_log, custom_filename)
file_path = self.export_dir / filename
# 创建文档
doc = self._create_document(search_log, results)
# 保存文档
doc.save(file_path)
# 记录导出信息
doc_id = self.db.save_exported_doc(
search_log_id, filename, str(file_path), len(results)
)
self.logger.info(f"文档导出成功: {filename}")
return {
'success': True,
'filename': filename,
'file_path': str(file_path),
'articles_count': len(results),
'doc_id': doc_id
}
except Exception as e:
self.logger.error(f"文档导出失败: {e}")
return {'success': False, 'error': str(e)}
def _get_search_log(self, search_log_id: int) -> Optional[Dict]:
"""获取搜索记录"""
try:
conn = self.db._get_connection()
cursor = conn.cursor()
cursor.execute("""
SELECT sl.*, i.name_cn as industry_name, i.name_en as industry_en
FROM search_logs sl
LEFT JOIN industries i ON sl.industry_id = i.id
WHERE sl.id = ?
""", (search_log_id,))
result = cursor.fetchone()
return dict(result) if result else None
except Exception as e:
self.logger.error(f"获取搜索记录失败: {e}")
return None
def _get_search_results(self, search_log_id: int) -> List[Dict]:
"""获取搜索结果"""
try:
conn = self.db._get_connection()
cursor = conn.cursor()
cursor.execute("""
SELECT a.*, rs.source_name, rs.authority_level, sr.relevance_score, sr.rank_position
FROM search_results sr
JOIN articles a ON sr.article_id = a.id
JOIN rss_sources rs ON a.source_id = rs.id
WHERE sr.search_log_id = ?
ORDER BY sr.rank_position ASC
""", (search_log_id,))
return [dict(row) for row in cursor.fetchall()]
except Exception as e:
self.logger.error(f"获取搜索结果失败: {e}")
return []
def _generate_filename(self, search_log: Dict, custom_filename: str = None) -> str:
"""生成文件名"""
if custom_filename:
if not custom_filename.endswith('.docx'):
custom_filename += '.docx'
return custom_filename
# 自动生成文件名
date_str = datetime.now().strftime('%Y%m%d')
keywords = search_log.get('keywords', '').replace(' ', '_')[:20]
industry = search_log.get('industry_en', 'general')
language = search_log.get('language', 'en')
# 根据语言选择文件名格式
if language == 'cn':
filename = f"{date_str}_{industry}_{keywords}_CN.docx"
else:
filename = f"{date_str}_{industry}_{keywords}.docx"
# 确保文件名安全
filename = self._sanitize_filename(filename)
return filename
def _sanitize_filename(self, filename: str) -> str:
"""清理文件名"""
import re
# 移除不安全字符
filename = re.sub(r'[<>:"/\\|?*]', '_', filename)
# 限制长度
if len(filename) > 100:
name, ext = filename.rsplit('.', 1)
filename = name[:90] + '.' + ext
return filename
def _create_document(self, search_log: Dict, results: List[Dict]) -> Document:
"""创建DOCX文档"""
doc = Document()
# 设置文档样式
self._setup_document_styles(doc)
# 添加标题
self._add_title(doc, search_log)
# 添加搜索信息
self._add_search_info(doc, search_log)
# 添加搜索结果
self._add_search_results(doc, results)
# 添加页脚
self._add_footer(doc)
return doc
def _setup_document_styles(self, doc: Document):
"""设置文档样式"""
try:
# 标题样式
title_style = doc.styles.add_style('CustomTitle', WD_STYLE_TYPE.PARAGRAPH)
title_font = title_style.font
title_font.size = Inches(0.2)
title_font.bold = True
title_style.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
# 文章标题样式
article_title_style = doc.styles.add_style('ArticleTitle', WD_STYLE_TYPE.PARAGRAPH)
article_title_font = article_title_style.font
article_title_font.size = Inches(0.15)
article_title_font.bold = True
# 来源信息样式
source_style = doc.styles.add_style('SourceInfo', WD_STYLE_TYPE.PARAGRAPH)
source_font = source_style.font
source_font.size = Inches(0.1)
source_font.italic = True
except Exception as e:
# 如果样式已存在,忽略错误
pass
def _add_title(self, doc: Document, search_log: Dict):
"""添加文档标题"""
keywords = search_log.get('keywords', '')
industry_name = search_log.get('industry_name', '通用')
date_str = datetime.now().strftime('%Y年%m月%d')
if search_log.get('language') == 'cn':
title = f"{industry_name}行业搜索报告\n关键词: {keywords}\n{date_str}"
else:
title = f"{search_log.get('industry_en', 'General')} Industry Search Report\nKeywords: {keywords}\n{date_str}"
try:
title_para = doc.add_paragraph(title, style='CustomTitle')
except:
title_para = doc.add_paragraph(title)
title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.add_paragraph() # 空行
def _add_search_info(self, doc: Document, search_log: Dict):
"""添加搜索信息"""
search_time = search_log.get('search_time', '')
if search_time:
search_time = datetime.fromisoformat(search_time.replace('Z', '')).strftime('%Y-%m-%d %H:%M:%S')
info_lines = [
f"搜索时间: {search_time}",
f"关键词: {search_log.get('keywords', '')}",
f"搜索行业: {search_log.get('industry_name', '全部')}",
f"搜索语言: {'中文' if search_log.get('language') == 'cn' else '英文'}",
f"结果数量: {search_log.get('results_count', 0)}"
]
info_para = doc.add_paragraph()
for line in info_lines:
info_para.add_run(line + '\n')
doc.add_paragraph() # 空行
doc.add_paragraph("="*50) # 分隔线
doc.add_paragraph()
def _add_search_results(self, doc: Document, results: List[Dict]):
"""添加搜索结果"""
for i, result in enumerate(results, 1):
# 文章标题
title = result.get('title', '无标题')
try:
title_para = doc.add_paragraph(f"{i}. {title}", style='ArticleTitle')
except:
title_para = doc.add_paragraph(f"{i}. {title}")
title_para.runs[0].bold = True
# 来源信息
source_info = self._format_source_info(result)
try:
source_para = doc.add_paragraph(source_info, style='SourceInfo')
except:
source_para = doc.add_paragraph(source_info)
source_para.runs[0].italic = True
# 文章摘要
summary = result.get('summary', result.get('content', ''))
if summary:
# 限制摘要长度
if len(summary) > 300:
summary = summary[:300] + '...'
doc.add_paragraph(summary)
# 原文链接
url = result.get('original_url', '')
if url and EXPORT_CONFIG.get('include_source_links', True):
link_para = doc.add_paragraph(f"原文链接: {url}")
link_para.runs[0].font.color.rgb = None # 蓝色链接
doc.add_paragraph() # 空行分隔
# 分页每5篇文章一页
if i % 5 == 0 and i < len(results):
doc.add_page_break()
def _format_source_info(self, result: Dict) -> str:
"""格式化来源信息"""
source_name = result.get('source_name', '未知来源')
author = result.get('author', '')
published_date = result.get('published_date', '')
authority_level = result.get('authority_level', 3)
relevance_score = result.get('relevance_score', 0)
# 权威级别文本
authority_map = {1: '官方机构', 2: '主流媒体', 3: '专业平台', 4: '其他'}
authority_text = authority_map.get(authority_level, '其他')
# 格式化日期
if published_date:
try:
if isinstance(published_date, str):
pub_date = datetime.fromisoformat(published_date.replace('Z', ''))
else:
pub_date = published_date
date_str = pub_date.strftime('%Y-%m-%d')
except:
date_str = str(published_date)
else:
date_str = '未知日期'
info_parts = [
f"来源: {source_name} ({authority_text})",
f"发布时间: {date_str}",
f"相关性: {relevance_score:.2f}"
]
if author:
info_parts.insert(1, f"作者: {author}")
return " | ".join(info_parts)
def _add_footer(self, doc: Document):
"""添加页脚"""
doc.add_paragraph()
doc.add_paragraph("="*50)
footer_text = f"本报告由智能搜索系统生成 | 生成时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
footer_para = doc.add_paragraph(footer_text)
footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
def get_export_history(self, limit: int = 20) -> List[Dict]:
"""获取导出历史"""
try:
conn = self.db._get_connection()
cursor = conn.cursor()
cursor.execute("""
SELECT ed.*, sl.keywords, sl.search_time
FROM exported_docs ed
JOIN search_logs sl ON ed.search_log_id = sl.id
ORDER BY ed.created_at DESC
LIMIT ?
""", (limit,))
return [dict(row) for row in cursor.fetchall()]
except Exception as e:
self.logger.error(f"获取导出历史失败: {e}")
return []
def delete_exported_file(self, doc_id: int) -> Dict:
"""删除导出的文件"""
try:
conn = self.db._get_connection()
cursor = conn.cursor()
# 获取文件信息
cursor.execute("SELECT file_path FROM exported_docs WHERE id = ?", (doc_id,))
result = cursor.fetchone()
if not result:
return {'success': False, 'error': '文档记录不存在'}
file_path = Path(result['file_path'])
# 删除文件
if file_path.exists():
file_path.unlink()
# 删除数据库记录
cursor.execute("DELETE FROM exported_docs WHERE id = ?", (doc_id,))
conn.commit()
return {'success': True, 'message': '文件删除成功'}
except Exception as e:
self.logger.error(f"删除文件失败: {e}")
return {'success': False, 'error': str(e)}

367
代码实现/main.py Normal file
View File

@@ -0,0 +1,367 @@
# -*- coding: utf-8 -*-
"""
搜索系统主程序
提供命令行界面和简单的Web界面
"""
import os
import sys
import logging
import argparse
from typing import Dict, List
from pathlib import Path
# 添加当前目录到Python路径
sys.path.append(str(Path(__file__).parent))
from config import LOGGING_CONFIG
from database import DatabaseManager
from search_engine import SearchEngine
from document_exporter import DocumentExporter
from rss_monitor import RSSMonitor
class SearchSystemCLI:
"""搜索系统命令行界面"""
def __init__(self):
self.setup_logging()
self.db = DatabaseManager()
self.search_engine = SearchEngine()
self.exporter = DocumentExporter()
self.rss_monitor = RSSMonitor()
self.logger = logging.getLogger(__name__)
def setup_logging(self):
"""设置日志"""
logging.basicConfig(
level=LOGGING_CONFIG['level'],
format=LOGGING_CONFIG['format'],
handlers=[
logging.FileHandler(LOGGING_CONFIG['file'], encoding='utf-8'),
logging.StreamHandler()
]
)
def run_search(self, query: str, industry: str = None,
language: str = None, export: bool = False) -> Dict:
"""执行搜索"""
print(f"\n🔍 搜索查询: {query}")
print(f"📊 行业: {industry or '全部'}")
print(f"🌐 语言: {language or '自动检测'}")
print("-" * 50)
# 执行搜索
result = self.search_engine.search(
query=query,
industry=industry,
language=language
)
if not result['success']:
print(f"❌ 搜索失败: {result.get('error', '未知错误')}")
return result
# 显示搜索结果
self.display_search_results(result)
# 导出文档
if export and result['results']:
export_result = self.exporter.export_search_results(result['search_log_id'])
if export_result['success']:
print(f"\n📄 文档导出成功: {export_result['filename']}")
print(f"📁 文件路径: {export_result['file_path']}")
else:
print(f"❌ 文档导出失败: {export_result.get('error', '未知错误')}")
return result
def display_search_results(self, result: Dict):
"""显示搜索结果"""
print(f"\n✅ 搜索完成!")
print(f"📈 找到 {result['total_count']} 条结果")
print(f"⏱️ 搜索耗时: {result['search_time']}")
print(f"🔗 检索源: {result['sources_searched']['total_sources']}")
if not result['results']:
print("\n📭 没有找到相关结果")
return
print(f"\n📰 搜索结果预览 (前5条):")
print("=" * 80)
for i, article in enumerate(result['results'][:5], 1):
print(f"\n{i}. {article['title']}")
print(f" 🏢 来源: {article['source_name']} ({self.get_authority_text(article['authority_level'])})")
print(f" 📅 时间: {self.format_date(article.get('published_date', ''))}")
print(f" 🎯 相关性: {article.get('final_score', 0):.2f}")
print(f" 🔗 链接: {article['original_url']}")
summary = article.get('summary', article.get('content', ''))
if summary:
summary = summary[:100] + '...' if len(summary) > 100 else summary
print(f" 📝 摘要: {summary}")
if len(result['results']) > 5:
print(f"\n... 还有 {len(result['results']) - 5} 条结果")
def get_authority_text(self, level: int) -> str:
"""获取权威级别文本"""
authority_map = {1: '官方机构', 2: '主流媒体', 3: '专业平台', 4: '其他'}
return authority_map.get(level, '其他')
def format_date(self, date_str: str) -> str:
"""格式化日期"""
if not date_str:
return '未知'
try:
from datetime import datetime
if isinstance(date_str, str):
date_obj = datetime.fromisoformat(date_str.replace('Z', ''))
else:
date_obj = date_str
return date_obj.strftime('%Y-%m-%d')
except:
return str(date_str)
def show_statistics(self):
"""显示系统统计"""
stats = self.db.get_statistics()
print("\n📊 系统统计信息")
print("=" * 40)
print(f"📰 文章总数: {stats['total_articles']}")
print(f"🆕 今日新增: {stats['today_articles']}")
print(f"🔍 搜索总次数: {stats['total_searches']}")
print(f"📡 活跃源数: {stats['active_sources']}")
print(f"\n📈 各行业文章分布:")
for item in stats['articles_by_industry'][:8]:
print(f" {item['name_cn']}: {item['count']}")
def show_search_history(self, limit: int = 10):
"""显示搜索历史"""
history = self.db.get_search_history(limit)
print(f"\n📜 最近 {limit} 次搜索记录")
print("=" * 60)
for i, record in enumerate(history, 1):
print(f"{i}. {record['keywords']}")
print(f" 行业: {record.get('industry_name', '全部')} | "
f"结果: {record['results_count']} 条 | "
f"时间: {self.format_date(record['search_time'])}")
def interactive_mode(self):
"""交互模式"""
print("🚀 欢迎使用智能搜索系统!")
print("输入 'help' 查看帮助,输入 'quit' 退出")
while True:
try:
command = input("\n>>> ").strip()
if command.lower() in ['quit', 'exit', 'q']:
print("👋 再见!")
break
elif command.lower() == 'help':
self.show_help()
elif command.lower() == 'stats':
self.show_statistics()
elif command.lower() == 'history':
self.show_search_history()
elif command.startswith('search '):
query = command[7:]
self.run_search(query, export=True)
elif command:
# 直接搜索
self.run_search(command, export=True)
else:
print("请输入搜索查询或命令")
except KeyboardInterrupt:
print("\n👋 再见!")
break
except Exception as e:
print(f"❌ 错误: {e}")
def show_help(self):
"""显示帮助信息"""
help_text = """
🆘 命令帮助:
search <查询词> - 执行搜索
stats - 查看统计信息
history - 查看搜索历史
help - 显示此帮助
quit/exit/q - 退出程序
🔍 搜索示例:
search AI breakthrough 2024
search 英伟达最新财报
search renewable energy policy
💡 提示:
- 英文搜索会自动使用英文信源
- 包含中文关键词会自动切换中文搜索
- 搜索结果会自动导出为DOCX文档
"""
print(help_text)
def create_web_app():
"""创建简单的Web界面"""
try:
from flask import Flask, render_template_string, request, jsonify
app = Flask(__name__)
cli = SearchSystemCLI()
# 简单的HTML模板
HTML_TEMPLATE = """
<!DOCTYPE html>
<html>
<head>
<title>智能搜索系统</title>
<meta charset="utf-8">
<style>
body { font-family: Arial, sans-serif; margin: 40px; }
.header { text-align: center; margin-bottom: 30px; }
.search-box { text-align: center; margin-bottom: 30px; }
input[type="text"] { padding: 10px; width: 400px; font-size: 16px; }
button { padding: 10px 20px; font-size: 16px; margin-left: 10px; }
.results { margin-top: 30px; }
.result-item { border: 1px solid #ddd; margin: 10px 0; padding: 15px; }
.result-title { font-weight: bold; color: #2c5aa0; }
.result-meta { color: #666; font-size: 14px; margin: 5px 0; }
.result-summary { margin: 10px 0; }
.stats { background: #f5f5f5; padding: 15px; margin: 20px 0; }
</style>
</head>
<body>
<div class="header">
<h1>🔍 智能搜索系统</h1>
<p>支持8个行业的权威信息搜索</p>
</div>
<div class="search-box">
<form method="POST">
<input type="text" name="query" placeholder="输入搜索关键词..." value="{{ query or '' }}">
<select name="industry">
<option value="">全部行业</option>
<option value="finance">金融</option>
<option value="ai_software">AI/软件</option>
<option value="manufacturing">制造业</option>
<option value="healthcare_pharma">医疗制药</option>
<option value="fmcg">快消品</option>
<option value="ecommerce_retail">零售电商</option>
<option value="energy_chemical">能源化工</option>
<option value="real_estate">房地产建筑</option>
</select>
<button type="submit">搜索</button>
</form>
</div>
{% if search_result %}
<div class="stats">
<strong>搜索结果:</strong> {{ search_result.total_count }} 条 |
<strong>耗时:</strong> {{ search_result.search_time }} 秒 |
<strong>信源:</strong> {{ search_result.sources_searched.total_sources }} 个
</div>
<div class="results">
{% for article in search_result.results[:10] %}
<div class="result-item">
<div class="result-title">{{ loop.index }}. {{ article.title }}</div>
<div class="result-meta">
📰 {{ article.source_name }} |
📅 {{ article.published_date or '未知时间' }} |
🎯 相关性: {{ "%.2f"|format(article.final_score or 0) }}
</div>
<div class="result-summary">{{ article.summary[:200] }}...</div>
<div><a href="{{ article.original_url }}" target="_blank">🔗 查看原文</a></div>
</div>
{% endfor %}
</div>
{% endif %}
{% if error %}
<div style="color: red; text-align: center;">
{{ error }}
</div>
{% endif %}
</body>
</html>
"""
@app.route('/', methods=['GET', 'POST'])
def index():
if request.method == 'POST':
query = request.form.get('query', '').strip()
industry = request.form.get('industry', '') or None
if query:
try:
result = cli.search_engine.search(query, industry)
if result['success']:
return render_template_string(HTML_TEMPLATE,
query=query,
search_result=result)
else:
return render_template_string(HTML_TEMPLATE,
query=query,
error=result.get('error', '搜索失败'))
except Exception as e:
return render_template_string(HTML_TEMPLATE,
query=query,
error=str(e))
else:
return render_template_string(HTML_TEMPLATE,
query=query,
error='请输入搜索关键词')
return render_template_string(HTML_TEMPLATE)
return app
except ImportError:
print("Flask未安装无法启动Web界面")
print("请运行: pip install flask")
return None
def main():
"""主函数"""
parser = argparse.ArgumentParser(description='智能搜索系统')
parser.add_argument('--mode', choices=['cli', 'web', 'monitor'],
default='cli', help='运行模式')
parser.add_argument('--query', type=str, help='搜索查询')
parser.add_argument('--industry', type=str, help='搜索行业')
parser.add_argument('--language', type=str, choices=['en', 'cn'], help='搜索语言')
parser.add_argument('--export', action='store_true', help='导出结果')
parser.add_argument('--port', type=int, default=5000, help='Web端口')
args = parser.parse_args()
if args.mode == 'monitor':
# RSS监控模式
print("🚀 启动RSS监控器...")
from rss_monitor import start_rss_monitor
start_rss_monitor()
elif args.mode == 'web':
# Web界面模式
app = create_web_app()
if app:
print(f"🌐 启动Web界面: http://localhost:{args.port}")
app.run(host='0.0.0.0', port=args.port, debug=False)
elif args.mode == 'cli':
# 命令行模式
cli = SearchSystemCLI()
if args.query:
# 直接执行搜索
cli.run_search(args.query, args.industry, args.language, args.export)
else:
# 交互模式
cli.interactive_mode()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,41 @@
# 搜索系统依赖包
# 核心依赖
requests>=2.28.0
feedparser>=6.0.10
python-docx>=0.8.11
# 数据库
sqlite3 # Python内置无需安装
# 可选API依赖
newsapi-python>=0.2.6
# 日志和工具
pathlib # Python内置无需安装
logging # Python内置无需安装
hashlib # Python内置无需安装
json # Python内置无需安装
datetime # Python内置无需安装
typing # Python内置无需安装
threading # Python内置无需安装
concurrent.futures # Python内置无需安装
collections # Python内置无需安装
html # Python内置无需安装
re # Python内置无需安装
time # Python内置无需安装
# Web界面(可选)
flask>=2.0.0
jinja2>=3.0.0
# 数据处理增强(可选)
pandas>=1.5.0
numpy>=1.21.0
# 中文处理(可选)
jieba>=0.42.1
# 更高级的NLP处理(可选)
nltk>=3.8
scikit-learn>=1.1.0

324
代码实现/rss_monitor.py Normal file
View File

@@ -0,0 +1,324 @@
# -*- coding: utf-8 -*-
"""
RSS监控脚本 - 自动获取RSS源更新
"""
import feedparser
import requests
import time
import logging
import threading
from datetime import datetime, timezone
from typing import List, Dict, Optional
from concurrent.futures import ThreadPoolExecutor, as_completed
from database import DatabaseManager
from config import RSS_MONITOR_CONFIG, SEARCH_CONFIG
class RSSMonitor:
"""RSS监控器"""
def __init__(self):
self.db = DatabaseManager()
self.logger = logging.getLogger(__name__)
self.is_running = False
self.check_interval = RSS_MONITOR_CONFIG['check_interval']
self.max_retries = RSS_MONITOR_CONFIG['max_retries']
self.timeout = RSS_MONITOR_CONFIG['timeout']
self.user_agent = RSS_MONITOR_CONFIG['user_agent']
def start_monitoring(self):
"""开始监控RSS源"""
self.is_running = True
self.logger.info("RSS监控器启动")
while self.is_running:
try:
self._check_all_sources()
self.logger.info(f"等待 {self.check_interval} 秒后进行下次检查")
time.sleep(self.check_interval)
except KeyboardInterrupt:
self.logger.info("收到停止信号")
break
except Exception as e:
self.logger.error(f"监控过程出错: {e}")
time.sleep(60) # 出错后等待1分钟再继续
def stop_monitoring(self):
"""停止监控"""
self.is_running = False
self.logger.info("RSS监控器停止")
def _check_all_sources(self):
"""检查所有RSS源"""
sources = self.db.get_rss_sources()
self.logger.info(f"开始检查 {len(sources)} 个RSS源")
# 使用线程池并行处理
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {
executor.submit(self._check_single_source, source): source
for source in sources
}
success_count = 0
error_count = 0
for future in as_completed(futures):
source = futures[future]
try:
articles_count = future.result()
if articles_count is not None:
success_count += 1
if articles_count > 0:
self.logger.info(
f"{source['source_name']}: 新增 {articles_count} 篇文章"
)
else:
error_count += 1
except Exception as e:
error_count += 1
self.logger.error(f"检查 {source['source_name']} 时出错: {e}")
self.logger.info(f"RSS检查完成: 成功 {success_count}, 失败 {error_count}")
def _check_single_source(self, source: Dict) -> Optional[int]:
"""检查单个RSS源"""
source_id = source['id']
source_name = source['source_name']
source_url = source['source_url']
try:
# 获取RSS内容
articles = self._fetch_rss_articles(source_url, source)
if articles is None:
return None
# 保存新文章
new_articles_count = 0
for article in articles:
article['source_id'] = source_id
article_id = self.db.save_article(article)
if article_id:
new_articles_count += 1
# 更新RSS源检查时间
self.db.update_rss_source_check_time(source_id)
return new_articles_count
except Exception as e:
self.logger.error(f"检查RSS源 {source_name} 失败: {e}")
return None
def _fetch_rss_articles(self, url: str, source: Dict) -> Optional[List[Dict]]:
"""获取RSS文章"""
headers = {
'User-Agent': self.user_agent,
'Accept': 'application/rss+xml, application/xml, text/xml'
}
for attempt in range(self.max_retries):
try:
# 获取RSS内容
response = requests.get(url, headers=headers, timeout=self.timeout)
response.raise_for_status()
# 解析RSS
feed = feedparser.parse(response.content)
if feed.bozo and feed.bozo_exception:
self.logger.warning(
f"RSS解析警告 {source['source_name']}: {feed.bozo_exception}"
)
articles = []
for entry in feed.entries:
article = self._parse_rss_entry(entry, source)
if article:
articles.append(article)
return articles
except requests.RequestException as e:
self.logger.warning(
f"{attempt + 1} 次尝试获取 {source['source_name']} 失败: {e}"
)
if attempt < self.max_retries - 1:
time.sleep(2 ** attempt) # 指数退避
except Exception as e:
self.logger.error(f"解析RSS {source['source_name']} 时出错: {e}")
break
return None
def _parse_rss_entry(self, entry, source: Dict) -> Optional[Dict]:
"""解析RSS条目"""
try:
# 获取发布时间
published_date = None
if hasattr(entry, 'published_parsed') and entry.published_parsed:
published_date = datetime(*entry.published_parsed[:6], tzinfo=timezone.utc)
elif hasattr(entry, 'updated_parsed') and entry.updated_parsed:
published_date = datetime(*entry.updated_parsed[:6], tzinfo=timezone.utc)
# 获取内容
content = ''
if hasattr(entry, 'content') and entry.content:
content = entry.content[0].value if isinstance(entry.content, list) else entry.content
elif hasattr(entry, 'summary'):
content = entry.summary
elif hasattr(entry, 'description'):
content = entry.description
# 获取作者
author = ''
if hasattr(entry, 'author'):
author = entry.author
elif hasattr(entry, 'dc_creator'):
author = entry.dc_creator
# 提取关键词
keywords = self._extract_keywords(entry.title, content)
article = {
'title': entry.title if hasattr(entry, 'title') else '',
'content': self._clean_content(content),
'summary': entry.summary if hasattr(entry, 'summary') else '',
'author': author,
'original_url': entry.link if hasattr(entry, 'link') else '',
'published_date': published_date,
'language': source.get('language', 'en'),
'keywords': keywords
}
# 验证必要字段
if not article['title'] or not article['original_url']:
return None
return article
except Exception as e:
self.logger.error(f"解析RSS条目时出错: {e}")
return None
def _clean_content(self, content: str) -> str:
"""清理HTML内容"""
if not content:
return ''
try:
import re
from html import unescape
# 移除HTML标签
content = re.sub(r'<[^>]+>', '', content)
# 解码HTML实体
content = unescape(content)
# 移除多余空白
content = re.sub(r'\s+', ' ', content).strip()
return content
except:
return content
def _extract_keywords(self, title: str, content: str) -> List[str]:
"""提取关键词"""
try:
text = f"{title} {content}".lower()
# 简单关键词提取可以用更高级的NLP库
import re
words = re.findall(r'\b[a-zA-Z]{3,}\b', text)
# 过滤常见停用词
stop_words = {
'the', 'and', 'for', 'are', 'but', 'not', 'you', 'all', 'can', 'had',
'her', 'was', 'one', 'our', 'out', 'day', 'get', 'has', 'him', 'his',
'how', 'its', 'may', 'new', 'now', 'old', 'see', 'two', 'who', 'boy',
'this', 'that', 'with', 'have', 'will', 'from', 'they', 'been',
'said', 'each', 'make', 'most', 'over', 'some', 'time', 'very',
'what', 'when', 'here', 'just', 'like', 'long', 'many', 'than',
'them', 'well', 'your', 'come', 'could', 'into', 'more', 'much',
'only', 'other', 'such', 'take', 'than', 'them', 'well', 'were'
}
keywords = [word for word in words if word not in stop_words]
# 统计词频并返回前10个
from collections import Counter
word_counts = Counter(keywords)
return [word for word, count in word_counts.most_common(10)]
except Exception as e:
self.logger.error(f"提取关键词时出错: {e}")
return []
def manual_check_source(self, source_id: int) -> Dict:
"""手动检查指定RSS源"""
sources = self.db.get_rss_sources()
source = next((s for s in sources if s['id'] == source_id), None)
if not source:
return {'success': False, 'message': 'RSS源不存在'}
try:
articles_count = self._check_single_source(source)
if articles_count is not None:
return {
'success': True,
'message': f'成功检查 {source["source_name"]}',
'new_articles': articles_count
}
else:
return {
'success': False,
'message': f'检查 {source["source_name"]} 失败'
}
except Exception as e:
return {
'success': False,
'message': f'检查失败: {str(e)}'
}
def get_monitor_status(self) -> Dict:
"""获取监控状态"""
stats = self.db.get_statistics()
return {
'is_running': self.is_running,
'check_interval': self.check_interval,
'total_sources': stats.get('active_sources', 0),
'total_articles': stats.get('total_articles', 0),
'today_articles': stats.get('today_articles', 0)
}
def start_rss_monitor():
"""启动RSS监控器的主函数"""
import logging.config
from config import LOGGING_CONFIG
# 配置日志
logging.basicConfig(
level=LOGGING_CONFIG['level'],
format=LOGGING_CONFIG['format'],
handlers=[
logging.FileHandler(LOGGING_CONFIG['file'], encoding='utf-8'),
logging.StreamHandler()
]
)
monitor = RSSMonitor()
try:
monitor.start_monitoring()
except KeyboardInterrupt:
print("\n收到停止信号正在关闭RSS监控器...")
finally:
monitor.stop_monitoring()
monitor.db.close()
print("RSS监控器已停止")
if __name__ == "__main__":
start_rss_monitor()

View File

@@ -0,0 +1,461 @@
# -*- coding: utf-8 -*-
"""
搜索引擎主类
"""
import requests
import logging
import time
from typing import List, Dict, Optional, Tuple
from datetime import datetime, timedelta
from database import DatabaseManager
from config import API_CONFIG, SEARCH_CONFIG
class SearchEngine:
"""智能搜索引擎"""
def __init__(self):
self.db = DatabaseManager()
self.logger = logging.getLogger(__name__)
self.newsapi_key = API_CONFIG['newsapi']['key']
self.twitter_token = API_CONFIG['twitter']['bearer_token']
self.alpha_vantage_key = API_CONFIG['alpha_vantage']['key']
def search(self, query: str, industry: str = None,
language: str = None, user_ip: str = '') -> Dict:
"""执行搜索"""
start_time = time.time()
# 解析查询参数
search_params = self._parse_query(query, industry, language)
keywords = search_params['keywords']
industry_id = search_params['industry_id']
detected_language = search_params['language']
self.logger.info(f"开始搜索: {keywords}, 行业: {industry}, 语言: {detected_language}")
# 创建搜索记录
search_log_id = self.db.create_search_log(
keywords=' '.join(keywords),
industry_id=industry_id,
language=detected_language,
user_ip=user_ip
)
try:
# 多源搜索
all_results = []
# 1. 搜索本地数据库
db_results = self._search_database(keywords, industry_id, detected_language)
all_results.extend(db_results)
self.logger.info(f"数据库搜索结果: {len(db_results)}")
# 2. NewsAPI搜索如果有API密钥
if self.newsapi_key and detected_language == 'en':
news_results = self._search_newsapi(keywords, industry)
all_results.extend(news_results)
self.logger.info(f"NewsAPI搜索结果: {len(news_results)}")
# 3. 金融数据API搜索金融行业
if industry == 'finance' and self.alpha_vantage_key:
finance_results = self._search_financial_data(keywords)
all_results.extend(finance_results)
self.logger.info(f"金融数据搜索结果: {len(finance_results)}")
# 结果去重和排序
final_results = self._process_results(all_results, keywords)
# 保存搜索结果
if final_results:
self.db.save_search_results(search_log_id, final_results)
search_time = time.time() - start_time
return {
'success': True,
'search_log_id': search_log_id,
'query': query,
'keywords': keywords,
'industry': industry,
'language': detected_language,
'results': final_results,
'total_count': len(final_results),
'search_time': round(search_time, 2),
'sources_searched': self._get_sources_info(industry_id)
}
except Exception as e:
self.logger.error(f"搜索过程出错: {e}")
return {
'success': False,
'error': str(e),
'search_log_id': search_log_id,
'query': query
}
def _parse_query(self, query: str, industry: str = None,
language: str = None) -> Dict:
"""解析搜索查询"""
# 提取关键词
keywords = self._extract_keywords(query)
# 检测语言
if not language:
language = self._detect_language(query)
# 获取行业ID
industry_id = None
if industry:
industries = self.db.get_industries()
for ind in industries:
if ind['name_en'] == industry:
industry_id = ind['id']
break
return {
'keywords': keywords,
'industry_id': industry_id,
'language': language
}
def _extract_keywords(self, query: str) -> List[str]:
"""提取搜索关键词"""
import re
# 基础关键词提取
words = re.findall(r'\b\w+\b', query.lower())
# 过滤停用词
stop_words = {
'the', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with',
'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before',
'after', 'above', 'below', 'up', 'down', 'out', 'off', 'over', 'under',
'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where',
'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most',
'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too',
'very', 'can', 'will', 'just', 'should', 'now', 'what', 'news',
'latest', 'recent', 'update', 'today', 'yesterday'
}
keywords = [word for word in words if len(word) > 2 and word not in stop_words]
# 保留原始查询中的重要短语
phrases = self._extract_phrases(query)
keywords.extend(phrases)
return list(set(keywords)) # 去重
def _extract_phrases(self, query: str) -> List[str]:
"""提取重要短语"""
import re
# 提取引号内的短语
quoted_phrases = re.findall(r'"([^"]*)"', query)
# 提取常见的技术术语和公司名
phrases = []
# 技术术语模式
tech_patterns = [
r'\b[A-Z]{2,}\b', # 大写缩写 (AI, API, GDP)
r'\b\w+\.\w+\b', # 域名格式
r'\b\w+-\w+\b', # 连字符词组
]
for pattern in tech_patterns:
matches = re.findall(pattern, query)
phrases.extend(matches)
phrases.extend(quoted_phrases)
return phrases
def _detect_language(self, query: str) -> str:
"""检测查询语言"""
# 检查是否包含中文特定关键词
china_keywords = SEARCH_CONFIG['keywords_for_china']
for keyword in china_keywords:
if keyword in query:
return 'cn'
# 检查是否包含中文字符
import re
chinese_chars = re.findall(r'[\u4e00-\u9fff]+', query)
if chinese_chars:
return 'cn'
return SEARCH_CONFIG['default_language']
def _search_database(self, keywords: List[str], industry_id: Optional[int],
language: str) -> List[Dict]:
"""搜索本地数据库"""
return self.db.search_articles(
keywords=keywords,
industry_id=industry_id,
language=language if language != 'cn' else None,
limit=SEARCH_CONFIG['max_results_per_source']
)
def _search_newsapi(self, keywords: List[str], industry: str = None) -> List[Dict]:
"""使用NewsAPI搜索"""
if not self.newsapi_key:
return []
try:
url = f"{API_CONFIG['newsapi']['base_url']}everything"
# 构建查询字符串
query_str = ' AND '.join(keywords[:5]) # 限制关键词数量
params = {
'q': query_str,
'apiKey': self.newsapi_key,
'language': 'en',
'sortBy': 'relevancy',
'pageSize': 20,
'from': (datetime.now() - timedelta(days=30)).isoformat()
}
# 添加行业相关域名
if industry:
domains = self._get_industry_domains(industry)
if domains:
params['domains'] = ','.join(domains)
response = requests.get(url, params=params, timeout=30)
response.raise_for_status()
data = response.json()
articles = []
for article in data.get('articles', []):
processed_article = {
'id': f"newsapi_{hash(article['url'])}",
'title': article['title'],
'content': article.get('description', ''),
'summary': article.get('description', ''),
'author': article.get('author', ''),
'original_url': article['url'],
'published_date': self._parse_date(article.get('publishedAt')),
'source_name': article['source']['name'],
'authority_level': 2, # 默认主流媒体级别
'language': 'en',
'relevance_score': 0.8 # NewsAPI结果相关性较高
}
articles.append(processed_article)
self.logger.info(f"NewsAPI返回 {len(articles)} 条结果")
return articles
except Exception as e:
self.logger.error(f"NewsAPI搜索失败: {e}")
return []
def _search_financial_data(self, keywords: List[str]) -> List[Dict]:
"""搜索金融数据"""
if not self.alpha_vantage_key:
return []
try:
# 检查关键词是否包含股票代码
stock_symbols = self._extract_stock_symbols(keywords)
if not stock_symbols:
return []
articles = []
for symbol in stock_symbols[:3]: # 限制查询数量
data = self._get_stock_news(symbol)
if data:
articles.extend(data)
return articles
except Exception as e:
self.logger.error(f"金融数据搜索失败: {e}")
return []
def _extract_stock_symbols(self, keywords: List[str]) -> List[str]:
"""提取股票代码"""
import re
symbols = []
for keyword in keywords:
# 检查是否为股票代码格式
if re.match(r'^[A-Z]{1,5}$', keyword.upper()):
symbols.append(keyword.upper())
# 添加一些常见公司的股票代码映射
company_symbols = {
'apple': 'AAPL', 'microsoft': 'MSFT', 'google': 'GOOGL',
'amazon': 'AMZN', 'tesla': 'TSLA', 'meta': 'META',
'nvidia': 'NVDA', 'intel': 'INTC', 'amd': 'AMD'
}
for keyword in keywords:
if keyword.lower() in company_symbols:
symbols.append(company_symbols[keyword.lower()])
return list(set(symbols))
def _get_stock_news(self, symbol: str) -> List[Dict]:
"""获取股票新闻"""
try:
url = API_CONFIG['alpha_vantage']['base_url']
params = {
'function': 'NEWS_SENTIMENT',
'tickers': symbol,
'apikey': self.alpha_vantage_key,
'limit': 10
}
response = requests.get(url, params=params, timeout=30)
response.raise_for_status()
data = response.json()
articles = []
for item in data.get('feed', []):
article = {
'id': f"alphavantage_{hash(item['url'])}",
'title': item['title'],
'content': item.get('summary', ''),
'summary': item.get('summary', ''),
'author': ','.join(item.get('authors', [])),
'original_url': item['url'],
'published_date': self._parse_date(item.get('time_published')),
'source_name': item.get('source', 'Alpha Vantage'),
'authority_level': 2,
'language': 'en',
'relevance_score': float(item.get('overall_sentiment_score', 0.5))
}
articles.append(article)
return articles
except Exception as e:
self.logger.error(f"获取 {symbol} 股票新闻失败: {e}")
return []
def _parse_date(self, date_str: str) -> Optional[datetime]:
"""解析日期字符串"""
if not date_str:
return None
try:
# 尝试多种日期格式
formats = [
'%Y-%m-%dT%H:%M:%SZ',
'%Y-%m-%dT%H:%M:%S',
'%Y%m%dT%H%M%S',
'%Y-%m-%d %H:%M:%S',
'%Y-%m-%d'
]
for fmt in formats:
try:
return datetime.strptime(date_str, fmt)
except ValueError:
continue
return None
except Exception:
return None
def _process_results(self, results: List[Dict], keywords: List[str]) -> List[Dict]:
"""处理和排序搜索结果"""
if not results:
return []
# 去重基于URL
seen_urls = set()
unique_results = []
for result in results:
url = result.get('original_url', '')
if url and url not in seen_urls:
seen_urls.add(url)
unique_results.append(result)
# 计算最终相关性分数
for result in unique_results:
score = result.get('relevance_score', 0)
# 根据权威级别调整分数
authority_bonus = (4 - result.get('authority_level', 4)) * 0.2
score += authority_bonus
# 根据发布时间调整分数(越新越好)
pub_date = result.get('published_date')
if pub_date:
days_old = (datetime.now() - pub_date).days
time_factor = max(0, 1 - days_old / 30) # 30天内线性衰减
score += time_factor * 0.1
result['final_score'] = score
# 过滤低相关性结果
min_score = SEARCH_CONFIG['min_relevance_score']
filtered_results = [r for r in unique_results if r.get('final_score', 0) >= min_score]
# 按分数排序
filtered_results.sort(key=lambda x: x.get('final_score', 0), reverse=True)
# 限制结果数量
max_results = SEARCH_CONFIG['max_results_per_source'] * 2
return filtered_results[:max_results]
def _get_industry_domains(self, industry: str) -> List[str]:
"""获取行业相关域名"""
domain_map = {
'finance': [
'bloomberg.com', 'reuters.com', 'ft.com', 'wsj.com',
'cnbc.com', 'marketwatch.com', 'forbes.com'
],
'ai_software': [
'techcrunch.com', 'venturebeat.com', 'theverge.com',
'arstechnica.com', 'wired.com', 'technologyreview.com'
],
'healthcare_pharma': [
'statnews.com', 'fiercepharma.com', 'biopharmadive.com',
'nature.com', 'nejm.org'
]
}
return domain_map.get(industry, [])
def _get_sources_info(self, industry_id: Optional[int]) -> Dict:
"""获取搜索源信息"""
sources = self.db.get_rss_sources(industry_id)
return {
'total_sources': len(sources),
'by_authority': {
'1': len([s for s in sources if s['authority_level'] == 1]),
'2': len([s for s in sources if s['authority_level'] == 2]),
'3': len([s for s in sources if s['authority_level'] == 3])
}
}
def get_search_suggestions(self, partial_query: str, limit: int = 10) -> List[str]:
"""获取搜索建议"""
try:
# 基于历史搜索记录提供建议
history = self.db.get_search_history(limit=100)
suggestions = []
partial_lower = partial_query.lower()
for record in history:
keywords = record.get('keywords', '')
if partial_lower in keywords.lower() and keywords not in suggestions:
suggestions.append(keywords)
if len(suggestions) >= limit:
break
return suggestions
except Exception as e:
self.logger.error(f"获取搜索建议失败: {e}")
return []