auto-save 2026-04-01 09:03 (+2, ~1)

This commit is contained in:
2026-04-01 09:04:18 +08:00
parent 6ae622c451
commit eeaeaa1e04
3 changed files with 1017 additions and 9 deletions

View File

@@ -14,33 +14,261 @@
.container { max-width: 1200px; margin: 0 auto; }
h1 {
font-size: 2.5rem; font-weight: 700;
background: linear-gradient(135deg, #60a5fa, #a78bfa);
background: linear-gradient(135deg, #f97316, #ef4444);
-webkit-background-clip: text; -webkit-text-fill-color: transparent;
margin-bottom: 0.5rem;
}
.subtitle { color: #888; font-size: 1.1rem; margin-bottom: 2rem; }
.badge-row { display: flex; gap: 0.5rem; margin-bottom: 2rem; flex-wrap: wrap; }
.badge {
display: inline-block; padding: 0.3rem 0.8rem; border-radius: 20px;
font-size: 0.8rem; font-weight: 600;
}
.badge-ms { background: #1a3a5c; color: #60a5fa; }
.badge-asr { background: #3c2e1a; color: #fbbf24; }
.badge-tts { background: #1a3c2a; color: #4ade80; }
.badge-mit { background: #2e1a3c; color: #c4b5fd; }
.card {
background: #141414; border: 1px solid #222; border-radius: 12px;
padding: 2rem; margin-bottom: 1.5rem;
}
.card h2 { color: #60a5fa; margin-bottom: 1rem; font-size: 1.3rem; }
.card p { line-height: 1.8; color: #aaa; }
.card h2 { color: #f97316; margin-bottom: 1rem; font-size: 1.3rem; }
.card p, .card li { line-height: 1.8; color: #aaa; }
.card ul { list-style: none; padding: 0; }
.card ul li::before { content: ""; margin-right: 0.5rem; }
.grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(340px, 1fr)); gap: 1.5rem; }
table { width: 100%; border-collapse: collapse; margin-top: 0.5rem; }
th, td { text-align: left; padding: 0.7rem 1rem; border-bottom: 1px solid #222; }
th { color: #f97316; font-weight: 600; font-size: 0.9rem; }
td { color: #aaa; font-size: 0.9rem; }
.highlight { color: #4ade80; font-weight: 600; }
.warn { color: #f87171; font-weight: 600; }
.model-card {
background: #1a1a2e; border: 1px solid #2a2a4e; border-radius: 10px;
padding: 1.5rem; text-align: center;
}
.model-card .icon { font-size: 2.5rem; margin-bottom: 0.5rem; }
.model-card .name { font-size: 1.1rem; color: #f97316; font-weight: 700; margin-bottom: 0.3rem; }
.model-card .size { font-size: 0.8rem; color: #666; margin-bottom: 0.8rem; }
.model-card .desc { font-size: 0.85rem; color: #aaa; line-height: 1.6; text-align: left; }
.three-col { display: grid; grid-template-columns: repeat(auto-fit, minmax(280px, 1fr)); gap: 1.5rem; margin-bottom: 1.5rem; }
.use-case {
background: #141414; border: 1px solid #2a4e2a; border-radius: 12px;
padding: 1.5rem;
}
.use-case h3 { color: #4ade80; font-size: 1rem; margin-bottom: 0.5rem; }
.use-case p { color: #888; font-size: 0.9rem; line-height: 1.6; }
.use-case .tag { display: inline-block; background: #1a3c2a; color: #4ade80; padding: 0.2rem 0.5rem; border-radius: 4px; font-size: 0.75rem; margin-top: 0.5rem; }
.code-block {
background: #1a1a2e; border: 1px solid #2a2a4e; border-radius: 8px;
padding: 1.2rem; margin-top: 1rem; overflow-x: auto;
font-family: "SF Mono", "Fira Code", monospace; font-size: 0.85rem;
color: #c4b5fd; line-height: 1.6;
}
.code-comment { color: #555; }
.links { display: flex; gap: 1rem; margin-top: 1.5rem; flex-wrap: wrap; }
.links a {
display: inline-flex; align-items: center; gap: 0.4rem;
padding: 0.6rem 1.2rem; border-radius: 8px; text-decoration: none;
font-size: 0.9rem; font-weight: 600; transition: opacity 0.2s;
}
.links a:hover { opacity: 0.8; }
.link-gh { background: #1a1a2e; color: #c4b5fd; border: 1px solid #2a2a4e; }
.link-hf { background: #1a2e1a; color: #4ade80; border: 1px solid #2a4e2a; }
.link-doc { background: #2e2a1a; color: #fbbf24; border: 1px solid #4e3a2a; }
.verdict {
background: linear-gradient(135deg, #1a0a00, #141414);
border: 1px solid #f9731633; border-radius: 12px;
padding: 2rem; margin-top: 1.5rem; text-align: center;
}
.verdict h2 { color: #f97316; margin-bottom: 0.5rem; }
.verdict p { color: #888; max-width: 600px; margin: 0 auto; }
footer { text-align: center; color: #333; margin-top: 3rem; font-size: 0.8rem; }
</style>
</head>
<body>
<div class="container">
<h1>VibeVoice 语音AI研究</h1>
<p class="subtitle">微软开源语音全家桶ASR+TTS+实时语音,可用于法考字幕提取</p>
<h1>VibeVoice 语音 AI 全家桶</h1>
<p class="subtitle">微软开源 | ASR + TTS + 实时语音 | MIT 许可</p>
<div class="badge-row">
<span class="badge badge-ms">Microsoft Research</span>
<span class="badge badge-asr">ASR 语音识别</span>
<span class="badge badge-tts">TTS 语音合成</span>
<span class="badge badge-mit">MIT 开源</span>
</div>
<!-- 三个模型 -->
<div class="three-col">
<div class="model-card">
<div class="icon">🎙</div>
<div class="name">VibeVoice-ASR</div>
<div class="size">语音识别模型</div>
<div class="desc">
<ul style="list-style:none; padding:0;">
<li>单次处理 60 分钟音频</li>
<li>输出:说话人 + 时间戳 + 内容</li>
<li>支持 50+ 语言</li>
<li>支持自定义热词</li>
</ul>
</div>
</div>
<div class="model-card">
<div class="icon">🔊</div>
<div class="name">VibeVoice-1.5B</div>
<div class="size">15 亿参数 · TTS</div>
<div class="desc">
<ul style="list-style:none; padding:0;">
<li>高质量文字转语音</li>
<li>自然语调和韵律</li>
<li>多语言支持</li>
<li>7.5Hz 超低帧率 token</li>
</ul>
</div>
</div>
<div class="model-card">
<div class="icon"></div>
<div class="name">VibeVoice-Realtime-0.5B</div>
<div class="size">5 亿参数 · 实时 TTS</div>
<div class="desc">
<ul style="list-style:none; padding:0;">
<li>流式文字输入</li>
<li>首音延迟 ~300ms</li>
<li>支持长文本朗读</li>
<li>适合实时对话场景</li>
</ul>
</div>
</div>
</div>
<div class="grid">
<!-- 技术亮点 -->
<div class="card">
<h2>核心技术</h2>
<table>
<tr><th>技术</th><th>说明</th></tr>
<tr><td>连续语音 Tokenizer</td><td>声学 + 语义双 Tokenizer7.5Hz 超低帧率</td></tr>
<tr><td>长音频处理</td><td>单次 60 分钟,无需分段</td></tr>
<tr><td>说话人分离</td><td>自动识别 Who + When + What</td></tr>
<tr><td>流式推理</td><td>边输入文字边生成语音300ms 首音</td></tr>
<tr><td>热词支持</td><td>自定义专业术语提升识别率</td></tr>
</table>
</div>
<!-- 对比 -->
<div class="card">
<h2>vs 同类方案</h2>
<table>
<tr><th>维度</th><th>Whisper</th><th>ElevenLabs</th><th>VibeVoice</th></tr>
<tr><td>ASR</td><td class="highlight"></td><td class="warn"></td><td class="highlight">有(更强)</td></tr>
<tr><td>TTS</td><td class="warn"></td><td class="highlight"></td><td class="highlight"></td></tr>
<tr><td>实时流式</td><td class="warn"></td><td class="highlight"></td><td class="highlight"></td></tr>
<tr><td>说话人识别</td><td class="warn"></td><td class="warn"></td><td class="highlight">内置</td></tr>
<tr><td>长音频</td><td>需分段</td><td>N/A</td><td class="highlight">60分钟单次</td></tr>
<tr><td>开源</td><td class="highlight"></td><td class="warn"></td><td class="highlight">MIT</td></tr>
<tr><td>费用</td><td>免费</td><td class="warn">按量付费</td><td>免费</td></tr>
</table>
</div>
</div>
<!-- 应用场景 -->
<h2 style="color: #f97316; margin: 1.5rem 0 1rem;">我们的应用场景</h2>
<div class="three-col">
<div class="use-case">
<h3>法考视频字幕提取</h3>
<p>9,553 个法考视频需要提取字幕。VibeVoice-ASR 单次处理 60 分钟 + 自动时间戳 + 说话人识别,配合法律热词("不当得利""善意取得"等)可显著提升识别率。</p>
<span class="tag">高优先级</span>
</div>
<div class="use-case">
<h3>法海法考 App 语音朗读</h3>
<p>用 Realtime-0.5B 为题目和解析生成语音朗读,支持边看题边听讲解,提升学习体验。</p>
<span class="tag">中优先级</span>
</div>
<div class="use-case">
<h3>百陶会多语言介绍</h3>
<p>用 VibeVoice-1.5B 为产品页面生成中英文语音介绍50+ 语言支持覆盖海外客户。</p>
<span class="tag">低优先级</span>
</div>
</div>
<!-- 代码示例 -->
<div class="card">
<h2>概述</h2>
<p>待补充研究内容...</p>
<h2>ASR 使用示例</h2>
<div class="code-block">
<span class="code-comment"># 安装</span>
pip install transformers torch
<span class="code-comment"># ASR语音转文字带时间戳和说话人</span>
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="microsoft/VibeVoice-ASR"
)
result = asr("lecture_60min.wav")
<span class="code-comment"># 输出:[{speaker: "A", start: 0.0, end: 3.2, text: "..."}, ...]</span>
</div>
</div>
<div class="card">
<h2>核心发现</h2>
<p>待补充...</p>
<h2>TTS 使用示例</h2>
<div class="code-block">
<span class="code-comment"># 实时 TTS文字转语音</span>
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"microsoft/VibeVoice-Realtime-0.5B"
)
<span class="code-comment"># 流式生成,首音 ~300ms</span>
for audio_chunk in model.generate_stream("今天我们来讲民法典..."):
play(audio_chunk)
</div>
</div>
<!-- 硬件 -->
<div class="card">
<h2>硬件要求与本机适配</h2>
<table>
<tr><th>模型</th><th>显存需求</th><th>M2 Max 可运行?</th></tr>
<tr><td>VibeVoice-ASR</td><td>~8GB</td><td class="highlight">可以MPS 加速)</td></tr>
<tr><td>VibeVoice-1.5B</td><td>~6GB</td><td class="highlight">可以</td></tr>
<tr><td>VibeVoice-Realtime-0.5B</td><td>~2GB</td><td class="highlight">可以</td></tr>
</table>
<p style="margin-top: 1rem; color: #4ade80; font-size: 0.9rem;">
本机 M2 Max 64GB 完全满足所有模型运行要求
</p>
</div>
<!-- 评价 -->
<div class="verdict">
<h2>评价:实用性很高</h2>
<p>ASR + TTS + 实时语音三合一开源方案MIT 许可无商用限制。ASR 的 60 分钟长音频 + 说话人识别是真正的差异化优势。本机 M2 Max 可直接运行,不需要 GPU 服务器。对法考字幕提取项目有直接价值。</p>
</div>
<!-- 链接 -->
<div class="links">
<a href="https://github.com/microsoft/VibeVoice" target="_blank" class="link-gh">GitHub 源码</a>
<a href="https://huggingface.co/microsoft/VibeVoice-ASR" target="_blank" class="link-hf">ASR 模型</a>
<a href="https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B" target="_blank" class="link-hf">Realtime 模型</a>
<a href="https://microsoft.github.io/VibeVoice/" target="_blank" class="link-doc">官方文档</a>
</div>
<footer>
研究项目 · 立项日期 2026-03-31 · 源码克隆至 ./source/
</footer>
</div>
</body>
</html>