275 lines
12 KiB
HTML
275 lines
12 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="zh-CN">
|
||
<head>
|
||
<meta charset="UTF-8">
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||
<title>VibeVoice 语音AI研究</title>
|
||
<style>
|
||
* { margin: 0; padding: 0; box-sizing: border-box; }
|
||
body {
|
||
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif;
|
||
background: #0a0a0a; color: #e0e0e0;
|
||
min-height: 100vh; padding: 2rem;
|
||
}
|
||
.container { max-width: 1200px; margin: 0 auto; }
|
||
h1 {
|
||
font-size: 2.5rem; font-weight: 700;
|
||
background: linear-gradient(135deg, #f97316, #ef4444);
|
||
-webkit-background-clip: text; -webkit-text-fill-color: transparent;
|
||
margin-bottom: 0.5rem;
|
||
}
|
||
.subtitle { color: #888; font-size: 1.1rem; margin-bottom: 2rem; }
|
||
.badge-row { display: flex; gap: 0.5rem; margin-bottom: 2rem; flex-wrap: wrap; }
|
||
.badge {
|
||
display: inline-block; padding: 0.3rem 0.8rem; border-radius: 20px;
|
||
font-size: 0.8rem; font-weight: 600;
|
||
}
|
||
.badge-ms { background: #1a3a5c; color: #60a5fa; }
|
||
.badge-asr { background: #3c2e1a; color: #fbbf24; }
|
||
.badge-tts { background: #1a3c2a; color: #4ade80; }
|
||
.badge-mit { background: #2e1a3c; color: #c4b5fd; }
|
||
|
||
.card {
|
||
background: #141414; border: 1px solid #222; border-radius: 12px;
|
||
padding: 2rem; margin-bottom: 1.5rem;
|
||
}
|
||
.card h2 { color: #f97316; margin-bottom: 1rem; font-size: 1.3rem; }
|
||
.card p, .card li { line-height: 1.8; color: #aaa; }
|
||
.card ul { list-style: none; padding: 0; }
|
||
.card ul li::before { content: ""; margin-right: 0.5rem; }
|
||
|
||
.grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(340px, 1fr)); gap: 1.5rem; }
|
||
|
||
table { width: 100%; border-collapse: collapse; margin-top: 0.5rem; }
|
||
th, td { text-align: left; padding: 0.7rem 1rem; border-bottom: 1px solid #222; }
|
||
th { color: #f97316; font-weight: 600; font-size: 0.9rem; }
|
||
td { color: #aaa; font-size: 0.9rem; }
|
||
|
||
.highlight { color: #4ade80; font-weight: 600; }
|
||
.warn { color: #f87171; font-weight: 600; }
|
||
|
||
.model-card {
|
||
background: #1a1a2e; border: 1px solid #2a2a4e; border-radius: 10px;
|
||
padding: 1.5rem; text-align: center;
|
||
}
|
||
.model-card .icon { font-size: 2.5rem; margin-bottom: 0.5rem; }
|
||
.model-card .name { font-size: 1.1rem; color: #f97316; font-weight: 700; margin-bottom: 0.3rem; }
|
||
.model-card .size { font-size: 0.8rem; color: #666; margin-bottom: 0.8rem; }
|
||
.model-card .desc { font-size: 0.85rem; color: #aaa; line-height: 1.6; text-align: left; }
|
||
|
||
.three-col { display: grid; grid-template-columns: repeat(auto-fit, minmax(280px, 1fr)); gap: 1.5rem; margin-bottom: 1.5rem; }
|
||
|
||
.use-case {
|
||
background: #141414; border: 1px solid #2a4e2a; border-radius: 12px;
|
||
padding: 1.5rem;
|
||
}
|
||
.use-case h3 { color: #4ade80; font-size: 1rem; margin-bottom: 0.5rem; }
|
||
.use-case p { color: #888; font-size: 0.9rem; line-height: 1.6; }
|
||
.use-case .tag { display: inline-block; background: #1a3c2a; color: #4ade80; padding: 0.2rem 0.5rem; border-radius: 4px; font-size: 0.75rem; margin-top: 0.5rem; }
|
||
|
||
.code-block {
|
||
background: #1a1a2e; border: 1px solid #2a2a4e; border-radius: 8px;
|
||
padding: 1.2rem; margin-top: 1rem; overflow-x: auto;
|
||
font-family: "SF Mono", "Fira Code", monospace; font-size: 0.85rem;
|
||
color: #c4b5fd; line-height: 1.6;
|
||
}
|
||
.code-comment { color: #555; }
|
||
|
||
.links { display: flex; gap: 1rem; margin-top: 1.5rem; flex-wrap: wrap; }
|
||
.links a {
|
||
display: inline-flex; align-items: center; gap: 0.4rem;
|
||
padding: 0.6rem 1.2rem; border-radius: 8px; text-decoration: none;
|
||
font-size: 0.9rem; font-weight: 600; transition: opacity 0.2s;
|
||
}
|
||
.links a:hover { opacity: 0.8; }
|
||
.link-gh { background: #1a1a2e; color: #c4b5fd; border: 1px solid #2a2a4e; }
|
||
.link-hf { background: #1a2e1a; color: #4ade80; border: 1px solid #2a4e2a; }
|
||
.link-doc { background: #2e2a1a; color: #fbbf24; border: 1px solid #4e3a2a; }
|
||
|
||
.verdict {
|
||
background: linear-gradient(135deg, #1a0a00, #141414);
|
||
border: 1px solid #f9731633; border-radius: 12px;
|
||
padding: 2rem; margin-top: 1.5rem; text-align: center;
|
||
}
|
||
.verdict h2 { color: #f97316; margin-bottom: 0.5rem; }
|
||
.verdict p { color: #888; max-width: 600px; margin: 0 auto; }
|
||
|
||
footer { text-align: center; color: #333; margin-top: 3rem; font-size: 0.8rem; }
|
||
</style>
|
||
</head>
|
||
<body>
|
||
<div class="container">
|
||
<h1>VibeVoice — 语音 AI 全家桶</h1>
|
||
<p class="subtitle">微软开源 | ASR + TTS + 实时语音 | MIT 许可</p>
|
||
|
||
<div class="badge-row">
|
||
<span class="badge badge-ms">Microsoft Research</span>
|
||
<span class="badge badge-asr">ASR 语音识别</span>
|
||
<span class="badge badge-tts">TTS 语音合成</span>
|
||
<span class="badge badge-mit">MIT 开源</span>
|
||
</div>
|
||
|
||
<!-- 三个模型 -->
|
||
<div class="three-col">
|
||
<div class="model-card">
|
||
<div class="icon">🎙</div>
|
||
<div class="name">VibeVoice-ASR</div>
|
||
<div class="size">语音识别模型</div>
|
||
<div class="desc">
|
||
<ul style="list-style:none; padding:0;">
|
||
<li>单次处理 60 分钟音频</li>
|
||
<li>输出:说话人 + 时间戳 + 内容</li>
|
||
<li>支持 50+ 语言</li>
|
||
<li>支持自定义热词</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
<div class="model-card">
|
||
<div class="icon">🔊</div>
|
||
<div class="name">VibeVoice-1.5B</div>
|
||
<div class="size">15 亿参数 · TTS</div>
|
||
<div class="desc">
|
||
<ul style="list-style:none; padding:0;">
|
||
<li>高质量文字转语音</li>
|
||
<li>自然语调和韵律</li>
|
||
<li>多语言支持</li>
|
||
<li>7.5Hz 超低帧率 token</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
<div class="model-card">
|
||
<div class="icon">⚡</div>
|
||
<div class="name">VibeVoice-Realtime-0.5B</div>
|
||
<div class="size">5 亿参数 · 实时 TTS</div>
|
||
<div class="desc">
|
||
<ul style="list-style:none; padding:0;">
|
||
<li>流式文字输入</li>
|
||
<li>首音延迟 ~300ms</li>
|
||
<li>支持长文本朗读</li>
|
||
<li>适合实时对话场景</li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="grid">
|
||
<!-- 技术亮点 -->
|
||
<div class="card">
|
||
<h2>核心技术</h2>
|
||
<table>
|
||
<tr><th>技术</th><th>说明</th></tr>
|
||
<tr><td>连续语音 Tokenizer</td><td>声学 + 语义双 Tokenizer,7.5Hz 超低帧率</td></tr>
|
||
<tr><td>长音频处理</td><td>单次 60 分钟,无需分段</td></tr>
|
||
<tr><td>说话人分离</td><td>自动识别 Who + When + What</td></tr>
|
||
<tr><td>流式推理</td><td>边输入文字边生成语音,300ms 首音</td></tr>
|
||
<tr><td>热词支持</td><td>自定义专业术语提升识别率</td></tr>
|
||
</table>
|
||
</div>
|
||
|
||
<!-- 对比 -->
|
||
<div class="card">
|
||
<h2>vs 同类方案</h2>
|
||
<table>
|
||
<tr><th>维度</th><th>Whisper</th><th>ElevenLabs</th><th>VibeVoice</th></tr>
|
||
<tr><td>ASR</td><td class="highlight">有</td><td class="warn">无</td><td class="highlight">有(更强)</td></tr>
|
||
<tr><td>TTS</td><td class="warn">无</td><td class="highlight">有</td><td class="highlight">有</td></tr>
|
||
<tr><td>实时流式</td><td class="warn">无</td><td class="highlight">有</td><td class="highlight">有</td></tr>
|
||
<tr><td>说话人识别</td><td class="warn">无</td><td class="warn">无</td><td class="highlight">内置</td></tr>
|
||
<tr><td>长音频</td><td>需分段</td><td>N/A</td><td class="highlight">60分钟单次</td></tr>
|
||
<tr><td>开源</td><td class="highlight">是</td><td class="warn">否</td><td class="highlight">是(MIT)</td></tr>
|
||
<tr><td>费用</td><td>免费</td><td class="warn">按量付费</td><td>免费</td></tr>
|
||
</table>
|
||
</div>
|
||
</div>
|
||
|
||
<!-- 应用场景 -->
|
||
<h2 style="color: #f97316; margin: 1.5rem 0 1rem;">我们的应用场景</h2>
|
||
<div class="three-col">
|
||
<div class="use-case">
|
||
<h3>法考视频字幕提取</h3>
|
||
<p>9,553 个法考视频需要提取字幕。VibeVoice-ASR 单次处理 60 分钟 + 自动时间戳 + 说话人识别,配合法律热词("不当得利""善意取得"等)可显著提升识别率。</p>
|
||
<span class="tag">高优先级</span>
|
||
</div>
|
||
<div class="use-case">
|
||
<h3>法海法考 App 语音朗读</h3>
|
||
<p>用 Realtime-0.5B 为题目和解析生成语音朗读,支持边看题边听讲解,提升学习体验。</p>
|
||
<span class="tag">中优先级</span>
|
||
</div>
|
||
<div class="use-case">
|
||
<h3>百陶会多语言介绍</h3>
|
||
<p>用 VibeVoice-1.5B 为产品页面生成中英文语音介绍,50+ 语言支持覆盖海外客户。</p>
|
||
<span class="tag">低优先级</span>
|
||
</div>
|
||
</div>
|
||
|
||
<!-- 代码示例 -->
|
||
<div class="card">
|
||
<h2>ASR 使用示例</h2>
|
||
<div class="code-block">
|
||
<span class="code-comment"># 安装</span>
|
||
pip install transformers torch
|
||
|
||
<span class="code-comment"># ASR:语音转文字(带时间戳和说话人)</span>
|
||
from transformers import pipeline
|
||
|
||
asr = pipeline(
|
||
"automatic-speech-recognition",
|
||
model="microsoft/VibeVoice-ASR"
|
||
)
|
||
|
||
result = asr("lecture_60min.wav")
|
||
<span class="code-comment"># 输出:[{speaker: "A", start: 0.0, end: 3.2, text: "..."}, ...]</span>
|
||
</div>
|
||
</div>
|
||
|
||
<div class="card">
|
||
<h2>TTS 使用示例</h2>
|
||
<div class="code-block">
|
||
<span class="code-comment"># 实时 TTS:文字转语音</span>
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
"microsoft/VibeVoice-Realtime-0.5B"
|
||
)
|
||
|
||
<span class="code-comment"># 流式生成,首音 ~300ms</span>
|
||
for audio_chunk in model.generate_stream("今天我们来讲民法典..."):
|
||
play(audio_chunk)
|
||
</div>
|
||
</div>
|
||
|
||
<!-- 硬件 -->
|
||
<div class="card">
|
||
<h2>硬件要求与本机适配</h2>
|
||
<table>
|
||
<tr><th>模型</th><th>显存需求</th><th>M2 Max 可运行?</th></tr>
|
||
<tr><td>VibeVoice-ASR</td><td>~8GB</td><td class="highlight">可以(MPS 加速)</td></tr>
|
||
<tr><td>VibeVoice-1.5B</td><td>~6GB</td><td class="highlight">可以</td></tr>
|
||
<tr><td>VibeVoice-Realtime-0.5B</td><td>~2GB</td><td class="highlight">可以</td></tr>
|
||
</table>
|
||
<p style="margin-top: 1rem; color: #4ade80; font-size: 0.9rem;">
|
||
本机 M2 Max 64GB 完全满足所有模型运行要求
|
||
</p>
|
||
</div>
|
||
|
||
<!-- 评价 -->
|
||
<div class="verdict">
|
||
<h2>评价:实用性很高</h2>
|
||
<p>ASR + TTS + 实时语音三合一开源方案,MIT 许可无商用限制。ASR 的 60 分钟长音频 + 说话人识别是真正的差异化优势。本机 M2 Max 可直接运行,不需要 GPU 服务器。对法考字幕提取项目有直接价值。</p>
|
||
</div>
|
||
|
||
<!-- 链接 -->
|
||
<div class="links">
|
||
<a href="https://github.com/microsoft/VibeVoice" target="_blank" class="link-gh">GitHub 源码</a>
|
||
<a href="https://huggingface.co/microsoft/VibeVoice-ASR" target="_blank" class="link-hf">ASR 模型</a>
|
||
<a href="https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B" target="_blank" class="link-hf">Realtime 模型</a>
|
||
<a href="https://microsoft.github.io/VibeVoice/" target="_blank" class="link-doc">官方文档</a>
|
||
</div>
|
||
|
||
<footer>
|
||
研究项目 · 立项日期 2026-03-31 · 源码克隆至 ./source/
|
||
</footer>
|
||
</div>
|
||
</body>
|
||
</html>
|