Files
vibevoice/index.html

275 lines
12 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>VibeVoice 语音AI研究</title>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif;
background: #0a0a0a; color: #e0e0e0;
min-height: 100vh; padding: 2rem;
}
.container { max-width: 1200px; margin: 0 auto; }
h1 {
font-size: 2.5rem; font-weight: 700;
background: linear-gradient(135deg, #f97316, #ef4444);
-webkit-background-clip: text; -webkit-text-fill-color: transparent;
margin-bottom: 0.5rem;
}
.subtitle { color: #888; font-size: 1.1rem; margin-bottom: 2rem; }
.badge-row { display: flex; gap: 0.5rem; margin-bottom: 2rem; flex-wrap: wrap; }
.badge {
display: inline-block; padding: 0.3rem 0.8rem; border-radius: 20px;
font-size: 0.8rem; font-weight: 600;
}
.badge-ms { background: #1a3a5c; color: #60a5fa; }
.badge-asr { background: #3c2e1a; color: #fbbf24; }
.badge-tts { background: #1a3c2a; color: #4ade80; }
.badge-mit { background: #2e1a3c; color: #c4b5fd; }
.card {
background: #141414; border: 1px solid #222; border-radius: 12px;
padding: 2rem; margin-bottom: 1.5rem;
}
.card h2 { color: #f97316; margin-bottom: 1rem; font-size: 1.3rem; }
.card p, .card li { line-height: 1.8; color: #aaa; }
.card ul { list-style: none; padding: 0; }
.card ul li::before { content: ""; margin-right: 0.5rem; }
.grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(340px, 1fr)); gap: 1.5rem; }
table { width: 100%; border-collapse: collapse; margin-top: 0.5rem; }
th, td { text-align: left; padding: 0.7rem 1rem; border-bottom: 1px solid #222; }
th { color: #f97316; font-weight: 600; font-size: 0.9rem; }
td { color: #aaa; font-size: 0.9rem; }
.highlight { color: #4ade80; font-weight: 600; }
.warn { color: #f87171; font-weight: 600; }
.model-card {
background: #1a1a2e; border: 1px solid #2a2a4e; border-radius: 10px;
padding: 1.5rem; text-align: center;
}
.model-card .icon { font-size: 2.5rem; margin-bottom: 0.5rem; }
.model-card .name { font-size: 1.1rem; color: #f97316; font-weight: 700; margin-bottom: 0.3rem; }
.model-card .size { font-size: 0.8rem; color: #666; margin-bottom: 0.8rem; }
.model-card .desc { font-size: 0.85rem; color: #aaa; line-height: 1.6; text-align: left; }
.three-col { display: grid; grid-template-columns: repeat(auto-fit, minmax(280px, 1fr)); gap: 1.5rem; margin-bottom: 1.5rem; }
.use-case {
background: #141414; border: 1px solid #2a4e2a; border-radius: 12px;
padding: 1.5rem;
}
.use-case h3 { color: #4ade80; font-size: 1rem; margin-bottom: 0.5rem; }
.use-case p { color: #888; font-size: 0.9rem; line-height: 1.6; }
.use-case .tag { display: inline-block; background: #1a3c2a; color: #4ade80; padding: 0.2rem 0.5rem; border-radius: 4px; font-size: 0.75rem; margin-top: 0.5rem; }
.code-block {
background: #1a1a2e; border: 1px solid #2a2a4e; border-radius: 8px;
padding: 1.2rem; margin-top: 1rem; overflow-x: auto;
font-family: "SF Mono", "Fira Code", monospace; font-size: 0.85rem;
color: #c4b5fd; line-height: 1.6;
}
.code-comment { color: #555; }
.links { display: flex; gap: 1rem; margin-top: 1.5rem; flex-wrap: wrap; }
.links a {
display: inline-flex; align-items: center; gap: 0.4rem;
padding: 0.6rem 1.2rem; border-radius: 8px; text-decoration: none;
font-size: 0.9rem; font-weight: 600; transition: opacity 0.2s;
}
.links a:hover { opacity: 0.8; }
.link-gh { background: #1a1a2e; color: #c4b5fd; border: 1px solid #2a2a4e; }
.link-hf { background: #1a2e1a; color: #4ade80; border: 1px solid #2a4e2a; }
.link-doc { background: #2e2a1a; color: #fbbf24; border: 1px solid #4e3a2a; }
.verdict {
background: linear-gradient(135deg, #1a0a00, #141414);
border: 1px solid #f9731633; border-radius: 12px;
padding: 2rem; margin-top: 1.5rem; text-align: center;
}
.verdict h2 { color: #f97316; margin-bottom: 0.5rem; }
.verdict p { color: #888; max-width: 600px; margin: 0 auto; }
footer { text-align: center; color: #333; margin-top: 3rem; font-size: 0.8rem; }
</style>
</head>
<body>
<div class="container">
<h1>VibeVoice — 语音 AI 全家桶</h1>
<p class="subtitle">微软开源 | ASR + TTS + 实时语音 | MIT 许可</p>
<div class="badge-row">
<span class="badge badge-ms">Microsoft Research</span>
<span class="badge badge-asr">ASR 语音识别</span>
<span class="badge badge-tts">TTS 语音合成</span>
<span class="badge badge-mit">MIT 开源</span>
</div>
<!-- 三个模型 -->
<div class="three-col">
<div class="model-card">
<div class="icon">🎙</div>
<div class="name">VibeVoice-ASR</div>
<div class="size">语音识别模型</div>
<div class="desc">
<ul style="list-style:none; padding:0;">
<li>单次处理 60 分钟音频</li>
<li>输出:说话人 + 时间戳 + 内容</li>
<li>支持 50+ 语言</li>
<li>支持自定义热词</li>
</ul>
</div>
</div>
<div class="model-card">
<div class="icon">🔊</div>
<div class="name">VibeVoice-1.5B</div>
<div class="size">15 亿参数 · TTS</div>
<div class="desc">
<ul style="list-style:none; padding:0;">
<li>高质量文字转语音</li>
<li>自然语调和韵律</li>
<li>多语言支持</li>
<li>7.5Hz 超低帧率 token</li>
</ul>
</div>
</div>
<div class="model-card">
<div class="icon"></div>
<div class="name">VibeVoice-Realtime-0.5B</div>
<div class="size">5 亿参数 · 实时 TTS</div>
<div class="desc">
<ul style="list-style:none; padding:0;">
<li>流式文字输入</li>
<li>首音延迟 ~300ms</li>
<li>支持长文本朗读</li>
<li>适合实时对话场景</li>
</ul>
</div>
</div>
</div>
<div class="grid">
<!-- 技术亮点 -->
<div class="card">
<h2>核心技术</h2>
<table>
<tr><th>技术</th><th>说明</th></tr>
<tr><td>连续语音 Tokenizer</td><td>声学 + 语义双 Tokenizer7.5Hz 超低帧率</td></tr>
<tr><td>长音频处理</td><td>单次 60 分钟,无需分段</td></tr>
<tr><td>说话人分离</td><td>自动识别 Who + When + What</td></tr>
<tr><td>流式推理</td><td>边输入文字边生成语音300ms 首音</td></tr>
<tr><td>热词支持</td><td>自定义专业术语提升识别率</td></tr>
</table>
</div>
<!-- 对比 -->
<div class="card">
<h2>vs 同类方案</h2>
<table>
<tr><th>维度</th><th>Whisper</th><th>ElevenLabs</th><th>VibeVoice</th></tr>
<tr><td>ASR</td><td class="highlight"></td><td class="warn"></td><td class="highlight">有(更强)</td></tr>
<tr><td>TTS</td><td class="warn"></td><td class="highlight"></td><td class="highlight"></td></tr>
<tr><td>实时流式</td><td class="warn"></td><td class="highlight"></td><td class="highlight"></td></tr>
<tr><td>说话人识别</td><td class="warn"></td><td class="warn"></td><td class="highlight">内置</td></tr>
<tr><td>长音频</td><td>需分段</td><td>N/A</td><td class="highlight">60分钟单次</td></tr>
<tr><td>开源</td><td class="highlight"></td><td class="warn"></td><td class="highlight">MIT</td></tr>
<tr><td>费用</td><td>免费</td><td class="warn">按量付费</td><td>免费</td></tr>
</table>
</div>
</div>
<!-- 应用场景 -->
<h2 style="color: #f97316; margin: 1.5rem 0 1rem;">我们的应用场景</h2>
<div class="three-col">
<div class="use-case">
<h3>法考视频字幕提取</h3>
<p>9,553 个法考视频需要提取字幕。VibeVoice-ASR 单次处理 60 分钟 + 自动时间戳 + 说话人识别,配合法律热词("不当得利""善意取得"等)可显著提升识别率。</p>
<span class="tag">高优先级</span>
</div>
<div class="use-case">
<h3>法海法考 App 语音朗读</h3>
<p>用 Realtime-0.5B 为题目和解析生成语音朗读,支持边看题边听讲解,提升学习体验。</p>
<span class="tag">中优先级</span>
</div>
<div class="use-case">
<h3>百陶会多语言介绍</h3>
<p>用 VibeVoice-1.5B 为产品页面生成中英文语音介绍50+ 语言支持覆盖海外客户。</p>
<span class="tag">低优先级</span>
</div>
</div>
<!-- 代码示例 -->
<div class="card">
<h2>ASR 使用示例</h2>
<div class="code-block">
<span class="code-comment"># 安装</span>
pip install transformers torch
<span class="code-comment"># ASR语音转文字带时间戳和说话人</span>
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="microsoft/VibeVoice-ASR"
)
result = asr("lecture_60min.wav")
<span class="code-comment"># 输出:[{speaker: "A", start: 0.0, end: 3.2, text: "..."}, ...]</span>
</div>
</div>
<div class="card">
<h2>TTS 使用示例</h2>
<div class="code-block">
<span class="code-comment"># 实时 TTS文字转语音</span>
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"microsoft/VibeVoice-Realtime-0.5B"
)
<span class="code-comment"># 流式生成,首音 ~300ms</span>
for audio_chunk in model.generate_stream("今天我们来讲民法典..."):
play(audio_chunk)
</div>
</div>
<!-- 硬件 -->
<div class="card">
<h2>硬件要求与本机适配</h2>
<table>
<tr><th>模型</th><th>显存需求</th><th>M2 Max 可运行?</th></tr>
<tr><td>VibeVoice-ASR</td><td>~8GB</td><td class="highlight">可以MPS 加速)</td></tr>
<tr><td>VibeVoice-1.5B</td><td>~6GB</td><td class="highlight">可以</td></tr>
<tr><td>VibeVoice-Realtime-0.5B</td><td>~2GB</td><td class="highlight">可以</td></tr>
</table>
<p style="margin-top: 1rem; color: #4ade80; font-size: 0.9rem;">
本机 M2 Max 64GB 完全满足所有模型运行要求
</p>
</div>
<!-- 评价 -->
<div class="verdict">
<h2>评价:实用性很高</h2>
<p>ASR + TTS + 实时语音三合一开源方案MIT 许可无商用限制。ASR 的 60 分钟长音频 + 说话人识别是真正的差异化优势。本机 M2 Max 可直接运行,不需要 GPU 服务器。对法考字幕提取项目有直接价值。</p>
</div>
<!-- 链接 -->
<div class="links">
<a href="https://github.com/microsoft/VibeVoice" target="_blank" class="link-gh">GitHub 源码</a>
<a href="https://huggingface.co/microsoft/VibeVoice-ASR" target="_blank" class="link-hf">ASR 模型</a>
<a href="https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B" target="_blank" class="link-hf">Realtime 模型</a>
<a href="https://microsoft.github.io/VibeVoice/" target="_blank" class="link-doc">官方文档</a>
</div>
<footer>
研究项目 · 立项日期 2026-03-31 · 源码克隆至 ./source/
</footer>
</div>
</body>
</html>