auto-save 2026-04-01 09:03 (+8, ~2)
This commit is contained in:
10
.env.example
Normal file
10
.env.example
Normal file
@@ -0,0 +1,10 @@
|
||||
# Device
|
||||
DEVICE_SERIAL= # leave empty for auto-detect
|
||||
|
||||
# VLM Provider: poe / openrouter / local
|
||||
VLM_PROVIDER=poe
|
||||
VLM_MODEL=Qwen/Qwen2.5-VL-7B-Instruct
|
||||
|
||||
# API Keys (fill the one matching your provider)
|
||||
POE_API_KEY=
|
||||
OPENROUTER_API_KEY=
|
||||
4
.gitignore
vendored
4
.gitignore
vendored
@@ -10,3 +10,7 @@ __pycache__/
|
||||
.vscode/
|
||||
.idea/
|
||||
*.log
|
||||
data/screenshots/
|
||||
*.egg-info/
|
||||
.venv/
|
||||
venv/
|
||||
|
||||
77
.memory/project-status.md
Normal file
77
.memory/project-status.md
Normal file
@@ -0,0 +1,77 @@
|
||||
---
|
||||
name: GUI Agent 项目状态
|
||||
description: 手机GUI Agent项目当前进度、技术决策和待确认事项
|
||||
type: project
|
||||
---
|
||||
|
||||
## 项目状态:端到端已跑通 + 手机端 OCR 已部署
|
||||
|
||||
### 设备信息
|
||||
- **华为 P40 Pro**(ELS-AN00)
|
||||
- 序列号:UQG5T20416000119
|
||||
- 分辨率:1200x2640
|
||||
- 系统:HarmonyOS 4.x(兼容安卓层,ADB 可用)
|
||||
- ADB 路径:`/opt/homebrew/bin/adb`
|
||||
- 连接注意:华为手机需在开发者选项中额外打开"仅充电模式下允许ADB调试"
|
||||
- **已开启「通过USB安装应用」权限**(2026-03-29)
|
||||
|
||||
### 已完成
|
||||
- 七层管线骨架代码(L1-L7)全部就位
|
||||
- Web 控制台(FastAPI + 暗色主题 UI)已验证可运行
|
||||
- 端口 4380,VLM 默认走 Poe API
|
||||
- 支持 8 种动作类型(tap/swipe/type/long_press/back/home/scroll/wait)
|
||||
- Agent 主循环含历史记忆(最近 5 步)和连续错误自动停止
|
||||
- **ADB 截屏已验证通过**(2026-03-29)
|
||||
- **Mac 端 OCR 元素定位已验证**(2026-03-29)— easyocr 中文识别,返回像素坐标
|
||||
- **中文文本输入已验证**(2026-03-29)— uiautomator2 send_keys
|
||||
- **端到端发微信消息已跑通 3 次**(2026-03-29)— "你是大聪明"、"祝你生日快乐"、"生日快乐"
|
||||
- **手机端 OCR Service APK 已部署**(2026-03-29)— ML Kit Chinese bundled,端口 18900
|
||||
|
||||
### 手机端 OCR Service(android-ocr-service/)
|
||||
- **引擎**:Google ML Kit text-recognition-chinese(bundled 版,不依赖 GMS,华为可用)
|
||||
- **架构**:Kotlin APK = OcrEngine + NanoHTTPD(18900) + ForegroundService
|
||||
- **接口**:
|
||||
- `GET /health` — 健康检查
|
||||
- `GET /ocr?path=/data/local/tmp/s.png` — 读文件 OCR
|
||||
- `GET /ocr?path=...&text=微信` — 按文本过滤
|
||||
- `POST /snap` — POST 图片字节直接 OCR(NanoHTTPD 二进制处理有 bug,待修)
|
||||
- **使用流程**:
|
||||
```bash
|
||||
adb shell am start -n com.guiagent.ocr/.MainActivity
|
||||
adb forward tcp:18900 tcp:18900
|
||||
adb shell "screencap -p /data/local/tmp/s.png"
|
||||
curl http://localhost:18900/ocr?path=/data/local/tmp/s.png
|
||||
```
|
||||
- **性能**:首次 ~2.4s(模型加载),后续 ~1.8s/次
|
||||
- **构建**:`ANDROID_HOME=/opt/homebrew/share/android-commandlinetools JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.10/libexec/openjdk.jdk/Contents/Home ./gradlew assembleDebug`
|
||||
|
||||
### 关键技术决策
|
||||
| 能力 | 方案 | 备注 |
|
||||
|------|------|------|
|
||||
| 元素定位(Mac) | easyocr | pytesseract 中文分词差,uiautomator dump 在华为微信上返回空 |
|
||||
| 元素定位(手机端) | ML Kit Chinese (bundled) | 不依赖 GMS/HMS,APK 自带模型 |
|
||||
| 中文输入 | uiautomator2 send_keys | 需装辅助 APK,华为需开 USB 安装权限 |
|
||||
| 截屏 | `adb shell screencap -p /data/local/tmp/s.png` | 不经 FUSE,比 /sdcard/ 快 |
|
||||
| adb input text | 不支持中文 | NullPointerException,clipboard 也不可用 |
|
||||
| 截屏显示 | 必须 sips -Z 1800 缩小 | 原始 1200x2640 超 Claude 2000px 限制 |
|
||||
|
||||
### 已知问题
|
||||
1. OCR 偶尔误读("康"→"東")— ML Kit 和 easyocr 都有此问题
|
||||
2. POST /snap 端点 NanoHTTPD 二进制 body 解析 bug — 文件方式 workaround
|
||||
3. 微信双开弹选择框 — 每次 am start 会弹"使用以下方式打开"
|
||||
4. 发送按钮白字绿底 OCR 不稳定 — 用坐标 (1008, 2425) 或 OCR "(田发送"
|
||||
|
||||
### 下一步(周一继续)
|
||||
1. **速度优化**:发送按钮固定坐标不走 OCR(省2s),缩短 sleep(省2s),目标 5-6s/操作
|
||||
2. **OCR 推理优化**:缩图再识别 / NNAPI 加速,目标 <1s
|
||||
3. **集成到 Agent 主循环**:device OCR 引擎接入 ocr_grounding.py
|
||||
4. 配置 .env(Poe API Key)
|
||||
5. 接入 VLM(Poe API 调 Qwen2.5-VL)— 复杂场景屏幕理解
|
||||
6. 端到端跑通复杂多步任务(滑动、长按、跨 App)
|
||||
7. 完善验证纠错层
|
||||
|
||||
### 技术背景
|
||||
项目灵感来自对字节 UI-TARS / 豆包手机的深度调研。结论:
|
||||
- UI-TARS 开源的是权重+推理壳,训练代码和系统级操控完全闭源
|
||||
- 核心壁垒不是模型,是"截屏→理解→定位→规划→执行→验证"的全链路
|
||||
- 本项目目标:用开源 VLM + ADB 复现这个全链路
|
||||
44
RULES.md
44
RULES.md
@@ -1,17 +1,43 @@
|
||||
# 手机 GUI Agent 自动操控
|
||||
|
||||
## 启动
|
||||
- `待补充` — 端口 4380
|
||||
## 架构
|
||||
|
||||
## 部署
|
||||
- 平台:待定
|
||||
- 域名:待定
|
||||
七层管线闭环:截屏 → 理解 → 定位 → 规划 → 执行 → 验证 → 循环
|
||||
|
||||
```
|
||||
src/
|
||||
├── capture/ # L1 - ADB/scrcpy 截屏
|
||||
├── vision/ # L2 - VLM 屏幕理解
|
||||
├── grounding/ # L3 - 元素定位(自然语言→坐标)
|
||||
├── planner/ # L4 - 任务规划与分解
|
||||
├── executor/ # L5 - ADB 动作执行
|
||||
└── verifier/ # L6+L7 - 验证纠错 + 状态记忆
|
||||
```
|
||||
|
||||
## 启动
|
||||
|
||||
- `python -m src.main` — 主服务,端口 4380
|
||||
- `python scripts/test_device.py` — 测试 ADB 连接
|
||||
|
||||
## 技术栈
|
||||
|
||||
- Python 3.11+
|
||||
- ADB + scrcpy(截屏与操控)
|
||||
- Qwen2.5-VL / UI-TARS-1.5(视觉理解)
|
||||
- FastAPI(Web 控制台)
|
||||
- Poe API / OpenRouter(LLM 调用,按用户偏好)
|
||||
|
||||
## 环境变量
|
||||
- 待补充
|
||||
|
||||
- `DEVICE_SERIAL` — Android 设备序列号(adb devices 查看)
|
||||
- `VLM_PROVIDER` — vlm 提供者:`local` / `poe` / `openrouter`
|
||||
- `VLM_MODEL` — 模型名,默认 `Qwen/Qwen2.5-VL-7B-Instruct`
|
||||
- `POE_API_KEY` — Poe API Key(VLM_PROVIDER=poe 时必填)
|
||||
- `OPENROUTER_API_KEY` — OpenRouter Key(备用)
|
||||
|
||||
## 规则
|
||||
- 待补充
|
||||
|
||||
## 注意事项
|
||||
- 待补充
|
||||
- 截屏用 adb exec-out screencap,不用 scrcpy 录屏流(省资源)
|
||||
- 动作执行后必须等待 + 重新截屏验证
|
||||
- 所有截屏保存到 `data/screenshots/` 供调试
|
||||
- 坐标系统统一为百分比 (0-1),执行时再转设备像素
|
||||
|
||||
BIN
android-ocr-service/.gradle/8.5/checksums/checksums.lock
Normal file
BIN
android-ocr-service/.gradle/8.5/checksums/checksums.lock
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/8.5/checksums/md5-checksums.bin
Normal file
BIN
android-ocr-service/.gradle/8.5/checksums/md5-checksums.bin
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/8.5/checksums/sha1-checksums.bin
Normal file
BIN
android-ocr-service/.gradle/8.5/checksums/sha1-checksums.bin
Normal file
Binary file not shown.
Binary file not shown.
BIN
android-ocr-service/.gradle/8.5/fileChanges/last-build.bin
Normal file
BIN
android-ocr-service/.gradle/8.5/fileChanges/last-build.bin
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/8.5/fileHashes/fileHashes.lock
Normal file
BIN
android-ocr-service/.gradle/8.5/fileHashes/fileHashes.lock
Normal file
Binary file not shown.
0
android-ocr-service/.gradle/8.5/gc.properties
Normal file
0
android-ocr-service/.gradle/8.5/gc.properties
Normal file
BIN
android-ocr-service/.gradle/8.7/checksums/checksums.lock
Normal file
BIN
android-ocr-service/.gradle/8.7/checksums/checksums.lock
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/8.7/checksums/md5-checksums.bin
Normal file
BIN
android-ocr-service/.gradle/8.7/checksums/md5-checksums.bin
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/8.7/checksums/sha1-checksums.bin
Normal file
BIN
android-ocr-service/.gradle/8.7/checksums/sha1-checksums.bin
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
android-ocr-service/.gradle/8.7/fileChanges/last-build.bin
Normal file
BIN
android-ocr-service/.gradle/8.7/fileChanges/last-build.bin
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/8.7/fileHashes/fileHashes.bin
Normal file
BIN
android-ocr-service/.gradle/8.7/fileHashes/fileHashes.bin
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/8.7/fileHashes/fileHashes.lock
Normal file
BIN
android-ocr-service/.gradle/8.7/fileHashes/fileHashes.lock
Normal file
Binary file not shown.
Binary file not shown.
0
android-ocr-service/.gradle/8.7/gc.properties
Normal file
0
android-ocr-service/.gradle/8.7/gc.properties
Normal file
BIN
android-ocr-service/.gradle/9.4.1/checksums/checksums.lock
Normal file
BIN
android-ocr-service/.gradle/9.4.1/checksums/checksums.lock
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/9.4.1/checksums/md5-checksums.bin
Normal file
BIN
android-ocr-service/.gradle/9.4.1/checksums/md5-checksums.bin
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/9.4.1/checksums/sha1-checksums.bin
Normal file
BIN
android-ocr-service/.gradle/9.4.1/checksums/sha1-checksums.bin
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
android-ocr-service/.gradle/9.4.1/fileChanges/last-build.bin
Normal file
BIN
android-ocr-service/.gradle/9.4.1/fileChanges/last-build.bin
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/9.4.1/fileHashes/fileHashes.bin
Normal file
BIN
android-ocr-service/.gradle/9.4.1/fileHashes/fileHashes.bin
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/9.4.1/fileHashes/fileHashes.lock
Normal file
BIN
android-ocr-service/.gradle/9.4.1/fileHashes/fileHashes.lock
Normal file
Binary file not shown.
0
android-ocr-service/.gradle/9.4.1/gc.properties
Normal file
0
android-ocr-service/.gradle/9.4.1/gc.properties
Normal file
Binary file not shown.
@@ -0,0 +1,2 @@
|
||||
#Sun Mar 29 02:14:23 CST 2026
|
||||
gradle.version=8.7
|
||||
BIN
android-ocr-service/.gradle/buildOutputCleanup/outputFiles.bin
Normal file
BIN
android-ocr-service/.gradle/buildOutputCleanup/outputFiles.bin
Normal file
Binary file not shown.
BIN
android-ocr-service/.gradle/file-system.probe
Normal file
BIN
android-ocr-service/.gradle/file-system.probe
Normal file
Binary file not shown.
0
android-ocr-service/.gradle/vcs-1/gc.properties
Normal file
0
android-ocr-service/.gradle/vcs-1/gc.properties
Normal file
43
android-ocr-service/app/build.gradle.kts
Normal file
43
android-ocr-service/app/build.gradle.kts
Normal file
@@ -0,0 +1,43 @@
|
||||
plugins {
|
||||
id("com.android.application")
|
||||
id("org.jetbrains.kotlin.android")
|
||||
}
|
||||
|
||||
android {
|
||||
namespace = "com.guiagent.ocr"
|
||||
compileSdk = 31
|
||||
|
||||
defaultConfig {
|
||||
applicationId = "com.guiagent.ocr"
|
||||
minSdk = 26
|
||||
targetSdk = 31
|
||||
versionCode = 1
|
||||
versionName = "1.0"
|
||||
}
|
||||
|
||||
buildTypes {
|
||||
release {
|
||||
isMinifyEnabled = false
|
||||
}
|
||||
}
|
||||
|
||||
compileOptions {
|
||||
sourceCompatibility = JavaVersion.VERSION_1_8
|
||||
targetCompatibility = JavaVersion.VERSION_1_8
|
||||
}
|
||||
|
||||
kotlinOptions {
|
||||
jvmTarget = "1.8"
|
||||
}
|
||||
}
|
||||
|
||||
dependencies {
|
||||
// ML Kit Text Recognition - bundled model (no GMS needed!)
|
||||
implementation("com.google.mlkit:text-recognition-chinese:16.0.0")
|
||||
|
||||
// HTTP server
|
||||
implementation("org.nanohttpd:nanohttpd:2.3.1")
|
||||
|
||||
// JSON
|
||||
implementation("com.google.code.gson:gson:2.10.1")
|
||||
}
|
||||
28
android-ocr-service/app/src/main/AndroidManifest.xml
Normal file
28
android-ocr-service/app/src/main/AndroidManifest.xml
Normal file
@@ -0,0 +1,28 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<manifest xmlns:android="http://schemas.android.com/apk/res/android">
|
||||
|
||||
<uses-permission android:name="android.permission.READ_EXTERNAL_STORAGE"/>
|
||||
<uses-permission android:name="android.permission.INTERNET"/>
|
||||
<uses-permission android:name="android.permission.FOREGROUND_SERVICE"/>
|
||||
|
||||
<application
|
||||
android:allowBackup="false"
|
||||
android:label="OCR Service"
|
||||
android:supportsRtl="true">
|
||||
|
||||
<activity
|
||||
android:name=".MainActivity"
|
||||
android:exported="true">
|
||||
<intent-filter>
|
||||
<action android:name="android.intent.action.MAIN"/>
|
||||
<category android:name="android.intent.category.LAUNCHER"/>
|
||||
</intent-filter>
|
||||
</activity>
|
||||
|
||||
<service
|
||||
android:name=".OcrService"
|
||||
android:exported="true"
|
||||
android:foregroundServiceType="dataSync"/>
|
||||
|
||||
</application>
|
||||
</manifest>
|
||||
@@ -0,0 +1,23 @@
|
||||
package com.guiagent.ocr
|
||||
|
||||
import android.app.Activity
|
||||
import android.content.Intent
|
||||
import android.os.Bundle
|
||||
import android.widget.TextView
|
||||
|
||||
class MainActivity : Activity() {
|
||||
override fun onCreate(savedInstanceState: Bundle?) {
|
||||
super.onCreate(savedInstanceState)
|
||||
val tv = TextView(this).apply {
|
||||
text = "OCR Service\nPort: 18900\nStarting..."
|
||||
textSize = 20f
|
||||
setPadding(40, 40, 40, 40)
|
||||
}
|
||||
setContentView(tv)
|
||||
|
||||
// Start the service
|
||||
val intent = Intent(this, OcrService::class.java)
|
||||
startForegroundService(intent)
|
||||
tv.text = "OCR Service\nPort: 18900\nRunning!"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,79 @@
|
||||
package com.guiagent.ocr
|
||||
|
||||
import android.graphics.Bitmap
|
||||
import android.graphics.BitmapFactory
|
||||
import com.google.mlkit.vision.common.InputImage
|
||||
import com.google.mlkit.vision.text.TextRecognition
|
||||
import com.google.mlkit.vision.text.chinese.ChineseTextRecognizerOptions
|
||||
import java.io.File
|
||||
import java.util.concurrent.CountDownLatch
|
||||
import java.util.concurrent.TimeUnit
|
||||
|
||||
data class TextBox(
|
||||
val text: String,
|
||||
val x: Int,
|
||||
val y: Int,
|
||||
val w: Int,
|
||||
val h: Int,
|
||||
val confidence: Float
|
||||
) {
|
||||
val cx get() = x + w / 2
|
||||
val cy get() = y + h / 2
|
||||
}
|
||||
|
||||
object OcrEngine {
|
||||
|
||||
private val recognizer by lazy {
|
||||
TextRecognition.getClient(ChineseTextRecognizerOptions.Builder().build())
|
||||
}
|
||||
|
||||
fun recognize(imagePath: String): List<TextBox> {
|
||||
val file = File(imagePath)
|
||||
if (!file.exists()) return emptyList()
|
||||
val bitmap = BitmapFactory.decodeFile(imagePath) ?: return emptyList()
|
||||
return recognizeBitmap(bitmap)
|
||||
}
|
||||
|
||||
/** 直接截屏并识别,不落盘 */
|
||||
fun screencapAndRecognize(): List<TextBox> {
|
||||
val process = Runtime.getRuntime().exec("screencap -p")
|
||||
val bytes = process.inputStream.readBytes()
|
||||
process.waitFor()
|
||||
if (bytes.isEmpty()) return emptyList()
|
||||
val bitmap = BitmapFactory.decodeByteArray(bytes, 0, bytes.size) ?: return emptyList()
|
||||
return recognizeBitmap(bitmap)
|
||||
}
|
||||
|
||||
fun recognizeBitmap(bitmap: Bitmap): List<TextBox> {
|
||||
val image = InputImage.fromBitmap(bitmap, 0)
|
||||
val results = mutableListOf<TextBox>()
|
||||
val latch = CountDownLatch(1)
|
||||
|
||||
recognizer.process(image)
|
||||
.addOnSuccessListener { visionText ->
|
||||
for (block in visionText.textBlocks) {
|
||||
for (line in block.lines) {
|
||||
val box = line.boundingBox ?: continue
|
||||
results.add(
|
||||
TextBox(
|
||||
text = line.text,
|
||||
x = box.left,
|
||||
y = box.top,
|
||||
w = box.width(),
|
||||
h = box.height(),
|
||||
confidence = line.confidence ?: 0.8f
|
||||
)
|
||||
)
|
||||
}
|
||||
}
|
||||
latch.countDown()
|
||||
}
|
||||
.addOnFailureListener {
|
||||
latch.countDown()
|
||||
}
|
||||
|
||||
latch.await(10, TimeUnit.SECONDS)
|
||||
bitmap.recycle()
|
||||
return results
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,88 @@
|
||||
package com.guiagent.ocr
|
||||
|
||||
import android.graphics.BitmapFactory
|
||||
import com.google.gson.Gson
|
||||
import fi.iki.elonen.NanoHTTPD
|
||||
import java.io.ByteArrayOutputStream
|
||||
|
||||
class OcrHttpServer(port: Int = 18900) : NanoHTTPD(port) {
|
||||
|
||||
private val gson = Gson()
|
||||
private val defaultPath = "/sdcard/ocr_screen.png"
|
||||
|
||||
override fun serve(session: IHTTPSession): Response {
|
||||
return when (session.uri) {
|
||||
"/ocr" -> handleOcr(session)
|
||||
"/snap" -> handleSnap(session)
|
||||
"/health" -> jsonResponse(mapOf("status" to "ok", "engine" to "mlkit-chinese"))
|
||||
else -> newFixedLengthResponse(Response.Status.NOT_FOUND, MIME_PLAINTEXT, "404")
|
||||
}
|
||||
}
|
||||
|
||||
/** 读文件方式 OCR */
|
||||
private fun handleOcr(session: IHTTPSession): Response {
|
||||
val params = session.parms ?: emptyMap()
|
||||
val imagePath = params["path"] ?: defaultPath
|
||||
return doOcr(params["text"]) { OcrEngine.recognize(imagePath) }
|
||||
}
|
||||
|
||||
/** POST 图片数据直接 OCR,不存文件 */
|
||||
private fun handleSnap(session: IHTTPSession): Response {
|
||||
val params = session.parms ?: emptyMap()
|
||||
|
||||
if (session.method == Method.POST) {
|
||||
// NanoHTTPD parseBody 将 binary data 存到临时文件
|
||||
val bodyFiles = HashMap<String, String>()
|
||||
session.parseBody(bodyFiles)
|
||||
|
||||
// postData 键对应临时文件路径
|
||||
val tmpPath = bodyFiles["postData"]
|
||||
if (tmpPath != null) {
|
||||
val imageBytes = java.io.File(tmpPath).readBytes()
|
||||
val bitmap = BitmapFactory.decodeByteArray(imageBytes, 0, imageBytes.size)
|
||||
if (bitmap != null) {
|
||||
return doOcr(params["text"]) { OcrEngine.recognizeBitmap(bitmap) }
|
||||
}
|
||||
return jsonResponse(mapOf("error" to "decode failed", "size" to imageBytes.size, "count" to 0))
|
||||
}
|
||||
return jsonResponse(mapOf("error" to "no body received", "count" to 0))
|
||||
}
|
||||
|
||||
// GET: 读文件方式 fallback
|
||||
return handleOcr(session)
|
||||
}
|
||||
|
||||
private fun doOcr(query: String?, recognize: () -> List<TextBox>): Response {
|
||||
val startTime = System.currentTimeMillis()
|
||||
var results = recognize()
|
||||
|
||||
if (!query.isNullOrBlank()) {
|
||||
results = results.filter { it.text.contains(query) }
|
||||
}
|
||||
|
||||
val elapsed = System.currentTimeMillis() - startTime
|
||||
|
||||
val response = mapOf(
|
||||
"results" to results.map { box ->
|
||||
mapOf(
|
||||
"text" to box.text,
|
||||
"x" to box.x,
|
||||
"y" to box.y,
|
||||
"w" to box.w,
|
||||
"h" to box.h,
|
||||
"cx" to box.cx,
|
||||
"cy" to box.cy,
|
||||
"confidence" to box.confidence
|
||||
)
|
||||
},
|
||||
"count" to results.size,
|
||||
"elapsed_ms" to elapsed
|
||||
)
|
||||
return jsonResponse(response)
|
||||
}
|
||||
|
||||
private fun jsonResponse(data: Any): Response {
|
||||
val json = gson.toJson(data)
|
||||
return newFixedLengthResponse(Response.Status.OK, "application/json", json)
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,49 @@
|
||||
package com.guiagent.ocr
|
||||
|
||||
import android.app.*
|
||||
import android.content.Intent
|
||||
import android.os.Build
|
||||
import android.os.IBinder
|
||||
import android.util.Log
|
||||
|
||||
class OcrService : Service() {
|
||||
|
||||
private var server: OcrHttpServer? = null
|
||||
private val TAG = "OcrService"
|
||||
private val PORT = 18900
|
||||
|
||||
override fun onStartCommand(intent: Intent?, flags: Int, startId: Int): Int {
|
||||
startForegroundNotification()
|
||||
|
||||
if (server == null) {
|
||||
server = OcrHttpServer(PORT).also {
|
||||
it.start()
|
||||
Log.i(TAG, "OCR HTTP server started on port $PORT")
|
||||
}
|
||||
}
|
||||
return START_STICKY
|
||||
}
|
||||
|
||||
override fun onDestroy() {
|
||||
server?.stop()
|
||||
server = null
|
||||
Log.i(TAG, "OCR HTTP server stopped")
|
||||
super.onDestroy()
|
||||
}
|
||||
|
||||
override fun onBind(intent: Intent?): IBinder? = null
|
||||
|
||||
private fun startForegroundNotification() {
|
||||
val channelId = "ocr_service"
|
||||
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.O) {
|
||||
val channel = NotificationChannel(channelId, "OCR Service", NotificationManager.IMPORTANCE_LOW)
|
||||
getSystemService(NotificationManager::class.java).createNotificationChannel(channel)
|
||||
}
|
||||
val notification = Notification.Builder(this, channelId)
|
||||
.setContentTitle("OCR Service")
|
||||
.setContentText("Running on port $PORT")
|
||||
.setSmallIcon(android.R.drawable.ic_menu_camera)
|
||||
.build()
|
||||
startForeground(1, notification)
|
||||
}
|
||||
}
|
||||
4
android-ocr-service/app/src/main/res/values/strings.xml
Normal file
4
android-ocr-service/app/src/main/res/values/strings.xml
Normal file
@@ -0,0 +1,4 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<resources>
|
||||
<string name="app_name">OCR Service</string>
|
||||
</resources>
|
||||
4
android-ocr-service/build.gradle.kts
Normal file
4
android-ocr-service/build.gradle.kts
Normal file
@@ -0,0 +1,4 @@
|
||||
plugins {
|
||||
id("com.android.application") version "8.5.1" apply false
|
||||
id("org.jetbrains.kotlin.android") version "2.0.0" apply false
|
||||
}
|
||||
3
android-ocr-service/gradle.properties
Normal file
3
android-ocr-service/gradle.properties
Normal file
@@ -0,0 +1,3 @@
|
||||
org.gradle.jvmargs=-Xmx2048m
|
||||
android.useAndroidX=true
|
||||
kotlin.code.style=official
|
||||
BIN
android-ocr-service/gradle/wrapper/gradle-wrapper.jar
vendored
Normal file
BIN
android-ocr-service/gradle/wrapper/gradle-wrapper.jar
vendored
Normal file
Binary file not shown.
7
android-ocr-service/gradle/wrapper/gradle-wrapper.properties
vendored
Normal file
7
android-ocr-service/gradle/wrapper/gradle-wrapper.properties
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
distributionBase=GRADLE_USER_HOME
|
||||
distributionPath=wrapper/dists
|
||||
distributionUrl=https\://services.gradle.org/distributions/gradle-8.7-bin.zip
|
||||
networkTimeout=10000
|
||||
validateDistributionUrl=true
|
||||
zipStoreBase=GRADLE_USER_HOME
|
||||
zipStorePath=wrapper/dists
|
||||
249
android-ocr-service/gradlew
vendored
Executable file
249
android-ocr-service/gradlew
vendored
Executable file
@@ -0,0 +1,249 @@
|
||||
#!/bin/sh
|
||||
|
||||
#
|
||||
# Copyright © 2015-2021 the original authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# https://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
|
||||
##############################################################################
|
||||
#
|
||||
# Gradle start up script for POSIX generated by Gradle.
|
||||
#
|
||||
# Important for running:
|
||||
#
|
||||
# (1) You need a POSIX-compliant shell to run this script. If your /bin/sh is
|
||||
# noncompliant, but you have some other compliant shell such as ksh or
|
||||
# bash, then to run this script, type that shell name before the whole
|
||||
# command line, like:
|
||||
#
|
||||
# ksh Gradle
|
||||
#
|
||||
# Busybox and similar reduced shells will NOT work, because this script
|
||||
# requires all of these POSIX shell features:
|
||||
# * functions;
|
||||
# * expansions «$var», «${var}», «${var:-default}», «${var+SET}»,
|
||||
# «${var#prefix}», «${var%suffix}», and «$( cmd )»;
|
||||
# * compound commands having a testable exit status, especially «case»;
|
||||
# * various built-in commands including «command», «set», and «ulimit».
|
||||
#
|
||||
# Important for patching:
|
||||
#
|
||||
# (2) This script targets any POSIX shell, so it avoids extensions provided
|
||||
# by Bash, Ksh, etc; in particular arrays are avoided.
|
||||
#
|
||||
# The "traditional" practice of packing multiple parameters into a
|
||||
# space-separated string is a well documented source of bugs and security
|
||||
# problems, so this is (mostly) avoided, by progressively accumulating
|
||||
# options in "$@", and eventually passing that to Java.
|
||||
#
|
||||
# Where the inherited environment variables (DEFAULT_JVM_OPTS, JAVA_OPTS,
|
||||
# and GRADLE_OPTS) rely on word-splitting, this is performed explicitly;
|
||||
# see the in-line comments for details.
|
||||
#
|
||||
# There are tweaks for specific operating systems such as AIX, CygWin,
|
||||
# Darwin, MinGW, and NonStop.
|
||||
#
|
||||
# (3) This script is generated from the Groovy template
|
||||
# https://github.com/gradle/gradle/blob/HEAD/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
|
||||
# within the Gradle project.
|
||||
#
|
||||
# You can find Gradle at https://github.com/gradle/gradle/.
|
||||
#
|
||||
##############################################################################
|
||||
|
||||
# Attempt to set APP_HOME
|
||||
|
||||
# Resolve links: $0 may be a link
|
||||
app_path=$0
|
||||
|
||||
# Need this for daisy-chained symlinks.
|
||||
while
|
||||
APP_HOME=${app_path%"${app_path##*/}"} # leaves a trailing /; empty if no leading path
|
||||
[ -h "$app_path" ]
|
||||
do
|
||||
ls=$( ls -ld "$app_path" )
|
||||
link=${ls#*' -> '}
|
||||
case $link in #(
|
||||
/*) app_path=$link ;; #(
|
||||
*) app_path=$APP_HOME$link ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# This is normally unused
|
||||
# shellcheck disable=SC2034
|
||||
APP_BASE_NAME=${0##*/}
|
||||
# Discard cd standard output in case $CDPATH is set (https://github.com/gradle/gradle/issues/25036)
|
||||
APP_HOME=$( cd "${APP_HOME:-./}" > /dev/null && pwd -P ) || exit
|
||||
|
||||
# Use the maximum available, or set MAX_FD != -1 to use that value.
|
||||
MAX_FD=maximum
|
||||
|
||||
warn () {
|
||||
echo "$*"
|
||||
} >&2
|
||||
|
||||
die () {
|
||||
echo
|
||||
echo "$*"
|
||||
echo
|
||||
exit 1
|
||||
} >&2
|
||||
|
||||
# OS specific support (must be 'true' or 'false').
|
||||
cygwin=false
|
||||
msys=false
|
||||
darwin=false
|
||||
nonstop=false
|
||||
case "$( uname )" in #(
|
||||
CYGWIN* ) cygwin=true ;; #(
|
||||
Darwin* ) darwin=true ;; #(
|
||||
MSYS* | MINGW* ) msys=true ;; #(
|
||||
NONSTOP* ) nonstop=true ;;
|
||||
esac
|
||||
|
||||
CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar
|
||||
|
||||
|
||||
# Determine the Java command to use to start the JVM.
|
||||
if [ -n "$JAVA_HOME" ] ; then
|
||||
if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
|
||||
# IBM's JDK on AIX uses strange locations for the executables
|
||||
JAVACMD=$JAVA_HOME/jre/sh/java
|
||||
else
|
||||
JAVACMD=$JAVA_HOME/bin/java
|
||||
fi
|
||||
if [ ! -x "$JAVACMD" ] ; then
|
||||
die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
|
||||
|
||||
Please set the JAVA_HOME variable in your environment to match the
|
||||
location of your Java installation."
|
||||
fi
|
||||
else
|
||||
JAVACMD=java
|
||||
if ! command -v java >/dev/null 2>&1
|
||||
then
|
||||
die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
|
||||
|
||||
Please set the JAVA_HOME variable in your environment to match the
|
||||
location of your Java installation."
|
||||
fi
|
||||
fi
|
||||
|
||||
# Increase the maximum file descriptors if we can.
|
||||
if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then
|
||||
case $MAX_FD in #(
|
||||
max*)
|
||||
# In POSIX sh, ulimit -H is undefined. That's why the result is checked to see if it worked.
|
||||
# shellcheck disable=SC2039,SC3045
|
||||
MAX_FD=$( ulimit -H -n ) ||
|
||||
warn "Could not query maximum file descriptor limit"
|
||||
esac
|
||||
case $MAX_FD in #(
|
||||
'' | soft) :;; #(
|
||||
*)
|
||||
# In POSIX sh, ulimit -n is undefined. That's why the result is checked to see if it worked.
|
||||
# shellcheck disable=SC2039,SC3045
|
||||
ulimit -n "$MAX_FD" ||
|
||||
warn "Could not set maximum file descriptor limit to $MAX_FD"
|
||||
esac
|
||||
fi
|
||||
|
||||
# Collect all arguments for the java command, stacking in reverse order:
|
||||
# * args from the command line
|
||||
# * the main class name
|
||||
# * -classpath
|
||||
# * -D...appname settings
|
||||
# * --module-path (only if needed)
|
||||
# * DEFAULT_JVM_OPTS, JAVA_OPTS, and GRADLE_OPTS environment variables.
|
||||
|
||||
# For Cygwin or MSYS, switch paths to Windows format before running java
|
||||
if "$cygwin" || "$msys" ; then
|
||||
APP_HOME=$( cygpath --path --mixed "$APP_HOME" )
|
||||
CLASSPATH=$( cygpath --path --mixed "$CLASSPATH" )
|
||||
|
||||
JAVACMD=$( cygpath --unix "$JAVACMD" )
|
||||
|
||||
# Now convert the arguments - kludge to limit ourselves to /bin/sh
|
||||
for arg do
|
||||
if
|
||||
case $arg in #(
|
||||
-*) false ;; # don't mess with options #(
|
||||
/?*) t=${arg#/} t=/${t%%/*} # looks like a POSIX filepath
|
||||
[ -e "$t" ] ;; #(
|
||||
*) false ;;
|
||||
esac
|
||||
then
|
||||
arg=$( cygpath --path --ignore --mixed "$arg" )
|
||||
fi
|
||||
# Roll the args list around exactly as many times as the number of
|
||||
# args, so each arg winds up back in the position where it started, but
|
||||
# possibly modified.
|
||||
#
|
||||
# NB: a `for` loop captures its iteration list before it begins, so
|
||||
# changing the positional parameters here affects neither the number of
|
||||
# iterations, nor the values presented in `arg`.
|
||||
shift # remove old arg
|
||||
set -- "$@" "$arg" # push replacement arg
|
||||
done
|
||||
fi
|
||||
|
||||
|
||||
# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
|
||||
DEFAULT_JVM_OPTS='-Dfile.encoding=UTF-8 "-Xmx64m" "-Xms64m"'
|
||||
|
||||
# Collect all arguments for the java command:
|
||||
# * DEFAULT_JVM_OPTS, JAVA_OPTS, JAVA_OPTS, and optsEnvironmentVar are not allowed to contain shell fragments,
|
||||
# and any embedded shellness will be escaped.
|
||||
# * For example: A user cannot expect ${Hostname} to be expanded, as it is an environment variable and will be
|
||||
# treated as '${Hostname}' itself on the command line.
|
||||
|
||||
set -- \
|
||||
"-Dorg.gradle.appname=$APP_BASE_NAME" \
|
||||
-classpath "$CLASSPATH" \
|
||||
org.gradle.wrapper.GradleWrapperMain \
|
||||
"$@"
|
||||
|
||||
# Stop when "xargs" is not available.
|
||||
if ! command -v xargs >/dev/null 2>&1
|
||||
then
|
||||
die "xargs is not available"
|
||||
fi
|
||||
|
||||
# Use "xargs" to parse quoted args.
|
||||
#
|
||||
# With -n1 it outputs one arg per line, with the quotes and backslashes removed.
|
||||
#
|
||||
# In Bash we could simply go:
|
||||
#
|
||||
# readarray ARGS < <( xargs -n1 <<<"$var" ) &&
|
||||
# set -- "${ARGS[@]}" "$@"
|
||||
#
|
||||
# but POSIX shell has neither arrays nor command substitution, so instead we
|
||||
# post-process each arg (as a line of input to sed) to backslash-escape any
|
||||
# character that might be a shell metacharacter, then use eval to reverse
|
||||
# that process (while maintaining the separation between arguments), and wrap
|
||||
# the whole thing up as a single "set" statement.
|
||||
#
|
||||
# This will of course break if any of these variables contains a newline or
|
||||
# an unmatched quote.
|
||||
#
|
||||
|
||||
eval "set -- $(
|
||||
printf '%s\n' "$DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS" |
|
||||
xargs -n1 |
|
||||
sed ' s~[^-[:alnum:]+,./:=@_]~\\&~g; ' |
|
||||
tr '\n' ' '
|
||||
)" '"$@"'
|
||||
|
||||
exec "$JAVACMD" "$@"
|
||||
92
android-ocr-service/gradlew.bat
vendored
Normal file
92
android-ocr-service/gradlew.bat
vendored
Normal file
@@ -0,0 +1,92 @@
|
||||
@rem
|
||||
@rem Copyright 2015 the original author or authors.
|
||||
@rem
|
||||
@rem Licensed under the Apache License, Version 2.0 (the "License");
|
||||
@rem you may not use this file except in compliance with the License.
|
||||
@rem You may obtain a copy of the License at
|
||||
@rem
|
||||
@rem https://www.apache.org/licenses/LICENSE-2.0
|
||||
@rem
|
||||
@rem Unless required by applicable law or agreed to in writing, software
|
||||
@rem distributed under the License is distributed on an "AS IS" BASIS,
|
||||
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
@rem See the License for the specific language governing permissions and
|
||||
@rem limitations under the License.
|
||||
@rem
|
||||
|
||||
@if "%DEBUG%"=="" @echo off
|
||||
@rem ##########################################################################
|
||||
@rem
|
||||
@rem Gradle startup script for Windows
|
||||
@rem
|
||||
@rem ##########################################################################
|
||||
|
||||
@rem Set local scope for the variables with windows NT shell
|
||||
if "%OS%"=="Windows_NT" setlocal
|
||||
|
||||
set DIRNAME=%~dp0
|
||||
if "%DIRNAME%"=="" set DIRNAME=.
|
||||
@rem This is normally unused
|
||||
set APP_BASE_NAME=%~n0
|
||||
set APP_HOME=%DIRNAME%
|
||||
|
||||
@rem Resolve any "." and ".." in APP_HOME to make it shorter.
|
||||
for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi
|
||||
|
||||
@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
|
||||
set DEFAULT_JVM_OPTS=-Dfile.encoding=UTF-8 "-Xmx64m" "-Xms64m"
|
||||
|
||||
@rem Find java.exe
|
||||
if defined JAVA_HOME goto findJavaFromJavaHome
|
||||
|
||||
set JAVA_EXE=java.exe
|
||||
%JAVA_EXE% -version >NUL 2>&1
|
||||
if %ERRORLEVEL% equ 0 goto execute
|
||||
|
||||
echo.
|
||||
echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
|
||||
echo.
|
||||
echo Please set the JAVA_HOME variable in your environment to match the
|
||||
echo location of your Java installation.
|
||||
|
||||
goto fail
|
||||
|
||||
:findJavaFromJavaHome
|
||||
set JAVA_HOME=%JAVA_HOME:"=%
|
||||
set JAVA_EXE=%JAVA_HOME%/bin/java.exe
|
||||
|
||||
if exist "%JAVA_EXE%" goto execute
|
||||
|
||||
echo.
|
||||
echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME%
|
||||
echo.
|
||||
echo Please set the JAVA_HOME variable in your environment to match the
|
||||
echo location of your Java installation.
|
||||
|
||||
goto fail
|
||||
|
||||
:execute
|
||||
@rem Setup the command line
|
||||
|
||||
set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar
|
||||
|
||||
|
||||
@rem Execute Gradle
|
||||
"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %*
|
||||
|
||||
:end
|
||||
@rem End local scope for the variables with windows NT shell
|
||||
if %ERRORLEVEL% equ 0 goto mainEnd
|
||||
|
||||
:fail
|
||||
rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
|
||||
rem the _cmd.exe /c_ return code!
|
||||
set EXIT_CODE=%ERRORLEVEL%
|
||||
if %EXIT_CODE% equ 0 set EXIT_CODE=1
|
||||
if not ""=="%GRADLE_EXIT_CONSOLE%" exit %EXIT_CODE%
|
||||
exit /b %EXIT_CODE%
|
||||
|
||||
:mainEnd
|
||||
if "%OS%"=="Windows_NT" endlocal
|
||||
|
||||
:omega
|
||||
18
android-ocr-service/settings.gradle.kts
Normal file
18
android-ocr-service/settings.gradle.kts
Normal file
@@ -0,0 +1,18 @@
|
||||
pluginManagement {
|
||||
repositories {
|
||||
google()
|
||||
mavenCentral()
|
||||
gradlePluginPortal()
|
||||
}
|
||||
}
|
||||
|
||||
dependencyResolutionManagement {
|
||||
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
|
||||
repositories {
|
||||
google()
|
||||
mavenCentral()
|
||||
}
|
||||
}
|
||||
|
||||
rootProject.name = "ocr-service"
|
||||
include(":app")
|
||||
3
config/__init__.py
Normal file
3
config/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .settings import settings
|
||||
|
||||
__all__ = ["settings"]
|
||||
30
config/settings.py
Normal file
30
config/settings.py
Normal file
@@ -0,0 +1,30 @@
|
||||
from pydantic_settings import BaseSettings
|
||||
from typing import Optional
|
||||
|
||||
|
||||
class Settings(BaseSettings):
|
||||
# Device
|
||||
device_serial: Optional[str] = None # None = auto-detect first device
|
||||
adb_path: str = "/opt/homebrew/bin/adb"
|
||||
screenshot_dir: str = "data/screenshots"
|
||||
|
||||
# VLM
|
||||
vlm_provider: str = "poe" # local / poe / openrouter
|
||||
vlm_model: str = "Qwen/Qwen2.5-VL-7B-Instruct"
|
||||
poe_api_key: Optional[str] = None
|
||||
openrouter_api_key: Optional[str] = None
|
||||
|
||||
# Agent
|
||||
max_steps: int = 20
|
||||
action_delay: float = 1.5 # seconds to wait after each action
|
||||
screenshot_timeout: float = 5.0
|
||||
verify_after_action: bool = True
|
||||
|
||||
# Server
|
||||
host: str = "0.0.0.0"
|
||||
port: int = 4380
|
||||
|
||||
model_config = {"env_file": ".env", "env_file_encoding": "utf-8"}
|
||||
|
||||
|
||||
settings = Settings()
|
||||
15
requirements.txt
Normal file
15
requirements.txt
Normal file
@@ -0,0 +1,15 @@
|
||||
fastapi>=0.115.0
|
||||
uvicorn>=0.32.0
|
||||
pillow>=10.0.0
|
||||
httpx>=0.27.0
|
||||
pydantic>=2.0.0
|
||||
pydantic-settings>=2.0.0
|
||||
jinja2>=3.1.0
|
||||
python-multipart>=0.0.9
|
||||
|
||||
# OCR grounding (L3 - element detection by visible text)
|
||||
pytesseract>=0.3.10 # Fast, uses system tesseract binary
|
||||
numpy>=1.24.0 # Required by easyocr and image processing
|
||||
|
||||
# Optional: better Chinese OCR (install separately if pytesseract is insufficient)
|
||||
# pip install easyocr # ~150MB download, better zh_CN but slower first run
|
||||
38
scripts/test_device.py
Normal file
38
scripts/test_device.py
Normal file
@@ -0,0 +1,38 @@
|
||||
"""Quick test: check ADB device connection and take a screenshot."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
from src.capture import ADBCapture
|
||||
|
||||
|
||||
def main():
|
||||
cap = ADBCapture()
|
||||
|
||||
print("Checking device...")
|
||||
info = cap.check_device()
|
||||
|
||||
if not info["connected"]:
|
||||
print(f"[FAIL] {info['error']}")
|
||||
print()
|
||||
print("Troubleshooting:")
|
||||
print(" 1. USB debugging enabled on phone?")
|
||||
print(" 2. Run: adb devices")
|
||||
print(" 3. Accept USB debugging prompt on phone")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"[OK] Device: {info['model']}")
|
||||
print(f" Serial: {info['serial']}")
|
||||
print(f" Resolution: {info['resolution']}")
|
||||
print(f" All devices: {info['all_devices']}")
|
||||
|
||||
print("\nTaking screenshot...")
|
||||
img = cap.screenshot(save=True)
|
||||
print(f"[OK] Screenshot: {img.size[0]}x{img.size[1]}")
|
||||
print(f" Saved to: {cap.screenshot_dir}/")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
149
scripts/test_ocr_grounding.py
Normal file
149
scripts/test_ocr_grounding.py
Normal file
@@ -0,0 +1,149 @@
|
||||
"""Test OCR grounding: take a screenshot and find text elements.
|
||||
|
||||
Usage:
|
||||
# Find a specific text on current screen
|
||||
python scripts/test_ocr_grounding.py "微信"
|
||||
|
||||
# Detect ALL text on screen (debug mode)
|
||||
python scripts/test_ocr_grounding.py --all
|
||||
|
||||
# Use a saved screenshot instead of live ADB capture
|
||||
python scripts/test_ocr_grounding.py "发送" --image data/screenshots/test.png
|
||||
|
||||
# Try different engines
|
||||
python scripts/test_ocr_grounding.py "微信" --engine easyocr
|
||||
python scripts/test_ocr_grounding.py "微信" --engine pytesseract
|
||||
|
||||
# Also try uiautomator dump (hybrid mode)
|
||||
python scripts/test_ocr_grounding.py "微信" --hybrid
|
||||
|
||||
# Save annotated screenshot with bounding boxes drawn
|
||||
python scripts/test_ocr_grounding.py --all --annotate
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import argparse
|
||||
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
from src.grounding.ocr_grounding import OCRGrounding
|
||||
|
||||
|
||||
def annotate_image(img: Image.Image, boxes, query: str = "") -> Image.Image:
|
||||
"""Draw bounding boxes on the image for visualization."""
|
||||
annotated = img.copy()
|
||||
draw = ImageDraw.Draw(annotated)
|
||||
|
||||
for box in boxes:
|
||||
is_match = box.contains_text(query) if query else False
|
||||
color = "red" if is_match else "lime"
|
||||
width = 3 if is_match else 1
|
||||
|
||||
draw.rectangle(
|
||||
[box.x, box.y, box.x + box.w, box.y + box.h],
|
||||
outline=color, width=width,
|
||||
)
|
||||
label = f"{box.text} ({box.confidence:.0%})"
|
||||
draw.text((box.x, box.y - 14), label, fill=color)
|
||||
|
||||
return annotated
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Test OCR grounding on phone screen")
|
||||
parser.add_argument("query", nargs="?", default=None, help="Text to find on screen")
|
||||
parser.add_argument("--all", action="store_true", help="Detect all text on screen")
|
||||
parser.add_argument("--image", type=str, help="Use saved screenshot instead of ADB")
|
||||
parser.add_argument("--engine", type=str, default="auto",
|
||||
choices=["auto", "pytesseract", "easyocr"],
|
||||
help="OCR engine to use")
|
||||
parser.add_argument("--hybrid", action="store_true",
|
||||
help="Try uiautomator + OCR hybrid approach")
|
||||
parser.add_argument("--annotate", action="store_true",
|
||||
help="Save annotated screenshot with bounding boxes")
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.query and not args.all:
|
||||
parser.error("Provide a search query or --all")
|
||||
|
||||
# Get screenshot
|
||||
if args.image:
|
||||
print(f"Loading image: {args.image}")
|
||||
img = Image.open(args.image)
|
||||
else:
|
||||
from src.capture import ADBCapture
|
||||
cap = ADBCapture()
|
||||
info = cap.check_device()
|
||||
if not info["connected"]:
|
||||
print(f"[FAIL] {info['error']}")
|
||||
sys.exit(1)
|
||||
print(f"Device: {info['model']} ({info['resolution']})")
|
||||
print("Taking screenshot...")
|
||||
img = cap.screenshot(save=True)
|
||||
|
||||
print(f"Image size: {img.width}x{img.height}")
|
||||
grounding = OCRGrounding(engine=args.engine)
|
||||
|
||||
if args.all:
|
||||
print(f"\n--- Detecting ALL text (engine={args.engine}) ---\n")
|
||||
boxes = grounding.detect_all(img)
|
||||
if not boxes:
|
||||
print("[WARN] No text detected!")
|
||||
else:
|
||||
print(f"Found {len(boxes)} text regions:\n")
|
||||
for i, box in enumerate(boxes, 1):
|
||||
nx, ny = box.center_normalized(img.width, img.height)
|
||||
print(f" {i:3d}. '{box.text}'")
|
||||
print(f" pixel=({box.cx}, {box.cy}) "
|
||||
f"norm=({nx:.3f}, {ny:.3f}) "
|
||||
f"conf={box.confidence:.0%}")
|
||||
|
||||
if args.annotate and boxes:
|
||||
out_path = "data/screenshots/annotated_all.png"
|
||||
annotated = annotate_image(img, boxes, query=args.query or "")
|
||||
annotated.save(out_path)
|
||||
print(f"\nAnnotated image saved: {out_path}")
|
||||
|
||||
if args.query:
|
||||
print(f"\n--- Searching for: '{args.query}' (engine={args.engine}) ---\n")
|
||||
|
||||
if args.hybrid:
|
||||
result = grounding.find_text_hybrid(img, args.query)
|
||||
else:
|
||||
result = grounding.find_text(img, args.query)
|
||||
|
||||
if result is None:
|
||||
print(f"[NOT FOUND] '{args.query}' was not found on screen.")
|
||||
print("\nTip: Run with --all to see all detected text.")
|
||||
sys.exit(1)
|
||||
else:
|
||||
nx, ny = result.center_normalized(img.width, img.height)
|
||||
print(f"[FOUND] '{result.text}'")
|
||||
print(f" Pixel center: ({result.cx}, {result.cy})")
|
||||
print(f" Normalized center: ({nx:.4f}, {ny:.4f})")
|
||||
print(f" Bounding box: x={result.x} y={result.y} "
|
||||
f"w={result.w} h={result.h}")
|
||||
print(f" Confidence: {result.confidence:.0%}")
|
||||
print()
|
||||
print(f" To tap this element:")
|
||||
print(f" adb shell input tap {result.cx} {result.cy}")
|
||||
|
||||
# Show all matches
|
||||
all_matches = grounding.find_all_matches(img, args.query)
|
||||
if len(all_matches) > 1:
|
||||
print(f"\n ({len(all_matches)} total matches found)")
|
||||
for i, m in enumerate(all_matches):
|
||||
print(f" {i+1}. '{m.text}' at ({m.cx},{m.cy}) conf={m.confidence:.0%}")
|
||||
|
||||
if args.annotate:
|
||||
boxes = grounding.detect_all(img)
|
||||
out_path = "data/screenshots/annotated_search.png"
|
||||
annotated = annotate_image(img, boxes, query=args.query)
|
||||
annotated.save(out_path)
|
||||
print(f"\nAnnotated image saved: {out_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
0
src/__init__.py
Normal file
0
src/__init__.py
Normal file
3
src/capture/__init__.py
Normal file
3
src/capture/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .adb_capture import ADBCapture
|
||||
|
||||
__all__ = ["ADBCapture"]
|
||||
118
src/capture/adb_capture.py
Normal file
118
src/capture/adb_capture.py
Normal file
@@ -0,0 +1,118 @@
|
||||
"""L1 - Screen Capture via ADB
|
||||
|
||||
Captures screenshots from Android device using ADB.
|
||||
Handles device connection, screenshot acquisition, and resolution detection.
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import time
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from PIL import Image
|
||||
import io
|
||||
|
||||
from config import settings
|
||||
|
||||
|
||||
class ADBCapture:
|
||||
"""ADB-based screen capture for Android devices."""
|
||||
|
||||
def __init__(self):
|
||||
self.adb = settings.adb_path
|
||||
self.serial = settings.device_serial
|
||||
self.screenshot_dir = Path(settings.screenshot_dir)
|
||||
self.screenshot_dir.mkdir(parents=True, exist_ok=True)
|
||||
self._resolution: tuple[int, int] | None = None
|
||||
|
||||
def _adb_cmd(self, *args: str) -> list[str]:
|
||||
cmd = [self.adb]
|
||||
if self.serial:
|
||||
cmd.extend(["-s", self.serial])
|
||||
cmd.extend(args)
|
||||
return cmd
|
||||
|
||||
def check_device(self) -> dict:
|
||||
"""Check if device is connected and return device info."""
|
||||
result = subprocess.run(
|
||||
self._adb_cmd("devices"),
|
||||
capture_output=True, text=True, timeout=5
|
||||
)
|
||||
lines = result.stdout.strip().split("\n")[1:] # skip header
|
||||
devices = []
|
||||
for line in lines:
|
||||
parts = line.strip().split("\t")
|
||||
if len(parts) == 2 and parts[1] == "device":
|
||||
devices.append(parts[0])
|
||||
|
||||
if not devices:
|
||||
return {"connected": False, "error": "No device found"}
|
||||
|
||||
serial = self.serial or devices[0]
|
||||
if not self.serial:
|
||||
self.serial = serial
|
||||
|
||||
# Get device model
|
||||
model_result = subprocess.run(
|
||||
self._adb_cmd("shell", "getprop", "ro.product.model"),
|
||||
capture_output=True, text=True, timeout=5
|
||||
)
|
||||
model = model_result.stdout.strip()
|
||||
|
||||
# Get screen resolution
|
||||
w, h = self.get_resolution()
|
||||
|
||||
return {
|
||||
"connected": True,
|
||||
"serial": serial,
|
||||
"model": model,
|
||||
"resolution": f"{w}x{h}",
|
||||
"all_devices": devices,
|
||||
}
|
||||
|
||||
def get_resolution(self) -> tuple[int, int]:
|
||||
"""Get device screen resolution."""
|
||||
if self._resolution:
|
||||
return self._resolution
|
||||
|
||||
result = subprocess.run(
|
||||
self._adb_cmd("shell", "wm", "size"),
|
||||
capture_output=True, text=True, timeout=5
|
||||
)
|
||||
# Output: "Physical size: 1080x2400"
|
||||
size_str = result.stdout.strip().split(":")[-1].strip()
|
||||
w, h = size_str.split("x")
|
||||
self._resolution = (int(w), int(h))
|
||||
return self._resolution
|
||||
|
||||
def screenshot(self, save: bool = True) -> Image.Image:
|
||||
"""Take a screenshot and return as PIL Image.
|
||||
|
||||
Args:
|
||||
save: Whether to save the screenshot to disk for debugging.
|
||||
|
||||
Returns:
|
||||
PIL Image of the current screen.
|
||||
"""
|
||||
result = subprocess.run(
|
||||
self._adb_cmd("exec-out", "screencap", "-p"),
|
||||
capture_output=True, timeout=settings.screenshot_timeout
|
||||
)
|
||||
if result.returncode != 0:
|
||||
raise RuntimeError(f"Screenshot failed: {result.stderr.decode()}")
|
||||
|
||||
img = Image.open(io.BytesIO(result.stdout))
|
||||
|
||||
if save:
|
||||
ts = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
|
||||
path = self.screenshot_dir / f"{ts}.png"
|
||||
img.save(path)
|
||||
|
||||
return img
|
||||
|
||||
def screenshot_base64(self) -> str:
|
||||
"""Take screenshot and return as base64-encoded PNG string."""
|
||||
import base64
|
||||
img = self.screenshot(save=True)
|
||||
buffer = io.BytesIO()
|
||||
img.save(buffer, format="PNG")
|
||||
return base64.b64encode(buffer.getvalue()).decode("utf-8")
|
||||
3
src/executor/__init__.py
Normal file
3
src/executor/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .adb_executor import ADBExecutor
|
||||
|
||||
__all__ = ["ADBExecutor"]
|
||||
109
src/executor/adb_executor.py
Normal file
109
src/executor/adb_executor.py
Normal file
@@ -0,0 +1,109 @@
|
||||
"""L5 - Action Execution via ADB
|
||||
|
||||
Translates structured actions into ADB commands and executes them on device.
|
||||
Coordinates are normalized (0-1), converted to device pixels at execution time.
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
|
||||
from config import settings
|
||||
|
||||
|
||||
@dataclass
|
||||
class Action:
|
||||
"""A single GUI action to execute."""
|
||||
type: str # tap, swipe, type, long_press, back, home, scroll, wait
|
||||
x: float = 0.0 # normalized x (0-1)
|
||||
y: float = 0.0 # normalized y (0-1)
|
||||
text: str = "" # for type action
|
||||
x2: float = 0.0 # for swipe end
|
||||
y2: float = 0.0 # for swipe end
|
||||
duration: int = 300 # ms, for long_press and swipe
|
||||
|
||||
|
||||
class ADBExecutor:
|
||||
"""Execute actions on Android device via ADB."""
|
||||
|
||||
def __init__(self, capture):
|
||||
self.capture = capture
|
||||
self.adb = settings.adb_path
|
||||
self.serial = settings.device_serial
|
||||
|
||||
def _adb_cmd(self, *args: str) -> list[str]:
|
||||
cmd = [self.adb]
|
||||
if self.serial:
|
||||
cmd.extend(["-s", self.serial])
|
||||
cmd.extend(args)
|
||||
return cmd
|
||||
|
||||
def _run(self, *args: str):
|
||||
cmd = self._adb_cmd(*args)
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
|
||||
if result.returncode != 0:
|
||||
raise RuntimeError(f"ADB command failed: {' '.join(cmd)}\n{result.stderr}")
|
||||
return result.stdout
|
||||
|
||||
def _to_pixels(self, x: float, y: float) -> tuple[int, int]:
|
||||
"""Convert normalized (0-1) coordinates to device pixels."""
|
||||
w, h = self.capture.get_resolution()
|
||||
return int(x * w), int(y * h)
|
||||
|
||||
def execute(self, action: Action) -> str:
|
||||
"""Execute a single action and return a description of what was done."""
|
||||
match action.type:
|
||||
case "tap":
|
||||
px, py = self._to_pixels(action.x, action.y)
|
||||
self._run("shell", "input", "tap", str(px), str(py))
|
||||
desc = f"tap ({px}, {py})"
|
||||
|
||||
case "long_press":
|
||||
px, py = self._to_pixels(action.x, action.y)
|
||||
self._run("shell", "input", "swipe",
|
||||
str(px), str(py), str(px), str(py), str(action.duration))
|
||||
desc = f"long_press ({px}, {py}) {action.duration}ms"
|
||||
|
||||
case "swipe":
|
||||
px1, py1 = self._to_pixels(action.x, action.y)
|
||||
px2, py2 = self._to_pixels(action.x2, action.y2)
|
||||
self._run("shell", "input", "swipe",
|
||||
str(px1), str(py1), str(px2), str(py2), str(action.duration))
|
||||
desc = f"swipe ({px1},{py1}) → ({px2},{py2})"
|
||||
|
||||
case "type":
|
||||
# Escape special characters for ADB
|
||||
escaped = action.text.replace(" ", "%s").replace("&", "\\&")
|
||||
self._run("shell", "input", "text", escaped)
|
||||
desc = f"type '{action.text}'"
|
||||
|
||||
case "back":
|
||||
self._run("shell", "input", "keyevent", "KEYCODE_BACK")
|
||||
desc = "back"
|
||||
|
||||
case "home":
|
||||
self._run("shell", "input", "keyevent", "KEYCODE_HOME")
|
||||
desc = "home"
|
||||
|
||||
case "scroll":
|
||||
# Scroll direction: swipe center screen
|
||||
px, py = self._to_pixels(0.5, 0.5)
|
||||
if action.y < 0: # scroll up
|
||||
self._run("shell", "input", "swipe",
|
||||
str(px), str(py - 300), str(px), str(py + 300), "300")
|
||||
desc = "scroll up"
|
||||
else: # scroll down
|
||||
self._run("shell", "input", "swipe",
|
||||
str(px), str(py + 300), str(px), str(py - 300), "300")
|
||||
desc = "scroll down"
|
||||
|
||||
case "wait":
|
||||
time.sleep(action.duration / 1000)
|
||||
desc = f"wait {action.duration}ms"
|
||||
|
||||
case _:
|
||||
raise ValueError(f"Unknown action type: {action.type}")
|
||||
|
||||
# Wait for UI to settle after action
|
||||
time.sleep(settings.action_delay)
|
||||
return desc
|
||||
3
src/grounding/__init__.py
Normal file
3
src/grounding/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .ocr_grounding import OCRGrounding
|
||||
|
||||
__all__ = ["OCRGrounding"]
|
||||
354
src/grounding/ocr_grounding.py
Normal file
354
src/grounding/ocr_grounding.py
Normal file
@@ -0,0 +1,354 @@
|
||||
"""L3 - OCR-Based UI Element Grounding
|
||||
|
||||
Locates UI elements on screen by visible text using OCR on ADB screenshots.
|
||||
Provides reliable text-to-coordinate mapping that works on Huawei/HarmonyOS
|
||||
where uiautomator dump often returns empty XML for WeChat.
|
||||
|
||||
Strategy priority (auto mode):
|
||||
1. easyocr (best Chinese recognition, deep learning based)
|
||||
2. pytesseract (fallback, fast but fragments Chinese characters)
|
||||
3. uiautomator XML dump (supplementary, often empty on Huawei WeChat)
|
||||
|
||||
All coordinates returned as normalized (0.0-1.0) for consistency with the
|
||||
existing coordinate system in adb_executor.py.
|
||||
"""
|
||||
|
||||
import subprocess
|
||||
import re
|
||||
import io
|
||||
import logging
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from PIL import Image
|
||||
|
||||
from config import settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@dataclass
|
||||
class TextBox:
|
||||
"""A detected text region on screen."""
|
||||
text: str
|
||||
x: int # left pixel
|
||||
y: int # top pixel
|
||||
w: int # width pixels
|
||||
h: int # height pixels
|
||||
confidence: float # 0.0-1.0
|
||||
|
||||
@property
|
||||
def cx(self) -> int:
|
||||
"""Center x in pixels."""
|
||||
return self.x + self.w // 2
|
||||
|
||||
@property
|
||||
def cy(self) -> int:
|
||||
"""Center y in pixels."""
|
||||
return self.y + self.h // 2
|
||||
|
||||
def center_normalized(self, screen_w: int, screen_h: int) -> tuple[float, float]:
|
||||
"""Return center as normalized (0-1) coordinates."""
|
||||
return self.cx / screen_w, self.cy / screen_h
|
||||
|
||||
def contains_text(self, query: str, fuzzy: bool = True) -> bool:
|
||||
"""Check if this box's text matches the query.
|
||||
|
||||
Args:
|
||||
query: Text to search for.
|
||||
fuzzy: If True, does substring + case-insensitive match.
|
||||
"""
|
||||
if not query or not self.text:
|
||||
return False
|
||||
if fuzzy:
|
||||
return query.lower() in self.text.lower() or self.text.lower() in query.lower()
|
||||
return self.text == query
|
||||
|
||||
def match_score(self, query: str) -> float:
|
||||
"""Compute a match quality score (higher = better).
|
||||
|
||||
Scoring:
|
||||
- Exact match: 1000 + confidence
|
||||
- Query is full text: 500 + confidence
|
||||
- Text contains query as substring: 100 + confidence + length_ratio
|
||||
- Query contains text as substring: 50 + confidence
|
||||
- No match: 0
|
||||
"""
|
||||
if not query or not self.text:
|
||||
return 0.0
|
||||
|
||||
q = query.lower()
|
||||
t = self.text.lower().strip()
|
||||
|
||||
if t == q:
|
||||
return 1000 + self.confidence
|
||||
if q in t:
|
||||
# Prefer shorter texts that contain the query (more precise)
|
||||
length_ratio = len(q) / max(len(t), 1)
|
||||
return 100 + self.confidence + length_ratio
|
||||
if t in q:
|
||||
# Text is a subset of query -- weaker match
|
||||
length_ratio = len(t) / max(len(q), 1)
|
||||
return 50 + self.confidence * length_ratio
|
||||
return 0.0
|
||||
|
||||
|
||||
class OCRGrounding:
|
||||
"""OCR-based element grounding for Android screens.
|
||||
|
||||
Usage:
|
||||
grounding = OCRGrounding()
|
||||
|
||||
# From ADB screenshot (PIL Image)
|
||||
img = capture.screenshot()
|
||||
result = grounding.find_text(img, "发送")
|
||||
if result:
|
||||
norm_x, norm_y = result.center_normalized(img.width, img.height)
|
||||
# Use norm_x, norm_y with ADBExecutor
|
||||
"""
|
||||
|
||||
def __init__(self, engine: str = "auto"):
|
||||
"""
|
||||
Args:
|
||||
engine: OCR engine to use.
|
||||
"pytesseract" / "easyocr" / "auto" (easyocr first, pytesseract fallback)
|
||||
"""
|
||||
self.engine = engine
|
||||
self._easyocr_reader = None # lazy init (slow first load)
|
||||
|
||||
# ──────────────────────────────────────────────
|
||||
# Public API
|
||||
# ──────────────────────────────────────────────
|
||||
|
||||
def find_text(
|
||||
self, img: Image.Image, query: str, fuzzy: bool = True
|
||||
) -> TextBox | None:
|
||||
"""Find a UI element by visible text and return its bounding box.
|
||||
|
||||
Args:
|
||||
img: PIL Image (screenshot from ADB).
|
||||
query: Text to search for (e.g. "发送", "微信", "Search").
|
||||
fuzzy: Substring/case-insensitive match.
|
||||
|
||||
Returns:
|
||||
Best matching TextBox, or None if not found.
|
||||
"""
|
||||
boxes = self.detect_all(img)
|
||||
matches = [b for b in boxes if b.contains_text(query, fuzzy=fuzzy)]
|
||||
|
||||
if not matches:
|
||||
logger.warning(f"Text '{query}' not found. Detected texts: "
|
||||
f"{[b.text for b in boxes[:20]]}")
|
||||
return None
|
||||
|
||||
# Return best match by match_score (prefers exact/longer matches)
|
||||
matches.sort(key=lambda b: b.match_score(query), reverse=True)
|
||||
best = matches[0]
|
||||
logger.info(f"Found '{query}' → '{best.text}' at ({best.cx}, {best.cy}) "
|
||||
f"conf={best.confidence:.2f} score={best.match_score(query):.1f}")
|
||||
return best
|
||||
|
||||
def find_all_matches(
|
||||
self, img: Image.Image, query: str, fuzzy: bool = True
|
||||
) -> list[TextBox]:
|
||||
"""Find ALL matching elements (e.g., multiple chat contacts named similar)."""
|
||||
boxes = self.detect_all(img)
|
||||
return [b for b in boxes if b.contains_text(query, fuzzy=fuzzy)]
|
||||
|
||||
def detect_all(self, img: Image.Image) -> list[TextBox]:
|
||||
"""Run OCR on the full image and return all detected text boxes.
|
||||
|
||||
Tries engines in order based on self.engine setting.
|
||||
"""
|
||||
if self.engine == "pytesseract":
|
||||
return self._detect_pytesseract(img)
|
||||
elif self.engine == "easyocr":
|
||||
return self._detect_easyocr(img)
|
||||
else: # auto
|
||||
# Prefer easyocr (much better Chinese recognition), fall back to pytesseract
|
||||
try:
|
||||
return self._detect_easyocr(img)
|
||||
except Exception as e:
|
||||
logger.info(f"easyocr failed ({e}), trying pytesseract")
|
||||
|
||||
try:
|
||||
boxes = self._detect_pytesseract(img)
|
||||
if boxes:
|
||||
return boxes
|
||||
except Exception as e:
|
||||
logger.error(f"All OCR engines failed: {e}")
|
||||
|
||||
return []
|
||||
|
||||
def find_text_normalized(
|
||||
self, img: Image.Image, query: str, fuzzy: bool = True
|
||||
) -> tuple[float, float] | None:
|
||||
"""Convenience: find text and return normalized (x, y) center directly.
|
||||
|
||||
Returns None if not found.
|
||||
"""
|
||||
box = self.find_text(img, query, fuzzy=fuzzy)
|
||||
if box is None:
|
||||
return None
|
||||
return box.center_normalized(img.width, img.height)
|
||||
|
||||
# ──────────────────────────────────────────────
|
||||
# pytesseract engine
|
||||
# ──────────────────────────────────────────────
|
||||
|
||||
def _detect_pytesseract(self, img: Image.Image) -> list[TextBox]:
|
||||
"""Detect text using pytesseract (calls tesseract binary).
|
||||
|
||||
Uses chi_sim+eng for Chinese + English mixed content (common in WeChat).
|
||||
Falls back to eng-only if chi_sim data is not installed.
|
||||
"""
|
||||
import pytesseract
|
||||
|
||||
# Try Chinese+English first, fall back to English only
|
||||
for lang in ["chi_sim+eng", "eng"]:
|
||||
try:
|
||||
data = pytesseract.image_to_data(
|
||||
img,
|
||||
lang=lang,
|
||||
output_type=pytesseract.Output.DICT,
|
||||
config="--psm 11" # Sparse text: find as much text as possible
|
||||
)
|
||||
break
|
||||
except pytesseract.TesseractError:
|
||||
continue
|
||||
else:
|
||||
raise RuntimeError("Tesseract failed with all language configs")
|
||||
|
||||
boxes = []
|
||||
n = len(data["text"])
|
||||
for i in range(n):
|
||||
text = data["text"][i].strip()
|
||||
conf = int(data["conf"][i])
|
||||
if not text or conf < 20: # skip low-confidence noise
|
||||
continue
|
||||
boxes.append(TextBox(
|
||||
text=text,
|
||||
x=data["left"][i],
|
||||
y=data["top"][i],
|
||||
w=data["width"][i],
|
||||
h=data["height"][i],
|
||||
confidence=conf / 100.0,
|
||||
))
|
||||
|
||||
return boxes
|
||||
|
||||
# ──────────────────────────────────────────────
|
||||
# easyocr engine
|
||||
# ──────────────────────────────────────────────
|
||||
|
||||
def _detect_easyocr(self, img: Image.Image) -> list[TextBox]:
|
||||
"""Detect text using easyocr (better for Chinese, uses deep learning).
|
||||
|
||||
First call is slow (~10s) due to model loading. Subsequent calls are fast.
|
||||
"""
|
||||
import easyocr
|
||||
import numpy as np
|
||||
|
||||
if self._easyocr_reader is None:
|
||||
self._easyocr_reader = easyocr.Reader(
|
||||
["ch_sim", "en"],
|
||||
gpu=False, # CPU is fine for single screenshots
|
||||
)
|
||||
|
||||
# Convert PIL to numpy array for easyocr
|
||||
img_np = np.array(img.convert("RGB"))
|
||||
results = self._easyocr_reader.readtext(img_np)
|
||||
|
||||
boxes = []
|
||||
for (bbox, text, conf) in results:
|
||||
if not text.strip():
|
||||
continue
|
||||
# bbox is [[x1,y1],[x2,y2],[x3,y3],[x4,y4]] (quadrilateral)
|
||||
xs = [p[0] for p in bbox]
|
||||
ys = [p[1] for p in bbox]
|
||||
x = int(min(xs))
|
||||
y = int(min(ys))
|
||||
w = int(max(xs) - x)
|
||||
h = int(max(ys) - y)
|
||||
boxes.append(TextBox(
|
||||
text=text.strip(),
|
||||
x=x, y=y, w=w, h=h,
|
||||
confidence=float(conf),
|
||||
))
|
||||
|
||||
return boxes
|
||||
|
||||
# ──────────────────────────────────────────────
|
||||
# uiautomator XML dump (supplementary, often empty on Huawei)
|
||||
# ──────────────────────────────────────────────
|
||||
|
||||
def try_uiautomator_dump(self, serial: str | None = None) -> list[TextBox]:
|
||||
"""Attempt to get UI elements from uiautomator dump.
|
||||
|
||||
NOTE: This often returns nearly empty XML on Huawei/HarmonyOS,
|
||||
especially for WeChat. Use as a supplementary source, not primary.
|
||||
|
||||
Args:
|
||||
serial: Device serial (None = use settings or first device).
|
||||
|
||||
Returns:
|
||||
List of TextBox from accessibility tree, may be empty.
|
||||
"""
|
||||
adb = settings.adb_path
|
||||
cmd = [adb]
|
||||
if serial or settings.device_serial:
|
||||
cmd.extend(["-s", serial or settings.device_serial])
|
||||
|
||||
# Dump to device, then pull
|
||||
dump_cmd = cmd + ["shell", "uiautomator", "dump", "/sdcard/ui_dump.xml"]
|
||||
pull_cmd = cmd + ["shell", "cat", "/sdcard/ui_dump.xml"]
|
||||
|
||||
try:
|
||||
subprocess.run(dump_cmd, capture_output=True, timeout=10)
|
||||
result = subprocess.run(pull_cmd, capture_output=True, text=True, timeout=5)
|
||||
xml_content = result.stdout
|
||||
except Exception as e:
|
||||
logger.warning(f"uiautomator dump failed: {e}")
|
||||
return []
|
||||
|
||||
return self._parse_uiautomator_xml(xml_content)
|
||||
|
||||
def _parse_uiautomator_xml(self, xml_str: str) -> list[TextBox]:
|
||||
"""Parse uiautomator dump XML into TextBox list."""
|
||||
boxes = []
|
||||
# Pattern: text="..." bounds="[x1,y1][x2,y2]"
|
||||
pattern = r'text="([^"]*)"[^>]*bounds="\[(\d+),(\d+)\]\[(\d+),(\d+)\]"'
|
||||
for match in re.finditer(pattern, xml_str):
|
||||
text = match.group(1).strip()
|
||||
if not text:
|
||||
continue
|
||||
x1, y1, x2, y2 = (int(match.group(i)) for i in range(2, 6))
|
||||
boxes.append(TextBox(
|
||||
text=text,
|
||||
x=x1, y=y1,
|
||||
w=x2 - x1, h=y2 - y1,
|
||||
confidence=1.0, # accessibility tree is authoritative
|
||||
))
|
||||
return boxes
|
||||
|
||||
# ──────────────────────────────────────────────
|
||||
# Hybrid: combine OCR + uiautomator
|
||||
# ──────────────────────────────────────────────
|
||||
|
||||
def find_text_hybrid(
|
||||
self, img: Image.Image, query: str, fuzzy: bool = True
|
||||
) -> TextBox | None:
|
||||
"""Try uiautomator first (exact bounds), fall back to OCR.
|
||||
|
||||
Best strategy for Huawei: uiautomator might work for some apps,
|
||||
OCR always works as fallback.
|
||||
"""
|
||||
# Try uiautomator first (precise but often empty on Huawei)
|
||||
ua_boxes = self.try_uiautomator_dump()
|
||||
ua_matches = [b for b in ua_boxes if b.contains_text(query, fuzzy=fuzzy)]
|
||||
if ua_matches:
|
||||
logger.info(f"Found '{query}' via uiautomator")
|
||||
return ua_matches[0]
|
||||
|
||||
# Fall back to OCR
|
||||
logger.info(f"uiautomator found nothing for '{query}', using OCR")
|
||||
return self.find_text(img, query, fuzzy=fuzzy)
|
||||
122
src/main.py
Normal file
122
src/main.py
Normal file
@@ -0,0 +1,122 @@
|
||||
"""Phone GUI Agent - Main Entry Point
|
||||
|
||||
Web console for controlling the agent loop.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from fastapi import FastAPI, WebSocket, WebSocketDisconnect, Request
|
||||
from fastapi.responses import HTMLResponse
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
from fastapi.templating import Jinja2Templates
|
||||
|
||||
from config import settings
|
||||
from src.capture import ADBCapture
|
||||
from src.planner.agent_loop import AgentLoop
|
||||
|
||||
app = FastAPI(title="Phone GUI Agent", version="0.1.0")
|
||||
|
||||
BASE_DIR = Path(__file__).parent.parent
|
||||
app.mount("/static", StaticFiles(directory=BASE_DIR / "web" / "static"), name="static")
|
||||
templates = Jinja2Templates(directory=BASE_DIR / "web" / "templates")
|
||||
|
||||
# Global state
|
||||
capture = ADBCapture()
|
||||
agent = AgentLoop()
|
||||
|
||||
|
||||
@app.get("/", response_class=HTMLResponse)
|
||||
async def index(request: Request):
|
||||
return templates.TemplateResponse(request, "index.html")
|
||||
|
||||
|
||||
@app.get("/api/device")
|
||||
async def device_info():
|
||||
"""Check device connection status."""
|
||||
try:
|
||||
info = capture.check_device()
|
||||
return info
|
||||
except Exception as e:
|
||||
return {"connected": False, "error": str(e)}
|
||||
|
||||
|
||||
@app.get("/api/screenshot")
|
||||
async def take_screenshot():
|
||||
"""Take a screenshot and return base64."""
|
||||
try:
|
||||
b64 = capture.screenshot_base64()
|
||||
return {"ok": True, "image": b64}
|
||||
except Exception as e:
|
||||
return {"ok": False, "error": str(e)}
|
||||
|
||||
|
||||
@app.post("/api/stop")
|
||||
async def stop_task():
|
||||
"""Stop the current running task."""
|
||||
agent.stop()
|
||||
return {"ok": True}
|
||||
|
||||
|
||||
@app.websocket("/ws/task")
|
||||
async def task_websocket(ws: WebSocket):
|
||||
"""WebSocket endpoint for running tasks with real-time updates.
|
||||
|
||||
Client sends: {"task": "打开微信搜索张三"}
|
||||
Server streams: StepResult objects as JSON
|
||||
"""
|
||||
await ws.accept()
|
||||
try:
|
||||
data = await ws.receive_json()
|
||||
task = data.get("task", "")
|
||||
if not task:
|
||||
await ws.send_json({"error": "No task provided"})
|
||||
return
|
||||
|
||||
await ws.send_json({"status": "started", "task": task})
|
||||
|
||||
def on_step(result):
|
||||
asyncio.get_event_loop().call_soon_threadsafe(
|
||||
asyncio.ensure_future,
|
||||
ws.send_json({
|
||||
"status": "step",
|
||||
"step": result.step,
|
||||
"observation": result.observation,
|
||||
"thinking": result.thinking,
|
||||
"action_type": result.action_type,
|
||||
"action_desc": result.action_desc,
|
||||
"screenshot": result.screenshot_before[:100] + "..." if result.screenshot_before else None,
|
||||
"error": result.error,
|
||||
})
|
||||
)
|
||||
|
||||
session = await agent.run_task(task, on_step=on_step)
|
||||
|
||||
await ws.send_json({
|
||||
"status": session.status,
|
||||
"total_steps": len(session.steps),
|
||||
"task": task,
|
||||
})
|
||||
|
||||
except WebSocketDisconnect:
|
||||
agent.stop()
|
||||
except Exception as e:
|
||||
try:
|
||||
await ws.send_json({"error": str(e)})
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def main():
|
||||
import uvicorn
|
||||
uvicorn.run(
|
||||
"src.main:app",
|
||||
host=settings.host,
|
||||
port=settings.port,
|
||||
reload=True,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
3
src/planner/__init__.py
Normal file
3
src/planner/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .agent_loop import AgentLoop
|
||||
|
||||
__all__ = ["AgentLoop"]
|
||||
200
src/planner/agent_loop.py
Normal file
200
src/planner/agent_loop.py
Normal file
@@ -0,0 +1,200 @@
|
||||
"""L4+L6+L7 - Agent Loop: Planning, Verification, Memory
|
||||
|
||||
The core agent loop that orchestrates the full pipeline:
|
||||
Screenshot → VLM Analysis → Action Execution → Verification → Repeat
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
|
||||
from src.capture import ADBCapture
|
||||
from src.vision import VLMClient
|
||||
from src.executor.adb_executor import ADBExecutor, Action
|
||||
|
||||
|
||||
@dataclass
|
||||
class StepResult:
|
||||
step: int
|
||||
timestamp: str
|
||||
observation: str
|
||||
thinking: str
|
||||
action_type: str
|
||||
action_desc: str
|
||||
screenshot_before: str # base64
|
||||
screenshot_after: str | None = None
|
||||
verified: bool = False
|
||||
error: str | None = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class TaskSession:
|
||||
task: str
|
||||
status: str = "running" # running / completed / failed / stopped
|
||||
steps: list[StepResult] = field(default_factory=list)
|
||||
started_at: str = ""
|
||||
finished_at: str = ""
|
||||
|
||||
def history(self) -> list[dict]:
|
||||
"""Return history for VLM context."""
|
||||
return [
|
||||
{
|
||||
"observation": s.observation,
|
||||
"action": {"type": s.action_type},
|
||||
}
|
||||
for s in self.steps
|
||||
]
|
||||
|
||||
|
||||
class AgentLoop:
|
||||
"""Main agent loop orchestrating all pipeline layers."""
|
||||
|
||||
def __init__(self):
|
||||
self.capture = ADBCapture()
|
||||
self.vlm = VLMClient()
|
||||
self.executor = ADBExecutor(self.capture)
|
||||
self.current_session: TaskSession | None = None
|
||||
self._stop_requested = False
|
||||
|
||||
def stop(self):
|
||||
self._stop_requested = True
|
||||
|
||||
async def run_task(self, task: str, on_step=None) -> TaskSession:
|
||||
"""Execute a task through the full agent loop.
|
||||
|
||||
Args:
|
||||
task: Natural language task instruction.
|
||||
on_step: Optional callback called after each step with StepResult.
|
||||
|
||||
Returns:
|
||||
TaskSession with all steps and final status.
|
||||
"""
|
||||
from config import settings
|
||||
|
||||
session = TaskSession(
|
||||
task=task,
|
||||
started_at=datetime.now().isoformat(),
|
||||
)
|
||||
self.current_session = session
|
||||
self._stop_requested = False
|
||||
|
||||
try:
|
||||
for step_num in range(1, settings.max_steps + 1):
|
||||
if self._stop_requested:
|
||||
session.status = "stopped"
|
||||
break
|
||||
|
||||
result = await self._execute_step(step_num, task, session)
|
||||
session.steps.append(result)
|
||||
|
||||
if on_step:
|
||||
on_step(result)
|
||||
|
||||
if result.action_type == "done":
|
||||
session.status = "completed"
|
||||
break
|
||||
|
||||
if result.error:
|
||||
# Allow up to 3 consecutive errors before failing
|
||||
recent_errors = sum(
|
||||
1 for s in session.steps[-3:] if s.error
|
||||
)
|
||||
if recent_errors >= 3:
|
||||
session.status = "failed"
|
||||
break
|
||||
else:
|
||||
session.status = "failed" # max steps exceeded
|
||||
|
||||
except Exception as e:
|
||||
session.status = "failed"
|
||||
if session.steps:
|
||||
session.steps[-1].error = str(e)
|
||||
|
||||
session.finished_at = datetime.now().isoformat()
|
||||
self.current_session = None
|
||||
return session
|
||||
|
||||
async def _execute_step(
|
||||
self, step_num: int, task: str, session: TaskSession
|
||||
) -> StepResult:
|
||||
"""Execute a single step in the agent loop."""
|
||||
timestamp = datetime.now().isoformat()
|
||||
|
||||
# L1: Capture screenshot
|
||||
try:
|
||||
screenshot_b64 = self.capture.screenshot_base64()
|
||||
except Exception as e:
|
||||
return StepResult(
|
||||
step=step_num, timestamp=timestamp,
|
||||
observation="", thinking="",
|
||||
action_type="error", action_desc="",
|
||||
screenshot_before="", error=f"Screenshot failed: {e}"
|
||||
)
|
||||
|
||||
# L2+L3+L4: VLM analysis (understanding + grounding + planning)
|
||||
try:
|
||||
response = await self.vlm.analyze_screen(
|
||||
screenshot_b64, task, session.history()
|
||||
)
|
||||
except Exception as e:
|
||||
return StepResult(
|
||||
step=step_num, timestamp=timestamp,
|
||||
observation="", thinking="",
|
||||
action_type="error", action_desc="",
|
||||
screenshot_before=screenshot_b64,
|
||||
error=f"VLM analysis failed: {e}"
|
||||
)
|
||||
|
||||
observation = response.get("observation", "")
|
||||
thinking = response.get("thinking", "")
|
||||
action_data = response["action"]
|
||||
action_type = action_data["type"]
|
||||
|
||||
# Task complete
|
||||
if action_type == "done":
|
||||
return StepResult(
|
||||
step=step_num, timestamp=timestamp,
|
||||
observation=observation, thinking=thinking,
|
||||
action_type="done", action_desc="Task completed",
|
||||
screenshot_before=screenshot_b64,
|
||||
)
|
||||
|
||||
# L5: Execute action
|
||||
action = Action(
|
||||
type=action_type,
|
||||
x=action_data.get("x", 0),
|
||||
y=action_data.get("y", 0),
|
||||
text=action_data.get("text", ""),
|
||||
x2=action_data.get("x2", 0),
|
||||
y2=action_data.get("y2", 0),
|
||||
duration=action_data.get("duration", 300),
|
||||
)
|
||||
|
||||
try:
|
||||
action_desc = self.executor.execute(action)
|
||||
except Exception as e:
|
||||
return StepResult(
|
||||
step=step_num, timestamp=timestamp,
|
||||
observation=observation, thinking=thinking,
|
||||
action_type=action_type, action_desc="",
|
||||
screenshot_before=screenshot_b64,
|
||||
error=f"Execution failed: {e}"
|
||||
)
|
||||
|
||||
# L6: Verify by taking post-action screenshot
|
||||
screenshot_after = None
|
||||
if settings.verify_after_action:
|
||||
try:
|
||||
screenshot_after = self.capture.screenshot_base64()
|
||||
except Exception:
|
||||
pass # non-critical
|
||||
|
||||
return StepResult(
|
||||
step=step_num, timestamp=timestamp,
|
||||
observation=observation, thinking=thinking,
|
||||
action_type=action_type, action_desc=action_desc,
|
||||
screenshot_before=screenshot_b64,
|
||||
screenshot_after=screenshot_after,
|
||||
verified=screenshot_after is not None,
|
||||
)
|
||||
0
src/verifier/__init__.py
Normal file
0
src/verifier/__init__.py
Normal file
3
src/vision/__init__.py
Normal file
3
src/vision/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .vlm_client import VLMClient
|
||||
|
||||
__all__ = ["VLMClient"]
|
||||
171
src/vision/vlm_client.py
Normal file
171
src/vision/vlm_client.py
Normal file
@@ -0,0 +1,171 @@
|
||||
"""L2+L3 - Vision Language Model Client
|
||||
|
||||
Sends screenshots to VLM for screen understanding and element grounding.
|
||||
Supports multiple providers: Poe API (preferred), OpenRouter (backup), local.
|
||||
"""
|
||||
|
||||
import base64
|
||||
import httpx
|
||||
from PIL import Image
|
||||
import io
|
||||
|
||||
from config import settings
|
||||
|
||||
|
||||
SYSTEM_PROMPT = """你是一个手机 GUI 操控助手。你会收到一张 Android 手机截图和一个用户任务指令。
|
||||
|
||||
你的职责:
|
||||
1. 分析当前屏幕内容(识别所有 UI 元素、文本、图标、按钮)
|
||||
2. 根据任务目标,决定下一步要执行的操作
|
||||
3. 精确定位目标元素的屏幕坐标
|
||||
|
||||
输出格式(严格 JSON):
|
||||
{
|
||||
"observation": "当前屏幕的简要描述",
|
||||
"thinking": "下一步应该做什么,为什么",
|
||||
"action": {
|
||||
"type": "tap|swipe|type|long_press|back|home|scroll|wait|done",
|
||||
"x": 0.5,
|
||||
"y": 0.3,
|
||||
"text": "",
|
||||
"x2": 0.0,
|
||||
"y2": 0.0,
|
||||
"duration": 300
|
||||
}
|
||||
}
|
||||
|
||||
坐标说明:
|
||||
- x, y 为归一化坐标,范围 0.0-1.0
|
||||
- (0, 0) 是屏幕左上角,(1, 1) 是右下角
|
||||
- 点击按钮时,坐标应指向按钮的中心位置
|
||||
|
||||
当任务完成时,action.type 设为 "done"。
|
||||
"""
|
||||
|
||||
|
||||
class VLMClient:
|
||||
"""Multi-provider VLM client for screen understanding."""
|
||||
|
||||
def __init__(self):
|
||||
self.provider = settings.vlm_provider
|
||||
self.model = settings.vlm_model
|
||||
|
||||
async def analyze_screen(
|
||||
self, screenshot_b64: str, task: str, history: list[dict] | None = None
|
||||
) -> dict:
|
||||
"""Send screenshot to VLM and get structured action response.
|
||||
|
||||
Args:
|
||||
screenshot_b64: Base64-encoded PNG screenshot.
|
||||
task: User's task instruction.
|
||||
history: Previous observation/action pairs for context.
|
||||
|
||||
Returns:
|
||||
Parsed dict with observation, thinking, and action.
|
||||
"""
|
||||
messages = self._build_messages(screenshot_b64, task, history)
|
||||
|
||||
match self.provider:
|
||||
case "poe":
|
||||
raw = await self._call_poe(messages)
|
||||
case "openrouter":
|
||||
raw = await self._call_openrouter(messages)
|
||||
case "local":
|
||||
raw = await self._call_local(messages)
|
||||
case _:
|
||||
raise ValueError(f"Unknown VLM provider: {self.provider}")
|
||||
|
||||
return self._parse_response(raw)
|
||||
|
||||
def _build_messages(
|
||||
self, screenshot_b64: str, task: str, history: list[dict] | None
|
||||
) -> list[dict]:
|
||||
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
|
||||
|
||||
# Add history context
|
||||
if history:
|
||||
history_text = "\n".join(
|
||||
f"Step {i+1}: {h['observation']} → {h['action']['type']}"
|
||||
for i, h in enumerate(history[-5:]) # last 5 steps
|
||||
)
|
||||
messages.append({
|
||||
"role": "user",
|
||||
"content": f"历史操作记录:\n{history_text}"
|
||||
})
|
||||
|
||||
# Current step: screenshot + task
|
||||
messages.append({
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {"url": f"data:image/png;base64,{screenshot_b64}"}
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": f"当前任务:{task}\n\n请分析截图并给出下一步操作。"
|
||||
},
|
||||
],
|
||||
})
|
||||
return messages
|
||||
|
||||
async def _call_poe(self, messages: list[dict]) -> str:
|
||||
"""Call Poe API (preferred, cheapest)."""
|
||||
async with httpx.AsyncClient(timeout=30) as client:
|
||||
resp = await client.post(
|
||||
"https://api.poe.com/v1/chat/completions",
|
||||
headers={
|
||||
"Authorization": f"Bearer {settings.poe_api_key}",
|
||||
"Content-Type": "application/json",
|
||||
},
|
||||
json={"model": self.model, "messages": messages},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["choices"][0]["message"]["content"]
|
||||
|
||||
async def _call_openrouter(self, messages: list[dict]) -> str:
|
||||
"""Call OpenRouter API (backup)."""
|
||||
async with httpx.AsyncClient(timeout=30) as client:
|
||||
resp = await client.post(
|
||||
"https://openrouter.ai/api/v1/chat/completions",
|
||||
headers={
|
||||
"Authorization": f"Bearer {settings.openrouter_api_key}",
|
||||
"Content-Type": "application/json",
|
||||
},
|
||||
json={"model": self.model, "messages": messages},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["choices"][0]["message"]["content"]
|
||||
|
||||
async def _call_local(self, messages: list[dict]) -> str:
|
||||
"""Call local vLLM/Ollama server."""
|
||||
async with httpx.AsyncClient(timeout=60) as client:
|
||||
resp = await client.post(
|
||||
"http://localhost:11434/v1/chat/completions",
|
||||
json={"model": self.model, "messages": messages},
|
||||
)
|
||||
resp.raise_for_status()
|
||||
return resp.json()["choices"][0]["message"]["content"]
|
||||
|
||||
def _parse_response(self, raw: str) -> dict:
|
||||
"""Parse VLM response into structured action dict."""
|
||||
import json
|
||||
import re
|
||||
|
||||
# Extract JSON from response (handle markdown code blocks)
|
||||
json_match = re.search(r"```(?:json)?\s*(.*?)\s*```", raw, re.DOTALL)
|
||||
if json_match:
|
||||
raw = json_match.group(1)
|
||||
|
||||
# Try to find JSON object directly
|
||||
json_match = re.search(r"\{.*\}", raw, re.DOTALL)
|
||||
if not json_match:
|
||||
raise ValueError(f"No JSON found in VLM response: {raw[:200]}")
|
||||
|
||||
parsed = json.loads(json_match.group())
|
||||
|
||||
# Validate required fields
|
||||
assert "action" in parsed, "Missing 'action' field"
|
||||
assert "type" in parsed["action"], "Missing action 'type'"
|
||||
|
||||
return parsed
|
||||
192
web/templates/index.html
Normal file
192
web/templates/index.html
Normal file
@@ -0,0 +1,192 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="zh-CN">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Phone GUI Agent</title>
|
||||
<style>
|
||||
* { margin: 0; padding: 0; box-sizing: border-box; }
|
||||
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; background: #0a0a0a; color: #e0e0e0; height: 100vh; display: flex; flex-direction: column; }
|
||||
header { padding: 12px 20px; background: #111; border-bottom: 1px solid #222; display: flex; align-items: center; gap: 12px; }
|
||||
header h1 { font-size: 16px; font-weight: 600; }
|
||||
.status-dot { width: 8px; height: 8px; border-radius: 50%; background: #555; }
|
||||
.status-dot.connected { background: #22c55e; }
|
||||
.status-dot.running { background: #f59e0b; animation: pulse 1s infinite; }
|
||||
@keyframes pulse { 0%, 100% { opacity: 1; } 50% { opacity: 0.4; } }
|
||||
#device-info { font-size: 12px; color: #888; margin-left: auto; }
|
||||
|
||||
.main { flex: 1; display: flex; overflow: hidden; }
|
||||
|
||||
.panel-left { width: 320px; border-right: 1px solid #222; display: flex; flex-direction: column; }
|
||||
.panel-center { flex: 1; display: flex; align-items: center; justify-content: center; background: #050505; }
|
||||
.panel-right { width: 380px; border-left: 1px solid #222; display: flex; flex-direction: column; }
|
||||
|
||||
.phone-frame { width: 270px; height: 585px; border: 2px solid #333; border-radius: 24px; overflow: hidden; background: #111; position: relative; }
|
||||
.phone-frame img { width: 100%; height: 100%; object-fit: contain; }
|
||||
.phone-frame .placeholder { display: flex; align-items: center; justify-content: center; height: 100%; color: #444; font-size: 14px; }
|
||||
|
||||
.task-input { padding: 16px; border-bottom: 1px solid #222; }
|
||||
.task-input textarea { width: 100%; height: 80px; background: #1a1a1a; border: 1px solid #333; border-radius: 8px; color: #e0e0e0; padding: 10px; font-size: 14px; resize: none; }
|
||||
.task-input textarea:focus { outline: none; border-color: #4a9eff; }
|
||||
.btn-row { display: flex; gap: 8px; margin-top: 8px; }
|
||||
.btn { padding: 8px 16px; border-radius: 6px; border: none; cursor: pointer; font-size: 13px; font-weight: 500; }
|
||||
.btn-primary { background: #4a9eff; color: #fff; }
|
||||
.btn-primary:hover { background: #3a8eef; }
|
||||
.btn-danger { background: #ef4444; color: #fff; }
|
||||
.btn-secondary { background: #333; color: #ccc; }
|
||||
|
||||
.steps-list { flex: 1; overflow-y: auto; padding: 12px; }
|
||||
.step-card { background: #1a1a1a; border: 1px solid #222; border-radius: 8px; padding: 12px; margin-bottom: 8px; font-size: 13px; }
|
||||
.step-card .step-header { display: flex; justify-content: space-between; margin-bottom: 6px; }
|
||||
.step-num { color: #4a9eff; font-weight: 600; }
|
||||
.step-action { color: #22c55e; font-family: monospace; }
|
||||
.step-action.error { color: #ef4444; }
|
||||
.step-obs { color: #999; margin-top: 4px; }
|
||||
.step-think { color: #f59e0b; margin-top: 4px; font-style: italic; }
|
||||
|
||||
.log-panel { flex: 1; overflow-y: auto; padding: 12px; }
|
||||
.log-panel h3 { font-size: 13px; color: #888; margin-bottom: 8px; text-transform: uppercase; letter-spacing: 1px; }
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<header>
|
||||
<div class="status-dot" id="statusDot"></div>
|
||||
<h1>Phone GUI Agent</h1>
|
||||
<span id="device-info">检测设备中...</span>
|
||||
</header>
|
||||
|
||||
<div class="main">
|
||||
<div class="panel-left">
|
||||
<div class="task-input">
|
||||
<textarea id="taskInput" placeholder="输入任务指令,例如: 打开设置,连接WiFi 打开微信,搜索张三发消息"></textarea>
|
||||
<div class="btn-row">
|
||||
<button class="btn btn-primary" id="btnRun" onclick="runTask()">执行任务</button>
|
||||
<button class="btn btn-danger" id="btnStop" onclick="stopTask()" style="display:none">停止</button>
|
||||
<button class="btn btn-secondary" onclick="refreshScreenshot()">截屏</button>
|
||||
</div>
|
||||
</div>
|
||||
<div class="steps-list" id="stepsList"></div>
|
||||
</div>
|
||||
|
||||
<div class="panel-center">
|
||||
<div class="phone-frame">
|
||||
<img id="phoneScreen" style="display:none" />
|
||||
<div class="placeholder" id="phonePlaceholder">连接设备后显示截图</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="panel-right">
|
||||
<div class="log-panel">
|
||||
<h3>Agent 思考过程</h3>
|
||||
<div id="thinkingLog"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script>
|
||||
let ws = null;
|
||||
|
||||
async function checkDevice() {
|
||||
try {
|
||||
const resp = await fetch('/api/device');
|
||||
const data = await resp.json();
|
||||
const dot = document.getElementById('statusDot');
|
||||
const info = document.getElementById('device-info');
|
||||
if (data.connected) {
|
||||
dot.className = 'status-dot connected';
|
||||
info.textContent = `${data.model} (${data.resolution}) - ${data.serial}`;
|
||||
refreshScreenshot();
|
||||
} else {
|
||||
dot.className = 'status-dot';
|
||||
info.textContent = data.error || '未连接设备';
|
||||
}
|
||||
} catch (e) {
|
||||
document.getElementById('device-info').textContent = '服务未启动';
|
||||
}
|
||||
}
|
||||
|
||||
async function refreshScreenshot() {
|
||||
try {
|
||||
const resp = await fetch('/api/screenshot');
|
||||
const data = await resp.json();
|
||||
if (data.ok) {
|
||||
const img = document.getElementById('phoneScreen');
|
||||
img.src = 'data:image/png;base64,' + data.image;
|
||||
img.style.display = 'block';
|
||||
document.getElementById('phonePlaceholder').style.display = 'none';
|
||||
}
|
||||
} catch (e) {}
|
||||
}
|
||||
|
||||
function runTask() {
|
||||
const task = document.getElementById('taskInput').value.trim();
|
||||
if (!task) return;
|
||||
|
||||
document.getElementById('stepsList').innerHTML = '';
|
||||
document.getElementById('thinkingLog').innerHTML = '';
|
||||
document.getElementById('btnRun').style.display = 'none';
|
||||
document.getElementById('btnStop').style.display = 'inline-block';
|
||||
document.getElementById('statusDot').className = 'status-dot running';
|
||||
|
||||
const protocol = location.protocol === 'https:' ? 'wss:' : 'ws:';
|
||||
ws = new WebSocket(`${protocol}//${location.host}/ws/task`);
|
||||
|
||||
ws.onopen = () => {
|
||||
ws.send(JSON.stringify({ task }));
|
||||
};
|
||||
|
||||
ws.onmessage = (e) => {
|
||||
const data = JSON.parse(e.data);
|
||||
if (data.status === 'step') {
|
||||
addStep(data);
|
||||
} else if (data.status === 'completed' || data.status === 'failed' || data.status === 'stopped') {
|
||||
taskDone(data);
|
||||
}
|
||||
};
|
||||
|
||||
ws.onclose = () => taskDone({ status: 'disconnected' });
|
||||
}
|
||||
|
||||
function addStep(data) {
|
||||
const list = document.getElementById('stepsList');
|
||||
const card = document.createElement('div');
|
||||
card.className = 'step-card';
|
||||
card.innerHTML = `
|
||||
<div class="step-header">
|
||||
<span class="step-num">Step ${data.step}</span>
|
||||
<span class="step-action ${data.error ? 'error' : ''}">${data.error || data.action_desc || data.action_type}</span>
|
||||
</div>
|
||||
${data.observation ? `<div class="step-obs">${data.observation}</div>` : ''}
|
||||
${data.thinking ? `<div class="step-think">${data.thinking}</div>` : ''}
|
||||
`;
|
||||
list.appendChild(card);
|
||||
list.scrollTop = list.scrollHeight;
|
||||
|
||||
if (data.thinking) {
|
||||
const log = document.getElementById('thinkingLog');
|
||||
const p = document.createElement('div');
|
||||
p.className = 'step-card';
|
||||
p.innerHTML = `<span class="step-num">Step ${data.step}</span>: ${data.thinking}`;
|
||||
log.appendChild(p);
|
||||
log.scrollTop = log.scrollHeight;
|
||||
}
|
||||
|
||||
refreshScreenshot();
|
||||
}
|
||||
|
||||
function taskDone(data) {
|
||||
document.getElementById('btnRun').style.display = 'inline-block';
|
||||
document.getElementById('btnStop').style.display = 'none';
|
||||
document.getElementById('statusDot').className = 'status-dot connected';
|
||||
if (ws) { ws.close(); ws = null; }
|
||||
}
|
||||
|
||||
async function stopTask() {
|
||||
await fetch('/api/stop', { method: 'POST' });
|
||||
}
|
||||
|
||||
checkDevice();
|
||||
setInterval(checkDevice, 10000);
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
Reference in New Issue
Block a user