auto-save 2026-04-01 09:03 (+8, ~2)

This commit is contained in:
2026-04-01 09:04:04 +08:00
parent 0ddaa889de
commit 9709573870
70 changed files with 2331 additions and 9 deletions

10
.env.example Normal file
View File

@@ -0,0 +1,10 @@
# Device
DEVICE_SERIAL= # leave empty for auto-detect
# VLM Provider: poe / openrouter / local
VLM_PROVIDER=poe
VLM_MODEL=Qwen/Qwen2.5-VL-7B-Instruct
# API Keys (fill the one matching your provider)
POE_API_KEY=
OPENROUTER_API_KEY=

4
.gitignore vendored
View File

@@ -10,3 +10,7 @@ __pycache__/
.vscode/ .vscode/
.idea/ .idea/
*.log *.log
data/screenshots/
*.egg-info/
.venv/
venv/

77
.memory/project-status.md Normal file
View File

@@ -0,0 +1,77 @@
---
name: GUI Agent 项目状态
description: 手机GUI Agent项目当前进度、技术决策和待确认事项
type: project
---
## 项目状态:端到端已跑通 + 手机端 OCR 已部署
### 设备信息
- **华为 P40 Pro**ELS-AN00
- 序列号UQG5T20416000119
- 分辨率1200x2640
- 系统HarmonyOS 4.x兼容安卓层ADB 可用)
- ADB 路径:`/opt/homebrew/bin/adb`
- 连接注意:华为手机需在开发者选项中额外打开"仅充电模式下允许ADB调试"
- **已开启「通过USB安装应用」权限**2026-03-29
### 已完成
- 七层管线骨架代码L1-L7全部就位
- Web 控制台FastAPI + 暗色主题 UI已验证可运行
- 端口 4380VLM 默认走 Poe API
- 支持 8 种动作类型tap/swipe/type/long_press/back/home/scroll/wait
- Agent 主循环含历史记忆(最近 5 步)和连续错误自动停止
- **ADB 截屏已验证通过**2026-03-29
- **Mac 端 OCR 元素定位已验证**2026-03-29— easyocr 中文识别,返回像素坐标
- **中文文本输入已验证**2026-03-29— uiautomator2 send_keys
- **端到端发微信消息已跑通 3 次**2026-03-29— "你是大聪明"、"祝你生日快乐"、"生日快乐"
- **手机端 OCR Service APK 已部署**2026-03-29— ML Kit Chinese bundled端口 18900
### 手机端 OCR Serviceandroid-ocr-service/
- **引擎**Google ML Kit text-recognition-chinesebundled 版,不依赖 GMS华为可用
- **架构**Kotlin APK = OcrEngine + NanoHTTPD(18900) + ForegroundService
- **接口**
- `GET /health` — 健康检查
- `GET /ocr?path=/data/local/tmp/s.png` — 读文件 OCR
- `GET /ocr?path=...&text=微信` — 按文本过滤
- `POST /snap` — POST 图片字节直接 OCRNanoHTTPD 二进制处理有 bug待修
- **使用流程**
```bash
adb shell am start -n com.guiagent.ocr/.MainActivity
adb forward tcp:18900 tcp:18900
adb shell "screencap -p /data/local/tmp/s.png"
curl http://localhost:18900/ocr?path=/data/local/tmp/s.png
```
- **性能**:首次 ~2.4s(模型加载),后续 ~1.8s/次
- **构建**`ANDROID_HOME=/opt/homebrew/share/android-commandlinetools JAVA_HOME=/opt/homebrew/Cellar/openjdk@21/21.0.10/libexec/openjdk.jdk/Contents/Home ./gradlew assembleDebug`
### 关键技术决策
| 能力 | 方案 | 备注 |
|------|------|------|
| 元素定位Mac | easyocr | pytesseract 中文分词差uiautomator dump 在华为微信上返回空 |
| 元素定位(手机端) | ML Kit Chinese (bundled) | 不依赖 GMS/HMSAPK 自带模型 |
| 中文输入 | uiautomator2 send_keys | 需装辅助 APK华为需开 USB 安装权限 |
| 截屏 | `adb shell screencap -p /data/local/tmp/s.png` | 不经 FUSE比 /sdcard/ 快 |
| adb input text | 不支持中文 | NullPointerExceptionclipboard 也不可用 |
| 截屏显示 | 必须 sips -Z 1800 缩小 | 原始 1200x2640 超 Claude 2000px 限制 |
### 已知问题
1. OCR 偶尔误读("康"→"東")— ML Kit 和 easyocr 都有此问题
2. POST /snap 端点 NanoHTTPD 二进制 body 解析 bug — 文件方式 workaround
3. 微信双开弹选择框 — 每次 am start 会弹"使用以下方式打开"
4. 发送按钮白字绿底 OCR 不稳定 — 用坐标 (1008, 2425) 或 OCR "(田发送"
### 下一步(周一继续)
1. **速度优化**:发送按钮固定坐标不走 OCR省2s缩短 sleep省2s目标 5-6s/操作
2. **OCR 推理优化**:缩图再识别 / NNAPI 加速,目标 <1s
3. **集成到 Agent 主循环**device OCR 引擎接入 ocr_grounding.py
4. 配置 .envPoe API Key
5. 接入 VLMPoe API 调 Qwen2.5-VL— 复杂场景屏幕理解
6. 端到端跑通复杂多步任务(滑动、长按、跨 App
7. 完善验证纠错层
### 技术背景
项目灵感来自对字节 UI-TARS / 豆包手机的深度调研。结论:
- UI-TARS 开源的是权重+推理壳,训练代码和系统级操控完全闭源
- 核心壁垒不是模型,是"截屏→理解→定位→规划→执行→验证"的全链路
- 本项目目标:用开源 VLM + ADB 复现这个全链路

View File

@@ -1,17 +1,43 @@
# 手机 GUI Agent 自动操控 # 手机 GUI Agent 自动操控
## 启动 ## 架构
- `待补充` — 端口 4380
## 部署 七层管线闭环:截屏 → 理解 → 定位 → 规划 → 执行 → 验证 → 循环
- 平台:待定
- 域名:待定 ```
src/
├── capture/ # L1 - ADB/scrcpy 截屏
├── vision/ # L2 - VLM 屏幕理解
├── grounding/ # L3 - 元素定位(自然语言→坐标)
├── planner/ # L4 - 任务规划与分解
├── executor/ # L5 - ADB 动作执行
└── verifier/ # L6+L7 - 验证纠错 + 状态记忆
```
## 启动
- `python -m src.main` — 主服务,端口 4380
- `python scripts/test_device.py` — 测试 ADB 连接
## 技术栈
- Python 3.11+
- ADB + scrcpy截屏与操控
- Qwen2.5-VL / UI-TARS-1.5(视觉理解)
- FastAPIWeb 控制台)
- Poe API / OpenRouterLLM 调用,按用户偏好)
## 环境变量 ## 环境变量
- 待补充
- `DEVICE_SERIAL` — Android 设备序列号adb devices 查看)
- `VLM_PROVIDER` — vlm 提供者:`local` / `poe` / `openrouter`
- `VLM_MODEL` — 模型名,默认 `Qwen/Qwen2.5-VL-7B-Instruct`
- `POE_API_KEY` — Poe API KeyVLM_PROVIDER=poe 时必填)
- `OPENROUTER_API_KEY` — OpenRouter Key备用
## 规则 ## 规则
- 待补充
## 注意事项 - 截屏用 adb exec-out screencap不用 scrcpy 录屏流(省资源)
- 待补充 - 动作执行后必须等待 + 重新截屏验证
- 所有截屏保存到 `data/screenshots/` 供调试
- 坐标系统统一为百分比 (0-1),执行时再转设备像素

View File

@@ -0,0 +1,2 @@
#Sun Mar 29 02:14:23 CST 2026
gradle.version=8.7

Binary file not shown.

View File

@@ -0,0 +1,43 @@
plugins {
id("com.android.application")
id("org.jetbrains.kotlin.android")
}
android {
namespace = "com.guiagent.ocr"
compileSdk = 31
defaultConfig {
applicationId = "com.guiagent.ocr"
minSdk = 26
targetSdk = 31
versionCode = 1
versionName = "1.0"
}
buildTypes {
release {
isMinifyEnabled = false
}
}
compileOptions {
sourceCompatibility = JavaVersion.VERSION_1_8
targetCompatibility = JavaVersion.VERSION_1_8
}
kotlinOptions {
jvmTarget = "1.8"
}
}
dependencies {
// ML Kit Text Recognition - bundled model (no GMS needed!)
implementation("com.google.mlkit:text-recognition-chinese:16.0.0")
// HTTP server
implementation("org.nanohttpd:nanohttpd:2.3.1")
// JSON
implementation("com.google.code.gson:gson:2.10.1")
}

View File

@@ -0,0 +1,28 @@
<?xml version="1.0" encoding="utf-8"?>
<manifest xmlns:android="http://schemas.android.com/apk/res/android">
<uses-permission android:name="android.permission.READ_EXTERNAL_STORAGE"/>
<uses-permission android:name="android.permission.INTERNET"/>
<uses-permission android:name="android.permission.FOREGROUND_SERVICE"/>
<application
android:allowBackup="false"
android:label="OCR Service"
android:supportsRtl="true">
<activity
android:name=".MainActivity"
android:exported="true">
<intent-filter>
<action android:name="android.intent.action.MAIN"/>
<category android:name="android.intent.category.LAUNCHER"/>
</intent-filter>
</activity>
<service
android:name=".OcrService"
android:exported="true"
android:foregroundServiceType="dataSync"/>
</application>
</manifest>

View File

@@ -0,0 +1,23 @@
package com.guiagent.ocr
import android.app.Activity
import android.content.Intent
import android.os.Bundle
import android.widget.TextView
class MainActivity : Activity() {
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
val tv = TextView(this).apply {
text = "OCR Service\nPort: 18900\nStarting..."
textSize = 20f
setPadding(40, 40, 40, 40)
}
setContentView(tv)
// Start the service
val intent = Intent(this, OcrService::class.java)
startForegroundService(intent)
tv.text = "OCR Service\nPort: 18900\nRunning!"
}
}

View File

@@ -0,0 +1,79 @@
package com.guiagent.ocr
import android.graphics.Bitmap
import android.graphics.BitmapFactory
import com.google.mlkit.vision.common.InputImage
import com.google.mlkit.vision.text.TextRecognition
import com.google.mlkit.vision.text.chinese.ChineseTextRecognizerOptions
import java.io.File
import java.util.concurrent.CountDownLatch
import java.util.concurrent.TimeUnit
data class TextBox(
val text: String,
val x: Int,
val y: Int,
val w: Int,
val h: Int,
val confidence: Float
) {
val cx get() = x + w / 2
val cy get() = y + h / 2
}
object OcrEngine {
private val recognizer by lazy {
TextRecognition.getClient(ChineseTextRecognizerOptions.Builder().build())
}
fun recognize(imagePath: String): List<TextBox> {
val file = File(imagePath)
if (!file.exists()) return emptyList()
val bitmap = BitmapFactory.decodeFile(imagePath) ?: return emptyList()
return recognizeBitmap(bitmap)
}
/** 直接截屏并识别,不落盘 */
fun screencapAndRecognize(): List<TextBox> {
val process = Runtime.getRuntime().exec("screencap -p")
val bytes = process.inputStream.readBytes()
process.waitFor()
if (bytes.isEmpty()) return emptyList()
val bitmap = BitmapFactory.decodeByteArray(bytes, 0, bytes.size) ?: return emptyList()
return recognizeBitmap(bitmap)
}
fun recognizeBitmap(bitmap: Bitmap): List<TextBox> {
val image = InputImage.fromBitmap(bitmap, 0)
val results = mutableListOf<TextBox>()
val latch = CountDownLatch(1)
recognizer.process(image)
.addOnSuccessListener { visionText ->
for (block in visionText.textBlocks) {
for (line in block.lines) {
val box = line.boundingBox ?: continue
results.add(
TextBox(
text = line.text,
x = box.left,
y = box.top,
w = box.width(),
h = box.height(),
confidence = line.confidence ?: 0.8f
)
)
}
}
latch.countDown()
}
.addOnFailureListener {
latch.countDown()
}
latch.await(10, TimeUnit.SECONDS)
bitmap.recycle()
return results
}
}

View File

@@ -0,0 +1,88 @@
package com.guiagent.ocr
import android.graphics.BitmapFactory
import com.google.gson.Gson
import fi.iki.elonen.NanoHTTPD
import java.io.ByteArrayOutputStream
class OcrHttpServer(port: Int = 18900) : NanoHTTPD(port) {
private val gson = Gson()
private val defaultPath = "/sdcard/ocr_screen.png"
override fun serve(session: IHTTPSession): Response {
return when (session.uri) {
"/ocr" -> handleOcr(session)
"/snap" -> handleSnap(session)
"/health" -> jsonResponse(mapOf("status" to "ok", "engine" to "mlkit-chinese"))
else -> newFixedLengthResponse(Response.Status.NOT_FOUND, MIME_PLAINTEXT, "404")
}
}
/** 读文件方式 OCR */
private fun handleOcr(session: IHTTPSession): Response {
val params = session.parms ?: emptyMap()
val imagePath = params["path"] ?: defaultPath
return doOcr(params["text"]) { OcrEngine.recognize(imagePath) }
}
/** POST 图片数据直接 OCR不存文件 */
private fun handleSnap(session: IHTTPSession): Response {
val params = session.parms ?: emptyMap()
if (session.method == Method.POST) {
// NanoHTTPD parseBody 将 binary data 存到临时文件
val bodyFiles = HashMap<String, String>()
session.parseBody(bodyFiles)
// postData 键对应临时文件路径
val tmpPath = bodyFiles["postData"]
if (tmpPath != null) {
val imageBytes = java.io.File(tmpPath).readBytes()
val bitmap = BitmapFactory.decodeByteArray(imageBytes, 0, imageBytes.size)
if (bitmap != null) {
return doOcr(params["text"]) { OcrEngine.recognizeBitmap(bitmap) }
}
return jsonResponse(mapOf("error" to "decode failed", "size" to imageBytes.size, "count" to 0))
}
return jsonResponse(mapOf("error" to "no body received", "count" to 0))
}
// GET: 读文件方式 fallback
return handleOcr(session)
}
private fun doOcr(query: String?, recognize: () -> List<TextBox>): Response {
val startTime = System.currentTimeMillis()
var results = recognize()
if (!query.isNullOrBlank()) {
results = results.filter { it.text.contains(query) }
}
val elapsed = System.currentTimeMillis() - startTime
val response = mapOf(
"results" to results.map { box ->
mapOf(
"text" to box.text,
"x" to box.x,
"y" to box.y,
"w" to box.w,
"h" to box.h,
"cx" to box.cx,
"cy" to box.cy,
"confidence" to box.confidence
)
},
"count" to results.size,
"elapsed_ms" to elapsed
)
return jsonResponse(response)
}
private fun jsonResponse(data: Any): Response {
val json = gson.toJson(data)
return newFixedLengthResponse(Response.Status.OK, "application/json", json)
}
}

View File

@@ -0,0 +1,49 @@
package com.guiagent.ocr
import android.app.*
import android.content.Intent
import android.os.Build
import android.os.IBinder
import android.util.Log
class OcrService : Service() {
private var server: OcrHttpServer? = null
private val TAG = "OcrService"
private val PORT = 18900
override fun onStartCommand(intent: Intent?, flags: Int, startId: Int): Int {
startForegroundNotification()
if (server == null) {
server = OcrHttpServer(PORT).also {
it.start()
Log.i(TAG, "OCR HTTP server started on port $PORT")
}
}
return START_STICKY
}
override fun onDestroy() {
server?.stop()
server = null
Log.i(TAG, "OCR HTTP server stopped")
super.onDestroy()
}
override fun onBind(intent: Intent?): IBinder? = null
private fun startForegroundNotification() {
val channelId = "ocr_service"
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.O) {
val channel = NotificationChannel(channelId, "OCR Service", NotificationManager.IMPORTANCE_LOW)
getSystemService(NotificationManager::class.java).createNotificationChannel(channel)
}
val notification = Notification.Builder(this, channelId)
.setContentTitle("OCR Service")
.setContentText("Running on port $PORT")
.setSmallIcon(android.R.drawable.ic_menu_camera)
.build()
startForeground(1, notification)
}
}

View File

@@ -0,0 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<resources>
<string name="app_name">OCR Service</string>
</resources>

View File

@@ -0,0 +1,4 @@
plugins {
id("com.android.application") version "8.5.1" apply false
id("org.jetbrains.kotlin.android") version "2.0.0" apply false
}

View File

@@ -0,0 +1,3 @@
org.gradle.jvmargs=-Xmx2048m
android.useAndroidX=true
kotlin.code.style=official

Binary file not shown.

View File

@@ -0,0 +1,7 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-8.7-bin.zip
networkTimeout=10000
validateDistributionUrl=true
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists

249
android-ocr-service/gradlew vendored Executable file
View File

@@ -0,0 +1,249 @@
#!/bin/sh
#
# Copyright © 2015-2021 the original authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
##############################################################################
#
# Gradle start up script for POSIX generated by Gradle.
#
# Important for running:
#
# (1) You need a POSIX-compliant shell to run this script. If your /bin/sh is
# noncompliant, but you have some other compliant shell such as ksh or
# bash, then to run this script, type that shell name before the whole
# command line, like:
#
# ksh Gradle
#
# Busybox and similar reduced shells will NOT work, because this script
# requires all of these POSIX shell features:
# * functions;
# * expansions «$var», «${var}», «${var:-default}», «${var+SET}»,
# «${var#prefix}», «${var%suffix}», and «$( cmd )»;
# * compound commands having a testable exit status, especially «case»;
# * various built-in commands including «command», «set», and «ulimit».
#
# Important for patching:
#
# (2) This script targets any POSIX shell, so it avoids extensions provided
# by Bash, Ksh, etc; in particular arrays are avoided.
#
# The "traditional" practice of packing multiple parameters into a
# space-separated string is a well documented source of bugs and security
# problems, so this is (mostly) avoided, by progressively accumulating
# options in "$@", and eventually passing that to Java.
#
# Where the inherited environment variables (DEFAULT_JVM_OPTS, JAVA_OPTS,
# and GRADLE_OPTS) rely on word-splitting, this is performed explicitly;
# see the in-line comments for details.
#
# There are tweaks for specific operating systems such as AIX, CygWin,
# Darwin, MinGW, and NonStop.
#
# (3) This script is generated from the Groovy template
# https://github.com/gradle/gradle/blob/HEAD/subprojects/plugins/src/main/resources/org/gradle/api/internal/plugins/unixStartScript.txt
# within the Gradle project.
#
# You can find Gradle at https://github.com/gradle/gradle/.
#
##############################################################################
# Attempt to set APP_HOME
# Resolve links: $0 may be a link
app_path=$0
# Need this for daisy-chained symlinks.
while
APP_HOME=${app_path%"${app_path##*/}"} # leaves a trailing /; empty if no leading path
[ -h "$app_path" ]
do
ls=$( ls -ld "$app_path" )
link=${ls#*' -> '}
case $link in #(
/*) app_path=$link ;; #(
*) app_path=$APP_HOME$link ;;
esac
done
# This is normally unused
# shellcheck disable=SC2034
APP_BASE_NAME=${0##*/}
# Discard cd standard output in case $CDPATH is set (https://github.com/gradle/gradle/issues/25036)
APP_HOME=$( cd "${APP_HOME:-./}" > /dev/null && pwd -P ) || exit
# Use the maximum available, or set MAX_FD != -1 to use that value.
MAX_FD=maximum
warn () {
echo "$*"
} >&2
die () {
echo
echo "$*"
echo
exit 1
} >&2
# OS specific support (must be 'true' or 'false').
cygwin=false
msys=false
darwin=false
nonstop=false
case "$( uname )" in #(
CYGWIN* ) cygwin=true ;; #(
Darwin* ) darwin=true ;; #(
MSYS* | MINGW* ) msys=true ;; #(
NONSTOP* ) nonstop=true ;;
esac
CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar
# Determine the Java command to use to start the JVM.
if [ -n "$JAVA_HOME" ] ; then
if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
# IBM's JDK on AIX uses strange locations for the executables
JAVACMD=$JAVA_HOME/jre/sh/java
else
JAVACMD=$JAVA_HOME/bin/java
fi
if [ ! -x "$JAVACMD" ] ; then
die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
else
JAVACMD=java
if ! command -v java >/dev/null 2>&1
then
die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
fi
# Increase the maximum file descriptors if we can.
if ! "$cygwin" && ! "$darwin" && ! "$nonstop" ; then
case $MAX_FD in #(
max*)
# In POSIX sh, ulimit -H is undefined. That's why the result is checked to see if it worked.
# shellcheck disable=SC2039,SC3045
MAX_FD=$( ulimit -H -n ) ||
warn "Could not query maximum file descriptor limit"
esac
case $MAX_FD in #(
'' | soft) :;; #(
*)
# In POSIX sh, ulimit -n is undefined. That's why the result is checked to see if it worked.
# shellcheck disable=SC2039,SC3045
ulimit -n "$MAX_FD" ||
warn "Could not set maximum file descriptor limit to $MAX_FD"
esac
fi
# Collect all arguments for the java command, stacking in reverse order:
# * args from the command line
# * the main class name
# * -classpath
# * -D...appname settings
# * --module-path (only if needed)
# * DEFAULT_JVM_OPTS, JAVA_OPTS, and GRADLE_OPTS environment variables.
# For Cygwin or MSYS, switch paths to Windows format before running java
if "$cygwin" || "$msys" ; then
APP_HOME=$( cygpath --path --mixed "$APP_HOME" )
CLASSPATH=$( cygpath --path --mixed "$CLASSPATH" )
JAVACMD=$( cygpath --unix "$JAVACMD" )
# Now convert the arguments - kludge to limit ourselves to /bin/sh
for arg do
if
case $arg in #(
-*) false ;; # don't mess with options #(
/?*) t=${arg#/} t=/${t%%/*} # looks like a POSIX filepath
[ -e "$t" ] ;; #(
*) false ;;
esac
then
arg=$( cygpath --path --ignore --mixed "$arg" )
fi
# Roll the args list around exactly as many times as the number of
# args, so each arg winds up back in the position where it started, but
# possibly modified.
#
# NB: a `for` loop captures its iteration list before it begins, so
# changing the positional parameters here affects neither the number of
# iterations, nor the values presented in `arg`.
shift # remove old arg
set -- "$@" "$arg" # push replacement arg
done
fi
# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS='-Dfile.encoding=UTF-8 "-Xmx64m" "-Xms64m"'
# Collect all arguments for the java command:
# * DEFAULT_JVM_OPTS, JAVA_OPTS, JAVA_OPTS, and optsEnvironmentVar are not allowed to contain shell fragments,
# and any embedded shellness will be escaped.
# * For example: A user cannot expect ${Hostname} to be expanded, as it is an environment variable and will be
# treated as '${Hostname}' itself on the command line.
set -- \
"-Dorg.gradle.appname=$APP_BASE_NAME" \
-classpath "$CLASSPATH" \
org.gradle.wrapper.GradleWrapperMain \
"$@"
# Stop when "xargs" is not available.
if ! command -v xargs >/dev/null 2>&1
then
die "xargs is not available"
fi
# Use "xargs" to parse quoted args.
#
# With -n1 it outputs one arg per line, with the quotes and backslashes removed.
#
# In Bash we could simply go:
#
# readarray ARGS < <( xargs -n1 <<<"$var" ) &&
# set -- "${ARGS[@]}" "$@"
#
# but POSIX shell has neither arrays nor command substitution, so instead we
# post-process each arg (as a line of input to sed) to backslash-escape any
# character that might be a shell metacharacter, then use eval to reverse
# that process (while maintaining the separation between arguments), and wrap
# the whole thing up as a single "set" statement.
#
# This will of course break if any of these variables contains a newline or
# an unmatched quote.
#
eval "set -- $(
printf '%s\n' "$DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS" |
xargs -n1 |
sed ' s~[^-[:alnum:]+,./:=@_]~\\&~g; ' |
tr '\n' ' '
)" '"$@"'
exec "$JAVACMD" "$@"

92
android-ocr-service/gradlew.bat vendored Normal file
View File

@@ -0,0 +1,92 @@
@rem
@rem Copyright 2015 the original author or authors.
@rem
@rem Licensed under the Apache License, Version 2.0 (the "License");
@rem you may not use this file except in compliance with the License.
@rem You may obtain a copy of the License at
@rem
@rem https://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing, software
@rem distributed under the License is distributed on an "AS IS" BASIS,
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@rem See the License for the specific language governing permissions and
@rem limitations under the License.
@rem
@if "%DEBUG%"=="" @echo off
@rem ##########################################################################
@rem
@rem Gradle startup script for Windows
@rem
@rem ##########################################################################
@rem Set local scope for the variables with windows NT shell
if "%OS%"=="Windows_NT" setlocal
set DIRNAME=%~dp0
if "%DIRNAME%"=="" set DIRNAME=.
@rem This is normally unused
set APP_BASE_NAME=%~n0
set APP_HOME=%DIRNAME%
@rem Resolve any "." and ".." in APP_HOME to make it shorter.
for %%i in ("%APP_HOME%") do set APP_HOME=%%~fi
@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
set DEFAULT_JVM_OPTS=-Dfile.encoding=UTF-8 "-Xmx64m" "-Xms64m"
@rem Find java.exe
if defined JAVA_HOME goto findJavaFromJavaHome
set JAVA_EXE=java.exe
%JAVA_EXE% -version >NUL 2>&1
if %ERRORLEVEL% equ 0 goto execute
echo.
echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
echo.
echo Please set the JAVA_HOME variable in your environment to match the
echo location of your Java installation.
goto fail
:findJavaFromJavaHome
set JAVA_HOME=%JAVA_HOME:"=%
set JAVA_EXE=%JAVA_HOME%/bin/java.exe
if exist "%JAVA_EXE%" goto execute
echo.
echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME%
echo.
echo Please set the JAVA_HOME variable in your environment to match the
echo location of your Java installation.
goto fail
:execute
@rem Setup the command line
set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar
@rem Execute Gradle
"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %*
:end
@rem End local scope for the variables with windows NT shell
if %ERRORLEVEL% equ 0 goto mainEnd
:fail
rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
rem the _cmd.exe /c_ return code!
set EXIT_CODE=%ERRORLEVEL%
if %EXIT_CODE% equ 0 set EXIT_CODE=1
if not ""=="%GRADLE_EXIT_CONSOLE%" exit %EXIT_CODE%
exit /b %EXIT_CODE%
:mainEnd
if "%OS%"=="Windows_NT" endlocal
:omega

View File

@@ -0,0 +1,18 @@
pluginManagement {
repositories {
google()
mavenCentral()
gradlePluginPortal()
}
}
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
google()
mavenCentral()
}
}
rootProject.name = "ocr-service"
include(":app")

3
config/__init__.py Normal file
View File

@@ -0,0 +1,3 @@
from .settings import settings
__all__ = ["settings"]

30
config/settings.py Normal file
View File

@@ -0,0 +1,30 @@
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
# Device
device_serial: Optional[str] = None # None = auto-detect first device
adb_path: str = "/opt/homebrew/bin/adb"
screenshot_dir: str = "data/screenshots"
# VLM
vlm_provider: str = "poe" # local / poe / openrouter
vlm_model: str = "Qwen/Qwen2.5-VL-7B-Instruct"
poe_api_key: Optional[str] = None
openrouter_api_key: Optional[str] = None
# Agent
max_steps: int = 20
action_delay: float = 1.5 # seconds to wait after each action
screenshot_timeout: float = 5.0
verify_after_action: bool = True
# Server
host: str = "0.0.0.0"
port: int = 4380
model_config = {"env_file": ".env", "env_file_encoding": "utf-8"}
settings = Settings()

15
requirements.txt Normal file
View File

@@ -0,0 +1,15 @@
fastapi>=0.115.0
uvicorn>=0.32.0
pillow>=10.0.0
httpx>=0.27.0
pydantic>=2.0.0
pydantic-settings>=2.0.0
jinja2>=3.1.0
python-multipart>=0.0.9
# OCR grounding (L3 - element detection by visible text)
pytesseract>=0.3.10 # Fast, uses system tesseract binary
numpy>=1.24.0 # Required by easyocr and image processing
# Optional: better Chinese OCR (install separately if pytesseract is insufficient)
# pip install easyocr # ~150MB download, better zh_CN but slower first run

38
scripts/test_device.py Normal file
View File

@@ -0,0 +1,38 @@
"""Quick test: check ADB device connection and take a screenshot."""
import sys
import os
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from src.capture import ADBCapture
def main():
cap = ADBCapture()
print("Checking device...")
info = cap.check_device()
if not info["connected"]:
print(f"[FAIL] {info['error']}")
print()
print("Troubleshooting:")
print(" 1. USB debugging enabled on phone?")
print(" 2. Run: adb devices")
print(" 3. Accept USB debugging prompt on phone")
sys.exit(1)
print(f"[OK] Device: {info['model']}")
print(f" Serial: {info['serial']}")
print(f" Resolution: {info['resolution']}")
print(f" All devices: {info['all_devices']}")
print("\nTaking screenshot...")
img = cap.screenshot(save=True)
print(f"[OK] Screenshot: {img.size[0]}x{img.size[1]}")
print(f" Saved to: {cap.screenshot_dir}/")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,149 @@
"""Test OCR grounding: take a screenshot and find text elements.
Usage:
# Find a specific text on current screen
python scripts/test_ocr_grounding.py "微信"
# Detect ALL text on screen (debug mode)
python scripts/test_ocr_grounding.py --all
# Use a saved screenshot instead of live ADB capture
python scripts/test_ocr_grounding.py "发送" --image data/screenshots/test.png
# Try different engines
python scripts/test_ocr_grounding.py "微信" --engine easyocr
python scripts/test_ocr_grounding.py "微信" --engine pytesseract
# Also try uiautomator dump (hybrid mode)
python scripts/test_ocr_grounding.py "微信" --hybrid
# Save annotated screenshot with bounding boxes drawn
python scripts/test_ocr_grounding.py --all --annotate
"""
import sys
import os
import argparse
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from PIL import Image, ImageDraw, ImageFont
from src.grounding.ocr_grounding import OCRGrounding
def annotate_image(img: Image.Image, boxes, query: str = "") -> Image.Image:
"""Draw bounding boxes on the image for visualization."""
annotated = img.copy()
draw = ImageDraw.Draw(annotated)
for box in boxes:
is_match = box.contains_text(query) if query else False
color = "red" if is_match else "lime"
width = 3 if is_match else 1
draw.rectangle(
[box.x, box.y, box.x + box.w, box.y + box.h],
outline=color, width=width,
)
label = f"{box.text} ({box.confidence:.0%})"
draw.text((box.x, box.y - 14), label, fill=color)
return annotated
def main():
parser = argparse.ArgumentParser(description="Test OCR grounding on phone screen")
parser.add_argument("query", nargs="?", default=None, help="Text to find on screen")
parser.add_argument("--all", action="store_true", help="Detect all text on screen")
parser.add_argument("--image", type=str, help="Use saved screenshot instead of ADB")
parser.add_argument("--engine", type=str, default="auto",
choices=["auto", "pytesseract", "easyocr"],
help="OCR engine to use")
parser.add_argument("--hybrid", action="store_true",
help="Try uiautomator + OCR hybrid approach")
parser.add_argument("--annotate", action="store_true",
help="Save annotated screenshot with bounding boxes")
args = parser.parse_args()
if not args.query and not args.all:
parser.error("Provide a search query or --all")
# Get screenshot
if args.image:
print(f"Loading image: {args.image}")
img = Image.open(args.image)
else:
from src.capture import ADBCapture
cap = ADBCapture()
info = cap.check_device()
if not info["connected"]:
print(f"[FAIL] {info['error']}")
sys.exit(1)
print(f"Device: {info['model']} ({info['resolution']})")
print("Taking screenshot...")
img = cap.screenshot(save=True)
print(f"Image size: {img.width}x{img.height}")
grounding = OCRGrounding(engine=args.engine)
if args.all:
print(f"\n--- Detecting ALL text (engine={args.engine}) ---\n")
boxes = grounding.detect_all(img)
if not boxes:
print("[WARN] No text detected!")
else:
print(f"Found {len(boxes)} text regions:\n")
for i, box in enumerate(boxes, 1):
nx, ny = box.center_normalized(img.width, img.height)
print(f" {i:3d}. '{box.text}'")
print(f" pixel=({box.cx}, {box.cy}) "
f"norm=({nx:.3f}, {ny:.3f}) "
f"conf={box.confidence:.0%}")
if args.annotate and boxes:
out_path = "data/screenshots/annotated_all.png"
annotated = annotate_image(img, boxes, query=args.query or "")
annotated.save(out_path)
print(f"\nAnnotated image saved: {out_path}")
if args.query:
print(f"\n--- Searching for: '{args.query}' (engine={args.engine}) ---\n")
if args.hybrid:
result = grounding.find_text_hybrid(img, args.query)
else:
result = grounding.find_text(img, args.query)
if result is None:
print(f"[NOT FOUND] '{args.query}' was not found on screen.")
print("\nTip: Run with --all to see all detected text.")
sys.exit(1)
else:
nx, ny = result.center_normalized(img.width, img.height)
print(f"[FOUND] '{result.text}'")
print(f" Pixel center: ({result.cx}, {result.cy})")
print(f" Normalized center: ({nx:.4f}, {ny:.4f})")
print(f" Bounding box: x={result.x} y={result.y} "
f"w={result.w} h={result.h}")
print(f" Confidence: {result.confidence:.0%}")
print()
print(f" To tap this element:")
print(f" adb shell input tap {result.cx} {result.cy}")
# Show all matches
all_matches = grounding.find_all_matches(img, args.query)
if len(all_matches) > 1:
print(f"\n ({len(all_matches)} total matches found)")
for i, m in enumerate(all_matches):
print(f" {i+1}. '{m.text}' at ({m.cx},{m.cy}) conf={m.confidence:.0%}")
if args.annotate:
boxes = grounding.detect_all(img)
out_path = "data/screenshots/annotated_search.png"
annotated = annotate_image(img, boxes, query=args.query)
annotated.save(out_path)
print(f"\nAnnotated image saved: {out_path}")
if __name__ == "__main__":
main()

0
src/__init__.py Normal file
View File

3
src/capture/__init__.py Normal file
View File

@@ -0,0 +1,3 @@
from .adb_capture import ADBCapture
__all__ = ["ADBCapture"]

118
src/capture/adb_capture.py Normal file
View File

@@ -0,0 +1,118 @@
"""L1 - Screen Capture via ADB
Captures screenshots from Android device using ADB.
Handles device connection, screenshot acquisition, and resolution detection.
"""
import subprocess
import time
from pathlib import Path
from datetime import datetime
from PIL import Image
import io
from config import settings
class ADBCapture:
"""ADB-based screen capture for Android devices."""
def __init__(self):
self.adb = settings.adb_path
self.serial = settings.device_serial
self.screenshot_dir = Path(settings.screenshot_dir)
self.screenshot_dir.mkdir(parents=True, exist_ok=True)
self._resolution: tuple[int, int] | None = None
def _adb_cmd(self, *args: str) -> list[str]:
cmd = [self.adb]
if self.serial:
cmd.extend(["-s", self.serial])
cmd.extend(args)
return cmd
def check_device(self) -> dict:
"""Check if device is connected and return device info."""
result = subprocess.run(
self._adb_cmd("devices"),
capture_output=True, text=True, timeout=5
)
lines = result.stdout.strip().split("\n")[1:] # skip header
devices = []
for line in lines:
parts = line.strip().split("\t")
if len(parts) == 2 and parts[1] == "device":
devices.append(parts[0])
if not devices:
return {"connected": False, "error": "No device found"}
serial = self.serial or devices[0]
if not self.serial:
self.serial = serial
# Get device model
model_result = subprocess.run(
self._adb_cmd("shell", "getprop", "ro.product.model"),
capture_output=True, text=True, timeout=5
)
model = model_result.stdout.strip()
# Get screen resolution
w, h = self.get_resolution()
return {
"connected": True,
"serial": serial,
"model": model,
"resolution": f"{w}x{h}",
"all_devices": devices,
}
def get_resolution(self) -> tuple[int, int]:
"""Get device screen resolution."""
if self._resolution:
return self._resolution
result = subprocess.run(
self._adb_cmd("shell", "wm", "size"),
capture_output=True, text=True, timeout=5
)
# Output: "Physical size: 1080x2400"
size_str = result.stdout.strip().split(":")[-1].strip()
w, h = size_str.split("x")
self._resolution = (int(w), int(h))
return self._resolution
def screenshot(self, save: bool = True) -> Image.Image:
"""Take a screenshot and return as PIL Image.
Args:
save: Whether to save the screenshot to disk for debugging.
Returns:
PIL Image of the current screen.
"""
result = subprocess.run(
self._adb_cmd("exec-out", "screencap", "-p"),
capture_output=True, timeout=settings.screenshot_timeout
)
if result.returncode != 0:
raise RuntimeError(f"Screenshot failed: {result.stderr.decode()}")
img = Image.open(io.BytesIO(result.stdout))
if save:
ts = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
path = self.screenshot_dir / f"{ts}.png"
img.save(path)
return img
def screenshot_base64(self) -> str:
"""Take screenshot and return as base64-encoded PNG string."""
import base64
img = self.screenshot(save=True)
buffer = io.BytesIO()
img.save(buffer, format="PNG")
return base64.b64encode(buffer.getvalue()).decode("utf-8")

3
src/executor/__init__.py Normal file
View File

@@ -0,0 +1,3 @@
from .adb_executor import ADBExecutor
__all__ = ["ADBExecutor"]

View File

@@ -0,0 +1,109 @@
"""L5 - Action Execution via ADB
Translates structured actions into ADB commands and executes them on device.
Coordinates are normalized (0-1), converted to device pixels at execution time.
"""
import subprocess
import time
from dataclasses import dataclass
from config import settings
@dataclass
class Action:
"""A single GUI action to execute."""
type: str # tap, swipe, type, long_press, back, home, scroll, wait
x: float = 0.0 # normalized x (0-1)
y: float = 0.0 # normalized y (0-1)
text: str = "" # for type action
x2: float = 0.0 # for swipe end
y2: float = 0.0 # for swipe end
duration: int = 300 # ms, for long_press and swipe
class ADBExecutor:
"""Execute actions on Android device via ADB."""
def __init__(self, capture):
self.capture = capture
self.adb = settings.adb_path
self.serial = settings.device_serial
def _adb_cmd(self, *args: str) -> list[str]:
cmd = [self.adb]
if self.serial:
cmd.extend(["-s", self.serial])
cmd.extend(args)
return cmd
def _run(self, *args: str):
cmd = self._adb_cmd(*args)
result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
if result.returncode != 0:
raise RuntimeError(f"ADB command failed: {' '.join(cmd)}\n{result.stderr}")
return result.stdout
def _to_pixels(self, x: float, y: float) -> tuple[int, int]:
"""Convert normalized (0-1) coordinates to device pixels."""
w, h = self.capture.get_resolution()
return int(x * w), int(y * h)
def execute(self, action: Action) -> str:
"""Execute a single action and return a description of what was done."""
match action.type:
case "tap":
px, py = self._to_pixels(action.x, action.y)
self._run("shell", "input", "tap", str(px), str(py))
desc = f"tap ({px}, {py})"
case "long_press":
px, py = self._to_pixels(action.x, action.y)
self._run("shell", "input", "swipe",
str(px), str(py), str(px), str(py), str(action.duration))
desc = f"long_press ({px}, {py}) {action.duration}ms"
case "swipe":
px1, py1 = self._to_pixels(action.x, action.y)
px2, py2 = self._to_pixels(action.x2, action.y2)
self._run("shell", "input", "swipe",
str(px1), str(py1), str(px2), str(py2), str(action.duration))
desc = f"swipe ({px1},{py1}) → ({px2},{py2})"
case "type":
# Escape special characters for ADB
escaped = action.text.replace(" ", "%s").replace("&", "\\&")
self._run("shell", "input", "text", escaped)
desc = f"type '{action.text}'"
case "back":
self._run("shell", "input", "keyevent", "KEYCODE_BACK")
desc = "back"
case "home":
self._run("shell", "input", "keyevent", "KEYCODE_HOME")
desc = "home"
case "scroll":
# Scroll direction: swipe center screen
px, py = self._to_pixels(0.5, 0.5)
if action.y < 0: # scroll up
self._run("shell", "input", "swipe",
str(px), str(py - 300), str(px), str(py + 300), "300")
desc = "scroll up"
else: # scroll down
self._run("shell", "input", "swipe",
str(px), str(py + 300), str(px), str(py - 300), "300")
desc = "scroll down"
case "wait":
time.sleep(action.duration / 1000)
desc = f"wait {action.duration}ms"
case _:
raise ValueError(f"Unknown action type: {action.type}")
# Wait for UI to settle after action
time.sleep(settings.action_delay)
return desc

View File

@@ -0,0 +1,3 @@
from .ocr_grounding import OCRGrounding
__all__ = ["OCRGrounding"]

View File

@@ -0,0 +1,354 @@
"""L3 - OCR-Based UI Element Grounding
Locates UI elements on screen by visible text using OCR on ADB screenshots.
Provides reliable text-to-coordinate mapping that works on Huawei/HarmonyOS
where uiautomator dump often returns empty XML for WeChat.
Strategy priority (auto mode):
1. easyocr (best Chinese recognition, deep learning based)
2. pytesseract (fallback, fast but fragments Chinese characters)
3. uiautomator XML dump (supplementary, often empty on Huawei WeChat)
All coordinates returned as normalized (0.0-1.0) for consistency with the
existing coordinate system in adb_executor.py.
"""
import subprocess
import re
import io
import logging
from dataclasses import dataclass
from pathlib import Path
from PIL import Image
from config import settings
logger = logging.getLogger(__name__)
@dataclass
class TextBox:
"""A detected text region on screen."""
text: str
x: int # left pixel
y: int # top pixel
w: int # width pixels
h: int # height pixels
confidence: float # 0.0-1.0
@property
def cx(self) -> int:
"""Center x in pixels."""
return self.x + self.w // 2
@property
def cy(self) -> int:
"""Center y in pixels."""
return self.y + self.h // 2
def center_normalized(self, screen_w: int, screen_h: int) -> tuple[float, float]:
"""Return center as normalized (0-1) coordinates."""
return self.cx / screen_w, self.cy / screen_h
def contains_text(self, query: str, fuzzy: bool = True) -> bool:
"""Check if this box's text matches the query.
Args:
query: Text to search for.
fuzzy: If True, does substring + case-insensitive match.
"""
if not query or not self.text:
return False
if fuzzy:
return query.lower() in self.text.lower() or self.text.lower() in query.lower()
return self.text == query
def match_score(self, query: str) -> float:
"""Compute a match quality score (higher = better).
Scoring:
- Exact match: 1000 + confidence
- Query is full text: 500 + confidence
- Text contains query as substring: 100 + confidence + length_ratio
- Query contains text as substring: 50 + confidence
- No match: 0
"""
if not query or not self.text:
return 0.0
q = query.lower()
t = self.text.lower().strip()
if t == q:
return 1000 + self.confidence
if q in t:
# Prefer shorter texts that contain the query (more precise)
length_ratio = len(q) / max(len(t), 1)
return 100 + self.confidence + length_ratio
if t in q:
# Text is a subset of query -- weaker match
length_ratio = len(t) / max(len(q), 1)
return 50 + self.confidence * length_ratio
return 0.0
class OCRGrounding:
"""OCR-based element grounding for Android screens.
Usage:
grounding = OCRGrounding()
# From ADB screenshot (PIL Image)
img = capture.screenshot()
result = grounding.find_text(img, "发送")
if result:
norm_x, norm_y = result.center_normalized(img.width, img.height)
# Use norm_x, norm_y with ADBExecutor
"""
def __init__(self, engine: str = "auto"):
"""
Args:
engine: OCR engine to use.
"pytesseract" / "easyocr" / "auto" (easyocr first, pytesseract fallback)
"""
self.engine = engine
self._easyocr_reader = None # lazy init (slow first load)
# ──────────────────────────────────────────────
# Public API
# ──────────────────────────────────────────────
def find_text(
self, img: Image.Image, query: str, fuzzy: bool = True
) -> TextBox | None:
"""Find a UI element by visible text and return its bounding box.
Args:
img: PIL Image (screenshot from ADB).
query: Text to search for (e.g. "发送", "微信", "Search").
fuzzy: Substring/case-insensitive match.
Returns:
Best matching TextBox, or None if not found.
"""
boxes = self.detect_all(img)
matches = [b for b in boxes if b.contains_text(query, fuzzy=fuzzy)]
if not matches:
logger.warning(f"Text '{query}' not found. Detected texts: "
f"{[b.text for b in boxes[:20]]}")
return None
# Return best match by match_score (prefers exact/longer matches)
matches.sort(key=lambda b: b.match_score(query), reverse=True)
best = matches[0]
logger.info(f"Found '{query}''{best.text}' at ({best.cx}, {best.cy}) "
f"conf={best.confidence:.2f} score={best.match_score(query):.1f}")
return best
def find_all_matches(
self, img: Image.Image, query: str, fuzzy: bool = True
) -> list[TextBox]:
"""Find ALL matching elements (e.g., multiple chat contacts named similar)."""
boxes = self.detect_all(img)
return [b for b in boxes if b.contains_text(query, fuzzy=fuzzy)]
def detect_all(self, img: Image.Image) -> list[TextBox]:
"""Run OCR on the full image and return all detected text boxes.
Tries engines in order based on self.engine setting.
"""
if self.engine == "pytesseract":
return self._detect_pytesseract(img)
elif self.engine == "easyocr":
return self._detect_easyocr(img)
else: # auto
# Prefer easyocr (much better Chinese recognition), fall back to pytesseract
try:
return self._detect_easyocr(img)
except Exception as e:
logger.info(f"easyocr failed ({e}), trying pytesseract")
try:
boxes = self._detect_pytesseract(img)
if boxes:
return boxes
except Exception as e:
logger.error(f"All OCR engines failed: {e}")
return []
def find_text_normalized(
self, img: Image.Image, query: str, fuzzy: bool = True
) -> tuple[float, float] | None:
"""Convenience: find text and return normalized (x, y) center directly.
Returns None if not found.
"""
box = self.find_text(img, query, fuzzy=fuzzy)
if box is None:
return None
return box.center_normalized(img.width, img.height)
# ──────────────────────────────────────────────
# pytesseract engine
# ──────────────────────────────────────────────
def _detect_pytesseract(self, img: Image.Image) -> list[TextBox]:
"""Detect text using pytesseract (calls tesseract binary).
Uses chi_sim+eng for Chinese + English mixed content (common in WeChat).
Falls back to eng-only if chi_sim data is not installed.
"""
import pytesseract
# Try Chinese+English first, fall back to English only
for lang in ["chi_sim+eng", "eng"]:
try:
data = pytesseract.image_to_data(
img,
lang=lang,
output_type=pytesseract.Output.DICT,
config="--psm 11" # Sparse text: find as much text as possible
)
break
except pytesseract.TesseractError:
continue
else:
raise RuntimeError("Tesseract failed with all language configs")
boxes = []
n = len(data["text"])
for i in range(n):
text = data["text"][i].strip()
conf = int(data["conf"][i])
if not text or conf < 20: # skip low-confidence noise
continue
boxes.append(TextBox(
text=text,
x=data["left"][i],
y=data["top"][i],
w=data["width"][i],
h=data["height"][i],
confidence=conf / 100.0,
))
return boxes
# ──────────────────────────────────────────────
# easyocr engine
# ──────────────────────────────────────────────
def _detect_easyocr(self, img: Image.Image) -> list[TextBox]:
"""Detect text using easyocr (better for Chinese, uses deep learning).
First call is slow (~10s) due to model loading. Subsequent calls are fast.
"""
import easyocr
import numpy as np
if self._easyocr_reader is None:
self._easyocr_reader = easyocr.Reader(
["ch_sim", "en"],
gpu=False, # CPU is fine for single screenshots
)
# Convert PIL to numpy array for easyocr
img_np = np.array(img.convert("RGB"))
results = self._easyocr_reader.readtext(img_np)
boxes = []
for (bbox, text, conf) in results:
if not text.strip():
continue
# bbox is [[x1,y1],[x2,y2],[x3,y3],[x4,y4]] (quadrilateral)
xs = [p[0] for p in bbox]
ys = [p[1] for p in bbox]
x = int(min(xs))
y = int(min(ys))
w = int(max(xs) - x)
h = int(max(ys) - y)
boxes.append(TextBox(
text=text.strip(),
x=x, y=y, w=w, h=h,
confidence=float(conf),
))
return boxes
# ──────────────────────────────────────────────
# uiautomator XML dump (supplementary, often empty on Huawei)
# ──────────────────────────────────────────────
def try_uiautomator_dump(self, serial: str | None = None) -> list[TextBox]:
"""Attempt to get UI elements from uiautomator dump.
NOTE: This often returns nearly empty XML on Huawei/HarmonyOS,
especially for WeChat. Use as a supplementary source, not primary.
Args:
serial: Device serial (None = use settings or first device).
Returns:
List of TextBox from accessibility tree, may be empty.
"""
adb = settings.adb_path
cmd = [adb]
if serial or settings.device_serial:
cmd.extend(["-s", serial or settings.device_serial])
# Dump to device, then pull
dump_cmd = cmd + ["shell", "uiautomator", "dump", "/sdcard/ui_dump.xml"]
pull_cmd = cmd + ["shell", "cat", "/sdcard/ui_dump.xml"]
try:
subprocess.run(dump_cmd, capture_output=True, timeout=10)
result = subprocess.run(pull_cmd, capture_output=True, text=True, timeout=5)
xml_content = result.stdout
except Exception as e:
logger.warning(f"uiautomator dump failed: {e}")
return []
return self._parse_uiautomator_xml(xml_content)
def _parse_uiautomator_xml(self, xml_str: str) -> list[TextBox]:
"""Parse uiautomator dump XML into TextBox list."""
boxes = []
# Pattern: text="..." bounds="[x1,y1][x2,y2]"
pattern = r'text="([^"]*)"[^>]*bounds="\[(\d+),(\d+)\]\[(\d+),(\d+)\]"'
for match in re.finditer(pattern, xml_str):
text = match.group(1).strip()
if not text:
continue
x1, y1, x2, y2 = (int(match.group(i)) for i in range(2, 6))
boxes.append(TextBox(
text=text,
x=x1, y=y1,
w=x2 - x1, h=y2 - y1,
confidence=1.0, # accessibility tree is authoritative
))
return boxes
# ──────────────────────────────────────────────
# Hybrid: combine OCR + uiautomator
# ──────────────────────────────────────────────
def find_text_hybrid(
self, img: Image.Image, query: str, fuzzy: bool = True
) -> TextBox | None:
"""Try uiautomator first (exact bounds), fall back to OCR.
Best strategy for Huawei: uiautomator might work for some apps,
OCR always works as fallback.
"""
# Try uiautomator first (precise but often empty on Huawei)
ua_boxes = self.try_uiautomator_dump()
ua_matches = [b for b in ua_boxes if b.contains_text(query, fuzzy=fuzzy)]
if ua_matches:
logger.info(f"Found '{query}' via uiautomator")
return ua_matches[0]
# Fall back to OCR
logger.info(f"uiautomator found nothing for '{query}', using OCR")
return self.find_text(img, query, fuzzy=fuzzy)

122
src/main.py Normal file
View File

@@ -0,0 +1,122 @@
"""Phone GUI Agent - Main Entry Point
Web console for controlling the agent loop.
"""
import asyncio
import json
from pathlib import Path
from fastapi import FastAPI, WebSocket, WebSocketDisconnect, Request
from fastapi.responses import HTMLResponse
from fastapi.staticfiles import StaticFiles
from fastapi.templating import Jinja2Templates
from config import settings
from src.capture import ADBCapture
from src.planner.agent_loop import AgentLoop
app = FastAPI(title="Phone GUI Agent", version="0.1.0")
BASE_DIR = Path(__file__).parent.parent
app.mount("/static", StaticFiles(directory=BASE_DIR / "web" / "static"), name="static")
templates = Jinja2Templates(directory=BASE_DIR / "web" / "templates")
# Global state
capture = ADBCapture()
agent = AgentLoop()
@app.get("/", response_class=HTMLResponse)
async def index(request: Request):
return templates.TemplateResponse(request, "index.html")
@app.get("/api/device")
async def device_info():
"""Check device connection status."""
try:
info = capture.check_device()
return info
except Exception as e:
return {"connected": False, "error": str(e)}
@app.get("/api/screenshot")
async def take_screenshot():
"""Take a screenshot and return base64."""
try:
b64 = capture.screenshot_base64()
return {"ok": True, "image": b64}
except Exception as e:
return {"ok": False, "error": str(e)}
@app.post("/api/stop")
async def stop_task():
"""Stop the current running task."""
agent.stop()
return {"ok": True}
@app.websocket("/ws/task")
async def task_websocket(ws: WebSocket):
"""WebSocket endpoint for running tasks with real-time updates.
Client sends: {"task": "打开微信搜索张三"}
Server streams: StepResult objects as JSON
"""
await ws.accept()
try:
data = await ws.receive_json()
task = data.get("task", "")
if not task:
await ws.send_json({"error": "No task provided"})
return
await ws.send_json({"status": "started", "task": task})
def on_step(result):
asyncio.get_event_loop().call_soon_threadsafe(
asyncio.ensure_future,
ws.send_json({
"status": "step",
"step": result.step,
"observation": result.observation,
"thinking": result.thinking,
"action_type": result.action_type,
"action_desc": result.action_desc,
"screenshot": result.screenshot_before[:100] + "..." if result.screenshot_before else None,
"error": result.error,
})
)
session = await agent.run_task(task, on_step=on_step)
await ws.send_json({
"status": session.status,
"total_steps": len(session.steps),
"task": task,
})
except WebSocketDisconnect:
agent.stop()
except Exception as e:
try:
await ws.send_json({"error": str(e)})
except Exception:
pass
def main():
import uvicorn
uvicorn.run(
"src.main:app",
host=settings.host,
port=settings.port,
reload=True,
)
if __name__ == "__main__":
main()

3
src/planner/__init__.py Normal file
View File

@@ -0,0 +1,3 @@
from .agent_loop import AgentLoop
__all__ = ["AgentLoop"]

200
src/planner/agent_loop.py Normal file
View File

@@ -0,0 +1,200 @@
"""L4+L6+L7 - Agent Loop: Planning, Verification, Memory
The core agent loop that orchestrates the full pipeline:
Screenshot → VLM Analysis → Action Execution → Verification → Repeat
"""
import asyncio
import time
from dataclasses import dataclass, field
from datetime import datetime
from src.capture import ADBCapture
from src.vision import VLMClient
from src.executor.adb_executor import ADBExecutor, Action
@dataclass
class StepResult:
step: int
timestamp: str
observation: str
thinking: str
action_type: str
action_desc: str
screenshot_before: str # base64
screenshot_after: str | None = None
verified: bool = False
error: str | None = None
@dataclass
class TaskSession:
task: str
status: str = "running" # running / completed / failed / stopped
steps: list[StepResult] = field(default_factory=list)
started_at: str = ""
finished_at: str = ""
def history(self) -> list[dict]:
"""Return history for VLM context."""
return [
{
"observation": s.observation,
"action": {"type": s.action_type},
}
for s in self.steps
]
class AgentLoop:
"""Main agent loop orchestrating all pipeline layers."""
def __init__(self):
self.capture = ADBCapture()
self.vlm = VLMClient()
self.executor = ADBExecutor(self.capture)
self.current_session: TaskSession | None = None
self._stop_requested = False
def stop(self):
self._stop_requested = True
async def run_task(self, task: str, on_step=None) -> TaskSession:
"""Execute a task through the full agent loop.
Args:
task: Natural language task instruction.
on_step: Optional callback called after each step with StepResult.
Returns:
TaskSession with all steps and final status.
"""
from config import settings
session = TaskSession(
task=task,
started_at=datetime.now().isoformat(),
)
self.current_session = session
self._stop_requested = False
try:
for step_num in range(1, settings.max_steps + 1):
if self._stop_requested:
session.status = "stopped"
break
result = await self._execute_step(step_num, task, session)
session.steps.append(result)
if on_step:
on_step(result)
if result.action_type == "done":
session.status = "completed"
break
if result.error:
# Allow up to 3 consecutive errors before failing
recent_errors = sum(
1 for s in session.steps[-3:] if s.error
)
if recent_errors >= 3:
session.status = "failed"
break
else:
session.status = "failed" # max steps exceeded
except Exception as e:
session.status = "failed"
if session.steps:
session.steps[-1].error = str(e)
session.finished_at = datetime.now().isoformat()
self.current_session = None
return session
async def _execute_step(
self, step_num: int, task: str, session: TaskSession
) -> StepResult:
"""Execute a single step in the agent loop."""
timestamp = datetime.now().isoformat()
# L1: Capture screenshot
try:
screenshot_b64 = self.capture.screenshot_base64()
except Exception as e:
return StepResult(
step=step_num, timestamp=timestamp,
observation="", thinking="",
action_type="error", action_desc="",
screenshot_before="", error=f"Screenshot failed: {e}"
)
# L2+L3+L4: VLM analysis (understanding + grounding + planning)
try:
response = await self.vlm.analyze_screen(
screenshot_b64, task, session.history()
)
except Exception as e:
return StepResult(
step=step_num, timestamp=timestamp,
observation="", thinking="",
action_type="error", action_desc="",
screenshot_before=screenshot_b64,
error=f"VLM analysis failed: {e}"
)
observation = response.get("observation", "")
thinking = response.get("thinking", "")
action_data = response["action"]
action_type = action_data["type"]
# Task complete
if action_type == "done":
return StepResult(
step=step_num, timestamp=timestamp,
observation=observation, thinking=thinking,
action_type="done", action_desc="Task completed",
screenshot_before=screenshot_b64,
)
# L5: Execute action
action = Action(
type=action_type,
x=action_data.get("x", 0),
y=action_data.get("y", 0),
text=action_data.get("text", ""),
x2=action_data.get("x2", 0),
y2=action_data.get("y2", 0),
duration=action_data.get("duration", 300),
)
try:
action_desc = self.executor.execute(action)
except Exception as e:
return StepResult(
step=step_num, timestamp=timestamp,
observation=observation, thinking=thinking,
action_type=action_type, action_desc="",
screenshot_before=screenshot_b64,
error=f"Execution failed: {e}"
)
# L6: Verify by taking post-action screenshot
screenshot_after = None
if settings.verify_after_action:
try:
screenshot_after = self.capture.screenshot_base64()
except Exception:
pass # non-critical
return StepResult(
step=step_num, timestamp=timestamp,
observation=observation, thinking=thinking,
action_type=action_type, action_desc=action_desc,
screenshot_before=screenshot_b64,
screenshot_after=screenshot_after,
verified=screenshot_after is not None,
)

0
src/verifier/__init__.py Normal file
View File

3
src/vision/__init__.py Normal file
View File

@@ -0,0 +1,3 @@
from .vlm_client import VLMClient
__all__ = ["VLMClient"]

171
src/vision/vlm_client.py Normal file
View File

@@ -0,0 +1,171 @@
"""L2+L3 - Vision Language Model Client
Sends screenshots to VLM for screen understanding and element grounding.
Supports multiple providers: Poe API (preferred), OpenRouter (backup), local.
"""
import base64
import httpx
from PIL import Image
import io
from config import settings
SYSTEM_PROMPT = """你是一个手机 GUI 操控助手。你会收到一张 Android 手机截图和一个用户任务指令。
你的职责:
1. 分析当前屏幕内容(识别所有 UI 元素、文本、图标、按钮)
2. 根据任务目标,决定下一步要执行的操作
3. 精确定位目标元素的屏幕坐标
输出格式(严格 JSON
{
"observation": "当前屏幕的简要描述",
"thinking": "下一步应该做什么,为什么",
"action": {
"type": "tap|swipe|type|long_press|back|home|scroll|wait|done",
"x": 0.5,
"y": 0.3,
"text": "",
"x2": 0.0,
"y2": 0.0,
"duration": 300
}
}
坐标说明:
- x, y 为归一化坐标,范围 0.0-1.0
- (0, 0) 是屏幕左上角,(1, 1) 是右下角
- 点击按钮时,坐标应指向按钮的中心位置
当任务完成时action.type 设为 "done"
"""
class VLMClient:
"""Multi-provider VLM client for screen understanding."""
def __init__(self):
self.provider = settings.vlm_provider
self.model = settings.vlm_model
async def analyze_screen(
self, screenshot_b64: str, task: str, history: list[dict] | None = None
) -> dict:
"""Send screenshot to VLM and get structured action response.
Args:
screenshot_b64: Base64-encoded PNG screenshot.
task: User's task instruction.
history: Previous observation/action pairs for context.
Returns:
Parsed dict with observation, thinking, and action.
"""
messages = self._build_messages(screenshot_b64, task, history)
match self.provider:
case "poe":
raw = await self._call_poe(messages)
case "openrouter":
raw = await self._call_openrouter(messages)
case "local":
raw = await self._call_local(messages)
case _:
raise ValueError(f"Unknown VLM provider: {self.provider}")
return self._parse_response(raw)
def _build_messages(
self, screenshot_b64: str, task: str, history: list[dict] | None
) -> list[dict]:
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
# Add history context
if history:
history_text = "\n".join(
f"Step {i+1}: {h['observation']}{h['action']['type']}"
for i, h in enumerate(history[-5:]) # last 5 steps
)
messages.append({
"role": "user",
"content": f"历史操作记录:\n{history_text}"
})
# Current step: screenshot + task
messages.append({
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{screenshot_b64}"}
},
{
"type": "text",
"text": f"当前任务:{task}\n\n请分析截图并给出下一步操作。"
},
],
})
return messages
async def _call_poe(self, messages: list[dict]) -> str:
"""Call Poe API (preferred, cheapest)."""
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(
"https://api.poe.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {settings.poe_api_key}",
"Content-Type": "application/json",
},
json={"model": self.model, "messages": messages},
)
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
async def _call_openrouter(self, messages: list[dict]) -> str:
"""Call OpenRouter API (backup)."""
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(
"https://openrouter.ai/api/v1/chat/completions",
headers={
"Authorization": f"Bearer {settings.openrouter_api_key}",
"Content-Type": "application/json",
},
json={"model": self.model, "messages": messages},
)
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
async def _call_local(self, messages: list[dict]) -> str:
"""Call local vLLM/Ollama server."""
async with httpx.AsyncClient(timeout=60) as client:
resp = await client.post(
"http://localhost:11434/v1/chat/completions",
json={"model": self.model, "messages": messages},
)
resp.raise_for_status()
return resp.json()["choices"][0]["message"]["content"]
def _parse_response(self, raw: str) -> dict:
"""Parse VLM response into structured action dict."""
import json
import re
# Extract JSON from response (handle markdown code blocks)
json_match = re.search(r"```(?:json)?\s*(.*?)\s*```", raw, re.DOTALL)
if json_match:
raw = json_match.group(1)
# Try to find JSON object directly
json_match = re.search(r"\{.*\}", raw, re.DOTALL)
if not json_match:
raise ValueError(f"No JSON found in VLM response: {raw[:200]}")
parsed = json.loads(json_match.group())
# Validate required fields
assert "action" in parsed, "Missing 'action' field"
assert "type" in parsed["action"], "Missing action 'type'"
return parsed

192
web/templates/index.html Normal file
View File

@@ -0,0 +1,192 @@
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Phone GUI Agent</title>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif; background: #0a0a0a; color: #e0e0e0; height: 100vh; display: flex; flex-direction: column; }
header { padding: 12px 20px; background: #111; border-bottom: 1px solid #222; display: flex; align-items: center; gap: 12px; }
header h1 { font-size: 16px; font-weight: 600; }
.status-dot { width: 8px; height: 8px; border-radius: 50%; background: #555; }
.status-dot.connected { background: #22c55e; }
.status-dot.running { background: #f59e0b; animation: pulse 1s infinite; }
@keyframes pulse { 0%, 100% { opacity: 1; } 50% { opacity: 0.4; } }
#device-info { font-size: 12px; color: #888; margin-left: auto; }
.main { flex: 1; display: flex; overflow: hidden; }
.panel-left { width: 320px; border-right: 1px solid #222; display: flex; flex-direction: column; }
.panel-center { flex: 1; display: flex; align-items: center; justify-content: center; background: #050505; }
.panel-right { width: 380px; border-left: 1px solid #222; display: flex; flex-direction: column; }
.phone-frame { width: 270px; height: 585px; border: 2px solid #333; border-radius: 24px; overflow: hidden; background: #111; position: relative; }
.phone-frame img { width: 100%; height: 100%; object-fit: contain; }
.phone-frame .placeholder { display: flex; align-items: center; justify-content: center; height: 100%; color: #444; font-size: 14px; }
.task-input { padding: 16px; border-bottom: 1px solid #222; }
.task-input textarea { width: 100%; height: 80px; background: #1a1a1a; border: 1px solid #333; border-radius: 8px; color: #e0e0e0; padding: 10px; font-size: 14px; resize: none; }
.task-input textarea:focus { outline: none; border-color: #4a9eff; }
.btn-row { display: flex; gap: 8px; margin-top: 8px; }
.btn { padding: 8px 16px; border-radius: 6px; border: none; cursor: pointer; font-size: 13px; font-weight: 500; }
.btn-primary { background: #4a9eff; color: #fff; }
.btn-primary:hover { background: #3a8eef; }
.btn-danger { background: #ef4444; color: #fff; }
.btn-secondary { background: #333; color: #ccc; }
.steps-list { flex: 1; overflow-y: auto; padding: 12px; }
.step-card { background: #1a1a1a; border: 1px solid #222; border-radius: 8px; padding: 12px; margin-bottom: 8px; font-size: 13px; }
.step-card .step-header { display: flex; justify-content: space-between; margin-bottom: 6px; }
.step-num { color: #4a9eff; font-weight: 600; }
.step-action { color: #22c55e; font-family: monospace; }
.step-action.error { color: #ef4444; }
.step-obs { color: #999; margin-top: 4px; }
.step-think { color: #f59e0b; margin-top: 4px; font-style: italic; }
.log-panel { flex: 1; overflow-y: auto; padding: 12px; }
.log-panel h3 { font-size: 13px; color: #888; margin-bottom: 8px; text-transform: uppercase; letter-spacing: 1px; }
</style>
</head>
<body>
<header>
<div class="status-dot" id="statusDot"></div>
<h1>Phone GUI Agent</h1>
<span id="device-info">检测设备中...</span>
</header>
<div class="main">
<div class="panel-left">
<div class="task-input">
<textarea id="taskInput" placeholder="输入任务指令,例如:&#10;打开设置连接WiFi&#10;打开微信,搜索张三发消息"></textarea>
<div class="btn-row">
<button class="btn btn-primary" id="btnRun" onclick="runTask()">执行任务</button>
<button class="btn btn-danger" id="btnStop" onclick="stopTask()" style="display:none">停止</button>
<button class="btn btn-secondary" onclick="refreshScreenshot()">截屏</button>
</div>
</div>
<div class="steps-list" id="stepsList"></div>
</div>
<div class="panel-center">
<div class="phone-frame">
<img id="phoneScreen" style="display:none" />
<div class="placeholder" id="phonePlaceholder">连接设备后显示截图</div>
</div>
</div>
<div class="panel-right">
<div class="log-panel">
<h3>Agent 思考过程</h3>
<div id="thinkingLog"></div>
</div>
</div>
</div>
<script>
let ws = null;
async function checkDevice() {
try {
const resp = await fetch('/api/device');
const data = await resp.json();
const dot = document.getElementById('statusDot');
const info = document.getElementById('device-info');
if (data.connected) {
dot.className = 'status-dot connected';
info.textContent = `${data.model} (${data.resolution}) - ${data.serial}`;
refreshScreenshot();
} else {
dot.className = 'status-dot';
info.textContent = data.error || '未连接设备';
}
} catch (e) {
document.getElementById('device-info').textContent = '服务未启动';
}
}
async function refreshScreenshot() {
try {
const resp = await fetch('/api/screenshot');
const data = await resp.json();
if (data.ok) {
const img = document.getElementById('phoneScreen');
img.src = 'data:image/png;base64,' + data.image;
img.style.display = 'block';
document.getElementById('phonePlaceholder').style.display = 'none';
}
} catch (e) {}
}
function runTask() {
const task = document.getElementById('taskInput').value.trim();
if (!task) return;
document.getElementById('stepsList').innerHTML = '';
document.getElementById('thinkingLog').innerHTML = '';
document.getElementById('btnRun').style.display = 'none';
document.getElementById('btnStop').style.display = 'inline-block';
document.getElementById('statusDot').className = 'status-dot running';
const protocol = location.protocol === 'https:' ? 'wss:' : 'ws:';
ws = new WebSocket(`${protocol}//${location.host}/ws/task`);
ws.onopen = () => {
ws.send(JSON.stringify({ task }));
};
ws.onmessage = (e) => {
const data = JSON.parse(e.data);
if (data.status === 'step') {
addStep(data);
} else if (data.status === 'completed' || data.status === 'failed' || data.status === 'stopped') {
taskDone(data);
}
};
ws.onclose = () => taskDone({ status: 'disconnected' });
}
function addStep(data) {
const list = document.getElementById('stepsList');
const card = document.createElement('div');
card.className = 'step-card';
card.innerHTML = `
<div class="step-header">
<span class="step-num">Step ${data.step}</span>
<span class="step-action ${data.error ? 'error' : ''}">${data.error || data.action_desc || data.action_type}</span>
</div>
${data.observation ? `<div class="step-obs">${data.observation}</div>` : ''}
${data.thinking ? `<div class="step-think">${data.thinking}</div>` : ''}
`;
list.appendChild(card);
list.scrollTop = list.scrollHeight;
if (data.thinking) {
const log = document.getElementById('thinkingLog');
const p = document.createElement('div');
p.className = 'step-card';
p.innerHTML = `<span class="step-num">Step ${data.step}</span>: ${data.thinking}`;
log.appendChild(p);
log.scrollTop = log.scrollHeight;
}
refreshScreenshot();
}
function taskDone(data) {
document.getElementById('btnRun').style.display = 'inline-block';
document.getElementById('btnStop').style.display = 'none';
document.getElementById('statusDot').className = 'status-dot connected';
if (ws) { ws.close(); ws = null; }
}
async function stopTask() {
await fetch('/api/stop', { method: 'POST' });
}
checkDevice();
setInterval(checkDevice, 10000);
</script>
</body>
</html>