LibCyberAI · 功能实施方案（第 3 版）

视频片段关键帧标注 — LibCyberAI 功能实施方案

用户发送短视频片段，Telegram webhook 只做廉价校验、记录入站消息并 enqueue 持久化 file_id job；独立 worker 再下载片段、用 ffmpeg 抽帧并拼成一张时间有序的「联系表」大图。模型在单次视觉调用中完成「挑帧 + 返回标注」，随后全分辨率重抽该帧、复用现有标注渲染器回传。原始视频永不持久化（worker 本地临时目录内抽取即丢弃）。

提议中待业主签字基于截图视觉标注功能

2026-06-11（rev 3）

第 3 版变更说明：修复实施评审发现的关键集成风险：webhook 只 enqueue 持久化 Telegram file_id 元数据，worker 自己下载片段（不跨容器传 tempfile 路径）；worker 明确持久化入站/出站会话消息；结构化视觉调用走 direct-client response_format 路径；默认采样改为 N=9 / 2048px；徽章绘制在帧内容外，坐标相对 frame content，避免坐标歧义。

1 次

视觉调用 / 片段（典型）

≤3 轮

自适应上限（模型或守卫触发）

~$0.0005–0.054

单片段估算成本（按模型/轮次）

~65%

复用现有截图流程

5–12 秒

异步执行，用户不阻塞

N=9

默认帧数，2048px 合成图

§1 — 概要

1. 概要

我们要构建什么。用户发送一段短屏幕录制视频片段（Phase 1 支持 Telegram）。webhook 先做廉价 Telegram 元数据校验、记录一个轻量入站用户消息、enqueue 一个携带 file_id 元数据的 video_keyframe_analysis job，并立即 ack。独立 job worker 随后下载片段、用 ffmpeg 抽取帧，将它们逐一拼进一张合成图像（联系表），把这一张图发给视觉模型。模型在一次调用中同时完成「理解片段」和「回答问题」——返回哪一帧最相关，以及要在该帧上绘制的箭头/方框标注。worker 随后以全分辨率重抽该单帧，在其上绘制标注（复用现有标注渲染器），持久化 assistant 消息 + 附件，并把带标注的静图推回用户。原始视频永不持久化（worker 临时目录内下载、抽取、丢弃）。

Codex 洞见（已修正）。GPT 视觉模型没有原生视频输入，因此需要把片段转为图像。Codex 的技巧——也是我们的——是把采样的所有帧按时间顺序合成一张大图，对这单张合成图运行视觉。不是每帧一次调用，也不是场景检测——一张图，一次调用。模型能一次性跨所有帧进行推理。

自适应轮次 + 确定性覆盖守卫。默认采样为 N=9，3×3 合成图长边 2048px，每帧仍约 ~680px、足以读 UI，同时比旧 N=6 方案显著缩小第一轮盲区。模型仍可返回 need_another_round + focus_window 请求更密集的一轮；另外，对较长片段或低置信结果，即使模型没有主动请求，也会围绕选定时间戳做一次确定性细化。上限为 max_video_rounds（默认 3）。典型片段 = 1 次视觉调用；风险或模糊片段可预测地升级。

为什么一次调用就能正确标注（带明确几何约束）。标注坐标是归一化 [0,1]（app/core/vision/analyzer.py:77-81；在 app/core/annotation/renderer.py:382-394 中对基准图反归一化）。坐标精确迁移只在模型的框相对帧内容矩形时成立，而不是相对整个 cell 或徽章。因此每个 cell 记录 content_rect，徽章放在帧内容外的保留条带中，prompt 明确要求坐标只相对 frame content。默认同宽高比路径不需要转换；如出现 padding，必须先用 content_rect 映射后再渲染。grounding.snap_to_text（grounding.py:121-135）随后在全分辨率图上用 OCR 修正文本目标漂移。

如何复用现有功能（~65% 复用）。帧抽取和新的联系表分析器是全新内容；从「在选定帧上绘制」往后的所有内容都是截图管线的逐字复用：app/utils/image.py::normalise / ::to_data_url → app/core/annotation/renderer.py::AnnotationRenderer.render（+ grounding.snap_to_text）→ app/services/media_store.py::MediaStore.store → app/core/telegram/bot.py::send_photo → app/workers/cleanup_worker.py::_cleanup_expired_media。净新增：app/core/video/extractor.py（ffmpeg shell-out）、app/core/video/analyzer.py（VideoAnalyzer：联系表构建 + 它自己的 schema/prompt + 自适应轮次循环）、以及 app/workers/job_worker.py 中的 video_keyframe_analysis 处理器。

默认关闭。受新的 per-vendor video_support_enabled 标志门控（settings-dict 驱动，镜像 vendor_context.py:108 处的 image_support_enabled）——该标志不需要 DB 迁移。

§2 — 工作原理

2. 工作原理（Codex 风格洞见，正确实现版）

                ┌──────────────────────── ASYNC JOB WORKER (off the sync chat path) ────────────────────────┐
 Telegram       │                                                                                            │
	 video msg      │   worker         ffprobe        ffmpeg ×N         compose          VideoAnalyzer            │
	   │            │  downloads      (duration/     (N=9 uniform      (tile N frames,   (ONE vision call on the  │
	   │ _has_      │  by file_id      dims guard)    full-frame        badges outside   composite — "which frame  │
	   │  video     │      │              │           cells)            content)         + draw what?")            │
   ▼            │      ▼               │                  │                        │                          │
	 GATE ─enqueue► │  temp clip       frames[N] ──────► contact_sheet.png ──► to_data_url ──► { selected_cell,  │
 "⏳ analyzing"  │  too long                                                              annotations[],      │
   (sync ack)   │                                                                         need_another_round, │
                │                                                                         focus_window }      │
                │                                            need_another_round? ──yes──┐    │ no             │
                │                                            (rounds < max_video_rounds) │    │               │
                │          ┌──────────────────────────────────────────────────────────┘    │               │
                │          ▼  re-sample DENSER inside focus_window → new sheet → analyze ────┤ (loop ≤3)      │
                │                                                                            ▼               │
                │   ffmpeg -ss t_sel (settle-offset, accurate seek) ──► full-res chosen frame                │
                │                                                            │                               │
                │   image.normalise (1536px, EXIF strip)  ◄───────────────── │                               │
                │                          │                                                                 │
                │   AnnotationRenderer.render(frame, annotations[, OCR snap])   ◄── normalized [0,1] coords  │
	                │   (content_rect maps cell coords → full-res frame; snap_to_text fixes drift)               │
                │                          │                                                                 │
                │   MediaStore.store(png) ──► MessageAttachment                                               │
                │                          │                                                                 │
                └────────────────────────── │ ───────────────────────────────────────────────────────────── ┘
                                            ▼
                                   bot.send_photo(annotated PNG)  ──►  user

每轮一次视觉调用，典型情况只需一轮。合成图就是分析输入——模型一次看到全部 N 帧，挑出相关帧，并一起返回要绘制的标注。额外轮次在模型要求或确定性置信度/时长守卫触发时发生。ffmpeg/ffprobe 是本地 CPU，token 成本为 $0。

简化流程（中文）

视频片段 → ffmpeg 抽帧 → 拼成一张大图(联系表) → AI 一次性分析(挑出最相关帧+标注) → 全分辨率重新抽取该帧 → 画箭头标注 → 回传用户

§3 — 架构选择与原因

3. 架构选择 + 为什么

已选方案：adaptive-contact-sheet（每轮单次调用，模型驱动轮次 + 确定性守卫）。均匀采样 N=9 帧 → 以全帧单元格按时间顺序平铺成一张 3×3 合成图，索引+mm:ss 徽章放在 frame content 外 → 一次结构化视觉调用返回 {selected_cell, annotations[], need_another_round, focus_window, confidence} → 如果模型需要更多，或置信度/时长守卫认为第一张图过粗，就在 focus window 内更密集重采样并重复（≤3 轮）→ 以全分辨率重抽选定帧 → 根据记录的 content_rect 映射坐标并绘制标注（OCR snap 修正文本漂移）→ 交付。

为什么选这个，而不是第 1 版的双调用设计。第 1 版用一张降采样蒙太奇只做「挑帧」，然后再花一次全分辨率视觉调用去做重新分析和标注。代码显示第二次调用可以省掉：坐标是归一化 [0,1]（analyzer.py:77-81 → renderer.py:382-394），只要联系表记录并尊重 content_rect，那次返回的标注就能映射到全分辨率重抽取帧，而 snap_to_text（grounding.py:121-135）会在全分辨率图上把文本目标变锐利。去掉第二次调用让它更便宜（1 次 vs 2 次）、更简单，并忠实于 Codex 的单合成图方法——同时全分辨率重抽 + OCR snap 仍能让返回的截图保持清晰、箭头准确。

为什么 N=9 / 2048px。每个单元格必须清晰到模型能 (a) 读 UI 文字 (b) 放置箭头。max_long_edge=2048 时，3×3 网格的 9 帧仍可得到约 ~680px 的帧内容，同时比第 2 版 N=6 默认显著缩小时间盲区。N=6 保留为应急成本/延迟开关，不是推荐默认。自适应循环再处理剩余盲区：围绕模型选定时刻或模型给出的 focus_window 重新采样。

可选 · 非阻断本 fork 无计费。MODEL_PRICING 仅喂给管理后台成本「估算」；给它补 gpt-5.x 价格行只是让后台美元数字更真实，属可选项，与功能/收费无关。

§4 — 详细处理流程

4. 详细处理流程（精确命令）

入站门控之后的所有昂贵步骤都在 job_worker.py 的 _handle_video_keyframe_analysis 内运行，使用 worker 自己的 awaited AsyncSession（镜像 job_worker.py:228 处的 _handle_knowledge_reindex）。Telegram webhook 不下载到 tempfile 再把路径交给 worker；ai-service 与 job-worker 是独立进程/容器，只有显式配置的共享卷才安全。job payload 携带持久化 Telegram 元数据（file_id、来源类型、caption/question、chat/user id、声明的 duration/size/mime、预先创建的 user message id）。worker 自己调用 getFile，下载到本地 TemporaryDirectory，并在 finally: unlink 临时视频 + 所有中间帧/合成图。

4.1 Probe（权威时长/格式守卫）

Telegram 声明的 video.duration 只是廉价的 pre-getFile 过滤器；ffprobe 是事实来源：

ffprobe -v error -select_streams v:0 \
  -show_entries format=duration:stream=codec_type,nb_frames,avg_frame_rate,width,height \
  -of json <clip>

如果出现以下情况则拒绝（终端失败 → 友好文案）：无视频流、duration > max_video_duration_sec（默认 45）、或 width*height > MAX_IMAGE_PIXELS（50M，image.py:41）。记录原生 (w,h) 和帧宽高比（用于塑造单元格，§4.3）。

4.2 均匀采样 → N 帧

单元格中心时间戳 t_k = duration*(k+0.5)/N，k ∈ 0..N-1（默认 N=9；避免首尾黑帧）。每个时间戳一次快速输入 seek（廉价；关键帧对齐对合成图已足够）：

ffmpeg -hide_banner -loglevel error -ss <t_k> -i <clip> \
  -frames:v 1 -vf "scale=<cell_w>:-2" -f image2pipe -vcodec png pipe:1

选择 cell_w 使得合成图长边 ≈ 2048（3×3 网格下约 680px frame-content 单元格；见 §11）。

4.3 合成联系表（全帧单元格、帧宽高比、明确 content geometry）

用 Pillow 解码每帧；按时间顺序平铺成 3×3 网格。每个单元格显示整帧按其原生宽高比，无裁剪或拉伸。徽章条带保留在帧内容之外，不覆盖模型要标注的像素。记录 index → {timestamp, content_rect}，其中 content_rect 是合成图像素坐标中的精确帧内容矩形。prompt 明确要求模型的标注框归一化于编号帧的 content rectangle。若引入 padding，渲染路径必须先用 content_rect 从 cell-content 坐标映射到 full-frame 坐标；默认同宽高比路径仍是 no-op。对合成图调用 normalise(sheet_png, max_long_edge=2048) 做 EXIF 剥离 + 炸弹守卫 + 规范编码。

4.4 视觉调用（#1，典型情况下唯一一次）

to_data_url(sheet_png, "image/png") → VideoAnalyzer.analyze_sheet(sheet_url, question, model=…)。在 app/core/video/analyzer.py 中使用单独的严格 schema（不是已上线的 _RESPONSE_SCHEMA，后者是单截图形状），以及视频感知的系统提示（“你看到的是来自同一屏幕录制的按时间顺序排列的 N 个带编号帧；挑选最能回答问题的单帧；返回其 1-based selected_cell；标注的 box 是该帧 content rectangle 内的归一化 [0,1]，排除徽章条带；如果没有帧清晰显示该瞬间则设置 need_another_round+focus_window”）。标注对象逐字复用现有的 _ANNOTATION_SCHEMA 形状（analyzer.py:71-87）。传输使用与 VisionAnalyzer._resolve_client 相同的 direct-client 模式，直接调用 client.chat.completions.create(..., response_format=schema_format)，因为 OpenAIService.chat_completion 当前不接受 response_format。计量由 worker helper 镜像 ChatEngine._meter_vision_usage（engine.py:1361）。

4.5 自适应轮次决策

如果 need_another_round == true 且 round < max_video_rounds（默认 3）且 focus_window 是有效子范围：就在 focus_window 内重采样 N 帧、重建合成图（§4.3）、重复 §4.4。若模型没有请求第二轮，但 duration / N > max_round1_gap_sec（默认 4s）或 selected_confidence < min_video_confidence（默认 0.70），则围绕选定时间戳运行一次确定性细化窗口，仍受 max_video_rounds 限制。跨轮次累积 usage。守卫：把 selected_cell clamp 到 [1,N]；null/越界/解析失败时 → 中间单元格。

4.6 以原生分辨率重抽选定时刻（settle-offset + 准确 seek）

仅对这单帧做慢速准确输出 seek，偏移量约 300ms 以避开动画中途：

ffmpeg -hide_banner -loglevel error -i <clip> -ss <t_sel + settle_offset> \
  -frames:v 1 -f image2pipe -vcodec png pipe:1

4.7 归一化选定帧（复用）

normalise(frame_bytes, max_bytes=max_video_mb*1024*1024)（image.py:129）——单张静图，因此动画拒绝（image.py:189）永远不会触发；EXIF 剥离 + 炸弹守卫 + 降采样到 vision_max_long_edge_px（1536，config.py:93）+ 规范重编码。

4.8 标注（复用——无额外视觉调用）

如果模型为选定帧返回了 annotations：AnnotationRenderer.render(normalised_bytes, annotations, ocr=True, caption=…)（renderer.py:65）。§4.4 返回的归一化框先归一化到选定帧内容矩形（默认无 padding 路径是 no-op；如引入 padding 则显式映射）；随后 _denormalise 乘以全分辨率尺寸（renderer.py:382-394）；对于文本目标，grounding.snap_to_text 通过对全分辨率帧做 OCR 重新定位标签（grounding.py:121-135），修正任何漂移。对于 halo/图标目标（target_text 为 null）没有 OCR 佐证——准确度来自模型原始坐标 + content_rect 映射。

4.9 持久化 + 交付（worker 明确写入聊天记录）

webhook 预先创建或解析 Conversation + 入站 ConversationMessage，使用与 Telegram 聊天相同的 external id 形状（tg:<telegram_user_id> / tg:<chat_id>），并把 user message id 放进 job payload。worker 创建 assistant ConversationMessage，存储 MediaStore.store(vendor.id, annotated_png, "image/png")（media_store.py:143，sha256 去重，30d 过期），创建出站 MessageAttachment(purpose="annotated_video_frame")，设置 assistant_message.annotated_attachment_url，flush/commit，然后通过 MediaStore.bytes_for 读取 bytes 并调用 TelegramBot.send_photo（bot.py:836）。如果入站消息已存在后发生失败，worker 必须持久化一个文本 fallback assistant message 再发送文本，避免管理后台历史丢失该回合。

§5 — 帧选择算法

5. 帧选择算法（精确）

模型驱动的「选择 + 自适应细化」加上确定性安全检查：

第 1 轮（一次视觉调用）。N=9 全帧单元格覆盖整个片段 → 一张合成图 → 模型返回 selected_cell ∈ [1,N] + annotations[] + need_another_round + focus_window + confidence。模型挑选最能回答用户问题的帧对应的时刻。
细化轮次（模型触发或守卫触发，≤ max_video_rounds）。如果 need_another_round，就在 focus_window 内更密集地重采样 N 帧，重建合成图，重新询问。如果模型没要求，但片段时间粒度太粗（duration/N > max_round1_gap_sec）或选中结果低置信，则围绕 t_sel 做一次确定性窗口。这避免把「模型没有看到的瞬间」也交给模型自己决定是否需要再看。
落在一个干净帧上（仅 CPU）。通过准确输出 seek 以原生分辨率重抽 t_sel + settle_offset，使得标注帧是人类会截图的「已 settle」状态，而不是动画中途。

守卫：clamp selected_cell；null/越界/解析失败 → 中间单元格；focus_window 在片段外 → 忽略并停止循环；守卫触发的细化每个片段最多一次。（明确）接受的权衡：即使 N=9，第 1 轮仍可能漏掉亚 duration/N 事件；确定性细化守卫在最高风险场景缩小盲区，而不引入像素差/场景检测复杂度。

§6 — 复用映射

6. 复用映射（真实路径）

复用 API	路径	在视频路径中的角色
`image.normalise(data, *, max_long_edge, max_bytes)`	app/utils/image.py:129	硬化+编码联系表（`max_long_edge=1536/2048`）以及选定的全分辨率帧（单张静图会通过动画拒绝 `:189`）。
`image.to_data_url(data, content_type)`	app/utils/image.py:208	联系表（以及每轮的合成图）的数据 URL——无公开 URL 泄漏。
`_ANNOTATION_SCHEMA`（标注对象形状）	app/core/vision/analyzer.py:71-87	在新的 `VideoAnalyzer` schema 内部逐字复用（`kind` arrow/box/halo/step、归一化 `box`、`target_text`、`label`、`step_number`、`confidence`）。
`AnnotationRenderer.render(...)` + `_denormalise`	app/core/annotation/renderer.py:65,382-394	在选定的全分辨率帧上绘制箭头/方框/光晕；直接消费归一化 `[0,1]` 框。
`grounding.snap_to_text(image, region, target_text)`	app/core/annotation/grounding.py:68,121-135	在全分辨率帧上对文本目标做 OCR-snap → 修正单元格→帧的漂移（region-first，全图回退）。
`VisionAnalyzer._resolve_client` direct-client 模式	app/core/vision/analyzer.py:215-230	解析 BYOK/custom-endpoint client，并直接调用 `chat.completions.create(..., response_format=...)`。`OpenAIService.chat_completion` 能接受 list-content，但不暴露 `response_format`，因此不复用于结构化视频调用。
`MediaStore.store / bytes_for / signed_url / verify_signed`	app/services/media_store.py:143,315,358	持久化 + 寻址已标注的 PNG（sha256 去重，`expires_at`）。
`ChatEngine._meter_vision_usage` 模式	app/core/chat/engine.py:1361	worker helper 镜像此计量路径；计量每轮 `usage`（`{prompt,completion}`）并跨轮求和（§13）。
`UsageTracker.track_image / check_image_quota`	app/services/usage_tracker.py:201,327	配额机制（1 视频 = 1 单位，§13）。
`estimate_cost / MODEL_PRICING`	app/services/usage_tracker.py:50,22	仅用于后台仪表盘的成本估算（无计费）；可选补 gpt-5.x 行。
`TelegramBot._download_telegram_file(...)`	app/core/telegram/bot.py:409	由 worker 在解析 `file_id → file_path` 后复用；SSRF-pinned 下载，更大（≤20MB）上限。
`TelegramBot.send_photo / send_message`	app/core/telegram/bot.py:836,761	worker 在持久化 assistant 记录后直接发送标注帧或文本 fallback；`_deliver_user_response` 仍是普通 `ChatEngine` 路径。
Job worker 框架 + `JOB_TYPE_HANDLERS`	app/workers/job_worker.py:33,219,228	托管新处理器（字符串分派、重试/退避）。
Enqueue 模板	app/core/escalation/timer.py:103-126	添加并 flush 一个新类型的 `Job` 行。
`cleanup_worker._cleanup_expired_media`	app/workers/cleanup_worker.py:267	通过 `expires_at` 回收已存储帧，无改动。
Vendor 标志模式	app/core/vendor_context.py:108,156	克隆用于 `video_support_enabled` / `max_video_mb` / `max_video_duration_sec` / `max_video_rounds`。

NOT reused verbatim: VisionAnalyzer.analyze + 它的 _SYSTEM_INSTRUCTION + _RESPONSE_SCHEMA（analyzer.py:89-137）被硬编码为「ONE screenshot」，带单个顶层 description / should_annotate。VideoAnalyzer 复用了标注对象 schema 和传输，但需要它自己顶层 schema（selected_cell、need_another_round、focus_window、annotations）和视频感知提示。已上线的截图路径保持可证明未被触碰。

§7 — 新增模块与代码结构

7. 新增模块与代码结构

app/core/video/__init__.py                 # NEW package marker
app/core/video/extractor.py                 # NEW — ffmpeg/ffprobe shell-out + contact-sheet compositor
app/core/video/analyzer.py                  # NEW — VideoAnalyzer: own schema + video prompt + adaptive-round loop
app/workers/job_worker.py                   # EDIT — new _handle_video_keyframe_analysis + dict entry
	app/core/telegram/bot.py                    # EDIT — _has_inbound_video gate → persist user turn + ENQUEUE file_id metadata
app/core/vendor_context.py                  # EDIT — video_support_enabled / max_video_mb / max_video_duration_sec / max_video_rounds
	app/config.py                               # EDIT — *_default platform settings (+ N=9, sheet cap, max_video_rounds, settle_offset, confidence guard)
app/services/usage_tracker.py               # EDIT (可选) — add gpt-5.x to MODEL_PRICING (analytics estimate only, no billing)
alembic/versions/016_*.py                   # NEW — optional audit columns only (see §8)
tests/unit/test_video_extractor.py          # NEW
tests/unit/test_video_analyzer.py           # NEW
tests/unit/test_telegram_video_ingest.py    # NEW (+ update test_telegram_screenshot_vision.py:191-193)
tests/integration/test_video_keyframe_job.py# NEW (contract/e2e mirroring screenshot tests)

`extractor.py` 公开 API

@dataclass
class VideoMeta: duration_sec: float; width: int; height: int; codec: str; nb_frames: int | None; aspect: float

async def probe(path: str) -> VideoMeta                                       # ffprobe -of json
async def sample_frames(path, *, n, window=None, cell_w) -> list[tuple[int, float, bytes]]  # (idx, t_k, png); window=(start,end) for dense rounds
def build_contact_sheet(frames, *, max_long_edge=2048) -> tuple[bytes, dict[int, dict]]     # (png, idx→{ts, content_rect})
async def extract_frame_at(path, t, *, settle_offset=0.30) -> bytes           # accurate output seek

使用 asyncio.create_subprocess_exec 带参数列表（绝不 shell 字符串）、来自 config 的固定二进制路径，以及一个会杀死挂起/恶意片段的 wall-clock 超时。

`analyzer.py` 公开 API

@dataclass
class SheetResult:
    selected_cell: int
    annotations: list[Annotation]      # reuses the existing Annotation dataclass shape
    need_another_round: bool
    focus_window: tuple[float, float] | None
    confidence: float
    description: str
    usage: dict | None                 # {prompt, completion} — for metering

class VideoAnalyzer:
    async def analyze_sheet(self, sheet_url, question, *, model, openai_service, vendor) -> SheetResult
    async def analyze_clip(self, path, question, *, model, openai_service, vendor,
                           n, max_rounds) -> tuple[float, SheetResult, list[dict]]  # drives the adaptive loop; returns (t_sel, final result, per-round usages)

`job_worker.py` 编辑

JOB_TYPE_HANDLERS = { ..., "video_keyframe_analysis": "_handle_video_keyframe_analysis" }

async def _handle_video_keyframe_analysis(self, session, job) -> None:
    # payload: file_id/source/chat/user/message ids + declared Telegram metadata
    # getFile/download in worker → probe → analyze_clip(adaptive rounds) → re-extract t_sel
    # → normalise → render → persist assistant message + attachment → commit → send_photo/send_message
    # finally: unlink worker-local temp video + frames + sheets

§8 — 数据模型与迁移变更

8. 数据模型与迁移变更

Phase 1 不需要 schema 迁移。抽取即丢弃意味着只存储已标注的 PNG——通过受信任的 MediaStore.store 路径成为普通的 MediaObject，复用迁移 012（media_objects + message_attachments）。最新已应用迁移是 015；下一个空闲的是 016。

Telegram 门控在 enqueue 前记录入站用户回合：Conversation + ConversationMessage(role=USER, content=caption or "(video clip)")，并把 message id 写入 job payload。这样即使昂贵视频路径不走 ChatEngine.process_message，管理后台历史也保持一致。
worker 分析完成后记录 assistant 回合：ConversationMessage(role=ASSISTANT, content=...) + 已标注帧的出站 MessageAttachment。
MessageAttachment.purpose（String(30)）获得自由形式 annotated_video_frame（出站）——无 DDL。
MessageAttachment.attachment_metadata（JSON）携带用于审计的来源信息：{source:"video", selected_cell, selected_ts_sec, rounds_used, focus_windows:[…], settle_offset_sec, per_round_token_usage:[…], video_duration_sec, vision_model, source_file_id_hash}——无 DDL。
jobs.job_type 获得 "video_keyframe_analysis"——无 DDL（通过 JOB_TYPE_HANDLERS 的自由形式字符串分派）。
新的 per-vendor 标志（video_support_enabled、max_video_mb、max_video_duration_sec、max_video_rounds、monthly_video_quota）搭乘现有的 vendor settings JSON（vendor_context.py:158 模式）→ 无迁移。

可选的 016_add_video_frame_audit_columns.py（仅当业主想要每片段的一等可审计性时）：可空 message_attachments.source_kind（'image'|'video'）+ source_ts_sec（Float）。纯增量。已推迟——发货不需要。

明确推迟（原始片段持久化 / 时间戳寻址）：添加 media_objects.parent_video_id / frame_ts_ms / source_kind + 一个 ALLOWED_VIDEO_MIME 集合 + 非 normalise 存储路径的迁移。只要「抽取即丢弃」成立，就不需要。

§9 — API 与工具定义变更

9. API 与工具定义变更

用户的视频作为入站附件到达，因此分析是在下载的片段上由 Python 驱动（像截图一样）——不是模型选择的工具。我们镜像截图 Step-11b 模式，而不是 find_reference_image。

9.1 新的 `VideoAnalyzer` schema —— 不要修改已上线的 `_RESPONSE_SCHEMA`。

联系表调用在 app/core/video/analyzer.py 中使用单独的严格 schema：

{"type":"json_schema","json_schema":{"name":"video_sheet","strict":true,"schema":{
  "type":"object","additionalProperties":false,
  "required":["selected_cell","description","annotations","need_another_round","focus_window","confidence"],
  "properties":{
    "selected_cell":{"type":"integer"},                       // 1-based cell index
    "description":{"type":"string"},
    "annotations":{"type":"array","items": _ANNOTATION_SCHEMA },// REUSED verbatim from analyzer.py:71-87
    "need_another_round":{"type":"boolean"},
    "focus_window":{"type":["array","null"],"items":{"type":"number"}}, // [start_sec, end_sec] or null
    "confidence":{"type":"number"}                                      // selected-frame confidence 0..1
}}}}

这让截图严格 schema 路径（analyzer.py:89）保持可证明未被触碰。

9.2 Executor 分支 —— Phase 1 不需要。

触发器是入站媒体（Python），因此不需要新的 ToolExecutor 分支 / ToolDefinition。（可选 Phase-3：一个 analyze_video_clip ToolDefinition，镜像 definitions.py:1412 / executor.py:201 处的 find_reference_image，仅当我们希望模型请求分析它引用的片段时。）

9.3 Chat-engine 钩子。

同步的 engine.process_message（chat.py:228）不被扩展——视频工作对同步路径来说太慢。Telegram 门控会记录一个最小入站用户消息并 enqueue 一个 job；新处理器在自己的 session 上拥有完整的 download→extract→analyze(rounds)→annotate→persist→deliver 序列并自行 commit。它刻意不进入 engine 的 flush-not-commit 契约，但会写入等价的 conversation/message/attachment 记录。

§10 — 渠道变更

10. 渠道变更

Telegram（Phase 1 —— 在范围内）

入站门控（bot.py:318）。添加 _has_inbound_video(message)（与 _has_inbound_image 并行），对 message["video"]、message["video_note"]、message["animation"]（gif/mp4-loop）以及 document 且 mime_type ∈ {video/mp4, video/quicktime, video/webm} 返回 True。贴纸 / voice / audio 保持忽略。
廉价 pre-getFile 拒绝。Telegram video.duration / file_size 与 max_video_duration_sec / max_video_mb 对比（以及硬性的 20 MB Bot-API getFile 天花板——更大的文件无法下载；告诉用户）。
ENQUEUE 持久化元数据，而不是 tempfile 路径（最高风险编辑）。在 process_update（bot.py:258-272）中，视频分支只做廉价元数据检查、记录入站用户消息，然后enqueue 一个 video_keyframe_analysis Job，payload 携带 file_id、来源类型、caption/question、chat/user ids、声明元数据和 user_message_id，并立即发送 "⏳ analyzing your clip…" ack。它绝不能 inline 调用 process_message，也不能在 enqueue 前下载到本地 tempfile，因为独立 worker 可能不共享该路径。由测试断言。
Worker 下载。_handle_video_keyframe_analysis 解析 file_id → file_path，然后调用 _download_telegram_file（bot.py:409；SSRF host-pin + no-redirect + Content-Length 且流式字节上限，max_bytes = min(max_video_mb*1024*1024, 20MB)）下载到 worker 本地 TemporaryDirectory。
限制：mp4/quicktime/webm + video/video_note/animation；max_video_duration_sec（默认 45，ffprobe 权威）；max_video_mb（默认 20，clamp 到 API 天花板）。
返回帧：通过 send_photo multipart（bot.py:836）发送单张 PNG；caption 溢出 > 1024（_TG_CAPTION_LIMIT）遵循与 _deliver_user_response 相同的图片/文本拆分规则，但 worker 在 commit 后直接调用 send_photo / send_message。
测试更新：tests/unit/test_telegram_screenshot_vision.py:191-193 当前断言 video update 被忽略。标志 ON → 必须断言它enqueues；标志 OFF（默认）→ 仍被忽略。

Web widget（Phase 2 —— 已推迟）

widget 今天没有 deferred-reply 通道（chat.py:228 同步；AirPilotSelfWidget.tsx 行内渲染）。Phase 2 需要：一个 poll/SSE 结果端点；widget 轮询；放宽 accept="image/*"（AirPilotSelfWidget.tsx:378）+ file.type.startsWith('image/') 守卫（:119）到一个视频白名单；一个客户端 <video>+loadedmetadata 时长探针；用 max_video_mb 放宽 upload_public_widget_attachment（public_widget.py:391）；以及终端失败 UX 状态。v1 范围外。

§11 — 成本与延迟预算

11. 成本与延迟预算

成本 —— 典型 1 次视觉调用，上限 max_video_rounds（默认 3）。每次调用发送一张合成图（~2048px，9 个单元格）。规划数字为每轮约 2,000–3,000 input tokens + ~500 output tokens，具体需用选定 BYOK 模型实测。N=6/1536 保留为低成本 fallback 配置。

层级（$/1M input，2026-05）	典型（1 轮）	最坏情况（3 轮）
`gpt-5.4-nano` $0.20（LibCyberAI 默认）	~$0.0005–$0.0008	~$0.0015–$0.0024
`gpt-5.4` $2.50（主力）	~$0.006–$0.009	~$0.018–$0.027
`gpt-5.5` $5.00	~$0.013–$0.018	~$0.039–$0.054

典型情况仍比第 1 版的双调用更便宜（一轮调用，而不是无条件两次）。尺寸（几何验证）：N=9（3×3）配合 normalise(..., max_long_edge=2048) 可得到 ~680px frame-content 单元格，约 4.2 MP，仍远低于 50 MP / 10 MB 图像上限。选定的全分辨率帧会单独被 normalise() 到 1536px。实施前必须跑一次真实 token-metering smoke test，再打开生产 flag。

延迟 —— 异步，用户不被阻塞（立即 ack ≈ 1s 感知）。worker 下载取决于 Telegram CDN 和片段大小；ffprobe ~0.1–0.3s；9 次输入-seek 帧 ~0.8–2.0s；Pillow 合成 ~0.2–0.6s；视觉调用 ~2–6s/轮；额外轮次仅在模型或守卫触发时发生；准确-seek 重抽 ~0.2–0.5s；normalise ~0.1–0.3s；render（+OCR） ~0.3–1s；store + send_photo ~0.5–1.5s。端到端典型 ≈ 5–12s（1 轮），3 轮片段最多 ~20s。Job 拾取增加 ≤ POLL_INTERVAL_SECONDS（5s）。全部在异步 job 容忍度内。

§12 — 防护与限制

12. 防护与限制

轮次上限：max_video_rounds（默认 3）硬性限制每个片段的视觉调用次数；focus_window 必须是片段内的有效子范围，否则循环停止。
时长上限：ffprobe 权威的 duration > max_video_duration_sec（默认 45s）→ 拒绝；Telegram 声明的时长只是廉价的 pre-getFile 过滤器。
尺寸上限：下载时 max_video_mb（默认 20）（Content-Length + 流式字节上限），硬性受 Telegram 20 MB getFile 天花板限制。
格式：门控处的容器白名单（mp4/quicktime/webm）是建议性的；权威检查是 ffprobe 报告可解码视频流——不可解码/损坏 → 文本错误。
帧/合成图上限：每轮 N 帧（默认 9），每轮一张合成图 → 成本与片段长度解耦，并受轮次上限约束。
帧内容几何（正确性守卫）：cell 必须按原生宽高比显示整帧（无裁剪/拉伸），徽章必须在帧内容外，并记录 content_rect。默认同宽高比合成图可直接迁移归一化坐标；任何 padding 路径都必须先映射到 full-frame 坐标，再交给 renderer.py:382-394。
像素/炸弹上限：合成图和每张抽取的静图都通过 normalise 的 MAX_IMAGE_PIXELS=50M（image.py:41）；原始容器永不传给 normalise（仅图像，拒绝多帧 :189）——只有合成图 / 单张静图进入它。
ffmpeg 安全：create_subprocess_exec 参数列表（无 shell）、固定二进制、杀死挂起/恶意片段的 wall-clock 超时；拒绝无视频流。
滥用：工作被enqueue（而非 inline），因此洪水无法阻塞聊天；标准 job 并发 + 重试/退避；SSRF 守卫保留（host-pinned，关闭重定向）。
功能标志：整个路径受 per-vendor video_support_enabled（默认 OFF）门控。
抽取即丢弃：原始片段仅在 worker 本地 TemporaryDirectory 中，在 finally 中 unlink；没有原始视频 MediaObject，job payload 也不携带本地 temp 路径。
无 OCR 目标：halo/图标标注（target_text 为 null）得不到 snap_to_text 修正——准确度来自模型原始坐标 + content_rect 映射。
回退（降级，永不崩溃）：ffprobe 拒绝 → “clip too long/unreadable”；selected_cell null/越界 → 中间单元格；视觉调用无响应 → 中间单元格 + 仅描述；重抽失败 → 文本；render RuntimeError → 捕获 → 发送未标注的选定帧或文本；job 失败 → 现有的 _mark_failed 重试/退避。

§13 — 用量与配额计量

13. 用量与配额计量

本 fork 无计费。没有 Stripe / 钱包 / 预付费路径（代码 + DB 层均已移除）；强制 BYOK，厂商直接向 OpenAI 付费。此处「计量」指用量记录 + 配额，绝不收费。
Token/成本（多轮）。每轮的 SheetResult.usage（{prompt,completion}）通过 _meter_vision_usage 模式（engine.py:1361）计量：track_tokens → estimate_cost(vision_model, …) → track_cost。多轮干净组合——对 usages 求和并一次计量，或每轮计量（都走相同路径）；以 vendor_ctx.vision_model 为键。
可选 —— MODEL_PRICING（仅分析口径）。usage_tracker.py:22 只列出 gpt-4o/4.1；estimate_cost 对未知模型返回 0.0（:59），所以 gpt-5.x 用量在后台显示 $0 估算（截图功能现状亦如此）。纯外观。仅当业主希望后台美元数字更真实时才补 gpt-5.4 / gpt-5.4-nano / gpt-5.5 行。不影响功能或计费。
配额（决策 —— §16）。通过 track_image(vendor.id) 对1 视频 = 1 图像单位收费，每个视频回合一次（无论轮次）——不要循环 track_image。由 check_image_quota（usage_tracker.py:327）门控。如果视频应该更贵时的替代：一个 track_video + monthly_video_quota Redis 计数器。默认按 1-video=1-unit 发货。

可选 · 非阻断给 MODEL_PRICING 补 gpt-5.x 行只影响管理后台成本「估算」的展示真实度（截图功能现状也是如此），与计费无关（本 fork 无计费）。详见 §16 第 10 项。

§14 — 测试计划

14. 测试计划（镜像现有布局）

单元测试 —— `tests/unit/test_video_extractor.py`

probe() 解析 ffprobe JSON；拒绝无视频流、超时时长、超像素；报告宽高比。
sample_frames() 返回 N 个单元格中心时间戳；window= 产生更密集的子范围集合。
build_contact_sheet() 按时间顺序平铺、全帧单元格按帧宽高比、索引+mm:ss 徽章在帧内容外、合成图长边 ≤ 上限；返回 idx→{ts, content_rect} 映射。（*使用提交的小型 fixture 片段；ffmpeg 作为参数列表调用。*）
extract_frame_at() 使用准确输出 seek + settle-offset；超时会杀死挂起的子进程。

单元测试 —— `tests/unit/test_video_analyzer.py`

analyze_sheet() 通过单独的视频 schema 解析 {selected_cell, annotations, need_another_round, focus_window, confidence}；越界时 clamp；null/解析失败时回退到中间单元格；暴露 confidence 和 usage；使用 direct-client response_format 传输，而不是 OpenAIService.chat_completion。
analyze_clip() 循环：当 need_another_round==false 时停止；为 true 时在 focus_window 内重采样；当 duration/N 或置信度要求时执行一次守卫触发的细化；遵守 max_rounds；忽略片段外的 focus_window。
坐标迁移：无 padding 路径下，从合成图那次返回的归一化框应用到全分辨率帧后，反归一化得到预期像素；padding/徽章条带 fixture 证明 content_rect 映射排除徽章区域；snap_to_text 修正一个故意偏移的文本目标。
已上线的 _RESPONSE_SCHEMA + 单图像 VisionAnalyzer.analyze 逐字节不变（回归守卫）。

单元测试 —— `tests/unit/test_telegram_video_ingest.py`（+ 更新 `test_telegram_screenshot_vision.py`）

_has_inbound_video 匹配 video/video_note/animation/video-doc；忽略 sticker/voice/audio。
pre-getFile 时对声明的时长/尺寸拒绝；20 MB 天花板被遵守。
门控记录用户消息、ENQUEUES 一个携带 file_id 元数据的 job，且不 inline 调用 process_message、不下载到 tempfile（最高风险编辑守卫）。
test_telegram_screenshot_vision.py:191-193：标志 OFF → 仍被忽略；标志 ON → enqueue。

契约 / 集成 —— `tests/integration/test_video_keyframe_job.py`

（镜像截图 e2e）：enqueue → 用模拟 Telegram getFile/download 路径、模拟 OpenAI 客户端的 _handle_video_keyframe_analysis（第 1 轮返回一个 cell + annotations + need_another_round=false；第二个测试强制多一轮；第三个测试触发守卫细化）+ fixture 片段 → 断言：worker 下载到自己的 temp dir、选定帧已 normalise、assistant ConversationMessage 已持久化、已标注 PNG 作为 MediaObject 存储、出站 MessageAttachment(purpose="annotated_video_frame") 写入、每轮 usage 已计量并求和、track_image 一次、commit 先于 send_photo、send_photo 用 PNG bytes 调用、临时文件已 unlink。失败阶梯用例：ffprobe 拒绝 → 持久化文本 fallback；视觉无响应 → 中间单元格/文本；render 抛出 → 持久化文本/未标注 fallback。

分析测试（可选）：若补了 gpt-5.x 行，estimate_cost 对后台估算返回非零（不做计费断言——本 fork 无计费）。

§15 — 分阶段上线

15. 分阶段上线

Phase 1 —— MVP（仅 Telegram，最低风险）。新增 app/core/video/{extractor,analyzer}.py（联系表构建 + 自己的 schema/prompt + 自适应循环）；video_keyframe_analysis 处理器；Telegram 门控 → ENQUEUE + ack；max_video_mb / max_video_duration_sec / max_video_rounds / video_support_enabled 标志；抽取即丢弃；1-video=1-unit 配额；默认 N=9 / 2048px。逐字复用标注后半段。放在 per-vendor 默认 OFF 标志后发货；在 prod 验证后再进入 Phase 2。
Phase 2 —— 强化 + Widget。根据 prod 数据调优 N / 轮次上限 / settle-offset；widget 异步交付（poll/SSE、客户端时长探针、视频上传白名单、终端失败 UX）；从 max_image_mb 拆分 max_video_mb；可选 016 审计列。
Phase 3 —— 可选增强。一旦模型 token 经济学被确认，可使用 N=9 以上的更大或自适应合成图；可选面向模型的 analyze_video_clip 工具；每时间戳帧寻址 + 原始片段持久化（推迟的 parent_video_id/frame_ts 迁移 + ALLOWED_VIDEO_MIME）；如果出现滥用，则使用单独的 track_video/monthly_video_quota。

§16 — 业主需决策的开放问题

16. 业主需决策的开放问题

以下是需要业主明确决定的开放问题。本节是本页的行动号召。

配额语义（本 fork 无任何计费）。默认按 1 视频 = 1 图像单位发货（计入现有的 check_image_quota，每个视频回合只计一次，无论几轮）。在强制 BYOK 下厂商直接向 OpenAI 付费，所以配额只是滥用上限、并非收入杠杆——您到底要不要设上限，还是对这个单租户不限量？若想要独立限额：单独的 track_video + monthly_video_quota 计数器（可推迟到 Phase 3）。

N（每张合成图的帧数）、sheet cap 和 max_video_rounds。提议 N=9、max_long_edge=2048 且 ≤3 轮。只有当 token/延迟 smoke test 显示 N=9 太贵时才降到 N=6/1536。

max_video_mb 和 max_video_duration_sec。提议 Telegram ≤ 20 MB（API 天花板）和 ≤ 45s —— 需要产品对确切数字的签字。

Docker：apt ffmpeg vs imageio-ffmpeg。我们需要 ffprobe 做权威时长守卫；imageio-ffmpeg 不打包 ffprobe → 推荐 apt-get install ffmpeg（同时提供两者，在当前仅 gcc 的 Dockerfile 上镜像增长约 60–100 MB）。请确认。

大合成图的模型 token 容忍度。确认 BYOK 视觉模型能理智地定价/处理 ~2048px 合成图——管线限制没问题；模型侧经济学未编码。启用生产 flag 前必须做一次真实 metering smoke test。

接受的入站类型范围。v1 限制为 true video/video_note + mp4/mov，还是也接受 Telegram animation（gif/mp4-loop）？动画贴纸保持在外。

审计列现在加还是以后加？在 Phase 1 一起上可选的 016 迁移（source_kind/source_ts_sec），还是推迟（JSON attachment_metadata 对 MVP 已足够）？

Widget 时间线。确认 Phase 2 推迟 widget（今天没有 deferred-reply 通道）是可接受的——即仅 Telegram 对这个客户来说是一个可行的 v1。

Panel/工单视频支持未被涉及（截图计划已经把 panel/工单推迟，取决于工单自动化桩是否真实）。请确认范围外。

可选 · 非阻断

（可选，非阻断）给 MODEL_PRICING 补 gpt-5.x 价格行。仅让管理后台的成本「估算」（token_cost_today / token_cost_month）显示得更真实——纯分析口径外观，本 fork 无任何计费。详见 §13。

1. 概要

2. 工作原理（Codex 风格洞见，正确实现版）

简化流程（中文）

3. 架构选择 + 为什么

4. 详细处理流程（精确命令）

4.1 Probe（权威时长/格式守卫）

4.2 均匀采样 → N 帧

4.3 合成联系表（全帧单元格、帧宽高比、明确 content geometry）

4.4 视觉调用（#1，典型情况下唯一一次）

4.5 自适应轮次决策

4.6 以原生分辨率重抽选定时刻（settle-offset + 准确 seek）

4.7 归一化选定帧（复用）

4.8 标注（复用——无额外视觉调用）

4.9 持久化 + 交付（worker 明确写入聊天记录）

5. 帧选择算法（精确）

6. 复用映射（真实路径）

7. 新增模块与代码结构

extractor.py 公开 API

analyzer.py 公开 API

job_worker.py 编辑

8. 数据模型与迁移变更

9. API 与工具定义变更

9.1 新的 VideoAnalyzer schema —— 不要修改已上线的 _RESPONSE_SCHEMA。

9.2 Executor 分支 —— Phase 1 不需要。

9.3 Chat-engine 钩子。

10. 渠道变更

Telegram（Phase 1 —— 在范围内）

Web widget（Phase 2 —— 已推迟）

11. 成本与延迟预算

12. 防护与限制

13. 用量与配额计量

14. 测试计划（镜像现有布局）

单元测试 —— tests/unit/test_video_extractor.py

单元测试 —— tests/unit/test_video_analyzer.py

单元测试 —— tests/unit/test_telegram_video_ingest.py（+ 更新 test_telegram_screenshot_vision.py）

契约 / 集成 —— tests/integration/test_video_keyframe_job.py

15. 分阶段上线

16. 业主需决策的开放问题

`extractor.py` 公开 API

`analyzer.py` 公开 API

`job_worker.py` 编辑

9.1 新的 `VideoAnalyzer` schema —— 不要修改已上线的 `_RESPONSE_SCHEMA`。

单元测试 —— `tests/unit/test_video_extractor.py`

单元测试 —— `tests/unit/test_video_analyzer.py`

单元测试 —— `tests/unit/test_telegram_video_ingest.py`（+ 更新 `test_telegram_screenshot_vision.py`）

契约 / 集成 —— `tests/integration/test_video_keyframe_job.py`