test upload
Porting Spec — vision_listener for Corporate macOS
Audience: a fresh Claude Code instance running on a different macOS machine, with no access to this project’s history. Read this whole file before writing any code. The “Failed Paths” section exists specifically so you do not waste hours re-discovering them.
0. Mission Brief
Build a single Python script, vision_listener.py, that:
- Lets the user calibrate a screen rectangle by clicking the top-left and bottom-right corners of any region (typically Zoom’s closed-caption / transcript panel) — even when Zoom is fullscreen on its own macOS Space.
- Every 15 seconds, screen-captures only that rectangle, runs macOS Vision OCR on the PNG, and appends any net-new text to a per-session Markdown file with Obsidian-friendly YAML front-matter.
- Survives the rolling-caption pattern where the same line reappears across many polls with OCR drift, in both English and Simplified Chinese (often mixed in the same meeting).
Non-goals: speaker diarization, real-time UI, post-processing, anything that touches Zoom’s APIs directly.
1. Critical Failed Paths — DO NOT REPEAT
These were tried, debugged, and abandoned in the previous build. Every one of them looks attractive on paper. They are not. If you find yourself reaching for any of these, stop and re-read this section.
1.1 ❌ Zoom Accessibility / AppleScript AX-tree walking
The first version of this project (archive/zoom_listener.py in the source repo, not present here) polled Zoom’s AXTable named “Closed caption subtitles” via AppleScript every few seconds.
Why it died:
- Zombie / ghost AX nodes. Zoom keeps stale closed-caption tables attached to old window handles. Walking them returns inconsistent garbage.
-600“application isn’t running” errors fire intermittently when Zoom switches between fullscreen and windowed modes, even though the app is plainly running.- Breaks on every Zoom version bump because the AX hierarchy moves.
- Locked to Zoom only — useless for Teams / Meet / Webex.
Lesson: screen-pixel OCR is the only stable substrate. Do not try to tap into Zoom’s internals.
1.2 ❌ Tkinter transparent fullscreen overlay for click-and-drag selection
Multiple iterations: a Tk() window with attributes('-alpha', 0.3) and attributes('-fullscreen', True), draw a green rectangle on a canvas as the user drags, return the bbox.
Why it died:
- macOS Spaces routing. Zoom in fullscreen lives on its own Space. A tkinter window created from a terminal on the desktop Space appears on the desktop Space, yanking the user away from Zoom. Every workaround (
NSWindowCollectionBehaviorCanJoinAllSpaces,NSWindowCollectionBehaviorFullScreenAuxiliaryvia PyObjC) is unreliable in timing — the behavior is set on aNSWindowthat tkinter has already shown, and the OS has already assigned it to the wrong Space. -fullscreencreates a brand new Space, taking the user even further from Zoom.overrideredirect(True)+ manual geometry dodges fullscreen but the window still won’t receive mouse events after the user swipes back to Zoom’s Space.'-[NSApplication macOSVersion]: unrecognized selector'crash: PyObjC’sNSApplication.sharedApplication()installs a plainNSApplication, which then prevents Tk from installing its ownTKApplicationsubclass. Tk later callsmacOSVersionon the wrong class and crashes. Fix:tk.Tk()must run before any PyObjC code touchesNSApp. Even with this fixed, the Spaces problem above remains.- Beach-balling the cursor. Calling
root.destroy()from inside a tkinter event handler hangs the cursor in the spinning state. The fix iswithdraw()+after(0, root.quit)and only destroy from outside — but again, this is moot because the window can’t see Zoom.
Lesson: do not draw an overlay window. The corporate Claude will try this; resist it. The shipped solution uses no window at all — see Module 1.
1.3 ❌ Pillow / mss for screen capture
You do not need Pillow, mss, or any third-party capture library. The system tool screencapture -x -R x,y,w,h out.png is faster, has zero dependencies beyond what macOS already ships, and reads the correct global coordinate system that the Vision pipeline expects.
1.4 ❌ Iterative / sliding-window / word-tokenized deduplication
The deduper went through six failed designs before landing on the current one. Specifically do not implement any of:
- Exact
longest_overlap(suffix, prefix)— breaks on capitalization, punctuation, and OCR drift. - Single-pass
find_longest_matchat the seam only — misses overlaps that sit in the middle of a long Chinese frame because OCR noise scatters mismatches around the boundary. - Iterative peeling with
MIN_FIRST=12/MIN_REST=5— over-trims legitimate new content. - Word-level tokenizer with a 3-tier (exact / prefix /
ratio≥0.8) word matcher — works fine for English, completely broken for Chinese because CJK has no inter-word spaces, so the entire frame becomes one token. - Sliding-window normalization with a
ratio ≥ 0.85threshold — still fails on long Chinese frames where the overlap is real but the per-window similarity dips below threshold.
The only approach that works is the anywhere-overlap on a character-normalized stream described in Module 3. Implement that.
1.5 ❌ Local synthetic test loops
The user has explicit feedback on this: do not sit in a python -c "..." loop running synthetic strings through the deduper to “validate” thresholds. The user can re-run the script against real Zoom captions in seconds, and the live signal is more authoritative than any toy fixture. Ship a reasonable implementation and hand it back. One syntax / import sanity check is fine; deeper local validation only when (a) live testing is genuinely expensive, or (b) the user explicitly asks.
2. Architecture (Two Phases)
┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Calibration (one-shot, no window) │
│ │
│ Quartz CGEventTap (active) → swallow 2 left-mouse-downs │
│ → record (x1,y1) and (x2,y2) → derive (x,y,w,h) │
│ → play afplay confirmation on each click │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: Headless polling loop (every 15s, until Ctrl+C) │
│ │
│ screencapture -R → PNG → VNRecognizeTextRequest │
│ → fuzzy_stitch(committed_tail, new_text) │
│ → append net-new suffix to meeting_YYYYMMDD_HHMM.md │
└─────────────────────────────────────────────────────────────┘
State that survives across iterations: a single in-memory committed string (the entire transcript so far) used as the dedup haystack; a file handle (re-opened append-mode each tick) for persistence.
3. Dependencies
3.1 System requirements
- macOS 13 (Ventura) or newer. Older macOS still runs but loses per-region language auto-detection. macOS 11 is the hard floor for
VNRecognizeTextRequest. - Python 3.10+ (the script uses
tuple[int, ...] | NonePEP 604 syntax). - Two macOS permissions for the terminal/IDE that runs the script:
- Accessibility (System Settings → Privacy & Security → Accessibility) — required by
CGEventTapCreateto receive global mouse events. Without it, the tap is silently rejected. - Screen Recording (System Settings → Privacy & Security → Screen Recording) — required by
screencaptureto read pixels of other apps’ windows. Without it, the captured PNG will be blank. - First run will trigger the OS prompt for both. After granting, fully quit and relaunch the terminal — the permission is read at process launch, not live.
- Accessibility (System Settings → Privacy & Security → Accessibility) — required by
3.2 Python packages
pip install pyobjc-framework-Vision pyobjc-framework-Cocoa pyobjc-framework-QuartzThat is the entire dependency list. No tkinter, no Pillow, no mss, no Pillow-SIMD, no opencv. If the corporate pip is locked down:
- Try
pip install --user <pkg>. - If that fails, ask the user for the corp PyPI mirror URL and use
pip install --index-url <url> .... - All three pyobjc packages are pure-Python wheels — no compilation needed.
Standard-library imports only beyond that: difflib, re, subprocess, sys, tempfile, time, unicodedata, datetime.datetime, pathlib.Path. Import unicodedata lazily inside the normalizer if you want to keep the top-level imports tidy (the source does this).
3.3 No-go libraries
| Library | Why not |
|---|---|
tkinter |
Spaces routing breaks the overlay (see §1.2). |
Pillow / mss / pyscreenshot |
screencapture -R is sufficient and faster. |
pytesseract |
Vision is more accurate, especially for Chinese, and ships with the OS. |
pyautogui |
Pulls in Pillow and uses a different coordinate origin. |
4. Module 1 — Calibration (Quartz Event Tap)
Goal: capture two global left-mouse-down events, anywhere on any display, in any Space (including fullscreen Zoom), without drawing a window.
4.1 Why Quartz event tap
A CGEventTapCreate(kCGSessionEventTap, ..., kCGEventTapOptionDefault, mask, callback, None) listens to all mouse events session-wide. It has no UI. Spaces, fullscreen apps, and multi-monitor setups are invisible to it — the OS just hands you the raw CGEvent and CGEventGetLocation(event) returns the global top-left coordinate, which is exactly the coordinate system screencapture -R consumes.
kCGEventTapOptionDefault makes the tap active, meaning the callback can return None to swallow the event — critical, because we do not want the user’s calibration clicks to also activate Zoom UI under the cursor.
4.2 Implementation
def calibrate_via_clicks() -> tuple[int, int, int, int] | None:
try:
import Quartz
except ImportError:
print("✗ pyobjc-framework-Quartz missing")
return None
clicks: list[tuple[int, int]] = []
def callback(proxy, type_, event, refcon):
# macOS auto-disables slow taps; re-enable on those event types.
if type_ in (
Quartz.kCGEventTapDisabledByTimeout,
Quartz.kCGEventTapDisabledByUserInput,
):
Quartz.CGEventTapEnable(tap, True)
return event
if type_ != Quartz.kCGEventLeftMouseDown:
return event
loc = Quartz.CGEventGetLocation(event)
x, y = int(round(loc.x)), int(round(loc.y))
clicks.append((x, y))
_play_lock_sound() # confirmation on EVERY click, not just the second
if len(clicks) == 1:
print(f" ✓ top-left @ ({x:5d}, {y:5d})")
print(" Now click the BOTTOM-RIGHT corner…")
else:
print(f" ✓ bot-right @ ({x:5d}, {y:5d})")
Quartz.CFRunLoopStop(Quartz.CFRunLoopGetCurrent())
return None # swallow the click so Zoom doesn't react
mask = Quartz.CGEventMaskBit(Quartz.kCGEventLeftMouseDown)
tap = Quartz.CGEventTapCreate(
Quartz.kCGSessionEventTap,
Quartz.kCGHeadInsertEventTap,
Quartz.kCGEventTapOptionDefault, # active tap, can suppress
mask,
callback,
None,
)
if not tap:
print("✗ Could not create event tap. Grant Accessibility "
"permission to your terminal and re-run.")
return None
source = Quartz.CFMachPortCreateRunLoopSource(None, tap, 0)
Quartz.CFRunLoopAddSource(
Quartz.CFRunLoopGetCurrent(), source, Quartz.kCFRunLoopCommonModes
)
Quartz.CGEventTapEnable(tap, True)
print(" Switch to Zoom (any Space, fullscreen OK).")
print(" Click the TOP-LEFT corner of the transcript region…")
try:
Quartz.CFRunLoopRun()
finally:
Quartz.CGEventTapEnable(tap, False)
if len(clicks) < 2:
return None
(x1, y1), (x2, y2) = clicks
x, y = min(x1, x2), min(y1, y2)
w, h = abs(x2 - x1), abs(y2 - y1)
if w < MIN_REGION_DIMENSION or h < MIN_REGION_DIMENSION:
print(f"✗ Region too small ({w}×{h}). Aborted.")
return None
return (x, y, w, h)4.3 Pitfalls
- Tap returns None at
CGEventTapCreate. Always Accessibility permission. The OS does not raise — you just get a NULL tap. Print a guidance message. kCGEventTapDisabledByTimeoutfires if your callback takes too long. Keep the callback fast — do not block onafplay(usesubprocess.Popen, notsubprocess.run). Re-enable the tap on this event type or you will silently lose all subsequent clicks.- Coordinate min/max from
(x1,y1)and(x2,y2)— the user might click bottom-right first and top-left second; tolerate it. - Reject tiny rectangles (
MIN_REGION_DIMENSION = 20) — accidental double-click in the same spot would otherwise produce a 0×0 region andscreencapturewould silently fail. - The clicks are global, so closing the corporate VPN client mid- calibration with a stray click is possible. Tell the user this.
5. Module 2 — Vision OCR
Goal: turn a PNG file path into a plain-text string of recognized lines. Bilingual EN + zh-Hans.
5.1 PyObjC imports
from Foundation import NSURL
from Vision import VNRecognizeTextRequest, VNImageRequestHandlerWrap the import in try/except and gate downstream code on a _VISION_OK flag — the corporate machine may not have pyobjc-Vision installed yet.
5.2 Recognition request
RECOGNITION_LANGUAGES = ["zh-Hans", "en-US"]
def vision_ocr(image_path: Path) -> str:
nsurl = NSURL.fileURLWithPath_(str(image_path))
request = VNRecognizeTextRequest.alloc().init()
request.setRecognitionLevel_(0) # 0 = accurate, 1 = fast. USE 0.
request.setUsesLanguageCorrection_(True)
try:
request.setRecognitionLanguages_(RECOGNITION_LANGUAGES)
except Exception:
pass # very old macOS may not expose this language
# macOS 13+: let Vision pick the best language per text region
# instead of forcing the priority list onto every line. Critical
# for bilingual meetings where Chinese captions and English chat
# share a single screenshot.
if hasattr(request, "setAutomaticallyDetectsLanguage_"):
try:
request.setAutomaticallyDetectsLanguage_(True)
except Exception:
pass
handler = VNImageRequestHandler.alloc().initWithURL_options_(nsurl, {})
success, _err = handler.performRequests_error_([request], None)
if not success:
return ""
lines: list[str] = []
for obs in (request.results() or []):
candidates = obs.topCandidates_(1)
if candidates and len(candidates) > 0:
lines.append(str(candidates[0].string()))
return "\n".join(lines)5.3 Why these settings
setRecognitionLevel_(0)(accurate). Caption text is small and often anti-aliased; the fast mode misses ~30 % of Chinese characters.setUsesLanguageCorrection_(True)lets Vision smooth obvious OCR errors using its language model — vital for Chinese.["zh-Hans", "en-US"]order matters. The first entry biases the recognizer. Reverse it and Chinese characters get parsed as garbled Latin.setAutomaticallyDetectsLanguage_(True)(macOS 13+) overrides the priority list per-region, so a screenshot with both Chinese and English in different boxes is handled correctly. Always feature-detect withhasattr— older OS will not have it.
5.4 Capture helper
def capture_region(x: int, y: int, w: int, h: int, out_path: Path) -> bool:
result = subprocess.run(
["screencapture", "-x", "-R", f"{x},{y},{w},{h}", str(out_path)],
capture_output=True,
)
return (
result.returncode == 0
and out_path.exists()
and out_path.stat().st_size > 0
)-x suppresses the camera shutter sound. -R x,y,w,h is the region rectangle in global top-left coordinates — the same numbers the event tap gave us.
6. Module 3 — The Deduplication (“Holy Grail”)
Goal: given the transcript so far (committed) and a freshly OCR’d frame (new_text), return only the suffix of new_text that is genuinely new. Must work for both English and Chinese, and tolerate OCR drift, punctuation drift, and rolling-caption windows.
6.1 The shape of the input
Zoom’s caption rail is a 2-line rolling window. Across consecutive polls you see frames like:
poll N: "i wake up. then i get dressed. i walk to school. i do not"
poll N+1: "to school. i do not ride a bike. i do not ride the bus."
poll N+2: "ride the bus. i like to go to school. it rings. i do not"
The same text reappears with a sliding window plus light OCR drift (spdf vs spf, fullwidth , vs halfwidth ,, capitalization flips). For Chinese, the drift is single-character substitutions and fullwidth/halfwidth punctuation flips.
6.2 The algorithm: anywhere-overlap on a normalized character stream
1. Take the last OVERLAP_TAIL_RAW_CHARS (= 500) raw chars of `committed`.
Call it the "tail". This is the search haystack.
2. Normalize both `tail` and `new_text` into a character stream:
- strip whitespace
- lowercase
- NFKC-normalize (folds fullwidth digits/Latin to halfwidth)
- translate fullwidth CJK punctuation to ASCII via _PUNCT_MAP
While normalizing, remember the back-pointer from each normalized
char to its original raw index — we need this to slice the original
new_text at exactly the right cut point.
3. Run difflib.SequenceMatcher.find_longest_match over the entire
(norm_tail, norm_new) — find the longest contiguous shared run
ANYWHERE in either side. Not just at the seam — long Chinese frames
have OCR misreads scattered around the boundary, so the true overlap
is rarely flush against the edge.
4. If match.size >= OVERLAP_MIN_MATCH (= 15 chars):
cut_norm = match.b + match.size
if cut_norm covers the whole frame → drop entire frame (return "")
else slice new_text after raw_new_idx[cut_norm - 1] + 1
lstrip a fixed set of trailing-overlap punctuation
return that suffix
5. Else (no contiguous run of 15 chars):
sum every matching block from matcher.get_matching_blocks()
coverage = covered / len(norm_new)
if coverage >= OVERLAP_CONTAINMENT_RATIO (= 0.9):
this frame is stale (≥90% of its content already lives in the
tail, just shuffled / split across multiple matches) → drop
6. Else: the frame is genuinely new → return verbatim
Note the asymmetry between step 4 and step 5: step 4 demands a single contiguous 15-char run (cheap, common case), step 5 is a fallback for when noise has fragmented the overlap into many small matches but it still covers ≥90 %.
6.3 The normalizer
Two functions: a simple one for collapsing whitespace, and an indexed one that preserves raw-offset back-pointers.
import re, unicodedata, difflib
_WS_RE = re.compile(r"\s+")
_STITCH_TRIM = " ,.;:!?\n\t-—…\"'),。!?;:、)】」』"
# Fullwidth → halfwidth (and CJK-only marks → ASCII). NFKC handles
# fullwidth digits and Latin letters automatically; this map fills in
# the punctuation that NFKC leaves alone.
_PUNCT_MAP = str.maketrans({
",": ",", "。": ".", "!": "!", "?": "?", ";": ";", ":": ":",
"(": "(", ")": ")", "【": "[", "】": "]",
"「": '"', "」": '"', "『": '"', "』": '"',
"、": ",", "~": "~",
"“": '"', "”": '"', "‘": "'", "’": "'",
})
def normalize(text: str) -> str:
return _WS_RE.sub(" ", text).strip()
def _normalize_indexed(text: str) -> tuple[str, list[int]]:
"""Returns (normalized, raw_positions) where normalized[k] came from
text[raw_positions[k]]. Preserves the mapping so a cut found in
normalized space can be applied at the correct raw offset."""
norm_chars: list[str] = []
raw_idx: list[int] = []
for i, ch in enumerate(text):
if ch.isspace():
continue
nfkc = unicodedata.normalize("NFKC", ch)
folded = nfkc.translate(_PUNCT_MAP).lower()
for c in folded: # NFKC may expand 1 char to many
norm_chars.append(c)
raw_idx.append(i)
return "".join(norm_chars), raw_idx6.4 The stitcher
POLL_INTERVAL_SECONDS = 15
OVERLAP_TAIL_RAW_CHARS = 500
OVERLAP_MIN_MATCH = 15
OVERLAP_CONTAINMENT_RATIO = 0.9
STITCH_DEBUG = True
def _stitch_log(msg: str) -> None:
if STITCH_DEBUG:
print(f" [stitch] {msg}")
def fuzzy_stitch(committed: str, new_text: str) -> str:
if not committed:
_stitch_log(f"first frame, append {len(new_text)}c")
return new_text
if not new_text:
return ""
committed_tail = committed[-OVERLAP_TAIL_RAW_CHARS:]
norm_tail, _ = _normalize_indexed(committed_tail)
norm_new, raw_new_idx = _normalize_indexed(new_text)
if not norm_tail or not norm_new:
_stitch_log(f"empty normalized side, append {len(new_text)}c verbatim")
return new_text
matcher = difflib.SequenceMatcher(None, norm_tail, norm_new, autojunk=False)
match = matcher.find_longest_match(0, len(norm_tail), 0, len(norm_new))
if match.size >= OVERLAP_MIN_MATCH:
cut_norm = match.b + match.size
if cut_norm >= len(raw_new_idx):
_stitch_log(f"match={match.size}c covers full frame → drop")
return ""
cut_raw = raw_new_idx[cut_norm - 1] + 1
suffix = new_text[cut_raw:].lstrip(_STITCH_TRIM)
_stitch_log(
f"match={match.size}c at norm[{match.b}:{cut_norm}] → append {len(suffix)}c"
)
return suffix
covered = sum(blk.size for blk in matcher.get_matching_blocks())
coverage = covered / len(norm_new) if norm_new else 0.0
if coverage >= OVERLAP_CONTAINMENT_RATIO:
_stitch_log(f"stale frame ({coverage:.0%} contained), drop")
return ""
_stitch_log(
f"no overlap (longest={match.size}c, coverage={coverage:.0%}) → "
f"append {len(new_text)}c"
)
return new_text6.5 Why each constant is what it is
| Constant | Value | Why |
|---|---|---|
POLL_INTERVAL_SECONDS |
15 | Long enough that captions roll a meaningful amount; short enough that you don’t lose lines that scroll off-screen between ticks. |
OVERLAP_TAIL_RAW_CHARS |
500 | More than two caption-rail frames’ worth, but small enough that SequenceMatcher stays cheap. |
OVERLAP_MIN_MATCH |
15 | Below ~12 chars the match is often coincidental (common short Chinese phrases or English stopword runs). 15 is the empirical sweet spot. |
OVERLAP_CONTAINMENT_RATIO |
0.9 | Lower (0.7–0.8) drops legitimate new content. 0.95 lets through stale frames. |
STITCH_DEBUG |
True | Always leave on; per-frame decisions are how you debug the live run. |
6.6 The STITCH_DEBUG log lines
Every call prints exactly one decision line so the user can read the live transcript and see why each frame was kept, trimmed, or dropped. Keep this on by default — it cost nothing and is the only window into the dedup behavior. The user has explicitly preferred this over running synthetic test fixtures locally.
7. Module 4 — UX & File Handling
7.1 Confirmation sound on every calibration click
LOCK_SOUND = "/System/Library/Sounds/Pop.aiff"
def _play_lock_sound() -> None:
"""Fire-and-forget audio cue. Uses Popen, NOT run, because the
Quartz event tap callback must return fast — blocking inside the
callback risks the tap being disabled for slowness."""
try:
subprocess.Popen(
["afplay", LOCK_SOUND],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
except Exception:
passCall this from inside the calibration callback on every click, not just the second one. (The user explicitly wants two beeps — one per corner — so they get instant feedback on each click.) Use a path under /System/Library/Sounds/ (e.g. Pop.aiff, Tink.aiff, Glass.aiff) because those paths are stable across macOS versions. Do not use private framework paths Gemini or earlier suggestions might propose; they break on OS upgrades.
7.2 Per-session timestamped output file
OUTPUT_DIR = Path(__file__).resolve().parent
# Inside main(), at the start of the run:
output_file = OUTPUT_DIR / f"meeting_{datetime.now().strftime('%Y%m%d_%H%M')}.md"One file per script invocation. Do not reuse a single raw_meeting_transcript.md — the user does multiple meetings per day and overwrites are unrecoverable.
7.3 Obsidian-friendly YAML front-matter
The output file is meant to be dropped into an Obsidian vault. Write this header once at the start of the run:
---
type: meeting-transcript
date: 2026-04-13
started: "01:26:59"
source: zoom-vision-ocr
---
# Meeting Transcript — 2026-04-13 01:26
## TranscriptThen append each net-new chunk as its own paragraph (chunk + \n\n). On KeyboardInterrupt, append a footer:
---
_Ended: 01:28:51_7.4 The TranscriptWriter class
class TranscriptWriter:
"""Append-only Markdown writer with OCR-aware deduplication state.
Holds the running `committed` string in memory purely as the dedup
haystack. Each net-new tail suffix is appended to disk as its own
Markdown paragraph so the file stays human-readable.
"""
def __init__(self, path: Path):
self.path = path
self.start_time = datetime.now()
self.committed: str = ""
self._write_header()
def _write_header(self) -> None:
header = (
"---\n"
"type: meeting-transcript\n"
f"date: {self.start_time.strftime('%Y-%m-%d')}\n"
f'started: "{self.start_time.strftime("%H:%M:%S")}"\n'
"source: zoom-vision-ocr\n"
"---\n\n"
f"# Meeting Transcript — {self.start_time.strftime('%Y-%m-%d %H:%M')}\n\n"
"## Transcript\n\n"
)
self.path.write_bytes(header.encode("utf-8"))
def update(self, raw_text: str) -> str:
text = normalize(raw_text)
if not text:
return ""
if not self.committed:
self.committed = text
self._append_paragraph(text)
return text
suffix = fuzzy_stitch(self.committed, text)
if not suffix:
return ""
# Join with a space so future overlap searches span the boundary,
# but write a paragraph break to disk for readability.
self.committed = self.committed + " " + suffix
self._append_paragraph(suffix)
return suffix
def _append_paragraph(self, text: str) -> None:
with self.path.open("ab") as f:
f.write((text + "\n\n").encode("utf-8"))
def finalize(self) -> None:
footer = f"\n---\n\n_Ended: {datetime.now().strftime('%H:%M:%S')}_\n"
with self.path.open("ab") as f:
f.write(footer.encode("utf-8"))7.5 The main loop
def main() -> None:
if not _VISION_OK:
print("✗ pyobjc Vision framework is not available.")
print(f" {_VISION_ERR}")
sys.exit(1)
output_file = OUTPUT_DIR / f"meeting_{datetime.now().strftime('%Y%m%d_%H%M')}.md"
print("▶ vision_listener")
print(f" Output file : {output_file}")
print(f" Poll every : {POLL_INTERVAL_SECONDS}s")
print(f" Languages : {', '.join(RECOGNITION_LANGUAGES)}")
print()
print("Phase 1: Calibration (two-click)")
print(" No overlay window. Switch to Zoom on whichever Space it lives")
print(" on (fullscreen is fine), then click TOP-LEFT then BOTTOM-RIGHT")
print(" corners of the transcript / caption area.\n")
coords = calibrate_via_clicks()
if coords is None:
sys.exit(1)
x, y, w, h = coords
print(f"✓ Region locked: x={x} y={y} w={w} h={h}\n")
print("Phase 2: Headless polling — Ctrl+C to stop.\n")
writer = TranscriptWriter(output_file)
poll = 0
try:
while True:
poll += 1
t0 = time.time()
if not capture_region(x, y, w, h, TEMP_IMG):
print(f" [{poll:03d}] screencapture failed")
time.sleep(POLL_INTERVAL_SECONDS)
continue
try:
raw = vision_ocr(TEMP_IMG)
except Exception as exc:
print(f" [{poll:03d}] OCR error: {exc}")
time.sleep(POLL_INTERVAL_SECONDS)
continue
added = writer.update(raw)
dt = (time.time() - t0) * 1000
preview = added[:70].replace("\n", " ")
if added:
print(
f" [{poll:03d}] +{len(added):4d}ch "
f"({len(writer.committed)}ch total, {dt:.0f}ms) | {preview}"
)
else:
print(f" [{poll:03d}] (no new text, {dt:.0f}ms)")
time.sleep(POLL_INTERVAL_SECONDS)
except KeyboardInterrupt:
print("\n\n⏹ Stopping listener…")
writer.finalize()
try:
TEMP_IMG.unlink(missing_ok=True)
except Exception:
pass
print(f"✓ Transcript saved: {output_file}")
print(f" {len(writer.committed)} chars across {poll} polls.")8. Step-by-Step Implementation Order
Build it in this order. Each step is independently testable.
- Constants & imports. Top of the file. Set up
_VISION_OKgating around the Vision import. capture_region()— wrapscreencapture -R. Test by hand-coding a region (e.g.(100, 100, 400, 200)) and confirming the PNG opens.vision_ocr()— feed it the PNG from step 2. Confirm it returns sensible text. Try a screenshot containing both English and Chinese._normalize_indexed()andnormalize(). Sanity check that_normalize_indexed("Hello, 世界!")returns a stream like("hello,世界!", [0,1,2,3,4,5,7,8,9])(whitespace and casing folded, fullwidth,and!mapped to ASCII).fuzzy_stitch(). Do one smoke test:fuzzy_stitch("a b c d e f g h i j k l m n o", "k l m n o p q r")should return" p q r"or similar. Do not sit in a loop tuning thresholds — see §1.5.TranscriptWriter. Test by callingupdate()twice with overlapping strings and inspecting the file._play_lock_sound(). Run it standalone — you should hear a Pop sound once.calibrate_via_clicks(). Run it. The first invocation will prompt for Accessibility permission; grant it, fully quit and relaunch the terminal, re-run. Confirm two clicks end the function and return a sensible rectangle.main()glue. Wire it together. Run it against an actual Zoom meeting (or any window with rolling text). Watch the[stitch] ...debug lines to verify the dedup is doing the right thing on real captions. This is where validation happens — not in step 5.
9. Verification Checklist
Hand the script back to the user when all of these are true:
If any item fails, diagnose the actual cause before tuning constants. Most failures are permission issues or wrong coordinate systems, not algorithm problems.
10. What Not To Build
- No GUI framework. No tkinter, no PyQt, no wxPython.
- No drag-selection visual feedback. The two-click model is the model.
- No threading / async. The script is a tight serial loop and benefits from being trivially debuggable.
- No config file. Constants at the top of the file are sufficient.
- No CLI argparse. The script is invoked the same way every time.
- No retries with exponential backoff for OCR errors. A single
try/exceptthat logs and waits one poll interval is enough. - No Markdown post-processing or summarization. That happens downstream in Obsidian / a separate agent.
11. Open Permissions That Will Bite You
Order of operations on a fresh corporate machine:
pip install pyobjc-framework-Vision pyobjc-framework-Cocoa pyobjc-framework-Quartz- Run the script once. It will fail at the event tap creation and print the Accessibility instructions.
- System Settings → Privacy & Security → Accessibility → enable Terminal (or iTerm2 / Ghostty / whichever you use).
- System Settings → Privacy & Security → Screen Recording → enable the same app.
- Fully quit and relaunch the terminal app (Cmd-Q, then reopen). macOS reads these permissions at process launch — a relaunch is not optional.
- Re-run. Both phases should now work.
If the corporate environment uses MDM (Jamf / Intune / Kandji) to enforce permission profiles, those two permissions may need to be requested through the IT helpdesk because the user cannot grant them themselves. There is no software workaround for this — the alternative is running the script from an environment that already has the permissions (e.g. an internal automation account).
12. Reference: file the source repo built
For a one-glance reference: the source repo’s working file is one ~480-line Python module organized as:
vision_listener.py
├── module docstring (architecture overview)
├── imports + constants
├── try/except Vision import → _VISION_OK flag
├── # --- Calibration phase ---
│ ├── _play_lock_sound()
│ └── calibrate_via_clicks()
├── # --- Vision OCR ---
│ └── vision_ocr()
├── # --- Dedup ---
│ ├── _PUNCT_MAP, _WS_RE, _STITCH_TRIM
│ ├── normalize()
│ ├── _normalize_indexed()
│ ├── _stitch_log()
│ ├── fuzzy_stitch()
│ └── class TranscriptWriter
├── # --- Screen capture ---
│ └── capture_region()
└── # --- Main ---
└── main()
Recreate the same structure. One file. No package, no __init__.py, no setup.py. Just vision_listener.py next to a README.md.