test upload

Porting Spec — vision_listener for Corporate macOS

Audience: a fresh Claude Code instance running on a different macOS machine, with no access to this project’s history. Read this whole file before writing any code. The “Failed Paths” section exists specifically so you do not waste hours re-discovering them.


0. Mission Brief

Build a single Python script, vision_listener.py, that:

  1. Lets the user calibrate a screen rectangle by clicking the top-left and bottom-right corners of any region (typically Zoom’s closed-caption / transcript panel) — even when Zoom is fullscreen on its own macOS Space.
  2. Every 15 seconds, screen-captures only that rectangle, runs macOS Vision OCR on the PNG, and appends any net-new text to a per-session Markdown file with Obsidian-friendly YAML front-matter.
  3. Survives the rolling-caption pattern where the same line reappears across many polls with OCR drift, in both English and Simplified Chinese (often mixed in the same meeting).

Non-goals: speaker diarization, real-time UI, post-processing, anything that touches Zoom’s APIs directly.


1. Critical Failed Paths — DO NOT REPEAT

These were tried, debugged, and abandoned in the previous build. Every one of them looks attractive on paper. They are not. If you find yourself reaching for any of these, stop and re-read this section.

1.1 ❌ Zoom Accessibility / AppleScript AX-tree walking

The first version of this project (archive/zoom_listener.py in the source repo, not present here) polled Zoom’s AXTable named “Closed caption subtitles” via AppleScript every few seconds.

Why it died:

  • Zombie / ghost AX nodes. Zoom keeps stale closed-caption tables attached to old window handles. Walking them returns inconsistent garbage.
  • -600 “application isn’t running” errors fire intermittently when Zoom switches between fullscreen and windowed modes, even though the app is plainly running.
  • Breaks on every Zoom version bump because the AX hierarchy moves.
  • Locked to Zoom only — useless for Teams / Meet / Webex.

Lesson: screen-pixel OCR is the only stable substrate. Do not try to tap into Zoom’s internals.

1.2 ❌ Tkinter transparent fullscreen overlay for click-and-drag selection

Multiple iterations: a Tk() window with attributes('-alpha', 0.3) and attributes('-fullscreen', True), draw a green rectangle on a canvas as the user drags, return the bbox.

Why it died:

  • macOS Spaces routing. Zoom in fullscreen lives on its own Space. A tkinter window created from a terminal on the desktop Space appears on the desktop Space, yanking the user away from Zoom. Every workaround (NSWindowCollectionBehaviorCanJoinAllSpaces, NSWindowCollectionBehaviorFullScreenAuxiliary via PyObjC) is unreliable in timing — the behavior is set on a NSWindow that tkinter has already shown, and the OS has already assigned it to the wrong Space.
  • -fullscreen creates a brand new Space, taking the user even further from Zoom.
  • overrideredirect(True) + manual geometry dodges fullscreen but the window still won’t receive mouse events after the user swipes back to Zoom’s Space.
  • '-[NSApplication macOSVersion]: unrecognized selector' crash: PyObjC’s NSApplication.sharedApplication() installs a plain NSApplication, which then prevents Tk from installing its own TKApplication subclass. Tk later calls macOSVersion on the wrong class and crashes. Fix: tk.Tk() must run before any PyObjC code touches NSApp. Even with this fixed, the Spaces problem above remains.
  • Beach-balling the cursor. Calling root.destroy() from inside a tkinter event handler hangs the cursor in the spinning state. The fix is withdraw() + after(0, root.quit) and only destroy from outside — but again, this is moot because the window can’t see Zoom.

Lesson: do not draw an overlay window. The corporate Claude will try this; resist it. The shipped solution uses no window at all — see Module 1.

1.3 ❌ Pillow / mss for screen capture

You do not need Pillow, mss, or any third-party capture library. The system tool screencapture -x -R x,y,w,h out.png is faster, has zero dependencies beyond what macOS already ships, and reads the correct global coordinate system that the Vision pipeline expects.

1.4 ❌ Iterative / sliding-window / word-tokenized deduplication

The deduper went through six failed designs before landing on the current one. Specifically do not implement any of:

  • Exact longest_overlap(suffix, prefix) — breaks on capitalization, punctuation, and OCR drift.
  • Single-pass find_longest_match at the seam only — misses overlaps that sit in the middle of a long Chinese frame because OCR noise scatters mismatches around the boundary.
  • Iterative peeling with MIN_FIRST=12 / MIN_REST=5 — over-trims legitimate new content.
  • Word-level tokenizer with a 3-tier (exact / prefix / ratio≥0.8) word matcher — works fine for English, completely broken for Chinese because CJK has no inter-word spaces, so the entire frame becomes one token.
  • Sliding-window normalization with a ratio ≥ 0.85 threshold — still fails on long Chinese frames where the overlap is real but the per-window similarity dips below threshold.

The only approach that works is the anywhere-overlap on a character-normalized stream described in Module 3. Implement that.

1.5 ❌ Local synthetic test loops

The user has explicit feedback on this: do not sit in a python -c "..." loop running synthetic strings through the deduper to “validate” thresholds. The user can re-run the script against real Zoom captions in seconds, and the live signal is more authoritative than any toy fixture. Ship a reasonable implementation and hand it back. One syntax / import sanity check is fine; deeper local validation only when (a) live testing is genuinely expensive, or (b) the user explicitly asks.


2. Architecture (Two Phases)

┌─────────────────────────────────────────────────────────────┐
│ Phase 1: Calibration (one-shot, no window)                  │
│                                                             │
│   Quartz CGEventTap (active) → swallow 2 left-mouse-downs   │
│   → record (x1,y1) and (x2,y2) → derive (x,y,w,h)           │
│   → play afplay confirmation on each click                  │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│ Phase 2: Headless polling loop (every 15s, until Ctrl+C)    │
│                                                             │
│   screencapture -R → PNG → VNRecognizeTextRequest           │
│   → fuzzy_stitch(committed_tail, new_text)                  │
│   → append net-new suffix to meeting_YYYYMMDD_HHMM.md       │
└─────────────────────────────────────────────────────────────┘

State that survives across iterations: a single in-memory committed string (the entire transcript so far) used as the dedup haystack; a file handle (re-opened append-mode each tick) for persistence.


3. Dependencies

3.1 System requirements

  • macOS 13 (Ventura) or newer. Older macOS still runs but loses per-region language auto-detection. macOS 11 is the hard floor for VNRecognizeTextRequest.
  • Python 3.10+ (the script uses tuple[int, ...] | None PEP 604 syntax).
  • Two macOS permissions for the terminal/IDE that runs the script:
    • Accessibility (System Settings → Privacy & Security → Accessibility) — required by CGEventTapCreate to receive global mouse events. Without it, the tap is silently rejected.
    • Screen Recording (System Settings → Privacy & Security → Screen Recording) — required by screencapture to read pixels of other apps’ windows. Without it, the captured PNG will be blank.
    • First run will trigger the OS prompt for both. After granting, fully quit and relaunch the terminal — the permission is read at process launch, not live.

3.2 Python packages

pip install pyobjc-framework-Vision pyobjc-framework-Cocoa pyobjc-framework-Quartz

That is the entire dependency list. No tkinter, no Pillow, no mss, no Pillow-SIMD, no opencv. If the corporate pip is locked down:

  • Try pip install --user <pkg>.
  • If that fails, ask the user for the corp PyPI mirror URL and use pip install --index-url <url> ....
  • All three pyobjc packages are pure-Python wheels — no compilation needed.

Standard-library imports only beyond that: difflib, re, subprocess, sys, tempfile, time, unicodedata, datetime.datetime, pathlib.Path. Import unicodedata lazily inside the normalizer if you want to keep the top-level imports tidy (the source does this).

3.3 No-go libraries

Library Why not
tkinter Spaces routing breaks the overlay (see §1.2).
Pillow / mss / pyscreenshot screencapture -R is sufficient and faster.
pytesseract Vision is more accurate, especially for Chinese, and ships with the OS.
pyautogui Pulls in Pillow and uses a different coordinate origin.

4. Module 1 — Calibration (Quartz Event Tap)

Goal: capture two global left-mouse-down events, anywhere on any display, in any Space (including fullscreen Zoom), without drawing a window.

4.1 Why Quartz event tap

A CGEventTapCreate(kCGSessionEventTap, ..., kCGEventTapOptionDefault, mask, callback, None) listens to all mouse events session-wide. It has no UI. Spaces, fullscreen apps, and multi-monitor setups are invisible to it — the OS just hands you the raw CGEvent and CGEventGetLocation(event) returns the global top-left coordinate, which is exactly the coordinate system screencapture -R consumes.

kCGEventTapOptionDefault makes the tap active, meaning the callback can return None to swallow the event — critical, because we do not want the user’s calibration clicks to also activate Zoom UI under the cursor.

4.2 Implementation

def calibrate_via_clicks() -> tuple[int, int, int, int] | None:
    try:
        import Quartz
    except ImportError:
        print("✗ pyobjc-framework-Quartz missing")
        return None

    clicks: list[tuple[int, int]] = []

    def callback(proxy, type_, event, refcon):
        # macOS auto-disables slow taps; re-enable on those event types.
        if type_ in (
            Quartz.kCGEventTapDisabledByTimeout,
            Quartz.kCGEventTapDisabledByUserInput,
        ):
            Quartz.CGEventTapEnable(tap, True)
            return event
        if type_ != Quartz.kCGEventLeftMouseDown:
            return event

        loc = Quartz.CGEventGetLocation(event)
        x, y = int(round(loc.x)), int(round(loc.y))
        clicks.append((x, y))
        _play_lock_sound()  # confirmation on EVERY click, not just the second
        if len(clicks) == 1:
            print(f"  ✓ top-left  @ ({x:5d}, {y:5d})")
            print("  Now click the BOTTOM-RIGHT corner…")
        else:
            print(f"  ✓ bot-right @ ({x:5d}, {y:5d})")
            Quartz.CFRunLoopStop(Quartz.CFRunLoopGetCurrent())
        return None  # swallow the click so Zoom doesn't react

    mask = Quartz.CGEventMaskBit(Quartz.kCGEventLeftMouseDown)
    tap = Quartz.CGEventTapCreate(
        Quartz.kCGSessionEventTap,
        Quartz.kCGHeadInsertEventTap,
        Quartz.kCGEventTapOptionDefault,  # active tap, can suppress
        mask,
        callback,
        None,
    )
    if not tap:
        print("✗ Could not create event tap. Grant Accessibility "
              "permission to your terminal and re-run.")
        return None

    source = Quartz.CFMachPortCreateRunLoopSource(None, tap, 0)
    Quartz.CFRunLoopAddSource(
        Quartz.CFRunLoopGetCurrent(), source, Quartz.kCFRunLoopCommonModes
    )
    Quartz.CGEventTapEnable(tap, True)

    print("  Switch to Zoom (any Space, fullscreen OK).")
    print("  Click the TOP-LEFT corner of the transcript region…")
    try:
        Quartz.CFRunLoopRun()
    finally:
        Quartz.CGEventTapEnable(tap, False)

    if len(clicks) < 2:
        return None
    (x1, y1), (x2, y2) = clicks
    x, y = min(x1, x2), min(y1, y2)
    w, h = abs(x2 - x1), abs(y2 - y1)
    if w < MIN_REGION_DIMENSION or h < MIN_REGION_DIMENSION:
        print(f"✗ Region too small ({w}×{h}). Aborted.")
        return None
    return (x, y, w, h)

4.3 Pitfalls

  • Tap returns None at CGEventTapCreate. Always Accessibility permission. The OS does not raise — you just get a NULL tap. Print a guidance message.
  • kCGEventTapDisabledByTimeout fires if your callback takes too long. Keep the callback fast — do not block on afplay (use subprocess.Popen, not subprocess.run). Re-enable the tap on this event type or you will silently lose all subsequent clicks.
  • Coordinate min/max from (x1,y1) and (x2,y2) — the user might click bottom-right first and top-left second; tolerate it.
  • Reject tiny rectangles (MIN_REGION_DIMENSION = 20) — accidental double-click in the same spot would otherwise produce a 0×0 region and screencapture would silently fail.
  • The clicks are global, so closing the corporate VPN client mid- calibration with a stray click is possible. Tell the user this.

5. Module 2 — Vision OCR

Goal: turn a PNG file path into a plain-text string of recognized lines. Bilingual EN + zh-Hans.

5.1 PyObjC imports

from Foundation import NSURL
from Vision import VNRecognizeTextRequest, VNImageRequestHandler

Wrap the import in try/except and gate downstream code on a _VISION_OK flag — the corporate machine may not have pyobjc-Vision installed yet.

5.2 Recognition request

RECOGNITION_LANGUAGES = ["zh-Hans", "en-US"]

def vision_ocr(image_path: Path) -> str:
    nsurl = NSURL.fileURLWithPath_(str(image_path))

    request = VNRecognizeTextRequest.alloc().init()
    request.setRecognitionLevel_(0)        # 0 = accurate, 1 = fast. USE 0.
    request.setUsesLanguageCorrection_(True)
    try:
        request.setRecognitionLanguages_(RECOGNITION_LANGUAGES)
    except Exception:
        pass  # very old macOS may not expose this language

    # macOS 13+: let Vision pick the best language per text region
    # instead of forcing the priority list onto every line. Critical
    # for bilingual meetings where Chinese captions and English chat
    # share a single screenshot.
    if hasattr(request, "setAutomaticallyDetectsLanguage_"):
        try:
            request.setAutomaticallyDetectsLanguage_(True)
        except Exception:
            pass

    handler = VNImageRequestHandler.alloc().initWithURL_options_(nsurl, {})
    success, _err = handler.performRequests_error_([request], None)
    if not success:
        return ""

    lines: list[str] = []
    for obs in (request.results() or []):
        candidates = obs.topCandidates_(1)
        if candidates and len(candidates) > 0:
            lines.append(str(candidates[0].string()))
    return "\n".join(lines)

5.3 Why these settings

  • setRecognitionLevel_(0) (accurate). Caption text is small and often anti-aliased; the fast mode misses ~30 % of Chinese characters.
  • setUsesLanguageCorrection_(True) lets Vision smooth obvious OCR errors using its language model — vital for Chinese.
  • ["zh-Hans", "en-US"] order matters. The first entry biases the recognizer. Reverse it and Chinese characters get parsed as garbled Latin.
  • setAutomaticallyDetectsLanguage_(True) (macOS 13+) overrides the priority list per-region, so a screenshot with both Chinese and English in different boxes is handled correctly. Always feature-detect with hasattr — older OS will not have it.

5.4 Capture helper

def capture_region(x: int, y: int, w: int, h: int, out_path: Path) -> bool:
    result = subprocess.run(
        ["screencapture", "-x", "-R", f"{x},{y},{w},{h}", str(out_path)],
        capture_output=True,
    )
    return (
        result.returncode == 0
        and out_path.exists()
        and out_path.stat().st_size > 0
    )

-x suppresses the camera shutter sound. -R x,y,w,h is the region rectangle in global top-left coordinates — the same numbers the event tap gave us.


6. Module 3 — The Deduplication (“Holy Grail”)

Goal: given the transcript so far (committed) and a freshly OCR’d frame (new_text), return only the suffix of new_text that is genuinely new. Must work for both English and Chinese, and tolerate OCR drift, punctuation drift, and rolling-caption windows.

6.1 The shape of the input

Zoom’s caption rail is a 2-line rolling window. Across consecutive polls you see frames like:

poll N:    "i wake up. then i get dressed. i walk to school. i do not"
poll N+1:  "to school. i do not ride a bike. i do not ride the bus."
poll N+2:  "ride the bus. i like to go to school. it rings. i do not"

The same text reappears with a sliding window plus light OCR drift (spdf vs spf, fullwidth vs halfwidth ,, capitalization flips). For Chinese, the drift is single-character substitutions and fullwidth/halfwidth punctuation flips.

6.2 The algorithm: anywhere-overlap on a normalized character stream

1. Take the last OVERLAP_TAIL_RAW_CHARS (= 500) raw chars of `committed`.
   Call it the "tail". This is the search haystack.
2. Normalize both `tail` and `new_text` into a character stream:
     - strip whitespace
     - lowercase
     - NFKC-normalize  (folds fullwidth digits/Latin to halfwidth)
     - translate fullwidth CJK punctuation to ASCII via _PUNCT_MAP
   While normalizing, remember the back-pointer from each normalized
   char to its original raw index — we need this to slice the original
   new_text at exactly the right cut point.
3. Run difflib.SequenceMatcher.find_longest_match over the entire
   (norm_tail, norm_new) — find the longest contiguous shared run
   ANYWHERE in either side. Not just at the seam — long Chinese frames
   have OCR misreads scattered around the boundary, so the true overlap
   is rarely flush against the edge.
4. If match.size >= OVERLAP_MIN_MATCH (= 15 chars):
     cut_norm = match.b + match.size
     if cut_norm covers the whole frame → drop entire frame (return "")
     else slice new_text after raw_new_idx[cut_norm - 1] + 1
     lstrip a fixed set of trailing-overlap punctuation
     return that suffix
5. Else (no contiguous run of 15 chars):
     sum every matching block from matcher.get_matching_blocks()
     coverage = covered / len(norm_new)
     if coverage >= OVERLAP_CONTAINMENT_RATIO (= 0.9):
       this frame is stale (≥90% of its content already lives in the
       tail, just shuffled / split across multiple matches) → drop
6. Else: the frame is genuinely new → return verbatim

Note the asymmetry between step 4 and step 5: step 4 demands a single contiguous 15-char run (cheap, common case), step 5 is a fallback for when noise has fragmented the overlap into many small matches but it still covers ≥90 %.

6.3 The normalizer

Two functions: a simple one for collapsing whitespace, and an indexed one that preserves raw-offset back-pointers.

import re, unicodedata, difflib

_WS_RE = re.compile(r"\s+")
_STITCH_TRIM = " ,.;:!?\n\t-—…\"'),。!?;:、)】」』"

# Fullwidth → halfwidth (and CJK-only marks → ASCII). NFKC handles
# fullwidth digits and Latin letters automatically; this map fills in
# the punctuation that NFKC leaves alone.
_PUNCT_MAP = str.maketrans({
    ",": ",", "。": ".", "!": "!", "?": "?", ";": ";", ":": ":",
    "(": "(", ")": ")", "【": "[", "】": "]",
    "「": '"', "」": '"', "『": '"', "』": '"',
    "、": ",", "~": "~",
    "“": '"', "”": '"', "‘": "'", "’": "'",
})

def normalize(text: str) -> str:
    return _WS_RE.sub(" ", text).strip()

def _normalize_indexed(text: str) -> tuple[str, list[int]]:
    """Returns (normalized, raw_positions) where normalized[k] came from
    text[raw_positions[k]]. Preserves the mapping so a cut found in
    normalized space can be applied at the correct raw offset."""
    norm_chars: list[str] = []
    raw_idx: list[int] = []
    for i, ch in enumerate(text):
        if ch.isspace():
            continue
        nfkc = unicodedata.normalize("NFKC", ch)
        folded = nfkc.translate(_PUNCT_MAP).lower()
        for c in folded:           # NFKC may expand 1 char to many
            norm_chars.append(c)
            raw_idx.append(i)
    return "".join(norm_chars), raw_idx

6.4 The stitcher

POLL_INTERVAL_SECONDS = 15
OVERLAP_TAIL_RAW_CHARS = 500
OVERLAP_MIN_MATCH = 15
OVERLAP_CONTAINMENT_RATIO = 0.9
STITCH_DEBUG = True

def _stitch_log(msg: str) -> None:
    if STITCH_DEBUG:
        print(f"  [stitch] {msg}")

def fuzzy_stitch(committed: str, new_text: str) -> str:
    if not committed:
        _stitch_log(f"first frame, append {len(new_text)}c")
        return new_text
    if not new_text:
        return ""

    committed_tail = committed[-OVERLAP_TAIL_RAW_CHARS:]
    norm_tail, _ = _normalize_indexed(committed_tail)
    norm_new, raw_new_idx = _normalize_indexed(new_text)
    if not norm_tail or not norm_new:
        _stitch_log(f"empty normalized side, append {len(new_text)}c verbatim")
        return new_text

    matcher = difflib.SequenceMatcher(None, norm_tail, norm_new, autojunk=False)
    match = matcher.find_longest_match(0, len(norm_tail), 0, len(norm_new))

    if match.size >= OVERLAP_MIN_MATCH:
        cut_norm = match.b + match.size
        if cut_norm >= len(raw_new_idx):
            _stitch_log(f"match={match.size}c covers full frame → drop")
            return ""
        cut_raw = raw_new_idx[cut_norm - 1] + 1
        suffix = new_text[cut_raw:].lstrip(_STITCH_TRIM)
        _stitch_log(
            f"match={match.size}c at norm[{match.b}:{cut_norm}] → append {len(suffix)}c"
        )
        return suffix

    covered = sum(blk.size for blk in matcher.get_matching_blocks())
    coverage = covered / len(norm_new) if norm_new else 0.0
    if coverage >= OVERLAP_CONTAINMENT_RATIO:
        _stitch_log(f"stale frame ({coverage:.0%} contained), drop")
        return ""

    _stitch_log(
        f"no overlap (longest={match.size}c, coverage={coverage:.0%}) → "
        f"append {len(new_text)}c"
    )
    return new_text

6.5 Why each constant is what it is

Constant Value Why
POLL_INTERVAL_SECONDS 15 Long enough that captions roll a meaningful amount; short enough that you don’t lose lines that scroll off-screen between ticks.
OVERLAP_TAIL_RAW_CHARS 500 More than two caption-rail frames’ worth, but small enough that SequenceMatcher stays cheap.
OVERLAP_MIN_MATCH 15 Below ~12 chars the match is often coincidental (common short Chinese phrases or English stopword runs). 15 is the empirical sweet spot.
OVERLAP_CONTAINMENT_RATIO 0.9 Lower (0.7–0.8) drops legitimate new content. 0.95 lets through stale frames.
STITCH_DEBUG True Always leave on; per-frame decisions are how you debug the live run.

6.6 The STITCH_DEBUG log lines

Every call prints exactly one decision line so the user can read the live transcript and see why each frame was kept, trimmed, or dropped. Keep this on by default — it cost nothing and is the only window into the dedup behavior. The user has explicitly preferred this over running synthetic test fixtures locally.


7. Module 4 — UX & File Handling

7.1 Confirmation sound on every calibration click

LOCK_SOUND = "/System/Library/Sounds/Pop.aiff"

def _play_lock_sound() -> None:
    """Fire-and-forget audio cue. Uses Popen, NOT run, because the
    Quartz event tap callback must return fast — blocking inside the
    callback risks the tap being disabled for slowness."""
    try:
        subprocess.Popen(
            ["afplay", LOCK_SOUND],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
        )
    except Exception:
        pass

Call this from inside the calibration callback on every click, not just the second one. (The user explicitly wants two beeps — one per corner — so they get instant feedback on each click.) Use a path under /System/Library/Sounds/ (e.g. Pop.aiff, Tink.aiff, Glass.aiff) because those paths are stable across macOS versions. Do not use private framework paths Gemini or earlier suggestions might propose; they break on OS upgrades.

7.2 Per-session timestamped output file

OUTPUT_DIR = Path(__file__).resolve().parent

# Inside main(), at the start of the run:
output_file = OUTPUT_DIR / f"meeting_{datetime.now().strftime('%Y%m%d_%H%M')}.md"

One file per script invocation. Do not reuse a single raw_meeting_transcript.md — the user does multiple meetings per day and overwrites are unrecoverable.

7.3 Obsidian-friendly YAML front-matter

The output file is meant to be dropped into an Obsidian vault. Write this header once at the start of the run:

---
type: meeting-transcript
date: 2026-04-13
started: "01:26:59"
source: zoom-vision-ocr
---

# Meeting Transcript — 2026-04-13 01:26

## Transcript

Then append each net-new chunk as its own paragraph (chunk + \n\n). On KeyboardInterrupt, append a footer:


---

_Ended: 01:28:51_

7.4 The TranscriptWriter class

class TranscriptWriter:
    """Append-only Markdown writer with OCR-aware deduplication state.

    Holds the running `committed` string in memory purely as the dedup
    haystack. Each net-new tail suffix is appended to disk as its own
    Markdown paragraph so the file stays human-readable.
    """

    def __init__(self, path: Path):
        self.path = path
        self.start_time = datetime.now()
        self.committed: str = ""
        self._write_header()

    def _write_header(self) -> None:
        header = (
            "---\n"
            "type: meeting-transcript\n"
            f"date: {self.start_time.strftime('%Y-%m-%d')}\n"
            f'started: "{self.start_time.strftime("%H:%M:%S")}"\n'
            "source: zoom-vision-ocr\n"
            "---\n\n"
            f"# Meeting Transcript — {self.start_time.strftime('%Y-%m-%d %H:%M')}\n\n"
            "## Transcript\n\n"
        )
        self.path.write_bytes(header.encode("utf-8"))

    def update(self, raw_text: str) -> str:
        text = normalize(raw_text)
        if not text:
            return ""
        if not self.committed:
            self.committed = text
            self._append_paragraph(text)
            return text
        suffix = fuzzy_stitch(self.committed, text)
        if not suffix:
            return ""
        # Join with a space so future overlap searches span the boundary,
        # but write a paragraph break to disk for readability.
        self.committed = self.committed + " " + suffix
        self._append_paragraph(suffix)
        return suffix

    def _append_paragraph(self, text: str) -> None:
        with self.path.open("ab") as f:
            f.write((text + "\n\n").encode("utf-8"))

    def finalize(self) -> None:
        footer = f"\n---\n\n_Ended: {datetime.now().strftime('%H:%M:%S')}_\n"
        with self.path.open("ab") as f:
            f.write(footer.encode("utf-8"))

7.5 The main loop

def main() -> None:
    if not _VISION_OK:
        print("✗ pyobjc Vision framework is not available.")
        print(f"  {_VISION_ERR}")
        sys.exit(1)

    output_file = OUTPUT_DIR / f"meeting_{datetime.now().strftime('%Y%m%d_%H%M')}.md"

    print("▶ vision_listener")
    print(f"  Output file : {output_file}")
    print(f"  Poll every  : {POLL_INTERVAL_SECONDS}s")
    print(f"  Languages   : {', '.join(RECOGNITION_LANGUAGES)}")
    print()
    print("Phase 1: Calibration (two-click)")
    print("  No overlay window. Switch to Zoom on whichever Space it lives")
    print("  on (fullscreen is fine), then click TOP-LEFT then BOTTOM-RIGHT")
    print("  corners of the transcript / caption area.\n")

    coords = calibrate_via_clicks()
    if coords is None:
        sys.exit(1)
    x, y, w, h = coords
    print(f"✓ Region locked: x={x} y={y} w={w} h={h}\n")
    print("Phase 2: Headless polling — Ctrl+C to stop.\n")

    writer = TranscriptWriter(output_file)
    poll = 0
    try:
        while True:
            poll += 1
            t0 = time.time()
            if not capture_region(x, y, w, h, TEMP_IMG):
                print(f"  [{poll:03d}] screencapture failed")
                time.sleep(POLL_INTERVAL_SECONDS)
                continue
            try:
                raw = vision_ocr(TEMP_IMG)
            except Exception as exc:
                print(f"  [{poll:03d}] OCR error: {exc}")
                time.sleep(POLL_INTERVAL_SECONDS)
                continue
            added = writer.update(raw)
            dt = (time.time() - t0) * 1000
            preview = added[:70].replace("\n", " ")
            if added:
                print(
                    f"  [{poll:03d}] +{len(added):4d}ch  "
                    f"({len(writer.committed)}ch total, {dt:.0f}ms) | {preview}"
                )
            else:
                print(f"  [{poll:03d}]  (no new text, {dt:.0f}ms)")
            time.sleep(POLL_INTERVAL_SECONDS)
    except KeyboardInterrupt:
        print("\n\n⏹  Stopping listener…")
        writer.finalize()
        try:
            TEMP_IMG.unlink(missing_ok=True)
        except Exception:
            pass
        print(f"✓ Transcript saved: {output_file}")
        print(f"  {len(writer.committed)} chars across {poll} polls.")

8. Step-by-Step Implementation Order

Build it in this order. Each step is independently testable.

  1. Constants & imports. Top of the file. Set up _VISION_OK gating around the Vision import.
  2. capture_region() — wrap screencapture -R. Test by hand-coding a region (e.g. (100, 100, 400, 200)) and confirming the PNG opens.
  3. vision_ocr() — feed it the PNG from step 2. Confirm it returns sensible text. Try a screenshot containing both English and Chinese.
  4. _normalize_indexed() and normalize(). Sanity check that _normalize_indexed("Hello, 世界!") returns a stream like ("hello,世界!", [0,1,2,3,4,5,7,8,9]) (whitespace and casing folded, fullwidth and mapped to ASCII).
  5. fuzzy_stitch(). Do one smoke test: fuzzy_stitch("a b c d e f g h i j k l m n o", "k l m n o p q r") should return " p q r" or similar. Do not sit in a loop tuning thresholds — see §1.5.
  6. TranscriptWriter. Test by calling update() twice with overlapping strings and inspecting the file.
  7. _play_lock_sound(). Run it standalone — you should hear a Pop sound once.
  8. calibrate_via_clicks(). Run it. The first invocation will prompt for Accessibility permission; grant it, fully quit and relaunch the terminal, re-run. Confirm two clicks end the function and return a sensible rectangle.
  9. main() glue. Wire it together. Run it against an actual Zoom meeting (or any window with rolling text). Watch the [stitch] ... debug lines to verify the dedup is doing the right thing on real captions. This is where validation happens — not in step 5.

9. Verification Checklist

Hand the script back to the user when all of these are true:

If any item fails, diagnose the actual cause before tuning constants. Most failures are permission issues or wrong coordinate systems, not algorithm problems.


10. What Not To Build

  • No GUI framework. No tkinter, no PyQt, no wxPython.
  • No drag-selection visual feedback. The two-click model is the model.
  • No threading / async. The script is a tight serial loop and benefits from being trivially debuggable.
  • No config file. Constants at the top of the file are sufficient.
  • No CLI argparse. The script is invoked the same way every time.
  • No retries with exponential backoff for OCR errors. A single try/except that logs and waits one poll interval is enough.
  • No Markdown post-processing or summarization. That happens downstream in Obsidian / a separate agent.

11. Open Permissions That Will Bite You

Order of operations on a fresh corporate machine:

  1. pip install pyobjc-framework-Vision pyobjc-framework-Cocoa pyobjc-framework-Quartz
  2. Run the script once. It will fail at the event tap creation and print the Accessibility instructions.
  3. System Settings → Privacy & Security → Accessibility → enable Terminal (or iTerm2 / Ghostty / whichever you use).
  4. System Settings → Privacy & Security → Screen Recording → enable the same app.
  5. Fully quit and relaunch the terminal app (Cmd-Q, then reopen). macOS reads these permissions at process launch — a relaunch is not optional.
  6. Re-run. Both phases should now work.

If the corporate environment uses MDM (Jamf / Intune / Kandji) to enforce permission profiles, those two permissions may need to be requested through the IT helpdesk because the user cannot grant them themselves. There is no software workaround for this — the alternative is running the script from an environment that already has the permissions (e.g. an internal automation account).


12. Reference: file the source repo built

For a one-glance reference: the source repo’s working file is one ~480-line Python module organized as:

vision_listener.py
├── module docstring (architecture overview)
├── imports + constants
├── try/except Vision import → _VISION_OK flag
├── # --- Calibration phase ---
│   ├── _play_lock_sound()
│   └── calibrate_via_clicks()
├── # --- Vision OCR ---
│   └── vision_ocr()
├── # --- Dedup ---
│   ├── _PUNCT_MAP, _WS_RE, _STITCH_TRIM
│   ├── normalize()
│   ├── _normalize_indexed()
│   ├── _stitch_log()
│   ├── fuzzy_stitch()
│   └── class TranscriptWriter
├── # --- Screen capture ---
│   └── capture_region()
└── # --- Main ---
    └── main()

Recreate the same structure. One file. No package, no __init__.py, no setup.py. Just vision_listener.py next to a README.md.