About Polyglot Watchdog

Overview

Polyglot Watchdog is a pipeline and operator interface for building an English reference dataset and detecting localization issues across languages.

It combines automated capture, structured extraction, operator review, deterministic element pairing, and issue generation. This repository UI mainly acts as an operator/development console for those workflows.

The problem it addresses

Localization QA is difficult to automate because pages include dynamic content, repeated templates, and elements that should not always be compared one-to-one.

Polyglot Watchdog addresses this by capturing pages under controlled contexts, extracting stable elements, applying operator rules, and comparing eligible EN reference data with target-language captures.

Current UI reality: issue artifacts can be produced by backend phases, but parts of the visible UI still use mock-backed paths.

Pipeline summary

Phase 0 — URL discovery

Builds URL inventory artifacts for a domain and run.

Phase 1 — Page capture

Captures screenshots and extracts page items (text, images, buttons, inputs) with stable identifiers.

Phase 2 — Element annotation

Stores rules such as IGNORE_ENTIRE_ELEMENT, MASK_VARIABLE, and ALWAYS_COLLECT.

Phase 3 — English reference dataset

Builds the eligible EN dataset by applying rules and review status filters.

Phase 6 — Localization comparison

Pairs EN and target-language items, then writes issue records with contextual evidence such as IDs, bounding boxes, and screenshot references.

Scope note: pairing and issue writing are implemented, but this UI does not yet expose every backend comparison detail as a dedicated operator screen.

Capture contexts and deterministic pairing

A capture context includes URL, viewport kind, state, and optional user tier. Language is runtime configuration but excluded from context identity.

Page and element IDs are deterministic so matching is stable across runs. Element identity excludes text content to avoid translation text changing identity.

Review, rerun, and issue generation

Review records are saved by capture context and language and are consumed when building eligible EN data. Operators can request exact-context reruns using the same runtime dimensions.

Later comparison phases generate issue artifacts for target runs.

Artifacts produced

Artifacts are stored per domain/run, with screenshots kept under deterministic page IDs.

Typical operator workflow

  1. Manage seed URLs in /urls.
  2. Discover or inspect URLs.
  3. Capture EN pages.
  4. Review and annotate elements.
  5. Build the EN eligible dataset.
  6. Capture target-language pages and run comparison.
  7. Query issues in /.

Some steps are currently triggered through backend endpoints that are not yet fully represented by dedicated UI pages.

Glossary