About Polyglot Watchdog
Overview
Polyglot Watchdog is a pipeline and operator interface for building an English reference dataset and detecting localization issues across languages.
It combines automated capture, structured extraction, operator review, deterministic element pairing, and issue generation. This repository UI mainly acts as an operator/development console for those workflows.
The problem it addresses
Localization QA is difficult to automate because pages include dynamic content, repeated templates, and elements that should not always be compared one-to-one.
Polyglot Watchdog addresses this by capturing pages under controlled contexts, extracting stable elements, applying operator rules, and comparing eligible EN reference data with target-language captures.
Current UI reality: issue artifacts can be produced by backend phases, but parts of the visible UI still use mock-backed paths.
Pipeline summary
Phase 0 — URL discovery
Builds URL inventory artifacts for a domain and run.
Phase 1 — Page capture
Captures screenshots and extracts page items (text, images, buttons, inputs) with stable identifiers.
Phase 2 — Element annotation
Stores rules such as IGNORE_ENTIRE_ELEMENT, MASK_VARIABLE, and ALWAYS_COLLECT.
Phase 3 — English reference dataset
Builds the eligible EN dataset by applying rules and review status filters.
Phase 6 — Localization comparison
Pairs EN and target-language items, then writes issue records with contextual evidence such as IDs, bounding boxes, and screenshot references.
Scope note: pairing and issue writing are implemented, but this UI does not yet expose every backend comparison detail as a dedicated operator screen.
Capture contexts and deterministic pairing
A capture context includes URL, viewport kind, state, and optional user tier. Language is runtime configuration but excluded from context identity.
Page and element IDs are deterministic so matching is stable across runs. Element identity excludes text content to avoid translation text changing identity.
Review, rerun, and issue generation
Review records are saved by capture context and language and are consumed when building eligible EN data. Operators can request exact-context reruns using the same runtime dimensions.
Later comparison phases generate issue artifacts for target runs.
Artifacts produced
url_inventory.jsonpage_screenshots.jsoncollected_items.jsonuniversal_sections.jsontemplate_rules.jsoneligible_dataset.jsonissues.json
Artifacts are stored per domain/run, with screenshots kept under deterministic page IDs.
Typical operator workflow
- Manage seed URLs in
/urls. - Discover or inspect URLs.
- Capture EN pages.
- Review and annotate elements.
- Build the EN eligible dataset.
- Capture target-language pages and run comparison.
- Query issues in
/.
Some steps are currently triggered through backend endpoints that are not yet fully represented by dedicated UI pages.
Glossary
- Seed URLs: manually managed canonical URL list.
- Capture context: runtime capture dimensions for a page (identity excludes language).
- page_id: deterministic page identifier from context identity fields.
- item_id: deterministic element identifier from domain/URL/selector/bbox/type.
- Template rules: annotation decisions that shape the eligible dataset.
- Eligible dataset: filtered English reference artifact used for comparison.
- Exact context rerun: replaying capture with the same context parameters.