The Hidden Infrastructure of Ethical AI: Lessons from a Vietnamese Text Normalizer

When AI makes headlines, the stories tend to orbit massive frontier models, national‑security deals, or billion‑dollar data centers. But sometimes the most important ethical and infrastructural shifts start much smaller, like a tool that helps machines read everyday Vietnamese the way people actually write it.

A paper titled VietNormalizer, co-authored by KETEMU’s researcher, Dr. Ushik Shrestha, offers an open‑source, zero‑dependency Python library that converts the messy, heterogeneous text of everyday Vietnamese into structured, pronounceable forms. It handles numbers, dates, currency amounts, percentages, acronyms, and even transliterates foreign loanwords, all through a fast, rule‑based pipeline. It avoids neural dependencies and GPU requirements, pre‑compiles its regex patterns for high‑throughput processing, and is fully accessible via PyPI and GitHub.

At first glance, this might look like a niche technical contribution. But in practice, it speaks to one of AI’s most urgent ethical questions about who gets included in the machine-readable world. Vietnamese online text is filled with non‑standard forms such as digits, dates, emojis, English acronyms, and most global NLP systems still assume English‑centric formats or rely on heavy, proprietary preprocessing components. By showing that a transparent, auditable, rule‑based pipeline can serve an entire linguistic community, VietNormalizer reframes preprocessing as an equity issue, not a mere technical step.

This point becomes even sharper when contrasted with AI news headlines, many of which underscore gaps between AI intentions and real‑world consequences. In Canada, the federal government ordered a formal OpenAI safety review after revelations that the company failed to escalate red flags tied to a mass‑shooting suspect whose ChatGPT messages indicated potential violence. Officials pressed CEO Sam Altman on why the system didn’t stop the user from bypassing bans and what mechanisms exist for detecting dangerous content across contexts and languages.

Meanwhile in the U.S., OpenAI’s newly minted Pentagon agreement, which adds guardrails against domestic surveillance and autonomous weapons, triggered internal dissent and public criticism from other labs. Robotics hardware lead Caitlin Kalinowski resigned, warning that the deal lacked adequate deliberation around surveillance and lethal autonomy.

These controversies reveal that ethics is about the entire pipeline, from how text is ingested to how outputs are interpreted. If a system can’t reliably parse dates or currencies in Vietnamese, or mishandles diacritics in a safety alert, any higher-level promise of responsible AI collapses at the point of contact with the user. VietNormalizer’s value is both linguistic and governance‑friendly. Rule‑based components can be evaluated, documented, regulated, and improved collaboratively.

For developers working in or with multilingual regions like Vietnam, Indonesia, Singapore, and the broader ASEAN, this means preprocessing pipelines must be accurate and audit‑ready. Tools like VietNormalizer show how linguistic inclusion can align naturally with regulatory compliance. An open, well‑documented, rule‑based normalizer is far easier to justify to auditors than a closed neural component whose decisions are opaque.

Another way VietNormalizer speaks to current concerns is through the lens of AI security and red‑teaming, where recent research and competitions reveal persistent vulnerabilities in frontier models and agentic systems. A landmark large‑scale public red‑teaming competition, targeting 22 frontier AI agents, logged 1.8 million prompt‑injection attempts and tens of thousands of policy violations, underscoring how brittle real‑world deployments can be.

Even Amazon AGI Labs’ massive red‑teaming efforts found over 1,000 jailbreaks across 17 safety categories, including multilingual attack vectors where unreliable preprocessing becomes a security risk in itself.

It is clear that the edge of AI safety increasingly lies in the mundane, in the infrastructure layers where text meets code and where real users, with their spellings, acronyms, emojis, and mixed‑language habits, live. This is especially true in Southeast Asia, where digital services now reach millions across varying literacy levels, dialects, and formats. A disaster-warning chatbot, for instance, cannot afford to misread ‘12/3, 14:30’ or stumble on VND currency amounts when clarity is life‑critical.

In the language of ordinary people, the questions are: Can your AI system even say my name correctly? Can it read my dates, understand my currency, or parse my organization’s acronym? These are the baseline of dignity and inclusion. Against the backdrop of Pentagon deals, public safety reviews, and escalating regulatory demands, a tool like VietNormalizer shows that the soul of responsible AI often lives in unglamorous spaces.

Full text of the VietNormalizer paper is available here:
https://arxiv.org/abs/2603.04145

Bridging Discussion, Research, and Policy for Ethical Tech

The Hidden Infrastructure of Ethical AI: Lessons from a Vietnamese Text Normalizer

Leave a ReplyCancel Reply

What We Do

Who We Are