Normalize first
Store the original string, then compute comparison keys such as NFC or NFKC where appropriate.
Public boundary
The public interchange layer and the internal semantic-stability layer should stay deliberately separate.
Unicode and ISO/IEC 10646 provide the public character repertoire and encoding layer. They do not make font-specific glyph images into independent public meanings.
A semantic glyph interpreter must distinguish code points, grapheme clusters, rendered glyphs, SVG forms, and inferred concepts before it claims any interpretation.
The internal semantic layer may reason over visual structure, embeddings, ontology tags, and validation traces. Visible public output should remain assigned Unicode characters or valid public sequences.
Store the original string, then compute comparison keys such as NFC or NFKC where appropriate.
Use grapheme-cluster boundaries for user-perceived characters rather than single-code-point assumptions.
Use script properties and script extensions; do not treat Unicode blocks as script identity.
Treat variation selectors and emoji-style sequences as meaningful only when they are valid public sequences.
Private-use characters may exist by private agreement, but the converter should not promote them to public semantics.
Every output should show public symbol status, retrieval lanes, ontology checks, and confidence.
Do not map arbitrary private glyphs to secret public meanings without inspection.
Do not treat a style variation as a new character unless public standards encode that distinction.
Do not equate visual similarity with semantic identity; homoglyphs and spoofing matter.
Do not collapse Unicode, glyph image, ontology, and canonical meaning into one registry row.
The safest design is neither a pure text lookup nor an unconstrained glyph-invention system. It is layered: public Unicode for interchange, visual decomposition for evidence, ontology for validity, and stability scoring over time.