This idea is a bit technical, so I'll start with some
background and definitions for the uninitiated, before
jumping
into the problem statement and a description of the
solution.
BACKGROUND / DEFINITIONS:
- Unicode is a computing industry standard for the
consistent encoding, representation,
and handling of text
expressed in most of the world's writing systems.
(Source:
Wikipedia)
- UTF-8 is a variable width character encoding capable of
encoding all 1,112,064[1] valid code points in Unicode
using one to four 8-bit bytes.
-WTF-8 (Wobbly Transformation Format − 8-bit) is
a
superset of UTF-8 that encodes surrogate code points if
they are not in a pair. It represents, in a way compatible
with UTF-8, text from systems such as
JavaScript and Windows that use UTF-16 internally but
dont enforce the well-formedness invariant that
surrogates must be paired. (Source:
Simon Sapin, see link)
- WTF-8 (alternative definition) is that unintentionally
popular encoding where someone takes UTF-8,
accidentally
decodes it as their favorite single-byte encoding such as
Windows-1252, then encodes those
characters as UTF-8 (Source: Hacker News,
see link)
PROBLEM STATEMENT
UTF-8 is extremelly hard to implement correctly, as
evidenced by a Google search for "utf-8 issue" (32 million
results), and the emergence of grassroots alternative
encodings such as WTF-8 (two variants), and
computer-aided UTF-8 issue diagnostic tools such as
ftfy (see link).
UTF-8 and its related WTF-8 encodings also suffer from a
lack of human readability, as someone looking at the bits
in
memory would not be able to immediatelly tell what
characters they represent, without
referencing a Unicode table.
THE SOLUTION
We hereby propose two new sets of encodings for
Unicode:
WTF-64 and WTF-512.
WTF-512 consists of an 8x8 bitmap, 8-bit grayscale
representation of the underlying character, as often
represented (written) by humans.
WTF-64 is a compressed representation of WTF-512,
using
an 8x8 bitmap in monochrome, which can be used when
it
is known that the text contains only latin-derived scripts.
WTF-64 conveniently uses one
character per 64-bit machine word, and hence can be
processed efficiently by contemporary computers.
The key feature of this scheme, is that there is no
universal decoding table for either WTF-64 or WTF-512:
all
decoding must be done by either displaying the bitmap
directly on the screen, or using pre-trained
Machine Learning algorithms to decode the character set
into the machine's internal representation (see also:
MNIST
dataset).
WTF-64 is intended as a transitional technology for
compute-limited devices or until machines can be
upgraded. New systems should be architected for WTF-
512
natively.