Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
Contrary to popular belief

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.

user:
pass:
register,


           

WTF-64 and WTF-512 Unicode charset encodings

Human-readable replacement for UTF-8 and all WTF-8 variants
  (+2)
(+2)
  [vote for,
against]

This idea is a bit technical, so I'll start with some background and definitions for the uninitiated, before jumping into the problem statement and a description of the solution.

BACKGROUND / DEFINITIONS:

- Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. (Source: Wikipedia)

- UTF-8 is a variable width character encoding capable of encoding all 1,112,064[1] valid code points in Unicode using one to four 8-bit bytes.

-WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired. (Source: Simon Sapin, see link)

- WTF-8 (alternative definition) is that unintentionally popular encoding where someone takes UTF-8, accidentally decodes it as their favorite single-byte encoding such as Windows-1252, then encodes those characters as UTF-8 (Source: Hacker News, see link)

PROBLEM STATEMENT

UTF-8 is extremelly hard to implement correctly, as evidenced by a Google search for "utf-8 issue" (32 million results), and the emergence of grassroots alternative encodings such as WTF-8 (two variants), and computer-aided UTF-8 issue diagnostic tools such as ftfy (see link).

UTF-8 and its related WTF-8 encodings also suffer from a lack of human readability, as someone looking at the bits in memory would not be able to immediatelly tell what characters they represent, without referencing a Unicode table.

THE SOLUTION

We hereby propose two new sets of encodings for Unicode: WTF-64 and WTF-512.

WTF-512 consists of an 8x8 bitmap, 8-bit grayscale representation of the underlying character, as often represented (written) by humans.

WTF-64 is a compressed representation of WTF-512, using an 8x8 bitmap in monochrome, which can be used when it is known that the text contains only latin-derived scripts. WTF-64 conveniently uses one character per 64-bit machine word, and hence can be processed efficiently by contemporary computers.

The key feature of this scheme, is that there is no universal decoding table for either WTF-64 or WTF-512: all decoding must be done by either displaying the bitmap directly on the screen, or using pre-trained Machine Learning algorithms to decode the character set into the machine's internal representation (see also: MNIST dataset).

WTF-64 is intended as a transitional technology for compute-limited devices or until machines can be upgraded. New systems should be architected for WTF- 512 natively.

ignobel, Mar 24 2019

Simon Sapin's WTF-8 https://simonsapin.github.io/wtf-8
Code library for dealing with UTF-8 issues [ignobel, Mar 24 2019]

The WTF-8 encoding https://news.ycombi...com/item?id=9611710
Lengthy and deep WTF-8 discussion [ignobel, Mar 24 2019]

ftfy https://github.com/...Insight/python-ftfy
Python library for decoding WTF-8 [ignobel, Mar 24 2019, last modified Apr 05 2019]

Please log in.
If you're not logged in, you can see what this page looks like, but you will not be able to add anything.
Short name, e.g., Bob's Coffee
Destination URL. E.g., https://www.coffee.com/
Description (displayed with the short name and URL.)






       And here I thought I knew what wtf meant
theircompetitor, Mar 24 2019
  

       I think 8 bits of grey-scale are too many. I know that I wouldn't be able to tell the difference between grey-175 and grey-176 (out of 256 levels). Maybe 4 bits for grey-scale, leaving 4 bits for other stuff (I'm not geek enough to know what stuff could be relevant...).
neutrinos_shadow, Mar 24 2019
  


 

back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle