h a l f b a k e r yRenovating the wheel
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
UTF-4
Standardized efficient text storage | |
UTF-8 allows everyday English prose to be stored at 8 bits
per character while giving access to the higher ranges of
Unicode with a system resembling escape sequences. (A
special character above the normal one-byte UTF-8 range
is inserted, and the next two to five bytes are interpreted
as wider
characters as appropriate.)
Eight bits still gives too much wasted space for everyday
text. Proposed is a UTF scheme that switches to its higher
registers more often. Sure, the escape sequences are
basically wasted information, but I don't think we're on the
other side of such an optimization with 8 bits to play with.
Six might be more appropriate. Four is a tempting choice
to make two per byte.
The code page should be tuned so the most frequent
characters are on the bottom, like "e" and " ". If you want
a form feed, go back to ASCII. ;-)
[link]
|
|
Congratulations you've moved 1 step closer to Huffman coding. I suggest sticking with UTF-8 if you want to keep your data byte-aligned and somewhat more easily decoded, or else go for a fully optimized Huffman encoding to maximize compression. I don't see much use in this halfway thing. |
|
|
Or just compress your text before storage or
transmission. It'll provide the same additional hassle as
in this idea, plus far better storage efficiency to boot.
Or, given that a chip the size of my fingernail can store
the Encyclopedia Britannica dozens of times over, don't
even bother. [-] |
|
|
Edit: Yeah, pretty much what [scad mientist] posted
right before me. |
|
|
All caps (+ punctuation + numerals) fits into 6 bytes rather nicely, and 7-bit was what Unix originally used for text. |
|
|
//Or just compress your text before storage or transmission |
|
|
I did try that, but you have to be careful, as sometimes the O doesn't get completely flattened and it just looks like a D |
|
|
The trick is to compress your O around the middle until
it becomes an 8, then fold it in half so it looks like a º.
You can fit, like, six of those in the same space as the
original O. |
|
|
I'm still thinking the best compression would be to arrange the text at 90 degrees to the storing medium, then each character would only be 1 electron wide. |
|
|
[+] anyway the next generation of software
infrastructure will render all the current info useless.
Almost like the information in Ebcidic and anything
on magnetic tape became obsolete and practically
inaccessible. See the recovery attempts for
discovering the origin of :) :( |
|
|
I SEEM TO REMEMBER THAT TELETYPE MACHINES USED
7 BITS PER CHARACTER BUT HAD NO LOWERCASE AND
LIMITED PUNCTUATION. |
|
|
sp. "5" I think... wait no, I think that was telegrams. |
|
|
I wonder what would happen if we forced [po] to
communicate using an all-uppercase teletype
machine? Hmm - tellytypies. |
|
|
We sent a message to a friend who was getting married
overseas: "WE HEAR YOU ARE GETTING MARRIED. STOP.
YOU PLAN TO SPEND THE REST OF YOUR LIFE WITH THE
SAME WOMAN. STOP". He never commented on how it went
over. |
|
|
I would add to [n_m_r]'s suggestion about turning the text
sideways by pointing out that each sentence could be
stored end-on. |
|
|
Here's a real-world application: any medium where you can't fit more than a few bytes into a message, like 2D bar codes. There isn't enough space to fit a dictionary or huffman tree. |
|
|
Like scad said, this is similar to huffman coding with a predefined tree optimized for English text. |
|
|
//There isn't enough space to fit a dictionary or
huffman tree.// You could compress the Huffman
tree using a Huffman shrub, then compress that
using a Huffman petunia, and compress that using a
Huffman club-moss. |
|
|
// the tree is more complex and built on chains. // |
|
|
Is that one of those tyre rope-swing things, then ? |
|
|
Why don't you just come up to speed with the rest of the Universe and start using Trinary ? |
|
| |