Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
Renovating the wheel

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.

user:
pass:
register,


                                 

UTF-4

Standardized efficient text storage
  (+2, -1)
(+2, -1)
  [vote for,
against]

UTF-8 allows everyday English prose to be stored at 8 bits per character while giving access to the higher ranges of Unicode with a system resembling escape sequences. (A special character above the normal one-byte UTF-8 range is inserted, and the next two to five bytes are interpreted as wider characters as appropriate.)

Eight bits still gives too much wasted space for everyday text. Proposed is a UTF scheme that switches to its higher registers more often. Sure, the escape sequences are basically wasted information, but I don't think we're on the other side of such an optimization with 8 bits to play with. Six might be more appropriate. Four is a tempting choice to make two per byte.

The code page should be tuned so the most frequent characters are on the bottom, like "e" and " ". If you want a form feed, go back to ASCII. ;-)

kevinthenerd, Jul 24 2013

[link]






       Congratulations you've moved 1 step closer to Huffman coding. I suggest sticking with UTF-8 if you want to keep your data byte-aligned and somewhat more easily decoded, or else go for a fully optimized Huffman encoding to maximize compression. I don't see much use in this halfway thing.
scad mientist, Jul 25 2013
  

       Or just compress your text before storage or transmission. It'll provide the same additional hassle as in this idea, plus far better storage efficiency to boot. Or, given that a chip the size of my fingernail can store the Encyclopedia Britannica dozens of times over, don't even bother. [-]   

       Edit: Yeah, pretty much what [scad mientist] posted right before me.
ytk, Jul 25 2013
  

       All caps (+ punctuation + numerals) fits into 6 bytes rather nicely, and 7-bit was what Unix originally used for text.
FlyingToaster, Jul 25 2013
  

       //Or just compress your text before storage or transmission   

       I did try that, but you have to be careful, as sometimes the O doesn't get completely flattened and it just looks like a D
not_morrison_rm, Jul 25 2013
  

       The trick is to compress your O around the middle until it becomes an 8, then fold it in half so it looks like a º. You can fit, like, six of those in the same space as the original O.
ytk, Jul 25 2013
  

       Ahh, I'll give it a go.   

       I'm still thinking the best compression would be to arrange the text at 90 degrees to the storing medium, then each character would only be 1 electron wide.
not_morrison_rm, Jul 25 2013
  

       [+] anyway the next generation of software infrastructure will render all the current info useless. Almost like the information in Ebcidic and anything on magnetic tape became obsolete and practically inaccessible. See the recovery attempts for discovering the origin of :) :(
pashute, Jul 25 2013
  

       I SEEM TO REMEMBER THAT TELETYPE MACHINES USED 7 BITS PER CHARACTER BUT HAD NO LOWERCASE AND LIMITED PUNCTUATION.
MaxwellBuchanan, Jul 25 2013
  

       sp. "5" I think... wait no, I think that was telegrams.
FlyingToaster, Jul 26 2013
  

       /./ sp. "STOP"
Wrongfellow, Jul 26 2013
  

       I wonder what would happen if we forced [po] to communicate using an all-uppercase teletype machine? Hmm - tellytypies.
MaxwellBuchanan, Jul 26 2013
  

       POBOL ?
FlyingToaster, Jul 26 2013
  

       We sent a message to a friend who was getting married overseas: "WE HEAR YOU ARE GETTING MARRIED. STOP. YOU PLAN TO SPEND THE REST OF YOUR LIFE WITH THE SAME WOMAN. STOP". He never commented on how it went over.   

       I would add to [n_m_r]'s suggestion about turning the text sideways by pointing out that each sentence could be stored end-on.
AusCan531, Jul 26 2013
  

       Here's a real-world application: any medium where you can't fit more than a few bytes into a message, like 2D bar codes. There isn't enough space to fit a dictionary or huffman tree.   

       Like scad said, this is similar to huffman coding with a predefined tree optimized for English text.
arvin, Jul 28 2013
  

       //There isn't enough space to fit a dictionary or huffman tree.// You could compress the Huffman tree using a Huffman shrub, then compress that using a Huffman petunia, and compress that using a Huffman club-moss.
MaxwellBuchanan, Jul 28 2013
  

       // the tree is more complex and built on chains. //   

       Is that one of those tyre rope-swing things, then ?   

       Why don't you just come up to speed with the rest of the Universe and start using Trinary ?
8th of 7, Jul 29 2013
  
      
[annotate]
  


 

back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle