h a l f b a k e r yThe mutter of invention.
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
Of the 256 possible characters that a byte can represent, only a few of these are used under most circumstances. Couldn't we somehow take advantage of this, and make our text files smaller by eliminating the need for the rarely used letters? Why yes. Yes we can.
Of the 8 bits in a byte, the most
significant bit is rarely used (standard ASCII is only 7 bits long). So first we can eliminate the 8th bit, as is sometimes done anyway.
Next we can take a lesson from the 31337-5p34kers: many letters don't add any value in most situations, and can be replaced by more common letters. For example, a lower-case "L", capital "I", and the number "1" look nearly identical in most situations, so they can be consolidated.
There is little need for capital letters, so they can be dispensed with. The decompression/display program could even automatically capitalize the first letter of every sentence, common names, etc., if you like.
Repeating letters are hardly necessary. All multiple consecutive occurrences of a letter could be compressed into a single occurrence, and most of the meaning is preserved.
Certain letters and letter pairs look and/or sound the same as other letters/letter pairs. For example, "ck" offers nothing that a simple "k" does not.
Another technique that might be used is phoneme encoding. The text would be parsed and broken into its phonemes. The decoding program would then expand the phonemes out to something that could be read aloud.
Finally, the text is ready to be compressed in a more traditional way. Huffman compression would be used, so the most commonly used characters would require fewer bits to express. Each language would have its own general purpose Huffman key that would always be used, eliminating the need to recalculate and send the key along with each message.
Work has been done on lossy text compression on a word level, but to my knowledge it has not been done on the letter level as I described. With such a method, text files could meaningfully be compressed tremendous amounts. The character set could be reduced to a mere two or three dozen letters, I'm sure.
Applications could include email and instant messenging over slow connections, or the storage of large amounts of textual information.
Word-level compression
http://sequence.rutgers.edu/lossy/ (NOT an example of the described compression method) [Uberminky, Jun 08 2002, last modified Oct 17 2004]
A Plan for the Improvement of English Spelling
http://www.neth.de/...piele/newspell.html Seminal couple of paragraphs by Mark Twain. [jutta, Jun 08 2002]
Lossy PNG (ie zlib) compression
http://membled.com/work/apps/lossy_png/ A modified zlib which allows lossy matching in the Lempel-Ziv stage. The intended application is image compression, but it would be easy to change the code for particular letter mismatches. However I don't think it would be that much better compression than ordinary gzip. [Ed Avis, Oct 11 2002, last modified Oct 17 2004]
anfractuosity compressor
https://www.anfract...y-text-compression/ We simply pick the shortest alternative word from a thesaurus. In order to compress text in a lossy fashion. [mofosyne, Apr 17 2016]
[link]
|
|
Welcome to the bakery, Uber-dave. |
|
|
Lossless compression can condense text files to 10% of their original size, or less. Efficient lossless encoding makes use of techniques such as efficiently encoding whitespace, using dictionaries of common words or sentence fragments, predicting what letter is most likely to follow, and avoiding the duplication of repeated strings. However, since text files tend to be only a very small part of the total data transmitted and stored, technology has perhaps lagged behind compared to image compression. |
|
|
Nonetheless, I think most people would be very reluctant to use any lossy text compression method. Not only would it be unable to handle irregular strings such as car registration numbers, but it might change the meaning of sentences in subtle ways. |
|
|
sctld: Forgive me, I'm new here -- what link did you want me to look at? The only link I see is the one I myself submitted, which is not an example of what I mentioned. I now see that I should have used the "link" link, thanks yamahito. |
|
|
ravenswood: Actually the savings would be tremendous for things like instant messaging, etc. Worthwhile? No, probably not. I noticed that sarcasm doesn't go over very well around these parts. I'll keep this in mind. |
|
|
It's good to feel welcome! |
|
|
If its nothing to do with the idea, then why link it? |
|
|
He didn't say it wasn't to do with the idea, just that it wasn't an example of it. The link refers to compression at a word level, not a letter level. |
|
|
What ravenswood and pottedstu said. To clarify, the savings that would be neglible are those compared to straightforward Huffman compression, not compared to the existing text. |
|
|
yamahito: Whats the diffrence? |
|
|
the scale on which it's done, I guess. |
|
|
Sorry, but i don't think you can compress letters any more than you can compress words. When you compress words, effectively you are removing letters deemd 'un-needed'. You can't compress letters, because they are the root, they are the fundemental particles or words, paragraphs, chapters, books, volumes. They are to language what quarks are to elements. So, actually, you can't compress letters. |
|
|
there's a difference between compressing letters and compressing words on the letter-level. |
|
|
No, I don't feel like bickering over semantics. What Uberminky describes is what I would call compressing words on a letter by letter level. His link does it word by word, as I would imagine Huffman compression would. |
|
|
It's a left-brain/right-brain issue. You can use lossy compression for JPEG's because that kind of visual information is handled by our right-brains which are holistic and relatively unaffected by loss of detail information--capable, indeed, of discerning meaning even in the face of awesomely unfavorable signal-to-noise ratios. The left brain, OTOH, is responsible for linear, logic oriented thinking where attention to detail can be paramount. This is the part of the brain responsible for language and reading. Lossy compression is simply not appropriate for text compression because for the left brain to read the decompressed text, it must first work hard to "manufacture" the missing details (possibly calling on the right brain to step in and help?). |
|
|
However, this would make a fun programming project for a rainy day, or a way to get your feet wet in a new programming environment. Certainly more useful than the fifteen-minute-hack pig-latin translator I wrote the other day. |
|
|
If English were as diverse as ASCII itself, there
wouldn't be any words that look foreign to us.
Dictionary compression is baked, I'm sure. Storing
the text in all caps and uncompressing it to the
rules
of our language would be a great idea too. I guess
you could intelligently remove and replace commas
too. |
|
|
Hell, you get a bun for this. I'll bet this idea isn't
baked nearly like it could be. I want to see BURNT
edges, dammit! |
|
|
The anfractuosity compressor in the link is pretty
interesting. |
|
|
According to it, Alice in wonderland compresses from 164K
to 157K (and still just about being readable)! |
|
|
Basically uses a thesaurus to do the compression by picking
the shortest word. |
|
| |