h a l f b a k e r yWhat's a nice idea like yours doing in a place like this?
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
Usage: Networking and large text based storage and retrieval. Can compress even small texts of few words long.
Features:
a. Optimized header:
Header should be optimized in various ways. (OtherHalfbakery to supply other half of this explanation).
Algorithm:
a. Count unique letters.
(a,., ,c,o,u,n,t,i,q,e,l,t,r,s,. =15/24)
b. Use minimum bits to store. (4 bits per char)
c. Look for repeating letter sequences (starting at half lettes available and less. eg. qu would be one letter if used more than once.).
d. Look for repeating SIMILAR sequences with dif of up to 5%.
e. Replace repeating with the next unused number +
an identifier.
Lempel-Ziv-Welch Compression
http://en.wikipedia...ki/Lempel-Ziv-Welch The granddaddy. [jutta, Dec 23 2008]
[link]
|
|
You're trying to store the alphabet, numbers, and punctuation in "4 bits per char". 4 bits can represent 2^4, or 16, different symbols. There are more letters than that alone. |
|
|
The compression scheme reminds me a little bit of LZW, with the difference that LZW actually works and is widely used. Maybe understanding how it works would be a useful starting point? |
|
|
So look for similar text patterns, group them and then apply something like a Huffman code. Decoding would then need some kind of dictionary to determine what the original word would be. |
|
|
Jutta cmon! I wrote one of the first implementations of GSM (in '93 or 4 if I recall correctly) and of mpeg in the then new Java.
Last time I went into techspeak I was boned to the bone. |
|
|
We need a lightweight dictionary-free on the fly compression for SMALL sequences of words. |
|
|
Meant to say starting at 4 bits per char (header to give charsize, or use variable length compression schemes, in any case header points to it) |
|
|
I've always thought: Seven bits packed
together by missing out the ASCII parity bit (one
eighth space saved) Five bits per letter, then
case shift characters, so twenty-eight taken up by
that. Capitals occur automatically after a
period, which encodes two trailing
spaces. thirty-two codes for the commonest
bound morphemes. Thirty-two codes for the
commonest words. Thirty-two codes for the
commonest strings. Words are decompressed
by putting in spaces appropriate to
context. Numbers are not included but are
compressed as Roman numerals below about the
twelfth power of two, as words above. This gives
ten extra codes for punctuation, white
space and control characters. It would also enable
a limited degree of translation through
decompression. |
|
|
// We need a lightweight dictionary-free on the fly compression for SMALL sequences of words. // What for? Short text messages are the least of any networks load. (Except for the in-class networks relying on small paper scraps) |
|
|
//We need a lightweight dictionary-free on the fly compression for SMALL sequences of words.// |
|
|
//(header to give charsize, or use variable length compression schemes, in any case header points to it)// |
|
|
That's a bit dictionary-esque, not so? There are even heuristic methods for this, although they are lossy (consider t9 predictive text input). |
|
|
[loon...] Why? Well because any *lossless* optimisation, especially on small strings, results in optimisation and effeciencies that can be scaled up to larger strings. Although that would not be the intention of this idea. |
|
|
Why? For very low capacity, low cost devices which need networking with short text packets. |
|
|
I can remember the fist few versions of ol' PK. It was possible to double and triple zip files gaining small additional fractions of compression. I clearly remember getting a file within 100 bytes of a floppy disk's capacity. |
|
|
How about 100 bytes in 30? |
|
|
From wikipedia on Huffman coding: No other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. A method was later found to do this in linear time if input probabilities (also known as weights) are sorted. |
|
|
For normal text of say 100 letters, my algorithm would work better than huffman encoding, since it does "text compression". Finally, you could huffman encode the result. |
|
| |