Halfbakery: Alphanumeric only compression

Please log in.

Before you can vote, you need to register. Please log in or create an account.

Computer: Compression
Alphanumeric only compression (+2, -2) [vote for, against]
Fast text-only compression (only available chars and symbols on keyboard)

Usage: Networking and large text based storage and retrieval. Can compress even small texts of few words long.

Features:
a. Optimized header: Header should be optimized in various ways. (OtherHalfbakery to supply other half of this explanation).

Algorithm:
a. Count unique letters. (a,., ,c,o,u,n,t,i,q,e,l,t,r,s,. =15/24)
b. Use minimum bits to store. (4 bits per char)
c. Look for repeating letter sequences (starting at half lettes available and less. eg. qu would be one letter if used more than once.).
d. Look for repeating SIMILAR sequences with dif of up to 5%.
e. Replace repeating with the next unused number + an identifier.
-- pashute, Dec 23 2008

Lempel-Ziv-Welch Compression http://en.wikipedia...ki/Lempel-Ziv-Welch
The granddaddy. [jutta, Dec 23 2008]

You're trying to store the alphabet, numbers, and punctuation in "4 bits per char". 4 bits can represent 2^4, or 16, different symbols. There are more letters than that alone.

The compression scheme reminds me a little bit of LZW, with the difference that LZW actually works and is widely used. Maybe understanding how it works would be a useful starting point?
-- jutta, Dec 23 2008

So look for similar text patterns, group them and then apply something like a Huffman code. Decoding would then need some kind of dictionary to determine what the original word would be.
-- Jinbish, Dec 23 2008

Jutta cmon! I wrote one of the first implementations of GSM (in '93 or 4 if I recall correctly) and of mpeg in the then new Java. Last time I went into techspeak I was boned to the bone.

We need a lightweight dictionary-free on the fly compression for SMALL sequences of words.
-- pashute, Dec 31 2008

Meant to say starting at 4 bits per char (header to give charsize, or use variable length compression schemes, in any case header points to it)
-- pashute, Dec 31 2008

I've always thought:
Seven bits packed together by missing out the ASCII parity bit (one eighth space saved)
Five bits per letter, then case shift characters, so twenty-eight taken up by that.
Capitals occur automatically after a period, which encodes two trailing spaces.
thirty-two codes for the commonest bound morphemes.
Thirty-two codes for the commonest words.
Thirty-two codes for the commonest strings.
Words are decompressed by putting in spaces appropriate to context.
Numbers are not included but are compressed as Roman numerals below about the twelfth power of two, as words above. This gives ten extra codes for punctuation, white space and control characters. It would also enable a limited degree of translation through decompression.
-- nineteenthly, Dec 31 2008

// We need a lightweight dictionary-free on the fly compression for SMALL sequences of words. // What for? Short text messages are the least of any networks load. (Except for the in-class networks relying on small paper scraps)
-- loonquawl, May 27 2009

//We need a lightweight dictionary-free on the fly compression for SMALL sequences of words.//

//(header to give charsize, or use variable length compression schemes, in any case header points to it)//

That's a bit dictionary-esque, not so? There are even heuristic methods for this, although they are lossy (consider t9 predictive text input).

[loon...] Why? Well because any *lossless* optimisation, especially on small strings, results in optimisation and effeciencies that can be scaled up to larger strings. Although that would not be the intention of this idea.
-- 4whom, May 27 2009

pkunzip!
-- WcW, May 28 2009

Why? For very low capacity, low cost devices which need networking with short text packets.
-- pashute, Jun 17 2009

I can remember the fist few versions of ol' PK. It was possible to double and triple zip files gaining small additional fractions of compression. I clearly remember getting a file within 100 bytes of a floppy disk's capacity.
-- WcW, Jun 18 2009

How about 100 bytes in 30?
-- pashute, Jun 18 2009

bonsai Huffman tree?
-- pertinax, Jun 18 2009

From wikipedia on Huffman coding: No other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. A method was later found to do this in linear time if input probabilities (also known as weights) are sorted.

For normal text of say 100 letters, my algorithm would work better than huffman encoding, since it does "text compression". Finally, you could huffman encode the result.
-- pashute, Jun 24 2009

random, halfbakery