h a l f b a k e r y0.5 and holding.
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
ASCII code translates numbers in the range 0-255 into symbols (most of which are used seldom).
The genetic code (used by most living beings) translates DNA bases (4-letter alphabet, namely: A, C, G, T) into aminoacids (20-letter alphabet of proteins).
4x1 = 4. No use. 16 aminoacids wouldn't be
encoded.
4x4 = 16. Humpf! Almost there. Still 4 'unspeakable' aminoacids.
4x4x4 = 'base triplet' or 'codon' = 64. More than enough!!
The technical word for 'more than enough' is 'degenerate'. Some aminoacids are encoded by a single codon, some of them by two codons, etc, up to six.
This means the first base in the triplet is 'essential', while the second and the third one are often 'redundant'. They may accept a mutation and still code for the same aminoacid.
Imagine doubling the necessary numbers in the standard ASCII code, from 256 to 512. Useless! A waste of memory/disk room! Yeah, all of that, and more.
You don't want a binary 100 0111 (capital G) single-bit corrupted into a 101 0111 (capital W), and you definitely won't accept a 100 1000 -> 100 0000 (H -> @) change in your texts.
There would be several good rules to apply to the new code, such as providing highly used characters with more (synonymous) binary numbers.
For compression purposes, efficiency loss could be minimized by choosing the appropriate encoding so that code redundancy is put to use. If you can't compress the first binary number for character C1 along with the first binary number for chararcter C2, then try alternative encondings until compression is optimal or time cost is unaffordable.
Double the amount of needles and the amount of hay in the stack: any magnet will be able to pulls both needles in less than twice the required time for a single one.
Wikipedia: Forward error correction
http://en.wikipedia...ror-correcting_code [jutta, Jul 26 2007]
Please log in.
If you're not logged in,
you can see what this page
looks like, but you will
not be able to add anything.
Annotation:
|
|
As far as I'm concerned computers are and will remain absolutely useless until they learn to perform internal operations in plain English. |
|
|
my humpf! my humpf my humpf my humpf! |
|
|
ASCII what you can do for your computer,
not what your computer can do for you. |
|
|
I just had wonderful mental images of computer internals of various nationalities. |
|
|
"No, don't be interrupting me, listen what I'm telling, this is a matter for the accountant only" |
|
|
Disclaimer: Any nationality assumed from the above text is the reader's responsibility. |
|
|
Really interesting link, the one about forward error correction. Thank you for that. |
|
|
Humm, as for compression yield, of course non-optimal compression would be achieved by this method.
The point I was trying to make is that living organisms use redundant codes (such as the genetic code) in order not to lose information even after billions of replicating generations.
Error rates in DNA copying vary from 10 to the -7th power to 10 to the -10th power (that's one thousand to one million times better than an average book proofreading outcome), and that's why organisms such as bacteria can copy their entire genetic information with an almost lossless transmission (never intended to compress anything).
Anyway, from what you showed me, it's plain to foresee engineers will keep their jobs for a long while, because of their ability to tell the difference between 'signal' and 'noyse'. |
|
|
The second point, if ever there was a chance for it, was hummm, well, how good would zip performance be on that extended redundant code? And the answer, 'not as bad as expected', because of redundancy itself. |
|
|
Thank you all for discussing. |
|
|
1) The range of ASCII is 0-127
2) ASCII sucks anyway
3) Use RAID |
|
|
Well, I don't understand half of the ideas above, hummm. Ever since the 'message science' begun, people have been tryng to figure out what to do or how to encode things out.
An ancient Chinese Emperor ruled there should be 80,000 ideographic-syllabic characters at the most, because Chinesse was almost impossible to learn already.
What the... do people mean by comments such as //ASCII sucks anyway//? Of course it sucks, man! Unused characters suck more than frequently used characters. (I'll give you an excerpt of Harry Potter's character frequencies at the bottom of this comment.)
Just bear in mind that there are only 58 different symbols in that book; and the top 30 most frequent characters make up >99% of the volume's 1,367,463 bytes. In fact, >51% of those bytes code for one of the top 6 characters in the ranking. Mini-micro-tiny-ridicule 'ascii-6' would 'explain' half the goddamn book!!
Sounds like an outstanding statistics to begin with, doesn't it. I'm already half-baking a second-order Markov model so that frequent character pairs (such as 'th') could be 'condensed' into shorter codes. |
|
|
Blank 0,176 w 0,018
e 0,104 c 0,015
t 0,063 f 0,014
a 0,058 . 0,013
o 0,056 m 0,012
h 0,055 0,012
n 0,051 b 0,011
i 0,050 p 0,010
s 0,050 , 0,010
r 0,049 v 0,010
d 0,039 k 0,008
l 0,028 H 0,008
g 0,026 I 0,003
u 0,024 T 0,002
y 0,019 K 0,002 |
|
|
I am not sure i can follow your reasoning: In the idea you propose more than one code for often-used chars, and in the anno you propose less than two codes for two chars (th). So you would add a code for th, and then even make more than one code for th on account of its ubiquitousness? |
|
|
And where would the gain be of zipping a redundantly coded text? - the redundancy would disappear. |
|
|
//living organisms use redundant codes (such as the genetic code) in order not to lose information even after billions of replicating generations.// |
|
|
That's not true in the way you meant it. I think you've got the wrong end of the stick.
The redundancy in 'the genetic code' - that is, the translation of DNA to protein sequence - is down to having 20 amino acids (plus 'stop') encoded within 64 codons[1]. The redundancy is used in a way which minimises the effect of some mutations, but if it were somehow not present, it wouldn't make any difference to the mutation rate. It doesn't end there, as there can be some information encoded within a set of codons for an amino acid[2], so supposedly silent mutations may still have an affect. |
|
|
There is redundancy which is used for error-correction, but that relies on there being two copies of the entire message: DNA has 2 strands which hold complementary information. If a base on one strand is damaged, or miss-encoded during copying, the issue can be repaired by reference to the other strand. Higher organisms are mostly diploid, having two DNA molecules encoding (virtually the same) message. Thus they really have four DNA strands of their genome. |
|
|
Modern error-correcting codes are much more efficient than this, in any case. They don't need even two times as many bits to correct relevant types of error. |
|
|
[1] 20-ish. There are some special cases we needn't concern ourselves with.
[2] Depending on the organism, one of the codons for an amino-acid may be favoured, especially in highly expressed genes. There is a higher concentration of the corresponding tRNA in the cell. This is essentially an optimisation of translation speed. Mutating away from that codon may be deleterious, just not as bad as a change to another amino-acid. |
|
| |