h a l f b a k e r yYour journey of inspiration and perplexement provides a certain dark frisson.
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
So various governments and organisations are collecting data in the hope of decrypting it at some point in the future, when the cryptographic standard is broken or quantum computing becomes a thing, or whatever.
So to do this, they need to store the information until that point. So they have large
server farms intercepting, collating and storing traffic from the internet. Presumably this means they have to be moderately selective on what they save; they don't have infinite storage space.
Obviously, one can try to punt the problem further into the future by using stronger algorithms and more bits in your encryption key. But even that's not entirely reliable, because you need to trust that the (newer, larger, novel and less analysed) encryption algorithm isn't backdoored.
One method of encryption which is 100% secure (under certain assumptions) is the One Time Pad (OTP).
This works by "exclusive-oring" your message bit-by-bit with random data, which the recipient also has.
The /downside/ of this is that the recipient needs to have the random data beforehand, and you can only use each random bit once.
The /difficulty/ of this is that the data does need to be random, in the sense that you can't find out any bias, or correlations between the bits.
So I propose a combination of these two approaches:
First, you create a large number of very big files. It doesn't matter what they are provided they have a large amount of effective randomness. Shaky video footage would be fine. Any file which may contain repetitive regions is compressed. You put these files on two hard disks, keep one and give the other to the person you want to communicate with. Each of these files can be used for one communication, after which it must be securely wiped.
To send a message using this protocol, you need 4 things:
1) your message.
2) a pseudokey, which is a freshly generated random string of bits of some reasonable size, perhaps 256 bytes (2048 bits).
3) a standard cryptographic function, with its password(s), which you and the recipient have previously arranged as normal.
4) a large One Time File, as described above.
In this process, all the data in this file is combined with your message. To do this, a one-way derivative of an all-or-nothing transform is used. The file is split into blocks, each the size of the pseudokey. If the file is not an integer number of blocks long, the partial block is discarded.
Each block is hashed together with its index (i.e. its position within the file), the pseudokey, and the product of the previous block. (The hash used should be cryptographically strong.)
The final block hash is not stored, instead it is added to the pseudokey. Then the process is repeated in reverse - that is, working backwards from the end of the file, without resetting the index (i.e. it continues to increment).
Now the original data in the file is completely garbled, but every bit depends on the rest of the file and the pseudokey. Therefore, to be able to generate it, the original file must be stored in its entirety (apart from any partial final block).
Call this resultant data the 'keypad'.
To encode the message, it is first exclusive-orred with the keypad, then encoded with a standard encryption algorithm, along with the pseudokey and an identifier for the file used, that is, its filepath on the hard disk.
After the message is encoded, the referenced file should be securely wiped from the disk by overwriting with various bitpatterns and pseudorandom data.
Decoding effectively follows the same process as encoding. The file is identified and converted to a keypad by the same procedure, the message is decrypted using the appropriate password then exclusive-orred with the keypad.
Now, if in the future /they/ want to read the message, they'll need to have acquired a copy of your padfiles, before you deleted them.
- How is this better than just using a OTP?
This has the additional security that a significant storage commitment is necessary. Note that it doesn't protect you from a pre-meditated, targetted attack. It protects you from a general, 'bottom-trawling' attack. OTPs are not sufficient for that purpose because they are relatively small and any 'popular', widespread solution could be easily stolen from everyone by malware and stored until required.
Also, in jurisdictions like the UK, OTPs may open you to a legal risk, in that they may (accidentally or disingenuously) be considered encrypted data. A set of large but easily interpreted files does not have that issue, provided the amount of data is large enough.
All-or-nothing_transform
https://en.wikipedi...r-nothing_transform [Loris, Apr 29 2023]
[link]
|
|
//Why not use digits of pi or e or any other transcendental number? The key for any day or message would simply be a start point to start using the bits.// |
|
|
You mean, for the pseudokey?
I suppose you could use e.g. the date and time the message was transmitted to encode e.g. an offset into pi. But the risk is you've reduced the search-space significantly, and you'd lose 'volume protection' because they'd precalculate results for plausible values, or using some clever trick.
It's basically not worth the risk of 'them' getting hold of your (shared) method of precalculating a key for the message; the only reason the huge keypad file needs to be kept is that they don't know the pseudokey ahead of time, so why give them a chance at guessing it? |
|
|
For security, an few hundred extra bytes of random data in the message isn't a big deal - I think quite a few cryptographic functions use the pseudokey concept. |
|
|
//Double-take .. legal risk in UK from using some forms of data encryption? Tell me more?// |
|
|
I've gone on about this before. There's a law called the Regulation of Investigatory Powers Act (RIPA). Basically the police can demand you supply a key to make information 'intelligible'. If you can't, you have to prove you don't have a key, or you're still committing an offense. Quite how you're supposed to prove you've forgotten something (or never had it, or the file isn't encrypted data) is not established. |
|
|
No, I think if you say that they'll keep up with the rubber hose until you remember. |
|
|
Using ANY non-random data is silly. Especially formatted data. A video file is nearly as bad as pi, which is nearly as bad as not using it at all. And if you use actual random data you're just moving the OTP one step up in the decryption scheme. |
|
|
OTP is perfect protection against anything which doesn't have the key. Your scheme doesn't add any significant security against people who have stolen the key. |
|
|
Voice, it may start off formatted, but that's not going to matter after you've munged all the data in the file together.
Basically 'any' normal large media file of undocumented content contains a fair amount of 'random' data. At one point people used videos of lava-lamps to generate random numbers. They'd hash a video frame down to get a string of random bits. |
|
|
You could read up on "all-or-nothing transforms" (link now provided, it was remiss of me not to have put one up in the first instance, sorry), which are a core part of this, and perform a similar set of manoeuvres, in a reversible way. One of the suggestions there is to perform an all-or-nothing transform on a large file, and then encrypt only a small part of it to secure all of it. |
|
|
And if you read all the way to the bottom of the idea, you'll see what advantages I claim for this, over OTP. It's a distinct use-case. |
|
|
So let's do some data crunching. How big should your files be? |
|
|
The objective is to out-cost the adversary in data storage.
Therefore, I think the primary metric to consider is how large a pool to dedicate to each communication channel (i.e. communication with another individual). The file dedicated to a single message should be 'large' - but when sending a message, you choose one from those available at random, so an adversary has to store them all.
(Of course, if they've fully compromised every machine, they can read everything you do already, game over. However, people would find out, and soon the world would know. Let us assume they want to avoid being detected and must act covertly, and therefore must avoid long-term footprints on user's computers. They can read data in the post, transiently infect computers with obfuscated viruses, maybe sneak in and clone people's hard-disks occasionally, etc.) |
|
|
Let us consider the mass-surveillance case. The adversary would need to compromise a large number of devices (they only need one end of each comms channel), and exfiltrate all the data over the internet. Let us assume that they are willing and able to do this. |
|
|
For this to work for individuals, this has to be manageable in terms of cost and convenience. I suggest that a reasonable size to consider is one large commodity hard-disk. A security-conscious individual needing to send and receive many messages might be willing to dedicate one or more entire disks to their secure comms, but for wide-scale use, most people needing only a few secure lines may use just a part of their existing hard disk.
I found a 4 Tb disk on Amazon with good reviews for 88 uk pounds, while a similar 2Tb disk was 61 pounds, so I am going to consider this as the base size, and dedicating a half of that would be 2 Tb, with an incremental cost of 27 pounds. |
|
|
Amazon S3 Glacier Deep Archive reports a storage cost of about $1 per TB-month (which I think is reasonable to represent very efficient, long term storage costs at scale). I'm going to ignore retrieval costs. One dollar is currently 0.8 pounds, give or take, so let's call it 0.8 pounds per month per terabyte, or 9.6 pounds a year per terabyte. So /they/ would have to spend about 70% per year to store your data as was spent for you to store it outright, for each individual user. If you get a new computer every 4 years, they've spent almost 3 times as much on storing it.
Why this difference? Well, they need infrastructure. You're just using a commodity system. The other thing is that you probably don't back this up. (In the mass-market I think to a good approximation, noone ever backs anything up.) The hard disk dies before you've transferred it to a fresh one, well, you'll need to refresh all your channels.
Also, you send and receive messages, and refresh padfiles at a pace which suits you, they have to dance to your tune. |
|
|
So the question is, is that enough? I think, yes, because I think that sort of amount starts to look prohibitive on a budget when you're dealing with hundreds of millions of users. I've only calculated the storage cost here, but of course they'd also need a good comms system to deal with the firehose of all that additional data, hundreds, maybe thousands of times what they've previously budgeted for. |
|
|
So, next question for consideration - how do you generate all those padfiles?
They need to be individually, moderately large, relatively easy to generate, hard to predict, and 'obvious' in the sense that they should be 'intelligible' - they should have some sort of clear purpose, however vapid. |
|
|
One option, as mentioned in the idea, is videos of a source showing some sort of unpredictable movement. If you use an action camera or "gopro" to protect yourself from liability and/or identify offenders, you could generate at least one file from every trip you make. At least, where nothing of interest happens.
It looks like these produce about a gigabyte of data every 2 minutes. Assuming use for half an hour per day, and some software to automatically splice these into 1gb, ~2 minute movie files, you could generate 15, 1 gigabyte files per day. This means you could generate 1 terabyte in a little over 2 months.
I think this is a workable process - it might take a while for one person to build up some stock, but over time, more files would be generated than an average person will typically need for private messages. |
|
|
Now, you might say that 1 gigabyte is too big. I say - read the idea title.
But this notwithstanding, smaller messages could get away with smaller files. Provided the padfile contains enough 'randomness', it is the size of the pool which matters, not the individual file.
And it would be nice to have an automatic process which could generate many padfiles. |
|
|
Therefore, I propose a program which generates many high-quality, high-resolution 'desktop background' images, each seeded using authenticated strong random numbers, by using a range of arbitrarily coloured, scaled and placed vector images.
These are in some sense the inverse of the lavalamps used to generate random numbers I mentioned above. Instead of hashing an image containing randomness to generate a small amount of 'strong' random bits, this takes a (moderately) small amount of randomness and generates a large image - all the bits of which matter. |
|
|
One issue with this is potential asymmetry in communication. Suppose we have a communications channel were one end sends more messages than the other, or messages are 'bursty' - several messages needing to be sent in quick succession.
If padfiles are earmarked for transmission in a specific direction, the pool may 'run dry' sooner than it needs to - that is, there is still a large number of padfiles available at both ends, they're just committed to the wrong direction. |
|
|
There are at least a couple of ways round this. One option would be to keep the directionality, but when a message is sent and the pools are too imbalanced, to additionally report that a different padfile should change pools to compensate.
Obviously there is an edge-case where that padfile has been used in the mean-time, but that doesn't actually matter provided the file isn't used by the initial proponent before confirmation that the message has got through. |
|
|
Alternatively, the two channel ends could initially arrange a seed and a distinguishing value between them, which - when hashed with a varying value (say, the current date in a given format) and a padfile's name divides them into four groups (I will call 0,1,2,3) (modulus 4, or whatever). When a message needs to be sent, one end pulls from group 0, while the other pulls from group 2.
However, neither end chooses a file which would have been in the group specified for the other end on the previous day.
If there are no files available in the group for the day, pulling from the next group up is acceptable (under the same conditions). At that point, however, the pool is dangerously low.
This strategy means that there are no large fixed groups of files with dedicated direction, and the groupings change each day. If a channel is rarely used, an adversary cannot narrow down which file would be chosen at an arbitrary (unspecified) point in the future, and would have to store the entire pool. If they only care about communications in one direction, they still need to store the entire pool.
However, as the pool becomes depleted, availability probabalistically becomes an issue before it is completely drained. A small number of emergency 'unidirectional' padfiles could be retained against this case, although ideally the pool should be replenished well before this transpires. |
|
|
So far I have mentioned having two different sizes (and types) of padfile. I'd like to consider the distribution of padfile sizes in more detail. But before that, I think we should consider how to decide which padfiles are acceptable for 'precrypting' a message. |
|
|
It is important that a long message doesn't get precrypted with insufficient randomness to properly scramble it. We can get a good first-line defence against this by specifying a minimum scale-factor for the raw padfile. That is, if the message is L bytes long, then the raw padfile should be at least ML long, where M is a parameter of the method - let us say that we expect it to be larger than 100. More security is provided by larger values. |
|
|
Further to that, different types of file may contain different proportions of 'random data'. Most file types contain at least some formatting information, which should be excluded from consideration of the amount of entropy a file contains. Some filetypes are more compact than others. So we could assign a 'reduction ratio', R, which is effectively an additional multiplication factor (greater than or equal to 1) assigned to each type of file. |
|
|
So the pool of padfiles available to encrypt a message of size L is those with a size of at least RML. |
|
|
For example, let us suppose that many messages will be shorter than 10kb. If we suppose that a filetype has an R of 2, and M was set to 100 for the communication channel, then padfiles of 2Mb and larger would suffice for these messages.
Longer messages would of course need larger padfiles. I will post something about that when I've thought about it some more. |
|
|
Padfile size distribution.
It is of course the case that the larger the number of padfiles in the pool, the more messages one can send before running out. And the number of padfiles one can store is inversely proportional to their size.
However, the longest padfile one has would seem to dictate the maximum length of a message which could be sent using this system.
These two facts would seem to be at odds, and suggest the need for a large range of different padfile sizes. This then raises a new concern over how you choose the file to precrypt a particular message. If it's completely random except for size constraints, one would constantly be using files larger than deemed necessary. Conversely, if you select the best matching in terms of size, an adversary may be able to predict which files you might use. |
|
|
Furthermore, we would ideally support an arbitrary mixture of message lengths (including - 'all very large'; and 'all very small'), and also support a large range of padfile sources - each of which may have a characteristic size range (when considered in terms of random data).
Therefore, I propose the following extension to the system: |
|
|
One Time Files can be concatenated to generate a larger file. |
|
|
Obviously, using more files means increasing the file description data which has to be sent, but I don't think that's a significant issue in practice.
Regarding the file selection problem, I propose the following algorithm: |
|
|
1) Determine which files are available for transmission, and their capacity (that is, divide by the reduction ratio R)
2) Of these, choose a number of them ('X') at random. Let us say, X=4; we choose 4.
3) Is any one of the files by itself large enough to encrypt the message? If so, use the smallest which will do that.
4) Of the files in play, is the total size larger than required? If not, fetch another file - and keep doing so until this is true.
5) Of the files in play, is the largest by itself large enough to encrypt the message? If so, use that.
6) If we get to here, we now have enough storage space, spread over multiple files, and have reduced the scope to a knapsack problem. Since we don't need an optimal solution, I suggest creating a sequence of files, starting with the largest, then speculatively adding the next largest until the total exceeds the target, then removing the last in the list and checking each other option, starting with the smallest. |
|
|
This does mean that we will generally use a larger file than strictly necessary, and quite often use more than one small file for a medium-sized message. However, assuming you have more small files than large ones, for the most part it won't use up the largest files on small messages, or many small files on one large message. |
|
|
If we need to concatenate a series of files, we do that exactly end-to-end, and only discard the final partial block. |
|
| |