Please log in.
Before you can vote, you need to register. Please log in or create an account.
Computer: Backup
Anti-redundancy backup   (+8, -1)  [vote for, against]
Back up hard disks (or servers, or whatever) while avoiding redundancy

[rewritten to minimize confusion]

Many people's hard disks have numerous identical copies of many files. In some cases, this is because people make a 'backup copy' of a directory, then change a few files, then make a different backup copy of that directory, etc. In other cases, people will install different applications which are nice enough to keep their DLL's in the application directory, but which happen to have the same DLL's. For whatever reasons, though, they key point is that it's common for people to have multiple byte-for-byte identical files on their hard drives.

With hard drive storage being so cheap, this usually doesn't pose a problem. When trying to back up a hard drive, however, these extra files use up space on the backup media and cause the backup to take additional time.

I would propose that backup software perform a quick CRC-64 or similar hash on each file before backing it up, and compare it with the CRC-64's of files that have already been backed up in that session. In case there seems to be a match, do a direct file comparison (probably faster than backing the thing up) and if the files match, just write a link in place of the second file.

This approach could work especially well when backing up to media like CD-R's since they already inherently support file linking.
-- supercat, Dec 13 2002

ry4an's links as links, #1 http://www.mikerube...rs/rsync_snapshots/
[RayfordSteele, Oct 04 2004, last modified Oct 05 2004]

#2 http://rdiff-backup.stanford.edu/
[RayfordSteele, Oct 04 2004, last modified Oct 05 2004]

#3 http://www.nongnu.org/duplicity/
[RayfordSteele, Oct 04 2004, last modified Oct 05 2004]

CDZip http://www.halfbakery.com/idea/CD-Zip
Kind of like the last preposition here? [reap, Oct 04 2004, last modified Oct 05 2004]

Microsoft paper on similar subject in 2002 http://research.mic.../pubs/ICDCS2002.pdf
Reclaiming Space from Duplicate Files in a Serverless Distributed File System [gtoal, Apr 21 2008]

Efficient Algorithms for Sorting and Synchronization http://www.samba.or...idge/phd_thesis.pdf
Andrew Tridgell's PhD thesis (the samba guy) includes stuff on incremental checksums [gtoal, Apr 21 2008]

an implementation? http://www.opennet....kup/index_eng.shtml
[chrizoo, Apr 23 2008]

Delta Backup http://en.wikipedia.org/wiki/Rsync
[chrizoo, Apr 24 2008]

Areca http://areca.sourceforge.net/
Backup app supporting delta backup [chrizoo, Apr 24 2008]

Many modern file systems support linking too, so you would need a way to distinguish a file that is only linked on the backup from one that is actually linked on the original file system. Otherwise, when you restored the user wouldn't have "backup" copies of his files anymore, he'd just have links to one copy. This would defeat the user's original purpose in strewing copies of the file around in the first place.
-- krelnik, Dec 13 2002


Hypothetical scenario:

User backs up contents of hard drive to location#1, including AAA.doc (which is read-only file used for reference).

One week later user backs up hard drive again, to location#2. Computer detects that AAA.doc hasn't changed since last back up, so just creates a link from location#2 to location#1.

Repeat for one year, creating links to location#1 every time.

User then deletes location#1, because it's a year old. Now all the links point to a non-existent location (I'm sure there must be a techie name for this type of bad link). AAA.doc no longer exists. Information lost. Disaster.

Only other option is to keep all back ups for ever.
-- Mayfly, Dec 13 2002


I think that was what I was trying to say, Mayfly. I guess I wasn't clear.
-- krelnik, Dec 13 2002


Well, if you're backing up to some kind of ROM you won't be able to delete location #1. If you aren't, you could update last-modified date or another file attribute for the original backup and check that instead of creation date. Anyway, I think the idea is to avoid redundancy within a single snapshot, not over a series of them done over time.

Seems like a good idea, but it also seems too sensible to not be widely done already in some form by existing backup software. If the backups are relatively small, of course, you might do better to compress the whole tree into some sort of archive, thereby stripping out redundant information within as well as among files. [Waits quietly for egnor or jutta]
-- Monkfish, Dec 13 2002


They're called incremental backups and I run one each day at 3pm. One that's very similar to the type you describe was detailed here:

http://www.mikerubel.org/computers/rsync_snapshots/

The ones I use are here:

http://rdiff-backup.stanford.edu/

and here:

http://www.nongnu.org/duplicity/
-- ry4an, Dec 13 2002


Mayfly: That's why one mixes incremental and full backups. That being said, there's little advantage to having multiple copies of the same file on the same storage medium. Having two different files on two different CD's may be a significant advantage; havingtwo different files on the SAME CD is much less of an advantage.

Monkfish: If you make a .ZIP which happens to contain numerous copies of the same file, located in different directories, the resulting archive will contain separate independent compressed copies of each.

While links as a concept certainly exist in OS's, the idea of this approach is to (1) accommodate OS's which don't provide linking; and (2) accommodate situations in which linking is inappropriate (e.g. if I've made a backup directory so that I can change files in the original directory without losing the earlier versions, conventionally-implemented hard- or soft-links would not be appropriate, since a change to any of the original files would also change the linked "copy". If an OS were to implement "lazy-write" links (with or without pre-allocation) there would be less need for people to have redundant files on their hard drives, but some such duplication would still exist and it would be good to minimize it.
-- supercat, Dec 13 2002


ry4an: What I'm trying to get at isn't the idea of incremental backups (which are of course thoroughly baked and have been for decades).

What I'm getting at is the case where I'm backing up a hard disk which, for whatever reason, happens to have multiple copies of one or more large files. If I'm backing up my hard drive and both c:\foo\bar.bin and c:\bar\boz.bin happen to hold the exact same 20 megs of data today, there's no real benefit to having the backup medium hold two independent copies. The backup should contain the date/time stamps and other directory information for all copies, but the actual data should only need to be stored once.
-- supercat, Dec 13 2002


Fishbones from me. If you're going to keep multiple backups, then its just a bad idea to have them depend on one another, you might as well only keep a single back up.

Of course, this is unless you'd like to go back to many different possible times of data states. In which case, I say why not just commit your entire filesystem to a CVS-type system?
-- ironfroggy, Dec 13 2002


ironfroggy: What I am discussing, plainly and simply, is the idea that when making a SINGLE backup of a hard disk which happens to (for whatever reason) contain multiple copies of the same file, it would be useful to only have that particular backup contain one copy of the file and linking information for all the other instances of it. If the backup is on a medium where entire backup sets get replaced or discarded (rather than individual files thereon) there would be no danger of the link becoming obsolete since, if the file were to disappear the link would as well.

Why is everyone confused?
-- supercat, Dec 13 2002


(supercat: Is it really true that no archive programs use exactly this sort of idea to pack multiple directories (or even just multiple files, in case there are copies with different names)? It seemed like a natural (so, in hindsight, almost certainly wrong) assumption.

That wouldn't be an obstacle to this idea anyway, of course.)
-- Monkfish, Dec 13 2002


I don't think the issue here is whether or not it could be done, but whether or not it should be done. How would the software know two files are identical? File name? File size? Some other attribute? Some combination of attributes? Now give me a scenario where I couldn't - accidently - trick the backup software into thinking two files were identical when they weren't. Or vice-versa. Does the backup software do a byte-by-byte comparison? Does it prompt me for clarification?
-- phoenix, Dec 13 2002


phoenix: supercat's talking about a checksum. You can't trick it without trying really, really, really hard.
-- Monkfish, Dec 13 2002


phoenix: The software would initially use a CRC-64 or CRC-32 to identify files that are "probably" identical. It would then do a byte-for-byte comparison only on the rare files whose CRC's are identical. The backup software could then keep a list of files whose CRC's were identical but where the files differed (so that subsequent backup attempts would simply regard them as different without having to waste time checking).
-- supercat, Dec 13 2002


[supercat], maybe you could clarify your idea (make it crystal clear that you're linking files *within* a single backup, not across media), and then delete all the confused annotations?

You could just compress the backup image, which eliminates redundancy in a more general way at a cost of more processing time.

There are utilites that will go through a filesystem and replace identical files with hard links to each other. I'm unsure why backup utilities (which often compute a hash of the files being backed up for other reasons) don't do this as a matter of course. Maybe some do?
-- egnor, Dec 13 2002


egnor: Do you know of any filesystems which support "lazy-write" linking, with or without full or partial pre-allocation? Replacing identical copies of a file with a single hard-linked version would be dangerous if a conventional hard link was used, but could be quite handy if a lazy-write link was used. Know if anyone does this?
-- supercat, Dec 13 2002


I'm not aware of a filesystem with copy-on-write linking but it seems like such an obviously good idea that surely some obscure system has done it.

If you're creating a write-once filesystem (like CD-R) an ordinary hard link would suffice.
-- egnor, Dec 13 2002


Checksum? Every file? Then look for files with identical checksums and bit/byte-wise compare them? Then backup?
-- phoenix, Dec 13 2002


Phoenix: Checksumming files and then comparing checksums is indeed the idea. For most modern systems using typically-sized files, the system could read a file into memory, check it against the existing checksums (also kept in memory), and just write it out if needed. The overhead required to perform the checksum should be minimal, as should be the overhead of checking the checksum against existing ones.
-- supercat, Dec 13 2002


I badly wanted this to be a program that moves - not copies - all the important files you might need from wherever they are to a re-writeable CD. Note the stress is on moving not copying.

You would run the program when you have just been informed that you are free "to pursue a career outside the company".
-- DenholmRicshaw, Dec 13 2002


Well, if you didn't want the hashes for other reasons you could avoid computing them for the vast majority of files: Store just the file locations and sizes in bytes as you back them up. Compare new files to already-processed ones simply by size. Compute (and hang onto) the hash only for the tiny minority that have matches. If the hashes match, go on to do a byte-by-byte comparison if you're not satisfied with an overwhelming probability that the files are the same.

(In the absence of any evidence that this is baked or good arguments against it, why are there more fishbones that croissants?)
-- Monkfish, Dec 13 2002


// In the absence of any evidence that this is baked or good arguments against it, why are there more fishbones that croissants? //

Got me. Perhaps I should delete some of the annotations of people who mis-understood my original idea (re-reading it, I could see where it was unclear). I really don't like zapping other people's stuff, though; if people feel their annotations are no longer meaningful they can always edit or delete them themsleves.

Fishbones and croissants are funny. Some ideas which would be both possible and useful, but which have not yet been baked (so far as anyone can tell) get fishboned. Other bizarre silliness get croissanted and sometimes becomes a phenom of its own ("Tails for all?"). Go figure.
-- supercat, Dec 13 2002


//Well, if you didn't want the hashes for other reasons you could avoid computing them for the vast majority of files: Store just the file locations and sizes in bytes as you back them up. Compare new files to already-processed ones simply by size. Compute (and hang onto) the hash only for the tiny minority that have matches.//

If the file is small enough to fit in memory (less than 150 megs or so) computing the hash before writing it to the backup device should impose minimal cost; it would probably be faster to compute hashes for all files as a matter of course than to defer computing hashes until a possible size-match was found (since by that time the first file would no longer be cached).

BTW, I think size-matches probably happen pretty often.
-- supercat, Dec 13 2002


Here's why I fishboned this idea:
"With hard drive storage being so cheap, this usually doesn't pose a problem. When trying to back up a hard drive, however, these extra files use up space on the backup media and cause the backup to take additional time."

As you point out, drive space is cheap and time is not. I'm not convinced this is faster/better/cheaper than anything in existence now. You're saving what is probably a nominal bit of drive space while introducing the overhead of having to CRC every file, compare every CRC with every other CRC and verify duplicate CRCs.

Again, it's not a problem of whether or not it could be done, but whether or not it should be done.
-- phoenix, Dec 14 2002


I totally disagree.

Disk space is cheap but removable media less so, and someone who's using a CD-R or even DVD-R for backups would really appreciate having to change that media less frequently.

More importantly, it's easy to show the overhead would be practically zero, and easily dominated by the savings gained by not writing duplicate files. When you're backing up a disk you have to read every file *anyway*. Calculating a CRC32 (or even MD5) can be done as the data is read, and keeping an indexed table in memory of a few tens of thousands of checksums is near-trivial.
-- egnor, Dec 14 2002


egnor: You got it. Arguably, the fact that hard drives have become so cheap (less than $2/gig) has contributed to people having multiple copies of files on disk for a variety of reasons. Indeed, while CD-R's are in one sense a cheap backup medium, hard drives cost less than 20x as much and are both faster and reusable. Compared with rewritable media, hard drives are not much more expensive than anything else out there.

BTW, I have two hard drives in my system. Some stuff is stored on one, some on the other, and some on both. I frequently back up significant portions of one drive onto the other. If doing a full system backup, little is gained by writing out multiple copies of files which appear identically on both drives. When restoring, it would certainly be useful to have every instance of a file restored to its proper spot, but having redundant copies of the information isn't particularly useful.
-- supercat, Dec 14 2002


(Just some details.)

1) About check-as-you-go:
a) Will this approach work with CDs and other write-once media? Don't you need to write an index first?
b) It might be inconvenient not knowing your eventual space requirements when you set out.

2) About size-checking to avoid calculating hashes:

The efficiency of the size comparison approach will vary with the characteristics of the data: If you've an enormous number of very small files, most will have a coincidental size match; if you've got a few large ones, they won't. You may, of course, want a record of all the checksums somewhere anyway to do integrity checking or whatever, in which case the question's moot.

supercat, here's a suggestion that avoids the caching issue you point out. It may or may not be worthwhile in practice:

1) Traverse the directories first. Build a list (hash table/dictionary) of files and sizes, flagging/remembering/grouping any whose size is not unique. (This would let you calculate an upper limit for your space requirement (the sum of all file sizes) in case you wanted to verify that enough room is available. You would check any attributes you were interested in at this time, too.)

2) If you're writing to ROM or want to know your exact space requirement ahead of time, calculate the checksums you need and do your comparisons.

3) Loop through your in-memory table of files. If you've already done your check, just write the unique ones. If you haven't, do it for the flagged files as you go along (when they're in memory anyway).

----

(Also: Some archivers do indeed check for identical files, so compressing the whole tree can indeed work. You may not want to compress your backup, of course, so it doesn't make this anything other than a good idea.)
-- Monkfish, Dec 15 2002


Redundant programming and having dlls scattered across been my continual frustration with PC's, especially since systems like NeXTstations fixed this problem so many years ago with standard libraries and windows for everything and standard locations for them.

Now as far as backups are concerned, Isn't this what the 'archive' flag was originally designed for?

<non-programmer rant against programming with no common sense>There must be some relatively simple way for the computer to understand that 'yes, I want to create a second backup of everything' or 'just backup everything that's changed since X.' What I don't understand is why an actually useful addition such as incremental backups wasn't a standard feature ages ago.</rant>

Since I'm trying to keep the deluge of CD's under control, I want to know what I can throw away.
-- RayfordSteele, Dec 16 2002


//Now as far as backups are concerned, Isn't this what the 'archive' flag was originally designed for?//

The archive flag is intended to indicate that a particular has been backed up already (and has not been modified since); it is not intended to inform the system that multiple identical files happen to exist.
-- supercat, Dec 17 2002


I think this idea is nothing more than compressing an archive. What a wonderful and baked-for-many-decades idea you have. Can I toss another fishbone at this?
-- ironfroggy, Dec 17 2002


//I think this idea is nothing more than compressing an archive. What a wonderful and baked-for-many-decades idea you have. Can I toss another fishbone at this?//

Have you even READ it?

In all of the backup tools I'm aware of (including, at first glance, those linked above), if you are backing up two directories both of which happen to have identical forty-meg files, they will take up twice as much space on the backup media as would one forty-meg file. It may be that data compression gets them down to twenty megs each, but storing them both will still use twenty megs more storage media than storing just one.

Since backup media are, compared to hard drives, comparatively expensive, reducing the quantity of backup media required to back up a hard drive would be a significantly good thing.
-- supercat, Dec 17 2002


Running most compression algorithms on two back-to-back copies of a large file will result in a compressed copy that's almost twice as big as either alone. Further, there are decided advantages to having files compressed individually: if an entire archive is compressed, extracting a single file requires that there be space available to extract everything.
-- supercat, Dec 18 2002


Hmmm.
While this is obviously a good idea, wouldn't it be even better to detect similar files, and compress the second against the first (potentially compressed) copy?

Quite often there may be a series of similar files. These wouldn't get linked out, even if the difference was tiny. But you could compare the size of the compressed version against the size of the patch needed to modify another file, and store that instead if it were smaller.

You'd need some way of detecting similar files in a computationally feasible way, of course. It could use a series of hashes with collisions very likely in files differing in ways they are likely to differ (byte changes, insertions, deletions, other rearrangements). Other clues could come from similar filenames or directory structures (ie position relative to known identical files) and hints from the user.

To be really fancy it could use algorithms similar to those scientists use to compare DNA sequences - but that would take some time and resources.
-- Loris, Jul 25 2006


A backup routine which was familiar with the type of data to be backed up could take advantage of duplicated records. For a general-purpose backup routine, though, I don't think detecting partial similarity would be feasible.

I'm glad to see there don't seem to be fishbones this time 'round.

BTW, I believe CD-ROMs (and probably DVD-ROMs) usually write the index last. Although there would certainly be some advantages to knowing how many disks would be required prior t starting a backup (an analysis of directory information could provide reasonable min/max/probable estimates) many people wouldn't be upset if a backup that was expected to take five disks ended up only requiring four.
-- supercat, Jul 25 2006


Why not simply compress your backup archive? This is pretty much what a decent compressor would do, but straight compression would reduce the resulting backup size even further, of course.
-- ironfroggy, Jul 26 2006


Ironfroggy - you can do that as well of course. This has already been covered in previous comments. The point is that this method works in addition.

Supercat - well its your baby. But I don't think detecting similarity in the general case would be that hard. Many types of data vary in the same sorts of ways (as I said, replacements, insertions, deletions, extensions...). You are of course right in that if you know more about how the data would differ you could do better. It might be a good idea to have specific routines for common data types as well. But even the general case would be a big improvement in your algorithm I think.
The 'bad hashing' functions really needn't be that difficult to implement, either. For example, to detect small replacements you could hash large blocks of the file down to single bits.* This would detect many types of small changes to images. You could detect 'indels' (insertions and deletions) in a similar manner - simply start each block at a particular sequence. Use several different sequences to cover your bases.

BLAST, an algorithm used for finding similarity in DNA works by matching short 'seed' strings and then attempting to extend them. The full-blown method is quite compute-intensive, but a half-arsed attempt may pick up many useful matches. Just pick improbable sequences from one or two points in the file, and search new files for these. (I say improbable because there is no sense picking a sequence like &0000000000000000 - you'd get lots of incorrect hits.

(* i.e. EOR every bit of the first 4k to bit 0 of the hash, every byte of the next 4k to bit 1 and so on. Go back to bit 0 when you reach 64 bits, or whatever. Consider every file which differs in less than 2 bits, or something. To consider files over a large range, it might be an idea to increase successive block sizes..)
-- Loris, Jul 26 2006


As you note, the matching techniques used with DNA are quite computation-intensive. While there might be some reasonable ways of making a redundancy-detector work on a sub-file basis, such a detector would have to have some basis for deciding that two files were likely to be largely-identical before trying to compare them. Keeping 64 bytes of data in RAM for each of 1,000,000 files is entirely reasonable. Keeping 1K or more of data for 1,000,000 files would be rather less so.

Another consideration is that if decisions are made on a per-file basis, all of the files stored on the target medium will be either real files or simple pointers to real files. By contrast, if a file is 99% identical to another except for a few bytes here and there, storing that fact and being able to deal with it would add complexity to the backup/restore process.

Not a bad thought, but starting simpler would seem apropos.
-- supercat, Jul 26 2006


I think you mean 64 bits not bytes above - change this before I check back and I'll remove this line.

I don't think you'd even need 1k per file. The first method I proposed would essentially replace your collision-resistant CRC with one which would pick up small changes.
There are different ways in which sequences can differ by 1%. In practice you'd be looking for long stretches of 100% identity interspersed with unpredictable regions. Rather than 99% identity, with dispersed byte changes. So we are not actually looking at doing a full BLAST search. Also, while it would be nice to compress the data as well as theoretically possible, missing a few isn't a disaster. So I was suggesting a couple of potential seeds per file. That is another 128 bits, perhaps.
These could be reduced to one, or skipped entirely for small files, on the basis that they wouldn't give much of a saving anyway. The processing cost wouldn't be prohibitive on this basis, either.

You are right that it is more complicated than your system. But then, your system is more complicated than just burning the lot. It is just another point on the continuum I suppose. And I agree that it might be better to start simpler - I just liked the extrapolation.
-- Loris, Jul 26 2006


The 64 bytes/file figure would be assuming a moderate-length name, a path link, time/datestamp information, file size, checksum, etc.

It's easy to come up with a small hash for a file that will indicate whether anything has changed. It is impossible to come up with a good small hash that will indicate how much has changed.

To understand the difficulty, consider that when using a 64-bit hash function, at least for some files, it will be possible to change the file without changing the hash function by changing some number of bits <= 65. With advanced hash functions, there may be no tractible way to find an altered file that yields the same has value as the original, but for at least some input files, at least one such altered file must exist.

In most cases where anything in a file has changed, at least eight bytes will have done so. The only way to provide any information about whether changes were localized is to arrange the hash function so that localized changes can't affect all the bits. Unfortunately, doing that will mean that a file may be altered, without changing the hash function, by changing the state of fewer bits.

Doing meaningful comparisons of hash values to determine whether two files were "largely similar" would require using much larger hash values than if it were merely necessary to determine that they were identical or different.
-- supercat, Jul 26 2006


Oh, I thought you were talking about the additional cost. Never mind.
For what it is worth, why retain the time/datestamp or filesize if you are just matching identical files? The size might be a useful pre-check, but you can fetch it easily. Files may be identical but have different datestamps, so you can't rely on this at all.

I see what you are saying about an increase in the collision rate. In fact I thought I mentioned it myself. While the method I suggested may not be optimal in detail (for example, it may be better to generate larger hashes for each block), I don't think spurious collisions would be a problem in practice. Also, the crc isn't a large proportion of the data stored transiently in RAM while compiling the image. You could easily increase it without much of a hit.

Incidentally, one strategy to deal with the 'block size' issue would be to have different block sizes for different sizes of file. Each file would have two, one rounded down, and one up. So that files could at least double in size before they didn't indicate similarity.

The reason I'm so in favour of similarity matching, by the way, is that I think it is orders of magnitude more prevalent than identical files. As you say in the idea body, people copy files then make changes to the copy. This is the recommended practice in many circumstances. Programmers may have a few directories of mostly identical files, but otherwise the only people who'd be helped greatly would be the misguided who thought copying a file to another directory constituted a backup. And they should be re-educated not encouraged. (and they probably wouldn't use it anyway...)
-- Loris, Jul 27 2006


//For what it is worth, why retain the time/datestamp or filesize if you are just matching identical files? The size might be a useful pre-check, but you can fetch it easily. Files may be identical but have different datestamps, so you can't rely on this at all.//

The file size is a useful pre-check; the space required to store it is small compared with the space required to that required to store the filename, etc. The timestamp may not be particularly necessary for this particular application, but having a RAM table listing all of the information about all of the files in a backup set can be useful; again, the timestamps don't take up too much room, so why not?

//I see what you are saying about an increase in the collision rate. In fact I thought I mentioned it myself. While the method I suggested may not be optimal in detail (for example, it may be better to generate larger hashes for each block), I don't think spurious collisions would be a problem in practice. Also, the crc isn't a large proportion of the data stored transiently in RAM while compiling the image. You could easily increase it without much of a hit.//

The problem is that you'd have to increase it a LOT. A well-chosen 32-bit hash function will have a 1 in 4,000,000,000 chance of a false positive. A well-chosen 64-bit hash function will have a 1 in 16E36 chance.

Suppose I give you some simplifying assumptions: all files of interest are 65,536 bytes in size, and two files are "close" if there are no more than eight bytes different between them (but the eight bytes could be anywhere in the files). How would you go about designing a hash function to check for "close" matches?

Suppose I allow the eight changes to not just be direct byte changes, but allow any pair of changes to comprise an insertion and a deletion. How would you set up the hash functions then?

//The reason I'm so in favour of similarity matching, by the way, is that I think it is orders of magnitude more prevalent than identical files. As you say in the idea body, people copy files then make changes to the copy. This is the recommended practice in many circumstances. Programmers may have a few directories of mostly identical files, but otherwise the only people who'd be helped greatly would be the misguided who thought copying a file to another directory constituted a backup. And they should be re-educated not encouraged. (and they probably wouldn't use it anyway...)//

Similar files are indeed prevalent, but the cost of detecting similar files and compiling a difference list would almost certainly exceed the cost of simply backing up both copies separately.

It's easy, given a set of 1,000,000 files, to determine with pretty good accuracy whether within the set there exist two that are identical. This can be done in time essentially proportional to the number of files (there will be an NlgN term, but it's unlikely to dominate even for every large file sets).

Determining whether there exist two files within a set of 1,000,000 that are "close" is going to be another matter. When checking for identical files, one knows where in one's catalog any matching file must be found. When checking for similar files, things are much more difficult. The only way to avoid this problem is to put each file into the catalog multiple times, using different partial hashes, and then later look for each file multiple times, also using different partial hatches. All this greatly increases complexity as compared with simply looking for one hash value associated with each file.
-- supercat, Jul 28 2006


The idea here isn't to delete the duplicate files (since they may serve a useful purpose, especially in the absense of copy-on-write links) but rather to reduce the time and effort required to make backups.

BTW, in response to an earlier comment about why to keep timestamps: when producing the backup, it would be useful to include a notation of the time/datestamp of the "original" file. Although it wouldn't be incredibly slow to requery the original file to get that information, it's simpler to just buffer it.
-- supercat, Jul 28 2006


Hi everybody ... are there any news since 2006 (and 2002 respectively) ?

I was naturally assuming that the vast majority of backup tools would work as described here. This site is the only one I could find dealing with that issue though. I still don't believe it and continue to desperately look around, since I utterly need this function.

So if anybody stumbles over this message and could point me in the right direction I would be incredibly thankful.

As a side note: While I do encourage meaningful criticism - generally speaking - I find some notes here lacking decent respect. One of the basic rules in life is, if you don't need something but somebody else would desperately, and if furthermore you don't have any disadvantage whatsoever with the fact that a solution exists for other people, than please don't denigrate other peoples' ideas - especially if you didn't even take the time to fully think through the idea and really understand it ... as some posts above suggest.

I guess supercat must be quite frustrated to come up with an excellent idea and which additionally doesn't seem to be implemented anywhere ... only to run into ignorant comments. Sorry for being blunt about that, but that's my honest opinion.
-- chrizoo, Apr 21 2008


I may further add that all the bkup tools I came accross handle moved files as "deleted and newly created files elsewhere", so they are stored with every incremental or differential backup. I'd be similarly thankful for being told a (Windows) tool that doesn't.
-- chrizoo, Apr 21 2008


This is a sensible idea but the correct place for it is in the filing system, not the backup program. Adding it to backup once it is in the filing system is trivial.

The checksum has to be computed as the file is closed; and for optimum speed, on the fly if the file is being written sequentially. It *is* possible to have checksum algorithms that are resilient against things like seeking into the middle of a file and modifying a block or two but they take some extra work (and storage for partial sums; rsync has a rolling checksum that is not quite the same thing but is fairly close to what's needed. Read Andrew Tridgell's PhD paper linked above "Efficient Algorithms for Sorting and Synchronization")

The filing system for the EMAS mainframe OS from Edinburgh in the 70's/80's stored a checksum in the file descriptor; if the file was changed, the checksum was wiped and recalculated when the file was disconnected from the VM (ie closed).

A filing system where the file is stored under a file-name that reflects its content (you'd actually use something like an sha or md5 hash rather than a simple crc) is trivial to implement; I think the original Cambridge File Store from the 70's (Wilkes et al) used a similar scheme. (they called the file names 'capabilities' but i think it was a similar concept of a unique token based on content) In a modern context you would use 'fuse' as the underlying layer where files had content-dependent names, and export a traditional layer on top that had user-assigned names, with fuse doing the mapping of one to the other.

I have it at the back of my mind that something like this has been done but I'd have to dredge up a lot of old bookmarks to find it.

A slightly more sophisticated scheme can be done using diffs (someone suggested a DNA-type matcher, which is right in spirit but wrong in detail - there's plenty of code for doing simple text diffing) - which is not only possible but definitely has been done, again I believe using FUSE. Look for 'versioning' file systems that are layered over something like rsync. This idea is a degenerate case of an rsync filing system without the intermediate diffs allowing recovery of arbitrary versions. So the idea is _probably_ baked but not in the exact form the original poster suggested.

But a bun from me for smart thinking even if the OP is not entirely up to date on current FS design (or computer history). [later edit: Oh - he posted in 2002. So not being up to date on 2008 filing systems is probably excusable. later later edit: see link above entitled "Microsoft paper on similar subject in 2002" where things like this were being discussed around the time he first posted. And Tridgell's PhD was 1999, so these ideas in fact had actually been floating around for some time by the date of the original post...]
-- gtoal, Apr 21 2008


Thanks for looking at the idea. Certainly there have been some developments since I wrote the idea, and there were even some developments toward identifying identical files in distributed systems, but I'm still not aware of any program that simply performs the style of backup I indicated.

As for file-system support, the most practical addition would be support for lazy-write links. That could offer a performance benefit in many circumstances; backup software which understood such links could benefit as well.

I don't think it would be worth having the operating system try to analyze "coincidentally-identical" files, at least not on a routine basis. It may be useful to include within the OS the ability to cache information about file checksums and the like, and invalidate such information when files change. This would avoid any performance hit or other complications that might otherwise occur when performing large numbers of 'random' edits to a file.
-- supercat, Apr 21 2008


There maybe is an implementation of this idea. I posted the link to a Perl script in the link section. Could you check ?
-- chrizoo, Apr 23 2008


OK, I read a bit more about the subject. Isn't the advanced form of the idea you propose - the variant where similar files are detected and compressed against existing ones - basically the same as a delta backup ? See Wikipedia link in the link section.
-- chrizoo, Apr 24 2008


chrizoo: The perl script might be similar, but the site didn't seem to explain exactly what it does and I didn't feel like trying to find a .tar.gz extractor.

As for rsync, its function is to determine exactly what is different between two files which there is reason to believe may be largely similar. My proposal is to search for files that happen to be coincidentally identical.
-- supercat, Apr 24 2008


I seem to have missed this idea a couple of times now... I think it would have saved me some time & headache.

I was just concerned with trying to eliminate duplications in my .jpg photo collection. What I came up with was a data table containing the file length, the Adler-32 checksum of the first 1k of the file (Adler's not as good a checksum as CRC but it's faster), then the CRC-32 of the first 8k, and finally the CRC-32 of the whole thing. Checksums were only calculated when needed. If that didn't show differences, then do byte-to-byte.

Out of 33000 .jpgs, I had about 4000 removable as duplicate. No non-duplicated file had to go further than the Adler checksum - but with .jpgs, each with an EXIF in the header, there were pretty well guaranteed differences up front. Some other file types would not fare nearly so well.

I haven't played with this in a couple of years... I got into trying to identify duplicate, or very similar, images where one is a different file format, or a retouch, or some such thing. Some progress, not done yet.
-- lurch, Apr 24 2008


@supercat: /*My proposal is to search for files that happen to be coincidentally identical.*/

I know, this was the starting point. But during the discussion above, the idea was put forward to also include similar files. That was what I was referring to with rsync.

I have also found a consumer-level freeware app with delta backup support (I added the link above). I tested it. Delta backup works, but only for files which have the same name + path. So it implements the advanced part of the idea here ("find similar files and compress the new against the old one"), but not the core part of your idea, because it doesn't look for identical files. Thus when you move a file to another location, the delta method is not applied but the whole file is backed up.
-- chrizoo, Apr 24 2008


@lurch: If you want to find similar images, there are really a whole bunch of apps out there, but I guess you wanted to do the coding yourself ...
-- chrizoo, Apr 24 2008


I have been looking for a programm that does what's described here, but after some days of frustrating search I'm giving up.

I will just do it manually. Archive my files and save a checksum list with it. And then, at each incremental backup, I'll hash all current files, and backup only those with new hashes. I'll work with plain text files and text tools.

I have no experience with checksums/hashes whatsoever. What do you think is the hash/checksum (that's the same right?) that's the fastest to calculate but still is secure enough? (let's say the probability of getting a hash collision is less than a read/write error).
-- chrizoo, Apr 24 2008


/* If the hashes match, go on to do a byte-by-byte comparison if you're not satisfied with an overwhelming probability that the files are the same. */

"overwhelming" ... hm, has anybody done the maths? I thought so, but now, when I look through this page again, I can't find the calculation. Is it really overwhelming despite the birthday paradox ? (which comes into play here as far as I know)
-- chrizoo, Apr 24 2008


"then, at each incremental backup, I'll hash all current files, and backup only those with new hashes."
By definition, only those files that have changed should be flagged (eligible) for backup.
-- phoenix, Apr 24 2008


//only those files that have changed should be flagged (eligible) for backup//

"Archive" attribute.

Sounds not terribly complicated... sortof a reverse-linking; keep track of where all the copies of a file should go.

But you're still looking for a solution to a problem caused by incompetence. Nip it at the source.
-- FlyingToaster, Apr 24 2008


The birthday paradox states that the probability of collisions gets large when the number of items approaches the square root of the number of possibilities. A 32-bit hash will thus have real problems around 64,000 items; a 64-bit hash would have real problems around 4 billion items. A perfect 128-bit hash should not have problems, but it's better to be safe than sorry.

As for the archive attribute, that won't handle cases where files are coincidentally identical, nor will it be useful in situations where a unique file is copied from location A to B, and then the copy at A is trashed. In that case, the copy at B should be stored but the archive bit would have no way of knowing that.

As for "incompetance", when using a file system without lazy-write support, what would be your suggested method for backing up projects?
-- supercat, Apr 24 2008


I'd probably have to figure out what you mean by "lazy write" first as well as looking at my previous post trying to figure out where you got the idea that I was proposing using the "archive" attribute to solve the problem.

But, as far as the *original* post is concerned, I don't see a problem with occasionally backing up a redundant file by mistake, given there aren't too many of them.

However a computer system that has multiple copies of a file without some good reason is a system that is incompetently building its filesystem. Yes or No ?
-- FlyingToaster, Apr 25 2008


In a file system which supports 'lazy write' links, it's possible to request that the system make a pseudo-copy of a file be made. The act of making a pseudo-copy would not actually copy the file, but merely create a link to it. However, unlike a hard link:

-1- An attempt to rewrite any alias to the file would cause that alias to become detached from the file, so as not to disturb the file pointed to by the other aliases.

-2- An attempt to modify part of the file via any alias would cause the system to make a copy the file and attach the alias to that copy.

Unfortunately, I don't know of any mainstream operating systems that support such semantics.
-- supercat, Apr 25 2008


That's almost baked in several OSes: anything with ACL's could put a "read only" sticker on the link, and VMS produces a new version of a file when it's changed, while leaving the old one(s) intact. Microsoft's .NET does something with OS files such that when latest vers aren't backwards-compatible, the old one is kept as well.
-- FlyingToaster, Apr 25 2008


/*The birthday paradox states that the probability of collisions gets large when the number of items approaches the square root of the number of possibilities. A 32-bit hash will thus have real problems around 64,000 items; a 64-bit hash would have real problems around 4 billion items. A perfect 128-bit hash should not have problems, but it's better to be safe than sorry.*/ ... hm, look at the "birthday problem" entry at en.wikipedia.org. You see in the picture that the curve is very steep.

It also says: "The birthday problem in this more generic sense applies to hash functions: the expected number of N-bit hashes that can be generated before getting a collision is not 2^N, but rather only 2^(N/2) "
-- chrizoo, Apr 25 2008


And you seem to know quite a bit about hashing functions. Do you think that the probability of two given files (not looking at other files) having the same 128bit hash is always about 2^-128 or do you think it depends on the quality of the hash? And which one would you recommend then ?

PS: I can send you a .zip file if you can't open .tar.gz if you post your address.
-- chrizoo, Apr 25 2008


OK, I don't know if I got this right, but if Wikipedia is correct with 2^(N/2): If I take a 128 bit hash and let's assume we are on an average PC with 200 000 (all different) files, than the probability that 2 share the same hash is: 0,000000000001 %

Not exactly probable, but on the other hand not as impossible as I thought.

And let's further assume, this idea makes it into a software which is used by a million people and we get a probability of: 0,000001 %

If there are any further weaknesses, for example not all not all of the 2^128 hash values are equally probable, or any other weaknesses I don't know of, then we slowly approach numbers, where the chance of a hash collision seems realistic.

EDIT: uups ... and I forgot: the 200 000 files, that's only at one point in time. But if you want to include multiple backups, than collision chances get even higher.
-- chrizoo, Apr 25 2008


/* A well-chosen 32-bit hash function will have a 1 in 4,000,000,000 chance of a false positive. A well-chosen 64-bit hash function will have a 1 in 16E36 chance. */

???? My calc says: 2^32= 4,294,967,296 so that's about what you said, but 2^64=18E18 ... at least that's what my calc says ;)
-- chrizoo, Apr 25 2008


Don't get tempted to use your hashes like that. Remember - if you develop a working system, a useful system, it will grow - and if growth kills your system, it was fatally flawed from the start.

If the hashes are different, then the files are different. If the hashes are the same, then you don't know. If you try to make an assumption past that point, you're just begging for trouble.
-- lurch, Apr 25 2008


The guys said above the probability was "overwhelming", so I thought it would be interesting to do the maths and see how safe it is. Perfectionism for pure perfectionism's sake makes no sense. Consider that read/write errors even of modern hard disks are interestingly high and as long as a hash collision has 0,0001% of the probability of a read/write error, I think it's unnecessary to further optimize the method. Remember the chain is as weak as its weakest link. And in case the probability of a meteor hitting the earth and destroying therefore all the files is higher than a hash collision than it makes no sense to point the finger at the latter.
-- chrizoo, Apr 25 2008


OK, I found a very interesting note on that. See: en.wikipedia.org /wiki/Birthday_attack

10^-15 to 10^-18 is the uncorrectable bit error rate of a typical hard disk. A 128bit MD5 hash collision gets to that probability at 820 billion files.
-- chrizoo, Apr 25 2008


Certain filesystems already only store the file once and only delete a file if all links to it are deleted, get yourself a proper OS that uses a proper filesystem and ask it for this information. Currently ZFS is the most advanced in this area.
-- ewanm89, Apr 25 2008


[ewanm89] Do such file systems provide for 'lazy write' semantics? Not that such semantics are a cure-all, for a variety of reasons, but they would be helpful.

They wouldn't deal with the case of 'coincidentally' identical files, such as the object/intermediate files produced when building projects that have many identical source files, but if lazy-write semantics are supported they would still be quite useful.
-- supercat, Apr 25 2008



random, halfbakery