Halfbakery: Labeling Gitifier

Computer: Compression
Labeling Gitifier (+4) [vote for, against]
Lossless way compress files in hard drives...

[Problem]

So we tend to have multiple copies of files, because it needed less time to buy a new hard drive, than to go over files than to find all the very similar files, and choose the latest version of a file.

Moreover, we avoided deleting similar files in different folders, because these directory provided contexts to understanding what these files did in different situations.

We forgot where these redundancies were, and now, every time we make backups of disks, we just copy over the whole stuff, creating exponential growth of space requirements.

[Solution]

If a file was under several different hierarchies, add it to a git repository as the same file. Arrange it by times, and commit.

For each commit, make the comment be the location where file was found.

Commit by modification times.

Then, create a browser of files defined this way, that can browse multiple virtual hierarchy as defined by labels.

Moreover, based on file location statistics, automatically suggest the best location for each file to be.

[Expectation]

Significant reduction of space + ability to easily see file in multiple contexts + discovery of better directory structure to organize your files.
-- Mindey, Apr 19 2015

Related idea File_20system_20sup...king_20hard_20links
Your idea renders mine obsolete. [scad mientist, Apr 21 2015]

I feel your pain.

Is the invention basically a shell script that scans the file system and makes intermittent calls into GIT?

Is some UNIX ninja going to lose their marbles trying to implement it in one line using "find"?
-- pertinax, Apr 21 2015

"Tired of young people ? Can't seem to make them see your point of view ? Introducing the La Beling Gitifier <shows device resembling a Buck Rogers ray gun>. . . One quick shot to the sternum and they'll be ranting right along with you . . ."
-- FlyingToaster, Apr 21 2015

[+] I had an idea that accomplishes some of the goals on a more limited scale but would be easier to implement <link>. It appears that the Gitifier would need to be incorporated into the file system and completely change how it worked, but would accomplish everything I wanted with my idea and much much more.

While were at it, lets let GIT keep a history of each file. Obviously that would fill up the hard drive too fast if all history was stored, but history could be pruned as needed to make space, leaving more recent changes tracked in case the user needs to go back to an old version of any file.
-- scad mientist, Apr 21 2015

[pertinax], yes, what I had described, could probably be done in one line, as you say. Any UNIX ninjas?

[scad mientist], I see. Indeed!

// While were at it, lets let GIT keep a history of each file. Obviously that would fill up the hard drive too fast if all history was stored //

If history is stored as change to files, but not copies of files, then it would not fill it up fast.
-- Mindey, Apr 22 2015

//but history could be pruned as needed to make space

I vote for losing 1982.

If you could do it on a repeated basis we could have a Millenium party every year and avoid whatever is supposed to happen with that Mayan calendar.
-- not_morrison_rm, Apr 22 2015

Doesn't git (and VCSes in general) only work on files that are text-based? I thought that was one of the reasons why many file formats (MS Office, CadSoft Eagle, …) have moved from being binary to being XML-based recently. But many formats are still binary (images, videos, …), so I don't see how this would work as well for those files.
-- notexactly, Apr 26 2016

Whilst not understanding the idea, I would like to comment on it herewith.

To the extent that I _do_ understand, GIT is some system for managing files to avoid redundancy, yes? OK, so how about a simpler option.

Have an application that runs in the background. Whenever it finds two identical files on the disk, and provided neither copy is being edited at that moment, it simply deletes one copy and replaces it with an alias. (Do aliases exist on non-Mac systems? I presume so.)

The alias will sit there in the directory where the file originally was, and will therefore be fully findable and will retain its context. Problem solved, no?

An extension of this system could save yet more space, by replacing the repetitive parts of large files with aliases, and then reinstating them on the fly. Maybe.
-- MaxwellBuchanan, Apr 26 2016

I have some experience that seems to contraindicate the use of aliases or other OSes' equivalents. I have a collection of reference documents (scientific papers, datasheets, etc.) in my Google Drive. I had them organized into topic folders, but recently I found that the folders I had weren't optimal, so I decided to rearrange the files. At the same time, I decided to consolidate all of the files into one folder and put only aliases to them in the topic folders, to be able to put a file in multiple categories without duplication. This worked great until I tried to access it from my Windows computer or the Google Drive web interface, which don't understand Mac aliases. I could use Windows shortcuts (equivalent to aliases) but Mac OS X doesn't understand those. I could use symlinks, but I found some reason that I don't remember for those to not work either. I also thought of using hardlinks, but I realized that Google Drive would see those as separate files, resulting in duplication in the cloud and then probably in the local folders when it sunc again.

So. I'm currently thinking I have to build some kind of document management system to keep track of my reference documents (and it has to be able to sync between my computers and ideally also be accessible from mobile and web). I would like to be able to tag documents with multiple tags each, rather than having each one in just one folder (which is why I started the alias thing in the first place). I considered Evernote, which would work perfectly for that, but I use Evernote Basic (the free version), and the 60 MB/month upload cap is an order of magnitude too small.
-- notexactly, Apr 27 2016

Git isn't so much a method of avoiding redundancy, it's more a temporal directory - if you've a Mac, then it's like a focused time-machine that you can share with other people - if you've used "track-changes" in windows, then it's like that too, only not shit.

The way it works is that it maintains a set of keys based on the path+filename of a part of your directory tree, and for each key, a hash of the file-content. If the hash changes, or a new filename appears between one sync and the next, Git knows to update the centralised store with a copy of the contents of that file. What's nice is that it keeps a copy of the original, and shows you the precise differences between one version and the next.

If anything, you get added redundancy, as you're able to step back through time, to see the entire history of your project folder, with annotations, revert to pre-mullered versions of your code, and importantly, share a single repository with lots of people.

The idea is to leverage some of the features of Git in order to save space - based on the idea that in a given store, there is lots of duplication.

You could do this by generating a list of filenames as the keys in a dictionary, the values being lists of paths in which those files are found. The first disconnect between this and Git is that (I think) Git uses the filename (or at least path+filename as the unique identifier of a file - but here, that assumption no longer applies - a file called accounts.csv in a folder called zentom_personal_account is best kept separate from another file called accounts.csv in a folder called orphans_do_not_embezzle.

You could take a measure of the content and do a comparison such that if two files are 99% the same, and are found in different folders, then they can be stored using the same root, and perhaps an accessory 'delta' file - or you could just say, if they're different, then store them differently. But that means then that there's no definite temporal connection you can make on filename only.

Alternately, you could completely turn git on its head, and build a 'tig' system that analyses the content of files, and stores their contents as a graph of referenced nodes that are formed from stubs of repeated content.

If I've got lots of files containing repeated boilerplate code, the system might extract that boilerplating and save it physically as a single unit, then any files referencing that unit would be able to do so via a reference. Essentially, you're re-coding the storage to optimise for performance, based on the assumption that there's lots of repeated content replicated across your filesystem. That works for portions of files, but it also works for exact copies of files as well, so if a single picture appears in lots of different folders, it only needs to actually be stored once, and referenced in each other case.

Updating files could result in a temporary new copy of that file, but later it could be decomposed into its component parts and referenced into the filesystem.

It's interesting though in that some theories, our brains are supposed to work on this principle, with short-term memory being like a working, high-definition, high- density, dereferenced data object, which, after it's finished being used, eventually gets referenced against a longer-term memory set in which much of the content has already been encoded - i.e. after some point between childhood and growing up, our brains stop learning anything new, and instead move more to a filing/curation type role.
-- zen_tom, Apr 27 2016

random, halfbakery