h a l f b a k e r yMy hatstand runneth over
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
I'm hoping this is baked and I'm just using the wrong
search terms.
Many commonly used file systems support hard links
and/or soft links between files, allowing one copy of some
data to be accessed from two different locations. Many
also have snapshot features that allow all the files to
be
virtually copied without making a second physical copy
unless an application tries to modify the original. This
allows capturing the state of all files at one point in time,
often for the purpose of performing a backup of all files at
a certain time stamp while allowing the system to continue
normal operation during the backup.
I think it would be useful for file systems to support a type
of hard link that would automatically create a copy and
unlink files as soon as any application tries to modify the
file through any of the links.
The times that I really want this feature are maybe
somewhat specific to my job, but another application that
I think would be useful to many people is in managing
digital photos.
When I copy photos onto my computer, I organize all of the
original files the may I like to archive them and never
modify them. When I want to gather a collection of photos
to send to someone, for example highlights of a vacation,
or if I'm sorting through a bunch of photos to select the
best (from a portrait session with my kids). I generally
create a separate folder and put copies of a bunch of
photos in that folder. Depending on what I'm doing,
sometimes I'll end up making modification to some of the
files, which is why I made copies to ensure I don't modify
the originals. After I use these files I generally leave them
in that folder so I can see what I did with them. Of course
I never delete the originals. Hard drive space is pretty
cheap, so I don't worry too much about that, but I do end
up wasting a lot, and worse is the time spent waiting for
files to copy when I copy a large number of photos to a
folder so I can pick the one I want based on process of
elimination. If I had the ability to make hard link copies
that would automatically split into real copies if either
were modified, that would save hard drive space and
copying time. When talking specifically about photos,
some photo management programs implement some of
these things, but I have other uses for this as well, and
photo management software could take advantage of this
feature if it was built into the file system.
Hard links almost implement this feature. If you create a
hard link to a file in a separate folder, you can read it from
both. If you overwrite the photo in one folder, the link will
be broken and the photo in the other folder will be
unchanged, but if you modify the photo in one folder, the
photo in both folders will be modified since they share the
same data. The technology to share data but make a copy
if an attempt is made to modify the file is present in
existing snapshot features like Windows Shadow Copy, but
that particular feature can only be used on the entire
volume at once, and the shadow copy is read only. For
some of the uses I'm interested in, there may be many
copies of each file and many of those need to be writable.
But if I ever modify one, I want to unlink and save it
separately so I don't modify the others
One other use of this would be in a hard disk optimizer.
Someone could write a small program to scan through a
hard drive and link duplicate files together. This could be
basically transparent to the user because if an of the
duplicate files were modified, it would create a separate
copy of the modified one.
Note that this should not be used for making back-up
copies, but that's obvious because a backup copy on the
same volume is not very useful anyway.
This could make normal back-ups less efficient unless the
backup software if aware of this feature and there is a way
to track linked files so they can be linked on the backup
drive as well. Then again if this is only used in cases
where normal copies would have been used anyway, then
the backup would be no less efficient than before.
Maybe baked?
http://en.wikipedia...te_in_storage_media [scad mientist, Sep 25 2014]
HAMMER File System
http://www.dragonflybsd.org/hammer/ Part of DragonFly BSD [Spacecoyote, Jan 31 2015]
Please log in.
If you're not logged in,
you can see what this page
looks like, but you will
not be able to add anything.
Annotation:
|
|
So you want a hard link that stops being a hard link as soon as
you write to it...the problem with that (besides being a
rather opaque solution to the problem where more
transparent solutions exist) is that a hard link simply means
there are multiple names for the same file; it can't do
anything fancy. You could implement this as a special case of
variable symbolic link, though IMHO what you should really be
after is a version control system and a backup system. |
|
|
// (besides being a rather opaque solution to the
problem where more transparent solutions exist)
// -- Please elaborate on these more transparent
solutions. |
|
|
By the way, the application where this came to
mind today (I've had this idea stirring fora long
time) does have a lot to do with version control.
I'm working on a software project. It's poorly
managed (good thing I'm not in charge or it would
be worse). Because of this it's hard to grab just
the set of files that are needed. Access to the
Perforce server is slow for some people in the
organization (overseas and no Perforce proxy
installed). Also, training on how best to use
Perforce is somewhat spotty in the off site
locations. So when trying to reproduce some
issue they are working on they just send me a
huge zip file of their version of the code. Now I
could check that into Perforce somewhere, but
that seems like I rubbing in their face the fact that
they should have used Perforce. Often I just save
their copy on my computer and make multiple
other copies to test various solutions. I end up
with a bunch of copies of every single file, but
only a couple of those have even the slightest
difference. |
|
|
It just seems wasteful to me to have the same
data stored twice on the same drive and it annoys
me waiting for the files to copy, when I really
don't want another copy. I think the best way to
implement would actually to have this be the
default behavior when a file is copied to a
different folder or file name on the same physical
volume. Or maybe just make it a check box next
to the box for "compress drive to save disk space". |
|
|
Oh yeah, I also remember wanting this feature
back when I was required to use Subversion for
source control. SVN stores two copies of every
file. It diffs the main one against the one in the
hidden .svn folder to see if you've made local
changes. Again, most of these file pairs could be
linked, and once that was implemented in the OS,
with a few useful hooks, SVN could simply ask the
OS if the files are still linked rather than actually
having to diff the files. Even if SVN didn't take
advantage of the feature explicitly, when the diff
tool read one file then the other, the file would
probably already be cached by the OS so it would
only get read once from the hard drive anyway. |
|
|
To my knowledge, Git doesn't make a diff until there is a
change; you could have 100 branches where only 1 copy is
actually stored. Since its a distributed version control system,
it doesn't need to connect to a remote server unless you are
pushing/pulling commits to said server. So there's your
problem: you need a proper DVCS, and you need people to
actually use it. |
|
|
I finally found the term I'm looking for. I knew it
had to be something others had thought of
before. It's called Copy-on-write. See link. I say
maybe baked because while the description of it's
use for copying objects in memory matches how
I'd like to use this, the section on copy-on-write in
storage media says this is implemented on btrfs
and ZFS. I can't tell how it is used in btrfs, but in
ZFS it looks like their use of copy on write is to
always make a new copy of a full block of data
when modifying any of it to ensure file integrity,
with no mention of allowing duplicate files to
share space on disk. Qcow2 looks like what I
want, but not integrated with a normally running
OS (just virtual machines). |
|
|
I think you're thinking of a Versioning File System. |
|
|
Why does this idea remind me of Teamcenter? |
|
|
Oh yeah, because Teamcenter is what I have to deal
with from a random day-to-day basis. I hate that
program. |
|
|
No I am not thinking about a versioning file
system. I'm not looking for more data redundancy.
I'm looking for less. |
|
|
As I continue to think about what I'm looking for
here, it really is a file system that optimizes disk
space and file copy time by never copying data
unless it need to. |
|
|
Previously I was thinking this would be an optional
way of dealing with some files I want to treat in
this way, but as I think about this more, I see no
reason not to always have the file system do this.
The only reason not to is if people think they are
making the data more secure by making a second
copy. But if it's on the same volume, a second
copy doesn't protect against a hard drive crash
anyway. If someone is worried about data
integrity, they would be much better off using a
system that stores data redundantly automatically.
If you make a copy of a file to improve data
integrity and one bit is corrupted in one of the
file, you may not notice that right away. If you do
notice, you have to manually decide which file you
think isn't corrupted. |
|
|
Therefore, a user should have a backup system and
data integrity system appropriate for their
situation, but on any one volume, the file system
should make at least a nominal effort not to make
multiple copies of the same data. |
|
|
Windows 8 ReFS sounds pretty interesting. It
does mention Copy on Write, but they are using
the term in the same sense as ZFS: they never
update data in place since a power failure during
the write would corrupt the new data and the old
data. |
|
|
From the 2nd link: "The NTFS features we have
chosen to not support in ReFS are: named
streams, object IDs, short names, compression,
file level encryption (EFS), user data transactions,
sparse, hard-links, extended attributes, and
quotas." |
|
|
Based on the list of NTFS feature not included in
ReFS, (they excluded hard links and compression),
I'd say they are trending away from this concept.
Though they still support shadowing, which is
really the most essential infrastructure needed to
support my proposed feature. |
|
|
So, basically you want an editlog file (there's a name for that and I haven't been able to dredge it up for the last couple days), then commitable. The file (for picture editing), might read (may as well do it in plaintext) |
|
|
c:/proj1/work/Lookit.ptemp
----
Import c:/proj1/Lookitsunset.gif
Crop 24,500
Redout 300,475
----
|
|
|
which is a pretty small file, but it does rely on the original "LookitSunset" being around. At a later time the text file could be opened and a finished .gif written. Then if wanted it could be deleted as well as the original pic. |
|
|
Problem of course is that all that text is proprietary to the software program, barring creation of industry-wide standards. So it would be pointless putting it into the OS. |
|
|
I think that's a bit more high-level than he's going for, [FT]. |
|
|
yeah, I just reread the post. He wants to automatically write a new file when an opened hard-link file is modified. |
|
|
If it's Windows you can delete the link from SaveAs before SavingAs. Problem is "SaveAs" automatically opens in the original's directory, not where the link is. |
|
|
Sure, doing it by sectors would be great too. My
simple implementation should work well with larger
collections of small files, but would be quite
annoying with large files (for example Outlook
mailbox files). |
|
|
Unfortunately that would make it more difficult to
implement. |
|
|
DragonFly BSD's HAMMER File System sounds
interesting.
From the [link]: |
|
|
"HAMMER retains a fine-grained history. The state
of the
filesystem can be accessed live on 30-60 second
boundaries
without having to make explicit snapshots, up to a
configurable fine-grained retention time.
Coarse-grained history is controlled by snapshots.
By
default the system cron generates one snapshot a
day and
retains 60 days worth. Snapshots can be accessed
live.
A convenient undo command is provided for
single-file
history, diffs, and extractions. Snapshots may be
used to
access entire directory trees. Data and meta-data
is CRC-checked for integrity. Data block
deduplication [is used.]" |
|
| |