Halfbakery: Steganographic URL Text

Computer: Web: Searching
Steganographic URL Text (+3, -1) [vote for, against]

There are quite a few sites, Wikipedia for instance, which content is copied over and over again throughout the entire Internet. This occasionally makes Searches a bit of a hassle as the same results keep coming up over and over.

Proposed is that sites such as WP, which are both primary-sources and archival in nature (in the sense that a page will stay online indefinitely), embed the original page web address within the text of the page, mapping onto deprecated ASCII control-characters, replacing sufficient SPACE characters (20h) and showing up also as whitespace.
-- FlyingToaster, May 15 2012

from... Everything_20except_20Wikipedia
[FlyingToaster, May 15 2012]

ISO latin ctrl chars http://www.blooberr.../charentity0-31.htm
tab, LF, CR treated as space, others are undefined [Loris, May 16 2012]

I suspect that this would require too much honesty and goodwill to work in the real world, but [+] for the thought.
-- pertinax, May 15 2012

I think the idea is that it somehow remains hidden, so that copying it along with the text is unavoidable.as far as I can tell, the actual information is read only by search engines, in an effort to make implementing the idea in the [link] easier.
-- erenjay, May 15 2012

I'm not quite sure what the goal is here. Or rather, I don't think it would work for the stated aim, since search engines don't generally provide exact HTML matches.
However, it might be useful for watermarking content to prove origin - if the content scrapers copy HTML rather than just the output text. You could make that more likely by incorporating lots of essential HTML formatting in the content.

I think there are four symbols which html treats as a space.
1) ASCII space
2) newline (actually CR and/or LF)
3) tab
4) nonbreaking space entity ( )

The last is not quite the same as the others, since in some cases it will have a display effect, whereas the other three are defined to be equivalant.

Furthermore, runs of the first three are also collapsed into a single space (excluding within certain spans such as pre). So we could use two spaces, tab-space, space-tab and tab-tab etc to increase the data storage. One could go further, adding essentially invisible tags to the spaces between words; (i.e. "<b> </b>" and similar), but that might be unwise for several reasons.

The ideal watermarking system is undetectable, will survive any likely processing and uses a simple encoding (such that it is unfeasible that the codec was concocted to 'decrypt' arbitrary input to the desired output).

To this end I suggest that using one or two spaces to encode one bit between each word, with redundant data encoded after each full-stop using newlines, tabs and spaces, and further redundant encoding of data on long documents by altering the width between newlines.
-- Loris, May 15 2012

The idea isn't to enable foolproof copyright detection, it's to simply state what the originating URL is.

99.9...% of copied material on the 'net is done legally. The WP snippet is quite legal, the AP news stories already paid for, etc.

Embedding the source URL allows shortest comparison for duplication.

Required also of course would be cooperation from the search engines and/or browsers, in the engine's case an option (radio button was mentioned in the preceding post's anos) to "compact" finds, in the browser's case showing a second mouseover line.

There are over 30 ASCII control-characters, most of which usage has been deprecated over the decades.

Perhaps "as well as" 20h blanks as opposed to "in stead of".
-- FlyingToaster, May 15 2012

The original hypertext spec was supposedly for linking and referencing if I recall correctly. The linking took off, but the embedding and referencing side of it was too ad hoc.
-- not_only_but_also, May 16 2012

//referencing side was too ad hoc// most sites would rather just cut'n'paste the text, without taking any html tags.
-- FlyingToaster, May 16 2012

Wouldn't it make more sense to request search engine owners to match and discard identical text passages from search results, with preference accorded the earliest instance?
-- UnaBubba, May 16 2012

//99.9...% of copied material on the 'net is done legally.//
I doubt that, but never mind.

Incorporating deprecated ASCII control chars is probably not the thing to do; they're very likely to be stripped at any stage of the process. Even using just one of CR or LF for a newline is likely to be munged at some point. I'm not even sure that all browsers would treat them the same way. I'm pretty sure they won't uniformly be treated as spaces, because that doesn't seem to be defined behaviour (see link).

On the other hand, if what you want is simply for legal copies to point to the source, the solution is trivial. The licence should require propagation of boilerplate disclaimer including source link:
"This article originally posted at http://www.halfbakery.com/ idea/Steganographic_20URL _20Text#1337142042"
-- Loris, May 16 2012

The idea ain't a copyright protection scheme. Matching and identifying every bit of text on the net with each other has the disadvantage of trying to match and identify every bit of text on the net with each other.
-- FlyingToaster, May 17 2012

random, halfbakery