Half a croissant, on a plate, with a sign in front of it saying '50c'

h a l f b a k e r y

"Bun is such a sad word, is it not?" -- Watt, "Waiting for Godot"

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.

user:

pass:
register,

...
Loogle
Loogle V2
Markov Googler
Meta-Engine for Recent News
Misc. Search Engine Features wibni
Multi-Item Search in Half.Com
My browser history online
My first search engine
Name Wildcard
...

computer:searching

... application domain
... audio
... image
... iteration
... link quality
... neglected

Misc. Search Engine Features wibni

combine whitespace neutrality with boolean search maths

(+1)

[vote for,
against]

= Combine boolean search w. whitespace neutrality ==

By 'whitespace neutrality' I mean the style of writing preferred by the ancient Greeks and Roman civilizations, in which the words of a document are run together from beginning to end with no whitespace or punctuation. A text or hypertext file is thus prepared for indexing by stripping it of all but alphanumeric characters

autonom and anarch

infallib and doctrin

veillance

(solar or photovoltaic) and (addtocart or addtobasket)

== Use text dissimilarity in 'ranking' pages ==

I don't know what is ideal or feasible in 'measures of text dissimilarity.' Perhaps 'number of substrings in common' would be appropriate.

I suggest the following method for selecting the first-ranked result: The search string 'autonom' is 7 characters long. A 7-character search string in a 7,000-character-long file is 0.1% of the whole file and has a numeric value of 0.001. Three occurrences of 'autonom' in a 7,000 character file would count for 0.003. The 'and' connective represents multiplication and the 'or' addition. Unary 'not' could be 1 for absent and 0 for present. The first-ranked result should be simply the page for which the above formula yields the largest result.

The tenth-ranked result should not necessarily be the tenth largest value for the formula. It should be the page that is least textually similar to the first 9 results (perhaps the first 9 results concatenated?)

The top position is of course gameable by posting files containing a single word, but the dissimilarity algorithm should impose a certain level of signal to noise ratio on the results as a whole.

—	LoriZ, Apr 05 2009

Please log in.
If you're not logged in, you can see what this page looks like, but you will not be able to add anything.

Annotation:

I lost this part way through, but I think your proposed similarity search has some features in common with the sequence similarity searches used in genomics and proteomics (DNA and protein).

—	MaxwellBuchanan, Apr 05 2009

Either Google or AltaVista used to be able to do AND's and OR's as well as parentheses; your disambiguation idea sounds good, except you wouldn't want to go to that page, just refine your search to similar.

//Greek and Roman// had the advantage of common suffixes so reading a flowing text wouldn't require much thought.

—	FlyingToaster, Apr 05 2009

If that tenth result is the first nine concatenated, won't it be more relevant than all of them? And won't that encourage plagiarism?

—	notexactly, Apr 11 2019

No, the tenth result is whatever (among documents that match the search terms) is most dissimilar to the first nine concatenated.

—	LoriZ, Nov 30 2019

I think what I meant is: If somebody makes a new webpage by combining the first nine results, shouldn't that new combined page be considered relevant by this algorithm? Wouldn't it take the #1 spot and cause those nine to disappear from the results (or move down the list quite far) because they're so similar to it?

—	notexactly, Dec 02 2019

back: main index

business computer culture fashion food halfbakery home other product public science sport vehicle