Halfbakery: Misc. Search Engine Features wibni

= Combine boolean search w. whitespace neutrality ==

By 'whitespace neutrality' I mean the style of writing preferred by the ancient Greeks and Roman civilizations, in which the words of a document are run together from beginning to end with no whitespace or punctuation. A text or hypertext file is thus prepared for indexing by stripping it of all but alphanumeric characters

autonom and anarch

infallib and doctrin

veillance

(solar or photovoltaic) and (addtocart or addtobasket)

== Use text dissimilarity in 'ranking' pages ==

I don't know what is ideal or feasible in 'measures of text dissimilarity.' Perhaps 'number of substrings in common' would be appropriate.

I suggest the following method for selecting the first-ranked result: The search string 'autonom' is 7 characters long. A 7-character search string in a 7,000-character-long file is 0.1% of the whole file and has a numeric value of 0.001. Three occurrences of 'autonom' in a 7,000 character file would count for 0.003. The 'and' connective represents multiplication and the 'or' addition. Unary 'not' could be 1 for absent and 0 for present. The first-ranked result should be simply the page for which the above formula yields the largest result.

The tenth-ranked result should not necessarily be the tenth largest value for the formula. It should be the page that is least textually similar to the first 9 results (perhaps the first 9 results concatenated?)

The top position is of course gameable by posting files containing a single word, but the dissimilarity algorithm should impose a certain level of signal to noise ratio on the results as a whole.