Half a croissant, on a plate, with a sign in front of it saying '50c'
h a l f b a k e r y
Cogito, ergo sumthin'

idea: add, search, annotate, link, view, overview, recent, by name, random

meta: news, help, about, links, report a problem

account: browse anonymously, or get an account and write.

user:
pass:
register,


           

Misc. Search Engine Features wibni

combine whitespace neutrality with boolean search maths
  (+1)
(+1)
  [vote for,
against]

= Combine boolean search w. whitespace neutrality ==

By 'whitespace neutrality' I mean the style of writing preferred by the ancient Greeks and Roman civilizations, in which the words of a document are run together from beginning to end with no whitespace or punctuation. A text or hypertext file is thus prepared for indexing by stripping it of all but alphanumeric characters

autonom and anarch

infallib and doctrin

veillance

(solar or photovoltaic) and (addtocart or addtobasket)

== Use text dissimilarity in 'ranking' pages ==

I don't know what is ideal or feasible in 'measures of text dissimilarity.' Perhaps 'number of substrings in common' would be appropriate.

I suggest the following method for selecting the first-ranked result: The search string 'autonom' is 7 characters long. A 7-character search string in a 7,000-character-long file is 0.1% of the whole file and has a numeric value of 0.001. Three occurrences of 'autonom' in a 7,000 character file would count for 0.003. The 'and' connective represents multiplication and the 'or' addition. Unary 'not' could be 1 for absent and 0 for present. The first-ranked result should be simply the page for which the above formula yields the largest result.

The tenth-ranked result should not necessarily be the tenth largest value for the formula. It should be the page that is least textually similar to the first 9 results (perhaps the first 9 results concatenated?)

The top position is of course gameable by posting files containing a single word, but the dissimilarity algorithm should impose a certain level of signal to noise ratio on the results as a whole.

LoriZ, Apr 05 2009


Please log in.
If you're not logged in, you can see what this page looks like, but you will not be able to add anything.



Annotation:







       I lost this part way through, but I think your proposed similarity search has some features in common with the sequence similarity searches used in genomics and proteomics (DNA and protein).
MaxwellBuchanan, Apr 05 2009
  

       Either Google or AltaVista used to be able to do AND's and OR's as well as parentheses; your disambiguation idea sounds good, except you wouldn't want to go to that page, just refine your search to similar.   

       //Greek and Roman// had the advantage of common suffixes so reading a flowing text wouldn't require much thought.
FlyingToaster, Apr 05 2009
  

       If that tenth result is the first nine concatenated, won't it be more relevant than all of them? And won't that encourage plagiarism?
notexactly, Apr 11 2019
  

       No, the tenth result is whatever (among documents that match the search terms) is most dissimilar to the first nine concatenated.
LoriZ, Nov 30 2019
  

       I think what I meant is: If somebody makes a new webpage by combining the first nine results, shouldn't that new combined page be considered relevant by this algorithm? Wouldn't it take the #1 spot and cause those nine to disappear from the results (or move down the list quite far) because they're so similar to it?
notexactly, Dec 02 2019
  


 

back: main index

business  computer  culture  fashion  food  halfbakery  home  other  product  public  science  sport  vehicle