h a l f b a k e r yEureka! Keeping naked people off the streets since 1999.
add, search, annotate, link, view, overview, recent, by name, random
news, help, about, links, report a problem
browse anonymously,
or get an account
and write.
register,
|
|
|
When searching a term that is present in a hot press release, one finds that hundreds, or even thousands of hits, are simply referring to the exact same text.
Since the search engine is indexing the text, it might actually notice, and prune these out, making the haystack somewhat smaller.
Google cheat sheet
http://www.google.c...elp/cheatsheet.html [waugsqueke, Mar 09 2005]
Googlr search: turvy -"topsy turvy"
http://www.google.c...+-%22topsy+turvy%22 "turvy" other than "topsy turvy" [waugsqueke, Mar 10 2005]
[link]
|
|
Is that a distinct problem from link farms, which I believe Google does its best to filter already? |
|
|
In the case that triggered this idea, I was looking for more data on a specific company that was mentioned in a press release. |
|
|
Instead, I'm getting thousands of hits all quoting the press release. Mind you, the summary paragraph is actually fairly obviously the same. So Google kind of knows it's showing the same data |
|
|
Google sort of does this now... |
|
|
"In order to show you the most relevant results, we have omitted some entries very similar to the ## already displayed. If you like, you can repeat the search with the _omitted results included_." |
|
|
This would be easily solved if Google allowed a NOT operand. |
|
|
[waugs]: I think that only excludes multiple hits from the same domain. (I could be wrong, per usual.) |
|
|
I thought this idea would require that we carry each other. Carry each other. |
|
|
Damn you [bungston]! you beat me to it. |
|
|
guys, just clarifying -- these are hits from multiple different sites that all refer to the same exact text. |
|
|
So the only clue that Google has to the "sameness" of the text is the abstract. |
|
|
And it's absolutely generating thousands of them. |
|
|
Now, you can "minus" certain terms and eliminate all of the hits -- which is not what you would want either. |
|
|
Ideally, you'd want to see unique information referred to in a unique way, and no more than necessary. |
|
|
[confidential to Freefal -- you should have said U2 bungston]. |
|
|
And I can't be holdin' on... |
|
|
//if Google allowed a NOT operand//
It does already. Just put "-" before an item in the query. |
|
|
This searches for pages with "foo" but not "bar":
foo -bar |
|
|
This searches for pages with "foo" but not on the halfbakery:
foo -site:halfbakery.com |
|
|
sure, though it's tricky to do that for a whole paragraph or article. |
|
|
I think the criticism, though valid, misses the point. |
|
|
Sure I can be smart enough to still find what I want. |
|
|
But why would you show me 1000s of copies of the same entry? My assistant wouldn't, right? |
|
|
[UB], no sadly. I'm sure it's not my personality, though |
|
|
// it's tricky to do that for a whole paragraph or article. // |
|
|
Grab a fairly unique phrase from it, put it in quotes and then put a - before it. That tells Google to ignore anything that includes this passage of text. |
|
|
Added link to Google's cheat sheet, showing all the operators. They can be combined in very useful ways. |
|
|
<aside>The Brits among us may care to check the first Gooogle hit for "fuckwit".</aside> |
|
|
'Miserable failure' is interesting, too. |
|
|
I'd like an "other-than" boolean operator which would exclude text matches that satisfied a particular criterion, but not exclude entire pages on that basis. |
|
|
"turvy" other than "topsy turvy" |
|
|
would find places where the word "turvy" appeared not preceded by the word "topsy". Sites containing the phrase "topsy turvy" would not be completely excluded, but would only be included if the word "turvy" appeared without the word "topsy" in front of it. |
|
|
[angel] I'm a Yank, myself, but I checked the google as you suggested. I wonder who set it up to go to [John Leslie Prescott] |
|
|
//Grab a fairly unique phrase from it, put it in quotes and then put a - before it. // |
|
|
waugs--it seems like that would get rid of every instance it occurs, whereas the intention of this idea (I think) is to show it once and only once. |
|
|
// "turvy" other than "topsy turvy" // |
|
|
supercat, Google will do that too (see link). Note it found a surprisingly large number of instances of 'autopsy-turvy'. |
|
|
// the intention of this idea (I think) is to show it once and only once. // |
|
|
Yes but on the initial search, where you've found thousands of the same thing (which I still think Google will reduce down, even over multiple domains), you know on subsequent searches what to exclude. |
|
|
//I wonder who set it up to go to [John Leslie Prescott]// That was a joint effort by several bloggers, led by a guy called Tim Worstall. Google his name for details. |
|
|
I don't think Google's doing what I want, since it would not return a page containing the phrase "flopsy-turvy topsy-turvy"; the phrase "topsy-turvy" should not disqualify the page altogether, but when using the "minus" operator on Google it does. |
|
|
Not a bad idea on the surface, but begins to look less attractive when you consider some of the questions that would have to be answered in implementation. In particular, how should Google (or any other search engine) select a "definitive source" for a given document? |
|
|
Perhaps it would be better if we could attach our own intelligent agents to the search services, to sit between us and the raw flow of information and filter out the useful bits according to our own individual criteria. Otherwise we put search engines in the business of pre-filtering our information for us, and I don't know that we really want that. |
|
|
// it would not return a page containing the phrase "flopsy-turvy topsy-turvy"; the phrase "topsy-turvy" should not disqualify the page altogether, // |
|
|
Hm. I'm confused reading that, so I'm sure Google would be too. You're saying you don't want pages that have the phrase "topsy turvy" on them to appear in the search results, then say that the phrase "topsy turvy" shouldn't prevent a page from appearing when it's filtered out. Umm. |
|
| |