Halfbakery: web harvester black hole list

Computer: Web Technology
web harvester black hole list (+47, -2) [vote for, against]
Spam : Realtime Black Hole List = Web Address Harvesting : X

The "Realtime Black Hole List" is a database that keeps track of systems from which spam originates. Many professional e-mail systems can be configured to not accept e-mail from the spam sources on the RBHL, and there are mechanisms for getting sites on and off that list.

I would like an equivalent mechanism for web spiders.

That is, I would like a database with profiles of web spiders that either are known to scan for e-mail addresses or are known to misbehave (too many accesses in too short a time).

This is independent from the current HTML tags and robots.txt mechanisms, which only work for crawlers that pay attention to them.

The database would plug into widely used web servers (such as apache) and would be used to automatically and efficiently deny access to the systems listed in it.

How to tell spamming crawlers from good ones: Have a webpage and a mail server ("mailhost.com"). The webpage logs profiles of all accesses with a unique key for each access. For access #369, it presents a dynamically generated e-mail address with that key, e.g. user369@mailhost.com. If e-mail arrives at user369@mailhost.com, you know that the access in profile #369 was a harvester, and can add tbe profile to the database.

Realistically, the lure webpage and mailhost can't be in a single place. They would have to be distributed; any specific address is easy to filter out.
-- jutta, Sep 17 2001

MAPS Realtime Black hole List http://mail-abuse.org/rbl/
The e-mail version. [jutta, Sep 17 2001]

Web Robots Database http://www.robotstxt.org/wc/active.html
A good starting point. [via www.memepool.com] [jutta, Sep 17 2001]

Wpoison http://www.monkeys.com/wpoison/
One example of a spamtrap. Does not include the obvious next step you describe. [egnor, Sep 17 2001]

Sneakemail http://www.sneakemail.com
This isn't really what you're describing, but this is a good opportunity to tell people about the way i've learned to find out from where spam originates. [cameron, Sep 17 2001]

My SPAM list http://techref.mass...g/stats/spamlog.htm
How to kill SPAM accounts. [James Newton, Aug 08 2002]

Project Honeypot https://www.projecthoneypot.org/
[aguydude, Apr 28 2015]

A list of evil bot's "User Agent" headers. http://www.botsvsbr...egory/16/index.html
[goldbb, Apr 30 2015]

This one gets my vote. Web scraping spammers have forced me to maintain some pretty extensive email filtering rules. I'd love to block this at the source. The trick is to maintain that delicate balance between keeping out the clever bots while still allowing access for the legit users on a given ISP.
-- BigBrother, Sep 17 2001

Seems like the web, after an initial carefree childhood, now has a strong capitalistic/exploitative faction moving onto the turf of the GNUcode/freeware/free information flowergeek children. My kneejerk is strongly positive for anything that might benefit the average netizen by discomfitting spammers, spyware, push-media advertising, and everything else that is evil incarnate (by which I mean everything that I personally don't like, eh?). Go! Write the code!
-- Dog Ed, Sep 18 2001

You would also have to make the web bot 'trap' outlined in the last paragraph of the idea mutate from time to time so that no web bot will be able to know what it's walking in to.

I take it you get a lot of bandwidth wasting bots. I have a cable connection to the Internet. My computer runs Apache because, well, I couldn't be bothered removing it. My web page serves no content (just a splash screen telling people to piss off) and I still find evidence of web bots and that irritating Code Red worm on my server logs. Amazing...
-- sdm, Sep 18 2001

No. We haven't deliberately "structured our society" around anything - media developments have far overtaken legal ones, and plain old users don't quite have the lobbying power of e-mail marketers, is all. Protection against Spam does not mean the end of civilization as we know it; it would simply once again level the playing field in an area where technology currently favors the ruthless.

I hope you'll find a worthier target for your feelings of solidarity than people who are violating your privacy, lying about you, stealing your resources, or harassing you.
-- jutta, Sep 18 2001, last modified Sep 19 2001

I don't want to delete PeterSealy's and Rods Tigers deeply felt meditations on the general area, but please keep further annotations at least vaguely on the topic of reducing spam crawlers.

I get about twenty pieces of spam per day in my unfiltered accounts, but my real interest in this is as a website maintainer. The halfbakery isn't written very well, and building one idea takes a long time. The only reason you can still access halfbakery ideas in acceptable time is that I've manually profiled and locked out some of the crawlers that didn't pay attention to my robots.txt file and accessed about a page every second, slowing the site to the proverbial crawl. (Maybe that's where the name comes from.)
-- jutta, Sep 19 2001

Possibly a stupid question: Does an access to the Halfbakery from a crawler look any different to an access from a normal user? I'm guessing the answer is no, otherwise you'd presumably have filtered all crawlers out (except for a few permitted crawlers, like Google, AvantGo, etc.).

I suppose my first reaction to your idea is that its clever but, by the time you've got an email from the spam crawler it might be too late - that is, the crawler might have crawled all your real content pages. If you're able to distinguish crawlers from regular users you could operate a negative form of your idea - ban all crawlers from your site, but feed them a page with email addresses on. If, within a month, you haven't received an email from them then allow them in the next time they visit.
-- hippo, Sep 19 2001

Yeah, like the e-mail version, it doesn't work short-term for an individual site, but it would work mid-term for a group of sites, provided crawlers don't regularly change their mode of operations.

Do crawler accesses look different? Not by definition; both clients and crawlers are just software acting on behalf of people (and the boundary really is fluid.)

But accesses send lots of side information that can be used for profiling - the name of the client used, the IP address it comes from (all the filtering I do so far is based on that); and crawler accesses follow a certain pattern and persist longer than a user would.
-- jutta, Sep 19 2001

Just to clarify - part of what I was getting at in the 2nd para above (disallowing all crawlers except those which haven't fallen for your email bait in the past, rather than allowing all except those which fall for the bait) was to enure that any crawler which was faking its origins (which I assume is possible) would never get in.
But if you can't reliably distinguish between a crawler and a user my variation on your idea crumbles to dust...
-- hippo, Sep 19 2001

Not only that, but the web crawler will still follow through the answer to the question that presumably only humans can answer and get in. I'm not sure if it's legal, but I can envisage evil marketing people creating web crawlers that spoof different IPs in order to evade being banned.

For my money, the short-term solution would be something like what PeterSealy mentioned, that is, writing your email address in a different form. It would take some understanding from the general public though, that when I say "contact J.Bloggs at my companies server, someplace.com", I really mean "J.Bloggs@someplace.com".
-- sdm, Sep 20 2001

[UnaBubba] - some simple means of disguising your email address like that will stop most crawlers getting your address (just to be clear though - Jutta isn't griping about addresses being harvested as much as unwanted crawlers on her site slowing down access for everyone else).

And IP spoofing is quite possible (and may be even more widely used with the release of Windows XP).
-- hippo, Sep 20 2001

1 Spam croissant comin' right up - well combined list of ingredients, jutta.
-- thumbwax, Sep 20 2001

Spamtraps, designed to lure harvesting robots into an artificially constructed infinite maze of "sites" and "pages" full of bogus e-mail addresses, already exist. I'm actually rather surprised that nobody's hooked them up with a monitoring system to some sort of realtime block list. One problem is that looking something up in a central database may be slightly time consuming. If it takes a few extra seconds to deliver e-mail, nobody really minds. If every Web page hit takes a few extra seconds, ... that's an issue. I suppose a caching scheme fixes this; after the first hit, the server would remember that your IP address wasn't blocked.

If you're feeling malicious, you can also tarpit the spam robot by sending them infinitely long pages that load very slowly.

Also, you could target crawlers by observing behavior; I think you imply this in one of your annotations. (Anything which doesn't obey robots.txt but which does request pages in serial order faster than any human could reasonably read them...)
-- egnor, Sep 20 2001

I've heard of crawlers which check robots.txt and then immediately download all the listed "forbidden to crawlers" urls, and of software that watches for this behavior.

hippo: web crawlers can't hide using IP spoofing, because the web server has to be able to send packets back to the crawler (containing the contents of web pages, e.g.). IP spoofing is useful mainly for denial-of-service attacks (flood a site with garbage) and some moderately sophisticated protocol security holes.

Web crawlers could presumably use an anonymizing service though.
-- wiml, Sep 21 2001

[wiml] Thanks - I didn't think of that.
-- hippo, Sep 21 2001

Kill (temporarily, perhaps) crawlers by automatically denying access to anything that sends in load requests with a "non-human" regularity.
-- dsm, Oct 26 2001

AS IT HAPPENS... I have this exact thing half... err... done. (as opposed to half baked <GRIN>) Part of any anti-harvesting or spam trap system is not telling everyone what you are doing, but the basic idea is clear. The generation of the trap emails is automated on my servers for all spiders and the triggering (when an email arrives at that address) is manual. I also have a small, manually maintained list of ripping engines which I use to shut off access to jerks who are tying up my bandwidth by attempting to download the entire (2.5 GB) site.

But more importantly, I LART EVERY SPAM I get. I have reported over 4000 SPAMs and have several confirmed kills. (Yes, I'm proud of that). I use a system of MANUAL identification (whether it is or isn't a SPAM is still a human judgement call) combined with totally automatic reporting through SPAMCOP.NET. I have a batch (cron) job that runs every night and empties the SPAM folder, where I put them, by reporting them. I even log them... I'll add a link for ego purposes only.

And I get about 40 a day. I doubt that many people who are not web hosts realize what a wonderfully easy target webmaster@ and postmaster@ are for SPAM and virus generated email.

There is also a system that I use for reporting attacks against my firewalls. dshield.org runs a services of identifying and reporting virus and cracker activity and, like spamcop, provides a black list
-- James Newton, Aug 08 2002

RDF on the other hand... is completely kicking my butt. <GRIN>
-- James Newton, Aug 08 2002

Careful, [zevkirsh]; [jutta]'s allowed to get 'uppity': it's her site.
-- pertinax, Nov 07 2006

I think this is great! I'm sure some spammer would figure out how to get past it, though, they always do....
-- TahuNuva, Nov 03 2007

Why not the other way round? No crawling at all on your domain, unless it is a 'trusted crawler'. Maintaining a database of trusted crawlers and bots working for useful services is easier than blacklisting the bad ones.

Setting up the protocol and the organisation behind such a trusted list is a nice pet project for a Google-employee who can devote 20% of their time to such projects I read.
-- rrr, Feb 16 2008

I've now implemented a pretty good system for blocking wikispam or "spam" that comes, not as an email, but as a post or comment to a page on my site. It looks for and blocks updates, but not spidering, from anything that spiders the site (which is ok, since googlebot, et all do not comment on blogs). It also blocks spidering from anything that accesses more than a few pages per second unless the source IP is on a whitelist. But the most effective part has been looking for key phrases in the text that is posted. This is sort of like the spamassasin in that they are hand coded rules based on my observations about wikispam. At some point, it would be interesting to build a baysian filter that learns the difference; ASSP does a very fine job of that for my email. Sadly, sharing that list of good rules to match against would allow the wikispammers to work around it, but I would make exception for well known web hosts.
-- James Newton, Feb 22 2008

random, halfbakery