Link from Link Rover

(Apologies if this already exists, I looked and looked and couldn’t find one that did this).

Anyone who has had to maintain a web site knows that broken links are a continual pain in the neck. There are a huge number of tools to deal with this problem, but what they do is simply “spider” your web site, retrieve each link, and determine if a 404 (Page Not Found) error has occurred. Then they notify you in some way so you can fix it.

I think this could be taken further. Instead of waiting for maintenance time, spider the links while the site is in good working condition, to grab the intended target page of each link. Scrape out all the redundant HTML, menus, scripts and such to get to the actual content of the page. Then store this content (or snips of it) locally on the webmaster's system for use later. Include a manual interface so the webmaster can tweak what pieces of text from the target page are considered significant.

Then at maintenance time the software is armed with not only the URLs of the links, but the content which each is intended to display. Again spider the web site, and verify not only that each page leads somewhere, but that the intended content is present at the target page.

But wait, there’s more. If a link breaks, the software has what it needs to find a replacement. It can use a search engine to search the web for the target text from the original link. (Google has an API for this sort of thing). This would allow it to find a moved page on the original site, as well as mirrors of the original material. It could also automatically search the Internet Wayback Machine for a cached archive of the page, as well as the Google cache.

When it notifies you of broken links, it can supply suggested replacements, which you can verify and easily select.

Aside from making link maintenance a breeze, this tool would also allow you to catch broken links that many current tools would never notice such as:
· Content changed URL within site
· Domain has been sold or repurposed
· Domain has been “parked” to a search page