WebCrawler

The Case

MediaWiki, the world’s largest wiki content management system, produces webpages that are clean and simple to read. But collecting them is a noisy, recursive nightmare. MediaWiki articles are clean, well formatted, and simple. Behind the scenes, MediaWiki stores article revisions, user histories, and communications between users. Useful data, to be sure, but not necessary for a straightforward data collection. Many of these pages are dynamically generated, and a webcrawl may infinitely recurse. And of course, all this data lives in a database, which isn’t easily collected or produced.

A client was using MediaWiki to store technical data and documentation in a private wiki on their network. They needed to collect data used by an entire business unit, and the usual tools were returning hundreds of thousands of pages of data for what should have amounted to a few thousand articles. The noise was too much for the automated collection to be of use, and the target data set was too large for manual collection.

The Solution

Our development team wrote a prototype webcrawler in Python that was specifically designed to pull only the actual articles of interest, and bypass the metadata related to article history and user debate. It made extension use of regular expressions to determine what was of interest, and what could be ignored. Additionally, it tracked every step it had taken through the site, generating a canonical URL that was used to prevent recursion.

Articles were exported to a text-searchable PDF. The output was a direct 1:1 ratio of articles to PDFs – just the “meat” and none of the gristle.

It was such a success, that we opted to expand it into a full featured tool that supports multiple wikis (MediaWiki, Confluence), forums (vBulletin), authentication (basic, session, cookie), depth control, and user defined regex filtering.