I thought it might be worth writing a quick follow up to the Wikipedia Visualization piece. Being able to parse and process all of Wikipedia's articles in a reasonable amount of time opens up fantastic opportunities for data mining and analysis. What's more, it's easy once you know how.
Step 1: Get the data
This bit isn't hard, but it does require a lot of hard disk space. Find the latest Wikipedia dump (here's January 2011, for example), download it, and decompress. The code I wrote at the hackday was designed to operate on the Articles, templates, image descriptions, and primary meta-pages dumps, which contain only the latest revisions and no edit history. They are chunked into 15 or so sections, to make the whole thing a bit easier to handle.
Step 2: Write your processing job
class WikiPage(object): """ Holds data related to one page element parsed from the dump """ def __init__(self): self.title = u'' self.id = u'' self.text = u''
This little data structure will be populated for each wikipedia page extracted from the dump files. An article's
title is sufficient to uniquely identify and link to it, and the
text will be populated with the full article text in Wiki markup. To get an idea of what you'll get here, append
?action=raw to any wikipedia link and look at the resulting file.
Now you're going to write a callback function that takes one of these page objects as its argument. Every time the dump parser (which I'll introduce below) extracts a page from the raw XML, your function will receive the page, and can perform whatever analysis you choose to define on it.
Here's an example from the project:
def processPage(page): """ We're interested in pages representing years with event descriptions, and those which mention any sort of geographic coordinates. """ if isYearPattern.match(page.title): processYear(page) for event in page.events: dataSource.saveEvent(event) else: processPageForCoords(page) if page.coords: dataSource.savePage(page)
I hope that's self-explanatory.
Step 3: Hit go
Once you've got your callback function, you're good to go. It should be as simple as this:
import page_parser def yourCallback(page): # do your processing here pass page_parser.parseWithCallback("enwiki-20110115-pages-articles1.xml", yourCallback)
The first argument to
parseWithCallback() is the location of the dump file you want to process. The second is a reference to your callback function.
What's happening now?
On running the above snippet,
parseWithCallback() uses a few classes and function calls from page-parser.py to do the following:
- It creates a SAX Parser using Python's built in
- It creates an instance of the
WikiDumpHandlerclass, which is a subclass of
xml.sax.handler.ContentHandler. This class is ready to handle the events that occur as the parser chews through the XML dump.
- Next, it creates a
text_normalize_filter. This is an implementation detail that is not drastically important. It speeds up processing by storing up the chunks of text produced by the parser and only handing them off to the Handler when an XML element has been completely processed. This helps with the large chunks of text we encounter in Wikipedia dumps.
- Finally, it opens the file named in the method argument and starts parsing. You will find it takes a little while - performing my somewhat complex processing on all 30GB of dumps took about 25 minutes.
Hopefully all this is clear and straightforward. If you find this code at all useful, please let me know.