Processing every Wikipedia article

10 February 2011

I thought it might be worth writing a quick follow up to the Wikipedia Visualization piece. Being able to parse and process all of Wikipedia's articles in a reasonable amount of time opens up fantastic opportunities for data mining and analysis. What's more, it's easy once you know how.

Step 1: Get the data

This bit isn't hard, but it does require a lot of hard disk space. Find the latest Wikipedia dump (here's January 2011, for example), download it, and decompress. The code I wrote at the hackday was designed to operate on the Articles, templates, image descriptions, and primary meta-pages dumps, which contain only the latest revisions and no edit history. They are chunked into 15 or so sections, to make the whole thing a bit easier to handle.

Step 2: Write your processing job

If you have a look at page-parser.py from the hack-day project github, you'll see a very simple Python class called WikiPage:

class WikiPage(object): """ Holds data related to one page element parsed from the dump """ def __init__(self): self.title = u'' self.id = u'' self.text = u''

This little data structure will be populated for each wikipedia page extracted from the dump files. An article's title is sufficient to uniquely identify and link to it, and the text will be populated with the full article text in Wiki markup. To get an idea of what you'll get here, append ?action=raw to any wikipedia link and look at the resulting file.

Now you're going to write a callback function that takes one of these page objects as its argument. Every time the dump parser (which I'll introduce below) extracts a page from the raw XML, your function will receive the page, and can perform whatever analysis you choose to define on it.

Here's an example from the project:

def processPage(page): """ We're interested in pages representing years with event descriptions, and those which mention any sort of geographic coordinates. """ if isYearPattern.match(page.title): processYear(page) for event in page.events: dataSource.saveEvent(event) else: processPageForCoords(page) if page.coords: dataSource.savePage(page)

I hope that's self-explanatory.

Step 3: Hit go

Once you've got your callback function, you're good to go. It should be as simple as this:

import page_parser def yourCallback(page): # do your processing here pass page_parser.parseWithCallback("enwiki-20110115-pages-articles1.xml", yourCallback)

The first argument to parseWithCallback() is the location of the dump file you want to process. The second is a reference to your callback function.

What's happening now?

On running the above snippet, parseWithCallback() uses a few classes and function calls from page-parser.py to do the following:

  • It creates a SAX Parser using Python's built in xml.sax module.
  • It creates an instance of the WikiDumpHandler class, which is a subclass of xml.sax.handler.ContentHandler. This class is ready to handle the events that occur as the parser chews through the XML dump.
  • Next, it creates a text_normalize_filter. This is an implementation detail that is not drastically important. It speeds up processing by storing up the chunks of text produced by the parser and only handing them off to the Handler when an XML element has been completely processed. This helps with the large chunks of text we encounter in Wikipedia dumps.
  • Finally, it opens the file named in the method argument and starts parsing. You will find it takes a little while - performing my somewhat complex processing on all 30GB of dumps took about 25 minutes.

Hopefully all this is clear and straightforward. If you find this code at all useful, please let me know.

Comments

1 distant says...

Thank you. I find it useful. J.

Posted at 4:46 a.m. on March 31, 2011

2 Gianluca says...

I definitely find it useful, I am downloading this dump I want to parse and try to visualize using your code. Thanks a lot for sharing, very inspiring :-).

Posted at 11:55 a.m. on April 10, 2011

3 Gianluca says...

...this dump: http://download.wikimedia.org/eswiki/20110318/eswiki-2011...

Posted at 11:56 a.m. on April 10, 2011

4 Python-Developer says...

Thank you very much. I have been googling a while to find some parser for python, which parses wiki-xml dumps. I like very much your solution with the WikiPage-Class and the callback function. Very elegant. Thanks.

Posted at 8:39 a.m. on May 15, 2011

Comments are closed.

Comments have been closed for this post.