« A history of the world in 100 seconds | Twitter for revolutionaries »
Processing every Wikipedia article
I thought it might be worth writing a quick follow up to the Wikipedia Visualization piece. Being able to parse and process all of Wikipedia's articles in a reasonable amount of time opens up fantastic opportunities for data mining and analysis. What's more, it's easy once you know how.
Step 1: Get the data
This bit isn't hard, but it does require a lot of hard disk space. Find the latest Wikipedia dump (here's January 2011, for example), download it, and decompress. The code I wrote at the hackday was designed to operate on the Articles, templates, image descriptions, and primary meta-pages dumps, which contain only the latest revisions and no edit history. They are chunked into 15 or so sections, to make the whole thing a bit easier to handle.
Step 2: Write your processing job
If you have a look at page-parser.py from the hack-day project github, you'll see a very simple Python class called WikiPage:
class WikiPage(object):
"""
Holds data related to one page element parsed from the dump
"""
def __init__(self):
self.title = u''
self.id = u''
self.text = u''
This little data structure will be populated for each wikipedia page extracted from the dump files. An article's title is sufficient to uniquely identify and link to it, and the text will be populated with the full article text in Wiki markup. To get an idea of what you'll get here, append ?action=raw to any wikipedia link and look at the resulting file.
Now you're going to write a callback function that takes one of these page objects as its argument. Every time the dump parser (which I'll introduce below) extracts a page from the raw XML, your function will receive the page, and can perform whatever analysis you choose to define on it.
Here's an example from the project:
def processPage(page):
"""
We're interested in pages representing years with event descriptions,
and those which mention any sort of geographic coordinates.
"""
if isYearPattern.match(page.title):
processYear(page)
for event in page.events:
dataSource.saveEvent(event)
else:
processPageForCoords(page)
if page.coords:
dataSource.savePage(page)
I hope that's self-explanatory.
Step 3: Hit go
Once you've got your callback function, you're good to go. It should be as simple as this:
import page_parser
def yourCallback(page):
# do your processing here
pass
page_parser.parseWithCallback("enwiki-20110115-pages-articles1.xml", yourCallback)
The first argument to parseWithCallback() is the location of the dump file you want to process. The second is a reference to your callback function.
What's happening now?
On running the above snippet, parseWithCallback() uses a few classes and function calls from page-parser.py to do the following:
- It creates a SAX Parser using Python's built in
xml.saxmodule. - It creates an instance of the
WikiDumpHandlerclass, which is a subclass ofxml.sax.handler.ContentHandler. This class is ready to handle the events that occur as the parser chews through the XML dump. - Next, it creates a
text_normalize_filter. This is an implementation detail that is not drastically important. It speeds up processing by storing up the chunks of text produced by the parser and only handing them off to the Handler when an XML element has been completely processed. This helps with the large chunks of text we encounter in Wikipedia dumps. - Finally, it opens the file named in the method argument and starts parsing. You will find it takes a little while - performing my somewhat complex processing on all 30GB of dumps took about 25 minutes.
Hopefully all this is clear and straightforward. If you find this code at all useful, please let me know.
Comments are closed.
Comments have been closed for this post.
Comments
1 distant says...
Thank you. I find it useful. J.
Posted at 4:46 a.m. on March 31, 2011
2 Gianluca says...
I definitely find it useful, I am downloading this dump I want to parse and try to visualize using your code. Thanks a lot for sharing, very inspiring :-).
Posted at 11:55 a.m. on April 10, 2011
3 Gianluca says...
...this dump: http://download.wikimedia.org/eswiki/20110318/eswiki-2011...
Posted at 11:56 a.m. on April 10, 2011
4 Python-Developer says...
Thank you very much. I have been googling a while to find some parser for python, which parses wiki-xml dumps. I like very much your solution with the WikiPage-Class and the callback function. Very elegant. Thanks.
Posted at 8:39 a.m. on May 15, 2011