Tutorial: Parsing HTML with lxml, Requests, and XPath part I

Published: 2013-02-16
Tagged: python

There comes a moment in a man's life when copying and pasting words simply isn't enough anymore. No matter what keyboard shortcuts you use and other tricks you use, it's still too slow and the number of records to copy - too big. This is the day that you say to yourself "I'll build a robot that will do this dirty work for me!" and you reach for Python and its awesome lxml and Requests libs.

We start off with a simple pip install lxml and pip install requests or if you're using Windows you might have to visit the unofficial Windows Binaries for Python page. Make sure to chose the correct 32/64 bit version. It would also help to review how XPath works, although this will be more handy in part II of this tutorial.

We'll start off with a simple example: liberating data out of a single page and storing it in a CSV file. We're going to use Requests to pull the page off of the internet and then we'll make a lxml.html object which we can query using the xpath() function to extract the information that we want. We'll accomplish all of this in two simple steps:

  1. Fetch the page and extract the information.
  2. Iterate over the information and write it to a .csv

For this part of the tutorial, we're going to go with something easy . A very simple HTML document on a single page without complicated relations between data items.

The first part of the program:

import lxml.html
import csv

sheet = []
url = 'https://s3.amazonaws.com/codecave/tut1.html'
page = requests.get(url)
document_tree = lxml.html.fromstring(page.text)
i = 1
while 1:
    row = document_tree.xpath('//tr[%d]/td/text()' % i)
    if not row:
        break
    else:
        i += 1
        sheet.append(row)

In the beginning we take care of the imports and we set up an empty array to collect our data called sheet. After which we do:

url = 'https://s3.amazonaws.com/codecave/tut1.html'
page = requests.get(url)

We get requests to fetch the HTML document from the url. Then we:

document_tree = lxml.html.fromstring(page.text)

This piece of arcane magic uses the lxml.html library to create an object from the string representation (fromstring(page.text)) that we can query using XPath. Let's assume that we do not know how many items are in the table. The best way to do this is use an infinite loop that will exist upon some condition. Each XPath query we make to document_tree will return a list of strings so one might venture to assume that once the end of the table is reached, document_tree will return an empty list. One would be correct.

The only tricky part of this might be row = document_tree.xpath('//tr[%d]/td/text()' % i), but rest assured, there's nothing to fear. row is the list of strings that document_tree returns after querying and that we later append to the sheet list. The xpath() function takes a string argument which represents the XPath query. The tutorial that I linked to earlier does a good job of explaining why this query works - it's quite similar to the familiar URLs or the file path syntax. I'll only add that for real life situations there are tools which make finding the XPath much easier, one of which is Firebug. Chrome has this feature built-in.

[['Red Car', '5', '5165476', '4999'], ['Blue Car', '2', '6549687', '9999'],
['Green Car', '1', '546576', '12999'],
['Yello Car', '15', '521635', '1999'], ['Purple Car', '3', '65687', '7999']]

Each sublist is a row, which makes writing this information into a CSV file very easy indeed:

f = open('data.csv', 'wb')
writer = csv.writer(f)
for i in sheet:
    writer.writerow(i)
f.close()

We open a file for writing and feed the file-object into a csv.writer object called writer. Then we iterate over each list in sheet using a for-loop and use writer to write each row. Pretty easy, write? The writerow function is of big help here because it writes every item in a list into a separate "cell", which saves us some work. That's it!

This was but a warm up to the really awesome stuff we will be doing next time - liberating data from multiple HTML documents with one-to-many relationships and saving it into an SQLite3 database.

Hi, I'm Matt.

This blog is an unordered set of thoughts extracted from the mind of a software developer.

About Me PGP key

Archives  Feed  The Photolog!  t: pr0tagon1st