Tutorial: Parsing HTML with lxml, Requests, and XPath (and SQLite3) part II

Published: 2013-04-01
Tagged: python

Time is a fickle thing, but apart from taxes and death - the second part of this tutorial was inevitable. Last month we looked at the basics of web scraping with the use of Python, lxml, and XPath. This time we're going to take it a step further. We're going to scrape data from multiple pages and put it into an SQL database.

Foreword

If you just want to get to the script, jump over to the next section because here I will explain a bit of basic theory.

For our database, we'll go with SQLite3, as it is natively supported in Python and it offers all the functionality we need in this case. If this is your first time using SQLite, I highly recommend this excellent zetcode.com tutorial about Python and SQLite and if you'd like to explore SQLite a little bit more, I also recommend this awesome book by Jay A. Kreibich - Using SQLite. Both of these resources were of a huge help when I didn't know a single thing about SQL and they will surely explain some of the workings of the program we'll write now.

What exactly should our program do? Let's look at our target first: poor hostage data. For simplicity's sake I used the same table layout as from the previous example, but these methods work perfectly fine on any web page out there using slightly more complicated XPath queries. Notice that now there's five tables spread across five pages. This is fairly common - since data is usually meant for humans, breaking it down is only natural. Each page has a link that will take us to the next or previous page.

We have to get our script to .get each of these pages and there's two ways to do this. The first one, which we'll use, is the easiest and relies on the fact that all five pages that interest us have a very simple link: https://s3.amazonaws.com/codecave/dataX.html where X is an integer between 1 and 5. We can simply create a simple loop that iterates through 1 and 5 and substitutes X with that integer and this will work fine in our case. We have to get our script to .get each of these pages and there's two ways to do this. The first one, which we'll use, is the easiest and relies on the facWe have to get our script to .get each of these pages and there's two ways to do this. The first one, which we'll use, is the easiest and relies on the fact that all five pages that interest us have a very simple link: https://s3.amazonaws.com/codecave/dataX.html where X is an integer between 1 and 5. We can simply create a simple loop that iterates through 1 and 5 and substitutes X with that integer and this will work fine in our case.

I'd like to use this opportunity and also sketch out how to take care of pages which aren't such an easy case - where the links are some ugly random string or where we don't know how many pages of data there even exist. In this second case, we would use an infinite loop that would only terminate when there is no "Next page" link. This is closely related to how we would navigate to the next page - we would use XPath to query an lxml.html object to find the link for us. We would use the text "Next page" and any further clues to narrow down the location of the exact link. In our scenario, this is fairly simple:t that all five pages that interest us have a very simple link: https://s3.amazonaws.com/codecave/dataX.html where X is an integer between 1 and 5. We can simply create a simple loop that iterates through 1 and 5 and substitutes X with that integer and this will work fine in our case.

I'd like to use this opportunity and also sketch out how to take care of pages which aren't such an easy case - where the links are some ugly random string or where we don't know how many pages of data there even exist. In this second case, we would use an infinite loop that would only terminate when there is no "Next page" link. This is closely related to how we would navigate to the next page - we would use XPath to query an lxml.html object to find the link for us. We would use the text "Next page" and any further clues to narrow down the location of the exact link. In our scenario, this is fairly simple: I'd like to use this opportunity and also sketch out how to take care of pages which aren't such an easy case - where the links are some ugly random string or where we don't know how many pages of data there even exist. In this second case, we would use an infinite loop that would only terminate when there is no "Next page" link. This is closely related to how we would navigate to the next page - we would use XPath to query an lxml.html object to find the link for us. We would use the text "Next page" and any further clues to narrow down the location of the exact link. In our scenario, this is fairly simple:

tree.xpath('//a[text()="Next page"]/attribute::href')

The text in square brackets is an XPath conditional. In non-moon speak it simply means "if text() of any A element is equal to 'Next page', return an attribute identified by 'href'". We would incorporate this piece of code in our loop to check for a link to the next page on each current page and if there isn't any link like that, we usually can assume that we scraped all the pages. The value of this attribute is the link our script would follow on to the next page to scrape so our solution would look like this in pseudocode:

while true:
    get page(url)
    data.append(scrape page for the data we want)
    url = scrape page to get the link to the next page
    if not url:
        break

This should give you an idea how to tackle these sorts of problems. To see a few more example, check out my repository on github which has four scripts that I cooked up when I was learning how to scrape.

Scrape and Grind

Well, the time has finally come to show the code that will give us what we want. First we will start off with laying down the groundwork for our program - importing the libs we need and getting a connection to a database:

import sqlite3
from lxml import html
import requests

con = sqlite3.connect('liberated_data.db')
cursor = con.cursor()

These imports are nothing new, but let's look at the last two lines. The first line creates an SQLite3 connection object which is our connection to the database. The connect() function takes an argument which will be the file name of the database we want to create. The following line gives us a cursor in the database, which as the name implies, is an object that keeps track of where it is in the database. This might come in handy when you need to check the exact row and table that your script is writing to, but we don't need that now. SQLite3, unlike other databases, keeps it's data in flat files. This has some performance penalties, hence why production ready applications use MySQL or Postgres. Don't be fooled though, SQLite still packs a punch.

The next part of our script will set up our database tables:

cursor.execute('DROP TABLE IF EXISTS data;')
cursor.execute('CREATE TABLE data (id integer primary key, item text, quantity integer, stock integer, price integer)')
con.commit()

It's pretty straightforward - we use the cursor object, which we created earlier to execute() SQL statements. These statements check if a table called data already exists and if it doesn't then it is created with the columns id, item, quantity, stock, and price. SQLite is pretty forgiving when it comes to data types, more serious databases not so much. The last line is the key here because until you commit() whatever your cursor has executed, nothing is actually written to the database. A quick note: commits are time-hungry so it is better to commit data in larger chunks instead of after every execute() call.

The next piece of code is going to be a big chunk, but I believe it's only understandable when taken in as a whole. It's based in large part of the code from the previous post about scraping, so it should be familiar:

i = 1
while 1:
    page = requests.get('https://s3.amazonaws.com/codecave/data%d.html' % i)
    if not page:
        break
    tree = html.fromstring(page.text)
    j = 1
    while 1:
        row = tree.xpath('//tr[%d]/td/text()' % j)
        if not row:
            break
        cursor.execute('INSERT INTO DATA (item, quantity, stock, price) VALUES (?, ?, ?, ?)',
                         (row[0], row[1],row[2], row[3]))
        print row
        j += 1
    i += 1

From a glance it's easy to see that I setup two sentinel values i and j. They are used by two infinite while loops in order to break at the right moment. They are also used in the actual scraping process. i is used to keep track of which page we are on. After scraping a page, i is incremented and on the next iteration of the main loop - we download the next page we want to scrape. Simple, right? j is used to keep track of the rows. Each query selects a single row of data and returns a list of values. If it returns an empty list - it means we got to the end of the table and we break. If not, then we use the cursor to execute an SQL statement, which uses the row list to insert the data into the correct columns in the database. Basically, the inner loop scans each row of a table while the outer loop fetches fresh pages to scan.

Finally, if we get a 404 which Python treats as falsy, the main loop breaks and we are finished scraping. This isn't the prettiest way to tackle this problem, but I think it's really simple and readable.

Last but not least we commit everything we executed and we close the connection:

con.commit()
con.close()

A note of caution: in a more real-life setting you would think about sqlite3 and requests throwing errors. A Try...Except sqlite3.error or a with would go a long way to make your script fail nicely and not mess up things.

That wraps up the slightly more complicated scrapers. The next step would be to use the sessions feature from the Requests lib in order to be able to scrape websites, which require authentication. It might be the third part of this tutorial series because while it is fairly simple, its applications are countless.

By the way, you can find scripts for parts I and II of this tutorial in my repo here.

Hi, I'm Matt.

This blog is an unordered set of thoughts extracted from the mind of a software developer.

About Me PGP key

Archives  Feed  The Photolog!  t: pr0tagon1st