Programming Collective Intelligence chapter 6 notes

Published: 2014-04-25
Tagged: python readings

Classifying Text

Chapter 6 of Programming Collective Intelligence (from here on referred to as The Book) introduces the reader to classifying texts. The algorithms allow a fairly simple script to assign a category to a text input. As with previous cases of machine learning algorithms from The Book, the algorithm learns and becomes more and more accurate in it's predictions.

The uses for this kind of problem solving method are numerous. The Book works on two of them - a spam filter and an RSS filter. The code for this chapter can be found here. The docclass.py file contains the algorithms and pertains to the first part of the chapter. The feedfilter.py contains code required for correctly parsing and classifying RSS feeds found in the python_search.xml file.

In the first case, the script will be able to differentiate between two categories - good and bad. By first training the algorithm, it will assign texts containing the words money, casino, or pharmaceuticals to the bad category, while all the others into the good category. Using this example, the reader is introduced to two slightly more complicated methods of classifying text documents: the naive Bayes classifier and the Fisher classifier. Both of these methods allow for a higher accuracy.

In the second case, the reader will use the code explored in the first case and use it to parse and analyze an RSS feed. The feed comes from searching for Python RSS feeds and so contains three categories: Python (programming language), Monty Python, and snakes. The script will iterate through each item in the RSS feed and provide a guess as to which category the item belongs to. Then it will ask for user input - the correct category. This will train the algorithm, so the next guesses are more and more accurate.

It's uncanny to watch how, at first, the algorithm mistakes Python for python snakes, but only a few tries later, correctly guesses the category. The code itself is pretty simple and boils down to correctly parsing the input and keeping track of how many times a word comes up in a given category. This can be an awesome way to automate a few things in life - do you ever wish you had a spam filter for Hacker News comments? Or something to filter out articles about Apple from Ars Technica?

Classifying Words

A Simple Classifier

The chapter starts by creating a very basic classifier, aptly called Classifier. It's mean purpose is to correctly associate words (features), their counts, and the categories. At first, the code works by using dictionaries and is later upgrade to use an Sqlite database. I commented out some of the dictionary-related functions, but it's easy to uncomment them back in and comment out the Sqlite related code. In this article, I'll be using the dictionary version of the script, since it's a bit easier to inspect the dictionaries and follow along.

To start off, the reader has to instantiate the classifier with a method to extract features (docclass.get_words in this case). Then run a small training sample through the classifier:

In [1]: import docclass
In [2]: cl = docclass.Classifier(docclass.get_words)
In [3]: docclass.sample_train(cl)

The data is kept in two places: the self.feature_category and self.document_in_category dictionaries. The self.feature_category dictionary collects all of the features (words) and keeps track of how often that feature occurs in each category. It looks something like:

{'quick': {'good': 2, 'bad': 1}, 'money': {'bad': 1}}

It's easy to see that in the training sample, the word quick is associated with two good items and one bad. The word money is only associated with one bad item. If our algorithm was making a guess about the word "quick", it would output that it belongs to the good category. How would it treat the phrase "quick money" though? We'll get to that soon.

The other dictionary (self.document_in_category) keeps track of the number of documents in each category. It looks like this:

In [5]: cl.document_in_category
Out[5]: {'bad': 2, 'good': 3}

This will help with all sorts of calculations later on.

We can interactively use the function f_prob to check out how often a feature belongs to a category given the count of the category:

In [18]: cl.f_prob('quick', 'good')
Out[18]: 0.6666666666666666
In [19]: cl.f_prob('quick', 'bad')
Out[19]: 0.5
In [20]: cl.f_prob('money', 'bad')
Out[20]: 0.5
In [21]: cl.f_count('money', 'good')
Out[21]: 0.0

This shows the reader that the word quick is more likely to be good than bad (~0.667 > 0.5) and that the word money is more likely to be bad than good (0.5 > 0.0). We can improve the accuracy of our predictions by using weighted probability.

In [27]: cl.weighted_prob('quick', 'good', cl.f_prob)
Out[27]: 0.6333333333333333
In [28]: cl.weighted_prob('quick', 'bad', cl.f_prob)
Out[28]: 0.5
In [29]: cl.weighted_prob('money', 'good', cl.f_prob)
Out[29]: 0.5
In [30]: cl.weighted_prob('money', 'bad', cl.f_prob)
Out[30]: 0.5

The results are slightly different this time. One might say that they're a bit more reserved - they're a little more skeptical if a feature is too good, and they give a second chance to features that are totally evil, like money above. It makes sense - with so little training input, how can we say that money never belongs to the good category, which is implied by using f_prob?

Naive Bayes Classifier

Now we're going to go one step further and apply a bit of Bayes' Theorem to get a more accurate classifier. I won't go into how Bayes Theorem works, but it's a fascinating object of study and I recommend this gentle introduction: An Intuitive Explanation of Bayes' Theorem. I will also recommend this very cool piece that might help fill in the blanks.

This classifier builds on the basic one (omg, inheritance!). It uses two extra methods to invoke the god-like mathematical machinery of Bayes' Theorem. Seriously, read the two articles I linked to in the previous paragraph. Furthermore it provides a convenient method to classify a string after training.

In [1]: import docclass
In [2]: cl = docclass.NaiveBayes(docclass.get_words)
In [3]: docclass.sample_train(cl)
In [4]: cl.classify('quick rabbit')
Out[4]: 'good'
In [5]: cl.classify('money')
Out[5]: 'bad'

The astute reader will notice that using the code from the repo will not give the same result. As the unconfirmed errata notes, there's something not right at this point in the book, so I added an extra bad training example to sample_train in docclass.py.

It is worth noting at this point that it's possible to tweak the algorithms in two ways. First, it's possible to alter the weighted_prob function weight and ap parameters and aim at getting a more accurate score. Second, we can use the set_threshold and get_threshold functions to set the... thresholds!

What are thresholds? Well, in the case of emails, it's usually better to let through a bit more spam than to block an important letter. For example, what if your relative emailed you something about money and pharmaceuticals? Then you can use set_threshold('bad', 0.8) so that more spam is allowed through the filter. It's really fun to play around with this.

Fisher Method

The Fisher method slightly increases the accuracy of our classifiers and gets it's own class (that extends the basic classifier). This method calculates the probability of belonging to a category for each feature, combines these probabilities and finally checks if this probability is better than a random one.

A noteworthy fact is that the fisher_prob function uses a logarithm function to even out the probabilities of different categories. Using the simple example - what if you have many more spam emails than non-spam emails? This means that each good feature will have a larger effect on the classification, because document_in_category will report only a few good documents and many more bad documents. Looking at the f_prob and weighted_prob functions, we see that this would alter the result a lot.

This method also uses the reverse χ^2 function that checks if the resulting probability is better than a random one.

In [20]: cl = docclass.FisherClassifier(docclass.get_words)
In [21]: docclass.sample_train(cl)
In [22]: cl.classify('quick rabbit')
Out[22]: 'good'
In [23]: cl.classify('quick money')
Out[23]: 'bad'

As with the naive Bayes classifier, we can set thresholds that will allow more bad messages to pass, ensuring that no one will miss out an important email:

In [7]: cl.set_minimum('good', 0.2)
In [8]: cl.set_minimum('bad', 0.8)
In [9]: cl.classify('quick money')
Out[9]: 'good'

Classifying RSS

The chapter ends with a really cool exercise that's closer to real life than the previous one. The Book author Toby Segaran provides us with a 2006 RSS feed result of searching for "Python". This feed is stored in an XML file and the code in feedfilter.py uses feedparser to correctly parse it. The feed itself is filled with Python-related things: Python (programming), Monty Python, and snakes.

feedfilter.py introduces a new feature extraction method - entry_features that's better suited for the task at hand. The function grabs a few RSS specific fields such as the title, summary, and publisher and turns them into features. Furthermore, it adds a special place for uppercase words as well as grabbing the word after an uppercase word and combining these two words into one feature. This is really cool as it shows yet another way to tweak our code to get a more accurate result - tweaking at the level of data cleaning.

As I mentioned in the beginning of this post, the feedfilter.py code is meant to be used in an interactive session. The code cycles through the RSS feed, presents a best guess using the Fisher method and asks for user input to help train the classifier. I really recommend giving this a go for a few minutes to see how more and more accurate the classifier gets. Here's an example session:

In [1]: import docclass
In [2]: import feedfilter
In [3]: cl = docclass.FisherClassifier(feedfilter.entry_features)

In [4]: feedfilter.read('python_search.xml', cl)
Title:       My new baby boy!                                                                                                                                                               [153/418]
Publisher:   Shetan Noir, the zombie belly dancer! - MySpace Blog

THis is my new baby, Anthem. He is a 3 and half month old ball <b>python</b>, orange
shaded normal pattern. I have held him about 5 time since I 
brought him home tonight at 8:00pm...
Guess:       None
Enter category: snake

_____
Title:       If you need a laugh...
Publisher:   Kate&#39;s space

Even does 'funny walks' from Monty <b>Python</b>. He talks about all the 
ol' ladies that are after him. He teases me about my horror obsession. 
He attempts suicide. And best of all, he talks about 
poo. Who doesn't think poo is funny???!
Guess:       snake
Enter category: monty
_____
Title:       Sisters of Perpetual Indugence trike drag (queen) race, SF
Publisher:   The Smorgasbord - technology, open source, Ubuntu, Drupal, 
xBox, opinion, news, and more!

The scene along the sidewalk was Fellini meets Almodovar meet Monty <b>Python</b>: 
contestants only sprinted for 10 or 20 feet at a time on their tricycles, 
then ducked in to air-conditioned bars f
or pomegranate martinis. <b>...</b>
Guess:       monty
Enter category: monty

Afterword

One thing strikes me after going through this chapter: some of this stuff is easy to grasp and even easier to implement. No wonder the web saw the explosion of spam filters a decade ago. What are some other areas that might benefit from applying this easy to use tool? Automatically sorting emails into folders? What about classifying the importance of RSS entries? How about... detecting astroturfing on online forums or on Twitter?

Hi, I'm Matt.

This blog is an unordered set of thoughts extracted from the mind of a software developer.

About Me PGP key

Archives  Feed  The Photolog!  t: pr0tagon1st