Programming Collective Intelligence - chapter 2 notes

March 29 2014

Simple Recommendation System

Chapter two of Programming Collective Intelligence (from here on called The Book) has us build a simple recommendation system. I think this is a great introduction to the topic and definitely a good hook to lure in unsuspecting programmers into the fascinating world of pulling meaning from data.

I'm writing this series of posts in order to solidify my own understanding of the material presented in The Book. I'll be going chapter by chapter. All the code comes from The Book (with minor tweaks by me) and can be found in my github repo here.

In this post, we'll be working specifically with the code here, which uses the scipy and requests libraries. Note: running or importing the script for the first time will download the MovieLens 100k dataset and unzip it as well for later use.

Theory

The idea behind this simple recommendation system is that, having a set of users, who in turn have rated items, it is possible to compare two users using items that both of them have rated. If user A has rated movies X1, X2, and X3 and user B has rated movies X1, Y1, and X2 so that:

{user A: {X1: rating, X2: rating}, user B: {X1: rating, Y1: rating, X2: rating}}

, it will be possible to compare the ratings of movies X1 and X2 and figure out how similar the users are. Building on this, it is easy to see that having enough users with enough ratings (rule of thumb: >= 20), it will be possible to find many similar users to user A. Then from that group we can extract movies that user A has not watched, but that the group has rated well, which gives us a recommendation of what movie user A would probably enjoy.

Simply: in order to find out what a user likes: 1. We find a group of users who appears to share his tastes (similar ratings). 2. Find what that group has liked that the user still hasn't. 3. We recommend it to them.

Using this very simple yet powerful system, we are further introduced to two possible ways to approach the recommendation problem:

  1. User-based - the system that compares user tastes and recommends based upon this comparison.
  2. Item-based - a system based on comparing items instead of users. It is really simple - take the user A and B example from above and flip the movies with the users so that you get something like:

    {movie X1: {user A: rating, user B: rating}, movie X2: {user A: rating, user B: rating}, 
    movie Y1: {user B: rating}}
    

Both systems give us essentially the same information - what item would the current user most likely... like. A user-based system is slightly simpler to build, but it needs to be computed for each user every single time ie. for every page view. An item-based system is minimally more complex, but it builds on the code of the user-based system and more importantly - it is possible to calculate all similar items in one processing intensive step and then simply fetch the results for every page view. The second approach would be easily manageable by running a cron job every once in a while - the more users/ratings, the less often.

Measuring similarity

How do we actually measure how similar are users or items? There are a number of mathematical tools we can apply here. We'll only use the two that The Book uses: Euclidean distance and Pearson r. Basically, both of these tools return a single float value that measures how "close" two sets of values are. I won't go into the details of how these two tools work, but you can find out more here.

The two functions below are the Python embodiment of the paragraph above. As we see below, both functions accept a dictionary of people and their movie ratings, and two users (person1 and person2) and both of them return how similar these people are. Inside the functions we see that we're comparing them on the set of movies that they both share. I cheated a little and used the scipy.stats.pearsonr function to calculate the Pearson r value, however you will find The Book's code in the file in the github repo as well.

def shared_items_fn(prefs, person1, person2):
    return {item: 1 for item in prefs[person1] if item in prefs[person2]}


# euclidean distance between two people
def sim_distance(prefs, person1, person2):
    '''
    Calculates euclidean distance between two people
    by comparing their shared item scores.
    '''
    shared_items = shared_items_fn(prefs, person1, person2)

    if len(shared_items) == 0:
        return 0

    sum_of_squares = sum([(prefs[person1][item] - prefs[person2][item]) ** 2
                            for item in prefs[person1]
                            if item in prefs[person2]])


# pearson coefficient using scipy.stats.pearsonr
def scipy_sim_pearson(prefs, person1, person2):
    '''
    Calculates the pearson r for two people
    using the shared item scores and the
    scipy.stats.pearsonr function for improved
    performance.
    '''
    shared_items = {item: 1 for item in prefs[person1]
            if item in prefs[person2]}

    if len(shared_items) == 0:
        return 0

    return sp.pearsonr([prefs[person1][item] for item in shared_items],
            [prefs[person2][item] for item in shared_items])[0]

Example output:

In [4]: recommendations.scipy_sim_pearson(recommendations.critics, 'Toby', 'Lisa Rose')                       
Out[4]: 0.99124070716193036

It would be cool if we were able to get a ranking of similar users. So we match a user against every other user, collect the similarity scores, sort them, and return them. It's cool to play around with this, but it'll be used later on for item-based recommendations.

def top_matches(prefs, person, n=5, similarity_fn=scipy_sim_pearson):
    '''
    Calculates n top similar matches for person.
    '''
    scores = [(similarity_fn(prefs, person, other), other)
                for other in prefs if other != person]
    scores.sort()
    scores.reverse()
    return scores[:n]

Example output:

In [10]: recommendations.top_matches(recommendations.critics, 'Toby')
Out[10]: 
[(0.99124070716193036, 'Lisa Rose'),
 (0.92447345164190498, 'Mick LaSalle'),
 (0.89340514744156441, 'Caludia Puid'),
 (0.66284898035987017, 'Jack Matthews'),
 (0.38124642583151169, 'Gene Seymour')]

If you're sure of your Python skills, skip this paragraph and continue to the next one. There are two note-worthy details in this function. First is the fact that we are setting scipy_sim_pearson as the default function to use, but we can pass in any other scoring function we'd like. This is the beauty of objects - a function is just an object and we can pass in any one we like, even a lambda function. Second is the fact that we use a list comprehension to create a list of tuples in the form of (similarity score, other person).

User-based recommendations

Next up is the meat of the user-based recommendation system - a function to: 1. Take a dictionary of users and their preferences and a target user. 2. Crunch the numbers. 3. Find similar users. 4. Check what they have rated that our target user hasn't. 5. Check the strength of the similarity 6. Return a list of matches from the best to the worst.

Before I show you the code of the function, I want to explain what I mean by the strength of the similarity. We accumulate the total similarity for a movie A by multiplying another person's rating of the movie by the other person's similarity score. This will cause the ratings made by people with similar tastes to be more important than the ratings of less similar people.

Sounds perfect, but there's one last thing - what if one movie is rated by many users and another only by a few? This would give popular, but not similar movies, a higher score than less well known but better suited movies for our target user. The answer is simple when you see it - at the same time when we're accumulating the multiplication of a rating and similarity score for a movie, we're also accumulating the similarity score itself into another dictionary. Then, when we're ready to return the scores, we divide the first accumulated total (ratings * score) by the second accumulated total (score). This normalizes the score so it is no longer dependent on the number of ratings for a specific movie.

Alright, here's the code:

def get_recommendations(prefs, person, similarity_fn=scipy_sim_pearson):
    '''
    Generates an orderded list of similar items for person.
    '''
    totals = {}
    similarity_sums = {}

    for other_person in prefs:
        if other_person != person:
            similarity = similarity_fn(prefs, person, other_person)

            if similarity > 0:
                for item in prefs[other_person]:
                    if item not in prefs[person] or prefs[person][item] == 0:
                        totals.setdefault(item, 0)
                        # accumulate movie score times similarity...
                        totals[item] += prefs[other_person][item] * similarity
                        similarity_sums.setdefault(item, 0)
                        # ... and accumulate the total similarity as well ...
                        similarity_sums[item] += similarity

    # ... so we can normalize the final score here!
    rankings = [(total / similarity_sums[item], item)
            for item, total in totals.items()]

    rankings.sort()
    rankings.reverse()
    return rankings

Example output:

In [11]: recommendations.get_recommendations(recommendations.critics, 'Toby')
Out[11]: 
[(3.3477895267131013, 'The Night Listener'),
 (2.8325499182641622, 'Lady in the Water'),
 (2.5309807037655645, 'Just My Luck')]

That's our user-based recommendation system in it's entirety. Try it out interactively by passing in different person names or by using a different similarity function like sim_distance and see what you get.

Item-based recommendations

For this part of the code, we'll be using the same similarity functions (sim_distance and scipy_sim_pearson) as for the user-based system. Remember from the beginning how I said that we have to swap the users with items? We'll do that now:

def transform_prefs(prefs):
    '''
    Transforms user based preferences into item based
    preferences ie. {'User': {'item': 3.5, 'item2': 5.0}}
    is turned into {'item': {'User': 3.5'}, 'item2': 
    {'User': 5.0}. Useful when trying to get similar items
    instead of similar users.
    '''
    results = {}
    for person in prefs:
        for item in prefs[person]:
            results.setdefault(item, {})
            results[item][person] = prefs[person][item]

    return results

By the way, I haven't tried this in a web application, but having a has many through relationship between users and items should make this transformation very easy. However, keeping to The Book, I'll stick to working on dictionaries.

We can compute a list of similar items in one step, keep it in a key-store value perhaps and thus save a nice chunk of time because we don't have to calculate anything every time a user refreshes a page. Another great reason to do that is that a list for user recommendations might change fairly often, however, items will change very rarely. This is the reason why we only have to calculate this list once every n hours and just use the generated list.

The function itself works very simply. We use the previous function (transform_prefs) to get an item-based list and then we use the top_matches function to generate a list of similar items for each single item. The Book also adds some code to update the user of the progress, which really shines when you're working with the MovieLens dataset, which contains 100 thousand entries and takes a good few seconds to process. Note: We have to use the euclidean distance function (sim_distance) because some ratings are equal to zero and we get a division by zero error.

def calculate_similar_items(prefs, n=10):
    '''
    Generates a list of items along with n
    top matched similar items and their similarity score.
    '''

    result = {}

    item_prefs = transform_prefs(prefs)
    c = 0
    for item in item_prefs:
        c += 1
        if c % 100 == 0:
            print "%d / %d" % (c, len(item_prefs))

        scores = top_matches(item_prefs, item, n, sim_distance)
        # use sim_distance, because pearsonr will
        # have problems with divide by 0 and introduce nan's
        result[item] = scores
    return result

Example output:

In [13]: recommendations.calculate_similar_items(recommendations.critics, 3)
Out[13]: 
{'Just My Luck': [(0.2222222222222222, 'Lady in the Water'),
  (0.18181818181818182, 'You, Me and Dupree'),
  (0.15384615384615385, 'The Night Listener')],
 'Lady in the Water': [(0.4, 'You, Me and Dupree'),
  (0.2857142857142857, 'The Night Listener'),
  (0.2222222222222222, 'Snakes on a Plane')],
 'Snakes on a Plane': [(0.2222222222222222, 'Lady in the Water'),
  (0.18181818181818182, 'The Night Listener'),
  (0.16666666666666666, 'Superman Returns')],
 'Superman Returns': [(0.16666666666666666, 'Snakes on a Plane'),
  (0.10256410256410256, 'The Night Listener'),
  (0.09090909090909091, 'Lady in the Water')],
 'The Night Listener': [(0.2857142857142857, 'Lady in the Water'),
  (0.18181818181818182, 'Snakes on a Plane'),
  (0.15384615384615385, 'Just My Luck')],
 'You, Me and Dupree': [(0.4, 'Lady in the Water'),
  (0.18181818181818182, 'Just My Luck'),
  (0.14814814814814814, 'The Night Listener')]}

Having the list of items, each with a sub-list of similar items, we now need a function to use this list and recommend similar items to a user. It uses a similar method to normalize the result as the get_recommendations function. We cycle over every item the user has rated and we match it with an entry from the calculate_similar_items list. We then cycle through the item sublist to checkout similar items and if they're absent from the list of items the user has already rated, we again accumulate the multiplication of the rating and the similarity. We also accumulate the similarity itself, which we later use to normalize the results:

def get_recommended_items(prefs, itemMatch, user):
    '''
    Generates a list of recommended items for a user
    based on user preferences (prefs[user]) as well 
    as a list of similar items.
    '''
    user_ratings = prefs[user]
    scores = {}
    total_similar = {}

    for (item, rating) in user_ratings.items():
        for (similarity, item2) in itemMatch[item]:
            if item2 not in user_ratings:
                scores.setdefault(item2, 0)
                scores[item2] += similarity * rating

                total_similar.setdefault(item2, 0)
                total_similar[item2] += similarity

    rankings = [(score / total_similar[item], item)
            for item, score in scores.items()]

    rankings.sort()
    rankings.reverse()
    return rankings

Example output:

In [14]: recommendations.get_recommended_items(recommendations.critics, 
recommendations.calculate_similar_items(recommendations.critics), 'Toby')
Out[14]: 
[(3.182634730538922, 'The Night Listener'),
 (2.5983318700614575, 'Just My Luck'),
 (2.4730878186968837, 'Lady in the Water')]

A really cool thing to notice about what we get back is that it's a list of tuples, where the 0th item in the list is the likely rating our target user would assign to the recommended movie. Wow, we just got hit with the future.

All good things end

That pretty much sums up the second chapter of The Book. I structured the code so that it is easy to play around with in the interactive interpreter (IDLE or better yet - IPython!). For more fun, use these functions interactively to explore the 100k strong MovieLens dataset that downloads automatically when you run/import recommendations.py.

Last but not least, get The Book. Despite coming out in 2007 and having some outdated examples, it gives a great introductory overview of extracting information from raw data. Me? I'll go and start on chapter 3 :).