Recommender System using Python & python-recsys

python-recsys is a Python Library for implementing a Recommender System.

Currently, python-recsys supports two Recommender Algorithms: Singular Value Decomposition (SVD) and Neighborhood SVD.

Here is a QuickStart tutorial on using python-recsys for Recommender Systems. It takes movielens’s movie ratings dataset and shows examples about computing similarity between movie items and recommending movies to users.

Here, I will be showing code on how to use a custom CSV dataset and evaluate a recommender system using SVD algorithm.

I will be fetching data from a CSV file. The CSV file consists of 3 fields (user_id, item_id, and star_rating). item_id can be ID of anything like hotels, movies, books, etc. star_rating is the rating provided by users to items. The rating ranges from 1 to 5. 5 is considered as best rating and 1 is considered as worst rating.

For this article, I have created a dummy CSV file named dataset-recsys.csv containing three columns (user_id, item_id, and star_rating). You can download the CSV file from here.

Here is the code to create the CSV file:


import random 
import csv

fieldnames = ['user_id', 'item_id', 'star_rating']
with open('dataset-recsys.csv', "w") as myfile: # writing data to new csv file
	writer = csv.DictWriter(myfile, delimiter = ',', fieldnames = fieldnames)	
	writer.writeheader()	
	
	for x in range(1, 21):
		items = random.sample(list(range(1, 41)), 20)
		random.randint(1,5)
		for item in items:		
			writer.writerow({'user_id': x, 'item_id': item, 'star_rating': random.randint(1, 5)})

Creating a Data Model

python-recsys library uses matrix factorization algorithms like SVD and Neighborhood SVD that take input data (in a form of a matrix) and then decompose it (reduce it into lower dimensional space).


import recsys.algorithm
recsys.algorithm.VERBOSE = True

from recsys.algorithm.factorize import SVD
svd = SVD()
svd.load_data(filename='./data/dataset-recsys.csv', sep=',', format={'col':0, 'row':1, 'value':2, 'ids': int})

# About format parameter:
    #   'row': 1 -> Rows in matrix come from second column in dataset-recsys.csv file
    #   'col': 0 -> Cols in matrix come from first column in dataset-recsys.csv file
    #   'value': 2 -> Values (Mij) in matrix come from third column in dataset-recsys.csv file
    #   'ids': int -> Ids (row and col ids) are integers (not strings)

train, test = data.split_train_test(percent=70) # 70% train, 30% test

svd = SVD()
svd.set_data(train)

k = 100
svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True)

# min_values = 10 means those items that had less than 10 users who rated it, and those users that rated less than 10 items are removed

# Parameters:	
  # k (int) – number of dimensions
  # min_values (int) – min. number of non-zeros (or non-empty values) any row or col must have
  # pre_normalize (string) – normalize input matrix. Possible values are tfidf, rows, cols, all.
  # mean_center (Boolean) – centering the input matrix (aka mean substraction)
  # post_normalize (Boolean) – Normalize every row of U Sigma to be a unit vector. Thus, row similarity (using cosine distance) returns [-1.0 .. 1.0]
  # savefile (string) – path to save the SVD factorization (U, Sigma and V matrices)

# output SVD model can also be saved in a zip file
svd.compute(k=k, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile='/tmp/datamodel')
svd.similarity(ITEMID1, ITEMID2)

# and then the zipped model can be loaded
svd2 = SVD(filename='/tmp/movielens')
svd2.similarity(ITEMID1, ITEMID2)

Computing Similarity

Similarity between two items:


svd.similarity(ITEMID1, ITEMID2)

Similar items to a particular item:


svd.similar(ITEMID1, 5) # show 5 similar items

The similar and similarity function take row value of the matrix M.

As you can see below, while loading data, we have specified 2nd column of our CSV dataset file as row.


svd.load_data(filename='./data/dataset-recsys.csv', sep=',', format={'col':0, 'row':1, 'value':2, 'ids': int})

In our CSV file 1st column is user_id and 2nd column is item_id. Hence, we had to pass ITEMID as a parameter to similar and similarity functions.

So, if we want to compute similarity between users then we first need to load data by specifying 1st column (user_id) as row like this:


svd.load_data(filename='./data/dataset-recsys.csv', sep=',', format={'col':1, 'row':0, 'value':2, 'ids': int})

Now we can compute similarity between users.


# Similarity between two users
svd.similarity(USERID1, USERID2)

# Similar users to a particular user 
svd.similar(USERID1, 5) # show 5 similar users

Predicting rating for a particular user and item


MIN_RATING = 0.0
MAX_RATING = 5.0
ITEMID = 1
USERID = 1
svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING) # predicted rating value
svd.get_matrix().value(ITEMID, USERID) # real rating value

Recommend items to a particular user

Recommend items to a user that he/she hasn’t rated before.


# cols are users and rows are items, thus we set is_row=False
# n = 5, recommend 5 items
# only_unknowns = True, only return unknown values in matrix M, i.e. items not rated by the user
svd.recommend(USERID, n=5, only_unknowns=True, is_row=False) 

Evaluation

The following code contains evaluation using two prediction-based metrics (Root Mean Square Error (RMSE) & Mean Absolute Error (MAE)) and two rank-based metrics (Spearman’s rho & Kendall-tau).


rmse = RMSE()
mae = MAE()
spearman = SpearmanRho()
kendall = KendallTau()
for rating, item_id, user_id in test.get():
    try:
        pred_rating = svd.predict(item_id, user_id)
        rmse.add(rating, pred_rating)
        mae.add(rating, pred_rating)
        spearman.add(rating, pred_rating)
        kendall.add(rating, pred_rating) 
    except KeyError:
        continue

print 'RMSE=%s' % rmse.compute()
print 'MAE=%s' % mae.compute()
print 'Spearman\'s rho=%s' % spearman.compute()
print 'Kendall-tau=%s' % kendall.compute()

Here’s the full source code:


import sys

#To show some messages:
import recsys.algorithm
#recsys.algorithm.VERBOSE = True

from recsys.algorithm.factorize import SVD
from recsys.datamodel.data import Data
from recsys.evaluation.prediction import RMSE, MAE
from recsys.evaluation.decision import PrecisionRecallF1
from recsys.evaluation.ranking import SpearmanRho, KendallTau

#Dataset
PERCENT_TRAIN = 70
data = Data()
data.load('./data/dataset-recsys.csv', sep=',', format={'col':0, 'row':1, 'value':2, 'ids':int})

#Train & Test data
train, test = data.split_train_test(percent=PERCENT_TRAIN)

#Create SVD
K=100
svd = SVD()
svd.set_data(train)

svd.compute(k=K, min_values=1, pre_normalize=None, mean_center=True, post_normalize=True)
#svd.compute(k=K, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True)
#svd.compute(k=K, pre_normalize=None, mean_center=True, post_normalize=True)

print ''
print 'COMPUTING SIMILARITY'
print svd.similarity(1, 2) # similarity between items
print svd.similar(1, 5) # show 5 similar items

print ''
print 'GENERATING PREDICTION'
MIN_RATING = 0.0
MAX_RATING = 5.0
ITEMID = 1
USERID = 1
print svd.predict(ITEMID, USERID, MIN_RATING, MAX_RATING) # predicted rating value
print svd.get_matrix().value(ITEMID, USERID) # real rating value

print ''
print 'GENERATING RECOMMENDATION'
print svd.recommend(USERID, n=5, only_unknowns=True, is_row=False) 

#Evaluation using prediction-based metrics
rmse = RMSE()
mae = MAE()
spearman = SpearmanRho()
kendall = KendallTau()
#decision = PrecisionRecallF1()
for rating, item_id, user_id in test.get():
    try:
        pred_rating = svd.predict(item_id, user_id)
        rmse.add(rating, pred_rating)
        mae.add(rating, pred_rating)
        spearman.add(rating, pred_rating)
        kendall.add(rating, pred_rating)         
    except KeyError:
        continue

print ''
print 'EVALUATION RESULT'
print 'RMSE=%s' % rmse.compute()
print 'MAE=%s' % mae.compute()
print 'Spearman\'s rho=%s' % spearman.compute()
print 'Kendall-tau=%s' % kendall.compute()
#print decision.compute()
print ''

Hope this helps.
Thanks.