Recommender System using Python & Crab

Crab as known as scikits.recommender is a Python framework for building recommender engines integrated with the world of scientific Python packages (numpy, scipy, matplotlib).

Currently, Crab supports two Recommender Algorithms: User-based Collaborative Filtering and Item-based Collaborative Filtering.

Here is a tutorial on Introduction to Recommender Systems with Crab. It briefly explains about what Recommendation is, what are Collaborative Filtering and Content-based Filtering algorithms, and how Crab is used to build and evaluate a Recommender system. It shows an example of implementing User-based Collaborative Filtering algorithm on a sample movie dataset.

Here, I will be showing code to evaluate a recommender system using both user-based filtering and item-based filtering.

I will be fetching data from a CSV file. The CSV file consists of 3 fields (user_id, item_id, and star_rating). item_id can be ID of anything like hotels, movies, books, etc. star_rating is the rating provided by users to items. The rating ranges from 1 to 5. 5 is considered as best rating and 1 is considered as worst rating.

For this article, I have created a dummy CSV file named dataset-recsys.csv containing three columns (user_id, item_id, and star_rating). You can download the CSV file from here.

Here is the code to create the CSV file:


import random 
import csv

fieldnames = ['user_id', 'item_id', 'star_rating']
with open('dataset-recsys.csv', "w") as myfile: # writing data to new csv file
	writer = csv.DictWriter(myfile, delimiter = ',', fieldnames = fieldnames)	
	writer.writeheader()	
	
	for x in range(1, 21):
		items = random.sample(list(range(1, 41)), 20)
		random.randint(1,5)
		for item in items:		
			writer.writerow({'user_id': x, 'item_id': item, 'star_rating': random.randint(1, 5)})

Creating a Python dictionary

During the process of building and evaluating a recommender system, we will first read data from the CSV file and create a Python dictionary.


dataset = {} # define a dictionary
with open('dataset-recsys.csv') as myfile: 	
	reader = csv.DictReader(myfile, delimiter=',')	
	i = 0	
	for line in reader:			
		i += 1
		if (i == 1):
			continue	
		
		if (int(line['user_id']) not in dataset):
			dataset[int(line['user_id'])] = {}
			
		dataset[int(line['user_id'])][int(line['item_id'])] = float(line['star_rating'])

Create a Data Model

Now, the dictionary is used to create a data model. In our example we will use MatrixPreference Data Model.


model = MatrixPreferenceDataModel(dataset)

However, Boolean Data Model can also be used.


boolean_model = MatrixBooleanPrefDataModel(dataset)

Creating Similarity

For user-based filtering, we use UserSimilarity class and for item-based filtering, we use ItemSimilarity class. Crab provides different similarity measures implementation like euclidean_distances, cosine_distances, and jaccard_coefficient.

User-based Similarity


similarity = UserSimilarity(model, euclidean_distances, 3)
similarity = UserSimilarity(model, cosine_distances)
similarity = UserSimilarity(model, jaccard_coefficient)

# If using boolean model
boolean_similarity = UserSimilarity(boolean_model, jaccard_coefficient)

Item-based Similarity


similarity = ItemSimilarity(model, euclidean_distances, 3)
similarity = ItemSimilarity(model, cosine_distances)
similarity = ItemSimilarity(model, jaccard_coefficient)

# If using boolean model
boolean_similarity = ItemSimilarity(boolean_model, jaccard_coefficient)

Neighborhood Strategy

For user-based filtering:


neighborhood = NearestNeighborsStrategy()

For item-based filtering:


nhood_strategy = ItemsNeighborhoodStrategy()

Building Recommender System

For user-based filtering:


recsys = UserBasedRecommender(model, similarity, neighborhood)

# For boolean model and boolean similarity
boolean_recsys = UserBasedRecommender(boolean_model, boolean_similarity, neighborhood)

For item-based filtering:


recsys = ItemBasedRecommender(model, similarity, nhood_strategy, with_preference=False)

# For boolean model and boolean similarity
boolean_recsys = ItemBasedRecommender(boolean_model, boolean_similarity, nhood_strategy)

Evaluation

For evaluation purpose, we use evaluate function from CfEvaluator class of Crab Framework. Currently, it supports the following evaluation metrics: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Normalized Mean Absolute Error (NMAE), Precision, Recall, and F1 Score.

Here is the details about each parameters of the evaluate function:

metric: [None|’rmse’|’f1score’|’precision’|’recall’|’nmae’|’mae’]
If metrics is None, all metrics available will be evaluated.
Otherwise it will return the specified metric evaluated.

sampling_users: float or sampling, optional, default = None
If an float is passed, it is the percentage of evaluated
users. If sampling_users is None, all users are used in the
evaluation. Specific sampling objects can be passed, see
scikits.crab.metrics.sampling module for the list of possible
objects.

sampling_ratings: float or sampling, optional, default = None
If an float is passed, it is the percentage of evaluated
ratings. If sampling_ratings is None, 70% will be used in the
training set and 30% in the test set. Specific sampling objects
can be passed, see scikits.crab.metrics.sampling module
for the list of possible objects.

at: integer, optional, default = None
This number at is the ‘at’ value, as in ‘precision at 5’. For
example this would mean precision or recall evaluated by removing
the top 5 preferences for a user and then finding the percentage of
those 5 items included in the top 5 recommendations for that user.
If at is None, it will consider all the top 3 elements.

Returns
——-
Returns a dictionary containing the evaluation results:
(NMAE, MAE, RMSE, Precision, Recall, F1-Score)

The recommender system can be evaluated separately for each individual evaluation metrics shown above or it can be evaluated for all at once.

Evaluating each metric separately


evaluator = CfEvaluator()

rmse = evaluator.evaluate(recsys, 'rmse', permutation=False)
mae = evaluator.evaluate(recsys, 'mae', permutation=False)
nmae = evaluator.evaluate(recsys, 'nmae', permutation=False)
precision = evaluator.evaluate(recsys, 'precision', permutation=False)
recall = evaluator.evaluate(recsys, 'recall', permutation=False)
f1score = evaluator.evaluate(recsys, 'f1score', permutation=False)

print f1score
print mae
print nmae
print precision
print recall
print rmse

# Evaluating boolean recommender system
rmse = evaluator.evaluate(boolean_recsys, 'rmse', permutation=False)

Evaluating all metrics at once


all_scores = evaluator.evaluate(recsys, permutation=False)

# For boolean recommender system
all_scores = evaluator.evaluate(boolean_recsys, permutation=False)

Using 70% of data as training set and 30% as test set and evaluating precision and recall at N. Here we keep N = 10.


result = evaluator.evaluate(recsys, None, permutation=False, at=10, sampling_ratings=0.7)
pprint (result)

Evaluating with Cross Validation

Here, we use 5-fold cross validation to evaluate RMSE metric.


result = evaluator.evaluate_on_split(recsys, 'rmse', permutation=False, at=10, cv=5, sampling_ratings=0.7)
pprint (result)

Here is the full source code for building and evaluating a recommender system using Item-based Collaborative Filtering technique:


from pprint import pprint
import csv

from scikits.crab.models import MatrixPreferenceDataModel, MatrixBooleanPrefDataModel
from scikits.crab.metrics import pearson_correlation, euclidean_distances, jaccard_coefficient, cosine_distances, manhattan_distances, spearman_coefficient
from scikits.crab.similarities import ItemSimilarity, UserSimilarity
from scikits.crab.recommenders.knn import ItemBasedRecommender, UserBasedRecommender
from scikits.crab.recommenders.knn.neighborhood_strategies import NearestNeighborsStrategy
from scikits.crab.recommenders.knn.item_strategies import ItemsNeighborhoodStrategy
from scikits.crab.recommenders.svd.classes import MatrixFactorBasedRecommender
from scikits.crab.metrics.classes import CfEvaluator

"""
import random 

fieldnames = ['user_id', 'item_id', 'star_rating']
with open('dataset-recsys.csv', "w") as myfile: # writing data to new csv file
	writer = csv.DictWriter(myfile, delimiter = ',', fieldnames = fieldnames)	
	writer.writeheader()	
	
	for x in range(1, 21):
		items = random.sample(list(range(1, 41)), 20)
		random.randint(1,5)
		for item in items:		
			writer.writerow({'user_id': x, 'item_id': item, 'star_rating': random.randint(1, 5)})
"""

dataset = {}
with open('dataset-recsys.csv') as myfile: 	
	reader = csv.DictReader(myfile, delimiter=',')	
	i = 0	
	for line in reader:			
		i += 1
		if (i == 1):
			continue	
		
		if (int(line['user_id']) not in dataset):
			dataset[int(line['user_id'])] = {}
			
		dataset[int(line['user_id'])][int(line['item_id'])] = float(line['star_rating'])
					

model = MatrixPreferenceDataModel(dataset)

# User-based Similarity

#similarity = UserSimilarity(model, cosine_distances)
#neighborhood = NearestNeighborsStrategy()
#recsys = UserBasedRecommender(model, similarity, neighborhood)

# Item-based Similarity

similarity = ItemSimilarity(model, cosine_distances)
nhood_strategy = ItemsNeighborhoodStrategy()
recsys = ItemBasedRecommender(model, similarity, nhood_strategy, with_preference=False)

#recsys = MatrixFactorBasedRecommender(model=model, items_selection_strategy=nhood_strategy, n_features=10, n_interations=1)

evaluator = CfEvaluator()

#rmse = evaluator.evaluate(recsys, 'rmse', permutation=False)
#mae = evaluator.evaluate(recsys, 'mae', permutation=False)
#nmae = evaluator.evaluate(recsys, 'nmae', permutation=False)
#precision = evaluator.evaluate(recsys, 'precision', permutation=False)
#recall = evaluator.evaluate(recsys, 'recall', permutation=False)
#f1score = evaluator.evaluate(recsys, 'f1score', permutation=False)

#all_scores = evaluator.evaluate(recsys, permutation=False)
#all_scores = evaluator.evaluate(boolean_recsys, permutation=False)

result = evaluator.evaluate(recsys, None, permutation=False, at=10, sampling_ratings=0.7) 

# Cross Validation
#result = evaluator.evaluate_on_split(recsys, 'rmse', permutation=False, at=10, cv=5, sampling_ratings=0.7)

pprint (result)

Hope this helps.
Thanks.