The purpose of this recommender system is to predict which jokes will be found more or less funny by an individual user. We are using a fraction of the Jester data set found here. http://www.ieor.berkeley.edu/~goldberg/jester-data/
Data loads into R dataframe. Replace values β99β with βNAβ.
I convert the scores from -10:10 to 0:20 for RMSE purposes.
I focus on the dense portion of my dataframe for the remainder of this project. (Most jokes filled out with user values)
column_names = seq(0, 100, 1)
column_names[1] = 'User'
df = read_xls(path = "jester-data-2.xls",col_names = column_names)
df[df==99] <- NA
df$User = NULL
dense_df = subset(df,select=(5:20))
The purpose of this function is to create a test set and training set.
#This function was taken from 'Building a Recommendation System with R'
which_train <- sample(x = c(TRUE, FALSE), size = nrow(dense_df),
replace = TRUE, prob = c(0.8, 0.2))
train_set = dense_df[which_train, ] +10
test_set = dense_df[!which_train,] +10
This function computes the raw average of the user-item matrix
raw_average = function(x){
return(sum(colSums(x,na.rm = TRUE)) / length(which(!is.na(x))))
}
raw_average_train = raw_average(train_set)
raw_average_test = raw_average(test_set)
raw_average_test
## [1] 9.692964
raw_average_train
## [1] 9.615193
This function computes RMSE
#found from ('https://stackoverflow.com/questions/26237688/rmse-root-mean-square-deviation-calculation-in-r')
RMSE = function(x,y){
sqrt( mean (((x-y)^2), na.rm=TRUE) )
}
RMSE(train_set,raw_average_train)
## [1] 5.243725
RMSE(test_set,raw_average_train)
## [1] 5.24164
You can see using the raw average of the training set user-item matrix; we error in around 5 points of what a user might rate a specific joke.
These functions compute user and item bias. (Jokes in this case)
userBias = function(df,raw_avg){
return(rowMeans(df,na.rm=TRUE) - raw_avg)
}
jokeBias = function(df,raw_avg){
return(colMeans(df,na.rm=TRUE)-raw_avg)
}
I use the above functions to create some baseline predictions.
baseline_predictors = function(df){
user_bias = userBias(df,raw_average(df))
joke_bias = jokeBias(df,raw_average(df))
df[!is.na(df)] = raw_average(df)
df = df + user_bias + joke_bias
return(df)
}
Lets see when we account for user and item biases in our user-item matrix, what our RMSE is for the training set, and then the test set.
baseline_training_set = baseline_predictors(train_set)
baseline_test_set = baseline_predictors(test_set)
RMSE(baseline_training_set,raw_average_train)
## [1] 2.927845
RMSE(baseline_test_set,raw_average_test)
## [1] 2.902438
Our recommender system, successfully predicts user-item rating within 3 points out of 20. The RMSE is very consistent between training and test data.
Below is the head of the test sets predicted baseline values; below 10 means the user will probably not like the item (joke)
head(baseline_test_set)
## 5 6 7 8 9 10 11
## 1 10.098028 11.477651 8.631202 9.160518 NA 10.285258 9.090197
## 2 8.824466 9.466952 6.990910 7.271156 9.134611 6.652556 8.903901
## 3 11.607441 NA 12.732181 11.537120 NA NA 13.924574
## 4 10.035819 NA 9.417219 11.668565 NA NA 12.231615
## 5 6.894572 5.438938 7.902403 9.282026 6.435577 6.964893 5.898421
## 6 7.742026 3.737566 7.662591 8.305077 5.829035 6.109281 7.972736
## 12 13 14 15 16 17 18
## 1 7.634563 10.098028 11.477651 8.631202 9.160518 8.094046 10.285258
## 2 4.899441 8.824466 9.466952 6.990910 7.271156 9.134611 6.652556
## 3 11.078125 11.607441 10.540969 12.732181 11.537120 10.081486 12.544952
## 4 9.755574 10.035819 11.899274 9.417219 11.668565 7.664105 11.589129
## 5 8.089633 6.894572 5.438938 7.902403 9.282026 6.435577 6.964893
## 6 5.490681 7.742026 3.737566 7.662591 8.305077 5.829035 6.109281
## 19 20
## 1 9.090197 7.634563
## 2 8.903901 4.899441
## 3 13.924574 11.078125
## 4 12.231615 9.755574
## 5 5.898421 8.089633
## 6 7.972736 5.490681