The purpose of this recommender system is to predict which jokes will be found more or less funny by an individual user. We are using a fraction of the Jester data set found here. http://www.ieor.berkeley.edu/~goldberg/jester-data/

Data loads into R dataframe. Replace values β€˜99’ with β€˜NA’.
I convert the scores from -10:10 to 0:20 for RMSE purposes.
I focus on the dense portion of my dataframe for the remainder of this project. (Most jokes filled out with user values)

column_names = seq(0, 100, 1)
column_names[1] = 'User'
df = read_xls(path = "jester-data-2.xls",col_names = column_names)
df[df==99] <- NA
df$User = NULL
dense_df = subset(df,select=(5:20))

The purpose of this function is to create a test set and training set.

#This function was taken from 'Building a Recommendation System with R' 
which_train <- sample(x = c(TRUE, FALSE), size = nrow(dense_df),
replace = TRUE, prob = c(0.8, 0.2))
train_set = dense_df[which_train, ] +10
test_set = dense_df[!which_train,] +10

This function computes the raw average of the user-item matrix

raw_average = function(x){
  return(sum(colSums(x,na.rm = TRUE)) / length(which(!is.na(x))))
}

raw_average_train = raw_average(train_set)
raw_average_test = raw_average(test_set)
raw_average_test
## [1] 9.692964
raw_average_train
## [1] 9.615193

This function computes RMSE

#found from ('https://stackoverflow.com/questions/26237688/rmse-root-mean-square-deviation-calculation-in-r')
RMSE = function(x,y){
  sqrt( mean (((x-y)^2), na.rm=TRUE) )
}
RMSE(train_set,raw_average_train)
## [1] 5.243725
RMSE(test_set,raw_average_train)
## [1] 5.24164

You can see using the raw average of the training set user-item matrix; we error in around 5 points of what a user might rate a specific joke.

These functions compute user and item bias. (Jokes in this case)

userBias = function(df,raw_avg){
  return(rowMeans(df,na.rm=TRUE) - raw_avg)
}
jokeBias = function(df,raw_avg){
  return(colMeans(df,na.rm=TRUE)-raw_avg)
}

I use the above functions to create some baseline predictions.

baseline_predictors = function(df){
  user_bias = userBias(df,raw_average(df))
  joke_bias = jokeBias(df,raw_average(df))

  df[!is.na(df)] = raw_average(df)
  
  df = df + user_bias + joke_bias
  
  return(df)
}

Lets see when we account for user and item biases in our user-item matrix, what our RMSE is for the training set, and then the test set.

baseline_training_set = baseline_predictors(train_set)
baseline_test_set = baseline_predictors(test_set)
RMSE(baseline_training_set,raw_average_train)
## [1] 2.927845
RMSE(baseline_test_set,raw_average_test)
## [1] 2.902438

Our recommender system, successfully predicts user-item rating within 3 points out of 20. The RMSE is very consistent between training and test data.
Below is the head of the test sets predicted baseline values; below 10 means the user will probably not like the item (joke)

head(baseline_test_set)
##           5         6         7         8        9        10        11
## 1 10.098028 11.477651  8.631202  9.160518       NA 10.285258  9.090197
## 2  8.824466  9.466952  6.990910  7.271156 9.134611  6.652556  8.903901
## 3 11.607441        NA 12.732181 11.537120       NA        NA 13.924574
## 4 10.035819        NA  9.417219 11.668565       NA        NA 12.231615
## 5  6.894572  5.438938  7.902403  9.282026 6.435577  6.964893  5.898421
## 6  7.742026  3.737566  7.662591  8.305077 5.829035  6.109281  7.972736
##          12        13        14        15        16        17        18
## 1  7.634563 10.098028 11.477651  8.631202  9.160518  8.094046 10.285258
## 2  4.899441  8.824466  9.466952  6.990910  7.271156  9.134611  6.652556
## 3 11.078125 11.607441 10.540969 12.732181 11.537120 10.081486 12.544952
## 4  9.755574 10.035819 11.899274  9.417219 11.668565  7.664105 11.589129
## 5  8.089633  6.894572  5.438938  7.902403  9.282026  6.435577  6.964893
## 6  5.490681  7.742026  3.737566  7.662591  8.305077  5.829035  6.109281
##          19        20
## 1  9.090197  7.634563
## 2  8.903901  4.899441
## 3 13.924574 11.078125
## 4 12.231615  9.755574
## 5  5.898421  8.089633
## 6  7.972736  5.490681