DATA 643 : Project 1 : Global Baseline PRedictors and RMSE

The purpose of this recommender system is to predict which jokes will be found more or less funny by an individual user. We are using a fraction of the Jester data set found here. http://www.ieor.berkeley.edu/~goldberg/jester-data/

Data loads into R dataframe. Replace values ‘99’ with ‘NA’.
I convert the scores from -10:10 to 0:20 for RMSE purposes.
I focus on the dense portion of my dataframe for the remainder of this project. (Most jokes filled out with user values)

column_names = seq(0, 100, 1)
column_names[1] = 'User'
df = read_xls(path = "jester-data-2.xls",col_names = column_names)
df[df==99] <- NA
df$User = NULL
dense_df = subset(df,select=(5:20))

The purpose of this function is to create a test set and training set.

#This function was taken from 'Building a Recommendation System with R' 
which_train <- sample(x = c(TRUE, FALSE), size = nrow(dense_df),
replace = TRUE, prob = c(0.8, 0.2))
train_set = dense_df[which_train, ] +10
test_set = dense_df[!which_train,] +10

This function computes the raw average of the user-item matrix

raw_average = function(x){
  return(sum(colSums(x,na.rm = TRUE)) / length(which(!is.na(x))))
}

raw_average_train = raw_average(train_set)
raw_average_test = raw_average(test_set)
raw_average_test

## [1] 9.692964

raw_average_train

## [1] 9.615193

This function computes RMSE

#found from ('https://stackoverflow.com/questions/26237688/rmse-root-mean-square-deviation-calculation-in-r')
RMSE = function(x,y){
  sqrt( mean (((x-y)^2), na.rm=TRUE) )
}
RMSE(train_set,raw_average_train)

## [1] 5.243725

RMSE(test_set,raw_average_train)

## [1] 5.24164

You can see using the raw average of the training set user-item matrix; we error in around 5 points of what a user might rate a specific joke.

These functions compute user and item bias. (Jokes in this case)

userBias = function(df,raw_avg){
  return(rowMeans(df,na.rm=TRUE) - raw_avg)
}
jokeBias = function(df,raw_avg){
  return(colMeans(df,na.rm=TRUE)-raw_avg)
}

I use the above functions to create some baseline predictions.

baseline_predictors = function(df){
  user_bias = userBias(df,raw_average(df))
  joke_bias = jokeBias(df,raw_average(df))

  df[!is.na(df)] = raw_average(df)
  
  df = df + user_bias + joke_bias
  
  return(df)
}

Lets see when we account for user and item biases in our user-item matrix, what our RMSE is for the training set, and then the test set.

baseline_training_set = baseline_predictors(train_set)
baseline_test_set = baseline_predictors(test_set)
RMSE(baseline_training_set,raw_average_train)

## [1] 2.927845

RMSE(baseline_test_set,raw_average_test)

## [1] 2.902438

Our recommender system, successfully predicts user-item rating within 3 points out of 20. The RMSE is very consistent between training and test data.
Below is the head of the test sets predicted baseline values; below 10 means the user will probably not like the item (joke)

head(baseline_test_set)

##           5         6         7         8        9        10        11
## 1 10.098028 11.477651  8.631202  9.160518       NA 10.285258  9.090197
## 2  8.824466  9.466952  6.990910  7.271156 9.134611  6.652556  8.903901
## 3 11.607441        NA 12.732181 11.537120       NA        NA 13.924574
## 4 10.035819        NA  9.417219 11.668565       NA        NA 12.231615
## 5  6.894572  5.438938  7.902403  9.282026 6.435577  6.964893  5.898421
## 6  7.742026  3.737566  7.662591  8.305077 5.829035  6.109281  7.972736
##          12        13        14        15        16        17        18
## 1  7.634563 10.098028 11.477651  8.631202  9.160518  8.094046 10.285258
## 2  4.899441  8.824466  9.466952  6.990910  7.271156  9.134611  6.652556
## 3 11.078125 11.607441 10.540969 12.732181 11.537120 10.081486 12.544952
## 4  9.755574 10.035819 11.899274  9.417219 11.668565  7.664105 11.589129
## 5  8.089633  6.894572  5.438938  7.902403  9.282026  6.435577  6.964893
## 6  5.490681  7.742026  3.737566  7.662591  8.305077  5.829035  6.109281
##          19        20
## 1  9.090197  7.634563
## 2  8.903901  4.899441
## 3 13.924574 11.078125
## 4 12.231615  9.755574
## 5  5.898421  8.089633
## 6  7.972736  5.490681

DATA 643 : Project 1 : Global Baseline PRedictors and RMSE

Michael Muller

June 10, 2017

The purpose of this recommender system is to predict which jokes will be found more or less funny by an individual user. We are using a fraction of the Jester data set found here. http://www.ieor.berkeley.edu/~goldberg/jester-data/

Lets see when we account for user and item biases in our user-item matrix, what our RMSE is for the training set, and then the test set.