DATA612 - Project #1

Libraries used

require(dplyr)
require(tidyr)
require(caTools)
require(knitr)
require(kableExtra)

Describing the Recommender System

The recommender system that I’ll be buiding for this project will be to recommend certain data science books to readers.

Loading the dataset

I created a dummy dataset and uploaded it to my GitHub account. The dataset has 15 user ratings for seven different data science books, and includes some missing data. To read this file into R, I used the read.csv() function and stored it in a data frame called bookdf:

bookdf <- read.csv('https://raw.githubusercontent.com/zachalexander/data612_cuny/master/Project1/book_recommendations.csv')

As we can see from looking at the dimensions of the data frame below, it has 15 rows of user ratings, and 8 columns (1 column shows the userID, which I’ll drop a little later):

dim(bookdf)

## [1] 15  8

We can take a look at the ratings data frame below:

kable(bookdf, 'html') %>% 
  kable_styling(bookdf, bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 10)

User	Data.Science.for.Business	R.for.Data.Science	Super.Forecasting	Applied.Text.Analysis.with.Python	Applied.Predictive.Modeling	Data.Science.from.Scratch	Thinking..Fast.and.Slow
User1	5	NA	3	2	2	2	1
User2	NA	3	4	3	2	1	NA
User3	2	2	5	3	3	NA	1
User4	NA	NA	5	5	NA	1	NA
User5	5	3	4	1	3	NA	NA
User6	3	5	3	1	2	1	1
User7	3	4	NA	1	NA	1	3
User8	NA	1	NA	NA	3	NA	2
User9	3	NA	2	1	NA	1	NA
User10	4	2	3	NA	3	5	2
User11	NA	2	2	3	4	NA	4
User12	5	NA	5	2	NA	5	3
User13	NA	NA	NA	3	3	2	NA
User14	4	2	NA	2	NA	NA	2
User15	3	5	5	NA	3	2	2

Split data into training and test dataset

Now, with my data frame loaded properly in R, I can then split this data into a training and test dataset. I’ll also want to convert the data frames into matrices, so I can do easy calculations on them later. The syntax for splitting the full data frame into a 70/30 training and testing data sets is below, as well as syntax to select just the relevant columns and converting the data frames into matrices:

set.seed(123)
bookdf$split <- sample.split(bookdf$User, SplitRatio = 0.7)

bookdf_train <- bookdf %>% 
  filter(split == TRUE) %>% 
  select(Data.Science.for.Business, R.for.Data.Science, Super.Forecasting, Applied.Text.Analysis.with.Python, Applied.Predictive.Modeling, Data.Science.from.Scratch, Thinking..Fast.and.Slow)

bookdf_test <- bookdf %>% 
  filter(split != TRUE) %>% 
  select(Data.Science.for.Business, R.for.Data.Science, Super.Forecasting, Applied.Text.Analysis.with.Python, Applied.Predictive.Modeling, Data.Science.from.Scratch, Thinking..Fast.and.Slow)

bookdf_train <- data.matrix(bookdf_train, rownames.force = NA)
bookdf_test <- data.matrix(bookdf_test, rownames.force = NA)

Compute the raw average of the training dataset

Before we compute the raw averages, here’s a quick look at my training and testing matrices:

To save space, I renamed the column names, but the key for each book is below:

Data Science for Business = 1
R for Data Science = 2
Super Forecasting = 3
Applied Text Analysis with Python = 4
Applied Predictive Modeling = 5
Data Science from Scratch = 6
Thinking, Fast and Slow = 7

Training matrix

colnames(bookdf_train) <- c(1, 2, 3, 4, 5, 6, 7)
bookdf_train

##        1  2  3  4  5  6  7
##  [1,]  5 NA  3  2  2  2  1
##  [2,]  2  2  5  3  3 NA  1
##  [3,]  3  5  3  1  2  1  1
##  [4,]  3  4 NA  1 NA  1  3
##  [5,]  3 NA  2  1 NA  1 NA
##  [6,]  4  2  3 NA  3  5  2
##  [7,]  5 NA  5  2 NA  5  3
##  [8,] NA NA NA  3  3  2 NA
##  [9,]  4  2 NA  2 NA NA  2
## [10,]  3  5  5 NA  3  2  2

Testing matrix

colnames(bookdf_test) <- c(1, 2, 3, 4, 5, 6, 7)
bookdf_test

##       1  2  3  4  5  6  7
## [1,] NA  3  4  3  2  1 NA
## [2,] NA NA  5  5 NA  1 NA
## [3,]  5  3  4  1  3 NA NA
## [4,] NA  1 NA NA  3 NA  2
## [5,] NA  2  2  3  4 NA  4

Now, with the data split accordingly, I can take the raw average (mean) of the training dataset:

raw_avg <- mean(bookdf_train, na.rm = TRUE)

I’ve found that the raw average of the training dataset is 2.75. We can use this value to create two train/test matrices with the proper dimensions that just consist of 2.75, which we can use later for further calculations.

avg_train <- matrix(raw_avg, nrow=10, ncol=7, byrow=TRUE)
avg_test <- matrix(raw_avg, nrow=5, ncol=7, byrow=TRUE)

These matrices are our first approximation of ratings for users. However, this does not factor in any biases and will therefore be rough estimates. To calculate the amount of error we may receive by using this calculation for our recommendations, we can find the RMSE.

RMSE calculations

Before using a more shorthand method to calculate the RMSE, given that I’m working with matrices in R and can do these calculations on one line, I thought I’d demonstrate how to calculate the RMSE on our testing dataset given there are only 5 users.

First, I’d utilize the raw average computed above (2.75), and subtract it from each available rating. After subtracting, I’d square the difference and saved these values in a new matrix called test. Then, I took all of the values calculated in the test matrix and found the average. Since there were 21 available ratings, I used this value as the denominator to calculate the mean. Finally, I took the square root of this mean value.

test <- matrix(c(NA, (3-2.75)^2, (4-2.75)^2, (3-2.75)^2, (2-2.75)^2, (1-2.75)^2, NA,
         NA, NA, (5-2.75)^2, (5-2.75)^2, NA, (1-2.75)^2, NA,
         (5-2.75)^2, (3-2.75)^2, (4-2.75)^2, (1-2.75)^2, (3-2.75)^2, NA, NA,
         NA, (1-2.75)^2, NA, NA, (3-2.75)^2, NA, (2-2.75)^2,
         NA, (2-2.75)^2, (2-2.75)^2, (3-2.75)^2, (4-2.75)^2, NA, (4-2.75)^2), nrow = 5, ncol = 7, byrow = TRUE)

mean_test <- (0.0625 + 1.5625 + 0.0625 + 0.5625 + 3.0625 + 5.0625 + 5.0625 + 3.0625 + 5.0625 + 0.0625 + 1.5625 + 3.0625 + 0.0625 + 3.0625 + 0.0625 + 0.5625 + 0.5625 + 0.5625 + 0.0625 + 1.5625 + 1.5625) / 21

paste0('The RMSE of the test matrix is ', round(sqrt(mean_test), 2))

## [1] "The RMSE of the test matrix is 1.31"

Faster calculations of RMSE

To check this calculation, I can perform the same calculations using the shorthand syntax below:

rmse_test <- sqrt(mean((bookdf_test - avg_test)^2, na.rm = TRUE))
paste0('We can see that we also get the same RMSE for the test matrix of ', round(rmse_test, 2))

## [1] "We can see that we also get the same RMSE for the test matrix of 1.31"

We can do the same RMSE calculation on our training matrix:

rmse_train <- sqrt(mean((bookdf_train - avg_train)^2, na.rm = TRUE))
paste0('The RMSE of the train matrix is ', round(rmse_train, 3))

## [1] "The RMSE of the train matrix is 1.299"

With our initial RMSE calculations completed, we can see that there is pretty substantial error. To lower these values, we can calculate the bias on all of the books and users in the matrices.

Finding User Bias

To find the user bias in ratings, we can take the mean value of each user’s ratings and subtract it from our raw average value of 2.75. I wrote a for loop below that will append these values to a user-bias matrix and computed these values for both our training and testing matrices:

user_bias <- c()
raw_avg <- 2.75
for(i in 1:length(bookdf_train[,1])){
  user_bias[i] <- (mean(bookdf_train[i, ], na.rm = TRUE) - raw_avg)
}

user_bias_train <- matrix(user_bias, nrow = 10, ncol = 1)
user_bias_train

##              [,1]
##  [1,] -0.25000000
##  [2,] -0.08333333
##  [3,] -0.46428571
##  [4,] -0.35000000
##  [5,] -1.00000000
##  [6,]  0.41666667
##  [7,]  1.25000000
##  [8,] -0.08333333
##  [9,] -0.25000000
## [10,]  0.58333333

user_bias <- c()
for(i in 1:length(bookdf_test[,1])){
  user_bias[i] <- (mean(bookdf_test[i, ], na.rm = TRUE) - raw_avg)
}

user_bias_test <- matrix(user_bias, nrow = 5, ncol = 1)
user_bias_test

##            [,1]
## [1,] -0.1500000
## [2,]  0.9166667
## [3,]  0.4500000
## [4,] -0.7500000
## [5,]  0.2500000

As we can see, calculations above and below zero indicate a user’s relative bias as a rater of books. User #7 in the training dataset seems to be more of a positive rater of books, with a high bias value. Whereas User #4 in the testing dataset seems to be more of a negative reviewer of books overall. We can use this information to help us later as we continue to build our recommender system.

Finding Book Bias

Next, we can do the same calculations for each book, to find bias of people’s viewpoints of the books. We will use the same for loop as above, but computing averages down the columns for each book instead of across each row for users. I’ve conducted these calculations on both the training and test matrices and stored them in book-bias matrices:

book_bias <- c()
for(i in 1:length(bookdf_train[1,])){
  book_bias[i] <- (mean(bookdf_train[,i ], na.rm = TRUE) - raw_avg)
}

book_bias_train <- matrix(book_bias, nrow = 1, ncol = 7)
book_bias_train

##           [,1]      [,2]      [,3]   [,4]        [,5]   [,6]   [,7]
## [1,] 0.8055556 0.5833333 0.9642857 -0.875 -0.08333333 -0.375 -0.875

book_bias <- c()
for(i in 1:length(bookdf_test[1,])){
  book_bias[i] <- (mean(bookdf_test[,i ], na.rm = TRUE) - raw_avg)
}

book_bias_test <- matrix(book_bias, nrow = 1, ncol = 7)
book_bias_test

##      [,1] [,2] [,3] [,4] [,5]  [,6] [,7]
## [1,] 2.25 -0.5    1 0.25 0.25 -1.75 0.25

As we can see from our above calculations, “Super Forecasting” seems to be a well-liked book relative to others in our dataset. Additionally, it looks like “Data Science from Scratch” is more of a disliked book relative to the other books in our dataset.

Calculating the baseline predictors

With our biases calculated and stored in their proper matrices, I then created an additional for loop to iterate over the rows and columns of our bias matrices, add these values to our raw average computation of 2.75, and append them to a new baseline matrix for both the training and testing datasets. I also built in logic to ensure that any calculation above 5 was truncated to 5, and any calculation below 1 was truncated to 1.

baseline_train <- matrix(NA, nrow = 10, ncol = 7)
for(i in 1:10){
  for(j in 1:7){
    baseline_train[i, j] <- ifelse((raw_avg + user_bias_train[i, 1] + book_bias_train[1, j]) > 5, 5, 
                           ifelse(raw_avg + user_bias_train[i, 1] + book_bias_train[1, j] < 1, 1,
                           raw_avg + user_bias_train[i, 1] + book_bias_train[1, j]))
  }
}
baseline_train

##           [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
##  [1,] 3.305556 3.083333 3.464286 1.625000 2.416667 2.125000 1.625000
##  [2,] 3.472222 3.250000 3.630952 1.791667 2.583333 2.291667 1.791667
##  [3,] 3.091270 2.869048 3.250000 1.410714 2.202381 1.910714 1.410714
##  [4,] 3.205556 2.983333 3.364286 1.525000 2.316667 2.025000 1.525000
##  [5,] 2.555556 2.333333 2.714286 1.000000 1.666667 1.375000 1.000000
##  [6,] 3.972222 3.750000 4.130952 2.291667 3.083333 2.791667 2.291667
##  [7,] 4.805556 4.583333 4.964286 3.125000 3.916667 3.625000 3.125000
##  [8,] 3.472222 3.250000 3.630952 1.791667 2.583333 2.291667 1.791667
##  [9,] 3.305556 3.083333 3.464286 1.625000 2.416667 2.125000 1.625000
## [10,] 4.138889 3.916667 4.297619 2.458333 3.250000 2.958333 2.458333

baseline_test <- matrix(NA, nrow = 5, ncol = 7)
for(i in 1:5){
  for(j in 1:7){
    baseline_test[i, j] <- ifelse((raw_avg + user_bias_test[i, 1] + book_bias_test[1, j]) > 5, 5, 
                           ifelse(raw_avg + user_bias_test[i, 1] + book_bias_test[1, j] < 1, 1,
                           raw_avg + user_bias_test[i, 1] + book_bias_test[1, j]))
  }
}
baseline_test

##      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
## [1,] 4.85 2.100000 3.600000 2.850000 2.850000 1.000000 2.850000
## [2,] 5.00 3.166667 4.666667 3.916667 3.916667 1.916667 3.916667
## [3,] 5.00 2.700000 4.200000 3.450000 3.450000 1.450000 3.450000
## [4,] 4.25 1.500000 3.000000 2.250000 2.250000 1.000000 2.250000
## [5,] 5.00 2.500000 4.000000 3.250000 3.250000 1.250000 3.250000

We can see from above that we now have our baseline predictor matrices for both our training and testing datasets, which shows rating values that incorporate bias calculations and our raw average.

RMSE with baseline predictors

To test to see if these new baseline predictor matrices are more effective and contain less error than our raw average matrix computed earlier, we can calculate the RMSE on these as well:

rmse_train_baseline <- sqrt(mean((bookdf_train - baseline_train)^2, na.rm = TRUE))
rmse_test_baseline <- sqrt(mean((bookdf_test - baseline_test)^2, na.rm = TRUE))

paste0('The RMSE value for our baseline training matrix is ', round(rmse_train_baseline, 2))

## [1] "The RMSE value for our baseline training matrix is 0.91"

paste0('The RMSE value for our baseline testing matrix is ', round(rmse_test_baseline, 2))

## [1] "The RMSE value for our baseline testing matrix is 0.89"

As we can see, these RMSE values are much lower than our initial RMSE values predicted earlier – indicating that these matrices that factor in bias will be better performers in predicting user ratings on the books listed in the dataset.

Percent improvement and takeaways

To demonstrate the actual improvement from our initial RMSE values calculated on our raw average matrices, we can perform the calculations below:

pct_improve_train <- (1 - (rmse_train_baseline/rmse_train)) * 100
paste0('The RMSE improved by ', round(pct_improve_train, 2), '% on the training dataset.')

## [1] "The RMSE improved by 29.79% on the training dataset."

pct_improve_test <- (1 - (rmse_test_baseline/rmse_test)) * 100
paste0('The RMSE improved by ', round(pct_improve_test, 2), '% on the test dataset.')

## [1] "The RMSE improved by 32.62% on the test dataset."

In the end, the baseline predictors reduced the error in the predictions by almost a third relative to the raw average predictors!