Movie Recommender System
The following project builds out a Movie recommender system. The dataset comes from GroupLens, a research lab in the Department of Computer Science and Engineering at the University of Minnesota. About GroupLens. GroupLens compiled data from the MovieLens web site. A user rated the movies on a scale of 1 to 5.
About the Data
For this project, I chose reviews of seven highly popular films. The dataset that I downloaded consists of the following features:
userid : the Id of the user rating the films
movieid : a unique, numerical identifier for the film
move title: the title of the film
user ratins: user ratings on a scale of 1 to 5
movie genre(s) : the genres of the films, “Action|Comedy” etc.
## 'data.frame': 29085 obs. of 5 variables:
## $ userid : int 43 43 43 43 43 43 43 120 120 120 ...
## $ movieId: int 110 2959 356 527 260 1 1307 110 2959 356 ...
## $ title : Factor w/ 7 levels "Braveheart (1995)",..: 1 2 3 4 5 6 7 1 2 3 ...
## $ rating : num 4.5 5 5 3.5 4.5 4 4 5 5 5 ...
## $ genres : Factor w/ 7 levels "Action|Adventure|Sci-Fi",..: 3 2 5 7 1 4 6 3 2 5 ...
Data Preparation
For the purposes of this project, I removed the movieId and genres features.
Additionally, the data is in a long format. I changed it to a wide format so that the form of the dataset is a dense, user-matrix form.
Long Format
|
userid
|
title
|
rating
|
|
43
|
Braveheart (1995)
|
4.5
|
|
43
|
Fight Club (1999)
|
5.0
|
|
43
|
Forrest Gump (1994)
|
5.0
|
|
43
|
Schindler’s List (1993)
|
3.5
|
|
43
|
Star Wars: Episode IV - A New Hope (1977)
|
4.5
|
|
43
|
Toy Story (1995)
|
4.0
|
|
43
|
When Harry Met Sally… (1989)
|
4.0
|
ratings_wide <- spread(ratings, title, rating)
colnames(ratings_wide) <- c("UserId", "Braveheart", "Fight_Club", "Forrest_Gump", "Schindler's_List", "Star_Wars_Episode_IV_A_New_Hope", "Toy_Story", "When_Harry_Met_Sally")
kable(head(ratings_wide,5), caption = "Dense, User-Matrix Form") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Dense, User-Matrix Form
|
UserId
|
Braveheart
|
Fight_Club
|
Forrest_Gump
|
Schindler’s_List
|
Star_Wars_Episode_IV_A_New_Hope
|
Toy_Story
|
When_Harry_Met_Sally
|
|
43
|
4.5
|
5.0
|
5.0
|
3.5
|
4.5
|
4.0
|
4.0
|
|
120
|
5.0
|
5.0
|
5.0
|
5.0
|
5.0
|
5.0
|
5.0
|
|
171
|
4.5
|
4.5
|
4.5
|
4.0
|
4.5
|
4.5
|
4.0
|
|
426
|
4.5
|
5.0
|
4.0
|
4.5
|
4.5
|
2.5
|
3.5
|
|
431
|
4.5
|
5.0
|
3.5
|
1.5
|
4.0
|
3.0
|
1.5
|
## [1] 4155 8
I still have 4,155 which is too unruly for this project, so I trimmed it down to 100 users.
The last step in data preparation was to randomly assign “NAs” to the data as per project instructions. In the code block below, a random index number is generated between 1 and 100.
A for loop is created that does 10 iterations that randomly assigns NA values.
Dense, User-Matrix Form with NAs
|
UserId
|
Braveheart
|
Fight_Club
|
Forrest_Gump
|
Schindler’s_List
|
Star_Wars_Episode_IV_A_New_Hope
|
Toy_Story
|
When_Harry_Met_Sally
|
|
43
|
4.5
|
5.0
|
5.0
|
3.5
|
4.5
|
4.0
|
4.0
|
|
120
|
5.0
|
5.0
|
5.0
|
5.0
|
5.0
|
5.0
|
NA
|
|
171
|
4.5
|
4.5
|
4.5
|
4.0
|
4.5
|
4.5
|
4.0
|
|
426
|
4.5
|
NA
|
4.0
|
NA
|
4.5
|
NA
|
3.5
|
|
431
|
4.5
|
5.0
|
3.5
|
1.5
|
4.0
|
3.0
|
1.5
|
|
440
|
4.0
|
5.0
|
4.5
|
4.0
|
5.0
|
3.5
|
4.5
|
|
462
|
5.0
|
4.5
|
4.0
|
3.5
|
3.5
|
3.5
|
4.5
|
|
519
|
4.5
|
3.0
|
4.0
|
3.5
|
3.5
|
4.5
|
3.0
|
|
607
|
3.0
|
4.0
|
4.0
|
3.5
|
4.0
|
NA
|
3.5
|
|
707
|
NA
|
4.0
|
4.0
|
5.0
|
4.0
|
3.0
|
3.0
|
Break your ratings into separate training and test datasets
The data is split 50-50 into “train” and “test”
## [1] 50 8
## [1] 50 8
Using your training data, calculate the raw average (mean) rating for every user-item combination
User Averages
User Averages
|
|
user_avg
|
|
43
|
4.357143
|
|
171
|
4.357143
|
|
440
|
4.357143
|
|
707
|
3.833333
|
|
847
|
3.700000
|
|
896
|
4.250000
|
|
939
|
4.214286
|
|
947
|
4.500000
|
|
997
|
4.142857
|
|
1482
|
4.000000
|
Movie Averages
movie_avg <- colMeans(train[2:8], na.rm = T)
movie_avg_df <- as.data.frame(movie_avg)
rownames(movie_avg_df) <- c("Braveheart", "Fight_Club", "Forrest_Gump", "Schindler's_List", "Star_Wars_Episode_IV_A_New_Hope", "Toy_Story", "When_Harry_Met_Sally")
kable(head(movie_avg_df,10), caption = "Movie Averages") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Movie Averages
|
|
movie_avg
|
|
Braveheart
|
4.011628
|
|
Fight_Club
|
4.184783
|
|
Forrest_Gump
|
4.068182
|
|
Schindler’s_List
|
4.200000
|
|
Star_Wars_Episode_IV_A_New_Hope
|
4.114583
|
|
Toy_Story
|
3.777778
|
|
When_Harry_Met_Sally
|
3.944444
|
Raw averages for train set
For both the train and test sets, I converted them from a data.frame to a matrix while excluding the userid feature. Next, I calculated the mean for the entire matrixes and stored them in variables, train_raw_mean and test_raw_mean.
## [1] 4.044304
Raw averages for test set
## [1] 4
Calculate the RMSE for raw average for both your training data and your test data
In the code block below, I subtracted the raw mean from both the train and test data frames, then squared that result, converted that result to a matrix, took the mean, and finally took the square root of that result.
I got a train raw RMSE of 0.9528102 and a test raw RMSE of 0.9163092
## [1] 0.9528102
## [1] 0.9163092
Using your training data, calculate the bias for each user and each item.
In the next two code blocks, I subtracted the mean of the train dataset from from the average user and movie means.
User Biases - Top 10
|
userId
|
user_avg
|
|
43
|
0.3128391
|
|
171
|
0.3128391
|
|
440
|
0.3128391
|
|
707
|
-0.2109705
|
|
847
|
-0.3443038
|
|
896
|
0.2056962
|
|
939
|
0.1699819
|
|
947
|
0.4556962
|
|
997
|
0.0985533
|
|
1482
|
-0.0443038
|
We see that the users in the train set are negatively biased against Braveheart, Toy_Story, and When Harry Met Sally, and they have a positive bias to Schindler’s List and Fight Club.
Movie Biases
|
movie
|
movie_avg
|
|
Braveheart
|
-0.0326759
|
|
Fight_Club
|
0.1404788
|
|
Forrest_Gump
|
0.0238780
|
|
Schindler’s_List
|
0.1556962
|
|
Star_Wars_Episode_IV_A_New_Hope
|
0.0702795
|
|
Toy_Story
|
-0.2665260
|
|
When_Harry_Met_Sally
|
-0.0998594
|
From the raw average, and the appropriate user and item biases, calculate the baseline predictorsfor every user-item combination
The function below, create_baseline_predictors_df, takes in an item_bias and user_bias dataframes as well as the raw mean. It then creates and populate a data frame, baseline_predictors_df, with the raw mean plus the user and item biases. Additionally, it forces scores above 5 to be five and scores below 1 to be 1.
Baseline Predictors
|
|
Braveheart
|
Fight_Club
|
Forrest_Gump
|
Schindler’s_List
|
Star_Wars_Episode_IV_A_New_Hope
|
Toy_Story
|
When_Harry_Met_Sally
|
|
43
|
4.324467
|
4.497622
|
4.381021
|
4.512839
|
4.427422
|
4.090617
|
4.257283
|
|
171
|
4.324467
|
4.497622
|
4.381021
|
4.512839
|
4.427422
|
4.090617
|
4.257283
|
|
440
|
4.324467
|
4.497622
|
4.381021
|
4.512839
|
4.427422
|
4.090617
|
4.257283
|
|
707
|
3.800657
|
3.973812
|
3.857211
|
3.989030
|
3.903613
|
3.566807
|
3.733474
|
|
847
|
3.667324
|
3.840479
|
3.723878
|
3.855696
|
3.770280
|
3.433474
|
3.600141
|
|
896
|
4.217324
|
4.390479
|
4.273878
|
4.405696
|
4.320279
|
3.983474
|
4.150141
|
|
939
|
4.181610
|
4.354764
|
4.238164
|
4.369982
|
4.284565
|
3.947760
|
4.114426
|
|
947
|
4.467324
|
4.640479
|
4.523878
|
4.655696
|
4.570279
|
4.233474
|
4.400141
|
|
997
|
4.110181
|
4.283336
|
4.166735
|
4.298553
|
4.213137
|
3.876331
|
4.042998
|
|
1482
|
3.967324
|
4.140479
|
4.023878
|
4.155696
|
4.070279
|
3.733474
|
3.900141
|
Calculate the RMSE for the baseline predictors for both your training data and your test data
Here, I found something odd. Even though the RMSE improved for the training set, it got much worse for the test set.
## [1] 0.6759492
## [1] 1.154198
Summary
Accounting for usr bias improved the RMSE over using just the raw averages. I saw a 29% improvement of the RMSE.
## [1] 0.2905731
However, the same cannot be said for the test set. Here, I saw a 26% decline in the RMSE from using the raw average to adding in the item-user bias.
## [1] -0.2596158
I don’t have a ready explanation as to why, so I checked the movie averages and biases for the test set. I found that the test set had a stronger negative bias towards “When Harry Met Sally” than the train set. Also, the train set had a stronger negative bias against Toy Story than the train set. These may account for the train set’s RMSE decline.
movie_avg2 <- colMeans(test[2:8], na.rm = T)
movie_avg_df2 <- as.data.frame(movie_avg2)
rownames(movie_avg_df) <- c("Braveheart", "Fight_Club", "Forrest_Gump", "Schindler's_List", "Star_Wars_Episode_IV_A_New_Hope", "Toy_Story", "When_Harry_Met_Sally")
kable(head(movie_avg_df2,10), caption = "Movie Averages") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Movie Averages
|
|
movie_avg2
|
|
Braveheart
|
3.916667
|
|
Fight_Club
|
4.233333
|
|
Forrest_Gump
|
4.032609
|
|
Schindler’s_List
|
4.211111
|
|
Star_Wars_Episode_IV_A_New_Hope
|
4.093023
|
|
Toy_Story
|
4.000000
|
|
When_Harry_Met_Sally
|
3.522222
|
Movie Biases
|
movie
|
movie_avg2
|
|
Braveheart
|
-0.0833333
|
|
Fight_Club
|
0.2333333
|
|
Forrest_Gump
|
0.0326087
|
|
Schindler’s_List
|
0.2111111
|
|
Star_Wars_Episode_IV_A_New_Hope
|
0.0930233
|
|
Toy_Story
|
0.0000000
|
|
When_Harry_Met_Sally
|
-0.4777778
|
---
title: "CUNY DATA 612 Project One Summer 2020"
author: "John K. Hancock"
date: "6/3/2020"
output:
  html_document:
    code_download: yes
    code_folding: hide
    highlight: pygments
    number_sections: yes
    theme: paper
    toc: yes
    toc_float: yes
  pdf_document:
    toc: no
 
---

```{r, include=FALSE}
library(reshape2)
library(tidyr)
library(caTools)
library(kableExtra)
library(rmdformats)
```

## Movie Recommender System

The following project builds out a Movie recommender system.  The dataset comes from GroupLens, a research lab in the Department of Computer Science and Engineering at the University of Minnesota. [About GroupLens](https://grouplens.org/about/what-is-grouplens/).  GroupLens compiled data from the [MovieLens web site](https://movielens.org/).  A user rated the movies on a scale of 1 to 5. 

## About the Data

```{r, include=FALSE}
ratings <- read.csv("https://raw.githubusercontent.com/JohnKHancock/raw.github/master/CUNY%20DATA612/ML_Reviews.csv")
```

For this project, I chose reviews of seven highly popular films. The dataset that I downloaded consists of the following features:

userid : the Id of the user rating the films <br>
movieid : a unique, numerical identifier for the film <br>
move title: the title of the film <br>
user ratins: user ratings on a scale of 1 to 5 <br>
movie genre(s) : the genres of the films, "Action|Comedy" etc.<br> 

```{r}
str(ratings)
```



## Data Preparation


For the purposes of this project, I removed the movieId and genres features. 



```{r, include=FALSE}
ratings$genres <- NULL
ratings$movieId <- NULL
```

Additionally, the data is in a long format.  I changed it to a wide format so that the form of the dataset is a dense, user-matrix form. 

```{r}
kable(head(ratings,7), caption = "Long Format") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```


```{r, warning=FALSE}
ratings_wide <- spread(ratings, title, rating)
colnames(ratings_wide) <- c("UserId", "Braveheart", "Fight_Club", "Forrest_Gump", "Schindler's_List", "Star_Wars_Episode_IV_A_New_Hope", "Toy_Story", "When_Harry_Met_Sally")

kable(head(ratings_wide,5), caption = "Dense, User-Matrix Form") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```

```{r}
dim(ratings_wide)
```
I still have 4,155 which is too unruly for this project, so I trimmed it down to 100 users.


```{r}
top_100 <- ratings_wide[1:100,]
```


The last step in data preparation was to randomly assign "NAs" to the data as per project instructions. In the code block below, a random index number is generated between 1 and 100.  

```{r}
get_Index <-function(){
        return(floor(runif(1,min=0, max=101)))
  
  }
```

A for loop is created that does 10 iterations that randomly assigns NA values. 

```{r}
#Randomly assign NA
set.seed(123)
for (i in 1:10){
  top_100$Braveheart[get_Index()] = NA
  top_100$Fight_Club[get_Index()] = NA
  top_100$Forrest_Gump[get_Index()] = NA
  top_100$`Schindler's_List`[get_Index()] = NA
  top_100$Star_Wars_Episode_IV_A_New_Hope[get_Index()] = NA
  top_100$Toy_Story[get_Index()] = NA
  top_100$When_Harry_Met_Sally[get_Index()] = NA
  
  
  
  
}




```


```{r}

kable(head(top_100,10), caption = "Dense, User-Matrix Form with NAs") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

```


## Break your ratings into separate training and test datasets

The data is split 50-50 into "train" and "test"

```{r}
#Split the data into Training and Test Set using the caTools package
set.seed(123)
sample = sample.split(top_100$UserId, SplitRatio = .5)
train = subset(top_100, sample == TRUE)
test  = subset(top_100, sample == FALSE) 
```


```{r}
dim(train)
```
```{r}
dim(test)
```
## Using your training data, calculate the raw average (mean) rating for every user-item combination

### User Averages

```{r}
user_avg <- rowMeans(train[2:8], na.rm = T)
user_avg_df <- as.data.frame(user_avg)
rownames(user_avg_df) <- train$UserId

kable(head(user_avg_df,10), caption = "User Averages") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```







### Movie Averages
```{r}
movie_avg <-  colMeans(train[2:8], na.rm = T)
movie_avg_df <- as.data.frame(movie_avg)
rownames(movie_avg_df) <- c("Braveheart", "Fight_Club", "Forrest_Gump", "Schindler's_List", "Star_Wars_Episode_IV_A_New_Hope", "Toy_Story", "When_Harry_Met_Sally")

kable(head(movie_avg_df,10), caption = "Movie Averages") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

```




### Raw averages for train set

For both the train and test sets, I converted them from a data.frame to a matrix while excluding the userid feature. Next, I calculated the mean for the entire matrixes and stored them in variables, train_raw_mean and test_raw_mean. 

```{r}
train_matrix <- as.matrix(train[2:8])
train_raw_mean <- mean(train_matrix, na.rm = T)
train_raw_mean
```
### Raw averages for test set

```{r}
test_matrix <- as.matrix(test[2:8])
test_raw_mean <-mean(test_matrix, na.rm = T)
test_raw_mean
```

## Calculate the RMSE for raw average for both your training data and your test data

In the code block below, I subtracted the raw mean from both the train and test data frames, then squared that result, converted that result to a matrix, took the mean, and finally took the square root of that result. 

I got a train raw RMSE of 0.9528102 and a test raw RMSE of 0.9163092

```{r}
train_raw_RMSE <- sqrt(mean(as.matrix((train[2:8]-train_raw_mean)^2),na.rm = T))
train_raw_RMSE
```

```{r}
test_raw_RMSE <- sqrt(mean(as.matrix((test[2:8]-test_raw_mean)^2),na.rm = T))
test_raw_RMSE
```


## Using your training data, calculate the bias for each user and each item.

In the next two code blocks, I subtracted the mean of the train dataset from from the average user and movie means.  

```{r}
user_bias_df <- user_avg_df-train_raw_mean 
user_bias_df <- cbind("userId" = rownames(user_bias_df), user_bias_df)
rownames(user_bias_df) <- NULL

kable(head(user_bias_df,10), caption = "User Biases - Top 10") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

```

We see that the users in the train set are negatively biased against Braveheart, Toy_Story, and When Harry Met Sally, and they have a positive bias to Schindler's List and Fight Club. 


```{r}
movie_bias_df <- movie_avg_df-train_raw_mean 
movie_bias_df <- cbind("movie" = rownames(movie_bias_df), movie_bias_df)
rownames(movie_bias_df) <- NULL


kable(head(movie_bias_df,10), caption = "Movie Biases ") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

```

## From the raw average, and the appropriate user and item biases, calculate the baseline predictorsfor every user-item combination

The function below, create_baseline_predictors_df, takes in an item_bias and user_bias dataframes as well as the raw mean. It then creates and populate a data frame, baseline_predictors_df, with the raw mean plus the user and item biases. Additionally, it forces scores above 5 to be five and scores below 1 to be 1. 

```{r}
create_baseline_predictors_df <- function(item_Bias, user_Bias, raw_mean){
        baseline_predictors_df <- data.frame()
        
        for (i in 1:nrow(user_Bias)){
          arry <- c(raw_mean + user_Bias[i,2] + item_Bias[2])
          arry <- arry[[1]]
          arry[arry < 1] <- 1.00
          arry[arry > 5] <- 5.00
          baseline_predictors_df <- rbind(baseline_predictors_df,arry)
          }
        
        return(baseline_predictors_df)
  }
```



```{r}
baseline_predictors_df<- create_baseline_predictors_df(movie_bias_df, user_bias_df, train_raw_mean) 

colnames(baseline_predictors_df) <- as.character(movie_bias_df$movie)
rownames(baseline_predictors_df) <- user_bias_df$userId

kable(head(baseline_predictors_df,10), caption = "Baseline Predictors ") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
```
## Calculate the RMSE for the baseline predictors for both your training data and your test data

Here, I found something odd.  Even though the RMSE improved for the training set, it got much worse for the test set. 

```{r}
train_baseline_RMSE<-sqrt(mean((as.matrix(train[2:8] - baseline_predictors_df))^2,na.rm=T))
train_baseline_RMSE
```

```{r}
test_baseline_RMSE<-sqrt(mean((as.matrix(test[2:8] - baseline_predictors_df))^2,na.rm=T))
test_baseline_RMSE
```

## Summary

Accounting for usr bias improved the RMSE over using just the raw averages. I saw a 29% improvement of the RMSE.

```{r}
1 - (train_baseline_RMSE / train_raw_RMSE)

```

However, the same cannot be said for the test set.  Here, I saw a 26% decline in the RMSE from using the raw average to adding in the item-user bias. 

```{r}
1-(test_baseline_RMSE / test_raw_RMSE)
```

I don't have a ready explanation as to why, so I checked the movie averages and biases for the test set.  I found that the test set had a stronger negative bias towards "When Harry Met Sally" than the train set.  Also, the train set had a stronger negative bias against Toy Story than the train set. These may account for the train set's RMSE decline. 

```{r}
movie_avg2 <-  colMeans(test[2:8], na.rm = T)
movie_avg_df2 <- as.data.frame(movie_avg2)
rownames(movie_avg_df) <- c("Braveheart", "Fight_Club", "Forrest_Gump", "Schindler's_List", "Star_Wars_Episode_IV_A_New_Hope", "Toy_Story", "When_Harry_Met_Sally")

kable(head(movie_avg_df2,10), caption = "Movie Averages") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

```


```{r}
test_movie_bias_df <- movie_avg_df2-test_raw_mean 
test_movie_bias_df <- cbind("movie" = rownames(test_movie_bias_df), test_movie_bias_df)
rownames(test_movie_bias_df) <- NULL


kable(head(test_movie_bias_df,10), caption = "Movie Biases ") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

```
