if (!require("knitr")) install.packages("knitr")
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("kableExtra")) install.packages("kableExtra")
if (!require("dplyr")) install.packages("dplyr")

 

Introduction

Briefly describe the recommender system that you’re going to build out from a business perspective, e.g. “This system recommends data science books to readers.”

This system recommends members watched movies to other family members

Find a dataset, or build out your own toy dataset. As a minimum requirement for complexity, please include numeric ratings for at least five users, across at least five items, with some missing data.

For dataset, I created a survey in family, and I asked them to give ratings to movie which was picked by family members.

There are 6 members and 8 movies.

Data

Load your data into (for example) an R or pandas dataframe, a Python dictionary or list of lists, (or another data structure of your choosing).

Movie_Name User_Name Ratings
Company(2002) Member1 NA
Deewaar(1975) Member1 4.5
Dil Chahta Hai(2001) Member1 4.0
Gully Boy(2019) Member1 NA
Mr. India(1987) Member1 3.0
Mughal-e-Azam(1960) Member1 4.5
Sholay(1975) Member1 5.0
Swades(2004) Member1 3.0
Company(2002) Member2 NA
Deewaar(1975) Member2 3.0
Dil Chahta Hai(2001) Member2 NA
Gully Boy(2019) Member2 2.5
Mr. India(1987) Member2 4.0
Mughal-e-Azam(1960) Member2 3.0
Sholay(1975) Member2 4.0
Swades(2004) Member2 NA
Company(2002) Member3 5.0
Deewaar(1975) Member3 NA
Dil Chahta Hai(2001) Member3 4.5
Gully Boy(2019) Member3 3.0
Mr. India(1987) Member3 4.5
Mughal-e-Azam(1960) Member3 4.0
Sholay(1975) Member3 4.0
Swades(2004) Member3 4.0
Company(2002) Member4 4.0
Deewaar(1975) Member4 NA
Dil Chahta Hai(2001) Member4 4.5
Gully Boy(2019) Member4 3.0
Mr. India(1987) Member4 3.0
Mughal-e-Azam(1960) Member4 3.5
Sholay(1975) Member4 4.0
Swades(2004) Member4 4.0
Company(2002) Member5 2.0
Deewaar(1975) Member5 4.0
Dil Chahta Hai(2001) Member5 4.5
Gully Boy(2019) Member5 2.0
Mr. India(1987) Member5 3.0
Mughal-e-Azam(1960) Member5 3.0
Sholay(1975) Member5 4.0
Swades(2004) Member5 3.0
Company(2002) Member6 4.5
Deewaar(1975) Member6 NA
Dil Chahta Hai(2001) Member6 4.5
Gully Boy(2019) Member6 4.0
Mr. India(1987) Member6 2.5
Mughal-e-Azam(1960) Member6 NA
Sholay(1975) Member6 4.0
Swades(2004) Member6 4.0

The dimensions of the results dataframe are (48, 3)

## 'data.frame':    48 obs. of  3 variables:
##  $ Movie_Name: Factor w/ 8 levels "Company(2002)",..: 1 2 3 4 5 6 7 8 1 2 ...
##  $ User_Name : Factor w/ 6 levels "Member1","Member2",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ Ratings   : num  NA 4.5 4 NA 3 4.5 5 3 NA 3 ...
##                 Movie_Name   User_Name    Ratings     
##  Company(2002)       : 6   Member1:8   Min.   :2.000  
##  Deewaar(1975)       : 6   Member2:8   1st Qu.:3.000  
##  Dil Chahta Hai(2001): 6   Member3:8   Median :4.000  
##  Gully Boy(2019)     : 6   Member4:8   Mean   :3.705  
##  Mr. India(1987)     : 6   Member5:8   3rd Qu.:4.250  
##  Mughal-e-Azam(1960) : 6   Member6:8   Max.   :5.000  
##  (Other)             :12               NA's   :9

Transform Data

I use pivot_wider to convert the above format into a table with 8 rows and 6 columns.

Split training and test datasets.

I will make a matrix of ones and zeros which will facilitate extracting the desired elements from the overall matrix.

Test dataset:

TRAINING MATRIX
Company(2002) Deewaar(1975) Dil Chahta Hai(2001) Gully Boy(2019) Mr. India(1987) Mughal-e-Azam(1960) Sholay(1975) Swades(2004)
Member1 NA 4.5 NA NA 3.0 4.5 5 3
Member2 NA 3.0 NA NA 4.0 3.0 4 NA
Member3 5 NA 4.5 3 4.5 NA 4 4
Member4 4 NA 4.5 3 3.0 3.5 NA NA
Member5 2 NA 4.5 2 NA 3.0 4 3
Member6 NA NA 4.5 4 2.5 NA 4 4
TEST MATRIX
Company(2002) Deewaar(1975) Dil Chahta Hai(2001) Gully Boy(2019) Mr. India(1987) Mughal-e-Azam(1960) Sholay(1975) Swades(2004)
Member1 NA NA 4 NA NA NA NA NA
Member2 NA NA NA 2.5 NA NA NA NA
Member3 NA NA NA NA NA 4 NA NA
Member4 NA NA NA NA NA NA 4 4
Member5 NA 4 NA NA 3 NA NA NA
Member6 4.5 NA NA NA NA NA NA NA

Using your training data, calculate the raw average (mean) rating for every user-item combination.

## [1] 3.693548
MEAN-RATING MATRIX
Company(2002) Deewaar(1975) Dil Chahta Hai(2001) Gully Boy(2019) Mr. India(1987) Mughal-e-Azam(1960) Sholay(1975) Swades(2004)
Member1 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548
Member2 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548
Member3 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548
Member4 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548
Member5 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548
Member6 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548 3.693548

Calculate the RMSE for raw average for both your training data and your test data.

## [1] 0.8099922
## [1] 0.6149689

Using your training data, calculate the bias for each user and each item.

MOVIE BIAS
Company(2002) -0.0268817
Deewaar(1975) 0.0564516
Dil Chahta Hai(2001) 0.8064516
Gully Boy(2019) -0.6935484
Mr. India(1987) -0.2935484
Mughal-e-Azam(1960) -0.1935484
Sholay(1975) 0.5064516
Swades(2004) -0.1935484
USER BIAS
Member1 0.3064516
Member2 -0.1935484
Member3 0.4731183
Member4 -0.0935484
Member5 -0.6102151
Member6 0.1064516

From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination.

Company(2002) Deewaar(1975) Dil Chahta Hai(2001) Gully Boy(2019) Mr. India(1987) Mughal-e-Azam(1960) Sholay(1975) Swades(2004)
Member1 3.973118 4.056452 4.806452 3.306452 3.706452 3.806452 4.506452 3.806452
Member2 3.473118 3.556452 4.306452 2.806452 3.206452 3.306452 4.006452 3.306452
Member3 4.139785 4.223118 4.973118 3.473118 3.873118 3.973118 4.673118 3.973118
Member4 3.573118 3.656452 4.406452 2.906452 3.306452 3.406452 4.106452 3.406452
Member5 3.056452 3.139785 3.889785 2.389785 2.789785 2.889785 3.589785 2.889785
Member6 3.773118 3.856452 4.606452 3.106452 3.506452 3.606452 4.306452 3.606452

Calculate the RMSE for the baseline predictors for both your training data and your test data.

## [1] 0.5490571
## [1] 0.5501305

Summary results.

Lets calculate the percentage improvements based on the original (simple average) and baseline predictor (including bias) RMSE numbers for both Test and Train data sets.

## [1] 0.3221452
## [1] 0.1054337

The training RMSE declined from 0.81 to 0.549, which is an improvement of 32.215 percent.

The testing RMSE declined from 0.615 to 0.55, which is an improvement of 10.543 percent.