01.Abstract

Recommendation algorithms are applied into industry on various business domains. The most popular are:

-The movie recommendations like the one used by Netflix.

-And the (related items recommendations) during online purchases.

There are 2 types of recommender systems:

-Content Filtering : Based on the description of the item also called meta data or side information

-And Collaborative Filtering: Are calculating the similarity measures of the target ITEMS and finding the minimum (Euclidean distance, or Cosine distance, or other metric, it depends on the algorithm). This is done by filtering the interests of a user, by collecting preferences from many users (collaborating).

Matrix factorization with parallel stochastic gradient descent, is an effective algorithm used to create a recommender system. The approach is to approximate the rating matrix:

$R_{m\times n}$ by the product of two matrixes containing lower dimensions, $P_{k\times m}$ and $Q_{k\times n}$, in a way that \[R\approx P^\prime Q\]
For Ex. $p_u$ is the $u$-th column of $P$, and $q_v$ is the $v$-th column of $Q$, then the movie rating placed by the user $u$ on the item $v$ would be predicted as $p^\prime_u q_v$.

A usual equation for the $P$ and th $Q$ is given by the below optimization problem :

\[\min_{P,Q} \sum_{(u,v)\in R} \left[f(p_u,q_v;r_{u,v})+\mu_P||p_u||_1+\mu_Q||q_v||_1+\frac{\lambda_P}{2} ||p_u||_2^2+\frac{\lambda_Q}{2} ||q_v||_2^2\right]\]

where the $(u,v)$ are the locations of the real entries in the $R$, $r_{u,v}$ is the real rating, $f$ is the loss function, and $\mu_P,\mu_Q,\lambda_P,\lambda_Q$ are the usual penalization parameters used by many algorithms to avoid overfitting.

The procedure of solving the matrix $P$ and $Q$ is the model training, and the selection of choosing penalization parameters is the hyper parameters tuning. After obtaining the $P$ and the $Q$,

we can then predict :

$\hat{R}_{u,v}=p^\prime_u q_v$.

Many thanks to Yixuan Qiu from Carnegie Mellon University Source: Yixuan

02.Introduction

The purpose of this R project is to create a rating recommender system through machine learning training. That recommender system will be able to predict users rating into a new movie. Or the user preference for a movie.

The most famous recommender training event was the competition launched by Netflix with one million dollar price. I will use the 10M (millions) rows rating dataset named MovieLens created by the University of Minnesota. It was released at 1/2009 so our newest movies are until 2008. In order to find a pattern and behavior of the data, the data sets where “enhanced” by many new features (dimensions). As validation of the models i will use RMSE (regression approach). During the project more explanation is given.

03.Methodology

Many algorithms and data transformations where applied in order to achieve the lowest RMSE. Such us:

-Matrix Factorization with parallel stochastic gradient descent
-H2o stacked ensembles of (GBM,GLM,DRF,NN)
-H2o Deep Learning (Neural Networks)
-H2o Gradient Boosting Machine (GBM)
-H2o Auto ML

collaborative filtering underlying assumption is that if a person X has the same opinion as a person Y then the recommendation system should be based on preferences of person Y (similarity). I will enhance the collaborative filtering with the application of:

-Matrix Factorization with parallel stochastic gradient descent algorithms. MF is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two dimensionality rectangular matrices.

This family of methods became widely known during the Netflix prize challenge due to its effectiveness as reported by Simon Funk in his 2006 blog where he shared his findings with the research community.

[recosystem](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)

We will apply Matrix Factorization with parallel stochastic gradient descent. With the help of “recosystem” package it is an R wrapper of the LIBMF library which creates a Recommender System by Using Parallel Matrix Factorization.

-The main task of recommender system is to predict unknown entries in the rating matrix based on observed values.

More info on the recosystem package and the techniques

recosystem

Loading of the required libraries

require(tidyverse)
require(caret)
require(rmarkdown)
require(plotly)
require(rpart)
require(lattice)
require(ggthemes)
require(GGally)
require(knitr)
require(tidyr)
require(wordcloud)
require(kableExtra)
require(RColorBrewer)
require(dplyr)
require(ggplot2)
require(lubridate)
require(h2o)
require(stringr)
require(formatR)
require(recosystem)
require(knitr)
require(scales)

04.Data Preparation

We download the MovieLens 10M dataset which contains 10 millions ratings and perform the appropriate joins (http://files.grouplens.org/datasets/movielens/ml-10m.zip)

dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)


ratings <- read.table(text = gsub("::", "\t", readLines(unzip(dl, "ml-10M100K/ratings.dat"))), 
    col.names = c("userId", "movieId", "rating", "timestamp"))


movies <- str_split_fixed(readLines(unzip(dl, "ml-10M100K/movies.dat")), "\\::", 
    3)

colnames(movies) <- c("movieId", "title", "genres")

movies <- as.data.frame(movies) %>% mutate(movieId = as.numeric(levels(movieId))[movieId], 
    title = as.character(title), genres = as.character(genres))

movielens <- left_join(ratings, movies, by = "movieId")

Our testing set named validation will be the 10% of our training set named edx

set.seed(1)
test_index <- createDataPartition(y = movielens$rating, times = 1, p = 0.1, 
    list = FALSE)
edx <- movielens[-test_index, ]
temp <- movielens[test_index, ]

Make sure that userId and movieId in edxset are also in validation set

validation <- temp %>% semi_join(edx, by = "movieId") %>% semi_join(edx, by = "userId")

Add rows removed from validation set back into edx set

removed <- anti_join(temp, validation)
edx <- rbind(edx, removed)

Delete objects that we wont use anymore

rm(dl, ratings, movies, test_index, temp, movielens, removed)

05.Data Observation

Data structure of the edx (training set)

Observations: 9,000,055
Variables: 6
$ userId    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ movieId   <dbl> 122, 185, 292, 316, 329, 355, 356, 362, 364, 370, 37...
$ rating    <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
$ timestamp <int> 838985046, 838983525, 838983421, 838983392, 83898339...
$ title     <chr> "Boomerang (1992)", "Net, The (1995)", "Outbreak (19...
$ genres    <chr> "Comedy|Romance", "Action|Crime|Thriller", "Action|D...

Its class’data.frame’with : -9000047 obs (rows) -6 variables (features)

The same movie entry might belong to more than one genre. Every discrete rating is on a discrete row.

First entries of the edx (training set)

  userId movieId rating timestamp                         title
1      1     122      5 838985046              Boomerang (1992)
2      1     185      5 838983525               Net, The (1995)
4      1     292      5 838983421               Outbreak (1995)
5      1     316      5 838983392               Stargate (1994)
6      1     329      5 838983392 Star Trek: Generations (1994)
7      1     355      5 838984474       Flintstones, The (1994)
                         genres
1                Comedy|Romance
2         Action|Crime|Thriller
4  Action|Drama|Sci-Fi|Thriller
5       Action|Adventure|Sci-Fi
6 Action|Adventure|Drama|Sci-Fi
7       Children|Comedy|Fantasy

It looks we have to transform the timestamp which represents the (rating date), since the release date is inside the movie (title) column.

We will extract the release date from the movie title. And we will create a new matrix with more dimensions, containing every movie genre separately as factor.

Summary of the edx (training set)

     userId         movieId          rating        timestamp        
 Min.   :    1   Min.   :    1   Min.   :0.500   Min.   :7.897e+08  
 1st Qu.:18124   1st Qu.:  648   1st Qu.:3.000   1st Qu.:9.468e+08  
 Median :35738   Median : 1834   Median :4.000   Median :1.035e+09  
 Mean   :35870   Mean   : 4122   Mean   :3.512   Mean   :1.033e+09  
 3rd Qu.:53607   3rd Qu.: 3626   3rd Qu.:4.000   3rd Qu.:1.127e+09  
 Max.   :71567   Max.   :65133   Max.   :5.000   Max.   :1.231e+09  
    title              genres         
 Length:9000055     Length:9000055    
 Class :character   Class :character  
 Mode  :character   Mode  :character

The rating mean shows that users are rating above the average rating (3.512) right skewed

-Rating (our dependent variable y) has 10 continuous values from 0 until 5. Its row has one given rating by one user for one movie.

-Rating is our dependent (target variable) y
-userId, movieId, timestamp (date&time) are: quantitative - Discrete unique numbers.
-Title and genres are: qualitative and not unique.

Data structure of the validation (testing set)

Observations: 999,999
Variables: 6
$ userId    <int> 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5...
$ movieId   <dbl> 231, 480, 586, 151, 858, 1544, 590, 4995, 34, 432, 4...
$ rating    <dbl> 5.0, 5.0, 5.0, 3.0, 2.0, 3.0, 3.5, 4.5, 5.0, 3.0, 3....
$ timestamp <int> 838983392, 838983653, 838984068, 868246450, 86824564...
$ title     <chr> "Dumb & Dumber (1994)", "Jurassic Park (1993)", "Hom...
$ genres    <chr> "Comedy", "Action|Adventure|Sci-Fi|Thriller", "Child...

Its class ‘data.frame’: With 999999 obs. of 6 variables: Its exactly the 10% of our training set. And has the same 6 features

First entries of the validation (testing set)

  userId movieId rating timestamp
1      1     231      5 838983392
2      1     480      5 838983653
3      1     586      5 838984068
4      2     151      3 868246450
5      2     858      2 868245645
6      2    1544      3 868245920
                                                    title
1                                    Dumb & Dumber (1994)
2                                    Jurassic Park (1993)
3                                       Home Alone (1990)
4                                          Rob Roy (1995)
5                                   Godfather, The (1972)
6 Lost World: Jurassic Park, The (Jurassic Park 2) (1997)
                                   genres
1                                  Comedy
2        Action|Adventure|Sci-Fi|Thriller
3                         Children|Comedy
4                Action|Drama|Romance|War
5                             Crime|Drama
6 Action|Adventure|Horror|Sci-Fi|Thriller

Same as our training set. So we will perform the same data transformation on both training and test datasets. we mentioned earlier we could add features in our datasets in order to analyse for correlations,and if they exist they would help our ML models. By adding a feature of release year and year rated. So we will create 2 new data frames with that features for train set and test sets.

We will transform timestamp to year rated and extract the premier data from the movie title and add it as a separate feature.

edx <- edx %>% mutate(timestamp = as.POSIXct(timestamp, origin = "1970-01-01", 
    tz = "GMT"))
edx$timestamp <- format(edx$timestamp, "%Y")

colnames(edx)

[1] "userId"    "movieId"   "rating"    "timestamp" "title"     "genres"

names(edx)[names(edx) == "timestamp"] <- "year_rated"

releaseyear <- stringi::stri_extract(edx$title, regex = "(\\d{4})", comments = TRUE) %>% 
    as.numeric()


edx <- edx %>% mutate(release_year = releaseyear)



validation <- validation %>% mutate(timestamp = as.POSIXct(validation$timestamp, 
    origin = "1970-01-01", tz = "GMT"))
validation$timestamp <- format(validation$timestamp, "%Y")

colnames(validation)

[1] "userId"    "movieId"   "rating"    "timestamp" "title"     "genres"

names(validation)[names(validation) == "timestamp"] <- "year_rated"

releaseyear2 <- stringi::stri_extract(validation$title, regex = "(\\d{4})", 
    comments = TRUE) %>% as.numeric()


validation <- validation %>% mutate(release_year = releaseyear2)

Check the data sets for consistency

  userId movieId rating year_rated                         title
1      1     122      5       1996              Boomerang (1992)
2      1     185      5       1996               Net, The (1995)
3      1     292      5       1996               Outbreak (1995)
4      1     316      5       1996               Stargate (1994)
5      1     329      5       1996 Star Trek: Generations (1994)
6      1     355      5       1996       Flintstones, The (1994)
                         genres release_year
1                Comedy|Romance         1992
2         Action|Crime|Thriller         1995
3  Action|Drama|Sci-Fi|Thriller         1995
4       Action|Adventure|Sci-Fi         1994
5 Action|Adventure|Drama|Sci-Fi         1994
6       Children|Comedy|Fantasy         1994

  userId movieId rating year_rated
1      1     231      5       1996
2      1     480      5       1996
3      1     586      5       1996
4      2     151      3       1997
5      2     858      2       1997
6      2    1544      3       1997
                                                    title
1                                    Dumb & Dumber (1994)
2                                    Jurassic Park (1993)
3                                       Home Alone (1990)
4                                          Rob Roy (1995)
5                                   Godfather, The (1972)
6 Lost World: Jurassic Park, The (Jurassic Park 2) (1997)
                                   genres release_year
1                                  Comedy         1994
2        Action|Adventure|Sci-Fi|Thriller         1993
3                         Children|Comedy         1990
4                Action|Drama|Romance|War         1995
5                             Crime|Drama         1972
6 Action|Adventure|Horror|Sci-Fi|Thriller         1997

Display the distinct number of users and distinct number of movies in our train set

edx %>% summarize(distinct_users = n_distinct(userId), distinct_movies = n_distinct(movieId))

  distinct_users distinct_movies
1          69878           10677

We create and display a new df with useful metrics in order to understand better our dataset and identify outliers.

edx_movies_metrics <- edx %>% separate_rows(genres, sep = "\\|") %>% group_by(genres) %>% 
    summarize(Ratings_perGenre_Sum = n(), Ratings_perGenre_Mean = mean(rating), 
        Movies_perGenre_Sum = n_distinct(movieId), Users_perGenre_Sum = n_distinct(userId))
edx_movies_metrics

# A tibble: 20 x 5
   genres Ratings_perGenr~ Ratings_perGenr~ Movies_perGenre~
   <chr>             <int>            <dbl>            <int>
 1 (no g~                7             3.64                1
 2 Action          2560545             3.42             1473
 3 Adven~          1908892             3.49             1025
 4 Anima~           467168             3.60              286
 5 Child~           737994             3.42              528
 6 Comedy          3540930             3.44             3703
 7 Crime           1327715             3.67             1117
 8 Docum~            93066             3.78              481
 9 Drama           3910127             3.67             5336
10 Fanta~           925637             3.50              543
11 Film-~           118541             4.01              148
12 Horror           691485             3.27             1013
13 IMAX               8181             3.77               29
14 Music~           433080             3.56              436
15 Myste~           568332             3.68              509
16 Roman~          1712100             3.55             1685
17 Sci-Fi          1341183             3.40              754
18 Thril~          2325899             3.51             1705
19 War              511147             3.78              510
20 Weste~           189394             3.56              275
# ... with 1 more variable: Users_perGenre_Sum <int>

We observe that the rating mean is not rounded so we will transform it it. Also we identify in our new edx movies metrics df that there is one movie without genres.
We will treat it as an outlier and delete it from all our datasets, since it doesnt add any value. We also have 19 distinct genres.

Display the genres with the most rated movies (not distinct movies)

edx_movies_metrics$Ratings_perGenre_Mean <- round(edx_movies_metrics$Ratings_perGenre_Mean, 
    digits = 2)

edx_movies_metrics[order(-edx_movies_metrics$Movies_perGenre_Sum), ]

# A tibble: 20 x 5
   genres Ratings_perGenr~ Ratings_perGenr~ Movies_perGenre~
   <chr>             <int>            <dbl>            <int>
 1 Drama           3910127             3.67             5336
 2 Comedy          3540930             3.44             3703
 3 Thril~          2325899             3.51             1705
 4 Roman~          1712100             3.55             1685
 5 Action          2560545             3.42             1473
 6 Crime           1327715             3.67             1117
 7 Adven~          1908892             3.49             1025
 8 Horror           691485             3.27             1013
 9 Sci-Fi          1341183             3.4               754
10 Fanta~           925637             3.5               543
11 Child~           737994             3.42              528
12 War              511147             3.78              510
13 Myste~           568332             3.68              509
14 Docum~            93066             3.78              481
15 Music~           433080             3.56              436
16 Anima~           467168             3.6               286
17 Weste~           189394             3.56              275
18 Film-~           118541             4.01              148
19 IMAX               8181             3.77               29
20 (no g~                7             3.64                1
# ... with 1 more variable: Users_perGenre_Sum <int>

We can observe that most movies are in the above genres
(Reminder) those are not distinct movies. Because as we observed earlier one movie might belong to more than one genre

Display of the genres - with the most distinct ratings

edx_movies_metrics[order(-edx_movies_metrics$Ratings_perGenre_Sum), ]

# A tibble: 20 x 5
   genres Ratings_perGenr~ Ratings_perGenr~ Movies_perGenre~
   <chr>             <int>            <dbl>            <int>
 1 Drama           3910127             3.67             5336
 2 Comedy          3540930             3.44             3703
 3 Action          2560545             3.42             1473
 4 Thril~          2325899             3.51             1705
 5 Adven~          1908892             3.49             1025
 6 Roman~          1712100             3.55             1685
 7 Sci-Fi          1341183             3.4               754
 8 Crime           1327715             3.67             1117
 9 Fanta~           925637             3.5               543
10 Child~           737994             3.42              528
11 Horror           691485             3.27             1013
12 Myste~           568332             3.68              509
13 War              511147             3.78              510
14 Anima~           467168             3.6               286
15 Music~           433080             3.56              436
16 Weste~           189394             3.56              275
17 Film-~           118541             4.01              148
18 Docum~            93066             3.78              481
19 IMAX               8181             3.77               29
20 (no g~                7             3.64                1
# ... with 1 more variable: Users_perGenre_Sum <int>

Here we observed that the top 3 genres with the most ratings are

-Drama
-Comedy and
-Action

Some genres have exponential low sum of ratings so probably they will be also treated as outliers in the data frame that we will create with all genres as factors

Display of ratings mean - per genre

edx_movies_metrics[order(-edx_movies_metrics$Ratings_perGenre_Mean), ]

# A tibble: 20 x 5
   genres Ratings_perGenr~ Ratings_perGenr~ Movies_perGenre~
   <chr>             <int>            <dbl>            <int>
 1 Film-~           118541             4.01              148
 2 Docum~            93066             3.78              481
 3 War              511147             3.78              510
 4 IMAX               8181             3.77               29
 5 Myste~           568332             3.68              509
 6 Crime           1327715             3.67             1117
 7 Drama           3910127             3.67             5336
 8 (no g~                7             3.64                1
 9 Anima~           467168             3.6               286
10 Music~           433080             3.56              436
11 Weste~           189394             3.56              275
12 Roman~          1712100             3.55             1685
13 Thril~          2325899             3.51             1705
14 Fanta~           925637             3.5               543
15 Adven~          1908892             3.49             1025
16 Comedy          3540930             3.44             3703
17 Action          2560545             3.42             1473
18 Child~           737994             3.42              528
19 Sci-Fi          1341183             3.4               754
20 Horror           691485             3.27             1013
# ... with 1 more variable: Users_perGenre_Sum <int>

Here we can observe that genres with low sum of ratings have higher rating mean. This is one more indicator that should be treated as outliers. Also movies with low sum of ratings will be removed from the training set for the same reasons

We create also for our training set a metrics df

validation_movies_metrics <- validation %>% separate_rows(genres, sep = "\\|") %>% 
    group_by(genres) %>% summarize(Ratings_perGenre_Sum = n(), Ratings_perGenre_Mean = mean(rating), 
    Movies_perGenre_Sum = n_distinct(movieId), Users_perGenre_Sum = n_distinct(userId))



validation_movies_metrics$Ratings_perGenre_Mean <- round(validation_movies_metrics$Ratings_perGenre_Mean, 
    digits = 2)


validation_movies_metrics <- subset(validation_movies_metrics, genres != "(no genres listed)")

We create a ratings distribution df

ratings_distribution <- edx %>% group_by(rating) %>% summarise(ratings_distribution_sum = n()) %>% 
    arrange(desc(ratings_distribution_sum))

Display of the ratings distribution

rating	ratings_distribution_sum
4.0	2588430
3.0	2121240
5.0	1390114
3.5	791624
2.0	711422
4.5	526736
1.0	345679
2.5	333010
1.5	106426
0.5	85374
Note: Ratings distribution

Interactive histogram of the ratings distribution

p1 <- ggplot(edx, aes(rating, fill = cut(rating, 100))) + geom_histogram(color = "blue", 
    binwidth = 0.2) + scale_x_continuous(breaks = seq(0.5, 5, 0.5)) + geom_vline(xintercept = ratings_mean, 
    col = "red", linetype = "dashed") + labs(title = "Distribution of ratings", 
    x = "Ratings Scale", y = "Sum Of Rating") + theme(axis.text = element_text(size = 10), 
    plot.title = element_text(size = 11, color = "darkblue", hjust = 0.5)) + 
    theme_solarized(light = FALSE)

ggplotly(p1)

For the training of our ML algorithms we want to penalized movies rated by low number of users.So in order to put more weight on movies that have been rated by more people, we will add 2 more features in our data sets.

-Number of users per movie and
-Number of movies per user (how many movies had that user rated). So movies and users that have not many rates will be penalized during the ML training
Add new dimensions: Number of movies per user. And number of users per movie

edx <- edx %>% group_by(userId) %>% mutate(number_movies_byUser = n())


edx <- edx %>% group_by(movieId) %>% mutate(number_users_byMovie = n())


validation <- validation %>% group_by(userId) %>% mutate(number_movies_byUser = n())


validation <- validation %>% group_by(movieId) %>% mutate(number_users_byMovie = n())

Interactive plot of the most rated movies (Only those that have been rated over 20.000 times)

Bar graph display: in order to understand the distribution of the sum of ratings per genres, and movies per genres

ggplot(data = edx_movies_metrics, aes(x = reorder(genres, -Ratings_perGenre_Sum), 
    y = Ratings_perGenre_Sum, colour = "red", fill = genres, label = Ratings_perGenre_Sum)) + 
    geom_col() + geom_text(aes(label = comma(Ratings_perGenre_Sum)), angle = 90, 
    color = "white", size = 3, check_overlap = T, position = position_stack((vjust = 0.5))) + 
    xlab("Movies genres") + ylab("Sum of ratings in Mio") + ggtitle("Sum of ratings per genre") + 
    scale_y_continuous(labels = scales::comma) + theme_solarized(light = FALSE) + 
    theme(axis.text.x = element_text(angle = 90))

genres	Ratings_perGenre_Sum
Drama	3910127
Comedy	3540930
Action	2560545
Thriller	2325899
Adventure	1908892
Romance	1712100
Sci-Fi	1341183
Crime	1327715
Fantasy	925637
Children	737994
Horror	691485
Mystery	568332
War	511147
Animation	467168
Musical	433080
Western	189394
Film-Noir	118541
Documentary	93066
IMAX	8181
(no genres listed)	7

Is also clear that the distribution is not equal in all genres. After the fantasy genre, we can observe the exponential growth in the number of ratings.
That is an indication that probably we will penalize the genres with low number of ratings, in our model
Word count cloud plot - with the most rated genres

wordcloud(words = edx_movies_metrics$genres, freq = edx_movies_metrics$Ratings_perGenre_Sum, 
    min.freq = 10, max.words = 10, random.order = FALSE, random.color = FALSE, 
    rot.per = 0.35, scale = c(5, 0.2), font = 4, colors = brewer.pal(8, "Dark2"), 
    main = "Most rated genres")

Bar graph plot to observe - the distribution of ratings mean - per genres

ggplot(data = edx_movies_metrics, aes(x = reorder(genres, -Ratings_perGenre_Mean), 
    y = Ratings_perGenre_Mean, colour = Ratings_perGenre_Sum, fill = genres, 
    label = Ratings_perGenre_Mean)) + geom_col() + ggtitle("Mean of ratings per genre") + 
    geom_text(aes(label = Ratings_perGenre_Mean), angle = 90, color = "white", 
        size = 4, check_overlap = T, position = position_stack((vjust = 0.5))) + 
    xlab("Movies genres") + ylab("Mean of ratings") + 
theme(axis.text.x = element_text(angle = 90)) + theme_solarized(light = FALSE) + 
    theme(axis.text.x = element_text(angle = 90))

In order to improve the results from our algorithms we will a We created a new df with (more dimensions) all the genres extracted and displayed as factors in order to use more features and improve our models if needed.

edx_matrix <- edx

validation_matrix <- validation

genres <- edx_matrix$genres %>% str_split(pattern = "\\|")

genres_unique <- genres %>% unlist() %>% unique()


ncol(edx_matrix)

[1] 9

cols <- ncol(edx_matrix)

for (i in seq_along(genres_unique)) {
    id <- grepl(pattern = genres_unique[i], edx_matrix$genres)
    edx_matrix[[cols + i]] <- 0
    edx_matrix[[cols + i]][id] <- 1
}

names(edx_matrix)[(cols + 1):ncol(edx_matrix)] <- genres_unique

Same for our test set

ncol(validation_matrix)

[1] 9

cols <- ncol(validation_matrix)

for (i in seq_along(genres_unique)) {
    id <- grepl(pattern = genres_unique[i], validation_matrix$genres)
    validation_matrix[[cols + i]] <- 0
    validation_matrix[[cols + i]][id] <- 1
}

names(validation_matrix)[(cols + 1):ncol(validation_matrix)] <- genres_unique

Also instead of having the year released we will create a new feature representing:
how old is every movie and drop the year released dimension.

edx_matrix <- edx_matrix %>% mutate(age_of_movie = 2019 - release_year)

validation_matrix <- validation_matrix %>% mutate(age_of_movie = 2019 - release_year)

Observe the new df with the genres as dimensions

# A tibble: 6 x 30
# Groups:   movieId [6]
  userId movieId rating year_rated title genres release_year
   <int>   <dbl>  <dbl> <chr>      <chr> <chr>         <dbl>
1      1     122      5 1996       Boom~ Comed~         1992
2      1     185      5 1996       Net,~ Actio~         1995
3      1     292      5 1996       Outb~ Actio~         1995
4      1     316      5 1996       Star~ Actio~         1994
5      1     329      5 1996       Star~ Actio~         1994
6      1     355      5 1996       Flin~ Child~         1994
# ... with 23 more variables: number_movies_byUser <int>,
#   number_users_byMovie <int>, Comedy <dbl>, Romance <dbl>, Action <dbl>,
#   Crime <dbl>, Thriller <dbl>, Drama <dbl>, `Sci-Fi` <dbl>,
#   Adventure <dbl>, Children <dbl>, Fantasy <dbl>, War <dbl>,
#   Animation <dbl>, Musical <dbl>, Western <dbl>, Mystery <dbl>,
#   `Film-Noir` <dbl>, Horror <dbl>, Documentary <dbl>, IMAX <dbl>, `(no
#   genres listed)` <dbl>, age_of_movie <dbl>

We noticed a column with no genres that contains only one movie in edx matrix, we will delete that column (outlier), also we will delete the columns (genres) with low sum of ratings. The reasons for this is because the train set is already big 9000047 rows so we don’t want to have many dimensions during the models building. The second reason is to prevent overfitting
Deletion of the dimensions that are outliers

In our data sets the variation of the age of the movies starts from 11 years (that means 2008- this is the last year we have movies until 104 then are the oldest movies we have).
We will check if there is a correlation between age of movie and ratings.

First we will create a new object, in order to observe easier, and also plot faster and with less memory!

avg_rating_per_oldness <- edx_matrix %>% group_by(age_of_movie) %>% summarize(avg_rating_by_age = mean(rating))

We will examine if there is a correlation between age of movie and rating.

ggpairs(avg_rating_per_oldness, mapping = aes(color = "age_of_movie"), title = "Age of Movie VS Rating correlation", 
    upper = list(continuous = wrap("cor", size = 10)), lower = list(continuous = "smooth"))

We can observe on the graph the negative skewness

We can clearly notice that there is a positive trend. The oldest the movie the highest the ratings it receives.

This is due to 2 reasons.

-First the oldest the movie the more ratings it has.
-Second usually the old movies are consider classics and they are rated better by the audience

It looks that the **age of movie will have low p value in or ML models training.**We create one more interactive plot to demonstrate it`

plotage <- avg_rating_per_oldness %>% ggplot(aes(x = age_of_movie, y = avg_rating_by_age)) + 
    ggtitle("Age of Movies vs Rating") + geom_line(color = "yellow") + theme_solarized(light = FALSE)
ggplotly(plotage)

Check if there is a correlation between the year that the film was rated and the rating. The films rated year varies from 1995 until 2009.

avg_rating_per_yearRated <- edx_matrix %>% group_by(year_rated) %>% summarize(avg_rating = mean(rating))
avg_rating_per_yearRated

# A tibble: 15 x 2
   year_rated avg_rating
   <chr>           <dbl>
 1 1995             4   
 2 1996             3.55
 3 1997             3.59
 4 1998             3.51
 5 1999             3.62
 6 2000             3.58
 7 2001             3.54
 8 2002             3.47
 9 2003             3.47
10 2004             3.43
11 2005             3.44
12 2006             3.47
13 2007             3.47
14 2008             3.54
15 2009             3.46

We observe that the oldest the movie was rated (1995) the highest was the rating. Same correlation as the age of the movie.

Now we have really good insight of our datasets.
We enrich them by adding new dimensions. And also we identified the outliers and the correlations.

Now we can choose the appropriate algorithms and proceed with the models training.

06.Method: Matrix Factorization with parallel stochastic

Model Building - Training and Validation

As we mentioned earlier. There are 2 types of recommender systems:
Content filtering (based on the description of the item - also called meta data or side information)
And collaborative Filtering: Those techniques are calculating the similarity
measures of the target ITEMS and finding the minimum (Euclidean distance,
or Cosine distance, or other metric, it depends on the algorithm). This is done
by filtering the interests of a user, by collecting preferences from many users
(collaborating). The underlying assumption is that if a person X has the same
opinion as a person Y then the recommendation system should be based
on preferences of person Y (similarity).

We will enhance the collaborative filtering with the help of Matrix factorization.
MF is a class of collaborative filtering algorithms used in recommender systems.

Matrix factorization algorithms work by decomposing the user-item interaction matrix
into the product of two lower dimensionality rectangular matrices. This family of methods
became widely known during the Netflix prize challenge due to its effectiveness as
reported by Simon Funk in his 2006 blog post, where he shared his findings with the research community
[recommender](https://en.wikipedia.org/wiki/Matrix_factorization_(recommender_systems)
We will apply Matrix Factorization with parallel stochastic gradient descent. With the help of “recosystem” package it is an R wrapper of the LIBMF library which creates a Recommender System by Using Parallel Matrix Factorization. The main task of recommender system is to predict unknown entries in the rating matrix based on observed values.
More info on the recosystem package and the techniques recosystem

Before we proceed with the model building, training and validation we define the RMSE function

RMSE <- function(true_ratings, predicted_ratings) {
    sqrt(mean((true_ratings - predicted_ratings)^2))
}

The data file for training set needs to be arranged in sparse matrix triplet form, i.e., each line in the file contains three numbers in order to use recosystem package we create 2 new matrices (our train and our validation set) with the below 3 features
-(movieId, userId, rating)
Create data sets for algorithm Matrix Factorization

edx_factorization <- edx %>% select(movieId, userId, rating)
validation_factorization <- validation %>% select(movieId, userId, rating)

edx_factorization <- as.matrix(edx_factorization)
validation_factorization <- as.matrix(validation_factorization)

Recosystem needs to save the files as tables on hard disk, (recosystem package needed)

write.table(edx_factorization, file = "trainingset.txt", sep = " ", row.names = FALSE, 
    col.names = FALSE)



write.table(validation_factorization, file = "validationset.txt", sep = " ", 
    row.names = FALSE, col.names = FALSE)

data_file is specific recosystem command

set.seed(1)
training_dataset <- data_file("trainingset.txt")

validation_dataset <- data_file("validationset.txt")

We reate a model object (a Reference Class object in R) by calling the function Reco()

r = Reco()

This step is optional. We call the $tune() method to select the best tuning parameters
(along a set of candidate values) You need to perform many different settings until you
will reach the optimal.

opts = r$tune(training_dataset, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 
    0.2), costp_l1 = 0, costq_l1 = 0, nthread = 1, niter = 10))

Display of the tuning

$min
$min$dim
[1] 30

$min$costp_l1
[1] 0

$min$costp_l2
[1] 0.1

$min$costq_l1
[1] 0

$min$costq_l2
[1] 0.01

$min$lrate
[1] 0.1

$min$loss_fun
[1] 0.7974779


$res
   dim costp_l1 costp_l2 costq_l1 costq_l2 lrate  loss_fun
1   10        0     0.01        0     0.01   0.1 0.8246018
2   20        0     0.01        0     0.01   0.1 0.8083978
3   30        0     0.01        0     0.01   0.1 0.8148374
4   10        0     0.10        0     0.01   0.1 0.8280235
5   20        0     0.10        0     0.01   0.1 0.8040914
6   30        0     0.10        0     0.01   0.1 0.7974779
7   10        0     0.01        0     0.10   0.1 0.8269870
8   20        0     0.01        0     0.10   0.1 0.8023862
9   30        0     0.01        0     0.10   0.1 0.8005307
10  10        0     0.10        0     0.10   0.1 0.8387589
 [ reached 'max' / getOption("max.print") -- omitted 14 rows ]

Now we train the model by calling the $train() method. A number of parameters can be set inside the function, coming from the result of the previous step - $tune().

r$train(training_dataset, opts = c(opts$min, nthread = 1, niter = 20))

"We write predictions to a tempfile on HDisk"

stored_prediction = tempfile()

With the $predict() method we will make predictions on validation set and will calculate RMSE:

r$predict(validation_dataset, out_file(stored_prediction))



real_ratings <- read.table("validationset.txt", header = FALSE, sep = " ")$V3


pred_ratings <- scan(stored_prediction)

Mean squared error (abbreviated MSE) and root mean square error (RMSE) refer to the amount by which the values predicted by an estimator differ from the quantities being estimated (typically outside the sample from which the model was estimated).

We calculate the standard deviation of the residuals (prediction errors) RMSE . Between the predicted ratings and the real ratings. If one or more predictors are significant, the second step is to assess how well the model fits the data by inspecting the Residuals Standard Error (RSE).

rmse_of_model_mf <- RMSE(real_ratings, pred_ratings)

Root mean squared error of the Matrix Factorization model

rmse_of_model_mf

[1] 0.7829978

We observe that the RMSE is extremely low. And possibly until know the Matrix factorization with SGD is the best approach to create a recommender system. I would like to thank Yu-Chin Juan, Wei-Sheng Chin, Yong Zhuang

for creating the LIMBF library but also Yixuan Qiu that created the R wrapper.
link

We will compare the first 50 predictions of the MF model with the real ratings. First we round the predictions for visualization convenience

pred_ratings_rounded <- pred_ratings



pred_ratings_rounded <- round(pred_ratings_rounded/0.5) * 0.5



MF_first50_pred <- data.frame(real_ratings[1:50], pred_ratings_rounded[1:50])


names(MF_first50_pred) <- c("real_ratings", "predicted_ratings")

Interactive plot - with the 50 first predicted ratings of the MF model. The light blue are the correct predictions

pmf <- ggplot(data = MF_first50_pred, aes(x = real_ratings, y = predicted_ratings, 
    colour = correct_predicted)) + xlab("Real Ratings") + ylab("Predicted Ratings") + 
    ggtitle("Real vs Predicted Ratings") + theme(plot.title = element_text(size = 12, 
    color = "darkblue", hjust = 0.5)) + geom_jitter()
ggplotly(pmf)

kable(MF_first50_pred) %>% kable_styling(bootstrap_options = "striped", full_width = F, 
    position = "center") %>% column_spec(1, border_left = T) %>% column_spec(2, 
    bold = T, border_right = T) %>% footnote(general = "MF model 50 first predictions", 
    footnote_as_chunk = T)

real_ratings	predicted_ratings	correct_predicted
5.0	4.0	0
5.0	5.0	1
5.0	5.0	1
3.0	3.5	0
2.0	4.5	0
3.0	2.5	0
3.5	4.0	0
4.5	4.5	1
5.0	4.5	0
3.0	3.5	0
3.0	3.5	0
3.0	3.5	0
3.0	3.5	0
3.0	4.0	0
3.0	3.5	0
3.0	4.5	0
3.0	4.0	0
3.0	3.0	1
4.0	4.0	1
5.0	4.0	0
3.0	3.5	0
4.0	5.0	0
4.0	4.5	0
4.0	4.5	0
5.0	5.0	1
4.0	3.5	0
2.0	2.0	1
5.0	4.5	0
5.0	4.5	0
5.0	4.5	0
4.0	4.0	1
3.0	3.0	1
4.0	4.5	0
4.0	3.5	0
5.0	4.5	0
5.0	4.0	0
5.0	4.0	0
4.0	3.0	0
4.0	4.0	1
4.0	4.5	0
4.0	4.0	1
5.0	4.0	0
3.5	3.0	0
5.0	3.5	0
4.0	3.5	0
4.5	4.0	0
2.5	3.0	0
4.5	3.5	0
3.5	4.0	0
4.0	3.5	0
Note: MF model 50 first predictions

all_models_rmse_results <- data.frame(Algorithm = c("Matrix factorization with SGD"), 
    RMSE = c(rmse_of_model_mf))

Root Mean Squared error of the Matrix factorization with parallel stochastic gradient descent

kable(all_models_rmse_results) %>% kable_styling(bootstrap_options = "striped", 
    full_width = F, position = "center") %>% column_spec(1, border_left = T) %>% 
    column_spec(2, bold = T, border_right = T)

Algorithm	RMSE
Matrix factorization with SGD	0.7829978

07.H2o Machine learning training, build and validating part

H2o open source machine learning and artificial intelligence platform

Create factors because in h2o the dependent variables need to be factors

edx_h2o <- edx_matrix

edx_h2o$userId <- as.factor(edx_h2o$userId)
edx_h2o$movieId <- as.factor(edx_h2o$movieId)
edx_h2o$Drama <- as.factor(edx_h2o$Drama)
edx_h2o$Comedy <- as.factor(edx_h2o$Comedy)
edx_h2o$age_of_movie <- as.factor(edx_h2o$age_of_movie)
edx_h2o$Thriller <- as.factor(edx_h2o$Thriller)
edx_h2o$Action <- as.factor(edx_h2o$Action)
edx_h2o$Adventure <- as.factor(edx_h2o$Adventure)
edx_h2o$Crime <- as.factor(edx_h2o$Crime)
edx_h2o$`Sci-Fi` <- as.factor(edx_h2o$`Sci-Fi`)
edx_h2o$year_rated <- as.factor(edx_h2o$year_rated)

validation_h2o <- validation_matrix

validation_h2o$userId <- as.factor(validation_h2o$userId)
validation_h2o$movieId <- as.factor(validation_h2o$movieId)
validation_h2o$Drama <- as.factor(validation_h2o$Drama)
validation_h2o$Comedy <- as.factor(validation_h2o$Comedy)
validation_h2o$age_of_movie <- as.factor(validation_h2o$age_of_movie)
validation_h2o$Thriller <- as.factor(validation_h2o$Thriller)
validation_h2o$Action <- as.factor(validation_h2o$Action)
validation_h2o$Adventure <- as.factor(validation_h2o$Adventure)
validation_h2o$Crime <- as.factor(validation_h2o$Crime)
validation_h2o$`Sci-Fi` <- as.factor(validation_h2o$`Sci-Fi`)
validation_h2o$year_rated <- as.factor(validation_h2o$year_rated)

Start of the H2o cluster into your environment. Adjust RAM accordingly

require(h2o)
h2o.init(ignore_config = TRUE, nthreads = -1, max_mem_size = "24G")


H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\npapaco\AppData\Local\Temp\2\Rtmp0ebFeF/h2o_npapaco_started_from_r.out
    C:\Users\npapaco\AppData\Local\Temp\2\Rtmp0ebFeF/h2o_npapaco_started_from_r.err


Starting H2O JVM and connecting:  Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 225 milliseconds 
    H2O cluster timezone:       Europe/Berlin 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.22.1.1 
    H2O cluster version age:    4 months and 9 days !!! 
    H2O cluster name:           H2O_started_from_R_npapaco_jms911 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   21.33 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.5.3 (2019-03-11)

After the start of the cluster optionally we can access it from browser to http://localhost:54321
I recommend to try it. Experiment also with pojo saving of models
H2o algorithms are working only with H2o frameworks. We create also a split on the train data. but for the model evaluation we will use the initial test set (validation)

splits3 <- h2o.splitFrame(as.h2o(edx_h2o), ratios = 0.7, seed = 1)

train3 <- splits3[[1]]
test3 <- splits3[[2]]

test_validation <- as.h2o(validation_h2o)

We start with h2o Automl: - H2o’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles - one based on all previously trained models, another one on the best model of each family - will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)

08.Method: H2O AutoML

H2O’s AutoML can be used for automating the machine learning workflow , which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles - one based on all previously trained models, another one on the best model of each family - will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)

H2o Auto ML hyper parameters

The model with the least RMSE in the leaderboard

h2oamlmodel2@leader

Model Details:
==============

H2ORegressionModel: drf
Model ID:  DRF_1_AutoML_20190424_155200 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                       50              185395         1
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        20   14.40000          2        680   160.96000


H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

MSE:  1.01896
RMSE:  1.009436
MAE:  0.8091325
RMSLE:  0.2679033
Mean Residual Deviance :  1.01896



H2ORegressionMetrics: drf
** Reported on cross-validation data. **
** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  1.017104
RMSE:  1.008516
MAE:  0.8084183
RMSLE:  0.2677314
Mean Residual Deviance :  1.017104


Cross-Validation Metrics Summary: 
                              mean            sd cv_1_valid cv_2_valid
mae                      0.8084183   2.020126E-5  0.8083981 0.80839807
mean_residual_deviance   1.0171036 1.16297306E-4   1.016921  1.0173197
mse                      1.0171036 1.16297306E-4   1.016921  1.0173197
r2                     0.095347136   8.510263E-5 0.09520545  0.0953363
residual_deviance        1.0171036 1.16297306E-4   1.016921  1.0173197
rmse                     1.0085155  5.7656973E-5   1.008425  1.0086226
rmsle                    0.2677314   3.339011E-5 0.26766708  0.2677791
                       cv_3_valid
mae                     0.8084587
mean_residual_deviance  1.0170699
mse                     1.0170699
r2                     0.09549966
residual_deviance       1.0170699
rmse                    1.0084988
rmsle                  0.26774806

we observe the model that had the lowest RMSE on the leaderboard .The (leader) of all models tested was the below.

-Model ID: DRF_1_AutoML_20190424_155200.

-Algorithm: Algorithm: Distributed Random Forest

Display of the 6 best models from the leaderboard

h2oamlmodel2@leaderboard

                                   model_id mean_residual_deviance
1              DRF_1_AutoML_20190424_155200               1.077967
2              XRT_1_AutoML_20190424_155200               1.088622
3 GLM_grid_1_AutoML_20190424_155200_model_1               1.099932
4              GBM_4_AutoML_20190424_155200               1.112490
5              GBM_2_AutoML_20190424_155200               1.116864
6 GLM_grid_1_AutoML_20190424_142708_model_1               1.117170
      rmse      mse       mae     rmsle
1 1.038252 1.077967 0.8410296 0.2702502
2 1.043370 1.088622 0.8475965 0.2717722
3 1.048777 1.099932 0.8462009 0.2689410
4 1.054746 1.112490 0.8592683 0.2673023
5 1.056818 1.116864 0.8620348 0.2674363
6 1.056963 1.117170 0.8571898 0.2725451

[15 rows x 6 columns]

Print of the scoring history

h2o.scoreHistory(h2oamlmodel2@leader)

Scoring History: 
             timestamp          duration number_of_trees training_rmse
1  2019-04-24 15:56:44  4 min 43.962 sec               0            NA
2  2019-04-24 15:56:45  4 min 44.868 sec               1       1.04531
3  2019-04-24 15:56:46  4 min 46.496 sec               2       1.01972
4  2019-04-24 15:56:48  4 min 47.738 sec               3       1.02279
5  2019-04-24 15:56:52  4 min 52.322 sec               7       1.02390
6  2019-04-24 15:56:57  4 min 56.892 sec              10       1.01407
7  2019-04-24 15:57:01  5 min  1.097 sec              14       1.01350
8  2019-04-24 15:57:06  5 min  5.648 sec              18       1.01059
9  2019-04-24 15:57:11  5 min 10.711 sec              22       1.00790
10 2019-04-24 15:57:15  5 min 15.087 sec              25       1.00708
11 2019-04-24 15:57:19  5 min 19.196 sec              30       1.00855
12 2019-04-24 15:57:24  5 min 23.735 sec              33       1.00705
   training_mae training_deviance
1            NA                NA
2       0.84292           1.09268
3       0.81434           1.03983
4       0.81802           1.04610
5       0.82144           1.04838
6       0.81201           1.02834
7       0.81177           1.02718
8       0.80910           1.02129
9       0.80635           1.01587
10      0.80546           1.01422
11      0.80753           1.01717
12      0.80580           1.01414
 [ reached 'max' / getOption("max.print") -- omitted 4 rows ]

Plot of the training history, metric (rmse)

plot(h2oamlmodel2@leader, timestep = "number_of_trees", metric = "rmse")

Plot of the variables Importance (by order)

h2o.varimp_plot(h2oamlmodel2@leader, num_of_features = NULL)

Root mean squared error of H2o Auto ML algorithm

Algorithm	RMSE
Matrix factorization with SGD	0.7829978
H2o Auto ML model2	1.0382518

09.Method: H2o Generalized Linear Models (GLM)

Because on autoML we can’t tune many hyper parameters, we will also try on other models with various hyper tunings. Then we will stack those different models stacked ensemble.

Ensemble machine learning methods usemultiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms we start with:

-Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include

-Poisson

-binomial and

-gamma distributions (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html)
H2o GLM hyper parameters 0=ridge , 1 = lasso, so we leave it in the middle to use both penalizers

output - Standardized Coefficient Magnitudes (standardized coefficient magnitudes)
-movieId
-userId
-age_of_movie

RMSE H2o GLM algorithm

Algorithm	RMSE
Matrix factorization with SGD	0.7829978
H2o Auto ML model	1.0382518
H2o GLM model	1.0163097

10.Method: H2o - Gradient Boosting Machine model

Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel. (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html)

Plot scoring history

plot(h2o_gbm_model)

Plot the variables importance

h2o.varimp_plot(h2o_gbm_model, num_of_features = NULL)

Print the scoring history

h2o.scoreHistory(h2o_gbm_model)

Scoring History: 
            timestamp          duration number_of_trees training_rmse
1 2019-04-24 21:30:32  2 min 32.011 sec               0       1.06041
2 2019-04-24 21:30:33  2 min 33.057 sec               1       1.05020
3 2019-04-24 21:30:35  2 min 35.116 sec               2       1.04137
4 2019-04-24 21:30:40  2 min 39.933 sec               9       1.00783
5 2019-04-24 21:30:50  2 min 50.090 sec              24       0.98933
6 2019-04-24 21:31:08  3 min  7.809 sec              50       0.97746
  training_mae training_deviance validation_rmse validation_mae
1      0.85561           1.12446         1.06016        0.85539
2      0.84784           1.10293         1.05000        0.84765
3      0.84121           1.08444         1.04120        0.84105
4      0.80780           1.01572         1.00785        0.80779
5      0.78449           0.97877         0.98954        0.78462
6      0.77237           0.95543         0.97786        0.77263
  validation_deviance
1             1.12393
2             1.10249
3             1.08410
4             1.01577
5             0.97918
6             0.95621

Root mean squared error calculation

h2o.performance(h2o_gbm_model, test3)
pred.ratings.h2o_gbm_model <- h2o.predict(h2o_gbm_model, as.h2o(test_validation))
rmse_of_model_h2o_gbm_model <- RMSE(pred.ratings.h2o_gbm_model, as.h2o(test_validation$rating))

Root mean squared error of H2o GBM model

Algorithm	RMSE
Matrix factorization with SGD	0.7829978
H2o Auto ML model	1.0382518
H2o GLM model	1.0163097
H2o GBM model	1.0353257

11.Method: H2o Distributed Random Forest

Distributed Random Forest (DRF) is a powerful classification and regression tool. When
given a set of data, DRF generates a forest of classification (or regression) trees, rather
** than a single classification (or regression) tree. Each of these trees is a weak learner built **
on a subset of rows and columns

Plot variables importance

h2o.varimp_plot(h2orf1, num_of_features = NULL)

The scoring history of h2orf1

h2o.scoreHistory(h2orf1)

Scoring History: 
            timestamp          duration number_of_trees training_rmse
1 2019-04-22 20:28:56 25 min 48.392 sec               0            NA
2 2019-04-22 20:29:00 25 min 52.347 sec               1       0.95437
3 2019-04-22 20:29:04 25 min 56.322 sec               2       0.95141
4 2019-04-22 20:29:08 26 min  0.384 sec               3       0.94985
5 2019-04-22 20:29:12 26 min  4.251 sec               4       0.94858
  training_mae training_deviance
1           NA                NA
2      0.74891           0.91081
3      0.74656           0.90518
4      0.74548           0.90221
5      0.74441           0.89981

---
             timestamp          duration number_of_trees training_rmse
46 2019-04-22 20:31:55 28 min 46.636 sec              45       0.93905
47 2019-04-22 20:31:59 28 min 50.840 sec              46       0.93903
48 2019-04-22 20:32:03 28 min 54.761 sec              47       0.93900
49 2019-04-22 20:32:07 28 min 58.739 sec              48       0.93900
50 2019-04-22 20:32:11 29 min  2.568 sec              49       0.93898
51 2019-04-22 20:32:14 29 min  6.439 sec              50       0.93898
   training_mae training_deviance
46      0.73704           0.88181
47      0.73702           0.88179
48      0.73699           0.88172
49      0.73699           0.88171
50      0.73697           0.88169
51      0.73697           0.88167

Root mean squared error of h2orf1_model

Algorithm	RMSE
Matrix factorization with SGD	0.7829978
H2o Auto ML model	1.0382518
H2o GLM model	1.0163097
H2o GBM model	1.0353257
H2o RF model	1.0295050

12.Method: H2o Stacked Ensembles

Ensemble machine learning methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms . H2O’s Stacked Ensemble method is supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking. This method currently supports regression and binary classification. Stacking, also called Super Learning or Stacked Regression, is a class of algorithms that involves training a second-level “metalearner” to find the optimal combination of the base learners. Unlike bagging and boosting, the goal in stacking is to ensemble strong, diverse sets of learners together.

(https://h2o-release.s3.amazonaws.com/h2o/rel-ueno/2/docs-website/h2o-docs/data-science/stacked-ensembles.html)

We will stack the previous 3 models:

-Generalized Linear Models
-Gradient Boosting Machine and the
-Distributed Random Forest (h2o_glm, h2o_gbm_model, h2orf_model)

-Algorithm: Stacked Ensemble
-Model ID: h2oensemble2

13.Root mean squared error of all models

methods	rmse
Matrix factorization with SGD	0.7829978
H2o Auto ML model	1.0382518
H2o GLM model	1.0163097
H2o GBM model	1.0353257
H2o RF model	1.0295050
H2o Ensemble model	1.0036833

14. Results

The lowest RMSE surprisingly was achieved only with 2 Features : user ID and Movie ID in Matrix factorization with SGD (RMSE 0.78). Newral Networks algorithms are extremely slow and demanding on computation (even only with 2 newrons and 2 features). They needed more than 8 hours! If you need them please ask and i will send you the trained models

I have also run the algorithms on CUDA where i show significantly speed improvement comparing with my above algorithms. I would definitely recommended it.

H2o XGBoost algorithm runs ONLY IN LINUX
The second lowest was the h2o ensemble model (RMSE 1.003) which stacked the below models

-(GLM,GBM,DRF)

With more hyper parameters tuning even lower RMSE can be achieved. But not so low like with MF models.

I wouldn’t recommend the auto ML model, seems you can’t tune the hyper parameters of the models, that results not into the lowest RMSE or higher accuracy on classification models.

I also trained the same models with scaled values but the RMSE was in all models higher. It seems that the most important features where:

-Number of users rated the movie
-age of movie = more ratings and higher mean rating
-the movie id and
-Drama = (genre with the most ratings)

The other features didn’t had low p value and didn’t improve the model efficiency but the over fitted it.

In H2o Auto ML model (RMSE 1.03) the most important features where the below OUTPUT - VARIABLE IMPORTANCES
-variable relative_importance scaled_importance percentage
-n.users_bymovie 7164049.0 1.0 0.3019
-Drama 5418105.0 0.7563 0.2283
-movieId 5250999.0 0.7330 0.2212

The GLM Model (RMSE 1.01 ) has similar evaluation process with Matrix factorization with SGD. That’s why put more weight on the below features
output - Standardized Coefficient Magnitudes (standardized coefficient magnitudes)

-movieId
-userId
-age_of_movie
-But the RMSE 1.01 was not so low like MF model.

The Gradient Boosting Machine Model (RMSE 1.03) put more weight on the below features
-variable relative_importance scaled_importance percentage
-age_of_movie 1759746.6250 1.0 0.3140
-n.users_bymovie 1684104.0 0.9570 0.3005
-movieId 883807.2500 0.5022 0.1577

The distributed Random Forest Model (RMSE 1.02) put more weight on the below features

-variable relative_importance scaled_importance percentage
-n.users_bymovie 26496842.0 1.0 0.2790
-age_of_movie 22493868.0 0.8489 0.2369
-movieId 21997428.0 0.8302 0.2317
-Drama 9633707.0 0.3636 0.1015

The h2o ensemble model (RMSE 1.003) stacked the below 3 models:

-GLM
-GBM
-DRF

And had significant the lowest RMSE of the H2o Models.
With more hyper parameters tuning even lower RMSE can be achieved. But not so low like with MF models.

15. Conclussion

With web scraping we could add more dimensions into our datasets such us:

-budget of movie
-critics rating and
-duration of movie and compare the RMSE’s
I wouldn’t recommend the auto ML model seems you can’t tune the hyper parameters of the models and that results not into the optimal model.
Thank you for reading my analysis.
KR
Niko
# Contact

(https://www.linkedin.com/in/niko-papacosmas-mba-pmp-mcse-695a2695/)

Capstone HarvardX - Project MovieLens

Papacosmas Niko

May 2019