1 - Description


MovieLens is a data set that contains ratings from the MovieLense website (http://movielens.org). This data set is broken into a number of different sizes for research purposes. We will b using the small data set containing 100,000 ratings applied to 9,000 movies by 700 users.

Using the MovieLense data set we will construct a simple recommender system to recommend movies to users. This system will be built using user based collaborative filtering using both hand-built algorithms and the recomenderlab package.

2 - DataSet


The MovieLens data is broken into two data sets that we are interested in using. The first is the movies data that contains the movie ID, the title, and genres. The second is the user ratings data that contains the user ID, the movie ID, and the rating, and a time-stamp in UNIX time. We will drop the time-stamp data for this project.

Reading in the required data sets.

Lets take a quick look at the data that we have loaded.

2.1 - Movies

2.2 - Ratings

3 - Building the Recommender by Hand


The first thing that we need to do is construct the functions that we will use to generate recommendations. We will be using the cosine similarity between users to generate our recommendations so we first build a cosine similarity function. We then build our function to generate the recommendations. This function takes the similarity matrix generated by the cosine similarity function, the movies and ratings data, and info on the user, number of recommendations requested, and the number of nearest neighbors to use.

4 - Recommender with Normalization


We first try to build a recommendation system using normalization. There are some issues that we will see below. The first is the meaning of zero. In order for R to calculate the matrix multiplication used in our similarity function we need to fill the NA’s as 0. If we don’t we get a matrix with diagonals of 0 and the rest of the element is NA. However once normalized 0 means a centered review and this leads to issues with the list of movies that get recommended.

4.1 - Converting the Pairwise Ratings into a Sparse Matrix with Normalization

To calculate the user based collaborative filter we need to transform the data from our pairwise data to a sparse matrix. We will use the TidyR package to reshape the data.

dim(user_movie_mat)
[1]  671 9066

4.2 - Calculating Similaity using the Cosine Similarity

Now that we have the data loaded into a sparse matrix we want to calculate the similarity using the cosine distance of each of the users. We also plot the similarity values. We see that there may be issues already given that there is not a line indicating each user’s similarity with themselves.

4.3 Recomending a Movie Based on the Cosine Similarity with Normalization

We will now use the recommender and the normalized similarity matrix to recommend a set of 10 movies to user 100 using the 30 nearest neighbors.

norm_recs
      [,1]                                                  
 [1,] "Young Poisoner's Handbook, The (1995)"               
 [2,] "Addams Family Values (1993)"                         
 [3,] "Contact (1997)"                                      
 [4,] "No Holds Barred (1989)"                              
 [5,] "National Velvet (1944)"                              
 [6,] "Dracula: Dead and Loving It (1995)"                  
 [7,] "Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)"
 [8,] "Friday (1995)"                                       
 [9,] "Misérables, Les (1995)"                              
[10,] "Screamers (1995)"                                    

5 - Building a Recommendation System Using Non-Normalized Data


Given that we ran into a potential issue with the normalized data we will build out the recommender using the same data and functions with normalizing the data first.

5.1 - Generating the Similarity Matrix

We first need to recreate the user matrix without normalization.

5.2 - Generating the Cosine Similarities

Now that we have our non-normalized matrix we will generate the cosine similarities. We see that plot of the similarity matrix also lacks any real patterns and not surprisingly we get a different set of movies recommended to the user.

5.3 - Non-Normalized Movie Recommendations

We note that this is a different set then

recs
      [,1]                                    
 [1,] "Crossing Guard, The (1995)"            
 [2,] "Murder in the First (1995)"            
 [3,] "On Golden Pond (1981)"                 
 [4,] "Howling, The (1980)"                   
 [5,] "Fifth Element, The (1997)"             
 [6,] "Hot Lead and Cold Feet (1978)"         
 [7,] "Metroland (1997)"                      
 [8,] "Morning After, The (1986)"             
 [9,] "Sister Act 2: Back in the Habit (1993)"
[10,] "Where the Money Is (2000)"             

6 - Building A Recommendation System Using recomenderlab


For the final portion of the project we will recommenderlab package to build a recommendation for users. The process is relatively straight forward as the package takes the sparse user-review matrix, converts it to a real rating matrix, and then builds the recommender model based on this information.

movie_lab <- function(user_mat, movies, user, num_recs = 10, neigh = 10){
  #Convert rating matrix into a recommenderlab sparse matrix
  user_mat <- as(user_mat, "realRatingMatrix")
  
  #Create Recommender Model. "UBCF" stands for User-Based Collaborative Filtering
  recommender_model <- Recommender(user_mat, 
                                   method = "UBCF", 
                                   param=list(method="Cosine",nn = neigh))
  recom <- predict(recommender_model, 
                   user_mat[user], 
                   n=num_recs) #Obtain top 10 recommendations for 1st user in dataset
  
  recom_list <- as(recom, "list") #convert recommenderlab object to readable list
  
  recom_result <- matrix(0,num_recs)
  for (i in c(1:num_recs)){
    recom_result[i] <- movies$title[as.integer(recom_list[[1]][i])]
  }h
Error: unexpected symbol in:
"    recom_result[i] <- movies$title[as.integer(recom_list[[1]][i])]
  }h"

6.1 - Recommenderlab Movie Recommendations

Using the function above we can now generate the top to recommendation for user 100 using the 30 nearest neighbors. We see that these recommendations are once again different from the ones that we generated by hand above.

lab_recs
      [,1]                         
 [1,] "Scarlet Letter, The (1926)" 
 [2,] "Only You (1994)"            
 [3,] "Crooklyn (1994)"            
 [4,] "Substitute, The (1996)"     
 [5,] "Mary Reilly (1996)"         
 [6,] "Month by the Lake, A (1995)"
 [7,] "Top Hat (1935)"             
 [8,] "Eye for an Eye (1996)"      
 [9,] "Mighty Aphrodite (1995)"    
[10,] "Tales from the Hood (1995)" 

7 - Comparison of the Three Recomendations and Performance

The first thing that we note is that there is a definite difference in the performance of the hand coded algorithm and the recomenderlab package. The hand coded portion of the algorithm that generates the cosine similarity was a slow process. It took 20 to 30 seconds if not more to calculate the multiplications to generate the similarity matrix. The recommenderlab package does this process much more quickly and the model can be generated each time the function is called without performance hits.

The second thing that we notice is that all three of the methods of doing recommendations generate different results as we can see below. My best guess is that the method that I used in the function that generates the recommendations from the similarity matrix is not using the correct method for getting the best results. It is interesting to see that the normalized and non-normalized lead to different results but this is likely the fact that we need to include zero’s to do the matrix multiplication which have a meaning once we normalize the data.

---
title: 'DATA643 - Project 1: Basic Recommender System'
author: "Erik Nylander"
output:
  html_notebook: default
  pdf_document: default
---
```{R, include = FALSE}
require(tidyr)
require(dplyr)
```
## 1 - Description
***
[MovieLens](https://grouplens.org/datasets/movielens/) is a data set that contains ratings from the MovieLense website (http://movielens.org). This data set is broken into a number of different sizes for research purposes. We will b using the small data set containing 100,000 ratings applied to 9,000 movies by 700 users.  

Using the MovieLense data set we will construct a simple recommender system to recommend movies to users. This system will be built using user based collaborative filtering using both hand-built algorithms and the *recomenderlab* package.

## 2 - DataSet
***
The MovieLens data is broken into two data sets that we are interested in using. The first is the movies data that contains the movie ID, the title, and genres. The second is the user ratings data that contains the user ID, the movie ID, and the rating, and a time-stamp in UNIX time. We will drop the time-stamp data for this project.

Reading in the required data sets.

```{R, include = TRUE}
movies <- read.csv("~/GitHub/DATA643/data/ml-latest-small/movies.csv", header = TRUE, sep = ",",
                   stringsAsFactors = FALSE, encoding = "UTF-8")
ratings <- read.csv("~/GitHub/DATA643/data/ml-latest-small/ratings.csv", header = TRUE, sep =",",
                    stringsAsFactors = FALSE)
ratings <- ratings[,c(1,2,3)]
```

Lets take a quick look at the data that we have loaded.

#### 2.1 - Movies
```{R}
dim(movies)
head(movies)
```

#### 2.2 - Ratings
```{R}
dim(ratings)
head(ratings)
```


## 3 - Building the Recommender by Hand
***
The first thing that we need to do is construct the functions that we will use to generate recommendations. We will be using the cosine similarity between users to generate our recommendations so we first build a cosine similarity function. We then build our function to generate the recommendations. This function takes the similarity matrix generated by the cosine similarity function, the movies and ratings data, and info on the user, number of recommendations requested, and the number of nearest neighbors to use.

```{R}
# Calculate the cosine similarity using built in linear algebra.
cosineSim <- function(x){
  x <- as.dist(x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2))))) 
  return(x)
}

# Move recommendation unction
movie_rec <- function(sim_mat, movies, ratings, user, num_recs = 10, nn = 10){
  sim_vec <- sim_mat[user,]
  sim_vec <- names(sort(sim_vec))
  sim_vec <- sim_vec[2:nn+1]
  reviewed <- ratings[ratings$userId == user,]
  
  suggested <- ratings %>%
    filter(ratings$userId %in% sim_vec) %>%
    group_by(movieId) %>%
    summarise(
      count = n(), #How many times was the movie rated
      avg_rating = mean(rating) #What is the average rating from the similar reviewers
    ) %>%
    filter(!(movieId %in% reviewed$movieId)) %>%
    arrange(desc(avg_rating), desc(count))

  rec_list <- matrix(0, num_recs)
  for(i in 1:num_recs){
    rec_list[i] <- movies$title[suggested$movieId[i]]
  }
  return(rec_list)
}
```

## 4 - Recommender with Normalization
***
We first try to build a recommendation system using normalization. There are some issues that we will see below. The first is the meaning of zero. In order for R to calculate the matrix multiplication used in our similarity function we need to fill the NA's as 0. If we don't we get a matrix with diagonals of 0 and the rest of the element is NA. However once normalized 0 means a centered review and this leads to issues with the list of movies that get recommended.

#### 4.1 - Converting the Pairwise Ratings into a Sparse Matrix with Normalization
To calculate the user based collaborative filter we need to transform the data from our pairwise data to a sparse matrix. We will use the *TidyR* package to reshape the data.  
```{R}
# Creating a sparse matrix of users as rows and movieId as columns.
user_movie_mat <- ratings %>%
  spread(key = movieId, value = rating) %>%
  as.matrix()

user_movie_mat = user_movie_mat[,-1] #remove userID col. Rows are userIds, cols are movieIds

user_movie_mat <- t(scale(t(user_movie_mat), center = TRUE, scale = TRUE)) #preforminng row wise normalization.

user_movie_mat[is.na(user_movie_mat)] <- 0 #converting the NA's to 0 so that we can multiply the matrices below. 
```

#### 4.2 - Calculating Similaity using the Cosine Similarity
Now that we have the data loaded into a sparse matrix we want to calculate the similarity using the cosine distance of each of the users. We also plot the similarity values. We see that there may be issues already given that there is not a line indicating each user's similarity with themselves.

```{R}
# Creating the similarity matrix
similar_reviews_norm <- as.matrix(cosineSim(user_movie_mat))

image(similar_reviews_norm)
```


#### 4.3 Recomending a Movie Based on the Cosine Similarity with Normalization
We will now use the recommender and the normalized similarity matrix to recommend a set of 10 movies to user 100 using the 30 nearest neighbors. 
```{R}
norm_recs <- movie_rec(similar_reviews_norm, movies, ratings, user = 100, num_recs = 10, nn = 30)
norm_recs
```

## 5 - Building a Recommendation System Using Non-Normalized Data
***
Given that we ran into a potential issue with the normalized data we will build out the recommender using the same data and functions with normalizing the data first.

#### 5.1 - Generating the Similarity Matrix
We first need to recreate the user matrix without normalization.
```{R}
# Creating a sparse matrix of users as rows and movieId as columns.
user_movie_mat <- ratings %>%
  spread(key = movieId, value = rating) %>%
  as.matrix()

user_movie_mat = user_movie_mat[,-1] #remove userID col. Rows are userIds, cols are movieIds

user_movie_mat[is.na(user_movie_mat)] <- 0 #converting the NA's to 0 so that we can multiply the matrices below. 
```

#### 5.2 - Generating the Cosine Similarities
Now that we have our non-normalized matrix we will generate the cosine similarities. We see that plot of the similarity matrix also lacks any real patterns and not surprisingly we get a different set of movies recommended to the user. 
```{R}
# Creating the similarity matrix
similar_reviews <- as.matrix(cosineSim(user_movie_mat))

image(similar_reviews)
```
#### 5.3 - Non-Normalized Movie Recommendations
We note that this is a different set then 
```{R}
recs <- movie_rec(similar_reviews, movies, ratings, user = 100, num_recs = 10, nn = 30)
recs
```

## 6 - Building A Recommendation System Using *recomenderlab*
***
For the final portion of the project we will *recommenderlab* package to build a recommendation for users. The process is relatively straight forward as the package takes the sparse user-review matrix, converts it to a real rating matrix, and then builds the recommender model based on this information. 
```{R}
library(recommenderlab)

# Creating a sparse matrix of users as rows and movieId as columns.
user_movie_mat <- ratings %>%
  spread(key = movieId, value = rating) %>%
  as.matrix()

user_movie_mat = user_movie_mat[,-1]

movie_lab <- function(user_mat, movies, user, num_recs = 10, neigh = 10){
  #Convert rating matrix into a recommenderlab sparse matrix
  user_mat <- as(user_mat, "realRatingMatrix")
  
  #Create Recommender Model. "UBCF" stands for User-Based Collaborative Filtering
  recommender_model <- Recommender(user_mat, 
                                   method = "UBCF", 
                                   param=list(method="Cosine",nn = neigh))
  recom <- predict(recommender_model, 
                   user_mat[user], 
                   n=num_recs) #Obtain top 10 recommendations for 1st user in dataset
  
  recom_list <- as(recom, "list") #convert recommenderlab object to readable list
  
  recom_result <- matrix(0,num_recs)
  for (i in c(1:num_recs)){
    recom_result[i] <- movies$title[as.integer(recom_list[[1]][i])]
  }
  return(recom_result)
}
```
#### 6.1 - Recommenderlab Movie Recommendations
Using the function above we can now generate the top to recommendation for user 100 using the 30 nearest neighbors. We see that these recommendations are once again different from the ones that we generated by hand above.
```{R}
lab_recs <- movie_lab(user_movie_mat, movies, user = 100, num_recs = 10, neigh = 30)
lab_recs
```

## 7 - Comparison of the Three Recomendations and Performance
The first thing that we note is that there is a definite difference in the performance of the hand coded algorithm and the *recomenderlab* package. The hand coded portion of the algorithm that generates the cosine similarity was a slow process. It took 20 to 30 seconds if not more to calculate the multiplications to generate the similarity matrix. The *recommenderlab* package does this process much more quickly and the model can be generated each time the function is called without performance hits.  

The second thing that we notice is that all three of the methods of doing recommendations generate different results as we can see below. My best guess is that the method that I used in the function that generates the recommendations from the similarity matrix is not using the correct method for getting the best results. It is interesting to see that the normalized and non-normalized lead to different results but this is likely the fact that we need to include zero's to do the matrix multiplication which have a meaning once we normalize the data.
```{R}
results <- data.frame(Normalized = norm_recs, Non_Normalized = recs, Recommenderlab = lab_recs)
results
```