DATA 612 - Project 2

PROJECT 2 - Content-Based and Collaborative Filtering

The goal of this assignment is for you to try out different ways of implementing and configuring a recommender, and to evaluate your different approaches.

Implement at least two of these recommendation algorithms

• Content-Based Filtering

• User-User Collaborative Filtering

• Item-Item Collaborative Filtering

Load Data

data("MovieLense")

dim(MovieLense@data)

## [1]  943 1664

Let's do the Collaborative filtering:

Compute the similarity matrix

\[User-User\space Similarity\]

1	2	3	4	5	6	7	8	9	10
0.0000000	0.9605820	0.8339504	0.9192637	0.9326136	0.9541710	0.9446653	0.9775049	0.9764039	0.9683044
0.9605820	0.0000000	0.9268716	0.9370341	0.9848027	0.9543931	0.9670869	0.9586588	0.9400556	0.9826137
0.8339504	0.9268716	0.0000000	0.9130323	1.0000000	0.8668857	0.8738971	0.8958898	0.9191450	0.9024436
0.9192637	0.9370341	0.9130323	0.0000000	0.9946918	0.9238226	0.8823323	0.9765327	0.9938837	0.9691166
0.9326136	0.9848027	1.0000000	0.9946918	0.0000000	0.9336269	0.9081632	0.9412276	0.8809189	0.9401144
0.9541710	0.9543931	0.8668857	0.9238226	0.9336269	0.0000000	0.9605997	0.9775950	0.9427809	0.9784868
0.9446653	0.9670869	0.8738971	0.8823323	0.9081632	0.9605997	0.0000000	0.9561888	0.9394858	0.9780770
0.9775049	0.9586588	0.8958898	0.9765327	0.9412276	0.9775950	0.9561888	0.0000000	0.8990017	0.9752306
0.9764039	0.9400556	0.9191450	0.9938837	0.8809189	0.9427809	0.9394858	0.8990017	0.0000000	0.9880545
0.9683044	0.9826137	0.9024436	0.9691166	0.9401144	0.9784868	0.9780770	0.9752306	0.9880545	0.0000000

\[Item-Item\space Similarity\]

	Toy Story (1995)	GoldenEye (1995)	Four Rooms (1995)	Get Shorty (1995)	Copycat (1995)	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	Twelve Monkeys (1995)	Babe (1995)	Dead Man Walking (1995)	Richard III (1995)
Toy Story (1995)	0.0000000	0.9487374	0.9132997	0.9429069	0.9613638	0.9551194	0.9489155	0.9600459	0.9387445	0.9430394
GoldenEye (1995)	0.9487374	0.0000000	0.9088797	0.9394926	0.9426876	0.9550903	0.9411770	0.9499076	0.9145017	0.9389799
Four Rooms (1995)	0.9132997	0.9088797	0.0000000	0.8991940	0.9424719	0.9683641	0.9208737	0.8787096	0.9084892	0.9269418
Get Shorty (1995)	0.9429069	0.9394926	0.8991940	0.0000000	0.8919936	0.9190369	0.9484601	0.9539981	0.9497018	0.9582736
Copycat (1995)	0.9613638	0.9426876	0.9424719	0.8919936	0.0000000	0.9962406	0.9359823	0.9452349	0.9340369	0.9041944
Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	0.9551194	0.9550903	0.9683641	0.9190369	0.9962406	0.0000000	0.9072989	0.8613908	0.9517965	0.9405701
Twelve Monkeys (1995)	0.9489155	0.9411770	0.9208737	0.9484601	0.9359823	0.9072989	0.0000000	0.9595148	0.9503334	0.9494393
Babe (1995)	0.9600459	0.9499076	0.8787096	0.9539981	0.9452349	0.8613908	0.9595148	0.0000000	0.9611934	0.9681040
Dead Man Walking (1995)	0.9387445	0.9145017	0.9084892	0.9497018	0.9340369	0.9517965	0.9503334	0.9611934	0.0000000	0.9476115
Richard III (1995)	0.9430394	0.9389799	0.9269418	0.9582736	0.9041944	0.9405701	0.9494393	0.9681040	0.9476115	0.0000000

Show the Distribution of Ratings

ratings <- as.vector(MovieLense@data)
unique(ratings)

## [1] 5 4 0 3 1 2

table_ratings <- table(ratings)
table_ratings

## ratings
##       0       1       2       3       4       5 
## 1469760    6059   11307   27002   33947   21077

#remove 0s since these are missing data
ratings <- ratings[ratings != 0]
ratings <- factor(ratings)
qplot(ratings) + ggtitle("Distribution of Ratings")

Sort the movies by number of views and Visualize at least 10 rows:

Explore and Visualize the distribution of the average ratings:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let's put a threshold of 100 so that we only include the relevant average ratings:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Prepare and Normalize the Recommender Model

working_data <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]

#Normalize the data 
working_data <- normalize(working_data)

Split Data Into Training and Test Sets

#split training and test data
set.seed(100)
train_index <- sample(x = c(T,F), size = nrow(working_data), replace = T, prob = c(0.75,0.25))

#set train and test sets  
train <- MovieLense[train_index,]
test <- MovieLense[!train_index,]

IBCF - item-based collaborative filtering.

#k = 30
#create IBCF recommender
rec_IBCF <- Recommender(data = train, method = 'IBCF', parameter = list(k = 30))
#predict
predict_IBCF <- predict(object = rec_IBCF, newdata = test, n=5)
#recommendations for the first 5 people in test set
predict_IBCF %>% as("list") %>% head(5)

## $`7`
## [1] "Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)"
## [2] "Flipper (1996)"                                      
## [3] "Shall We Dance? (1996)"                              
## [4] "Pillow Book, The (1995)"                             
## [5] "In the Company of Men (1997)"                        
## 
## $`12`
## [1] "Brother Minister: The Assassination of Malcolm X (1994)"
## [2] "Maya Lin: A Strong Clear Vision (1994)"                 
## [3] "Blue Angel, The (Blaue Engel, Der) (1930)"              
## [4] "Wishmaster (1997)"                                      
## [5] "Leave It to Beaver (1997)"                              
## 
## $`15`
## [1] "Dolores Claiborne (1994)"            "Madness of King George, The (1994)" 
## [3] "Welcome to the Dollhouse (1995)"     "20,000 Leagues Under the Sea (1954)"
## [5] "Bedknobs and Broomsticks (1971)"    
## 
## $`27`
## [1] "North (1994)"                                              
## [2] "Losing Chase (1996)"                                       
## [3] "Brother Minister: The Assassination of Malcolm X (1994)"   
## [4] "Horseman on the Roof, The (Hussard sur le toit, Le) (1995)"
## [5] "Miracle on 34th Street (1994)"                             
## 
## $`28`
## [1] "Playing God (1997)"            "Across the Sea of Time (1995)"
## [3] "Boys Life (1995)"              "Orlando (1993)"               
## [5] "Rent-a-Kid (1995)"

UBCF - user-based collaborative filtering.

#create UBCF recommender
rec_UBCF <- Recommender(data = train, method = 'UBCF')
#predict
predict_UBCF <- predict(rec_UBCF, newdata = test,n=5) 

#recommendations for the first 5 people in test set
predict_UBCF %>% as("list") %>% head(5)

## $`7`
## [1] "Titanic (1997)"            "Good Will Hunting (1997)" 
## [3] "L.A. Confidential (1997)"  "As Good As It Gets (1997)"
## [5] "Apt Pupil (1998)"         
## 
## $`12`
## [1] "Titanic (1997)"            "Scream (1996)"            
## [3] "Good Will Hunting (1997)"  "As Good As It Gets (1997)"
## [5] "Full Monty, The (1997)"   
## 
## $`15`
## [1] "Alien (1979)"                     "Silence of the Lambs, The (1991)"
## [3] "Fargo (1996)"                     "Empire Strikes Back, The (1980)" 
## [5] "Graduate, The (1967)"            
## 
## $`27`
## [1] "Contact (1997)"           "Good Will Hunting (1997)"
## [3] "L.A. Confidential (1997)" "Game, The (1997)"        
## [5] "Aliens (1986)"           
## 
## $`28`
## [1] "Full Monty, The (1997)"       "Wag the Dog (1997)"          
## [3] "Sense and Sensibility (1995)" "Cold Comfort Farm (1995)"    
## [5] "Evita (1996)"

Summary:

One problem with USER-USER similarity is that the user preferences change over time. If a user liked some item one year ago then chances are, he/she might not like the same item today or in the future. As a workaround, one way to solve it is to use recent ratings, say, ratings that is at least 3 months of data. However, one drawback to only using the recent data is that the USER-ITEM matrix might make it more sparser.

On the the contrary, in the ITEM-ITEM similarity, one key advantage of it is that the ratings on a given item do not change significantly after initial period.

Since in this project, the users are more than items and item ratings do not change much over time after the initial period, ITEM-ITEM similarity based Recommender System is more preferable over USER-USER based Recommender System.