In this project, I will be using singular value decomposition to factorize a matrix and identify “concepts” surrounding an item to be recommended.
I’ll once again use the MovieLens data set, and will focus on identifying singular values that describe a movie, based on it’s recommendations.
First, I’ll load packages and data.
library(recommenderlab)
## Loading required package: Matrix
## Loading required package: arules
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:Matrix':
##
## expand
data("MovieLense")
For this project, I’ll use the same scheme for subsetting the data as the most successfull pattern as my last project. This scheme made sure each user watched at least 20 movies, and that each movie had been reviewed 40 times.
rm.20_40 <- MovieLense[rowCounts(MovieLense) >= 20, colCounts(MovieLense) >= 40]
My next step is to convert the movies data out of the RecommenderLab matrix object into a dataframe, which I can than convert into a generic R data frame. My goal is to have m rows relating to users, and n columns relating to films.
rawDataFrame <- getData.frame(rm.20_40)
Since my data is in a “long” format, I’ll “spread” the movie values out into columns.
spreadDF <- spread(rawDataFrame, item, rating, fill = 0)
I also need to remove the user ID column, as that will not be used in this analysis.
#929 users
#706 movies
spreadMatrix <- data.matrix(spreadDF[,2:705])
I’ll next scale my data as method for handling un-standardized rating scores.
scaleMatrix <- scale(spreadMatrix)
I’m now ready to run a singular value decomposition.
svd_result <- svd(scaleMatrix)
With the SVD algorithm factorization ran, I can pull out the left singular matrix (U), the diagonal matrix (D), and right singular matrix (V).
Roughly speaking, these matrices decompose my ratings data so that I can think of it in terms of “concepts”. The U matrix relates users to concepts, the V matrix relates movies to concepts, and the D matrix measures the strength of those concepts.
u <- svd_result$u
d <- svd_result$d
v <- svd_result$v
I want to only examine concepts with values over 100, of which there are 6.
#limit d to "concepts" with values of over 100
concepts <- d[1:6]
I’m interested in examining the strongest Movie examples of each concept. In order to do so, I’ll reattach the movie list.
movies <- unlist(dimnames(scaleMatrix)[2])
movie_concepts <- t(v[1:6,])
#put them together in a dataframe
movie_concept_df <- as.data.frame(movie_concepts)
movie_concept_df <- cbind.data.frame(movies, movie_concept_df)
Now I can view the top 10 examples of each concept.
#take a look at the top 10 movies for each concept
arrange(movie_concept_df, desc(V1))[1:10,1]
## [1] One Fine Day (1996) Magnificent Seven, The (1954)
## [3] Basquiat (1996) Liar Liar (1997)
## [5] Last Supper, The (1995) Dave (1993)
## [7] Jean de Florette (1986) Circle of Friends (1995)
## [9] Hercules (1997) Dangerous Minds (1995)
## 704 Levels: 101 Dalmatians (1996) 12 Angry Men (1957) ... Young Guns II (1990)
arrange(movie_concept_df, desc(V2))[1:10,1]
## [1] Full Metal Jacket (1987) Fierce Creatures (1997)
## [3] It Could Happen to You (1994) Batman Returns (1992)
## [5] Mortal Kombat: Annihilation (1997) Reservoir Dogs (1992)
## [7] Emma (1996) Bedknobs and Broomsticks (1971)
## [9] Seven (Se7en) (1995) Craft, The (1996)
## 704 Levels: 101 Dalmatians (1996) 12 Angry Men (1957) ... Young Guns II (1990)
arrange(movie_concept_df, desc(V3))[1:10,1]
## [1] Matilda (1996)
## [2] Jurassic Park (1993)
## [3] Jungle Book, The (1994)
## [4] Circle of Friends (1995)
## [5] Annie Hall (1977)
## [6] Lion King, The (1994)
## [7] Citizen Kane (1941)
## [8] Flubber (1997)
## [9] Bridge on the River Kwai, The (1957)
## [10] Platoon (1986)
## 704 Levels: 101 Dalmatians (1996) 12 Angry Men (1957) ... Young Guns II (1990)
arrange(movie_concept_df, desc(V4))[1:10,1]
## [1] Man Without a Face, The (1993)
## [2] Substitute, The (1996)
## [3] Matilda (1996)
## [4] Jaws 2 (1978)
## [5] Money Talks (1997)
## [6] Chinatown (1974)
## [7] Taxi Driver (1976)
## [8] Truth About Cats & Dogs, The (1996)
## [9] Ulee's Gold (1997)
## [10] Jungle Book, The (1994)
## 704 Levels: 101 Dalmatians (1996) 12 Angry Men (1957) ... Young Guns II (1990)
arrange(movie_concept_df, desc(V5))[1:10,1]
## [1] Trees Lounge (1996) Money Talks (1997)
## [3] Forget Paris (1995) Time to Kill, A (1996)
## [5] Taxi Driver (1976) Sword in the Stone, The (1963)
## [7] Blues Brothers, The (1980) Liar Liar (1997)
## [9] Speed (1994) Glory (1989)
## 704 Levels: 101 Dalmatians (1996) 12 Angry Men (1957) ... Young Guns II (1990)
arrange(movie_concept_df, desc(V6))[1:10,1]
## [1] Wings of the Dove, The (1997)
## [2] Platoon (1986)
## [3] Substitute, The (1996)
## [4] Mystery Science Theater 3000: The Movie (1996)
## [5] Good, The Bad and The Ugly, The (1966)
## [6] Independence Day (ID4) (1996)
## [7] Raise the Red Lantern (1991)
## [8] Pretty Woman (1990)
## [9] Young Guns II (1990)
## [10] Life Less Ordinary, A (1997)
## 704 Levels: 101 Dalmatians (1996) 12 Angry Men (1957) ... Young Guns II (1990)
Potential next steps would be to run analysis on each set of movies in terms of various features, in an attempt to identify a common underlying theme that unifies them.