Understanding Movie Preferences

Background Information on the Dataset

In Unit 6, we were introduced to a MovieLens dataset containing movies and their associated genres, and clustered movies accordingly. In addition to collecting data on movies and genres, MovieLens collects data on users and their ratings of movies.

We collected the “latest” MovieLens dataset in September 2018 from https://grouplens.org/datasets/movielens/, and used it to create a new dataset that aggregates user ratings by the genres of the movies (omitting users who had rated less than 500 or over 2000 movies).

Dataset: movielens-user-genre-ratings.csv

Our dataset has the following columns:

  • userId: a unique integer identifying a user

  • action, adventure, …, *western**: the sum of all the ratings that this user has rated any movies corresponding to these genres. For example, the user with userID = 24 has action = 431, which means that the sum of the ratings of any action movies this user saw is 431. For brevity, we will refer to this “sum of all the ratings” value as the score of the genre for the user. Note that number of movies that a user has watched of the genre as well as the ratings that the user gave to each movie BOTH contribute to the score of a movie.

In this problem, we aim to cluster users by the genres of movies they watch.

Exploratory Data Analysis

Read the dataset movielens-user-genre-ratings.csv into a dataframe called ratings.

# Read in the dataset
ratings = read.csv("movielens-user-genre-ratings.csv")

How many users are in the dataset?

# Find the number of observations
str(ratings)
## 'data.frame':    9144 obs. of  20 variables:
##  $ userId     : int  24 46 120 132 150 229 231 251 332 340 ...
##  $ action     : num  431 845 220 292 336 ...
##  $ adventure  : num  384 532 320 302 173 ...
##  $ animation  : num  40 127 120 85 20 ...
##  $ children   : num  101 179 116 110 41 ...
##  $ comedy     : num  917 969 850 759 636 ...
##  $ crime      : num  351 328 260 335 317 ...
##  $ documentary: num  54 22 20 25 36 104 43.5 3.5 80.5 35.5 ...
##  $ drama      : num  1085 766 1034 1091 953 ...
##  $ fantasy    : num  153 288 241 172 129 ...
##  $ film.noir  : num  20 14.5 16.5 35 22 89 23.5 16 10.5 27 ...
##  $ horror     : num  151 482 74.5 122 282 ...
##  $ imax       : num  8 19.5 19.5 7 3 ...
##  $ musical    : num  64 56 71 84 37 ...
##  $ mystery    : num  108 192 139 160 144 ...
##  $ romance    : num  407 287 413 403 234 ...
##  $ sci.fi     : num  199 494 210 183 241 ...
##  $ thriller   : num  396 780 296 394 483 ...
##  $ war        : num  122 87 104 100 76 ...
##  $ western    : num  69 82 30.5 28 28 70 83 28 34 52 ...

9144 users in the dataset.

How many genres are in the dataset?

# Number of variables - 1
str(ratings)
## 'data.frame':    9144 obs. of  20 variables:
##  $ userId     : int  24 46 120 132 150 229 231 251 332 340 ...
##  $ action     : num  431 845 220 292 336 ...
##  $ adventure  : num  384 532 320 302 173 ...
##  $ animation  : num  40 127 120 85 20 ...
##  $ children   : num  101 179 116 110 41 ...
##  $ comedy     : num  917 969 850 759 636 ...
##  $ crime      : num  351 328 260 335 317 ...
##  $ documentary: num  54 22 20 25 36 104 43.5 3.5 80.5 35.5 ...
##  $ drama      : num  1085 766 1034 1091 953 ...
##  $ fantasy    : num  153 288 241 172 129 ...
##  $ film.noir  : num  20 14.5 16.5 35 22 89 23.5 16 10.5 27 ...
##  $ horror     : num  151 482 74.5 122 282 ...
##  $ imax       : num  8 19.5 19.5 7 3 ...
##  $ musical    : num  64 56 71 84 37 ...
##  $ mystery    : num  108 192 139 160 144 ...
##  $ romance    : num  407 287 413 403 234 ...
##  $ sci.fi     : num  199 494 210 183 241 ...
##  $ thriller   : num  396 780 296 394 483 ...
##  $ war        : num  122 87 104 100 76 ...
##  $ western    : num  69 82 30.5 28 28 70 83 28 34 52 ...

19 genres.

Which genre has the highest mean score across all users?

# Summary of ratings
z = summary(ratings)
kable(z)
userId action adventure animation children comedy crime documentary drama fantasy film.noir horror imax musical mystery romance sci.fi thriller war western
Min. : 24 Min. : 41.0 Min. : 52.0 Min. : 0.0 Min. : 0.0 Min. : 48.0 Min. : 46.5 Min. : 0.00 Min. : 159.5 Min. : 20.0 Min. : 0.00 Min. : 1.5 Min. : 0.00 Min. : 0.0 Min. : 12.0 Min. : 42.0 Min. : 15.0 Min. : 44.0 Min. : 3.0 Min. : 0.00
1st Qu.: 67234 1st Qu.: 471.0 1st Qu.: 384.4 1st Qu.: 78.5 1st Qu.: 111.0 1st Qu.: 687.9 1st Qu.: 308.5 1st Qu.: 12.00 1st Qu.: 852.0 1st Qu.: 191.0 1st Qu.: 10.50 1st Qu.: 107.0 1st Qu.: 18.50 1st Qu.: 55.0 1st Qu.:153.0 1st Qu.: 311.5 1st Qu.: 259.5 1st Qu.: 487.5 1st Qu.: 82.0 1st Qu.: 28.00
Median :136407 Median : 667.5 Median : 521.5 Median : 134.0 Median : 175.0 Median : 902.5 Median : 409.0 Median : 28.00 Median :1118.2 Median : 264.5 Median : 21.50 Median : 173.5 Median : 54.00 Median : 93.0 Median :204.5 Median : 434.0 Median : 378.5 Median : 650.0 Median :120.0 Median : 45.50
Mean :135573 Mean : 739.0 Mean : 561.1 Mean : 165.0 Mean : 211.9 Mean :1000.9 Mean : 452.6 Mean : 48.69 Mean :1263.4 Mean : 291.6 Mean : 32.81 Mean : 223.5 Mean : 88.18 Mean :109.8 Mean :226.3 Mean : 486.3 Mean : 419.5 Mean : 724.1 Mean :136.5 Mean : 56.18
3rd Qu.:204038 3rd Qu.: 927.6 3rd Qu.: 687.1 3rd Qu.: 214.5 3rd Qu.: 274.5 3rd Qu.:1202.5 3rd Qu.: 552.5 3rd Qu.: 60.00 3rd Qu.:1522.2 3rd Qu.: 358.0 3rd Qu.: 43.00 3rd Qu.: 280.0 3rd Qu.:135.00 3rd Qu.:145.5 3rd Qu.:277.5 3rd Qu.: 601.0 3rd Qu.: 532.5 3rd Qu.: 891.0 3rd Qu.:170.5 3rd Qu.: 71.50
Max. :270769 Max. :3268.0 Max. :2306.0 Max. :1629.5 Max. :1892.0 Max. :4304.5 Max. :1558.0 Max. :1147.00 Max. :5685.5 Max. :1543.5 Max. :568.00 Max. :2167.0 Max. :574.00 Max. :828.5 Max. :866.5 Max. :2286.0 Max. :1860.5 Max. :2751.5 Max. :878.5 Max. :614.00

Drama has the highest mean score across all users.

Which genre has the lowest total score across all users?

film.noir has the lowest total score across all users.

Which of the following pairs of genres are most positively correlated in their user scores?

# Correlations amongst variables
cor(ratings$action, ratings$adventure)
## [1] 0.8749937
cor(ratings$action, ratings$crime)
## [1] 0.6687332
cor(ratings$adventure, ratings$fantasy)
## [1] 0.8777215
cor(ratings$animation, ratings$children)
## [1] 0.8404969

adventure, fantasy is the most positively correlated.

Clustering

Remove the first column of the table using the following line of code.

points = ratings[,2:ncol(ratings)]

points = ratings[,2:ncol(ratings)]

Why did we remove the first column of our dataframe?

The values in the first column are not meaningful for clustering correct

Why do we normalize data when clustering?

To give all features equal weight

What will the maximum value of action be after normalization? Answer without actually normalizing the data.

Not enough information

Normalize the data using the following code:

What is the maximum value of adventure after the normalization?

# Normalize the data
library(caret)
preproc = preProcess(points)
pointsnorm = predict(preproc, points)

# Maximum value of adventure
max(pointsnorm$adventure)
## [1] 7.013803

7.013803 is the maximum value of adventure after the normalization.

Create a dendogram using the following code:

# Create dendrogram
distances = dist(pointsnorm, method = "euclidean")

dend = hclust(distances, method = "ward.D")

plot(dend, labels = FALSE)

What number of clusters is associated with a height of approximately 1500?

3 clusters according to the dendrogram.

In our clustering, we want to set the number of clusters to 5. Which of the following statements is most correct?

The number of clusters with the most vertical room on the dendogram is less than 5, but we want more specific clusters.

Set the random seed to 200, and run the k-means clustering algorithm on your normalized dataset, setting the number of clusters to 5.

# K-means clustering
set.seed(200)
kmc = kmeans(pointsnorm, centers = 5)
# Divides the dataset into 5 different subsets for each cluster
KmeansCluster1 = subset(pointsnorm, kmc$cluster == 1)

KmeansCluster2 = subset(pointsnorm, kmc$cluster == 2)

KmeansCluster3 = subset(pointsnorm, kmc$cluster == 3)

KmeansCluster4 = subset(pointsnorm, kmc$cluster == 4)

KmeansCluster5 = subset(pointsnorm, kmc$cluster == 5)
How many observations are in the smallest cluster?
# Output the number of observations in each cluster
nrow(KmeansCluster1)
## [1] 2084
nrow(KmeansCluster2)
## [1] 942
nrow(KmeansCluster3)
## [1] 3968
nrow(KmeansCluster4)
## [1] 748
nrow(KmeansCluster5)
## [1] 1402

748 observations are present in the smallest cluster, which is Cluster 4.

Conceptual Questions

If we ran k-means clustering a second time without making any additional calls to set.seed, we would expect:

Different results from the first time correct

If we ran k-means clustering a second time after calling set.seed(200), we would expect:

The same results from the first time correct

Why do we typically use cluster centroids to describe the clusters?

The cluster centroid captures the average behavior in the cluster, and can be used to summarize the general pattern in the cluster.

Is “overfitting” a problem in clustering?

Yes, at the extreme every data point can be assigned to its own cluster.

Is “multicollinearity” a problem in clustering?

Yes, multicollinearity could cause certain features to be overweighted in the distances calculations.

Understanding the Clusters

# Understanding the Clusters
z = tail(sort(colMeans(KmeansCluster1)))
kable(z)
x
adventure 0.2824732
horror 0.3363904
imax 0.4718513
thriller 0.4925912
action 0.5495948
sci.fi 0.5579951

z = tail(sort(colMeans(KmeansCluster2)))
kable(z)
x
mystery 1.367318
romance 1.373039
crime 1.381170
war 1.484580
film.noir 1.570508
drama 1.842013

z = tail(sort(colMeans(KmeansCluster3)))
kable(z)
x
musical -0.3643688
war -0.3539588
romance -0.3342490
western -0.3186429
documentary -0.1667101
film.noir -0.1179028

z = tail(sort(colMeans(KmeansCluster4)))
kable(z)
x
animation 1.624056
thriller 1.760436
sci.fi 1.946086
action 1.998269
fantasy 2.060513
adventure 2.112335

z = tail(sort(colMeans(KmeansCluster5)))
kable(z)
x
comedy 0.2781019
adventure 0.3795921
fantasy 0.6394475
musical 0.8548469
animation 0.8867496
children 1.1078301

Which of the clusters is best described as “users who like dramas, film noir, war movies, and crime movies?”

Cluster 2.

Which of the clusters is best described as “users who like adventure movies, fantasy movies, and action movies?”

Cluster 4.

What genre contributes least to the cluster described as “users who like adventure movies, fantasy movies, and action movies?”

# Sort the genres in cluster 4
z = sort(colMeans(KmeansCluster4))
kable(z)
x
film.noir 0.0779509
documentary 0.4077432
western 0.7529009
musical 0.8168096
war 0.8519114
romance 0.9125008
drama 1.0208771
horror 1.3084837
mystery 1.3759077
crime 1.4182716
children 1.4878481
imax 1.5361422
comedy 1.5956005
animation 1.6240559
thriller 1.7604359
sci.fi 1.9460857
action 1.9982694
fantasy 2.0605133
adventure 2.1123349

Film noir has the lowest contribution in Cluster 4.