In Unit 6, we were introduced to a MovieLens dataset containing movies and their associated genres, and clustered movies accordingly. In addition to collecting data on movies and genres, MovieLens collects data on users and their ratings of movies.
We collected the “latest” MovieLens dataset in September 2018 from https://grouplens.org/datasets/movielens/, and used it to create a new dataset that aggregates user ratings by the genres of the movies (omitting users who had rated less than 500 or over 2000 movies).
Dataset: movielens-user-genre-ratings.csv
Our dataset has the following columns:
userId: a unique integer identifying a user
action, adventure, …, *western**: the sum of all the ratings that this user has rated any movies corresponding to these genres. For example, the user with userID = 24 has action = 431, which means that the sum of the ratings of any action movies this user saw is 431. For brevity, we will refer to this “sum of all the ratings” value as the score of the genre for the user. Note that number of movies that a user has watched of the genre as well as the ratings that the user gave to each movie BOTH contribute to the score of a movie.
In this problem, we aim to cluster users by the genres of movies they watch.
Read the dataset movielens-user-genre-ratings.csv into a dataframe called ratings.
# Read in the dataset
ratings = read.csv("movielens-user-genre-ratings.csv")# Find the number of observations
str(ratings)
## 'data.frame': 9144 obs. of 20 variables:
## $ userId : int 24 46 120 132 150 229 231 251 332 340 ...
## $ action : num 431 845 220 292 336 ...
## $ adventure : num 384 532 320 302 173 ...
## $ animation : num 40 127 120 85 20 ...
## $ children : num 101 179 116 110 41 ...
## $ comedy : num 917 969 850 759 636 ...
## $ crime : num 351 328 260 335 317 ...
## $ documentary: num 54 22 20 25 36 104 43.5 3.5 80.5 35.5 ...
## $ drama : num 1085 766 1034 1091 953 ...
## $ fantasy : num 153 288 241 172 129 ...
## $ film.noir : num 20 14.5 16.5 35 22 89 23.5 16 10.5 27 ...
## $ horror : num 151 482 74.5 122 282 ...
## $ imax : num 8 19.5 19.5 7 3 ...
## $ musical : num 64 56 71 84 37 ...
## $ mystery : num 108 192 139 160 144 ...
## $ romance : num 407 287 413 403 234 ...
## $ sci.fi : num 199 494 210 183 241 ...
## $ thriller : num 396 780 296 394 483 ...
## $ war : num 122 87 104 100 76 ...
## $ western : num 69 82 30.5 28 28 70 83 28 34 52 ...9144 users in the dataset.
# Number of variables - 1
str(ratings)
## 'data.frame': 9144 obs. of 20 variables:
## $ userId : int 24 46 120 132 150 229 231 251 332 340 ...
## $ action : num 431 845 220 292 336 ...
## $ adventure : num 384 532 320 302 173 ...
## $ animation : num 40 127 120 85 20 ...
## $ children : num 101 179 116 110 41 ...
## $ comedy : num 917 969 850 759 636 ...
## $ crime : num 351 328 260 335 317 ...
## $ documentary: num 54 22 20 25 36 104 43.5 3.5 80.5 35.5 ...
## $ drama : num 1085 766 1034 1091 953 ...
## $ fantasy : num 153 288 241 172 129 ...
## $ film.noir : num 20 14.5 16.5 35 22 89 23.5 16 10.5 27 ...
## $ horror : num 151 482 74.5 122 282 ...
## $ imax : num 8 19.5 19.5 7 3 ...
## $ musical : num 64 56 71 84 37 ...
## $ mystery : num 108 192 139 160 144 ...
## $ romance : num 407 287 413 403 234 ...
## $ sci.fi : num 199 494 210 183 241 ...
## $ thriller : num 396 780 296 394 483 ...
## $ war : num 122 87 104 100 76 ...
## $ western : num 69 82 30.5 28 28 70 83 28 34 52 ...19 genres.
# Summary of ratings
z = summary(ratings)
kable(z)| userId | action | adventure | animation | children | comedy | crime | documentary | drama | fantasy | film.noir | horror | imax | musical | mystery | romance | sci.fi | thriller | war | western | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 24 | Min. : 41.0 | Min. : 52.0 | Min. : 0.0 | Min. : 0.0 | Min. : 48.0 | Min. : 46.5 | Min. : 0.00 | Min. : 159.5 | Min. : 20.0 | Min. : 0.00 | Min. : 1.5 | Min. : 0.00 | Min. : 0.0 | Min. : 12.0 | Min. : 42.0 | Min. : 15.0 | Min. : 44.0 | Min. : 3.0 | Min. : 0.00 | |
| 1st Qu.: 67234 | 1st Qu.: 471.0 | 1st Qu.: 384.4 | 1st Qu.: 78.5 | 1st Qu.: 111.0 | 1st Qu.: 687.9 | 1st Qu.: 308.5 | 1st Qu.: 12.00 | 1st Qu.: 852.0 | 1st Qu.: 191.0 | 1st Qu.: 10.50 | 1st Qu.: 107.0 | 1st Qu.: 18.50 | 1st Qu.: 55.0 | 1st Qu.:153.0 | 1st Qu.: 311.5 | 1st Qu.: 259.5 | 1st Qu.: 487.5 | 1st Qu.: 82.0 | 1st Qu.: 28.00 | |
| Median :136407 | Median : 667.5 | Median : 521.5 | Median : 134.0 | Median : 175.0 | Median : 902.5 | Median : 409.0 | Median : 28.00 | Median :1118.2 | Median : 264.5 | Median : 21.50 | Median : 173.5 | Median : 54.00 | Median : 93.0 | Median :204.5 | Median : 434.0 | Median : 378.5 | Median : 650.0 | Median :120.0 | Median : 45.50 | |
| Mean :135573 | Mean : 739.0 | Mean : 561.1 | Mean : 165.0 | Mean : 211.9 | Mean :1000.9 | Mean : 452.6 | Mean : 48.69 | Mean :1263.4 | Mean : 291.6 | Mean : 32.81 | Mean : 223.5 | Mean : 88.18 | Mean :109.8 | Mean :226.3 | Mean : 486.3 | Mean : 419.5 | Mean : 724.1 | Mean :136.5 | Mean : 56.18 | |
| 3rd Qu.:204038 | 3rd Qu.: 927.6 | 3rd Qu.: 687.1 | 3rd Qu.: 214.5 | 3rd Qu.: 274.5 | 3rd Qu.:1202.5 | 3rd Qu.: 552.5 | 3rd Qu.: 60.00 | 3rd Qu.:1522.2 | 3rd Qu.: 358.0 | 3rd Qu.: 43.00 | 3rd Qu.: 280.0 | 3rd Qu.:135.00 | 3rd Qu.:145.5 | 3rd Qu.:277.5 | 3rd Qu.: 601.0 | 3rd Qu.: 532.5 | 3rd Qu.: 891.0 | 3rd Qu.:170.5 | 3rd Qu.: 71.50 | |
| Max. :270769 | Max. :3268.0 | Max. :2306.0 | Max. :1629.5 | Max. :1892.0 | Max. :4304.5 | Max. :1558.0 | Max. :1147.00 | Max. :5685.5 | Max. :1543.5 | Max. :568.00 | Max. :2167.0 | Max. :574.00 | Max. :828.5 | Max. :866.5 | Max. :2286.0 | Max. :1860.5 | Max. :2751.5 | Max. :878.5 | Max. :614.00 |
Drama has the highest mean score across all users.
film.noir has the lowest total score across all users.
Remove the first column of the table using the following line of code.
points = ratings[,2:ncol(ratings)]
points = ratings[,2:ncol(ratings)]The values in the first column are not meaningful for clustering correct
To give all features equal weight
Not enough information
What is the maximum value of adventure after the normalization?
# Normalize the data
library(caret)
preproc = preProcess(points)
pointsnorm = predict(preproc, points)
# Maximum value of adventure
max(pointsnorm$adventure)
## [1] 7.0138037.013803 is the maximum value of adventure after the normalization.
# Create dendrogram
distances = dist(pointsnorm, method = "euclidean")
dend = hclust(distances, method = "ward.D")
plot(dend, labels = FALSE)3 clusters according to the dendrogram.
The number of clusters with the most vertical room on the dendogram is less than 5, but we want more specific clusters.
# K-means clustering
set.seed(200)
kmc = kmeans(pointsnorm, centers = 5)
# Divides the dataset into 5 different subsets for each cluster
KmeansCluster1 = subset(pointsnorm, kmc$cluster == 1)
KmeansCluster2 = subset(pointsnorm, kmc$cluster == 2)
KmeansCluster3 = subset(pointsnorm, kmc$cluster == 3)
KmeansCluster4 = subset(pointsnorm, kmc$cluster == 4)
KmeansCluster5 = subset(pointsnorm, kmc$cluster == 5)# Output the number of observations in each cluster
nrow(KmeansCluster1)
## [1] 2084
nrow(KmeansCluster2)
## [1] 942
nrow(KmeansCluster3)
## [1] 3968
nrow(KmeansCluster4)
## [1] 748
nrow(KmeansCluster5)
## [1] 1402748 observations are present in the smallest cluster, which is Cluster 4.
Different results from the first time correct
The same results from the first time correct
The cluster centroid captures the average behavior in the cluster, and can be used to summarize the general pattern in the cluster.
Yes, at the extreme every data point can be assigned to its own cluster.
Yes, multicollinearity could cause certain features to be overweighted in the distances calculations.
# Understanding the Clusters
z = tail(sort(colMeans(KmeansCluster1)))
kable(z)| x | |
|---|---|
| adventure | 0.2824732 |
| horror | 0.3363904 |
| imax | 0.4718513 |
| thriller | 0.4925912 |
| action | 0.5495948 |
| sci.fi | 0.5579951 |
z = tail(sort(colMeans(KmeansCluster2)))
kable(z)| x | |
|---|---|
| mystery | 1.367318 |
| romance | 1.373039 |
| crime | 1.381170 |
| war | 1.484580 |
| film.noir | 1.570508 |
| drama | 1.842013 |
z = tail(sort(colMeans(KmeansCluster3)))
kable(z)| x | |
|---|---|
| musical | -0.3643688 |
| war | -0.3539588 |
| romance | -0.3342490 |
| western | -0.3186429 |
| documentary | -0.1667101 |
| film.noir | -0.1179028 |
z = tail(sort(colMeans(KmeansCluster4)))
kable(z)| x | |
|---|---|
| animation | 1.624056 |
| thriller | 1.760436 |
| sci.fi | 1.946086 |
| action | 1.998269 |
| fantasy | 2.060513 |
| adventure | 2.112335 |
z = tail(sort(colMeans(KmeansCluster5)))
kable(z)| x | |
|---|---|
| comedy | 0.2781019 |
| adventure | 0.3795921 |
| fantasy | 0.6394475 |
| musical | 0.8548469 |
| animation | 0.8867496 |
| children | 1.1078301 |
Cluster 2.
Cluster 4.
# Sort the genres in cluster 4
z = sort(colMeans(KmeansCluster4))
kable(z)| x | |
|---|---|
| film.noir | 0.0779509 |
| documentary | 0.4077432 |
| western | 0.7529009 |
| musical | 0.8168096 |
| war | 0.8519114 |
| romance | 0.9125008 |
| drama | 1.0208771 |
| horror | 1.3084837 |
| mystery | 1.3759077 |
| crime | 1.4182716 |
| children | 1.4878481 |
| imax | 1.5361422 |
| comedy | 1.5956005 |
| animation | 1.6240559 |
| thriller | 1.7604359 |
| sci.fi | 1.9460857 |
| action | 1.9982694 |
| fantasy | 2.0605133 |
| adventure | 2.1123349 |
Film noir has the lowest contribution in Cluster 4.