In this problem, we will use a dataset comprised of google reviews on attractions from 23 categories. Google user ratings range from 1 to 5 and average user ratings per category is pre-calculated. The data set is populated by capturing user ratings from Google reviews. Reviews on attractions from 23 categories across Europe are considered. Each observation represents a user.
Dataset: ratings.csv
Our dataset has the following columns:
userId: a unique integer identifying a user
churches, resorts, beaches,..,monuments, gardens: the average rating that this user has rated any attraction corresponding to these categories. For example, the user with userID = User 1 has parks = 3.65, which means that the average rating of all the parks this user rated is 3.65. It can be assumed that if an average rating is 0, then that is the average rating. It is not the case that the user has not rated that category.
In this problem, we aim to cluster users by their average rating per category. Hence, users in the same cluster tend to enjoy or dislike the same categories.
ratings <- read.csv("ratings.csv")
str(ratings)
## 'data.frame': 5456 obs. of 24 variables:
## $ userid : chr "User 1" "User 2" "User 3" "User 4" ...
## $ churches : num 0 0 0 0 0 0 0 0 0 0 ...
## $ resorts : num 0 0 0 0.5 0 0 5 5 5 5 ...
## $ beaches : num 3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
## $ parks : num 3.65 3.65 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
## $ theatres : num 5 5 5 5 5 5 5 5 5 5 ...
## $ museums : num 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 ...
## $ malls : num 5 5 5 5 5 5 3.03 5 3.03 5 ...
## $ zoo : num 2.35 2.64 2.64 2.35 2.64 2.63 2.35 2.63 2.62 2.35 ...
## $ restaurants : num 2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.32 2.32 ...
## $ pubs : num 2.64 2.65 2.64 2.64 2.64 2.65 2.64 2.64 2.63 2.63 ...
## $ burger_shops : num 1.69 1.69 1.69 1.69 1.69 1.69 1.68 1.68 1.67 1.67 ...
## $ hotels : num 1.7 1.7 1.7 1.7 1.7 1.69 1.69 1.69 1.68 1.67 ...
## $ juice_bars : num 1.72 1.72 1.72 1.72 1.72 1.72 1.71 1.71 1.7 1.7 ...
## $ art_galleries: num 1.74 1.74 1.74 1.74 1.74 1.74 1.75 1.74 0.75 0.74 ...
## $ dance_clubs : num 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.6 0.6 0.59 ...
## $ pools : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 ...
## $ gyms : num 0 0 0 0 0 0 0 0 0 0 ...
## $ bakeries : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 ...
## $ spas : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cafes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ view_points : num 0 0 0 0 0 0 0 0 0 0 ...
## $ monuments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ gardens : num 0 0 0 0 0 0 0 0 0 0 ...
summary(ratings)
## userid churches resorts beaches
## Length:5456 Min. :0.000 Min. :0.000 Min. :0.000
## Class :character 1st Qu.:0.920 1st Qu.:1.360 1st Qu.:1.540
## Mode :character Median :1.340 Median :1.905 Median :2.060
## Mean :1.456 Mean :2.320 Mean :2.489
## 3rd Qu.:1.810 3rd Qu.:2.683 3rd Qu.:2.740
## Max. :5.000 Max. :5.000 Max. :5.000
##
## parks theatres museums malls
## Min. :0.830 Min. :1.120 Min. :1.110 Min. :1.120
## 1st Qu.:1.730 1st Qu.:1.770 1st Qu.:1.790 1st Qu.:1.930
## Median :2.460 Median :2.670 Median :2.680 Median :3.230
## Mean :2.797 Mean :2.959 Mean :2.893 Mean :3.351
## 3rd Qu.:4.093 3rd Qu.:4.312 3rd Qu.:3.840 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## zoo restaurants pubs burger_shops
## Min. :0.860 Min. :0.840 Min. :0.810 Min. :0.780
## 1st Qu.:1.620 1st Qu.:1.800 1st Qu.:1.640 1st Qu.:1.290
## Median :2.170 Median :2.800 Median :2.680 Median :1.690
## Mean :2.541 Mean :3.126 Mean :2.833 Mean :2.078
## 3rd Qu.:3.190 3rd Qu.:5.000 3rd Qu.:3.530 3rd Qu.:2.285
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :1
## hotels juice_bars art_galleries dance_clubs
## Min. :0.770 Min. :0.760 Min. :0.000 Min. :0.000
## 1st Qu.:1.190 1st Qu.:1.030 1st Qu.:0.860 1st Qu.:0.690
## Median :1.610 Median :1.490 Median :1.330 Median :0.800
## Mean :2.126 Mean :2.191 Mean :2.207 Mean :1.193
## 3rd Qu.:2.360 3rd Qu.:2.740 3rd Qu.:4.440 3rd Qu.:1.160
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
##
## pools gyms bakeries spas
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00
## 1st Qu.:0.5800 1st Qu.:0.5300 1st Qu.:0.5200 1st Qu.:0.54
## Median :0.7400 Median :0.6900 Median :0.6900 Median :0.69
## Mean :0.9492 Mean :0.8224 Mean :0.9698 Mean :1.00
## 3rd Qu.:0.9100 3rd Qu.:0.8400 3rd Qu.:0.8600 3rd Qu.:0.86
## Max. :5.0000 Max. :5.0000 Max. :5.0000 Max. :5.00
##
## cafes view_points monuments gardens
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.5700 1st Qu.:0.740 1st Qu.:0.790 1st Qu.:0.880
## Median :0.7600 Median :1.030 Median :1.070 Median :1.290
## Mean :0.9658 Mean :1.751 Mean :1.531 Mean :1.561
## 3rd Qu.:1.0000 3rd Qu.:2.070 3rd Qu.:1.560 3rd Qu.:1.660
## Max. :5.0000 Max. :5.000 Max. :5.000 Max. :5.000
## NA's :1
# removing missing values
ratings <- ratings[rowSums(is.na(ratings)) == 0, ]
summary(ratings)
## userid churches resorts beaches
## Length:5454 Min. :0.000 Min. :0.000 Min. :0.000
## Class :character 1st Qu.:0.920 1st Qu.:1.360 1st Qu.:1.540
## Mode :character Median :1.340 Median :1.910 Median :2.060
## Mean :1.456 Mean :2.320 Mean :2.489
## 3rd Qu.:1.810 3rd Qu.:2.688 3rd Qu.:2.740
## Max. :5.000 Max. :5.000 Max. :5.000
## parks theatres museums malls
## Min. :0.830 Min. :1.120 Min. :1.110 Min. :1.120
## 1st Qu.:1.730 1st Qu.:1.770 1st Qu.:1.790 1st Qu.:1.930
## Median :2.460 Median :2.670 Median :2.680 Median :3.230
## Mean :2.797 Mean :2.959 Mean :2.893 Mean :3.351
## 3rd Qu.:4.098 3rd Qu.:4.310 3rd Qu.:3.837 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## zoo restaurants pubs burger_shops
## Min. :0.860 Min. :0.840 Min. :0.810 Min. :0.780
## 1st Qu.:1.620 1st Qu.:1.800 1st Qu.:1.640 1st Qu.:1.290
## Median :2.170 Median :2.800 Median :2.680 Median :1.690
## Mean :2.541 Mean :3.127 Mean :2.833 Mean :2.078
## 3rd Qu.:3.190 3rd Qu.:5.000 3rd Qu.:3.527 3rd Qu.:2.288
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## hotels juice_bars art_galleries dance_clubs
## Min. :0.770 Min. :0.76 Min. :0.000 Min. :0.000
## 1st Qu.:1.190 1st Qu.:1.03 1st Qu.:0.860 1st Qu.:0.690
## Median :1.610 Median :1.49 Median :1.330 Median :0.800
## Mean :2.126 Mean :2.19 Mean :2.206 Mean :1.193
## 3rd Qu.:2.360 3rd Qu.:2.74 3rd Qu.:4.440 3rd Qu.:1.160
## Max. :5.000 Max. :5.00 Max. :5.000 Max. :5.000
## pools gyms bakeries spas
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.5800 1st Qu.:0.5300 1st Qu.:0.5200 1st Qu.:0.5400
## Median :0.7400 Median :0.6900 Median :0.6900 Median :0.6900
## Mean :0.9493 Mean :0.8225 Mean :0.9692 Mean :0.9996
## 3rd Qu.:0.9100 3rd Qu.:0.8400 3rd Qu.:0.8600 3rd Qu.:0.8600
## Max. :5.0000 Max. :5.0000 Max. :5.0000 Max. :5.0000
## cafes view_points monuments gardens
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.5700 1st Qu.:0.740 1st Qu.:0.790 1st Qu.:0.880
## Median :0.7600 Median :1.030 Median :1.070 Median :1.290
## Mean :0.9653 Mean :1.749 Mean :1.531 Mean :1.561
## 3rd Qu.:1.0000 3rd Qu.:2.070 3rd Qu.:1.560 3rd Qu.:1.660
## Max. :5.0000 Max. :5.000 Max. :5.000 Max. :5.000
Before performing clustering on the dataset, which variable(s) should be removed?
Ans: user ID
Remove the necessary column from the dataset and rename the new data frame points.
Now, we will normalize the data.
What will the maximum value of pubs be after applying mean-var normalization? Answer without actually normalizing the data.
points <- ratings[-1]
summary(points)
## churches resorts beaches parks
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.830
## 1st Qu.:0.920 1st Qu.:1.360 1st Qu.:1.540 1st Qu.:1.730
## Median :1.340 Median :1.910 Median :2.060 Median :2.460
## Mean :1.456 Mean :2.320 Mean :2.489 Mean :2.797
## 3rd Qu.:1.810 3rd Qu.:2.688 3rd Qu.:2.740 3rd Qu.:4.098
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## theatres museums malls zoo
## Min. :1.120 Min. :1.110 Min. :1.120 Min. :0.860
## 1st Qu.:1.770 1st Qu.:1.790 1st Qu.:1.930 1st Qu.:1.620
## Median :2.670 Median :2.680 Median :3.230 Median :2.170
## Mean :2.959 Mean :2.893 Mean :3.351 Mean :2.541
## 3rd Qu.:4.310 3rd Qu.:3.837 3rd Qu.:5.000 3rd Qu.:3.190
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## restaurants pubs burger_shops hotels juice_bars
## Min. :0.840 Min. :0.810 Min. :0.780 Min. :0.770 Min. :0.76
## 1st Qu.:1.800 1st Qu.:1.640 1st Qu.:1.290 1st Qu.:1.190 1st Qu.:1.03
## Median :2.800 Median :2.680 Median :1.690 Median :1.610 Median :1.49
## Mean :3.127 Mean :2.833 Mean :2.078 Mean :2.126 Mean :2.19
## 3rd Qu.:5.000 3rd Qu.:3.527 3rd Qu.:2.288 3rd Qu.:2.360 3rd Qu.:2.74
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00
## art_galleries dance_clubs pools gyms
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.860 1st Qu.:0.690 1st Qu.:0.5800 1st Qu.:0.5300
## Median :1.330 Median :0.800 Median :0.7400 Median :0.6900
## Mean :2.206 Mean :1.193 Mean :0.9493 Mean :0.8225
## 3rd Qu.:4.440 3rd Qu.:1.160 3rd Qu.:0.9100 3rd Qu.:0.8400
## Max. :5.000 Max. :5.000 Max. :5.0000 Max. :5.0000
## bakeries spas cafes view_points
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.5200 1st Qu.:0.5400 1st Qu.:0.5700 1st Qu.:0.740
## Median :0.6900 Median :0.6900 Median :0.7600 Median :1.030
## Mean :0.9692 Mean :0.9996 Mean :0.9653 Mean :1.749
## 3rd Qu.:0.8600 3rd Qu.:0.8600 3rd Qu.:1.0000 3rd Qu.:2.070
## Max. :5.0000 Max. :5.0000 Max. :5.0000 Max. :5.000
## monuments gardens
## Min. :0.000 Min. :0.000
## 1st Qu.:0.790 1st Qu.:0.880
## Median :1.070 Median :1.290
## Mean :1.531 Mean :1.561
## 3rd Qu.:1.560 3rd Qu.:1.660
## Max. :5.000 Max. :5.000
# Normalizing the data
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc <- preProcess(points)
pointsnorm <- predict(preproc, points)
summary(pointsnorm)
## churches resorts beaches parks
## Min. :-1.7587 Min. :-1.6320 Min. :-1.9952 Min. :-1.5025
## 1st Qu.:-0.6472 1st Qu.:-0.6753 1st Qu.:-0.7608 1st Qu.:-0.8151
## Median :-0.1398 Median :-0.2884 Median :-0.3439 Median :-0.2575
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4280 3rd Qu.: 0.2585 3rd Qu.: 0.2012 3rd Qu.: 0.9933
## Max. : 4.2819 Max. : 1.8852 Max. : 2.0128 Max. : 1.6826
## theatres museums malls zoo
## Min. :-1.3736 Min. :-1.3910 Min. :-1.57892 Min. :-1.5127
## 1st Qu.:-0.8880 1st Qu.:-0.8606 1st Qu.:-1.00579 1st Qu.:-0.8288
## Median :-0.2158 Median :-0.1665 Median :-0.08595 Median :-0.3340
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 1.0092 3rd Qu.: 0.7364 3rd Qu.: 1.16644 3rd Qu.: 0.5838
## Max. : 1.5246 Max. : 1.6431 Max. : 1.16644 Max. : 2.2124
## restaurants pubs burger_shops hotels
## Min. :-1.6853 Min. :-1.5472 Min. :-1.0393 Min. :-0.9638
## 1st Qu.:-0.9777 1st Qu.:-0.9123 1st Qu.:-0.6311 1st Qu.:-0.6653
## Median :-0.2407 Median :-0.1168 Median :-0.3109 Median :-0.3667
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.3808 3rd Qu.: 0.5315 3rd Qu.: 0.1674 3rd Qu.: 0.1665
## Max. : 1.3808 Max. : 1.6578 Max. : 2.3386 Max. : 2.0432
## juice_bars art_galleries dance_clubs pools
## Min. :-0.9073 Min. :-1.2857 Min. :-1.07725 Min. :-0.97506
## 1st Qu.:-0.7361 1st Qu.:-0.7845 1st Qu.:-0.45405 1st Qu.:-0.37935
## Median :-0.4443 Median :-0.5106 Median :-0.35469 Median :-0.21502
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.3486 3rd Qu.: 1.3019 3rd Qu.:-0.02954 3rd Qu.:-0.04041
## Max. : 1.7822 Max. : 1.6283 Max. : 3.43874 Max. : 4.16037
## gyms bakeries spas cafes
## Min. :-0.86763 Min. :-0.80577 Min. :-0.8378 Min. :-1.03980
## 1st Qu.:-0.30857 1st Qu.:-0.37348 1st Qu.:-0.3852 1st Qu.:-0.42579
## Median :-0.13979 Median :-0.23215 Median :-0.2595 Median :-0.22112
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.01843 3rd Qu.:-0.09082 3rd Qu.:-0.1170 3rd Qu.: 0.03741
## Max. : 4.40655 Max. : 3.35091 Max. : 3.3528 Max. : 4.34624
## view_points monuments gardens
## Min. :-1.0948 Min. :-1.1633 Min. :-1.33179
## 1st Qu.:-0.6317 1st Qu.:-0.5630 1st Qu.:-0.58080
## Median :-0.4502 Median :-0.3503 Median :-0.23090
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.2007 3rd Qu.: 0.0220 3rd Qu.: 0.08485
## Max. : 2.0344 Max. : 2.6356 Max. : 2.93521
Create a dendogram using the following code:
distances = dist(pointsnorm, method = “euclidean”)
dend = hclust(distances, method = “ward.D”)
plot(dend, labels = FALSE)
Based on the dendrogram, how many clusters do you think would NOT be appropriate for this problem?
distances <- dist(pointsnorm, method = "euclidean")
dend <- hclust(distances, method = "ward.D")
plot(dend, labels = FALSE)
Ans: 5
Based on this dendogram, in choosing the number of clusters, what is the best option?
Ans: 4
Set the random seed to 100, and run the k-means clustering algorithm on your normalized dataset, setting the number of clusters to 4.
How many observations are in the largest cluster?
set.seed(100)
# Ignore this rm(ratingsCluster)
# kmeans clustering, k = 4
kmc_ratings <- kmeans(pointsnorm, centers = 4)
#creating subset
KmeansCluster1 <- subset(pointsnorm, kmc_ratings$cluster == 1)
KmeansCluster2 <- subset(pointsnorm, kmc_ratings$cluster == 2)
KmeansCluster3 <- subset(pointsnorm, kmc_ratings$cluster == 3)
KmeansCluster4 <- subset(pointsnorm, kmc_ratings$cluster == 4)
Ans: False
Ans: True
Ans: The cluster centroid captures the average behavior in the cluster, and can be used to summarize the general pattern in the cluster.
Ans: Yes, at the extreme every data point can be assigned to its own cluster.
Ans: Yes, multicollinearity could cause certain features to be overweighted in the distances calculations.
Which cluster has the user with the lowest average rating in restaurants?
# Average rating
tapply(ratings$restaurants, kmc_ratings$cluster, mean)
## 1 2 3 4
## 4.033197 2.617096 2.777727 1.744214
# Ans: Cluster 4
# Clusters who enjoy churches, pools, gyms, bakeries, and cafe
tapply(ratings$churches, kmc_ratings$cluster, mean)
## 1 2 3 4
## 1.052756 1.518646 1.927692 2.353155
tapply(ratings$pools, kmc_ratings$cluster, mean)
## 1 2 3 4
## 0.7023474 0.7359990 0.8975874 2.2309726
tapply(ratings$bakeries, kmc_ratings$cluster, mean)
## 1 2 3 4
## 0.7171782 0.6309063 1.7812238 2.2608479
tapply(ratings$cafes, kmc_ratings$cluster, mean)
## 1 2 3 4
## 0.6846865 0.8170494 1.6820280 1.9166584
tapply(ratings$gyms, kmc_ratings$cluster, mean)
## 1 2 3 4
## 0.6119266 0.5507981 0.9092308 2.0860973
# Ans: Cluster 4 again :)
# Which cluster seems to enjoy being outside, but does not enjoy as much going to the zoo or pool?
tapply(ratings$beaches, kmc_ratings$cluster, mean)
## 1 2 3 4
## 1.879150 3.265566 2.804860 2.339589
tapply(ratings$resorts, kmc_ratings$cluster, mean)
## 1 2 3 4
## 1.918861 2.643960 2.951049 2.523254
tapply(ratings$zoo, kmc_ratings$cluster, mean)
## 1 2 3 4
## 3.119076 2.333496 1.646713 1.616372
tapply(ratings$pools, kmc_ratings$cluster, mean)
## 1 2 3 4
## 0.7023474 0.7359990 0.8975874 2.2309726