Exam 3: Understanding User Ratings

In this problem, we will use a dataset comprised of google reviews on attractions from 23 categories. Google user ratings range from 1 to 5 and average user ratings per category is pre-calculated. The data set is populated by capturing user ratings from Google reviews. Reviews on attractions from 23 categories across Europe are considered. Each observation represents a user.

Dataset: ratings.csv

Our dataset has the following columns:

userId: a unique integer identifying a user
churches, resorts, beaches,..,monuments, gardens: the average rating that this user has rated any attraction corresponding to these categories.  For example, the user with userID = User 1 has parks = 3.65, which means that the average rating of all the parks this user rated is 3.65.  It can be assumed that if an average rating is 0, then that is the average rating. It is not the case that the user has not rated that category.

In this problem, we aim to cluster users by their average rating per category. Hence, users in the same cluster tend to enjoy or dislike the same categories.

ratings <- read.csv("ratings.csv")
str(ratings)
## 'data.frame':    5456 obs. of  24 variables:
##  $ userid       : chr  "User 1" "User 2" "User 3" "User 4" ...
##  $ churches     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ resorts      : num  0 0 0 0.5 0 0 5 5 5 5 ...
##  $ beaches      : num  3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
##  $ parks        : num  3.65 3.65 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
##  $ theatres     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ museums      : num  2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 ...
##  $ malls        : num  5 5 5 5 5 5 3.03 5 3.03 5 ...
##  $ zoo          : num  2.35 2.64 2.64 2.35 2.64 2.63 2.35 2.63 2.62 2.35 ...
##  $ restaurants  : num  2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.32 2.32 ...
##  $ pubs         : num  2.64 2.65 2.64 2.64 2.64 2.65 2.64 2.64 2.63 2.63 ...
##  $ burger_shops : num  1.69 1.69 1.69 1.69 1.69 1.69 1.68 1.68 1.67 1.67 ...
##  $ hotels       : num  1.7 1.7 1.7 1.7 1.7 1.69 1.69 1.69 1.68 1.67 ...
##  $ juice_bars   : num  1.72 1.72 1.72 1.72 1.72 1.72 1.71 1.71 1.7 1.7 ...
##  $ art_galleries: num  1.74 1.74 1.74 1.74 1.74 1.74 1.75 1.74 0.75 0.74 ...
##  $ dance_clubs  : num  0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.6 0.6 0.59 ...
##  $ pools        : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 ...
##  $ gyms         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bakeries     : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 ...
##  $ spas         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cafes        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ view_points  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ monuments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gardens      : num  0 0 0 0 0 0 0 0 0 0 ...
summary(ratings)
##     userid             churches        resorts         beaches     
##  Length:5456        Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:0.920   1st Qu.:1.360   1st Qu.:1.540  
##  Mode  :character   Median :1.340   Median :1.905   Median :2.060  
##                     Mean   :1.456   Mean   :2.320   Mean   :2.489  
##                     3rd Qu.:1.810   3rd Qu.:2.683   3rd Qu.:2.740  
##                     Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                    
##      parks          theatres        museums          malls      
##  Min.   :0.830   Min.   :1.120   Min.   :1.110   Min.   :1.120  
##  1st Qu.:1.730   1st Qu.:1.770   1st Qu.:1.790   1st Qu.:1.930  
##  Median :2.460   Median :2.670   Median :2.680   Median :3.230  
##  Mean   :2.797   Mean   :2.959   Mean   :2.893   Mean   :3.351  
##  3rd Qu.:4.093   3rd Qu.:4.312   3rd Qu.:3.840   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##       zoo         restaurants         pubs        burger_shops  
##  Min.   :0.860   Min.   :0.840   Min.   :0.810   Min.   :0.780  
##  1st Qu.:1.620   1st Qu.:1.800   1st Qu.:1.640   1st Qu.:1.290  
##  Median :2.170   Median :2.800   Median :2.680   Median :1.690  
##  Mean   :2.541   Mean   :3.126   Mean   :2.833   Mean   :2.078  
##  3rd Qu.:3.190   3rd Qu.:5.000   3rd Qu.:3.530   3rd Qu.:2.285  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                  NA's   :1      
##      hotels        juice_bars    art_galleries    dance_clubs   
##  Min.   :0.770   Min.   :0.760   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.190   1st Qu.:1.030   1st Qu.:0.860   1st Qu.:0.690  
##  Median :1.610   Median :1.490   Median :1.330   Median :0.800  
##  Mean   :2.126   Mean   :2.191   Mean   :2.207   Mean   :1.193  
##  3rd Qu.:2.360   3rd Qu.:2.740   3rd Qu.:4.440   3rd Qu.:1.160  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                                 
##      pools             gyms           bakeries           spas     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:0.5800   1st Qu.:0.5300   1st Qu.:0.5200   1st Qu.:0.54  
##  Median :0.7400   Median :0.6900   Median :0.6900   Median :0.69  
##  Mean   :0.9492   Mean   :0.8224   Mean   :0.9698   Mean   :1.00  
##  3rd Qu.:0.9100   3rd Qu.:0.8400   3rd Qu.:0.8600   3rd Qu.:0.86  
##  Max.   :5.0000   Max.   :5.0000   Max.   :5.0000   Max.   :5.00  
##                                                                   
##      cafes         view_points      monuments        gardens     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.5700   1st Qu.:0.740   1st Qu.:0.790   1st Qu.:0.880  
##  Median :0.7600   Median :1.030   Median :1.070   Median :1.290  
##  Mean   :0.9658   Mean   :1.751   Mean   :1.531   Mean   :1.561  
##  3rd Qu.:1.0000   3rd Qu.:2.070   3rd Qu.:1.560   3rd Qu.:1.660  
##  Max.   :5.0000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##                                                   NA's   :1
# removing missing values

ratings <- ratings[rowSums(is.na(ratings)) == 0, ]
summary(ratings)
##     userid             churches        resorts         beaches     
##  Length:5454        Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:0.920   1st Qu.:1.360   1st Qu.:1.540  
##  Mode  :character   Median :1.340   Median :1.910   Median :2.060  
##                     Mean   :1.456   Mean   :2.320   Mean   :2.489  
##                     3rd Qu.:1.810   3rd Qu.:2.688   3rd Qu.:2.740  
##                     Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      parks          theatres        museums          malls      
##  Min.   :0.830   Min.   :1.120   Min.   :1.110   Min.   :1.120  
##  1st Qu.:1.730   1st Qu.:1.770   1st Qu.:1.790   1st Qu.:1.930  
##  Median :2.460   Median :2.670   Median :2.680   Median :3.230  
##  Mean   :2.797   Mean   :2.959   Mean   :2.893   Mean   :3.351  
##  3rd Qu.:4.098   3rd Qu.:4.310   3rd Qu.:3.837   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##       zoo         restaurants         pubs        burger_shops  
##  Min.   :0.860   Min.   :0.840   Min.   :0.810   Min.   :0.780  
##  1st Qu.:1.620   1st Qu.:1.800   1st Qu.:1.640   1st Qu.:1.290  
##  Median :2.170   Median :2.800   Median :2.680   Median :1.690  
##  Mean   :2.541   Mean   :3.127   Mean   :2.833   Mean   :2.078  
##  3rd Qu.:3.190   3rd Qu.:5.000   3rd Qu.:3.527   3rd Qu.:2.288  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      hotels        juice_bars   art_galleries    dance_clubs   
##  Min.   :0.770   Min.   :0.76   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.190   1st Qu.:1.03   1st Qu.:0.860   1st Qu.:0.690  
##  Median :1.610   Median :1.49   Median :1.330   Median :0.800  
##  Mean   :2.126   Mean   :2.19   Mean   :2.206   Mean   :1.193  
##  3rd Qu.:2.360   3rd Qu.:2.74   3rd Qu.:4.440   3rd Qu.:1.160  
##  Max.   :5.000   Max.   :5.00   Max.   :5.000   Max.   :5.000  
##      pools             gyms           bakeries           spas       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.5800   1st Qu.:0.5300   1st Qu.:0.5200   1st Qu.:0.5400  
##  Median :0.7400   Median :0.6900   Median :0.6900   Median :0.6900  
##  Mean   :0.9493   Mean   :0.8225   Mean   :0.9692   Mean   :0.9996  
##  3rd Qu.:0.9100   3rd Qu.:0.8400   3rd Qu.:0.8600   3rd Qu.:0.8600  
##  Max.   :5.0000   Max.   :5.0000   Max.   :5.0000   Max.   :5.0000  
##      cafes         view_points      monuments        gardens     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.5700   1st Qu.:0.740   1st Qu.:0.790   1st Qu.:0.880  
##  Median :0.7600   Median :1.030   Median :1.070   Median :1.290  
##  Mean   :0.9653   Mean   :1.749   Mean   :1.531   Mean   :1.561  
##  3rd Qu.:1.0000   3rd Qu.:2.070   3rd Qu.:1.560   3rd Qu.:1.660  
##  Max.   :5.0000   Max.   :5.000   Max.   :5.000   Max.   :5.000

Preparing the data

Before performing clustering on the dataset, which variable(s) should be removed?

Ans: user ID

Remove the necessary column from the dataset and rename the new data frame points.

Now, we will normalize the data.

What will the maximum value of pubs be after applying mean-var normalization? Answer without actually normalizing the data.

points <- ratings[-1]
summary(points)
##     churches        resorts         beaches          parks      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.830  
##  1st Qu.:0.920   1st Qu.:1.360   1st Qu.:1.540   1st Qu.:1.730  
##  Median :1.340   Median :1.910   Median :2.060   Median :2.460  
##  Mean   :1.456   Mean   :2.320   Mean   :2.489   Mean   :2.797  
##  3rd Qu.:1.810   3rd Qu.:2.688   3rd Qu.:2.740   3rd Qu.:4.098  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     theatres        museums          malls            zoo       
##  Min.   :1.120   Min.   :1.110   Min.   :1.120   Min.   :0.860  
##  1st Qu.:1.770   1st Qu.:1.790   1st Qu.:1.930   1st Qu.:1.620  
##  Median :2.670   Median :2.680   Median :3.230   Median :2.170  
##  Mean   :2.959   Mean   :2.893   Mean   :3.351   Mean   :2.541  
##  3rd Qu.:4.310   3rd Qu.:3.837   3rd Qu.:5.000   3rd Qu.:3.190  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##   restaurants         pubs        burger_shops       hotels        juice_bars  
##  Min.   :0.840   Min.   :0.810   Min.   :0.780   Min.   :0.770   Min.   :0.76  
##  1st Qu.:1.800   1st Qu.:1.640   1st Qu.:1.290   1st Qu.:1.190   1st Qu.:1.03  
##  Median :2.800   Median :2.680   Median :1.690   Median :1.610   Median :1.49  
##  Mean   :3.127   Mean   :2.833   Mean   :2.078   Mean   :2.126   Mean   :2.19  
##  3rd Qu.:5.000   3rd Qu.:3.527   3rd Qu.:2.288   3rd Qu.:2.360   3rd Qu.:2.74  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.00  
##  art_galleries    dance_clubs        pools             gyms       
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.860   1st Qu.:0.690   1st Qu.:0.5800   1st Qu.:0.5300  
##  Median :1.330   Median :0.800   Median :0.7400   Median :0.6900  
##  Mean   :2.206   Mean   :1.193   Mean   :0.9493   Mean   :0.8225  
##  3rd Qu.:4.440   3rd Qu.:1.160   3rd Qu.:0.9100   3rd Qu.:0.8400  
##  Max.   :5.000   Max.   :5.000   Max.   :5.0000   Max.   :5.0000  
##     bakeries           spas            cafes         view_points   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.5200   1st Qu.:0.5400   1st Qu.:0.5700   1st Qu.:0.740  
##  Median :0.6900   Median :0.6900   Median :0.7600   Median :1.030  
##  Mean   :0.9692   Mean   :0.9996   Mean   :0.9653   Mean   :1.749  
##  3rd Qu.:0.8600   3rd Qu.:0.8600   3rd Qu.:1.0000   3rd Qu.:2.070  
##  Max.   :5.0000   Max.   :5.0000   Max.   :5.0000   Max.   :5.000  
##    monuments        gardens     
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.790   1st Qu.:0.880  
##  Median :1.070   Median :1.290  
##  Mean   :1.531   Mean   :1.561  
##  3rd Qu.:1.560   3rd Qu.:1.660  
##  Max.   :5.000   Max.   :5.000
# Normalizing the data
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc <- preProcess(points)
pointsnorm <-  predict(preproc, points)
summary(pointsnorm)
##     churches          resorts           beaches            parks        
##  Min.   :-1.7587   Min.   :-1.6320   Min.   :-1.9952   Min.   :-1.5025  
##  1st Qu.:-0.6472   1st Qu.:-0.6753   1st Qu.:-0.7608   1st Qu.:-0.8151  
##  Median :-0.1398   Median :-0.2884   Median :-0.3439   Median :-0.2575  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4280   3rd Qu.: 0.2585   3rd Qu.: 0.2012   3rd Qu.: 0.9933  
##  Max.   : 4.2819   Max.   : 1.8852   Max.   : 2.0128   Max.   : 1.6826  
##     theatres          museums            malls               zoo         
##  Min.   :-1.3736   Min.   :-1.3910   Min.   :-1.57892   Min.   :-1.5127  
##  1st Qu.:-0.8880   1st Qu.:-0.8606   1st Qu.:-1.00579   1st Qu.:-0.8288  
##  Median :-0.2158   Median :-0.1665   Median :-0.08595   Median :-0.3340  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 1.0092   3rd Qu.: 0.7364   3rd Qu.: 1.16644   3rd Qu.: 0.5838  
##  Max.   : 1.5246   Max.   : 1.6431   Max.   : 1.16644   Max.   : 2.2124  
##   restaurants           pubs          burger_shops         hotels       
##  Min.   :-1.6853   Min.   :-1.5472   Min.   :-1.0393   Min.   :-0.9638  
##  1st Qu.:-0.9777   1st Qu.:-0.9123   1st Qu.:-0.6311   1st Qu.:-0.6653  
##  Median :-0.2407   Median :-0.1168   Median :-0.3109   Median :-0.3667  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.3808   3rd Qu.: 0.5315   3rd Qu.: 0.1674   3rd Qu.: 0.1665  
##  Max.   : 1.3808   Max.   : 1.6578   Max.   : 2.3386   Max.   : 2.0432  
##    juice_bars      art_galleries      dance_clubs           pools         
##  Min.   :-0.9073   Min.   :-1.2857   Min.   :-1.07725   Min.   :-0.97506  
##  1st Qu.:-0.7361   1st Qu.:-0.7845   1st Qu.:-0.45405   1st Qu.:-0.37935  
##  Median :-0.4443   Median :-0.5106   Median :-0.35469   Median :-0.21502  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.3486   3rd Qu.: 1.3019   3rd Qu.:-0.02954   3rd Qu.:-0.04041  
##  Max.   : 1.7822   Max.   : 1.6283   Max.   : 3.43874   Max.   : 4.16037  
##       gyms             bakeries             spas             cafes         
##  Min.   :-0.86763   Min.   :-0.80577   Min.   :-0.8378   Min.   :-1.03980  
##  1st Qu.:-0.30857   1st Qu.:-0.37348   1st Qu.:-0.3852   1st Qu.:-0.42579  
##  Median :-0.13979   Median :-0.23215   Median :-0.2595   Median :-0.22112  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.01843   3rd Qu.:-0.09082   3rd Qu.:-0.1170   3rd Qu.: 0.03741  
##  Max.   : 4.40655   Max.   : 3.35091   Max.   : 3.3528   Max.   : 4.34624  
##   view_points        monuments          gardens        
##  Min.   :-1.0948   Min.   :-1.1633   Min.   :-1.33179  
##  1st Qu.:-0.6317   1st Qu.:-0.5630   1st Qu.:-0.58080  
##  Median :-0.4502   Median :-0.3503   Median :-0.23090  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.2007   3rd Qu.: 0.0220   3rd Qu.: 0.08485  
##  Max.   : 2.0344   Max.   : 2.6356   Max.   : 2.93521

Clustering

Create a dendogram using the following code:

distances = dist(pointsnorm, method = “euclidean”)

dend = hclust(distances, method = “ward.D”)

plot(dend, labels = FALSE)

Based on the dendrogram, how many clusters do you think would NOT be appropriate for this problem?

distances <- dist(pointsnorm, method = "euclidean")

dend <- hclust(distances, method = "ward.D")

plot(dend, labels = FALSE)

Ans: 5

Based on this dendogram, in choosing the number of clusters, what is the best option?

Ans: 4

Clustering 2

Set the random seed to 100, and run the k-means clustering algorithm on your normalized dataset, setting the number of clusters to 4.

How many observations are in the largest cluster?

set.seed(100)
# Ignore this rm(ratingsCluster)

# kmeans clustering, k = 4
kmc_ratings <- kmeans(pointsnorm, centers = 4)

#creating subset
KmeansCluster1 <-  subset(pointsnorm, kmc_ratings$cluster == 1)
KmeansCluster2 <-  subset(pointsnorm, kmc_ratings$cluster == 2)
KmeansCluster3 <-  subset(pointsnorm, kmc_ratings$cluster == 3)
KmeansCluster4 <-  subset(pointsnorm, kmc_ratings$cluster == 4)

Conceptual Questions

  1. True or False: If we ran k-means clustering a second time without making any additional calls to set.seed, we would expect every observation to be in the same cluster as it is now.

Ans: False

  1. True or False: K-means clustering is sensitive to outliers.

Ans: True

  1. Why do we typically use cluster centroids to describe the clusters?

Ans: The cluster centroid captures the average behavior in the cluster, and can be used to summarize the general pattern in the cluster.

  1. Is “overfitting” a problem in clustering?

Ans: Yes, at the extreme every data point can be assigned to its own cluster.

  1. Is “multicollinearity” a problem in clustering?

Ans: Yes, multicollinearity could cause certain features to be overweighted in the distances calculations.

Understanding the Clusters

Which cluster has the user with the lowest average rating in restaurants?

# Average rating
tapply(ratings$restaurants, kmc_ratings$cluster, mean)
##        1        2        3        4 
## 4.033197 2.617096 2.777727 1.744214
# Ans: Cluster 4

# Clusters who enjoy churches, pools, gyms, bakeries, and cafe
tapply(ratings$churches, kmc_ratings$cluster, mean)
##        1        2        3        4 
## 1.052756 1.518646 1.927692 2.353155
tapply(ratings$pools, kmc_ratings$cluster, mean)
##         1         2         3         4 
## 0.7023474 0.7359990 0.8975874 2.2309726
tapply(ratings$bakeries, kmc_ratings$cluster, mean)
##         1         2         3         4 
## 0.7171782 0.6309063 1.7812238 2.2608479
tapply(ratings$cafes, kmc_ratings$cluster, mean)
##         1         2         3         4 
## 0.6846865 0.8170494 1.6820280 1.9166584
tapply(ratings$gyms, kmc_ratings$cluster, mean)
##         1         2         3         4 
## 0.6119266 0.5507981 0.9092308 2.0860973
# Ans: Cluster 4 again :)

# Which cluster seems to enjoy being outside, but does not enjoy as much going to the zoo or pool?
tapply(ratings$beaches, kmc_ratings$cluster, mean)
##        1        2        3        4 
## 1.879150 3.265566 2.804860 2.339589
tapply(ratings$resorts, kmc_ratings$cluster, mean)
##        1        2        3        4 
## 1.918861 2.643960 2.951049 2.523254
tapply(ratings$zoo, kmc_ratings$cluster, mean)
##        1        2        3        4 
## 3.119076 2.333496 1.646713 1.616372
tapply(ratings$pools, kmc_ratings$cluster, mean)
##         1         2         3         4 
## 0.7023474 0.7359990 0.8975874 2.2309726