Through preliminary data analysis it could be observed that many categories were unrated as ratings ranged from 1 - 5 and these had 0’s as values. Furthermore, it was observed a 26th column wih all NA values was introduced and this added no value to the analysis. Therefore, we replaced 0 with a value of NA.
Europe_Travel_Reviews <- read.csv("C:/Users/willi/Desktop/Georgetown/RStudio Datasource/Travel_Review.csv")
Europe_Travel_Reviews <- Europe_Travel_Reviews %>%
mutate(LocalServices = as.numeric(LocalServices))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
###EDA
dim(Europe_Travel_Reviews)
## [1] 5456 26
str(Europe_Travel_Reviews)
## 'data.frame': 5456 obs. of 26 variables:
## $ UserID : chr "User 1" "User 2" "User 3" "User 4" ...
## $ Churches : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Resorts : num 0 0 0 0.5 0 0 5 5 5 5 ...
## $ Beaches : num 3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
## $ Parks : num 3.65 3.65 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
## $ Theatres : num 5 5 5 5 5 5 5 5 5 5 ...
## $ Museums : num 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 ...
## $ Malls : num 5 5 5 5 5 5 3.03 5 3.03 5 ...
## $ Zoo : num 2.35 2.64 2.64 2.35 2.64 2.63 2.35 2.63 2.62 2.35 ...
## $ Restaurants : num 2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.32 2.32 ...
## $ Pubs_Bars : num 2.64 2.65 2.64 2.64 2.64 2.65 2.64 2.64 2.63 2.63 ...
## $ LocalServices : num 1.7 1.7 1.7 1.73 1.7 1.71 1.73 1.7 1.71 1.69 ...
## $ Burger_PizzaShops : num 1.69 1.69 1.69 1.69 1.69 1.69 1.68 1.68 1.67 1.67 ...
## $ Hotels_OtherLodgings: num 1.7 1.7 1.7 1.7 1.7 1.69 1.69 1.69 1.68 1.67 ...
## $ JuiceBars : num 1.72 1.72 1.72 1.72 1.72 1.72 1.71 1.71 1.7 1.7 ...
## $ ArtGalleries : num 1.74 1.74 1.74 1.74 1.74 1.74 1.75 1.74 0.75 0.74 ...
## $ DanceClubs : num 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.6 0.6 0.59 ...
## $ Swimming.Pools : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 ...
## $ Gyms : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Bakeries : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 ...
## $ BeautySpas : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Cafes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ViewPoints : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Monuments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Gardens : num 0 0 0 0 0 0 0 0 0 0 ...
## $ X : num NA NA NA NA NA NA NA NA NA NA ...
head(Europe_Travel_Reviews)
## UserID Churches Resorts Beaches Parks Theatres Museums Malls Zoo Restaurants
## 1 User 1 0 0.0 3.63 3.65 5 2.92 5 2.35 2.33
## 2 User 2 0 0.0 3.63 3.65 5 2.92 5 2.64 2.33
## 3 User 3 0 0.0 3.63 3.63 5 2.92 5 2.64 2.33
## 4 User 4 0 0.5 3.63 3.63 5 2.92 5 2.35 2.33
## 5 User 5 0 0.0 3.63 3.63 5 2.92 5 2.64 2.33
## 6 User 6 0 0.0 3.63 3.63 5 2.92 5 2.63 2.33
## Pubs_Bars LocalServices Burger_PizzaShops Hotels_OtherLodgings JuiceBars
## 1 2.64 1.70 1.69 1.70 1.72
## 2 2.65 1.70 1.69 1.70 1.72
## 3 2.64 1.70 1.69 1.70 1.72
## 4 2.64 1.73 1.69 1.70 1.72
## 5 2.64 1.70 1.69 1.70 1.72
## 6 2.65 1.71 1.69 1.69 1.72
## ArtGalleries DanceClubs Swimming.Pools Gyms Bakeries BeautySpas Cafes
## 1 1.74 0.59 0.5 0 0.5 0 0
## 2 1.74 0.59 0.5 0 0.5 0 0
## 3 1.74 0.59 0.5 0 0.5 0 0
## 4 1.74 0.59 0.5 0 0.5 0 0
## 5 1.74 0.59 0.5 0 0.5 0 0
## 6 1.74 0.59 0.5 0 0.5 0 0
## ViewPoints Monuments Gardens X
## 1 0 0 0 NA
## 2 0 0 0 NA
## 3 0 0 0 NA
## 4 0 0 0 NA
## 5 0 0 0 NA
## 6 0 0 0 NA
Europe_Travel_Reviews[Europe_Travel_Reviews == 0] <- NA
head(Europe_Travel_Reviews)
## UserID Churches Resorts Beaches Parks Theatres Museums Malls Zoo Restaurants
## 1 User 1 NA NA 3.63 3.65 5 2.92 5 2.35 2.33
## 2 User 2 NA NA 3.63 3.65 5 2.92 5 2.64 2.33
## 3 User 3 NA NA 3.63 3.63 5 2.92 5 2.64 2.33
## 4 User 4 NA 0.5 3.63 3.63 5 2.92 5 2.35 2.33
## 5 User 5 NA NA 3.63 3.63 5 2.92 5 2.64 2.33
## 6 User 6 NA NA 3.63 3.63 5 2.92 5 2.63 2.33
## Pubs_Bars LocalServices Burger_PizzaShops Hotels_OtherLodgings JuiceBars
## 1 2.64 1.70 1.69 1.70 1.72
## 2 2.65 1.70 1.69 1.70 1.72
## 3 2.64 1.70 1.69 1.70 1.72
## 4 2.64 1.73 1.69 1.70 1.72
## 5 2.64 1.70 1.69 1.70 1.72
## 6 2.65 1.71 1.69 1.69 1.72
## ArtGalleries DanceClubs Swimming.Pools Gyms Bakeries BeautySpas Cafes
## 1 1.74 0.59 0.5 NA 0.5 NA NA
## 2 1.74 0.59 0.5 NA 0.5 NA NA
## 3 1.74 0.59 0.5 NA 0.5 NA NA
## 4 1.74 0.59 0.5 NA 0.5 NA NA
## 5 1.74 0.59 0.5 NA 0.5 NA NA
## 6 1.74 0.59 0.5 NA 0.5 NA NA
## ViewPoints Monuments Gardens X
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
Europe_Travel_Reviews = Europe_Travel_Reviews[-26] #(My data did not have an extra column)
During exploratory analysis we analyzed Summary Statistics, Missing Values, and Correlation among variables to identify errors, unlock initial patterns, and find interesting relationships among the variables.
Our Summary Statistics give us several data points that help us understand the distribution of our data such as Mean Rating per category, its Minimum, Maximum, Quartiles, and Interquartile Range. It also shows us that most of our variables are dramatically skewed, and gives us initial insight into which categories have the highest volume of unrated observations via the Pct.Valid column. If Pct.Valid is below 100%, then we’ve identified a variable where there are missing values.
summary_stats <- summarytools::descr(Europe_Travel_Reviews, round.digits = 2, transpose = TRUE)
view(summary_stats, method = "render")
## Non-numerical variable(s) ignored: UserID
| Mean | Std.Dev | Min | Q1 | Median | Q3 | Max | MAD | IQR | CV | Skewness | SE.Skewness | Kurtosis | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ArtGalleries | 2.21 | 1.72 | 0.50 | 0.86 | 1.33 | 4.44 | 5.00 | 0.77 | 3.58 | 0.78 | 0.86 | 0.03 | -1.05 | 5452 | 99.93 |
| Bakeries | 1.20 | 1.23 | 0.50 | 0.61 | 0.76 | 0.93 | 5.00 | 0.22 | 0.32 | 1.03 | 2.45 | 0.04 | 4.58 | 4410 | 80.83 |
| Beaches | 2.49 | 1.25 | 0.50 | 1.54 | 2.06 | 2.74 | 5.00 | 0.85 | 1.20 | 0.50 | 1.09 | 0.03 | -0.12 | 5452 | 99.93 |
| BeautySpas | 1.20 | 1.21 | 0.50 | 0.61 | 0.74 | 0.92 | 5.00 | 0.21 | 0.31 | 1.01 | 2.43 | 0.04 | 4.59 | 4560 | 83.58 |
| Burger_PizzaShops | 2.08 | 1.25 | 0.78 | 1.29 | 1.69 | 2.29 | 5.00 | 0.73 | 1.00 | 0.60 | 1.39 | 0.03 | 0.82 | 5455 | 99.98 |
| Cafes | 1.09 | 0.92 | 0.50 | 0.64 | 0.80 | 1.05 | 5.00 | 0.28 | 0.41 | 0.84 | 3.04 | 0.04 | 9.01 | 4852 | 88.93 |
| Churches | 1.51 | 0.79 | 0.50 | 0.97 | 1.35 | 1.86 | 5.00 | 0.64 | 0.89 | 0.53 | 1.99 | 0.03 | 5.55 | 5261 | 96.43 |
| DanceClubs | 1.22 | 1.10 | 0.50 | 0.71 | 0.81 | 1.17 | 5.00 | 0.27 | 0.46 | 0.91 | 2.72 | 0.03 | 6.32 | 5344 | 97.95 |
| Gardens | 1.63 | 1.15 | 0.50 | 0.92 | 1.31 | 1.69 | 5.00 | 0.58 | 0.77 | 0.71 | 2.00 | 0.03 | 3.13 | 5230 | 95.86 |
| Gyms | 1.01 | 0.96 | 0.50 | 0.61 | 0.74 | 0.89 | 5.00 | 0.21 | 0.28 | 0.95 | 3.48 | 0.04 | 11.35 | 4439 | 81.36 |
| Hotels_OtherLodgings | 2.13 | 1.41 | 0.77 | 1.19 | 1.61 | 2.36 | 5.00 | 0.77 | 1.17 | 0.66 | 1.26 | 0.03 | 0.11 | 5456 | 100.00 |
| JuiceBars | 2.19 | 1.58 | 0.76 | 1.03 | 1.49 | 2.74 | 5.00 | 0.83 | 1.71 | 0.72 | 1.03 | 0.03 | -0.65 | 5456 | 100.00 |
| LocalServices | 2.55 | 1.38 | 0.78 | 1.58 | 2.00 | 3.22 | 5.00 | 0.99 | 1.64 | 0.54 | 0.82 | 0.03 | -0.72 | 5455 | 99.98 |
| Malls | 3.35 | 1.41 | 1.12 | 1.93 | 3.23 | 5.00 | 5.00 | 2.37 | 3.07 | 0.42 | 0.02 | 0.03 | -1.60 | 5456 | 100.00 |
| Monuments | 1.62 | 1.30 | 0.26 | 0.84 | 1.10 | 1.65 | 5.00 | 0.46 | 0.81 | 0.80 | 1.76 | 0.03 | 1.80 | 5154 | 94.46 |
| Museums | 2.89 | 1.28 | 1.11 | 1.79 | 2.68 | 3.84 | 5.00 | 1.33 | 2.05 | 0.44 | 0.56 | 0.03 | -1.07 | 5456 | 100.00 |
| Parks | 2.80 | 1.31 | 0.83 | 1.73 | 2.46 | 4.10 | 5.00 | 1.19 | 2.36 | 0.47 | 0.71 | 0.03 | -0.98 | 5456 | 100.00 |
| Pubs_Bars | 2.83 | 1.31 | 0.81 | 1.64 | 2.68 | 3.53 | 5.00 | 1.51 | 1.89 | 0.46 | 0.52 | 0.03 | -0.93 | 5456 | 100.00 |
| Resorts | 2.36 | 1.40 | 0.50 | 1.37 | 1.97 | 2.72 | 5.00 | 0.95 | 1.35 | 0.59 | 0.93 | 0.03 | -0.43 | 5366 | 98.35 |
| Restaurants | 3.13 | 1.36 | 0.84 | 1.80 | 2.80 | 5.00 | 5.00 | 1.68 | 3.20 | 0.43 | 0.27 | 0.03 | -1.39 | 5456 | 100.00 |
| Swimming.Pools | 1.04 | 0.97 | 0.50 | 0.61 | 0.76 | 0.95 | 5.00 | 0.24 | 0.34 | 0.93 | 3.41 | 0.03 | 10.84 | 4977 | 91.22 |
| Theatres | 2.96 | 1.34 | 1.12 | 1.77 | 2.67 | 4.31 | 5.00 | 1.48 | 2.54 | 0.45 | 0.49 | 0.03 | -1.27 | 5456 | 100.00 |
| ViewPoints | 1.87 | 1.58 | 0.50 | 0.78 | 1.07 | 2.20 | 5.00 | 0.55 | 1.42 | 0.85 | 1.19 | 0.03 | -0.26 | 5111 | 93.68 |
| Zoo | 2.54 | 1.11 | 0.86 | 1.62 | 2.17 | 3.19 | 5.00 | 1.05 | 1.57 | 0.44 | 0.77 | 0.03 | -0.36 | 5456 | 100.00 |
Generated by summarytools 0.9.9 (R version 4.1.0)
2021-06-30
The top 5 highest rated attractions in Europe were Malls (3.35), Restaurants (3.13), Theaters (2.96), Museums (2.90), and Pubs/Bars (2.83). The 5 lowest rated attractions were Gyms (1.01), Swimming Pools (1.04), Cafes (1.09), Beauty Spas (1.20), and Bakeries (1.20). A clear distinction between the top 5 categories versus the lowest 5 categories are their completion rates, which also indicates the presence of non-presence of missing values. Every reviewer/traveler in our sample set reviewed the top 5 attractions with consistency. However, the completion rate for the lowest rated attractions bottomed out at 81% and did not go higher than 91%. These numbers tell us a couple of things: 1) Travelers frequented the top 5 attractions more than they frequented the bottom 5 attractions, and 2) their experiences at these places on average were rated higher than the rest of the categories in the data set.
# Average User Rating Bar Chart
ggplot(summary_stats_table) +
aes(x = reorder(Variable, Mean), y = Mean) +
geom_bar(position="dodge",stat="identity", fill = "#0c4c8a") +
coord_flip() +
labs(title = "Average User Rating by Category",
x = "Variables", y = "Mean")+
theme_minimal()
The top 5 categories by average rating had relatively little to no skewness indicating they have the most normal distributions out of all the variables. However, relative skewness to the left begins to pick up between Parks and Art Galleries, and then dramatically begins to increase around Resorts through to Gyms indicating that most of our variables are highly skewed to the left. Its important to note, because clustering algorithms work best with relatively independent continuous variables with low skewness. Given the skewness of our variables, standardization and Principal Component Analysis becomes very important in building our clusters.
# Skewness By Variable
ggplot(summary_stats_table) +
aes(x = reorder(Variable, Skewness), y = Skewness) +
geom_bar(position="dodge",stat="identity", fill = "#2F4F4F") +
coord_flip() +
labs(title = "Skewness by Category",
x = "Variables", y = "Skewness")+
theme_minimal()
The sample population consists of 5,456 unique Travel Reviewers and their average ratings across 24 categories in Europe. About 68% or 3,742 Travel reviewers had complete ratings across all 24 categories in the data set. Conversely, 1,732 or 32% of the sample population accounted for 5,322 missing values, which is approximately 4% of the total 136,400 values in the data set.
The presence of a missing value indicates the reviewer did not provide a rating for a specific category. Based on the overall distribution of missing values, we can surmise that they are heavily concentrated across a few categories with some overlapping patterns of nullity. Using the naniar and VIM packages we will assess the source and concentrations of non-response ratings across the data set.
| Data.Category | Unique.Observations | Percent.Observations | Total.Values | Percent.Values |
|---|---|---|---|---|
| Missing | 1,732 | 31.74% | 5,322 | 3.90% |
| Complete | 3,724 | 68.26% | 131,078 | 96.10% |
| Total | 5,456 | 100.00% | 136,400 | 100.00% |
It was observed that certain categories had a significant amount of missing values. Overall, 15 out of the 25 variables or 60% of the variables in the data set had at least one missing value or no rating. The top 5 categories with missing values were Bakeries (19.17%), Gyms (18.64%), Beauty Spas (16.42%), Cafes (11.07%), and Swimming Pools (8.78%). Coincidentally, all 5 categories had the lowest average ratings, and were among the most skewed distributions in the data set.
# Missing Value Summaries and Visualizations
missing_values_summary <- miss_var_summary(Europe_Travel_Reviews, order = TRUE) # Missing Values by Category
formattable(missing_values_summary) # Missing Value Output Table
| variable | n_miss | pct_miss |
|---|---|---|
| Bakeries | 1046 | 19.17155425 |
| Gyms | 1017 | 18.64002933 |
| BeautySpas | 896 | 16.42228739 |
| Cafes | 604 | 11.07038123 |
| Swimming.Pools | 479 | 8.77932551 |
| ViewPoints | 345 | 6.32331378 |
| Monuments | 302 | 5.53519062 |
| Gardens | 226 | 4.14222874 |
| Churches | 195 | 3.57404692 |
| DanceClubs | 112 | 2.05278592 |
| Resorts | 90 | 1.64956012 |
| Beaches | 4 | 0.07331378 |
| ArtGalleries | 4 | 0.07331378 |
| LocalServices | 1 | 0.01832845 |
| Burger_PizzaShops | 1 | 0.01832845 |
| UserID | 0 | 0.00000000 |
| Parks | 0 | 0.00000000 |
| Theatres | 0 | 0.00000000 |
| Museums | 0 | 0.00000000 |
| Malls | 0 | 0.00000000 |
| Zoo | 0 | 0.00000000 |
| Restaurants | 0 | 0.00000000 |
| Pubs_Bars | 0 | 0.00000000 |
| Hotels_OtherLodgings | 0 | 0.00000000 |
| JuiceBars | 0 | 0.00000000 |
missing_value_plot <- gg_miss_var(Europe_Travel_Reviews) # Missing Value Plot by Variable
missing_value_perc <- gg_miss_var(Europe_Travel_Reviews, show_pct = TRUE) + ylim(0,25) # Missing Value Percentage Plot by Variable
grid.arrange(missing_value_plot,missing_value_perc) # Arrange Plots in the same output
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use
## `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use
## `guide = "none"` instead.
### Relationship Among Missing Values The below graphic shows the top 5 variables with the most missing values, and then orders them by the size of their nullity in the data set. We can see from the visualization that the combination of Bakeries, Gyms, and Beauty Spas have the most frequent intersection of missing values. The second highest intersection among missing values had a combination of Bakeries, Gyms, and Swimming Pools. By assessing the chart, we can see that the missing values are most probably MNAR, which stands for Missing Not At Random. Travelers that did not rate one of the top categories with missing values were also likely not to rate another top category with missing values. Meaning these are places where certain pockets of travelers did not visit while they were in Europe.
### Approach to Handling Missing Values & Analyzing the Data Since traditional Principal Component Analysis (PCA) and Clustering methodologies do not work with missing values, we decided to try a couple of different approaches to handling nullity and assessed which approach had a better impact on our results. For the remainder of our analysis we will compare and contrast the results from each method, and ultimately select the model(s) that produce the best results.
Approach I: Multiple Imputation of Missing Values In the first approach we leveraged the MICE package to impute the missing values into the data set. Since Bakeries and Gyms missing value percentage is approaching 20%, we are inherently introducing bias into our subsequent models by imputing the values. After using Multiple Imputation to impute the missing values, we performed Principal Component Analysis on the imputed data set to find correlations among variables, and then used the actual variables to create clusters of Travel Reviewers.
Approach II: Leverage missMDA Package to Perform PCA with Missing Values The missMDA package allows principal components analysis to be performed on data sets with missing values. It uses an iterative PCA algorithm for a pre-defined number of dimensions to predict the missing values. The PCA is then performed on the imputed data set. Since it is based on a principal component method, it accounts for the similarities between observations and the relationship between variables. missMDA imputes missing values in such a way that the imputed values have no weight on the PCA results. We then used our PCA to get an idea of how our clusters will be formed, and then use the PCs to actually create our clusters.
# Create Seperate Data Set for Approach I
travel <- as_tibble(Europe_Travel_Reviews)
travel = dplyr::rename(travel, Pool = `Swimming.Pools`)
# Here we will use the mice package that will help us do the imputation (we are going for 1 imputation, m=1)
impute_travel<-mice(travel,m=1,seed = 1111)
##
## iter imp variable
## 1 1 Churches Resorts Beaches LocalServices Burger_PizzaShops ArtGalleries DanceClubs Pool Gyms Bakeries BeautySpas Cafes ViewPoints Monuments Gardens
## 2 1 Churches Resorts Beaches LocalServices Burger_PizzaShops ArtGalleries DanceClubs Pool Gyms Bakeries BeautySpas Cafes ViewPoints Monuments Gardens
## 3 1 Churches Resorts Beaches LocalServices Burger_PizzaShops ArtGalleries DanceClubs Pool Gyms Bakeries BeautySpas Cafes ViewPoints Monuments Gardens
## 4 1 Churches Resorts Beaches LocalServices Burger_PizzaShops ArtGalleries DanceClubs Pool Gyms Bakeries BeautySpas Cafes ViewPoints Monuments Gardens
## 5 1 Churches Resorts Beaches LocalServices Burger_PizzaShops ArtGalleries DanceClubs Pool Gyms Bakeries BeautySpas Cafes ViewPoints Monuments Gardens
## Warning: Number of logged events: 1
travel<-complete(impute_travel,1)
Re-Check for Missing Values
n_miss(travel)
## [1] 0
We use Box Plots to explore the distribution of our Travel Review Categories.
## Warning in melt(travel): The melt generic in data.table has been passed a
## data.frame and will attempt to redirect to the relevant reshape2 method;
## please note that reshape2 is deprecated, and this redirection is now
## deprecated as well. To continue using melt methods from reshape2 while both
## libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(travel). In the next version, this warning will become an error.
## Using UserID as id variables
The box plots reveal and reiterate the skewness within our variables to the left. We also see that our lowest rated categories have a number of outliers as their average ratings are below 1.5. Meaning higher ratings toward 5 would be considered outliers within these classes of variables since their averages are so low. We witness this effect to the right of the Box Plots for Swimming Pools, Gyms, Bakeries, Beauty Spas, and Cafes. View Points, Monuments, and Gardens are also experiencing the same sort of distribution but with a lesser effect as their averages are slightly higher than the aforementioned variables.
In order to perform correlation analysis, PCA, and clustering, we need to remove UserID from the data set in both approaches. We will merge UserID back to the PCs and Clusters for further analysis and interpretation. In the first approach, we only need to remove UserID itself before performing correlation analysis because the missing values have been imputed. Since we are using missMDA in the second approach we have to remove both UserID and the missing values in order to perform correlation analysis.
# Approach I: Remove UserID
travelvar <- travel[-1]
# Approach II: Remove UserID to isolate numerica variables, and create a data set with missing values removed to perform correlation analysis.
Travel_Reviews_Vars <- subset(Europe_Travel_Reviews, select = -c(UserID))
Travel_Review_Vars_NA_Removed <- na.omit(Travel_Reviews_Vars)
## corrplot 0.89 loaded
The first approach and visualizations do not show much correlation. The colors and shades tend to blend together without offering much information.
When we add the correlation coefficient values to the visualization in the second approach, some relationships begin to emerge. While six out of the twenty-four Travel Review categories exhibit some moderate correlation, the patterns within them were not overly overt. Meaning the correlated variables tended to increase with one another, but the reasons behind why were not obvious.
Theaters and Parks had the highest positive correlation among variables at 0.62. The remaining correlations of any significance were levered by Restaurants and Zoo. The highest correlation among this group was between Restaurants and Pubs/Bars at 0.57, followed by Restaurants and Zoo at 0.56, Pub/Bars and Zoo at 0.54, Restaurants and Malls at 0.54, and Malls and Zoo at 0.53. There’s a couple of ways to think about these correlations:
We have complimentary categories of attractions either by function or proximity of activity. Meaning travelers who enjoy theaters also enjoy parks, or theaters and parks of good quality are attractions frequented one after the other due to being in the same destination. The same anecdotal precept could be applied to Restaurants and Pubs/Bars, or any of the other correlated variables.
Correlations among variables are coincidental and not behavioral.
To check our thought process we further isolate the correlated variables and use the GGally package to explore the relationships among the correlated variables. In doing so, we see that the variables are likely not correlated by chance and have some sort of relationship. We will be able to further unpack them through PCA and Clustering.
We conducted PCA using two different approaches. The first approach leveraged multiple imputation data set travel. The second approach leveraged the missMDA package to perform PCA on the Travel Reviews data set with missing values. Each approach produced similar results with the missMDA approach accounting for slightly more variance in the top 4 PCs.
# Create travel2 data set for PCA model
travel2 = copy(travel)
# Remove UserID from data set
travel2 = subset(travel, select = -UserID)
UserID = travel$UserID
# Run PCA model using prcomp
pca2 = prcomp(travelvar, scale = TRUE)
#car::vif(travel2)
pcs = as.data.frame(pca2$x)
combdata = cbind(UserID, pcs)
head(combdata)
## UserID PC1 PC2 PC3 PC4 PC5 PC6
## 1 User 1 0.25756185 -1.789616 -0.9580841 -0.192729145 -0.1637141 -0.1689986
## 2 User 2 0.12880935 -2.222705 -0.7237128 -0.002998196 0.1049917 0.9272178
## 3 User 3 -2.46073610 -2.732965 0.6205651 -1.267083471 0.8886820 -1.6244595
## 4 User 4 0.74430117 -1.736327 -1.2054701 0.072787200 -0.2285305 0.3670751
## 5 User 5 -0.67788008 -2.346384 -0.7699898 0.110948757 0.3575986 -1.6762360
## 6 User 6 -0.01122303 -2.237991 -1.1776180 0.635360853 0.2439735 -1.1769982
## PC7 PC8 PC9 PC10 PC11 PC12
## 1 -0.9849344 -0.17091907 -0.47475534 0.2658401 0.94242004 0.4890931
## 2 -0.1548428 0.05516802 -0.15935148 0.7881863 -0.21399876 -0.2357170
## 3 0.3889483 -2.37687902 0.99030760 0.4474360 -0.12090983 1.2151787
## 4 -1.0535964 0.30211174 -0.58221007 0.1977566 0.70843944 0.3405807
## 5 1.1326676 -1.19866729 0.24265698 -0.2381035 0.24202397 0.5727746
## 6 1.2738029 -0.64525316 -0.01967841 -0.3594399 0.08633248 0.1219503
## PC13 PC14 PC15 PC16 PC17 PC18
## 1 -0.10087228 0.09758711 0.49732621 -0.39825880 1.0207123 -0.30912746
## 2 0.05067522 -0.47260099 -0.78604631 -0.19065273 -0.6954410 0.38309404
## 3 -0.11517069 0.39931074 -0.28802813 0.45756285 -0.9684179 0.96799956
## 4 -0.48650812 0.35934738 0.36771998 -0.39478850 0.6837842 -0.04165999
## 5 0.29251999 -0.35219394 -0.03597081 -0.03936501 0.3042821 0.29726619
## 6 0.23998190 -0.63198858 -0.06401170 -0.26079033 0.3497699 0.24802461
## PC19 PC20 PC21 PC22 PC23 PC24
## 1 0.8385173 0.7400919 0.4230716 -0.1393456 0.4011609 0.4464884
## 2 1.1599972 0.8162189 -0.2166374 -1.3886077 0.2832106 0.5927382
## 3 0.6849134 0.2590930 -0.4005175 -0.5439907 0.7474203 0.5954807
## 4 0.7868757 0.7450735 0.3684155 -0.4880521 0.3334450 0.4370365
## 5 0.4788189 0.7045643 0.4681104 -0.1250196 0.5823903 0.3068787
## 6 0.4970208 0.8275216 0.6475252 -0.3225851 0.4583333 0.2468726
nb <- estim_ncpPCA(Travel_Reviews_Vars, scale = TRUE, method.cv = "Kfold", nbsim = 100)
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 13%
|
|========== | 14%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|==================== | 28%
|
|===================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|=========================== | 38%
|
|============================ | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================== | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|================================ | 45%
|
|================================= | 46%
|
|================================= | 47%
|
|================================== | 48%
|
|=================================== | 49%
|
|=================================== | 51%
|
|==================================== | 52%
|
|===================================== | 53%
|
|===================================== | 54%
|
|====================================== | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================= | 71%
|
|================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
nb$ncp
## [1] 5
res.comp <- imputePCA(Travel_Reviews_Vars, ncp = nb$ncp)
imp <- data.frame(res.comp$completeObs)
head(imp)
## Churches Resorts Beaches Parks Theatres Museums Malls Zoo Restaurants
## 1 1.461173 2.845288 3.63 3.65 5 2.92 5 2.35 2.33
## 2 1.441032 2.828942 3.63 3.65 5 2.92 5 2.64 2.33
## 3 1.441243 2.829215 3.63 3.63 5 2.92 5 2.64 2.33
## 4 1.445697 0.500000 3.63 3.63 5 2.92 5 2.35 2.33
## 5 1.441243 2.829215 3.63 3.63 5 2.92 5 2.64 2.33
## 6 1.441748 2.828332 3.63 3.63 5 2.92 5 2.63 2.33
## Pubs_Bars LocalServices Burger_PizzaShops Hotels_OtherLodgings JuiceBars
## 1 2.64 1.70 1.69 1.70 1.72
## 2 2.65 1.70 1.69 1.70 1.72
## 3 2.64 1.70 1.69 1.70 1.72
## 4 2.64 1.73 1.69 1.70 1.72
## 5 2.64 1.70 1.69 1.70 1.72
## 6 2.65 1.71 1.69 1.69 1.72
## ArtGalleries DanceClubs Swimming.Pools Gyms Bakeries BeautySpas
## 1 1.74 0.59 0.5 0.5135240 0.5 0.9048210
## 2 1.74 0.59 0.5 0.5010227 0.5 0.8778343
## 3 1.74 0.59 0.5 0.5020172 0.5 0.8801363
## 4 1.74 0.59 0.5 0.5915906 0.5 0.7889952
## 5 1.74 0.59 0.5 0.5020172 0.5 0.8801363
## 6 1.74 0.59 0.5 0.5020174 0.5 0.8798243
## Cafes ViewPoints Monuments Gardens
## 1 0.7957945 1.801658 1.521236 1.557100
## 2 0.7802937 1.775173 1.498736 1.531745
## 3 0.7815898 1.774185 1.497922 1.531352
## 4 0.7561164 1.851798 1.587643 1.595324
## 5 0.7815898 1.774185 1.497922 1.531352
## 6 0.7821357 1.776209 1.499535 1.532373
PCA_Travel_Reviews <- prcomp(imp, scale. = TRUE)
var = get_pca_var(PCA_Travel_Reviews)
var
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
library("factoextra")
eig.val = get_eigenvalue(pca2)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.4882415 18.701006 18.70101
## Dim.2 3.3974242 14.155934 32.85694
## Dim.3 1.8135401 7.556417 40.41336
## Dim.4 1.6204095 6.751706 47.16506
## Dim.5 1.4750728 6.146137 53.31120
## Dim.6 1.1749669 4.895695 58.20690
## Dim.7 1.1065657 4.610690 62.81759
## Dim.8 0.9675169 4.031320 66.84891
## Dim.9 0.7677442 3.198934 70.04784
## Dim.10 0.7163200 2.984667 73.03251
## Dim.11 0.6436108 2.681712 75.71422
## Dim.12 0.6249997 2.604165 78.31838
## Dim.13 0.6023845 2.509935 80.82832
## Dim.14 0.5355661 2.231526 83.05985
## Dim.15 0.5282108 2.200878 85.26072
## Dim.16 0.4968939 2.070391 87.33112
## Dim.17 0.4572299 1.905124 89.23624
## Dim.18 0.4399519 1.833133 91.06937
## Dim.19 0.4050301 1.687626 92.75700
## Dim.20 0.3968623 1.653593 94.41059
## Dim.21 0.3710364 1.545985 95.95658
## Dim.22 0.3504492 1.460205 97.41678
## Dim.23 0.3313589 1.380662 98.79744
## Dim.24 0.2886137 1.202557 100.00000
The first eigenvalue of Principal Component 1 (Dim.1) represents four variables worth of variation, which explains 18.70% of the variance in the original 24 variables. The second eigenvalue of Principal Component 2 (Dim.2) represents 3 variables worth of variation, and collectively explains 32.86% of the variance within the our Travel Category ratings. Principal Components 3 through 7 each represent between a little under 2 and little over 1 variable(s) worth of variation within the data set. Collectively with Principal Components 1 and 2 they explain 62.18% of the variance within the Travel Review rating categories.
While the rest of the eigenvalues do not represent a full variable worth of variation, their makeup could be useful in explaining more than 62.18% of the variance within our data set. For instance including up to 10 Principal Components accounts for 73.03% of the variation within Travel Review rating categories, while including up to 13 would explain 80.82% of the variation. We wouldn’t want to go much further beyond that range, or we would be defeating the purpose of Principal Components Analysis, which is to variable reduction and feature optimization.
#Screeplot
fviz_eig(pca2, addlabels = TRUE,
barfill="lightyellow", barcolor ="orange",
linecolor ="orange", title = "Eigenvalues")
The Scree Plot displays the orthogonal nature and inherent functionality of Principal Components, which is to explain as much variation as possible in the first principal component. The second principal component is orthogonal to the first principal component and captures as much variation as possible that could not be captured in the first principal component. Each subsequent principal component continues to derive maximum variation not achieved in the prior principal component. The Scree Plot does not go further than ten dimensions, because the idea is to capture the maximum amount of variance in the least amount of principal components.
eigen.values = get_eigenvalue(PCA_Travel_Reviews)
formattable(eigen.values)
| eigenvalue | variance.percent | cumulative.variance.percent | |
|---|---|---|---|
| Dim.1 | 4.5630460 | 19.012692 | 19.01269 |
| Dim.2 | 3.4511581 | 14.379825 | 33.39252 |
| Dim.3 | 1.8309519 | 7.628966 | 41.02148 |
| Dim.4 | 1.6646764 | 6.936152 | 47.95764 |
| Dim.5 | 1.4814778 | 6.172824 | 54.13046 |
| Dim.6 | 1.1484313 | 4.785131 | 58.91559 |
| Dim.7 | 1.0972310 | 4.571796 | 63.48739 |
| Dim.8 | 0.9081492 | 3.783955 | 67.27134 |
| Dim.9 | 0.7549385 | 3.145577 | 70.41692 |
| Dim.10 | 0.6987725 | 2.911552 | 73.32847 |
| Dim.11 | 0.6375082 | 2.656284 | 75.98475 |
| Dim.12 | 0.6191782 | 2.579909 | 78.56466 |
| Dim.13 | 0.5945424 | 2.477260 | 81.04192 |
| Dim.14 | 0.5303580 | 2.209825 | 83.25175 |
| Dim.15 | 0.5054309 | 2.105962 | 85.35771 |
| Dim.16 | 0.4851320 | 2.021383 | 87.37909 |
| Dim.17 | 0.4523722 | 1.884884 | 89.26398 |
| Dim.18 | 0.4399273 | 1.833030 | 91.09701 |
| Dim.19 | 0.4061759 | 1.692399 | 92.78941 |
| Dim.20 | 0.3916423 | 1.631843 | 94.42125 |
| Dim.21 | 0.3688550 | 1.536896 | 95.95815 |
| Dim.22 | 0.3506153 | 1.460897 | 97.41904 |
| Dim.23 | 0.3311240 | 1.379683 | 98.79873 |
| Dim.24 | 0.2883055 | 1.201273 | 100.00000 |
The results from Approach I and Approach II are similar and have parity with the second approach’s eigenvalues capturing slightly more variation in the first five principal components compared to the first approach. Using this approach we would still want to leverage between 10 and 13 principal components to capture between 73.33% and 81.04% of the variation respectively.
# Screeplot of Eigenvalues
fviz_eig(PCA_Travel_Reviews,
barfill="darkblue",
barcolor ="grey55",
linecolor ="grey55",
title = "Eigenvalue Scree Plot",
addlabels = TRUE)
The Eigenvalue Scree Plot shows the first principal component capturing 19.01% of the variance within the 24 Travel Review categories. The second principal component captured 14.4% of variance. Collectively the ten principal components in the second approach capture 73.33% of the variation within the data set. Since Approach II capture slightly more variance than Approach I, we will use Approach II’s principal components to build our clusters.
### Approach II Correlation between Variables and Principal Components
Approach I and II have parity between correlation of variables to Principal Components with the second approach explaining slightly more variation within the first two principal components. The below correlation plots give us a grid representation of correlation between variables and principal components for Approach II.
The first principal component (Dim.1) had moderately high negative correlation with Restaurants (-0.69), Pubs/Bars (-0.63), Malls (-0.63), and Zoo (-0.63). During our exploratory analysis we observed moderate correlation among these variables. Their collective correlating impact on the first principal component further provides evidence of their relationship. Travel Reviewers tend to frequent and rate these attractions together. Additionally, we observed moderate positive correlation between the first principal component and Churches (0.59), View Points (0.54), Gardens (0.54), Monuments (0.53), and Cafes (0.54).
The second principal component (Dim.2) had high negative correlation with Theaters (-0.72) and Parks (-0.68). Both variables exhibited the highest correlation among variables in our exploratory analysis. We also observed a moderate negative correlation between principal component 2 and Museums at -0.58, and a slightly positive correlation with Juice Bars at 0.59.
We observe that our correlated variables have the highest quality representation in principal components 1 and 2 except Museums.
#### Quality of Representation Bar Plot (Approach II)
The circle shows a slightly different quality of factor map compared to Approach I. Variables such as Juice Bars and Gyms have different coordinates in the second approach versus the first approach.
Technically, we’ve been leveraging the values from Approach II when analyzing PCA variables using corrplot. This will not have a huge impact as the factoextra package leverages the model outputs from each approach themselves. Additionally, the models produced highly similar results with the second approach accounting for slightly more variability in the first five components. So, what we observe in Approach II will not be too far off from what we observe in Approach I. We merely mark Approach I versus Approach II in corrplot usage to separate results without confusion.
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | Dim.6 | Dim.7 | Dim.8 | Dim.9 | Dim.10 | Dim.11 | Dim.12 | Dim.13 | Dim.14 | Dim.15 | Dim.16 | Dim.17 | Dim.18 | Dim.19 | Dim.20 | Dim.21 | Dim.22 | Dim.23 | Dim.24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Churches | 7.611314e+00 | 0.14567033 | 1.63730292 | 0.43731131 | 1.35375215 | 10.337239856 | 3.792917e+00 | 3.51038718 | 0.31345111 | 0.02140916 | 18.481802565 | 0.014053591 | 2.452733e+01 | 9.91741347 | 4.556023e-01 | 5.6708866 | 4.229879553 | 3.06561241 | 1.98212103 | 0.001322932 | 7.127238e-01 | 1.636279e+00 | 3.812057e-02 | 1.060967e-01 |
| Resorts | 5.335035e-01 | 2.25428892 | 1.14491823 | 5.36012276 | 1.55628155 | 2.801017037 | 4.663501e+01 | 0.89887497 | 1.13452200 | 3.23743521 | 11.141304119 | 1.574721897 | 2.014478e-01 | 1.66661912 | 6.783519e+00 | 0.4713336 | 4.736217129 | 4.50480299 | 2.28114869 | 0.062886767 | 6.631036e-02 | 1.260491e-01 | 4.282381e-01 | 3.994295e-01 |
| Beaches | 1.082077e+00 | 7.13954924 | 5.01144795 | 0.27881842 | 0.07334987 | 2.584161054 | 1.595440e+01 | 0.36377013 | 3.04299840 | 23.91447654 | 3.564304960 | 1.771819769 | 1.517475e+01 | 0.20089554 | 5.110233e+00 | 0.8357906 | 0.525223443 | 4.14019615 | 0.00174181 | 1.860004639 | 3.529076e+00 | 1.775806e+00 | 1.593961e+00 | 4.711499e-01 |
| Parks | 4.535361e-01 | 13.53090581 | 3.35393320 | 2.84335560 | 1.22893820 | 7.671281606 | 2.595030e-01 | 0.03766050 | 4.95056316 | 0.01702433 | 1.565371668 | 0.125386131 | 2.351906e+00 | 5.45312421 | 1.103812e+00 | 1.8626944 | 3.338892780 | 6.52778302 | 7.54265556 | 0.073791095 | 4.088833e-01 | 1.369858e+01 | 4.127290e+00 | 1.747312e+01 |
| Theatres | 8.220319e-04 | 15.08977003 | 6.21788449 | 3.24203278 | 2.27116289 | 0.101300584 | 3.258710e-01 | 0.48796498 | 5.38277832 | 1.31731847 | 1.458985868 | 0.350235040 | 3.821799e-01 | 0.75558596 | 1.584013e+00 | 0.5269366 | 0.152763911 | 0.68068414 | 0.08809335 | 7.452975576 | 5.159008e+00 | 2.805863e+00 | 2.271183e+00 | 4.189459e+01 |
| Museums | 1.831868e+00 | 9.74638234 | 4.55111478 | 1.28697385 | 3.05564117 | 2.998405200 | 8.785329e+00 | 0.99890007 | 2.82946446 | 0.31205824 | 0.027071839 | 0.658682839 | 1.402832e+00 | 2.58315503 | 1.106653e+00 | 0.5836594 | 14.205757299 | 6.24105582 | 18.51472771 | 7.663929431 | 2.281894e+00 | 3.678697e+00 | 4.514502e-01 | 4.204297e+00 |
| Malls | 8.765188e+00 | 0.41137378 | 0.44270524 | 3.60277934 | 2.08350909 | 4.708163197 | 2.999171e+00 | 6.21216909 | 0.16890205 | 3.56073254 | 0.162452977 | 10.623312310 | 4.440707e-01 | 14.01065909 | 6.145459e-01 | 0.9374306 | 0.468912089 | 18.81564840 | 12.94418696 | 2.325899699 | 3.952304e-04 | 3.588439e+00 | 1.602656e+00 | 5.066964e-01 |
| Zoo | 8.740294e+00 | 0.26497957 | 7.83383699 | 0.12844918 | 2.09139280 | 7.795264787 | 1.363500e+00 | 0.52195604 | 7.49322035 | 0.67565607 | 0.004229098 | 0.852053147 | 1.940100e-02 | 4.31515197 | 6.026255e-05 | 1.8014345 | 1.880070100 | 12.07057107 | 2.95727808 | 7.042286688 | 1.381472e-03 | 3.305768e+00 | 2.795724e+01 | 8.845260e-01 |
| Restaurants | 1.037651e+01 | 0.03625603 | 7.65678695 | 3.59454088 | 0.10838919 | 0.078950570 | 3.502437e-01 | 0.76354309 | 3.67026536 | 1.60064416 | 0.761498811 | 0.154756857 | 7.384913e-01 | 0.29200390 | 1.417786e-01 | 1.0085027 | 9.776188366 | 6.05391501 | 1.38909907 | 25.424765229 | 8.174428e-01 | 3.601206e+00 | 3.943746e+00 | 1.766048e+01 |
| Pubs_Bars | 8.768412e+00 | 0.05507274 | 12.19992021 | 0.76740371 | 0.72094743 | 1.447737158 | 5.972131e+00 | 0.09643482 | 2.49638446 | 0.07792392 | 0.470500279 | 1.263244505 | 1.568149e-01 | 1.72463849 | 4.729463e+00 | 1.9710362 | 0.438654647 | 4.54657038 | 0.85585923 | 0.028186698 | 1.613315e+00 | 1.832863e-02 | 4.634124e+01 | 3.239777e+00 |
| LocalServices | 4.307522e+00 | 1.00946844 | 4.72530438 | 18.13487944 | 0.76894019 | 1.275153100 | 4.761496e+00 | 2.37890388 | 1.18484305 | 3.60740091 | 0.956095374 | 1.150914495 | 1.260442e-02 | 3.88499449 | 2.844824e+00 | 0.6302276 | 12.930254306 | 5.09694468 | 9.22797450 | 3.133254392 | 9.498463e+00 | 6.427927e+00 | 1.569876e+00 | 4.817339e-01 |
| Burger_PizzaShops | 1.745529e+00 | 5.23986847 | 4.88030464 | 15.50422157 | 0.73684936 | 0.370886301 | 1.110685e+00 | 0.27308222 | 0.00551141 | 12.83115309 | 10.262233396 | 3.909091894 | 5.549061e-04 | 3.45250175 | 2.909534e+00 | 7.1426483 | 1.116343658 | 4.00265345 | 6.03888807 | 7.240634773 | 7.012156e+00 | 2.090260e-01 | 3.574069e+00 | 4.315743e-01 |
| Hotels_OtherLodgings | 1.200029e+00 | 5.73560712 | 6.75602793 | 14.32821413 | 2.27243567 | 0.020814405 | 1.286186e-02 | 3.61624429 | 11.29038046 | 3.32374130 | 0.530078949 | 3.002665584 | 1.236143e-01 | 0.34401911 | 2.028760e-01 | 0.5664465 | 10.289094774 | 2.79594495 | 13.53657444 | 0.205780491 | 5.913764e-01 | 1.455413e+01 | 7.154714e-02 | 4.629498e+00 |
| JuiceBars | 1.031615e+00 | 9.98246335 | 9.80458774 | 0.06405407 | 2.58095505 | 0.003002792 | 8.956599e-04 | 5.95202786 | 4.53112253 | 3.59557683 | 1.471433898 | 8.245142141 | 5.483686e+00 | 0.04201261 | 2.526014e+00 | 2.4389872 | 4.103374976 | 1.34506448 | 10.90789548 | 0.153302601 | 2.760451e+00 | 1.920775e+01 | 1.240243e+00 | 2.528340e+00 |
| ArtGalleries | 4.665127e-01 | 7.00746612 | 1.64813418 | 4.70001772 | 1.28704245 | 6.883643872 | 1.706415e-01 | 26.81340469 | 3.93989998 | 0.24398860 | 0.460661053 | 5.132685205 | 1.515199e+01 | 14.30352478 | 2.117778e+00 | 2.1436163 | 3.527665163 | 0.33701691 | 1.59560763 | 1.559957921 | 3.141084e-02 | 2.143576e-01 | 2.037165e-01 | 5.925711e-02 |
| DanceClubs | 1.492791e+00 | 1.35887723 | 0.05249914 | 0.08660540 | 25.17095012 | 11.729254824 | 2.323109e+00 | 5.02371911 | 3.05444759 | 12.42388946 | 8.912735728 | 0.792573305 | 6.751641e-03 | 3.91970534 | 1.702007e+00 | 18.5069377 | 0.170189672 | 0.18477301 | 0.02203339 | 0.990281344 | 1.671776e+00 | 1.430141e-01 | 1.867723e-01 | 7.430644e-02 |
| Swimming.Pools | 3.959738e+00 | 4.56399761 | 0.18564464 | 0.01020114 | 25.58320737 | 0.418083261 | 2.305152e-01 | 0.06509340 | 0.03568510 | 1.51066594 | 1.057692296 | 0.020017431 | 5.873074e-01 | 0.93697183 | 5.307453e+00 | 10.7747417 | 0.005278408 | 1.08601020 | 0.06338711 | 10.605562044 | 2.611653e+01 | 5.254011e+00 | 6.630146e-02 | 1.555903e+00 |
| Gyms | 4.405715e+00 | 6.66402171 | 0.41304413 | 0.07589762 | 16.94218859 | 2.263482525 | 1.037802e-01 | 0.21499736 | 0.18368427 | 1.92605328 | 1.002183052 | 0.001218303 | 3.603044e-01 | 0.17712153 | 2.805645e+00 | 7.3692290 | 0.886104480 | 0.19340652 | 1.82134347 | 16.611186046 | 2.577629e+01 | 8.288280e+00 | 6.135583e-05 | 1.514760e+00 |
| Bakeries | 4.647042e+00 | 5.49770056 | 1.38288872 | 1.76994138 | 0.67014610 | 3.282383867 | 1.860132e-01 | 11.12563615 | 0.83806820 | 9.62252821 | 5.704700666 | 1.980986604 | 8.883940e+00 | 0.29374164 | 2.982014e+01 | 9.6419624 | 0.552060541 | 1.06534314 | 1.13987736 | 1.008904027 | 1.064868e-02 | 4.088304e-01 | 1.785583e-01 | 2.879631e-01 |
| BeautySpas | 3.427684e+00 | 1.43056900 | 0.25449558 | 5.80715425 | 7.33890378 | 2.308690185 | 4.544072e-02 | 6.96662532 | 41.99308028 | 0.05842999 | 1.483271890 | 20.286179715 | 1.521683e+00 | 0.08018815 | 4.464916e-02 | 3.6857199 | 1.595462727 | 0.13372273 | 0.74543640 | 0.278772683 | 4.998847e-01 | 3.155569e-03 | 1.015127e-02 | 6.492831e-04 |
| Cafes | 6.319347e+00 | 0.53550046 | 6.46526904 | 2.86884497 | 1.19285845 | 2.523200608 | 2.517109e+00 | 8.15523958 | 0.10268750 | 1.05399201 | 7.575040907 | 15.237187750 | 8.778458e+00 | 10.09667791 | 1.335253e+01 | 5.5020448 | 0.273633167 | 0.02410304 | 0.72424835 | 1.785687747 | 1.434140e+00 | 8.704372e-04 | 3.259760e+00 | 2.215722e-01 |
| ViewPoints | 6.456981e+00 | 1.35188388 | 6.53763252 | 4.26410685 | 0.37481820 | 9.284957180 | 1.508031e+00 | 0.18259338 | 1.03392549 | 7.78028677 | 0.169995336 | 3.913926932 | 4.631402e+00 | 0.35180742 | 8.628486e+00 | 0.5465166 | 18.235177912 | 1.19546130 | 4.96727424 | 1.387705679 | 8.806225e+00 | 7.138442e+00 | 4.975098e-05 | 1.252314e+00 |
| Monuments | 6.057766e+00 | 0.69424741 | 5.08193294 | 6.24423617 | 0.18643419 | 1.003915298 | 5.738245e-01 | 9.05599046 | 0.24850078 | 4.56753964 | 18.519734912 | 18.907123999 | 1.754391e-01 | 0.54197473 | 1.036217e+00 | 15.1845220 | 2.084516109 | 4.26287226 | 0.59964575 | 0.130705478 | 3.394587e-01 | 3.914141e+00 | 5.763999e-01 | 1.286174e-02 |
| Gardens | 6.318203e+00 | 0.25407983 | 1.76238346 | 4.59983745 | 0.35090612 | 18.109010731 | 1.752410e-02 | 6.28478143 | 0.07561370 | 2.72007533 | 4.256620361 | 0.032020558 | 8.883040e+00 | 20.65551194 | 5.072174e+00 | 0.1966947 | 4.478284792 | 11.62984392 | 0.05290232 | 2.972216020 | 8.607546e-01 | 1.046687e-03 | 3.073657e-01 | 1.091086e-01 |
Variable contribution across principal components 1-3 tended to be more evenly weighted across variables where there were moderate to high correlations with the principal component in either direction. For instance, Restaurants (10.38%), Pubs/Bars (8.77%), Malls (8.77%), Zoo (8.74%), and Churches (7.61%) were the highest contributing variables to principal component one, which aligns to its eigenvalue of 4.56 variables accounting for 19% of variation within the data set. The variables order of contribution matched their degree of correlation. We observe the same sort of distribution in principal component two except the highest contributing variables were Theaters (15.09%), Parks (13.53%), Museums (9.75%), and Juice Bars (9.98%). These contributions align to an eigenvalue of 3.45 variables accounting for 14% of the variation within the data set.
At principal components four and five we start to see a more concentrated degree of impact by fewer variables. For example Local Services (18.13%), Burger/Pizza Shops (15.50%), and Hotels/Other Lodgings (14.33%) had the highest degree of contribution to principal component four. Conversely, Swimming Pools (25.58%), Dance Clubs (25.17%), and Gyms (16.94%) had the highest contribution to principal component 5.
We analyze the variable contributions of the maximum 10 to 13 principal components for use in clustering our data set.
set.seed(123)
km = kmeans(var$coord, centers = 3, nstart = 25)
grp = as.factor(km$cluster)
## Principal Component Analysis Results for individuals
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the individuals"
## 2 "$cos2" "Cos2 for the individuals"
## 3 "$contrib" "contributions of the individuals"
“The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by a uniform data distribution. In other words, it tests the spatial randomness of the data.” The Null Hypothesis assumes the data set is uniformly distributed containing no meaningful clusters. The Alternative Hypothesis states the opposite, and assumes the data is not uniformly distributed containing meaningful clusters.
We conduct the Hopkins Statistic Test iteratively, using 0.5 as a threshold. If H < 0.5 then we accept the Null Hypothesis that the data is not clusterable. If H > 0.5, then we reject the Null Hypothesis and conclude the data is clusterable. We will use the data set that is closest to 1 to cluster the data.
For more information visit Assessing Clustering Tendency.
| Data.Set | Hopkins.Statistic |
|---|---|
| Mice Imputed Variables | 0.8138782 |
| Mice Imputed Principal Components | 0.8712349 |
| PCA Imputed Variables | 0.8172558 |
| PCA Principal Components | 0.8763902 |
All of the data sets are highly clusterable based on their respective Hopkins Statistics with the missMDA imputed data sets performing slightly higher. The data sets leveraging the average travel review ratings themselves represented by Mice Imputed Variables and PCA Imputed Variables received a Hopkins Statistic of .81 and .82 respectively. However, both Principal Component iterations performed better with Mice Imputed Principal Components returning a Hopkins Statistic of .87 and the PCA Principal Components data set returning a .88. We will use missMDA PCA Principal Components data set to cluster our travel reviews.
##
## Call:
## factanal(x = travelvar, factors = 3, rotation = "varimax")
##
## Uniquenesses:
## Churches Resorts Beaches
## 0.70 0.94 0.77
## Parks Theatres Museums
## 0.47 0.30 0.60
## Malls Zoo Restaurants
## 0.64 0.52 0.38
## Pubs_Bars LocalServices Burger_PizzaShops
## 0.53 0.82 0.66
## Hotels_OtherLodgings JuiceBars ArtGalleries
## 0.61 0.51 0.79
## DanceClubs Pool Gyms
## 0.94 0.80 0.74
## Bakeries BeautySpas Cafes
## 0.73 0.84 0.68
## ViewPoints Monuments Gardens
## 0.68 0.73 0.76
##
## Loadings:
## Factor1 Factor2 Factor3
## Malls -0.58
## Zoo -0.68
## Restaurants -0.73 -0.31
## Pubs_Bars -0.64
## Parks 0.70
## Theatres 0.82
## Museums -0.36 0.51
## Burger_PizzaShops 0.58
## Hotels_OtherLodgings 0.62
## JuiceBars 0.65
## Churches 0.44 -0.33
## Resorts
## Beaches 0.44
## LocalServices -0.34
## ArtGalleries -0.32 0.32
## DanceClubs
## Pool 0.41
## Gyms 0.45
## Bakeries 0.46
## BeautySpas 0.34
## Cafes 0.40 -0.35
## ViewPoints 0.35 -0.44
## Monuments 0.35 -0.38
## Gardens 0.40
##
## Factor1 Factor2 Factor3
## SS loadings 3.50 2.31 2.07
## Proportion Var 0.15 0.10 0.09
## Cumulative Var 0.15 0.24 0.33
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 11678.07 on 207 degrees of freedom.
## The p-value is 0
##
## Call:
## factanal(x = travelvar, factors = 4, rotation = "varimax")
##
## Uniquenesses:
## Churches Resorts Beaches
## 0.70 0.92 0.77
## Parks Theatres Museums
## 0.45 0.29 0.49
## Malls Zoo Restaurants
## 0.55 0.51 0.38
## Pubs_Bars LocalServices Burger_PizzaShops
## 0.38 0.47 0.52
## Hotels_OtherLodgings JuiceBars ArtGalleries
## 0.52 0.52 0.76
## DanceClubs Pool Gyms
## 0.94 0.81 0.75
## Bakeries BeautySpas Cafes
## 0.72 0.82 0.67
## ViewPoints Monuments Gardens
## 0.57 0.66 0.73
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## Malls -0.60
## ViewPoints 0.62
## Monuments 0.55
## Parks 0.71
## Theatres 0.84
## Museums -0.40 0.54
## Zoo -0.32 0.62
## Restaurants -0.43 0.62
## Pubs_Bars 0.75
## LocalServices 0.56 0.45
## Burger_PizzaShops 0.68
## Hotels_OtherLodgings 0.67
## Churches 0.47
## Resorts
## Beaches 0.40
## JuiceBars -0.33 -0.32 0.48
## ArtGalleries -0.37
## DanceClubs
## Pool
## Gyms -0.30 -0.33
## Bakeries -0.32 -0.38
## BeautySpas
## Cafes 0.38
## Gardens 0.47
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 2.53 2.42 2.39 1.74
## Proportion Var 0.11 0.10 0.10 0.07
## Cumulative Var 0.11 0.21 0.31 0.38
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 8066.47 on 186 degrees of freedom.
## The p-value is 0
##
## Call:
## factanal(x = travelvar, factors = 2, rotation = "varimax")
##
## Uniquenesses:
## Churches Resorts Beaches
## 0.73 0.94 0.76
## Parks Theatres Museums
## 0.47 0.40 0.62
## Malls Zoo Restaurants
## 0.63 0.61 0.51
## Pubs_Bars LocalServices Burger_PizzaShops
## 0.62 0.82 0.88
## Hotels_OtherLodgings JuiceBars ArtGalleries
## 0.89 0.78 0.82
## DanceClubs Pool Gyms
## 0.94 0.78 0.72
## Bakeries BeautySpas Cafes
## 0.72 0.85 0.77
## ViewPoints Monuments Gardens
## 0.74 0.77 0.77
##
## Loadings:
## Factor1 Factor2
## Malls -0.60
## Zoo -0.60
## Restaurants -0.62 -0.32
## Pubs_Bars -0.55
## Parks 0.72
## Theatres 0.74
## Churches 0.48
## Resorts
## Beaches 0.49
## Museums -0.44 0.44
## LocalServices -0.32
## Burger_PizzaShops -0.33
## Hotels_OtherLodgings -0.32
## JuiceBars -0.47
## ArtGalleries -0.42
## DanceClubs
## Pool 0.45
## Gyms 0.48
## Bakeries 0.49
## BeautySpas 0.38
## Cafes 0.48
## ViewPoints 0.39 0.33
## Monuments 0.39
## Gardens 0.42
##
## Factor1 Factor2
## SS loadings 3.61 2.85
## Proportion Var 0.15 0.12
## Cumulative Var 0.15 0.27
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 16308.61 on 229 degrees of freedom.
## The p-value is 0
## Churches Resorts Beaches Parks Theatres
## Churches 0.000001 0.124526 -0.001357 -0.066153 -0.044424
## Resorts 0.124526 0.000001 0.197951 -0.012102 -0.022370
## Beaches -0.001357 0.197951 0.000000 0.050193 -0.017093
## Parks -0.066153 -0.012102 0.050193 -0.000001 0.073324
## Theatres -0.044424 -0.022370 -0.017093 0.073324 -0.000001
## Museums -0.024765 -0.045665 -0.028450 -0.041760 0.065112
## Malls -0.002333 -0.011742 0.005850 -0.056676 0.007853
## Zoo 0.112609 0.053041 -0.069647 -0.059471 -0.013711
## Restaurants 0.038128 0.050497 -0.029902 0.003130 -0.077883
## Pubs_Bars 0.044253 0.003852 -0.011376 0.034913 -0.021375
## LocalServices 0.022070 -0.131170 -0.003779 0.053435 0.006962
## Burger_PizzaShops -0.060290 -0.033975 -0.065756 0.063875 0.116131
## Hotels_OtherLodgings -0.015911 -0.105217 -0.021812 0.078881 0.130078
## JuiceBars -0.086419 0.015043 0.068497 0.030084 0.064235
## ArtGalleries -0.059140 0.050448 0.073867 0.035200 -0.006711
## DanceClubs -0.053021 -0.034504 -0.004504 0.088658 0.052323
## Pool -0.043987 -0.043179 -0.013279 0.033847 0.069276
## Gyms -0.028298 -0.021478 -0.014257 0.017097 0.056601
## Bakeries -0.007693 0.051267 0.013148 -0.003857 0.011498
## BeautySpas -0.013283 0.040950 0.006424 0.000737 -0.043666
## Cafes 0.020031 -0.010270 -0.058948 -0.044372 -0.061627
## ViewPoints 0.051999 -0.131571 -0.065127 0.060445 -0.052557
## Monuments 0.110632 -0.056922 -0.059909 -0.021438 -0.020363
## Gardens 0.187552 -0.005296 -0.077648 -0.065383 -0.004340
## Museums Malls Zoo Restaurants Pubs_Bars
## Churches -0.024765 -0.002333 0.112609 0.038128 0.044253
## Resorts -0.045665 -0.011742 0.053041 0.050497 0.003852
## Beaches -0.028450 0.005850 -0.069647 -0.029902 -0.011376
## Parks -0.041760 -0.056676 -0.059471 0.003130 0.034913
## Theatres 0.065112 0.007853 -0.013711 -0.077883 -0.021375
## Museums 0.000001 0.160075 0.013182 -0.017009 -0.139059
## Malls 0.160075 0.000001 0.031113 0.030603 -0.101749
## Zoo 0.013182 0.031113 0.000001 0.110863 0.171569
## Restaurants -0.017009 0.030603 0.110863 0.000003 0.132150
## Pubs_Bars -0.139059 -0.101749 0.171569 0.132150 0.000001
## LocalServices -0.166316 -0.118882 0.053786 -0.023650 0.215688
## Burger_PizzaShops -0.063944 -0.062574 -0.116460 -0.183123 -0.022193
## Hotels_OtherLodgings -0.025153 -0.042356 -0.103634 -0.121946 -0.058357
## JuiceBars 0.049435 0.040953 -0.107286 -0.122284 -0.137445
## ArtGalleries 0.007092 0.061970 -0.126849 0.004308 -0.068527
## DanceClubs -0.013584 0.000230 0.015489 0.022844 0.093468
## Pool 0.059991 0.057025 0.061287 0.002477 0.019139
## Gyms 0.071543 0.082729 0.053125 -0.018982 -0.013079
## Bakeries 0.064185 0.011162 0.018470 -0.020206 -0.056812
## BeautySpas -0.025690 -0.032204 -0.026186 0.058239 -0.015475
## Cafes -0.021026 0.007316 0.014371 0.109605 0.077651
## ViewPoints -0.096046 -0.121345 0.033492 0.072462 0.141480
## Monuments -0.072289 0.011233 0.107444 0.050632 0.091993
## Gardens -0.023448 0.001595 0.142238 -0.010297 0.040238
## LocalServices Burger_PizzaShops Hotels_OtherLodgings
## Churches 0.022070 -0.060290 -0.015911
## Resorts -0.131170 -0.033975 -0.105217
## Beaches -0.003779 -0.065756 -0.021812
## Parks 0.053435 0.063875 0.078881
## Theatres 0.006962 0.116131 0.130078
## Museums -0.166316 -0.063944 -0.025153
## Malls -0.118882 -0.062574 -0.042356
## Zoo 0.053786 -0.116460 -0.103634
## Restaurants -0.023650 -0.183123 -0.121946
## Pubs_Bars 0.215688 -0.022193 -0.058357
## LocalServices -0.000001 0.196490 0.148496
## Burger_PizzaShops 0.196490 0.000000 0.357869
## Hotels_OtherLodgings 0.148496 0.357869 -0.000001
## JuiceBars -0.075675 0.197221 0.361967
## ArtGalleries -0.139648 0.011175 0.063952
## DanceClubs 0.055771 -0.043864 -0.052851
## Pool 0.031945 0.013204 -0.018508
## Gyms -0.005978 0.009594 0.008816
## Bakeries -0.055748 -0.017553 -0.002405
## BeautySpas -0.124240 -0.083811 -0.058210
## Cafes -0.113886 -0.187780 -0.154185
## ViewPoints 0.155980 -0.122081 -0.035750
## Monuments 0.106556 -0.014047 0.003008
## Gardens 0.049854 0.009177 0.019572
## JuiceBars ArtGalleries DanceClubs Pool Gyms
## Churches -0.086419 -0.059140 -0.053021 -0.043987 -0.028298
## Resorts 0.015043 0.050448 -0.034504 -0.043179 -0.021478
## Beaches 0.068497 0.073867 -0.004504 -0.013279 -0.014257
## Parks 0.030084 0.035200 0.088658 0.033847 0.017097
## Theatres 0.064235 -0.006711 0.052323 0.069276 0.056601
## Museums 0.049435 0.007092 -0.013584 0.059991 0.071543
## Malls 0.040953 0.061970 0.000230 0.057025 0.082729
## Zoo -0.107286 -0.126849 0.015489 0.061287 0.053125
## Restaurants -0.122284 0.004308 0.022844 0.002477 -0.018982
## Pubs_Bars -0.137445 -0.068527 0.093468 0.019139 -0.013079
## LocalServices -0.075675 -0.139648 0.055771 0.031945 -0.005978
## Burger_PizzaShops 0.197221 0.011175 -0.043864 0.013204 0.009594
## Hotels_OtherLodgings 0.361967 0.063952 -0.052851 -0.018508 0.008816
## JuiceBars -0.000001 0.172507 -0.026830 -0.001874 0.008704
## ArtGalleries 0.172507 0.000000 0.078880 -0.016070 -0.027027
## DanceClubs -0.026830 0.078880 0.000000 0.284062 0.195664
## Pool -0.001874 -0.016070 0.284062 0.000001 0.356334
## Gyms 0.008704 -0.027027 0.195664 0.356334 -0.000001
## Bakeries 0.019153 -0.040039 -0.043733 0.100374 0.120908
## BeautySpas 0.002292 -0.002177 -0.034432 -0.112509 -0.079131
## Cafes -0.075264 0.046396 0.034405 -0.056265 -0.058103
## ViewPoints -0.112414 -0.050734 0.011332 -0.045398 -0.088076
## Monuments -0.126569 -0.056725 -0.040001 -0.000963 -0.029000
## Gardens -0.057475 -0.122275 -0.076162 -0.002338 -0.000391
## Bakeries BeautySpas Cafes ViewPoints Monuments
## Churches -0.007693 -0.013283 0.020031 0.051999 0.110632
## Resorts 0.051267 0.040950 -0.010270 -0.131571 -0.056922
## Beaches 0.013148 0.006424 -0.058948 -0.065127 -0.059909
## Parks -0.003857 0.000737 -0.044372 0.060445 -0.021438
## Theatres 0.011498 -0.043666 -0.061627 -0.052557 -0.020363
## Museums 0.064185 -0.025690 -0.021026 -0.096046 -0.072289
## Malls 0.011162 -0.032204 0.007316 -0.121345 0.011233
## Zoo 0.018470 -0.026186 0.014371 0.033492 0.107444
## Restaurants -0.020206 0.058239 0.109605 0.072462 0.050632
## Pubs_Bars -0.056812 -0.015475 0.077651 0.141480 0.091993
## LocalServices -0.055748 -0.124240 -0.113886 0.155980 0.106556
## Burger_PizzaShops -0.017553 -0.083811 -0.187780 -0.122081 -0.014047
## Hotels_OtherLodgings -0.002405 -0.058210 -0.154185 -0.035750 0.003008
## JuiceBars 0.019153 0.002292 -0.075264 -0.112414 -0.126569
## ArtGalleries -0.040039 -0.002177 0.046396 -0.050734 -0.056725
## DanceClubs -0.043733 -0.034432 0.034405 0.011332 -0.040001
## Pool 0.100374 -0.112509 -0.056265 -0.045398 -0.000963
## Gyms 0.120908 -0.079131 -0.058103 -0.088076 -0.029000
## Bakeries 0.000000 0.050178 -0.067631 -0.099243 -0.077663
## BeautySpas 0.050178 0.000000 0.061783 0.026861 -0.040187
## Cafes -0.067631 0.061783 0.000001 0.153069 0.103893
## ViewPoints -0.099243 0.026861 0.153069 0.000002 0.188747
## Monuments -0.077663 -0.040187 0.103893 0.188747 0.000001
## Gardens -0.057269 -0.070728 0.038869 0.022149 0.185498
## Gardens
## Churches 0.187552
## Resorts -0.005296
## Beaches -0.077648
## Parks -0.065383
## Theatres -0.004340
## Museums -0.023448
## Malls 0.001595
## Zoo 0.142238
## Restaurants -0.010297
## Pubs_Bars 0.040238
## LocalServices 0.049854
## Burger_PizzaShops 0.009177
## Hotels_OtherLodgings 0.019572
## JuiceBars -0.057475
## ArtGalleries -0.122275
## DanceClubs -0.076162
## Pool -0.002338
## Gyms -0.000391
## Bakeries -0.057269
## BeautySpas -0.070728
## Cafes 0.038869
## ViewPoints 0.022149
## Monuments 0.185498
## Gardens 0.000001
##
## Attaching package: 'nFactors'
## The following object is masked from 'package:lattice':
##
## parallel
## Warning: factor.pa is deprecated. Please use the fa function with fm=pa
## Factor Analysis using method = pa
## Call: factor.pa(r = travelvar, nfactors = 2, rotate = "varimax")
## Unstandardized loadings (pattern matrix) based upon covariance matrix
## PA1 PA2 h2 u2 H2 U2
## Churches 0.49 -0.24 0.295 0.71 0.293 0.71
## Resorts 0.05 -0.25 0.067 0.93 0.067 0.93
## Beaches 0.05 -0.47 0.225 0.78 0.224 0.78
## Parks -0.08 -0.66 0.437 0.56 0.438 0.56
## Theatres -0.22 -0.66 0.477 0.52 0.479 0.52
## Museums -0.43 -0.41 0.354 0.65 0.353 0.65
## Malls -0.59 0.09 0.359 0.64 0.359 0.64
## Zoo -0.58 0.12 0.351 0.65 0.351 0.65
## Restaurants -0.61 0.25 0.432 0.57 0.431 0.57
## Pubs_Bars -0.54 0.24 0.349 0.65 0.350 0.65
## LocalServices -0.32 0.28 0.183 0.82 0.183 0.82
## Burger_PizzaShops -0.12 0.42 0.194 0.81 0.194 0.81
## Hotels_OtherLodgings -0.07 0.42 0.182 0.82 0.181 0.82
## JuiceBars -0.02 0.55 0.306 0.69 0.307 0.69
## ArtGalleries 0.02 0.44 0.191 0.81 0.191 0.81
## DanceClubs 0.26 0.08 0.077 0.92 0.077 0.92
## Pool 0.47 0.19 0.257 0.74 0.258 0.74
## Gyms 0.51 0.25 0.318 0.68 0.318 0.68
## Bakeries 0.49 0.21 0.284 0.72 0.283 0.72
## BeautySpas 0.37 0.07 0.145 0.85 0.146 0.85
## Cafes 0.48 -0.06 0.238 0.76 0.239 0.76
## ViewPoints 0.40 -0.35 0.283 0.72 0.282 0.72
## Monuments 0.40 -0.31 0.254 0.75 0.253 0.75
## Gardens 0.43 -0.25 0.244 0.76 0.243 0.76
##
## PA1 PA2
## SS loadings 3.61 2.89
## Proportion Var 0.15 0.12
## Cumulative Var 0.15 0.27
## Proportion Explained 0.56 0.44
## Cumulative Proportion 0.56 1.00
##
## Standardized loadings (pattern matrix)
## item PA1 PA2 h2 u2
## Churches 1 0.48 -0.24 0.293 0.71
## Resorts 2 0.06 -0.25 0.067 0.93
## Beaches 3 0.05 -0.47 0.224 0.78
## Parks 4 -0.08 -0.66 0.438 0.56
## Theatres 5 -0.22 -0.66 0.479 0.52
## Museums 6 -0.43 -0.41 0.353 0.65
## Malls 7 -0.59 0.09 0.359 0.64
## Zoo 8 -0.58 0.12 0.351 0.65
## Restaurants 9 -0.61 0.25 0.431 0.57
## Pubs_Bars 10 -0.54 0.24 0.350 0.65
## LocalServices 11 -0.32 0.28 0.183 0.82
## Burger_PizzaShops 12 -0.12 0.42 0.194 0.81
## Hotels_OtherLodgings 13 -0.07 0.42 0.181 0.82
## JuiceBars 14 -0.02 0.55 0.307 0.69
## ArtGalleries 15 0.02 0.44 0.191 0.81
## DanceClubs 16 0.27 0.08 0.077 0.92
## Pool 17 0.47 0.19 0.258 0.74
## Gyms 18 0.51 0.25 0.318 0.68
## Bakeries 19 0.49 0.21 0.283 0.72
## BeautySpas 20 0.37 0.07 0.146 0.85
## Cafes 21 0.48 -0.06 0.239 0.76
## ViewPoints 22 0.40 -0.35 0.282 0.72
## Monuments 23 0.40 -0.31 0.253 0.75
## Gardens 24 0.43 -0.25 0.243 0.76
##
## PA1 PA2
## SS loadings 3.61 2.89
## Proportion Var 0.15 0.12
## Cumulative Var 0.15 0.27
## Cum. factor Var 0.56 1.00
##
## Mean item complexity = 1.3
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 276 and the objective function was 7.38 with Chi Square of 40169.45
## The degrees of freedom for the model are 229 and the objective function was 3.03
##
## The root mean square of the residuals (RMSR) is 0.08
## The df corrected root mean square of the residuals is 0.09
##
## The harmonic number of observations is 5456 with the empirical chi square 21175.16 with prob < 0
## The total number of observations was 5456 with Likelihood Chi Square = 16511.3 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.508
## RMSEA index = 0.114 and the 90 % confidence intervals are 0.113 0.116
## BIC = 14540.87
## Fit based upon off diagonal values = 0.84
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.92 0.90
## Multiple R square of scores with factors 0.84 0.81
## Minimum correlation of possible factor scores 0.68 0.62
# Scale Imputed Data Sets
travel3 <- scale(travel2)
imp2 <- scale(imp)
# MICE Imputed Model
k3 = kmeans(travel3, centers =3, nstart = 25)
#str(k3)
#k3
# PCA Imputed Model
k4 = kmeans(imp2, centers = 3, nstart = 25)
#str(k4)
#k4
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
The elbow is not quite clear. It appears the optimal number of clusters would be between 10 to 12, which may default to 8 for practical business and marketing purposes. Let’s leverage the
fviz_nbclust function from the factoextra package to see if we can get a clear elbow or silhouette.
We get a slight elbow at 6 before the line descends toward 8 to 10. It appears between 6 to 10 clusters would optimal based on the elbow method. However, the Silhouette Method marks the optimal amount of clusters at 2. Its worth noting that the non-scaled data set
travel2 Silhouette suggests 5 clusters to be the most optimal. We will leverage the NbClust package to see if we get any better results leveraging the scaled vs. non-scaled data set.
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Travel 2 Data Set The NbClust Algorithm was run on the travel2 data set, but was left out due to the amount of times it takes to run the package. The results from running it are below. We will create 2, 3, 4, and 6 clusters from the data set.
nbclust <- NbClust(data = travel2, distance = “euclidean”, min.nc = 2, max.nc = 15, method = “kmeans”) *** : The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.
*** : The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.
Among all indices:
5 proposed 2 as the best number of clusters
5 proposed 3 as the best number of clusters
4 proposed 4 as the best number of clusters
2 proposed 6 as the best number of clusters
1 proposed 11 as the best number of clusters
1 proposed 14 as the best number of clusters
5 proposed 15 as the best number of clusters
***** Conclusion ***** According to the majority rule, the best number of clusters is 2
Travel 3 Data Set The NbClust algorithm determined the the best number of clusters to be 3 by majority rule, while the travel2 data set was tie between 2 and 3 optimal clusters. We will create 2, 3, and 7 clusters from the travel3 data set.
## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used
## Warning in if (class(best_nc) == "matrix") .viz_NbClust(x, print.summary, : the
## condition has length > 1 and only the first element will be used
## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used
## Warning in if (class(best_nc) == "matrix") {: the condition has length > 1 and
## only the first element will be used
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 5 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 4 proposed 15 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 3 .
The line is relatively similar to the MICE Imputed data. We will again leverage
fviz_nbclust to gain a clearer picture.
PCA Imputed produces the same results as the MICE Imputed data set. We will run the NbClust package on the data set to understand if the same majority rules results are produced for the PCA imputed data set.
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 2 proposed 14 as the best number of clusters
## * 3 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Produced the same results as the MICE Imputed or travel3 data set. We will produce 2, 3, and 7 clusters from the imp2 PCA Imputed data set to compare and/or validate against the same cluster constructs from the MICE Imputed data set.
## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used
## Warning in if (class(best_nc) == "matrix") .viz_NbClust(x, print.summary, : the
## condition has length > 1 and only the first element will be used
## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used
## Warning in if (class(best_nc) == "matrix") {: the condition has length > 1 and
## only the first element will be used
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 5 proposed 2 as the best number of clusters
## * 11 proposed 3 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 2 proposed 14 as the best number of clusters
## * 3 proposed 15 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 3 .
The Hastings Statistic revealed the most clusterable data set to be the missMDA Principal Components data set with a value of .88. Our PCA analysis revealed two important facts: 1) The missMDA approach accounted for slightly more variability in the first five PCs than the MICE imputed data set, and 2) Leveraging 10 PCs accounts for 73.33% of the variability within the Travel Ratings data set. Increasing the number of PCs to 13 allows us to account for 81.04% of the variability within the data set.
Europe_Travel_Ratings <- cbind(Europe_Travel_Reviews,Travel_Review_PCS)
pcavar <- c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10")
europca <- scale(Europe_Travel_Ratings[pcavar])
#### MICE Imputed
pcavar <- c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10")
europca_mice <- scale(pcs[pcavar])
Leveraging principal components appears to produce clusters that are not quite as clean visually compared to using the actual ratings. We’d expect the opposite given the Hastings Statistic. Perhaps three is just a less than optimal number of clusters for the data set as shown by our previous optimal number of clusters exercises. So, before we continue analyzing K-Means using Principal Components, let’s take a look at some key metrics across the three models we’ve run thus far. We’ll look to see which data set minimizes the within sum of squares and maximizes the between sum of squares.
| Data.Set | Cluster.Size | Within.Total.Sum.Squares | Between.Sum.Squares | Total.Sum.Squares |
|---|---|---|---|---|
| Mice Imputed Variables | 2413 | 100431.55 | 30488.451 | 130920 |
| PCA Imputed Variables | 2053 | 100047.57 | 30872.425 | 130920 |
| Principal Components | 990 | 46106.05 | 8443.953 | 54550 |
| Mice Imputed Variables | 988 | 100431.55 | 30488.451 | 130920 |
| PCA Imputed Variables | 2059 | 100047.57 | 30872.425 | 130920 |
| Principal Components | 2409 | 46106.05 | 8443.953 | 54550 |
| Mice Imputed Variables | 1931 | 100431.55 | 30488.451 | 130920 |
| PCA Imputed Variables | 2467 | 100047.57 | 30872.425 | 130920 |
| Principal Components | 1058 | 46106.05 | 8443.953 | 54550 |
As we can see in the numbers the Principal Components is doing good job of minimizing the within sum of squares for each cluster, but its performing poorly in terms of maximizing between sum of squares. Meaning there’s not enough distance between clusters to clearly delineate between their makeup. In fact, the between sum of squares is higher than the within sum of squares, which indicates poor performance. Let’s see if these numbers improve by finding the optimal number of clusters.
The optimal number of clusters based on the Silhoutte Method is 9 for the Principal Components set, and is the lowest point prior to 10 clusters using the Elbow method. We run NbClust on the data set to see what 30 indexes consider the most optimal number of clusters.
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 1 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 2 proposed 11 as the best number of clusters
## * 1 proposed 12 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 3 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
The optimal number of clusters by majority rule for the Principal Components data set is 2 closely followed by 4. We will create 2, 4, and 9 clusters from the data, and then compare across other clusters created from the travel2, travel3, and imp data sets.
## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used
## Warning in if (class(best_nc) == "matrix") .viz_NbClust(x, print.summary, : the
## condition has length > 1 and only the first element will be used
## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used
## Warning in if (class(best_nc) == "matrix") {: the condition has length > 1 and
## only the first element will be used
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 1 proposed 1 as the best number of clusters
## * 5 proposed 2 as the best number of clusters
## * 1 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 2 proposed 11 as the best number of clusters
## * 1 proposed 12 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 3 proposed 15 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
### Create and Extract Final K-Means Clusters #### Travel 2 Data Set: Unscaled Multiple Imputation Data Set
# Compute k-means clustering with k = 2, 3, 4, 6
set.seed(402)
km1 <- kmeans(travel2, 2, nstart = 25)
km2 <- kmeans(travel2, 3, nstart = 25)
km3 <- kmeans(travel2, 4, nstart = 25)
km4 <- kmeans(travel2, 6, nstart = 25)
# Extract Clusters into original data set
Travel_Rating_Clusters <- cbind(Europe_Travel_Reviews,
Travel_Cluster_2 = km1$cluster,
Travel_Cluster_3 = km2$cluster,
Travel_Cluster_4 = km3$cluster,
Travel_Cluster_6 = km4$cluster)
| Travel_Cluster_2 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2822 | 0.52 | 3.94 | 3.90 | 2.45 | 2.79 | 3.47 | 2.20 | 3.12 | 3.04 | 2.00 | 2.04 | 2.79 | 2.86 | 2.57 | 2.45 | 1.17 | 1.20 | 1.05 | 1.20 | 1.16 | 0.92 | 0.92 | 0.84 | 0.94 | 0.89 |
| 2 | 2634 | 0.48 | 2.72 | 2.29 | 3.51 | 3.00 | 2.15 | 3.43 | 1.94 | 2.00 | 3.02 | 2.69 | 1.59 | 1.47 | 1.65 | 1.69 | 2.56 | 2.06 | 2.18 | 1.82 | 1.28 | 1.48 | 1.48 | 1.32 | 1.15 | 1.14 |
| Travel_Cluster_3 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1042 | 0.19 | 2.04 | 1.84 | 2.05 | 1.89 | 1.64 | 2.22 | 1.61 | 1.60 | 2.46 | 2.64 | 2.47 | 1.96 | 1.58 | 1.49 | 2.66 | 2.52 | 2.38 | 2.28 | 1.82 | 2.49 | 2.49 | 1.85 | 1.93 | 1.93 |
| 2 | 2434 | 0.45 | 3.94 | 4.05 | 2.28 | 2.73 | 3.59 | 2.11 | 3.20 | 3.10 | 1.93 | 2.02 | 2.85 | 2.91 | 2.59 | 2.49 | 1.13 | 1.16 | 1.02 | 1.17 | 1.02 | 0.89 | 0.89 | 0.84 | 0.77 | 0.75 |
| 3 | 1980 | 0.36 | 3.33 | 2.67 | 4.27 | 3.63 | 2.53 | 3.95 | 2.24 | 2.35 | 3.19 | 2.61 | 1.28 | 1.42 | 1.84 | 1.88 | 2.27 | 1.69 | 1.89 | 1.50 | 1.14 | 0.76 | 0.76 | 0.93 | 0.86 | 0.74 |
| Travel_Cluster_4 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 962 | 0.18 | 3.56 | 3.22 | 2.19 | 2.42 | 2.91 | 2.04 | 2.89 | 2.38 | 2.03 | 1.77 | 3.99 | 4.79 | 3.84 | 3.26 | 1.12 | 1.23 | 1.00 | 1.17 | 1.19 | 1.08 | 1.08 | 0.93 | 1.01 | 0.91 |
| 2 | 1031 | 0.19 | 2.00 | 1.84 | 2.07 | 1.91 | 1.60 | 2.28 | 1.57 | 1.59 | 2.53 | 2.71 | 2.37 | 1.85 | 1.54 | 1.45 | 2.69 | 2.52 | 2.40 | 2.29 | 1.68 | 2.50 | 2.50 | 1.84 | 1.81 | 1.86 |
| 3 | 1879 | 0.34 | 3.30 | 2.60 | 4.31 | 3.64 | 2.54 | 4.00 | 2.27 | 2.31 | 3.23 | 2.62 | 1.30 | 1.44 | 1.85 | 1.89 | 2.30 | 1.73 | 1.92 | 1.50 | 1.16 | 0.74 | 0.74 | 0.93 | 0.86 | 0.74 |
| 4 | 1584 | 0.29 | 4.16 | 4.53 | 2.40 | 2.94 | 3.93 | 2.16 | 3.30 | 3.53 | 1.87 | 2.15 | 2.09 | 1.73 | 1.79 | 1.99 | 1.16 | 1.13 | 1.05 | 1.18 | 1.00 | 0.74 | 0.74 | 0.80 | 0.71 | 0.68 |
| Travel_Cluster_6 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 613 | 0.11 | 3.05 | 2.79 | 1.75 | 1.72 | 2.91 | 1.77 | 3.23 | 2.01 | 1.71 | 1.24 | 4.04 | 4.88 | 4.26 | 3.89 | 1.15 | 1.09 | 0.99 | 1.08 | 1.38 | 1.24 | 1.24 | 0.98 | 1.11 | 1.00 |
| 2 | 904 | 0.17 | 3.71 | 4.50 | 2.30 | 2.56 | 4.44 | 2.21 | 4.19 | 3.62 | 1.71 | 1.58 | 1.39 | 1.37 | 1.90 | 2.09 | 1.38 | 1.24 | 1.20 | 1.26 | 1.12 | 0.66 | 0.66 | 0.74 | 0.74 | 0.67 |
| 3 | 1331 | 0.24 | 3.64 | 2.64 | 4.34 | 3.82 | 2.43 | 3.68 | 2.05 | 2.43 | 3.21 | 2.80 | 1.31 | 1.69 | 1.97 | 2.05 | 0.89 | 1.73 | 1.51 | 1.45 | 1.11 | 0.85 | 0.85 | 0.84 | 0.86 | 0.78 |
| 4 | 939 | 0.17 | 4.66 | 4.48 | 2.50 | 3.48 | 3.16 | 2.24 | 2.16 | 3.24 | 2.29 | 2.93 | 3.66 | 2.94 | 1.99 | 1.90 | 0.91 | 1.03 | 0.92 | 1.13 | 0.82 | 0.79 | 0.79 | 0.88 | 0.68 | 0.69 |
| 5 | 729 | 0.13 | 2.73 | 2.61 | 4.07 | 3.20 | 2.67 | 4.29 | 2.60 | 2.17 | 3.15 | 2.25 | 1.18 | 1.25 | 1.79 | 1.60 | 4.83 | 1.78 | 2.49 | 1.65 | 1.23 | 0.68 | 0.68 | 1.11 | 0.88 | 0.70 |
| 6 | 940 | 0.17 | 1.97 | 1.76 | 2.02 | 1.85 | 1.61 | 2.19 | 1.60 | 1.59 | 2.42 | 2.62 | 2.41 | 1.92 | 1.57 | 1.47 | 2.64 | 2.59 | 2.46 | 2.32 | 1.74 | 2.61 | 2.61 | 1.90 | 1.93 | 1.98 |
# Compute k-means clustering with k = 2, 3, 7
set.seed(123)
km5 <- kmeans(travel3, 2, nstart = 25)
km6 <- kmeans(travel3, 3, nstart = 25)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 272800)
km7 <- kmeans(travel3, 7, nstart = 25)
# Extract Clusters into original data set
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters,
Kmeans_Cluster_2 = km5$cluster,
Kmeans_Cluster_3 = km6$cluster,
Kmeans_Cluster_7 = km7$cluster)
| Kmeans_Cluster_2 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2834 | 0.52 | 3.99 | 3.91 | 2.57 | 2.90 | 3.49 | 2.28 | 3.13 | 3.07 | 2.03 | 2.03 | 2.63 | 2.76 | 2.54 | 2.43 | 1.18 | 1.18 | 1.04 | 1.16 | 1.01 | 0.88 | 0.88 | 0.81 | 0.77 | 0.75 |
| 2 | 2622 | 0.48 | 2.66 | 2.28 | 3.38 | 2.89 | 2.13 | 3.35 | 1.92 | 1.97 | 2.99 | 2.70 | 1.75 | 1.58 | 1.67 | 1.70 | 2.55 | 2.09 | 2.19 | 1.87 | 1.44 | 1.52 | 1.52 | 1.35 | 1.35 | 1.27 |
| Kmeans_Cluster_3 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 990 | 0.18 | 2.06 | 1.83 | 2.03 | 1.86 | 1.69 | 2.14 | 1.65 | 1.61 | 2.36 | 2.55 | 2.47 | 2.04 | 1.64 | 1.54 | 2.71 | 2.58 | 2.44 | 2.34 | 1.93 | 2.46 | 2.46 | 1.94 | 2.04 | 2.02 |
| 2 | 2053 | 0.38 | 3.28 | 2.65 | 4.21 | 3.58 | 2.47 | 3.94 | 2.22 | 2.27 | 3.24 | 2.66 | 1.40 | 1.48 | 1.85 | 1.86 | 2.24 | 1.71 | 1.87 | 1.51 | 1.14 | 0.83 | 0.83 | 0.92 | 0.86 | 0.74 |
| 3 | 2413 | 0.44 | 3.94 | 4.06 | 2.28 | 2.74 | 3.61 | 2.09 | 3.20 | 3.15 | 1.91 | 2.01 | 2.79 | 2.86 | 2.56 | 2.48 | 1.14 | 1.14 | 1.02 | 1.15 | 0.99 | 0.90 | 0.90 | 0.83 | 0.75 | 0.75 |
| Kmeans_Cluster_7 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 550 | 0.10 | 2.95 | 2.80 | 1.66 | 1.65 | 2.94 | 1.61 | 3.31 | 1.98 | 1.65 | 1.18 | 3.90 | 4.76 | 4.39 | 4.17 | 0.97 | 0.95 | 0.89 | 0.95 | 1.03 | 1.28 | 1.28 | 0.93 | 0.79 | 0.89 |
| 2 | 1197 | 0.22 | 2.91 | 2.52 | 4.20 | 3.48 | 2.45 | 4.23 | 2.00 | 2.08 | 3.77 | 3.00 | 1.48 | 1.38 | 1.46 | 1.60 | 2.24 | 1.26 | 1.30 | 1.36 | 1.32 | 0.88 | 0.88 | 0.93 | 0.83 | 0.71 |
| 3 | 625 | 0.11 | 3.42 | 2.96 | 4.24 | 3.27 | 2.91 | 3.83 | 3.27 | 2.85 | 2.69 | 2.36 | 1.30 | 1.80 | 3.05 | 2.57 | 2.80 | 2.85 | 3.29 | 1.85 | 0.94 | 0.76 | 0.76 | 0.78 | 0.95 | 0.79 |
| 4 | 1099 | 0.20 | 3.99 | 4.66 | 2.17 | 2.63 | 4.61 | 2.07 | 3.71 | 3.71 | 1.83 | 1.95 | 2.08 | 1.53 | 1.66 | 1.93 | 1.23 | 1.16 | 1.06 | 1.19 | 1.01 | 0.69 | 0.69 | 0.79 | 0.69 | 0.65 |
| 5 | 259 | 0.05 | 2.83 | 2.41 | 2.05 | 1.94 | 2.23 | 2.05 | 2.20 | 2.06 | 2.05 | 1.93 | 3.25 | 2.91 | 2.25 | 2.20 | 1.76 | 1.78 | 1.74 | 1.71 | 3.34 | 2.97 | 2.97 | 1.39 | 4.32 | 4.06 |
| 6 | 952 | 0.17 | 4.72 | 3.75 | 3.18 | 4.03 | 2.32 | 2.49 | 1.96 | 2.88 | 2.13 | 2.41 | 2.70 | 2.94 | 2.12 | 1.90 | 0.90 | 1.11 | 0.96 | 1.20 | 0.84 | 0.85 | 0.85 | 0.84 | 0.71 | 0.73 |
| 7 | 774 | 0.14 | 1.85 | 1.72 | 2.09 | 1.87 | 1.60 | 2.25 | 1.48 | 1.50 | 2.48 | 2.75 | 2.08 | 1.71 | 1.42 | 1.32 | 3.06 | 2.81 | 2.66 | 2.54 | 1.44 | 2.25 | 2.25 | 2.16 | 1.30 | 1.45 |
# Compute k-means clustering with k = 2, 3, 7
set.seed(123)
km8 <- kmeans(imp2, 2, nstart = 25)
km9 <- kmeans(imp2, 3, nstart = 25)
km10 <- kmeans(imp2, 7, nstart = 25)
# Extract Clusters into original data set
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters,
KMeans_Cluster_2 = km8$cluster,
KMeans_Cluster_3 = km9$cluster,
KMeans_Cluster_7 = km10$cluster)
| KMeans_Cluster_2 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1328 | 0.24 | 2.12 | 1.99 | 2.38 | 2.00 | 1.78 | 2.62 | 1.64 | 1.63 | 2.69 | 2.64 | 2.22 | 1.86 | 1.54 | 1.47 | 2.90 | 2.37 | 2.44 | 2.18 | 1.75 | 2.15 | 2.15 | 1.84 | 1.77 | 1.73 |
| 2 | 4128 | 0.76 | 3.75 | 3.49 | 3.15 | 3.18 | 3.17 | 2.85 | 2.84 | 2.83 | 2.43 | 2.27 | 2.20 | 2.30 | 2.31 | 2.28 | 1.51 | 1.38 | 1.34 | 1.29 | 1.04 | 0.83 | 0.83 | 0.81 | 0.78 | 0.73 |
| KMeans_Cluster_3 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2409 | 0.44 | 3.95 | 4.06 | 2.28 | 2.75 | 3.61 | 2.09 | 3.20 | 3.15 | 1.91 | 2.01 | 2.78 | 2.85 | 2.56 | 2.48 | 1.13 | 1.14 | 1.02 | 1.15 | 0.98 | 0.88 | 0.88 | 0.83 | 0.74 | 0.75 |
| 2 | 988 | 0.18 | 2.07 | 1.85 | 2.02 | 1.85 | 1.71 | 2.13 | 1.66 | 1.61 | 2.34 | 2.52 | 2.51 | 2.08 | 1.65 | 1.55 | 2.70 | 2.57 | 2.45 | 2.33 | 1.96 | 2.50 | 2.50 | 1.95 | 2.06 | 2.02 |
| 3 | 2059 | 0.38 | 3.27 | 2.64 | 4.21 | 3.57 | 2.47 | 3.94 | 2.22 | 2.27 | 3.25 | 2.68 | 1.40 | 1.48 | 1.85 | 1.86 | 2.25 | 1.72 | 1.87 | 1.51 | 1.13 | 0.83 | 0.83 | 0.92 | 0.86 | 0.74 |
| KMeans_Cluster_7 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1095 | 0.20 | 3.99 | 4.66 | 2.17 | 2.63 | 4.62 | 2.07 | 3.71 | 3.72 | 1.83 | 1.94 | 2.09 | 1.52 | 1.66 | 1.93 | 1.23 | 1.16 | 1.06 | 1.19 | 1.01 | 0.68 | 0.68 | 0.79 | 0.69 | 0.65 |
| 2 | 771 | 0.14 | 1.85 | 1.71 | 2.08 | 1.86 | 1.60 | 2.24 | 1.48 | 1.49 | 2.49 | 2.76 | 2.10 | 1.71 | 1.42 | 1.32 | 3.05 | 2.81 | 2.66 | 2.53 | 1.45 | 2.28 | 2.28 | 2.15 | 1.31 | 1.45 |
| 3 | 952 | 0.17 | 4.73 | 3.77 | 3.17 | 4.02 | 2.32 | 2.48 | 1.96 | 2.88 | 2.14 | 2.43 | 2.70 | 2.95 | 2.11 | 1.90 | 0.90 | 1.10 | 0.96 | 1.20 | 0.84 | 0.85 | 0.85 | 0.84 | 0.71 | 0.72 |
| 4 | 628 | 0.12 | 3.42 | 2.95 | 4.24 | 3.28 | 2.90 | 3.84 | 3.25 | 2.85 | 2.66 | 2.33 | 1.31 | 1.78 | 3.03 | 2.54 | 2.80 | 2.85 | 3.28 | 1.85 | 0.94 | 0.76 | 0.76 | 0.79 | 0.95 | 0.79 |
| 5 | 259 | 0.05 | 2.83 | 2.39 | 1.98 | 1.87 | 2.23 | 2.00 | 2.22 | 2.04 | 1.99 | 1.88 | 3.37 | 3.00 | 2.30 | 2.24 | 1.76 | 1.79 | 1.74 | 1.71 | 3.46 | 2.91 | 2.91 | 1.40 | 4.31 | 4.06 |
| 6 | 1204 | 0.22 | 2.92 | 2.52 | 4.20 | 3.48 | 2.44 | 4.23 | 1.99 | 2.08 | 3.78 | 3.00 | 1.46 | 1.38 | 1.46 | 1.60 | 2.24 | 1.26 | 1.30 | 1.36 | 1.32 | 0.86 | 0.86 | 0.94 | 0.85 | 0.71 |
| 7 | 547 | 0.10 | 2.93 | 2.81 | 1.66 | 1.65 | 2.95 | 1.60 | 3.33 | 1.99 | 1.64 | 1.18 | 3.86 | 4.76 | 4.42 | 4.20 | 0.96 | 0.95 | 0.90 | 0.95 | 0.96 | 1.28 | 1.28 | 0.92 | 0.79 | 0.90 |
# Compute k-means clustering with k = 2, 4, 9
set.seed(915)
km11 <- kmeans(europca, 2, nstart = 25)
km12 <- kmeans(europca, 4, nstart = 25)
## Warning: did not converge in 10 iterations
km13 <- kmeans(europca, 9, nstart = 25)
# Extract Clusters into original data set
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters,
Kmeans_PCA_Cluster_2 = km11$cluster,
Kmeans_PCA_Cluster_4 = km12$cluster,
Kmeans_PCA_Cluster_9 = km13$cluster)
| Kmeans_PCA_Cluster_2 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2607 | 0.48 | 2.85 | 2.34 | 3.42 | 3.09 | 1.98 | 3.27 | 1.76 | 1.91 | 2.89 | 2.56 | 1.68 | 1.52 | 1.48 | 1.58 | 2.36 | 1.93 | 1.92 | 1.74 | 1.23 | 1.53 | 1.53 | 1.32 | 1.15 | 1.16 |
| 2 | 2849 | 0.52 | 3.81 | 3.85 | 2.54 | 2.72 | 3.61 | 2.36 | 3.27 | 3.11 | 2.13 | 2.17 | 2.69 | 2.81 | 2.71 | 2.54 | 1.38 | 1.34 | 1.33 | 1.29 | 1.21 | 0.88 | 0.88 | 0.84 | 0.95 | 0.87 |
| Kmeans_PCA_Cluster_4 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1214 | 0.22 | 3.51 | 3.20 | 2.79 | 2.67 | 2.95 | 2.54 | 3.05 | 2.54 | 2.13 | 2.01 | 3.14 | 4.01 | 4.13 | 3.25 | 1.56 | 1.66 | 1.62 | 1.38 | 0.93 | 1.02 | 1.02 | 0.85 | 0.84 | 0.83 |
| 2 | 1388 | 0.25 | 4.09 | 4.55 | 2.25 | 2.84 | 4.17 | 2.04 | 3.40 | 3.76 | 1.90 | 2.20 | 2.03 | 1.75 | 1.61 | 1.95 | 1.27 | 1.19 | 1.16 | 1.21 | 0.94 | 0.72 | 0.72 | 0.81 | 0.72 | 0.69 |
| 3 | 1008 | 0.18 | 2.12 | 1.85 | 2.08 | 1.89 | 1.69 | 2.12 | 1.68 | 1.65 | 2.41 | 2.57 | 2.39 | 2.01 | 1.62 | 1.57 | 2.61 | 2.69 | 2.45 | 2.43 | 1.90 | 2.37 | 2.37 | 1.88 | 2.05 | 2.00 |
| 4 | 1846 | 0.34 | 3.36 | 2.70 | 4.08 | 3.63 | 2.37 | 3.90 | 2.05 | 2.11 | 3.22 | 2.57 | 1.63 | 1.43 | 1.47 | 1.68 | 2.06 | 1.35 | 1.50 | 1.30 | 1.24 | 0.88 | 0.88 | 0.95 | 0.81 | 0.71 |
| Kmeans_PCA_Cluster_9 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 422 | 0.08 | 3.39 | 2.57 | 3.88 | 3.18 | 2.39 | 3.15 | 2.17 | 2.78 | 2.54 | 3.02 | 1.11 | 1.91 | 2.46 | 2.83 | 1.19 | 3.86 | 2.51 | 2.40 | 0.81 | 0.97 | 0.97 | 0.82 | 1.01 | 0.86 |
| 2 | 1132 | 0.21 | 3.98 | 2.85 | 3.82 | 4.08 | 1.99 | 3.32 | 1.87 | 2.29 | 2.35 | 1.59 | 1.93 | 1.95 | 1.85 | 1.77 | 1.51 | 1.18 | 1.16 | 1.23 | 1.02 | 0.96 | 0.96 | 0.83 | 0.85 | 0.80 |
| 3 | 1047 | 0.19 | 4.01 | 4.68 | 2.15 | 2.71 | 4.54 | 2.02 | 3.65 | 3.80 | 1.73 | 1.92 | 1.96 | 1.54 | 1.57 | 1.94 | 1.20 | 1.15 | 1.06 | 1.18 | 0.99 | 0.70 | 0.70 | 0.81 | 0.68 | 0.64 |
| 4 | 364 | 0.07 | 2.41 | 2.76 | 2.38 | 2.20 | 2.11 | 2.70 | 1.52 | 1.82 | 2.50 | 2.61 | 2.03 | 2.34 | 1.93 | 1.50 | 2.65 | 1.64 | 1.76 | 1.76 | 1.36 | 2.26 | 2.26 | 1.53 | 1.02 | 1.14 |
| 5 | 333 | 0.06 | 2.69 | 2.30 | 1.99 | 1.89 | 2.17 | 2.02 | 2.23 | 2.00 | 2.02 | 2.01 | 2.94 | 2.86 | 2.24 | 2.20 | 1.86 | 1.99 | 1.83 | 1.79 | 3.03 | 3.08 | 3.08 | 1.50 | 3.88 | 3.86 |
| 6 | 310 | 0.06 | 1.56 | 1.73 | 2.10 | 1.67 | 1.75 | 2.31 | 1.49 | 1.35 | 2.40 | 2.47 | 2.55 | 1.13 | 0.96 | 1.05 | 4.19 | 3.13 | 3.40 | 2.43 | 1.65 | 1.36 | 1.36 | 3.39 | 1.14 | 1.24 |
| 7 | 818 | 0.15 | 3.36 | 3.15 | 3.41 | 3.12 | 2.81 | 3.30 | 2.10 | 2.43 | 3.73 | 4.39 | 2.48 | 2.32 | 1.73 | 1.92 | 1.20 | 1.14 | 1.02 | 1.42 | 1.23 | 1.15 | 1.15 | 0.85 | 0.77 | 0.75 |
| 8 | 498 | 0.09 | 2.91 | 2.78 | 1.70 | 1.66 | 2.95 | 1.66 | 3.30 | 1.97 | 1.64 | 1.19 | 3.97 | 4.74 | 4.37 | 4.24 | 0.99 | 0.97 | 0.90 | 0.96 | 1.00 | 1.18 | 1.18 | 0.92 | 0.74 | 0.83 |
| 9 | 532 | 0.10 | 3.21 | 2.97 | 3.98 | 3.12 | 3.08 | 4.04 | 3.66 | 2.63 | 3.50 | 2.10 | 1.56 | 1.72 | 2.77 | 1.54 | 4.09 | 1.79 | 3.04 | 1.68 | 1.08 | 0.71 | 0.71 | 0.80 | 0.92 | 0.73 |
##would not run##
pam_sil_viz1 <- fviz_nbclust(travel2, cluster::pam, method = "silhouette")+labs(subtitle = "Multiple Imputation Data Set")
pam_sil_viz2 <- fviz_nbclust(travel3, cluster::pam, method = "silhouette")+labs(subtitle = "Scaled Multiple Imputation Data Set")
pam_sil_viz3 <- fviz_nbclust(imp2, cluster::pam, method = "silhouette")+labs(subtitle = "PCA Imputed Data Set")
pam_sil_viz4 <- fviz_nbclust(europca, cluster::pam, method = "silhouette")+labs(subtitle = "Principal Components Data Set")
grid.arrange(pam_sil_viz1, pam_sil_viz2, pam_sil_viz3, pam_sil_viz4)
The optimal number of clusters for the travel2 Multiple Imputation data set was 10 clusters.
mi_pam <- eclust(travel2, "pam", k = 10, hc_metric="euclidean")
# PAM Clustering
mi_pamclus <- pam(travel2, 10)
# Add Clusters into original data set
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, PAM_Travel_Cluster = mi_pamclus$cluster)
| PAM_Travel_Cluster | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 746 | 0.14 | 4.55 | 2.86 | 4.08 | 4.44 | 2.12 | 3.32 | 2.00 | 2.36 | 2.59 | 1.59 | 1.40 | 1.83 | 1.99 | 1.92 | 0.91 | 1.54 | 1.45 | 1.25 | 0.98 | 0.86 | 0.86 | 0.82 | 0.85 | 0.78 |
| 2 | 593 | 0.11 | 2.55 | 2.43 | 4.39 | 3.40 | 2.54 | 4.68 | 2.51 | 2.07 | 3.15 | 2.15 | 1.10 | 1.19 | 1.70 | 1.63 | 4.56 | 1.77 | 2.14 | 1.55 | 1.21 | 0.66 | 0.66 | 1.00 | 0.90 | 0.71 |
| 3 | 650 | 0.12 | 3.23 | 2.90 | 4.26 | 3.33 | 2.79 | 3.65 | 2.08 | 2.67 | 3.52 | 4.33 | 1.37 | 1.90 | 1.88 | 2.21 | 0.94 | 1.52 | 1.32 | 1.57 | 1.22 | 0.75 | 0.75 | 0.87 | 0.76 | 0.75 |
| 4 | 632 | 0.12 | 2.27 | 2.10 | 2.35 | 2.18 | 1.64 | 2.50 | 1.61 | 1.76 | 2.79 | 2.96 | 2.06 | 2.05 | 1.82 | 1.62 | 1.71 | 2.29 | 1.93 | 2.18 | 1.18 | 2.71 | 2.71 | 1.12 | 1.46 | 1.64 |
| 5 | 407 | 0.07 | 1.57 | 1.62 | 1.97 | 1.66 | 1.64 | 2.23 | 1.48 | 1.35 | 2.45 | 2.51 | 2.15 | 1.31 | 1.07 | 1.09 | 4.33 | 3.02 | 3.43 | 2.42 | 1.73 | 1.69 | 1.69 | 3.00 | 1.42 | 1.36 |
| 6 | 756 | 0.14 | 3.50 | 4.35 | 2.28 | 2.38 | 4.27 | 2.16 | 4.88 | 3.48 | 1.74 | 1.75 | 1.52 | 1.44 | 2.09 | 2.21 | 1.75 | 1.28 | 1.46 | 1.34 | 1.15 | 0.71 | 0.71 | 0.72 | 0.79 | 0.71 |
| 7 | 474 | 0.09 | 4.63 | 4.91 | 2.17 | 3.33 | 4.54 | 1.99 | 1.87 | 4.01 | 1.83 | 2.34 | 2.23 | 1.57 | 1.43 | 1.64 | 1.06 | 1.13 | 1.07 | 1.12 | 0.75 | 0.72 | 0.72 | 0.88 | 0.61 | 0.64 |
| 8 | 636 | 0.12 | 4.50 | 3.91 | 2.76 | 3.02 | 3.01 | 2.56 | 2.65 | 2.90 | 2.54 | 2.29 | 4.49 | 3.85 | 2.83 | 2.22 | 1.08 | 1.25 | 1.03 | 1.22 | 1.07 | 0.77 | 0.77 | 0.89 | 0.73 | 0.69 |
| 9 | 158 | 0.03 | 3.08 | 2.61 | 1.94 | 1.90 | 2.34 | 1.96 | 2.27 | 2.15 | 1.94 | 1.82 | 4.10 | 3.26 | 2.42 | 2.27 | 1.58 | 1.60 | 1.56 | 1.57 | 4.31 | 2.19 | 2.19 | 1.26 | 4.66 | 4.46 |
| 10 | 404 | 0.07 | 2.52 | 2.72 | 1.56 | 1.57 | 2.94 | 1.54 | 3.32 | 1.89 | 1.57 | 1.06 | 3.89 | 5.00 | 4.59 | 4.52 | 0.89 | 0.92 | 0.85 | 0.87 | 0.88 | 1.40 | 1.40 | 0.87 | 0.81 | 0.96 |
The optimal number of clusters for the travel3 Scaled Multiple Imputation data set was 2 clusters.
miscaled_pam <- eclust(travel3, "pam", k = 2, hc_metric="euclidean")
# PAM Clustering
miscaled_pamclus <- pam(travel3, 2)
# Add Clusters into original data set
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, PAM_Travel_Cluster_2 = miscaled_pamclus$cluster)
| PAM_Travel_Cluster_2 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3338 | 0.61 | 4.03 | 3.70 | 2.97 | 3.15 | 3.32 | 2.66 | 2.93 | 2.94 | 2.26 | 2.17 | 2.43 | 2.43 | 2.40 | 2.34 | 1.23 | 1.26 | 1.18 | 1.21 | 1.00 | 0.82 | 0.82 | 0.81 | 0.74 | 0.72 |
| 2 | 2118 | 0.39 | 2.28 | 2.21 | 2.94 | 2.49 | 2.06 | 3.01 | 1.95 | 1.91 | 2.86 | 2.65 | 1.86 | 1.82 | 1.69 | 1.67 | 2.83 | 2.20 | 2.29 | 1.98 | 1.55 | 1.69 | 1.69 | 1.47 | 1.48 | 1.39 |
The optimal number of clusters for the imp2 PCA Imputed data set was 2 clusters.
pcai_pam <- eclust(imp2, "pam", k = 2, hc_metric="euclidean")
# PAM Clustering
pcai_pamclus <- pam(imp2, 2)
# Add Clusters into original data set
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, PAM_Travel_Cluster_3 = pcai_pamclus$cluster)
| PAM_Travel_Cluster_3 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3651 | 0.67 | 3.91 | 3.62 | 3.04 | 3.17 | 3.27 | 2.72 | 2.91 | 2.92 | 2.38 | 2.23 | 2.35 | 2.43 | 2.41 | 2.32 | 1.34 | 1.29 | 1.24 | 1.24 | 1.04 | 0.83 | 0.83 | 0.82 | 0.77 | 0.72 |
| 2 | 1805 | 0.33 | 2.23 | 2.13 | 2.79 | 2.33 | 1.96 | 2.94 | 1.81 | 1.78 | 2.72 | 2.62 | 1.92 | 1.71 | 1.55 | 1.59 | 2.90 | 2.31 | 2.37 | 2.06 | 1.57 | 1.80 | 1.80 | 1.58 | 1.54 | 1.48 |
The optimal number of clusters for the europca Principal Components data set was 10 clusters.
pca_pam <- eclust(europca, "pam", k = 10, hc_metric="euclidean")
# PAM Clustering
pca_pamclus <- pam(europca, 10)
# Add Clusters into original data set
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, PAM_Travel_Cluster_4 = pca_pamclus$cluster)
| PAM_Travel_Cluster_4 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 740 | 0.14 | 3.10 | 2.50 | 3.93 | 3.32 | 2.49 | 4.06 | 2.29 | 2.08 | 4.44 | 2.87 | 1.60 | 1.54 | 1.47 | 1.51 | 2.15 | 1.22 | 1.38 | 1.32 | 1.08 | 1.15 | 1.15 | 0.78 | 0.88 | 0.71 |
| 2 | 948 | 0.17 | 4.01 | 2.89 | 3.72 | 4.10 | 2.00 | 3.14 | 1.87 | 2.37 | 1.97 | 1.55 | 1.88 | 1.85 | 1.79 | 1.82 | 1.57 | 1.24 | 1.20 | 1.25 | 1.01 | 0.98 | 0.98 | 0.83 | 0.84 | 0.83 |
| 3 | 629 | 0.12 | 4.12 | 3.94 | 2.98 | 3.20 | 3.36 | 2.66 | 2.18 | 3.22 | 2.82 | 4.52 | 3.15 | 2.82 | 1.89 | 2.04 | 0.93 | 1.11 | 0.98 | 1.41 | 1.25 | 0.77 | 0.77 | 0.89 | 0.69 | 0.70 |
| 4 | 439 | 0.08 | 3.25 | 2.42 | 3.82 | 3.05 | 2.35 | 3.10 | 2.19 | 2.64 | 2.50 | 2.92 | 1.01 | 1.78 | 2.38 | 2.83 | 1.44 | 3.89 | 2.57 | 2.33 | 0.87 | 1.11 | 1.11 | 0.89 | 1.06 | 1.04 |
| 5 | 875 | 0.16 | 3.75 | 4.58 | 2.15 | 2.56 | 4.55 | 2.06 | 4.06 | 3.69 | 1.64 | 1.63 | 1.65 | 1.38 | 1.72 | 2.07 | 1.32 | 1.19 | 1.08 | 1.22 | 0.97 | 0.71 | 0.71 | 0.77 | 0.71 | 0.68 |
| 6 | 517 | 0.09 | 3.12 | 2.89 | 1.73 | 1.84 | 2.90 | 1.70 | 3.10 | 1.92 | 1.70 | 1.14 | 4.25 | 4.95 | 4.34 | 3.98 | 0.92 | 0.96 | 0.92 | 0.95 | 0.92 | 1.13 | 1.13 | 0.93 | 0.72 | 0.81 |
| 7 | 417 | 0.08 | 2.33 | 2.67 | 2.36 | 2.19 | 2.11 | 2.64 | 1.52 | 1.88 | 2.44 | 2.62 | 1.85 | 2.21 | 1.84 | 1.51 | 2.57 | 1.75 | 1.79 | 1.83 | 1.42 | 2.46 | 2.46 | 1.50 | 1.11 | 1.32 |
| 8 | 306 | 0.06 | 3.33 | 3.44 | 3.85 | 3.04 | 3.33 | 3.84 | 3.99 | 2.92 | 2.96 | 2.11 | 1.65 | 2.10 | 3.78 | 1.74 | 4.08 | 1.97 | 3.54 | 1.86 | 1.01 | 0.75 | 0.75 | 0.80 | 0.96 | 0.78 |
| 9 | 289 | 0.05 | 1.59 | 1.83 | 2.14 | 1.70 | 1.83 | 2.35 | 1.50 | 1.35 | 2.49 | 2.51 | 2.75 | 1.14 | 0.95 | 1.01 | 4.23 | 2.98 | 3.44 | 2.39 | 1.58 | 1.33 | 1.33 | 3.52 | 1.06 | 1.08 |
| 10 | 296 | 0.05 | 2.79 | 2.44 | 2.02 | 1.91 | 2.27 | 2.05 | 2.24 | 2.05 | 2.04 | 1.94 | 3.22 | 2.86 | 2.16 | 2.16 | 1.84 | 1.86 | 1.82 | 1.74 | 3.45 | 2.69 | 2.69 | 1.50 | 4.20 | 3.91 |
#For example, given a distance matrix “distance” generated by the function dist() #the base R function hclust() can be used to create the hierarchical tree #do not use the ward.D method (it does not correctly implement Ward’s distance)
travel2 Multiple Imputation Data SetTravel_Sample_1 <- travel2 %>% sample_n(1000)
Travel_Sample_2 <- travel2 %>% sample_n(1000)
Travel_Sample_3 <- travel2 %>% sample_n(1000)
Travel_Sample_4 <- travel2 %>% sample_n(1000)
Travel_Sample_5 <- travel2 %>% sample_n(1000)
Distance_1 <- get_dist(Travel_Sample_1)
Distance_2 <- get_dist(Travel_Sample_2)
Distance_3 <- get_dist(Travel_Sample_3)
Distance_4 <- get_dist(Travel_Sample_4)
Distance_5 <- get_dist(Travel_Sample_5)
hclus1 <- hclust(d=Distance_1, method="ward.D2")
hclus2 <- hclust(d=Distance_2, method="ward.D2")
hclus3 <- hclust(d=Distance_3, method="ward.D2")
hclus4 <- hclust(d=Distance_4, method="ward.D2")
hclus5 <- hclust(d=Distance_5, method="ward.D2")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Sample I indicates a 4 clear clusters as optimal, but you could also infer 6 clusters which feed into the 4 clusters. We have already run k means clusters on the full data set using both values.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
The Second Sample shows 3 clear clusters, which we’ve also ran with k-means with maximum of 9 clusters. We could also interpret 4 to 5 clusters leveraging this Dendrogram as well.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Sample 3 looks very similar to Sample 1 where we can clearly interpret 4 clusters and infer 6.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
In Sample 4, you can easily interpret 4 or 6 clusters. Its worth noting that 2 to 3 clusters can be interpreted from the Dendrogram as well. The NbClust algorithm actually voted for 2 and 3 clusters by majority rule. So, we included those splits in the K Means algorithm as well.
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
Based on the results from the fifth sample, we can safely conclude that 4 is an optimal number of clusters. Two clusters can also be inferred as optimal since they are the furthest apart.
We perform Hierarchical cluster analysis on the full travel2 Multiple Imputed data set. Based on our samples, we will cut the tree at 4 and 6 clusters.
mi_distance1 <- get_dist(travel2)
# Ward's method
hclus_mi1 <- hclust(d=mi_distance1, method="ward.D2")
# Cut tree into 4 and 6 groups
hclus_group_4 <- cutree(hclus_mi1, k = 4)
hclus_group_6 <- cutree(hclus_mi1, k = 6)
# Visualize results in scatter plot
fviz_cluster(list(data = travel2, cluster = hclus_group_4),
palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07", "#00FF00"),
ellipse.type = "convex", # Concentration ellipse
repel = FALSE, # Allow label overplotting (slow)
#main = "Travel Ratings Cluster Plot",
#sub = "Multiple Imputed Data Set with Four Subgroups",
show.clust.cent = FALSE, ggtheme = theme_minimal())
Travel_Rating_Clusters <- Travel_Rating_Clusters %>%
dplyr::mutate(Hier_Cluster_Group_4 = hclus_group_4,
Hier_Cluster_Group_6 = hclus_group_6)
| Hier_Cluster_Group_4 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2047 | 0.38 | 3.33 | 2.65 | 4.25 | 3.58 | 2.67 | 3.90 | 2.44 | 2.44 | 3.13 | 2.54 | 1.32 | 1.61 | 2.10 | 1.94 | 2.31 | 1.77 | 1.91 | 1.52 | 1.16 | 0.74 | 0.74 | 0.92 | 0.88 | 0.76 |
| 2 | 1072 | 0.20 | 2.07 | 1.94 | 2.09 | 1.93 | 1.58 | 2.29 | 1.56 | 1.60 | 2.55 | 2.72 | 2.41 | 1.94 | 1.55 | 1.45 | 2.61 | 2.46 | 2.30 | 2.23 | 1.74 | 2.45 | 2.45 | 1.80 | 1.86 | 1.87 |
| 3 | 1853 | 0.34 | 4.22 | 4.43 | 2.40 | 3.04 | 3.70 | 2.19 | 3.05 | 3.35 | 1.98 | 2.22 | 2.63 | 2.26 | 1.85 | 2.01 | 1.08 | 1.10 | 1.01 | 1.19 | 1.09 | 0.73 | 0.73 | 0.85 | 0.75 | 0.67 |
| 4 | 484 | 0.09 | 2.92 | 2.76 | 1.56 | 1.60 | 2.95 | 1.56 | 3.30 | 1.95 | 1.59 | 1.10 | 3.90 | 4.92 | 4.55 | 4.28 | 0.89 | 0.94 | 0.86 | 0.92 | 0.75 | 1.40 | 1.40 | 0.89 | 0.83 | 0.98 |
| Hier_Cluster_Group_6 | Cluster_Size | Cluster_Percent | Malls | Restaurants | Theatres | Museums | Pubs_Bars | Parks | LocalServices | Zoo | Beaches | Resorts | ArtGalleries | JuiceBars | Hotels_OtherLodgings | Burger_PizzaShops | ViewPoints | Gardens | Monuments | Churches | DanceClubs | Bakeries | BeautySpas | Cafes | SwimmingPools | Gyms |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 941 | 0.17 | 3.65 | 2.57 | 4.35 | 3.92 | 2.54 | 3.74 | 2.08 | 2.40 | 3.47 | 2.88 | 1.45 | 1.73 | 1.90 | 1.89 | 0.91 | 1.09 | 0.98 | 1.38 | 1.23 | 0.80 | 0.80 | 0.83 | 0.83 | 0.78 |
| 2 | 1106 | 0.20 | 3.07 | 2.72 | 4.17 | 3.28 | 2.79 | 4.04 | 2.75 | 2.47 | 2.84 | 2.25 | 1.21 | 1.51 | 2.27 | 1.99 | 3.46 | 2.34 | 2.69 | 1.63 | 1.10 | 0.71 | 0.71 | 0.98 | 0.91 | 0.74 |
| 3 | 1072 | 0.20 | 2.07 | 1.94 | 2.09 | 1.93 | 1.58 | 2.29 | 1.56 | 1.60 | 2.55 | 2.72 | 2.41 | 1.94 | 1.55 | 1.45 | 2.61 | 2.46 | 2.30 | 2.23 | 1.74 | 2.45 | 2.45 | 1.80 | 1.86 | 1.87 |
| 4 | 1364 | 0.25 | 4.55 | 4.43 | 2.47 | 3.28 | 3.45 | 2.23 | 2.35 | 3.26 | 2.14 | 2.51 | 3.12 | 2.61 | 1.85 | 1.90 | 0.96 | 1.05 | 1.00 | 1.16 | 1.12 | 0.75 | 0.75 | 0.89 | 0.72 | 0.68 |
| 5 | 484 | 0.09 | 2.92 | 2.76 | 1.56 | 1.60 | 2.95 | 1.56 | 3.30 | 1.95 | 1.59 | 1.10 | 3.90 | 4.92 | 4.55 | 4.28 | 0.89 | 0.94 | 0.86 | 0.92 | 0.75 | 1.40 | 1.40 | 0.89 | 0.83 | 0.98 |
| 6 | 489 | 0.09 | 3.30 | 4.44 | 2.19 | 2.34 | 4.40 | 2.09 | 4.99 | 3.61 | 1.55 | 1.43 | 1.24 | 1.28 | 1.86 | 2.33 | 1.45 | 1.26 | 1.04 | 1.26 | 1.02 | 0.65 | 0.65 | 0.69 | 0.82 | 0.64 |
Note:Understanding cluster profiles are best
clmethods <- c("hierarchical", "kmeans", "pam")
travel_internal <- clValid(travel2, nClust = 3:10, clMethods = clmethods, validation = "internal", maxitems = 5456)
travel_scaled_internal <- clValid(travel3, nClust = 3:10, clMethods = clmethods, validation = "internal", maxitems = 5456)
travel_pcai_internal <- clValid(imp2, nClust = 3:10, clMethods = clmethods, validation = "internal", maxitems = 5456)
travel_pca_internal <- clValid(europca, nClust = 3:10, clMethods = clmethods, validation = "internal", maxitems = 5456)
## Warning: did not converge in 10 iterations
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 3 4 5 6 7 8 9 10
##
## hierarchical Connectivity 9.4603 19.2286 47.4845 85.3508 143.0341 149.1032 233.8627 254.2690
## Dunn 0.2649 0.2649 0.2189 0.1926 0.1292 0.1292 0.1311 0.1411
## Silhouette 0.1734 0.0825 0.0579 0.0287 0.0719 0.0526 0.0823 0.0907
## kmeans Connectivity 833.4464 841.4909 760.7056 885.7698 972.1968 1009.0960 1058.1020 1025.7151
## Dunn 0.0318 0.0332 0.0054 0.0178 0.0218 0.0178 0.0208 0.0208
## Silhouette 0.1336 0.1457 0.1503 0.1464 0.1497 0.1506 0.1589 0.1525
## pam Connectivity 1191.7956 1394.5278 1243.8123 1439.3845 1495.9766 1564.6048 1684.3956 1695.8183
## Dunn 0.0011 0.0022 0.0021 0.0015 0.0088 0.0112 0.0112 0.0115
## Silhouette 0.0978 0.1200 0.1211 0.1236 0.1272 0.1181 0.1191 0.1281
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 9.4603 hierarchical 3
## Dunn 0.2649 hierarchical 3
## Silhouette 0.1734 hierarchical 3
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 3 4 5 6 7 8 9 10
##
## hierarchical Connectivity 7.3702 59.2675 61.0909 62.6865 82.2544 88.2591 111.1655 113.4250
## Dunn 0.2254 0.1548 0.1557 0.1557 0.1560 0.1560 0.1560 0.1560
## Silhouette 0.1829 0.1888 0.1597 0.1582 0.1335 0.1268 0.1160 0.1095
## kmeans Connectivity 688.9484 769.7591 904.3278 801.5972 857.4563 897.3956 902.8179 1026.7782
## Dunn 0.0220 0.0215 0.0168 0.0035 0.0291 0.0094 0.0094 0.0174
## Silhouette 0.1441 0.1404 0.1489 0.1461 0.1487 0.1420 0.1470 0.1458
## pam Connectivity 1309.5131 1386.7464 1422.0921 1462.4032 1676.9393 1774.6433 1682.5694 1712.0758
## Dunn 0.0015 0.0020 0.0020 0.0020 0.0100 0.0107 0.0048 0.0093
## Silhouette 0.0772 0.0997 0.1071 0.1188 0.1243 0.1285 0.1267 0.1282
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 7.3702 hierarchical 3
## Dunn 0.2254 hierarchical 3
## Silhouette 0.1888 hierarchical 4
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 3 4 5 6 7 8 9 10
##
## hierarchical Connectivity 10.2060 60.2282 72.1873 76.4349 79.4167 98.8861 129.0131 135.2357
## Dunn 0.2226 0.1505 0.1505 0.1505 0.1540 0.1540 0.1679 0.1679
## Silhouette 0.2037 0.1939 0.1683 0.1411 0.1301 0.1162 0.1428 0.1382
## kmeans Connectivity 671.9722 762.5563 745.7270 826.9480 1005.8107 985.2929 947.9067 917.1889
## Dunn 0.0156 0.0118 0.0066 0.0131 0.0097 0.0098 0.0275 0.0257
## Silhouette 0.1452 0.1489 0.1482 0.1528 0.1417 0.1453 0.1470 0.1497
## pam Connectivity 993.9496 1263.6901 1252.3071 1297.5603 1431.2468 1665.1766 1677.0290 1654.8706
## Dunn 0.0015 0.0015 0.0113 0.0113 0.0039 0.0041 0.0049 0.0042
## Silhouette 0.1302 0.1133 0.1124 0.1241 0.1156 0.1118 0.1176 0.1203
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 10.2060 hierarchical 3
## Dunn 0.2226 hierarchical 3
## Silhouette 0.2037 hierarchical 3
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 3 4 5 6 7 8 9 10
##
## hierarchical Connectivity 50.7321 103.8397 153.6329 162.0135 175.1313 175.1313 224.2028 262.5607
## Dunn 0.0817 0.0687 0.0746 0.0746 0.0757 0.0757 0.0701 0.0701
## Silhouette 0.2143 0.1911 0.1780 0.1640 0.1491 0.1410 0.1195 0.0948
## kmeans Connectivity 586.2115 914.9972 876.5504 927.1952 1056.4540 962.8099 1044.9456 1081.1587
## Dunn 0.0208 0.0113 0.0100 0.0079 0.0218 0.0079 0.0047 0.0040
## Silhouette 0.1690 0.1232 0.1368 0.1373 0.1566 0.1567 0.1790 0.1867
## pam Connectivity 919.3008 943.7290 1190.4151 1274.4563 1293.6722 1313.8040 1587.2595 1560.2778
## Dunn 0.0041 0.0043 0.0011 0.0035 0.0035 0.0075 0.0035 0.0088
## Silhouette 0.0870 0.1102 0.1359 0.1497 0.1662 0.1823 0.1837 0.1912
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 50.7321 hierarchical 3
## Dunn 0.0817 hierarchical 3
## Silhouette 0.2143 hierarchical 3