Load Data and Preliminary Analysis

Through preliminary data analysis it could be observed that many categories were unrated as ratings ranged from 1 - 5 and these had 0’s as values. Furthermore, it was observed a 26th column wih all NA values was introduced and this added no value to the analysis. Therefore, we replaced 0 with a value of NA.

Europe_Travel_Reviews <- read.csv("C:/Users/willi/Desktop/Georgetown/RStudio Datasource/Travel_Review.csv")
  
Europe_Travel_Reviews <- Europe_Travel_Reviews %>% 
  mutate(LocalServices = as.numeric(LocalServices))

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

###EDA
dim(Europe_Travel_Reviews)

## [1] 5456   26

str(Europe_Travel_Reviews)

## 'data.frame':    5456 obs. of  26 variables:
##  $ UserID              : chr  "User 1" "User 2" "User 3" "User 4" ...
##  $ Churches            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Resorts             : num  0 0 0 0.5 0 0 5 5 5 5 ...
##  $ Beaches             : num  3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
##  $ Parks               : num  3.65 3.65 3.63 3.63 3.63 3.63 3.63 3.63 3.64 3.64 ...
##  $ Theatres            : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ Museums             : num  2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 2.92 ...
##  $ Malls               : num  5 5 5 5 5 5 3.03 5 3.03 5 ...
##  $ Zoo                 : num  2.35 2.64 2.64 2.35 2.64 2.63 2.35 2.63 2.62 2.35 ...
##  $ Restaurants         : num  2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.32 2.32 ...
##  $ Pubs_Bars           : num  2.64 2.65 2.64 2.64 2.64 2.65 2.64 2.64 2.63 2.63 ...
##  $ LocalServices       : num  1.7 1.7 1.7 1.73 1.7 1.71 1.73 1.7 1.71 1.69 ...
##  $ Burger_PizzaShops   : num  1.69 1.69 1.69 1.69 1.69 1.69 1.68 1.68 1.67 1.67 ...
##  $ Hotels_OtherLodgings: num  1.7 1.7 1.7 1.7 1.7 1.69 1.69 1.69 1.68 1.67 ...
##  $ JuiceBars           : num  1.72 1.72 1.72 1.72 1.72 1.72 1.71 1.71 1.7 1.7 ...
##  $ ArtGalleries        : num  1.74 1.74 1.74 1.74 1.74 1.74 1.75 1.74 0.75 0.74 ...
##  $ DanceClubs          : num  0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.6 0.6 0.59 ...
##  $ Swimming.Pools      : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 ...
##  $ Gyms                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bakeries            : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 ...
##  $ BeautySpas          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cafes               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ViewPoints          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Monuments           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gardens             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X                   : num  NA NA NA NA NA NA NA NA NA NA ...

head(Europe_Travel_Reviews)

##   UserID Churches Resorts Beaches Parks Theatres Museums Malls  Zoo Restaurants
## 1 User 1        0     0.0    3.63  3.65        5    2.92     5 2.35        2.33
## 2 User 2        0     0.0    3.63  3.65        5    2.92     5 2.64        2.33
## 3 User 3        0     0.0    3.63  3.63        5    2.92     5 2.64        2.33
## 4 User 4        0     0.5    3.63  3.63        5    2.92     5 2.35        2.33
## 5 User 5        0     0.0    3.63  3.63        5    2.92     5 2.64        2.33
## 6 User 6        0     0.0    3.63  3.63        5    2.92     5 2.63        2.33
##   Pubs_Bars LocalServices Burger_PizzaShops Hotels_OtherLodgings JuiceBars
## 1      2.64          1.70              1.69                 1.70      1.72
## 2      2.65          1.70              1.69                 1.70      1.72
## 3      2.64          1.70              1.69                 1.70      1.72
## 4      2.64          1.73              1.69                 1.70      1.72
## 5      2.64          1.70              1.69                 1.70      1.72
## 6      2.65          1.71              1.69                 1.69      1.72
##   ArtGalleries DanceClubs Swimming.Pools Gyms Bakeries BeautySpas Cafes
## 1         1.74       0.59            0.5    0      0.5          0     0
## 2         1.74       0.59            0.5    0      0.5          0     0
## 3         1.74       0.59            0.5    0      0.5          0     0
## 4         1.74       0.59            0.5    0      0.5          0     0
## 5         1.74       0.59            0.5    0      0.5          0     0
## 6         1.74       0.59            0.5    0      0.5          0     0
##   ViewPoints Monuments Gardens  X
## 1          0         0       0 NA
## 2          0         0       0 NA
## 3          0         0       0 NA
## 4          0         0       0 NA
## 5          0         0       0 NA
## 6          0         0       0 NA

Replace 0 for NA

Europe_Travel_Reviews[Europe_Travel_Reviews == 0] <- NA
head(Europe_Travel_Reviews)

##   UserID Churches Resorts Beaches Parks Theatres Museums Malls  Zoo Restaurants
## 1 User 1       NA      NA    3.63  3.65        5    2.92     5 2.35        2.33
## 2 User 2       NA      NA    3.63  3.65        5    2.92     5 2.64        2.33
## 3 User 3       NA      NA    3.63  3.63        5    2.92     5 2.64        2.33
## 4 User 4       NA     0.5    3.63  3.63        5    2.92     5 2.35        2.33
## 5 User 5       NA      NA    3.63  3.63        5    2.92     5 2.64        2.33
## 6 User 6       NA      NA    3.63  3.63        5    2.92     5 2.63        2.33
##   Pubs_Bars LocalServices Burger_PizzaShops Hotels_OtherLodgings JuiceBars
## 1      2.64          1.70              1.69                 1.70      1.72
## 2      2.65          1.70              1.69                 1.70      1.72
## 3      2.64          1.70              1.69                 1.70      1.72
## 4      2.64          1.73              1.69                 1.70      1.72
## 5      2.64          1.70              1.69                 1.70      1.72
## 6      2.65          1.71              1.69                 1.69      1.72
##   ArtGalleries DanceClubs Swimming.Pools Gyms Bakeries BeautySpas Cafes
## 1         1.74       0.59            0.5   NA      0.5         NA    NA
## 2         1.74       0.59            0.5   NA      0.5         NA    NA
## 3         1.74       0.59            0.5   NA      0.5         NA    NA
## 4         1.74       0.59            0.5   NA      0.5         NA    NA
## 5         1.74       0.59            0.5   NA      0.5         NA    NA
## 6         1.74       0.59            0.5   NA      0.5         NA    NA
##   ViewPoints Monuments Gardens  X
## 1         NA        NA      NA NA
## 2         NA        NA      NA NA
## 3         NA        NA      NA NA
## 4         NA        NA      NA NA
## 5         NA        NA      NA NA
## 6         NA        NA      NA NA

Europe_Travel_Reviews = Europe_Travel_Reviews[-26] #(My data did not have an extra column)

Exploratory Analysis

During exploratory analysis we analyzed Summary Statistics, Missing Values, and Correlation among variables to identify errors, unlock initial patterns, and find interesting relationships among the variables.

Summary Statistics

Our Summary Statistics give us several data points that help us understand the distribution of our data such as Mean Rating per category, its Minimum, Maximum, Quartiles, and Interquartile Range. It also shows us that most of our variables are dramatically skewed, and gives us initial insight into which categories have the highest volume of unrated observations via the Pct.Valid column. If Pct.Valid is below 100%, then we’ve identified a variable where there are missing values.

summary_stats <- summarytools::descr(Europe_Travel_Reviews, round.digits = 2, transpose = TRUE)
view(summary_stats, method = "render")

## Non-numerical variable(s) ignored: UserID

Descriptive Statistics

Europe_Travel_Reviews

N: 5456

	Mean	Std.Dev	Min	Q1	Median	Q3	Max	MAD	IQR	CV	Skewness	SE.Skewness	Kurtosis	N.Valid	Pct.Valid
ArtGalleries	2.21	1.72	0.50	0.86	1.33	4.44	5.00	0.77	3.58	0.78	0.86	0.03	-1.05	5452	99.93
Bakeries	1.20	1.23	0.50	0.61	0.76	0.93	5.00	0.22	0.32	1.03	2.45	0.04	4.58	4410	80.83
Beaches	2.49	1.25	0.50	1.54	2.06	2.74	5.00	0.85	1.20	0.50	1.09	0.03	-0.12	5452	99.93
BeautySpas	1.20	1.21	0.50	0.61	0.74	0.92	5.00	0.21	0.31	1.01	2.43	0.04	4.59	4560	83.58
Burger_PizzaShops	2.08	1.25	0.78	1.29	1.69	2.29	5.00	0.73	1.00	0.60	1.39	0.03	0.82	5455	99.98
Cafes	1.09	0.92	0.50	0.64	0.80	1.05	5.00	0.28	0.41	0.84	3.04	0.04	9.01	4852	88.93
Churches	1.51	0.79	0.50	0.97	1.35	1.86	5.00	0.64	0.89	0.53	1.99	0.03	5.55	5261	96.43
DanceClubs	1.22	1.10	0.50	0.71	0.81	1.17	5.00	0.27	0.46	0.91	2.72	0.03	6.32	5344	97.95
Gardens	1.63	1.15	0.50	0.92	1.31	1.69	5.00	0.58	0.77	0.71	2.00	0.03	3.13	5230	95.86
Gyms	1.01	0.96	0.50	0.61	0.74	0.89	5.00	0.21	0.28	0.95	3.48	0.04	11.35	4439	81.36
Hotels_OtherLodgings	2.13	1.41	0.77	1.19	1.61	2.36	5.00	0.77	1.17	0.66	1.26	0.03	0.11	5456	100.00
JuiceBars	2.19	1.58	0.76	1.03	1.49	2.74	5.00	0.83	1.71	0.72	1.03	0.03	-0.65	5456	100.00
LocalServices	2.55	1.38	0.78	1.58	2.00	3.22	5.00	0.99	1.64	0.54	0.82	0.03	-0.72	5455	99.98
Malls	3.35	1.41	1.12	1.93	3.23	5.00	5.00	2.37	3.07	0.42	0.02	0.03	-1.60	5456	100.00
Monuments	1.62	1.30	0.26	0.84	1.10	1.65	5.00	0.46	0.81	0.80	1.76	0.03	1.80	5154	94.46
Museums	2.89	1.28	1.11	1.79	2.68	3.84	5.00	1.33	2.05	0.44	0.56	0.03	-1.07	5456	100.00
Parks	2.80	1.31	0.83	1.73	2.46	4.10	5.00	1.19	2.36	0.47	0.71	0.03	-0.98	5456	100.00
Pubs_Bars	2.83	1.31	0.81	1.64	2.68	3.53	5.00	1.51	1.89	0.46	0.52	0.03	-0.93	5456	100.00
Resorts	2.36	1.40	0.50	1.37	1.97	2.72	5.00	0.95	1.35	0.59	0.93	0.03	-0.43	5366	98.35
Restaurants	3.13	1.36	0.84	1.80	2.80	5.00	5.00	1.68	3.20	0.43	0.27	0.03	-1.39	5456	100.00
Swimming.Pools	1.04	0.97	0.50	0.61	0.76	0.95	5.00	0.24	0.34	0.93	3.41	0.03	10.84	4977	91.22
Theatres	2.96	1.34	1.12	1.77	2.67	4.31	5.00	1.48	2.54	0.45	0.49	0.03	-1.27	5456	100.00
ViewPoints	1.87	1.58	0.50	0.78	1.07	2.20	5.00	0.55	1.42	0.85	1.19	0.03	-0.26	5111	93.68
Zoo	2.54	1.11	0.86	1.62	2.17	3.19	5.00	1.05	1.57	0.44	0.77	0.03	-0.36	5456	100.00

Generated by summarytools 0.9.9 (R version 4.1.0)
2021-06-30

Average User Ratings by Category

The top 5 highest rated attractions in Europe were Malls (3.35), Restaurants (3.13), Theaters (2.96), Museums (2.90), and Pubs/Bars (2.83). The 5 lowest rated attractions were Gyms (1.01), Swimming Pools (1.04), Cafes (1.09), Beauty Spas (1.20), and Bakeries (1.20). A clear distinction between the top 5 categories versus the lowest 5 categories are their completion rates, which also indicates the presence of non-presence of missing values. Every reviewer/traveler in our sample set reviewed the top 5 attractions with consistency. However, the completion rate for the lowest rated attractions bottomed out at 81% and did not go higher than 91%. These numbers tell us a couple of things: 1) Travelers frequented the top 5 attractions more than they frequented the bottom 5 attractions, and 2) their experiences at these places on average were rated higher than the rest of the categories in the data set.

# Average User Rating Bar Chart
ggplot(summary_stats_table) +
  aes(x = reorder(Variable, Mean), y = Mean) +
  geom_bar(position="dodge",stat="identity", fill = "#0c4c8a") +
  coord_flip() +
  labs(title = "Average User Rating by Category",
       x = "Variables", y = "Mean")+
  theme_minimal()

Skewness by Category

The top 5 categories by average rating had relatively little to no skewness indicating they have the most normal distributions out of all the variables. However, relative skewness to the left begins to pick up between Parks and Art Galleries, and then dramatically begins to increase around Resorts through to Gyms indicating that most of our variables are highly skewed to the left. Its important to note, because clustering algorithms work best with relatively independent continuous variables with low skewness. Given the skewness of our variables, standardization and Principal Component Analysis becomes very important in building our clusters.

# Skewness By Variable
ggplot(summary_stats_table) +
  aes(x = reorder(Variable, Skewness), y = Skewness) +
  geom_bar(position="dodge",stat="identity", fill = "#2F4F4F") +
  coord_flip() +
  labs(title = "Skewness by Category",
       x = "Variables", y = "Skewness")+
  theme_minimal()

Missing Values

Check Missing Values by Unique Observations and Total Values

The sample population consists of 5,456 unique Travel Reviewers and their average ratings across 24 categories in Europe. About 68% or 3,742 Travel reviewers had complete ratings across all 24 categories in the data set. Conversely, 1,732 or 32% of the sample population accounted for 5,322 missing values, which is approximately 4% of the total 136,400 values in the data set.

The presence of a missing value indicates the reviewer did not provide a rating for a specific category. Based on the overall distribution of missing values, we can surmise that they are heavily concentrated across a few categories with some overlapping patterns of nullity. Using the naniar and VIM packages we will assess the source and concentrations of non-response ratings across the data set.

Data.Category	Unique.Observations	Percent.Observations	Total.Values	Percent.Values
Missing	1,732	31.74%	5,322	3.90%
Complete	3,724	68.26%	131,078	96.10%
Total	5,456	100.00%	136,400	100.00%

Missing Values by Category

It was observed that certain categories had a significant amount of missing values. Overall, 15 out of the 25 variables or 60% of the variables in the data set had at least one missing value or no rating. The top 5 categories with missing values were Bakeries (19.17%), Gyms (18.64%), Beauty Spas (16.42%), Cafes (11.07%), and Swimming Pools (8.78%). Coincidentally, all 5 categories had the lowest average ratings, and were among the most skewed distributions in the data set.

# Missing Value Summaries and Visualizations 
missing_values_summary <- miss_var_summary(Europe_Travel_Reviews, order = TRUE) # Missing Values by Category
formattable(missing_values_summary) # Missing Value Output Table

variable	n_miss	pct_miss
Bakeries	1046	19.17155425
Gyms	1017	18.64002933
BeautySpas	896	16.42228739
Cafes	604	11.07038123
Swimming.Pools	479	8.77932551
ViewPoints	345	6.32331378
Monuments	302	5.53519062
Gardens	226	4.14222874
Churches	195	3.57404692
DanceClubs	112	2.05278592
Resorts	90	1.64956012
Beaches	4	0.07331378
ArtGalleries	4	0.07331378
LocalServices	1	0.01832845
Burger_PizzaShops	1	0.01832845
UserID	0	0.00000000
Parks	0	0.00000000
Theatres	0	0.00000000
Museums	0	0.00000000
Malls	0	0.00000000
Zoo	0	0.00000000
Restaurants	0	0.00000000
Pubs_Bars	0	0.00000000
Hotels_OtherLodgings	0	0.00000000
JuiceBars	0	0.00000000

missing_value_plot <- gg_miss_var(Europe_Travel_Reviews) # Missing Value Plot by Variable
missing_value_perc <- gg_miss_var(Europe_Travel_Reviews, show_pct = TRUE) + ylim(0,25) # Missing Value Percentage Plot by Variable

grid.arrange(missing_value_plot,missing_value_perc) # Arrange Plots in the same output

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use
## `guide = "none"` instead.

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please use
## `guide = "none"` instead.

### Relationship Among Missing Values The below graphic shows the top 5 variables with the most missing values, and then orders them by the size of their nullity in the data set. We can see from the visualization that the combination of Bakeries, Gyms, and Beauty Spas have the most frequent intersection of missing values. The second highest intersection among missing values had a combination of Bakeries, Gyms, and Swimming Pools. By assessing the chart, we can see that the missing values are most probably MNAR, which stands for Missing Not At Random. Travelers that did not rate one of the top categories with missing values were also likely not to rate another top category with missing values. Meaning these are places where certain pockets of travelers did not visit while they were in Europe.

### Approach to Handling Missing Values & Analyzing the Data Since traditional Principal Component Analysis (PCA) and Clustering methodologies do not work with missing values, we decided to try a couple of different approaches to handling nullity and assessed which approach had a better impact on our results. For the remainder of our analysis we will compare and contrast the results from each method, and ultimately select the model(s) that produce the best results.

Approach I: Multiple Imputation of Missing Values In the first approach we leveraged the MICE package to impute the missing values into the data set. Since Bakeries and Gyms missing value percentage is approaching 20%, we are inherently introducing bias into our subsequent models by imputing the values. After using Multiple Imputation to impute the missing values, we performed Principal Component Analysis on the imputed data set to find correlations among variables, and then used the actual variables to create clusters of Travel Reviewers.

Approach II: Leverage missMDA Package to Perform PCA with Missing Values The missMDA package allows principal components analysis to be performed on data sets with missing values. It uses an iterative PCA algorithm for a pre-defined number of dimensions to predict the missing values. The PCA is then performed on the imputed data set. Since it is based on a principal component method, it accounts for the similarities between observations and the relationship between variables. missMDA imputes missing values in such a way that the imputed values have no weight on the PCA results. We then used our PCA to get an idea of how our clusters will be formed, and then use the PCs to actually create our clusters.

Approch I

Create Seperate Data Set & Rename Variables

# Create Seperate Data Set for Approach I 
travel <- as_tibble(Europe_Travel_Reviews)
travel = dplyr::rename(travel, Pool = `Swimming.Pools`)

Leverage MICE package to impute for missing values

# Here we will use the mice package that will help us do the imputation (we are going for 1 imputation, m=1)
impute_travel<-mice(travel,m=1,seed = 1111)

## 
##  iter imp variable
##   1   1  Churches  Resorts  Beaches  LocalServices  Burger_PizzaShops  ArtGalleries  DanceClubs  Pool  Gyms  Bakeries  BeautySpas  Cafes  ViewPoints  Monuments  Gardens
##   2   1  Churches  Resorts  Beaches  LocalServices  Burger_PizzaShops  ArtGalleries  DanceClubs  Pool  Gyms  Bakeries  BeautySpas  Cafes  ViewPoints  Monuments  Gardens
##   3   1  Churches  Resorts  Beaches  LocalServices  Burger_PizzaShops  ArtGalleries  DanceClubs  Pool  Gyms  Bakeries  BeautySpas  Cafes  ViewPoints  Monuments  Gardens
##   4   1  Churches  Resorts  Beaches  LocalServices  Burger_PizzaShops  ArtGalleries  DanceClubs  Pool  Gyms  Bakeries  BeautySpas  Cafes  ViewPoints  Monuments  Gardens
##   5   1  Churches  Resorts  Beaches  LocalServices  Burger_PizzaShops  ArtGalleries  DanceClubs  Pool  Gyms  Bakeries  BeautySpas  Cafes  ViewPoints  Monuments  Gardens

## Warning: Number of logged events: 1

travel<-complete(impute_travel,1)

Re-Check for Missing Values

n_miss(travel)

## [1] 0

Exploratory Analysis Continued: BoxPlots

We use Box Plots to explore the distribution of our Travel Review Categories.

## Warning in melt(travel): The melt generic in data.table has been passed a
## data.frame and will attempt to redirect to the relevant reshape2 method;
## please note that reshape2 is deprecated, and this redirection is now
## deprecated as well. To continue using melt methods from reshape2 while both
## libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(travel). In the next version, this warning will become an error.

## Using UserID as id variables

The box plots reveal and reiterate the skewness within our variables to the left. We also see that our lowest rated categories have a number of outliers as their average ratings are below 1.5. Meaning higher ratings toward 5 would be considered outliers within these classes of variables since their averages are so low. We witness this effect to the right of the Box Plots for Swimming Pools, Gyms, Bakeries, Beauty Spas, and Cafes. View Points, Monuments, and Gardens are also experiencing the same sort of distribution but with a lesser effect as their averages are slightly higher than the aforementioned variables.

Remove UserID from Analysis

In order to perform correlation analysis, PCA, and clustering, we need to remove UserID from the data set in both approaches. We will merge UserID back to the PCs and Clusters for further analysis and interpretation. In the first approach, we only need to remove UserID itself before performing correlation analysis because the missing values have been imputed. Since we are using missMDA in the second approach we have to remove both UserID and the missing values in order to perform correlation analysis.

# Approach I: Remove UserID
travelvar <- travel[-1]

# Approach II: Remove UserID to isolate numerica variables, and create a data set with missing values removed to perform correlation analysis.
Travel_Reviews_Vars <- subset(Europe_Travel_Reviews, select = -c(UserID))
Travel_Review_Vars_NA_Removed <- na.omit(Travel_Reviews_Vars)

Correlation Analysis

Approach I

## corrplot 0.89 loaded

The first approach and visualizations do not show much correlation. The colors and shades tend to blend together without offering much information.

Approach II

When we add the correlation coefficient values to the visualization in the second approach, some relationships begin to emerge. While six out of the twenty-four Travel Review categories exhibit some moderate correlation, the patterns within them were not overly overt. Meaning the correlated variables tended to increase with one another, but the reasons behind why were not obvious.

Theaters and Parks had the highest positive correlation among variables at 0.62. The remaining correlations of any significance were levered by Restaurants and Zoo. The highest correlation among this group was between Restaurants and Pubs/Bars at 0.57, followed by Restaurants and Zoo at 0.56, Pub/Bars and Zoo at 0.54, Restaurants and Malls at 0.54, and Malls and Zoo at 0.53. There’s a couple of ways to think about these correlations:

We have complimentary categories of attractions either by function or proximity of activity. Meaning travelers who enjoy theaters also enjoy parks, or theaters and parks of good quality are attractions frequented one after the other due to being in the same destination. The same anecdotal precept could be applied to Restaurants and Pubs/Bars, or any of the other correlated variables.
Correlations among variables are coincidental and not behavioral.

To check our thought process we further isolate the correlated variables and use the GGally package to explore the relationships among the correlated variables. In doing so, we see that the variables are likely not correlated by chance and have some sort of relationship. We will be able to further unpack them through PCA and Clustering.

Principal Component Analysis (PCA)

We conducted PCA using two different approaches. The first approach leveraged multiple imputation data set travel. The second approach leveraged the missMDA package to perform PCA on the Travel Reviews data set with missing values. Each approach produced similar results with the missMDA approach accounting for slightly more variance in the top 4 PCs.

Step 1: Run Principal Component Analysis (PCA) using Approach I and Approach II

Approach I: PCA on Multiple Imputed Data Set

# Create travel2 data set for PCA model
travel2 = copy(travel)

# Remove UserID from data set
travel2 = subset(travel, select = -UserID)
UserID = travel$UserID

# Run PCA model using prcomp
pca2 = prcomp(travelvar, scale = TRUE)

Check for multicollinearity

#car::vif(travel2)

Combine the UserID variable to the principal components

pcs = as.data.frame(pca2$x)
combdata = cbind(UserID, pcs)
head(combdata)

##   UserID         PC1       PC2        PC3          PC4        PC5        PC6
## 1 User 1  0.25756185 -1.789616 -0.9580841 -0.192729145 -0.1637141 -0.1689986
## 2 User 2  0.12880935 -2.222705 -0.7237128 -0.002998196  0.1049917  0.9272178
## 3 User 3 -2.46073610 -2.732965  0.6205651 -1.267083471  0.8886820 -1.6244595
## 4 User 4  0.74430117 -1.736327 -1.2054701  0.072787200 -0.2285305  0.3670751
## 5 User 5 -0.67788008 -2.346384 -0.7699898  0.110948757  0.3575986 -1.6762360
## 6 User 6 -0.01122303 -2.237991 -1.1776180  0.635360853  0.2439735 -1.1769982
##          PC7         PC8         PC9       PC10        PC11       PC12
## 1 -0.9849344 -0.17091907 -0.47475534  0.2658401  0.94242004  0.4890931
## 2 -0.1548428  0.05516802 -0.15935148  0.7881863 -0.21399876 -0.2357170
## 3  0.3889483 -2.37687902  0.99030760  0.4474360 -0.12090983  1.2151787
## 4 -1.0535964  0.30211174 -0.58221007  0.1977566  0.70843944  0.3405807
## 5  1.1326676 -1.19866729  0.24265698 -0.2381035  0.24202397  0.5727746
## 6  1.2738029 -0.64525316 -0.01967841 -0.3594399  0.08633248  0.1219503
##          PC13        PC14        PC15        PC16       PC17        PC18
## 1 -0.10087228  0.09758711  0.49732621 -0.39825880  1.0207123 -0.30912746
## 2  0.05067522 -0.47260099 -0.78604631 -0.19065273 -0.6954410  0.38309404
## 3 -0.11517069  0.39931074 -0.28802813  0.45756285 -0.9684179  0.96799956
## 4 -0.48650812  0.35934738  0.36771998 -0.39478850  0.6837842 -0.04165999
## 5  0.29251999 -0.35219394 -0.03597081 -0.03936501  0.3042821  0.29726619
## 6  0.23998190 -0.63198858 -0.06401170 -0.26079033  0.3497699  0.24802461
##        PC19      PC20       PC21       PC22      PC23      PC24
## 1 0.8385173 0.7400919  0.4230716 -0.1393456 0.4011609 0.4464884
## 2 1.1599972 0.8162189 -0.2166374 -1.3886077 0.2832106 0.5927382
## 3 0.6849134 0.2590930 -0.4005175 -0.5439907 0.7474203 0.5954807
## 4 0.7868757 0.7450735  0.3684155 -0.4880521 0.3334450 0.4370365
## 5 0.4788189 0.7045643  0.4681104 -0.1250196 0.5823903 0.3068787
## 6 0.4970208 0.8275216  0.6475252 -0.3225851 0.4583333 0.2468726

Approach II: PCA Using missMDA Package

1. Estimate the number of dimension for Principal Component Analysis (PCA) by K-fold Cross-Validation

nb <- estim_ncpPCA(Travel_Reviews_Vars, scale = TRUE, method.cv = "Kfold", nbsim = 100)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |=================================                                     |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |===================================                                   |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |=====================================                                 |  54%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==========================================                            |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

nb$ncp

## [1] 5

2. Impute missing values using iterativePCA algorithm and optimal number of dimensions from previous step

res.comp <- imputePCA(Travel_Reviews_Vars, ncp = nb$ncp)
imp <- data.frame(res.comp$completeObs)
head(imp)

##   Churches  Resorts Beaches Parks Theatres Museums Malls  Zoo Restaurants
## 1 1.461173 2.845288    3.63  3.65        5    2.92     5 2.35        2.33
## 2 1.441032 2.828942    3.63  3.65        5    2.92     5 2.64        2.33
## 3 1.441243 2.829215    3.63  3.63        5    2.92     5 2.64        2.33
## 4 1.445697 0.500000    3.63  3.63        5    2.92     5 2.35        2.33
## 5 1.441243 2.829215    3.63  3.63        5    2.92     5 2.64        2.33
## 6 1.441748 2.828332    3.63  3.63        5    2.92     5 2.63        2.33
##   Pubs_Bars LocalServices Burger_PizzaShops Hotels_OtherLodgings JuiceBars
## 1      2.64          1.70              1.69                 1.70      1.72
## 2      2.65          1.70              1.69                 1.70      1.72
## 3      2.64          1.70              1.69                 1.70      1.72
## 4      2.64          1.73              1.69                 1.70      1.72
## 5      2.64          1.70              1.69                 1.70      1.72
## 6      2.65          1.71              1.69                 1.69      1.72
##   ArtGalleries DanceClubs Swimming.Pools      Gyms Bakeries BeautySpas
## 1         1.74       0.59            0.5 0.5135240      0.5  0.9048210
## 2         1.74       0.59            0.5 0.5010227      0.5  0.8778343
## 3         1.74       0.59            0.5 0.5020172      0.5  0.8801363
## 4         1.74       0.59            0.5 0.5915906      0.5  0.7889952
## 5         1.74       0.59            0.5 0.5020172      0.5  0.8801363
## 6         1.74       0.59            0.5 0.5020174      0.5  0.8798243
##       Cafes ViewPoints Monuments  Gardens
## 1 0.7957945   1.801658  1.521236 1.557100
## 2 0.7802937   1.775173  1.498736 1.531745
## 3 0.7815898   1.774185  1.497922 1.531352
## 4 0.7561164   1.851798  1.587643 1.595324
## 5 0.7815898   1.774185  1.497922 1.531352
## 6 0.7821357   1.776209  1.499535 1.532373

3. Run PCA and Analyze PCA on Imputed data Set

PCA_Travel_Reviews <- prcomp(imp, scale. = TRUE)
var = get_pca_var(PCA_Travel_Reviews)
var

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

Step 2: Analyze and Compare Eigenvalues from Approach I and Approach II

Approach I Eigenvalues

library("factoextra")
eig.val = get_eigenvalue(pca2)
eig.val

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   4.4882415        18.701006                    18.70101
## Dim.2   3.3974242        14.155934                    32.85694
## Dim.3   1.8135401         7.556417                    40.41336
## Dim.4   1.6204095         6.751706                    47.16506
## Dim.5   1.4750728         6.146137                    53.31120
## Dim.6   1.1749669         4.895695                    58.20690
## Dim.7   1.1065657         4.610690                    62.81759
## Dim.8   0.9675169         4.031320                    66.84891
## Dim.9   0.7677442         3.198934                    70.04784
## Dim.10  0.7163200         2.984667                    73.03251
## Dim.11  0.6436108         2.681712                    75.71422
## Dim.12  0.6249997         2.604165                    78.31838
## Dim.13  0.6023845         2.509935                    80.82832
## Dim.14  0.5355661         2.231526                    83.05985
## Dim.15  0.5282108         2.200878                    85.26072
## Dim.16  0.4968939         2.070391                    87.33112
## Dim.17  0.4572299         1.905124                    89.23624
## Dim.18  0.4399519         1.833133                    91.06937
## Dim.19  0.4050301         1.687626                    92.75700
## Dim.20  0.3968623         1.653593                    94.41059
## Dim.21  0.3710364         1.545985                    95.95658
## Dim.22  0.3504492         1.460205                    97.41678
## Dim.23  0.3313589         1.380662                    98.79744
## Dim.24  0.2886137         1.202557                   100.00000

The first eigenvalue of Principal Component 1 (Dim.1) represents four variables worth of variation, which explains 18.70% of the variance in the original 24 variables. The second eigenvalue of Principal Component 2 (Dim.2) represents 3 variables worth of variation, and collectively explains 32.86% of the variance within the our Travel Category ratings. Principal Components 3 through 7 each represent between a little under 2 and little over 1 variable(s) worth of variation within the data set. Collectively with Principal Components 1 and 2 they explain 62.18% of the variance within the Travel Review rating categories.

While the rest of the eigenvalues do not represent a full variable worth of variation, their makeup could be useful in explaining more than 62.18% of the variance within our data set. For instance including up to 10 Principal Components accounts for 73.03% of the variation within Travel Review rating categories, while including up to 13 would explain 80.82% of the variation. We wouldn’t want to go much further beyond that range, or we would be defeating the purpose of Principal Components Analysis, which is to variable reduction and feature optimization.

#Screeplot
fviz_eig(pca2, addlabels = TRUE,
               barfill="lightyellow", barcolor ="orange",
               linecolor ="orange", title = "Eigenvalues")

The Scree Plot displays the orthogonal nature and inherent functionality of Principal Components, which is to explain as much variation as possible in the first principal component. The second principal component is orthogonal to the first principal component and captures as much variation as possible that could not be captured in the first principal component. Each subsequent principal component continues to derive maximum variation not achieved in the prior principal component. The Scree Plot does not go further than ten dimensions, because the idea is to capture the maximum amount of variance in the least amount of principal components.

Approach II Eigenvalues

eigen.values = get_eigenvalue(PCA_Travel_Reviews)
formattable(eigen.values)

	eigenvalue	variance.percent	cumulative.variance.percent
Dim.1	4.5630460	19.012692	19.01269
Dim.2	3.4511581	14.379825	33.39252
Dim.3	1.8309519	7.628966	41.02148
Dim.4	1.6646764	6.936152	47.95764
Dim.5	1.4814778	6.172824	54.13046
Dim.6	1.1484313	4.785131	58.91559
Dim.7	1.0972310	4.571796	63.48739
Dim.8	0.9081492	3.783955	67.27134
Dim.9	0.7549385	3.145577	70.41692
Dim.10	0.6987725	2.911552	73.32847
Dim.11	0.6375082	2.656284	75.98475
Dim.12	0.6191782	2.579909	78.56466
Dim.13	0.5945424	2.477260	81.04192
Dim.14	0.5303580	2.209825	83.25175
Dim.15	0.5054309	2.105962	85.35771
Dim.16	0.4851320	2.021383	87.37909
Dim.17	0.4523722	1.884884	89.26398
Dim.18	0.4399273	1.833030	91.09701
Dim.19	0.4061759	1.692399	92.78941
Dim.20	0.3916423	1.631843	94.42125
Dim.21	0.3688550	1.536896	95.95815
Dim.22	0.3506153	1.460897	97.41904
Dim.23	0.3311240	1.379683	98.79873
Dim.24	0.2883055	1.201273	100.00000

The results from Approach I and Approach II are similar and have parity with the second approach’s eigenvalues capturing slightly more variation in the first five principal components compared to the first approach. Using this approach we would still want to leverage between 10 and 13 principal components to capture between 73.33% and 81.04% of the variation respectively.

# Screeplot of Eigenvalues
fviz_eig(PCA_Travel_Reviews,
         barfill="darkblue", 
         barcolor ="grey55",
         linecolor ="grey55", 
         title = "Eigenvalue Scree Plot",
         addlabels = TRUE)

The Eigenvalue Scree Plot shows the first principal component capturing 19.01% of the variance within the 24 Travel Review categories. The second principal component captured 14.4% of variance. Collectively the ten principal components in the second approach capture 73.33% of the variation within the data set. Since Approach II capture slightly more variance than Approach I, we will use Approach II’s principal components to build our clusters.

Step 3: Variable Analysis for Approach I and Approach II

Approach I Correlation Between Variables and Principal Components

### Approach II Correlation between Variables and Principal Components

Approach I and II have parity between correlation of variables to Principal Components with the second approach explaining slightly more variation within the first two principal components. The below correlation plots give us a grid representation of correlation between variables and principal components for Approach II.

The first principal component (Dim.1) had moderately high negative correlation with Restaurants (-0.69), Pubs/Bars (-0.63), Malls (-0.63), and Zoo (-0.63). During our exploratory analysis we observed moderate correlation among these variables. Their collective correlating impact on the first principal component further provides evidence of their relationship. Travel Reviewers tend to frequent and rate these attractions together. Additionally, we observed moderate positive correlation between the first principal component and Churches (0.59), View Points (0.54), Gardens (0.54), Monuments (0.53), and Cafes (0.54).

The second principal component (Dim.2) had high negative correlation with Theaters (-0.72) and Parks (-0.68). Both variables exhibited the highest correlation among variables in our exploratory analysis. We also observed a moderate negative correlation between principal component 2 and Museums at -0.58, and a slightly positive correlation with Juice Bars at 0.59.

Approach I Quality of Representation

Quality of Factor Map (Approach II)

Quality of Representation Bar Plot (Approach I)

We observe that our correlated variables have the highest quality representation in principal components 1 and 2 except Museums.

Quality of Factor Circle Map (Approach I)

Approach II Quality of Representation

Quality of Factor Map (Approach II)

#### Quality of Representation Bar Plot (Approach II)

Quality of Factor Circle Map (Approach II)

The circle shows a slightly different quality of factor map compared to Approach I. Variables such as Juice Bars and Gyms have different coordinates in the second approach versus the first approach.

Approach I Variable Contribution

Variable Contribution Plot (Approach II)

Technically, we’ve been leveraging the values from Approach II when analyzing PCA variables using corrplot. This will not have a huge impact as the factoextra package leverages the model outputs from each approach themselves. Additionally, the models produced highly similar results with the second approach accounting for slightly more variability in the first five components. So, what we observe in Approach II will not be too far off from what we observe in Approach I. We merely mark Approach I versus Approach II in corrplot usage to separate results without confusion.

Variable Contribution Bar Charts for PC I (Approach I)

Variable Contribution Bar Charts for PC II (Approach I)

Variable Contribution Circle Plot (Approach I)

Approach II Variable Contribution

Variable Contribution Table (Approach II)

	Dim.1	Dim.2	Dim.3	Dim.4	Dim.5	Dim.6	Dim.7	Dim.8	Dim.9	Dim.10	Dim.11	Dim.12	Dim.13	Dim.14	Dim.15	Dim.16	Dim.17	Dim.18	Dim.19	Dim.20	Dim.21	Dim.22	Dim.23	Dim.24
Churches	7.611314e+00	0.14567033	1.63730292	0.43731131	1.35375215	10.337239856	3.792917e+00	3.51038718	0.31345111	0.02140916	18.481802565	0.014053591	2.452733e+01	9.91741347	4.556023e-01	5.6708866	4.229879553	3.06561241	1.98212103	0.001322932	7.127238e-01	1.636279e+00	3.812057e-02	1.060967e-01
Resorts	5.335035e-01	2.25428892	1.14491823	5.36012276	1.55628155	2.801017037	4.663501e+01	0.89887497	1.13452200	3.23743521	11.141304119	1.574721897	2.014478e-01	1.66661912	6.783519e+00	0.4713336	4.736217129	4.50480299	2.28114869	0.062886767	6.631036e-02	1.260491e-01	4.282381e-01	3.994295e-01
Beaches	1.082077e+00	7.13954924	5.01144795	0.27881842	0.07334987	2.584161054	1.595440e+01	0.36377013	3.04299840	23.91447654	3.564304960	1.771819769	1.517475e+01	0.20089554	5.110233e+00	0.8357906	0.525223443	4.14019615	0.00174181	1.860004639	3.529076e+00	1.775806e+00	1.593961e+00	4.711499e-01
Parks	4.535361e-01	13.53090581	3.35393320	2.84335560	1.22893820	7.671281606	2.595030e-01	0.03766050	4.95056316	0.01702433	1.565371668	0.125386131	2.351906e+00	5.45312421	1.103812e+00	1.8626944	3.338892780	6.52778302	7.54265556	0.073791095	4.088833e-01	1.369858e+01	4.127290e+00	1.747312e+01
Theatres	8.220319e-04	15.08977003	6.21788449	3.24203278	2.27116289	0.101300584	3.258710e-01	0.48796498	5.38277832	1.31731847	1.458985868	0.350235040	3.821799e-01	0.75558596	1.584013e+00	0.5269366	0.152763911	0.68068414	0.08809335	7.452975576	5.159008e+00	2.805863e+00	2.271183e+00	4.189459e+01
Museums	1.831868e+00	9.74638234	4.55111478	1.28697385	3.05564117	2.998405200	8.785329e+00	0.99890007	2.82946446	0.31205824	0.027071839	0.658682839	1.402832e+00	2.58315503	1.106653e+00	0.5836594	14.205757299	6.24105582	18.51472771	7.663929431	2.281894e+00	3.678697e+00	4.514502e-01	4.204297e+00
Malls	8.765188e+00	0.41137378	0.44270524	3.60277934	2.08350909	4.708163197	2.999171e+00	6.21216909	0.16890205	3.56073254	0.162452977	10.623312310	4.440707e-01	14.01065909	6.145459e-01	0.9374306	0.468912089	18.81564840	12.94418696	2.325899699	3.952304e-04	3.588439e+00	1.602656e+00	5.066964e-01
Zoo	8.740294e+00	0.26497957	7.83383699	0.12844918	2.09139280	7.795264787	1.363500e+00	0.52195604	7.49322035	0.67565607	0.004229098	0.852053147	1.940100e-02	4.31515197	6.026255e-05	1.8014345	1.880070100	12.07057107	2.95727808	7.042286688	1.381472e-03	3.305768e+00	2.795724e+01	8.845260e-01
Restaurants	1.037651e+01	0.03625603	7.65678695	3.59454088	0.10838919	0.078950570	3.502437e-01	0.76354309	3.67026536	1.60064416	0.761498811	0.154756857	7.384913e-01	0.29200390	1.417786e-01	1.0085027	9.776188366	6.05391501	1.38909907	25.424765229	8.174428e-01	3.601206e+00	3.943746e+00	1.766048e+01
Pubs_Bars	8.768412e+00	0.05507274	12.19992021	0.76740371	0.72094743	1.447737158	5.972131e+00	0.09643482	2.49638446	0.07792392	0.470500279	1.263244505	1.568149e-01	1.72463849	4.729463e+00	1.9710362	0.438654647	4.54657038	0.85585923	0.028186698	1.613315e+00	1.832863e-02	4.634124e+01	3.239777e+00
LocalServices	4.307522e+00	1.00946844	4.72530438	18.13487944	0.76894019	1.275153100	4.761496e+00	2.37890388	1.18484305	3.60740091	0.956095374	1.150914495	1.260442e-02	3.88499449	2.844824e+00	0.6302276	12.930254306	5.09694468	9.22797450	3.133254392	9.498463e+00	6.427927e+00	1.569876e+00	4.817339e-01
Burger_PizzaShops	1.745529e+00	5.23986847	4.88030464	15.50422157	0.73684936	0.370886301	1.110685e+00	0.27308222	0.00551141	12.83115309	10.262233396	3.909091894	5.549061e-04	3.45250175	2.909534e+00	7.1426483	1.116343658	4.00265345	6.03888807	7.240634773	7.012156e+00	2.090260e-01	3.574069e+00	4.315743e-01
Hotels_OtherLodgings	1.200029e+00	5.73560712	6.75602793	14.32821413	2.27243567	0.020814405	1.286186e-02	3.61624429	11.29038046	3.32374130	0.530078949	3.002665584	1.236143e-01	0.34401911	2.028760e-01	0.5664465	10.289094774	2.79594495	13.53657444	0.205780491	5.913764e-01	1.455413e+01	7.154714e-02	4.629498e+00
JuiceBars	1.031615e+00	9.98246335	9.80458774	0.06405407	2.58095505	0.003002792	8.956599e-04	5.95202786	4.53112253	3.59557683	1.471433898	8.245142141	5.483686e+00	0.04201261	2.526014e+00	2.4389872	4.103374976	1.34506448	10.90789548	0.153302601	2.760451e+00	1.920775e+01	1.240243e+00	2.528340e+00
ArtGalleries	4.665127e-01	7.00746612	1.64813418	4.70001772	1.28704245	6.883643872	1.706415e-01	26.81340469	3.93989998	0.24398860	0.460661053	5.132685205	1.515199e+01	14.30352478	2.117778e+00	2.1436163	3.527665163	0.33701691	1.59560763	1.559957921	3.141084e-02	2.143576e-01	2.037165e-01	5.925711e-02
DanceClubs	1.492791e+00	1.35887723	0.05249914	0.08660540	25.17095012	11.729254824	2.323109e+00	5.02371911	3.05444759	12.42388946	8.912735728	0.792573305	6.751641e-03	3.91970534	1.702007e+00	18.5069377	0.170189672	0.18477301	0.02203339	0.990281344	1.671776e+00	1.430141e-01	1.867723e-01	7.430644e-02
Swimming.Pools	3.959738e+00	4.56399761	0.18564464	0.01020114	25.58320737	0.418083261	2.305152e-01	0.06509340	0.03568510	1.51066594	1.057692296	0.020017431	5.873074e-01	0.93697183	5.307453e+00	10.7747417	0.005278408	1.08601020	0.06338711	10.605562044	2.611653e+01	5.254011e+00	6.630146e-02	1.555903e+00
Gyms	4.405715e+00	6.66402171	0.41304413	0.07589762	16.94218859	2.263482525	1.037802e-01	0.21499736	0.18368427	1.92605328	1.002183052	0.001218303	3.603044e-01	0.17712153	2.805645e+00	7.3692290	0.886104480	0.19340652	1.82134347	16.611186046	2.577629e+01	8.288280e+00	6.135583e-05	1.514760e+00
Bakeries	4.647042e+00	5.49770056	1.38288872	1.76994138	0.67014610	3.282383867	1.860132e-01	11.12563615	0.83806820	9.62252821	5.704700666	1.980986604	8.883940e+00	0.29374164	2.982014e+01	9.6419624	0.552060541	1.06534314	1.13987736	1.008904027	1.064868e-02	4.088304e-01	1.785583e-01	2.879631e-01
BeautySpas	3.427684e+00	1.43056900	0.25449558	5.80715425	7.33890378	2.308690185	4.544072e-02	6.96662532	41.99308028	0.05842999	1.483271890	20.286179715	1.521683e+00	0.08018815	4.464916e-02	3.6857199	1.595462727	0.13372273	0.74543640	0.278772683	4.998847e-01	3.155569e-03	1.015127e-02	6.492831e-04
Cafes	6.319347e+00	0.53550046	6.46526904	2.86884497	1.19285845	2.523200608	2.517109e+00	8.15523958	0.10268750	1.05399201	7.575040907	15.237187750	8.778458e+00	10.09667791	1.335253e+01	5.5020448	0.273633167	0.02410304	0.72424835	1.785687747	1.434140e+00	8.704372e-04	3.259760e+00	2.215722e-01
ViewPoints	6.456981e+00	1.35188388	6.53763252	4.26410685	0.37481820	9.284957180	1.508031e+00	0.18259338	1.03392549	7.78028677	0.169995336	3.913926932	4.631402e+00	0.35180742	8.628486e+00	0.5465166	18.235177912	1.19546130	4.96727424	1.387705679	8.806225e+00	7.138442e+00	4.975098e-05	1.252314e+00
Monuments	6.057766e+00	0.69424741	5.08193294	6.24423617	0.18643419	1.003915298	5.738245e-01	9.05599046	0.24850078	4.56753964	18.519734912	18.907123999	1.754391e-01	0.54197473	1.036217e+00	15.1845220	2.084516109	4.26287226	0.59964575	0.130705478	3.394587e-01	3.914141e+00	5.763999e-01	1.286174e-02
Gardens	6.318203e+00	0.25407983	1.76238346	4.59983745	0.35090612	18.109010731	1.752410e-02	6.28478143	0.07561370	2.72007533	4.256620361	0.032020558	8.883040e+00	20.65551194	5.072174e+00	0.1966947	4.478284792	11.62984392	0.05290232	2.972216020	8.607546e-01	1.046687e-03	3.073657e-01	1.091086e-01

Variable Contribution Plots (Approach I and II)

Variable contribution across principal components 1-3 tended to be more evenly weighted across variables where there were moderate to high correlations with the principal component in either direction. For instance, Restaurants (10.38%), Pubs/Bars (8.77%), Malls (8.77%), Zoo (8.74%), and Churches (7.61%) were the highest contributing variables to principal component one, which aligns to its eigenvalue of 4.56 variables accounting for 19% of variation within the data set. The variables order of contribution matched their degree of correlation. We observe the same sort of distribution in principal component two except the highest contributing variables were Theaters (15.09%), Parks (13.53%), Museums (9.75%), and Juice Bars (9.98%). These contributions align to an eigenvalue of 3.45 variables accounting for 14% of the variation within the data set.

At principal components four and five we start to see a more concentrated degree of impact by fewer variables. For example Local Services (18.13%), Burger/Pizza Shops (15.50%), and Hotels/Other Lodgings (14.33%) had the highest degree of contribution to principal component four. Conversely, Swimming Pools (25.58%), Dance Clubs (25.17%), and Gyms (16.94%) had the highest contribution to principal component 5.

Variable Contribution Bar Charts (Approach II)

We analyze the variable contributions of the maximum 10 to 13 principal components for use in clustering our data set.

Variable Contribution Circle Plot (Approach II)

Create grouping variables using kmeans

set.seed(123)
km = kmeans(var$coord, centers = 3, nstart = 25)
grp = as.factor(km$cluster)

k Means Grouping Using Approach I

k Means Grouping Using Approach II

Individual Analysis

## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"

Step 4: Assess Cluster Tendency of Data Sets Using Hopkins Statistic

“The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a data set by measuring the probability that a given data set is generated by a uniform data distribution. In other words, it tests the spatial randomness of the data.” The Null Hypothesis assumes the data set is uniformly distributed containing no meaningful clusters. The Alternative Hypothesis states the opposite, and assumes the data is not uniformly distributed containing meaningful clusters.

We conduct the Hopkins Statistic Test iteratively, using 0.5 as a threshold. If H < 0.5 then we accept the Null Hypothesis that the data is not clusterable. If H > 0.5, then we reject the Null Hypothesis and conclude the data is clusterable. We will use the data set that is closest to 1 to cluster the data.

For more information visit Assessing Clustering Tendency.

Get Cluster Tendency for MICE Imputed Data Set, MICE Imputed PCA Scores, missMDA PCA Imputed, missMDA PCA Scores

Data.Set	Hopkins.Statistic
Mice Imputed Variables	0.8138782
Mice Imputed Principal Components	0.8712349
PCA Imputed Variables	0.8172558
PCA Principal Components	0.8763902

All of the data sets are highly clusterable based on their respective Hopkins Statistics with the missMDA imputed data sets performing slightly higher. The data sets leveraging the average travel review ratings themselves represented by Mice Imputed Variables and PCA Imputed Variables received a Hopkins Statistic of .81 and .82 respectively. However, both Principal Component iterations performed better with Mice Imputed Principal Components returning a Hopkins Statistic of .87 and the PCA Principal Components data set returning a .88. We will use missMDA PCA Principal Components data set to cluster our travel reviews.

Exploratory Factor Analysis (interpretation of extracted features)

Maximum Likelihood Analysis

## 
## Call:
## factanal(x = travelvar, factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##             Churches              Resorts              Beaches 
##                 0.70                 0.94                 0.77 
##                Parks             Theatres              Museums 
##                 0.47                 0.30                 0.60 
##                Malls                  Zoo          Restaurants 
##                 0.64                 0.52                 0.38 
##            Pubs_Bars        LocalServices    Burger_PizzaShops 
##                 0.53                 0.82                 0.66 
## Hotels_OtherLodgings            JuiceBars         ArtGalleries 
##                 0.61                 0.51                 0.79 
##           DanceClubs                 Pool                 Gyms 
##                 0.94                 0.80                 0.74 
##             Bakeries           BeautySpas                Cafes 
##                 0.73                 0.84                 0.68 
##           ViewPoints            Monuments              Gardens 
##                 0.68                 0.73                 0.76 
## 
## Loadings:
##                      Factor1 Factor2 Factor3
## Malls                -0.58                  
## Zoo                  -0.68                  
## Restaurants          -0.73   -0.31          
## Pubs_Bars            -0.64                  
## Parks                         0.70          
## Theatres                      0.82          
## Museums              -0.36    0.51          
## Burger_PizzaShops                     0.58  
## Hotels_OtherLodgings                  0.62  
## JuiceBars                             0.65  
## Churches              0.44           -0.33  
## Resorts                                     
## Beaches                       0.44          
## LocalServices        -0.34                  
## ArtGalleries                 -0.32    0.32  
## DanceClubs                                  
## Pool                  0.41                  
## Gyms                  0.45                  
## Bakeries              0.46                  
## BeautySpas            0.34                  
## Cafes                 0.40           -0.35  
## ViewPoints            0.35           -0.44  
## Monuments             0.35           -0.38  
## Gardens               0.40                  
## 
##                Factor1 Factor2 Factor3
## SS loadings       3.50    2.31    2.07
## Proportion Var    0.15    0.10    0.09
## Cumulative Var    0.15    0.24    0.33
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 11678.07 on 207 degrees of freedom.
## The p-value is 0

## 
## Call:
## factanal(x = travelvar, factors = 4, rotation = "varimax")
## 
## Uniquenesses:
##             Churches              Resorts              Beaches 
##                 0.70                 0.92                 0.77 
##                Parks             Theatres              Museums 
##                 0.45                 0.29                 0.49 
##                Malls                  Zoo          Restaurants 
##                 0.55                 0.51                 0.38 
##            Pubs_Bars        LocalServices    Burger_PizzaShops 
##                 0.38                 0.47                 0.52 
## Hotels_OtherLodgings            JuiceBars         ArtGalleries 
##                 0.52                 0.52                 0.76 
##           DanceClubs                 Pool                 Gyms 
##                 0.94                 0.81                 0.75 
##             Bakeries           BeautySpas                Cafes 
##                 0.72                 0.82                 0.67 
##           ViewPoints            Monuments              Gardens 
##                 0.57                 0.66                 0.73 
## 
## Loadings:
##                      Factor1 Factor2 Factor3 Factor4
## Malls                -0.60                          
## ViewPoints            0.62                          
## Monuments             0.55                          
## Parks                         0.71                  
## Theatres                      0.84                  
## Museums              -0.40    0.54                  
## Zoo                  -0.32            0.62          
## Restaurants          -0.43            0.62          
## Pubs_Bars                             0.75          
## LocalServices                         0.56    0.45  
## Burger_PizzaShops                             0.68  
## Hotels_OtherLodgings                          0.67  
## Churches              0.47                          
## Resorts                                             
## Beaches                       0.40                  
## JuiceBars            -0.33   -0.32            0.48  
## ArtGalleries                 -0.37                  
## DanceClubs                                          
## Pool                                                
## Gyms                         -0.30   -0.33          
## Bakeries                     -0.32   -0.38          
## BeautySpas                                          
## Cafes                 0.38                          
## Gardens               0.47                          
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings       2.53    2.42    2.39    1.74
## Proportion Var    0.11    0.10    0.10    0.07
## Cumulative Var    0.11    0.21    0.31    0.38
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 8066.47 on 186 degrees of freedom.
## The p-value is 0

## 
## Call:
## factanal(x = travelvar, factors = 2, rotation = "varimax")
## 
## Uniquenesses:
##             Churches              Resorts              Beaches 
##                 0.73                 0.94                 0.76 
##                Parks             Theatres              Museums 
##                 0.47                 0.40                 0.62 
##                Malls                  Zoo          Restaurants 
##                 0.63                 0.61                 0.51 
##            Pubs_Bars        LocalServices    Burger_PizzaShops 
##                 0.62                 0.82                 0.88 
## Hotels_OtherLodgings            JuiceBars         ArtGalleries 
##                 0.89                 0.78                 0.82 
##           DanceClubs                 Pool                 Gyms 
##                 0.94                 0.78                 0.72 
##             Bakeries           BeautySpas                Cafes 
##                 0.72                 0.85                 0.77 
##           ViewPoints            Monuments              Gardens 
##                 0.74                 0.77                 0.77 
## 
## Loadings:
##                      Factor1 Factor2
## Malls                -0.60          
## Zoo                  -0.60          
## Restaurants          -0.62   -0.32  
## Pubs_Bars            -0.55          
## Parks                         0.72  
## Theatres                      0.74  
## Churches              0.48          
## Resorts                             
## Beaches                       0.49  
## Museums              -0.44    0.44  
## LocalServices        -0.32          
## Burger_PizzaShops            -0.33  
## Hotels_OtherLodgings         -0.32  
## JuiceBars                    -0.47  
## ArtGalleries                 -0.42  
## DanceClubs                          
## Pool                  0.45          
## Gyms                  0.48          
## Bakeries              0.49          
## BeautySpas            0.38          
## Cafes                 0.48          
## ViewPoints            0.39    0.33  
## Monuments             0.39          
## Gardens               0.42          
## 
##                Factor1 Factor2
## SS loadings       3.61    2.85
## Proportion Var    0.15    0.12
## Cumulative Var    0.15    0.27
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 16308.61 on 229 degrees of freedom.
## The p-value is 0

##                       Churches   Resorts   Beaches     Parks  Theatres
## Churches              0.000001  0.124526 -0.001357 -0.066153 -0.044424
## Resorts               0.124526  0.000001  0.197951 -0.012102 -0.022370
## Beaches              -0.001357  0.197951  0.000000  0.050193 -0.017093
## Parks                -0.066153 -0.012102  0.050193 -0.000001  0.073324
## Theatres             -0.044424 -0.022370 -0.017093  0.073324 -0.000001
## Museums              -0.024765 -0.045665 -0.028450 -0.041760  0.065112
## Malls                -0.002333 -0.011742  0.005850 -0.056676  0.007853
## Zoo                   0.112609  0.053041 -0.069647 -0.059471 -0.013711
## Restaurants           0.038128  0.050497 -0.029902  0.003130 -0.077883
## Pubs_Bars             0.044253  0.003852 -0.011376  0.034913 -0.021375
## LocalServices         0.022070 -0.131170 -0.003779  0.053435  0.006962
## Burger_PizzaShops    -0.060290 -0.033975 -0.065756  0.063875  0.116131
## Hotels_OtherLodgings -0.015911 -0.105217 -0.021812  0.078881  0.130078
## JuiceBars            -0.086419  0.015043  0.068497  0.030084  0.064235
## ArtGalleries         -0.059140  0.050448  0.073867  0.035200 -0.006711
## DanceClubs           -0.053021 -0.034504 -0.004504  0.088658  0.052323
## Pool                 -0.043987 -0.043179 -0.013279  0.033847  0.069276
## Gyms                 -0.028298 -0.021478 -0.014257  0.017097  0.056601
## Bakeries             -0.007693  0.051267  0.013148 -0.003857  0.011498
## BeautySpas           -0.013283  0.040950  0.006424  0.000737 -0.043666
## Cafes                 0.020031 -0.010270 -0.058948 -0.044372 -0.061627
## ViewPoints            0.051999 -0.131571 -0.065127  0.060445 -0.052557
## Monuments             0.110632 -0.056922 -0.059909 -0.021438 -0.020363
## Gardens               0.187552 -0.005296 -0.077648 -0.065383 -0.004340
##                        Museums     Malls       Zoo Restaurants Pubs_Bars
## Churches             -0.024765 -0.002333  0.112609    0.038128  0.044253
## Resorts              -0.045665 -0.011742  0.053041    0.050497  0.003852
## Beaches              -0.028450  0.005850 -0.069647   -0.029902 -0.011376
## Parks                -0.041760 -0.056676 -0.059471    0.003130  0.034913
## Theatres              0.065112  0.007853 -0.013711   -0.077883 -0.021375
## Museums               0.000001  0.160075  0.013182   -0.017009 -0.139059
## Malls                 0.160075  0.000001  0.031113    0.030603 -0.101749
## Zoo                   0.013182  0.031113  0.000001    0.110863  0.171569
## Restaurants          -0.017009  0.030603  0.110863    0.000003  0.132150
## Pubs_Bars            -0.139059 -0.101749  0.171569    0.132150  0.000001
## LocalServices        -0.166316 -0.118882  0.053786   -0.023650  0.215688
## Burger_PizzaShops    -0.063944 -0.062574 -0.116460   -0.183123 -0.022193
## Hotels_OtherLodgings -0.025153 -0.042356 -0.103634   -0.121946 -0.058357
## JuiceBars             0.049435  0.040953 -0.107286   -0.122284 -0.137445
## ArtGalleries          0.007092  0.061970 -0.126849    0.004308 -0.068527
## DanceClubs           -0.013584  0.000230  0.015489    0.022844  0.093468
## Pool                  0.059991  0.057025  0.061287    0.002477  0.019139
## Gyms                  0.071543  0.082729  0.053125   -0.018982 -0.013079
## Bakeries              0.064185  0.011162  0.018470   -0.020206 -0.056812
## BeautySpas           -0.025690 -0.032204 -0.026186    0.058239 -0.015475
## Cafes                -0.021026  0.007316  0.014371    0.109605  0.077651
## ViewPoints           -0.096046 -0.121345  0.033492    0.072462  0.141480
## Monuments            -0.072289  0.011233  0.107444    0.050632  0.091993
## Gardens              -0.023448  0.001595  0.142238   -0.010297  0.040238
##                      LocalServices Burger_PizzaShops Hotels_OtherLodgings
## Churches                  0.022070         -0.060290            -0.015911
## Resorts                  -0.131170         -0.033975            -0.105217
## Beaches                  -0.003779         -0.065756            -0.021812
## Parks                     0.053435          0.063875             0.078881
## Theatres                  0.006962          0.116131             0.130078
## Museums                  -0.166316         -0.063944            -0.025153
## Malls                    -0.118882         -0.062574            -0.042356
## Zoo                       0.053786         -0.116460            -0.103634
## Restaurants              -0.023650         -0.183123            -0.121946
## Pubs_Bars                 0.215688         -0.022193            -0.058357
## LocalServices            -0.000001          0.196490             0.148496
## Burger_PizzaShops         0.196490          0.000000             0.357869
## Hotels_OtherLodgings      0.148496          0.357869            -0.000001
## JuiceBars                -0.075675          0.197221             0.361967
## ArtGalleries             -0.139648          0.011175             0.063952
## DanceClubs                0.055771         -0.043864            -0.052851
## Pool                      0.031945          0.013204            -0.018508
## Gyms                     -0.005978          0.009594             0.008816
## Bakeries                 -0.055748         -0.017553            -0.002405
## BeautySpas               -0.124240         -0.083811            -0.058210
## Cafes                    -0.113886         -0.187780            -0.154185
## ViewPoints                0.155980         -0.122081            -0.035750
## Monuments                 0.106556         -0.014047             0.003008
## Gardens                   0.049854          0.009177             0.019572
##                      JuiceBars ArtGalleries DanceClubs      Pool      Gyms
## Churches             -0.086419    -0.059140  -0.053021 -0.043987 -0.028298
## Resorts               0.015043     0.050448  -0.034504 -0.043179 -0.021478
## Beaches               0.068497     0.073867  -0.004504 -0.013279 -0.014257
## Parks                 0.030084     0.035200   0.088658  0.033847  0.017097
## Theatres              0.064235    -0.006711   0.052323  0.069276  0.056601
## Museums               0.049435     0.007092  -0.013584  0.059991  0.071543
## Malls                 0.040953     0.061970   0.000230  0.057025  0.082729
## Zoo                  -0.107286    -0.126849   0.015489  0.061287  0.053125
## Restaurants          -0.122284     0.004308   0.022844  0.002477 -0.018982
## Pubs_Bars            -0.137445    -0.068527   0.093468  0.019139 -0.013079
## LocalServices        -0.075675    -0.139648   0.055771  0.031945 -0.005978
## Burger_PizzaShops     0.197221     0.011175  -0.043864  0.013204  0.009594
## Hotels_OtherLodgings  0.361967     0.063952  -0.052851 -0.018508  0.008816
## JuiceBars            -0.000001     0.172507  -0.026830 -0.001874  0.008704
## ArtGalleries          0.172507     0.000000   0.078880 -0.016070 -0.027027
## DanceClubs           -0.026830     0.078880   0.000000  0.284062  0.195664
## Pool                 -0.001874    -0.016070   0.284062  0.000001  0.356334
## Gyms                  0.008704    -0.027027   0.195664  0.356334 -0.000001
## Bakeries              0.019153    -0.040039  -0.043733  0.100374  0.120908
## BeautySpas            0.002292    -0.002177  -0.034432 -0.112509 -0.079131
## Cafes                -0.075264     0.046396   0.034405 -0.056265 -0.058103
## ViewPoints           -0.112414    -0.050734   0.011332 -0.045398 -0.088076
## Monuments            -0.126569    -0.056725  -0.040001 -0.000963 -0.029000
## Gardens              -0.057475    -0.122275  -0.076162 -0.002338 -0.000391
##                       Bakeries BeautySpas     Cafes ViewPoints Monuments
## Churches             -0.007693  -0.013283  0.020031   0.051999  0.110632
## Resorts               0.051267   0.040950 -0.010270  -0.131571 -0.056922
## Beaches               0.013148   0.006424 -0.058948  -0.065127 -0.059909
## Parks                -0.003857   0.000737 -0.044372   0.060445 -0.021438
## Theatres              0.011498  -0.043666 -0.061627  -0.052557 -0.020363
## Museums               0.064185  -0.025690 -0.021026  -0.096046 -0.072289
## Malls                 0.011162  -0.032204  0.007316  -0.121345  0.011233
## Zoo                   0.018470  -0.026186  0.014371   0.033492  0.107444
## Restaurants          -0.020206   0.058239  0.109605   0.072462  0.050632
## Pubs_Bars            -0.056812  -0.015475  0.077651   0.141480  0.091993
## LocalServices        -0.055748  -0.124240 -0.113886   0.155980  0.106556
## Burger_PizzaShops    -0.017553  -0.083811 -0.187780  -0.122081 -0.014047
## Hotels_OtherLodgings -0.002405  -0.058210 -0.154185  -0.035750  0.003008
## JuiceBars             0.019153   0.002292 -0.075264  -0.112414 -0.126569
## ArtGalleries         -0.040039  -0.002177  0.046396  -0.050734 -0.056725
## DanceClubs           -0.043733  -0.034432  0.034405   0.011332 -0.040001
## Pool                  0.100374  -0.112509 -0.056265  -0.045398 -0.000963
## Gyms                  0.120908  -0.079131 -0.058103  -0.088076 -0.029000
## Bakeries              0.000000   0.050178 -0.067631  -0.099243 -0.077663
## BeautySpas            0.050178   0.000000  0.061783   0.026861 -0.040187
## Cafes                -0.067631   0.061783  0.000001   0.153069  0.103893
## ViewPoints           -0.099243   0.026861  0.153069   0.000002  0.188747
## Monuments            -0.077663  -0.040187  0.103893   0.188747  0.000001
## Gardens              -0.057269  -0.070728  0.038869   0.022149  0.185498
##                        Gardens
## Churches              0.187552
## Resorts              -0.005296
## Beaches              -0.077648
## Parks                -0.065383
## Theatres             -0.004340
## Museums              -0.023448
## Malls                 0.001595
## Zoo                   0.142238
## Restaurants          -0.010297
## Pubs_Bars             0.040238
## LocalServices         0.049854
## Burger_PizzaShops     0.009177
## Hotels_OtherLodgings  0.019572
## JuiceBars            -0.057475
## ArtGalleries         -0.122275
## DanceClubs           -0.076162
## Pool                 -0.002338
## Gyms                 -0.000391
## Bakeries             -0.057269
## BeautySpas           -0.070728
## Cafes                 0.038869
## ViewPoints            0.022149
## Monuments             0.185498
## Gardens               0.000001

Interpretation of factors

Determine number of factors in data

## 
## Attaching package: 'nFactors'

## The following object is masked from 'package:lattice':
## 
##     parallel

## Warning: factor.pa is deprecated. Please use the fa function with fm=pa

## Factor Analysis using method =  pa
## Call: factor.pa(r = travelvar, nfactors = 2, rotate = "varimax")
## Unstandardized loadings (pattern matrix) based upon covariance matrix
##                        PA1   PA2    h2   u2    H2   U2
## Churches              0.49 -0.24 0.295 0.71 0.293 0.71
## Resorts               0.05 -0.25 0.067 0.93 0.067 0.93
## Beaches               0.05 -0.47 0.225 0.78 0.224 0.78
## Parks                -0.08 -0.66 0.437 0.56 0.438 0.56
## Theatres             -0.22 -0.66 0.477 0.52 0.479 0.52
## Museums              -0.43 -0.41 0.354 0.65 0.353 0.65
## Malls                -0.59  0.09 0.359 0.64 0.359 0.64
## Zoo                  -0.58  0.12 0.351 0.65 0.351 0.65
## Restaurants          -0.61  0.25 0.432 0.57 0.431 0.57
## Pubs_Bars            -0.54  0.24 0.349 0.65 0.350 0.65
## LocalServices        -0.32  0.28 0.183 0.82 0.183 0.82
## Burger_PizzaShops    -0.12  0.42 0.194 0.81 0.194 0.81
## Hotels_OtherLodgings -0.07  0.42 0.182 0.82 0.181 0.82
## JuiceBars            -0.02  0.55 0.306 0.69 0.307 0.69
## ArtGalleries          0.02  0.44 0.191 0.81 0.191 0.81
## DanceClubs            0.26  0.08 0.077 0.92 0.077 0.92
## Pool                  0.47  0.19 0.257 0.74 0.258 0.74
## Gyms                  0.51  0.25 0.318 0.68 0.318 0.68
## Bakeries              0.49  0.21 0.284 0.72 0.283 0.72
## BeautySpas            0.37  0.07 0.145 0.85 0.146 0.85
## Cafes                 0.48 -0.06 0.238 0.76 0.239 0.76
## ViewPoints            0.40 -0.35 0.283 0.72 0.282 0.72
## Monuments             0.40 -0.31 0.254 0.75 0.253 0.75
## Gardens               0.43 -0.25 0.244 0.76 0.243 0.76
## 
##                        PA1  PA2
## SS loadings           3.61 2.89
## Proportion Var        0.15 0.12
## Cumulative Var        0.15 0.27
## Proportion Explained  0.56 0.44
## Cumulative Proportion 0.56 1.00
## 
##  Standardized loadings (pattern matrix)
##                      item   PA1   PA2    h2   u2
## Churches                1  0.48 -0.24 0.293 0.71
## Resorts                 2  0.06 -0.25 0.067 0.93
## Beaches                 3  0.05 -0.47 0.224 0.78
## Parks                   4 -0.08 -0.66 0.438 0.56
## Theatres                5 -0.22 -0.66 0.479 0.52
## Museums                 6 -0.43 -0.41 0.353 0.65
## Malls                   7 -0.59  0.09 0.359 0.64
## Zoo                     8 -0.58  0.12 0.351 0.65
## Restaurants             9 -0.61  0.25 0.431 0.57
## Pubs_Bars              10 -0.54  0.24 0.350 0.65
## LocalServices          11 -0.32  0.28 0.183 0.82
## Burger_PizzaShops      12 -0.12  0.42 0.194 0.81
## Hotels_OtherLodgings   13 -0.07  0.42 0.181 0.82
## JuiceBars              14 -0.02  0.55 0.307 0.69
## ArtGalleries           15  0.02  0.44 0.191 0.81
## DanceClubs             16  0.27  0.08 0.077 0.92
## Pool                   17  0.47  0.19 0.258 0.74
## Gyms                   18  0.51  0.25 0.318 0.68
## Bakeries               19  0.49  0.21 0.283 0.72
## BeautySpas             20  0.37  0.07 0.146 0.85
## Cafes                  21  0.48 -0.06 0.239 0.76
## ViewPoints             22  0.40 -0.35 0.282 0.72
## Monuments              23  0.40 -0.31 0.253 0.75
## Gardens                24  0.43 -0.25 0.243 0.76
## 
##                  PA1  PA2
## SS loadings     3.61 2.89
## Proportion Var  0.15 0.12
## Cumulative Var  0.15 0.27
## Cum. factor Var 0.56 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  276  and the objective function was  7.38 with Chi Square of  40169.45
## The degrees of freedom for the model are 229  and the objective function was  3.03 
## 
## The root mean square of the residuals (RMSR) is  0.08 
## The df corrected root mean square of the residuals is  0.09 
## 
## The harmonic number of observations is  5456 with the empirical chi square  21175.16  with prob <  0 
## The total number of observations was  5456  with Likelihood Chi Square =  16511.3  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.508
## RMSEA index =  0.114  and the 90 % confidence intervals are  0.113 0.116
## BIC =  14540.87
## Fit based upon off diagonal values = 0.84
## Measures of factor score adequacy             
##                                                    PA1  PA2
## Correlation of (regression) scores with factors   0.92 0.90
## Multiple R square of scores with factors          0.84 0.81
## Minimum correlation of possible factor scores     0.68 0.62

Clustering

K-Means Clustering

MICE and PCA Imputed Data Sets Original Variables

# Scale Imputed Data Sets
travel3 <- scale(travel2)
imp2 <- scale(imp)

# MICE Imputed Model
k3 = kmeans(travel3, centers =3, nstart = 25)
#str(k3)
#k3

# PCA Imputed Model
k4 = kmeans(imp2, centers = 3, nstart = 25)
#str(k4)
#k4

Visualize MICE Imputed Cluster

Visualize PCA Imputed Cluster

Determine Optimal Number of Clusters for MICE Imputed Data

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

The elbow is not quite clear. It appears the optimal number of clusters would be between 10 to 12, which may default to 8 for practical business and marketing purposes. Let’s leverage the fviz_nbclust function from the factoextra package to see if we can get a clear elbow or silhouette.

We get a slight elbow at 6 before the line descends toward 8 to 10. It appears between 6 to 10 clusters would optimal based on the elbow method. However, the Silhouette Method marks the optimal amount of clusters at 2. Its worth noting that the non-scaled data set travel2 Silhouette suggests 5 clusters to be the most optimal. We will leverage the NbClust package to see if we get any better results leveraging the scaled vs. non-scaled data set.

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 4 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Travel 2 Data Set The NbClust Algorithm was run on the travel2 data set, but was left out due to the amount of times it takes to run the package. The results from running it are below. We will create 2, 3, 4, and 6 clusters from the data set.

nbclust <- NbClust(data = travel2, distance = “euclidean”, min.nc = 2, max.nc = 15, method = “kmeans”) *** : The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.

Among all indices:
5 proposed 2 as the best number of clusters
5 proposed 3 as the best number of clusters
4 proposed 4 as the best number of clusters
2 proposed 6 as the best number of clusters
1 proposed 11 as the best number of clusters
1 proposed 14 as the best number of clusters

5 proposed 15 as the best number of clusters

             ***** Conclusion *****

According to the majority rule, the best number of clusters is 2

Travel 3 Data Set The NbClust algorithm determined the the best number of clusters to be 3 by majority rule, while the travel2 data set was tie between 2 and 3 optimal clusters. We will create 2, 3, and 7 clusters from the travel3 data set.

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(best_nc) == "matrix") .viz_NbClust(x, print.summary, : the
## condition has length > 1 and only the first element will be used

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(best_nc) == "matrix") {: the condition has length > 1 and
## only the first element will be used

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 5 proposed  2 as the best number of clusters
## * 11 proposed  3 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 2 proposed  7 as the best number of clusters
## * 1 proposed  14 as the best number of clusters
## * 4 proposed  15 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

Determine Optimal Number of Clusters for PCA Imputed Data

The line is relatively similar to the MICE Imputed data. We will again leverage fviz_nbclust to gain a clearer picture.

PCA Imputed produces the same results as the MICE Imputed data set. We will run the NbClust package on the data set to understand if the same majority rules results are produced for the PCA imputed data set.

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 11 proposed 3 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 2 proposed 14 as the best number of clusters 
## * 3 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

Produced the same results as the MICE Imputed or travel3 data set. We will produce 2, 3, and 7 clusters from the imp2 PCA Imputed data set to compare and/or validate against the same cluster constructs from the MICE Imputed data set.

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(best_nc) == "matrix") .viz_NbClust(x, print.summary, : the
## condition has length > 1 and only the first element will be used

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(best_nc) == "matrix") {: the condition has length > 1 and
## only the first element will be used

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 5 proposed  2 as the best number of clusters
## * 11 proposed  3 as the best number of clusters
## * 2 proposed  7 as the best number of clusters
## * 1 proposed  13 as the best number of clusters
## * 2 proposed  14 as the best number of clusters
## * 3 proposed  15 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

Merge PCs to Original Data and Scale

The Hastings Statistic revealed the most clusterable data set to be the missMDA Principal Components data set with a value of .88. Our PCA analysis revealed two important facts: 1) The missMDA approach accounted for slightly more variability in the first five PCs than the MICE imputed data set, and 2) Leveraging 10 PCs accounts for 73.33% of the variability within the Travel Ratings data set. Increasing the number of PCs to 13 allows us to account for 81.04% of the variability within the data set.

PCA Imputed

Europe_Travel_Ratings <- cbind(Europe_Travel_Reviews,Travel_Review_PCS)
pcavar <- c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10")
europca <- scale(Europe_Travel_Ratings[pcavar])

Randomly Start with 3 Clusters: PCA Imputed

#### MICE Imputed

pcavar <- c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10")
europca_mice <- scale(pcs[pcavar])

Randomly Start with 3 Clusters: MICE Imputed

Leveraging principal components appears to produce clusters that are not quite as clean visually compared to using the actual ratings. We’d expect the opposite given the Hastings Statistic. Perhaps three is just a less than optimal number of clusters for the data set as shown by our previous optimal number of clusters exercises. So, before we continue analyzing K-Means using Principal Components, let’s take a look at some key metrics across the three models we’ve run thus far. We’ll look to see which data set minimizes the within sum of squares and maximizes the between sum of squares.

Data.Set	Cluster.Size	Within.Total.Sum.Squares	Between.Sum.Squares	Total.Sum.Squares
Mice Imputed Variables	2413	100431.55	30488.451	130920
PCA Imputed Variables	2053	100047.57	30872.425	130920
Principal Components	990	46106.05	8443.953	54550
Mice Imputed Variables	988	100431.55	30488.451	130920
PCA Imputed Variables	2059	100047.57	30872.425	130920
Principal Components	2409	46106.05	8443.953	54550
Mice Imputed Variables	1931	100431.55	30488.451	130920
PCA Imputed Variables	2467	100047.57	30872.425	130920
Principal Components	1058	46106.05	8443.953	54550

As we can see in the numbers the Principal Components is doing good job of minimizing the within sum of squares for each cluster, but its performing poorly in terms of maximizing between sum of squares. Meaning there’s not enough distance between clusters to clearly delineate between their makeup. In fact, the between sum of squares is higher than the within sum of squares, which indicates poor performance. Let’s see if these numbers improve by finding the optimal number of clusters.

Determine Optimal Number of Clusters for Principal Components

The optimal number of clusters based on the Silhoutte Method is 9 for the Principal Components set, and is the lowest point prior to 10 clusters using the Elbow method. We run NbClust on the data set to see what 30 indexes consider the most optimal number of clusters.

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 1 proposed 3 as the best number of clusters 
## * 4 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 3 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 2 proposed 11 as the best number of clusters 
## * 1 proposed 12 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 3 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

The optimal number of clusters by majority rule for the Principal Components data set is 2 closely followed by 4. We will create 2, 4, and 9 clusters from the data, and then compare across other clusters created from the travel2, travel3, and imp data sets.

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(best_nc) == "matrix") .viz_NbClust(x, print.summary, : the
## condition has length > 1 and only the first element will be used

## Warning in if (class(best_nc) == "numeric") print(best_nc) else if
## (class(best_nc) == : the condition has length > 1 and only the first element
## will be used

## Warning in if (class(best_nc) == "matrix") {: the condition has length > 1 and
## only the first element will be used

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 5 proposed  2 as the best number of clusters
## * 1 proposed  3 as the best number of clusters
## * 4 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 3 proposed  9 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## * 2 proposed  11 as the best number of clusters
## * 1 proposed  12 as the best number of clusters
## * 1 proposed  13 as the best number of clusters
## * 1 proposed  14 as the best number of clusters
## * 3 proposed  15 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

### Create and Extract Final K-Means Clusters #### Travel 2 Data Set: Unscaled Multiple Imputation Data Set

# Compute k-means clustering with k = 2, 3, 4, 6
set.seed(402)
km1 <- kmeans(travel2, 2, nstart = 25)
km2 <- kmeans(travel2, 3, nstart = 25)
km3 <- kmeans(travel2, 4, nstart = 25)
km4 <- kmeans(travel2, 6, nstart = 25)

# Extract Clusters into original data set 
Travel_Rating_Clusters <- cbind(Europe_Travel_Reviews, 
                                Travel_Cluster_2 = km1$cluster, 
                                Travel_Cluster_3 = km2$cluster, 
                                Travel_Cluster_4 = km3$cluster,
                                Travel_Cluster_6 = km4$cluster)

Travel Cluster 2

Travel_Cluster_2	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	2822	0.52	3.94	3.90	2.45	2.79	3.47	2.20	3.12	3.04	2.00	2.04	2.79	2.86	2.57	2.45	1.17	1.20	1.05	1.20	1.16	0.92	0.92	0.84	0.94	0.89
2	2634	0.48	2.72	2.29	3.51	3.00	2.15	3.43	1.94	2.00	3.02	2.69	1.59	1.47	1.65	1.69	2.56	2.06	2.18	1.82	1.28	1.48	1.48	1.32	1.15	1.14

Travel Cluster 3

Travel_Cluster_3	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	1042	0.19	2.04	1.84	2.05	1.89	1.64	2.22	1.61	1.60	2.46	2.64	2.47	1.96	1.58	1.49	2.66	2.52	2.38	2.28	1.82	2.49	2.49	1.85	1.93	1.93
2	2434	0.45	3.94	4.05	2.28	2.73	3.59	2.11	3.20	3.10	1.93	2.02	2.85	2.91	2.59	2.49	1.13	1.16	1.02	1.17	1.02	0.89	0.89	0.84	0.77	0.75
3	1980	0.36	3.33	2.67	4.27	3.63	2.53	3.95	2.24	2.35	3.19	2.61	1.28	1.42	1.84	1.88	2.27	1.69	1.89	1.50	1.14	0.76	0.76	0.93	0.86	0.74

Travel Cluster 4

Travel_Cluster_4	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	962	0.18	3.56	3.22	2.19	2.42	2.91	2.04	2.89	2.38	2.03	1.77	3.99	4.79	3.84	3.26	1.12	1.23	1.00	1.17	1.19	1.08	1.08	0.93	1.01	0.91
2	1031	0.19	2.00	1.84	2.07	1.91	1.60	2.28	1.57	1.59	2.53	2.71	2.37	1.85	1.54	1.45	2.69	2.52	2.40	2.29	1.68	2.50	2.50	1.84	1.81	1.86
3	1879	0.34	3.30	2.60	4.31	3.64	2.54	4.00	2.27	2.31	3.23	2.62	1.30	1.44	1.85	1.89	2.30	1.73	1.92	1.50	1.16	0.74	0.74	0.93	0.86	0.74
4	1584	0.29	4.16	4.53	2.40	2.94	3.93	2.16	3.30	3.53	1.87	2.15	2.09	1.73	1.79	1.99	1.16	1.13	1.05	1.18	1.00	0.74	0.74	0.80	0.71	0.68

Travel Cluster 6

Travel_Cluster_6	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	613	0.11	3.05	2.79	1.75	1.72	2.91	1.77	3.23	2.01	1.71	1.24	4.04	4.88	4.26	3.89	1.15	1.09	0.99	1.08	1.38	1.24	1.24	0.98	1.11	1.00
2	904	0.17	3.71	4.50	2.30	2.56	4.44	2.21	4.19	3.62	1.71	1.58	1.39	1.37	1.90	2.09	1.38	1.24	1.20	1.26	1.12	0.66	0.66	0.74	0.74	0.67
3	1331	0.24	3.64	2.64	4.34	3.82	2.43	3.68	2.05	2.43	3.21	2.80	1.31	1.69	1.97	2.05	0.89	1.73	1.51	1.45	1.11	0.85	0.85	0.84	0.86	0.78
4	939	0.17	4.66	4.48	2.50	3.48	3.16	2.24	2.16	3.24	2.29	2.93	3.66	2.94	1.99	1.90	0.91	1.03	0.92	1.13	0.82	0.79	0.79	0.88	0.68	0.69
5	729	0.13	2.73	2.61	4.07	3.20	2.67	4.29	2.60	2.17	3.15	2.25	1.18	1.25	1.79	1.60	4.83	1.78	2.49	1.65	1.23	0.68	0.68	1.11	0.88	0.70
6	940	0.17	1.97	1.76	2.02	1.85	1.61	2.19	1.60	1.59	2.42	2.62	2.41	1.92	1.57	1.47	2.64	2.59	2.46	2.32	1.74	2.61	2.61	1.90	1.93	1.98

Travel 3 Data Set: Scaled Multiple Imputation Data Set

# Compute k-means clustering with k = 2, 3, 7
set.seed(123)
km5 <- kmeans(travel3, 2, nstart = 25)
km6 <- kmeans(travel3, 3, nstart = 25)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 272800)

km7 <- kmeans(travel3, 7, nstart = 25)


# Extract Clusters into original data set 
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, 
                                Kmeans_Cluster_2 = km5$cluster, 
                                Kmeans_Cluster_3 = km6$cluster, 
                                Kmeans_Cluster_7 = km7$cluster)

K Means Cluster 2

Kmeans_Cluster_2	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	2834	0.52	3.99	3.91	2.57	2.90	3.49	2.28	3.13	3.07	2.03	2.03	2.63	2.76	2.54	2.43	1.18	1.18	1.04	1.16	1.01	0.88	0.88	0.81	0.77	0.75
2	2622	0.48	2.66	2.28	3.38	2.89	2.13	3.35	1.92	1.97	2.99	2.70	1.75	1.58	1.67	1.70	2.55	2.09	2.19	1.87	1.44	1.52	1.52	1.35	1.35	1.27

K Means Cluster 3

Kmeans_Cluster_3	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	990	0.18	2.06	1.83	2.03	1.86	1.69	2.14	1.65	1.61	2.36	2.55	2.47	2.04	1.64	1.54	2.71	2.58	2.44	2.34	1.93	2.46	2.46	1.94	2.04	2.02
2	2053	0.38	3.28	2.65	4.21	3.58	2.47	3.94	2.22	2.27	3.24	2.66	1.40	1.48	1.85	1.86	2.24	1.71	1.87	1.51	1.14	0.83	0.83	0.92	0.86	0.74
3	2413	0.44	3.94	4.06	2.28	2.74	3.61	2.09	3.20	3.15	1.91	2.01	2.79	2.86	2.56	2.48	1.14	1.14	1.02	1.15	0.99	0.90	0.90	0.83	0.75	0.75

K Means Cluster 7

Kmeans_Cluster_7	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	550	0.10	2.95	2.80	1.66	1.65	2.94	1.61	3.31	1.98	1.65	1.18	3.90	4.76	4.39	4.17	0.97	0.95	0.89	0.95	1.03	1.28	1.28	0.93	0.79	0.89
2	1197	0.22	2.91	2.52	4.20	3.48	2.45	4.23	2.00	2.08	3.77	3.00	1.48	1.38	1.46	1.60	2.24	1.26	1.30	1.36	1.32	0.88	0.88	0.93	0.83	0.71
3	625	0.11	3.42	2.96	4.24	3.27	2.91	3.83	3.27	2.85	2.69	2.36	1.30	1.80	3.05	2.57	2.80	2.85	3.29	1.85	0.94	0.76	0.76	0.78	0.95	0.79
4	1099	0.20	3.99	4.66	2.17	2.63	4.61	2.07	3.71	3.71	1.83	1.95	2.08	1.53	1.66	1.93	1.23	1.16	1.06	1.19	1.01	0.69	0.69	0.79	0.69	0.65
5	259	0.05	2.83	2.41	2.05	1.94	2.23	2.05	2.20	2.06	2.05	1.93	3.25	2.91	2.25	2.20	1.76	1.78	1.74	1.71	3.34	2.97	2.97	1.39	4.32	4.06
6	952	0.17	4.72	3.75	3.18	4.03	2.32	2.49	1.96	2.88	2.13	2.41	2.70	2.94	2.12	1.90	0.90	1.11	0.96	1.20	0.84	0.85	0.85	0.84	0.71	0.73
7	774	0.14	1.85	1.72	2.09	1.87	1.60	2.25	1.48	1.50	2.48	2.75	2.08	1.71	1.42	1.32	3.06	2.81	2.66	2.54	1.44	2.25	2.25	2.16	1.30	1.45

Imp 2 Data Set: Scaled PCA Imputed Data Set

# Compute k-means clustering with k = 2, 3, 7
set.seed(123)
km8 <- kmeans(imp2, 2, nstart = 25)
km9 <- kmeans(imp2, 3, nstart = 25)
km10 <- kmeans(imp2, 7, nstart = 25)


# Extract Clusters into original data set 
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, 
                                KMeans_Cluster_2 = km8$cluster, 
                                KMeans_Cluster_3 = km9$cluster, 
                                KMeans_Cluster_7 = km10$cluster)

KMeans Cluster 2

KMeans_Cluster_2	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	1328	0.24	2.12	1.99	2.38	2.00	1.78	2.62	1.64	1.63	2.69	2.64	2.22	1.86	1.54	1.47	2.90	2.37	2.44	2.18	1.75	2.15	2.15	1.84	1.77	1.73
2	4128	0.76	3.75	3.49	3.15	3.18	3.17	2.85	2.84	2.83	2.43	2.27	2.20	2.30	2.31	2.28	1.51	1.38	1.34	1.29	1.04	0.83	0.83	0.81	0.78	0.73

KMeans Cluster 3

KMeans_Cluster_3	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	2409	0.44	3.95	4.06	2.28	2.75	3.61	2.09	3.20	3.15	1.91	2.01	2.78	2.85	2.56	2.48	1.13	1.14	1.02	1.15	0.98	0.88	0.88	0.83	0.74	0.75
2	988	0.18	2.07	1.85	2.02	1.85	1.71	2.13	1.66	1.61	2.34	2.52	2.51	2.08	1.65	1.55	2.70	2.57	2.45	2.33	1.96	2.50	2.50	1.95	2.06	2.02
3	2059	0.38	3.27	2.64	4.21	3.57	2.47	3.94	2.22	2.27	3.25	2.68	1.40	1.48	1.85	1.86	2.25	1.72	1.87	1.51	1.13	0.83	0.83	0.92	0.86	0.74

KMeans Cluster 7

KMeans_Cluster_7	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	1095	0.20	3.99	4.66	2.17	2.63	4.62	2.07	3.71	3.72	1.83	1.94	2.09	1.52	1.66	1.93	1.23	1.16	1.06	1.19	1.01	0.68	0.68	0.79	0.69	0.65
2	771	0.14	1.85	1.71	2.08	1.86	1.60	2.24	1.48	1.49	2.49	2.76	2.10	1.71	1.42	1.32	3.05	2.81	2.66	2.53	1.45	2.28	2.28	2.15	1.31	1.45
3	952	0.17	4.73	3.77	3.17	4.02	2.32	2.48	1.96	2.88	2.14	2.43	2.70	2.95	2.11	1.90	0.90	1.10	0.96	1.20	0.84	0.85	0.85	0.84	0.71	0.72
4	628	0.12	3.42	2.95	4.24	3.28	2.90	3.84	3.25	2.85	2.66	2.33	1.31	1.78	3.03	2.54	2.80	2.85	3.28	1.85	0.94	0.76	0.76	0.79	0.95	0.79
5	259	0.05	2.83	2.39	1.98	1.87	2.23	2.00	2.22	2.04	1.99	1.88	3.37	3.00	2.30	2.24	1.76	1.79	1.74	1.71	3.46	2.91	2.91	1.40	4.31	4.06
6	1204	0.22	2.92	2.52	4.20	3.48	2.44	4.23	1.99	2.08	3.78	3.00	1.46	1.38	1.46	1.60	2.24	1.26	1.30	1.36	1.32	0.86	0.86	0.94	0.85	0.71
7	547	0.10	2.93	2.81	1.66	1.65	2.95	1.60	3.33	1.99	1.64	1.18	3.86	4.76	4.42	4.20	0.96	0.95	0.90	0.95	0.96	1.28	1.28	0.92	0.79	0.90

Europca Data Set: Principal Components Data Set

# Compute k-means clustering with k = 2, 4, 9
set.seed(915)
km11 <- kmeans(europca, 2, nstart = 25)
km12 <- kmeans(europca, 4, nstart = 25)

## Warning: did not converge in 10 iterations

km13 <- kmeans(europca, 9, nstart = 25)


# Extract Clusters into original data set 
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, 
                                Kmeans_PCA_Cluster_2 = km11$cluster, 
                                Kmeans_PCA_Cluster_4 = km12$cluster, 
                                Kmeans_PCA_Cluster_9 = km13$cluster)

K Means Principal Components Cluster 2

Kmeans_PCA_Cluster_2	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	2607	0.48	2.85	2.34	3.42	3.09	1.98	3.27	1.76	1.91	2.89	2.56	1.68	1.52	1.48	1.58	2.36	1.93	1.92	1.74	1.23	1.53	1.53	1.32	1.15	1.16
2	2849	0.52	3.81	3.85	2.54	2.72	3.61	2.36	3.27	3.11	2.13	2.17	2.69	2.81	2.71	2.54	1.38	1.34	1.33	1.29	1.21	0.88	0.88	0.84	0.95	0.87

K Means Principal Components Cluster 4

Kmeans_PCA_Cluster_4	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	1214	0.22	3.51	3.20	2.79	2.67	2.95	2.54	3.05	2.54	2.13	2.01	3.14	4.01	4.13	3.25	1.56	1.66	1.62	1.38	0.93	1.02	1.02	0.85	0.84	0.83
2	1388	0.25	4.09	4.55	2.25	2.84	4.17	2.04	3.40	3.76	1.90	2.20	2.03	1.75	1.61	1.95	1.27	1.19	1.16	1.21	0.94	0.72	0.72	0.81	0.72	0.69
3	1008	0.18	2.12	1.85	2.08	1.89	1.69	2.12	1.68	1.65	2.41	2.57	2.39	2.01	1.62	1.57	2.61	2.69	2.45	2.43	1.90	2.37	2.37	1.88	2.05	2.00
4	1846	0.34	3.36	2.70	4.08	3.63	2.37	3.90	2.05	2.11	3.22	2.57	1.63	1.43	1.47	1.68	2.06	1.35	1.50	1.30	1.24	0.88	0.88	0.95	0.81	0.71

K Means Principal Components Cluster 9

Kmeans_PCA_Cluster_9	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	422	0.08	3.39	2.57	3.88	3.18	2.39	3.15	2.17	2.78	2.54	3.02	1.11	1.91	2.46	2.83	1.19	3.86	2.51	2.40	0.81	0.97	0.97	0.82	1.01	0.86
2	1132	0.21	3.98	2.85	3.82	4.08	1.99	3.32	1.87	2.29	2.35	1.59	1.93	1.95	1.85	1.77	1.51	1.18	1.16	1.23	1.02	0.96	0.96	0.83	0.85	0.80
3	1047	0.19	4.01	4.68	2.15	2.71	4.54	2.02	3.65	3.80	1.73	1.92	1.96	1.54	1.57	1.94	1.20	1.15	1.06	1.18	0.99	0.70	0.70	0.81	0.68	0.64
4	364	0.07	2.41	2.76	2.38	2.20	2.11	2.70	1.52	1.82	2.50	2.61	2.03	2.34	1.93	1.50	2.65	1.64	1.76	1.76	1.36	2.26	2.26	1.53	1.02	1.14
5	333	0.06	2.69	2.30	1.99	1.89	2.17	2.02	2.23	2.00	2.02	2.01	2.94	2.86	2.24	2.20	1.86	1.99	1.83	1.79	3.03	3.08	3.08	1.50	3.88	3.86
6	310	0.06	1.56	1.73	2.10	1.67	1.75	2.31	1.49	1.35	2.40	2.47	2.55	1.13	0.96	1.05	4.19	3.13	3.40	2.43	1.65	1.36	1.36	3.39	1.14	1.24
7	818	0.15	3.36	3.15	3.41	3.12	2.81	3.30	2.10	2.43	3.73	4.39	2.48	2.32	1.73	1.92	1.20	1.14	1.02	1.42	1.23	1.15	1.15	0.85	0.77	0.75
8	498	0.09	2.91	2.78	1.70	1.66	2.95	1.66	3.30	1.97	1.64	1.19	3.97	4.74	4.37	4.24	0.99	0.97	0.90	0.96	1.00	1.18	1.18	0.92	0.74	0.83
9	532	0.10	3.21	2.97	3.98	3.12	3.08	4.04	3.66	2.63	3.50	2.10	1.56	1.72	2.77	1.54	4.09	1.79	3.04	1.68	1.08	0.71	0.71	0.80	0.92	0.73

K Mediods Method (PAM)

Find Optimal Clusters Using Silouhette

##would not run##
pam_sil_viz1 <- fviz_nbclust(travel2, cluster::pam, method = "silhouette")+labs(subtitle = "Multiple Imputation Data Set")
pam_sil_viz2 <- fviz_nbclust(travel3, cluster::pam, method = "silhouette")+labs(subtitle = "Scaled Multiple Imputation Data Set")
pam_sil_viz3 <- fviz_nbclust(imp2, cluster::pam, method = "silhouette")+labs(subtitle = "PCA Imputed Data Set")
pam_sil_viz4 <- fviz_nbclust(europca, cluster::pam, method = "silhouette")+labs(subtitle = "Principal Components Data Set")

grid.arrange(pam_sil_viz1, pam_sil_viz2, pam_sil_viz3, pam_sil_viz4)

Run PAM Using Optimal Number of Clusters Per Data Set

PAM Multiple Impution Clusters

The optimal number of clusters for the travel2 Multiple Imputation data set was 10 clusters.

Multiple Imputation PAM Clusters Plot

mi_pam <- eclust(travel2, "pam", k = 10, hc_metric="euclidean")

PAM Multiple Imputation Cluster Model and Merge

# PAM Clustering
mi_pamclus <- pam(travel2, 10)

# Add Clusters into original data set 
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, PAM_Travel_Cluster = mi_pamclus$cluster)

PAM Multiple Imputation Cluster Profiles

PAM_Travel_Cluster	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	746	0.14	4.55	2.86	4.08	4.44	2.12	3.32	2.00	2.36	2.59	1.59	1.40	1.83	1.99	1.92	0.91	1.54	1.45	1.25	0.98	0.86	0.86	0.82	0.85	0.78
2	593	0.11	2.55	2.43	4.39	3.40	2.54	4.68	2.51	2.07	3.15	2.15	1.10	1.19	1.70	1.63	4.56	1.77	2.14	1.55	1.21	0.66	0.66	1.00	0.90	0.71
3	650	0.12	3.23	2.90	4.26	3.33	2.79	3.65	2.08	2.67	3.52	4.33	1.37	1.90	1.88	2.21	0.94	1.52	1.32	1.57	1.22	0.75	0.75	0.87	0.76	0.75
4	632	0.12	2.27	2.10	2.35	2.18	1.64	2.50	1.61	1.76	2.79	2.96	2.06	2.05	1.82	1.62	1.71	2.29	1.93	2.18	1.18	2.71	2.71	1.12	1.46	1.64
5	407	0.07	1.57	1.62	1.97	1.66	1.64	2.23	1.48	1.35	2.45	2.51	2.15	1.31	1.07	1.09	4.33	3.02	3.43	2.42	1.73	1.69	1.69	3.00	1.42	1.36
6	756	0.14	3.50	4.35	2.28	2.38	4.27	2.16	4.88	3.48	1.74	1.75	1.52	1.44	2.09	2.21	1.75	1.28	1.46	1.34	1.15	0.71	0.71	0.72	0.79	0.71
7	474	0.09	4.63	4.91	2.17	3.33	4.54	1.99	1.87	4.01	1.83	2.34	2.23	1.57	1.43	1.64	1.06	1.13	1.07	1.12	0.75	0.72	0.72	0.88	0.61	0.64
8	636	0.12	4.50	3.91	2.76	3.02	3.01	2.56	2.65	2.90	2.54	2.29	4.49	3.85	2.83	2.22	1.08	1.25	1.03	1.22	1.07	0.77	0.77	0.89	0.73	0.69
9	158	0.03	3.08	2.61	1.94	1.90	2.34	1.96	2.27	2.15	1.94	1.82	4.10	3.26	2.42	2.27	1.58	1.60	1.56	1.57	4.31	2.19	2.19	1.26	4.66	4.46
10	404	0.07	2.52	2.72	1.56	1.57	2.94	1.54	3.32	1.89	1.57	1.06	3.89	5.00	4.59	4.52	0.89	0.92	0.85	0.87	0.88	1.40	1.40	0.87	0.81	0.96

PAM Scaled Multiple Impution Clusters

The optimal number of clusters for the travel3 Scaled Multiple Imputation data set was 2 clusters.

Scaled Multiple Imputation PAM Clusters Plot

miscaled_pam <- eclust(travel3, "pam", k = 2, hc_metric="euclidean")

PAM Scaled Multiple Imputation Cluster Model and Merge

# PAM Clustering
miscaled_pamclus <- pam(travel3, 2)

# Add Clusters into original data set 
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, PAM_Travel_Cluster_2 = miscaled_pamclus$cluster)

PAM Multiple Imputation Cluster Profiles

PAM_Travel_Cluster_2	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	3338	0.61	4.03	3.70	2.97	3.15	3.32	2.66	2.93	2.94	2.26	2.17	2.43	2.43	2.40	2.34	1.23	1.26	1.18	1.21	1.00	0.82	0.82	0.81	0.74	0.72
2	2118	0.39	2.28	2.21	2.94	2.49	2.06	3.01	1.95	1.91	2.86	2.65	1.86	1.82	1.69	1.67	2.83	2.20	2.29	1.98	1.55	1.69	1.69	1.47	1.48	1.39

PAM PCA Imputed Clusters

The optimal number of clusters for the imp2 PCA Imputed data set was 2 clusters.

PCA Imputed PAM Clusters Plot

pcai_pam <- eclust(imp2, "pam", k = 2, hc_metric="euclidean")

PAM PCA Imputed Cluster Model and Merge

# PAM Clustering
pcai_pamclus <- pam(imp2, 2)

# Add Clusters into original data set 
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, PAM_Travel_Cluster_3 = pcai_pamclus$cluster)

PAM PCA Imputed Cluster Profiles

PAM_Travel_Cluster_3	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	3651	0.67	3.91	3.62	3.04	3.17	3.27	2.72	2.91	2.92	2.38	2.23	2.35	2.43	2.41	2.32	1.34	1.29	1.24	1.24	1.04	0.83	0.83	0.82	0.77	0.72
2	1805	0.33	2.23	2.13	2.79	2.33	1.96	2.94	1.81	1.78	2.72	2.62	1.92	1.71	1.55	1.59	2.90	2.31	2.37	2.06	1.57	1.80	1.80	1.58	1.54	1.48

PAM Principal Components Clusters

The optimal number of clusters for the europca Principal Components data set was 10 clusters.

Principal Components PAM Clusters Plot

pca_pam <- eclust(europca, "pam", k = 10, hc_metric="euclidean")

PAM Principal Components Cluster Model and Merge

# PAM Clustering
pca_pamclus <- pam(europca, 10)

# Add Clusters into original data set 
Travel_Rating_Clusters <- cbind(Travel_Rating_Clusters, PAM_Travel_Cluster_4 = pca_pamclus$cluster)

PAM Principal Components Cluster Profiles

PAM_Travel_Cluster_4	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	740	0.14	3.10	2.50	3.93	3.32	2.49	4.06	2.29	2.08	4.44	2.87	1.60	1.54	1.47	1.51	2.15	1.22	1.38	1.32	1.08	1.15	1.15	0.78	0.88	0.71
2	948	0.17	4.01	2.89	3.72	4.10	2.00	3.14	1.87	2.37	1.97	1.55	1.88	1.85	1.79	1.82	1.57	1.24	1.20	1.25	1.01	0.98	0.98	0.83	0.84	0.83
3	629	0.12	4.12	3.94	2.98	3.20	3.36	2.66	2.18	3.22	2.82	4.52	3.15	2.82	1.89	2.04	0.93	1.11	0.98	1.41	1.25	0.77	0.77	0.89	0.69	0.70
4	439	0.08	3.25	2.42	3.82	3.05	2.35	3.10	2.19	2.64	2.50	2.92	1.01	1.78	2.38	2.83	1.44	3.89	2.57	2.33	0.87	1.11	1.11	0.89	1.06	1.04
5	875	0.16	3.75	4.58	2.15	2.56	4.55	2.06	4.06	3.69	1.64	1.63	1.65	1.38	1.72	2.07	1.32	1.19	1.08	1.22	0.97	0.71	0.71	0.77	0.71	0.68
6	517	0.09	3.12	2.89	1.73	1.84	2.90	1.70	3.10	1.92	1.70	1.14	4.25	4.95	4.34	3.98	0.92	0.96	0.92	0.95	0.92	1.13	1.13	0.93	0.72	0.81
7	417	0.08	2.33	2.67	2.36	2.19	2.11	2.64	1.52	1.88	2.44	2.62	1.85	2.21	1.84	1.51	2.57	1.75	1.79	1.83	1.42	2.46	2.46	1.50	1.11	1.32
8	306	0.06	3.33	3.44	3.85	3.04	3.33	3.84	3.99	2.92	2.96	2.11	1.65	2.10	3.78	1.74	4.08	1.97	3.54	1.86	1.01	0.75	0.75	0.80	0.96	0.78
9	289	0.05	1.59	1.83	2.14	1.70	1.83	2.35	1.50	1.35	2.49	2.51	2.75	1.14	0.95	1.01	4.23	2.98	3.44	2.39	1.58	1.33	1.33	3.52	1.06	1.08
10	296	0.05	2.79	2.44	2.02	1.91	2.27	2.05	2.24	2.05	2.04	1.94	3.22	2.86	2.16	2.16	1.84	1.86	1.82	1.74	3.45	2.69	2.69	1.50	4.20	3.91

3. Hierarchical Clustering Method

#For example, given a distance matrix “distance” generated by the function dist() #the base R function hclust() can be used to create the hierarchical tree #do not use the ward.D method (it does not correctly implement Ward’s distance)

Multiple Imputed Data Set Hierarchical Cluster Analysis

Generate 5 Randoms Samples from the `travel2` Multiple Imputation Data Set

Travel_Sample_1 <- travel2 %>% sample_n(1000)
Travel_Sample_2 <- travel2 %>% sample_n(1000)
Travel_Sample_3 <- travel2 %>% sample_n(1000)
Travel_Sample_4 <- travel2 %>% sample_n(1000)
Travel_Sample_5 <- travel2 %>% sample_n(1000)

Get Distance for each Sample Size

Distance_1 <- get_dist(Travel_Sample_1)
Distance_2 <- get_dist(Travel_Sample_2)
Distance_3 <- get_dist(Travel_Sample_3)
Distance_4 <- get_dist(Travel_Sample_4)
Distance_5 <- get_dist(Travel_Sample_5)

Run Hieararchical Clusters

hclus1 <- hclust(d=Distance_1, method="ward.D2")
hclus2 <- hclust(d=Distance_2, method="ward.D2")
hclus3 <- hclust(d=Distance_3, method="ward.D2")
hclus4 <- hclust(d=Distance_4, method="ward.D2")
hclus5 <- hclust(d=Distance_5, method="ward.D2")

Sample I Hierarchical Cluster Dendrogram

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Sample I indicates a 4 clear clusters as optimal, but you could also infer 6 clusters which feed into the 4 clusters. We have already run k means clusters on the full data set using both values.

Sample II Hierarchical Cluster Dendrogram

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

The Second Sample shows 3 clear clusters, which we’ve also ran with k-means with maximum of 9 clusters. We could also interpret 4 to 5 clusters leveraging this Dendrogram as well.

Sample III Hierarchical Cluster Dendrogram

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Sample 3 looks very similar to Sample 1 where we can clearly interpret 4 clusters and infer 6.

Sample IV Hierarchical Cluster Dendrogram

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

In Sample 4, you can easily interpret 4 or 6 clusters. Its worth noting that 2 to 3 clusters can be interpreted from the Dendrogram as well. The NbClust algorithm actually voted for 2 and 3 clusters by majority rule. So, we included those splits in the K Means algorithm as well.

Sample V Hierarchical Cluster Dendrogram

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Based on the results from the fifth sample, we can safely conclude that 4 is an optimal number of clusters. Two clusters can also be inferred as optimal since they are the furthest apart.

Perform Hierarchical Cluster Analysis on Full Multiple Imputed Data Set

We perform Hierarchical cluster analysis on the full travel2 Multiple Imputed data set. Based on our samples, we will cut the tree at 4 and 6 clusters.

Get Distance Full Data Sets

mi_distance1 <- get_dist(travel2)

Run Hieararchical Cluster and Cut Tree into Groups

# Ward's method
hclus_mi1 <- hclust(d=mi_distance1, method="ward.D2")

# Cut tree into 4 and 6 groups
hclus_group_4 <- cutree(hclus_mi1, k = 4)
hclus_group_6 <- cutree(hclus_mi1, k = 6)

# Visualize results in scatter plot
fviz_cluster(list(data = travel2, cluster = hclus_group_4),
             palette = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07", "#00FF00"),
             ellipse.type = "convex", # Concentration ellipse
             repel = FALSE, # Allow label overplotting (slow)
             #main = "Travel Ratings Cluster Plot",
             #sub =  "Multiple Imputed Data Set with Four Subgroups",
             show.clust.cent = FALSE, ggtheme = theme_minimal())

Merge Hierarchical Clusters to Travel Rating Clusters Table

Travel_Rating_Clusters <- Travel_Rating_Clusters %>% 
  dplyr::mutate(Hier_Cluster_Group_4 = hclus_group_4,
                Hier_Cluster_Group_6 = hclus_group_6)

Hierarchical Cluster 4 Profile

Hier_Cluster_Group_4	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	2047	0.38	3.33	2.65	4.25	3.58	2.67	3.90	2.44	2.44	3.13	2.54	1.32	1.61	2.10	1.94	2.31	1.77	1.91	1.52	1.16	0.74	0.74	0.92	0.88	0.76
2	1072	0.20	2.07	1.94	2.09	1.93	1.58	2.29	1.56	1.60	2.55	2.72	2.41	1.94	1.55	1.45	2.61	2.46	2.30	2.23	1.74	2.45	2.45	1.80	1.86	1.87
3	1853	0.34	4.22	4.43	2.40	3.04	3.70	2.19	3.05	3.35	1.98	2.22	2.63	2.26	1.85	2.01	1.08	1.10	1.01	1.19	1.09	0.73	0.73	0.85	0.75	0.67
4	484	0.09	2.92	2.76	1.56	1.60	2.95	1.56	3.30	1.95	1.59	1.10	3.90	4.92	4.55	4.28	0.89	0.94	0.86	0.92	0.75	1.40	1.40	0.89	0.83	0.98

Hierarchical Cluster 6 Profile

Hier_Cluster_Group_6	Cluster_Size	Cluster_Percent	Malls	Restaurants	Theatres	Museums	Pubs_Bars	Parks	LocalServices	Zoo	Beaches	Resorts	ArtGalleries	JuiceBars	Hotels_OtherLodgings	Burger_PizzaShops	ViewPoints	Gardens	Monuments	Churches	DanceClubs	Bakeries	BeautySpas	Cafes	SwimmingPools	Gyms
1	941	0.17	3.65	2.57	4.35	3.92	2.54	3.74	2.08	2.40	3.47	2.88	1.45	1.73	1.90	1.89	0.91	1.09	0.98	1.38	1.23	0.80	0.80	0.83	0.83	0.78
2	1106	0.20	3.07	2.72	4.17	3.28	2.79	4.04	2.75	2.47	2.84	2.25	1.21	1.51	2.27	1.99	3.46	2.34	2.69	1.63	1.10	0.71	0.71	0.98	0.91	0.74
3	1072	0.20	2.07	1.94	2.09	1.93	1.58	2.29	1.56	1.60	2.55	2.72	2.41	1.94	1.55	1.45	2.61	2.46	2.30	2.23	1.74	2.45	2.45	1.80	1.86	1.87
4	1364	0.25	4.55	4.43	2.47	3.28	3.45	2.23	2.35	3.26	2.14	2.51	3.12	2.61	1.85	1.90	0.96	1.05	1.00	1.16	1.12	0.75	0.75	0.89	0.72	0.68
5	484	0.09	2.92	2.76	1.56	1.60	2.95	1.56	3.30	1.95	1.59	1.10	3.90	4.92	4.55	4.28	0.89	0.94	0.86	0.92	0.75	1.40	1.40	0.89	0.83	0.98
6	489	0.09	3.30	4.44	2.19	2.34	4.40	2.09	4.99	3.61	1.55	1.43	1.24	1.28	1.86	2.33	1.45	1.26	1.04	1.26	1.02	0.65	0.65	0.69	0.82	0.64

Cluster Validation

Note:Understanding cluster profiles are best

Compute Validation for Data Sets leveraged in Cluster Analysis

clmethods <- c("hierarchical", "kmeans", "pam")
travel_internal <- clValid(travel2, nClust = 3:10, clMethods = clmethods, validation = "internal", maxitems = 5456)
travel_scaled_internal <- clValid(travel3, nClust = 3:10, clMethods = clmethods, validation = "internal", maxitems = 5456)
travel_pcai_internal <- clValid(imp2, nClust = 3:10, clMethods = clmethods, validation = "internal", maxitems = 5456)
travel_pca_internal <- clValid(europca, nClust = 3:10, clMethods = clmethods, validation = "internal", maxitems = 5456)

## Warning: did not converge in 10 iterations

Summary Validation of Multiple Imputation Data Set

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                    3         4         5         6         7         8         9        10
##                                                                                                           
## hierarchical Connectivity     9.4603   19.2286   47.4845   85.3508  143.0341  149.1032  233.8627  254.2690
##              Dunn             0.2649    0.2649    0.2189    0.1926    0.1292    0.1292    0.1311    0.1411
##              Silhouette       0.1734    0.0825    0.0579    0.0287    0.0719    0.0526    0.0823    0.0907
## kmeans       Connectivity   833.4464  841.4909  760.7056  885.7698  972.1968 1009.0960 1058.1020 1025.7151
##              Dunn             0.0318    0.0332    0.0054    0.0178    0.0218    0.0178    0.0208    0.0208
##              Silhouette       0.1336    0.1457    0.1503    0.1464    0.1497    0.1506    0.1589    0.1525
## pam          Connectivity  1191.7956 1394.5278 1243.8123 1439.3845 1495.9766 1564.6048 1684.3956 1695.8183
##              Dunn             0.0011    0.0022    0.0021    0.0015    0.0088    0.0112    0.0112    0.0115
##              Silhouette       0.0978    0.1200    0.1211    0.1236    0.1272    0.1181    0.1191    0.1281
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 9.4603 hierarchical 3       
## Dunn         0.2649 hierarchical 3       
## Silhouette   0.1734 hierarchical 3

Summary Validation of Scaled Multiple Imputation Data Set

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                    3         4         5         6         7         8         9        10
##                                                                                                           
## hierarchical Connectivity     7.3702   59.2675   61.0909   62.6865   82.2544   88.2591  111.1655  113.4250
##              Dunn             0.2254    0.1548    0.1557    0.1557    0.1560    0.1560    0.1560    0.1560
##              Silhouette       0.1829    0.1888    0.1597    0.1582    0.1335    0.1268    0.1160    0.1095
## kmeans       Connectivity   688.9484  769.7591  904.3278  801.5972  857.4563  897.3956  902.8179 1026.7782
##              Dunn             0.0220    0.0215    0.0168    0.0035    0.0291    0.0094    0.0094    0.0174
##              Silhouette       0.1441    0.1404    0.1489    0.1461    0.1487    0.1420    0.1470    0.1458
## pam          Connectivity  1309.5131 1386.7464 1422.0921 1462.4032 1676.9393 1774.6433 1682.5694 1712.0758
##              Dunn             0.0015    0.0020    0.0020    0.0020    0.0100    0.0107    0.0048    0.0093
##              Silhouette       0.0772    0.0997    0.1071    0.1188    0.1243    0.1285    0.1267    0.1282
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 7.3702 hierarchical 3       
## Dunn         0.2254 hierarchical 3       
## Silhouette   0.1888 hierarchical 4

Summary Validation of PCA Imputed Data Set

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                    3         4         5         6         7         8         9        10
##                                                                                                           
## hierarchical Connectivity    10.2060   60.2282   72.1873   76.4349   79.4167   98.8861  129.0131  135.2357
##              Dunn             0.2226    0.1505    0.1505    0.1505    0.1540    0.1540    0.1679    0.1679
##              Silhouette       0.2037    0.1939    0.1683    0.1411    0.1301    0.1162    0.1428    0.1382
## kmeans       Connectivity   671.9722  762.5563  745.7270  826.9480 1005.8107  985.2929  947.9067  917.1889
##              Dunn             0.0156    0.0118    0.0066    0.0131    0.0097    0.0098    0.0275    0.0257
##              Silhouette       0.1452    0.1489    0.1482    0.1528    0.1417    0.1453    0.1470    0.1497
## pam          Connectivity   993.9496 1263.6901 1252.3071 1297.5603 1431.2468 1665.1766 1677.0290 1654.8706
##              Dunn             0.0015    0.0015    0.0113    0.0113    0.0039    0.0041    0.0049    0.0042
##              Silhouette       0.1302    0.1133    0.1124    0.1241    0.1156    0.1118    0.1176    0.1203
## 
## Optimal Scores:
## 
##              Score   Method       Clusters
## Connectivity 10.2060 hierarchical 3       
## Dunn          0.2226 hierarchical 3       
## Silhouette    0.2037 hierarchical 3

Summary Validation of Principal Components Data Set

## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                    3         4         5         6         7         8         9        10
##                                                                                                           
## hierarchical Connectivity    50.7321  103.8397  153.6329  162.0135  175.1313  175.1313  224.2028  262.5607
##              Dunn             0.0817    0.0687    0.0746    0.0746    0.0757    0.0757    0.0701    0.0701
##              Silhouette       0.2143    0.1911    0.1780    0.1640    0.1491    0.1410    0.1195    0.0948
## kmeans       Connectivity   586.2115  914.9972  876.5504  927.1952 1056.4540  962.8099 1044.9456 1081.1587
##              Dunn             0.0208    0.0113    0.0100    0.0079    0.0218    0.0079    0.0047    0.0040
##              Silhouette       0.1690    0.1232    0.1368    0.1373    0.1566    0.1567    0.1790    0.1867
## pam          Connectivity   919.3008  943.7290 1190.4151 1274.4563 1293.6722 1313.8040 1587.2595 1560.2778
##              Dunn             0.0041    0.0043    0.0011    0.0035    0.0035    0.0075    0.0035    0.0088
##              Silhouette       0.0870    0.1102    0.1359    0.1497    0.1662    0.1823    0.1837    0.1912
## 
## Optimal Scores:
## 
##              Score   Method       Clusters
## Connectivity 50.7321 hierarchical 3       
## Dunn          0.0817 hierarchical 3       
## Silhouette    0.2143 hierarchical 3

ML II Final Project: Principal Component Analysis (PCA) and Clustering

Samia Cabezas, Ryan Taylor, Will Helfrich, Pedro Voyer, and Austin Schwoegl

6/26/2021

Load Data and Preliminary Analysis

Replace 0 for NA

Exploratory Analysis

Summary Statistics

Descriptive Statistics

Europe_Travel_Reviews

Average User Ratings by Category

Skewness by Category

Missing Values

Check Missing Values by Unique Observations and Total Values

Missing Values by Category

Approch I

Create Seperate Data Set & Rename Variables

Leverage MICE package to impute for missing values

Exploratory Analysis Continued: BoxPlots

Remove UserID from Analysis

Correlation Analysis

Approach I

Approach II

Principal Component Analysis (PCA)

Step 1: Run Principal Component Analysis (PCA) using Approach I and Approach II

Approach I: PCA on Multiple Imputed Data Set

Check for multicollinearity

Combine the UserID variable to the principal components

Approach II: PCA Using missMDA Package

1. Estimate the number of dimension for Principal Component Analysis (PCA) by K-fold Cross-Validation

2. Impute missing values using iterativePCA algorithm and optimal number of dimensions from previous step

3. Run PCA and Analyze PCA on Imputed data Set

Step 2: Analyze and Compare Eigenvalues from Approach I and Approach II

Approach I Eigenvalues

Approach II Eigenvalues

Step 3: Variable Analysis for Approach I and Approach II

Approach I Correlation Between Variables and Principal Components

Approach I Quality of Representation

Quality of Factor Map (Approach II)

Quality of Representation Bar Plot (Approach I)

Quality of Factor Circle Map (Approach I)

Approach II Quality of Representation

Quality of Factor Map (Approach II)

Quality of Factor Circle Map (Approach II)

Approach I Variable Contribution

Variable Contribution Plot (Approach II)

Variable Contribution Bar Charts for PC I (Approach I)

Variable Contribution Bar Charts for PC II (Approach I)

Variable Contribution Circle Plot (Approach I)

Approach II Variable Contribution

Variable Contribution Table (Approach II)

Variable Contribution Plots (Approach I and II)

Variable Contribution Bar Charts (Approach II)

Variable Contribution Circle Plot (Approach II)

Create grouping variables using kmeans

k Means Grouping Using Approach I

k Means Grouping Using Approach II

Individual Analysis

Step 4: Assess Cluster Tendency of Data Sets Using Hopkins Statistic

Get Cluster Tendency for MICE Imputed Data Set, MICE Imputed PCA Scores, missMDA PCA Imputed, missMDA PCA Scores

Exploratory Factor Analysis (interpretation of extracted features)

Maximum Likelihood Analysis

Interpretation of factors

Determine number of factors in data

Clustering

K-Means Clustering

MICE and PCA Imputed Data Sets Original Variables

Visualize MICE Imputed Cluster

Visualize PCA Imputed Cluster

Determine Optimal Number of Clusters for MICE Imputed Data

Determine Optimal Number of Clusters for PCA Imputed Data

Merge PCs to Original Data and Scale

PCA Imputed

Randomly Start with 3 Clusters: PCA Imputed

Randomly Start with 3 Clusters: MICE Imputed

Determine Optimal Number of Clusters for Principal Components

Travel Cluster 2

Travel Cluster 3

Travel Cluster 4

Travel Cluster 6

Travel 3 Data Set: Scaled Multiple Imputation Data Set

Generate 5 Randoms Samples from the `travel2` Multiple Imputation Data Set