Brian Surratt

head(jdat)
## # A tibble: 6 × 62
##   STATE_ABBR statename  state region division   rz_ext   rz_agr   rz_cns  rz_neu
##   <chr>      <chr>      <dbl> <chr>  <chr>       <dbl>    <dbl>    <dbl>   <dbl>
## 1 AK         Alabama        1 3      6        -0.126   -0.105   -0.110   -0.0491
## 2 AL         Alaska         2 4      9         0.0430   0.100    0.0806  -0.0152
## 3 AR         Arizona        4 4      8        -0.00934 -0.0154  -0.0583   0.0765
## 4 AZ         Arkansas       5 3      7         0.0122  -0.00702  0.00843 -0.0436
## 5 CA         California     6 4      9        -0.0430  -0.0357  -0.0654  -0.0143
## 6 CO         Colorado       8 4      8        -0.00489 -0.0658  -0.0221  -0.0451
## # … with 53 more variables: rz_opn <dbl>, mzextra <dbl>, mzagree <dbl>,
## #   mzconsc <dbl>, mzneuro <dbl>, mzopen <dbl>, fzextra <dbl>, fzagree <dbl>,
## #   fzconsc <dbl>, fzneuro <dbl>, fzopen <dbl>, E_LT30 <dbl>, A_LT30 <dbl>,
## #   C_LT30 <dbl>, N_LT30 <dbl>, O_LT30 <dbl>, E_GT30 <dbl>, A_GT30 <dbl>,
## #   C_GT30 <dbl>, N_GT30 <dbl>, O_GT30 <dbl>, GenD_E <dbl>, GenD_A <dbl>,
## #   GenD_C <dbl>, GenD_N <dbl>, GenD_O <dbl>, AgeD_E <dbl>, AgeD_A <dbl>,
## #   AgeD_C <dbl>, AgeD_N <dbl>, AgeD_O <dbl>, TFR <dbl>, alpha <dbl>, …
# changing from tibble to data frame

jdat <- as.data.frame(jdat)

head(jdat)
##   STATE_ABBR  statename state region division    rz_ext    rz_agr    rz_cns
## 1         AK    Alabama     1      3        6 -0.125762 -0.104588 -0.109966
## 2         AL     Alaska     2      4        9  0.043042  0.100483  0.080650
## 3         AR    Arizona     4      4        8 -0.009335 -0.015433 -0.058324
## 4         AZ   Arkansas     5      3        7  0.012155 -0.007023  0.008430
## 5         CA California     6      4        9 -0.043034 -0.035667 -0.065377
## 6         CO   Colorado     8      4        8 -0.004892 -0.065783 -0.022110
##      rz_neu    rz_opn   mzextra  mzagree  mzconsc   mzneuro   mzopen  fzextra
## 1 -0.049075  0.051630 -0.004242 0.412499 0.301710 -0.326062 0.427654 0.139469
## 2 -0.015238 -0.112345  0.112153 0.504055 0.357393 -0.292434 0.394611 0.250319
## 3  0.076501 -0.080592  0.090463 0.468918 0.313488 -0.242351 0.435278 0.200209
## 4 -0.043650  0.002879  0.100366 0.469782 0.343585 -0.307043 0.452290 0.219097
## 5 -0.014349  0.092209  0.058111 0.463655 0.294992 -0.271535 0.496399 0.187948
## 6 -0.045139  0.073840  0.084083 0.433223 0.310563 -0.315733 0.477640 0.211284
##    fzagree  fzconsc   fzneuro   fzopen    E_LT30    A_LT30    C_LT30    N_LT30
## 1 0.534365 0.337558 -0.013555 0.419930 -0.140225 -0.133313 -0.142583 -0.041884
## 2 0.635946 0.440848  0.012200 0.300435  0.051194  0.140257  0.130334 -0.030951
## 3 0.574798 0.384941  0.061924 0.311068 -0.009714 -0.012163 -0.048157  0.072481
## 4 0.578078 0.413102 -0.013868 0.365238 -0.003936  0.007324  0.020526 -0.047188
## 5 0.557243 0.376993 -0.004257 0.411690 -0.070760 -0.023765 -0.059610 -0.013238
## 6 0.552301 0.406930 -0.010579 0.407388 -0.007388 -0.066719 -0.012239 -0.044091
##      O_LT30    E_GT30    A_GT30    C_GT30    N_GT30    O_GT30    GenD_E
## 1  0.064560 -0.093362 -0.040240 -0.036899 -0.065184  0.022665 -0.143711
## 2 -0.109746  0.017092 -0.026130 -0.077508  0.034782 -0.120618 -0.138166
## 3 -0.036030 -0.008526 -0.022415 -0.080035  0.085087 -0.175739 -0.109746
## 4  0.014119  0.051829 -0.042397 -0.021394 -0.034928 -0.024835 -0.118731
## 5  0.082416  0.027033 -0.065745 -0.079949 -0.017157  0.116955 -0.129837
## 6  0.078307  0.001056 -0.063551 -0.045642 -0.047637  0.063192 -0.127201
##      GenD_A    GenD_C    GenD_N   GenD_O    AgeD_E    AgeD_A    AgeD_C
## 1 -0.121866 -0.035848 -0.312507 0.007724 -0.046863 -0.093073 -0.105684
## 2 -0.131891 -0.083455 -0.304634 0.094176  0.034101  0.166388  0.207842
## 3 -0.105880 -0.071453 -0.304275 0.124210 -0.001189  0.010252  0.031878
## 4 -0.108296 -0.069517 -0.293175 0.087052 -0.055765  0.049721  0.041920
## 5 -0.093588 -0.082001 -0.267278 0.084709 -0.097794  0.041980  0.020339
## 6 -0.119078 -0.096367 -0.305154 0.070252 -0.008443 -0.003168  0.033403
##      AgeD_N    AgeD_O    TFR     alpha     peak     stop ageFB t_ageFM nevermar
## 1  0.023300  0.041896 2.3470 13.338400 23.90560 2.854100  24.3   25.85 0.316093
## 2 -0.065733  0.010872 1.8715 11.279618 24.63356 4.435501  23.6   26.60 0.291652
## 3 -0.012606  0.139709 2.0030 12.428397 23.17632 4.598583  23.0   25.65 0.263799
## 4 -0.012260  0.038953 2.0680 10.913909 25.19252 2.924860  24.0   26.80 0.316186
## 5  0.003918 -0.034539 1.9475  7.012688 28.85349 2.941833  25.6   28.30 0.360036
## 6  0.003546  0.015115 1.9240  6.950724 28.27968 3.288943  25.7   27.15 0.306618
##    divorce cohabit abortion t_nmf unintprg famplnpw med_inc perAA perHisp
## 1 0.016276     8.2     12.0  31.9       53      147   64576   3.7     5.5
## 2 0.017909     4.8     12.0  45.0       55      147   40474  26.2     3.9
## 3 0.018978     5.3      8.7  39.1       56      152   38307  15.4     6.4
## 4 0.014391     7.7     15.2  41.3       51      151   46789   4.1    29.6
## 5 0.012347     8.0     27.6  33.9       56      245   57708   6.2    37.6
## 6 0.015716     8.1     15.7  29.8       48       80   54046   4.0    20.7
##   perFem perBA perUrb voteO vryrel relcons
## 1   47.9  27.2  66.02 38.74   56.3   42.76
## 2   51.5  21.7  59.04 37.89   27.9   18.75
## 3   50.9  19.1  56.16 45.12   35.7   18.08
## 4   50.3  26.3  89.81 38.86   52.1   39.93
## 5   50.3  30.1  94.95 61.01   34.0   11.45
## 6   49.9  35.9  86.15 53.66   32.6   14.78

A Principal Components Analysis was conducted for five personality variables at the state level. The variables are Extraversion (rz_ext), Agreeableness (rz_agr), Conscientiousness (rz_cns), Neuroticism (rz_neu), and Openness (rz_opn).

# Personality variables
# c("rz_ext", "rz_agr", "rz_cns", "rz_neu", "rz_opn")

jdat.pc <-prcomp(jdat[, c("rz_ext", "rz_agr", "rz_cns", "rz_neu", "rz_opn")], center = TRUE, scale = TRUE, retx = T)

summary(jdat.pc)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5
## Standard deviation     1.7632 1.0031 0.68075 0.53194 0.37220
## Proportion of Variance 0.6218 0.2012 0.09268 0.05659 0.02771
## Cumulative Proportion  0.6218 0.8230 0.91570 0.97229 1.00000

PC1 explains 62% of variance and PC2 explains 20% of variance. Together the cumulative proportion of variance is 82%.

jdat.pc$rotation
##               PC1         PC2        PC3         PC4         PC5
## rz_ext  0.4727570 -0.28812904  0.5841284 -0.31910990 -0.50044514
## rz_agr  0.5046479  0.05736142 -0.1897925  0.79135796 -0.28243859
## rz_cns  0.5265494  0.09646826  0.3029963  0.01126307  0.78835652
## rz_neu -0.2324309 -0.88700949  0.1094530  0.31628415  0.21719671
## rz_opn -0.4365237  0.34292763  0.7204069  0.41443808 -0.03320681
biplot(jdat.pc, scale = 0, xlabs=jdat[,1])

The biplot appears to show that Agreeableness and Conscientiousness are highly correlated.

#Screeplot

screeplot(jdat.pc, type = "l", main = "Scree Plot")

abline(h=1)

jdat.pc$sdev^2
## [1] 3.1088923 1.0061839 0.4634242 0.2829631 0.1385365
# Different function to obtain the eigenvalues

jdat.play <- PCA(jdat[, c("rz_ext", "rz_agr", "rz_cns", "rz_neu", "rz_opn")], scale.unit=T,graph=F)

eigenvalues <- jdat.play$eig

head(eigenvalues[,1:2])
##        eigenvalue percentage of variance
## comp 1  3.1088923              62.177847
## comp 2  1.0061839              20.123677
## comp 3  0.4634242               9.268483
## comp 4  0.2829631               5.659262
## comp 5  0.1385365               2.770731
# Plot the first two components

hist(jdat.pc$x[,1])

hist(jdat.pc$x[,2])

cor(jdat.pc$x[,1],jdat.pc$x[,2])
## [1] -1.745457e-16
# Put scores in separate file so that I can merge them with my original data set

scores <- data.frame(jdat.pc$x)

jdat <- cbind(jdat, scores)
# now imposing USA regions on a biplot

ggplot(jdat, aes(PC1, PC2, col = Species, fill = region)) +
  stat_ellipse(geom = "polygon", col = "black", alpha = 0.5) +
  geom_point(shape = 21, col = "black")

Region 1 is the north east. Region 2 is the midwest. Region 3 is the south. Region 4 is the west. All four regions overlap. Region 3 is wholly contained in Region 4. Region 1 diverges from the other regions to a greater extent than 2, 3, and 4.

It appears you would want “fill” to be a variable with only a few categorical values, such as region.

I tried mapping my division as well, but it isn’t clear what division means.

# now imposing USA divisions on a biplot

ggplot(jdat, aes(PC1, PC2, col = Species, fill = division)) +
  stat_ellipse(geom = "polygon", col = "black", alpha = 0.5) +
  geom_point(shape = 21, col = "black")
## Too few points to calculate an ellipse

Graphic by division shows less difference than by region.

#correlations of original variables with PC scores

round(cor(jdat[,c("rz_ext", "rz_agr", "rz_cns", "rz_neu", "rz_opn")]), 3)
##        rz_ext rz_agr rz_cns rz_neu rz_opn
## rz_ext  1.000  0.622  0.772 -0.098 -0.581
## rz_agr  0.622  1.000  0.777 -0.363 -0.634
## rz_cns  0.772  0.777  1.000 -0.426 -0.582
## rz_neu -0.098 -0.363 -0.426  1.000  0.082
## rz_opn -0.581 -0.634 -0.582  0.082  1.000
# Plot a US map
# Get centroids
centroid_labels <- usmapdata::centroid_labels("states")
# Rename state abbreviation to match centroid_labels
jdat <- rename(jdat, "abbr" = STATE_ABBR)
# Join data to centroids
data_labels <- merge(centroid_labels, jdat, by = "abbr")
# producing map of PC1 values for States

map1dat <- data_labels %>%
  select(fips, PC1)

plot_usmap(data = map1dat, values = "PC1")+
labs(title="Junkins Index", 
       subtitle="PC1", 
        caption = "Source: Junkins Data") +
    theme(panel.background = element_rect(colour = "black"))+
    scale_fill_continuous(low = "white", high ="darkred", 
                          name = "Value on Index",label = scales::comma) + 
    theme(legend.position = "right") +
    guides(fill = "none") +
  geom_text(data = data_labels, ggplot2::aes(
    x = x, y = y,
    label = scales::number(PC1, scale = 1, accuracy = .1)), color = "black")

# producing map of PC2 values for States

map2dat <- data_labels %>%
  select(fips, PC2)

plot_usmap(data = map2dat, values = "PC2")+
labs(title="Junkins Index", 
       subtitle="PC2", 
        caption = "Source: Junkins Data") +
    theme(panel.background = element_rect(colour = "black"))+
    scale_fill_continuous(low = "white", high ="darkred", 
                          name = "Value on Index",label = scales::comma) + 
    theme(legend.position = "right") +
    guides(fill = "none") +
  geom_text(data = data_labels, ggplot2::aes(
    x = x, y = y,
    label = scales::number(PC1, scale = 1, accuracy = .1)), color = "black")

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
# I need to center and scale my variables

# c("rz_ext", "rz_agr", "rz_cns", "rz_neu", "rz_opn")

jdat$Extraversion <- scale(jdat$rz_ext, center = TRUE, scale = TRUE)
jdat$Agreeableness <- scale(jdat$rz_agr, center = TRUE, scale = TRUE)
jdat$Conscientiousness <- scale(jdat$rz_cns, center = TRUE, scale = TRUE)
jdat$Neuroticism <- scale(jdat$rz_neu, center = TRUE, scale = TRUE)
jdat$Openness <- scale(jdat$rz_opn, center = TRUE, scale = TRUE)
STATEDATA <- subset(jdat, select = c("Extraversion", "Agreeableness", "Conscientiousness", "Neuroticism", "Openness"))

head(STATEDATA)
##   Extraversion Agreeableness Conscientiousness Neuroticism   Openness
## 1   -3.5516082    -1.8638115        -2.2141392  -1.0510473  1.1440545
## 2    1.2637407     1.8008589         1.6336407  -0.3856869 -1.4936254
## 3   -0.2303799    -0.2705892        -1.1716923   1.4182409 -0.9828509
## 4    0.3826497    -0.1203004         0.1758058  -0.9443718  0.3598524
## 5   -1.1916867    -0.6321759        -1.3140644  -0.3682058  1.7968028
## 6   -0.1036377    -1.1703564        -0.4406755  -0.9736510  1.5013215

Run analyses for K-Means OR Hierarchical Clustering similar to those in class. Determine the optimal number of clusters based on one of these two approaches.

#DETERMINE HOW MANY CLUSTERS IS OPTIMAL
#########################################

#plot number of clusters vs. total within sum of squares
fviz_nbclust(STATEDATA, kmeans, method = "wss")

#calculate gap statistic based on number of clusters
gap_stat <- clusGap(STATEDATA,
                    FUN = kmeans,
                    nstart = 25,
                    K.max = 10,
                    B = 50)

#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)

##########################################
#PERFORM K-MEANS CLUSTERING WITH OPTIMAL K
##########################################

#make this example reproducible
set.seed(1)

#perform k-means clustering with k = 3 clusters
km <- kmeans(STATEDATA, centers = 3, nstart = 25)

#view results
km
## K-means clustering with 3 clusters of sizes 24, 13, 13
## 
## Cluster means:
##   Extraversion Agreeableness Conscientiousness Neuroticism   Openness
## 1    0.6809430     0.6450706         0.8641922  -0.3815411 -0.5791547
## 2   -0.9838597    -0.6627354        -0.7672295  -0.5966224  0.7617837
## 3   -0.2732658    -0.5281641        -0.8282022   1.3010059  0.3074249
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  2  1  3  1  2  2  3  1  1  1  2  1  1  1  1  1  1  3  3  2  3  1  1  1  1  2 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
##  1  1  1  3  3  2  2  3  3  3  2  1  3  1  1  1  2  1  2  3  2  1  3  2 
## 
## Within cluster sum of squares by cluster:
## [1] 52.86982 43.97169 22.71133
##  (between_SS / total_SS =  51.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
#plot results of final k-means model
fviz_cluster(km, data = STATEDATA)

#find mean of each cluster
aggregate(STATEDATA, by=list(cluster=km$cluster), mean)
##   cluster Extraversion Agreeableness Conscientiousness Neuroticism   Openness
## 1       1    0.6809430     0.6450706         0.8641922  -0.3815411 -0.5791547
## 2       2   -0.9838597    -0.6627354        -0.7672295  -0.5966224  0.7617837
## 3       3   -0.2732658    -0.5281641        -0.8282022   1.3010059  0.3074249

Show how your values of PC1 and PC2 differ across your clusters.

jdat <- cbind(jdat, cluster = km$cluster)

jdat$newcluster <- car::recode(jdat$cluster,"1=3;2=1;3=2")

boxplot(jdat$PC1 ~ jdat$newcluster,
                col='steelblue',
        xlab='Cluster',
        ylab='Value on PC1') 

boxplot(jdat$PC2 ~ jdat$newcluster,
                col='steelblue',
        xlab='Cluster',
        ylab='Value on PC2') 

#add cluster assigment to original data
final_data <- cbind(jdat, kcluster = km$cluster)

#view final data
head(final_data)
##   abbr  statename state region division    rz_ext    rz_agr    rz_cns    rz_neu
## 1   AK    Alabama     1      3        6 -0.125762 -0.104588 -0.109966 -0.049075
## 2   AL     Alaska     2      4        9  0.043042  0.100483  0.080650 -0.015238
## 3   AR    Arizona     4      4        8 -0.009335 -0.015433 -0.058324  0.076501
## 4   AZ   Arkansas     5      3        7  0.012155 -0.007023  0.008430 -0.043650
## 5   CA California     6      4        9 -0.043034 -0.035667 -0.065377 -0.014349
## 6   CO   Colorado     8      4        8 -0.004892 -0.065783 -0.022110 -0.045139
##      rz_opn   mzextra  mzagree  mzconsc   mzneuro   mzopen  fzextra  fzagree
## 1  0.051630 -0.004242 0.412499 0.301710 -0.326062 0.427654 0.139469 0.534365
## 2 -0.112345  0.112153 0.504055 0.357393 -0.292434 0.394611 0.250319 0.635946
## 3 -0.080592  0.090463 0.468918 0.313488 -0.242351 0.435278 0.200209 0.574798
## 4  0.002879  0.100366 0.469782 0.343585 -0.307043 0.452290 0.219097 0.578078
## 5  0.092209  0.058111 0.463655 0.294992 -0.271535 0.496399 0.187948 0.557243
## 6  0.073840  0.084083 0.433223 0.310563 -0.315733 0.477640 0.211284 0.552301
##    fzconsc   fzneuro   fzopen    E_LT30    A_LT30    C_LT30    N_LT30    O_LT30
## 1 0.337558 -0.013555 0.419930 -0.140225 -0.133313 -0.142583 -0.041884  0.064560
## 2 0.440848  0.012200 0.300435  0.051194  0.140257  0.130334 -0.030951 -0.109746
## 3 0.384941  0.061924 0.311068 -0.009714 -0.012163 -0.048157  0.072481 -0.036030
## 4 0.413102 -0.013868 0.365238 -0.003936  0.007324  0.020526 -0.047188  0.014119
## 5 0.376993 -0.004257 0.411690 -0.070760 -0.023765 -0.059610 -0.013238  0.082416
## 6 0.406930 -0.010579 0.407388 -0.007388 -0.066719 -0.012239 -0.044091  0.078307
##      E_GT30    A_GT30    C_GT30    N_GT30    O_GT30    GenD_E    GenD_A
## 1 -0.093362 -0.040240 -0.036899 -0.065184  0.022665 -0.143711 -0.121866
## 2  0.017092 -0.026130 -0.077508  0.034782 -0.120618 -0.138166 -0.131891
## 3 -0.008526 -0.022415 -0.080035  0.085087 -0.175739 -0.109746 -0.105880
## 4  0.051829 -0.042397 -0.021394 -0.034928 -0.024835 -0.118731 -0.108296
## 5  0.027033 -0.065745 -0.079949 -0.017157  0.116955 -0.129837 -0.093588
## 6  0.001056 -0.063551 -0.045642 -0.047637  0.063192 -0.127201 -0.119078
##      GenD_C    GenD_N   GenD_O    AgeD_E    AgeD_A    AgeD_C    AgeD_N
## 1 -0.035848 -0.312507 0.007724 -0.046863 -0.093073 -0.105684  0.023300
## 2 -0.083455 -0.304634 0.094176  0.034101  0.166388  0.207842 -0.065733
## 3 -0.071453 -0.304275 0.124210 -0.001189  0.010252  0.031878 -0.012606
## 4 -0.069517 -0.293175 0.087052 -0.055765  0.049721  0.041920 -0.012260
## 5 -0.082001 -0.267278 0.084709 -0.097794  0.041980  0.020339  0.003918
## 6 -0.096367 -0.305154 0.070252 -0.008443 -0.003168  0.033403  0.003546
##      AgeD_O    TFR     alpha     peak     stop ageFB t_ageFM nevermar  divorce
## 1  0.041896 2.3470 13.338400 23.90560 2.854100  24.3   25.85 0.316093 0.016276
## 2  0.010872 1.8715 11.279618 24.63356 4.435501  23.6   26.60 0.291652 0.017909
## 3  0.139709 2.0030 12.428397 23.17632 4.598583  23.0   25.65 0.263799 0.018978
## 4  0.038953 2.0680 10.913909 25.19252 2.924860  24.0   26.80 0.316186 0.014391
## 5 -0.034539 1.9475  7.012688 28.85349 2.941833  25.6   28.30 0.360036 0.012347
## 6  0.015115 1.9240  6.950724 28.27968 3.288943  25.7   27.15 0.306618 0.015716
##   cohabit abortion t_nmf unintprg famplnpw med_inc perAA perHisp perFem perBA
## 1     8.2     12.0  31.9       53      147   64576   3.7     5.5   47.9  27.2
## 2     4.8     12.0  45.0       55      147   40474  26.2     3.9   51.5  21.7
## 3     5.3      8.7  39.1       56      152   38307  15.4     6.4   50.9  19.1
## 4     7.7     15.2  41.3       51      151   46789   4.1    29.6   50.3  26.3
## 5     8.0     27.6  33.9       56      245   57708   6.2    37.6   50.3  30.1
## 6     8.1     15.7  29.8       48       80   54046   4.0    20.7   49.9  35.9
##   perUrb voteO vryrel relcons        PC1        PC2        PC3        PC4
## 1  66.02 38.74   56.3   42.76 -4.0405808  2.0274333 -1.6825893 -0.2248166
## 2  59.04 37.89   27.9   18.75  3.1080828 -0.2733236 -0.2268482  0.2992498
## 3  56.16 45.12   35.7   18.08 -0.7630252 -1.6572132 -0.9910559 -0.1125770
## 4  89.81 38.86   52.1   39.93  0.2751785  0.8608766  0.4554930 -0.3648812
## 5  94.95 61.01   34.0   11.45 -2.2730888  1.1231070  0.2798554  0.4934071
## 6  86.15 53.66   32.6   14.78 -1.3007065  1.2986988  1.0030575 -0.5838080
##          PC5 Extraversion Agreeableness Conscientiousness Neuroticism
## 1  0.2919919   -3.5516082    -1.8638115        -2.2141392  -1.0510473
## 2  0.1126550    1.2637407     1.8008589         1.6336407  -0.3856869
## 3 -0.3913193   -0.2303799    -0.2705892        -1.1716923   1.4182409
## 4 -0.2359841    0.3826497    -0.1203004         0.1758058  -0.9443718
## 5 -0.4006657   -1.1916867    -0.6321759        -1.3140644  -0.3682058
## 6 -0.2263186   -0.1036377    -1.1703564        -0.4406755  -0.9736510
##     Openness cluster newcluster kcluster
## 1  1.1440545       2          1        2
## 2 -1.4936254       1          3        1
## 3 -0.9828509       3          2        3
## 4  0.3598524       1          3        1
## 5  1.7968028       2          1        2
## 6  1.5013215       2          1        2

Comment on your overall findings and what you would do differently were you to start over on Assignments 2 and 3.

I would have started with more variables, not just the personality traits. I could run additional PCA models to determine which variables to remove.

I need to understand better how to choose the number of clusters and how to interpret each cluster. I ran the function with 3 clusters, but the charts seem to show I could have chosen up to 7 clusters. Perhaps if I remove variables with high colinearity, this would reduce the optimal number of clusters.

Cluster 1 seems to show high extraversion, agreeableness, and conscientiousness with low neuroticism and openness.

Cluster 2 seems to show negative values in all personality traits, except openness.

Cluster 3 has negative extraversion, agreeableness, and conscientiousness, with high neuroticism and positive openness.

I used cbind to add the kcluster assignment back to the original dataframe. This allows me to associate each state to a cluster. In the future, I can create tables and charts to show the states in each cluster.