title: “Assignment 2”

author: “Jacob M. Souch”

format: html

editor: visual


Assignment 2

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(readxl)

library(car)
## Loading required package: carData
library(ggplot2)

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(ggpubr)

library(ggrepel)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2
## ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ✔ purrr   1.0.1      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some()   masks car::some()
Junkins <- read_excel("C:/Users/jacob/OneDrive - University of Texas at San Antonio/Courses/Stats for Demographic Data 2/Homework 1/Junkins Data.xlsx") %>% as.data.frame()



Junkins$STATE_ABBR <- NULL





Junkins$statabb <- case_when(state.name == Junkins$statename ~ state.abb)



#Junkins <-filter(Junkins, STATEAB!="DC")
Policy.pc <- prcomp(Junkins[,c( "cohabit","relcons","abortion","famplnpw","perBA")], center = TRUE,scale. = TRUE,retx=T,rownames=Junkins[,62])
## Warning: In prcomp.default(Junkins[, c("cohabit", "relcons", "abortion", 
##     "famplnpw", "perBA")], center = TRUE, scale. = TRUE, retx = T, 
##     rownames = Junkins[, 62]) :
##  extra argument 'rownames' will be disregarded
plot(Policy.pc)

biplot(Policy.pc, scale = 0,xlabs=Junkins[,62])

Policy.pc
## Standard deviations (1, .., p=5):
## [1] 1.4387920 1.0312699 0.9018942 0.8171225 0.6206913
## 
## Rotation (n x k) = (5 x 5):
##                  PC1         PC2         PC3        PC4         PC5
## cohabit   0.47098683 -0.03012803  0.70152112  0.3240984  0.42437255
## relcons  -0.42826473 -0.29362850  0.53860864 -0.6600198  0.06816326
## abortion  0.50712785 -0.21560143 -0.41479365 -0.5195491  0.50433360
## famplnpw  0.05113282 -0.92751848 -0.08018794  0.2920140 -0.21305551
## perBA     0.57876378  0.07810319  0.19820438 -0.3226929 -0.71799490
Policy.pc$rotation
##                  PC1         PC2         PC3        PC4         PC5
## cohabit   0.47098683 -0.03012803  0.70152112  0.3240984  0.42437255
## relcons  -0.42826473 -0.29362850  0.53860864 -0.6600198  0.06816326
## abortion  0.50712785 -0.21560143 -0.41479365 -0.5195491  0.50433360
## famplnpw  0.05113282 -0.92751848 -0.08018794  0.2920140 -0.21305551
## perBA     0.57876378  0.07810319  0.19820438 -0.3226929 -0.71799490
summary(Policy.pc)
## Importance of components:
##                          PC1    PC2    PC3    PC4     PC5
## Standard deviation     1.439 1.0313 0.9019 0.8171 0.62069
## Proportion of Variance 0.414 0.2127 0.1627 0.1335 0.07705
## Cumulative Proportion  0.414 0.6267 0.7894 0.9230 1.00000
centroid_labels <- usmapdata::centroid_labels("states")

Only PC1 and PC2 have SDs greater than 1. This means that only PC1 and PC2 are salient to subsequent analysis and interpretation.

plot(Policy.pc)

plot(Junkins[,40:50], pch=20)

p <- biplot(Policy.pc, scale = 0,xlabs=Junkins[,1])

#ggbiplot(Policy.pc, labels =  rownames(Junkins))





library(ggfortify)

ggplot2::autoplot(Policy.pc, label = TRUE, loadings.label = TRUE)

screeplot(Policy.pc, type = "l", main = "Scree Plot")

abline(h=1)

Interpretation

Principal Component Analysis (PCA) was used to analyze total fertility rate, divorce rate, unintentional pregnancy rate, total age at first marriage, and bachelor’s degree attainment at the state-level. The Scree plot suggests that two-components are sufficient to explain variation. PC1 and PC2 cumulatively explain 62.7% of the variation in the features of the data.

The density plot and ellipse plots suggest that the strongest divergence occurs between the Northeast and the South.

PC1

The loadings on PC1 are similar loading (all loadings are >0.4) with the exception of famplnpw (family planning expenditures) (0.05). PC1 appears to be dependent educational attainment (0.57) and cohabitation behaviors (0.47). Given these loading values, PC1 may be capturing a measure of a progressive lifestyle. It follows that non-coastal states like Missouri and Kentucky score highly in PC1.

scores<-as.data.frame(Policy.pc$x)



Policy<-cbind(scores,Junkins)



Policy$fips <- sprintf("%02d", as.numeric(as.character(Policy$state)))
ggplot(Policy, aes(PC1, PC2, col = Species, fill = region)) +

  stat_ellipse(geom = "polygon", col = "black", alpha = 0.5) +

  geom_point(shape = 21, col = "black")

ggplot(Policy, aes(PC1, PC2, col = region, fill = region))+

  stat_density_2d(geom = "contour")+

  geom_point()

library(usmap)

library(ggplot2)





plot_usmap(data =  Policy, values = "PC1", labels = TRUE)+

scale_fill_continuous(low = "blue", high ="orange")+

  ggtitle("PC1 at State-Level")

plot_usmap(data =  Policy, values = "PC2", labels = TRUE)+

scale_fill_continuous(low = "blue", high ="orange")+

  ggtitle("PC2 at State-Level")

PC2

The loadings on PC2 are more varied compared to PC1. Familyplanpw loads highly on to PC2 (-0.93), followed by relcons (-0.21), perBA (-0.29). PC2 appears to be capturing conservative family orientation (associated with a lack of family planning and abortion). It follows that states like California would score lowly on this component.

Assignment 3

str(Junkins)
## 'data.frame':    50 obs. of  62 variables:
##  $ statename: chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ state    : num  1 2 4 5 6 8 9 10 12 13 ...
##  $ region   : chr  "3" "4" "4" "3" ...
##  $ division : chr  "6" "9" "8" "7" ...
##  $ rz_ext   : num  -0.12576 0.04304 -0.00933 0.01215 -0.04303 ...
##  $ rz_agr   : num  -0.10459 0.10048 -0.01543 -0.00702 -0.03567 ...
##  $ rz_cns   : num  -0.10997 0.08065 -0.05832 0.00843 -0.06538 ...
##  $ rz_neu   : num  -0.0491 -0.0152 0.0765 -0.0437 -0.0143 ...
##  $ rz_opn   : num  0.05163 -0.11235 -0.08059 0.00288 0.09221 ...
##  $ mzextra  : num  -0.00424 0.11215 0.09046 0.10037 0.05811 ...
##  $ mzagree  : num  0.412 0.504 0.469 0.47 0.464 ...
##  $ mzconsc  : num  0.302 0.357 0.313 0.344 0.295 ...
##  $ mzneuro  : num  -0.326 -0.292 -0.242 -0.307 -0.272 ...
##  $ mzopen   : num  0.428 0.395 0.435 0.452 0.496 ...
##  $ fzextra  : num  0.139 0.25 0.2 0.219 0.188 ...
##  $ fzagree  : num  0.534 0.636 0.575 0.578 0.557 ...
##  $ fzconsc  : num  0.338 0.441 0.385 0.413 0.377 ...
##  $ fzneuro  : num  -0.01355 0.0122 0.06192 -0.01387 -0.00426 ...
##  $ fzopen   : num  0.42 0.3 0.311 0.365 0.412 ...
##  $ E_LT30   : num  -0.14022 0.05119 -0.00971 -0.00394 -0.07076 ...
##  $ A_LT30   : num  -0.13331 0.14026 -0.01216 0.00732 -0.02377 ...
##  $ C_LT30   : num  -0.1426 0.1303 -0.0482 0.0205 -0.0596 ...
##  $ N_LT30   : num  -0.0419 -0.031 0.0725 -0.0472 -0.0132 ...
##  $ O_LT30   : num  0.0646 -0.1097 -0.036 0.0141 0.0824 ...
##  $ E_GT30   : num  -0.09336 0.01709 -0.00853 0.05183 0.02703 ...
##  $ A_GT30   : num  -0.0402 -0.0261 -0.0224 -0.0424 -0.0657 ...
##  $ C_GT30   : num  -0.0369 -0.0775 -0.08 -0.0214 -0.0799 ...
##  $ N_GT30   : num  -0.0652 0.0348 0.0851 -0.0349 -0.0172 ...
##  $ O_GT30   : num  0.0227 -0.1206 -0.1757 -0.0248 0.117 ...
##  $ GenD_E   : num  -0.144 -0.138 -0.11 -0.119 -0.13 ...
##  $ GenD_A   : num  -0.1219 -0.1319 -0.1059 -0.1083 -0.0936 ...
##  $ GenD_C   : num  -0.0358 -0.0835 -0.0715 -0.0695 -0.082 ...
##  $ GenD_N   : num  -0.313 -0.305 -0.304 -0.293 -0.267 ...
##  $ GenD_O   : num  0.00772 0.09418 0.12421 0.08705 0.08471 ...
##  $ AgeD_E   : num  -0.04686 0.0341 -0.00119 -0.05577 -0.09779 ...
##  $ AgeD_A   : num  -0.0931 0.1664 0.0103 0.0497 0.042 ...
##  $ AgeD_C   : num  -0.1057 0.2078 0.0319 0.0419 0.0203 ...
##  $ AgeD_N   : num  0.0233 -0.06573 -0.01261 -0.01226 0.00392 ...
##  $ AgeD_O   : num  0.0419 0.0109 0.1397 0.039 -0.0345 ...
##  $ TFR      : num  2.35 1.87 2 2.07 1.95 ...
##  $ alpha    : num  13.34 11.28 12.43 10.91 7.01 ...
##  $ peak     : num  23.9 24.6 23.2 25.2 28.9 ...
##  $ stop     : num  2.85 4.44 4.6 2.92 2.94 ...
##  $ ageFB    : num  24.3 23.6 23 24 25.6 25.7 27.2 25 25 24.5 ...
##  $ t_ageFM  : num  25.9 26.6 25.6 26.8 28.3 ...
##  $ nevermar : num  0.316 0.292 0.264 0.316 0.36 ...
##  $ divorce  : num  0.0163 0.0179 0.019 0.0144 0.0123 ...
##  $ cohabit  : num  8.2 4.8 5.3 7.7 8 8.1 6.5 7 7.6 6.1 ...
##  $ abortion : num  12 12 8.7 15.2 27.6 15.7 24.6 40 27.2 19.2 ...
##  $ t_nmf    : num  31.9 45 39.1 41.3 33.9 29.8 32 42.5 40.5 39.6 ...
##  $ unintprg : num  53 55 56 51 56 48 51 60 59 57 ...
##  $ famplnpw : num  147 147 152 151 245 80 65 143 92 142 ...
##  $ med_inc  : num  64576 40474 38307 46789 57708 ...
##  $ perAA    : num  3.7 26.2 15.4 4.1 6.2 4 10.1 21.4 16 30.5 ...
##  $ perHisp  : num  5.5 3.9 6.4 29.6 37.6 20.7 13.4 8.2 22.5 8.8 ...
##  $ perFem   : num  47.9 51.5 50.9 50.3 50.3 49.9 51.3 51.6 51.1 51.2 ...
##  $ perBA    : num  27.2 21.7 19.1 26.3 30.1 35.9 35.2 27.7 25.9 27.2 ...
##  $ perUrb   : num  66 59 56.2 89.8 95 ...
##  $ voteO    : num  38.7 37.9 45.1 38.9 61 ...
##  $ vryrel   : num  56.3 27.9 35.7 52.1 34 32.6 31.4 34.1 39 48.5 ...
##  $ relcons  : num  42.8 18.8 18.1 39.9 11.4 ...
##  $ statabb  : chr  "AL" "AK" "AZ" "AR" ...
Junkins_nums <- select_if(Junkins, is.numeric)  



par(mfcol=c(5,3),mai=c(0.5,0.5,0.5,0))



pairs(Junkins_nums[,1:20])

pairs(Junkins_nums[,21:30])

pairs(Junkins_nums[,31:40])

pairs(Junkins_nums[,41:50])

pairs(Junkins_nums[,51:58])

library(cluster)



#define linkage methods

m <- c( "average", "single", "complete", "ward")

names(m) <- c( "average", "single", "complete", "ward")





StateDataZ <- Junkins %>% select("cohabit","relcons","abortion","famplnpw","perBA") %>% scale() %>% as.data.frame()



#function to compute agglomerative coefficient

ac <- function(x) {

  agnes(StateDataZ, method = x)$ac

}



#calculate agglomerative coefficient for each clustering linkage method

sapply(m, ac)
##   average    single  complete      ward 
## 0.7501353 0.6654925 0.8154407 0.8889082
clust <- agnes(StateDataZ, method = "ward")



pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram",labels = Junkins$statename) 

According to the agglomerative coefficient for the clustering linkage methods, “Ward” is the optimal method of choice.

The dendrogram shows two main branches.

#install.packages( "factoextra")

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
gap_stat0 <- clusGap(StateDataZ, FUN = hcut, nstart = 25, K.max = 10, B = 50)

gap_stat1 <- clusGap(StateDataZ, FUN = kmeans, nstart = 25, K.max = 10, B = 50)





fviz_gap_stat(gap_stat0)

fviz_gap_stat(gap_stat1)

According to the GAP statistic plot, two cluster is the optimal number of clusters.

d <- dist(StateDataZ, method = "euclidean")

final_clust <- hclust(d, method = "ward.D2" )

groups <- cutree(final_clust, k=2)

table(groups)
## groups
##  1  2 
## 16 34
aggregate(StateDataZ, by=list(groups), mean)
##   Group.1    cohabit    relcons   abortion   famplnpw      perBA
## 1       1 -0.7527539  1.0632674 -0.5637586  0.4066386 -0.8010273
## 2       2  0.3542371 -0.5003611  0.2652982 -0.1913594  0.3769540
StateDataZ <- cbind(StateDataZ, hcluster = groups, scores)



#StateDataZ$newhcluster <- car::recode(StateDataZ$hcluster,"1=3;3=1;2=2")



boxplot(StateDataZ$PC1 ~ StateDataZ$hcluster,

                col='steelblue',

        xlab='Cluster',

        ylab='Value on PC1') 

boxplot(StateDataZ$PC2 ~ StateDataZ$hcluster,

                col='steelblue',

        xlab='Cluster',

        ylab='Value on PC2')

StateDataZ %>%

  ggplot(aes(x = PC1, col = factor(hcluster)))+

  geom_boxplot()+
theme_bw()+
  geom_text(aes(y = jitter(rnorm(50,0), amount = 1), label = state.abb), size =2.5, col = "black")+
  ggtitle("PC1 Scores by Cluster")

StateDataZ %>%

  ggplot(aes(x = PC2, col = factor(hcluster)))+

  geom_boxplot()+
theme_bw()+
  geom_text(aes(y = jitter(rnorm(50,0), amount = 1), label = state.abb), size =2.5, col = "black")+
    ggtitle("PC2 Scores by Cluster")

According to the plot above, cluster 2 scores higher in PC2 than cluster 1.

#Distribution of PC1

StateDataZ %>% ggplot(aes(x = PC1,col = state.region))+

  geom_boxplot()+

  geom_text(aes(y = jitter(rnorm(50,0), amount = 1), label = state.abb), size =2.5, col = "black")+

  theme_bw()+

  ggtitle("Distribution of PC1 By State and Region")+

  ylab("")+

  xlab("PC1 Score")+

    theme(

      axis.text.y=element_blank(),

      axis.ticks.y=element_blank())

#Distribution of PC2

StateDataZ %>% ggplot(aes(x = PC2,col = state.region))+

  geom_boxplot()+

  geom_text(aes(y = jitter(rnorm(50,0), amount = 1), label = state.abb), size =2.5, col = "black")+

  theme_bw()+

  ggtitle("Distribution of PC2 By State and Region")+

  ylab("")+

  xlab("PC2 Score")

If I were to do this assignment again, I’d begin with a more complete review of the literature and select variables for analysis more carefully. I think the variables that I chose for PCA could be supported with more evidence and that would be conducive to the subsequent cluster analysis. I appreciated learning new visualization techniques and exploring the conceptual relationships between hierarchical clustering, PCA, and k-means.