Introduction

Our group aims to look at US county-level data on eviction and demographics (e.g. eviction filings, poverty rate, and racial distributions), as well as their relevance with Covid-19 statistics like death and case rates. Using supervised and unsupervised learning, we explore the data that we obtained from USAFacts (https://usafacts.org/visualizations/coronavirus-covid-19-spread-map), Eviction Lab (https://evictionlab.org/), and US Census as of 4/30/2020 (https://www.census.gov/popclock/).

Given the wide range of learning techniques we explore in our project, each model addresses a different question.

Table of Contents

Supervised Learning: 1. LM (simple, long, Logit) 2. knn 3. Single Decision Trees, Bagging, Boosting and Random Forests 4. LDA

Unsupervised Learning: 1. PCA 2. Hierarchical 3. K-Means

Overall Analysis * Supervised vs Unsupervised * Supervised * Unsupervised

Load Data: COVID-19 and eviction data (appended).

covid.eviction <- read_csv("covid_eviction2.csv")

[SUPERVISED LEARNING]

1. LM Models

Simple LM

Question: Do eviction filing rates have an effect on COVID case rates at the county level?

A simple lm model uses only our independent variable of interest, eviction filing rates, on the right hand side.

lm1 <- lm(county_covid_cases_rate ~ `eviction-filing-rate`,
          data = covid.eviction)
summary(lm1)
## 
## Call:
## lm(formula = county_covid_cases_rate ~ `eviction-filing-rate`, 
##     data = covid.eviction)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.004502 -0.001107 -0.000784 -0.000126  0.061371 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.121e-03  7.062e-05  15.875  < 2e-16 ***
## `eviction-filing-rate` 9.185e-05  1.213e-05   7.573 4.95e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003119 on 2707 degrees of freedom
##   (432 observations deleted due to missingness)
## Multiple R-squared:  0.02075,    Adjusted R-squared:  0.02039 
## F-statistic: 57.36 on 1 and 2707 DF,  p-value: 4.955e-14

The lm output gives us the R-squared and adjusted R-squared values which reperesent how well our model explains variations in our output/prediction.

Adjusted R-squared:

The adjusted R-squared value here is low, meaning while the eviction filing rate is significant, its variations account for only 2% of the variation in COVID cases. To better optimize our lm model, we can consider how other variables may impact our coefficient on eviction or if they better explain COVID cases than eviction.

F-statistic:

The F-test, which tells us whether our R-squared value is significantly different from 0, is significant, which leads us to conclude that our model overall is significant.

Magnitude of Coefficients

Another way to assess the efficacy of our linear model is by looking at the magnitude of our coefficients. While significant at below the 0.1% level, it’s difficult to know what the coefficient of 9.185e-05 actually means.

A graph and summary statistics can help us visualize what this slope actually means:

covid.eviction %>%
  ggplot(aes(x = `eviction-filing-rate`, 
             y = county_covid_cases_rate)) +
  geom_point() +
  geom_abline(slope = 9.185e-05, intercept = 1.121e-03, col = 'blue')

Even though our coefficient was significant, by looking at the fitted line plotted against the actual values from in our data tells us that our points are really clustered at the lower covid and eviction case rates, with just a couple outliers that are really skewing our fitted model. From a visual perspective, this lm does not fit the data well, so I wouldn’t trust this model to make predictions about COVID cases based on eviction filing rates. Perhaps this means that a non-linear model would work best, but we can also try adding in additional control and interaction terms to see if the model changes.

sumstats <- stat.desc(covid.eviction %>%
                    dplyr::select(`poverty-rate`, `renter-occupied-households`, `pct-renter-occupied`, `median-gross-rent`, `median-household-income`, `median-property-value`, `rent-burden`, `pct-white`, `pct-af-am`, `pct-hispanic`, `pct-am-ind`, `pct-asian`, `pct-multiple`, `eviction-filing-rate`, county_covid_cases_rate, county_covid_death_rate))
kable(as.data.frame(sumstats[c(4, 8, 9, 5, 13), ]))
poverty-rate renter-occupied-households pct-renter-occupied median-gross-rent median-household-income median-property-value rent-burden pct-white pct-af-am pct-hispanic pct-am-ind pct-asian pct-multiple eviction-filing-rate county_covid_cases_rate county_covid_death_rate
min 0.000000 15.00 7.350000 275.0000 19328.00 32300.00 10.000000 0.95000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000 0.0000000
median 11.310000 2888.00 27.290000 653.0000 45112.50 111900.00 28.800000 84.65000 2.090000 3.740000 0.260000 0.540000 1.470000 1.570000 0.0006014 0.0100302
mean 12.203381 14440.12 28.586609 700.7557 46836.91 133647.50 28.509392 77.30806 8.881694 8.828389 1.737275 1.259465 1.808488 3.079255 0.0014880 0.0367911
max 44.320000 1792186.00 96.300000 1827.0000 123453.00 902500.00 50.000000 99.76000 85.950000 98.710000 87.380000 41.640000 23.450000 109.160000 0.0625000 1.0000000
std.dev 5.748387 55529.87 8.155774 191.6848 12247.72 77987.08 4.568922 19.87860 14.391028 13.496955 7.158680 2.659009 1.700525 4.941657 0.0032366 0.0699260
Looking at the summary st atistics, we see that covid c ase rates range from 0 to 6.25% of county populations. Eviction fili ng rates range from 0 to 109.16 (total eviction fi lings/number of renter-occu pued househol ds). Our lm doesn’t estimat e COVID rates above aro und 1%, meaning our model would never predict a covid case rate above around 1%, when in reality we see much higher rates ocurring. This further suggests that there are limitations to using a simple linear model to predict on our data.

How can we improve this linear model? If we still care about eviction rates, we can look at what other varibles may be able to explain additional variation in COVID cases and if our eviction rate variable remains significant. Adding control variables and interaction terms can add more depth to our understanding of the relationships in our data.

Let’s look at the correlation between our variables:

corr <- round(cor(covid.eviction %>%
                    dplyr::select(`poverty-rate`, `renter-occupied-households`, `pct-renter-occupied`, `median-gross-rent`, `median-household-income`, `median-property-value`, `rent-burden`, `pct-white`, `pct-af-am`, `pct-hispanic`, `pct-am-ind`, `pct-asian`, `pct-multiple`, `eviction-filing-rate`, county_covid_cases_rate, county_covid_death_rate), use = 'complete.obs'), 2)

corr[upper.tri(corr)] <- ""
corr <- as.data.frame(corr)
print(kable(corr), type = "latex", comment = FALSE)
poverty-rate renter-occupied-households pct-renter-occupied median-gross-rent median-household-income median-property-value rent-burden pct-white pct-af-am pct-hispanic pct-am-ind pct-asian pct-multiple eviction-filing-rate county_covid_cases_rate county_covid_death_rate
poverty-rate 1
renter-occupied-households -0.01 1
pct-renter-occupied 0.31 0.36 1
median-gross-rent -0.38 0.37 0.28 1
median-household-income -0.74 0.19 -0.09 0.74 1
median-property-value -0.42 0.34 0.22 0.84 0.7 1
rent-burden 0.38 0.15 0.31 0.23 -0.24 0.15 1
pct-white -0.5 -0.26 -0.49 -0.23 0.12 -0.06 -0.26 1
pct-af-am 0.53 0.09 0.31 0.01 -0.3 -0.12 0.37 -0.67 1
pct-hispanic 0.1 0.22 0.25 0.21 0.08 0.11 -0.02 -0.58 -0.09 1
pct-am-ind 0.19 -0.04 0.08 -0.06 -0.06 -0.05 -0.14 -0.25 -0.1 -0.03 1
pct-asian -0.16 0.47 0.39 0.62 0.46 0.61 0.15 -0.28 0.01 0.18 -0.03 1
pct-multiple -0.02 0.06 0.19 0.19 0.1 0.15 0.03 -0.12 -0.11 -0.04 0.32 0.25 1
eviction-filing-rate 0.06 0.14 0.26 0.3 0.09 0.13 0.25 -0.28 0.39 -0.03 -0.07 0.15 0.07 1
county_covid_cases_rate 0.07 0.15 0.16 0.18 0.09 0.16 0.08 -0.23 0.2 0.11 -0.03 0.17 -0.07 0.12 1
county_covid_death_rate -0.01 0.03 -0.02 0.02 0.01 0.01 0.05 -0.01 0.03 -0.02 -0.03 0.02 0.03 0.04 0.03 1

The variables most highly correlated with county-level covid cases & deaths are pct white, median gross rent, pct renter occupied, and pct asian. While eviction filing rates don’t seem to be very highly correlated with covid case rates, we can still include it in our lm model.

Long LM Model Using High Correlation Variables

lm2 <- lm(county_covid_cases_rate ~ `eviction-filing-rate` + `pct-white` + `pct-asian` +  `pct-renter-occupied` + `median-gross-rent`,
          data = covid.eviction)
summary(lm2)
## 
## Call:
## lm(formula = county_covid_cases_rate ~ `eviction-filing-rate` + 
##     `pct-white` + `pct-asian` + `pct-renter-occupied` + `median-gross-rent`, 
##     data = covid.eviction)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.005676 -0.000982 -0.000497  0.000042  0.061098 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.669e-03  5.150e-04   3.240  0.00121 ** 
## `eviction-filing-rate`  3.587e-05  1.288e-05   2.784  0.00541 ** 
## `pct-white`            -2.548e-05  3.516e-06  -7.246 5.57e-13 ***
## `pct-asian`             4.692e-05  3.072e-05   1.527  0.12680    
## `pct-renter-occupied`   6.971e-06  8.817e-06   0.791  0.42924    
## `median-gross-rent`     1.982e-06  4.150e-07   4.774 1.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003036 on 2702 degrees of freedom
##   (433 observations deleted due to missingness)
## Multiple R-squared:  0.0738, Adjusted R-squared:  0.07209 
## F-statistic: 43.06 on 5 and 2702 DF,  p-value: < 2.2e-16

Adjusted R-squared:

As we can see in our output, basing our independent variables off of correlations alone helped some but didn’t drastically increase Adjusted R-squared and gave us a couple of insignificant variables. We could just add and remove variables until we felt the R-squared was optimal, but that would take a lot of time so we can use the stepwise function to do that for us.

An additional strength of the stepwise function is that it optimizes models based on AIC rather than relying on R-squared and F-stats alone.

Stepwise Long LM with Interaction Terms

covid.eviction.naomit <- na.omit(covid.eviction %>%
                                   dplyr::select(`poverty-rate`, `renter-occupied-households`, `pct-renter-occupied`, `median-gross-rent`, `median-household-income`, `median-property-value`, `rent-burden`, `pct-white`, `pct-af-am`, `pct-hispanic`, `pct-am-ind`, `pct-asian`, `pct-multiple`, `eviction-filing-rate`, county_covid_cases_rate))
  
full.lm <- lm(county_covid_cases_rate ~.,
              data = covid.eviction.naomit)

step.lm <- stepAIC(full.lm, 
                   scope = . ~ .^2,
                   direction = "both",
                   trace = FALSE)
summary(step.lm)
## 
## Call:
## lm(formula = county_covid_cases_rate ~ `poverty-rate` + `renter-occupied-households` + 
##     `pct-renter-occupied` + `median-gross-rent` + `median-household-income` + 
##     `median-property-value` + `rent-burden` + `pct-white` + `pct-af-am` + 
##     `pct-hispanic` + `pct-am-ind` + `pct-asian` + `pct-multiple` + 
##     `eviction-filing-rate` + `pct-hispanic`:`eviction-filing-rate` + 
##     `pct-af-am`:`eviction-filing-rate` + `renter-occupied-households`:`pct-multiple` + 
##     `median-property-value`:`rent-burden` + `renter-occupied-households`:`pct-hispanic` + 
##     `renter-occupied-households`:`median-property-value` + `median-gross-rent`:`pct-asian` + 
##     `pct-hispanic`:`pct-asian` + `median-household-income`:`median-property-value` + 
##     `renter-occupied-households`:`pct-asian` + `median-gross-rent`:`median-property-value` + 
##     `poverty-rate`:`pct-af-am` + `pct-hispanic`:`pct-multiple` + 
##     `pct-am-ind`:`pct-asian` + `rent-burden`:`pct-asian` + `median-gross-rent`:`pct-hispanic` + 
##     `median-property-value`:`pct-hispanic` + `median-household-income`:`pct-hispanic` + 
##     `renter-occupied-households`:`rent-burden` + `renter-occupied-households`:`median-household-income` + 
##     `renter-occupied-households`:`pct-renter-occupied` + `median-gross-rent`:`median-household-income` + 
##     `median-household-income`:`rent-burden` + `pct-af-am`:`pct-multiple` + 
##     `pct-white`:`pct-af-am` + `poverty-rate`:`renter-occupied-households` + 
##     `renter-occupied-households`:`pct-af-am` + `renter-occupied-households`:`pct-white` + 
##     `renter-occupied-households`:`pct-am-ind` + `pct-asian`:`eviction-filing-rate` + 
##     `median-property-value`:`pct-asian` + `pct-renter-occupied`:`median-household-income` + 
##     `renter-occupied-households`:`eviction-filing-rate`, data = covid.eviction.naomit)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.009185 -0.000838 -0.000373  0.000218  0.059674 
## 
## Coefficients:
##                                                          Estimate
## (Intercept)                                             6.886e-03
## `poverty-rate`                                          1.497e-05
## `renter-occupied-households`                            2.620e-06
## `pct-renter-occupied`                                  -8.218e-05
## `median-gross-rent`                                     1.172e-05
## `median-household-income`                              -1.201e-07
## `median-property-value`                                -4.871e-08
## `rent-burden`                                          -2.206e-04
## `pct-white`                                            -1.564e-05
## `pct-af-am`                                            -2.419e-05
## `pct-hispanic`                                         -6.666e-06
## `pct-am-ind`                                           -1.008e-05
## `pct-asian`                                             1.986e-03
## `pct-multiple`                                          2.013e-05
## `eviction-filing-rate`                                 -7.374e-06
## `pct-hispanic`:`eviction-filing-rate`                   1.050e-05
## `pct-af-am`:`eviction-filing-rate`                     -1.979e-06
## `renter-occupied-households`:`pct-multiple`            -4.199e-08
## `median-property-value`:`rent-burden`                   9.215e-10
## `renter-occupied-households`:`pct-hispanic`            -3.036e-08
## `renter-occupied-households`:`median-property-value`   -6.135e-14
## `median-gross-rent`:`pct-asian`                        -1.098e-06
## `pct-hispanic`:`pct-asian`                              1.832e-05
## `median-household-income`:`median-property-value`       7.413e-13
## `renter-occupied-households`:`pct-asian`               -3.150e-08
## `median-gross-rent`:`median-property-value`            -2.119e-11
## `poverty-rate`:`pct-af-am`                              1.933e-06
## `pct-hispanic`:`pct-multiple`                          -1.177e-05
## `pct-am-ind`:`pct-asian`                               -1.390e-05
## `rent-burden`:`pct-asian`                              -4.126e-05
## `median-gross-rent`:`pct-hispanic`                     -2.292e-07
## `median-property-value`:`pct-hispanic`                  3.425e-10
## `median-household-income`:`pct-hispanic`                2.630e-09
## `renter-occupied-households`:`rent-burden`              4.941e-09
## `renter-occupied-households`:`median-household-income`  2.524e-12
## `renter-occupied-households`:`pct-renter-occupied`      1.249e-09
## `median-gross-rent`:`median-household-income`          -1.435e-10
## `median-household-income`:`rent-burden`                 3.038e-09
## `pct-af-am`:`pct-multiple`                             -1.012e-05
## `pct-white`:`pct-af-am`                                 6.964e-07
## `poverty-rate`:`renter-occupied-households`             3.419e-09
## `renter-occupied-households`:`pct-af-am`               -2.975e-08
## `renter-occupied-households`:`pct-white`               -2.933e-08
## `renter-occupied-households`:`pct-am-ind`              -2.535e-08
## `pct-asian`:`eviction-filing-rate`                      2.254e-05
## `median-property-value`:`pct-asian`                     7.876e-10
## `pct-renter-occupied`:`median-household-income`         1.634e-09
## `renter-occupied-households`:`eviction-filing-rate`    -7.586e-10
##                                                        Std. Error t value
## (Intercept)                                             2.042e-02   0.337
## `poverty-rate`                                          2.240e-05   0.668
## `renter-occupied-households`                            6.564e-07   3.991
## `pct-renter-occupied`                                   3.204e-05  -2.565
## `median-gross-rent`                                     2.507e-06   4.676
## `median-household-income`                               4.651e-08  -2.582
## `median-property-value`                                 1.127e-08  -4.323
## `rent-burden`                                           5.205e-05  -4.238
## `pct-white`                                             2.028e-04  -0.077
## `pct-af-am`                                             2.041e-04  -0.119
## `pct-hispanic`                                          2.048e-04  -0.033
## `pct-am-ind`                                            2.034e-04  -0.050
## `pct-asian`                                             3.652e-04   5.437
## `pct-multiple`                                          2.111e-04   0.095
## `eviction-filing-rate`                                  2.779e-05  -0.265
## `pct-hispanic`:`eviction-filing-rate`                   2.363e-06   4.441
## `pct-af-am`:`eviction-filing-rate`                      6.511e-07  -3.039
## `renter-occupied-households`:`pct-multiple`             7.783e-09  -5.395
## `median-property-value`:`rent-burden`                   3.013e-10   3.058
## `renter-occupied-households`:`pct-hispanic`             6.437e-09  -4.716
## `renter-occupied-households`:`median-property-value`    2.827e-14  -2.170
## `median-gross-rent`:`pct-asian`                         2.069e-07  -5.309
## `pct-hispanic`:`pct-asian`                              3.232e-06   5.668
## `median-household-income`:`median-property-value`       1.448e-13   5.118
## `renter-occupied-households`:`pct-asian`                6.455e-09  -4.880
## `median-gross-rent`:`median-property-value`             5.728e-12  -3.700
## `poverty-rate`:`pct-af-am`                              8.137e-07   2.376
## `pct-hispanic`:`pct-multiple`                           5.353e-06  -2.200
## `pct-am-ind`:`pct-asian`                                4.157e-06  -3.345
## `rent-burden`:`pct-asian`                               8.104e-06  -5.092
## `median-gross-rent`:`pct-hispanic`                      4.855e-08  -4.722
## `median-property-value`:`pct-hispanic`                  1.217e-10   2.815
## `median-household-income`:`pct-hispanic`                5.561e-10   4.729
## `renter-occupied-households`:`rent-burden`              8.819e-10   5.603
## `renter-occupied-households`:`median-household-income`  3.765e-13   6.704
## `renter-occupied-households`:`pct-renter-occupied`      3.622e-10   3.450
## `median-gross-rent`:`median-household-income`           4.233e-11  -3.391
## `median-household-income`:`rent-burden`                 1.316e-09   2.308
## `pct-af-am`:`pct-multiple`                              4.965e-06  -2.037
## `pct-white`:`pct-af-am`                                 3.209e-07   2.170
## `poverty-rate`:`renter-occupied-households`             9.163e-10   3.732
## `renter-occupied-households`:`pct-af-am`                6.429e-09  -4.628
## `renter-occupied-households`:`pct-white`                6.443e-09  -4.553
## `renter-occupied-households`:`pct-am-ind`               6.672e-09  -3.799
## `pct-asian`:`eviction-filing-rate`                      7.942e-06   2.838
## `median-property-value`:`pct-asian`                     3.353e-10   2.349
## `pct-renter-occupied`:`median-household-income`         6.703e-10   2.438
## `renter-occupied-households`:`eviction-filing-rate`     3.964e-10  -1.913
##                                                        Pr(>|t|)    
## (Intercept)                                            0.735913    
## `poverty-rate`                                         0.504106    
## `renter-occupied-households`                           6.76e-05 ***
## `pct-renter-occupied`                                  0.010383 *  
## `median-gross-rent`                                    3.08e-06 ***
## `median-household-income`                              0.009867 ** 
## `median-property-value`                                1.60e-05 ***
## `rent-burden`                                          2.34e-05 ***
## `pct-white`                                            0.938556    
## `pct-af-am`                                            0.905666    
## `pct-hispanic`                                         0.974039    
## `pct-am-ind`                                           0.960478    
## `pct-asian`                                            5.93e-08 ***
## `pct-multiple`                                         0.924061    
## `eviction-filing-rate`                                 0.790760    
## `pct-hispanic`:`eviction-filing-rate`                  9.33e-06 ***
## `pct-af-am`:`eviction-filing-rate`                     0.002394 ** 
## `renter-occupied-households`:`pct-multiple`            7.46e-08 ***
## `median-property-value`:`rent-burden`                  0.002248 ** 
## `renter-occupied-households`:`pct-hispanic`            2.52e-06 ***
## `renter-occupied-households`:`median-property-value`   0.030078 *  
## `median-gross-rent`:`pct-asian`                        1.19e-07 ***
## `pct-hispanic`:`pct-asian`                             1.60e-08 ***
## `median-household-income`:`median-property-value`      3.31e-07 ***
## `renter-occupied-households`:`pct-asian`               1.12e-06 ***
## `median-gross-rent`:`median-property-value`            0.000220 ***
## `poverty-rate`:`pct-af-am`                             0.017574 *  
## `pct-hispanic`:`pct-multiple`                          0.027927 *  
## `pct-am-ind`:`pct-asian`                               0.000834 ***
## `rent-burden`:`pct-asian`                              3.79e-07 ***
## `median-gross-rent`:`pct-hispanic`                     2.45e-06 ***
## `median-property-value`:`pct-hispanic`                 0.004914 ** 
## `median-household-income`:`pct-hispanic`               2.38e-06 ***
## `renter-occupied-households`:`rent-burden`             2.33e-08 ***
## `renter-occupied-households`:`median-household-income` 2.47e-11 ***
## `renter-occupied-households`:`pct-renter-occupied`     0.000570 ***
## `median-gross-rent`:`median-household-income`          0.000706 ***
## `median-household-income`:`rent-burden`                0.021086 *  
## `pct-af-am`:`pct-multiple`                             0.041704 *  
## `pct-white`:`pct-af-am`                                0.030093 *  
## `poverty-rate`:`renter-occupied-households`            0.000194 ***
## `renter-occupied-households`:`pct-af-am`               3.87e-06 ***
## `renter-occupied-households`:`pct-white`               5.54e-06 ***
## `renter-occupied-households`:`pct-am-ind`              0.000148 ***
## `pct-asian`:`eviction-filing-rate`                     0.004569 ** 
## `median-property-value`:`pct-asian`                    0.018881 *  
## `pct-renter-occupied`:`median-household-income`        0.014835 *  
## `renter-occupied-households`:`eviction-filing-rate`    0.055803 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002753 on 2657 degrees of freedom
## Multiple R-squared:  0.2508, Adjusted R-squared:  0.2376 
## F-statistic: 18.93 on 47 and 2657 DF,  p-value: < 2.2e-16

According to the step function, this is the optimal formula for an lm with interaction terms and controls. But this model shows that nearly every variable is significant and significantly interacts with another. What are we supposed to do with this information? If we have all variable information available for a given county, it may give us a precise prediction of covid case rates. However, the many NAs in our dataset mean that a lot of counties would not have enough of these variables observed to make a prediction using this model (rows are removed entirely if any NAs are observed). This may hint that our model is overfitting our data.

In addition, multiple linear regression models are difficult to visualize. A 3-D plane could be constructed to view a model with two independent variables, but beyond that, there isn’t much we can do to understand our data visually. In ADDITION, interaction terms make the entire model less interpretable. We can no longer interpret the coefficient on any variable “holding all else constant”. So while we maybe gain precision, we lose interpretability as well as the likelihood that our model can be used to accurately predict on new data.

Without interaction terms, the step function actually isn’t too bad at fitting a model. But again, we can’t really visualize it and really just have to hope that it can produce accurate predictions.

Adjusted R-squared

Our Adjusted R-squared value increased to 0.23, meaning variations in the model explain about 23% of variations in Covid case rates, but our F-statistic decreased. It’s still significant, but it may suggest that our model overall isn’t as strong.

F-statistic

Our F-statistic is reduced but still statistically significant.

Overall, using a stepwise function to find the best lm is not optimal because it results in overfitting, chooses models based on AIC rather than other significance measures, and as a result includes insignificant variables.

Stepwise Long LM without Interaction Terms

step.lm2 <- stepAIC(full.lm,
                   direction = "both",
                   trace = FALSE)
summary(step.lm2)
## 
## Call:
## lm(formula = county_covid_cases_rate ~ `poverty-rate` + `renter-occupied-households` + 
##     `pct-renter-occupied` + `median-household-income` + `median-property-value` + 
##     `pct-af-am` + `pct-hispanic` + `pct-multiple`, data = covid.eviction.naomit)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.007660 -0.000970 -0.000457  0.000087  0.060444 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -2.614e-03  5.426e-04  -4.818 1.53e-06 ***
## `poverty-rate`                6.604e-05  1.754e-05   3.765  0.00017 ***
## `renter-occupied-households`  3.092e-09  1.148e-09   2.693  0.00713 ** 
## `pct-renter-occupied`         1.423e-05  8.920e-06   1.595  0.11090    
## `median-household-income`     4.037e-08  9.122e-09   4.426 9.99e-06 ***
## `median-property-value`       5.317e-09  1.190e-09   4.470 8.15e-06 ***
## `pct-af-am`                   4.242e-05  5.041e-06   8.414  < 2e-16 ***
## `pct-hispanic`                1.446e-05  4.927e-06   2.935  0.00336 ** 
## `pct-multiple`               -1.504e-04  3.505e-05  -4.290 1.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002977 on 2696 degrees of freedom
## Multiple R-squared:  0.1112, Adjusted R-squared:  0.1085 
## F-statistic: 42.15 on 8 and 2696 DF,  p-value: < 2.2e-16

Adjusted R-squared

Our Adjusted R-sqared term increased from 0.02 in our simple lm to 0.11 here. It’s still lower the 0.24 value we got from our long lm including interaction terms, meaning it explains less of the data. However, with fewer variables and no interaction terms, coefficients are relatively easy to interpret and we gain some ability to explain variations in COVID case rates.

F-statistic:

Our F-statistic jumps up again, suggesting that this model is still relatively strong and our R-squared is significantly different from 0.

Significance of our initial independent variable of interest

Eviction filing rates are no longer significant in this model, which highlights how lms can actually be helpful in identifying important variables so you don’t have to guess yourself. However, we are no longer answering our question and cannot make meaningful conclusions about our hypotheses.

One downside to all of these linear models using a continuous dependent variable is that we can’t measure accuracy as we can with other types of predictive models. To adjust our model to allow us to do this, we can perform a logit using a binary dependent variable: Whether Covid cases at the county level are greater than the state average.

Simple Logistic Regression

logit.model <- glm(covidCases_greater_state ~ `eviction-filing-rate`, 
                   family = 'binomial',
                   data = covid.eviction)

summary(logit.model)
## 
## Call:
## glm(formula = covidCases_greater_state ~ `eviction-filing-rate`, 
##     family = "binomial", data = covid.eviction)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7567  -0.7690  -0.7347   1.3781   1.7120  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.202928   0.053890 -22.322  < 2e-16 ***
## `eviction-filing-rate`  0.055702   0.009084   6.132 8.67e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3101.5  on 2680  degrees of freedom
## Residual deviance: 3061.3  on 2679  degrees of freedom
##   (460 observations deleted due to missingness)
## AIC: 3065.3
## 
## Number of Fisher Scoring iterations: 4

AIC

In this logit model, our eviction filing rate is significant and our AIC high. AIC represents the quality of our model using out-of-sample prediction errors, so a high AIC suggests a strong model.

Because we only have one dependent variable, it’s pretty easy to visualize:

covid.eviction %>%
  ggplot(aes(x = `eviction-filing-rate`, 
             y = covidCases_greater_state)) +
  geom_point(alpha = 0.25) +
  stat_smooth(method = "glm", 
              method.args = list(family = "binomial"), 
              se = FALSE)

By viewing our plot we can see that there’s some interesting stuff going on here. For the most part, the distribution of Yesses and No’s (1s and 0s) doesn’t appear that different along eviction filing rates. Additionally, we still have on outlier at an eviction filing rate of >100. Also, our curve never seems to drop below 0.20, meaning it would never actually predict an outcome of 0 (that the covid case rate in a given county are lower than state averages). Is this logit model good at helping us predict covid outcomes?

I will make predictions on the data, round the predictions to 0 or 1, and compare with actual covidCases_Greater_State values to measure accuracy.

# subsetting original data to train & test
ran <- sample(1:nrow(covid.eviction), 0.9 * nrow(covid.eviction))

covid.eviction2 <- covid.eviction[, c(21, 41)] %>%
  na.omit()

train <- covid.eviction2[ran, ]
test <- covid.eviction2[-ran, 1]
# train.cl <- covid.eviction2[ran, 2]
test.cl <- covid.eviction2[-ran, 2]
  
# # making sure dependent variable is read as factor
# train.cl$covidCases_greater_state <- as.factor(train.cl$covidCases_greater_state)
# test.cl$covidCases_greater_state <- as.factor(test.cl$covidCases_greater_state)

train.logit.model <- glm(covidCases_greater_state ~ `eviction-filing-rate`, 
                   family = 'binomial',
                   data = train)

# creating column of predictions
logit.predict <- test %>%
  mutate(prediction = round(predict.glm(train.logit.model,
                          newdata = test,
                          type = 'response')))

# confusion matrix
cm1 <- confusionMatrix(as.factor(logit.predict$prediction), as.factor(test.cl$covidCases_greater_state))
logit.accuracy <- as.data.table(cm1$overall)[1,]
logit.kappa <- as.data.table(cm1$overall)[2,]

print(cm1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 186  82
##          1   1   3
##                                          
##                Accuracy : 0.6949         
##                  95% CI : (0.6364, 0.749)
##     No Information Rate : 0.6875         
##     P-Value [Acc > NIR] : 0.4253         
##                                          
##                   Kappa : 0.0405         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.99465        
##             Specificity : 0.03529        
##          Pos Pred Value : 0.69403        
##          Neg Pred Value : 0.75000        
##              Prevalence : 0.68750        
##          Detection Rate : 0.68382        
##    Detection Prevalence : 0.98529        
##       Balanced Accuracy : 0.51497        
##                                          
##        'Positive' Class : 0              
## 

Accuracy

While our accuracy using our logistic regression model is relatively high at around 0.75, we can see that there are actually a lot of false negatives, which we would ideally want minimized if we were trying to actually predict counties under most stress from Covid. This model would result in an underreporting of counties with covid case rates higher than state levels.

Precision:

logit.precision <- precision(cm1$table)
print(logit.precision)
## [1] 0.6940299

Relatively high precision (positive predictive value) demonstrates that this model tells us that the fraction of relevant predictions (true positives) out of all predictions (true positives + false positives) is somewhat high. While there are false positives reflected in our denominator, we’re not too worried about that with a covid model. It might be better to have false positives than false negatives. Because this value is still less than 1, it means not all results retreived by our model were relevant.

Recall:

logit.recall <- recall(cm1$table)
print(logit.recall)
## [1] 0.9946524

The recall value represents how well our model retreives relevant instances: true positives & false negatives. We care about false negatives because we really would not want our model to produce much of that if we were using it to predict areas with covid strains. Our super high recall value shows us that our model retrieved almost all relevant instances.

F-measure:

Since we really can’t cay for sure whether it’s better to have false negatives (county doesn’t receive attention) or false positives (diversion of attention from counties that actually need it to ones that don’t), we can weigh the overall precision and recall of the model using a beta of 1, meaning recall and precision are equally weighted.

logit.f <- F_meas(cm1$table)
logit.matrix <-
  data.frame(
    Measure = c("Accuracy", "Precision", "Recall", "F-measure", "Kappa"),
    Value = c(
      as.numeric(logit.accuracy),
      as.numeric(logit.precision),
      as.numeric(logit.recall),
      as.numeric(logit.f),
      as.numeric(logit.kappa)
    )
  )
print(logit.f)
## [1] 0.8175824

Our F-measure is 0.86. Considering 1 is the upper limit on this measure, this is actually pretty good, but we can know more about it by comparing to the f-measure we get from our KNN models.

Cohen’s Kappa:

Kappa compares observed accuracy to the accuracy we would expect if classification was random. Since Kappas range from -1 tp 1 in value, our Kappa of -0.0057 represents no agreement, meaning our model isn’t doing any better than random chance at matching predictions to our true observations.

2. KNN

For my KNN function, I am using accuracy as my initial metric for optimization. Because there is a tradeoff between increasing values of K and reducing the number of classes, and a tradeoff between reducing values of K and response to variation in the data, I make this decision by plotting accuracies of our KNN function at various K’s (1-10) and choose the smallest K at which marginal increase in accuracy seems to taper off.

# create function to normalize axis to improve distance calculations
normalize.function <- function(x){
  (x - min(x))/(max(x) - min(x))
}

# extracting columns of interest based on lm models, in this case eviction filing rates and percentage of residents who are african american
covid.eviction2 <- covid.eviction[, c(11, 21, 41)] %>%
  na.omit()

# applying normalization function to variables of interest
ce2 <- covid.eviction2
ce2$pct.af.am <- normalize.function(ce2$`pct-af-am`)
ce2$eviction.filing.rate <- normalize.function(ce2$`eviction-filing-rate`)

# run loop through ks 1-20 and graph to find optimal K (where additional k has significant marginal effect on accuracy)
store <- NULL
accuracy <- NULL

# because this seems to take forever, I'm limiting k to 10
for(k in 1:10){
  
  for(i in 1:nrow(ce2)){
    
    train <- ce2[-i,]
    test <- ce2[i,]
    
    prediction <- knn(train = train %>%
                        dplyr::select(pct.af.am, eviction.filing.rate),
                      test = test %>%
                        dplyr::select(pct.af.am, eviction.filing.rate),
                      train$covidCases_greater_state,
                      k)
      
    store[i] <- ifelse(prediction == ce2$covidCases_greater_state[i],
                       1, 
                       0)
    
  }
  
  accuracy[k] <- mean(store)
  
    }

k <- (1:10)

find.k <- data.frame(k, accuracy)

find.k %>%
  ggplot(aes(x = k, y = accuracy)) +
  geom_point() +
  geom_line()

We’ll go with k = 5, 7, and 10.

# create a grid of normalized values within the range of our training data for visualization:
eviction.norm.seq <- seq(from = 0, to = 1, by = .01)
afam.norm.seq <- seq(from = 0, to = 1, by = .01)
norm.grid <- expand.grid(eviction.norm.seq, afam.norm.seq)
colnames(norm.grid)[c(1,2)] <- c('eviction.filing.rate', 'pct.af.am') 

knn.grid <- norm.grid %>%
  mutate(prediction = knn(train = ce2 %>%
                            dplyr::select(pct.af.am, eviction.filing.rate),
                          test = norm.grid,
                          ce2$covidCases_greater_state,
                          5))

knn.grid %>%
  ggplot(aes(x = eviction.filing.rate,
             y = pct.af.am)) +
  geom_point(aes(color = factor(prediction)),
             size = 2,
             alpha = 0.1) +
  geom_point(data = ce2,
             mapping = aes(x = eviction.filing.rate,
                           y = pct.af.am,
                           color = factor(covidCases_greater_state),
                           alpha = 0.5),
             size = 3) +
  xlab('County Eviction Filing Rate') +
  ylab('County Percent African American Residents') +
  ggtitle('(K = 5) How well does our KNN model predict county COVID cases?') +
  scale_color_discrete(name = "Prediction", labels = c("Greater than State Avg", "Less than State Avg"))

knn.grid <- norm.grid %>%
  mutate(prediction = knn(train = ce2 %>%
                            dplyr::select(pct.af.am, eviction.filing.rate),
                          test = norm.grid,
                          ce2$covidCases_greater_state,
                          7))

knn.grid %>%
  ggplot(aes(x = eviction.filing.rate,
             y = pct.af.am)) +
  geom_point(aes(color = factor(prediction)),
             size = 2,
             alpha = 0.1) +
  geom_point(data = ce2,
             mapping = aes(x = eviction.filing.rate,
                           y = pct.af.am,
                           color = factor(covidCases_greater_state),
                           alpha = 0.5),
             size = 3) +
  xlab('County Eviction Filing Rate') +
  ylab('County Percent African American Residents') +
  ggtitle('(K = 7) How well does our KNN model predict county COVID cases?') +
  scale_color_discrete(name = "Prediction", labels = c("Greater than State Avg", "Less than State Avg"))

knn.grid <- norm.grid %>%
  mutate(prediction = knn(train = ce2 %>%
                            dplyr::select(pct.af.am, eviction.filing.rate),
                          test = norm.grid,
                          ce2$covidCases_greater_state,
                          10))

knn.grid %>%
  ggplot(aes(x = eviction.filing.rate,
             y = pct.af.am)) +
  geom_point(aes(color = factor(prediction)),
             size = 2,
             alpha = 0.1) +
  geom_point(data = ce2,
             mapping = aes(x = eviction.filing.rate,
                           y = pct.af.am,
                           color = factor(covidCases_greater_state),
                           alpha = 0.5),
             size = 3) +
  xlab('County Eviction Filing Rate') +
  ylab('County Percent African American Residents') +
  ggtitle('(K = 10) How well does our KNN model predict county COVID cases?') +
  scale_color_discrete(name = "Prediction", labels = c("Greater than State Avg", "Less than State Avg"))

Looking at this plot, we can see that there are a lot of blue predictons ocurring between 0.50 and 1.00 eviction filing rates where there aren’t any original data for all three K values. That suggests that our model would falsely predict a lot of ‘Less than state avg’ in places where we have no original data to confirm that to be true. This may be driven by an outlier in eviction filing rates near 1.00 (normalized). This suggests that a KNN model is less helpful in making predictions on values that it isn’t trained on. Applied to our real world example, if testing is largely done in counties with low eviction rates or eviction rates really are skewed so drastically, KNN may not be a great predictor for COVID case rates in underrepresented areas.

In addition, because we have so much data and it’s largely clustered in one third of the grid, it’s actually hard to tell what’s going on with our predictions behind the actual datapoints. Our two classes aren’t very clearly separated, which makes it difficult to get any meaningful separations using a KNN model.

How good is this KNN model by other measures?

Confusion Matrix for KNN

# subsetting original data to train & test
ran <- sample(1:nrow(ce2), 0.9 * nrow(ce2))

train <- ce2[ran, c(4,5)]
test <- ce2[-ran, c(4,5)]
train.cl <- ce2[ran, 3]
test.cl <- ce2[-ran, 3]

# making sure dependent variable is read as factor
train.cl$covidCases_greater_state <- as.factor(train.cl$covidCases_greater_state)
test.cl$covidCases_greater_state <- as.factor(test.cl$covidCases_greater_state)

# creating column of predictions
knn.predict.k5 <- test %>%
  mutate(prediction = knn(train = train,
                          test = test,
                          train.cl$covidCases_greater_state,
                          k = 5))

# creating column of predictions
knn.predict.k7 <- test %>%
  mutate(prediction = knn(train = train,
                          test = test,
                          train.cl$covidCases_greater_state,
                          k = 7))

# creating column of predictions
knn.predict.k10 <- test %>%
  mutate(prediction = knn(train = train,
                          test = test,
                          train.cl$covidCases_greater_state,
                          k = 10))

cm.k5 <- confusionMatrix(knn.predict.k5$prediction, test.cl$covidCases_greater_state)
accuracy.k5 <- as.data.table(cm.k5$overall)[1]
precision.k5 <- precision(cm.k5$table)
recall.k5 <- recall(cm.k5$table)
f.k5 <- F_meas(cm.k5$table)
kappa.k5 <- as.data.table(cm.k5$overall)[2]

cm.k7 <- confusionMatrix(knn.predict.k7$prediction, test.cl$covidCases_greater_state)
accuracy.k7 <- as.data.table(cm.k7$overall)[1,]
precision.k7 <- precision(cm.k7$table)
recall.k7 <- recall(cm.k7$table)
f.k7 <- F_meas(cm.k7$table)
kappa.k7 <- as.data.table(cm.k7$overall)[2,]

cm.k10 <- confusionMatrix(knn.predict.k10$prediction, test.cl$covidCases_greater_state)
accuracy.k10 <- as.data.table(cm.k10$overall)[1,]
precision.k10 <- precision(cm.k10$table)
recall.k10 <- recall(cm.k10$table)
f.k10 <- F_meas(cm.k10$table)
kappa.k10 <- as.data.table(cm.k10$overall)[2,]

kappas <- c(kappa.k5, kappa.k7, kappa.k10)
kappas <- unlist(kappas, use.names = FALSE)
  
KNN.matrix <- data.frame(K = c(5, 7, 10), 
  Accuracy = c(accuracy.k5, accuracy.k7, accuracy.k10),
  Precision = c(precision.k5, precision.k7, precision.k10),
  Recall = c(recall.k5, recall.k7, recall.k10),
  F_measure = c(f.k5, f.k7, f.k10),
  Kappa = kappas)

kable(KNN.matrix)
K Accuracy.V1 Accuracy.V1.1 Accuracy.V1.2 Precision Recall F_measure Kappa
5 0.6765799 0.669145 0.6765799 0.7191489 0.8894737 0.7952941 0.0648152
7 0.6765799 0.669145 0.6765799 0.7112971 0.8947368 0.7925408 0.0260364
10 0.6765799 0.669145 0.6765799 0.7085020 0.9210526 0.8009153 0.0122399

Accuracy:

It appears that accuracy is the same for both k = 7 and 10 and a bit higher for k = 5. This makes intuitive sense becuase with a smaller k, you’re looking at ‘closer’ neighbors which are more likely to be similar to each other, and thus better predictors of each other’s outcomes.

However, even the highest accuracy of 0.6952 is not really that high because it means a large fraction of our predictions made by these models are incorrect. All of these accuracies are lower than that of our logistic regression model.

Precision:

Precision seems to decrease as K increases, suggesting a lower k is preferable when aiming for precision/a higher fraction of our retrieved instances to be relevant. These precision values are all higher than that of our logit model.

Recall:

Recall increases as K increases, suggesting a higher k is preferable if we care more about a high recall/retrieving more of the total relevant instances. These recall values are all well below that of our logit model.

F-measure:

F-measure increases as K increases. If we weigh precision and recall equally, our f-measures suggest that a k of 10 is optimal, as it maximizes precision and recall on balance. These f-measures are all a little lower than that of our logit model.

Cohen’s Kappa:

The Kappas don’t seem to follow any particular pattern (they change every time the code is run), making it difficult to assess the true value of Kappa. Our kappas are all pretty low (below 0.20). Kappas of this value are considered ot have little to no agreement, meaning despite our other favorabl measures, these KNN models may not be much better at predicting than random chance. However, these kappas are all well above the negative kappa produced by our logit model.

3.Single Decision Trees, Bagging, Boosting and Random Forests

In this section, I will use decision trees to use eviction filing rate and other demographic variables of a county to predict COVID-19 case and death outcomes. I will be comparing 4 different types of models: a single decision tree, bagging, boosting, and random forest. I will train these models on the training data, and will be using the testing data to examine each of their performance.

Create Training and Testing Data

Create two data, one for training the different models and the other for testing the models.

#generate index for training data
index <- sample(nrow(covid.eviction),
                size = nrow(covid.eviction)*0.7,
                replace = FALSE)

#create training data
covid.eviction.train <- covid.eviction[index,]

#create testing data
covid.eviction.test <- covid.eviction[-index,]

Single Decision Trees

Create 4 decision tree models for each of the 4 different outcome variables: covidCases_greater_state, covidCases_greater_US, covidDeaths_greater_state, covidDeaths_greater_US. The following also shows the decision trees generated for each of the models.

#single decision tree model for covidCases_greater_state
tree.case.state <- rpart(
  factor(covidCases_greater_state) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train
)

fancyRpartPlot(tree.case.state)

#single decision tree model for covidCases_greater_US
tree.case.US <- rpart(
  factor(covidCases_greater_US) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train
)

fancyRpartPlot(tree.case.US)

#single decision tree model for covidDeaths_greater_state
tree.death.state <- rpart(
  factor(covidDeaths_greater_state) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train
)

fancyRpartPlot(tree.death.state)

#single decision tree model for covidDeaths_greater_US
tree.death.US <- rpart(
  factor(covidDeaths_greater_US) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train
)

tree.death.US
## n=1990 (208 observations deleted due to missingness)
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1990 442 0 (0.7778894 0.2221106) *

Amongst the four decision tree models, the last one, which uses covidDeaths_greater_US as the outcome, does not create a decision tree; it stops at the root node and does not split further. Below, I try to replicate this model but only using county-level data from a single state to check whether that produces decision trees.

#single decision tree model of NY
tree.death.US.NY <- rpart(
  factor(covidDeaths_greater_US) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train,
  subset = State == "NY",
  minsplit = 10
)

fancyRpartPlot(tree.death.US.NY)

#single decision tree model of TX
tree.death.US.TX <- rpart(
  factor(covidDeaths_greater_US) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train,
  subset = State == "TX",
  minsplit = 10
)

fancyRpartPlot(tree.death.US.TX)

#single decision tree model of WA
tree.death.US.WA <- rpart(
  factor(covidDeaths_greater_US) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train,
  subset = State == "WA",
  minsplit = 10
)

fancyRpartPlot(tree.death.US.WA)

#single decision tree model of MI
tree.death.US.MI <- rpart(
  factor(covidDeaths_greater_US) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train,
  subset = State == "MI",
  minsplit = 10
)

fancyRpartPlot(tree.death.US.MI)

When using a state-level subset of the data (in this case: NY, TX, WA and MI), the models generate decision trees. It is suggested that when looking at the whole US data, the independent variables are not good predictors of whether then COVID-19 death rate of a county is greater than the US average, but when looking at the specific states, there are some independent variables that are good predictors of the outcome variable.

Bagging

Create 4 bagging models for each of the outcome variables. I use the default number of trees in the built-in bagging model, which is 25 bootstrap replications. Below also shows the out-of-bag estimate of misclassification error.

#bagging model for covidCases_greater_state
bagging.case.state <- bagging(
  factor(covidCases_greater_state) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train,
  coob = TRUE
)

bagging.case.state
## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = factor(covidCases_greater_state) ~ 
##     `eviction-filing-rate` + `poverty-rate` + `pct-renter-occupied` + 
##         `median-household-income` + `pct-white` + `pct-af-am` + 
##         `pct-hispanic` + `pct-asian` + population, data = covid.eviction.train, 
##     coob = TRUE)
## 
## Out-of-bag estimate of misclassification error:  0.285
#bagging model for covidCases_greater_US
bagging.case.US <- bagging(
  factor(covidCases_greater_US) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train,
  coob = TRUE
)

bagging.case.US
## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = factor(covidCases_greater_US) ~ 
##     `eviction-filing-rate` + `poverty-rate` + `pct-renter-occupied` + 
##         `median-household-income` + `pct-white` + `pct-af-am` + 
##         `pct-hispanic` + `pct-asian` + population, data = covid.eviction.train, 
##     coob = TRUE)
## 
## Out-of-bag estimate of misclassification error:  0.1052
#bagging model for covidDeaths_greater_state
bagging.death.state <- bagging(
  factor(covidDeaths_greater_state) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train,
  coob = TRUE
)

bagging.death.state
## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = factor(covidDeaths_greater_state) ~ 
##     `eviction-filing-rate` + `poverty-rate` + `pct-renter-occupied` + 
##         `median-household-income` + `pct-white` + `pct-af-am` + 
##         `pct-hispanic` + `pct-asian` + population, data = covid.eviction.train, 
##     coob = TRUE)
## 
## Out-of-bag estimate of misclassification error:  0.3995
#bagging model for covidDeaths_greater_US
bagging.death.US <- bagging(
  factor(covidDeaths_greater_US) ~
    `eviction-filing-rate` +
    `poverty-rate` +
    `pct-renter-occupied` +
    `median-household-income` +
    `pct-white` +
    `pct-af-am` +
    `pct-hispanic` +
    `pct-asian` +
    `population`,
  data = covid.eviction.train,
  coob = TRUE
)

bagging.death.US
## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = factor(covidDeaths_greater_US) ~ 
##     `eviction-filing-rate` + `poverty-rate` + `pct-renter-occupied` + 
##         `median-household-income` + `pct-white` + `pct-af-am` + 
##         `pct-hispanic` + `pct-asian` + population, data = covid.eviction.train, 
##     coob = TRUE)
## 
## Out-of-bag estimate of misclassification error:  0.2494

Boosting

There are 4 different boosting models for each of the outcome variables. The number of trees is 25, which is the default number for the built-in adaboost function. The outcomes also show the weights of each of the trees.

#boosting model for covidCases_greater_state
boosting.case.state <- adaboost(
  covidCases_greater_state ~
    eviction.filing.rate +
    poverty.rate +
    pct.renter.occupied +
    median.household.income +
    pct.white +
    pct.af.am +
    pct.hispanic +
    pct.asian +
    population,
  data = data.frame(covid.eviction.train),
  nIter = 25
)

boosting.case.state
## adaboost(formula = covidCases_greater_state ~ eviction.filing.rate + 
##     poverty.rate + pct.renter.occupied + median.household.income + 
##     pct.white + pct.af.am + pct.hispanic + pct.asian + population, 
##     data = data.frame(covid.eviction.train), nIter = 25)
## covidCases_greater_state ~ eviction.filing.rate + poverty.rate + 
##     pct.renter.occupied + median.household.income + pct.white + 
##     pct.af.am + pct.hispanic + pct.asian + population
## Dependent Variable: covidCases_greater_state
## No of trees:25
## The weights of the trees are:0.81892210.7514950.81427520.78087320.80587270.79017980.8122760.78499970.76382430.75464450.77189790.76983460.75556010.79214890.84378460.77555540.77705820.7664260.82206090.80255520.76458840.79073040.79344960.78695040.7854634
#boosting model for covidCases_greater_US
boosting.case.US <- adaboost(
  covidCases_greater_US ~
    eviction.filing.rate +
    poverty.rate +
    pct.renter.occupied +
    median.household.income +
    pct.white +
    pct.af.am +
    pct.hispanic +
    pct.asian +
    population,
  data = data.frame(covid.eviction.train),
  nIter = 25
)

boosting.case.US
## adaboost(formula = covidCases_greater_US ~ eviction.filing.rate + 
##     poverty.rate + pct.renter.occupied + median.household.income + 
##     pct.white + pct.af.am + pct.hispanic + pct.asian + population, 
##     data = data.frame(covid.eviction.train), nIter = 25)
## covidCases_greater_US ~ eviction.filing.rate + poverty.rate + 
##     pct.renter.occupied + median.household.income + pct.white + 
##     pct.af.am + pct.hispanic + pct.asian + population
## Dependent Variable: covidCases_greater_US
## No of trees:25
## The weights of the trees are:1.285671.0654610.99808490.98762610.96722590.94577050.95881950.97353950.94595820.92786580.98979920.98515221.0231390.9428650.92164380.92306010.91927860.99492790.81120260.88392520.98740150.92419430.86019680.96369180.9800169
#boosting model for covidDeaths_greater_state
boosting.death.state <- adaboost(
  covidDeaths_greater_state ~
    eviction.filing.rate +
    poverty.rate +
    pct.renter.occupied +
    median.household.income +
    pct.white +
    pct.af.am +
    pct.hispanic +
    pct.asian +
    population,
  data = data.frame(covid.eviction.train),
  nIter = 25
)

boosting.death.state
## adaboost(formula = covidDeaths_greater_state ~ eviction.filing.rate + 
##     poverty.rate + pct.renter.occupied + median.household.income + 
##     pct.white + pct.af.am + pct.hispanic + pct.asian + population, 
##     data = data.frame(covid.eviction.train), nIter = 25)
## covidDeaths_greater_state ~ eviction.filing.rate + poverty.rate + 
##     pct.renter.occupied + median.household.income + pct.white + 
##     pct.af.am + pct.hispanic + pct.asian + population
## Dependent Variable: covidDeaths_greater_state
## No of trees:25
## The weights of the trees are:0.75139970.69123680.69189090.76828320.71337990.66519550.72987410.711540.68640280.65924330.74793440.76263510.6929780.77336970.72569690.71797140.74808430.65977240.70756560.69515450.72493990.72987070.71514940.66255970.6844139
#boosting model for covidDeaths_greater_US
boosting.death.US <- adaboost(
  covidDeaths_greater_US ~
    eviction.filing.rate +
    poverty.rate +
    pct.renter.occupied +
    median.household.income +
    pct.white +
    pct.af.am +
    pct.hispanic +
    pct.asian +
    population,
  data = data.frame(covid.eviction.train),
  nIter = 25
)

boosting.death.US
## adaboost(formula = covidDeaths_greater_US ~ eviction.filing.rate + 
##     poverty.rate + pct.renter.occupied + median.household.income + 
##     pct.white + pct.af.am + pct.hispanic + pct.asian + population, 
##     data = data.frame(covid.eviction.train), nIter = 25)
## covidDeaths_greater_US ~ eviction.filing.rate + poverty.rate + 
##     pct.renter.occupied + median.household.income + pct.white + 
##     pct.af.am + pct.hispanic + pct.asian + population
## Dependent Variable: covidDeaths_greater_US
## No of trees:25
## The weights of the trees are:0.84498840.81408270.75223760.77164290.76266030.78946920.77796750.77496920.72050490.72998390.77517510.80746410.78934450.77233880.79160170.72659990.77894020.79134190.80630840.7848630.79005190.82651260.78793380.80420930.7887148

Random Forests

Again, there are 4 different random forest models for each of the outcome variables. The type of random forest is classification, and each model has 500 trees and 3 variables tried at each split. The outcome below shows out-of-bag estimate of error rate and the confusion matrix for each of the models.

#random forest model for covidCases_greater_state
rf.case.state <- randomForest(
  factor(covidCases_greater_state) ~
    eviction.filing.rate +
    poverty.rate +
    pct.renter.occupied +
    median.household.income +
    pct.white +
    pct.af.am +
    pct.hispanic +
    pct.asian +
    population,
  data = data.frame(covid.eviction.train) %>%
    na.omit()
)

rf.case.state
## 
## Call:
##  randomForest(formula = factor(covidCases_greater_state) ~ eviction.filing.rate +      poverty.rate + pct.renter.occupied + median.household.income +      pct.white + pct.af.am + pct.hispanic + pct.asian + population,      data = data.frame(covid.eviction.train) %>% na.omit()) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 25.13%
## Confusion matrix:
##     0  1 class.error
## 0 249 21  0.07777778
## 1  74 34  0.68518519
#random forest model for covidCases_greater_US
rf.case.US <- randomForest(
  factor(covidCases_greater_US) ~
    eviction.filing.rate +
    poverty.rate +
    pct.renter.occupied +
    median.household.income +
    pct.white +
    pct.af.am +
    pct.hispanic +
    pct.asian +
    population,
  data = data.frame(covid.eviction.train) %>%
    na.omit()
)

rf.case.US
## 
## Call:
##  randomForest(formula = factor(covidCases_greater_US) ~ eviction.filing.rate +      poverty.rate + pct.renter.occupied + median.household.income +      pct.white + pct.af.am + pct.hispanic + pct.asian + population,      data = data.frame(covid.eviction.train) %>% na.omit()) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 13.49%
## Confusion matrix:
##     0  1 class.error
## 0 299 11  0.03548387
## 1  40 28  0.58823529
#random forest model for covidDeaths_greater_state
rf.death.state <- randomForest(
  factor(covidDeaths_greater_state) ~
    eviction.filing.rate +
    poverty.rate +
    pct.renter.occupied +
    median.household.income +
    pct.white +
    pct.af.am +
    pct.hispanic +
    pct.asian +
    population,
  data = data.frame(covid.eviction.train) %>%
    na.omit()
)

rf.death.state
## 
## Call:
##  randomForest(formula = factor(covidDeaths_greater_state) ~ eviction.filing.rate +      poverty.rate + pct.renter.occupied + median.household.income +      pct.white + pct.af.am + pct.hispanic + pct.asian + population,      data = data.frame(covid.eviction.train) %>% na.omit()) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 38.36%
## Confusion matrix:
##     0  1 class.error
## 0 185 45   0.1956522
## 1 100 48   0.6756757
#random forest model for covidDeaths_greater_US
rf.death.US <- randomForest(
  factor(covidDeaths_greater_US) ~
    eviction.filing.rate +
    poverty.rate +
    pct.renter.occupied +
    median.household.income +
    pct.white +
    pct.af.am +
    pct.hispanic +
    pct.asian +
    population,
  data = data.frame(covid.eviction.train) %>%
    na.omit()
)

rf.death.US
## 
## Call:
##  randomForest(formula = factor(covidDeaths_greater_US) ~ eviction.filing.rate +      poverty.rate + pct.renter.occupied + median.household.income +      pct.white + pct.af.am + pct.hispanic + pct.asian + population,      data = data.frame(covid.eviction.train) %>% na.omit()) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 29.37%
## Confusion matrix:
##     0  1 class.error
## 0 246 21  0.07865169
## 1  90 21  0.81081081

Test the Models on Testing Data

Predict outcome on testing data using models

We predict the outcome of each variables for testing data using the models we generated above.

#predictions for single decision tree models
tree.case.state.preds <-
  ifelse(predict(tree.case.state, covid.eviction.test)[, 2] > 0.5, 1, 0)
tree.case.US.preds <-
  ifelse(predict(tree.case.US, covid.eviction.test)[, 2] > 0.5, 1, 0)
tree.death.state.preds <-
  ifelse(predict(tree.death.state, covid.eviction.test)[, 2] > 0.5, 1, 0)
tree.death.US.preds <-
  ifelse(predict(tree.death.US, covid.eviction.test)[, 2] > 0.5, 1, 0)

#predictions for bagging models
bagging.case.state.preds <-
  predict(bagging.case.state, covid.eviction.test)
bagging.case.US.preds <-
  predict(bagging.case.US, covid.eviction.test)
bagging.death.state.preds <-
  predict(bagging.death.state, covid.eviction.test)
bagging.death.US.preds <-
  predict(bagging.death.state, covid.eviction.test)

#predictions for boosting models
boosting.case.state.preds <-
  predict(boosting.case.state, data.frame(covid.eviction.test))$class
boosting.case.US.preds <-
  predict(boosting.case.US, data.frame(covid.eviction.test))$class
boosting.death.state.preds <-
  predict(boosting.death.state, data.frame(covid.eviction.test))$class
boosting.death.US.preds <-
  predict(boosting.death.US, data.frame(covid.eviction.test))$class

#predictions for random forest models
rf.case.state.preds <-
  predict(rf.case.state, data.frame(covid.eviction.test))
rf.case.US.preds <-
  predict(rf.case.US, data.frame(covid.eviction.test))
rf.death.state.preds <-
  predict(rf.death.state, data.frame(covid.eviction.test))
rf.death.US.preds <-
  predict(rf.death.US, data.frame(covid.eviction.test))

Function that corrects confusion matrix that is not 2x2

This function corrects confusion matrix that is not 2x2. In the case of our data, the tree.death.US model predicts all the outcomes to be 0, thus the confusion matrix is 1x2. This function corrects it to a 2x2 matrix, so that all confusion matrices would have the same dimensions.

correct.cm <- function(confusion.matrix) {
  #Inputs:
  #confusion.matrix: confusion matrix, ideally 2x2, but could be 1x2 or 2x1
  
  #Output: return a corrected 2x2 confusion matrix as a table
  
  #check if confusion matrix is 2x2
  if (nrow(confusion.matrix) == 2 & ncol(confusion.matrix) == 2) {
    #if 2x2, nothing needs to be corrected
    corrected <- confusion.matrix
   
    #if 1x2 (missing row)
  } else if (nrow(confusion.matrix) == 1) {
    #check the rowname that exists vs missing
    if (rownames(confusion.matrix)[1] == 0) {
      #if outcome == 1 row is missing, add the row (0, 0) to the bottom
      corrected <- rbind(confusion.matrix, c(0, 0))
      rownames(corrected) <- c(0, 1)
      #if outcome == 0 row is missing, add the row (0, 0) to the top
    } else if (rownames(confusion.matrix)[1] == 1) {
      corrected <- rbind(c(0, 0), confusion.matrix)
      rownames(corrected) <- c(0, 1)
    }
    
    #if 2x1 (missing column)
  } else if (ncol(confusion.matrix) == 1) {
    #check the colname that exists vs missing
    if (colnames(confusion.matrix)[1] == 0) {
      #if outcome == 1 col is missing, add the col (0, 0) to the right
      corrected <- cbind(confusion.matrix, c(0, 0))
      colnames(corrected) <- c(0, 1)
      #if outcome == 0 col is missing, add the col (0, 0) to the left
    } else if (colnames(confusion.matrix)[1] == 1) {
      corrected <- cbind(c(0, 0), confusion.matrix)
      colnames(corrected) <- c(0, 1)
    }
  }
 
   #make sure that the final corrected matrix is in a table format
  corrected <- as.table(corrected)
  
  #return corrected 2x2 confusion matrix
  return(corrected)
}

Create confusion matrices for each model

We create confusion matrices for each of the models, and use the function that corrects 1x2 or 2x1 matrices into a 2x2 matrix to make sure that all the matrices are of the same dimesion.

#vector of 4 different model types
model.types <- c("tree", "bagging", "boosting", "rf")

#vector of 4 different outcome variables
outcome.vars <-
  c("case.state", "case.US", "death.state", "death.US")

#for loop to generate confusion matrices for each outcome variable using 4 different models 
for (i in model.types) {
  for (j in outcome.vars) {
    
    #creates string variable with the confusion matrix name (eg. "cm.tree.case.state")
    cm <- paste("cm", i, j, sep = ".")
    
    #evaluates the prediction of each model (eg. tree.case.state.preds)
    preds <- eval(parse(text = paste(i, j, "preds", sep = ".")))
    
    #for each outcome variable, it assigns the confusion matrix table to 'cm'
    if (j == "case.state") {
      assign(cm,
             table(preds, covid.eviction.test$covidCases_greater_state))
    } else if (j == "case.US") {
      assign(cm,
             table(preds, covid.eviction.test$covidCases_greater_US))
    } else if (j == "death.state") {
      assign(cm,
             table(preds, covid.eviction.test$covidDeaths_greater_state))
    } else if (j == "death.US") {
      assign(cm,
             table(preds, covid.eviction.test$covidDeaths_greater_US))
    }
    
    #each confusion matrix generated above is corrected to a 2x2 table using the corrected.cm() function
    assign(cm, correct.cm(eval(as.name(cm))))
  }
}

Accuracy Function

This function uses a 2x2 confusion matrix as an input to calculate the accuracy.

accuracy <- function(confusion.matrix) {
  #Inputs:
  #confusion.matrix: confusion matrix of an algorithm (dimensions: 2x2)
  
  #Output: return accuracy
  
  #calculate accuracy (observed agreement)
  accuracy <-
    (confusion.matrix[1, 1] + confusion.matrix[2, 2]) / sum(confusion.matrix)
  
  return(accuracy)
}

Cohen’s Kappa Function

This function uses a 2x2 confusion matrix as an input to calculate Cohen’s Kappa.

kappa <- function(confusion.matrix) {
  #Inputs:
  #confusion.matrix: confusion matrix of an algorithm (dimensions: 2x2)
  
  #Output: return Cohen's kappa
  
  #calculate observed agreement (accuracy)
  Po <-
    (confusion.matrix[1, 1] + confusion.matrix[2, 2]) / sum(confusion.matrix)
  
  #calculate probability of agreement by random chance
  Pe <- 1 / sum(confusion.matrix) ^ 2 *
    (sum(confusion.matrix[1,] * sum(confusion.matrix[, 1])) +
       sum(confusion.matrix[2,] * sum(confusion.matrix[, 2])))
  
  #calculate Cohen's Kappa
  kappa = (Po - Pe) / (1 - Pe)
  
  return(kappa)
}

Create a table to compare all models

We calculate the accuracy, kappa, precision, recall and f statistics of each model, and we use that to generate a comparison table that shows the efficiency for each type of model for each outcome variable.

#vector of 4 different outcome variables
outcome.vars <- c("case.state", "case.US", "death.state", "death.US")

#null vectors for single decision tree models' accuracy, kappa, precision, recall and f
tree.accuracy <- NULL
tree.precision <- NULL
tree.recall <- NULL
tree.f <- NULL
tree.kappa <- NULL

#vector for single decision tree models' accuracy values
tree.accuracy <- c(
  accuracy(cm.tree.case.state),
  accuracy(cm.tree.case.US),
  accuracy(cm.tree.death.state),
  accuracy(cm.tree.death.US)
)

#vector for precision
tree.precision <- c(
  precision(confusionMatrix(cm.tree.case.state)$table),
  precision(confusionMatrix(cm.tree.case.US)$table),
  precision(confusionMatrix(cm.tree.death.state)$table),
  precision(confusionMatrix(cm.tree.death.US)$table)
)

#vector for recall
tree.recall <- c(
  recall(confusionMatrix(cm.tree.case.state)$table),
  recall(confusionMatrix(cm.tree.case.US)$table),
  recall(confusionMatrix(cm.tree.death.state)$table),
  recall(confusionMatrix(cm.tree.death.US)$table)
)

#vector for f stats
tree.f <- c(
  F_meas(confusionMatrix(cm.tree.case.state)$table),
  F_meas(confusionMatrix(cm.tree.case.US)$table),
  F_meas(confusionMatrix(cm.tree.death.state)$table),
  F_meas(confusionMatrix(cm.tree.death.US)$table)
)

#vector for single decision tree models' kappa values
tree.kappa <- c(
  kappa(cm.tree.case.state),
  kappa(cm.tree.case.US),
  kappa(cm.tree.death.state),
  kappa(cm.tree.death.US)
)

#null vectors for bagging models' accuracy, kappa, precision, recall and f
bagging.accuracy <- NULL
bagging.precision <- NULL
bagging.recall <- NULL
bagging.f <- NULL
bagging.kappa <- NULL

#vector for bagging models' accuracy values
bagging.accuracy <- c(
  accuracy(cm.bagging.case.state),
  accuracy(cm.bagging.case.US),
  accuracy(cm.bagging.death.state),
  accuracy(cm.bagging.death.US)
)

#vector for precision
bagging.precision <- c(
  precision(confusionMatrix(cm.bagging.case.state)$table),
  precision(confusionMatrix(cm.bagging.case.US)$table),
  precision(confusionMatrix(cm.bagging.death.state)$table),
  precision(confusionMatrix(cm.bagging.death.US)$table)
)

#vector for recall
bagging.recall <- c(
  recall(confusionMatrix(cm.bagging.case.state)$table),
  recall(confusionMatrix(cm.bagging.case.US)$table),
  recall(confusionMatrix(cm.bagging.death.state)$table),
  recall(confusionMatrix(cm.bagging.death.US)$table)
)

#vector for f stats
bagging.f <- c(
  F_meas(confusionMatrix(cm.bagging.case.state)$table),
  F_meas(confusionMatrix(cm.bagging.case.US)$table),
  F_meas(confusionMatrix(cm.bagging.death.state)$table),
  F_meas(confusionMatrix(cm.bagging.death.US)$table)
)

#vector for bagging models' kappa values
bagging.kappa <- c(
  kappa(cm.bagging.case.state),
  kappa(cm.bagging.case.US),
  kappa(cm.bagging.death.state),
  kappa(cm.bagging.death.US)
)

#null vectors for boosting models' accuracy, kappa, precision, recall and f
boosting.accuracy <- NULL
boosting.precision <- NULL
boosting.recall <- NULL
boosting.f <- NULL
boosting.kappa <- NULL

#vector for boosting models' accuracy values
boosting.accuracy <- c(
  accuracy(cm.boosting.case.state),
  accuracy(cm.boosting.case.US),
  accuracy(cm.boosting.death.state),
  accuracy(cm.boosting.death.US)
)

#vector for precision
boosting.precision <- c(
  precision(confusionMatrix(cm.boosting.case.state)$table),
  precision(confusionMatrix(cm.boosting.case.US)$table),
  precision(confusionMatrix(cm.boosting.death.state)$table),
  precision(confusionMatrix(cm.boosting.death.US)$table)
)

#vector for recall
boosting.recall <- c(
  recall(confusionMatrix(cm.boosting.case.state)$table),
  recall(confusionMatrix(cm.boosting.case.US)$table),
  recall(confusionMatrix(cm.boosting.death.state)$table),
  recall(confusionMatrix(cm.boosting.death.US)$table)
)

#vector for f stats
boosting.f <- c(
  F_meas(confusionMatrix(cm.boosting.case.state)$table),
  F_meas(confusionMatrix(cm.boosting.case.US)$table),
  F_meas(confusionMatrix(cm.boosting.death.state)$table),
  F_meas(confusionMatrix(cm.boosting.death.US)$table)
)

#vector for boosting models' kappa values
boosting.kappa <- c(
  kappa(cm.boosting.case.state),
  kappa(cm.boosting.case.US),
  kappa(cm.boosting.death.state),
  kappa(cm.boosting.death.US)
)

#null vectors for random forest models' accuracy, kappa, precision, recall and f
rf.accuracy <- NULL
rf.precision <- NULL
rf.recall <- NULL
rf.f <- NULL

#vector for random forest models' accuracy values
rf.accuracy <- c(
  accuracy(cm.rf.case.state),
  accuracy(cm.rf.case.US),
  accuracy(cm.rf.death.state),
  accuracy(cm.rf.death.US)
)

#vector for precision
rf.precision <- c(
  precision(confusionMatrix(cm.rf.case.state)$table),
  precision(confusionMatrix(cm.rf.case.US)$table),
  precision(confusionMatrix(cm.rf.death.state)$table),
  precision(confusionMatrix(cm.rf.death.US)$table)
)

#vector for recall
rf.recall <- c(
  recall(confusionMatrix(cm.rf.case.state)$table),
  recall(confusionMatrix(cm.rf.case.US)$table),
  recall(confusionMatrix(cm.rf.death.state)$table),
  recall(confusionMatrix(cm.rf.death.US)$table)
)

#vector for f stats
rf.f <- c(
  F_meas(confusionMatrix(cm.rf.case.state)$table),
  F_meas(confusionMatrix(cm.rf.case.US)$table),
  F_meas(confusionMatrix(cm.rf.death.state)$table),
  F_meas(confusionMatrix(cm.rf.death.US)$table)
)

#vector for random forest models' kappa values
rf.kappa <- c(
  kappa(cm.rf.case.state),
  kappa(cm.rf.case.US),
  kappa(cm.rf.death.state),
  kappa(cm.rf.death.US)
)

#generate data frame of accuracy and kappa values of each models for different outcome variables
comparison <- data.frame(outcome.vars, tree.accuracy, bagging.accuracy, boosting.accuracy, rf.accuracy, tree.precision, bagging.precision, boosting.precision, rf.precision, tree.recall, bagging.recall, boosting.recall, rf.recall, tree.f, bagging.f, boosting.f, rf.f, tree.kappa, bagging.kappa, boosting.kappa, rf.kappa)

kable(comparison, digits = 3)
outcome.vars tree.accuracy bagging.accuracy boosting.accuracy rf.accuracy tree.precision bagging.precision boosting.precision rf.precision tree.recall bagging.recall boosting.recall rf.recall tree.f bagging.f boosting.f rf.f tree.kappa bagging.kappa boosting.kappa rf.kappa
case.state 0.743 0.723 0.720 0.749 0.763 0.770 0.776 0.773 0.940 0.886 0.869 0.935 0.843 0.824 0.820 0.847 0.179 0.186 0.205 0.195
case.US 0.883 0.891 0.887 0.877 0.895 0.899 0.906 0.904 0.985 0.988 0.974 0.966 0.938 0.942 0.939 0.934 0.080 0.144 0.213 0.099
death.state 0.534 0.553 0.505 0.556 0.564 0.581 0.540 0.571 0.693 0.693 0.711 0.809 0.622 0.632 0.614 0.670 0.031 0.075 -0.041 0.051
death.US 0.773 0.651 0.744 0.742 0.773 0.785 0.786 0.771 1.000 0.754 0.920 0.949 0.872 0.769 0.847 0.851 0.000 0.051 0.084 -0.027

Looking at the comparison table above, boosting and random forest models generally have the highest accuracy. Boosting models have similarly high accuracy rates, but slightly higher kappa values than random forests. Across most measures, bagging models generally do better than single decision tree models. This is as expected, as decision trees have low bias but high variance. Bagging lowers variance of the models by using multiple decision trees, thus most measures of bagging models are generally higher than a single decision tree. On the other hand, boosting reduces bias by training with more weight on misclassified observations. Random forests, similarly to bagging, reduces variance, but it also creates a more robust model by randomizing the subset of variables considered at each split.

4. Linear Discriminant Analysis (LDA)

Question: How can we use Linear Discriminant Analysis to best separate state’s with COVID death rates and death rates greater than the national average using demographic and eviction data?

Understanding the underlying characteristics of the states most severely impacted by COVID can inform early public health and policy intervention. Using LDA, we will be able to predict whether a state will have a COVID death or death rate greater than the national average. Like Decision Tree Algorithms, LDA is suitable for predicting categorical outcome. It is particularly useful for deaths in which a clear or extreme separatation between categories exists. Additionally, LDA is often compared in the literature to supervised Principal Component Analysis (PCA)- as both rely on linear transformations to reduce data dimensionality.

First, I apply LDA to predict whether a state will have a COVID death-rate greater than the national average. I begin my splitting a subsetted dataset into training and testing. I train my model with the “training” dataset, and then predict with the “testing” dataset.

County death Rate > State

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

# LDA for COUNTY death RATE greater than STATE
cVs.case <- na.omit(covid.eviction %>%
  dplyr::select( -US_covid_cases_rate, -covidDeaths_greater_state, 
         -US_covid_death_rate, -covidCases_greater_US, 
         -county_covid_cases_rate, -county_covid_death_rate,
         -state_covid_cases_rate, -state_covid_death_rate,
         -covidDeaths, -covidCases, -evictions_greater_state,
         -evictions_greater_US, -covidDeaths_greater_US, -GEOID, -year,
         -`low-flag`, -imputed, -subbed, -stateFIPS, -County, -state_mean_filing_rate,
          -US_mean_filing_rate))

#let's create training and testing datasets
index <- sample(nrow(cVs.case),
                size = nrow(cVs.case),
                replace = TRUE)

cVs.case.train <- cVs.case[index,]
cVs.case.test <- cVs.case[-index,]

#Now, let's build an LDA model with the training dataset
lda.case <- lda(covidCases_greater_state ~ .,
            data = cVs.case.train  %>%
              dplyr::select(-State),
            family = "binomal")

lda.case
## Call:
## lda(covidCases_greater_state ~ ., data = cVs.case.train %>% dplyr::select(-State), 
##     family = "binomal")
## 
## Prior probabilities of groups:
##         0         1 
## 0.7127958 0.2872042 
## 
## Group means:
##   `poverty-rate` `renter-occupied-households` `pct-renter-occupied`
## 0       11.92104                     7814.057              27.02306
## 1       12.24450                    30919.268              30.28418
##   `median-gross-rent` `median-household-income` `median-property-value`
## 0            671.8706                  45631.52                125137.0
## 1            754.0000                  48571.06                149535.2
##   `rent-burden` `pct-white` `pct-af-am` `pct-hispanic` `pct-am-ind`
## 0      28.58925    82.57388    6.855639       6.801193     1.009246
## 1      28.90880    70.57965   12.914176      10.897207     1.619260
##   `pct-asian` `pct-nh-pi` `pct-multiple` `pct-other` `eviction-filings`
## 0   0.8638661  0.05126618       1.771362  0.07324142           498.6961
## 1   1.9567039  0.08772346       1.827179  0.11874302          1499.8422
##   evictions `eviction-rate` `eviction-filing-rate` population
## 0  208.6376        1.491536               3.036263   61469.88
## 1  635.8045        1.874036               4.021466  206843.54
## 
## Coefficients of linear discriminants:
##                                        LD1
## `poverty-rate`               -3.475221e-02
## `renter-occupied-households` -2.148129e-05
## `pct-renter-occupied`         3.044254e-02
## `median-gross-rent`           5.447759e-04
## `median-household-income`     8.118736e-06
## `median-property-value`      -2.867088e-07
## `rent-burden`                -3.887208e-02
## `pct-white`                   1.911383e+01
## `pct-af-am`                   1.916687e+01
## `pct-hispanic`                1.914105e+01
## `pct-am-ind`                  1.919100e+01
## `pct-asian`                   1.921143e+01
## `pct-nh-pi`                   1.944107e+01
## `pct-multiple`                1.902952e+01
## `pct-other`                   1.963228e+01
## `eviction-filings`           -1.650881e-05
## evictions                     1.078086e-04
## `eviction-rate`               8.749388e-02
## `eviction-filing-rate`       -6.126884e-02
## population                    4.058340e-06
#Check the pi_k values
lda.case$prior
##         0         1 
## 0.7127958 0.2872042
#Now, let's predict using the testing dataset
cVs.case.predictions <- predict(lda.case, cVs.case.test)
cVs.case.predictions  <- data.frame(cVs.case.predictions )
cVs.case.predictions  <- cVs.case.predictions  %>%
  dplyr::select(-posterior.0, -posterior.1, -LD1)

cVs.case.comparison <- data.frame(cVs.case.test$covidCases_greater_state,
                               cVs.case.predictions$class)

#Let's examine the accuracy of this model
cVs.case.accuracy <- cVs.case.comparison %>%
  mutate(acc = cVs.case.test.covidCases_greater_state == cVs.case.predictions.class) 

LDA.case.accuracy <- cVs.case.accuracy %>%
  summarize(acc.rate = sum(acc)/n())

LDA.case.accuracy
##    acc.rate
## 1 0.7480748
#Build a confusion Matrix to view distribution of case predictions
testy <- cVs.case.accuracy[,1]
yhat <- cVs.case.accuracy[,2]

LDA.case.confusion <- table(yhat,testy)
confusionMatrix(LDA.case.confusion, mode = "everything")
## Confusion Matrix and Statistics
## 
##     testy
## yhat   0   1
##    0 622 166
##    1  63  58
##                                          
##                Accuracy : 0.7481         
##                  95% CI : (0.7185, 0.776)
##     No Information Rate : 0.7536         
##     P-Value [Acc > NIR] : 0.6659         
##                                          
##                   Kappa : 0.1975         
##                                          
##  Mcnemar's Test P-Value : 1.58e-11       
##                                          
##             Sensitivity : 0.9080         
##             Specificity : 0.2589         
##          Pos Pred Value : 0.7893         
##          Neg Pred Value : 0.4793         
##               Precision : 0.7893         
##                  Recall : 0.9080         
##                      F1 : 0.8445         
##              Prevalence : 0.7536         
##          Detection Rate : 0.6843         
##    Detection Prevalence : 0.8669         
##       Balanced Accuracy : 0.5835         
##                                          
##        'Positive' Class : 0              
## 

Accuracy and Confusion Matrix

Our LDA model predicting whether State COVID cases rates are higher or lower than their state’s average has an accuracy of approximately 76.21%. In comparison to the other other supervised learning algorithms explored in this project, this accuracy is about average. To get a better understanding how the distribution of correct and incorrect predictions, I generated a 2x2 confusion matrix of predicted values.

Of the total 822 counties in the testing sameple with COVID case rates lower than their state’s average, 641 were correctly predicted by the LDA. Predictions for counties with COVID case rates higher than their state’s average were notably less accurate- only 54 of a total 90 counties were correctly classified. This suggests that counties with higher case rates may not necessarily behave similarly to high-case counties within or outside their home state.

While LDA may not perform well in separating differences in county-level death rates, perhaps it will be more accurate at predicting whether a county’s COVID death-rate is higher than its state average.

County Death Rate > State

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

cVs.death <- na.omit(covid.eviction %>%
  dplyr::select( -US_covid_cases_rate, -covidCases_greater_state, 
         -US_covid_death_rate, -covidCases_greater_US, 
         -county_covid_cases_rate, -county_covid_death_rate,
         -state_covid_cases_rate, -state_covid_death_rate,
         -covidDeaths, -covidCases, -evictions_greater_state,
         -evictions_greater_US, -covidDeaths_greater_US, -GEOID, -year,
         -`low-flag`, -imputed, -subbed, -stateFIPS, -County, -state_mean_filing_rate,
          -US_mean_filing_rate))

#let's create training and testing datasets
index <- sample(nrow(cVs.death),
                size = nrow(cVs.death),
                replace = TRUE)

cVs.death.train <- cVs.death[index,]
cVs.death.test <- cVs.death[-index,]

#Now, let's build an LDA model with the training dataset
lda.death <- lda(covidDeaths_greater_state ~ .,
            data = cVs.death.train  %>%
              dplyr::select(-State),
            family = "binomal")

lda.death
## Call:
## lda(covidDeaths_greater_state ~ ., data = cVs.death.train %>% 
##     dplyr::select(-State), family = "binomal")
## 
## Prior probabilities of groups:
##         0         1 
## 0.5850091 0.4149909 
## 
## Group means:
##   `poverty-rate` `renter-occupied-households` `pct-renter-occupied`
## 0       11.99575                     22873.10              28.44856
## 1       11.41855                     28358.64              28.94731
##   `median-gross-rent` `median-household-income` `median-property-value`
## 0            752.7844                  48550.11                148666.9
## 1            770.1013                  49108.15                149059.0
##   `rent-burden` `pct-white` `pct-af-am` `pct-hispanic` `pct-am-ind`
## 0      29.90281    79.06509    11.00925       6.017469    0.7951875
## 1      30.11806    79.51031    11.56872       5.313921    0.2718502
##   `pct-asian` `pct-nh-pi` `pct-multiple` `pct-other` `eviction-filings`
## 0    1.395188  0.02981250       1.536094   0.1519062           1455.178
## 1    1.609648  0.02740088       1.583172   0.1145815           2047.264
##   evictions `eviction-rate` `eviction-filing-rate` population
## 0  470.7250        1.869250               4.326625   165466.4
## 1  904.8414        2.189648               4.396300   200452.3
## 
## Coefficients of linear discriminants:
##                                        LD1
## `poverty-rate`               -1.308292e-01
## `renter-occupied-households` -5.610164e-05
## `pct-renter-occupied`         3.543660e-02
## `median-gross-rent`           2.982754e-03
## `median-household-income`    -4.498392e-05
## `median-property-value`      -6.810636e-07
## `rent-burden`                -9.589843e-04
## `pct-white`                  -1.428805e+01
## `pct-af-am`                  -1.426422e+01
## `pct-hispanic`               -1.431684e+01
## `pct-am-ind`                 -1.431770e+01
## `pct-asian`                  -1.416137e+01
## `pct-nh-pi`                  -1.697024e+01
## `pct-multiple`               -1.423297e+01
## `pct-other`                  -1.654646e+01
## `eviction-filings`           -5.848143e-05
## evictions                     9.457463e-04
## `eviction-rate`               1.520519e-02
## `eviction-filing-rate`       -8.492298e-02
## population                    5.905832e-06
#Check the pi_k values
lda.death$prior
##         0         1 
## 0.5850091 0.4149909
#Now, let's predict using the testing dataset
cVs.death.predictions <- predict(lda.death, cVs.death.test)
cVs.death.predictions  <- data.frame(cVs.death.predictions )
cVs.death.predictions  <- cVs.death.predictions  %>%
  dplyr::select(-posterior.0, -posterior.1, -LD1)

cVs.death.comparison <- data.frame(cVs.death.test$covidDeaths_greater_state,
                               cVs.death.predictions$class)

#Let's examine the accuracy of this model
cVs.death.accuracy <- cVs.death.comparison %>%
  mutate(acc = cVs.death.test.covidDeaths_greater_state == cVs.death.predictions.class) 

LDA.death.accuracy <- cVs.death.accuracy %>%
  summarize(acc.rate = sum(acc)/n())

LDA.death.accuracy
##    acc.rate
## 1 0.6206897
#Build a confusion Matrix to view distribution of death predictions
testy <- cVs.death.accuracy[,1]
yhat <- cVs.death.accuracy[,2]

LDA.death.confusion <- table(yhat,testy)
confusionMatrix(LDA.death.confusion, mode = "everything")
## Confusion Matrix and Statistics
## 
##     testy
## yhat   0   1
##    0 110  59
##    1  18  16
##                                           
##                Accuracy : 0.6207          
##                  95% CI : (0.5501, 0.6877)
##     No Information Rate : 0.6305          
##     P-Value [Acc > NIR] : 0.6439          
##                                           
##                   Kappa : 0.082           
##                                           
##  Mcnemar's Test P-Value : 5.154e-06       
##                                           
##             Sensitivity : 0.8594          
##             Specificity : 0.2133          
##          Pos Pred Value : 0.6509          
##          Neg Pred Value : 0.4706          
##               Precision : 0.6509          
##                  Recall : 0.8594          
##                      F1 : 0.7407          
##              Prevalence : 0.6305          
##          Detection Rate : 0.5419          
##    Detection Prevalence : 0.8325          
##       Balanced Accuracy : 0.5364          
##                                           
##        'Positive' Class : 0               
## 

Accuracy and Confusion Matrix

Our LDA model predicting whether State COVID deaths rates are higher or lower than their state’s average has an approximate accuracy of 54.27%. This is lower that the accuracy of our COVID case model, as well as most of the other supervised learning algorithms previously utlized in this project. Again, I generate a 2x2 confusion matrix to explore the distribution correct and incorrect predicted models.

Of the total 176 counties in the testing sameple with COVID death rates lower than their state’s average, 100 were correctly predicted by the LDA. As occured in the COVID case LDA model, predictions for counties with COVID death rates higher than their state’s average were largely innaccurate - 15 of a total 23 counties were correctly classified. Once again, this suggests that counties highly-effected by COVID, both in case and death rate, do not strictly believe similarly. Without a clear separation between counties with higher or lower than state average death rates, LDA is likely not the most effective modeling technique to answer this question.

Comparing accuracies in all Supervised Learning techniques in a table:

#create vector of each measures
measure = c("Accuracy", "Precision", "Recall", "F-measure", "Kappa")

#logit model vector
logit.vector <- logit.matrix[, 2]

#KNN vectors
KNN.5.vector <- c(as.numeric(accuracy.k5), as.numeric(precision.k5), as.numeric(recall.k5), as.numeric(f.k5), as.numeric(kappa.k5))

KNN.7.vector <- c(as.numeric(accuracy.k7), as.numeric(precision.k7), as.numeric(recall.k7), as.numeric(f.k7), as.numeric(kappa.k7))

KNN.10.vector <- c(as.numeric(accuracy.k10), as.numeric(precision.k10), as.numeric(recall.k10), as.numeric(f.k10), as.numeric(kappa.k10))

#tree vector
tree.vector <- c(tree.accuracy[1], tree.precision[1], tree.recall[1], tree.f[1], tree.kappa[1])

#bagging vector
bagging.vector <- c(bagging.accuracy[1], bagging.precision[1], bagging.recall[1], bagging.f[1], bagging.kappa[1])

#boosting vector
boosting.vector <- c(boosting.accuracy[1], boosting.precision[1], boosting.recall[1], boosting.f[1], boosting.kappa[1])

#rf vector
rf.vector <- c(rf.accuracy[1], rf.precision[1], rf.recall[1], rf.f[1], rf.kappa[1])

# LDA accuracy, precision, recall, f and kappa
LDA.case.accuracy <- as.numeric(LDA.case.accuracy)
LDA.case.precision <- precision(LDA.case.confusion)
LDA.case.recall <- recall(LDA.case.confusion)
LDA.case.f <- F_meas(LDA.case.confusion)
LDA.case.kappa <- as.numeric(confusionMatrix(LDA.case.confusion, mode = "everything")$overall[2])

LDA.vector <- c(LDA.case.accuracy, LDA.case.precision, LDA.case.recall, LDA.case.f, LDA.case.kappa)

comparison.supervised <- data.frame(
Measure = c(measure),
Logit = c(logit.vector),
KNN.5 = c(KNN.5.vector),
KNN.7 = c(KNN.7.vector),
KNN.10 = c(KNN.10.vector),
Tree = c(tree.vector),
Bagging = c(bagging.vector),
Boosting = c(boosting.vector),
Random.Forest = c(rf.vector),
LDA = c(LDA.vector)
)

kable(comparison.supervised, digits = 3)
Measure Logit KNN.5 KNN.7 KNN.10 Tree Bagging Boosting Random.Forest LDA
Accuracy 0.695 0.677 0.669 0.677 0.743 0.723 0.720 0.749 0.748
Precision 0.694 0.719 0.711 0.709 0.763 0.770 0.776 0.773 0.789
Recall 0.995 0.889 0.895 0.921 0.940 0.886 0.869 0.935 0.908
F-measure 0.818 0.795 0.793 0.801 0.843 0.824 0.820 0.847 0.845
Kappa 0.040 0.065 0.026 0.012 0.179 0.186 0.205 0.195 0.198

Overall analysis of Supervised Learning techniques:

The table above compares the 5 measures (accuracy, precision, recall, f stats, and kappa) across all supervised learning techniques that we used to predict whether COVID-19 cases are greater than the state average in a given county using its eviction filing rate and other demographic variables. It seems like random forest and LDA models are the most effective when it comes to predicting the outcome, with high values across all 5 measures.

[UNSUPERVISED LEARNING]

1. PCA

Principal Component Analysis is a useful tool for unsupervised machine learning. Like Linear Discriminant Analysis, PCA relies on linear transformations to summarize relationship between different variables. By creating a “principal component” that is a weighted linear tranformation of n variables, PCA is able to decrease the dimensionality of a dataset.

While other unsupervised learning techniques such as heirchical or knn clustering lend themselves more naturally to visualizing relationships between datapoints, they do little to explain what variables drive that relatedness. This is where PCA is particularly useful in unsupervised machine learning. By projecting n variables on a two-dimensional axis, we are able to examine which variables may be most important in differentiating classes from one another.

In order for this analysis to be completely unsupervised, I do not examine any COVID data in my PCA. Instead, I use PCA to examine baseline differences between States, prior to the COVID pandemic.

Main PCA - All Variables

#Create "Clean" dataset on the COUNTY level (547 obs)
#This dataset does NOT contain any COVID data
evict.clean <- na.omit(covid.eviction %>%
                                dplyr::select(-GEOID, -year, -County, -stateFIPS, -`low-flag`,
                               -imputed, -subbed, -evictions_greater_state,
                               -evictions_greater_US, -covidCases_greater_US, 
                               -covidDeaths_greater_US, -US_mean_filing_rate,
                                -US_covid_cases_rate, -covidDeaths_greater_state, 
                               -US_covid_death_rate, -covidCases_greater_state, 
                               -county_covid_cases_rate, -county_covid_death_rate,
                               -state_covid_cases_rate, -state_covid_death_rate,
                               -covidDeaths, -covidCases))

#This dataset DOES contain COVID data (we'll return to it later)
evict.COVID.clean <- na.omit(covid.eviction %>%
                                dplyr::select(-GEOID, -year, -County, -stateFIPS, -`low-flag`,
                               -imputed, -subbed, 
                               -evictions_greater_US, 
                                -US_mean_filing_rate,
                                -US_covid_cases_rate,  
                               -US_covid_death_rate,  
                               -county_covid_cases_rate, -county_covid_death_rate,
                               -state_covid_cases_rate, -state_covid_death_rate,
                               -covidDeaths, -covidCases))

#Collapse EVICT dataset on the STATE level
evict.bystate <- evict.clean %>% 
group_by(evict.clean$State) %>%
summarise_all(mean) %>%
  dplyr::select(-State)

evict.bystate2 <- data.frame(evict.bystate, row.names = 1)

#Collapse EVICT COVID dataset on the STATE level
evict.covid.bystate <- evict.COVID.clean %>% 
group_by(evict.COVID.clean$State) %>%
summarise_all(mean)

#evict.covid.bystate2 <- data.frame(evict.covid.bystate, row.names = 1)


#PCA on a STATE level
pca.state <- prcomp(evict.bystate2, scale = TRUE)

pca.state
## Standard deviations (1, .., p=21):
##  [1] 2.802205e+00 1.937767e+00 1.594893e+00 1.441739e+00 1.197255e+00
##  [6] 9.357001e-01 8.766515e-01 7.065048e-01 6.615112e-01 4.972058e-01
## [11] 4.479475e-01 3.501365e-01 2.502060e-01 2.145646e-01 2.022351e-01
## [16] 1.382946e-01 8.944997e-02 8.396634e-02 4.480825e-02 1.192441e-03
## [21] 7.183349e-05
## 
## Rotation (n x k) = (21 x 21):
##                                    PC1          PC2         PC3
## poverty.rate               -0.03627044  0.193911629  0.47162638
## renter.occupied.households  0.31286351  0.008918725 -0.12107947
## pct.renter.occupied         0.29229955 -0.028127575  0.07232266
## median.gross.rent           0.33192045 -0.119537264 -0.10544043
## median.household.income     0.26090418 -0.139238737 -0.33258077
## median.property.value       0.31192854 -0.181698085 -0.09774482
## rent.burden                 0.16541609  0.041182675  0.17601757
## pct.white                  -0.23176348  0.012112970 -0.38464628
## pct.af.am                   0.15977122  0.331799290  0.26734263
## pct.hispanic                0.09228317 -0.122870730  0.05268475
## pct.am.ind                 -0.04830155 -0.040877160  0.08598101
## pct.asian                   0.19625978 -0.355169983  0.22670207
## pct.nh.pi                   0.09625013 -0.359363285  0.33648418
## pct.multiple                0.10396610 -0.353473493  0.32150176
## pct.other                   0.24157757 -0.017610931 -0.20189492
## eviction.filings            0.22523913  0.243352390 -0.02980383
## evictions                   0.28234628  0.170876118 -0.05841257
## eviction.rate               0.06389254  0.287988844  0.16955842
## eviction.filing.rate        0.19480574  0.326326207  0.04795626
## population                  0.30849964 -0.019173081 -0.13984234
## state_mean_filing_rate      0.19432213  0.327690073  0.04662827
##                                    PC4          PC5         PC6
## poverty.rate                0.24679181  0.196039595 -0.07495693
## renter.occupied.households  0.20479251  0.119056441 -0.12662985
## pct.renter.occupied         0.22046473  0.033979035 -0.31284598
## median.gross.rent          -0.07153794  0.065866164  0.05951807
## median.household.income    -0.11467315 -0.068595829 -0.04271553
## median.property.value      -0.05235355  0.063276665 -0.09016968
## rent.burden                -0.20833768  0.467729280  0.40619474
## pct.white                  -0.22938260  0.075017966 -0.16920103
## pct.af.am                  -0.02606450  0.165135809 -0.16167225
## pct.hispanic                0.44293336 -0.116178752  0.63821132
## pct.am.ind                  0.46594826 -0.370657602 -0.16277508
## pct.asian                  -0.17017806 -0.043961271  0.01709534
## pct.nh.pi                  -0.21741434 -0.135622166 -0.09213628
## pct.multiple               -0.20158500 -0.168227034 -0.13373447
## pct.other                   0.05629288  0.200233425  0.10182440
## eviction.filings           -0.10327628 -0.414998040  0.05308637
## evictions                   0.17060301  0.005057141 -0.31495078
## eviction.rate              -0.15198398  0.212383749 -0.12834565
## eviction.filing.rate       -0.22946428 -0.324120693  0.16600787
## population                  0.17229728  0.122815848  0.05133285
## state_mean_filing_rate     -0.22868512 -0.323149399  0.16641140
##                                     PC7          PC8          PC9
## poverty.rate               -0.200018136  0.231812110 -0.017066154
## renter.occupied.households -0.137598284 -0.139524364 -0.257172040
## pct.renter.occupied        -0.204275928 -0.047740187  0.080942609
## median.gross.rent           0.049561143  0.021371413  0.013072389
## median.household.income     0.081364991 -0.055090627  0.187953577
## median.property.value      -0.020713300  0.052158050  0.007898211
## rent.burden                -0.037272846  0.245508801 -0.458141889
## pct.white                  -0.027467451  0.105467169 -0.263375544
## pct.af.am                  -0.181671930 -0.027588810  0.355055174
## pct.hispanic                0.093567757 -0.340214621  0.150333530
## pct.am.ind                  0.381376994  0.451762554 -0.308200640
## pct.asian                   0.027989771  0.003046890 -0.047977312
## pct.nh.pi                   0.022189844 -0.011269794  0.037887817
## pct.multiple                0.093588972  0.009391279 -0.054069974
## pct.other                   0.227586050  0.612647730  0.469673453
## eviction.filings           -0.312756343  0.069574836 -0.123668619
## evictions                   0.110846205 -0.203649841 -0.098306102
## eviction.rate               0.721488602 -0.300498784 -0.029135833
## eviction.filing.rate        0.063320523  0.073667150 -0.027328039
## population                 -0.003164021 -0.057230234 -0.337524680
## state_mean_filing_rate      0.063044463  0.073709232 -0.028573556
##                                   PC10        PC11        PC12        PC13
## poverty.rate                0.15853173  0.14037964  0.15332247 -0.48085524
## renter.occupied.households  0.02164894  0.21849746  0.17192459  0.03098548
## pct.renter.occupied         0.27633341 -0.44554579  0.32200672  0.46884435
## median.gross.rent          -0.18525823 -0.20818268 -0.04307170 -0.17942237
## median.household.income    -0.34612465  0.03154570 -0.10506453 -0.05466424
## median.property.value      -0.05507582 -0.42640783  0.19148886 -0.54990399
## rent.burden                -0.05918501 -0.26852455 -0.27963631  0.20515731
## pct.white                   0.27989054 -0.02522701  0.10441093 -0.10140951
## pct.af.am                  -0.47035790  0.04727821 -0.11709409  0.11869945
## pct.hispanic                0.18717803 -0.10106537 -0.02922714 -0.06032156
## pct.am.ind                 -0.31529398 -0.15054181 -0.05037368  0.04078990
## pct.asian                  -0.06668359  0.26944206  0.18696575  0.03828586
## pct.nh.pi                   0.13928162  0.11201297 -0.14096307 -0.19197445
## pct.multiple                0.15519241  0.03544589 -0.10532205  0.24963147
## pct.other                   0.34387880  0.21469902 -0.03815981  0.09151555
## eviction.filings            0.20226084 -0.03262977 -0.18412192 -0.11441434
## evictions                   0.26217061  0.08061384 -0.65710077 -0.08358866
## eviction.rate               0.11973604 -0.07163397  0.18977344 -0.06833581
## eviction.filing.rate        0.02899472  0.02235171  0.18446975  0.03411882
## population                 -0.10773270  0.50388325  0.22391232  0.05123312
## state_mean_filing_rate      0.02736023  0.02092365  0.18505536  0.03339604
##                                   PC14         PC15        PC16
## poverty.rate               -0.47555871 -0.098574188  0.03094468
## renter.occupied.households  0.06744942  0.111251190 -0.28185317
## pct.renter.occupied        -0.06433967 -0.253828237  0.07619448
## median.gross.rent          -0.09966354  0.114545953  0.65592000
## median.household.income    -0.58629338 -0.316080874 -0.33410411
## median.property.value       0.22356521  0.246890929 -0.19341521
## rent.burden                -0.03349325 -0.127143109 -0.14132710
## pct.white                  -0.10096608 -0.078763460  0.08113354
## pct.af.am                   0.18903885  0.128705083 -0.04321656
## pct.hispanic               -0.05352332 -0.004397341 -0.04382986
## pct.am.ind                  0.06492537 -0.066365845 -0.04569019
## pct.asian                   0.09121247 -0.408470668 -0.02149976
## pct.nh.pi                   0.32040271 -0.292641459  0.04727099
## pct.multiple               -0.38325528  0.624212077 -0.07551472
## pct.other                   0.11248095  0.066724695 -0.07163111
## eviction.filings            0.13200676  0.056430937 -0.36187518
## evictions                  -0.02674420 -0.124690422  0.19763150
## eviction.rate               0.03233596 -0.009639362 -0.20467673
## eviction.filing.rate       -0.07336191 -0.045944882  0.16249165
## population                  0.08190346  0.163541651  0.15844810
## state_mean_filing_rate     -0.07468657 -0.043296702  0.15859766
##                                    PC17         PC18         PC19
## poverty.rate               -0.056385346  0.008407050 -0.003290466
## renter.occupied.households  0.092075415 -0.676879477 -0.262583432
## pct.renter.occupied        -0.075308187  0.171386559 -0.050537625
## median.gross.rent          -0.435353744 -0.303006312  0.044698543
## median.household.income    -0.056190646  0.120530756 -0.171719917
## median.property.value       0.347393975  0.198078474  0.099281562
## rent.burden                 0.053364327  0.035827197 -0.062811922
## pct.white                  -0.023835239  0.001537587 -0.045784353
## pct.af.am                   0.001420509  0.033336976 -0.025211347
## pct.hispanic                0.017446844  0.010306572 -0.033049138
## pct.am.ind                 -0.031458591 -0.033626145 -0.029010364
## pct.asian                   0.094292951 -0.153546438  0.634957401
## pct.nh.pi                  -0.114704094  0.071251006 -0.608637253
## pct.multiple                0.050688792  0.016254273  0.019243796
## pct.other                  -0.022904919 -0.047371487  0.002082552
## eviction.filings           -0.542706610  0.053282871  0.218174133
## evictions                   0.313749599  0.050217546  0.160119654
## eviction.rate              -0.287892373  0.001799867  0.042295150
## eviction.filing.rate        0.272378239 -0.034537049 -0.113749226
## population                 -0.120409287  0.569543779 -0.067722984
## state_mean_filing_rate      0.273687679 -0.032891014 -0.109518261
##                                     PC20          PC21
## poverty.rate               -0.0001739192 -5.490040e-05
## renter.occupied.households  0.0003336271  7.516048e-05
## pct.renter.occupied         0.0007947018 -3.878454e-05
## median.gross.rent           0.0017479039  2.014422e-04
## median.household.income    -0.0005861530 -1.724724e-04
## median.property.value      -0.0022863062 -5.730169e-05
## rent.burden                -0.0007625800 -3.348582e-05
## pct.white                   0.0062923171 -7.100854e-01
## pct.af.am                   0.0042469140 -5.213305e-01
## pct.hispanic                0.0034518497 -3.917219e-01
## pct.am.ind                  0.0010108039 -1.521266e-01
## pct.asian                   0.0012088337 -1.578753e-01
## pct.nh.pi                   0.0046025072 -6.533948e-02
## pct.multiple                0.0006203815 -1.345733e-01
## pct.other                   0.0007253216 -4.895284e-03
## eviction.filings           -0.0010028829 -1.984173e-05
## evictions                   0.0003482971  6.256770e-05
## eviction.rate              -0.0004230036  1.897596e-05
## eviction.filing.rate       -0.7064151084 -6.201809e-03
## population                 -0.0003202347 -1.704483e-04
## state_mean_filing_rate      0.7077232395  6.195809e-03
#How much variation is each principal component describing?
pca.state$sdev^2
##  [1] 7.852353e+00 3.754941e+00 2.543684e+00 2.078612e+00 1.433420e+00
##  [6] 8.755346e-01 7.685178e-01 4.991490e-01 4.375971e-01 2.472136e-01
## [11] 2.006570e-01 1.225955e-01 6.260307e-02 4.603799e-02 4.089904e-02
## [16] 1.912540e-02 8.001296e-03 7.050347e-03 2.007779e-03 1.421915e-06
## [21] 5.160050e-09
#What about percent of variation?
pca.state$sdev^2/sum(pca.state$sdev^2)
##  [1] 3.739216e-01 1.788067e-01 1.211278e-01 9.898151e-02 6.825811e-02
##  [6] 4.169213e-02 3.659609e-02 2.376900e-02 2.083796e-02 1.177208e-02
## [11] 9.555094e-03 5.837883e-03 2.981098e-03 2.192285e-03 1.947573e-03
## [16] 9.107335e-04 3.810141e-04 3.357308e-04 9.560854e-05 6.771022e-08
## [21] 2.457167e-10
#Because our PCA is primarily composed of PCs 1 and 2, let's plot those onto a biplot
biplot(pca.state, scale = 0,
       arrow.len = 0)

### Variation: Main PCA Our first PCA is a function of all 27 variables from the housing and evictions dataset. The first principal component, PC1, has a variance of 7.85, or approximately 37.39% of the model’s total variation. The second princial component, PC2, has a variance of 4.75, and accounts for approximately 12.11% of the model’s total variation. Because n = 27, some principal components contribute almost no variance to the model. Next I graph a biplot of PC1 and PC2. Clearly, projecting 27 dimensions into 2 does not make for a quality graphic! From what we are able to parse from this plot, most states appear to behave quite similarly. However we can see some clear outliers. DC appears to be leading on both the “evictions” and “eviction filing” index, closely followed by Maryland (MD). However, DC also is quite high on the “median household income” dimension, as well as the “rent burden” dimension. This may be due to the fact that while DC is only a city, it is treated equivalently to all other states in this analysis. Other states with major cities, such as New York (NYC) or Illinois (Chicago), also have rural and suburban areas that are included in state averages, DC is almost entirely urban.

In order to create a slightly more interpretable graphic, I create two variable subsets of data : “Housing & Evictions” and “Population Statistics: Race, Ethnicity, Poverty”. By performing PCA of these two subsets, I will be able to gain more insight on which variables in these subcategories create the most variance, and thus are most impactful.

Subsetted PCAs: Housing & Eviction and Population Statistics

#Let's create the "Housing & Evictions" subset
#I'm also going to focus on "rates" instead of "counts" 
#I also drop "State Filing Rate" because its identical to eviction.filing.rate
#when we collapse on the State-level

housing.sub <- evict.bystate2 %>%
  dplyr::select(pct.renter.occupied, median.gross.rent,
         median.property.value, eviction.rate, eviction.filing.rate)

#PCA with HOUSING & EVICTIONS subsets
pca.housing <- prcomp(housing.sub, scale = TRUE)

#Let's look at the percentage variances of each of our 7 new "Housing" PCs
pca.housing$sdev^2/sum(pca.housing$sdev^2)
## [1] 0.550186658 0.274089555 0.099863068 0.068878306 0.006982412
#Let's graph a biplot with out 2 primary PCs: PC1 and PC2
biplot(pca.housing, scale = 0,
       arrow.len = 0)

#Return to PCA output to see which variables drive the main PCs.
pca.housing
## Standard deviations (1, .., p=5):
## [1] 1.6585938 1.1706613 0.7066225 0.5868488 0.1868477
## 
## Rotation (n x k) = (5 x 5):
##                             PC1        PC2         PC3        PC4
## pct.renter.occupied   0.5126549  0.1246095 -0.24647753 -0.8085559
## median.gross.rent     0.5732881  0.0953457  0.03335204  0.4395208
## median.property.value 0.5645345  0.2138619 -0.07706653  0.3394145
## eviction.rate         0.0863060 -0.7476249 -0.64440333  0.1335842
## eviction.filing.rate  0.2870154 -0.6088528  0.71898649 -0.1414633
##                               PC5
## pct.renter.occupied   -0.08451955
## median.gross.rent     -0.68407535
## median.property.value  0.71723252
## eviction.rate          0.02253587
## eviction.filing.rate   0.09983491

Variation & Analysis: Housing and Evictions PCA

After running PCA on the housing and evictions subset, I examine the two most impactful principal components. PC1 accounts for 55.02% of the model’s variance, and is mostly a function of “percent renter occupied” “median gross rent,” and “median property value.” PC2 describes 27.41% of the model’s variance, and is mainly a funtion of “eviction rate” and “eviction filing.” I then graph a biplot of PC1 and PC2.

From this biplot, we can see two clear clusters of variables. The dimensionality of “eviction” variables(eviction rate and eviction filing) is nearly perpendicular to the rest of the “housing” variables (percent renter occupied, mean gross rent, etc). This is somewhat surprising to me, as I would have assumed that variables such as median property value and median rent would possibly be correlated with eviction rates. For robustness, I re-run the “housing” PCA to include “median household income,” a variable otherwise included in the “population statistics” subset. “Median household income” clusters with the rest of of the rent / property value dimensions, running perpendicular to the “eviction” varaibles.

While this plot is still somewhat difficult to read, we again see DC as an outlier. Both DC and Hawaii (HI) are higher on the “housing” access- they both have higher meadian property value and higher median gross rent. They also appear to have higher percentages of renters.

Delaware and Maryland have comparable median costs of renting. However, they have notably higher rates of eviction filing and eviction rates, with DC and Virginia close behind. South Carolina is the furthest on the “eviction” axis.

Next, I conduct PCA with the “population statistics” subset

#PCA with POPULATION subsets

pop.sub <- evict.bystate2 %>%
  dplyr::select(poverty.rate, median.household.income, pct.white, pct.af.am, pct.hispanic,
         pct.am.ind, pct.asian, pct.nh.pi, pct.multiple, pct.other, population)

pca.pop <- prcomp(pop.sub, scale = TRUE)

#Let's look at the percentage variances 
pca.pop$sdev^2/sum(pca.pop$sdev^2)
##  [1] 3.320943e-01 2.080871e-01 1.978896e-01 1.336183e-01 5.615873e-02
##  [6] 3.602707e-02 2.501731e-02 5.206918e-03 4.497380e-03 1.403138e-03
## [11] 6.937790e-10
#Let's graph a biplot with out 2 primary PCs: PC1 and PC2
biplot(pca.pop, scale = 0,
       arrow.len = 0)

#Unlike our "housing" PCA, we do not have as clearly defined dimensional clusters

#Let's return to the initial Population PCA output to see which variables drive the main PCs.
pca.pop
## Standard deviations (1, .., p=11):
##  [1] 1.911292e+00 1.512930e+00 1.475393e+00 1.212354e+00 7.859682e-01
##  [6] 6.295219e-01 5.245859e-01 2.393242e-01 2.224212e-01 1.242357e-01
## [11] 8.735885e-05
## 
## Rotation (n x k) = (11 x 11):
##                                 PC1         PC2         PC3         PC4
## poverty.rate            -0.08885636  0.47737504 -0.38808552  0.12567648
## median.household.income  0.33271606 -0.46952252  0.02750817  0.03679128
## pct.white               -0.35543838 -0.26362114  0.39138443  0.04333396
## pct.af.am                0.04534129  0.11882276 -0.49214688  0.50043697
## pct.hispanic             0.19598569  0.04292771 -0.24072931 -0.57297663
## pct.am.ind              -0.02026225  0.20176573 -0.11451159 -0.61716663
## pct.asian                0.48081273  0.12978169  0.19100438  0.07883918
## pct.nh.pi                0.38797602  0.30506801  0.30744289  0.09604678
## pct.multiple             0.38989303  0.29111535  0.30716458  0.05805093
## pct.other                0.26395119 -0.37444572 -0.24405760  0.02828610
## population               0.33286373 -0.30104319 -0.31186713 -0.04158548
##                                  PC5         PC6         PC7          PC8
## poverty.rate             0.076796738  0.31607039  0.35648322  0.535112088
## median.household.income  0.109691856 -0.30832943 -0.22693472  0.634244942
## pct.white                0.107559928  0.17475081  0.28967314  0.108312020
## pct.af.am                0.083914158 -0.28506248 -0.30940475 -0.161073389
## pct.hispanic            -0.620974695  0.09290939 -0.13199346  0.037379752
## pct.am.ind               0.696541009 -0.18940078 -0.09380170  0.009802344
## pct.asian               -0.006170493  0.02467350  0.18637972  0.208000495
## pct.nh.pi                0.020876785  0.08835586 -0.04538770  0.102988154
## pct.multiple             0.114534829  0.02619172 -0.02901405 -0.357391233
## pct.other                0.276022924  0.76719907 -0.21089891 -0.137402411
## population               0.053846899 -0.22590536  0.73036969 -0.267843728
##                                  PC9         PC10         PC11
## poverty.rate            -0.279229617  0.024900248 6.843108e-05
## median.household.income -0.316694459  0.074683206 8.333609e-05
## pct.white               -0.051132411  0.066866386 7.101220e-01
## pct.af.am                0.096071687  0.014192033 5.213159e-01
## pct.hispanic            -0.070723464  0.044466178 3.917317e-01
## pct.am.ind               0.108126511 -0.009972579 1.521342e-01
## pct.asian                0.444766286 -0.638623223 1.580266e-01
## pct.nh.pi                0.323682822  0.722995528 6.526634e-02
## pct.multiple            -0.699237108 -0.126910271 1.345437e-01
## pct.other                0.039796997  0.002725753 4.886822e-03
## population              -0.009697701  0.200941221 1.226055e-05
AA.housing.sub <- evict.bystate2 %>%
  dplyr::select(pct.renter.occupied, median.gross.rent,
         median.property.value, eviction.rate, eviction.filing.rate, pct.af.am)

pca.AA.housing <- prcomp(AA.housing.sub, scale = TRUE)

biplot(pca.AA.housing, scale = 0, arrow.len = 0)

#Interestingly, pct.AA appears to move in the same direction of eviction rates.

Variation & Analysis: Population Statistics PCA

In the “Population Statitics” PCA, PC1 accounts for 33.21% of the model’s variance, and is mostly a function of mostly a function of precent asian, percent multi-racial, percent pacific islander, and population size. PC2 describes 20.81% of the model’s variance, and is mainly a funtion of percent “other” race, median household income, and poverty rate.

The biplot of PC1 and PC2 shows Hawaii (HI) to be a clear outlier, with its large Asian, Pacific Islander, and Mixed Race Population. Both DC and California (CA) appear to also be more racially diverse, while relatively low on the “poverty” dimension, which is lead by Mississippi (MS) and Arizona (AZ).

Notably, I was surprised to know that percent African American was not more important in each PC, as I know that social science literature traditionally is closely related to a variety of life outcome variables. To see whether the negligible effect of percent African American is due to its inclusion in the “Population Statistics” PCA, I add it to the “Housing” PCA and examine the results. Interestingly, Percent African American does appear correlated with eviction rates and evictions filing rates. However, principal components for which Percent African American plays a large role contribute minimal variance to the overall model.

2. HIERARCHICAL clustering:

What states are dissimilar from one another, and what are the potential features that cause them to be so?

Data Cleaning:

For the purpose of our exploratory project, we remove “evictions” variable, as we consider “evictions filings” more heavily.

covid.eviction2 <- covid.eviction[,-2]
covid.eviction2.clust <- covid.eviction2[,-c(1, 18:24, 30:43)] #remove unnecessary columns

covid.eviction2.clust <- covid.eviction2.clust %>%
  group_by(State) %>%
  summarise_all(mean, na.rm = TRUE)

Fundamental Calculations

Most names of the columns stay the same (c.f. covid.eviction2 dataset)

Us.covid.rate <- covid.eviction2 %>%
  summarize(US_covid_cases_rate = sum(covidCases)/329584132,
            US_covid_death_rate = sum(covidDeaths)/sum(covidCases))

covid.eviction2.clust <- covid.eviction2.clust %>%
  mutate(US_covid_cases_rate = Us.covid.rate$US_covid_cases_rate,
         US_covid_death_rate = Us.covid.rate$US_covid_death_rate)

State.covid.rate <- covid.eviction2.clust %>%
  group_by(State) %>%
  summarize(state_covid_cases_rate = covidCases/population,
            state_covid_death_rate = covidDeaths/covidCases)

covid.eviction2.clust <- full_join(covid.eviction2.clust, State.covid.rate,
                                  by = "State")

covid.eviction2.clust <- covid.eviction2.clust %>%
  mutate(covidCases_greater_US = ifelse(state_covid_cases_rate > US_covid_cases_rate,
                                           1,
                                           0)) %>%
  mutate(covidDeaths_greater_US = ifelse(state_covid_death_rate >
                                          US_covid_death_rate,
                                         1,
                                         0))

In hierarchical clustering, we use continuous, quantitative variables only. With four graphs below, we reason why we choose complete linkage as our metric of optimization, as hierarchical clustering is observed with the “degree of dissimilarity” amongst the states. We use Euclidean distance as part of the clustering process.

State Covid CASE Rates:

lower than the US level

c.lower.state <- covid.eviction2.clust %>%
  filter(covidCases_greater_US == 0)

c.lower.state.subset <- c.lower.state[,-c(18, 22:27)]

c.lower.state.scaled <- scale(c.lower.state.subset %>%
                              dplyr::select(-State))
c.lower.state.dist <- dist(c.lower.state.scaled)

c.low.case.hc.single = hclust(c.lower.state.dist, method = "single")
c.low.case.hc.avg = hclust(c.lower.state.dist, method = "average")
c.low.case.hc.complete = hclust(c.lower.state.dist, method = "complete")

c.low.case.hc.complete
## 
## Call:
## hclust(d = c.lower.state.dist, method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 39
c.low.case.hc.single %>%
  as.dendrogram() %>%
  place_labels(c.lower.state$State) %>%
  set("labels_cex", 0.5) %>% 
  color_labels(k = 12) %>%
  set("branches_lwd", 5) %>%
  color_branches(k =12) %>%
  plot()

c.low.case.hc.avg %>%
  as.dendrogram() %>%
  place_labels(c.lower.state$State) %>%
  set("labels_cex", 0.5) %>% 
  color_labels(k = 6) %>%
  set("branches_lwd", 5) %>%
  color_branches(k =6) %>%
  plot()

c.low.case.hc.complete %>%
  as.dendrogram() %>%
  place_labels(c.lower.state$State) %>%
  set("labels_cex", 0.5) %>% 
  color_labels(k = 6) %>%
  set("branches_lwd", 5) %>%
  color_branches(k =6) %>%
  plot()

Before going further into the usage of different linkage methods, one thing to observe from all three graphs is that there are many states. More specifically, this suggests that there are many states given the condition that the respective rate of covid cases is lower than the mean rate of covid cases across the nation, which may be reassuring.

Usage of linkage techniques

Euclidean distance is used as a metric for the distance of dissimilarities.

[Comparisons of linkage methods] In all four cases, we see one commonality that Hawaii is not connected to any other state. More specifically, in the three methods (single, average, and complete), the state acts as an “outlier” and is the most dissimlar to all remaining 50 states. This “outlier” observation may be due to the geographical proximity, as it is an island and thus is possible to not share similar attributes with 50 other states.

The complete method by far is the best solution amongst the three linkages, since the single and average methods are expressed in a more extended form – that is, multiple states are clustered in one big group before being clustered into other subgroups (e.g. the violet-colored states from OR to VT as shown in single linkage, and the purple-colored states from SC to VA in average linkage). As we try to observe the states in a big picture, having them grouped in smaller subsets allows us to visually analyze more effieciently. Thus, although complete linkage is a more commonly used method, we will use Ward’s method.

[Analysis of the Complete Linkage] Similarly to HI, CA and AK are not found to be similar to other states until much later, more specifically, at the height of around 10. Furthermore, it is interesting to observe that states are linked based on their physical proximities (e.g. NM and TX in purple), this does not hold true to all linkages (e.g. FL and NV in green). Let’s look further into these two examples.

NM & TX
c.lower.state.subset %>%
  filter(State == "NM" | State == "TX" | State == "SC" | State == "SD")
## # A tibble: 4 x 20
##   State `poverty-rate` `renter-occupie… `pct-renter-occ… `median-gross-r…
##   <chr>          <dbl>            <dbl>            <dbl>            <dbl>
## 1 NM              15.6            8348              27.4             678.
## 2 SC              16.3           13619.             29.0             690.
## 3 SD              11.0            1730.             29.1             556.
## 4 TX              13.1           14899.             28.4             699.
## # … with 15 more variables: `median-household-income` <dbl>,
## #   `median-property-value` <dbl>, `rent-burden` <dbl>, `pct-white` <dbl>,
## #   `pct-af-am` <dbl>, `pct-hispanic` <dbl>, `pct-am-ind` <dbl>,
## #   `pct-asian` <dbl>, `pct-nh-pi` <dbl>, `pct-multiple` <dbl>,
## #   `pct-other` <dbl>, `eviction-filings` <dbl>, covidCases <dbl>,
## #   covidDeaths <dbl>, population <dbl>

We compare NM and TX with SC (in the same color) and SD (grouped into another subset until their subsets merge at the height of around 7). From the data set, we can assume that some of the variables that were considered to match the two states NM and TX might have been as follows: poverty rate, eviction filings, percentage of hispanics, covidDeaths, and covidCases. Of course, because we’re loooking at a datset of just four states and do not compare all states from one another, this should be handled merely as a sample that allows us to see what variables could have been considered in the dendrogram and scatterplot. If this observation, however, can be generalized into the whole dataset that we used for the this hierarchical clustering, such an analysis allows us to assume that the influence of geographical proximity on clustering. That is, variables like poverty rate and percentage of hispanics may be similar to states that are located closely to each other due to socieconomic factors.

FL & NV
c.lower.state.subset %>%
  filter(State == "FL" | State == "NV" | State == "AZ" | State == "HI")
## # A tibble: 4 x 20
##   State `poverty-rate` `renter-occupie… `pct-renter-occ… `median-gross-r…
##   <chr>          <dbl>            <dbl>            <dbl>            <dbl>
## 1 AZ             15.8            63476.             33.1             754.
## 2 FL             13.4            43616.             28.6             846.
## 3 HI              7.12           40438.             51.2            1207 
## 4 NV              9.94           28223.             32.5             763.
## # … with 15 more variables: `median-household-income` <dbl>,
## #   `median-property-value` <dbl>, `rent-burden` <dbl>, `pct-white` <dbl>,
## #   `pct-af-am` <dbl>, `pct-hispanic` <dbl>, `pct-am-ind` <dbl>,
## #   `pct-asian` <dbl>, `pct-nh-pi` <dbl>, `pct-multiple` <dbl>,
## #   `pct-other` <dbl>, `eviction-filings` <dbl>, covidCases <dbl>,
## #   covidDeaths <dbl>, population <dbl>

We use a similar method to compare between FL and NV, and two other states: AZ is chosen, since it’s not connected to FL and NV until the height = around 6; and HI is chosen, since it’d be interesting to look what dissimilar values it has to become an outlier, although this is not the focus of our comparison between FL and NV.

In comparison, FL and NV seem to share similar percentage of whites, hispanics, and other, which may have factored into their preliminary stage of clustering, although again, this comparison method is not holistic, as we are only comparing amongst four states.

This comparison of four states, however, is important, as it proves that one should not easily assume that geographical proximity is the primary step to cluster states into subgroups. Specifically speaking, AZ and NV are neighboring states, but variables, including the poverty rate, renter occupied households, population, and covidCases, in NV are far smaller than those in AZ.

c.low.case.clust.df <- c.lower.state.scaled
rownames(c.low.case.clust.df) <- c.lower.state$State
c.low.case.clust <- cutree(c.low.case.hc.complete, k = 6)
fviz_cluster(list(data = c.low.case.clust.df, cluster = c.low.case.clust),
             labelsize = 8, title = "Cluster plot when k = 6")

Another way to show the same results from the previous dendrogram is with a scatterplot as shown here. We designate k as 6, resulting in 6 different clusters. One difference to note between the dendrogram and the scatterplot is that this plot represents the first stage of the dendrogram, that is, before HI, AK, and CA are merged into subsets. Before that stage, these three states are acting “outliers”. HI is the farthest amongst all 51 states (especially the three subgroups), followed by CA and AK. One drawback is that it is difficult to identify all states regardless of whether they overlap in the graph.

*Note: The dimensions represent a percentage of variation from the original dataset used. The principal component accounts for 38.5% and the second component accounts for 23.8% of the variation in the dataset. This somewhat low variation suggests lower level of dispersion that, for example, dimension percentage of 70% would.

Now that we have extensively explored different linkage methods as well as closer look into certain states, we will look at three other cases more concisely, with complete linkage only.

Higher than the US Level (CASE rate)

c.higher.state <- covid.eviction2.clust %>%
  filter(covidCases_greater_US == 1)

c.higher.state.subset <- c.higher.state[,-c(18, 22:27)]

c.higher.state.scaled <- scale(c.higher.state.subset %>%
                              dplyr::select(-State))
c.higher.state.dist <- dist(c.higher.state.scaled)
c.high.case.hc.complete = hclust(c.higher.state.dist)

c.high.case.hc.complete %>%
  as.dendrogram() %>%
  place_labels(c.higher.state$State) %>%
  set("labels_cex", 0.5) %>% 
  color_labels(k = 4) %>%
  set("branches_lwd", 5) %>%
  color_branches(k = 4) %>%
  plot() 

c.high.case.hc.complete %>%
  as.dendrogram() %>%
  place_labels(c.higher.state$State) %>%
  set("labels_cex", 0.5) %>% 
  color_labels(k = 6) %>%
  set("branches_lwd", 5) %>%
  color_branches(k = 6) %>%
  plot()

c.high.case.clust.df <- c.higher.state.scaled
rownames(c.high.case.clust.df) <- c.higher.state$State
c.high.case.clust <- cutree(c.high.case.hc.complete, k = 6)
fviz_cluster(list(data = c.high.case.clust.df, cluster = c.high.case.clust),
             labelsize = 8, title = "'Cluster plot when k = 6")

fviz_cluster(list(data = c.high.case.clust.df, cluster = cutree(c.high.case.hc.complete, k = 4)),
             labelsize = 8, title = "Cluster plot when k = 4")

With fewer states present given the condition that covid case rates are larger than the US level, a dendrogram is a better visualization than the scatterplot that shows a subset of one or two states for the majority of the sample. Of course, if we change the k from 6 to 4, there will be fewer clusters and make the scatterplot more visually organized that now links MI, IL, PA, and LA into one group that are separated in the graph with k = 6. This observation will be futher discussed in the “K-Means clustering” section.

When looking at the dendrogram with colors based on k = 4 (changing k here, however, does not change the clustering process, as shown in the second graph), we see that the far RHS shows the New England states, from NJ to MA. However, some of the other New England states are clustered into other groups with non-New England states, e.g. RI with DE, IL and PA. Moreover, DC is the most dissimlar state amongst these 12 states. The observation of DC being an outlier was previously seen in PCA.

Covid DEATH Rates

lower than the US level

d.lower.state <- covid.eviction2.clust %>%
  filter(covidDeaths_greater_US == 0)

d.lower.state.subset <- d.lower.state[,-c(18, 22:27)]

d.lower.state.scaled <- scale(d.lower.state.subset %>%
                              dplyr::select(-State))
d.lower.state.dist <- dist (d.lower.state.scaled)
d.low.case.hc.complete = hclust(d.lower.state.dist)

d.low.case.hc.complete %>%
  as.dendrogram() %>%
  place_labels(d.lower.state$State) %>%
  set("labels_cex", 0.5) %>% 
  color_labels(k = 6) %>%
  set("branches_lwd", 5) %>%
  color_branches(k = 6) %>%
  plot()

d.low.case.clust.df <- d.lower.state.scaled
rownames(d.low.case.clust.df) <- d.lower.state$State
d.low.case.clust <- cutree(d.low.case.hc.complete, k = 6)
fviz_cluster(list(data = d.low.case.clust.df, cluster = d.low.case.clust),
             labelsize = 8, title = "Cluster plot when k = 6")

fviz_cluster(list(data = d.low.case.clust.df, cluster = cutree(d.low.case.hc.complete, k = 12)),
             labelsize = 8, title = "Cluster plot when k =12")

When k increases from 6 to 12, there are overlaps amongst some clusters. If we keep k = 6, we see that there are no clusters and fewer subsets of a single state, e.g. HI, AK, MD, and DC. This explains the overfitting process for when the value of k is large.

Similarly to when the state covid case rates are lower than the US level, the majority of the US states are found to have death rates lower than the US level. This means that, under collective linkage, there are multiple clusters. However, one difference is that it is not HI that is the greatest outlier, but rather DC. Let’s look at possible factors.

d.lower.state.subset %>%
  filter(State == "HI" | State == "DC" | State == "AK" | State == "MT" | State == "MD" | State == "CA")
## # A tibble: 6 x 20
##   State `poverty-rate` `renter-occupie… `pct-renter-occ… `median-gross-r…
##   <chr>          <dbl>            <dbl>            <dbl>            <dbl>
## 1 AK              9.70            3617.             37.3             985.
## 2 CA             11.7           102611.             38.9            1084.
## 3 DC             14.3           175121              58.8            1327 
## 4 HI              7.12           40438.             51.2            1207 
## 5 MD              8.12           32392.             29.2            1105.
## 6 MT             10.6             2701.             29.3             607.
## # … with 15 more variables: `median-household-income` <dbl>,
## #   `median-property-value` <dbl>, `rent-burden` <dbl>, `pct-white` <dbl>,
## #   `pct-af-am` <dbl>, `pct-hispanic` <dbl>, `pct-am-ind` <dbl>,
## #   `pct-asian` <dbl>, `pct-nh-pi` <dbl>, `pct-multiple` <dbl>,
## #   `pct-other` <dbl>, `eviction-filings` <dbl>, covidCases <dbl>,
## #   covidDeaths <dbl>, population <dbl>

We look at different states, each from a different k group. We see that DC has the highest poverty rate, renter occupied households, percentage of renter occupied, median gross rent, median household income, median property value, percentage of African Americans and other, population, eviction filings, and, most importantly, covid cases and covid deaths. Let’s look more deeply.

cor(d.lower.state.subset %>%
      dplyr::select(-State))
##                            poverty-rate renter-occupied-households
## poverty-rate                1.000000000                 0.07277332
## renter-occupied-households  0.072773318                 1.00000000
## pct-renter-occupied         0.044948935                 0.77056515
## median-gross-rent          -0.311665959                 0.71929556
## median-household-income    -0.705880323                 0.47153943
## median-property-value      -0.339907040                 0.70907563
## rent-burden                 0.401997702                 0.29114876
## pct-white                  -0.374832985                -0.53024951
## pct-af-am                   0.589839632                 0.39032902
## pct-hispanic                0.170809415                 0.30311029
## pct-am-ind                 -0.034281667                -0.07424363
## pct-asian                  -0.246004014                 0.27544848
## pct-nh-pi                  -0.205924806                 0.10700117
## pct-multiple               -0.267515405                 0.14453732
## pct-other                  -0.065000721                 0.59517265
## eviction-filings           -0.058631412                 0.49296783
## covidCases                 -0.009363216                 0.87568911
## covidDeaths                 0.047397862                 0.88647147
## population                  0.034412132                 0.94599832
##                            pct-renter-occupied median-gross-rent
## poverty-rate                        0.04494893       -0.31166596
## renter-occupied-households          0.77056515        0.71929556
## pct-renter-occupied                 1.00000000        0.74350720
## median-gross-rent                   0.74350720        1.00000000
## median-household-income             0.50288094        0.81010306
## median-property-value               0.82256657        0.93069366
## rent-burden                         0.14198611        0.32292800
## pct-white                          -0.69594644       -0.54516966
## pct-af-am                           0.33620732        0.22506403
## pct-hispanic                        0.17998077        0.17983003
## pct-am-ind                          0.12534008        0.03049678
## pct-asian                           0.64472731        0.58549803
## pct-nh-pi                           0.52263337        0.39652246
## pct-multiple                        0.60514121        0.51679059
## pct-other                           0.46477044        0.69297319
## eviction-filings                    0.33621277        0.54055846
## covidCases                          0.64774635        0.67262176
## covidDeaths                         0.67428218        0.62631349
## population                          0.66400546        0.72815997
##                            median-household-income median-property-value
## poverty-rate                           -0.70588032            -0.3399070
## renter-occupied-households              0.47153943             0.7090756
## pct-renter-occupied                     0.50288094             0.8225666
## median-gross-rent                       0.81010306             0.9306937
## median-household-income                 1.00000000             0.7754446
## median-property-value                   0.77544462             1.0000000
## rent-burden                            -0.17105096             0.2339430
## pct-white                              -0.18439376            -0.4860586
## pct-af-am                              -0.07984152             0.1386167
## pct-hispanic                           -0.02715056             0.1249404
## pct-am-ind                              0.15005609            -0.0443156
## pct-asian                               0.44113888             0.6836618
## pct-nh-pi                               0.28378609             0.5429371
## pct-multiple                            0.41688938             0.6084067
## pct-other                               0.51443234             0.5697893
## eviction-filings                        0.50287895             0.4374098
## covidCases                              0.54589976             0.6467986
## covidDeaths                             0.49050288             0.6203958
## population                              0.44587823             0.6703180
##                             rent-burden  pct-white   pct-af-am
## poverty-rate                0.401997702 -0.3748330  0.58983963
## renter-occupied-households  0.291148763 -0.5302495  0.39032902
## pct-renter-occupied         0.141986111 -0.6959464  0.33620732
## median-gross-rent           0.322928000 -0.5451697  0.22506403
## median-household-income    -0.171050962 -0.1843938 -0.07984152
## median-property-value       0.233943028 -0.4860586  0.13861673
## rent-burden                 1.000000000 -0.2404073  0.43980795
## pct-white                  -0.240407337  1.0000000 -0.50385789
## pct-af-am                   0.439807946 -0.5038579  1.00000000
## pct-hispanic                0.108707434 -0.5344351 -0.14067839
## pct-am-ind                 -0.472556737 -0.2630996 -0.23188857
## pct-asian                   0.074617666 -0.5314680 -0.06194069
## pct-nh-pi                   0.007560009 -0.4289804 -0.10956085
## pct-multiple               -0.036280954 -0.4855319 -0.14279919
## pct-other                   0.378818110 -0.3720039  0.33725550
## eviction-filings            0.186290941 -0.2708154  0.43273453
## covidCases                  0.219313798 -0.3769576  0.49223834
## covidDeaths                 0.182977138 -0.3911354  0.51859480
## population                  0.368236483 -0.5379308  0.28823649
##                            pct-hispanic  pct-am-ind   pct-asian
## poverty-rate                 0.17080941 -0.03428167 -0.24600401
## renter-occupied-households   0.30311029 -0.07424363  0.27544848
## pct-renter-occupied          0.17998077  0.12534008  0.64472731
## median-gross-rent            0.17983003  0.03049678  0.58549803
## median-household-income     -0.02715056  0.15005609  0.44113888
## median-property-value        0.12494038 -0.04431560  0.68366184
## rent-burden                  0.10870743 -0.47255674  0.07461767
## pct-white                   -0.53443507 -0.26309959 -0.53146802
## pct-af-am                   -0.14067839 -0.23188857 -0.06194069
## pct-hispanic                 1.00000000  0.12923775  0.08115287
## pct-am-ind                   0.12923775  1.00000000  0.05010387
## pct-asian                    0.08115287  0.05010387  1.00000000
## pct-nh-pi                    0.02182028 -0.03156152  0.96105871
## pct-multiple                 0.01381515  0.23695370  0.96069387
## pct-other                    0.20913965 -0.10143897  0.14249579
## eviction-filings            -0.01611527 -0.10788488  0.05995943
## covidCases                   0.09292683 -0.12983077  0.09156726
## covidDeaths                  0.08447927 -0.11494891  0.07954214
## population                   0.39465602 -0.06743394  0.33126526
##                               pct-nh-pi pct-multiple    pct-other
## poverty-rate               -0.205924806 -0.267515405 -0.065000721
## renter-occupied-households  0.107001168  0.144537323  0.595172648
## pct-renter-occupied         0.522633374  0.605141205  0.464770444
## median-gross-rent           0.396522463  0.516790594  0.692973192
## median-household-income     0.283786090  0.416889379  0.514432345
## median-property-value       0.542937088  0.608406686  0.569789274
## rent-burden                 0.007560009 -0.036280954  0.378818110
## pct-white                  -0.428980410 -0.485531865 -0.372003912
## pct-af-am                  -0.109560850 -0.142799189  0.337255503
## pct-hispanic                0.021820275  0.013815146  0.209139648
## pct-am-ind                 -0.031561520  0.236953705 -0.101438973
## pct-asian                   0.961058714  0.960693865  0.142495794
## pct-nh-pi                   1.000000000  0.929258295 -0.003897428
## pct-multiple                0.929258295  1.000000000  0.055467989
## pct-other                  -0.003897428  0.055467989  1.000000000
## eviction-filings           -0.052360883 -0.010126408  0.414549352
## covidCases                 -0.053374421 -0.008870551  0.704445124
## covidDeaths                -0.053692246 -0.013662155  0.585406719
## population                  0.150732596  0.185143664  0.601656889
##                            eviction-filings   covidCases covidDeaths
## poverty-rate                    -0.05863141 -0.009363216  0.04739786
## renter-occupied-households       0.49296783  0.875689111  0.88647147
## pct-renter-occupied              0.33621277  0.647746354  0.67428218
## median-gross-rent                0.54055846  0.672621761  0.62631349
## median-household-income          0.50287895  0.545899759  0.49050288
## median-property-value            0.43740978  0.646798626  0.62039585
## rent-burden                      0.18629094  0.219313798  0.18297714
## pct-white                       -0.27081544 -0.376957582 -0.39113540
## pct-af-am                        0.43273453  0.492238341  0.51859480
## pct-hispanic                    -0.01611527  0.092926835  0.08447927
## pct-am-ind                      -0.10788488 -0.129830767 -0.11494891
## pct-asian                        0.05995943  0.091567261  0.07954214
## pct-nh-pi                       -0.05236088 -0.053374421 -0.05369225
## pct-multiple                    -0.01012641 -0.008870551 -0.01366216
## pct-other                        0.41454935  0.704445124  0.58540672
## eviction-filings                 1.00000000  0.595967014  0.61185905
## covidCases                       0.59596701  1.000000000  0.97320474
## covidDeaths                      0.61185905  0.973204742  1.00000000
## population                       0.43808336  0.755681644  0.73295405
##                             population
## poverty-rate                0.03441213
## renter-occupied-households  0.94599832
## pct-renter-occupied         0.66400546
## median-gross-rent           0.72815997
## median-household-income     0.44587823
## median-property-value       0.67031803
## rent-burden                 0.36823648
## pct-white                  -0.53793082
## pct-af-am                   0.28823649
## pct-hispanic                0.39465602
## pct-am-ind                 -0.06743394
## pct-asian                   0.33126526
## pct-nh-pi                   0.15073260
## pct-multiple                0.18514366
## pct-other                   0.60165689
## eviction-filings            0.43808336
## covidCases                  0.75568164
## covidDeaths                 0.73295405
## population                  1.00000000

The correlation coefficient helps us explain how these variables as listed above have such high values in DC compared to in other states. The variables “covidCases” and “covidDeaths” are very highly and positively correlated to renter occupied households, percentage of renter occupied, median gross rent, median household income, eviction filings, percentage of African Americans and median property value, all with correlation coefficients higher than or just about 0.5. This explains that with one of these variables of a large value comes a high level of covid cases and death rates. This ultimately explains why DC is the greater outlier in this dataset.

Higher than the US level

d.higher.state <- covid.eviction2.clust %>%
  filter(covidDeaths_greater_US == 1)

d.higher.state.subset <- d.higher.state[,-c(18, 22:27)]

d.higher.state.scaled <- scale(d.higher.state.subset %>%
                              dplyr::select(-State))
d.higher.state.dist <- dist (d.higher.state.scaled)
d.high.case.hc.complete = hclust(d.higher.state.dist)

d.high.case.hc.complete %>%
  as.dendrogram() %>%
  place_labels(d.higher.state$State) %>%
  set("labels_cex", 0.5) %>% 
  color_labels(k = 6) %>%
  set("branches_lwd", 5) %>%
  color_branches(k = 6) %>%
  plot()

d.high.case.clust.df <- d.higher.state.scaled
rownames(d.high.case.clust.df) <- d.higher.state$State
d.high.case.clust <- cutree(d.high.case.hc.complete, k = 6)
fviz_cluster(list(data = d.high.case.clust.df, cluster = d.high.case.clust),
             labelsize = 8)

Similarly to when the state case rate was higher than the US level, there are several states (to be more specific, one fewer state). At this point, we know that the scatterplot isnt’ efficient to visually observe the clusters, unless we want to know the level of variation. Even so, it is not worth it, since the high level of variation of 61.4% used to cluster in the principal component shows great dispersion anyway.

As can be seen in the dendrogram, on the left-hand side (LHS) of the graph (NJ, MA, CT, NY) are all adjoining states in New England area, while those on the right-hand side (RHS) are either a southern state or midwestern state. The geographical proximity may reflect the distance of (dis)similarities as observed on the y-axis. That is, given the slightly shorter “height” of the y-axis on the LHS than on the RHS, we can state that the former are more similar amongst each other than the latter in general, even though the subsets of {CT, NY} and {MI, MN} have the same height to start off with in the dendrogram.

Analysis of Hierarchical Clustering

First and foremost, we have the flexibility to choose what method to use for hierarchical clustering, from single to complete linkage. As we have seen with four different cases, hierarchical clustering allows us to choose not only how many clusters we want to color but also how we would like it to be visualized, with combining the k. What is also efficient about this clustering process is that it helps us observe the “distance of dissimilarity” by looking at the height of the line as represented by the y-axis.

One disadvantage of hierarchical clustering is that it is difficult to pinpoint the factors that act as threshold of the clustering process. We can only assume the potential variables that may have played into it, such as geographical proximity, but that does not hold true in all cases, as we have observed in the first case. There could be other underlying socioeconomic factors, but with hierarchical clustering, we cannot confirm with full certainty, neither the dendrograms nor the scatterpots with fviz_cluster() function identify them.

3. K-MEANS clustering

How can states be clustered in certain groups? Does higher k value mean better separation amongst the states?

From hierarchical clustering, we see that clustering is done best when k ranges from 4 to 6. Again, we attempted hierarchical clustering before k-means clustering given that it is hard to find the “optimal” k-value. Right off the bat, it is hard to do k-means clustering, since we’re using each of the 51 states as a categorical variable. It is not the same as when we observe starbucks data from openintro library or iris dataset where there are groups of states that we start off with.

Therefore, we will explore the dataset, focusing on the case and death rates in general. That is, instead of looking at each state in a given level (e..g case rate < US level), we will categorize all states by the case rate level, in which it is either higher or lower than the US level.

Metric of optimization for K-Means clustering

We designate within-cluster variation of “k” number of clusters as our metric of optimization, as k-means clustering attempts to put data points into clusters – thus, it is important to analyze and affirm that those data points are grouped according to similar attributes and such that will help make sense in the clustering process. To use within variation as our metric, we use Euclidean distance like we did in hierarchical clustering.

Creating Levels (depending on case and death rates):

km.covid.eviction2.clust <- covid.eviction2.clust[,-c(18, 22:25)]
km.covid <- km.covid.eviction2.clust %>%
  mutate(levels = ifelse(covidCases_greater_US == 0 & covidDeaths_greater_US == 0, "low case",
                         ifelse(covidCases_greater_US == 1 & covidDeaths_greater_US == 0, "high case but low death",
                                ifelse(covidCases_greater_US == 1 & covidDeaths_greater_US == 1, "high case and deaths",
                                       ifelse(covidCases_greater_US == 0 & covidDeaths_greater_US == 1, "low case but high death", NA)))))

Determining the “optimal” range of k

# Let's make a graph showing k vs the total within cluster variation
tot <- NULL
for(i in 1:20){
 km <- kmeans(km.covid %>% dplyr::select(-State, -levels),
              i)
 tot[i] <- km$tot.withinss/i #averaging by the # of clusters
}
plot(tot)

From the elbow plot above, we see that the optimal k would be either 2 or 3 given that the variation is low to the extent to which it doesn’t overfit the data (c.f. when k = 20). These values are lower than what we have seen in hierarchical clustering. Similarly to the effects of high levels of k, such low values of k most likely result in “underfitting” – that is, it will not clearly divide the data points into groups.

Let’s explore data when k = 2, 4, and 6, and see if maybe a value greater than 4 would provide a better clustering than 4 or 2:

km.scaled <- scale(km.covid %>%
                         dplyr::select(-State, -levels))
km.scaled_2 <- kmeans(km.scaled, 2)
km.scaled_4 <- kmeans(km.scaled, 4)
km.scaled_6 <- kmeans(km.scaled, 6)
km.scaled_2$centers
##   poverty-rate renter-occupied-households pct-renter-occupied
## 1   -0.6708387                  1.4429261           1.0991644
## 2    0.1636192                 -0.3519332          -0.2680889
##   median-gross-rent median-household-income median-property-value
## 1         1.6575033               1.4903277             1.5942448
## 2        -0.4042691              -0.3634945            -0.3888402
##   rent-burden  pct-white   pct-af-am pct-hispanic  pct-am-ind  pct-asian
## 1   0.7652106 -0.7094592  0.34323210   0.28693779 -0.39259653  1.0452630
## 2  -0.1866367  0.1730388 -0.08371515  -0.06998483  0.09575525 -0.2549422
##    pct-nh-pi pct-multiple  pct-other eviction-filings covidCases
## 1  0.5641097    0.5560995  1.4087911        1.0181066   1.550509
## 2 -0.1375877   -0.1356340 -0.3436076       -0.2483187  -0.378173
##   covidDeaths population covidCases_greater_US covidDeaths_greater_US
## 1   1.4232602  1.5423991             1.3181641              0.6619762
## 2  -0.3471366 -0.3761949            -0.3215034             -0.1614576
km.covid %>%
  mutate(cluster = km.scaled_4$cluster) %>%
  ggplot(aes(x = `eviction-filings`,
             y = `median-household-income`)) +
  geom_point(aes(color = factor(cluster)),
             size = 3)

km.covid %>%
  mutate(cluster = km.scaled_2$cluster) %>%
  ggplot(aes(x = `pct-white`,
             y = `median-household-income`)) +
  geom_point(aes(color = factor(cluster)),
             size = 3)

km.covid %>%
  mutate(cluster = km.scaled_4$cluster) %>%
  ggplot(aes(x = `pct-white`,
             y = `median-household-income`)) +
  geom_point(aes(color = factor(cluster)),
             size = 3)

km.covid %>%
  mutate(cluster = km.scaled_6$cluster) %>%
  ggplot(aes(x = `pct-white`,
             y = `median-household-income`)) +
  geom_point(aes(color = factor(cluster)),
             size = 3)

After many attempts of finding a visualization that separates data points into clear groups, we decided to graph a scatterplot of pct-white against median-household-income that seems to be the best visualization so far, compared to the first graph of median-household-income and eviction-filings as an example.

Right off the bat, looking at the last three graphs focused on pct-white and median-household-income, we can see that there is an overlap when k = 2, as tehre is a red point within the general parameter of the blue points. That overlap still exists when k = 4 and 6. Perhaps, k = 4 seems the optimal value amongst all, since there is only one overlap, while not underfitting or overfitting the data like when k = 2 and k = 6. Overall, however, we can affirm that, regardless of how high or low the k-value is, it is difficult to visually find and show a “perfect” separation of the 51 datapoints.

# How well did k-means do?
table(km.scaled_2$cluster, km.covid$levels)
##    
##     high case and deaths high case but low death low case
##   1                    4                       4        2
##   2                    2                       2       35
##    
##     low case but high death
##   1                       0
##   2                       2
table(km.scaled_4$cluster, km.covid$levels)
##    
##     high case and deaths high case but low death low case
##   1                    0                       0        1
##   2                    0                       0        5
##   3                    2                       2       31
##   4                    4                       4        0
##    
##     low case but high death
##   1                       0
##   2                       0
##   3                       2
##   4                       0
table(km.scaled_6$cluster, km.covid$levels)
##    
##     high case and deaths high case but low death low case
##   1                    0                       0        1
##   2                    4                       0        0
##   3                    0                       4        1
##   4                    0                       0        4
##   5                    1                       2       24
##   6                    1                       0        7
##    
##     low case but high death
##   1                       0
##   2                       0
##   3                       0
##   4                       0
##   5                       2
##   6                       0
 # use Euclidean distance
km.scaled_2$tot.withinss # total of Euclidean distances: how compact are my clusters?
## [1] 721.3713
km.scaled_4$tot.withinss
## [1] 510.6829
km.scaled_6$tot.withinss
## [1] 374.5778

As we see with total within-variation, the case in which k = 2 has the highest variation, almost as twice as when k = 6 and almost the sum of when k = 4 and k = 6. It is by no surprise that the higher the k-value, the less variation of the dataset. And with that observation, we would initially assume that the higher k-value would lead to more accurate clustering. However, the results from the tables show otherwise. When we look at “low case” (the third column), there are “low case” situations in all four k segments, but there are fewer than there are in k = 6 segments. Not only does this prove our assumption to be false, but it also serves as an example of overfitting. Additionally, this observation shows that it is hard to determine the optimal value of k.

With this total within-variation analysis as well as the visualized scatterplot with “k” clusters, we can see that even what we would assume to be the “optimal” value of k (which was 6) is not necessarily the “optimal” value after all. Therefore, as much as k-means clustering is one of the most common clustering method, it is not efficiently applicable to this field of research that aims to analyze the interrelatedness of eviction-related socioeconomic factors and covid-related results.

Overall Analysis

1. Supervised vs Unsupervised Learning

Supervised learning is beneficial in analyzing Covid-related data in short term, because it allows us to predict what regions would be expected to have high/low occurrences. This then would help healthcare or government officials to easily target and address issues in those areas.

Meanwhile, in unsupervised learning, we were focused more on how regions (e.g. counties and/or states) differentiate from each other, and what variables play a role in this process.

The practical use of combining supervised and unsupervised learning allows us to observe the characteristics of a certain area/region that would potentially influence certain levels (high or low) of outbreaks and/or deaths caused by Covid-19. For example, if we were to observe a situation with high correlation between high housing instability and covid cases (or deaths), then policy makers could approach the Covid issue by addressing what could perhaps be its root cause: housing instability and/or eviction.

2. Supervised Learning

As discussed previously in the supervised learning section, when comparing the different supervised learning techniques that we have learned this semester, random forests and LDA models seem to be the most effective in predicting outcomes accurately in the case of our data.

3. Unsupervised learning

Firstly, PCA allows to decrease dimensionality of the data, because a single component or variable can account for multiple variables in the data. On the other hand, it can be difficult to visualize high-dimensional data, simlar to K-means clustering, as we are unable to see the individually analyze each variable. However, despite the lack of ability to visually present the data, it can give insight to what drives the dendrogram that we create.

Hierarchical clustering is geared more towards looking at the level of dissimiarility of accuracies and allows us to observe which states are more similar and different from one another. It does somewhat answer to our research question that focuses on potential relevance of socioeconomic factors like eviction rate, poverty rate, and income with covid case and death rates, but is only restricted to our assumption, including the effect of geographical proximity on clustering. Ultimately, we cannot confirm solely from hierarchical clustering. Moreover, we are limited to observing one condition of all four conditions at a time (e.g. covid case rate < US level).

With k-means clustering, not only is it hard to determine the “optimal” k, but it is also time consuming to create a “perfect” graph that shows clear clusters, espeically when k-means visualization shows organized clusters in two dimensions only and there are multiple variables we hope to observe. We observe that each k has its drawbacks, regardless of how high or low it is. More specifically, overlaps exist even when k is as low as 2, so maybe we should take that as given, or we should look for another, better technique than k-means clustering. One thing that k-means clustering has that hierarchical clustering doesn’t, however, is that it allows us to look at the dataset more holistically, that is, with all four cases with covid cases and deaths.

Overall, when it comes to answering our research questions and exploring our main point of relationships between eviction- and demographics- variables and covid-related data, some combination of PCA and hierarchical clustering seem to be the best route to explore the data. The main reason behind this is that PCA allows us to analyze what drives the creation of dendrogram that presents distances of dissimilarity amongst states and shows a better visualization that what PCA does. Even though we have looked at samples of some states in this clustering method, our study serves as a sample of a bigger study that could be conducted for further research that looks more deeply into what determines the level of dissimilarities amongst the states besides the potential geographical proximity and other socioeconomic factors behind eviction filings.