MATH-239 Final Paper

Introduction

In 2020, the world fell into a pandemic that has killed millions of people, and uprooted everyone’s everyday lives. It has been nearly 2 years since the pandemic began, and now there is enough time and data collected to examine January 2021’s COVID patterns. Here, students will produce a model in order to find a relationship between COVID deaths and controllable factors. The datasets used were COVID hospital data on a state scale and census data for state family median income and state population, which was used to create per capita data, so all states could be compared. The hospital COVID data had 117 variables, and that had to be narrowed significantly. After a comparison of many types of data and their correlation to the impacts of COVID (primarily COVID deaths), the most relevant and correlated factors were found to be a column that counted the number of hospitals in a state that reported being short staffed, and inpatients COVID positive. Using a linear regression model to find the impact of short staffed hospitals on COVID death rate, the model will show the importance of staffing hospitals in pandemic times. The model will show that there is a relationship between short staffed hospitals and COVID deaths, which means to minimize COVID deaths, it is worthwhile to invest in staffing these overburdened hospitals and distributing patients more evenly (though that is a massive task).

Body

Data

In order to analyze relationships between COVID deaths and other factors, hospital data–a data set that measures a range of variables–census population data, and median family income (from the census) data were combined. After wrangling the data into only January averages of variables that appeared usable for each of the 50 states and the District of Columbia, the 117 variables were narrowed down to 15 relevant variables (Figure 1) and put into a pairs plot (Figure 2).

Methods

After wrangling the data into only January averages of data that appeared usable for each of the 50 states and the District of Columbia, the 117 variables were narrowed down to 5 relevant variables and put into a pairs plot (Figure 2). The variables shown to have highest correlation were picked out so that they could be accurately modeled. The highest two correlations were 1) .903 for CDPC (COVID Deaths Per Capita) and Percent Inpatient Beds Used COVID which is the percent of the total state hospital beds occupied by COVID patients; and 2) .900 for CDPC and Percent Inpatients COVID, which is the percent of hospital inpatients with COVID in the state. These explanatory variables measure similar things, but their relationship with COVID numbers made them the best predictors of COVID deaths. Since neither of these variables would make sense to be measured categorically, nor are they directly fundable, variables for Model 1 were chosen to be CDPC and staffing shortage level (extremely, somewhat, and converted to staffing shortage per capita to make the data more comparable between states). Staffing shortage data made more sense to split into categorical levels as the top ⅓ of states were identified as “extremely” so funding and resource allocation could be directed where it’s most needed. Model 1 (model1) shows the relationship between COVID deaths per capita and Short Staffing per capita. The scatter plot in Figure 3 of the appendix illustrates this relationship, and the linear model (Figure 4) indicates a residual standard error of 2.956e-06 on 48 degrees of freedom with a P-value of 0.3964. The QQ plot (Figure 5) shows a fairly normal distribution with heavy tails, indicating that there are notable observations with values 3 standard deviations above or below the mean. This tells us that this may not be an entirely normal distribution, and that the range of values will have a wider spread than expected.

Analysis

From past sociological research, researchers hypothesized that there might be some valuable correlation between income level and inpatients with covid COVID. The box plot (Figure 9) shows states split into high, medium, and low categorical variables. The model (Figure 10) produces a P-value of 0.5086 indicating that there is very little significance between these variables. The residual standard error is 0.07136 on 46 degrees of freedom, and there is an adjusted R squared value of -0.01326. All of this means that this model as a whole is not very significant, and might not be the best relationship to pursue in order to better predict and understand COVID patterns. In an attempt to investigate where resources might be better allocated to reduce the impacts of COVID, the short staffing variable was mutated into a binary categorical variable (extremely short staffed states and somewhat short staffed states). The upper third of states was designated extremely short staffed, and the bottom ⅔ were designated as somewhat. Variables were factored, giving states in the “extremely” category a value of 1, and a value of 0 to those in “somewhat”. This logistic regression model (Figure 8) produced a p-value of 0.01059, indicating that there is some statistical significance in this model. There is a null deviance of 64.104 on 49 degrees of freedom and a residual deviance of 56.212 on 48 degrees of freedom.

Results

In this study, the short staffing relationship with COVID deaths per capita having an R squared -0.01079 of residual standard error of 2.956e-06 on 48 degrees of freedom with a P-value of 0.3964 means that if tracking short staffing is possible then prediction of COVID deaths is possible. This means that if staffing cannot be changed, then maybe refrigeration cars need to be, or people can prepare in other ways for the losses that are statistically probable. In the context of the massively populous earth and current international shipping delays, good global health models are needed to mitigate impacts on people living through these challenging times.

Conclusion

With the context of the past two years bringing wave after wave of COVID, the importance of modeling states that are failing and succeeding at lowering COVID deaths is somewhat self explanatory. In the United States, congress has raised military budget again–against the advice of military leaders–despite there being no threat to American military forces. This study illuminates relevant patterns in data related to COVID, showing that if congress were to reallocate their spending and put more funding towards staffing hospitals more thoroughly, there is a potential for greater efficiency and in how funds are allocated throughout the country. This study could be added to by combining variables to more accurately predict and model COVID. That being said, COVID is an extremely complex and dynamic issue that could not be accurately predicted with these simple models, further research might use more advanced statistical methods to see if this increases predictability.

Appendix

Full code for everything discussed throughout

Loading packages & data we need into the global environment

#Install and call packages/datasets

#install.packages("dplyr")
#install.packages("tidyr")
#install.packages("ggplot2")
#install.packages("tidyverse")
#install.packages("GGally")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.4     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(GGally)

## Warning: package 'GGally' was built under R version 4.1.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ISLR)

library(readxl)

Hospital_capacity_RAW <- read_excel("C:/Users/jessi/Downloads/Hospital_capacity_RAW.xls")

StateMedInc <- read_excel("C:/Users/jessi/Downloads/StateMedInc.xls")

StatePop <-
  read_excel("C:/Users/Jessi/Downloads/StatePop.xlsx")

Data Wrangling

Narrowing variables down

SelectDataHosp <-Hospital_capacity_RAW %>%
  select(state,
         date,
         critical_staffing_shortage_today_yes,
         critical_staffing_shortage_today_no,
         critical_staffing_shortage_today_not_reported,
         hospital_onset_covid,
         inpatient_beds,
         inpatient_beds_used,
         inpatient_beds_used_covid,
         staffed_adult_icu_bed_occupancy,
         staffed_icu_adult_patients_confirmed_and_suspected_covid,
         staffed_icu_adult_patients_confirmed_covid,
         total_adult_patients_hospitalized_confirmed_and_suspected_covid,
         total_adult_patients_hospitalized_confirmed_covid,
         total_pediatric_patients_hospitalized_confirmed_and_suspected_covid,
         total_pediatric_patients_hospitalized_confirmed_covid,
         total_staffed_adult_icu_beds,
         inpatient_beds_utilization,
         percent_of_inpatients_with_covid,
         deaths_covid)

Narrowing data to include only January 2021 averages

Data <- SelectDataHosp %>%
  filter(year(date)==2021, 
         month(date)==1)

#Getting January 2021 average number of hospitals with staffing shortages by state each day 
shortStaffMeans <- Data %>%
  group_by(state)%>%
  summarize(mean(critical_staffing_shortage_today_yes, na.rm = TRUE))

#Renaming the columns
colnames(shortStaffMeans)<- c("state", "shortStaff")
colnames(StateMedInc) <- c("StateName", "medFamInc", "state")

#Getting January 2021 average daily COVID deaths by state
covidDeathMean <- SelectDataHosp%>% 
  group_by(state)%>%
  filter(year(date)==2021, 
         month(date)==1)%>%
  summarize(mean(deaths_covid, na.rm=TRUE)) 

#Renaming the columns 
colnames(covidDeathMean) <- c("state", "avgCovidDeaths")

#Percent beds used by COVID patients
percentBedsCovid <- Data%>%
  group_by(state)%>%
  filter(year(date)==2021, 
         month(date)==1)%>%
  summarize(mean(inpatient_beds_used_covid/inpatient_beds_used, na.rm=TRUE)) 
percentBedsCovid<- percentBedsCovid[-c(40, 48),]
colnames(percentBedsCovid) <- c("state", "percentUsedBedsCovid")

#Avg. confirmed COVID cases in hospitals in each state
totalConfirmedCovid <- Data%>%
  group_by(state)%>%
  filter(year(date)==2021, 
         month(date)==1)%>%
  summarize(mean(total_adult_patients_hospitalized_confirmed_covid + total_pediatric_patients_hospitalized_confirmed_covid))
totalConfirmedCovid<- totalConfirmedCovid[-c(40, 48),]
colnames(totalConfirmedCovid) <- c("state", "totalConfirmedCovidPatients")

#Percent of inpatients with COVID
percentCovid <- Data%>%
  group_by(state)%>%
  filter(year(date)==2021, 
         month(date)==1)%>%
  summarize(mean(percent_of_inpatients_with_covid))
percentCovid<- percentCovid[-c(40, 48),]
colnames(percentCovid) <- c("state", "PercentInpatientsCovid")

Mutating median family income data to split values into 3 categorical levels (high, med, low):

#Creating median family income levels 
StateMedInc%>%
  arrange(medFamInc)

## # A tibble: 51 x 3
##    StateName      medFamInc state
##    <chr>              <dbl> <chr>
##  1 Mississippi        45081 MS   
##  2 West Virginia      46711 WV   
##  3 Arkansas           47597 AR   
##  4 Louisiana          49469 LA   
##  5 New Mexico         49754 NM   
##  6 Alabama            50536 AL   
##  7 Kentucky           50589 KY   
##  8 Oklahoma           52919 OK   
##  9 South Carolina     53199 SC   
## 10 Tennessee          53320 TN   
## # ... with 41 more rows

low<-StateMedInc%>%
  filter(medFamInc <=56602)%>%
  mutate(incLev="low")

med<-StateMedInc%>%
  filter(medFamInc > 56602 & medFamInc <=65886)%>%
  mutate(incLev="med")

high<-StateMedInc%>%
  filter(medFamInc >= 67167)%>%
  mutate(incLev="high")

stateLev<-rbind(low, med, high)
View(stateLev)

Joining data:

#Creating final usable data set
covidDeathMean<- covidDeathMean[-c(40,48),]
shortStaffMeans<- shortStaffMeans[-c(40, 48),]

joinedData <- left_join(covidDeathMean, stateLev)

## Joining, by = "state"

joinedData <- left_join(joinedData, shortStaffMeans)

## Joining, by = "state"

joinedData <- left_join(joinedData, percentBedsCovid)

## Joining, by = "state"

joinedData <- left_join(joinedData, totalConfirmedCovid)

## Joining, by = "state"

joinedData <- left_join(joinedData, percentCovid)

## Joining, by = "state"

joinedData <- left_join(joinedData, StatePop)

## Joining, by = "StateName"

#Renaming columns 
colnames(joinedData)<-c("state", "avgCovidDeaths", "stateName", "medFamInc", "incLev", "shortStaff", "percentUsedBedsCovid", "totalConfirmedCovidPatients", "percentInpatientsCovid", "rank", "census2020", "percentTotalUSPop")

Mutating the necessary variables to reflect value per capita so state-to-state comparison is possible:

covid<-joinedData%>%
  mutate(avgCDPC=avgCovidDeaths/census2020) 

covid<-joinedData%>%
  mutate(avgSSPC=shortStaff/census2020)

avgSSPC <- covid%>%
  select("shortStaff", "census2020")%>%
  mutate(avgSSPC=shortStaff/census2020)


avgCDPC <- covid%>%
  select("avgCovidDeaths", "census2020")%>%
  mutate(avgCDPC=avgCovidDeaths/census2020)

Joining into final data set, “covid”:

covid <- left_join(covid, avgSSPC)

## Joining, by = c("shortStaff", "census2020", "avgSSPC")

covid <- left_join(covid, avgCDPC)

## Joining, by = c("avgCovidDeaths", "census2020")

Mutating staffing shortage data into two levels (extremely, somewhat) and factoring the data:

#Dichotomize the Data
delta <- covid%>%
  select("state", "avgSSPC")

delta%>% 
  arrange(avgSSPC)

## # A tibble: 51 x 2
##    state     avgSSPC
##    <chr>       <dbl>
##  1 KY    0.000000179
##  2 UT    0.000000221
##  3 FL    0.000000400
##  4 NY    0.000000442
##  5 OR    0.000000472
##  6 NC    0.000000516
##  7 MD    0.000000522
##  8 NJ    0.00000102 
##  9 NV    0.00000104 
## 10 CO    0.00000106 
## # ... with 41 more rows

extremely<-covid%>%
  filter(avgSSPC > 4.39264e-06)%>%
  mutate(staffLev=1)

somewhat<-covid%>%
  filter(avgSSPC <=4.39264e-06)%>%
  mutate(staffLev=0)

shortageLev<-rbind(somewhat, extremely)

covid<- left_join(covid, shortageLev)

## Joining, by = c("state", "avgCovidDeaths", "stateName", "medFamInc", "incLev", "shortStaff", "percentUsedBedsCovid", "totalConfirmedCovidPatients", "percentInpatientsCovid", "rank", "census2020", "percentTotalUSPop", "avgSSPC", "avgCDPC")

Working with data

Selected variables:

names(covid)

##  [1] "state"                       "avgCovidDeaths"             
##  [3] "stateName"                   "medFamInc"                  
##  [5] "incLev"                      "shortStaff"                 
##  [7] "percentUsedBedsCovid"        "totalConfirmedCovidPatients"
##  [9] "percentInpatientsCovid"      "rank"                       
## [11] "census2020"                  "percentTotalUSPop"          
## [13] "avgSSPC"                     "avgCDPC"                    
## [15] "staffLev"

Figure 1

Pairs plot:

#Data-set we'll use for pairs plot 
COVID <- covid%>%
  select("medFamInc", "percentUsedBedsCovid", "percentInpatientsCovid", "avgSSPC", "avgCDPC")

#create a pairs plot to find correlations
ggpairs(COVID)+ theme_bw()

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_density).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_density).

Figure 2

Exploratory data analysis:

ggplot(covid, aes(x = avgSSPC, y = avgCDPC, na.rm=TRUE))+
  geom_point(color="red")+
  scale_x_log10()+
  scale_y_log10()

## Warning: Removed 1 rows containing missing values (geom_point).

Figure 3

Creating linear model (short staff/cap & covid deaths/cap:

model1 <- lm(avgCDPC~avgSSPC, data = covid)
summary(model1)

## 
## Call:
## lm(formula = avgCDPC ~ avgSSPC, data = covid)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.070e-06 -2.677e-06  5.900e-08  2.084e-06  6.943e-06 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.649e-06  5.879e-07   7.908 3.03e-10 ***
## avgSSPC     9.120e-02  1.066e-01   0.856    0.396    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.956e-06 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01503,    Adjusted R-squared:  -0.005492 
## F-statistic: 0.7324 on 1 and 48 DF,  p-value: 0.3964

Figure 4

qq plot:

qqnorm(model1$residuals)
qqline(model1$residuals)

Figure 5

Plotting and adding a simple linear regression line:

qplot(x = avgCDPC, y = staffLev, data = shortageLev, 
      geom = "point", alpha = I(.5), ylab = "Staff Shortage Level") +
  stat_smooth(method = "lm",
              se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Figure 6

qplot(x = avgCDPC, y = staffLev, data = shortageLev, 
      geom = "point", alpha = I(.5), ylab = "Staff Shortage Level") +
  stat_smooth(method = "glm", method.args = list(family = "binomial"),
              se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Figure 7

Creating a model for this relationship:

model2 <- glm(staffLev ~ avgCDPC, data = shortageLev, family = "binomial")
summary(model2)

## 
## Call:
## glm(formula = staffLev ~ avgCDPC, family = "binomial", data = shortageLev)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4741  -0.8683  -0.5582   1.0430   1.9882  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -2.330e+00  7.584e-01  -3.072  0.00213 **
## avgCDPC      3.113e+05  1.218e+05   2.556  0.01059 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 64.104  on 49  degrees of freedom
## Residual deviance: 56.212  on 48  degrees of freedom
## AIC: 60.212
## 
## Number of Fisher Scoring iterations: 3

Figure 8

ggplot(data=covid, aes(y=percentInpatientsCovid, fill=incLev))+
  geom_boxplot()

Figure 9

modA<-lm(percentInpatientsCovid~incLev, data=covid)
summary(modA)

## 
## Call:
## lm(formula = percentInpatientsCovid ~ incLev, data = covid)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.111767 -0.043854 -0.006698  0.035112  0.210524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17022    0.01712   9.943 3.05e-13 ***
## incLevlow    0.02574    0.02421   1.063    0.293    
## incLevmed   -0.00119    0.02421  -0.049    0.961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07058 on 48 degrees of freedom
## Multiple R-squared:  0.03187,    Adjusted R-squared:  -0.008472 
## F-statistic:  0.79 on 2 and 48 DF,  p-value: 0.4597

anova(modA)

## Analysis of Variance Table
## 
## Response: percentInpatientsCovid
##           Df   Sum Sq   Mean Sq F value Pr(>F)
## incLev     2 0.007872 0.0039359    0.79 0.4597
## Residuals 48 0.239147 0.0049822

Figure 10