COVID 19 - The spread amongst the states

Research question

The question I am looking to address is as follows: Is there a difference between the spread of COVID 19 in different states in the United States?

This is clearly a topical research question given current events, and if there are observable differences between pairs of states, we can begin to ask followup questions regarding the differences and similarities between certain states that may impact the spread of COVID 19.

library(readr)
library(stringr)
library(dplyr)

file_url <- 'https://raw.githubusercontent.com/sbellows1/607/master/Hospitalization_all_locs.csv'

df <- read_csv(url(file_url))

library(rvest)
url <- 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population'

table <- url %>% read_html() %>% html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[1]') %>% html_table(fill = T) %>% as.data.frame()


url2 <- 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density'

xpath1 <- '//*[@id="mw-content-text"]/div/table[1]'

dens_table <- url2 %>% read_html() %>% html_nodes(xpath = xpath1) %>% html_table(fill = T) %>% as.data.frame()

Data

Each data point is a forecast for the number of hospital beds required for COVID patients for a given day in a given state. There is some uncertainty in this forecast as it is modeled based on COVID deaths, so an upper bound and lower bound for the confidence interval is given. The variable of interested is the mean hospital beds required on a given day in a given state, which is a continuous numeric variable. This is an observational study, as obviously we are not assigning people to be COVID patients or not.

There is no reason to believe that the data would not be representative of the state, so the results should be generalizable to comparing entire state populations. The results of course cannot determine causation.

I am using supplementary data that gives the population and population density of each state, citing Census Bureau estimates.

I will be analyzing the differences between the states of California, New York, Texas, Michigan, Alaska, Florida, and Kansas. This choice is somewhat arbitrary but is intended to capture a wide variety of states in terms of political leaning and size/population density to hopefully identify states that have differences.

states <- c('California', 'New York', 'Texas', 'Michigan', 'Washington', 'Florida', 'Kansas')

df_states <- df %>% filter(location_name %in% states)

pop_table <- table[2:nrow(table),]
pop_table <- pop_table %>% filter(State %in% states) %>% select(c(State, Census.population))

I will be using the average forecasted number of hospital beds in the state as a proxy for the spread of COVID. Some issues to address : not all states are on the same time scale - Alaska likely started seeing cases later than the other states and other oddities. Not all states are on the same population scale : 50 cases in Wyoming is very different than 50 cases in California. Lastly, not all states have similar population densities, which may effect the spread of the virus. New York and California became hotbeds for COVID likely partially due to the high population density.

beds <- df_states %>% select(c(location_name, date, allbed_mean))

The data goes up until 4/29, everything past that point is projected and will not be used. I will count the first day having more than 10 mean covid beds required as day 1 for each state so as to avoid temporal issues with the spread of the disease. The choice of the number 10 is arbitrary and simply indicates some preponderance of cases.

Data Cleaning

The first step is to filter the data and the wikipedia tables to only include the states of interested. I also filter the data to only include points where at least 10 COVID hospital beds were required (which we will call day 1). Lastly, I only include dates before May 4th, as data after that point is hypothetical and more rooted in projection than reality. I then select only the state and hospital bed columns.

I now relabel the date column to a numeric count so that all states are on an even playing field (Day 1 is the first day with more than 10 required hospital beds for all states).

library(tidyr)
library(ggplot2)

beds <- beds %>% filter(allbed_mean > 10) %>% filter(as.Date(date) <= as.Date('2020-05-04'))
reps <- c()
for (i in 1:length(states)){
  df <- beds %>% filter (location_name == states[i])
  df$date <- 1:nrow(df)
  reps <- append(reps, nrow(df))
  if (i == 1){
    full_df <- df
  } else {
    full_df <- rbind(full_df, df)
  }
}

full_df %>% ggplot(aes(x = date, y = allbed_mean, col = location_name)) + geom_point() + geom_line() + xlab('Day') + ylab('COVID Hospital Beds Required')

EDA

We can see from viewing the plot that on an absolute level there is a clear difference between the states, and quite different trajectories as well. Let us plot population adjusted and population density adjusted curves as well.

pop_table$Census.population <- pop_table$Census.population %>% str_replace_all(",", '')

pops <- c()
for (i in 1:length(states)){
  partial_df <- pop_table %>% filter(State == states[i])
  pops <- append(pops, rep(as.integer(partial_df$Census.population), reps[i]))
}
  


pop_adjusted_df <- full_df
pop_adjusted_df$allbed_mean <- pop_adjusted_df$allbed_mean/pops

pop_adjusted_df %>% ggplot(aes(x = date, y = allbed_mean * mean(as.integer(pop_table$Census.population)), col = location_name)) + geom_point() + geom_line() + ylab('Covid Beds (population adjusted)') + xlab('Day')

When adjusting for the population size of the state, it is clear that there is more similarity than the initial graph showed. However, there is still different shape present, as some states have a parabolic shape whereas some states continue to increase.

dens_table <- dens_table %>% filter(State.etc. %in% states) %>% select(c(State.etc., Population.density.2))

dens <- c()
for (i in 1:length(states)){
  partial_df <- dens_table %>% filter(State.etc. == states[i])
  dens <- append(dens, rep(as.integer(partial_df$Population.density.2), reps[i]))
}

dens_adjusted_df <- full_df
dens_adjusted_df$allbed_mean <- dens_adjusted_df$allbed_mean/dens

dens_adjusted_df %>% ggplot(aes(x = date, y = allbed_mean * mean(as.integer(dens_table$Population.density.2)), col = location_name)) + geom_point() + geom_line() + ylab('Covid Beds (population density adjusted)') + xlab('Day')

Data Transformation

In order to perform the chi-square goodness of fit the data must be transformed to a wide format, with the “day of pandemic” columns going along the x axis and the states along the y axis.

chi_normal_df <- full_df %>% pivot_wider(names_from = date, values_from = allbed_mean)
chi_normal_df <- chi_normal_df %>% as.data.frame()

Inference

There are no real requirements to perform a chi-square goodness of fit test. Our null hypothesis is that there are no differences between the states - the difference in the number of hospital beds required is due to random chance. The alternative hypothesis is that there is a difference between the states.

The chi-square goodness of fit is appropriate, as we want to measure if the distribution of growth is equal across days. Therefore, we would like to see a similar proportion of the total cases on day 1, 2, 3, etc., which is what the goodness of fit test performs.

normal_chi <- chisq.test(chi_normal_df[,2:44])
normal_chi

## 
##  Pearson's Chi-squared test
## 
## data:  chi_normal_df[, 2:44]
## X-squared = 20635, df = 252, p-value < 2.2e-16

With a p-value coming in at close to 0, it is clear that there are differences between the states, it is not purely random chance. We will perform a follow up comparison between all pairs of states to see which states specifically are different. Because we will be performing 21 comparisons, we will update our alpha from .05 to .05/21 so as not to inflate our type I error.

chi_sqs <- list()

for (i in 1:nrow(chi_normal_df)){
  for (j in 1:nrow(chi_normal_df)){
    if (i < j){
      partial_df <- chi_normal_df[c(i,j), 2:44]
      chisq_test <- chisq.test(partial_df)
      chi_sqs <- append(chi_sqs, chisq_test$p.value)
    }
  }
}
chi_sqs_tf <- chi_sqs <= .05/21
chi_sqs_tf

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE

Using the absolute counts for hospital beds, we see that all 21 pairs of states are statistically different, even when using the Bonferroni correction.

tworow <- full_df %>% filter(location_name %in% c('California', 'New York')) %>% filter(date %in% c(1:50))

cal_prop <- (tworow %>% filter(location_name == 'California') %>% .$allbed_mean %>% sum())/sum(tworow$allbed_mean)
exp <- c()
for (i in 1:nrow(tworow)){
  new_df <- tworow %>% filter(date == tworow$date[i])
  if (tworow$location_name[i] == 'California'){
    exp <- append(exp, sum(new_df$allbed_mean) * cal_prop)
  } else {
    exp <- append(exp, sum(new_df$allbed_mean) * (1 - cal_prop))
  }
  
}

tworow$expected <- exp

tworow %>% ggplot() + geom_line(aes(x = date, y = allbed_mean, col = location_name)) + geom_point(aes(x = date, y = allbed_mean, col = location_name)) + geom_line(aes(x = date, y = expected, col = location_name)) + xlab('Day') + ylab('COVID Hospital Beds Required') + ggtitle('Counts vs Expected Counts')

Let us see if adjusting the counts for state population makes a difference. This makes sense, as more populous states should have more cases, so we will adjust for total population size. I then multiply all counts by the average population so as not to shrink the size of the counts.

pop_adjusted_df$allbed_mean <- pop_adjusted_df$allbed_mean * mean(as.numeric(pop_table$Census.population))

pop_chi_df <- pop_adjusted_df %>% pivot_wider(names_from = date, values_from = allbed_mean)
pop_chi_df <- pop_chi_df %>% as.data.frame()

pop_chi <- chisq.test(pop_chi_df[,2:39])
pop_chi

## 
##  Pearson's Chi-squared test
## 
## data:  pop_chi_df[, 2:39]
## X-squared = 23702, df = 222, p-value < 2.2e-16

When adjusting for population, we still see an extremely low p-value, meaning at least one pair of states are significantly different.

pop_chi_sqs <- list()

for (i in 1:nrow(pop_chi_df)){
  for (j in 1:nrow(pop_chi_df)){
    if (i < j){
      partial_df <- pop_chi_df[c(i,j), 2:44]
      chisq_test <- chisq.test(partial_df)
      pop_chi_sqs <- append(pop_chi_sqs, chisq_test$p.value)
    }
  }
}

pop_chi_sqs_tf <- pop_chi_sqs <= .05/21
pop_chi_sqs_tf

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE

Adjusting for population, we still see significant difference in all pairs of states, even when employing the Bonferroni correction. However, we can see in the graphs above that the distributions are at least more similar, even if they are still statistically significant.

tworow <- pop_adjusted_df %>% filter(location_name %in% c('Florida', 'Michigan')) %>% filter(date %in% c(1:50))

cal_prop <- (tworow %>% filter(location_name == 'Florida') %>% .$allbed_mean %>% sum())/sum(tworow$allbed_mean)
exp <- c()
for (i in 1:nrow(tworow)){
  new_df <- tworow %>% filter(date == tworow$date[i])
  if (tworow$location_name[i] == 'Florida'){
    exp <- append(exp, sum(new_df$allbed_mean) * cal_prop)
  } else {
    exp <- append(exp, sum(new_df$allbed_mean) * (1 - cal_prop))
  }
  
}

tworow$expected <- exp

tworow %>% ggplot() + geom_line(aes(x = date, y = allbed_mean, col = location_name)) + geom_point(aes(x = date, y = allbed_mean, col = location_name)) + geom_line(aes(x = date, y = expected, col = location_name)) + xlab('Day') + ylab('COVID Beds (Population Adjsuted)') + ggtitle('Counts vs Expected Counts')

The last thing I was interested in is if adjusting for population density will affect the chi square significance. If normalizing by the population density creates non significant results, it may indicate that population density plays a crucial role in COVID 19. (This is not a causal claim, simply a lead for further analysis). Similarly to population, I multiply all counts by the average density so as not to shrink the counts.

dens_adjusted_df$allbed_mean <- dens_adjusted_df$allbed_mean * mean(as.integer(dens_table$Population.density.2))

dens_chi_df <- dens_adjusted_df %>% pivot_wider(names_from = date, values_from = allbed_mean)
dens_chi_df <- dens_chi_df %>% as.data.frame()

dens_chi <- chisq.test(dens_chi_df[,2:44])
dens_chi

## 
##  Pearson's Chi-squared test
## 
## data:  dens_chi_df[, 2:44]
## X-squared = 26166, df = 252, p-value < 2.2e-16

dens_chi_sqs <- list()

for (i in 1:nrow(dens_chi_df)){
  for (j in 1:nrow(dens_chi_df)){
    if (i < j){
      partial_df <- dens_chi_df[c(i,j), 2:44]
      chisq_test <- chisq.test(partial_df)
      dens_chi_sqs <- append(dens_chi_sqs, chisq_test$p.value)
    }
  }
}

dens_chi_sqs_tf <- dens_chi_sqs <= .05/21
dens_chi_sqs_tf

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE

Adjusting for density, all pairs are still statistically significant.

tworow <- dens_adjusted_df %>% filter(location_name %in% c('Washington', 'Texas')) %>% filter(date %in% c(1:45))

cal_prop <- (tworow %>% filter(location_name == 'Washington') %>% .$allbed_mean %>% sum())/sum(tworow$allbed_mean)
exp <- c()
for (i in 1:nrow(tworow)){
  new_df <- tworow %>% filter(date == tworow$date[i])
  if (tworow$location_name[i] == 'Washington'){
    exp <- append(exp, sum(new_df$allbed_mean) * cal_prop)
  } else {
    exp <- append(exp, sum(new_df$allbed_mean) * (1 - cal_prop))
  }
  
}

tworow$expected <- exp

tworow %>% ggplot() + geom_line(aes(x = date, y = allbed_mean, col = location_name)) + geom_point(aes(x = date, y = allbed_mean, col = location_name)) + geom_line(aes(x = date, y = expected, col = location_name)) + xlab('Day') + ylab('COVID Beds (Density Adjusted)') + ggtitle('Counts vs Expected Counts')

Conclusion

There is a statistically significant difference in the spread of COVID cases between all 21 pairs of states examined, meaning that all 7 states has significantly different trajectories of COVID cases. These differences cannot be attributed to differences in population or population density as the statistical significance persists even when the data is normalized for population and population density. The difference also is not attributed to different time frames, as the timeframe was normalized to only begin one the mean hospital beds required reached 10. This leads to the natural question: What accounts for the difference in the spread of COVID cases? This is beyond the scope of this data, but understanding the answer to this question could lead to a better understanding of COVID and more effective policy decisions regarding the disease.

Final Project