This analysis deals with exploring death trends in the United States, specifically having Florida as the focal point. The exploration in Florida begins with analyzing positive cases/trends, visualizing the positive and negative cases, and inspecting the total number of deaths. On the basis of population comparisons, Florida will be compared to California and Texas due to having the most similar total number of positive COVID-19 cases. The two main factors that will be investigated between the states are the average percent increase of deaths per day and the average deaths per day. The data represented in this analysis can be found on https://covidtracking.com/ and will cover the time frame of 2020-03-04 to 2020-08-07.
The questions asked during the analysis were:
How does Florida compare to other states in the U.S, in terms of, positive COVID-19 cases, total numbers of deaths, death rate, etc?
Which states currently (2020-08-07) have the most positive cases?
Since positive cases and deaths are positively correlated, which state has the highest death rate? Can we hypothesize that the state with the most total deaths has the highest death rate?
Based on linear models, can we predict which states will have the most deaths in the long run?
The data analysis will be broken down by the following sections:
Florida Data
Data for All States
Narrow Focus - Florida, California, and Texas
Florida vs California
Florida vs Texas
Florida vs California vs Texas
Below is the data on Florida and we will look at the following:
Positive Cases/Trends
Total Positive Cases vs Total Negative Cases
Death Trends
Positive Cases/Trends in Florida:
ggplot(data = FL_data, mapping = aes(x = date, y = positive)) +
geom_point(position = "jitter", alpha = 1/5) +
scale_y_continuous(labels = comma) + #overrides scientific notation for y values
geom_smooth() + #adds a line of best fit to express the relationship between the points
labs(x = "Date",
y = "Number of Positive Cases",
title = "Number of Positive Cases by Date in Florida",
subtitle = "March 2020 - August 2020",
caption = "Source: https://covidtracking.com/") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) #used to center the title and subtitle
Line Chart to Show Total Positive vs Total Negative Cases in Florida:
colors <- c("positive_cases" = "red", "negative_cases" = "green") #added to make custom/manual legend for line chart
ggplot(data = FL_data_by_month) +
geom_line(mapping = aes(x = month, y = total_positive_cases, color = "positive_cases"), size = 1) +
geom_line(mapping = aes(x = month, y = total_negative_cases, color = "negative_cases"), size = 1) +
scale_y_continuous(labels = comma) +
labs(x = "Month",
y = "Number of Cases",
title = "Positive vs Negative Cases by Month in Florida",
subtitle = "March 2020 - August 2020",
caption = "Source: https://covidtracking.com/",
color = "Legend") +
scale_color_manual(values = colors) +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "right")
Death Trends in Florida:
ggplot(data = FL_data, mapping = aes(x = date, y = death)) +
geom_point(position = "jitter", alpha = 1/5) +
scale_y_continuous(labels = comma) + #overrides scientific notation for y values
geom_smooth() + #adds a line of best fit to express the relationship between the points
labs(x = "Date",
y = "Number of Deaths",
title = "Number of Deaths by Date in Florida",
subtitle = "March 2020 - August 2020",
caption = "Source: https://covidtracking.com/") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
The data below will show total cases (positive and negative) and total deaths for all states:
(The data will be restricted to the top 10 states for visibility purposes)
total_cases_all <- data %>%
filter(date == "2020-08-07") %>%
select(date : negative) %>%
mutate(total_testings = positive + negative) %>%
arrange(desc(positive))
knitr::kable(total_cases_all[1:10, ], caption = "Total Number of Positive and Negative Cases for All States (Top 10)")
| date | state | positive | negative | total_testings |
|---|---|---|---|---|
| 2020-08-07 | CA | 538416 | 8058466 | 8596882 |
| 2020-08-07 | FL | 518075 | 3378864 | 3896939 |
| 2020-08-07 | TX | 474524 | 3526008 | 4000532 |
| 2020-08-07 | NY | 419642 | 5949333 | 6368975 |
| 2020-08-07 | GA | 209004 | 1544125 | 1753129 |
| 2020-08-07 | IL | 191808 | 2794110 | 2985918 |
| 2020-08-07 | AZ | 185053 | 819792 | 1004845 |
| 2020-08-07 | NJ | 184061 | 2068731 | 2252792 |
| 2020-08-07 | NC | 132812 | 1807000 | 1939812 |
| 2020-08-07 | LA | 128746 | 1376256 | 1505002 |
# Look at total number of deaths per state (from inception to 2020-08-07)
death_cases <- data %>%
filter(date == "2020-08-07") %>%
arrange(desc(death)) %>%
select(date : state, death)
knitr::kable(death_cases[1:10, ], caption = "Total Number of Deaths for All States (Top 10)")
| date | state | death |
|---|---|---|
| 2020-08-07 | NY | 25190 |
| 2020-08-07 | NJ | 15860 |
| 2020-08-07 | CA | 10011 |
| 2020-08-07 | MA | 8709 |
| 2020-08-07 | TX | 8096 |
| 2020-08-07 | FL | 8051 |
| 2020-08-07 | IL | 7822 |
| 2020-08-07 | PA | 7297 |
| 2020-08-07 | MI | 6524 |
| 2020-08-07 | CT | 4441 |
We will narrow our focus to Florida, California, and Texas to show total cases (positive and negative) and total deaths for each respective state:
states_of_interest <- c("FL", "CA", "TX")
total_cases <- data %>%
filter(date == "2020-08-07", state %in% states_of_interest) %>%
select(date : negative) %>%
mutate(total_cases = positive + negative) %>%
arrange(desc(positive))
knitr::kable(total_cases, caption = "Total Positive Cases in California, Florida, and Texas")
| date | state | positive | negative | total_cases |
|---|---|---|---|---|
| 2020-08-07 | CA | 538416 | 8058466 | 8596882 |
| 2020-08-07 | FL | 518075 | 3378864 | 3896939 |
| 2020-08-07 | TX | 474524 | 3526008 | 4000532 |
As shown in the above data set: California has more positive cases than Florida and Florida has more positive cases than Texas.
California, Florida, and Texas are the top 3 states with the most positive cases in the U.S.
Because California and Texas have the most similar amount of total positive cases among all states, we will compare Florida to these states to analyze the average percent increase of deaths and average deaths per day to see if they differ statistically.
Below we will see California (10,011 deaths) accounts for the most deaths, followed by Texas (8,096 deaths), and then by Florida (8,051 deaths). - As of 2020-08-07:
FL_TX_CA_data <- data %>%
filter(state == "FL" | state == "TX" | state == "CA") %>%
arrange(state, date) %>%
select(date : state, death)
ggplot(data = FL_TX_CA_data, mapping = aes(x = date, y = death, color = state)) +
geom_point() +
geom_smooth() +
labs(x = "Date",
y = "Deaths",
title = "Deaths Over Time in California, Florida, and Texas",
subtitle = "March 2020 - August 2020",
caption = "Source: https://covidtracking.com/") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
We will first start by comparing the average difference in deaths between Florida and California. To do so, we will conduct an Independent 2-sample t-test and generate a confidence interval to see if Florida and California’s average deaths differ as the states that account for the most positive cases in the U.S.
This t-test will tell us how significant the average difference of death is between Florida and California
Null Hypothesis (Ho): The average difference of death for Florida and California are the same (do not differ and equal to zero)
Alternative Hypothesis (Ha): The average difference of death for Florida and California are different (not equal to zero)
To test the hypothesis, we will need the data set to show data for only Florida and California
FL_CA_data <- data %>%
filter(state == "FL" | state == "CA") %>%
arrange(state, date) %>%
select(date : state, death)
We will then visualize the variance between the two states with a box plot. As we can see, California has more variability than Florida over the same time frame as seen in the width of the interquartile range:
ggplot(data = FL_CA_data) +
geom_boxplot(aes(x = state, y = death)) +
coord_flip() +
labs(x = "State",
y = "Deaths",
title = "Deaths in Florida vs California",
subtitle = "March 2020 - August 2020",
caption = "Source: https://covidtracking.com/") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Now we can run the t-test keeping in mind, our null hypothesis is equal to zero, the alternative is two.sided, the confidence interval is set to 95%, the variance is not equal (as shown in the box plot above) and they are independent populations.
t.test(FL_CA_data$death ~ FL_CA_data$state, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: FL_CA_data$death by FL_CA_data$state
## t = 4.569, df = 266.95, p-value = 7.496e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 767.5099 1929.8791
## sample estimates:
## mean in group CA mean in group FL
## 3877.174 2528.480
The output tells us that the p-value is close to zero, therefore we can reject our null hypothesis that there is no difference in average deaths. The difference within means falls between 767.51 to 1929.88 and since zero is not within the confidence interval as well, we can reject the null hypothesis with 95% confidence.
Now that we confirmed our mean difference is statistically significant, we can compare the death rates between Florida and California.
We will tidy our data to only show death statistics in California:
death_CA <- FL_CA_data %>%
filter(state == "CA") %>%
mutate(diff_day = date - lag(date),
diff_death = death - lag(death),
rate_percent = diff_death/lag(death) * 100)
Now we can calculate the average rate of growth for deaths and average deaths per day in California:
average_death_rate_CA <- mean(death_CA$rate_percent, na.rm = TRUE)
The average increase of death is 5.81% per day in California.
average_deaths_per_day_CA <- mean(death_CA$diff_death, na.rm = TRUE)
The average deaths per day in California is 67.61
Now we will repeat the same process for Florida:
death_FL <- FL_TX_data %>%
filter(state == "FL") %>%
mutate(diff_day = date - lag(date),
diff_death = death - lag(death),
rate_percent = diff_death/lag(death) * 100)
Calculate the average rate of growth for deaths and average deaths per day in Florida:
average_death_rate_FL <- mean(death_FL$rate_percent, na.rm = TRUE)
The average increase of death is 6.13% per day in Florida.
average_deaths_per_day_FL <- mean(death_FL$diff_death, na.rm = TRUE)
The average deaths per day in Florida is 54.02
We will utilize the above averages in an ANOVA (Analysis of Variance) test to compare statistical significance at the end of the analysis.
Again, we will first start by comparing the average difference in deaths between Florida and Texas. We will conduct another Independent 2-sample t-test and generate a confidence interval to see if Florida and Texas’ average deaths differ.
This t-test will tell us how significant the average difference of death is between Florida and Texas.
Null Hypothesis (Ho): The average difference of death for Florida and Texas are the same (do not differ and equal to zero)
Alternative Hypothesis (Ha): The average difference of death for Florida and Texas are different (not equal to zero)
To test the hypothesis, we will need the data set to show data for only Florida and Texas
FL_TX_data <- data %>%
filter(state == "FL" | state == "TX") %>%
arrange(state, date) %>%
select(date : state, death)
We will then visualize the variance between the two states with a box plot. As we can see, Florida has more variability then Texas over the same time frame:
ggplot(data = FL_TX_data) +
geom_boxplot(aes(x = state, y = death)) +
coord_flip() +
labs(x = "State",
y = "Deaths",
title = "Deaths in Florida vs Texas",
subtitle = "March 2020 - August 2020",
caption = "Source: https://covidtracking.com/") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Now we can run the t-test keeping in mind, our null hypothesis is equal to zero, alternative is two.sided, confidence interval is set to 95%, variance is not equal (as shown in the box plot above) and they are independent populations.
t.test(FL_TX_data$death ~ FL_TX_data$state, mu = 0, alt = "two.sided", conf = 0.95, var.eq = FALSE, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: FL_TX_data$death by FL_TX_data$state
## t = 2.3893, df = 291.19, p-value = 0.01752
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 98.05174 1014.56104
## sample estimates:
## mean in group FL mean in group TX
## 2528.480 1972.174
The output tells us the p-value is less than alpha = 5%, therefore we can reject our null hypothesis. With 95% confidence, we can say that the difference within means fall between 98.05 to 1014.56 Since zero is not in the confidence interval as well, we can reject the null hypothesis with 95% confidence.
After analyzing the box plot that shows the variance between Texas and Florida, it shows that there was a spike in deaths by the outliers. Which is shown between 6,000 to 8,000 deaths. Based on that, could we hypothesize that the rate of death is greater in Texas than in Florida?
We will tidy our data to only show death statistics in Texas:
death_TX <- FL_TX_data %>%
filter(state == "TX") %>%
mutate(diff_day = date - lag(date),
diff_death = death - lag(death),
rate_percent = diff_death/lag(death) * 100)
Calculate the average rate of growth for deaths and average deaths per day in Texas:
average_death_rate_TX <- mean(death_TX$rate_percent, na.rm = TRUE)
Average rate of death is 7.14% per day in TX.
average_deaths_per_day_TX <- mean(death_TX$diff_death, na.rm = TRUE)
Average deaths per day in Texas is 56.61
Note, from the previous hypothesis test between FL and CA. Florida’s average death rate and average deaths per day are as shown below:
average_death_rate_FL <- mean(death_FL$rate_percent, na.rm = TRUE)
The average increase of death is 6.13% per day in Florida.
average_deaths_per_day_FL <- mean(death_FL$diff_death, na.rm = TRUE)
The average deaths per day in Florida is 54.02
We will utilize the above averages in an ANOVA (Analysis of Variance) test to compare statistical significance, between all three states, in the next section!
We will compare the data for Florida, California, and Texas in regards to the average percent increase in deaths per day and average deaths per day.
We will organize all relevant data on a table and visualize the percent increase of death on a scatter plot with a regression line.
Lastly, we will compare all three states regarding the above two factors (average percent increase in deaths per day and average deaths per day) to test statistical significance with an ANOVA (Analysis of Variance) test!
Below is a table that shows the relevant statistics for all three states:
(positive cases, average percent increase in deaths per day, and average deaths per day)
three_state_statistics <- tribble(
~state, ~positive_cases, ~avg_percent_inc_death, ~avg_deaths_per_day,
"CA", total_cases_all$positive[1], average_death_rate_CA, average_deaths_per_day_CA,
"FL", total_cases_all$positive[2], average_death_rate_FL, average_deaths_per_day_FL,
"TX", total_cases_all$positive[3], average_death_rate_TX, average_deaths_per_day_TX
)
knitr::kable(three_state_statistics, caption = "Relevant State Statistics")
| state | positive_cases | avg_percent_inc_death | avg_deaths_per_day |
|---|---|---|---|
| CA | 538416 | 5.811746 | 67.61486 |
| FL | 518075 | 6.130288 | 54.02013 |
| TX | 474524 | 7.136418 | 56.60839 |
It is interesting to see that despite having the most cases, California has the lowest average percent increase in deaths out of the three states.
It is also interesting to see that Texas, with the lowest amount of positive cases, has the highest average percent increase in deaths.
We will fit a non-linear regression line for the relationship between Y = death rate, X1 = date, and X2 = state (TX, FL, and CA) This will show the rate of death over 5 months, between Florida, California, and Texas.
death_FL_TX_CA <- rbind(death_FL, death_TX, death_CA) # Combining the three data frames together to visualize death rates
ggplot(data = death_FL_TX_CA, mapping = aes(x = date, y = diff_death, color = state)) +
geom_point() +
geom_smooth(se = FALSE) + #added line of best fit to show death rates per state
labs(x = "Date",
y = "Deaths per Day",
title = "Rapid Growth of Texas Death Rate",
subtitle = "March 2020 - August 2020",
caption = "Source: https://covidtracking.com/") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
Based on the percent increase of deaths per day, we can conclude visually and mathematically that Texas has the highest at 7.14% increase per day, Florida with a 6.13% increase per day, and California with the lowest at 5.81% increase per day.
Is this statistically significant?
If they are not equal we can run further tests to see which one(s) are different.
Our Null Hypothesis (Ho) is: The average percent increase of deaths per day between the three states is not significantly different.
Our Alternative Hypothesis (Ha) is: At least one of the states will have a difference in average percent increase of deaths per day. (the ANOVA test will not tell us which one differs)
ANOVA1 <- aov(death_FL_TX_CA$rate_percent ~ death_FL_TX_CA$state)
summary(ANOVA1) # To run a more informative summary
## Df Sum Sq Mean Sq F value Pr(>F)
## death_FL_TX_CA$state 2 139 69.25 0.548 0.579
## Residuals 437 55242 126.41
## 31 observations deleted due to missingness
Our p-value is 57.9%. Therefore, we fail to reject our null hypothesis. This means that the average percent increase of deaths per day between the three states is not significantly different.
plot(TukeyHSD(ANOVA1)) # To visualize, the means of death rate per day do not differ from each other.
This visual shows that the means of death rate per day do not statistically differ from each other.
Our Null Hypothesis (Ho) is: The average deaths per day between the 3 states are not significantly different.
Our Alternative Hypothesis (Ha) is: At least one of the states will have a difference in average deaths per day. (the ANOVA test will not tell us which one differs)
ANOVA2 <- aov(death_FL_TX_CA$diff_death ~ death_FL_TX_CA$state)
summary(ANOVA2) # To run a more informative summary
## Df Sum Sq Mean Sq F value Pr(>F)
## death_FL_TX_CA$state 2 15414 7707 1.932 0.146
## Residuals 437 1743544 3990
## 31 observations deleted due to missingness
Our p-value is 14.6%. Therefore, we fail to reject our null hypothesis. This means that the average deaths per day between the 3 states are not significantly different.
plot(TukeyHSD(ANOVA2)) # To visualize, the means of deaths per day do not differ from each other.
This visual shows that the average deaths per day do not statistically differ from each other.
The death rate and total deaths within a state are imperative to understanding the conditions and responsiveness to the current global pandemic, COVID-19. The increase of death rate as shown in Florida, California, and Texas continues to contribute to the grimness of reality we live in today. As shown in this analysis there are certain states, such as Texas, that are showing a rapid growth in deaths that exceed other states with similar or substantially more positive cases.
As shown in the statistical analysis, the mean differences between death rates are not significant among the independent populations. Therefore, further analysis is needed to remedy this growth issue that is shared between the populations.
Opportunities for further analysis could be focusing on the above states and researching the following:
The effectiveness of the state lockdowns
The consequences of premature state reopenings
The protocol adherence of the state residents
The process of recording COVID related deaths
The recovery rate of positive COVID-19 cases