Introduction

This data, from the list of studies removed by the Trump administration, is the record of transmission of coronavirus COVID-19. Most notabily, this virus caused the 2020 lockdown that continues to affect economic, healthcare, and political decisions to this day. Even though the national lockdown is over and vaccination is available, the disease continues to spread.

My question, then, is whether there is a difference between racial groups in infection from Covid, post-widespread-vaccination. Racial disparities continue to fester in the US, but public health strategy might need to pivot if certain racial groups are suffering disproportionally from Covid.

The Code

To start, I libraried all packages I needed and called forth my data:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tibble)
library(tidyr)
library(FSA)
## Warning: package 'FSA' was built under R version 4.5.3
## ## FSA v0.10.1. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.
covidDeaths <- read.csv("Provisional_COVID-19_death_counts_and_rates_by_month,_jurisdiction_of_residence,_and_demographic_characteristics_20260415.csv")

For this data, I decided to look specifically at the adjusted annual rate of COVID, as the CDC put in the work to adjust the COVID rate in every region surveyed, which should save some time cleaning the data later. In addition, I did look at the data for COVID Deaths, however, that isn’t helpful without total population data, which was not available in this dataset, so the rate will take population into account for us.

Cleaning the Data

This dataset includes all deaths from Covid since it manifested in the US until 2026, including sex, age, race, and other demographic information. To get specifically race data from 2025, I had to filter through “group.”

However, this filtered data had a glaring issue: any unknown or unavailable data was simply left blank, which R could not handle. So I had to add in an “NA” value, which R can work around.

Now, even after this, it would take much more fiddling to get R to recognize these numbers as numbers, so to lock in the data and explore it in excel to troubleshoot, I turned this filtered data into a CSV.

covidRace <- filter(covidDeaths, group=="Race")
covidRace2025 <- filter(covidRace, year == "2,025")

covidRace2025$aa_COVID_rate_ann[covidRace2025$aa_COVID_rate_ann == ""] <- NA

write.csv(covidRace2025, file = "C:/Users/whelanr1/OneDrive - La Salle University/Biostat/ABDLabs/ABDLabs/Final Project/covidRace2025.csv")

Now, after reentering the CSV file to a different R document, I found the problem. The values of Covid infection were being read as characters instead of numbers, which was a relatively easy fix from an embarassingly long and arduous troubleshooting session.

covidRace2025 <- read.csv("C:/Users/whelanr1/OneDrive - La Salle University/Biostat/ABDLabs/ABDLabs/Final Project/covidRace2025.csv")

covidRace2025$aa_COVID_rate_ann <- as.numeric(covidRace2025$aa_COVID_rate_ann)

We can now explore the data!

Exploring the Data

ggplot(covidRace2025, aes(x=aa_COVID_rate_ann)) +
  geom_histogram() +
  facet_wrap(~ subgroup1)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 438 rows containing non-finite outside the scale range
## (`stat_bin()`).

The data set is too wide to see a distribution, let alone compare each group to one another, so I decided to apply a logarithmic function to see if that would normalize each group.

covidRace2025$log_aa_COVID_rate_ann <- log10(covidRace2025$aa_COVID_rate_ann)

ggplot(covidRace2025, aes(x=log_aa_COVID_rate_ann)) +
  geom_histogram() +
  facet_wrap(~ subgroup1)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 677 rows containing non-finite outside the scale range
## (`stat_bin()`).

This appears to have worked, most notably on the Non-Hispanic White populations, which turned into a normal distribution from a sweeping skew right distribution. All the datasets that show up on the distribution look normal, but that needs to be tested.

To start, we will split the data into the 7 different racial groups surveyed.

hispanic <- 
  filter(covidRace2025, subgroup1 == "Hispanic")
Non_Hispanic_American_Indian_or_Alaska_Native <- 
  filter(covidRace2025, subgroup1 == "Non-Hispanic American Indian or Alaska Native")
Non_Hispanic_Asian <- 
  filter(covidRace2025, subgroup1 == "Non-Hispanic Asian")
Non_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander <- 
  filter(covidRace2025, subgroup1 == "Non-Hispanic Asian, Native Hawaiian or Other Pacific Islander")
Non_Hispanic_Black <- 
  filter(covidRace2025, subgroup1 == "Non-Hispanic Black")
Non_Hispanic_White <- 
  filter(covidRace2025, subgroup1 == "Non-Hispanic White")
Non_Hispanic_Native_Hawaiian_or_Other_Pacific_Islander <- 
  filter(covidRace2025, subgroup1 == "Non-Hispanic Native Hawaiian or Other Pacific Islander")

#Filtering______________________________________________________________________

filteredhispanic  <- hispanic %>% 
  filter_at(vars(log_aa_COVID_rate_ann), all_vars(!is.infinite(.)))


filteredNon_Hispanic_American_Indian_or_Alaska_Native  <- Non_Hispanic_American_Indian_or_Alaska_Native %>% 
  filter_at(vars(log_aa_COVID_rate_ann), all_vars(!is.infinite(.)))

filteredNon_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander  <- Non_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander %>% 
  filter_at(vars(log_aa_COVID_rate_ann), all_vars(!is.infinite(.)))

filteredNon_Hispanic_Black  <- Non_Hispanic_Black %>% 
  filter_at(vars(log_aa_COVID_rate_ann), all_vars(!is.infinite(.)))

filteredNon_Hispanic_White  <- Non_Hispanic_White %>% 
  filter_at(vars(log_aa_COVID_rate_ann), all_vars(!is.infinite(.)))

filteredNon_Hispanic_Native_Hawaiian_or_Other_Pacific_Islander  <- Non_Hispanic_Native_Hawaiian_or_Other_Pacific_Islander %>% 
  filter_at(vars(log_aa_COVID_rate_ann), all_vars(!is.infinite(.)))

filteredNon_Hispanic_Asian  <- Non_Hispanic_Asian %>% 
  filter_at(vars(log_aa_COVID_rate_ann), all_vars(!is.infinite(.)))

Then we can run a QQ plot for all groups.

qqnorm(filteredhispanic$log_aa_COVID_rate_ann)
qqline(filteredhispanic$log_aa_COVID_rate_ann)

#I'm going to remove this group from further testing, because there is only one data point in this set, which is insufficient for a qq plot or further statistical testing. 
#qqnorm(filteredNon_Hispanic_American_Indian_or_Alaska_Native$log_aa_COVID_rate_ann)
#qqline(filteredNon_Hispanic_American_Indian_or_Alaska_Native$log_aa_COVID_rate_ann)

qqnorm(filteredNon_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander$log_aa_COVID_rate_ann)
qqline(filteredNon_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander$log_aa_COVID_rate_ann)

qqnorm(filteredNon_Hispanic_Black$log_aa_COVID_rate_ann)
qqline(filteredNon_Hispanic_Black$log_aa_COVID_rate_ann)

qqnorm(filteredNon_Hispanic_White$log_aa_COVID_rate_ann)
qqline(filteredNon_Hispanic_White$log_aa_COVID_rate_ann)

#I am removing the Non-Hispanic Native Hawaiian or Other Pacific Islander group because there was insufficient data to form a qqplot
#Plus this group may have overlap with the Non-Hispanic Asian, Native Hawaiian or Other Pacific Islander group
#So this data will hopefully be covered
#qqnorm(filteredNon_Hispanic_Native_Hawaiian_or_Other_Pacific_Islander$log_aa_COVID_rate_ann)
#qqline(filteredNon_Hispanic_Native_Hawaiian_or_Other_Pacific_Islander$log_aa_COVID_rate_ann)

qqnorm(filteredNon_Hispanic_Asian$log_aa_COVID_rate_ann)
qqline(filteredNon_Hispanic_Asian$log_aa_COVID_rate_ann)

At this point, we can start to see some groups do not have sufficient amount of data to draw any strong conclusions from, and therefore had to be removed from the dataset. This means all further tests will be done with the assumption of 5 comparison groups, no longer including the Non-Hispanic Native Hawaiian or Other Pacific Islander group and the Non-Hispanic American Indian or Alaska Native group.

All the remaining plots, with the exception of White and Black datasets, are too flimsy to call normal, so the ideal statistical test should be non-parametric.

The histograms show the variance continues to differ greatly between multiple groups (the White population has a far wider variance than most other groups, likely due to more data being available), it should also be noted that the ideal statistical test should be for groups with mismatched variances.

Statistical Test

The best fit test for this data is a Kruskal-Wallis ranked-sum test, which functions as a non-parametric ANOVA.

The assumptions for a Kruskal-Wallis test require independence and a qualitative explanatory and quantitiative response variable.

Independence is satisfied by each individual infected being one individual. Each person is sorted into one racial group.

The qualitative explanatory variable is one of 7 racial groups and the quantitative variable is the adjusted infection rate.

It should be noted that Kruskal-Wallis requires equal variance in certain cases, most notably in measurements of the median, however, I want to look at the mean, so homoscedasticity isn’t required.

kruskal.test(log_aa_COVID_rate_ann ~ subgroup1,
             data = covidRace2025)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  log_aa_COVID_rate_ann by subgroup1
## Kruskal-Wallis chi-squared = 295.95, df = 6, p-value < 2.2e-16

I am assigning a standard alpha value of 0.05. The p-value here is incredibly small, which is lower than the assigned alpha, meaning there is a significant difference in the mean between at least one group and another group.

Further Investigation

The main limitation of a Kruskal-Wallis rank-sum test is it can’t show which groups are different, only that there exists at least one difference between one racial group and another. This isn’t useful in the case of public health, as we can’t see who needs help. To glean a more useful interpretation of the data, we can continue into another post-hoc statistical test.

I will use the Dunn test, as it is the most frequently used after a Kruskal-Wallis.

dunnTest(aa_COVID_rate_ann ~ subgroup1,
  data = covidRace2025,
  method = "holm"
)
## Warning: subgroup1 was coerced to a factor.
## Warning: Some rows deleted from 'x' and 'g' because missing data.
##   Kruskal-Wallis rank sum test
## 
## data: x and g
## Kruskal-Wallis chi-squared = 295.947, df = 6, p-value = 0
## 
## 
##                      Dunn's Pairwise Comparison of x by g                     
##                                     (Holm)                                    
## Col Mean-│
## Row Mean │   Hispanic   Non-Hisp   Non-Hisp   Non-Hisp   Non-Hisp   Non-Hisp
## ─────────┼──────────────────────────────────────────────────────────────────
## Non-Hisp │   7.344009
##          │     0.0000*
##          │
## Non-Hisp │   3.481793  -3.369413
##          │     0.0045*    0.0045*
##          │
## Non-Hisp │   3.452645  -3.448584  -0.049382
##          │     0.0044*    0.0039*    0.9606 
##          │
## Non-Hisp │  -1.394598  -9.178380  -4.957359  -4.937635
##          │     0.6525     0.0000*    0.0000*    0.0000*
##          │
## Non-Hisp │   8.194862   0.260437   3.860012   3.951819   10.27098
##          │     0.0000*    1.0000     0.0011*    0.0009*    0.0000*
##          │
## Non-Hisp │  -2.145733  -11.78508  -6.256913  -6.251691  -0.536037  -13.84768
##          │     0.1595     0.0000*    0.0000*    0.0000*    1.0000     0.0000*
## 
## FWER = 0.05
## Reject Ho if p ≤ FWER with stopping rule, where p = Pr(|Z| ≥ |z|)
## Dunn (1964) Kruskal-Wallis multiple comparison
## 
##   p-values adjusted with the Holm method.
##                                                                                                                Comparison
## 1                                                                Hispanic - Non-Hispanic American Indian or Alaska Native
## 2                                                                                           Hispanic - Non-Hispanic Asian
## 3                                                      Non-Hispanic American Indian or Alaska Native - Non-Hispanic Asian
## 4                                                Hispanic - Non-Hispanic Asian, Native Hawaiian or Other Pacific Islander
## 5           Non-Hispanic American Indian or Alaska Native - Non-Hispanic Asian, Native Hawaiian or Other Pacific Islander
## 6                                      Non-Hispanic Asian - Non-Hispanic Asian, Native Hawaiian or Other Pacific Islander
## 7                                                                                           Hispanic - Non-Hispanic Black
## 8                                                      Non-Hispanic American Indian or Alaska Native - Non-Hispanic Black
## 9                                                                                 Non-Hispanic Asian - Non-Hispanic Black
## 10                                     Non-Hispanic Asian, Native Hawaiian or Other Pacific Islander - Non-Hispanic Black
## 11                                                      Hispanic - Non-Hispanic Native Hawaiian or Other Pacific Islander
## 12                 Non-Hispanic American Indian or Alaska Native - Non-Hispanic Native Hawaiian or Other Pacific Islander
## 13                                            Non-Hispanic Asian - Non-Hispanic Native Hawaiian or Other Pacific Islander
## 14 Non-Hispanic Asian, Native Hawaiian or Other Pacific Islander - Non-Hispanic Native Hawaiian or Other Pacific Islander
## 15                                            Non-Hispanic Black - Non-Hispanic Native Hawaiian or Other Pacific Islander
## 16                                                                                          Hispanic - Non-Hispanic White
## 17                                                     Non-Hispanic American Indian or Alaska Native - Non-Hispanic White
## 18                                                                                Non-Hispanic Asian - Non-Hispanic White
## 19                                     Non-Hispanic Asian, Native Hawaiian or Other Pacific Islander - Non-Hispanic White
## 20                                                                                Non-Hispanic Black - Non-Hispanic White
## 21                                            Non-Hispanic Native Hawaiian or Other Pacific Islander - Non-Hispanic White
##               Z      P.unadj        P.adj
## 1    7.34400936 2.072886e-13 3.316618e-12
## 2    3.48179306 4.980684e-04 4.482616e-03
## 3   -3.36941339 7.532836e-04 4.519702e-03
## 4    3.45264519 5.551186e-04 4.440949e-03
## 5   -3.44858497 5.635321e-04 3.944725e-03
## 6   -0.04938255 9.606144e-01 9.606144e-01
## 7   -1.39459854 1.631369e-01 6.525477e-01
## 8   -9.17838086 4.376218e-20 7.877192e-19
## 9   -4.95735908 7.145784e-07 9.289519e-06
## 10  -4.93763574 7.907537e-07 9.489044e-06
## 11   8.19486225 2.508803e-16 4.264965e-15
## 12   0.26043748 7.945263e-01 1.000000e+00
## 13   3.86001209 1.133814e-04 1.133814e-03
## 14   3.95181994 7.755909e-05 8.531500e-04
## 15  10.27098788 9.522450e-25 1.809266e-23
## 16  -2.14573333 3.189426e-02 1.594713e-01
## 17 -11.78508999 4.659223e-32 9.318445e-31
## 18  -6.25691350 3.926711e-10 5.890066e-09
## 19  -6.25169163 4.060305e-10 5.684427e-09
## 20  -0.53603793 5.919323e-01 1.000000e+00
## 21 -13.84768992 1.313693e-43 2.758756e-42

This chart of P values is large, but the main number to look for is the adjusted P value. In this case, after removing previously omitted groups, all groups are significantly different from each other at the alpha = 0.05 except for Non-Hispanic Asian - Non-Hispanic Asian, Native Hawaiian or Other Pacific Islander (p = 0.961), Hispanic - Non-Hispanic Black (p = 0.653), Hispanic - Non-Hispanic White (p = 0.159), and Non-Hispanic Black - Non-Hispanic White (p = 1.000).

This is interesting, but the goal of the exercise was to see which groups had a higher mortality rate to Covid in 2025, and we still cannot see that without a clear numbers comparison.

Comparison

Now that we know the groups that are statistically different from one another, we can look to the means of each group and compare.

Now we can go back to the divided data frames and calculate the mean.

meanhispanic <- mean(filteredhispanic$aa_COVID_rate_ann, na.rm = TRUE)
#meanNon_Hispanic_American_Indian_or_Alaska_Native <- mean(Non_Hispanic_American_Indian_or_Alaska_Native$aa_COVID_rate_ann, na.rm = TRUE)
meanNon_Hispanic_Asian <- mean(filteredNon_Hispanic_Asian$aa_COVID_rate_ann, na.rm = TRUE)
meanNon_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander <- mean(filteredNon_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander$aa_COVID_rate_ann, na.rm = TRUE)
meanNon_Hispanic_Black <- mean(filteredNon_Hispanic_Black$aa_COVID_rate_ann, na.rm = TRUE)
meanNon_Hispanic_White <- mean(filteredNon_Hispanic_White$aa_COVID_rate_ann, na.rm = TRUE)
#meanNon_Hispanic_Native_Hawaiian_or_Other_Pacific_Islander <- mean(Non_Hispanic_Native_Hawaiian_or_Other_Pacific_Islander$aa_COVID_rate_ann, na.rm = TRUE)

#Confidence Interval____________________________________________________________

t.test(filteredhispanic$aa_COVID_rate_ann)$conf.int
## [1] 4.000833 5.372140
## attr(,"conf.level")
## [1] 0.95
t.test(filteredNon_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander$aa_COVID_rate_ann)$conf.int
## [1] 3.208728 4.433377
## attr(,"conf.level")
## [1] 0.95
t.test(filteredNon_Hispanic_Black$aa_COVID_rate_ann)$conf.int
## [1] 5.709834 7.513422
## attr(,"conf.level")
## [1] 0.95
t.test(filteredNon_Hispanic_White$aa_COVID_rate_ann)$conf.int
## [1] 4.770594 6.091421
## attr(,"conf.level")
## [1] 0.95
t.test(filteredNon_Hispanic_Asian$aa_COVID_rate_ann)$conf.int
## [1] 3.243084 4.568027
## attr(,"conf.level")
## [1] 0.95

To get a useful comparison of the means, I’ll throw together a simple data frame of the calculated means.

means_comparison <- data.frame(
  Race = c("Hispanic", "Non_Hispanic_Asian", "Non_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander", "Non_Hispanic_Black", "Non_Hispanic_White"), 
  Mean = c(meanhispanic, meanNon_Hispanic_Asian, meanNon_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander, meanNon_Hispanic_Black, meanNon_Hispanic_White),
  ConfIntLower = c(4.000833, 3.243084, 3.208728, 5.709834, 4.770594),
  ConfIntHigher = c(5.372140, 4.568027, 4.433377, 7.513422, 6.091421)
)

Now we can put them in order from most to least.

means_comparison[order(means_comparison$Mean),]
##                                                           Race     Mean
## 3 Non_Hispanic_Asian_Native_Hawaiian_or_Other_Pacific_Islander 3.821053
## 2                                           Non_Hispanic_Asian 3.905556
## 1                                                     Hispanic 4.686486
## 5                                           Non_Hispanic_White 5.431008
## 4                                           Non_Hispanic_Black 6.611628
##   ConfIntLower ConfIntHigher
## 3     3.208728      4.433377
## 2     3.243084      4.568027
## 1     4.000833      5.372140
## 5     4.770594      6.091421
## 4     5.709834      7.513422

Finally, a definitive list of the highest to lowest annual mortality rate from COVID-19 by race in the United States. Bearing in mind that Hispanic, White, and Black communities do not have a statistically significant difference in annual infection rate, but Asian, Native Hawaiian, and Pacific Islander communities have a lower rate of infection due to Covid.

ggplot(means_comparison, aes(x = Race, y = Mean, fill = Race)) +
  geom_col() +
  geom_errorbar(ymin = means_comparison$ConfIntLower, ymax = means_comparison$ConfIntHigher) +
  ylim(0, 8)

Conclusion and Discussion

At the end of this road, Asian, Native Hawaiian, and Pacific Islander communities have a statistically significant lower annual infection rate than other racial groups in the US.

This is interesting, and does not align with what I expected. Hispanic and Non-Hispanic Black communities tend to live in more impoverished or underserved areas, plus the history of medical abuse and trauma impacting Black communities especially, I anticipated Covid rates to be highest for the Non-Hispanic Black group. Then I expected the Non-Hispanic White population to have the lowest rate, simply because wealthy areas tend to be predominately White and have better access to healthcare or public health interventions. However, there was no statistical evidence of difference between Black, White, or Hispanic populations in terms of annual covid infection.

The main significant differences show up between Asian populations and the rest of American racial groups surveyed. The Asian groups had significantly lower adjusted annual covid rate. Further analysis would be interesting for reasons why. I would want to look towards if there are cultural reasons for lower transmission, such as an increased tendency towards masking or vaccination.