Varying Demographics Effects on COVID-19 Cases in California

Avantika Singh

August 14, 2020

Introduction

Our lives are continuously been altered due to the COVID-19 pandemic. People are no longer able to see and visit their loved ones anymore due to the rise of cases and deaths across the world. However, we are still unable to develop a scientific solution to this pandemic (such as a vaccine, for example). Hopefully, with the investigation into data sets that present numerous variables/factors that could show a direct correlation to the rise of cases in various California counties, this data analysis can provide and find more insights into a new area (COVID-19) that is of great impact to all of us but not much is known about. To take it a step further, hopefully, the analysis of which factors cause more of an impact or rise to cases may lead to future discoveries bringing us one step closer to finding a cure.

The data set for my analysis is available at: https://data.ca.gov/group/covid-19

Setup

library(ggplot2)
library(dplyr)
library(lubridate)
library(RColorBrewer)
library(readr)
library(knitr)
library(maps)
library(sf)
library(tmap)
library(tmaptools)
library(leaflet)

Data import:

df1 <- read.csv("statewide_cases.csv")
df2 <- read.csv("case_demographics_age.csv")
df3 <- read.csv("case_demographics_sex.csv")
df4 <- read.csv("case_demographics_ethnicity.csv")

I will download the CSV file for COVID-19 cases in California that is sorted by county and recorded from March 18, 2020 to July 12, 2020. From there I will take note of the counties that have the highest rise in cases to the lowest and show it in data visualization. Then, I will download the CSV file for cases based on age, gender, and ethnicity. Going through that I will use data visualizations to show the differences and correlate it to the increase in cases for the counties.

Below is a summary of the data of statewide cases:

df1 <- df1
kable(head(df1))
county totalcountconfirmed totalcountdeaths newcountconfirmed newcountdeaths date
Santa Clara 151 6 151 6 2020-03-18
Santa Clara 183 8 32 2 2020-03-19
Santa Clara 246 8 63 0 2020-03-20
Santa Clara 269 10 23 2 2020-03-21
Santa Clara 284 13 15 3 2020-03-22
Santa Clara 336 13 52 0 2020-03-23

Summary of cases by age:

df2 <- df2%>% select(age_group, totalpositive, date)
kable(head(df2))
age_group totalpositive date
0-17 120 2020-04-02
18-49 5302 2020-04-02
50-64 2879 2020-04-02
65 and Older 2342 2020-04-02
Unknown 58 2020-04-02
0-17 137 2020-04-03

Summary of cases by sex:

df3 <- df3%>% select(sex, totalpositive2, date)
kable(head(df3))
sex totalpositive2 date
Female 5015 2020-04-02
Male 5547 2020-04-02
Unknown 139 2020-04-02
Female 5674 2020-04-03
Male 6202 2020-04-03
Unknown 150 2020-04-03

Summary of cases by ethnicity:

df4 <- df4
kable(head(df4))
race_ethnicity cases case_percentage deaths death_percentage percent_ca_population date
Latino 5276 35.99 170 28.38 38.9 2020-04-13
Latino 5910 37.18 203 29.72 38.9 2020-04-14
Latino 6433 37.80 226 29.70 38.9 2020-04-15
Latino 7013 38.51 254 29.85 38.9 2020-04-16
Latino 7627 39.41 281 30.58 38.9 2020-04-17
Latino 8195 40.28 314 31.24 38.9 2020-04-18

Data Anlysis

The following is a scatter plot graph of the trend in cases in each county throughout California. The graph shows the percentage of case growths per week for each county in California.

data...Sheet1 <- read.csv("~/Downloads/data - Sheet1.csv")
df1 <- data.frame(data...Sheet1)
ggplot(data = data...Sheet1, aes(x = County, y = Case.Growth.week)) + geom_point()

This graph shows us that in a week Colusa county has had the most with a % growth in cases. Following Colusa county from highest to lowest growth was Glenn, Tuolumne, Del Norte, Butte, Lake, Madera, Mendocino, Marin, Imperial, Sutter, Merced, San Joaquin, Stanislaus, Yuba, Kings, Fresno, Monterey, Siskiyou, Orange, Ventura, Napa, San Bernardino, San Benito, Placer, Sonoma, Solano, Shasta, El Dorado, Riverside, Yolo, Amador, Kern, Calaveras, Sacramento, Tulare, Contra Costa, San Luis Obispo, Santa Cruz, Nevada, Los Angeles, San Diego, Alameda, Plumas, Santa Barbara, Humboldt, Santa Clara, San Mateo, San Francisco, Mairposa, Trinity, Inyo, Mono, Tehama, Sierra, Modoc, Lassen, Alpine.

For purposes of the study, we will be looking at the two counties with the highest growth rate per week (which is Colusa and Glenn) and the two counties with the lowest growth rate per week (Lassen and Alpine), as well the county with the median growth rate which happens to be Riverside.

Because of the varying and enormous data sets, we will be focusing mostly on ethnicity in the 5 selected counties and taking a look to see if COVID-19 targets a certain racial group of people.

This is the data for growth in cases by ethnicity for Colusa County:

`data...Colusa.(1)` <- read.csv("~/Downloads/data - Colusa (1).csv")
ggplot(data = `data...Colusa.(1)`, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

Colusa county being the highest in case growth shows us that almost most of the cases that are contributing to this growth come from ethnic groups such as Hispanic groups and Northwestern groups.

This is the data for growth in cases by ethnicity for Glenn County:

data...Glenn <- read.csv("~/Downloads/data - Glenn.csv")
ggplot(data = data...Glenn, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

This graph shows us something a little more different than we expected. While there is a hypothesis that COVID-19 has been affecting many more people of a Hispanic background, this graph from Glenn county, which has the second highest growth rate, shows us a very high percentage of cases coming from white people instead. The second most is Northwestern which is similar to what we had seen in the first graph.

This is the data for growth in cases by ethnicity for Riverside County:

data...Riverside <- read.csv("~/Downloads/data - Riverside.csv")
ggplot(data = data...Riverside, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

In the graph for Riverside, once again the ethnic group contributing the most to the case growth is Northwestern.

This is the data for growth in cases by ethnicity for Lassen County:

data...Lassen <- read.csv("~/Downloads/data - Lassen.csv")
ggplot(data = data...Lassen, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

Similar to Glenn county, we notice a high percentage of cases coming from people who are white. However, it is important to note the second highest contributing ethnicity is Northwestern (once again).

This is the data for growth in cases by ethnicity for Alpine County:

data...Alpine <- read.csv("~/Downloads/data - Alpine.csv")
ggplot(data = data...Alpine, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

For Alpine county, we see the highest percentage of contributing cases coming from the Northwestern population which is almost consistent with the rest of our charts and findings.

Conclusion

With the information collected and found, we can see that COVID-19 is not necessarily affecting random ethnic groups. In most of our charts we can see that Northwesterns, or people of American-Indian descent are more prone to catching COVID-19.

One of the two limitations in this project can be that we have no way of knowing how external factors can affect the rise of the cases (for example, during the pandemic and the shelter in place, some cities still had gatherings for the Black Lives Matters protests and counties where the protests happened can see a spike in cases since more people were coming in contact with each other and there was a higher likelihood the virus would spread). Another limitation is that since we still don’t know much about COVID-19, new discoveries are being made all the time so a discovery can be made that can affect the outcome and purpose of this project.