Varying Demographics Effects on COVID-19 Cases in California

Avantika Singh

August 14, 2020

Introduction

Our lives are continuously been altered due to the COVID-19 pandemic. People are no longer able to see and visit their loved ones anymore due to the rise of cases and deaths across the world. However, we are still unable to develop a scientific solution to this pandemic (such as a vaccine, for example). Hopefully, with the investigation into data sets that present numerous variables/factors that could show a direct correlation to the rise of cases in various California counties, this data analysis can provide and find more insights into a new area (COVID-19) that is of great impact to all of us but not much is known about. To take it a step further, hopefully, the analysis of which factors cause more of an impact or rise to cases may lead to future discoveries bringing us one step closer to finding a cure.

The data set for my analysis is available at: https://data.ca.gov/group/covid-19

Setup

library(ggplot2)
library(dplyr)
library(lubridate)
library(RColorBrewer)
library(readr)
library(knitr)
library(maps)
library(sf)
library(tmap)
library(tmaptools)
library(leaflet)

Data import:

df1 <- read.csv("statewide_cases.csv")
df2 <- read.csv("case_demographics_age.csv")
df3 <- read.csv("case_demographics_sex.csv")
df4 <- read.csv("case_demographics_ethnicity.csv")

I will download the CSV file for COVID-19 cases in California that is sorted by county and recorded from March 18, 2020 to July 12, 2020. From there I will take note of the counties that have the highest rise in cases to the lowest and show it in data visualization. Then, I will download the CSV file for cases based on age, gender, and ethnicity. Going through that I will use data visualizations to show the differences and correlate it to the increase in cases for the counties.

Below is a summary of the data of statewide cases:

df1 <- df1
kable(head(df1))

county	totalcountconfirmed	totalcountdeaths	newcountconfirmed	newcountdeaths	date
Santa Clara	151	6	151	6	2020-03-18
Santa Clara	183	8	32	2	2020-03-19
Santa Clara	246	8	63	0	2020-03-20
Santa Clara	269	10	23	2	2020-03-21
Santa Clara	284	13	15	3	2020-03-22
Santa Clara	336	13	52	0	2020-03-23

Summary of cases by age:

df2 <- df2%>% select(age_group, totalpositive, date)
kable(head(df2))

age_group	totalpositive	date
0-17	120	2020-04-02
18-49	5302	2020-04-02
50-64	2879	2020-04-02
65 and Older	2342	2020-04-02
Unknown	58	2020-04-02
0-17	137	2020-04-03

Summary of cases by sex:

df3 <- df3%>% select(sex, totalpositive2, date)
kable(head(df3))

sex	totalpositive2	date
Female	5015	2020-04-02
Male	5547	2020-04-02
Unknown	139	2020-04-02
Female	5674	2020-04-03
Male	6202	2020-04-03
Unknown	150	2020-04-03

Summary of cases by ethnicity:

df4 <- df4
kable(head(df4))

race_ethnicity	cases	case_percentage	deaths	death_percentage	percent_ca_population	date
Latino	5276	35.99	170	28.38	38.9	2020-04-13
Latino	5910	37.18	203	29.72	38.9	2020-04-14
Latino	6433	37.80	226	29.70	38.9	2020-04-15
Latino	7013	38.51	254	29.85	38.9	2020-04-16
Latino	7627	39.41	281	30.58	38.9	2020-04-17
Latino	8195	40.28	314	31.24	38.9	2020-04-18

Data Anlysis

The following is a scatter plot graph of the trend in cases in each county throughout California. The graph shows the percentage of case growths per week for each county in California.

data...Sheet1 <- read.csv("~/Downloads/data - Sheet1.csv")
df1 <- data.frame(data...Sheet1)
ggplot(data = data...Sheet1, aes(x = County, y = Case.Growth.week)) + geom_point()

This graph shows us that in a week Colusa county has had the most with a % growth in cases. Following Colusa county from highest to lowest growth was Glenn, Tuolumne, Del Norte, Butte, Lake, Madera, Mendocino, Marin, Imperial, Sutter, Merced, San Joaquin, Stanislaus, Yuba, Kings, Fresno, Monterey, Siskiyou, Orange, Ventura, Napa, San Bernardino, San Benito, Placer, Sonoma, Solano, Shasta, El Dorado, Riverside, Yolo, Amador, Kern, Calaveras, Sacramento, Tulare, Contra Costa, San Luis Obispo, Santa Cruz, Nevada, Los Angeles, San Diego, Alameda, Plumas, Santa Barbara, Humboldt, Santa Clara, San Mateo, San Francisco, Mairposa, Trinity, Inyo, Mono, Tehama, Sierra, Modoc, Lassen, Alpine.

For purposes of the study, we will be looking at the two counties with the highest growth rate per week (which is Colusa and Glenn) and the two counties with the lowest growth rate per week (Lassen and Alpine), as well the county with the median growth rate which happens to be Riverside.

Because of the varying and enormous data sets, we will be focusing mostly on ethnicity in the 5 selected counties and taking a look to see if COVID-19 targets a certain racial group of people.

This is the data for growth in cases by ethnicity for Colusa County:

`data...Colusa.(1)` <- read.csv("~/Downloads/data - Colusa (1).csv")
ggplot(data = `data...Colusa.(1)`, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

Colusa county being the highest in case growth shows us that almost most of the cases that are contributing to this growth come from ethnic groups such as Hispanic groups and Northwestern groups.

This is the data for growth in cases by ethnicity for Glenn County:

data...Glenn <- read.csv("~/Downloads/data - Glenn.csv")
ggplot(data = data...Glenn, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

This graph shows us something a little more different than we expected. While there is a hypothesis that COVID-19 has been affecting many more people of a Hispanic background, this graph from Glenn county, which has the second highest growth rate, shows us a very high percentage of cases coming from white people instead. The second most is Northwestern which is similar to what we had seen in the first graph.

This is the data for growth in cases by ethnicity for Riverside County:

data...Riverside <- read.csv("~/Downloads/data - Riverside.csv")
ggplot(data = data...Riverside, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

In the graph for Riverside, once again the ethnic group contributing the most to the case growth is Northwestern.

This is the data for growth in cases by ethnicity for Lassen County:

data...Lassen <- read.csv("~/Downloads/data - Lassen.csv")
ggplot(data = data...Lassen, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

Similar to Glenn county, we notice a high percentage of cases coming from people who are white. However, it is important to note the second highest contributing ethnicity is Northwestern (once again).

This is the data for growth in cases by ethnicity for Alpine County:

data...Alpine <- read.csv("~/Downloads/data - Alpine.csv")
ggplot(data = data...Alpine, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)

For Alpine county, we see the highest percentage of contributing cases coming from the Northwestern population which is almost consistent with the rest of our charts and findings.

Conclusion

With the information collected and found, we can see that COVID-19 is not necessarily affecting random ethnic groups. In most of our charts we can see that Northwesterns, or people of American-Indian descent are more prone to catching COVID-19.

One of the two limitations in this project can be that we have no way of knowing how external factors can affect the rise of the cases (for example, during the pandemic and the shelter in place, some cities still had gatherings for the Black Lives Matters protests and counties where the protests happened can see a spike in cases since more people were coming in contact with each other and there was a higher likelihood the virus would spread). Another limitation is that since we still don’t know much about COVID-19, new discoveries are being made all the time so a discovery can be made that can affect the outcome and purpose of this project.