Our lives are continuously been altered due to the COVID-19 pandemic. People are no longer able to see and visit their loved ones anymore due to the rise of cases and deaths across the world. However, we are still unable to develop a scientific solution to this pandemic (such as a vaccine, for example). Hopefully, with the investigation into data sets that present numerous variables/factors that could show a direct correlation to the rise of cases in various California counties, this data analysis can provide and find more insights into a new area (COVID-19) that is of great impact to all of us but not much is known about. To take it a step further, hopefully, the analysis of which factors cause more of an impact or rise to cases may lead to future discoveries bringing us one step closer to finding a cure.
The data set for my analysis is available at: https://data.ca.gov/group/covid-19
library(ggplot2)
library(dplyr)
library(lubridate)
library(RColorBrewer)
library(readr)
library(knitr)
library(maps)
library(sf)
library(tmap)
library(tmaptools)
library(leaflet)
Data import:
df1 <- read.csv("statewide_cases.csv")
df2 <- read.csv("case_demographics_age.csv")
df3 <- read.csv("case_demographics_sex.csv")
df4 <- read.csv("case_demographics_ethnicity.csv")
I will download the CSV file for COVID-19 cases in California that is sorted by county and recorded from March 18, 2020 to July 12, 2020. From there I will take note of the counties that have the highest rise in cases to the lowest and show it in data visualization. Then, I will download the CSV file for cases based on age, gender, and ethnicity. Going through that I will use data visualizations to show the differences and correlate it to the increase in cases for the counties.
Below is a summary of the data of statewide cases:
df1 <- df1
kable(head(df1))
| county | totalcountconfirmed | totalcountdeaths | newcountconfirmed | newcountdeaths | date |
|---|---|---|---|---|---|
| Santa Clara | 151 | 6 | 151 | 6 | 2020-03-18 |
| Santa Clara | 183 | 8 | 32 | 2 | 2020-03-19 |
| Santa Clara | 246 | 8 | 63 | 0 | 2020-03-20 |
| Santa Clara | 269 | 10 | 23 | 2 | 2020-03-21 |
| Santa Clara | 284 | 13 | 15 | 3 | 2020-03-22 |
| Santa Clara | 336 | 13 | 52 | 0 | 2020-03-23 |
Summary of cases by age:
df2 <- df2%>% select(age_group, totalpositive, date)
kable(head(df2))
| age_group | totalpositive | date |
|---|---|---|
| 0-17 | 120 | 2020-04-02 |
| 18-49 | 5302 | 2020-04-02 |
| 50-64 | 2879 | 2020-04-02 |
| 65 and Older | 2342 | 2020-04-02 |
| Unknown | 58 | 2020-04-02 |
| 0-17 | 137 | 2020-04-03 |
Summary of cases by sex:
df3 <- df3%>% select(sex, totalpositive2, date)
kable(head(df3))
| sex | totalpositive2 | date |
|---|---|---|
| Female | 5015 | 2020-04-02 |
| Male | 5547 | 2020-04-02 |
| Unknown | 139 | 2020-04-02 |
| Female | 5674 | 2020-04-03 |
| Male | 6202 | 2020-04-03 |
| Unknown | 150 | 2020-04-03 |
Summary of cases by ethnicity:
df4 <- df4
kable(head(df4))
| race_ethnicity | cases | case_percentage | deaths | death_percentage | percent_ca_population | date |
|---|---|---|---|---|---|---|
| Latino | 5276 | 35.99 | 170 | 28.38 | 38.9 | 2020-04-13 |
| Latino | 5910 | 37.18 | 203 | 29.72 | 38.9 | 2020-04-14 |
| Latino | 6433 | 37.80 | 226 | 29.70 | 38.9 | 2020-04-15 |
| Latino | 7013 | 38.51 | 254 | 29.85 | 38.9 | 2020-04-16 |
| Latino | 7627 | 39.41 | 281 | 30.58 | 38.9 | 2020-04-17 |
| Latino | 8195 | 40.28 | 314 | 31.24 | 38.9 | 2020-04-18 |
The following is a scatter plot graph of the trend in cases in each county throughout California. The graph shows the percentage of case growths per week for each county in California.
data...Sheet1 <- read.csv("~/Downloads/data - Sheet1.csv")
df1 <- data.frame(data...Sheet1)
ggplot(data = data...Sheet1, aes(x = County, y = Case.Growth.week)) + geom_point()
This graph shows us that in a week Colusa county has had the most with a % growth in cases. Following Colusa county from highest to lowest growth was Glenn, Tuolumne, Del Norte, Butte, Lake, Madera, Mendocino, Marin, Imperial, Sutter, Merced, San Joaquin, Stanislaus, Yuba, Kings, Fresno, Monterey, Siskiyou, Orange, Ventura, Napa, San Bernardino, San Benito, Placer, Sonoma, Solano, Shasta, El Dorado, Riverside, Yolo, Amador, Kern, Calaveras, Sacramento, Tulare, Contra Costa, San Luis Obispo, Santa Cruz, Nevada, Los Angeles, San Diego, Alameda, Plumas, Santa Barbara, Humboldt, Santa Clara, San Mateo, San Francisco, Mairposa, Trinity, Inyo, Mono, Tehama, Sierra, Modoc, Lassen, Alpine.
For purposes of the study, we will be looking at the two counties with the highest growth rate per week (which is Colusa and Glenn) and the two counties with the lowest growth rate per week (Lassen and Alpine), as well the county with the median growth rate which happens to be Riverside.
Because of the varying and enormous data sets, we will be focusing mostly on ethnicity in the 5 selected counties and taking a look to see if COVID-19 targets a certain racial group of people.
This is the data for growth in cases by ethnicity for Colusa County:
`data...Colusa.(1)` <- read.csv("~/Downloads/data - Colusa (1).csv")
ggplot(data = `data...Colusa.(1)`, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)
Colusa county being the highest in case growth shows us that almost most of the cases that are contributing to this growth come from ethnic groups such as Hispanic groups and Northwestern groups.
This is the data for growth in cases by ethnicity for Glenn County:
data...Glenn <- read.csv("~/Downloads/data - Glenn.csv")
ggplot(data = data...Glenn, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)
This graph shows us something a little more different than we expected. While there is a hypothesis that COVID-19 has been affecting many more people of a Hispanic background, this graph from Glenn county, which has the second highest growth rate, shows us a very high percentage of cases coming from white people instead. The second most is Northwestern which is similar to what we had seen in the first graph.
This is the data for growth in cases by ethnicity for Riverside County:
data...Riverside <- read.csv("~/Downloads/data - Riverside.csv")
ggplot(data = data...Riverside, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)
In the graph for Riverside, once again the ethnic group contributing the most to the case growth is Northwestern.
This is the data for growth in cases by ethnicity for Lassen County:
data...Lassen <- read.csv("~/Downloads/data - Lassen.csv")
ggplot(data = data...Lassen, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)
Similar to Glenn county, we notice a high percentage of cases coming from people who are white. However, it is important to note the second highest contributing ethnicity is Northwestern (once again).
This is the data for growth in cases by ethnicity for Alpine County:
data...Alpine <- read.csv("~/Downloads/data - Alpine.csv")
ggplot(data = data...Alpine, aes(x=Ethnicity, y=Percentage)) + geom_bar(stat = "identity", width = 0.5)
For Alpine county, we see the highest percentage of contributing cases coming from the Northwestern population which is almost consistent with the rest of our charts and findings.
With the information collected and found, we can see that COVID-19 is not necessarily affecting random ethnic groups. In most of our charts we can see that Northwesterns, or people of American-Indian descent are more prone to catching COVID-19.
One of the two limitations in this project can be that we have no way of knowing how external factors can affect the rise of the cases (for example, during the pandemic and the shelter in place, some cities still had gatherings for the Black Lives Matters protests and counties where the protests happened can see a spike in cases since more people were coming in contact with each other and there was a higher likelihood the virus would spread). Another limitation is that since we still don’t know much about COVID-19, new discoveries are being made all the time so a discovery can be made that can affect the outcome and purpose of this project.