DATA 110 Project 1: COVID-19 Dataset (2021)

Author

Olivia Yuengling

Source: Johns Hopkins Whitman School of Engineering (Original visualization: Coronavirus COVID-19 (2019-nCoV) (arcgis.com))

Introduction

The following data focuses on multiple variables from the COVID-19 pandemic in 2021. These variables include the month, number of deaths and cases, and the state (US). Each data set is from the first day of each month, and we will integrate these data sets into a single data frame (covid21_df).

In this project I intend to define when the peak of the COVID-19 pandemic was in the year of 2021 for each US region. I will create two graphs in this project: a graph for the amount of active COVID-19 cases and a graph measuring the amount of deaths caused by COVID-19.

Stage I: Loading & Cleaning the Data

Loading the Libraries and Datasets

To start off the project, we will need to load the packages we’re going to use to clean our data and to create our graphs. These are tidyverse, dplyr, and ggplot2.

After we load our packages we will load each individual dataset that we have taken from Johns Hopkins Whitman School of Engineering. After we load each dataset we will merge them into a singular data frame.

# loading necessary libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

# loads all of the COVID-19 datasets from the first day of every month in 2021
jan <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/01-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (12): Lat, Long_, Confirmed, Deaths, Recovered, Active, FIPS, Incident_...
lgl   (4): People_Hospitalized, Hospitalization_Rate, People_Tested, Mortali...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/02-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (12): Lat, Long_, Confirmed, Deaths, Recovered, Active, FIPS, Incident_...
lgl   (4): People_Hospitalized, Hospitalization_Rate, People_Tested, Mortali...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/03-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (12): Lat, Long_, Confirmed, Deaths, Recovered, Active, FIPS, Incident_...
lgl   (4): People_Hospitalized, Hospitalization_Rate, People_Tested, Mortali...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/04-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/05-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/06-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/07-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/08-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/09-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/10-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/11-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec <- read_csv("C:/Users/omyue/Downloads/COVID-19 Datasets/12-01-2021.csv")
Rows: 58 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): Province_State, Country_Region, ISO3
dbl  (10): Lat, Long_, Confirmed, Deaths, FIPS, Incident_Rate, Total_Test_Re...
lgl   (6): Recovered, Active, People_Hospitalized, Hospitalization_Rate, Peo...
dttm  (1): Last_Update
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# merges the monthly datasets into a single data frame
covid21_df <- rbind(jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec)

Creating US Regions Column

Having a graph that focuses on all 50 states and even more territories of the US will be too confusing to the viewers of the graph, so we will organize our data by region. We will have five regions: Northeast, Midwest, South, West, and Other. The “other” category will contain any US territories or states outside of the mainland.

# creates the individual US regions for the dataframe
covid21_df <- covid21_df %>%
  mutate(region = ifelse(Province_State %in% c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania"), "Northeast",
                ifelse(Province_State %in% c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin", "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota"), "Midwest",
                ifelse(Province_State %in% c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "District of Columbia", "West Virginia", "Alabama", "Kentucky", "Mississippi", "Tennessee", "Arkansas", "Louisiana", "Oklahoma", "Texas", "Virgin Islands", "Puerto Rico"), "South",
                   ifelse(Province_State %in% c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming", "Alaska", "California", "Hawaii", "Oregon", "Washington", "Guam"), "West", "Other")))))

Creating the Months Column & Assigning Month Names

Now let’s create a new column that extracts the month from the given date in our data. After we extract the month, we assign a label that corresponds with the month number (ex. 10 = October) and change the number into the label.

# extracts the month value from the dataset and creates a new column
covid21_df$month <- month(covid21_df$Date)

# changes month number to corresponding name in the column month
covid21_df$month <- factor(covid21_df$month, levels = 1:12, labels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

Narrowing the Data Set

I usually like to create a new data set containing only the variables I want to use for the graph so I could see everything I need in front of me.

# creates a new data frame with the values month, region, confirmed cases, deaths caused by COVID, and the state/province
covid21 <- covid21_df |> # puts the selected columns into a new df
  select(month, region, Confirmed, Deaths) # selects necessary variables

covid21 <- covid21 |>
  group_by(region, month) |> # groups by the region and month
  # creates new rows containing the sum of deaths and case for each region
  summarise(cases = ((sum(Confirmed))/10^5), deaths = ((sum(Deaths))/10^3))
`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.

Stage II: Linear Regression and Equations

Is there a correlation between cases and the amount of deaths? Let’s explore by creating a linear regression equation.

Let’s Visualize!

# sets x & y variables
cases_deaths <- ggplot(covid21_df, aes(x = Confirmed, y = Deaths)) + 
  geom_point()+ # plots points on the graph
  geom_smooth(method = "lm", color = "red") # creates regression line and sets color
cases_deaths # prints the graph
`geom_smooth()` using formula = 'y ~ x'

Interesting…

It seems that by the more cases there are, the more deaths there are! The graph exhibits a strong correlation between these two variables given the fact that as there are more cases, there are more deaths.

Finding the Variables for the Equation

linearreg <-  lm(Deaths ~ Confirmed, data = covid21_df) # finds the variables we need for the equation
summary(linearreg)

Call:
lm(formula = Deaths ~ Confirmed, data = covid21_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-16137.7  -1089.2   -239.5    391.4  21215.0 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.695e+02  1.661e+02   1.021    0.308    
Confirmed   1.666e-02  1.671e-04  99.700   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3484 on 694 degrees of freedom
Multiple R-squared:  0.9347,    Adjusted R-squared:  0.9346 
F-statistic:  9940 on 1 and 694 DF,  p-value: < 2.2e-16

What Variables Do We Need?

In a linear regression equation (y = ax +b) we need the slope and the y-intercept.

Note

deaths = 169.5 + 0.017(cases)

Hold on, what does this equation mean? For every unit that the number of cases increases, the number of deaths is expected to increase by 0.017 units. The intercept for the equation, 169.5, represents the amount of deaths when the amount of cases is at zero.

Are the Results Significant?

To check if the results are significant, we check the p-value we are given. When studies are conducted the level of significance is typically a = 0.05, or 5 percent. The p-value that we have derived from our investigation is 2.2e-16, or .0000000000000022 which is very close to zero. Obviously, the p-value is considered significant to our investigation. If the p-value is considered extremely significant when we are investigating the correlation, it means that there is a strong correlation between the amount of cases and deaths.

Stage III: Plotting the Graphs

Plotting the Cases Graph

To create the graph we are going to use ggplot2. First we are going to set the variables for the y and x axis through using the data frame “covid21”. We will use aes() to create labels, but also to seperate the points we will plot by region, and to assign the color of the point based on the point’s region.

To plot the points we are going to use geom_point() and we will connect the dots with their corresponding region by using geom_line. The points will connect to one another by their respective region.

To distinct which region is which visually, we are going to assign colors to each region using the function scale_color_manual(). This function will allow us to create our own color palette to use in the plot by placing our desired colors into a vector, which will be the argument to “values = ___” in the function.

case_chart <- ggplot(covid21, aes(x = month, y = cases, color = region, group = region)) + # sets variables, assigns color, and groups by region 
  labs(title = "COVID-19 Cases per US Region (2021)") + # create title label
  geom_point(size = 2.5) + # plots the points by month and region
  geom_line(size = 1.5, alpha = 0.5) + # connects the points on the graph
  scale_color_manual(values = c("gold", "purple", "green", "#00bbf9", "hotpink")) + # sets the color palette
  xlab("Month") + # creates title for the x-axis
  ylab("Cases (Hundred Thousands)") + # creates title for the y-axis
  theme_bw(base_size = 7) # assigns font size of the title
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
case_chart # prints the graph

Cases Chart Results

As exhibited, the southern region of the United states has the highest rate of cases as its respective line is higher up on the graph compared to the other four regions. The “Other” region of the United States appears to be static at zero because of the low amount of cases in the U.S territories when compared to the other four regions of the United States.

So what month did the amount of COVID-19 cases peak in 2021? The points for all five regions (except for “Other”) have the highest y-value in December. That indicates the peak of COVID-19 cases in 2021 was in December.

Deaths chart

Now for the deaths chart, we’re going to do the exact same process that we’ve done for the cases chart but with some minimal tweaks.

First, we’re going to change the y-axis from cases to deaths. Because the deaths is scaled by THOUSANDS instead of HUNDRED THOUSANDS, we’re going to change the y-axis label to “Cases (Thousands)”. After making these minor adjustments we will print out the results.

deaths_chart <- ggplot(covid21, aes(x = month, y = deaths, color = region, group = region)) + # sets variables, assigns color, and groups by region
  labs(title = "COVID-19 Deaths per US Region (2021)") + # sets title label
  geom_point(size = 2.5) + # plots points on the graph by month and region
  geom_line(size = 1.5, alpha = 0.5) + # connects the points on the graph
  scale_color_manual(values = c("gold", "purple", "green", "#00bbf9", "hotpink")) + # sets the color palette
  xlab("Month") + # assigns label to the x-axis
  ylab("Deaths (Thousands)") + # assigns label to the y-axis
  theme_bw(base_size = 7) # sets font size 
deaths_chart # prints the graph

Death Chart Results

The results we have attained from the deaths graph are similar to the cases chart, but there are some notable differences.

In the cases graph, the northeastern region of the US had the lowest amount of cases compared to the other 3 mainland regions (excluding the region “Other”). But in the deaths graphs, the Northeast appears to have more deaths from COVID-19 compared to the mid-western and western regions of the United States, which both had higher numbers of cases than the northeast.

But what has remained the same in this graph is that the southern region of the United States still has the highest rates compared to the other regions. What has also remained the same is the “Other” region appearing static as its values are close to zero.

Summary and Conclusion

Cleaning the Data

To start off the project I had to clean up the data by merging the 12 given datasets into one. I used the function rbind() to execute this. After I merged the datasets I extracted the month from the given date and I created a new column which would contain only the number that the month is associated with. Because I wanted the x-axis of the plots to have the proper name of the months instead of the numbers, I used the “factor” function to label the month numbers to their respective names.

After I gathered all of the information I needed for my plots, I decided to narrow down the variables into a smaller dataframe called “covid21” which I would use for my plots. I used the “select()” function and the “group_by()” to unite the data by their respective region, then select the variables I would need in my plot. This narrowed down any variables that I wouldn’t need in the future when I lay out my plots. I also divided the cases by 100,000 and the amount of deaths by 1000 for scaling purposes in the future.

Plotting the Graph (Multi-Linear Plot)

For both graphs, I used ggplot2. I used the aes() function within it to assign the x-variable (month) and y-variable (cases/deaths) and to seperate the lines/points I would plot later on by region + assigning a color for each region. I used geom_point() and geom_line() to plot the points and connect them on the graph, and they are connected for better visualization of the trends. I used the argument “size = __” to adjust the line and point thickness to make the lines appear bolder to the viewers. After I adjusted the size of the points and lines I manually assigned the colors for each region using “set_color_manual”, and in the “values = ___” argument I entered the hex codes for each desired color into a vector. I set the labels for the graph using “xlab()”, “ylab()”, and “labs()” then I set the theme and font sizes for the plot using the function “theme_bw()”. After setting up the graph, I printed it by calling the variable that the code was assigned to.

Conclusion

I enjoyed playing around with the variables in the project, and it had furthered by understanding of the COVID-19 pandemic in 2021. For example, I learned the peak moments of the pandemic in 2021 and the relationship of cases and deaths caused by COVID-19.

When I started off the project, my professor provided the code for extracting the month and for assigning regions for each US state/territory. But other than that, I was able to code and run everything throughout my project quite smoothly which suprised me.

The largest problem I did run into was deciding what colors I wanted to use in the multi-linear graphs that would look both aesthetically pleasing but also allow the viewers to clearly distinct which region is which. I ended up with a color palette consisting of green, yellow, blue, pink, and purple and I am happy with it. I wish I was able to change the font for the graphs, but it wouldn’t make that much of a change regardless of whether I changed it or not so I left it be.