COVID-19

Author

Hana Rose

Introduction

The covid dataset contains a global record of cases and deaths of COVID-19 over the course of two years across different countries, continents, and populations, with data collected daily by the European Center for Disease Control and Prevention (ECDC). For this visualization and analysis, I will be investigating the total cases and deaths of the 10 countries with the most cases and deaths in the year 2020.

Quantitative Variables: 1. Cases 2. Deaths

Categorical Variables: 1. Country 2. Year

Load Libraries

library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.3

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(readr)
Warning: package 'readr' was built under R version 4.4.3
library(viridis)
Warning: package 'viridis' was built under R version 4.4.3
Loading required package: viridisLite
library(highcharter)
Warning: package 'highcharter' was built under R version 4.4.3
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use

Load Dataset

setwd("C:/Users/Hana Rose/OneDrive/Data 110")
covid <- read.csv("covid.csv")

Check for NAs and Anomalies

# Check for missing values
summary(covid)
    Date.Day       Date.Month       Date.Year      Data.Cases      
 Min.   : 1.00   Min.   : 1.000   Min.   :2019   Min.   : -8261.0  
 1st Qu.: 8.00   1st Qu.: 4.000   1st Qu.:2020   1st Qu.:     0.0  
 Median :16.00   Median : 7.000   Median :2020   Median :    13.0  
 Mean   :15.85   Mean   : 6.407   Mean   :2020   Mean   :   898.4  
 3rd Qu.:24.00   3rd Qu.: 9.000   3rd Qu.:2020   3rd Qu.:   206.0  
 Max.   :31.00   Max.   :12.000   Max.   :2020   Max.   :102507.0  
  Data.Deaths       Location.Country   Location.Code      Data.Population     
 Min.   :-1918.00   Length:53629       Length:53629       Min.   :        -1  
 1st Qu.:    0.00   Class :character   Class :character   1st Qu.:   1324820  
 Median :    0.00   Mode  :character   Mode  :character   Median :   7813207  
 Mean   :   22.87                                         Mean   :  41645789  
 3rd Qu.:    3.00                                         3rd Qu.:  28608715  
 Max.   : 4928.00                                         Max.   :1433783692  
 Location.Continent   Data.Rate        
 Length:53629       Min.   :-147.4196  
 Class :character   1st Qu.:   0.2692  
 Mode  :character   Median :   4.6528  
                    Mean   :  44.3829  
                    3rd Qu.:  34.9054  
                    Max.   :1900.8362  

Clean Data

covid_clean <- covid %>%
  filter(across(everything(), ~ . >= 0)) # Remove rows with values less than 0 across all columns
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.

Select the 10 Countries With the Most Cases in the Year 2020

total_cases <- covid_clean %>%  # Funnel the data for total cases into the new dataframe using the covid_clean dataset
  filter(Date.Year == 2020) %>%  # Filter for rows in the year column where the year is 2020
  group_by(Location.Country) %>%  # Group the data by country 
  summarize(total_cases = sum(Data.Cases, na.rm = TRUE)) %>%  # Exclude NAs and calculate the total cases for each country
  arrange(desc(total_cases)) %>%  # Arrange the countries in descending order of total cases
  top_n(10)  # Select the 10 countries with the most cases
Selecting by total_cases

Correct Country Name Formatting for Cases Plot

total_cases$Location.Country <- gsub("_", " ", total_cases$Location.Country) # Remove underscores from all rows in the country column and replace them with spaces

Bar Plot for the 10 Countries With the Most COVID-19 Cases (2020)

# Create the bar plot for cases
ggplot(total_cases, aes(x = factor(2020), y = total_cases, fill = Location.Country)) +  # Create a plot using the total_cases dataframe and set up the variables for the axes and fill
  geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7) +  # Make the plot a bar chart, adjust the width of the bars, and set bars side-by-side
  scale_y_continuous(labels = scales::comma) +  # Add commas to the quantitative y-axis labels
  labs(  # Add labels
    x = "Year",  # Label the x-axis
    y = "Total COVID-19 Cases",  # Label the y-axis 
    fill = "Country",  # Label the fill for the legend
    title = "Top 10 Total COVID-19 Cases by Country"  # Give the graph a title
  ) +
  theme_minimal(base_size = 14) +  # Set theme and font size
  theme(  # Customize various theme elements.
    plot.background = element_rect(fill = "black"),  # Set background color
    axis.text = element_text(color = "white"),  # Set the axis text color 
    axis.title = element_text(color = "white"),  # Set the axis title color
    legend.title = element_text(color = "white"),  # Set legend title color
    legend.text = element_text(color = "white"),  # Set legend text color 
    panel.grid.major = element_blank(),  # Remove major grid lines
    panel.grid.minor = element_blank(),  # Remove minor grid lines
    axis.text.x = element_text(size = 12, color = "white"),  # Set x-axis text color and font size
    plot.title = element_text(color = "white", size = 16, face = "bold", hjust = 0.5)  # Set plot title color, size, face, and position
  ) +
  scale_fill_manual(values = RColorBrewer::brewer.pal(10, "Set3"))  # Set the fill colors using the RColorBrewer package

Select the 10 Countries With the Most Deaths in the Year 2020

total_deaths <- covid_clean %>% # Funnel the data for total deaths into the new dataframe using the covid_clean dataset
  filter(Date.Year == 2020) %>% # Filter for rows in the year column where the year is 2020
  group_by(Location.Country) %>% # Group the data by country 
  summarize(total_deaths = sum(Data.Deaths, na.rm = TRUE)) %>% # Exclude NAs and calculate the total deaths for each country
  arrange(desc(total_deaths)) %>% # Arrange the countries in descending order of total cases
  top_n(10) # Select the 10 countries with the most deaths
Selecting by total_deaths

Correct Country Name Formatting for Deaths Plot

total_deaths$Location.Country <- gsub("_", " ", total_cases$Location.Country) # Remove underscores from all rows in the country column and replace them with spaces

Bar Plot for the 10 Countries With the Most COVID-19 Deaths (2020)

# Create the bar plot for deaths
ggplot(total_deaths, aes(x = factor(2020), y = total_deaths, fill = Location.Country)) +  # Create a plot using the total_deaths dataframe and set up the variables for the axes and fill
  geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7) +  # Make the plot a bar chart, adjust the width of the bars, and set bars side-by-side
  scale_y_continuous(labels = scales::comma) +  # Add commas to the quantitative y-axis labels
  labs(  # Add labels
    x = "Year",  # Label the x-axis
    y = "Total COVID-19 Deaths",  # Label the y-axis 
    fill = "Country",  # Label the fill for the legend
    title = "Top 10 Total COVID-19 Deaths by Country"  # Give the graph a title
  ) +
  theme_minimal(base_size = 14) +  # Set theme and font size
  theme(  # Customize various theme elements.
    plot.background = element_rect(fill = "black"),  # Set background color
    axis.text = element_text(color = "white"),  # Set the axis text color 
    axis.title = element_text(color = "white"),  # Set the axis title color
    legend.title = element_text(color = "white"),  # Set legend title color
    legend.text = element_text(color = "white"),  # Set legend text color 
    panel.grid.major = element_blank(),  # Remove major grid lines
    panel.grid.minor = element_blank(),  # Remove minor grid lines
    axis.text.x = element_text(size = 12, color = "white"),  # Set x-axis text color and font size
    plot.title = element_text(color = "white", size = 16, face = "bold", hjust = 0.5)  # Set plot title color, size, face, and position
  ) +
  scale_fill_manual(values = RColorBrewer::brewer.pal(10, "Set3"))  # Set the fill colors using the RColorBrewer package

Multiple Linear Regression Analysis: How Does Population and Total Deaths Predict Total Cases in COVID-19?

Multiple Linear Regression Model

model <- lm(Data.Cases ~ Data.Population + Data.Deaths, data = covid_clean)
summary(model) # Fit a linear regression model that will use predict population and deaths to predict cases

Call:
lm(formula = Data.Cases ~ Data.Population + Data.Deaths, data = covid_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-110175    -129     -32     -15   70161 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.938e+01  1.483e+01   1.307    0.191    
Data.Population 5.713e-06  9.560e-08  59.763   <2e-16 ***
Data.Deaths     2.809e+01  1.204e-01 233.220   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3290 on 53480 degrees of freedom
Multiple R-squared:  0.5715,    Adjusted R-squared:  0.5715 
F-statistic: 3.566e+04 on 2 and 53480 DF,  p-value: < 2.2e-16

Multiple Linear Regression Equation

coefficients <- coef(model) # Extract the coefficients from the regression model
paste("Cases =", round(coefficients[1], 2), "+", round(coefficients[2], 2), "* Population +", round(coefficients[3], 2), "* Deaths") # Round the coefficients to the second decimal place and place the rounded coefficients of the cases, population, and deaths variables into the string
[1] "Cases = 19.38 + 0 * Population + 28.09 * Deaths"

Diagnostic Plots

par(mfrow = c(2,2))  # Arrange the plots into 2x2 grids
plot(model)  # Display the residual vs fitted, normal Q-Q, scale-location, and residuals vs leverage plots

Multiple Linear Regression Analysis

The multiple linear regression calculations indicate that both population (p < 2e-16) and deaths (p < 2e-16) are extremely statistically significant predictors of COVID-19 infection cases.

According to the results of the goodness of fit (Adjusted R^2 = 0.5715), about 57.15% of the variation in COVID-19 cases across countries is explained by population and deaths.

This shows that population and deaths have a very strong association with COVID-19 infections! However, it’s important to clarify that this doesn’t indicate causation.

Essay

Cleaning the Data

To clean my dataset, I first used the summary(covid) function to check for any missing or unusual values in the dataset. While there weren’t any NA or -inf/inf values, there were a lot of strange entries, like negative values for cases, deaths, and population. To remove the problematic rows, I used the filter() function from the dplyr package to kick out any values below 0.

About the Visualizations

The first visualization shows the total cases of COVID-19 in the year 2020 of the 10 countries with the most cases that year. The countries are represented by multicolored vertical bars, the height of which indicates the total number of COVID-19 infection cases for that country in that year. The year, 2020, is displayed at the bottom across the x-axis. Unsurprsingly, countries with larger populations tend to have more COVID-19 infections. Interestingly, though, India has a lower number of cases than the USA despite having a significantly larger population. This may mean India has differences from the USA in reporting or testing, population density, government intervention, social and cultural response, or immunity.

The second visualization shows the total deaths of COVID-19 in the year 2020 of the10 countries with the most cases that year. The countries are represented by multicolored vertical bars, the height of which indicates the total number of COVID-19 infection cases for that country in that year. The year, 2020, is displayed at the bottom across the x-axis. Unsurprsingly, countries with larger populations tend to have more COVID-19 deaths.Interestingly, though, India has a lower number of deaths than the USA despite having a significantly larger population. This may mean India has differences from the USA in reporting or testing, population density, government intervention, social and cultural response, or immunity.

About the Process

Initially, I was going to do an animated interactive choropleth world map of infection rates (Data.Cases / Data.Population) and mortality rates (Data.Deaths / Data.Population) during the 12 months of the pandemic in 2020. This map would have had a red-to-white gradient color key for infection and mortality rates corresponding to different countries that shift with time to visualize fluctuations and concentrations of infection and mortality rates globally. Users can manually select months out of the year to explore and hover over the map with a tooltip to gain specific insights on each country. Unfortunately, I ran into two major problems: the mutated infection rate and mortality rate variables were far too small to derive meaningful relationships from and there wouldn’t be three colors to distinguish categorical variables as listed in the rubric. As a result, I relinquished this idea and turned to a bar chart visualization at the last minute and changed my variables to population, cases, and deaths. I instead decided I would explore the relationship between population and deaths as a predictor for cases since these variables had enough data to work with and would produce meaningful results, but I find the results to be more intuitive and less insightful than exploring mortality and infection rates across different countries could have been. I was also going to add a hover tooltip in highcharter so that users can see the population of each country when they hover over the bars, but I ran out of time. I think the population tooltip would have made the graphs much more relevant and provided a more comprehensive image of the data.