The covid dataset contains a global record of cases and deaths of COVID-19 over the course of two years across different countries, continents, and populations, with data collected daily by the European Center for Disease Control and Prevention (ECDC). For this visualization and analysis, I will be investigating the total cases and deaths of the 10 countries with the most cases and deaths in the year 2020.
Quantitative Variables: 1. Cases 2. Deaths
Categorical Variables: 1. Country 2. Year
Load Libraries
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.3
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(readr)
Warning: package 'readr' was built under R version 4.4.3
library(viridis)
Warning: package 'viridis' was built under R version 4.4.3
Loading required package: viridisLite
library(highcharter)
Warning: package 'highcharter' was built under R version 4.4.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
Date.Day Date.Month Date.Year Data.Cases
Min. : 1.00 Min. : 1.000 Min. :2019 Min. : -8261.0
1st Qu.: 8.00 1st Qu.: 4.000 1st Qu.:2020 1st Qu.: 0.0
Median :16.00 Median : 7.000 Median :2020 Median : 13.0
Mean :15.85 Mean : 6.407 Mean :2020 Mean : 898.4
3rd Qu.:24.00 3rd Qu.: 9.000 3rd Qu.:2020 3rd Qu.: 206.0
Max. :31.00 Max. :12.000 Max. :2020 Max. :102507.0
Data.Deaths Location.Country Location.Code Data.Population
Min. :-1918.00 Length:53629 Length:53629 Min. : -1
1st Qu.: 0.00 Class :character Class :character 1st Qu.: 1324820
Median : 0.00 Mode :character Mode :character Median : 7813207
Mean : 22.87 Mean : 41645789
3rd Qu.: 3.00 3rd Qu.: 28608715
Max. : 4928.00 Max. :1433783692
Location.Continent Data.Rate
Length:53629 Min. :-147.4196
Class :character 1st Qu.: 0.2692
Mode :character Median : 4.6528
Mean : 44.3829
3rd Qu.: 34.9054
Max. :1900.8362
Clean Data
covid_clean <- covid %>%filter(across(everything(), ~ . >=0)) # Remove rows with values less than 0 across all columns
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
Select the 10 Countries With the Most Cases in the Year 2020
total_cases <- covid_clean %>%# Funnel the data for total cases into the new dataframe using the covid_clean datasetfilter(Date.Year ==2020) %>%# Filter for rows in the year column where the year is 2020group_by(Location.Country) %>%# Group the data by country summarize(total_cases =sum(Data.Cases, na.rm =TRUE)) %>%# Exclude NAs and calculate the total cases for each countryarrange(desc(total_cases)) %>%# Arrange the countries in descending order of total casestop_n(10) # Select the 10 countries with the most cases
Selecting by total_cases
Correct Country Name Formatting for Cases Plot
total_cases$Location.Country <-gsub("_", " ", total_cases$Location.Country) # Remove underscores from all rows in the country column and replace them with spaces
Bar Plot for the 10 Countries With the Most COVID-19 Cases (2020)
# Create the bar plot for casesggplot(total_cases, aes(x =factor(2020), y = total_cases, fill = Location.Country)) +# Create a plot using the total_cases dataframe and set up the variables for the axes and fillgeom_bar(stat ="identity", position =position_dodge(width =0.8), width =0.7) +# Make the plot a bar chart, adjust the width of the bars, and set bars side-by-sidescale_y_continuous(labels = scales::comma) +# Add commas to the quantitative y-axis labelslabs( # Add labelsx ="Year", # Label the x-axisy ="Total COVID-19 Cases", # Label the y-axis fill ="Country", # Label the fill for the legendtitle ="Top 10 Total COVID-19 Cases by Country"# Give the graph a title ) +theme_minimal(base_size =14) +# Set theme and font sizetheme( # Customize various theme elements.plot.background =element_rect(fill ="black"), # Set background coloraxis.text =element_text(color ="white"), # Set the axis text color axis.title =element_text(color ="white"), # Set the axis title colorlegend.title =element_text(color ="white"), # Set legend title colorlegend.text =element_text(color ="white"), # Set legend text color panel.grid.major =element_blank(), # Remove major grid linespanel.grid.minor =element_blank(), # Remove minor grid linesaxis.text.x =element_text(size =12, color ="white"), # Set x-axis text color and font sizeplot.title =element_text(color ="white", size =16, face ="bold", hjust =0.5) # Set plot title color, size, face, and position ) +scale_fill_manual(values = RColorBrewer::brewer.pal(10, "Set3")) # Set the fill colors using the RColorBrewer package
Select the 10 Countries With the Most Deaths in the Year 2020
total_deaths <- covid_clean %>%# Funnel the data for total deaths into the new dataframe using the covid_clean datasetfilter(Date.Year ==2020) %>%# Filter for rows in the year column where the year is 2020group_by(Location.Country) %>%# Group the data by country summarize(total_deaths =sum(Data.Deaths, na.rm =TRUE)) %>%# Exclude NAs and calculate the total deaths for each countryarrange(desc(total_deaths)) %>%# Arrange the countries in descending order of total casestop_n(10) # Select the 10 countries with the most deaths
Selecting by total_deaths
Correct Country Name Formatting for Deaths Plot
total_deaths$Location.Country <-gsub("_", " ", total_cases$Location.Country) # Remove underscores from all rows in the country column and replace them with spaces
Bar Plot for the 10 Countries With the Most COVID-19 Deaths (2020)
# Create the bar plot for deathsggplot(total_deaths, aes(x =factor(2020), y = total_deaths, fill = Location.Country)) +# Create a plot using the total_deaths dataframe and set up the variables for the axes and fillgeom_bar(stat ="identity", position =position_dodge(width =0.8), width =0.7) +# Make the plot a bar chart, adjust the width of the bars, and set bars side-by-sidescale_y_continuous(labels = scales::comma) +# Add commas to the quantitative y-axis labelslabs( # Add labelsx ="Year", # Label the x-axisy ="Total COVID-19 Deaths", # Label the y-axis fill ="Country", # Label the fill for the legendtitle ="Top 10 Total COVID-19 Deaths by Country"# Give the graph a title ) +theme_minimal(base_size =14) +# Set theme and font sizetheme( # Customize various theme elements.plot.background =element_rect(fill ="black"), # Set background coloraxis.text =element_text(color ="white"), # Set the axis text color axis.title =element_text(color ="white"), # Set the axis title colorlegend.title =element_text(color ="white"), # Set legend title colorlegend.text =element_text(color ="white"), # Set legend text color panel.grid.major =element_blank(), # Remove major grid linespanel.grid.minor =element_blank(), # Remove minor grid linesaxis.text.x =element_text(size =12, color ="white"), # Set x-axis text color and font sizeplot.title =element_text(color ="white", size =16, face ="bold", hjust =0.5) # Set plot title color, size, face, and position ) +scale_fill_manual(values = RColorBrewer::brewer.pal(10, "Set3")) # Set the fill colors using the RColorBrewer package
Multiple Linear Regression Analysis: How Does Population and Total Deaths Predict Total Cases in COVID-19?
Multiple Linear Regression Model
model <-lm(Data.Cases ~ Data.Population + Data.Deaths, data = covid_clean)summary(model) # Fit a linear regression model that will use predict population and deaths to predict cases
Call:
lm(formula = Data.Cases ~ Data.Population + Data.Deaths, data = covid_clean)
Residuals:
Min 1Q Median 3Q Max
-110175 -129 -32 -15 70161
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.938e+01 1.483e+01 1.307 0.191
Data.Population 5.713e-06 9.560e-08 59.763 <2e-16 ***
Data.Deaths 2.809e+01 1.204e-01 233.220 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3290 on 53480 degrees of freedom
Multiple R-squared: 0.5715, Adjusted R-squared: 0.5715
F-statistic: 3.566e+04 on 2 and 53480 DF, p-value: < 2.2e-16
Multiple Linear Regression Equation
coefficients <-coef(model) # Extract the coefficients from the regression modelpaste("Cases =", round(coefficients[1], 2), "+", round(coefficients[2], 2), "* Population +", round(coefficients[3], 2), "* Deaths") # Round the coefficients to the second decimal place and place the rounded coefficients of the cases, population, and deaths variables into the string
par(mfrow =c(2,2)) # Arrange the plots into 2x2 gridsplot(model) # Display the residual vs fitted, normal Q-Q, scale-location, and residuals vs leverage plots
Multiple Linear Regression Analysis
The multiple linear regression calculations indicate that both population (p < 2e-16) and deaths (p < 2e-16) are extremely statistically significant predictors of COVID-19 infection cases.
According to the results of the goodness of fit (Adjusted R^2 = 0.5715), about 57.15% of the variation in COVID-19 cases across countries is explained by population and deaths.
This shows that population and deaths have a very strong association with COVID-19 infections! However, it’s important to clarify that this doesn’t indicate causation.
Essay
Cleaning the Data
To clean my dataset, I first used the summary(covid) function to check for any missing or unusual values in the dataset. While there weren’t any NA or -inf/inf values, there were a lot of strange entries, like negative values for cases, deaths, and population. To remove the problematic rows, I used the filter() function from the dplyr package to kick out any values below 0.
About the Visualizations
The first visualization shows the total cases of COVID-19 in the year 2020 of the 10 countries with the most cases that year. The countries are represented by multicolored vertical bars, the height of which indicates the total number of COVID-19 infection cases for that country in that year. The year, 2020, is displayed at the bottom across the x-axis. Unsurprsingly, countries with larger populations tend to have more COVID-19 infections. Interestingly, though, India has a lower number of cases than the USA despite having a significantly larger population. This may mean India has differences from the USA in reporting or testing, population density, government intervention, social and cultural response, or immunity.
The second visualization shows the total deaths of COVID-19 in the year 2020 of the10 countries with the most cases that year. The countries are represented by multicolored vertical bars, the height of which indicates the total number of COVID-19 infection cases for that country in that year. The year, 2020, is displayed at the bottom across the x-axis. Unsurprsingly, countries with larger populations tend to have more COVID-19 deaths.Interestingly, though, India has a lower number of deaths than the USA despite having a significantly larger population. This may mean India has differences from the USA in reporting or testing, population density, government intervention, social and cultural response, or immunity.
About the Process
Initially, I was going to do an animated interactive choropleth world map of infection rates (Data.Cases / Data.Population) and mortality rates (Data.Deaths / Data.Population) during the 12 months of the pandemic in 2020. This map would have had a red-to-white gradient color key for infection and mortality rates corresponding to different countries that shift with time to visualize fluctuations and concentrations of infection and mortality rates globally. Users can manually select months out of the year to explore and hover over the map with a tooltip to gain specific insights on each country. Unfortunately, I ran into two major problems: the mutated infection rate and mortality rate variables were far too small to derive meaningful relationships from and there wouldn’t be three colors to distinguish categorical variables as listed in the rubric. As a result, I relinquished this idea and turned to a bar chart visualization at the last minute and changed my variables to population, cases, and deaths. I instead decided I would explore the relationship between population and deaths as a predictor for cases since these variables had enough data to work with and would produce meaningful results, but I find the results to be more intuitive and less insightful than exploring mortality and infection rates across different countries could have been. I was also going to add a hover tooltip in highcharter so that users can see the population of each country when they hover over the bars, but I ran out of time. I think the population tooltip would have made the graphs much more relevant and provided a more comprehensive image of the data.