Final Project (Covid19 cases and Deaths in Russia and the
World)
Introduction:
This project was created to answer three questions which came up due
to the Covid19 pandemic. Data sets from R packages and Kaggle are going
to be used in an attempt to get answers to the questions below.
Questions:
1. Check for the spread of the covid 19 virus in Russia.
2. Using a linear regression model,check the data factors that may
have led to the increase in COVID-19 deaths in Russia
3. What are the top 10 countries with the highest death rates.
# Install the necessary packages
# Load in the packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(COVID19)
library(dplyr)
library(readr)
Load the data set
covidData <- read_csv("CovidDataWorld.csv")
## Rows: 224 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, Other, Total Recovered, Active Cases
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# view the contents of the data
covidData
## # A tibble: 224 × 11
## `Country, Other` `Total Cases` `Total Deaths` `Total Recover…` `Active Cases`
## <chr> <dbl> <dbl> <chr> <chr>
## 1 USA 42634054 688486 32,347,726 9,597,842
## 2 India 33381728 444278 32,598,424 339,026
## 3 Brazil 21069017 589277 20,173,064 306,676
## 4 UK 7339009 134805 5,907,029 1,297,175
## 5 Russia 7214520 195835 6,452,398 566,287
## 6 France 6934732 115894 6,595,374 223,464
## 7 Turkey 6767008 60903 6,262,690 443,415
## 8 Iran 5378408 116072 4,682,704 579,632
## 9 Argentina 5234851 114101 5,087,120 33,630
## 10 Colombia 4936052 125782 4,774,661 35,609
## # … with 214 more rows, and 6 more variables:
## # `Serious / Critical Condition` <dbl>, `Total Cases / 1M Population` <dbl>,
## # `Deaths / 1M Population` <dbl>, `Total Tests` <dbl>,
## # `Tests / 1M Population` <dbl>, Population <dbl>
#filter the data into another dataset called Russia_data which involves only data involving Russia
Russia_data <- covidData |>
filter(`Country, Other`=="Russia")
view(Russia_data)
1. Check for the spread of the virus in Russia based on the
confirmed cases and the number of pronounced deaths.
## Load the columns needed into a different data set
Russia_data <- COVID19::covid19(country = "Russia")
##
## We have invested a lot of time and effort in creating COVID-19 Data
## Hub, please cite the following when using it:
##
## Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open
## Source Software 5(51):2376, doi: 10.21105/joss.02376
##
## The implementation details and the latest version of the data are
## described in:
##
## Guidotti, E., (2022), "A worldwide epidemiological database for
## COVID-19 at fine-grained spatial resolution", Sci Data 9(1):112, doi:
## 10.1038/s41597-022-01245-1
##
## To print citations in BibTeX format use:
## > print(citation('COVID19'), bibtex=TRUE)
##
## To hide this message use 'verbose = FALSE'.
# plot the daily cases and deaths
ggplot(Russia_data, aes(x = date)) +
geom_line(aes(y = confirmed , color = "Confirmed"), size = 1) +
geom_line(aes(y = deaths, color = "Deaths"), size = 1) +
scale_color_manual(values = c("red", "blue")) +
labs(title = "Daily COVID-19 Cases and Deaths in Russia from 2020-2023", x = "Date", y = "Count")
## Warning: Removed 86 row(s) containing missing values (geom_path).
## Warning: Removed 134 row(s) containing missing values (geom_path).

2. Using a linear regression model,check the data factors that may
have led to the increase in COVID-19 deaths in Russia
# Fit a linear regression model
linear_model <- lm(deaths ~ confirmed, data = Russia_data)
# Plot the daily cases and deaths
ggplot(Russia_data, aes(x = confirmed, y = deaths)) +
geom_point(color = "black") +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = "Daily COVID-19 Cases and Deaths in Russia from 2020-2023",
x = "Confirmed Cases",
y = "Deaths")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 134 rows containing non-finite values (stat_smooth).
## Warning: Removed 134 rows containing missing values (geom_point).

# Display the linear regression model
print(linear_model)
##
## Call:
## lm(formula = deaths ~ confirmed, data = Russia_data)
##
## Coefficients:
## (Intercept) confirmed
## 2.047e+04 1.879e-02
# In order for me to be able to answer the third question, I had to create a smaller dataset which involved only the 10 countries with a high amount of death rates
library(dplyr)
top_10_deaths <- covidData |>
group_by(`Total Deaths`) |>
arrange(desc(`Total Deaths`)) |>
head(10)
top_10_deaths
## # A tibble: 10 × 11
## # Groups: Total Deaths [10]
## `Country, Other` `Total Cases` `Total Deaths` `Total Recover…` `Active Cases`
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Total: 227844736 4684246 204,514,139 18,646,351
## 2 USA 42634054 688486 32,347,726 9,597,842
## 3 Brazil 21069017 589277 20,173,064 306,676
## 4 India 33381728 444278 32,598,424 339,026
## 5 Mexico 3549229 270346 2,897,667 381,216
## 6 Peru 2164380 198891 N/A N/A
## 7 Russia 7214520 195835 6,452,398 566,287
## 8 Indonesia 4181309 139919 3,968,152 73,238
## 9 UK 7339009 134805 5,907,029 1,297,175
## 10 Italy 4623155 130167 4,376,646 116,342
## # … with 6 more variables: `Serious / Critical Condition` <dbl>,
## # `Total Cases / 1M Population` <dbl>, `Deaths / 1M Population` <dbl>,
## # `Total Tests` <dbl>, `Tests / 1M Population` <dbl>, Population <dbl>
3. What are the top 10 countries with the highest death rates.
top_10_deaths |>
mutate(death_rate = `Total Deaths` / `Total Cases`) %>%
arrange(desc(death_rate)) %>%
head(11)
## # A tibble: 10 × 12
## # Groups: Total Deaths [10]
## `Country, Other` `Total Cases` `Total Deaths` `Total Recover…` `Active Cases`
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Peru 2164380 198891 N/A N/A
## 2 Mexico 3549229 270346 2,897,667 381,216
## 3 Indonesia 4181309 139919 3,968,152 73,238
## 4 Italy 4623155 130167 4,376,646 116,342
## 5 Brazil 21069017 589277 20,173,064 306,676
## 6 Russia 7214520 195835 6,452,398 566,287
## 7 Total: 227844736 4684246 204,514,139 18,646,351
## 8 UK 7339009 134805 5,907,029 1,297,175
## 9 USA 42634054 688486 32,347,726 9,597,842
## 10 India 33381728 444278 32,598,424 339,026
## # … with 7 more variables: `Serious / Critical Condition` <dbl>,
## # `Total Cases / 1M Population` <dbl>, `Deaths / 1M Population` <dbl>,
## # `Total Tests` <dbl>, `Tests / 1M Population` <dbl>, Population <dbl>,
## # death_rate <dbl>
# Bar plot of top 10 countries with highest number of deaths
top_10_countries <- covidData |>
group_by(`Country, Other`) |>
summarise(`Total Deaths`) |>
top_n(10, `Total Deaths` )
ggplot(data = top_10_deaths,
aes(x = reorder(`Country, Other`, `Total Deaths`, sum), y = `Total Deaths`,
fill = `Country, Other`)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = RColorBrewer::brewer.pal(10, "Paired")) +
coord_flip() +
labs(x = "Country", y = "Total Deaths",
title = "Countries with Highest Number of Covid19 related deaths")

Conclusion
Based on the above analysis and visualizations, from the first line
graph, it could be said that the although the virus was rampant in
Russia (Indicated by the red line in the graph which indicated confirmed
cases), the Russians however found a way to control it because the
deaths happened to be much lower (Indicated by the blue line in the
graph) than the confirmed cases. In addition to this, I went ahead to
create a scatter plot alongside a linear model to figure out if the
confirmed cases where due to any factor in the data set. From this
linear regression in question two, the scatter plot shows individual
data points, and the blue line represents the best-fit line determined
by the model. This line helps us understand how changes in the number of
confirmed cases relate to changes in the number of deaths. In the 3rd
question, I attempted to look at the countries in the world with the
highest amount of death rates. Based on the total amount of deaths in
the world, We could see that over 4 million people in the world were
killed by this virus. However, on a country by country basis as shown in
this dataset, we could see that the country with the highest amounts of
deaths as a result of this virus was the United States of America.