Final Project (Covid19 cases and Deaths in Russia and the World)

Introduction:

This project was created to answer three questions which came up due to the Covid19 pandemic. Data sets from R packages and Kaggle are going to be used in an attempt to get answers to the questions below.

Questions:

1. Check for the spread of the covid 19 virus in Russia.

2. Using a linear regression model,check the data factors that may have led to the increase in COVID-19 deaths in Russia

3. What are the top 10 countries with the highest death rates.

# Install the necessary packages
# Load in the packages

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.7      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.0 
## ✔ readr   2.1.2      ✔ forcats 0.5.1 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(COVID19)
library(dplyr)
library(readr)

Load the data set

covidData <- read_csv("CovidDataWorld.csv")
## Rows: 224 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, Other, Total Recovered, Active Cases
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# view the contents of the data
covidData
## # A tibble: 224 × 11
##    `Country, Other` `Total Cases` `Total Deaths` `Total Recover…` `Active Cases`
##    <chr>                    <dbl>          <dbl> <chr>            <chr>         
##  1 USA                   42634054         688486 32,347,726       9,597,842     
##  2 India                 33381728         444278 32,598,424       339,026       
##  3 Brazil                21069017         589277 20,173,064       306,676       
##  4 UK                     7339009         134805 5,907,029        1,297,175     
##  5 Russia                 7214520         195835 6,452,398        566,287       
##  6 France                 6934732         115894 6,595,374        223,464       
##  7 Turkey                 6767008          60903 6,262,690        443,415       
##  8 Iran                   5378408         116072 4,682,704        579,632       
##  9 Argentina              5234851         114101 5,087,120        33,630        
## 10 Colombia               4936052         125782 4,774,661        35,609        
## # … with 214 more rows, and 6 more variables:
## #   `Serious / Critical Condition` <dbl>, `Total Cases / 1M Population` <dbl>,
## #   `Deaths / 1M Population` <dbl>, `Total Tests` <dbl>,
## #   `Tests / 1M Population` <dbl>, Population <dbl>
#filter the data into another dataset called Russia_data which involves only data involving Russia

Russia_data <- covidData |>
 filter(`Country, Other`=="Russia")
view(Russia_data)

1. Check for the spread of the virus in Russia based on the confirmed cases and the number of pronounced deaths.

## Load the columns needed into a different data set

Russia_data <- COVID19::covid19(country = "Russia")
## 
## We have invested a lot of time and effort in creating COVID-19 Data
## Hub, please cite the following when using it:
## 
##   Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open
##   Source Software 5(51):2376, doi: 10.21105/joss.02376
## 
## The implementation details and the latest version of the data are
## described in:
## 
##   Guidotti, E., (2022), "A worldwide epidemiological database for
##   COVID-19 at fine-grained spatial resolution", Sci Data 9(1):112, doi:
##   10.1038/s41597-022-01245-1
## 
## To print citations in BibTeX format use:
##  > print(citation('COVID19'), bibtex=TRUE)
## 
## To hide this message use 'verbose = FALSE'.
# plot the daily cases and deaths
ggplot(Russia_data, aes(x = date)) +
  geom_line(aes(y = confirmed , color = "Confirmed"), size = 1) +
  geom_line(aes(y = deaths, color = "Deaths"), size = 1) +
  scale_color_manual(values = c("red", "blue")) +
  labs(title = "Daily COVID-19 Cases and Deaths in Russia from 2020-2023", x = "Date", y = "Count")
## Warning: Removed 86 row(s) containing missing values (geom_path).
## Warning: Removed 134 row(s) containing missing values (geom_path).

2. Using a linear regression model,check the data factors that may have led to the increase in COVID-19 deaths in Russia

# Fit a linear regression model
linear_model <- lm(deaths ~ confirmed, data = Russia_data)

# Plot the daily cases and deaths
ggplot(Russia_data, aes(x = confirmed, y = deaths)) +
  geom_point(color = "black") +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Daily COVID-19 Cases and Deaths in Russia from 2020-2023",
       x = "Confirmed Cases",
       y = "Deaths")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 134 rows containing non-finite values (stat_smooth).
## Warning: Removed 134 rows containing missing values (geom_point).

# Display the linear regression model
print(linear_model)
## 
## Call:
## lm(formula = deaths ~ confirmed, data = Russia_data)
## 
## Coefficients:
## (Intercept)    confirmed  
##   2.047e+04    1.879e-02
# In order for me to be able to answer the third question, I had to create a smaller dataset which involved only the 10 countries with a high amount of death rates
library(dplyr)

top_10_deaths <- covidData |>
  group_by(`Total Deaths`) |>
  arrange(desc(`Total Deaths`)) |>
  head(10)

top_10_deaths
## # A tibble: 10 × 11
## # Groups:   Total Deaths [10]
##    `Country, Other` `Total Cases` `Total Deaths` `Total Recover…` `Active Cases`
##    <chr>                    <dbl>          <dbl> <chr>            <chr>         
##  1 Total:               227844736        4684246 204,514,139      18,646,351    
##  2 USA                   42634054         688486 32,347,726       9,597,842     
##  3 Brazil                21069017         589277 20,173,064       306,676       
##  4 India                 33381728         444278 32,598,424       339,026       
##  5 Mexico                 3549229         270346 2,897,667        381,216       
##  6 Peru                   2164380         198891 N/A              N/A           
##  7 Russia                 7214520         195835 6,452,398        566,287       
##  8 Indonesia              4181309         139919 3,968,152        73,238        
##  9 UK                     7339009         134805 5,907,029        1,297,175     
## 10 Italy                  4623155         130167 4,376,646        116,342       
## # … with 6 more variables: `Serious / Critical Condition` <dbl>,
## #   `Total Cases / 1M Population` <dbl>, `Deaths / 1M Population` <dbl>,
## #   `Total Tests` <dbl>, `Tests / 1M Population` <dbl>, Population <dbl>

3. What are the top 10 countries with the highest death rates.

top_10_deaths |>
  mutate(death_rate = `Total Deaths` / `Total Cases`) %>%
  arrange(desc(death_rate)) %>%
  head(11)
## # A tibble: 10 × 12
## # Groups:   Total Deaths [10]
##    `Country, Other` `Total Cases` `Total Deaths` `Total Recover…` `Active Cases`
##    <chr>                    <dbl>          <dbl> <chr>            <chr>         
##  1 Peru                   2164380         198891 N/A              N/A           
##  2 Mexico                 3549229         270346 2,897,667        381,216       
##  3 Indonesia              4181309         139919 3,968,152        73,238        
##  4 Italy                  4623155         130167 4,376,646        116,342       
##  5 Brazil                21069017         589277 20,173,064       306,676       
##  6 Russia                 7214520         195835 6,452,398        566,287       
##  7 Total:               227844736        4684246 204,514,139      18,646,351    
##  8 UK                     7339009         134805 5,907,029        1,297,175     
##  9 USA                   42634054         688486 32,347,726       9,597,842     
## 10 India                 33381728         444278 32,598,424       339,026       
## # … with 7 more variables: `Serious / Critical Condition` <dbl>,
## #   `Total Cases / 1M Population` <dbl>, `Deaths / 1M Population` <dbl>,
## #   `Total Tests` <dbl>, `Tests / 1M Population` <dbl>, Population <dbl>,
## #   death_rate <dbl>
# Bar plot of top 10 countries with highest number of deaths
top_10_countries <- covidData |> 
  group_by(`Country, Other`) |>
  summarise(`Total Deaths`) |>
  top_n(10, `Total Deaths` )

ggplot(data = top_10_deaths, 
       aes(x = reorder(`Country, Other`, `Total Deaths`, sum), y = `Total Deaths`, 
           fill = `Country, Other`)) + 
  geom_bar(stat = "identity") + 
  scale_fill_manual(values = RColorBrewer::brewer.pal(10, "Paired")) + 
  coord_flip() + 
  labs(x = "Country", y = "Total Deaths", 
       title = "Countries with Highest Number of Covid19 related deaths")

Conclusion

Based on the above analysis and visualizations, from the first line graph, it could be said that the although the virus was rampant in Russia (Indicated by the red line in the graph which indicated confirmed cases), the Russians however found a way to control it because the deaths happened to be much lower (Indicated by the blue line in the graph) than the confirmed cases. In addition to this, I went ahead to create a scatter plot alongside a linear model to figure out if the confirmed cases where due to any factor in the data set. From this linear regression in question two, the scatter plot shows individual data points, and the blue line represents the best-fit line determined by the model. This line helps us understand how changes in the number of confirmed cases relate to changes in the number of deaths. In the 3rd question, I attempted to look at the countries in the world with the highest amount of death rates. Based on the total amount of deaths in the world, We could see that over 4 million people in the world were killed by this virus. However, on a country by country basis as shown in this dataset, we could see that the country with the highest amounts of deaths as a result of this virus was the United States of America.