In this R project, you’ll step into the role of a data analyst exploring the global COVID-19 pandemic using real-world data. Leveraging R and the powerful dplyr library, you’ll manipulate, filter, and aggregate a comprehensive dataset containing information on COVID-19 cases, tests, and hospitalizations across different countries. By applying data wrangling techniques such as grouping and summarizing, you’ll uncover which countries have the highest rates of positive COVID-19 tests relative to their testing numbers. This hands-on project will not only strengthen your R programming skills and analytical thinking but also provide valuable experience in deriving actionable insights from real-world health data – a crucial skill in today’s data-driven healthcare landscape.

  1. Load and explore the COVID-19 dataset.
library(readxl)
library(dplyr)
library(ggplot2)
covid_data <- read_excel('C:/Users/mickp/OneDrive/Desktop/R Stuff/covid project excel.xlsx')
  1. Filter and select relevant data using dplyr functions.
glimpse(covid_data)
## Rows: 27,641
## Columns: 12
## $ Date             <dttm> 2020-01-16, 2020-01-17, 2020-01-18, 2020-01-20, 2020…
## $ Country_Region   <chr> "Iceland", "Iceland", "Iceland", "South Korea", "Unit…
## $ Province_State   <chr> "All States", "All States", "All States", "All States…
## $ positive         <dbl> 3, 4, 7, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, NA, NA, NA,…
## $ active           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hospitalized     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hospitalizedCurr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ recovered        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ death            <dbl> NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA…
## $ total_tested     <dbl> NA, NA, NA, 4, 0, 0, 0, 0, 0, 0, 27, 0, 0, 0, NA, NA,…
## $ daily_tested     <dbl> NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 5, 0, 0, 0, NA, …
## $ daily_positive   <dbl> NA, 1, 3, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, NA, NA…
dim(covid_data)
## [1] 27641    12
head(covid_data)
## # A tibble: 6 × 12
##   Date                Country_Region Province_State positive active hospitalized
##   <dttm>              <chr>          <chr>             <dbl>  <dbl>        <dbl>
## 1 2020-01-16 00:00:00 Iceland        All States            3     NA           NA
## 2 2020-01-17 00:00:00 Iceland        All States            4     NA           NA
## 3 2020-01-18 00:00:00 Iceland        All States            7     NA           NA
## 4 2020-01-20 00:00:00 South Korea    All States            1     NA           NA
## 5 2020-01-22 00:00:00 United States  All States            0     NA           NA
## 6 2020-01-22 00:00:00 United States  Massachusetts         0     NA           NA
## # ℹ 6 more variables: hospitalizedCurr <dbl>, recovered <dbl>, death <dbl>,
## #   total_tested <dbl>, daily_tested <dbl>, daily_positive <dbl>
covid_df_all_states <- covid_data %>%
  filter(Province_State == 'All States') %>%
  select(-Province_State)
#Selecting the columns with cumulative numbers 

covid_df_all_states_daily <- covid_df_all_states %>%
  select(Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)

head(covid_df_all_states_daily)
## # A tibble: 6 × 6
##   Date                Country_Region active hospitalizedCurr daily_tested
##   <dttm>              <chr>           <dbl>            <dbl>        <dbl>
## 1 2020-01-16 00:00:00 Iceland            NA               NA           NA
## 2 2020-01-17 00:00:00 Iceland            NA               NA           NA
## 3 2020-01-18 00:00:00 Iceland            NA               NA           NA
## 4 2020-01-20 00:00:00 South Korea        NA               NA           NA
## 5 2020-01-22 00:00:00 United States      NA               NA           NA
## 6 2020-01-23 00:00:00 United States      NA               NA            0
## # ℹ 1 more variable: daily_positive <dbl>
sum(covid_df_all_states_daily$daily_tested, na.rm = TRUE)
## [1] 349586788
  1. Aggregate data by country and calculate summary statistics.
covid_daily_sum <- covid_df_all_states_daily %>%
  group_by(Country_Region) %>%
  summarise(tested = sum(daily_tested, na.rm = TRUE),
            positive = sum(daily_positive, na.rm = TRUE),
            active = sum(active),
            hospitalized = sum(hospitalizedCurr)) %>%
  arrange(desc(tested))

summary(covid_daily_sum)
##  Country_Region         tested             positive           active        
##  Length:146         Min.   :        0   Min.   :      0   Min.   :  296559  
##  Class :character   1st Qu.:     9032   1st Qu.:      0   1st Qu.:  501572  
##  Mode  :character   Median :    43021   Median :    370   Median : 1846922  
##                     Mean   :  2394430   Mean   :  93481   Mean   : 4760791  
##                     3rd Qu.:   424318   3rd Qu.:   5810   3rd Qu.: 4025622  
##                     Max.   :136937092   Max.   :9850413   Max.   :17176595  
##                                                           NA's   :137       
##   hospitalized    
##  Min.   : 129018  
##  1st Qu.: 697050  
##  Median :1265082  
##  Mean   :1265082  
##  3rd Qu.:1833114  
##  Max.   :2401146  
##  NA's   :144
  1. Identify top countries by testing numbers and positive case ratios.

Countries with highest testing numbers

covid_top_ten <- head(covid_daily_sum, 10)
  
covid_top_ten
## # A tibble: 10 × 5
##    Country_Region    tested positive   active hospitalized
##    <chr>              <dbl>    <dbl>    <dbl>        <dbl>
##  1 United States  136937092  9850413       NA           NA
##  2 India          106267322    60959       NA           NA
##  3 Italy           17370389   934875 17176595      2401146
##  4 Russia          11319603   432269       NA           NA
##  5 Canada           9873530   259992       NA           NA
##  6 Australia        8874298        0       NA           NA
##  7 Israel           4915043      402       NA           NA
##  8 Turkey           4351655   221499  4025622           NA
##  9 Peru             3578707    59497       NA           NA
## 10 Brazil           3474441    10321       NA           NA

Countries with highest positive case ratio (minimum cases 10,000)

high_pos_ratio <- covid_daily_sum %>%
  filter(tested > 10000) %>%
  mutate(positive_ratio = positive / tested) %>%
  arrange(desc(positive_ratio))

high_pos_ratio %>%
  select(Country_Region, tested, positive, positive_ratio)
## # A tibble: 108 × 4
##    Country_Region  tested positive positive_ratio
##    <chr>            <dbl>    <dbl>          <dbl>
##  1 Costa Rica      320327   116361          0.363
##  2 Scotland         13186     3491          0.265
##  3 Armenia         438837   106424          0.243
##  4 Bangladesh     2442470   420235          0.172
##  5 Czechia        2557224   411220          0.161
##  6 Piedmont         33041     5073          0.154
##  7 Ecuador          36679     5289          0.144
##  8 Lombardy         64852     9279          0.143
##  9 United Kingdom 1460486   163418          0.112
## 10 Belgium         511055    54209          0.106
## # ℹ 98 more rows
  1. Create vectors and matrices to store key findings
covid_filtered <- covid_daily_sum %>%
  filter(positive > 10000)

countries <- covid_filtered$Country_Region
ratio_vector <- covid_filtered$positive / covid_filtered$tested
tested_vector <- covid_daily_sum$tested
positive_vector <- covid_daily_sum$positive
active_vector <- covid_daily_sum$active
hospitalized_vector <- covid_daily_sum$hospitalized

names(tested_vector) <- countries
names(positive_vector) <- countries
names(active_vector) <- countries
names(hospitalized_vector) <- countries
names(ratio_vector) <- countries
matrix <- cbind(
  tested = tested_vector,
  positive = positive_vector,
  active = active_vector,
  hospitalized = hospitalized_vector,
  positive_ratio = ratio_vector
)
## Warning in cbind(tested = tested_vector, positive = positive_vector, active =
## active_vector, : number of rows of result is not a multiple of vector length
## (arg 5)
head(matrix)
##                  tested positive   active hospitalized positive_ratio
## United States 136937092  9850413       NA           NA   0.0719338556
## India         106267322    60959       NA           NA   0.0005736382
## Italy          17370389   934875 17176595      2401146   0.0538200382
## Russia         11319603   432269       NA           NA   0.0381876467
## Canada          9873530   259992       NA           NA   0.0263322236
## Turkey          8874298        0       NA           NA   0.0508999450

Compile results into a comprehensive list structure

covid_summary_list <- list(
  countries = countries,
  tested = tested_vector,
  positive = positive_vector,
  active = active_vector,
  hospitalized = hospitalized_vector,
  positive_ratio <- ratio_vector,
  summary_matrix = matrix
)

summary(covid_summary_list)
##                Length Class  Mode     
## countries       26    -none- character
## tested         146    -none- numeric  
## positive       146    -none- numeric  
## active         146    -none- numeric  
## hospitalized   146    -none- numeric  
##                 26    -none- numeric  
## summary_matrix 730    -none- numeric
  1. Create relevant visualizations.
ggplot(covid_top_ten, aes(x = reorder(Country_Region, tested), y = tested)) +
  geom_col(fill = 'steelblue') +
    scale_y_continuous(labels =  scales::comma) +
  labs(title = 'Top Ten Most Tested Countries (As of Nov 5th, 2020)', x = 'Country', y = 'Total Tests')

top_ratio <- covid_daily_sum %>%
  filter(tested > 10000) %>%  
  mutate(positive_ratio = positive / tested) %>%
  arrange(desc(positive_ratio)) %>%
  slice_head(n = 10)

ggplot(top_ratio, aes(x = reorder(Country_Region, positive_ratio), y = positive_ratio)) +
geom_col(fill = 'orange') +
labs(title = 'Top Ten Highest Positive Case Ratios by Country (As of Nov 5th 2020)', x = 'Countries', y = 'Positive Case Ratio') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

This plot should not be interpreted as showing the countries with the most COVID-19 cases overall. Many of the countries with high positive case ratios have relatively small populations, and may not appear among the global leaders in total testing. However, it is plausible that these countries conducted high levels of testing per capita, which would not be evident in total counts. For instance, Costa Rica was praised in the early stages of the pandemic for its proactive monitoring strategies, even as it showed one of the highest positive case ratios. Ultimately, a definitive understanding of global positive case ratios would require consistent, accurate data reporting from all countries and an analysis that adjusts for population size.