In this R project, you’ll step into the role of a data analyst exploring the global COVID-19 pandemic using real-world data. Leveraging R and the powerful dplyr library, you’ll manipulate, filter, and aggregate a comprehensive dataset containing information on COVID-19 cases, tests, and hospitalizations across different countries. By applying data wrangling techniques such as grouping and summarizing, you’ll uncover which countries have the highest rates of positive COVID-19 tests relative to their testing numbers. This hands-on project will not only strengthen your R programming skills and analytical thinking but also provide valuable experience in deriving actionable insights from real-world health data – a crucial skill in today’s data-driven healthcare landscape.
library(readxl)
library(dplyr)
library(ggplot2)
covid_data <- read_excel('C:/Users/mickp/OneDrive/Desktop/R Stuff/covid project excel.xlsx')
glimpse(covid_data)
## Rows: 27,641
## Columns: 12
## $ Date <dttm> 2020-01-16, 2020-01-17, 2020-01-18, 2020-01-20, 2020…
## $ Country_Region <chr> "Iceland", "Iceland", "Iceland", "South Korea", "Unit…
## $ Province_State <chr> "All States", "All States", "All States", "All States…
## $ positive <dbl> 3, 4, 7, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, NA, NA, NA,…
## $ active <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hospitalized <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hospitalizedCurr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ recovered <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ death <dbl> NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA…
## $ total_tested <dbl> NA, NA, NA, 4, 0, 0, 0, 0, 0, 0, 27, 0, 0, 0, NA, NA,…
## $ daily_tested <dbl> NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 5, 0, 0, 0, NA, …
## $ daily_positive <dbl> NA, 1, 3, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, NA, NA…
dim(covid_data)
## [1] 27641 12
head(covid_data)
## # A tibble: 6 × 12
## Date Country_Region Province_State positive active hospitalized
## <dttm> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-16 00:00:00 Iceland All States 3 NA NA
## 2 2020-01-17 00:00:00 Iceland All States 4 NA NA
## 3 2020-01-18 00:00:00 Iceland All States 7 NA NA
## 4 2020-01-20 00:00:00 South Korea All States 1 NA NA
## 5 2020-01-22 00:00:00 United States All States 0 NA NA
## 6 2020-01-22 00:00:00 United States Massachusetts 0 NA NA
## # ℹ 6 more variables: hospitalizedCurr <dbl>, recovered <dbl>, death <dbl>,
## # total_tested <dbl>, daily_tested <dbl>, daily_positive <dbl>
covid_df_all_states <- covid_data %>%
filter(Province_State == 'All States') %>%
select(-Province_State)
#Selecting the columns with cumulative numbers
covid_df_all_states_daily <- covid_df_all_states %>%
select(Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)
head(covid_df_all_states_daily)
## # A tibble: 6 × 6
## Date Country_Region active hospitalizedCurr daily_tested
## <dttm> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-16 00:00:00 Iceland NA NA NA
## 2 2020-01-17 00:00:00 Iceland NA NA NA
## 3 2020-01-18 00:00:00 Iceland NA NA NA
## 4 2020-01-20 00:00:00 South Korea NA NA NA
## 5 2020-01-22 00:00:00 United States NA NA NA
## 6 2020-01-23 00:00:00 United States NA NA 0
## # ℹ 1 more variable: daily_positive <dbl>
sum(covid_df_all_states_daily$daily_tested, na.rm = TRUE)
## [1] 349586788
covid_daily_sum <- covid_df_all_states_daily %>%
group_by(Country_Region) %>%
summarise(tested = sum(daily_tested, na.rm = TRUE),
positive = sum(daily_positive, na.rm = TRUE),
active = sum(active),
hospitalized = sum(hospitalizedCurr)) %>%
arrange(desc(tested))
summary(covid_daily_sum)
## Country_Region tested positive active
## Length:146 Min. : 0 Min. : 0 Min. : 296559
## Class :character 1st Qu.: 9032 1st Qu.: 0 1st Qu.: 501572
## Mode :character Median : 43021 Median : 370 Median : 1846922
## Mean : 2394430 Mean : 93481 Mean : 4760791
## 3rd Qu.: 424318 3rd Qu.: 5810 3rd Qu.: 4025622
## Max. :136937092 Max. :9850413 Max. :17176595
## NA's :137
## hospitalized
## Min. : 129018
## 1st Qu.: 697050
## Median :1265082
## Mean :1265082
## 3rd Qu.:1833114
## Max. :2401146
## NA's :144
Countries with highest testing numbers
covid_top_ten <- head(covid_daily_sum, 10)
covid_top_ten
## # A tibble: 10 × 5
## Country_Region tested positive active hospitalized
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 United States 136937092 9850413 NA NA
## 2 India 106267322 60959 NA NA
## 3 Italy 17370389 934875 17176595 2401146
## 4 Russia 11319603 432269 NA NA
## 5 Canada 9873530 259992 NA NA
## 6 Australia 8874298 0 NA NA
## 7 Israel 4915043 402 NA NA
## 8 Turkey 4351655 221499 4025622 NA
## 9 Peru 3578707 59497 NA NA
## 10 Brazil 3474441 10321 NA NA
Countries with highest positive case ratio (minimum cases 10,000)
high_pos_ratio <- covid_daily_sum %>%
filter(tested > 10000) %>%
mutate(positive_ratio = positive / tested) %>%
arrange(desc(positive_ratio))
high_pos_ratio %>%
select(Country_Region, tested, positive, positive_ratio)
## # A tibble: 108 × 4
## Country_Region tested positive positive_ratio
## <chr> <dbl> <dbl> <dbl>
## 1 Costa Rica 320327 116361 0.363
## 2 Scotland 13186 3491 0.265
## 3 Armenia 438837 106424 0.243
## 4 Bangladesh 2442470 420235 0.172
## 5 Czechia 2557224 411220 0.161
## 6 Piedmont 33041 5073 0.154
## 7 Ecuador 36679 5289 0.144
## 8 Lombardy 64852 9279 0.143
## 9 United Kingdom 1460486 163418 0.112
## 10 Belgium 511055 54209 0.106
## # ℹ 98 more rows
covid_filtered <- covid_daily_sum %>%
filter(positive > 10000)
countries <- covid_filtered$Country_Region
ratio_vector <- covid_filtered$positive / covid_filtered$tested
tested_vector <- covid_daily_sum$tested
positive_vector <- covid_daily_sum$positive
active_vector <- covid_daily_sum$active
hospitalized_vector <- covid_daily_sum$hospitalized
names(tested_vector) <- countries
names(positive_vector) <- countries
names(active_vector) <- countries
names(hospitalized_vector) <- countries
names(ratio_vector) <- countries
matrix <- cbind(
tested = tested_vector,
positive = positive_vector,
active = active_vector,
hospitalized = hospitalized_vector,
positive_ratio = ratio_vector
)
## Warning in cbind(tested = tested_vector, positive = positive_vector, active =
## active_vector, : number of rows of result is not a multiple of vector length
## (arg 5)
head(matrix)
## tested positive active hospitalized positive_ratio
## United States 136937092 9850413 NA NA 0.0719338556
## India 106267322 60959 NA NA 0.0005736382
## Italy 17370389 934875 17176595 2401146 0.0538200382
## Russia 11319603 432269 NA NA 0.0381876467
## Canada 9873530 259992 NA NA 0.0263322236
## Turkey 8874298 0 NA NA 0.0508999450
Compile results into a comprehensive list structure
covid_summary_list <- list(
countries = countries,
tested = tested_vector,
positive = positive_vector,
active = active_vector,
hospitalized = hospitalized_vector,
positive_ratio <- ratio_vector,
summary_matrix = matrix
)
summary(covid_summary_list)
## Length Class Mode
## countries 26 -none- character
## tested 146 -none- numeric
## positive 146 -none- numeric
## active 146 -none- numeric
## hospitalized 146 -none- numeric
## 26 -none- numeric
## summary_matrix 730 -none- numeric
ggplot(covid_top_ten, aes(x = reorder(Country_Region, tested), y = tested)) +
geom_col(fill = 'steelblue') +
scale_y_continuous(labels = scales::comma) +
labs(title = 'Top Ten Most Tested Countries (As of Nov 5th, 2020)', x = 'Country', y = 'Total Tests')
top_ratio <- covid_daily_sum %>%
filter(tested > 10000) %>%
mutate(positive_ratio = positive / tested) %>%
arrange(desc(positive_ratio)) %>%
slice_head(n = 10)
ggplot(top_ratio, aes(x = reorder(Country_Region, positive_ratio), y = positive_ratio)) +
geom_col(fill = 'orange') +
labs(title = 'Top Ten Highest Positive Case Ratios by Country (As of Nov 5th 2020)', x = 'Countries', y = 'Positive Case Ratio') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This plot should not be interpreted as showing the countries with the most COVID-19 cases overall. Many of the countries with high positive case ratios have relatively small populations, and may not appear among the global leaders in total testing. However, it is plausible that these countries conducted high levels of testing per capita, which would not be evident in total counts. For instance, Costa Rica was praised in the early stages of the pandemic for its proactive monitoring strategies, even as it showed one of the highest positive case ratios. Ultimately, a definitive understanding of global positive case ratios would require consistent, accurate data reporting from all countries and an analysis that adjusts for population size.