This is an explorative analysis of the coronavirus data set obtained from the OWID website. The data was gathered from 2019-12-31 up and to 2020-08-11.
Since actions speak louder than words, let’s dive straight into it!
library(readr)
d <- read_csv("~/Documents/Personal/R-projects/owid-covid-data.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## iso_code = col_character(),
## continent = col_character(),
## location = col_character(),
## date = col_date(format = ""),
## new_tests = col_logical(),
## total_tests = col_logical(),
## total_tests_per_thousand = col_logical(),
## new_tests_per_thousand = col_logical(),
## new_tests_smoothed = col_logical(),
## new_tests_smoothed_per_thousand = col_logical(),
## tests_per_case = col_logical(),
## positive_rate = col_logical(),
## tests_units = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 108700 parsing failures.
## row col expected actual file
## 1169 new_tests 1/0/T/F/TRUE/FALSE 2.0 '~/Documents/Personal/R-projects/owid-covid-data.csv'
## 1169 total_tests 1/0/T/F/TRUE/FALSE 2.0 '~/Documents/Personal/R-projects/owid-covid-data.csv'
## 1169 total_tests_per_thousand 1/0/T/F/TRUE/FALSE 0.0 '~/Documents/Personal/R-projects/owid-covid-data.csv'
## 1169 new_tests_per_thousand 1/0/T/F/TRUE/FALSE 0.0 '~/Documents/Personal/R-projects/owid-covid-data.csv'
## 1169 tests_units 1/0/T/F/TRUE/FALSE people tested '~/Documents/Personal/R-projects/owid-covid-data.csv'
## .... ........................ .................. ............. .....................................................
## See problems(...) for more details.
#head(d, n = 1)
Note that there are alot of columns containing missing values. These columns are useless, so we will get rid of them!
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
d_clean = d %>% select(-(new_tests:stringency_index)) %>%
select(-extreme_poverty, -female_smokers, -male_smokers)
#head(d_clean, n =1)
Note how we have reduced the amount of columns from 36 to 23.
Okay let’s gather some descriptives statitics surrounding the distrubution of cases across the globe.
#library(Hmisc)
#describe(d_clean$total_cases)
Enough numbers! Let’s plot the total number of cases per million inhabitants across time and look at every continent seperatly. This allows us to gain a general understanding of when and where the virus has spreaded.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
library(ggridges)
continents_overtime = ggplot(d_clean, aes(x = date, y = continent, fill=total_cases_per_million, color = continent)) +
geom_density_ridges(alpha = 0.5) +
theme_ridges() +
theme(legend.position = "none")
continents_overtime
## Picking joint bandwidth of 10.6
Beautifull! Now lets look at the distrubtion of cases per country in every continent. For this we need to select the last date only.
d_lastdate = d_clean %>% filter(date == "2020-08-11") %>% as.data.frame()
unique(d_lastdate$date)
## [1] "2020-08-11"
Sweet, now let’s move on.
country_continent <- d_lastdate %>%
ggplot( aes(x=continent, y= total_cases_per_million, color = continent)) +
geom_boxplot(alpha=0.01) +
geom_violin() +
geom_point(alpha=0.3) +
theme(legend.position="none")
country_continent
This beaut above is a violin chart. It shows three things; the amount of (known) cases per million, the various continents and distribution of cases across countries within these continents.
Now lets look if the population denisty has anything to do with the spread of the virus.
popdens_total_cases = ggplot(d_lastdate, aes(x= population_density,
y= total_cases_per_million, color = continent)) +
geom_point() + stat_smooth(method = "lm", color = "black") +
ylim(0, 300) +
xlim(0,300)
popdens_total_cases
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 167 rows containing non-finite values (stat_smooth).
## Warning: Removed 167 rows containing missing values (geom_point).
Whoa! the line in the middel shows a clear upward trend, this means that indeed the virus is more likely to spread in more densly populated countries.
Let now explore the amounts of deaths per million while keeping both the gross domestic product (GDP) and life expectancy in mind.
gdp_lexp_tdeaths = ggplot(d_lastdate, aes(x= gdp_per_capita, y=life_expectancy, size = total_deaths_per_million, color = continent)) +
geom_point(alpha =0.5) +
ylim(50,90) +
xlim(0,75000)
gdp_lexp_tdeaths
## Warning: Removed 31 rows containing missing values (geom_point).
It looks like Europe has the most numbers of deaths per million, while these countries are the richest and have the highest life expetancy. Does this have anything to do with the amount elderly citizens?
aged_65_total_case_deaths = ggplot(d_lastdate, aes(x= aged_65_older,
y= total_cases_per_million, size= total_deaths_per_million, color = continent)) +
geom_point(alpha=0.5) +
ylim(0,20000)
aged_65_total_case_deaths
## Warning: Removed 29 rows containing missing values (geom_point).
It looks like older people have an increased risk of dying from the virus. Lets compare the northwestern parts of Europe to the rest of Europe (South/East).
EU = d_clean %>% filter(continent=="Europe") %>% as.data.frame()
N_EU = EU$location %in% c("United Kingdom", "Ireland", "Luxembourg", "Belgium", "Netherlands", "Denmark","Iceland", "Switzerland","Austria", "Germany", "Sweden", "Norway", "Finland")
plot_n_eu = ggplot(EU, aes(x = date, y = total_deaths_per_million, color = N_EU)) +
geom_smooth()
plot_n_eu
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 92 rows containing non-finite values (stat_smooth).
Most people per million died of corona in northwestern Europe.