(a) Loading the csv and casting the
date column to the date type
We noticed immediately that loading the data-set from the url takes some time, therefore we saved it locally in our repository and saved running time. We accepted the trade-off of having the data less updated, but by doing so, we also managed to understand the results of our code better, since the output changed only because of what we wrote and due to the daily updates in the data.
covid19 = read.csv("owid-covid-data.csv")
# Modifying class of date column
covid19$date = as.Date(covid19$date)
class(covid19$date)
## [1] "Date"
(b) Listing five top dates
After examining the data set, we understood that there are values in
the locations column that refer to categorical location,
such as High income countries. We subset only these
categories into a new data set, and simply selected the top 5 dates in
terms of highest values of new_cases.
high_income_covid19 = filter(covid19, location == "High income") # temp df
#top 5 new_cases
high_income_covid19 %>%
select(date, new_cases) %>%
arrange(desc(new_cases)) %>%
top_n(5)
## Selecting by new_cases
## date new_cases
## 1 2022-01-19 2830533
## 2 2022-01-18 2737804
## 3 2022-01-10 2589365
## 4 2022-01-21 2558959
## 5 2022-01-26 2532201
#top 5 new_deaths
data.table(high_income_covid19 %>%
select(date, new_deaths) %>%
arrange(desc(new_deaths)) %>%
top_n(5))
## Selecting by new_deaths
## date new_deaths
## 1: 2021-01-20 11047
## 2: 2021-01-26 10279
## 3: 2021-01-12 10170
## 4: 2021-01-27 10079
## 5: 2021-01-08 9967
#top 5 new_vaccinations
high_income_covid19 %>%
select(date, new_vaccinations) %>%
arrange(desc(new_vaccinations)) %>%
top_n(5)
## Selecting by new_vaccinations
## date new_vaccinations
## 1 2021-08-04 11980707
## 2 2021-12-17 9748326
## 3 2021-06-10 9538010
## 4 2021-06-11 9504763
## 5 2021-06-09 9452211
The tables show that the top 5 dates with the largest amount of
new_cases recorded were all in January 2022. The 5 dates
with the highest new_deaths were recorded in January 2021.
Regarding the five dates with the most vaccinations, there
is a dispersal in the second half of 2021.
(a) Writing a plot function for high/low income countries
We developed a function that receives a data frame, and filters within it the information on the categories of high / low income according to the input column while excluding NA values and non-numeric values. Inside the function, we used ggplot to display a plot as required.
plot_income <- function(df, col_name, log=FALSE){
income_locations <- c("High income","Low income") #specifying the wanted locations
df %>%
select(location, date, all_of(col_name)) %>% # selecting wanted columns and filtering them
filter(location %in% income_locations) %>%
ggplot(aes_string(x="date",y=all_of(col_name), color="location"),
width = 40, height = 20, units = "cm") +
labs(x="Date", # using string functions to "prettify" the labels
y=str_to_title(str_replace_all(as.character(all_of(col_name)), '_', ' ')),
title = paste(str_to_title(str_replace_all(as.character(all_of(col_name)), '_', ' ')),
"over Time Per Income Category"),
color='Income Category') +
geom_line()
}
(b) Plotting with the function with different columns
(i) plotting new_cases_per_million vs
plotting logged new_cases_smoothed_per_million
# first plot - new_cases_per_million
plot_income(covid19, "new_cases_per_million")
# second plot - logged_new_cases_smoothed_per_million
covid19$logged_new_cases_smoothed_per_million <- log(covid19$new_cases_smoothed_per_million)
plot_income(covid19, "logged_new_cases_smoothed_per_million")
We presented the information in the plot as asked and immediately noticed that the logged & smoothed version of the data we plotted is much more convenient to interpret. Smoothing and using log down-sizes the scale of the data without changing the trends With this sort of plotting it is much simpler to visualize the trends of the column we inspect.
We learn from the second graph that the general trends of disease spread were relatively identical between the two income-types of countries.
(ii) plotting
logged_new_deathes_smoothed and
logged_new_vaccinations_smoothed_per_million
We executed the same function using the log function for a plot the
new_deaths_smoothed and
new_vaccinations_smoothed_per_million columns.
# third plot - logged smoothed_new_deathes
covid19$logged_new_deathes_smoothed <- log(covid19$new_deaths_smoothed)
plot_income(covid19, "logged_new_deathes_smoothed")
# fourth plot - logged new_vaccinations_smoothed_per_million
covid19$logged_new_vaccinations_smoothed_per_million <- log(covid19$new_vaccinations_smoothed_per_million)
plot_income(covid19, "logged_new_vaccinations_smoothed_per_million")
It can be seen in the third graph referring to deaths that the trends
are relatively identical in 2020 and 2022 so
far, but reversed in 2021. The fourth plot showing the
extent of vaccinations presents how High income countries
vaccinated the populations earlier and in a sharp and rapid process,
compared to Low income countries, which started vaccinating
later and at a lower pace.
(a) Defining the currnet data frame
applies fun to all columns specified in .SDcols while grouping by the columns specified in by.
In order to create the current table with most updated data for each
country we used the data.table and functions that it has.
We create a function for the columns that takeing the last element is
not NA by location. After using that function, every last row of each
location has the most updated data for that specific location, at this
point we group the locations by the last date (which is the last row for
each country). Afterwards we make sure to select only the needed
columns.
current = data.table(covid19[,c("location", "continent", "total_cases_per_million", "total_deaths_per_million", "total_vaccinations_per_hundred","people_fully_vaccinated_per_hundred", "total_boosters_per_hundred", "excess_mortality_cumulative_per_million")])
current = current[, lapply(.SD, function(x) last(x[!is.na(x)])), by= location]
(b) Analyzing
total_deaths_per_million
(i) Histogram of the data
current %>%
ggplot(aes(total_deaths_per_million)) +
geom_histogram(binwidth = 30, color = "dark blue") +
xlab("Total Deaths Per Million") +
ylab("Count")+
ggtitle("Histogram Of Total Deaths Per Million")
The histogram does not look close to the normal distribution, it has a right tail and the center of the values is in the left side. Many data-points are on the far left, few are on the far right.
(ii) Skewness and kurtosis of the data
ske = skewness(current$total_deaths_per_million, na.rm = TRUE)
kur = kurtosis(current$total_deaths_per_million, na.rm = TRUE)
As we learned that distributions can exhibit positive or negative skewness to varying degrees and it’s formula is:
\[skewness: \tilde{\mu}_3 \equiv E [(\frac{(x-\mu)}{\sigma})^3]\] The histogram has a Positive skewness = 1.3099515 and this indicates a long right tail. We learned about the kurtosis as well, distributions can exhibit positive or negative kurtosis in various values and it’s formulated as:
\[\tilde{\mu}_4 \equiv E [(\frac{(x-\mu)}{\sigma})^4]\]
\[kurtosis: \kappa \equiv \tilde{\mu}_4 - 3\]
The histogram has a positive kurtosis = 1.6398022 and it indicates a long tail (compared to the normal distribution).
(c) Plotting linear model of
total_deaths_per_million vs
total_cases_per_million
reg = lm(total_deaths_per_million ~ total_cases_per_million, data = current)
ggplot(data = current, mapping = aes(x=total_cases_per_million , y= total_deaths_per_million), col = "blue") +
geom_point() +
geom_abline(intercept = reg$coefficients[1], slope = reg$coefficients[2], col = "blue") +
xlab("Total Cases Per Million") +
ylab("Total Deaths Per Million") +
ggtitle("Total Deaths Vs Total Cases Per Million")
The slope tells us how much we can expect Y to change as X increases, and can be used to estimate an average rate of change. The greater the magnitude of the slope, the steeper the line and the greater the rate of change. The slope of the regression is 0.0041153 and it means that there is a rate of 0.0041153 in change of total deaths as total cases change. The intercept is 581.912673.
(a) Analyzing total number of vaccinations per hundred in all the countries by continent:
(i) Box-plot of the data
First we made sure to select only locations with continents from
current data frame. Then we box-plotted the data:
continents = c("Africa", "Asia", "Europe", "North America", "Oceania", "South America")
curr = subset(current, continent %in% continents) # Dropping out the lines without continent.
curr %>%
ggplot(aes(x=continent, y=total_vaccinations_per_hundred, fill=continent)) +
labs(x="Continents",
y= "Total Vaccinations Per Hundred",
fill="Continents",
title = "Boxplot of Total Vaccinations per hunderd for Each Continent")+
geom_boxplot()
We got a box-plot of all the continent’s colored differently. For each continent we can see the spread of the results, the values range between 25th and 75th quartiles, the median, and whether are any outlines. We see that in most continents vaccination ratios spread at the range 100-200, while Africa stands down with lower range of vaccination ratio.
(ii) Finding 2 outlier countries in the data
We used simple selection with filters to output the outliers which we saw in the boxplot above.
# First outlier
max_location = curr$location[curr$total_vaccinations_per_hundred == max(curr$total_vaccinations_per_hundred)]
max_vaccination = curr$total_vaccinations_per_hundred[curr$location == max_location]
# Second outlier
North_America = curr %>%
select(location, continent, total_vaccinations_per_hundred) %>%
filter(continent == "North America") %>%
arrange(desc(total_vaccinations_per_hundred)) %>%
head(1)
max_north_america = as.character(North_America[1])
max_vaccination_na = as.numeric(North_America[3])
Looking at the results of the box-plot, we can notice 4 outliers in
Africa, one in Europe, and one in North America. The maximum
total_vaccinations_per_hundred in the data is the country
Gibraltar” from Europe and it’s value is 353.04.
Another outline is the maximum
total_vaccinations_per_hundred in North America, which is
Cuba with 313.99 vaccinations per hundred.
(b) Defining and plotting the Booster Ratio
# Defining a new column - booster_ratio
covid19$booster_ratio = covid19$total_boosters/covid19$people_fully_vaccinated
# Plot of `booster_ratio` as a function of time
covid19 %>%
filter(location %in% continents) %>%
ggplot(aes(x=date, y=booster_ratio, color=location)) +
geom_line(size=1.2, alpha=0.5) +
xlim(as.Date(c('1/1/2021','13/4/2022'),format="%d/%m/%Y"))+
labs(x="Date",
y= "Booster Ratio",
color="Continents") +
ggtitle("Booster Ratio Per Continent Over Time")
We can notice that the third vaccine (the booster) started for the first time around January 2021 in South America and Europe, and later on in July 2021 in Asia and then in North America, South America, Europe, Africa and Oceania. There is an anomaly in the booster ratio of South America in January 2021 which is in contrast of what we expected.
(a) Monthly and yearly boxplot of
new_cases
We filterd out the locations without continents and then selected the desierd values using group_by function. Then we ploted with ggplot the boxplot.
#Assigning a new df for Grouping the data to months,years and locations
months_df <- covid19 %>%
filter(continent != "") %>% # avoiding calculating the continents twice and other non-countries
select(location, date, new_cases_per_million,new_deaths,new_vaccinations) %>%
mutate(year_month = format(as.Date(date), "%Y-%m")) %>% # Grouping by month and year
group_by(location,year_month) %>%
summarise(monthly_new_cases_per_mill = sum(new_cases_per_million, na.rm = TRUE),
monthly_new_deaths = sum(new_deaths, na.rm = TRUE),
monthly_new_vaccinations = sum(new_vaccinations, na.rm = TRUE), .groups = "keep")
# first plot - Monthly New Cases Per Million
months_df %>%
ggplot(aes(x=year_month,
y=monthly_new_cases_per_mill,
fill=year_month,)) +
labs(x="Year & Month",
y= "Monthly New Cases Per Million",
color="Year & Month") +
theme(axis.text.x=element_text(angle = 90, hjust = 0))+
geom_boxplot(show.legend = FALSE, ) +
ggtitle("Wrold Monthly New Cases Per Million Distributions Over Time")
As seen in the plot above, the ratio of monthly new cases per million in the world grows with time.
(b) Plotting two more disticutions and concluding them
Here we applied the same code from (a)
#second plot - new_deaths:
months_df %>%
ggplot(aes(x=year_month,
y=monthly_new_deaths,
fill=year_month,)) +
labs(x="Year & Month",
y= "Monthly New Deaths",
color="Year & Month") +
theme(axis.text.x=element_text(angle = 90, hjust = 0))+
geom_boxplot(show.legend = FALSE, ) +
ggtitle("Wrold Monthly New Deaths Distributions Over Time")
#third plot - new_vaccinations:
months_df %>%
ggplot(aes(x=year_month,
y=monthly_new_vaccinations,
fill=year_month,)) +
labs(x="Year & Month",
y= "Monthly New Vaccinations",
color="Year & Month") +
theme(axis.text.x=element_text(angle = 90, hjust = 0))+
geom_boxplot(show.legend = FALSE, ) +
ggtitle("Wrold Monthly New Vaccinations Distributions Over Time")
We see a significant variance in all plots, which is quite expected since the world is diversified. From the first plot, it can be seen that in recent months there has been a significant increase in the new cases, as remembered the varient of omicron appeared in these months. From the second plot, it can be seen that there is actually a greater concentration in the middle of the period, i.e. at the beginning of 2021, of deaths. From the third plot, it can be seen that the vaccines began in the middle of the period, i.e at the beginning of 2021, and as we remember in these months, vaccinations also began in Israel.
(i) Adding an R_cases column to the
covid19 data frame
First we created a temporary data frame with colunmns
location, date and
new_cases_smoothed, then we used the lag
function to create a new column of the new_cases_smoothed 7
days before the date value, and then we calculated the R_case using a
function we built and apply.
# Defining temporary table with specific columns
temp_df <- covid19 %>%
select(location, date,
new_cases_smoothed)
n <- length(covid19$location)
# Adding a column of new_cases_smoothed_7_days_ago using lag function
temp_df <- temp_df %>%
mutate(location_2 = lag(location, n=7),
new_cases_smoothed_7_days_ago = lag(new_cases_smoothed, n=7),
R_cases = numeric(length=n))
temp_df$new_cases_smoothed = as.numeric(temp_df$new_cases_smoothed)
temp_df$new_cases_smoothed_7_days_ago = as.numeric(temp_df$new_cases_smoothed_7_days_ago)
# Defining a function that assigns an R_cases value to a cell (using apply later)
caclculate_R_cases <- function(df_row){
if(is.na(df_row["new_cases_smoothed"]) | is.na(df_row["new_cases_smoothed_7_days_ago"])){
df_row["R_cases"] <- NA
}else if(as.numeric(df_row["new_cases_smoothed_7_days_ago"]) == 0){
df_row["R_cases"] <- NA
}else if(df_row["location"]!=df_row["location_2"]){
df_row["R_cases"] <- NA
}else{
df_row["R_cases"] <- as.numeric(df_row["new_cases_smoothed"]) / as.numeric(df_row["new_cases_smoothed_7_days_ago"])
}
}
# Applying on a vector of R_cases assigning it as column in covid19 data frame
R_cases <- apply(X=temp_df,MARGIN = 1,FUN = caclculate_R_cases)
covid19$R_cases = R_cases
temp_df$R_cases = R_cases #we will use this later
(ii) Plotting the R_cases values for
Israel, UK, and US
We made a simple plot using ggplot.
locations_vec <- c("Israel", "United Kingdom", "United States")
# plot with outlires included (no filtering)
covid19 %>% select(location, R_cases,date) %>%
filter(location %in% locations_vec) %>%
ggplot(aes(x = date, y = R_cases, color = location)) +
geom_point() +
geom_line(alpha=0.5)
The plot is not clear because of some outlines in US’s data. We decided to plot it again filtering the outline as seen below:
# plot with outlines excluded
covid19 %>% select(location, R_cases,date) %>%
filter(location %in% locations_vec) %>%
ggplot(aes(x = date, y = R_cases, color = location)) + ylim(0,10) +
geom_point(size=1.5, alpha = 0.6) +
geom_line(alpha=0.5)
(iii) listing in a table the number of days of spread
We filterd out and counted with dyplr the requested data.
spreading_days <- temp_df %>%
filter(location %in% locations_vec) %>%
filter(R_cases > 1) %>%
count(location,
sort = TRUE,
name = "num of disease spreading days")
spreading_days
## location num of disease spreading days
## 1 United Kingdom 443
## 2 Israel 400
## 3 United States 395
United kingdom had the most days out of the three.
We used the requested functions for displaying the map-plots as follows:
(i) Defining relevant objects
sPDF <- joinCountryData2Map(current, joinCode = "NAME"
, nameJoinColumn = "location")
## 218 codes from your data successfully matched countries in the map
## 24 codes from your data failed to match with a country code in the map
## 25 codes from the map weren't represented in your data
par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
(ii) Mapping the world by
total_deaths_per_million
mapCountryData(sPDF, nameColumnToPlot="total_deaths_per_million",catMethod = "fixedWidth", missingCountryCol = "white", mapTitle = "Total Deaths Per Million", aspect = "variable", colourPalette = "magma")
# Presenting in a table
current %>%
select(location, total_deaths_per_million) %>%
arrange(desc(total_deaths_per_million)) %>%
head(3)
## location total_deaths_per_million
## 1: Peru 6370.855
## 2: Bulgaria 5327.220
## 3: Bosnia and Herzegovina 4823.716
We got 3 countries Peru, Bulgaria and
Bosnia and Herzegovina in red color and the rest of the
countries show values that are relatively intermediate or low.
(iii) Mapping the world by
total_vaccinations_per_hundred
mapCountryData(sPDF, nameColumnToPlot="total_vaccinations_per_hundred",catMethod = "fixedWidth", missingCountryCol = "white", mapTitle = "Total Vaccinations Per Hundred map", aspect = "variable",colourPalette = "terrain")
# Presenting in a table
current %>%
select(location, total_vaccinations_per_hundred) %>%
arrange(desc(total_vaccinations_per_hundred)) %>%
head(3)
## location total_vaccinations_per_hundred
## 1: Gibraltar 353.04
## 2: Cuba 313.99
## 3: Chile 266.58
It can be seen from the map that the number of vaccinated in
Africa is very low, Russia and
India the number of vaccinated is lower than the median
while most of America is above the median.
(iv) Mapping the world by
excess_mortality_cumulative_per_million
mapCountryData(sPDF, nameColumnToPlot="excess_mortality_cumulative_per_million",catMethod = "fixedWidth", missingCountryCol = "white", mapTitle = "Excess Mortality Cumulative Per Million", aspect = "variable",colourPalette = "magma")
# Presenting in a table
current %>%
select(location, excess_mortality_cumulative_per_million) %>%
arrange(desc(excess_mortality_cumulative_per_million)) %>%
head(3)
## location excess_mortality_cumulative_per_million
## 1: Bulgaria 9573.960
## 2: Serbia 8158.629
## 3: Russia 7975.083
It can be seen that compared to previous years, in
Russia and Eastern-European countries, the excess rate is
very large (it can be amused that this is due to under-reporting in
previous years but there may be other reasons). In contrast, in
countries like Australia and Greenland the
excess is negative.
(a) Creating a cross_correlation function
cross_corr <- function(df, country, column_x, column_y){
tempdf = df %>%
filter(location == country) %>%
select(date, all_of(column_x), all_of(column_y)) #creating a temporary data frame
X = tempdf[all_of(column_x)]
Y = tempdf[all_of(column_y)]
result = vector("numeric" , 121 ) #creating a new vector of the result
names(result) = c(-60:60) # define the names of the values
delta_t= -60 # temporary value for the loop
for (i in 1:121){
if (delta_t<0){ #Calculating correlation with a time difference as in the definition
result[i] = cor(Y[(1+abs(delta_t)):nrow(X),1],
X[1:(nrow(Y)-abs(delta_t)),1],
method = "pearson",
use = "pairwise.complete.obs")
}else {
result[i] = cor(X[(1+abs(delta_t)):nrow(Y),1],
Y[1:(nrow(X)-abs(delta_t)),1],
method = "pearson",
use = "pairwise.complete.obs")
}
if(delta_t < 60){ #the range of the time difference is -60:60
delta_t= delta_t+1
}else{
return(result)
}
}
}
(b) Plotting the output of our function with world new cases and deaths (smoothed)
#Defining a vector of the data and plotting it
world_cc <- cross_corr(covid19,"World","new_cases_smoothed","new_deaths_smoothed")
plot(x=names(world_cc), y=world_cc, type = "h", main = "Cross-Correlation of World New Cases and Deaths (Smoothed)", xlab = "Time Difference", ylab = "Correlation Level")
# Finding the maximal value and it's time difference (presented in the text)
max_corr = max(world_cc)
max_time_diff = names(world_cc[world_cc==max_corr])
The results show that the cross correlation is maximized around the time delay of -13
(a) Calculating and plotting death_rate
per continents and entire world
For the presenting the entire world, we chose to present the values
of the location World, after we made sure that it does
represents the data of the entire world and read it in the OWID
Changelog.
covid19$death_rate = covid19$total_deaths / covid19$total_cases
covid19["continent"][covid19["location"] == "World"] <- "World"
covid19 %>%
filter(continent != "") %>%
group_by(continent, date) %>%
summarize(avg = mean(death_rate, na.rm = TRUE), .groups = 'keep') %>%
group_by(continent) %>%
filter(date >= '2021-01-01') %>%
ggplot(aes(x=date, y=avg, color=continent)) +
geom_line(size=0.5) +
ggtitle("Average Death Rate Over Time") +
labs(x="Date",y = "Average Death Rate",color = "Continents")
The plot shows significant decrease Death Rates since the end of year 2021. We suppose that it is correlated with type of variants that were most common at a several time. Many people were infected with the Omicron variant which spread at the begging of 2021, but it was considers least fatal and that is why the death rate decreased.
(b) Calculating and plotting
total_vaccinations_per_hundred per continents
and entire world
We did the same as in (a).
covid19 %>%
filter(continent != "") %>%
group_by(continent, date) %>%
summarize(avg = mean(total_vaccinations_per_hundred, na.rm = TRUE), .groups = 'keep') %>%
filter(date >= '2021-01-01') %>%
ggplot(aes(x=date, y=avg, color=continent)) +
geom_line(size=0.5) +
ggtitle("Average Total Vaccinations Per Hundred Over Time") +
labs(x="Date",y = "Average Total Vaccinations Per Hundred",color = "Continents")
We do see that the plots suggest a negative correlation between vaccination and death rate,
(a) Scatter plot of the current date
total_deaths_per_million vs. the
excess_mortality_cumulative_per_million
We added a diff column to current data
frame and filled it the the absalute value of the difference between the
excess and reported mortality rates. afterwards we plotted the data and
used geom_text_repel to mark the countries that hada
difference bigger then 2000 of the rates.
current$diff = abs(current$excess_mortality_cumulative_per_million-current$total_deaths_per_million) # Defining the difference between the columns (with abs)
a = ifelse(current$diff >= 2000, "dark blue", "black")
ggplot(current, aes(x=excess_mortality_cumulative_per_million, y=total_deaths_per_million)) +
geom_point(col = a) +
geom_text_repel(size = 3, data = current %>%
filter(diff >= 2000 | diff <= -2000),
aes(label=location), col = "dark blue") + # Marking on the graph in color all the countries with diff >= 2000
geom_abline(slope = 1, intercept = 0, col = "blue") +
geom_abline(slope = 1, intercept = 2000, col = "red") +
geom_abline(slope = 1, intercept = -2000, col = "green") +
ggtitle("Total Death ~ Excess Mortality Cumulative (Per Million)")+
xlab("Excess Mortality Cumulative (Per Million)") +
ylab("Total Deaths (Per Million)")
We can see that there are quite a few countires that has significant
diff values.
(b) Plotting 3 excising countries
From the named countries in the previous plot chose 3 random countries and made sure they had over 50 records and their plot is blow.
country = c("South Africa", "Bulgaria", "Lithuania")
covid19 %>%
filter(location %in% country) %>%
select(location, date, excess_mortality_cumulative_per_million,total_deaths_per_million) %>%
group_by(location) %>%
ggplot(aes(x=date, y=total_deaths_per_million, colour=location)) +
geom_line(size=0.5) +
geom_point(size=0.5, aes(y=excess_mortality_cumulative_per_million)) +
labs(y = "Total Deaths Per Million (line),
Excess Mortality Cumulative Per Million (dots)", x = "Date", colour = "Location")
The plot shows that the difference between reported deaths and excess rates grows in time. We see that until July 2020 the difference wasn’t significant. Until that time we see the most excess mortality not explained by Covid-19.