JHU DSS Developing Data Products Week 3 Assignment

01/09/2021

Introduction

Completed as part of the Johns Hopkins University Data Science Specialization, Developing Data Products course.

The task for this assignment was simply to “create a web page presentation using R Markdown that features a plot created with Plotly”.

For this task, I have decided to display the total number of Covid-19 deaths up until August 24th 2020 and the total number of Covid-19 deaths up until August 24th 2021 for each country on a map. By means of comparison, I will also add the percentage increase for each.

Getting and Cleaning the Data

I will access three datasets. One with the Covid-19 information on 20/08/24, one for 21/08/24, and the last one with the coordinates for each country.

A little further processing is required as some of the country names don’t match up and in a few cases the coordinates are not provided. I first edit the (factor) country names, then add missing country coordinate information, and then create the final clean dataset for the coordinate information.

I will display the first few entries of each dataframe.

Getting and Cleaning the Data

Downloading and storing 3 separate datasets.

library(plotly)
library(dplyr)

URL2020 <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/08-24-2020.csv"
URL2021 <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/08-24-2021.csv"
URLcountries <- "https://raw.githubusercontent.com/albertyw/avenews/master/old/data/average-latitude-longitude-countries.csv"
download.file(URL2020, "./08-24-2020.csv")
download.file(URL2021, "./08-24-2021.csv")
download.file(URLcountries, "./countries.csv")

data2020 <- read.csv("./08-24-2020.csv")
data2021 <- read.csv("./08-24-2021.csv")
countries <- read.csv("./countries.csv")

Getting and Cleaning the Data

Changing factor level names of countries with differing names.

levels(countries$Country)[levels(countries$Country) ==
    "United States"] <- "US"
levels(countries$Country)[levels(countries$Country) ==
    "Myanmar"] <- "Burma"
levels(countries$Country)[levels(countries$Country) ==
    "Cape Verde"] <- "Cabo Verde"
levels(countries$Country)[levels(countries$Country) ==
    "Congo"] <- "Congo (Brazzaville)"
levels(countries$Country)[levels(countries$Country) ==
    "Congo, The Democratic Republic of the"] <- "Congo (Kinshasa)"
levels(countries$Country)[levels(countries$Country) ==
    "Czech Republic"] <- "Czechia"
levels(countries$Country)[levels(countries$Country) ==
    "Swaziland"] <- "Eswatini"
levels(countries$Country)[levels(countries$Country) ==
    "Iran, Islamic Republic of"] <- "Iran"

Getting and Cleaning the Data

levels(countries$Country)[levels(countries$Country) ==
    "Korea, Republic of"] <- "Korea, South"
levels(countries$Country)[levels(countries$Country) ==
    "Lao People's Democratic Republic"] <- "Laos"
levels(countries$Country)[levels(countries$Country) ==
    "Libyan Arab Jamahiriya"] <- "Libya"
levels(countries$Country)[levels(countries$Country) ==
    "Moldova, Republic of"] <- "Moldova"
levels(countries$Country)[levels(countries$Country) ==
    "Russian Federation"] <- "Russia"
levels(countries$Country)[levels(countries$Country) ==
    "Syrian Arab Republic"] <- "Syria"
levels(data2020$Country_Region)[levels(data2020$Country_Region) ==
    "Taiwan*"] <- "Taiwan"
levels(data2021$Country_Region)[levels(data2021$Country_Region) ==
    "Taiwan*"] <- "Taiwan"
levels(countries$Country)[levels(countries$Country) ==
    "Tanzania, United Republic of"] <- "Tanzania"

Getting and Cleaning Data

levels(countries$Country)[levels(countries$Country) ==
    "Palestinian Territory"] <- "West Bank and Gaza"

Getting and Cleaning the Data

Adding coordinates for countries that were not included in the original dataframe that had latitude and longitude information.

missCountries <- data.frame(ISO.3166.Country.Code = c("NM",
    "KO", "SS"), Country = c("North Macedonia", "Kosovo",
    "South Sudan"), Latitude = c(41.6086, 42.6026,
    6.877), Longitude = c(21.7453, 20.903, 31.307))

countries <- rbind(countries, missCountries)

Getting and Cleaning the Data

Displaying first entries of each data frame.

head(data2020)[, c(4, 6, 7, 8, 9)]

##        Country_Region       Lat     Long_ Confirmed Deaths
## 1         Afghanistan  33.93911  67.70995     38045   1390
## 2             Albania  41.15330  20.16830      8605    254
## 3             Algeria  28.03390   1.65960     41858   1446
## 4             Andorra  42.50630   1.52180      1060     53
## 5              Angola -11.20270  17.87390      2222    100
## 6 Antigua and Barbuda  17.06080 -61.79640        94      3

Getting and Cleaning the Data

head(data2021)[, c(4, 6, 7, 8, 9)]

##        Country_Region       Lat     Long_ Confirmed Deaths
## 1         Afghanistan  33.93911  67.70995    152660   7083
## 2             Albania  41.15330  20.16830    140521   2480
## 3             Algeria  28.03390   1.65960    192626   5063
## 4             Andorra  42.50630   1.52180     15003    130
## 5              Angola -11.20270  17.87390     46340   1166
## 6 Antigua and Barbuda  17.06080 -61.79640      1540     43

Getting and Cleaning the Data

head(countries)

##   ISO.3166.Country.Code              Country Latitude Longitude
## 1                    AD              Andorra    42.50      1.50
## 2                    AE United Arab Emirates    24.00     54.00
## 3                    AF          Afghanistan    33.00     65.00
## 4                    AG  Antigua and Barbuda    17.05    -61.80
## 5                    AI             Anguilla    18.25    -63.17
## 6                    AL              Albania    41.00     20.00

Data Processing

Next, using forloops, I will compile the number of covid deaths, for countries that have been separated into multiple regions, into one observation.

country_names_2020 <- unique(data2020$Country_Region)

df1 <- data.frame(country = NULL, death_total_2020 = NULL)
for (x in country_names_2020) {
    death_total_2020 <- sum(data2020[data2020$Country_Region ==
        x, ]$Deaths)
    country <- country_names_2020[country_names_2020 ==
        x]
    df1 <- rbind(df1, c(country, death_total_2020))
}

df1 <- cbind(Country = as.character(country_names_2020),
    df1) %>%
    select(-2) %>%
    rename(Deaths_2020 = 2)

Data Processing

country_names_2021 <- unique(data2021$Country_Region)

df2 <- data.frame(country = NULL, death_total_2021 = NULL)
for (x in country_names_2021) {
    death_total_2021 <- sum(data2021[data2021$Country_Region ==
        x, ]$Deaths)
    country <- country_names_2021[country_names_2021 ==
        x]
    df2 <- rbind(df2, c(country, death_total_2021))
}

df2 <- cbind(Country = as.character(country_names_2021),
    df2) %>%
    select(-2) %>%
    rename(Deaths_2021 = 2)

Data Processing

head(df1)

##               Country Deaths_2020
## 1         Afghanistan        1390
## 2             Albania         254
## 3             Algeria        1446
## 4             Andorra          53
## 5              Angola         100
## 6 Antigua and Barbuda           3

Data Processing

head(df2)

##               Country Deaths_2021
## 1         Afghanistan        7083
## 2             Albania        2480
## 3             Algeria        5063
## 4             Andorra         130
## 5              Angola        1166
## 6 Antigua and Barbuda          43

Data Processing

Creating a column for percentage change whilst merging all the datasets into one.

df3 <- merge(df1, df2, all = FALSE) %>%
    mutate(Percentage_Growth = signif(Deaths_2021/Deaths_2020,
        3))

df4 <- merge(countries, df3, all = FALSE) %>%
    select(-2) %>%
    rename(lat = 2, lng = 3)

Data Processing

head(df4)

              Country    lat   lng Deaths_2020 Deaths_2021 Percentage_Growth
1         Afghanistan  33.00  65.0        1390        7083              5.10
2             Albania  41.00  20.0         254        2480              9.76
3             Algeria  28.00   3.0        1446        5063              3.50
4             Andorra  42.50   1.5          53         130              2.45
5              Angola -12.50  18.5         100        1166             11.70
6 Antigua and Barbuda  17.05 -61.8           3          43             14.30

The Map

I want to show which countries suffered the largest percentage increases in Covid-19 deaths using colour. As there are some NA values, some extremely large values and some infinite values, I will create an extra dataframe that acts as a colour key. This will set all values above 100 (including infinity) to 100 and all NA values to 0.

df5 <- df4

df5$Percentage_Growth[is.na(df5$Percentage_Growth)] <- 0
df5$Percentage_Growth[df5$Percentage_Growth > 100] <- 100
df5$Percentage_Growth[df5$Percentage_Growth == Inf] <- 100

The Map

With the following figure, we can see, based on the colour key, countries that suffered most from a relative increase in Covid-19 deaths between August 2020 and August 2021. This shows countries that may have fared well in the first 5 months of the pandemic, but subsequently suffered as the virus spread worldwide.

*A lot of the countries represented by the upper end of the colour spectrum will of course be countries that had little or no deaths in those first 5 months- their relative percentage increase being great even if they have a comparatively low number of deaths in 2021.

*With a lot of countries suffering between 2 and 10 percent increases, it is difficult to compare the differences between countries in this range, however the chief goal of the colour code is to highlight extremities.

fig <- plot_ly(df4, type = "scattergeo", mode = "markers",
    lat = df4$lat, lon = df4$lng, color = df5$Percentage_Growth,
    text = ~paste0(Country, "<br> Deaths 2020: ", Deaths_2020,
        "<br> Deaths 2021: ", Deaths_2021, "<br> Percentage Growth: ",
        Percentage_Growth))
fig