Overview

This final project will consist of webscraping a global COVID-19 dataset from a public wiki page to conduct various data analysis tasks on the data set.

There are 10 main tasks that make up this final project.

Loading Relevant Libraries

library(httr)
library(rvest)

TASK 1: Getting a COVID-19 pandemic Wiki page using HTTP request from the following url.

URL: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country

get_wiki_covid19_page <- GET("https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country")

Calling the get_wiki_covid19_page function to get a http response with the target html page.

get_wiki_covid19_page
Response [https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country]
  Date: 2023-02-18 12:56
  Status: 200
  Content-Type: text/html; charset=UTF-8
  Size: 425 kB
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-langu...
<head>
<meta charset="UTF-8"/>
<title>Template:COVID-19 testing by country - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header...
"wgCSPNonce":false,"wgCanonicalNamespace":"Template","wgCanonicalSpecialPageName":false
"CS1 Khmer-language sources (km)","CS1 uses Chinese-language script (zh)","CS1 Chinese-...
"CS1 foreign language sources (ISO 639-2)","CS1 Burmese-language sources (my)","CS1 Dut...
"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDispl

TASK 2: Extracting the COVID-19 testing data table from the wiki HTML page.

root_node <- read_html("https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country")

Creating a table node highlighting available tables on the webpage.

table_node <- html_nodes(root_node,"table")
table_node

The website shows that the table titled “COVID-19 testing statistics by country” is the second table or table [2] visible in the table node below.

[1] <table class="box-Update plainlinks ombox ombox-content ambox-Update" role="pr ...
[2] <table class="wikitable plainrowheaders sortable collapsible autocollapse">\n< ...
[3] <table class="plainlinks ombox mbox-small ombox-notice" role="presentation"><t ...

Returning table [2] with html_function and then transforming it into a dataframe using the as.data.frame function.

table <- html_table(table_node[2])
table_df <- as.data.frame(html_table(table_node[2]))

Returning the abbreviated dataframe only showing 10 rows.

table_df[1:10,]
Country.or.region     Date.a.      Tested Units.b. Confirmed.cases.
1          Afghanistan 17 Dec 2020     154,767  samples           49,621
2              Albania 18 Feb 2021     428,654  samples           96,838
3              Algeria  2 Nov 2020     230,553  samples           58,574
4              Andorra 23 Feb 2022     300,307  samples           37,958
5               Angola  2 Feb 2021     399,228  samples           20,981
6  Antigua and Barbuda  6 Mar 2021      15,268  samples              832
7            Argentina 16 Apr 2022  35,716,069  samples        9,060,495
8              Armenia 29 May 2022   3,099,602  samples          422,963
9            Australia  9 Sep 2022  78,548,492  samples       10,112,229
10             Austria  1 Feb 2023 205,817,752  samples        5,789,991
   Confirmed..tested.. Tested..population.. Confirmed..population..   Ref.
1                 32.1                 0.40                    0.13    [1]
2                 22.6                 15.0                     3.4    [2]
3                 25.4                 0.53                    0.13 [3][4]
4                 12.6                  387                    49.0    [5]
5                  5.3                  1.3                   0.067    [6]
6                  5.4                 15.9                    0.86    [7]
7                 25.4                 78.3                    20.0    [8]
8                 13.6                  105                    14.3    [9]
9                 12.9                  313                    40.3   [10]
10                 2.8                2,312                    65.0   [11

Returning the structure of the dataframe(table_df) showing all columns to be in character format.

str(table_df)
'data.frame':   173 obs. of  9 variables:
 $ Country.or.region      : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ Date.a.                : chr  "17 Dec 2020" "18 Feb 2021" "2 Nov 2020" "23 Feb 2022" ...
 $ Tested                 : chr  "154,767" "428,654" "230,553" "300,307" ...
 $ Units.b.               : chr  "samples" "samples" "samples" "samples" ...
 $ Confirmed.cases.       : chr  "49,621" "96,838" "58,574" "37,958" ...
 $ Confirmed..tested..    : chr  "32.1" "22.6" "25.4" "12.6" ...
 $ Tested..population..   : chr  "0.40" "15.0" "0.53" "387" ...
 $ Confirmed..population..: chr  "0.13" "3.4" "0.13" "49.0" ...
 $ Ref.                   : chr  "[1]" "[2]" "[3][4]" "[5]" ...

TASK 3: Pre-processing and exporting the extracted data frame.

Below I have used a nested function suggested by the course to pre-process the dataframe (table_df) by removing irrelevant columns, renaming columns, and converting columns into proper data types.

preprocess_covid_data_frame <- function() {
    
    shape <- dim(table_df)
    
    # Remove the World row
    table_df<-table_df[!(table_df$`Country.or.region`=="World"),]
    # Remove the last row
    table_df <- table_df[-173,]
    
    # We dont need the Units and Ref columns, so can be removed
    table_df["Ref."] <- NULL
    table_df["Units.b."] <- NULL
    
    # Renaming the columns
    names(table_df) <- c("country", "date", "tested", "confirmed", 
                         "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
    
    # Convert column data types
    table_df$country <- as.factor(table_df$country)
    table_df$date <- as.factor(table_df$date)
    table_df$tested <- as.numeric(gsub(",","",table_df$tested))
    table_df$confirmed <- as.numeric(gsub(",","",table_df$confirmed))
    table_df$'confirmed.tested.ratio' <- as.numeric(gsub(",","",table_df$`confirmed.tested.ratio`))
    table_df$'tested.population.ratio' <- as.numeric(gsub(",","",table_df$`tested.population.ratio`))
    table_df$'confirmed.population.ratio' <- as.numeric(gsub(",","",table_df$`confirmed.population.ratio`))
    
    return(table_df)
  }

Creating a new dataframe from the pre-processing nested function.

preprocess_covid_data_frame <- preprocess_covid_data_frame()  

Below are the first 10 rows of the preprocess_covid_data_frame, showing the changes to the column names, which consist of removing of unecessary rows and columns and changing the column formats.

preprocess_covid_data_frame[1:10,]
               country        date    tested confirmed confirmed.tested.ratio
1          Afghanistan 17 Dec 2020    154767     49621                   32.1
2              Albania 18 Feb 2021    428654     96838                   22.6
3              Algeria  2 Nov 2020    230553     58574                   25.4
4              Andorra 23 Feb 2022    300307     37958                   12.6
5               Angola  2 Feb 2021    399228     20981                    5.3
6  Antigua and Barbuda  6 Mar 2021     15268       832                    5.4
7            Argentina 16 Apr 2022  35716069   9060495                   25.4
8              Armenia 29 May 2022   3099602    422963                   13.6
9            Australia  9 Sep 2022  78548492  10112229                   12.9
10             Austria  1 Feb 2023 205817752   5789991                    2.8
   tested.population.ratio confirmed.population.ratio
1                     0.40                      0.130
2                    15.00                      3.400
3                     0.53                      0.130
4                   387.00                     49.000
5                     1.30                      0.067
6                    15.90                      0.860
7                    78.30                     20.000
8                   105.00                     14.300
9                   313.00                     40.300
10                 2312.00                     65.000

Checking the updated column format through the str function which shows the columns have been formatted accordingly.

str(preprocess_covid_data_frame)
'data.frame':   172 obs. of  7 variables:
 $ country                   : Factor w/ 172 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ date                      : Factor w/ 141 levels "1 Aug 2021","1 Feb 2023",..: 31 37 49 60 44 123 29 90 141 2 ...
 $ tested                    : num  154767 428654 230553 300307 399228 ...
 $ confirmed                 : num  49621 96838 58574 37958 20981 ...
 $ confirmed.tested.ratio    : num  32.1 22.6 25.4 12.6 5.3 5.4 25.4 13.6 12.9 2.8 ...
 $ tested.population.ratio   : num  0.4 15 0.53 387 1.3 ...
 $ confirmed.population.ratio: num  0.13 3.4 0.13 49 0.067 0.86 20 14.3 40.3 65 

After making changes to the dataframe(table_df), the write.csv() function is used to save the csv file into a file.

write.csv(preprocess_covid_data_frame, file = "covid.csv")

The below code is run to confirm that the file (covid.csv) has been successfully saved in the directory.

# Get working directory
wd <- getwd()
# Get exported 
file_path <- paste(wd, sep="", "/covid.csv")
# File path
print(file_path)
file.exists(file_path)
> file.exists(file_path)
[1] TRUE

Opening the covid.csv file into a new dataframe.

covid_data_frame_csv <- read.csv("covid.csv", header=TRUE, sep=",")

TASK 4: Getting a subset of the extracted data frame for the 5th to 10th rows from the data frame with only country and confirmed columns selected.

covid_data_frame_subset <- subset(covid_data_frame_csv[(5:10),c("country","confirmed")])

Returning the subset.

covid_data_frame_subset
               country confirmed
5               Angola     20981
6  Antigua and Barbuda       832
7            Argentina   2195722
8              Armenia    177104
9            Australia     29130
10             Austria    488007

TASK 5: Calculating worldwide COVID testing positive ratio by dividing the total confirmed and total tested cases worldwide.

Calculating the total confirmed cases worldwide

tot_confirmed_cases <- sum(covid_data_frame_csv$confirmed)
tot_confirmed_cases
[1] 117313932

Calculating the total tested cases worldwide

tot_tested_cases <- sum(covid_data_frame_csv$tested)
tot_tested_cases
[1] 1698581244

Calculating the positive ratio between the total confirmed cases and the total tested cases.

positive_ratio <- (tot_confirmed_cases/tot_tested_cases)
positive_ratio
[1] 0.06906584

TASK 6: Getting a list of countries who have reported their COVID-19 testing data and sorting them alphabetically in ascending and descending order.

Returning the country column.

country_column <- covid_data_frame_csv$country

Checking the class of the column to ensure it can be easily sorted.

class(country_column)
[1] "character"

Sorting the countries AtoZ and returning a sample result.

sort(country_column)
  [1] "Afghanistan"            "Albania"                "Algeria"               
  [4] "Andorra"                "Angola"                 "Antigua and Barbuda"   
  [7] "Argentina"              "Armenia"                "Australia"             
 [10] "Austria"                "Azerbaijan"             "Bahamas" 

Sorting the countries Z to A and returning a sample result.

sort(country_column, decreasing = TRUE)
[1] "Zimbabwe"               "Zambia"                 "Vietnam"               
  [4] "Venezuela"              "Uzbekistan"             "Uruguay"               
  [7] "United States"          "United Kingdom"         "United Arab Emirates"  
 [10] "Ukraine"                "Uganda"                 "Turkey"  

TASK 7: Identifying country names using a regular expression to find any countries that start with ‘United’

countries_United <- grep('United.+',(covid_data_frame_csv$country), value = TRUE)
print(countries_United)
[1] "United Arab Emirates" "United Kingdom"       "United States"

TASK 8: Creating a subset and comparing the COVID-19 test data between two countries by country name, confirmed cases and confirmed-population-ratio columns.

Subset 1: Finland

finland_subset <- subset(covid_data_frame_csv[55,c("country","confirmed","confirmed.population.ratio")])
finland_subset
   country confirmed confirmed.population.ratio
55 Finland    371135                        6.7

Subset 2: Hungary

subset_hungary <- subset(covid_data_frame_csv[71,c("country","confirmed","confirmed.population.ratio")])

subset_hungary
   country confirmed confirmed.population.ratio
71 Hungary   1909948                       19.8

TASK 9: Comparing which one of the selected countries has a larger ratio of confirmed cases to population through IF and ELSE statements.

if(finland_subset$confirmed > subset_hungary$confirmed){print("Finland has more confirmed cases")}else{
    print("Hungary has more confirmed cases")}
[1] "Hungary has more confirmed cases"

TASK 10: Finding countries with confirmed to population ratio rate less than a set threshold of less than 1% as it may indicate the risk of those countries being relatively low.

Setting the threshold.

threshold = 0.01

Using an IF - ELSE statements to return the country with the higher confirmed covid cases based on the threshold.

countries_less_threshold <- if(threshold == '0.01'){
    subset(covid_data_frame_csv, confirmed.population.ratio <= 0.01)}else{
    subset(covid_data_frame_csv, confirmed.population.ratio)}

Returning the result into a new dataframe by creating a subset only returning the contry name and confirmed population ratio.

countries_result <- subset(countries_less_threshold, select = c("country","confirmed.population.ratio"))
countries_result
        country confirmed.population.ratio
28      Burundi                    0.00740
34     China[c]                    0.00610
89         Laos                    0.00063
119 North Korea                    0.00000
156    Tanzania                    0.00085