This final project will consist of webscraping a global COVID-19 dataset from a public wiki page to conduct various data analysis tasks on the data set.
There are 10 main tasks that make up this final project.
library(httr)
library(rvest)
TASK 1: Getting a COVID-19 pandemic Wiki page using HTTP request from the following url.
URL: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country
get_wiki_covid19_page <- GET("https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country")
Calling the get_wiki_covid19_page function to get a http response with the target html page.
get_wiki_covid19_page
Response [https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country]
Date: 2023-02-18 12:56
Status: 200
Content-Type: text/html; charset=UTF-8
Size: 425 kB
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-langu...
<head>
<meta charset="UTF-8"/>
<title>Template:COVID-19 testing by country - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header...
"wgCSPNonce":false,"wgCanonicalNamespace":"Template","wgCanonicalSpecialPageName":false
"CS1 Khmer-language sources (km)","CS1 uses Chinese-language script (zh)","CS1 Chinese-...
"CS1 foreign language sources (ISO 639-2)","CS1 Burmese-language sources (my)","CS1 Dut...
"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDispl
TASK 2: Extracting the COVID-19 testing data table from the wiki HTML page.
root_node <- read_html("https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country")
Creating a table node highlighting available tables on the webpage.
table_node <- html_nodes(root_node,"table")
table_node
The website shows that the table titled “COVID-19 testing statistics by country” is the second table or table [2] visible in the table node below.
[1] <table class="box-Update plainlinks ombox ombox-content ambox-Update" role="pr ...
[2] <table class="wikitable plainrowheaders sortable collapsible autocollapse">\n< ...
[3] <table class="plainlinks ombox mbox-small ombox-notice" role="presentation"><t ...
Returning table [2] with html_function and then transforming it into a dataframe using the as.data.frame function.
table <- html_table(table_node[2])
table_df <- as.data.frame(html_table(table_node[2]))
Returning the abbreviated dataframe only showing 10 rows.
table_df[1:10,]
Country.or.region Date.a. Tested Units.b. Confirmed.cases.
1 Afghanistan 17 Dec 2020 154,767 samples 49,621
2 Albania 18 Feb 2021 428,654 samples 96,838
3 Algeria 2 Nov 2020 230,553 samples 58,574
4 Andorra 23 Feb 2022 300,307 samples 37,958
5 Angola 2 Feb 2021 399,228 samples 20,981
6 Antigua and Barbuda 6 Mar 2021 15,268 samples 832
7 Argentina 16 Apr 2022 35,716,069 samples 9,060,495
8 Armenia 29 May 2022 3,099,602 samples 422,963
9 Australia 9 Sep 2022 78,548,492 samples 10,112,229
10 Austria 1 Feb 2023 205,817,752 samples 5,789,991
Confirmed..tested.. Tested..population.. Confirmed..population.. Ref.
1 32.1 0.40 0.13 [1]
2 22.6 15.0 3.4 [2]
3 25.4 0.53 0.13 [3][4]
4 12.6 387 49.0 [5]
5 5.3 1.3 0.067 [6]
6 5.4 15.9 0.86 [7]
7 25.4 78.3 20.0 [8]
8 13.6 105 14.3 [9]
9 12.9 313 40.3 [10]
10 2.8 2,312 65.0 [11
Returning the structure of the dataframe(table_df) showing all columns to be in character format.
str(table_df)
'data.frame': 173 obs. of 9 variables:
$ Country.or.region : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Date.a. : chr "17 Dec 2020" "18 Feb 2021" "2 Nov 2020" "23 Feb 2022" ...
$ Tested : chr "154,767" "428,654" "230,553" "300,307" ...
$ Units.b. : chr "samples" "samples" "samples" "samples" ...
$ Confirmed.cases. : chr "49,621" "96,838" "58,574" "37,958" ...
$ Confirmed..tested.. : chr "32.1" "22.6" "25.4" "12.6" ...
$ Tested..population.. : chr "0.40" "15.0" "0.53" "387" ...
$ Confirmed..population..: chr "0.13" "3.4" "0.13" "49.0" ...
$ Ref. : chr "[1]" "[2]" "[3][4]" "[5]" ...
TASK 3: Pre-processing and exporting the extracted data frame.
Below I have used a nested function suggested by the course to pre-process the dataframe (table_df) by removing irrelevant columns, renaming columns, and converting columns into proper data types.
preprocess_covid_data_frame <- function() {
shape <- dim(table_df)
# Remove the World row
table_df<-table_df[!(table_df$`Country.or.region`=="World"),]
# Remove the last row
table_df <- table_df[-173,]
# We dont need the Units and Ref columns, so can be removed
table_df["Ref."] <- NULL
table_df["Units.b."] <- NULL
# Renaming the columns
names(table_df) <- c("country", "date", "tested", "confirmed",
"confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
# Convert column data types
table_df$country <- as.factor(table_df$country)
table_df$date <- as.factor(table_df$date)
table_df$tested <- as.numeric(gsub(",","",table_df$tested))
table_df$confirmed <- as.numeric(gsub(",","",table_df$confirmed))
table_df$'confirmed.tested.ratio' <- as.numeric(gsub(",","",table_df$`confirmed.tested.ratio`))
table_df$'tested.population.ratio' <- as.numeric(gsub(",","",table_df$`tested.population.ratio`))
table_df$'confirmed.population.ratio' <- as.numeric(gsub(",","",table_df$`confirmed.population.ratio`))
return(table_df)
}
Creating a new dataframe from the pre-processing nested function.
preprocess_covid_data_frame <- preprocess_covid_data_frame()
Below are the first 10 rows of the preprocess_covid_data_frame, showing the changes to the column names, which consist of removing of unecessary rows and columns and changing the column formats.
preprocess_covid_data_frame[1:10,]
country date tested confirmed confirmed.tested.ratio
1 Afghanistan 17 Dec 2020 154767 49621 32.1
2 Albania 18 Feb 2021 428654 96838 22.6
3 Algeria 2 Nov 2020 230553 58574 25.4
4 Andorra 23 Feb 2022 300307 37958 12.6
5 Angola 2 Feb 2021 399228 20981 5.3
6 Antigua and Barbuda 6 Mar 2021 15268 832 5.4
7 Argentina 16 Apr 2022 35716069 9060495 25.4
8 Armenia 29 May 2022 3099602 422963 13.6
9 Australia 9 Sep 2022 78548492 10112229 12.9
10 Austria 1 Feb 2023 205817752 5789991 2.8
tested.population.ratio confirmed.population.ratio
1 0.40 0.130
2 15.00 3.400
3 0.53 0.130
4 387.00 49.000
5 1.30 0.067
6 15.90 0.860
7 78.30 20.000
8 105.00 14.300
9 313.00 40.300
10 2312.00 65.000
Checking the updated column format through the str function which shows the columns have been formatted accordingly.
str(preprocess_covid_data_frame)
'data.frame': 172 obs. of 7 variables:
$ country : Factor w/ 172 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
$ date : Factor w/ 141 levels "1 Aug 2021","1 Feb 2023",..: 31 37 49 60 44 123 29 90 141 2 ...
$ tested : num 154767 428654 230553 300307 399228 ...
$ confirmed : num 49621 96838 58574 37958 20981 ...
$ confirmed.tested.ratio : num 32.1 22.6 25.4 12.6 5.3 5.4 25.4 13.6 12.9 2.8 ...
$ tested.population.ratio : num 0.4 15 0.53 387 1.3 ...
$ confirmed.population.ratio: num 0.13 3.4 0.13 49 0.067 0.86 20 14.3 40.3 65
After making changes to the dataframe(table_df), the write.csv() function is used to save the csv file into a file.
write.csv(preprocess_covid_data_frame, file = "covid.csv")
The below code is run to confirm that the file (covid.csv) has been successfully saved in the directory.
# Get working directory
wd <- getwd()
# Get exported
file_path <- paste(wd, sep="", "/covid.csv")
# File path
print(file_path)
file.exists(file_path)
> file.exists(file_path)
[1] TRUE
Opening the covid.csv file into a new dataframe.
covid_data_frame_csv <- read.csv("covid.csv", header=TRUE, sep=",")
TASK 4: Getting a subset of the extracted data frame for the 5th to 10th rows from the data frame with only country and confirmed columns selected.
covid_data_frame_subset <- subset(covid_data_frame_csv[(5:10),c("country","confirmed")])
Returning the subset.
covid_data_frame_subset
country confirmed
5 Angola 20981
6 Antigua and Barbuda 832
7 Argentina 2195722
8 Armenia 177104
9 Australia 29130
10 Austria 488007
TASK 5: Calculating worldwide COVID testing positive ratio by dividing the total confirmed and total tested cases worldwide.
Calculating the total confirmed cases worldwide
tot_confirmed_cases <- sum(covid_data_frame_csv$confirmed)
tot_confirmed_cases
[1] 117313932
Calculating the total tested cases worldwide
tot_tested_cases <- sum(covid_data_frame_csv$tested)
tot_tested_cases
[1] 1698581244
Calculating the positive ratio between the total confirmed cases and the total tested cases.
positive_ratio <- (tot_confirmed_cases/tot_tested_cases)
positive_ratio
[1] 0.06906584
TASK 6: Getting a list of countries who have reported their COVID-19 testing data and sorting them alphabetically in ascending and descending order.
Returning the country column.
country_column <- covid_data_frame_csv$country
Checking the class of the column to ensure it can be easily sorted.
class(country_column)
[1] "character"
Sorting the countries AtoZ and returning a sample result.
sort(country_column)
[1] "Afghanistan" "Albania" "Algeria"
[4] "Andorra" "Angola" "Antigua and Barbuda"
[7] "Argentina" "Armenia" "Australia"
[10] "Austria" "Azerbaijan" "Bahamas"
Sorting the countries Z to A and returning a sample result.
sort(country_column, decreasing = TRUE)
[1] "Zimbabwe" "Zambia" "Vietnam"
[4] "Venezuela" "Uzbekistan" "Uruguay"
[7] "United States" "United Kingdom" "United Arab Emirates"
[10] "Ukraine" "Uganda" "Turkey"
TASK 7: Identifying country names using a regular expression to find any countries that start with ‘United’
countries_United <- grep('United.+',(covid_data_frame_csv$country), value = TRUE)
print(countries_United)
[1] "United Arab Emirates" "United Kingdom" "United States"
TASK 8: Creating a subset and comparing the COVID-19 test data between two countries by country name, confirmed cases and confirmed-population-ratio columns.
Subset 1: Finland
finland_subset <- subset(covid_data_frame_csv[55,c("country","confirmed","confirmed.population.ratio")])
finland_subset
country confirmed confirmed.population.ratio
55 Finland 371135 6.7
Subset 2: Hungary
subset_hungary <- subset(covid_data_frame_csv[71,c("country","confirmed","confirmed.population.ratio")])
subset_hungary
country confirmed confirmed.population.ratio
71 Hungary 1909948 19.8
TASK 9: Comparing which one of the selected countries has a larger ratio of confirmed cases to population through IF and ELSE statements.
if(finland_subset$confirmed > subset_hungary$confirmed){print("Finland has more confirmed cases")}else{
print("Hungary has more confirmed cases")}
[1] "Hungary has more confirmed cases"
TASK 10: Finding countries with confirmed to population ratio rate less than a set threshold of less than 1% as it may indicate the risk of those countries being relatively low.
Setting the threshold.
threshold = 0.01
Using an IF - ELSE statements to return the country with the higher confirmed covid cases based on the threshold.
countries_less_threshold <- if(threshold == '0.01'){
subset(covid_data_frame_csv, confirmed.population.ratio <= 0.01)}else{
subset(covid_data_frame_csv, confirmed.population.ratio)}
Returning the result into a new dataframe by creating a subset only returning the contry name and confirmed population ratio.
countries_result <- subset(countries_less_threshold, select = c("country","confirmed.population.ratio"))
countries_result
country confirmed.population.ratio
28 Burundi 0.00740
34 China[c] 0.00610
89 Laos 0.00063
119 North Korea 0.00000
156 Tanzania 0.00085