Overview:

In this project, we will focus on analyzing COVID-19 case data using the R programming language. The main objective is to obtain updated COVID-19 data using web scraping techniques and manipulate this data to extract valuable information and generate informative visualizations.

The first objective of the project is data collection. We will use web scraping techniques to extract COVID-19 data from reliable sources on the web. This will allow us to obtain updated information on the number of confirmed cases, recoveries, deaths, and other relevant indicators.

Once the data is collected, we will perform various manipulation and cleaning operations to ensure the information is in a suitable format for analysis. This will include removing missing values, transforming variables, and creating new derived columns.

#install.packages("httr")
#install.packages("rvest")

Get a COVID-19 pandemic Wiki page using HTTP request

First, let’s develop a function to use an HTTP request to retrieve a public COVID-19 Wikipedia page.

Before writing the function, you can access this public page via the following URL: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country using a web browser.

The goal is to obtain the HTML page using an HTTP request with the httr library.

Invoke the get_wiki_covid19_page function to obtain an HTTP response containing the desired HTML page.

get_wiki_covid_page <- function() {
  url <- "https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country"
  response <- GET(url)
  
  if (status_code(response) == 200) {
    content <- content(response, "text")
    print(content)
  } else {
    print(paste("Error:", status_code(response)))
  }
}

get_wiki_covid_page()

Extract COVID-19 testing data table from the wiki HTML page

The task involves extracting the COVID-19 testing data table from a Wikipedia HTML page. This process typically includes sending an HTTP request to retrieve the HTML content of the page, parsing the HTML to locate the specific table, and then extracting the relevant data for further analysis.

get_covid_testing_data <- function() {
  url <- "https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country"
  page <- read_html(url)
  
  # Get all the tables in the HTML root node
  tables <- html_nodes(page, "table")
  
  # Extract the second table (index 2)
  covid_testing_table <- html_table(tables[[2]], fill = TRUE)
  
  # Convert the table to a data frame
  covid_testing_df <- as.data.frame(covid_testing_table)
  
  # Print the data frame
  head(covid_testing_df)
}

# Call the function and get COVID-19 testing data
get_covid_testing_data()
##     Country or region     Date[a]  Tested Units[b] Confirmed(cases)
## 1         Afghanistan 17 Dec 2020 154,767  samples           49,621
## 2             Albania 18 Feb 2021 428,654  samples           96,838
## 3             Algeria  2 Nov 2020 230,553  samples           58,574
## 4             Andorra 23 Feb 2022 300,307  samples           37,958
## 5              Angola  2 Feb 2021 399,228  samples           20,981
## 6 Antigua and Barbuda  6 Mar 2021  15,268  samples              832
##   Confirmed /tested,% Tested /population,% Confirmed /population,%   Ref.
## 1                32.1                 0.40                    0.13    [1]
## 2                22.6                 15.0                     3.4    [2]
## 3                25.4                 0.53                    0.13 [3][4]
## 4                12.6                  387                    49.0    [5]
## 5                 5.3                  1.3                   0.067    [6]
## 6                 5.4                 15.9                    0.86    [7]

Pre-process and export the extracted data frame

The objective is to perform pre-processing on the data frame extracted in the previous step and subsequently export it as a CSV file.

# Get COVID-19 test data
url <- "https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country"
page <- read_html(url)

# Get all the tables in the HTML root node
tables <- html_nodes(page, "table")

# Extract the second table (index 2)
covid_testing_table <- html_table(tables[[2]], fill = TRUE)

# Convert the table to a data frame
covid_testing_df3 <- as.data.frame(covid_testing_table)

# Remove the last row
covid_testing_df3 <- covid_testing_df3[1:172, ]
# We dont need the Units and Ref columns, so can be removed
covid_testing_df3["Ref."] <- NULL
covid_testing_df3["Units[b]"] <- NULL
 # Renaming the columns
names(covid_testing_df3) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
 # Convert column data types

covid_testing_df3$country <- as.factor(covid_testing_df3$country)
covid_testing_df3$date <- as.factor(covid_testing_df3$date)
covid_testing_df3$tested <- as.numeric(gsub(",", "", covid_testing_df3$tested))
covid_testing_df3$confirmed <- as.numeric(gsub(",", "", covid_testing_df3$confirmed))
covid_testing_df3$`confirmed.tested.ratio` <- as.numeric(gsub(",", "", covid_testing_df3$`confirmed.tested.ratio`))
covid_testing_df3$`tested.population.ratio` <- as.numeric(gsub(",", "", covid_testing_df3$`tested.population.ratio`))
covid_testing_df3$`confirmed.population.ratio` <- as.numeric(gsub(",", "", covid_testing_df3$`confirmed.population.ratio`))
#Summary
summary(covid_testing_df3)
##                 country             date         tested         
##  Afghanistan        :  1   2 Feb 2023 :  6   Min.   :     3880  
##  Albania            :  1   1 Feb 2023 :  4   1st Qu.:   512037  
##  Algeria            :  1   31 Jan 2023:  4   Median :  3029859  
##  Andorra            :  1   1 Mar 2021 :  3   Mean   : 31377219  
##  Angola             :  1   23 Jul 2021:  3   3rd Qu.: 12386725  
##  Antigua and Barbuda:  1   29 Jan 2023:  3   Max.   :929349291  
##  (Other)            :166   (Other)    :149                      
##    confirmed        confirmed.tested.ratio tested.population.ratio
##  Min.   :       0   Min.   : 0.00          Min.   :   0.006       
##  1st Qu.:   37839   1st Qu.: 5.00          1st Qu.:   9.475       
##  Median :  281196   Median :10.05          Median :  46.950       
##  Mean   : 2508340   Mean   :11.25          Mean   : 175.504       
##  3rd Qu.: 1278105   3rd Qu.:15.25          3rd Qu.: 156.500       
##  Max.   :90749469   Max.   :46.80          Max.   :3223.000       
##                                                                   
##  confirmed.population.ratio
##  Min.   : 0.000            
##  1st Qu.: 0.425            
##  Median : 6.100            
##  Mean   :12.769            
##  3rd Qu.:16.250            
##  Max.   :74.400            
## 

In this project, several stages of data processing and manipulation were carried out to analyze information on COVID-19 testing. First, the columns Ref. and Units[b] were identified and removed from the covid_testing_df3 dataframe, as they were not necessary for the analysis. This initial cleaning helps to simplify the dataset and focus on the relevant columns.

Next, the columns of the dataframe were renamed to improve clarity and facilitate data interpretation. The new columns include country, date tested, confirmed, confirmed.tested.ratio, tested.population.ratio, and confirmed.population.ratio. This step is crucial to ensure that the column names are descriptive and consistent with the data content.

Finally, data type conversions were performed to ensure that each column has the appropriate data type. The “country” and “date” columns were converted to factors, while the numeric columns were converted to numeric values by removing commas. This conversion is essential for performing accurate statistical analyses and avoiding errors in data processing.

# Importing the dataset

covid_data_frame_csv <- read.csv("covid_testing_data3.csv", header=TRUE, sep=",")

Get a subset of the extracted data frame

The objective is to extract rows 5 through 10 from the data frame, selecting only the country and confirmed columns.

# Get the 5th to 10th rows, selecting only the `country` and `confirmed` columns
subset_data <- covid_data_frame_csv[5:10, c("country", "confirmed")]

# Print the subset of data
print(subset_data)
##                country confirmed
## 5               Angola     20981
## 6  Antigua and Barbuda       832
## 7            Argentina   9060495
## 8              Armenia    422963
## 9            Australia  10112229
## 10             Austria   5789991

Calculate worldwide COVID testing positive ratio

The goal is to get the total confirmed and tested cases worldwide, and try to figure the overall positive ratio using confirmed cases / tested cases

# Get the total confirmed cases worldwide
total_confirmed <- sum(covid_data_frame_csv$confirmed, na.rm = TRUE)

# Get the total tested cases worldwide
total_tested <- sum(covid_data_frame_csv$tested, na.rm = TRUE)

# Calculate the positive ratio
positive_ratio <- total_confirmed / total_tested
# Print the results
cat("Total confirmed cases worldwide:", total_confirmed, "\n")
## Total confirmed cases worldwide: 431434555
cat("Total tested cases worldwide:", total_tested, "\n")
## Total tested cases worldwide: 5396881644
cat("Worldwide COVID-19 testing positive ratio:", positive_ratio, "\n")
## Worldwide COVID-19 testing positive ratio: 0.07994145

Get a country list which reported their testing data

The goal is to get a catalog or sorted list of countries who have reported their COVID-19 testing data

# Get the 'country' column
country_list <- covid_data_frame_csv$country

# Check the class of the 'country' column
print(class(country_list))
## [1] "character"
# Convert the 'country' column into character type
country_list <- as.character(country_list)

# Sort the countries A to Z
sorted_country_list <- sort(country_list)

# Sort the countries Z to A
sorted_country_list2 <- sort(country_list , decreasing = TRUE)
print(sorted_country_list2)
##   [1] "Zimbabwe"               "Zambia"                 "Vietnam"               
##   [4] "Venezuela"              "Uzbekistan"             "Uruguay"               
##   [7] "United States"          "United Kingdom"         "United Arab Emirates"  
##  [10] "Ukraine"                "Uganda"                 "Turkey"                
##  [13] "Tunisia"                "Trinidad and Tobago"    "Togo"                  
##  [16] "Thailand"               "Tanzania"               "Taiwan[m]"             
##  [19] "Switzerland[l]"         "Sweden"                 "Sudan"                 
##  [22] "Sri Lanka"              "Spain"                  "South Sudan"           
##  [25] "South Korea"            "South Africa"           "Slovenia"              
##  [28] "Slovakia"               "Singapore"              "Serbia"                
##  [31] "Senegal"                "Saudi Arabia"           "San Marino"            
##  [34] "Saint Vincent"          "Saint Lucia"            "Saint Kitts and Nevis" 
##  [37] "Rwanda"                 "Russia"                 "Romania"               
##  [40] "Qatar"                  "Portugal"               "Poland"                
##  [43] "Philippines"            "Peru"                   "Paraguay"              
##  [46] "Papua New Guinea"       "Panama"                 "Palestine"             
##  [49] "Pakistan"               "Oman"                   "Norway"                
##  [52] "Northern Cyprus[k]"     "North Macedonia"        "North Korea"           
##  [55] "Nigeria"                "Niger"                  "New Zealand"           
##  [58] "New Caledonia"          "Netherlands"            "Nepal"                 
##  [61] "Namibia"                "Myanmar"                "Mozambique"            
##  [64] "Morocco"                "Montenegro"             "Mongolia"              
##  [67] "Moldova[j]"             "Mexico"                 "Mauritius"             
##  [70] "Mauritania"             "Malta"                  "Mali"                  
##  [73] "Maldives"               "Malaysia"               "Malawi"                
##  [76] "Madagascar"             "Luxembourg[i]"          "Lithuania"             
##  [79] "Libya"                  "Liberia"                "Lesotho"               
##  [82] "Lebanon"                "Latvia"                 "Laos"                  
##  [85] "Kyrgyzstan"             "Kuwait"                 "Kosovo"                
##  [88] "Kenya"                  "Kazakhstan"             "Jordan"                
##  [91] "Japan"                  "Jamaica"                "Ivory Coast"           
##  [94] "Italy"                  "Israel"                 "Ireland"               
##  [97] "Iraq"                   "Iran"                   "Indonesia"             
## [100] "India"                  "Iceland"                "Hungary"               
## [103] "Honduras"               "Haiti"                  "Guyana"                
## [106] "Guinea-Bissau"          "Guinea"                 "Guatemala"             
## [109] "Grenada"                "Greenland"              "Greece"                
## [112] "Ghana"                  "Germany"                "Georgia[h]"            
## [115] "Gambia"                 "Gabon"                  "France[f][g]"          
## [118] "Finland"                "Fiji"                   "Faroe Islands"         
## [121] "Ethiopia"               "Eswatini"               "Estonia"               
## [124] "Equatorial Guinea"      "El Salvador"            "Egypt"                 
## [127] "Ecuador"                "DR Congo"               "Dominican Republic"    
## [130] "Dominica"               "Djibouti"               "Denmark[e]"            
## [133] "Czechia"                "Cyprus[d]"              "Cuba"                  
## [136] "Croatia"                "Costa Rica"             "Colombia"              
## [139] "China[c]"               "Chile"                  "Chad"                  
## [142] "Canada"                 "Cameroon"               "Cambodia"              
## [145] "Burundi"                "Burkina Faso"           "Bulgaria"              
## [148] "Brunei"                 "Brazil"                 "Botswana"              
## [151] "Bosnia and Herzegovina" "Bolivia"                "Bhutan"                
## [154] "Benin"                  "Belize"                 "Belgium"               
## [157] "Belarus"                "Barbados"               "Bangladesh"            
## [160] "Bahrain"                "Bahamas"                "Azerbaijan"            
## [163] "Austria"                "Australia"              "Armenia"               
## [166] "Argentina"              "Antigua and Barbuda"    "Angola"                
## [169] "Andorra"                "Algeria"                "Albania"               
## [172] "Afghanistan"

Identify countries names with a specific pattern

The goalis using a regular expression to find any countires start with United

united_countries <- regexpr("United.*", country_list)
regmatches(country_list, united_countries)
## [1] "United Arab Emirates" "United Kingdom"       "United States"

Pick two countries you are interested, and then review their testing data

The goal is to compare the COVID-19 test data between two countires, you will need to select two rows from the dataframe, and select country, confirmed, confirmed-population-ratio columns

country1 <- subset(covid_data_frame_csv, country == 'Colombia', select = c(country, confirmed, confirmed.population.ratio))
country2 <- subset(covid_data_frame_csv, country == 'Croatia', select = c(country, confirmed, confirmed.population.ratio))

#Colombia Vs Croatia

comparasion <- rbind(country1,country2)
print(comparasion)
##     country confirmed confirmed.population.ratio
## 35 Colombia   6314769                       13.1
## 37  Croatia   1267798                       31.1

Compare which one of the selected countries has a larger ratio of confirmed cases to population

The goal is to find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk

# Compare the ratios
if (country1$confirmed.population.ratio > country2$confirmed.population.ratio) {
    print(country1)
} else {
    print(country2)
}
##    country confirmed confirmed.population.ratio
## 37 Croatia   1267798                       31.1

Comparing the rates of confirmed COVID-19 cases relative to the population across countries has important implications for public health. A higher proportion of confirmed cases may indicate greater community transmission and a higher risk of outbreaks. This can help health authorities identify high-priority areas for interventions and resources.

Countries with a high proportion of confirmed cases may need to strengthen control measures, such as social distancing, mask use, and hand hygiene, to reduce the spread of the virus. Furthermore, active surveillance and diagnostic testing are crucial to quickly detect and isolate new cases, especially in countries with missing data.

Find countries with confirmed to population ratio rate less than a threshold

The goal is to find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low

# Define the threshold
threshold <- 0.1

# Find countries with ratio less than the threshold
filtered_countries <- subset(covid_data_frame_csv, confirmed.population.ratio < threshold, select = c(country, confirmed, confirmed.population.ratio))

# Display the filtered countries
print(filtered_countries)
##              country confirmed confirmed.population.ratio
## 5             Angola     20981                    0.06700
## 19             Benin      7884                    0.06700
## 25            Brunei       338                    0.07400
## 27      Burkina Faso     12123                    0.05800
## 28           Burundi       884                    0.00740
## 32              Chad      4020                    0.02900
## 34          China[c]     87655                    0.00610
## 45          DR Congo     25961                    0.02900
## 57             Gabon     25325                    0.08200
## 89              Laos        45                    0.00063
## 97        Madagascar     19831                    0.07600
## 101             Mali     14449                    0.07100
## 104        Mauritius       494                    0.03900
## 115    New Caledonia       136                    0.05000
## 117            Niger      4740                    0.02100
## 118          Nigeria    155657                    0.07600
## 119      North Korea         0                    0.00000
## 127 Papua New Guinea       961                    0.01100
## 149      South Sudan     10688                    0.08400
## 152            Sudan     23316                    0.05300
## 156         Tanzania       509                    0.00085
## 157         Thailand     26162                    0.03800
## 162           Uganda     39979                    0.08700

The results show that Angola, Benin, and Eritrea meet this criterion, with ratios of 0.001070, 0.002010, and 0.001910, respectively. However, several countries have missing data (MDCs), which could affect the accuracy of the assessment. For a more complete conclusion, complete and up-to-date data from all countries would be ideal.

Identifying countries with a low ratio of confirmed cases to the population can have several important implications for public health. A low ratio may indicate that the risk of community transmission is relatively low in these countries. This can help health authorities prioritize resources and efforts in areas with the greatest need.

Countries with low ratios can focus on maintaining and reinforcing existing preventive measures, such as social distancing, mask use, and hand hygiene, to prevent a surge in cases. Active surveillance and diagnostic testing are crucial to quickly detect and isolate new cases. This is especially important in countries with missing data, where the lack of information can obscure the true magnitude of the problem.

Even if the proportion is low, these countries must be prepared for potential outbreaks. This includes having contingency plans, adequate hospital capacity, and sufficient medical supplies. Continuing to educate the population about the importance of preventive measures and vaccination is essential to maintaining control of the pandemic and preventing future outbreaks.