In this project, we will focus on analyzing COVID-19 case data using the R programming language. The main objective is to obtain updated COVID-19 data using web scraping techniques and manipulate this data to extract valuable information and generate informative visualizations.
The first objective of the project is data collection. We will use web scraping techniques to extract COVID-19 data from reliable sources on the web. This will allow us to obtain updated information on the number of confirmed cases, recoveries, deaths, and other relevant indicators.
Once the data is collected, we will perform various manipulation and cleaning operations to ensure the information is in a suitable format for analysis. This will include removing missing values, transforming variables, and creating new derived columns.
#install.packages("httr")
#install.packages("rvest")
COVID-19 pandemic Wiki page using HTTP
requestFirst, let’s develop a function to use an HTTP request to retrieve a public COVID-19 Wikipedia page.
Before writing the function, you can access this public page via the following URL: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country using a web browser.
The goal is to obtain the HTML page using an HTTP request with the
httr library.
Invoke the get_wiki_covid19_page function to obtain an
HTTP response containing the desired HTML page.
get_wiki_covid_page <- function() {
url <- "https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country"
response <- GET(url)
if (status_code(response) == 200) {
content <- content(response, "text")
print(content)
} else {
print(paste("Error:", status_code(response)))
}
}
get_wiki_covid_page()
The task involves extracting the COVID-19 testing data table from a Wikipedia HTML page. This process typically includes sending an HTTP request to retrieve the HTML content of the page, parsing the HTML to locate the specific table, and then extracting the relevant data for further analysis.
get_covid_testing_data <- function() {
url <- "https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country"
page <- read_html(url)
# Get all the tables in the HTML root node
tables <- html_nodes(page, "table")
# Extract the second table (index 2)
covid_testing_table <- html_table(tables[[2]], fill = TRUE)
# Convert the table to a data frame
covid_testing_df <- as.data.frame(covid_testing_table)
# Print the data frame
head(covid_testing_df)
}
# Call the function and get COVID-19 testing data
get_covid_testing_data()
## Country or region Date[a] Tested Units[b] Confirmed(cases)
## 1 Afghanistan 17 Dec 2020 154,767 samples 49,621
## 2 Albania 18 Feb 2021 428,654 samples 96,838
## 3 Algeria 2 Nov 2020 230,553 samples 58,574
## 4 Andorra 23 Feb 2022 300,307 samples 37,958
## 5 Angola 2 Feb 2021 399,228 samples 20,981
## 6 Antigua and Barbuda 6 Mar 2021 15,268 samples 832
## Confirmed /tested,% Tested /population,% Confirmed /population,% Ref.
## 1 32.1 0.40 0.13 [1]
## 2 22.6 15.0 3.4 [2]
## 3 25.4 0.53 0.13 [3][4]
## 4 12.6 387 49.0 [5]
## 5 5.3 1.3 0.067 [6]
## 6 5.4 15.9 0.86 [7]
The objective is to perform pre-processing on the data frame extracted in the previous step and subsequently export it as a CSV file.
# Get COVID-19 test data
url <- "https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country"
page <- read_html(url)
# Get all the tables in the HTML root node
tables <- html_nodes(page, "table")
# Extract the second table (index 2)
covid_testing_table <- html_table(tables[[2]], fill = TRUE)
# Convert the table to a data frame
covid_testing_df3 <- as.data.frame(covid_testing_table)
# Remove the last row
covid_testing_df3 <- covid_testing_df3[1:172, ]
# We dont need the Units and Ref columns, so can be removed
covid_testing_df3["Ref."] <- NULL
covid_testing_df3["Units[b]"] <- NULL
# Renaming the columns
names(covid_testing_df3) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
# Convert column data types
covid_testing_df3$country <- as.factor(covid_testing_df3$country)
covid_testing_df3$date <- as.factor(covid_testing_df3$date)
covid_testing_df3$tested <- as.numeric(gsub(",", "", covid_testing_df3$tested))
covid_testing_df3$confirmed <- as.numeric(gsub(",", "", covid_testing_df3$confirmed))
covid_testing_df3$`confirmed.tested.ratio` <- as.numeric(gsub(",", "", covid_testing_df3$`confirmed.tested.ratio`))
covid_testing_df3$`tested.population.ratio` <- as.numeric(gsub(",", "", covid_testing_df3$`tested.population.ratio`))
covid_testing_df3$`confirmed.population.ratio` <- as.numeric(gsub(",", "", covid_testing_df3$`confirmed.population.ratio`))
#Summary
summary(covid_testing_df3)
## country date tested
## Afghanistan : 1 2 Feb 2023 : 6 Min. : 3880
## Albania : 1 1 Feb 2023 : 4 1st Qu.: 512037
## Algeria : 1 31 Jan 2023: 4 Median : 3029859
## Andorra : 1 1 Mar 2021 : 3 Mean : 31377219
## Angola : 1 23 Jul 2021: 3 3rd Qu.: 12386725
## Antigua and Barbuda: 1 29 Jan 2023: 3 Max. :929349291
## (Other) :166 (Other) :149
## confirmed confirmed.tested.ratio tested.population.ratio
## Min. : 0 Min. : 0.00 Min. : 0.006
## 1st Qu.: 37839 1st Qu.: 5.00 1st Qu.: 9.475
## Median : 281196 Median :10.05 Median : 46.950
## Mean : 2508340 Mean :11.25 Mean : 175.504
## 3rd Qu.: 1278105 3rd Qu.:15.25 3rd Qu.: 156.500
## Max. :90749469 Max. :46.80 Max. :3223.000
##
## confirmed.population.ratio
## Min. : 0.000
## 1st Qu.: 0.425
## Median : 6.100
## Mean :12.769
## 3rd Qu.:16.250
## Max. :74.400
##
In this project, several stages of data processing and manipulation
were carried out to analyze information on COVID-19 testing. First, the
columns Ref. and Units[b] were
identified and removed from the covid_testing_df3
dataframe, as they were not necessary for the analysis. This initial
cleaning helps to simplify the dataset and focus on the relevant
columns.
Next, the columns of the dataframe were renamed to improve clarity and facilitate data interpretation. The new columns include country, date tested, confirmed, confirmed.tested.ratio, tested.population.ratio, and confirmed.population.ratio. This step is crucial to ensure that the column names are descriptive and consistent with the data content.
Finally, data type conversions were performed to ensure that each column has the appropriate data type. The “country” and “date” columns were converted to factors, while the numeric columns were converted to numeric values by removing commas. This conversion is essential for performing accurate statistical analyses and avoiding errors in data processing.
# Importing the dataset
covid_data_frame_csv <- read.csv("covid_testing_data3.csv", header=TRUE, sep=",")
The objective is to extract rows 5 through 10 from the data frame,
selecting only the country and confirmed
columns.
# Get the 5th to 10th rows, selecting only the `country` and `confirmed` columns
subset_data <- covid_data_frame_csv[5:10, c("country", "confirmed")]
# Print the subset of data
print(subset_data)
## country confirmed
## 5 Angola 20981
## 6 Antigua and Barbuda 832
## 7 Argentina 9060495
## 8 Armenia 422963
## 9 Australia 10112229
## 10 Austria 5789991
The goal is to get the total confirmed and tested cases worldwide,
and try to figure the overall positive ratio using
confirmed cases / tested cases
# Get the total confirmed cases worldwide
total_confirmed <- sum(covid_data_frame_csv$confirmed, na.rm = TRUE)
# Get the total tested cases worldwide
total_tested <- sum(covid_data_frame_csv$tested, na.rm = TRUE)
# Calculate the positive ratio
positive_ratio <- total_confirmed / total_tested
# Print the results
cat("Total confirmed cases worldwide:", total_confirmed, "\n")
## Total confirmed cases worldwide: 431434555
cat("Total tested cases worldwide:", total_tested, "\n")
## Total tested cases worldwide: 5396881644
cat("Worldwide COVID-19 testing positive ratio:", positive_ratio, "\n")
## Worldwide COVID-19 testing positive ratio: 0.07994145
The goal is to get a catalog or sorted list of countries who have reported their COVID-19 testing data
# Get the 'country' column
country_list <- covid_data_frame_csv$country
# Check the class of the 'country' column
print(class(country_list))
## [1] "character"
# Convert the 'country' column into character type
country_list <- as.character(country_list)
# Sort the countries A to Z
sorted_country_list <- sort(country_list)
# Sort the countries Z to A
sorted_country_list2 <- sort(country_list , decreasing = TRUE)
print(sorted_country_list2)
## [1] "Zimbabwe" "Zambia" "Vietnam"
## [4] "Venezuela" "Uzbekistan" "Uruguay"
## [7] "United States" "United Kingdom" "United Arab Emirates"
## [10] "Ukraine" "Uganda" "Turkey"
## [13] "Tunisia" "Trinidad and Tobago" "Togo"
## [16] "Thailand" "Tanzania" "Taiwan[m]"
## [19] "Switzerland[l]" "Sweden" "Sudan"
## [22] "Sri Lanka" "Spain" "South Sudan"
## [25] "South Korea" "South Africa" "Slovenia"
## [28] "Slovakia" "Singapore" "Serbia"
## [31] "Senegal" "Saudi Arabia" "San Marino"
## [34] "Saint Vincent" "Saint Lucia" "Saint Kitts and Nevis"
## [37] "Rwanda" "Russia" "Romania"
## [40] "Qatar" "Portugal" "Poland"
## [43] "Philippines" "Peru" "Paraguay"
## [46] "Papua New Guinea" "Panama" "Palestine"
## [49] "Pakistan" "Oman" "Norway"
## [52] "Northern Cyprus[k]" "North Macedonia" "North Korea"
## [55] "Nigeria" "Niger" "New Zealand"
## [58] "New Caledonia" "Netherlands" "Nepal"
## [61] "Namibia" "Myanmar" "Mozambique"
## [64] "Morocco" "Montenegro" "Mongolia"
## [67] "Moldova[j]" "Mexico" "Mauritius"
## [70] "Mauritania" "Malta" "Mali"
## [73] "Maldives" "Malaysia" "Malawi"
## [76] "Madagascar" "Luxembourg[i]" "Lithuania"
## [79] "Libya" "Liberia" "Lesotho"
## [82] "Lebanon" "Latvia" "Laos"
## [85] "Kyrgyzstan" "Kuwait" "Kosovo"
## [88] "Kenya" "Kazakhstan" "Jordan"
## [91] "Japan" "Jamaica" "Ivory Coast"
## [94] "Italy" "Israel" "Ireland"
## [97] "Iraq" "Iran" "Indonesia"
## [100] "India" "Iceland" "Hungary"
## [103] "Honduras" "Haiti" "Guyana"
## [106] "Guinea-Bissau" "Guinea" "Guatemala"
## [109] "Grenada" "Greenland" "Greece"
## [112] "Ghana" "Germany" "Georgia[h]"
## [115] "Gambia" "Gabon" "France[f][g]"
## [118] "Finland" "Fiji" "Faroe Islands"
## [121] "Ethiopia" "Eswatini" "Estonia"
## [124] "Equatorial Guinea" "El Salvador" "Egypt"
## [127] "Ecuador" "DR Congo" "Dominican Republic"
## [130] "Dominica" "Djibouti" "Denmark[e]"
## [133] "Czechia" "Cyprus[d]" "Cuba"
## [136] "Croatia" "Costa Rica" "Colombia"
## [139] "China[c]" "Chile" "Chad"
## [142] "Canada" "Cameroon" "Cambodia"
## [145] "Burundi" "Burkina Faso" "Bulgaria"
## [148] "Brunei" "Brazil" "Botswana"
## [151] "Bosnia and Herzegovina" "Bolivia" "Bhutan"
## [154] "Benin" "Belize" "Belgium"
## [157] "Belarus" "Barbados" "Bangladesh"
## [160] "Bahrain" "Bahamas" "Azerbaijan"
## [163] "Austria" "Australia" "Armenia"
## [166] "Argentina" "Antigua and Barbuda" "Angola"
## [169] "Andorra" "Algeria" "Albania"
## [172] "Afghanistan"
The goalis using a regular expression to find any countires start
with United
united_countries <- regexpr("United.*", country_list)
regmatches(country_list, united_countries)
## [1] "United Arab Emirates" "United Kingdom" "United States"
The goal is to compare the COVID-19 test data between two countires,
you will need to select two rows from the dataframe, and select
country, confirmed,
confirmed-population-ratio columns
country1 <- subset(covid_data_frame_csv, country == 'Colombia', select = c(country, confirmed, confirmed.population.ratio))
country2 <- subset(covid_data_frame_csv, country == 'Croatia', select = c(country, confirmed, confirmed.population.ratio))
#Colombia Vs Croatia
comparasion <- rbind(country1,country2)
print(comparasion)
## country confirmed confirmed.population.ratio
## 35 Colombia 6314769 13.1
## 37 Croatia 1267798 31.1
The goal is to find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk
# Compare the ratios
if (country1$confirmed.population.ratio > country2$confirmed.population.ratio) {
print(country1)
} else {
print(country2)
}
## country confirmed confirmed.population.ratio
## 37 Croatia 1267798 31.1
Comparing the rates of confirmed COVID-19 cases relative to the population across countries has important implications for public health. A higher proportion of confirmed cases may indicate greater community transmission and a higher risk of outbreaks. This can help health authorities identify high-priority areas for interventions and resources.
Countries with a high proportion of confirmed cases may need to strengthen control measures, such as social distancing, mask use, and hand hygiene, to reduce the spread of the virus. Furthermore, active surveillance and diagnostic testing are crucial to quickly detect and isolate new cases, especially in countries with missing data.
The goal is to find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low
# Define the threshold
threshold <- 0.1
# Find countries with ratio less than the threshold
filtered_countries <- subset(covid_data_frame_csv, confirmed.population.ratio < threshold, select = c(country, confirmed, confirmed.population.ratio))
# Display the filtered countries
print(filtered_countries)
## country confirmed confirmed.population.ratio
## 5 Angola 20981 0.06700
## 19 Benin 7884 0.06700
## 25 Brunei 338 0.07400
## 27 Burkina Faso 12123 0.05800
## 28 Burundi 884 0.00740
## 32 Chad 4020 0.02900
## 34 China[c] 87655 0.00610
## 45 DR Congo 25961 0.02900
## 57 Gabon 25325 0.08200
## 89 Laos 45 0.00063
## 97 Madagascar 19831 0.07600
## 101 Mali 14449 0.07100
## 104 Mauritius 494 0.03900
## 115 New Caledonia 136 0.05000
## 117 Niger 4740 0.02100
## 118 Nigeria 155657 0.07600
## 119 North Korea 0 0.00000
## 127 Papua New Guinea 961 0.01100
## 149 South Sudan 10688 0.08400
## 152 Sudan 23316 0.05300
## 156 Tanzania 509 0.00085
## 157 Thailand 26162 0.03800
## 162 Uganda 39979 0.08700
The results show that Angola, Benin, and Eritrea meet this criterion, with ratios of 0.001070, 0.002010, and 0.001910, respectively. However, several countries have missing data (MDCs), which could affect the accuracy of the assessment. For a more complete conclusion, complete and up-to-date data from all countries would be ideal.
Identifying countries with a low ratio of confirmed cases to the population can have several important implications for public health. A low ratio may indicate that the risk of community transmission is relatively low in these countries. This can help health authorities prioritize resources and efforts in areas with the greatest need.
Countries with low ratios can focus on maintaining and reinforcing existing preventive measures, such as social distancing, mask use, and hand hygiene, to prevent a surge in cases. Active surveillance and diagnostic testing are crucial to quickly detect and isolate new cases. This is especially important in countries with missing data, where the lack of information can obscure the true magnitude of the problem.
Even if the proportion is low, these countries must be prepared for potential outbreaks. This includes having contingency plans, adequate hospital capacity, and sufficient medical supplies. Continuing to educate the population about the importance of preventive measures and vaccination is essential to maintaining control of the pandemic and preventing future outbreaks.