Code
library(rio)
library(tidyverse)
library(spData)
library(sf)
library(tmap)
library(knitr)
library(DT)Laboratory Report
library(rio)
library(tidyverse)
library(spData)
library(sf)
library(tmap)
library(knitr)
library(DT)For the purpose of this report, I used a dataset containing daily updated data on the COVID-19 pandemic prepared by the European Centre for Disease Prevention and Control. .
data <- rio::import("COVID-19-geographic-disbtribution-worldwide-2020-12-14.xlsx")
data <- as_tibble(data)The data was loaded using the rio, which allows for easy reading of data from various file formats and then converted to a tibble object.
Additionally, for data visualization on the map, I used the World dataset, from the tmap package.This dataset contains information related to demographics, economics, and social indicators for various countries around the world, along with their georeferences.
data("World")| variable | explanation |
|---|---|
| dateRep | Date of reporting (dd/mm/yyyy) |
| day | Day |
| month | Month |
| year | Year |
| cases | Number of newly reported cases |
| deaths | Number of newly reported deaths |
| countriesAndterritories | Country name |
| geoId | 2-letter ISO code |
| countriesAndterritoryCode | 3-letter ISO code |
| popData2020 | Population in 2020 |
| continentExp | Continent name |
| variable | explanation |
|---|---|
| iso_3 | 3-letter ISO code |
| name | Country name |
| sovereignt | Country status (sovereignty) |
| continent | Continent name |
| area | Area |
| pop_est | Population |
| pop_est_dens | Population density |
| economy | Economic category |
| income_grp | Income group |
| gdp_cap_est | GDP per capita |
| life_exp | Life expectancy |
| well_being | Well-being index |
| footprint | Ecological footprint |
| inequality | Social inequality index (Gini) |
| HPI | Happiness index |
| geometry | MULTIPOLYGON |
countries <-
c("PL",
"DE",
"SK",
"CZ",
"UA",
"BY",
"LT")
europe <-
data |> filter(continentExp == "Europe") |> arrange(dateRep)
poland_neighbors <-
europe |> filter(geoId %in% countries)
results <-
poland_neighbors |> filter(cases > 0) |> filter(dateRep < as.POSIXct("2020-12-31")) |> group_by(countriesAndTerritories) |> summarise(
iso_a3=unique(countryterritoryCode),
max_daily_cases = max(cases),
max_daily_deaths =
max(deaths),
avg_daily_cases =
round(mean(cases),2),
avg_daily_deaths =
round(mean(deaths),2),
total_cases =
sum(cases),
total_deaths =
sum(deaths),
infection_rate_per_1000_population=round(1000*(sum(cases)/max(popData2019)),2),
mortality_rate_per_1000_population =
round(1000*(sum(deaths) / max(popData2019)), 2),
percentage_infected=paste(round(100*((sum(cases) / max(popData2019))), 2), "%"),
mortality_rate=paste(round(100*(sum(deaths) / sum(cases)), 2),"%"),
mortality_level = cut((100*(sum(deaths) / sum(cases))), breaks=c(0,1,1.7,max(100*(sum(deaths)/ sum(cases)))), labels = c("low", "medium", "high")))
countries <- results$countriesAndTerritories
results <- results |> select(-countriesAndTerritories)
row.names(results) <- countries
datatable(results,class = 'cell-border stripe')The summary presents various statistical information regarding the development of the epidemic. The obtained statistics represent:
max_daily_cases: the highest daily increase in cases in a given country (highest value for Poland),
max_daily_deaths: the highest daily increase in deaths in a given country (highest value for Poland),
avg_daily_cases: the average number of daily cases reported in a given country (highest value for Germany),
avg_daily_deaths: the average number of daily deaths reported in a given country (highest value for Poland),
total_cases: the total number of cases recorded in a given country (highest value for Germany),
total_deaths: the total number of deaths recorded in a given country (highest value for Poland).
The above statistics do not fully reflect the epidemic situation in the analyzed countries (countries differ significantly in population size). Therefore, coefficients were introduced to better represent the scale of the epidemic in a given country. These include:
infection_rate_per_1000_population: the total number of cases recorded in a given country per 1000 residents (highest value in the Czech Republic)
mortality_rate_per_1000_population: the total number of deaths recorded in a given country per 1000 residents (highest value in the Czech Republic)
Dodatkowo wyliczone zostały:
percentage_infected: expressed as a percentage indicating the number of infected people relative to the total population (highest value in the Czech Republic)
mortality_rate: expressed as a percentage indicating the number of deaths relative to the number of infections (highest value in Poland), based on this statistic, a factor variable was created: factor: mortality_level, , which has 3 levels of differentiation (low, medium, high)
The presented data indicate that statistically, the Czech Republic had the greatest problem (high number of infected people relative to the population), while Poland struggled the most with the epidemic, as it recorded the highest mortality rate despite not having the highest number of infected people.
The chart shows daily COVID-19 cases in the analyzed countries in May 2020. The bar colors represent different countries, highlighting each country’s contribution to the total number of cases on a given day. The dashed horizontal line indicates the day with the highest number of cases.
#| echo: true
#| code-fold: true
#| output: true
polska <- poland_neighbors |> group_by(dateRep) |> mutate(suma=sum(cases))
do_wykresu <- polska |> filter(cases>0) |> filter(dateRep >
as.Date("2020-05-01") & dateRep < as.Date("2020-06-01")) |> arrange(cases)
lim <- as.POSIXct(c("2020-05-01", "2020-05-31"))
do_wykresu |>
ggplot(aes(x = dateRep, y = cases, fill = countriesAndTerritories)) +
geom_bar(stat = "identity") +
labs(title = "Daily COVID-19 Cases in Poland and Neighboring Countries", subtitle = "May 2020",
x = "Day",
y = "Number of Cases",
fill = "Country",
caption = "Data source: ECDC") +
geom_hline(yintercept = max(do_wykresu$suma), color = "black", linetype=2, linewidth=1) +
annotate("text", x = as.POSIXct("2020-05-23"), y = max(do_wykresu$cases),
label = "Maximum Cases", vjust = -16, color = "red", size=3.5) +
scale_y_continuous(expand = expand_scale(c(0,0.4)))+scale_x_datetime(limits = lim,date_labels = "%e",expand=expand_scale(c(0.01,-0.01)), date_breaks = "2 days")+theme_bw()+theme(title = element_text(
face = "italic",
color = "black",
size = 12,
hjust = 0.5))Wykres przedstawia wykaz rejestrowanej ilości zakażeń w Polsce w okresie od 1 lipca do 30 sierpnia 2020 roku. Zaznaczone na czerwono punkty obrazują zaobserwowaną zależność spadków ilości potwierdzonych przypadków w poniedziałki każdego tygodnia. Moe bz to spowodowane mniejszą ilością wykonywanych testów w dniach poprzedzających (dni wolne od pracy- sobota i niedziela)
#| echo: true
#| code-fold: true
#| output: true
polska <- poland_neighbors |> filter(countriesAndTerritories=="Poland" & dateRep>as.Date("2020-07-01") & dateRep<as.Date("2020-08-31"))
lim <- as.POSIXct(c("2020-07-01", "2020-08-31"))
poniedzialki <- polska |> filter(weekdays(polska$dateRep)=="poniedziałek")
polska |> ggplot(aes(x = dateRep, y = cases)) + geom_point() + geom_line() + geom_vline(xintercept = poniedzialki$dateRep, size = 0.1, color = "red") +
ylim(200, 1500) +
annotate(
"text",
x = poniedzialki$dateRep,
y = 1300,
label = "MONDAY",
vjust = -0.5,
color = "red",
size = 3.5,
angle = 90,
fontface = "bold"
) + scale_x_datetime(
limits = lim,
date_labels = "%a",
expand = expand_scale(c(0.01, -0.01)),
breaks = "1 day"
) + guides(x = guide_axis(angle = 90)) + annotate(
geom = "point",
x = as.POSIXct(poniedzialki$dateRep),
y = poniedzialki$cases,
colour = "red",
size = 4,
alpha = 0.5
) + labs(title = "COVID-19 Cases in Poland",
subtitle = "Period: July 1 - August 30, 2020",
x = "Day of the Week",
y = "Number of Cases",
caption = "Data source: ECDC") + theme_bw()To better represent the data, visualizations in the form of maps were created using the tmap tmap package.
The following maps illustrate the mortality statistics and a factor variable dividing countries into three categories based on the level of this statistic.
data("World")
combined_data <- left_join(results, World, by = "iso_a3")
combined_data <- sf::st_as_sf(combined_data)
combined_data <- na.omit(combined_data)
tmap_arrange(
tm_shape(combined_data) +
tm_fill("mortality_rate", palette = "viridis") +
tm_borders() +
tmap_options(max.categories = 5),
tm_shape(combined_data) +
tm_fill("mortality_level", palette = "viridis") +
tm_borders() +
tmap_options(max.categories = 5),
nrow = 2
)The following interactive map allows for a more engaging presentation of the data. It displays the percentage of infected people relative to the total population of each country.
data("World")
polaczone_dane <- left_join(results, World, by = "iso_a3")
polaczone_dane <- sf::st_as_sf(polaczone_dane)
polaczone_dane <- na.omit(polaczone_dane)
tmap_mode("view")
tm_shape(polaczone_dane)+tm_fill("percentage_infected",palette = "viridis" )+tm_borders()+tmap_options(max.categories = 5)Saving data files with different numbers of rows into a single tibble object creates a challenge. This problem is solved by filling in the missing rows with NA values.
The following loop checks the number of rows in the files and saves them to a vector, allowing us to determine the largest value.
files = dir()
lengths = c()
for (i in seq(1, length(files))) {
if (endsWith(files[i], ".csv")) {
data <- read.csv(files[i])
lengths <- append(lengths, nrow(data))
}
}By obtaining information about the maximum number of rows in the loaded files, we create a table object of type tibble with this dimension.
table = tibble(.rows = max(lengths))The following loop reads data from each CSV file into the environment, where it is temporarily stored in the data. An inner for loop iterates over the columns of the data object, and for each column, a new column is added to table using mutate function from dplyr package. The column name is dynamically assigned based on the original column name in the CSV file. If the number of rows in data is less than intabele it is padded with NA values.
If the sum value exceeds 3, the loop is terminated, meaning only three CSV files will be read.
sum = 0
for (l in seq(1, length(files))) {
if (endsWith(files[l], ".csv")) {
sum = sum + 1
if (sum <= 3) {
data <- read.csv(files[l])
for (column in seq(1, length(data))) {
column_name <- colnames(data)[column]
table <- table |>
mutate(!!column_name := c(data[, column], rep(NA, (
nrow(table) - nrow(data)))))
}
} else {
break
}
}
}Wynikiem działania pętli jest obiekt typu tibble:
# A tibble: 150 × 13
X Time demand Plant Type Treatment conc uptake Sepal.Length
<int> <int> <dbl> <chr> <chr> <chr> <int> <dbl> <dbl>
1 1 1 8.3 Qn1 Quebec nonchilled 95 16 5.1
2 2 2 10.3 Qn1 Quebec nonchilled 175 30.4 4.9
3 3 3 19 Qn1 Quebec nonchilled 250 34.8 4.7
4 4 4 16 Qn1 Quebec nonchilled 350 37.2 4.6
5 5 5 15.6 Qn1 Quebec nonchilled 500 35.3 5
6 6 7 19.8 Qn1 Quebec nonchilled 675 39.2 5.4
7 7 NA NA Qn1 Quebec nonchilled 1000 39.7 4.6
8 8 NA NA Qn2 Quebec nonchilled 95 13.6 5
9 9 NA NA Qn2 Quebec nonchilled 175 27.3 4.4
10 10 NA NA Qn2 Quebec nonchilled 250 37.1 4.9
# ℹ 140 more rows
# ℹ 4 more variables: Sepal.Width <dbl>, Petal.Length <dbl>, Petal.Width <dbl>,
# Species <chr>