Script-Based Data Processing

Laboratory Report

Author

KB

Published

September 11, 2024

Abstract:
This report presents the processing and analysis of data related to the development of the COVID-19 pandemic in Poland and neighboring countries. Statistics are summarized, including infection and mortality rates. The data is then visualized using charts and maps. The report also includes a function for calculating the mortality rate. The final conclusions focus on differences between countries, allowing for a better understanding of the epidemic situation.

1 Analysis of the COVID-19 Pandemic Development in Poland and Neighboring Countries

1.1 Required Packages

Code
library(rio)
library(tidyverse)
library(spData)
library(sf)
library(tmap)
library(knitr)
library(DT)

1.2 Data

1.2.1 Data Acquisition

For the purpose of this report, I used a dataset containing daily updated data on the COVID-19 pandemic prepared by the European Centre for Disease Prevention and Control. .

Code
data <- rio::import("COVID-19-geographic-disbtribution-worldwide-2020-12-14.xlsx")
data <- as_tibble(data)

The data was loaded using the rio, which allows for easy reading of data from various file formats and then converted to a tibble object.

Additionally, for data visualization on the map, I used the World dataset, from the tmap package.This dataset contains information related to demographics, economics, and social indicators for various countries around the world, along with their georeferences.

Code
data("World")

1.2.2 Data Description

Data Set Explanations
variable explanation
dateRep Date of reporting (dd/mm/yyyy)
day Day
month Month
year Year
cases Number of newly reported cases
deaths Number of newly reported deaths
countriesAndterritories Country name
geoId 2-letter ISO code
countriesAndterritoryCode 3-letter ISO code
popData2020 Population in 2020
continentExp Continent name
Geographic Data Set Explanations
variable explanation
iso_3 3-letter ISO code
name Country name
sovereignt Country status (sovereignty)
continent Continent name
area Area
pop_est Population
pop_est_dens Population density
economy Economic category
income_grp Income group
gdp_cap_est GDP per capita
life_exp Life expectancy
well_being Well-being index
footprint Ecological footprint
inequality Social inequality index (Gini)
HPI Happiness index
geometry MULTIPOLYGON

1.3 Statistical Summary

1.3.1 Table of Calculated Statistics

Code
countries <-
  c("PL",
    "DE",
    "SK",
    "CZ",
    "UA",
    "BY",
    "LT")

europe <-
  data |>  filter(continentExp == "Europe") |> arrange(dateRep)

poland_neighbors <-
  europe |> filter(geoId %in% countries)

results <-
  poland_neighbors |> filter(cases > 0) |> filter(dateRep < as.POSIXct("2020-12-31")) |> group_by(countriesAndTerritories) |> summarise(
  iso_a3=unique(countryterritoryCode),
  max_daily_cases = max(cases),
  max_daily_deaths =
    max(deaths),
  avg_daily_cases =
    round(mean(cases),2),
  avg_daily_deaths =
    round(mean(deaths),2),
  total_cases =
    sum(cases),
  total_deaths =
    sum(deaths),
  infection_rate_per_1000_population=round(1000*(sum(cases)/max(popData2019)),2),
  mortality_rate_per_1000_population =
    round(1000*(sum(deaths) / max(popData2019)), 2),
  percentage_infected=paste(round(100*((sum(cases) / max(popData2019))), 2), "%"),
  mortality_rate=paste(round(100*(sum(deaths) / sum(cases)), 2),"%"), 
  mortality_level = cut((100*(sum(deaths) / sum(cases))), breaks=c(0,1,1.7,max(100*(sum(deaths)/ sum(cases)))), labels = c("low", "medium", "high"))) 
                   
  

countries <- results$countriesAndTerritories
results <- results |> select(-countriesAndTerritories)
row.names(results) <- countries
datatable(results,class = 'cell-border stripe')

1.3.2 Comment

The summary presents various statistical information regarding the development of the epidemic. The obtained statistics represent:

  • max_daily_cases: the highest daily increase in cases in a given country (highest value for Poland),

  • max_daily_deaths: the highest daily increase in deaths in a given country (highest value for Poland),

  • avg_daily_cases: the average number of daily cases reported in a given country (highest value for Germany),

  • avg_daily_deaths: the average number of daily deaths reported in a given country (highest value for Poland),

  • total_cases: the total number of cases recorded in a given country (highest value for Germany),

  • total_deaths: the total number of deaths recorded in a given country (highest value for Poland).

The above statistics do not fully reflect the epidemic situation in the analyzed countries (countries differ significantly in population size). Therefore, coefficients were introduced to better represent the scale of the epidemic in a given country. These include:

  • infection_rate_per_1000_population: the total number of cases recorded in a given country per 1000 residents (highest value in the Czech Republic)

  • mortality_rate_per_1000_population: the total number of deaths recorded in a given country per 1000 residents (highest value in the Czech Republic)

Dodatkowo wyliczone zostały:

  • percentage_infected: expressed as a percentage indicating the number of infected people relative to the total population (highest value in the Czech Republic)

  • mortality_rate: expressed as a percentage indicating the number of deaths relative to the number of infections (highest value in Poland), based on this statistic, a factor variable was created: factor: mortality_level, , which has 3 levels of differentiation (low, medium, high)

1.3.3 Conclusions

The presented data indicate that statistically, the Czech Republic had the greatest problem (high number of infected people relative to the population), while Poland struggled the most with the epidemic, as it recorded the highest mortality rate despite not having the highest number of infected people.


1.4 Data Visualization

1.4.1 Charts

The chart shows daily COVID-19 cases in the analyzed countries in May 2020. The bar colors represent different countries, highlighting each country’s contribution to the total number of cases on a given day. The dashed horizontal line indicates the day with the highest number of cases.

Code
#| echo: true
#| code-fold: true
#| output: true

polska <- poland_neighbors |> group_by(dateRep) |> mutate(suma=sum(cases))
do_wykresu <- polska |> filter(cases>0) |> filter(dateRep > 
                                                             as.Date("2020-05-01") & dateRep < as.Date("2020-06-01")) |> arrange(cases)
lim <- as.POSIXct(c("2020-05-01", "2020-05-31"))

do_wykresu |> 
  ggplot(aes(x = dateRep, y = cases, fill = countriesAndTerritories)) +
  geom_bar(stat = "identity") +
  labs(title = "Daily COVID-19 Cases in Poland and Neighboring Countries", subtitle = "May 2020",
       x = "Day",
       y = "Number of Cases",
       fill = "Country",
       caption = "Data source: ECDC") +
  geom_hline(yintercept = max(do_wykresu$suma), color = "black", linetype=2, linewidth=1) +
  annotate("text", x = as.POSIXct("2020-05-23"), y = max(do_wykresu$cases), 
           label = "Maximum Cases", vjust = -16, color = "red", size=3.5) +
  scale_y_continuous(expand = expand_scale(c(0,0.4)))+scale_x_datetime(limits = lim,date_labels = "%e",expand=expand_scale(c(0.01,-0.01)), date_breaks = "2 days")+theme_bw()+theme(title = element_text(
  face = "italic",
  color = "black",
  size = 12,
  hjust = 0.5))

Wykres przedstawia wykaz rejestrowanej ilości zakażeń w Polsce w okresie od 1 lipca do 30 sierpnia 2020 roku. Zaznaczone na czerwono punkty obrazują zaobserwowaną zależność spadków ilości potwierdzonych przypadków w poniedziałki każdego tygodnia. Moe bz to spowodowane mniejszą ilością wykonywanych testów w dniach poprzedzających (dni wolne od pracy- sobota i niedziela)

Code
#| echo: true
#| code-fold: true
#| output: true

polska <- poland_neighbors |> filter(countriesAndTerritories=="Poland" & dateRep>as.Date("2020-07-01") & dateRep<as.Date("2020-08-31"))
lim <- as.POSIXct(c("2020-07-01", "2020-08-31"))
poniedzialki <- polska |> filter(weekdays(polska$dateRep)=="poniedziałek")
polska |> ggplot(aes(x = dateRep, y = cases)) + geom_point() + geom_line() + geom_vline(xintercept = poniedzialki$dateRep, size = 0.1, color = "red") +
  ylim(200, 1500) +
  annotate(
    "text",
    x = poniedzialki$dateRep,
    y = 1300,
    label = "MONDAY",
    vjust = -0.5,
    color = "red",
    size = 3.5,
    angle = 90,
    fontface = "bold"
  ) + scale_x_datetime(
    limits = lim,
    date_labels = "%a",
    expand = expand_scale(c(0.01, -0.01)),
    breaks  = "1 day"
  ) + guides(x = guide_axis(angle = 90)) +  annotate(
    geom = "point",
    x = as.POSIXct(poniedzialki$dateRep),
    y = poniedzialki$cases,
    colour = "red",
    size = 4,
    alpha = 0.5
  ) + labs(title = "COVID-19 Cases in Poland",
       subtitle = "Period: July 1 - August 30, 2020",
       x = "Day of the Week",
       y = "Number of Cases",
       caption = "Data source: ECDC") + theme_bw()

1.4.2 Mapy

To better represent the data, visualizations in the form of maps were created using the tmap tmap package.

The following maps illustrate the mortality statistics and a factor variable dividing countries into three categories based on the level of this statistic.

Code
data("World")
combined_data <- left_join(results, World, by = "iso_a3")
combined_data <- sf::st_as_sf(combined_data)
combined_data <- na.omit(combined_data)

tmap_arrange(
  tm_shape(combined_data) + 
    tm_fill("mortality_rate", palette = "viridis") + 
    tm_borders() + 
    tmap_options(max.categories = 5),
  tm_shape(combined_data) + 
    tm_fill("mortality_level", palette = "viridis") + 
    tm_borders() + 
    tmap_options(max.categories = 5),
  nrow = 2
)

The following interactive map allows for a more engaging presentation of the data. It displays the percentage of infected people relative to the total population of each country.

Code
data("World")
polaczone_dane <- left_join(results, World, by = "iso_a3")
polaczone_dane <- sf::st_as_sf(polaczone_dane)
polaczone_dane <- na.omit(polaczone_dane)
tmap_mode("view")
tm_shape(polaczone_dane)+tm_fill("percentage_infected",palette = "viridis" )+tm_borders()+tmap_options(max.categories = 5)

2 Extra Task: Loop for Reading Exactly Three CSV Files

2.1 Introduction

Saving data files with different numbers of rows into a single tibble object creates a challenge. This problem is solved by filling in the missing rows with NA values.

2.2 Solution

The following loop checks the number of rows in the files and saves them to a vector, allowing us to determine the largest value.

Code
files = dir()
lengths = c()
for (i in seq(1, length(files))) {
  if (endsWith(files[i], ".csv")) {
    data <- read.csv(files[i])
    lengths <- append(lengths, nrow(data))
  }
}

By obtaining information about the maximum number of rows in the loaded files, we create a table object of type tibble with this dimension.

Code
table = tibble(.rows = max(lengths))

The following loop reads data from each CSV file into the environment, where it is temporarily stored in the data. An inner for loop iterates over the columns of the data object, and for each column, a new column is added to table using mutate function from dplyr package. The column name is dynamically assigned based on the original column name in the CSV file. If the number of rows in data is less than intabele it is padded with NA values.

If the sum value exceeds 3, the loop is terminated, meaning only three CSV files will be read.

Code
sum = 0
for (l in seq(1, length(files))) {
  if (endsWith(files[l], ".csv")) {
    sum = sum + 1
    if (sum <= 3) {
      data <- read.csv(files[l])
      for (column in seq(1, length(data))) {
        column_name <- colnames(data)[column]
        table <- table |>
          mutate(!!column_name := c(data[, column], rep(NA, (
            nrow(table) - nrow(data)))))
      }
    } else {
      break
    }
  }
}

Wynikiem działania pętli jest obiekt typu tibble:

# A tibble: 150 × 13
       X  Time demand Plant Type   Treatment   conc uptake Sepal.Length
   <int> <int>  <dbl> <chr> <chr>  <chr>      <int>  <dbl>        <dbl>
 1     1     1    8.3 Qn1   Quebec nonchilled    95   16            5.1
 2     2     2   10.3 Qn1   Quebec nonchilled   175   30.4          4.9
 3     3     3   19   Qn1   Quebec nonchilled   250   34.8          4.7
 4     4     4   16   Qn1   Quebec nonchilled   350   37.2          4.6
 5     5     5   15.6 Qn1   Quebec nonchilled   500   35.3          5  
 6     6     7   19.8 Qn1   Quebec nonchilled   675   39.2          5.4
 7     7    NA   NA   Qn1   Quebec nonchilled  1000   39.7          4.6
 8     8    NA   NA   Qn2   Quebec nonchilled    95   13.6          5  
 9     9    NA   NA   Qn2   Quebec nonchilled   175   27.3          4.4
10    10    NA   NA   Qn2   Quebec nonchilled   250   37.1          4.9
# ℹ 140 more rows
# ℹ 4 more variables: Sepal.Width <dbl>, Petal.Length <dbl>, Petal.Width <dbl>,
#   Species <chr>