For the take-home part of the MSDS 401 Final Exam, you are tasked with analyzing data on new daily covid-19 cases and deaths in European Union (EU) and European Economic Area (EEA) countries. A data file may be downloaded here, or you may use the provided read.csv() code in the ‘setup’ code chunk below to read the data directly from the web csv. Either approach is acceptable; the data should be the same.

Once you have defined a data frame with the daily case and death and country data, you are asked to: (1) perform an Exploratory Data Analysis (EDA), (2) perform some hypothesis testing, (3) perform some correlation testing, and (4) fit and describe a linear regression model. Each of these four (4) items is further explained below and “code chunks” have been created for you in which to add your R code, just as with the R and Data Analysis Assignments. You may add additional code chunks, as needed. You should make comments in the code chunks or add clarifying text between code chunks that you think further your work.

A data dictionary for the dataset is available here.

Definitions:


1. Descriptive Statistics

Perform an Exploratory Data Analysis (EDA). Your EDA is exactly that: yours. Your knit .html should include the visualizations and summary tables that you find valuable while exploring this dataset. However, at minimum, your EDA must include the following:

#Create 'incidence_rate' and 'fatality_rate' vectors and add them to the dataframe
data <- data %>%
  mutate(
    incidence_rate = (cases / popData2020) * 100000,  # Daily new cases per 100K population
    fatality_rate = (deaths / popData2020) * 100000   # Daily new deaths per 100K population
  )

#Plot incidence rate trends over time for a subset of 5 countries
countries_subset <- c("Germany", "France", "Italy", "Spain", "Sweden")

# Incidence rate trend plot
incidence_plot <- data %>%
  filter(countriesAndTerritories %in% countries_subset) %>%  # Filter data for the selected countries
  ggplot(aes(x = dateRep, y = incidence_rate, color = countriesAndTerritories)) +
  geom_line() +
  labs(
    title = "Daily Incidence Rate over Time",
    x = "Date",
    y = "Incidence Rate (per 100K)",
    color = "Country"
  ) +
  theme_minimal()

#Plot fatality rate trends over time for the same subset of countries
fatality_plot <- data %>%
  filter(countriesAndTerritories %in% countries_subset) %>%  # Filter data for the selected countries
  ggplot(aes(x = dateRep, y = fatality_rate, color = countriesAndTerritories)) +
  geom_line() +
  labs(
    title = "Daily Fatality Rate over Time",
    x = "Date",
    y = "Fatality Rate (per 100K)",
    color = "Country"
  ) +
  theme_minimal()

# 4. Summarize total cases, total deaths, and calculate case fatality rate per country
summary_table <- data %>%
  group_by(countriesAndTerritories) %>%  # Group data by country
  summarise(
    total_cases = sum(cases, na.rm = TRUE),  # Calculate total confirmed cases
    total_deaths = sum(deaths, na.rm = TRUE),  # Calculate total deaths
    case_fatality_rate = (total_deaths / total_cases) * 100  # Calculate case fatality rate (%)
  ) %>%
  arrange(desc(case_fatality_rate))  # Arrange countries by descending fatality rate

# Visualize case fatality rate by country using a horizontal bar plot
case_fatality_plot <- summary_table %>%
  ggplot(aes(x = reorder(countriesAndTerritories, -case_fatality_rate), 
             y = case_fatality_rate, fill = case_fatality_rate)) +
  geom_bar(stat = "identity") +
  coord_flip() +  # Flip the coordinates for horizontal bars
  labs(
    title = "Case Fatality Rate by Country",
    x = "Country",
    y = "Case Fatality Rate (%)",
    fill = "Fatality Rate (%)"
  ) +
  theme_minimal()

#Display the plots and summary table
grid.arrange(incidence_plot, fatality_plot, nrow = 2)  # Arrange the two trend plots vertically

# Print the summary table
summary_table
## # A tibble: 30 × 4
##    countriesAndTerritories total_cases total_deaths case_fatality_rate
##    <fct>                         <int>        <int>              <dbl>
##  1 Bulgaria                    1275481        37790              2.96 
##  2 Hungary                     2141513        47938              2.24 
##  3 Romania                     3246580        67179              2.07 
##  4 Poland                      6189562       118050              1.91 
##  5 Croatia                     1244692        17085              1.37 
##  6 Czechia                     4152997        41524              1.00 
##  7 Spain                      13564823       114110              0.841
##  8 Slovakia                    2535554        20511              0.809
##  9 Sweden                      2604866        20407              0.783
## 10 Italy                      23359680       178633              0.765
## # ℹ 20 more rows

2. Inferential Statistics

Select two (2) countries of your choosing and compare their incidence or fatality rates using hypothesis testing. At minimum, your work should include the following:

#Select two countries and filter the data
country1 <- "Germany"
country2 <- "France"

data_subset <- data %>%
  filter(countriesAndTerritories %in% c(country1, country2))

# Visualization: Compare daily incidence rates
incidence_compare_plot <- data_subset %>%
  ggplot(aes(x = dateRep, y = incidence_rate, color = countriesAndTerritories)) +
  geom_line() +
  labs(
    title = "Comparison of Daily Incidence Rates",
    x = "Date",
    y = "Incidence Rate (per 100K)",
    color = "Country"
  ) +
  theme_minimal()

print(incidence_compare_plot)

# Null Hypothesis
# H0: The mean daily incidence rates of Germany and France are equal.
# H1: The mean daily incidence rates of Germany and France are not equal.

# Normality Test (Shapiro-Wilk)
normality_test <- data_subset %>%
  group_by(countriesAndTerritories) %>%
  summarise(shapiro_p_value = shapiro.test(incidence_rate)$p.value)

print(normality_test)
## # A tibble: 2 × 2
##   countriesAndTerritories shapiro_p_value
##   <fct>                             <dbl>
## 1 France                         1.50e-44
## 2 Germany                        8.88e-41
# Interpretation: If p > 0.05, the data does not significantly deviate from normality.

# Variance Equality Test (Manual Calculation)
# Compute variance for each group
variance_summary <- data_subset %>%
  group_by(countriesAndTerritories) %>%
  summarise(variance = var(incidence_rate, na.rm = TRUE))

print(variance_summary)
## # A tibble: 2 × 2
##   countriesAndTerritories variance
##   <fct>                      <dbl>
## 1 France                     9861.
## 2 Germany                    4645.
# If variances are similar, the assumption of equal variances is reasonable.

# Perform Two-Sample T-Test
# Since we lack leveneTest(), manually decide equal variance based on variance_summary results
equal_variance <- abs(diff(variance_summary$variance)) < 0.5 * mean(variance_summary$variance)  # Example threshold

t_test <- t.test(
  incidence_rate ~ countriesAndTerritories,
  data = data_subset,
  var.equal = equal_variance,  # Use equal variance based on above calculation
  conf.level = 0.95
)

print(t_test)
## 
##  Welch Two Sample t-test
## 
## data:  incidence_rate by countriesAndTerritories
## t = 2.9662, df = 1782.1, p-value = 0.003055
## alternative hypothesis: true difference in means between group France and group Germany is not equal to 0
## 95 percent confidence interval:
##   3.82468 18.75366
## sample estimates:
##  mean in group France mean in group Germany 
##              54.06142              42.77224
#  Conclusion Based on Alpha
alpha <- 0.05  # Selected alpha level
if (t_test$p.value < alpha) {
  cat("At alpha =", alpha, ", we reject the null hypothesis. There is a significant difference in incidence rates between Germany and France.\n")
} else {
  cat("At alpha =", alpha, ", we fail to reject the null hypothesis. There is no significant difference in incidence rates between Germany and France.\n")
}
## At alpha = 0.05 , we reject the null hypothesis. There is a significant difference in incidence rates between Germany and France.

3. Correlation

Considering all countries, explore the relationship between incidence rates and fatality rates. At minimum, your work should include the following:

# Prepare the data by selecting incidence_rate and fatality_rate, ignoring country and date
correlation_data <- data %>%
  filter(!is.na(incidence_rate) & !is.na(fatality_rate)) %>%
  select(incidence_rate, fatality_rate)

# Visualize the distributions of incidence_rate and fatality_rate
# Incidence rate histogram
incidence_hist <- ggplot(correlation_data, aes(x = incidence_rate)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(
    title = "Distribution of Daily Incidence Rates",
    x = "Incidence Rate (per 100K)",
    y = "Frequency"
  ) +
  theme_minimal()

# Fatality rate histogram
fatality_hist <- ggplot(correlation_data, aes(x = fatality_rate)) +
  geom_histogram(bins = 30, fill = "salmon", color = "black", alpha = 0.7) +
  labs(
    title = "Distribution of Daily Fatality Rates",
    x = "Fatality Rate (per 100K)",
    y = "Frequency"
  ) +
  theme_minimal()

# Scatter plot showing relationship between incidence_rate and fatality_rate
scatter_plot <- ggplot(correlation_data, aes(x = incidence_rate, y = fatality_rate)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Scatter Plot of Incidence vs Fatality Rates",
    x = "Incidence Rate (per 100K)",
    y = "Fatality Rate (per 100K)"
  ) +
  theme_minimal()

# Display the histograms and scatter plot
grid.arrange(incidence_hist, fatality_hist, scatter_plot, nrow = 3)

# Statement on the correlation coefficient
# The Pearson correlation coefficient is most appropriate as both variables are continuous.
# Pearson correlation assumes a linear relationship and measures the strength and direction of association.

# Calculate the correlation coefficient
correlation_result <- cor.test(correlation_data$incidence_rate, correlation_data$fatality_rate, method = "pearson")

print(correlation_result)
## 
##  Pearson's product-moment correlation
## 
## data:  correlation_data$incidence_rate and correlation_data$fatality_rate
## t = 19.085, df = 28342, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1011333 0.1241216
## sample estimates:
##       cor 
## 0.1126425
# Interpretation:
cat("Pearson correlation coefficient:", correlation_result$estimate, "\n")
## Pearson correlation coefficient: 0.1126425
if (correlation_result$p.value < 0.05) {
  cat("The correlation is statistically significant (p-value =", correlation_result$p.value, ").\n")
} else {
  cat("The correlation is not statistically significant (p-value =", correlation_result$p.value, ").\n")
}
## The correlation is statistically significant (p-value = 1.0821e-80 ).

4. Regression

Here, we will fit a model on data from twenty (20) countries considering total new cases as a function of population, population density and gross domestic product (GDP) per capita. Note that the GDP per capita is given in “purchasing power standard,” which considers the costs of goods and services in a country relative to incomes in that country; i.e. we will consider this as appropriately standardized.

Code is given below defining a new data frame, ‘model_df,’ which provides the total area and standardized GDP per capita for the twenty (20) countries for our model fit. You are responsible for creating a vector of the total new cases across the time frame of the dataset, for each of those countries, and adding that vector to our ’model_df” data frame.

# The code below creates a new data frame, 'model_df,' that includes the area,
# GDP per capita, population and population density for the twenty (20)
# countries of interest. All you should need to do is execute this code, as is.

# You do not need to add code in this chunk. You will need to add code in the
# 'regression_b,' 'regression_c' and 'regression_d' code chunks.

twenty_countries <- c("Austria", "Belgium", "Bulgaria", "Cyprus", "Denmark",
                      "Finland", "France", "Germany", "Hungary", "Ireland",
                      "Latvia", "Lithuania", "Malta", "Norway", "Poland",
                      "Portugal", "Romania", "Slovakia", "Spain", "Sweden")

sq_km <- c(83858, 30510, 110994, 9251, 44493, 338145, 551695, 357386, 93030,
           70273, 64589, 65300, 316, 385178, 312685, 88416, 238397, 49036,
           498511, 450295)

gdp_pps <- c(128, 118, 51, 91, 129, 111, 104, 123, 71, 190, 69, 81, 100, 142,
             71, 78, 65, 71, 91, 120)

model_df <- data %>%
  select(c(countriesAndTerritories, popData2020)) %>%
  filter(countriesAndTerritories %in% twenty_countries) %>%
  distinct(countriesAndTerritories, .keep_all = TRUE) %>%
  add_column(sq_km, gdp_pps) %>%
  mutate(pop_dens = popData2020 / sq_km) %>%
  rename(country = countriesAndTerritories, pop = popData2020)

model_df
##      country      pop  sq_km gdp_pps   pop_dens
## 1    Austria  8901064  83858     128  106.14448
## 2    Belgium 11522440  30510     118  377.66109
## 3   Bulgaria  6951482 110994      51   62.62935
## 4     Cyprus   888005   9251      91   95.99016
## 5    Denmark  5822763  44493     129  130.86919
## 6    Finland  5525292 338145     111   16.34001
## 7     France 67320216 551695     104  122.02434
## 8    Germany 83166711 357386     123  232.70836
## 9    Hungary  9769526  93030      71  105.01479
## 10   Ireland  4964440  70273     190   70.64506
## 11    Latvia  1907675  64589      69   29.53560
## 12 Lithuania  2794090  65300      81   42.78851
## 13     Malta   514564    316     100 1628.36709
## 14    Norway  5367580 385178     142   13.93532
## 15    Poland 37958138 312685      71  121.39418
## 16  Portugal 10295909  88416      78  116.44848
## 17   Romania 19328838 238397      65   81.07836
## 18  Slovakia  5457873  49036      71  111.30339
## 19     Spain 47332614 498511      91   94.94798
## 20    Sweden 10327589 450295     120   22.93516

Next, we need to add one (1) more column to our ‘model_df’ data frame. Specifically, one that has the total number of new cases for each of the twenty (20) countries. We calculate the total number of new cases by summing all the daily new cases, for each country, across all the days in the dataset.

### The following code will be removed for students to complete the work themselves.

total_cases <- data %>%
  select(c(countriesAndTerritories, cases)) %>%
  group_by(countriesAndTerritories) %>%
  dplyr::summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
  filter(countriesAndTerritories %in% twenty_countries) %>%
  select(total_cases)

model_df <- model_df %>%
  add_column(total_cases)

model_df
##      country      pop  sq_km gdp_pps   pop_dens total_cases
## 1    Austria  8901064  83858     128  106.14448     5402162
## 2    Belgium 11522440  30510     118  377.66109     4607296
## 3   Bulgaria  6951482 110994      51   62.62935     1275481
## 4     Cyprus   888005   9251      91   95.99016      596297
## 5    Denmark  5822763  44493     129  130.86919     3219571
## 6    Finland  5525292 338145     111   16.34001     1335318
## 7     France 67320216 551695     104  122.02434    36612627
## 8    Germany 83166711 357386     123  232.70836    35287690
## 9    Hungary  9769526  93030      71  105.01479     2141513
## 10   Ireland  4964440  70273     190   70.64506     1670377
## 11    Latvia  1907675  64589      69   29.53560      949252
## 12 Lithuania  2794090  65300      81   42.78851     1265985
## 13     Malta   514564    316     100 1628.36709      115243
## 14    Norway  5367580 385178     142   13.93532     1463991
## 15    Poland 37958138 312685      71  121.39418     6189562
## 16  Portugal 10295909  88416      78  116.44848     5514482
## 17   Romania 19328838 238397      65   81.07836     3246580
## 18  Slovakia  5457873  49036      71  111.30339     2535554
## 19     Spain 47332614 498511      91   94.94798    13564823
## 20    Sweden 10327589 450295     120   22.93516     2604866

Now, we will fit our model using the data in ‘model_df.’ We are interested in explaining total cases (response) as a function of population (explanatory), population density (explanatory), and GDP (explanatory).

At minimum, your modeling work should including the following:

# Describe the 'model_df' data frame
# Print structure and summary of 'model_df' to understand data types and summary statistics
str(model_df)  # Structure of the data frame
## 'data.frame':    20 obs. of  6 variables:
##  $ country    : Factor w/ 30 levels "Austria","Belgium",..: 1 2 3 5 7 9 10 11 13 15 ...
##  $ pop        : int  8901064 11522440 6951482 888005 5822763 5525292 67320216 83166711 9769526 4964440 ...
##  $ sq_km      : num  83858 30510 110994 9251 44493 ...
##  $ gdp_pps    : num  128 118 51 91 129 111 104 123 71 190 ...
##  $ pop_dens   : num  106.1 377.7 62.6 96 130.9 ...
##  $ total_cases: int  5402162 4607296 1275481 596297 3219571 1335318 36612627 35287690 2141513 1670377 ...
summary(model_df)  # Summary statistics for numeric columns
##      country        pop               sq_km           gdp_pps     
##  Austria : 1   Min.   :  514564   Min.   :   316   Min.   : 51.0  
##  Belgium : 1   1st Qu.: 5266795   1st Qu.: 60701   1st Qu.: 71.0  
##  Bulgaria: 1   Median : 7926273   Median : 90723   Median : 95.5  
##  Cyprus  : 1   Mean   :17305840   Mean   :192118   Mean   :100.2  
##  Denmark : 1   3rd Qu.:13474040   3rd Qu.:342955   3rd Qu.:120.8  
##  Finland : 1   Max.   :83166711   Max.   :551695   Max.   :190.0  
##  (Other) :14                                                      
##     pop_dens        total_cases      
##  Min.   :  13.94   Min.   :  115243  
##  1st Qu.:  57.67   1st Qu.: 1320359  
##  Median : 100.50   Median : 2570210  
##  Mean   : 179.14   Mean   : 6479934  
##  3rd Qu.: 121.55   3rd Qu.: 5430242  
##  Max.   :1628.37   Max.   :36612627  
## 
# Fit the linear regression model
# Response: total_cases
# Explanatory variables: pop (population), pop_dens (population density), gdp_pps (GDP per capita)
model <- lm(total_cases ~ pop + pop_dens + gdp_pps, data = model_df)

# Output model summary
model_summary <- summary(model)
print(model_summary)
## 
## Call:
## lm(formula = total_cases ~ pop + pop_dens + gdp_pps, data = model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -8241030 -1086127    -9952  1651494  8715365 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.859e+06  2.749e+06  -1.404    0.179    
## pop          4.268e-01  3.625e-02  11.774 2.71e-09 ***
## pop_dens     6.499e+02  2.399e+03   0.271    0.790    
## gdp_pps      2.831e+04  2.524e+04   1.121    0.279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3655000 on 16 degrees of freedom
## Multiple R-squared:  0.8982, Adjusted R-squared:  0.8791 
## F-statistic: 47.03 on 3 and 16 DF,  p-value: 3.686e-08
#  Evaluate the fit of the model
# Extract key details from the model summary
r_squared <- model_summary$r.squared  # R-squared
coefficients <- model_summary$coefficients  # Coefficients table
p_values <- coefficients[, 4]  # P-values for each coefficient

# Output key findings
cat("Model R-squared:", r_squared, "\n")
## Model R-squared: 0.8981509
cat("Statistically significant coefficients (p < 0.05):\n")
## Statistically significant coefficients (p < 0.05):
significant_vars <- rownames(coefficients)[p_values < 0.05]
if (length(significant_vars) > 0) {
  cat(significant_vars, "\n")
} else {
  cat("None of the coefficients are statistically significant at alpha = 0.05.\n")
}
## pop
# Recommendation for model reduction
# If none or few coefficients are significant, suggest a reduced model
if (length(significant_vars) < ncol(model_df) - 1) {
  cat("Consider fitting a reduced model by removing non-significant variables.\n")
} else {
  cat("All variables appear important; no reduction is recommended.\n")
}
## Consider fitting a reduced model by removing non-significant variables.

The last thing we will do is use our model to predict the total new cases of two (2) countries not included in our model fit. At minimum, your work should include:

# Define the 'newdata' data frame (already provided)
newdata <- data.frame(
  country = c("Luxembourg", "Netherlands"),
  pop = c(626108, 17407585),
  gdp_pps = c(261, 130),
  pop_dens = c(626108, 17407585) / c(2586, 41540)
)

# Retrieve the actual total new cases for Luxembourg and Netherlands
actual_cases <- data %>%
  filter(countriesAndTerritories %in% newdata$country) %>%  # Filter for the two countries
  group_by(countriesAndTerritories) %>%  # Group by country
  summarise(actual_total_cases = sum(cases, na.rm = TRUE))  # Calculate total cases

print(actual_cases)
## # A tibble: 2 × 2
##   countriesAndTerritories actual_total_cases
##   <fct>                                <int>
## 1 Luxembourg                          301031
## 2 Netherlands                        8494705
# Predict the total new cases using the fitted model
predicted_cases <- predict(model, newdata = newdata)  # Use the 'predict()' function to get predictions

# Add predictions to the 'newdata' data frame
newdata <- newdata %>%
  mutate(predicted_total_cases = predicted_cases)

# Compare actual and predicted cases
comparison <- newdata %>%
  left_join(actual_cases, by = c("country" = "countriesAndTerritories"))

print(comparison)
##       country      pop gdp_pps pop_dens predicted_total_cases
## 1  Luxembourg   626108     261 242.1145               3953237
## 2 Netherlands 17407585     130 419.0560               7522800
##   actual_total_cases
## 1             301031
## 2            8494705
#  Evaluate model performance
# Calculate residuals: actual - predicted
comparison <- comparison %>%
  mutate(residual = actual_total_cases - predicted_total_cases)

# Print the comparison table
print(comparison)
##       country      pop gdp_pps pop_dens predicted_total_cases
## 1  Luxembourg   626108     261 242.1145               3953237
## 2 Netherlands 17407585     130 419.0560               7522800
##   actual_total_cases   residual
## 1             301031 -3652206.4
## 2            8494705   971905.4
# Short statement on model performance
cat("Model Performance for Luxembourg and Netherlands:\n")
## Model Performance for Luxembourg and Netherlands:
cat("Actual vs Predicted Total Cases:\n")
## Actual vs Predicted Total Cases:
print(comparison[, c("country", "actual_total_cases", "predicted_total_cases", "residual")])
##       country actual_total_cases predicted_total_cases   residual
## 1  Luxembourg             301031               3953237 -3652206.4
## 2 Netherlands            8494705               7522800   971905.4
if (mean(abs(comparison$residual)) < mean(abs(model$residuals))) {
  cat("The model performed comparably well for the new countries.\n")
} else {
  cat("The model's predictions for the new countries show higher residuals compared to the training data.\n")
}
## The model's predictions for the new countries show higher residuals compared to the training data.