For the take-home part of the MSDS 401 Final Exam, you are tasked with analyzing data on new daily covid-19 cases and deaths in European Union (EU) and European Economic Area (EEA) countries. A data file may be downloaded here, or you may use the provided read.csv() code in the ‘setup’ code chunk below to read the data directly from the web csv. Either approach is acceptable; the data should be the same.
Once you have defined a data frame with the daily case and death and country data, you are asked to: (1) perform an Exploratory Data Analysis (EDA), (2) perform some hypothesis testing, (3) perform some correlation testing, and (4) fit and describe a linear regression model. Each of these four (4) items is further explained below and “code chunks” have been created for you in which to add your R code, just as with the R and Data Analysis Assignments. You may add additional code chunks, as needed. You should make comments in the code chunks or add clarifying text between code chunks that you think further your work.
A data dictionary for the dataset is available here.
“Incidence rate” is equal to new daily cases per 100K individuals. Country population estimates can be found in ‘popData2020.’ You will calculate a daily incidence rate in item (1), for each country, that we will explore further in items (2) and (3).
“Fatality rate” is equal to new daily deaths per 100K individuals. Country population estimates can be found in ‘popData2020.’ You will calculate a daily fatality rate in item (1), for each country, that we will explore further in items (2) and (3).
Perform an Exploratory Data Analysis (EDA). Your EDA is exactly that: yours. Your knit .html should include the visualizations and summary tables that you find valuable while exploring this dataset. However, at minimum, your EDA must include the following:
#Create 'incidence_rate' and 'fatality_rate' vectors and add them to the dataframe
data <- data %>%
mutate(
incidence_rate = (cases / popData2020) * 100000, # Daily new cases per 100K population
fatality_rate = (deaths / popData2020) * 100000 # Daily new deaths per 100K population
)
#Plot incidence rate trends over time for a subset of 5 countries
countries_subset <- c("Germany", "France", "Italy", "Spain", "Sweden")
# Incidence rate trend plot
incidence_plot <- data %>%
filter(countriesAndTerritories %in% countries_subset) %>% # Filter data for the selected countries
ggplot(aes(x = dateRep, y = incidence_rate, color = countriesAndTerritories)) +
geom_line() +
labs(
title = "Daily Incidence Rate over Time",
x = "Date",
y = "Incidence Rate (per 100K)",
color = "Country"
) +
theme_minimal()
#Plot fatality rate trends over time for the same subset of countries
fatality_plot <- data %>%
filter(countriesAndTerritories %in% countries_subset) %>% # Filter data for the selected countries
ggplot(aes(x = dateRep, y = fatality_rate, color = countriesAndTerritories)) +
geom_line() +
labs(
title = "Daily Fatality Rate over Time",
x = "Date",
y = "Fatality Rate (per 100K)",
color = "Country"
) +
theme_minimal()
# 4. Summarize total cases, total deaths, and calculate case fatality rate per country
summary_table <- data %>%
group_by(countriesAndTerritories) %>% # Group data by country
summarise(
total_cases = sum(cases, na.rm = TRUE), # Calculate total confirmed cases
total_deaths = sum(deaths, na.rm = TRUE), # Calculate total deaths
case_fatality_rate = (total_deaths / total_cases) * 100 # Calculate case fatality rate (%)
) %>%
arrange(desc(case_fatality_rate)) # Arrange countries by descending fatality rate
# Visualize case fatality rate by country using a horizontal bar plot
case_fatality_plot <- summary_table %>%
ggplot(aes(x = reorder(countriesAndTerritories, -case_fatality_rate),
y = case_fatality_rate, fill = case_fatality_rate)) +
geom_bar(stat = "identity") +
coord_flip() + # Flip the coordinates for horizontal bars
labs(
title = "Case Fatality Rate by Country",
x = "Country",
y = "Case Fatality Rate (%)",
fill = "Fatality Rate (%)"
) +
theme_minimal()
#Display the plots and summary table
grid.arrange(incidence_plot, fatality_plot, nrow = 2) # Arrange the two trend plots vertically
# Print the summary table
summary_table
## # A tibble: 30 × 4
## countriesAndTerritories total_cases total_deaths case_fatality_rate
## <fct> <int> <int> <dbl>
## 1 Bulgaria 1275481 37790 2.96
## 2 Hungary 2141513 47938 2.24
## 3 Romania 3246580 67179 2.07
## 4 Poland 6189562 118050 1.91
## 5 Croatia 1244692 17085 1.37
## 6 Czechia 4152997 41524 1.00
## 7 Spain 13564823 114110 0.841
## 8 Slovakia 2535554 20511 0.809
## 9 Sweden 2604866 20407 0.783
## 10 Italy 23359680 178633 0.765
## # ℹ 20 more rows
Select two (2) countries of your choosing and compare their incidence or fatality rates using hypothesis testing. At minimum, your work should include the following:
#Select two countries and filter the data
country1 <- "Germany"
country2 <- "France"
data_subset <- data %>%
filter(countriesAndTerritories %in% c(country1, country2))
# Visualization: Compare daily incidence rates
incidence_compare_plot <- data_subset %>%
ggplot(aes(x = dateRep, y = incidence_rate, color = countriesAndTerritories)) +
geom_line() +
labs(
title = "Comparison of Daily Incidence Rates",
x = "Date",
y = "Incidence Rate (per 100K)",
color = "Country"
) +
theme_minimal()
print(incidence_compare_plot)
# Null Hypothesis
# H0: The mean daily incidence rates of Germany and France are equal.
# H1: The mean daily incidence rates of Germany and France are not equal.
# Normality Test (Shapiro-Wilk)
normality_test <- data_subset %>%
group_by(countriesAndTerritories) %>%
summarise(shapiro_p_value = shapiro.test(incidence_rate)$p.value)
print(normality_test)
## # A tibble: 2 × 2
## countriesAndTerritories shapiro_p_value
## <fct> <dbl>
## 1 France 1.50e-44
## 2 Germany 8.88e-41
# Interpretation: If p > 0.05, the data does not significantly deviate from normality.
# Variance Equality Test (Manual Calculation)
# Compute variance for each group
variance_summary <- data_subset %>%
group_by(countriesAndTerritories) %>%
summarise(variance = var(incidence_rate, na.rm = TRUE))
print(variance_summary)
## # A tibble: 2 × 2
## countriesAndTerritories variance
## <fct> <dbl>
## 1 France 9861.
## 2 Germany 4645.
# If variances are similar, the assumption of equal variances is reasonable.
# Perform Two-Sample T-Test
# Since we lack leveneTest(), manually decide equal variance based on variance_summary results
equal_variance <- abs(diff(variance_summary$variance)) < 0.5 * mean(variance_summary$variance) # Example threshold
t_test <- t.test(
incidence_rate ~ countriesAndTerritories,
data = data_subset,
var.equal = equal_variance, # Use equal variance based on above calculation
conf.level = 0.95
)
print(t_test)
##
## Welch Two Sample t-test
##
## data: incidence_rate by countriesAndTerritories
## t = 2.9662, df = 1782.1, p-value = 0.003055
## alternative hypothesis: true difference in means between group France and group Germany is not equal to 0
## 95 percent confidence interval:
## 3.82468 18.75366
## sample estimates:
## mean in group France mean in group Germany
## 54.06142 42.77224
# Conclusion Based on Alpha
alpha <- 0.05 # Selected alpha level
if (t_test$p.value < alpha) {
cat("At alpha =", alpha, ", we reject the null hypothesis. There is a significant difference in incidence rates between Germany and France.\n")
} else {
cat("At alpha =", alpha, ", we fail to reject the null hypothesis. There is no significant difference in incidence rates between Germany and France.\n")
}
## At alpha = 0.05 , we reject the null hypothesis. There is a significant difference in incidence rates between Germany and France.
Considering all countries, explore the relationship between incidence rates and fatality rates. At minimum, your work should include the following:
# Prepare the data by selecting incidence_rate and fatality_rate, ignoring country and date
correlation_data <- data %>%
filter(!is.na(incidence_rate) & !is.na(fatality_rate)) %>%
select(incidence_rate, fatality_rate)
# Visualize the distributions of incidence_rate and fatality_rate
# Incidence rate histogram
incidence_hist <- ggplot(correlation_data, aes(x = incidence_rate)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black", alpha = 0.7) +
labs(
title = "Distribution of Daily Incidence Rates",
x = "Incidence Rate (per 100K)",
y = "Frequency"
) +
theme_minimal()
# Fatality rate histogram
fatality_hist <- ggplot(correlation_data, aes(x = fatality_rate)) +
geom_histogram(bins = 30, fill = "salmon", color = "black", alpha = 0.7) +
labs(
title = "Distribution of Daily Fatality Rates",
x = "Fatality Rate (per 100K)",
y = "Frequency"
) +
theme_minimal()
# Scatter plot showing relationship between incidence_rate and fatality_rate
scatter_plot <- ggplot(correlation_data, aes(x = incidence_rate, y = fatality_rate)) +
geom_point(alpha = 0.5) +
labs(
title = "Scatter Plot of Incidence vs Fatality Rates",
x = "Incidence Rate (per 100K)",
y = "Fatality Rate (per 100K)"
) +
theme_minimal()
# Display the histograms and scatter plot
grid.arrange(incidence_hist, fatality_hist, scatter_plot, nrow = 3)
# Statement on the correlation coefficient
# The Pearson correlation coefficient is most appropriate as both variables are continuous.
# Pearson correlation assumes a linear relationship and measures the strength and direction of association.
# Calculate the correlation coefficient
correlation_result <- cor.test(correlation_data$incidence_rate, correlation_data$fatality_rate, method = "pearson")
print(correlation_result)
##
## Pearson's product-moment correlation
##
## data: correlation_data$incidence_rate and correlation_data$fatality_rate
## t = 19.085, df = 28342, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1011333 0.1241216
## sample estimates:
## cor
## 0.1126425
# Interpretation:
cat("Pearson correlation coefficient:", correlation_result$estimate, "\n")
## Pearson correlation coefficient: 0.1126425
if (correlation_result$p.value < 0.05) {
cat("The correlation is statistically significant (p-value =", correlation_result$p.value, ").\n")
} else {
cat("The correlation is not statistically significant (p-value =", correlation_result$p.value, ").\n")
}
## The correlation is statistically significant (p-value = 1.0821e-80 ).
Here, we will fit a model on data from twenty (20) countries considering total new cases as a function of population, population density and gross domestic product (GDP) per capita. Note that the GDP per capita is given in “purchasing power standard,” which considers the costs of goods and services in a country relative to incomes in that country; i.e. we will consider this as appropriately standardized.
Code is given below defining a new data frame, ‘model_df,’ which provides the total area and standardized GDP per capita for the twenty (20) countries for our model fit. You are responsible for creating a vector of the total new cases across the time frame of the dataset, for each of those countries, and adding that vector to our ’model_df” data frame.
# The code below creates a new data frame, 'model_df,' that includes the area,
# GDP per capita, population and population density for the twenty (20)
# countries of interest. All you should need to do is execute this code, as is.
# You do not need to add code in this chunk. You will need to add code in the
# 'regression_b,' 'regression_c' and 'regression_d' code chunks.
twenty_countries <- c("Austria", "Belgium", "Bulgaria", "Cyprus", "Denmark",
"Finland", "France", "Germany", "Hungary", "Ireland",
"Latvia", "Lithuania", "Malta", "Norway", "Poland",
"Portugal", "Romania", "Slovakia", "Spain", "Sweden")
sq_km <- c(83858, 30510, 110994, 9251, 44493, 338145, 551695, 357386, 93030,
70273, 64589, 65300, 316, 385178, 312685, 88416, 238397, 49036,
498511, 450295)
gdp_pps <- c(128, 118, 51, 91, 129, 111, 104, 123, 71, 190, 69, 81, 100, 142,
71, 78, 65, 71, 91, 120)
model_df <- data %>%
select(c(countriesAndTerritories, popData2020)) %>%
filter(countriesAndTerritories %in% twenty_countries) %>%
distinct(countriesAndTerritories, .keep_all = TRUE) %>%
add_column(sq_km, gdp_pps) %>%
mutate(pop_dens = popData2020 / sq_km) %>%
rename(country = countriesAndTerritories, pop = popData2020)
model_df
## country pop sq_km gdp_pps pop_dens
## 1 Austria 8901064 83858 128 106.14448
## 2 Belgium 11522440 30510 118 377.66109
## 3 Bulgaria 6951482 110994 51 62.62935
## 4 Cyprus 888005 9251 91 95.99016
## 5 Denmark 5822763 44493 129 130.86919
## 6 Finland 5525292 338145 111 16.34001
## 7 France 67320216 551695 104 122.02434
## 8 Germany 83166711 357386 123 232.70836
## 9 Hungary 9769526 93030 71 105.01479
## 10 Ireland 4964440 70273 190 70.64506
## 11 Latvia 1907675 64589 69 29.53560
## 12 Lithuania 2794090 65300 81 42.78851
## 13 Malta 514564 316 100 1628.36709
## 14 Norway 5367580 385178 142 13.93532
## 15 Poland 37958138 312685 71 121.39418
## 16 Portugal 10295909 88416 78 116.44848
## 17 Romania 19328838 238397 65 81.07836
## 18 Slovakia 5457873 49036 71 111.30339
## 19 Spain 47332614 498511 91 94.94798
## 20 Sweden 10327589 450295 120 22.93516
Next, we need to add one (1) more column to our ‘model_df’ data frame. Specifically, one that has the total number of new cases for each of the twenty (20) countries. We calculate the total number of new cases by summing all the daily new cases, for each country, across all the days in the dataset.
### The following code will be removed for students to complete the work themselves.
total_cases <- data %>%
select(c(countriesAndTerritories, cases)) %>%
group_by(countriesAndTerritories) %>%
dplyr::summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
filter(countriesAndTerritories %in% twenty_countries) %>%
select(total_cases)
model_df <- model_df %>%
add_column(total_cases)
model_df
## country pop sq_km gdp_pps pop_dens total_cases
## 1 Austria 8901064 83858 128 106.14448 5402162
## 2 Belgium 11522440 30510 118 377.66109 4607296
## 3 Bulgaria 6951482 110994 51 62.62935 1275481
## 4 Cyprus 888005 9251 91 95.99016 596297
## 5 Denmark 5822763 44493 129 130.86919 3219571
## 6 Finland 5525292 338145 111 16.34001 1335318
## 7 France 67320216 551695 104 122.02434 36612627
## 8 Germany 83166711 357386 123 232.70836 35287690
## 9 Hungary 9769526 93030 71 105.01479 2141513
## 10 Ireland 4964440 70273 190 70.64506 1670377
## 11 Latvia 1907675 64589 69 29.53560 949252
## 12 Lithuania 2794090 65300 81 42.78851 1265985
## 13 Malta 514564 316 100 1628.36709 115243
## 14 Norway 5367580 385178 142 13.93532 1463991
## 15 Poland 37958138 312685 71 121.39418 6189562
## 16 Portugal 10295909 88416 78 116.44848 5514482
## 17 Romania 19328838 238397 65 81.07836 3246580
## 18 Slovakia 5457873 49036 71 111.30339 2535554
## 19 Spain 47332614 498511 91 94.94798 13564823
## 20 Sweden 10327589 450295 120 22.93516 2604866
Now, we will fit our model using the data in ‘model_df.’ We are interested in explaining total cases (response) as a function of population (explanatory), population density (explanatory), and GDP (explanatory).
At minimum, your modeling work should including the following:
# Describe the 'model_df' data frame
# Print structure and summary of 'model_df' to understand data types and summary statistics
str(model_df) # Structure of the data frame
## 'data.frame': 20 obs. of 6 variables:
## $ country : Factor w/ 30 levels "Austria","Belgium",..: 1 2 3 5 7 9 10 11 13 15 ...
## $ pop : int 8901064 11522440 6951482 888005 5822763 5525292 67320216 83166711 9769526 4964440 ...
## $ sq_km : num 83858 30510 110994 9251 44493 ...
## $ gdp_pps : num 128 118 51 91 129 111 104 123 71 190 ...
## $ pop_dens : num 106.1 377.7 62.6 96 130.9 ...
## $ total_cases: int 5402162 4607296 1275481 596297 3219571 1335318 36612627 35287690 2141513 1670377 ...
summary(model_df) # Summary statistics for numeric columns
## country pop sq_km gdp_pps
## Austria : 1 Min. : 514564 Min. : 316 Min. : 51.0
## Belgium : 1 1st Qu.: 5266795 1st Qu.: 60701 1st Qu.: 71.0
## Bulgaria: 1 Median : 7926273 Median : 90723 Median : 95.5
## Cyprus : 1 Mean :17305840 Mean :192118 Mean :100.2
## Denmark : 1 3rd Qu.:13474040 3rd Qu.:342955 3rd Qu.:120.8
## Finland : 1 Max. :83166711 Max. :551695 Max. :190.0
## (Other) :14
## pop_dens total_cases
## Min. : 13.94 Min. : 115243
## 1st Qu.: 57.67 1st Qu.: 1320359
## Median : 100.50 Median : 2570210
## Mean : 179.14 Mean : 6479934
## 3rd Qu.: 121.55 3rd Qu.: 5430242
## Max. :1628.37 Max. :36612627
##
# Fit the linear regression model
# Response: total_cases
# Explanatory variables: pop (population), pop_dens (population density), gdp_pps (GDP per capita)
model <- lm(total_cases ~ pop + pop_dens + gdp_pps, data = model_df)
# Output model summary
model_summary <- summary(model)
print(model_summary)
##
## Call:
## lm(formula = total_cases ~ pop + pop_dens + gdp_pps, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8241030 -1086127 -9952 1651494 8715365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.859e+06 2.749e+06 -1.404 0.179
## pop 4.268e-01 3.625e-02 11.774 2.71e-09 ***
## pop_dens 6.499e+02 2.399e+03 0.271 0.790
## gdp_pps 2.831e+04 2.524e+04 1.121 0.279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3655000 on 16 degrees of freedom
## Multiple R-squared: 0.8982, Adjusted R-squared: 0.8791
## F-statistic: 47.03 on 3 and 16 DF, p-value: 3.686e-08
# Evaluate the fit of the model
# Extract key details from the model summary
r_squared <- model_summary$r.squared # R-squared
coefficients <- model_summary$coefficients # Coefficients table
p_values <- coefficients[, 4] # P-values for each coefficient
# Output key findings
cat("Model R-squared:", r_squared, "\n")
## Model R-squared: 0.8981509
cat("Statistically significant coefficients (p < 0.05):\n")
## Statistically significant coefficients (p < 0.05):
significant_vars <- rownames(coefficients)[p_values < 0.05]
if (length(significant_vars) > 0) {
cat(significant_vars, "\n")
} else {
cat("None of the coefficients are statistically significant at alpha = 0.05.\n")
}
## pop
# Recommendation for model reduction
# If none or few coefficients are significant, suggest a reduced model
if (length(significant_vars) < ncol(model_df) - 1) {
cat("Consider fitting a reduced model by removing non-significant variables.\n")
} else {
cat("All variables appear important; no reduction is recommended.\n")
}
## Consider fitting a reduced model by removing non-significant variables.
The last thing we will do is use our model to predict the total new cases of two (2) countries not included in our model fit. At minimum, your work should include:
# Define the 'newdata' data frame (already provided)
newdata <- data.frame(
country = c("Luxembourg", "Netherlands"),
pop = c(626108, 17407585),
gdp_pps = c(261, 130),
pop_dens = c(626108, 17407585) / c(2586, 41540)
)
# Retrieve the actual total new cases for Luxembourg and Netherlands
actual_cases <- data %>%
filter(countriesAndTerritories %in% newdata$country) %>% # Filter for the two countries
group_by(countriesAndTerritories) %>% # Group by country
summarise(actual_total_cases = sum(cases, na.rm = TRUE)) # Calculate total cases
print(actual_cases)
## # A tibble: 2 × 2
## countriesAndTerritories actual_total_cases
## <fct> <int>
## 1 Luxembourg 301031
## 2 Netherlands 8494705
# Predict the total new cases using the fitted model
predicted_cases <- predict(model, newdata = newdata) # Use the 'predict()' function to get predictions
# Add predictions to the 'newdata' data frame
newdata <- newdata %>%
mutate(predicted_total_cases = predicted_cases)
# Compare actual and predicted cases
comparison <- newdata %>%
left_join(actual_cases, by = c("country" = "countriesAndTerritories"))
print(comparison)
## country pop gdp_pps pop_dens predicted_total_cases
## 1 Luxembourg 626108 261 242.1145 3953237
## 2 Netherlands 17407585 130 419.0560 7522800
## actual_total_cases
## 1 301031
## 2 8494705
# Evaluate model performance
# Calculate residuals: actual - predicted
comparison <- comparison %>%
mutate(residual = actual_total_cases - predicted_total_cases)
# Print the comparison table
print(comparison)
## country pop gdp_pps pop_dens predicted_total_cases
## 1 Luxembourg 626108 261 242.1145 3953237
## 2 Netherlands 17407585 130 419.0560 7522800
## actual_total_cases residual
## 1 301031 -3652206.4
## 2 8494705 971905.4
# Short statement on model performance
cat("Model Performance for Luxembourg and Netherlands:\n")
## Model Performance for Luxembourg and Netherlands:
cat("Actual vs Predicted Total Cases:\n")
## Actual vs Predicted Total Cases:
print(comparison[, c("country", "actual_total_cases", "predicted_total_cases", "residual")])
## country actual_total_cases predicted_total_cases residual
## 1 Luxembourg 301031 3953237 -3652206.4
## 2 Netherlands 8494705 7522800 971905.4
if (mean(abs(comparison$residual)) < mean(abs(model$residuals))) {
cat("The model performed comparably well for the new countries.\n")
} else {
cat("The model's predictions for the new countries show higher residuals compared to the training data.\n")
}
## The model's predictions for the new countries show higher residuals compared to the training data.