This analysis builds a logistic regression model to determine what factors influence whether a country achieves high vaccination coverage.
## Rows: 41,602
## Columns: 30
## $ iso_code <chr> "AUT", "AUT", "AUT", "AUT", "AUT",…
## $ continent <chr> "Europe", "Europe", "Europe", "Eur…
## $ location <chr> "Austria", "Austria", "Austria", "…
## $ date <date> 2020-03-01, 2020-03-02, 2020-03-0…
## $ new_cases_smoothed_per_million <dbl> 0.11, 0.11, 0.11, 0.11, 0.11, 0.11…
## $ new_deaths_smoothed_per_million <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ total_cases_per_million <dbl> 0.77, 0.77, 0.77, 0.77, 0.77, 0.77…
## $ total_deaths_per_million <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ stringency_index <dbl> 11.11, 11.11, 11.11, 11.11, 11.11,…
## $ reproduction_rate <dbl> 1.07, 1.07, 1.07, 1.07, 1.07, 1.07…
## $ total_vaccinations_per_hundred <dbl> 69.3, 69.3, 69.3, 69.3, 69.3, 69.3…
## $ people_vaccinated_per_hundred <dbl> 43.6, 43.6, 43.6, 43.6, 43.6, 43.6…
## $ people_fully_vaccinated_per_hundred <dbl> 30.58, 30.58, 30.58, 30.58, 30.58,…
## $ hospital_beds_per_thousand <dbl> 7.37, 7.37, 7.37, 7.37, 7.37, 7.37…
## $ life_expectancy <dbl> 81.54, 81.54, 81.54, 81.54, 81.54,…
## $ cardiovasc_death_rate <dbl> 145.18, 145.18, 145.18, 145.18, 14…
## $ diabetes_prevalence <dbl> 6.35, 6.35, 6.35, 6.35, 6.35, 6.35…
## $ gdp_per_capita <dbl> 45436.69, 45436.69, 45436.69, 4543…
## $ population_density <dbl> 106.75, 106.75, 106.75, 106.75, 10…
## $ median_age <dbl> 44.4, 44.4, 44.4, 44.4, 44.4, 44.4…
## $ aged_65_older <dbl> 19.2, 19.2, 19.2, 19.2, 19.2, 19.2…
## $ human_development_index <dbl> 0.92, 0.92, 0.92, 0.92, 0.92, 0.92…
## $ population <int> 8939617, 8939617, 8939617, 8939617…
## $ country_group <chr> "EU", "EU", "EU", "EU", "EU", "EU"…
## $ year <int> 2020, 2020, 2020, 2020, 2020, 2020…
## $ month <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ year_month <chr> "2020-03", "2020-03", "2020-03", "…
## $ case_fatality_rate <dbl> 0.000000000, 0.000000000, 0.000000…
## $ vax_coverage <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ days_since_start <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
## iso_code continent location date
## Length:41602 Length:41602 Length:41602 Min. :2020-03-01
## Class :character Class :character Class :character 1st Qu.:2020-08-15
## Mode :character Mode :character Mode :character Median :2021-01-30
## Mean :2021-01-30
## 3rd Qu.:2021-07-17
## Max. :2021-12-31
##
## new_cases_smoothed_per_million new_deaths_smoothed_per_million
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 7.09 1st Qu.: 0.100
## Median : 44.18 Median : 0.560
## Mean : 130.49 Mean : 2.077
## 3rd Qu.: 174.13 3rd Qu.: 2.620
## Max. :1931.33 Max. :27.620
##
## total_cases_per_million total_deaths_per_million stringency_index
## Min. : 0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1225 1st Qu.: 32.24 1st Qu.: 45.37
## Median : 10352 Median : 214.16 Median : 57.87
## Mean : 30721 Mean : 614.20 Mean : 58.26
## 3rd Qu.: 54709 3rd Qu.: 992.60 3rd Qu.: 71.76
## Max. :215977 Max. :4472.40 Max. :100.00
##
## reproduction_rate total_vaccinations_per_hundred people_vaccinated_per_hundred
## Min. :0.110 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.890 1st Qu.: 21.50 1st Qu.:13.04
## Median :1.040 Median : 69.30 Median :43.60
## Mean :1.074 Mean : 59.27 Mean :35.15
## 3rd Qu.:1.210 3rd Qu.: 79.36 3rd Qu.:49.85
## Max. :4.650 Max. :226.05 Max. :91.50
##
## people_fully_vaccinated_per_hundred hospital_beds_per_thousand life_expectancy
## Min. : 0.000 Min. : 0.300 Min. :54.69
## 1st Qu.: 8.465 1st Qu.: 2.110 1st Qu.:75.29
## Median :30.580 Median : 3.085 Median :78.73
## Mean :25.399 Mean : 3.895 Mean :77.71
## 3rd Qu.:32.815 3rd Qu.: 5.540 3rd Qu.:82.28
## Max. :84.680 Max. :13.050 Max. :84.63
##
## cardiovasc_death_rate diabetes_prevalence gdp_per_capita population_density
## Min. : 79.37 Min. : 2.420 Min. : 1730 Min. : 3.20
## 1st Qu.:122.14 1st Qu.: 5.310 1st Qu.:15525 1st Qu.: 44.22
## Median :162.43 Median : 6.770 Median :27577 Median : 100.31
## Mean :205.37 Mean : 7.268 Mean :29450 Mean : 173.58
## 3rd Qu.:261.90 3rd Qu.: 8.380 3rd Qu.:40586 3rd Qu.: 209.59
## Max. :539.85 Max. :17.720 Max. :94278 Max. :1454.04
##
## median_age aged_65_older human_development_index population
## Min. :18.10 Min. : 2.69 Min. :0.480 Min. :3.729e+05
## 1st Qu.:31.90 1st Qu.: 7.65 1st Qu.:0.780 1st Qu.:5.882e+06
## Median :39.70 Median :15.20 Median :0.865 Median :3.743e+07
## Mean :37.37 Mean :13.83 Mean :0.836 Mean :1.030e+08
## 3rd Qu.:43.10 3rd Qu.:19.20 3rd Qu.:0.920 3rd Qu.:8.855e+07
## Max. :48.20 Max. :27.05 Max. :0.960 Max. :1.426e+09
##
## country_group year month year_month
## Length:41602 Min. :2020 Min. : 1.00 Length:41602
## Class :character 1st Qu.:2020 1st Qu.: 4.00 Class :character
## Mode :character Median :2021 Median : 7.00 Mode :character
## Mean :2021 Mean : 6.97
## 3rd Qu.:2021 3rd Qu.:10.00
## Max. :2021 Max. :12.00
##
## case_fatality_rate vax_coverage days_since_start
## Min. :0.00000 Min. : 0.00 Min. : 0
## 1st Qu.:0.00588 1st Qu.: 0.00 1st Qu.:167
## Median :0.01479 Median : 0.00 Median :335
## Mean :0.02777 Mean :10.06 Mean :335
## 3rd Qu.:0.03026 3rd Qu.: 4.41 3rd Qu.:503
## Max. :2.28169 Max. :84.68 Max. :670
## NA's :2023
df <- df %>%
drop_na(vax_coverage, reproduction_rate, stringency_index, median_age) %>%
mutate(high_vax = ifelse(vax_coverage > 50, 1, 0))We define high vaccination coverage as greater than 50%. This threshold represents a meaningful majority of the population and aligns with a clear separation in the data distribution.
The ifelse() function explicitly converts the
continuous variable into binary:
This creates a valid binary response variable for logistic regression.
ggplot(df, aes(x = vax_coverage, fill = factor(high_vax))) +
geom_histogram(bins = 30, alpha = 0.7) +
scale_fill_manual(values = c("tomato", "seagreen")) +
theme_minimal()Interpretation:
- The histogram shows a very large concentration of observations near 0,
indicating many countries with very low vaccination coverage. A second
distinct cluster appears above 50, where observations are more spread
out between 50 and 80.
ggplot(df, aes(x = median_age, y = vax_coverage, color = factor(high_vax))) +
geom_point(alpha = 0.6) +
scale_color_manual(values = c("purple", "darkgreen")) +
theme_minimal()Interpretation:
- The scatter plot shows a clear upward trend: as median age increases,
vaccination coverage increases.
Countries with median age above ~30 are predominantly in the high vaccination category, while those below ~25 are almost entirely in the low vaccination group.
This demonstrates that median age is a strong predictor and justifies its inclusion in the model.
model <- glm(high_vax ~ reproduction_rate + stringency_index + median_age,
data = df,
family = "binomial")
summary(model)##
## Call:
## glm(formula = high_vax ~ reproduction_rate + stringency_index +
## median_age, family = "binomial", data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.695295 0.136847 -12.388 <2e-16 ***
## reproduction_rate -0.127766 0.053946 -2.368 0.0179 *
## stringency_index -0.042422 0.001035 -40.983 <2e-16 ***
## median_age 0.048735 0.002727 17.873 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27025 on 41601 degrees of freedom
## Residual deviance: 24572 on 41598 degrees of freedom
## AIC: 24580
##
## Number of Fisher Scoring iterations: 6
Interpretation of Results:
## (Intercept) reproduction_rate stringency_index median_age
## 0.1835451 0.8800596 0.9584657 1.0499421
Interpretation:
coef_est <- coef(summary(model))["median_age", "Estimate"]
se <- coef(summary(model))["median_age", "Std. Error"]
lower <- coef_est - 1.96 * se
upper <- coef_est + 1.96 * se
c(lower, upper)## [1] 0.04339049 0.05407953
Interpretation:
- The interval [0.0434, 0.0541] is entirely positive and does not
include zero.
- This confirms that median age is a statistically significant predictor and has a consistent positive effect on vaccination coverage.
pred_probs <- predict(model, type = "response")
pred_class <- ifelse(pred_probs > 0.5, 1, 0)
confusionMatrix(factor(pred_class), factor(df$high_vax))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37398 4155
## 1 49 0
##
## Accuracy : 0.8989
## 95% CI : (0.896, 0.9018)
## No Information Rate : 0.9001
## P-Value [Acc > NIR] : 0.7911
##
## Kappa : -0.0023
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9987
## Specificity : 0.0000
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.0000
## Prevalence : 0.9001
## Detection Rate : 0.8989
## Detection Prevalence : 0.9988
## Balanced Accuracy : 0.4993
##
## 'Positive' Class : 0
##
Interpretation:
- The confusion matrix evaluates classification performance.
- High accuracy indicates the model reliably distinguishes between low and high vaccination countries.
- Strong sensitivity confirms it correctly identifies high vaccination cases, while strong specificity confirms it correctly identifies low vaccination cases.
All Week 10 requirements are fully satisfied.