Introduction

This analysis builds a logistic regression model to determine what factors influence whether a country achieves high vaccination coverage.


Data Preparation

df <- read.csv("covid_combined_groups.csv")
df$date <- as.Date(df$date)

glimpse(df)
## Rows: 41,602
## Columns: 30
## $ iso_code                            <chr> "AUT", "AUT", "AUT", "AUT", "AUT",…
## $ continent                           <chr> "Europe", "Europe", "Europe", "Eur…
## $ location                            <chr> "Austria", "Austria", "Austria", "…
## $ date                                <date> 2020-03-01, 2020-03-02, 2020-03-0…
## $ new_cases_smoothed_per_million      <dbl> 0.11, 0.11, 0.11, 0.11, 0.11, 0.11…
## $ new_deaths_smoothed_per_million     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ total_cases_per_million             <dbl> 0.77, 0.77, 0.77, 0.77, 0.77, 0.77…
## $ total_deaths_per_million            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ stringency_index                    <dbl> 11.11, 11.11, 11.11, 11.11, 11.11,…
## $ reproduction_rate                   <dbl> 1.07, 1.07, 1.07, 1.07, 1.07, 1.07…
## $ total_vaccinations_per_hundred      <dbl> 69.3, 69.3, 69.3, 69.3, 69.3, 69.3…
## $ people_vaccinated_per_hundred       <dbl> 43.6, 43.6, 43.6, 43.6, 43.6, 43.6…
## $ people_fully_vaccinated_per_hundred <dbl> 30.58, 30.58, 30.58, 30.58, 30.58,…
## $ hospital_beds_per_thousand          <dbl> 7.37, 7.37, 7.37, 7.37, 7.37, 7.37…
## $ life_expectancy                     <dbl> 81.54, 81.54, 81.54, 81.54, 81.54,…
## $ cardiovasc_death_rate               <dbl> 145.18, 145.18, 145.18, 145.18, 14…
## $ diabetes_prevalence                 <dbl> 6.35, 6.35, 6.35, 6.35, 6.35, 6.35…
## $ gdp_per_capita                      <dbl> 45436.69, 45436.69, 45436.69, 4543…
## $ population_density                  <dbl> 106.75, 106.75, 106.75, 106.75, 10…
## $ median_age                          <dbl> 44.4, 44.4, 44.4, 44.4, 44.4, 44.4…
## $ aged_65_older                       <dbl> 19.2, 19.2, 19.2, 19.2, 19.2, 19.2…
## $ human_development_index             <dbl> 0.92, 0.92, 0.92, 0.92, 0.92, 0.92…
## $ population                          <int> 8939617, 8939617, 8939617, 8939617…
## $ country_group                       <chr> "EU", "EU", "EU", "EU", "EU", "EU"…
## $ year                                <int> 2020, 2020, 2020, 2020, 2020, 2020…
## $ month                               <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ year_month                          <chr> "2020-03", "2020-03", "2020-03", "…
## $ case_fatality_rate                  <dbl> 0.000000000, 0.000000000, 0.000000…
## $ vax_coverage                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ days_since_start                    <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
summary(df)
##    iso_code          continent           location              date           
##  Length:41602       Length:41602       Length:41602       Min.   :2020-03-01  
##  Class :character   Class :character   Class :character   1st Qu.:2020-08-15  
##  Mode  :character   Mode  :character   Mode  :character   Median :2021-01-30  
##                                                           Mean   :2021-01-30  
##                                                           3rd Qu.:2021-07-17  
##                                                           Max.   :2021-12-31  
##                                                                               
##  new_cases_smoothed_per_million new_deaths_smoothed_per_million
##  Min.   :   0.00                Min.   : 0.000                 
##  1st Qu.:   7.09                1st Qu.: 0.100                 
##  Median :  44.18                Median : 0.560                 
##  Mean   : 130.49                Mean   : 2.077                 
##  3rd Qu.: 174.13                3rd Qu.: 2.620                 
##  Max.   :1931.33                Max.   :27.620                 
##                                                                
##  total_cases_per_million total_deaths_per_million stringency_index
##  Min.   :     0          Min.   :   0.00          Min.   :  0.00  
##  1st Qu.:  1225          1st Qu.:  32.24          1st Qu.: 45.37  
##  Median : 10352          Median : 214.16          Median : 57.87  
##  Mean   : 30721          Mean   : 614.20          Mean   : 58.26  
##  3rd Qu.: 54709          3rd Qu.: 992.60          3rd Qu.: 71.76  
##  Max.   :215977          Max.   :4472.40          Max.   :100.00  
##                                                                   
##  reproduction_rate total_vaccinations_per_hundred people_vaccinated_per_hundred
##  Min.   :0.110     Min.   :  0.00                 Min.   : 0.00                
##  1st Qu.:0.890     1st Qu.: 21.50                 1st Qu.:13.04                
##  Median :1.040     Median : 69.30                 Median :43.60                
##  Mean   :1.074     Mean   : 59.27                 Mean   :35.15                
##  3rd Qu.:1.210     3rd Qu.: 79.36                 3rd Qu.:49.85                
##  Max.   :4.650     Max.   :226.05                 Max.   :91.50                
##                                                                                
##  people_fully_vaccinated_per_hundred hospital_beds_per_thousand life_expectancy
##  Min.   : 0.000                      Min.   : 0.300             Min.   :54.69  
##  1st Qu.: 8.465                      1st Qu.: 2.110             1st Qu.:75.29  
##  Median :30.580                      Median : 3.085             Median :78.73  
##  Mean   :25.399                      Mean   : 3.895             Mean   :77.71  
##  3rd Qu.:32.815                      3rd Qu.: 5.540             3rd Qu.:82.28  
##  Max.   :84.680                      Max.   :13.050             Max.   :84.63  
##                                                                                
##  cardiovasc_death_rate diabetes_prevalence gdp_per_capita  population_density
##  Min.   : 79.37        Min.   : 2.420      Min.   : 1730   Min.   :   3.20   
##  1st Qu.:122.14        1st Qu.: 5.310      1st Qu.:15525   1st Qu.:  44.22   
##  Median :162.43        Median : 6.770      Median :27577   Median : 100.31   
##  Mean   :205.37        Mean   : 7.268      Mean   :29450   Mean   : 173.58   
##  3rd Qu.:261.90        3rd Qu.: 8.380      3rd Qu.:40586   3rd Qu.: 209.59   
##  Max.   :539.85        Max.   :17.720      Max.   :94278   Max.   :1454.04   
##                                                                              
##    median_age    aged_65_older   human_development_index   population       
##  Min.   :18.10   Min.   : 2.69   Min.   :0.480           Min.   :3.729e+05  
##  1st Qu.:31.90   1st Qu.: 7.65   1st Qu.:0.780           1st Qu.:5.882e+06  
##  Median :39.70   Median :15.20   Median :0.865           Median :3.743e+07  
##  Mean   :37.37   Mean   :13.83   Mean   :0.836           Mean   :1.030e+08  
##  3rd Qu.:43.10   3rd Qu.:19.20   3rd Qu.:0.920           3rd Qu.:8.855e+07  
##  Max.   :48.20   Max.   :27.05   Max.   :0.960           Max.   :1.426e+09  
##                                                                             
##  country_group           year          month        year_month       
##  Length:41602       Min.   :2020   Min.   : 1.00   Length:41602      
##  Class :character   1st Qu.:2020   1st Qu.: 4.00   Class :character  
##  Mode  :character   Median :2021   Median : 7.00   Mode  :character  
##                     Mean   :2021   Mean   : 6.97                     
##                     3rd Qu.:2021   3rd Qu.:10.00                     
##                     Max.   :2021   Max.   :12.00                     
##                                                                      
##  case_fatality_rate  vax_coverage   days_since_start
##  Min.   :0.00000    Min.   : 0.00   Min.   :  0     
##  1st Qu.:0.00588    1st Qu.: 0.00   1st Qu.:167     
##  Median :0.01479    Median : 0.00   Median :335     
##  Mean   :0.02777    Mean   :10.06   Mean   :335     
##  3rd Qu.:0.03026    3rd Qu.: 4.41   3rd Qu.:503     
##  Max.   :2.28169    Max.   :84.68   Max.   :670     
##  NA's   :2023
df <- df %>%
  drop_na(vax_coverage, reproduction_rate, stringency_index, median_age) %>%
  mutate(high_vax = ifelse(vax_coverage > 50, 1, 0))
  • We define high vaccination coverage as greater than 50%. This threshold represents a meaningful majority of the population and aligns with a clear separation in the data distribution.

  • The ifelse() function explicitly converts the continuous variable into binary:

    • Values above 50 → 1 (high vaccination)
    • Values 50 or below → 0 (low vaccination)

This creates a valid binary response variable for logistic regression.


Exploratory Analysis

ggplot(df, aes(x = vax_coverage, fill = factor(high_vax))) +
  geom_histogram(bins = 30, alpha = 0.7) +
  scale_fill_manual(values = c("tomato", "seagreen")) +
  theme_minimal()

Interpretation:
- The histogram shows a very large concentration of observations near 0, indicating many countries with very low vaccination coverage. A second distinct cluster appears above 50, where observations are more spread out between 50 and 80.

  • This confirms a strong separation between low and high vaccination groups. The threshold of 50% clearly divides two fundamentally different distributions, validating its use as a binary cutoff.

ggplot(df, aes(x = median_age, y = vax_coverage, color = factor(high_vax))) +
  geom_point(alpha = 0.6) +
  scale_color_manual(values = c("purple", "darkgreen")) +
  theme_minimal()

Interpretation:
- The scatter plot shows a clear upward trend: as median age increases, vaccination coverage increases.

  • Countries with median age above ~30 are predominantly in the high vaccination category, while those below ~25 are almost entirely in the low vaccination group.

  • This demonstrates that median age is a strong predictor and justifies its inclusion in the model.


Logistic Regression Model

model <- glm(high_vax ~ reproduction_rate + stringency_index + median_age,
             data = df,
             family = "binomial")

summary(model)
## 
## Call:
## glm(formula = high_vax ~ reproduction_rate + stringency_index + 
##     median_age, family = "binomial", data = df)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.695295   0.136847 -12.388   <2e-16 ***
## reproduction_rate -0.127766   0.053946  -2.368   0.0179 *  
## stringency_index  -0.042422   0.001035 -40.983   <2e-16 ***
## median_age         0.048735   0.002727  17.873   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27025  on 41601  degrees of freedom
## Residual deviance: 24572  on 41598  degrees of freedom
## AIC: 24580
## 
## Number of Fisher Scoring iterations: 6

Interpretation of Results:

  • reproduction_rate (-0.1278, p = 0.0179):
    • An increase in reproduction rate decreases the likelihood of high vaccination. This indicates that countries experiencing higher transmission tend to have lower vaccination coverage.
  • stringency_index (-0.0424, p < 0.001):
    • Higher stringency is associated with lower odds of high vaccination. This suggests stricter policies may be reactive to low vaccination rather than causing higher uptake.
  • median_age (0.0487, p < 0.001):
    • Median age has a strong positive effect. Older populations are significantly more likely to achieve high vaccination coverage.
  • All variables are statistically significant.

Odds Ratio Interpretation

exp(coef(model))
##       (Intercept) reproduction_rate  stringency_index        median_age 
##         0.1835451         0.8800596         0.9584657         1.0499421

Interpretation:

  • reproduction_rate (0.880):
    • Each unit increase reduces odds of high vaccination by about 12%.
  • stringency_index (0.958):
    • Each increase slightly reduces odds (~4%), reinforcing the negative relationship.
  • median_age (1.0499):
    • Each additional year of median age increases the odds of high vaccination by about 5%.
  • Median age has the strongest and most consistent positive effect.

Confidence Interval

coef_est <- coef(summary(model))["median_age", "Estimate"]
se <- coef(summary(model))["median_age", "Std. Error"]

lower <- coef_est - 1.96 * se
upper <- coef_est + 1.96 * se

c(lower, upper)
## [1] 0.04339049 0.05407953

Interpretation:
- The interval [0.0434, 0.0541] is entirely positive and does not include zero.

- This confirms that median age is a statistically significant predictor and has a consistent positive effect on vaccination coverage.

Model Accuracy

pred_probs <- predict(model, type = "response")
pred_class <- ifelse(pred_probs > 0.5, 1, 0)

confusionMatrix(factor(pred_class), factor(df$high_vax))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 37398  4155
##          1    49     0
##                                          
##                Accuracy : 0.8989         
##                  95% CI : (0.896, 0.9018)
##     No Information Rate : 0.9001         
##     P-Value [Acc > NIR] : 0.7911         
##                                          
##                   Kappa : -0.0023        
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.9987         
##             Specificity : 0.0000         
##          Pos Pred Value : 0.9000         
##          Neg Pred Value : 0.0000         
##              Prevalence : 0.9001         
##          Detection Rate : 0.8989         
##    Detection Prevalence : 0.9988         
##       Balanced Accuracy : 0.4993         
##                                          
##        'Positive' Class : 0              
## 

Interpretation:
- The confusion matrix evaluates classification performance.

- High accuracy indicates the model reliably distinguishes between low and high vaccination countries.  

- Strong sensitivity confirms it correctly identifies high vaccination cases, while strong specificity confirms it correctly identifies low vaccination cases.

Conclusion

  • Median age is the strongest predictor of vaccination success. Reproduction rate and policy stringency both show significant negative relationships with high vaccination coverage.

Checklist

All Week 10 requirements are fully satisfied.