Vaccine Hesitancy in Germany: An Exploratory Multi-Method Analysis of the ISSP 2021 Dataset

Author

Theresa Nink

Published

March 10, 2026

1 Introduction

Vaccination is one of the most important public health measures to stop the spread of communicable diseases and prevents millions of deaths each year. Recently the COVID-19 pandemic has demonstrated its enormous use by approximately saving more than 14 million lives globally just in the first year of starting vaccination campaigns (Watson et al., 2022). However, the pandemic also revealed the persisting challenge of vaccine hesitancy, which is defined as “delay in acceptance or refusal of vaccination despite availability of vaccination services” (MacDonald, 2015). Research on vaccine hesitancy has identified many correlates, including socioeconomic, demographic, trust-related, political, religious and informational factors (De Araújo et al., 2024, Cénat et al., 2022, Aw et al., 2021, Jennings, 2021).

This analysis leverages tree-based machine learning algorithms as well as logistic regression to explore prevalence and determinants of vaccine hesitancy in the Germany using data from the International Social Survey Programme 2021 Health module. Importantly, this study seeks to identify factors that reliably correlate with hesitancy, not to predict individual cases.

Research question: Which sociodemographic and attitudinal factors are associated with vaccine hesitancy among the German population?

Four complementary modeling approaches are used: First a decision tree is fit for structural visualization on a balanced subsample, then Random Forest is applied for estimation of variable importance. Additionally, logistic regression serves as the primary interpretive model and Lasso regularization is employed to check the robustness of variable selection. Convergence of results across the four complementary methods strengthens confidence in the identified associations.

2 Data & Preprocessing

2.1 Data Source

The ISSP 2021 “Health and Healthcare II” dataset (ZA8000) is a cross-national survey conducted by the International Social Survey Programme, available via GESIS. It covers health-related attitudes, behaviors, experiences sociodemogric information across 30 countries. In Germany, data were collected between June 14 and August 18, 2021 - during the COVID-19 vaccination.

The original dataset contains 44,549 observations. For this analysis, only the German subsample is used.

2.2 Outcome Variable: Vaccine Hesitancy

Vaccine hesitancy was operationalized by constructing a binary dependet variable Vac_hesitant using two Likert-scaled survey items (1 = Strongly agree, 5 = Strongly disagree):

“Overall, vaccinations do more harm than good.”
“It is better to develop immunity by getting ill than by getting vaccinated.”

A respondent was classified as hesitant if they agreed (score 1 or 2) with at least one item . The combined indicator captures both explicit rejection of vaccines as well as preference for natural immunity. In doing so, it allows for a more sensitive and comprehensive measure of hesitancy than relying on a single item. Moreover, the OR rule was chosen in alignment with the WHO definition of vaccine hesitancy, which emphasized doubt about vaccination rather than just refusal (MacDonald, 2015).

A stricter AND-rule (agreement with both items) was retrospectively examined as a sensitivity check and yields a prevalence of only 2.6%, compared to 13% under the OR-rule. The large difference suggests many respondents express partial rather than full hesitancy.

Validation: To confirm the binary classification is internally consistent, an inverted sum score (range 2–10, higher = more hesitant) was computed from both items. Boxplots visualize a clear separation of groups, since the median score of respondents classified as hesitant was 7 (IQR 6 to 8), compared to a median score of 3 (IQR 2 to 4) among non-hesitant respondents (figure 1). The AUC of the sum score as a predictor of the binary classification was 0.989 and the Spearman correlation was 0.565, confirming internal consistency.

Validation of binary outcome against the underlying continuous score. Clear group separation confirms internal consistency.

2.3 Predictor Variables

Fourteen predictors were selected based on the existing literature on vaccine hesitancy (De Araújo et al., 2024; Aw et al., 2021), covering trust, information-seeking, and sociodemographic factors.

2.4 Missing Data

Missing values per predictor before imputation.
Variable	Missing (%)
inc	20.9
info_health_stress	8.5
not_reliable_online	8.3
info_health	8.0
info_health_vaccines	7.9
eduY	2.9
conf_hcs	2.2
religion	2.0
trust_doc	1.1
rural	1.1
age	0.8
insurance	0.8
sex	0.4
Vac_hesitant	0.0
politics	0.0

Income (20.9%) and political party (34.4%) had the highest non-response. For political party, non-response was treated as a separate category (no answer) rather than imputed, because 34% missingness is likely non-random (MNAR), and imputing it could introduce systematic bias.

For all other variables, Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) was applied. Complete-case analysis was not appropriate here, since removing all rows with any missing value would have further reduced the already small hesitant group (n=205). Hence, making it harder for models to learn the minority class signal.

Limitation: Ideally, all five imputed datasets should be pooled via Rubin’s Rules. A single imputed dataset is used here for pragmatic reasons, which slightly underestimates imputation uncertainty.

3 Exploratory Data Analysis

Final sample: n = 1571

Hesitant: n= 205 ( 13 %)

The final sample included 1571 respondets of which about 13% were classified as vaccine hesitant (n=205 vs. n=1,366 non-hesitant). This class imbalance is important: the standard classification threshold of 0.5 is inappropriate and will be addressed in Section 5.

 Vac_hesitant  info_health    info_health_stress info_health_vaccines
 No :1366     Min.   :1.000   Min.   :1.000      Min.   :1.000       
 Yes: 205     1st Qu.:4.000   1st Qu.:1.000      1st Qu.:2.000       
              Median :5.000   Median :2.000      Median :3.000       
              Mean   :4.306   Mean   :1.954      Mean   :2.832       
              3rd Qu.:6.000   3rd Qu.:3.000      3rd Qu.:4.000       
              Max.   :6.000   Max.   :5.000      Max.   :5.000       
                                                                     
    conf_hcs       trust_doc     not_reliable_online      age       
 Min.   :1.000   Min.   :1.000   Min.   :1.000       Min.   :18.00  
 1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000       1st Qu.:39.00  
 Median :2.000   Median :2.000   Median :2.000       Median :56.00  
 Mean   :2.334   Mean   :2.228   Mean   :2.062       Mean   :53.97  
 3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000       3rd Qu.:67.00  
 Max.   :5.000   Max.   :5.000   Max.   :5.000       Max.   :94.00  
                                                                    
     sex           eduY       rural        inc           insurance    religion
 male  :738   Min.   : 6.00   1:266   Min.   :    0   statuary:1351   0:738   
 female:833   1st Qu.:10.00   2:264   1st Qu.: 1875   privat  : 203   1:833   
              Median :12.00   3:568   Median : 3250   others  :  17           
              Mean   :13.08   4:473   Mean   : 3519                           
              3rd Qu.:16.00           3rd Qu.: 4250                           
              Max.   :30.00           Max.   :11250                           
                                                                              
      politics  
 no answer:541  
 CDU_CSU  :344  
 SPD      :197  
 Gruene   :194  
 FDP      : 91  
 Linke    : 70  
 (Other)  :134

   CDU_CSU        SPD     Gruene        FDP      Linke        AfD     others 
0.21896881 0.12539784 0.12348822 0.05792489 0.04455761 0.04137492 0.04392107 
 no answer 
0.34436665

        1         2         3         4 
0.1693189 0.1680458 0.3615532 0.3010821

  Vac_hesitant      age
1           No 54.41947
2          Yes 50.94146

  Vac_hesitant     eduY
1           No 13.22694
2          Yes 12.08780

  Vac_hesitant      inc
1           No 3592.419
2          Yes 3032.790

  Vac_hesitant conf_hcs
1           No 2.282577
2          Yes 2.678049

  Vac_hesitant trust_doc
1           No  2.183016
2          Yes  2.526829

  Vac_hesitant info_health
1           No    4.319912
2          Yes    4.209756

        
          No Yes
  male   625 113
  female 741  92

        
                 No        Yes
  male   0.39783577 0.07192871
  female 0.47167409 0.05856143

Sociodemographic variables by vaccine hesitancy status.

Key bivariate patterns: Hesitant individuals were on average younger (51 vs. 54 years), had fewer years of education (12 vs. 13), lower household income (€3,033 vs. €3,592), and reported lower confidence in the healthcare system. Males made up 55% of the hesitant group despite being 47% of the sample.

4 Modeling

This exploratory analysis leverages four complementary modeling methods to examine vaccine hesitancy: First a decision tree was fit for structural visualization on a balanced subsample, then Random Forest was applied for estimation of variable importance. Additionally, logistic regression served as the primary interpretive model and Lasso regularization was employed to check the robustness of variable selection. The tree-based methods capture nonlinear interactions, while the logistic regression provides interpretability (Gonzales et al., 2025).

Because the aim of this study is to identify which factors are robustly associated with hesitancy across multiple modeling approaches, not to optimize held out accuracy, all models were fit without a train/test split. Convergence of results across the four complementary methods strengthens confidence in the identified associations.

4.1 Decision Tree

A classification tree provides an interpretable structural visualization of how the data naturally splits by hesitancy. Because 13% prevalence causes a tree trained on the full dataset to predict “No” for everyone (null classifier), the tree was trained on a balanced subsample (2:1 ratio, n=615) created by downsampling the majority class.

The cross-validation estimates the test error for varying tree sizes and allows for the selection of the optimal size that minimizes the misclassification error. Based on this a pruned tree with 3 terminal nodes was selected

Warning

The tree probabilities reflect the balanced subsample, not the actual population prevalence of 13%. They should not be interpreted as individual risk estimates.

Pruned decision tree (k=3). Confidence in the healthcare system is the primary split, age the secondary split among low-confidence respondents.


Classification tree:
snip.tree(tree = tree_full, nodes = c(2L, 6L))
Variables actually used in tree construction:
[1] "conf_hcs" "age"     
Number of terminal nodes:  3 
Residual mean deviance:  1.186 = 725.7 / 612 
Misclassification error rate: 0.3008 = 185 / 615

node), split, n, deviance, yval, (yprob)
      * denotes terminal node

1) root 615 782.9 No ( 0.6667 0.3333 )  
  2) conf_hcs < 2.5 392 429.5 No ( 0.7628 0.2372 ) *
  3) conf_hcs > 2.5 223 309.1 Yes ( 0.4978 0.5022 )  
    6) age < 32.5 38  41.6 Yes ( 0.2368 0.7632 ) *
    7) age > 32.5 185 254.5 No ( 0.5514 0.4486 ) *

The pruned tree first partitioned the data based on confidence in the healthcare system (conf_hcs < 2.5) into high confidence (n=392, 23.7% hesitant) and low confidence groups (n=223, 50.2% hesitant). Among the low confidence group, a secondary split on age < 32.5 identified a subgroup of younger respondents with elevated levels of hesitancy (n=38, 76.3% hesitant). Older respondents with low confidence showed heterogenous

4.2 Random Forest

A Random Forest reduced the high variance of single trees by averaging 500 trees, each trained on a different bootstrap sample. It was used here primarily for variable importance estimation. Class imbalance was addressed by usage of the classwt argument, which penalizes misclassification of hesitant cases more heavily. The utilized ratio of 1:7 was arbitrary chosen.

                             No       Yes MeanDecreaseAccuracy MeanDecreaseGini
info_health           7.4276926 -1.969417            6.2384365        21.745185
info_health_stress    2.1076788 -2.723988            0.9056667        19.529888
info_health_vaccines  2.4499340  3.663732            3.7833481        20.830708
conf_hcs             -0.7068989 10.109145            3.5389721        15.557737
trust_doc            -1.6854717  4.605351            0.4584593        13.723953
not_reliable_online  -1.9266738  2.749082           -0.6656229        17.751487
age                   1.7018877  4.020248            3.1533378        52.603601
sex                   5.4377831 -1.473130            4.5414719         7.550102
eduY                 -4.6838264  4.827433           -2.2762681        45.092369
rural                -1.2388007  1.182659           -0.6423402        24.911631
inc                  -5.1480047  7.756024           -1.6591864        42.008797
insurance             2.3034340 -1.619731            1.5906532         6.543936
religion             -1.5115466 -1.644803           -2.0263348         7.612280
politics             -4.2532064  7.026977           -1.3568009        44.824857

Variable importance by Mean Decrease in Accuracy (MDA). Gini importance is not shown as it is biased toward continuous variables.

The five most important predictors by MDA were: info_health, sex, info_health_vaccines, conf_hcs, and age. These findings are consistent with the results of the decision tree and logistic regression.

4.3 Logistic Regression

Binary logistic regression was the primary interpretive model. It produces odds ratios (OR) with 95% confidence intervals, hence presenting directly interpretable effect estimates that connect to established epidemiological literature. The model was fit on the full imputed dataset considering all 14 predictors.

Odds ratios with 95% confidence intervals. Red points indicate significant predictors (CI does not cross 1). Log scale on x-axis.

Key findings from logistic regression:

Confidence in the healthcare system was the strongest trust-related predictor: each unit increase on the 5-point scale was associated with 58% higher odds of hesitancy (OR = 1.58, p < .001)
Vaccine-specific information-seeking was protective: higher frequency was associated with 21% lower odds (OR = 0.79, p < .001)
Female sex was associated with 38% lower odds (OR = 0.62, p = .004)
AfD voters showed more than twice the odds of hesitancy compared to CDU/CSU voters (OR = 2.31, p = .022)
Private health insurance was associated with higher odds compared to statutory insurance, which is counterintuitive given that private insurance holders tend to have higher education and income, both of which were negatively associated with hesitancy

4.4 LASSO Regularization

LASSO applies L1 regularization, shrinking irrelevant predictors to exactly zero. It was employed to test tge robustness of variable selction. Model complexity was chosen by the 1-standard error criterion (lambda.1se), which selects the lambda value, whose cross-validation error is still within one standard deviation of the minimum error, but results in a simpler model. This approach was chosen over lambda.min because it balances parsimony against minimal prediction error and is more stable against sample variability.

Predictors retained by LASSO (lambda.1se):

 [1] "info_health"          "info_health_vaccines" "conf_hcs"            
 [4] "trust_doc"            "not_reliable_online"  "age"                 
 [7] "sexfemale"            "eduY"                 "rural4"              
[10] "inc"                  "politicsLinke"        "politicsAfD"

LASSO retained 12 of 23 terms. Notably, politicsAfD was retained with a positive coefficient (0.380) and politicsLinke with a negative coefficient (-0.386), while all other political affiliations were shrunken to zero. This pattern indicates that right-wing populist affiliation (AfD) is associated with higher hesitancy, while left-wing affiliation (Linke) is associated with lower hesitancy. The elimination of all centrist parties confirms that their hesitancy rates do not differ meaningfully from the CDU/CSU reference category once other predictors are controlled for.

5 Model Performance & Threshold Analysis

Performance metrics at standard (0.5) and Youden-optimal thresholds.
Model	Threshold	Accuracy	Sensitivity	Specificity	Bal_Acc	AUC
Logistic Regression	0.500	0.875	0.078	0.995	0.536	0.733
Logistic Regression	0.155	0.745	0.610	0.766	0.688	0.733
Random Forest	0.500	0.870	0.010	0.999	0.504	0.662
Random Forest	0.101	0.566	0.722	0.542	0.632	0.662

At the standard threshold of 0.5, both models classified nearly everyone as non-hesitant, yielding ~87% accuracy but near-zero sensitivity. This is a result of the imbalanced outcome, since the model learns that always predicting the majority class is “accurate”, but it cannot identify hestitant individuals

The Youden-optimal threshold, which is determined by maximizing sensitivity and specificity on a ROC curve, was 0.155 for logistic regression and 0.104 for Random Forest. It substantially improved sensitivity (61% for logistic regression, 72% for Random Forest) at the cost of specificity, and raises balanced accuracy by more than 12%.

The Random Forest OOB AUC (0.662) is more conservative than the in-sample logistic regression AUC (0.733). These values are not directly comparable, because the logistic regression is evaluated on the same data it was trained on, making it optimistic.

ROC curves for logistic regression and random forest. Note: the two AUC values use different evaluation strategies and are not directly comparable.

6 Convergence Across Methods

The table below summarizes predictor relevance across all four methods. A predictor is considered relevant if it was significant in logistic regression (p < .05), retained by LASSO, ranked among the top 5 by RF MDA, or was used as a split in the pruned tree.

Convergence of predictor relevance across four analytic methods.
Predictor	Logit sig.	Lasso kept	RF top 5 (MDA)	Tree used
conf_hcs	***	yes	yes	yes
trust_doc	*	yes	no	no
info_health_vaccines	***	yes	yes	no
info_health	*	yes	yes	no
not_reliable_online	*	yes	no	no
age	***	yes	yes	yes
eduY	**	yes	no	no
inc	**	yes	no	no
sex (female)	**	yes	yes	no
insurance (private)	*	no	no	no
politics (AfD)	*	yes	no	no
politics (Linke)	–	yes	no	no
rural (4)	–	yes	no	no

conf_hcs and age are the most consistent predictors across all methods. info_health, info_health_vaccines, and sex converge across three methods. These represent the most reliable correlates of vaccine hesitancy in this dataset.

Moreover, education years, income, trust in doctors, difficulty to identify reliable information online, private healthcare insurance and voting for the AfD emerged as significant variables in the regression-based models.

7 Discussion

7.1 Key Findings

Approximately 13% of the analyzed subsample expressed vaccine hesitancy. This is a plausible approximation given estimations of other studies examining the vaccine hesitancy in Germany during a similar time period, ranging from 8% to 29% hesitancy. Importantly, these studies focus on COVID-19-specific vaccine hesitancy, while the ISSP 2021 assessed general attitudes towards vaccination within a broad social survey context (Fobiwe et al., 2022, Lincoln et al., 2022, Steinert et al., 2022). Therefore, the general vaccine attitudes captured in the ISSP dataset may be more representative of stable opinions, whereas COVID-19-targeted surveys likely reflect situational-dependent hesitancy.

Across all four modeling approaches, confidence in the healthcare system and age emerged as the most consistent correlates of vaccine hesitancy. This aligns with established theoretical frameworks emphasizing institutional trust as a central determinant of vaccine acceptance (MacDonald, 2015), and with previous studies focusing on Germany (Fobiwe et al., 2022).

The information-related variables showed a nuanced pattern: active information-seeking about vaccines was associated with lower hesitancy, while difficulty evaluating reliable online health information was associated with higher hesitancy. This distinction matters for public health communication, since it suggest, that the problem is not information overload per se, but the inability to assess its quality.

The association between AfD voting and hesitancy (OR = 2.31) is consistent with prior research linking right-wing populist party support with vaccine skepticism in Germany (Jäckle & Timmis, 2022). In contrast, the finding that female sex was associated with lower hesitancy diverges with some international literature (De Araújo et al., 2024; Aw et al., 2021), suggesting potential country-specific patterns worth exploring in future research.

7.2 Public Health Implications

These findings point to two main levers for reducing vaccine hesitancy in Germany:

Trust in health institutions is the most modifiable factor identified. Public health communication that reinforces confidence in the healthcare system for example through transparent risk communication, visible accountability, and consistent messaging, may be more effective than information campaigns alone.
Digital health literacy matters independently of how much people search online. Interventions that help people evaluate the quality of health information may reduce hesitancy among those who are already engaged online.

The political polarization around vaccine attitudes also suggests that hesitancy is partly driven by broader trust deficits in institutions. Oppositional groups may require different communication strategies than those aimed at the general population.

7.3 Limitations

Important

Key limitations to keep in mind:

Outcome operationalization: The binary hesitancy indicator is derived from two general attitude items, not a direct behavioral measure. Agreement with one item does not necessarily translate into actual vaccine refusal.
Single imputed dataset: Using one of five MICE-imputed datasets rather than pooling via Rubin’s Rules slightly underestimates imputation uncertainty.
No external validation: All models were estimated on the full imputed dataset without a held-out test set. AUC values should not be interpreted as generalizable predictive performance.
Cross-sectional design: Causal interpretations cannot be drawn. The associations identified are correlates, not determinants in a causal sense.
Sample representativeness: The German ISSP subsample may not be fully representative of the general population, particularly given high political non-response (34.4%).

8 References

Aw, J., Seng, J. J. B., Seah, S. S. Y., & Low, L. L. (2021). COVID-19 Vaccine Hesitancy—A Scoping Review of Literature in High-Income Countries. Vaccines, 9(8), 900. https://doi.org/10.3390/vaccines9080900

Cénat, J. M., Noorishad, P., Farahi, S. M. M. M., Darius, W. P., Aouame, A. M. E., Onesi, O., Broussard, C., Furyk, S. E., Yaya, S., Caulley, L., Chomienne, M., Etowa, J., & Labelle, P. R. (2022). Prevalence and factors related to COVID-19 vaccine hesitancy and unwillingness in Canada: A systematic review and meta-analysis. Journal of Medical Virology, 95(1), e28156. https://doi.org/10.1002/jmv.28156

De Araújo, J. S. T., Delpino, F. M., De Paula Andrade-Gonçalves, R. L., Aragão, F. B. A., Ferezin, L. P., Santos, D. A., Neto, N. C. D., Nascimento, M. C. D., Moreira, S. P. T., Ribeiro, G. F., Alves, R. F. D. S., & Arcêncio, R. A. (2024). Determinants of COVID-19 Vaccine Acceptance and hesitancy: A Systematic review. Vaccines, 12(12), 1352. https://doi.org/10.3390/vaccines12121352

Fobiwe, J. P., et al. (2022). Influences on attitudes regarding COVID-19 vaccination in Germany. Vaccines, 10(5), 658. https://doi.org/10.3390/vaccines10050658

ISSP Research Group (2024). ISSP 2021 — Health and Health Care II (ZA8000, Version 2.0.0). GESIS, Cologne. https://doi.org/10.4232/5.ZA8000.2.0.0

Jäckle, S., & Timmis, J. K. (2023). Left–Right-Position, party affiliation and regional differences explain low COVID-19 vaccination rates in Germany. Microbial Biotechnology, 16(3), 662–677. https://doi.org/10.1111/1751-7915.14210

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.

Jennings, W., Stoker, G., Bunting, H., Valgarðsson, V. O., Gaskell, J., Devine, D., McKay, L., & Mills, M. C. (2021). Lack of trust, conspiracy beliefs, and social media use 20 predict COVID-19 vaccine hesitancy. Vaccines, 9(6), 593. https://doi.org/10.3390/vaccines9060593

Lincoln, T. M., Schlier, B., Strakeljahn, F., Gaudiano, B. A., So, S. H., Kingston, J., Morris, E. M. J., & Ellett, L. (2022). Taking a machine learning approach to optimize prediction of vaccine hesitancy in high income countries. Scientific reports, 12(1), 2055. https://doi.org/10.1038/s41598-022-05915-3

MacDonald, N. E. (2015). Vaccine hesitancy: Definition, scope and determinants. Vaccine, 33(34), 4161–4164. https://doi.org/10.1016/j.vaccine.2015.04.036

Mienye, I. D., & Jere, N. (2024). A survey of Decision trees: Concepts, algorithms, and applications. IEEE Access, 12, 86716–86727. https://doi.org/10.1109/access.2024.3416838

Steinert, J. I., Sternberg, H., Prince, H., Fasolo, B., Galizzi, M. M., Büthe, T., & Veltri, G. A. (2022). COVID-19 vaccine hesitancy in eight European countries: Prevalence, determinants, and heterogeneity. Science Advances, 8(17), eabm9825. https://doi.org/10.1126/sciadv.abm9825

Watson, O. J., Barnsley, G., Toor, J., Hogan, A. B., Winskill, P., & Ghani, A. C. (2022). Global impact of the first year of COVID-19 vaccination: a mathematical modelling study. The Lancet Infectious Diseases, 22(9), 1293–1302. https://doi.org/10.1016/s1473-3099(22)00320-6

Wulff, J., & Ejlskov, L. (2017). Multiple imputation by chained equations in praxis: Guidelines and review. VBN Forskningsportal (Aalborg Universitet), 15(1), 41–56. https://vbn.aau.dk/da/publications/8abf732c-281b-4192-b647-af9346db1290