Vaccine Hesitancy in Germany: An Exploratory Multi-Method Analysis of the ISSP 2021 Dataset
1 Introduction
Vaccination is one of the most important public health measures to stop the spread of communicable diseases and prevents millions of deaths each year. Recently the COVID-19 pandemic has demonstrated its enormous use by approximately saving more than 14 million lives globally just in the first year of starting vaccination campaigns (Watson et al., 2022). However, the pandemic also revealed the persisting challenge of vaccine hesitancy, which is defined as “delay in acceptance or refusal of vaccination despite availability of vaccination services” (MacDonald, 2015). Research on vaccine hesitancy has identified many correlates, including socioeconomic, demographic, trust-related, political, religious and informational factors (De Araújo et al., 2024, Cénat et al., 2022, Aw et al., 2021, Jennings, 2021).
This analysis leverages tree-based machine learning algorithms as well as logistic regression to explore prevalence and determinants of vaccine hesitancy in the Germany using data from the International Social Survey Programme 2021 Health module. Importantly, this study seeks to identify factors that reliably correlate with hesitancy, not to predict individual cases.
Research question: Which sociodemographic and attitudinal factors are associated with vaccine hesitancy among the German population?
Four complementary modeling approaches are used: First a decision tree is fit for structural visualization on a balanced subsample, then Random Forest is applied for estimation of variable importance. Additionally, logistic regression serves as the primary interpretive model and Lasso regularization is employed to check the robustness of variable selection. Convergence of results across the four complementary methods strengthens confidence in the identified associations.
2 Data & Preprocessing
2.1 Data Source
The ISSP 2021 “Health and Healthcare II” dataset (ZA8000) is a cross-national survey conducted by the International Social Survey Programme, available via GESIS. It covers health-related attitudes, behaviors, experiences sociodemogric information across 30 countries. In Germany, data were collected between June 14 and August 18, 2021 - during the COVID-19 vaccination.
The original dataset contains 44,549 observations. For this analysis, only the German subsample is used.
2.2 Outcome Variable: Vaccine Hesitancy
Vaccine hesitancy was operationalized by constructing a binary dependet variable Vac_hesitant using two Likert-scaled survey items (1 = Strongly agree, 5 = Strongly disagree):
- “Overall, vaccinations do more harm than good.”
- “It is better to develop immunity by getting ill than by getting vaccinated.”
A respondent was classified as hesitant if they agreed (score 1 or 2) with at least one item . The combined indicator captures both explicit rejection of vaccines as well as preference for natural immunity. In doing so, it allows for a more sensitive and comprehensive measure of hesitancy than relying on a single item. Moreover, the OR rule was chosen in alignment with the WHO definition of vaccine hesitancy, which emphasized doubt about vaccination rather than just refusal (MacDonald, 2015).
A stricter AND-rule (agreement with both items) was retrospectively examined as a sensitivity check and yields a prevalence of only 2.6%, compared to 13% under the OR-rule. The large difference suggests many respondents express partial rather than full hesitancy.
Validation: To confirm the binary classification is internally consistent, an inverted sum score (range 2–10, higher = more hesitant) was computed from both items. Boxplots visualize a clear separation of groups, since the median score of respondents classified as hesitant was 7 (IQR 6 to 8), compared to a median score of 3 (IQR 2 to 4) among non-hesitant respondents (figure 1). The AUC of the sum score as a predictor of the binary classification was 0.989 and the Spearman correlation was 0.565, confirming internal consistency.
2.3 Predictor Variables
Fourteen predictors were selected based on the existing literature on vaccine hesitancy (De Araújo et al., 2024; Aw et al., 2021), covering trust, information-seeking, and sociodemographic factors.
2.4 Missing Data
| Variable | Missing (%) |
|---|---|
| inc | 20.9 |
| info_health_stress | 8.5 |
| not_reliable_online | 8.3 |
| info_health | 8.0 |
| info_health_vaccines | 7.9 |
| eduY | 2.9 |
| conf_hcs | 2.2 |
| religion | 2.0 |
| trust_doc | 1.1 |
| rural | 1.1 |
| age | 0.8 |
| insurance | 0.8 |
| sex | 0.4 |
| Vac_hesitant | 0.0 |
| politics | 0.0 |
Income (20.9%) and political party (34.4%) had the highest non-response. For political party, non-response was treated as a separate category (no answer) rather than imputed, because 34% missingness is likely non-random (MNAR), and imputing it could introduce systematic bias.
For all other variables, Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) was applied. Complete-case analysis was not appropriate here, since removing all rows with any missing value would have further reduced the already small hesitant group (n=205). Hence, making it harder for models to learn the minority class signal.
Limitation: Ideally, all five imputed datasets should be pooled via Rubin’s Rules. A single imputed dataset is used here for pragmatic reasons, which slightly underestimates imputation uncertainty.
3 Exploratory Data Analysis
Final sample: n = 1571
Hesitant: n= 205 ( 13 %)
The final sample included 1571 respondets of which about 13% were classified as vaccine hesitant (n=205 vs. n=1,366 non-hesitant). This class imbalance is important: the standard classification threshold of 0.5 is inappropriate and will be addressed in Section 5.
Vac_hesitant info_health info_health_stress info_health_vaccines
No :1366 Min. :1.000 Min. :1.000 Min. :1.000
Yes: 205 1st Qu.:4.000 1st Qu.:1.000 1st Qu.:2.000
Median :5.000 Median :2.000 Median :3.000
Mean :4.306 Mean :1.954 Mean :2.832
3rd Qu.:6.000 3rd Qu.:3.000 3rd Qu.:4.000
Max. :6.000 Max. :5.000 Max. :5.000
conf_hcs trust_doc not_reliable_online age
Min. :1.000 Min. :1.000 Min. :1.000 Min. :18.00
1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:39.00
Median :2.000 Median :2.000 Median :2.000 Median :56.00
Mean :2.334 Mean :2.228 Mean :2.062 Mean :53.97
3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:67.00
Max. :5.000 Max. :5.000 Max. :5.000 Max. :94.00
sex eduY rural inc insurance religion
male :738 Min. : 6.00 1:266 Min. : 0 statuary:1351 0:738
female:833 1st Qu.:10.00 2:264 1st Qu.: 1875 privat : 203 1:833
Median :12.00 3:568 Median : 3250 others : 17
Mean :13.08 4:473 Mean : 3519
3rd Qu.:16.00 3rd Qu.: 4250
Max. :30.00 Max. :11250
politics
no answer:541
CDU_CSU :344
SPD :197
Gruene :194
FDP : 91
Linke : 70
(Other) :134
CDU_CSU SPD Gruene FDP Linke AfD others
0.21896881 0.12539784 0.12348822 0.05792489 0.04455761 0.04137492 0.04392107
no answer
0.34436665
1 2 3 4
0.1693189 0.1680458 0.3615532 0.3010821
Vac_hesitant age
1 No 54.41947
2 Yes 50.94146
Vac_hesitant eduY
1 No 13.22694
2 Yes 12.08780
Vac_hesitant inc
1 No 3592.419
2 Yes 3032.790
Vac_hesitant conf_hcs
1 No 2.282577
2 Yes 2.678049
Vac_hesitant trust_doc
1 No 2.183016
2 Yes 2.526829
Vac_hesitant info_health
1 No 4.319912
2 Yes 4.209756
No Yes
male 625 113
female 741 92
No Yes
male 0.39783577 0.07192871
female 0.47167409 0.05856143
Key bivariate patterns: Hesitant individuals were on average younger (51 vs. 54 years), had fewer years of education (12 vs. 13), lower household income (€3,033 vs. €3,592), and reported lower confidence in the healthcare system. Males made up 55% of the hesitant group despite being 47% of the sample.
4 Modeling
This exploratory analysis leverages four complementary modeling methods to examine vaccine hesitancy: First a decision tree was fit for structural visualization on a balanced subsample, then Random Forest was applied for estimation of variable importance. Additionally, logistic regression served as the primary interpretive model and Lasso regularization was employed to check the robustness of variable selection. The tree-based methods capture nonlinear interactions, while the logistic regression provides interpretability (Gonzales et al., 2025).
Because the aim of this study is to identify which factors are robustly associated with hesitancy across multiple modeling approaches, not to optimize held out accuracy, all models were fit without a train/test split. Convergence of results across the four complementary methods strengthens confidence in the identified associations.
4.1 Decision Tree
A classification tree provides an interpretable structural visualization of how the data naturally splits by hesitancy. Because 13% prevalence causes a tree trained on the full dataset to predict “No” for everyone (null classifier), the tree was trained on a balanced subsample (2:1 ratio, n=615) created by downsampling the majority class.
The cross-validation estimates the test error for varying tree sizes and allows for the selection of the optimal size that minimizes the misclassification error. Based on this a pruned tree with 3 terminal nodes was selected
The tree probabilities reflect the balanced subsample, not the actual population prevalence of 13%. They should not be interpreted as individual risk estimates.
Classification tree:
snip.tree(tree = tree_full, nodes = c(2L, 6L))
Variables actually used in tree construction:
[1] "conf_hcs" "age"
Number of terminal nodes: 3
Residual mean deviance: 1.186 = 725.7 / 612
Misclassification error rate: 0.3008 = 185 / 615
node), split, n, deviance, yval, (yprob)
* denotes terminal node
1) root 615 782.9 No ( 0.6667 0.3333 )
2) conf_hcs < 2.5 392 429.5 No ( 0.7628 0.2372 ) *
3) conf_hcs > 2.5 223 309.1 Yes ( 0.4978 0.5022 )
6) age < 32.5 38 41.6 Yes ( 0.2368 0.7632 ) *
7) age > 32.5 185 254.5 No ( 0.5514 0.4486 ) *
The pruned tree first partitioned the data based on confidence in the healthcare system (conf_hcs < 2.5) into high confidence (n=392, 23.7% hesitant) and low confidence groups (n=223, 50.2% hesitant). Among the low confidence group, a secondary split on age < 32.5 identified a subgroup of younger respondents with elevated levels of hesitancy (n=38, 76.3% hesitant). Older respondents with low confidence showed heterogenous
4.2 Random Forest
A Random Forest reduced the high variance of single trees by averaging 500 trees, each trained on a different bootstrap sample. It was used here primarily for variable importance estimation. Class imbalance was addressed by usage of the classwt argument, which penalizes misclassification of hesitant cases more heavily. The utilized ratio of 1:7 was arbitrary chosen.
No Yes MeanDecreaseAccuracy MeanDecreaseGini
info_health 7.4276926 -1.969417 6.2384365 21.745185
info_health_stress 2.1076788 -2.723988 0.9056667 19.529888
info_health_vaccines 2.4499340 3.663732 3.7833481 20.830708
conf_hcs -0.7068989 10.109145 3.5389721 15.557737
trust_doc -1.6854717 4.605351 0.4584593 13.723953
not_reliable_online -1.9266738 2.749082 -0.6656229 17.751487
age 1.7018877 4.020248 3.1533378 52.603601
sex 5.4377831 -1.473130 4.5414719 7.550102
eduY -4.6838264 4.827433 -2.2762681 45.092369
rural -1.2388007 1.182659 -0.6423402 24.911631
inc -5.1480047 7.756024 -1.6591864 42.008797
insurance 2.3034340 -1.619731 1.5906532 6.543936
religion -1.5115466 -1.644803 -2.0263348 7.612280
politics -4.2532064 7.026977 -1.3568009 44.824857
The five most important predictors by MDA were: info_health, sex, info_health_vaccines, conf_hcs, and age. These findings are consistent with the results of the decision tree and logistic regression.
4.3 Logistic Regression
Binary logistic regression was the primary interpretive model. It produces odds ratios (OR) with 95% confidence intervals, hence presenting directly interpretable effect estimates that connect to established epidemiological literature. The model was fit on the full imputed dataset considering all 14 predictors.
Key findings from logistic regression:
- Confidence in the healthcare system was the strongest trust-related predictor: each unit increase on the 5-point scale was associated with 58% higher odds of hesitancy (OR = 1.58, p < .001)
- Vaccine-specific information-seeking was protective: higher frequency was associated with 21% lower odds (OR = 0.79, p < .001)
- Female sex was associated with 38% lower odds (OR = 0.62, p = .004)
- AfD voters showed more than twice the odds of hesitancy compared to CDU/CSU voters (OR = 2.31, p = .022)
- Private health insurance was associated with higher odds compared to statutory insurance, which is counterintuitive given that private insurance holders tend to have higher education and income, both of which were negatively associated with hesitancy
4.4 LASSO Regularization
LASSO applies L1 regularization, shrinking irrelevant predictors to exactly zero. It was employed to test tge robustness of variable selction. Model complexity was chosen by the 1-standard error criterion (lambda.1se), which selects the lambda value, whose cross-validation error is still within one standard deviation of the minimum error, but results in a simpler model. This approach was chosen over lambda.min because it balances parsimony against minimal prediction error and is more stable against sample variability.
Predictors retained by LASSO (lambda.1se):
[1] "info_health" "info_health_vaccines" "conf_hcs"
[4] "trust_doc" "not_reliable_online" "age"
[7] "sexfemale" "eduY" "rural4"
[10] "inc" "politicsLinke" "politicsAfD"
LASSO retained 12 of 23 terms. Notably, politicsAfD was retained with a positive coefficient (0.380) and politicsLinke with a negative coefficient (-0.386), while all other political affiliations were shrunken to zero. This pattern indicates that right-wing populist affiliation (AfD) is associated with higher hesitancy, while left-wing affiliation (Linke) is associated with lower hesitancy. The elimination of all centrist parties confirms that their hesitancy rates do not differ meaningfully from the CDU/CSU reference category once other predictors are controlled for.
5 Model Performance & Threshold Analysis
| Model | Threshold | Accuracy | Sensitivity | Specificity | Bal_Acc | AUC |
|---|---|---|---|---|---|---|
| Logistic Regression | 0.500 | 0.875 | 0.078 | 0.995 | 0.536 | 0.733 |
| Logistic Regression | 0.155 | 0.745 | 0.610 | 0.766 | 0.688 | 0.733 |
| Random Forest | 0.500 | 0.870 | 0.010 | 0.999 | 0.504 | 0.662 |
| Random Forest | 0.101 | 0.566 | 0.722 | 0.542 | 0.632 | 0.662 |
At the standard threshold of 0.5, both models classified nearly everyone as non-hesitant, yielding ~87% accuracy but near-zero sensitivity. This is a result of the imbalanced outcome, since the model learns that always predicting the majority class is “accurate”, but it cannot identify hestitant individuals
The Youden-optimal threshold, which is determined by maximizing sensitivity and specificity on a ROC curve, was 0.155 for logistic regression and 0.104 for Random Forest. It substantially improved sensitivity (61% for logistic regression, 72% for Random Forest) at the cost of specificity, and raises balanced accuracy by more than 12%.
The Random Forest OOB AUC (0.662) is more conservative than the in-sample logistic regression AUC (0.733). These values are not directly comparable, because the logistic regression is evaluated on the same data it was trained on, making it optimistic.
6 Convergence Across Methods
The table below summarizes predictor relevance across all four methods. A predictor is considered relevant if it was significant in logistic regression (p < .05), retained by LASSO, ranked among the top 5 by RF MDA, or was used as a split in the pruned tree.
| Predictor | Logit sig. | Lasso kept | RF top 5 (MDA) | Tree used |
|---|---|---|---|---|
| conf_hcs | *** | yes | yes | yes |
| trust_doc | * | yes | no | no |
| info_health_vaccines | *** | yes | yes | no |
| info_health | * | yes | yes | no |
| not_reliable_online | * | yes | no | no |
| age | *** | yes | yes | yes |
| eduY | ** | yes | no | no |
| inc | ** | yes | no | no |
| sex (female) | ** | yes | yes | no |
| insurance (private) | * | no | no | no |
| politics (AfD) | * | yes | no | no |
| politics (Linke) | – | yes | no | no |
| rural (4) | – | yes | no | no |
conf_hcs and age are the most consistent predictors across all methods. info_health, info_health_vaccines, and sex converge across three methods. These represent the most reliable correlates of vaccine hesitancy in this dataset.
Moreover, education years, income, trust in doctors, difficulty to identify reliable information online, private healthcare insurance and voting for the AfD emerged as significant variables in the regression-based models.
7 Discussion
7.1 Key Findings
Approximately 13% of the analyzed subsample expressed vaccine hesitancy. This is a plausible approximation given estimations of other studies examining the vaccine hesitancy in Germany during a similar time period, ranging from 8% to 29% hesitancy. Importantly, these studies focus on COVID-19-specific vaccine hesitancy, while the ISSP 2021 assessed general attitudes towards vaccination within a broad social survey context (Fobiwe et al., 2022, Lincoln et al., 2022, Steinert et al., 2022). Therefore, the general vaccine attitudes captured in the ISSP dataset may be more representative of stable opinions, whereas COVID-19-targeted surveys likely reflect situational-dependent hesitancy.
Across all four modeling approaches, confidence in the healthcare system and age emerged as the most consistent correlates of vaccine hesitancy. This aligns with established theoretical frameworks emphasizing institutional trust as a central determinant of vaccine acceptance (MacDonald, 2015), and with previous studies focusing on Germany (Fobiwe et al., 2022).
The information-related variables showed a nuanced pattern: active information-seeking about vaccines was associated with lower hesitancy, while difficulty evaluating reliable online health information was associated with higher hesitancy. This distinction matters for public health communication, since it suggest, that the problem is not information overload per se, but the inability to assess its quality.
The association between AfD voting and hesitancy (OR = 2.31) is consistent with prior research linking right-wing populist party support with vaccine skepticism in Germany (Jäckle & Timmis, 2022). In contrast, the finding that female sex was associated with lower hesitancy diverges with some international literature (De Araújo et al., 2024; Aw et al., 2021), suggesting potential country-specific patterns worth exploring in future research.
7.2 Public Health Implications
These findings point to two main levers for reducing vaccine hesitancy in Germany:
Trust in health institutions is the most modifiable factor identified. Public health communication that reinforces confidence in the healthcare system for example through transparent risk communication, visible accountability, and consistent messaging, may be more effective than information campaigns alone.
Digital health literacy matters independently of how much people search online. Interventions that help people evaluate the quality of health information may reduce hesitancy among those who are already engaged online.
The political polarization around vaccine attitudes also suggests that hesitancy is partly driven by broader trust deficits in institutions. Oppositional groups may require different communication strategies than those aimed at the general population.
7.3 Limitations
Key limitations to keep in mind:
- Outcome operationalization: The binary hesitancy indicator is derived from two general attitude items, not a direct behavioral measure. Agreement with one item does not necessarily translate into actual vaccine refusal.
- Single imputed dataset: Using one of five MICE-imputed datasets rather than pooling via Rubin’s Rules slightly underestimates imputation uncertainty.
- No external validation: All models were estimated on the full imputed dataset without a held-out test set. AUC values should not be interpreted as generalizable predictive performance.
- Cross-sectional design: Causal interpretations cannot be drawn. The associations identified are correlates, not determinants in a causal sense.
- Sample representativeness: The German ISSP subsample may not be fully representative of the general population, particularly given high political non-response (34.4%).
8 References
Aw, J., Seng, J. J. B., Seah, S. S. Y., & Low, L. L. (2021). COVID-19 Vaccine Hesitancy—A Scoping Review of Literature in High-Income Countries. Vaccines, 9(8), 900. https://doi.org/10.3390/vaccines9080900
Cénat, J. M., Noorishad, P., Farahi, S. M. M. M., Darius, W. P., Aouame, A. M. E., Onesi, O., Broussard, C., Furyk, S. E., Yaya, S., Caulley, L., Chomienne, M., Etowa, J., & Labelle, P. R. (2022). Prevalence and factors related to COVID-19 vaccine hesitancy and unwillingness in Canada: A systematic review and meta-analysis. Journal of Medical Virology, 95(1), e28156. https://doi.org/10.1002/jmv.28156
De Araújo, J. S. T., Delpino, F. M., De Paula Andrade-Gonçalves, R. L., Aragão, F. B. A., Ferezin, L. P., Santos, D. A., Neto, N. C. D., Nascimento, M. C. D., Moreira, S. P. T., Ribeiro, G. F., Alves, R. F. D. S., & Arcêncio, R. A. (2024). Determinants of COVID-19 Vaccine Acceptance and hesitancy: A Systematic review. Vaccines, 12(12), 1352. https://doi.org/10.3390/vaccines12121352
Fobiwe, J. P., et al. (2022). Influences on attitudes regarding COVID-19 vaccination in Germany. Vaccines, 10(5), 658. https://doi.org/10.3390/vaccines10050658
ISSP Research Group (2024). ISSP 2021 — Health and Health Care II (ZA8000, Version 2.0.0). GESIS, Cologne. https://doi.org/10.4232/5.ZA8000.2.0.0
Jäckle, S., & Timmis, J. K. (2023). Left–Right-Position, party affiliation and regional differences explain low COVID-19 vaccination rates in Germany. Microbial Biotechnology, 16(3), 662–677. https://doi.org/10.1111/1751-7915.14210
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.
Jennings, W., Stoker, G., Bunting, H., Valgarðsson, V. O., Gaskell, J., Devine, D., McKay, L., & Mills, M. C. (2021). Lack of trust, conspiracy beliefs, and social media use 20 predict COVID-19 vaccine hesitancy. Vaccines, 9(6), 593. https://doi.org/10.3390/vaccines9060593
Lincoln, T. M., Schlier, B., Strakeljahn, F., Gaudiano, B. A., So, S. H., Kingston, J., Morris, E. M. J., & Ellett, L. (2022). Taking a machine learning approach to optimize prediction of vaccine hesitancy in high income countries. Scientific reports, 12(1), 2055. https://doi.org/10.1038/s41598-022-05915-3
MacDonald, N. E. (2015). Vaccine hesitancy: Definition, scope and determinants. Vaccine, 33(34), 4161–4164. https://doi.org/10.1016/j.vaccine.2015.04.036
Mienye, I. D., & Jere, N. (2024). A survey of Decision trees: Concepts, algorithms, and applications. IEEE Access, 12, 86716–86727. https://doi.org/10.1109/access.2024.3416838
Steinert, J. I., Sternberg, H., Prince, H., Fasolo, B., Galizzi, M. M., Büthe, T., & Veltri, G. A. (2022). COVID-19 vaccine hesitancy in eight European countries: Prevalence, determinants, and heterogeneity. Science Advances, 8(17), eabm9825. https://doi.org/10.1126/sciadv.abm9825
Watson, O. J., Barnsley, G., Toor, J., Hogan, A. B., Winskill, P., & Ghani, A. C. (2022). Global impact of the first year of COVID-19 vaccination: a mathematical modelling study. The Lancet Infectious Diseases, 22(9), 1293–1302. https://doi.org/10.1016/s1473-3099(22)00320-6
Wulff, J., & Ejlskov, L. (2017). Multiple imputation by chained equations in praxis: Guidelines and review. VBN Forskningsportal (Aalborg Universitet), 15(1), 41–56. https://vbn.aau.dk/da/publications/8abf732c-281b-4192-b647-af9346db1290