# Define a function to install and load required R packages
install_load <- function(packages) {
# Retrieve a list of all currently installed packages on your system
installed_package_list <- installed.packages()[, "Package"]
# Identify packages from the input list that are NOT yet installed
missing_package <- packages[!packages %in% installed_package_list]
# If there are any missing packages, install them along with their dependencies
if(length(missing_package) > 0) {
install.packages(missing_package, dependencies = TRUE)
}
# Load all specified packages into the current R session
# 'invisible' is used to suppress printing output from lapply
# 'character.only = TRUE' tells library() to interpret package names as strings
invisible(lapply(packages, library, character.only = TRUE))
}
# Create a vector of packages required for this analysis
# tidyverse: for data manipulation and visualization
# caret: for machine learning workflows, train/test split, and model evaluation
# xgboost: for gradient boosting modeling
# randomForest: for random forest modeling
# pROC: for computing and plotting ROC curves and AUC
# glmnet: for regularized regression (LASSO, Ridge)
# SuperLearner: for ensemble modeling
my_packages <- c("tidyverse", "caret", "xgboost", "randomForest", "pROC", "glmnet", "SuperLearner")
# Call the function to ensure all packages are installed and loaded
install_load(my_packages)Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines Using R
Overview
This report presents a comprehensive analysis and predictive modeling pipeline for vaccine uptake, focusing on both seasonal and H1N1 vaccines from DrivenData Flu Shot Learning Competition. The primary objective is to build a model that accurately predict respondent vaccination behavior. This report is intended for R programmers and data scientists familiar with statistical modeling and machine learning, providing them with a reproducible, end-to-end workflow that covers data pre-processing, predictive models building, hyperparameter tuning, and evaluation.
Throughout the report, we leverage popular R packages such as tidyverse for data manipulation, caret for modeling infrastructure, xgboost and randomForest for machine learning, and pROC for model evaluation. The pipeline emphasizes techniques that can enhance predictive performance, including handling of missing values, encoding of categorical variables, generation of interaction features, and optimization of model hyperparameters.
This document is designed not only to present results but also to serve as a practical guide for R programmers to replicate and extend analysis, experiment with alternative models, and potentially improve predictive performance toward achieving higher AUC scores.
Installing all Required Packages
The code below ensures that all the R packages required for the analysis and modeling pipeline are installed and loaded automatically. By using a function to handle both installation and loading, the workflow becomes reproducible and robust, meaning anyone running the script won’t encounter errors due to missing packages. It also simplifies the code, avoiding repeated calls to install.packages() or library().
The packages chosen cover data manipulation (tidyverse), modeling and evaluation (caret, xgboost, randomForest, glmnet, SuperLearner), and performance metrics (pROC).
Loading the Datasets
In this step, we load the training and test datasets provided by the DrivenData “Flu Shot Learning” competition. The datasets include:
training_features– Contains demographic, health, and behavioral features for each respondent.training_labels– Contains the target variables indicating whether each respondent received the H1N1 or seasonal flu vaccine.test_features– Contains features for respondents in the test set, which we will use to generate predictions for submission.
We then merge the training features with the training labels using the respondent_id as the key. This creates a single dataset called data that includes both predictors and outcomes. This merged dataset will be used for feature engineering, model training, and evaluation.
The reason for merging at this stage is to ensure that each row in our dataset has both the explanatory variables (features) and the response variables (labels), which is necessary for supervised learning models like XGBoost, Random Forest, and Logistic Regression.
# Load the training features dataset from the DrivenData competition URL
# This dataset contains demographic, health, and behavioral variables for each respondent
training_features <- read.csv("https://drivendata-prod.s3.amazonaws.com/data/66/public/training_set_features.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYSN7TAHVS%2F20251111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251111T154753Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=a2738e023bdaf520499879a6b413e9a27035f2e827004311e7e3fad054ea68d8")
# Load the training labels dataset from the DrivenData competition URL
# This dataset contains the target variables (H1N1 and seasonal flu vaccine uptake) for each respondent
training_labels <- read.csv("https://drivendata-prod.s3.amazonaws.com/data/66/public/training_set_labels.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYSN7TAHVS%2F20251111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251111T154753Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=06ee84122c9a6bb285fce6c5bb69b0be488770e63502028b5855ab4832e3c219")
# Load the test features dataset from the DrivenData competition URL
# This dataset contains the same predictor variables as the training set, but for respondents without target labels
test_features <- read.csv("https://drivendata-prod.s3.amazonaws.com/data/66/public/test_set_features.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYSN7TAHVS%2F20251111%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251111T154753Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=239bbc286abeebd975a15913a6cb046f209a95b8813c6b3a5d5df201d7a1ff4b")
# Merge the training features with the corresponding labels using 'respondent_id' as the key
# This creates a single dataset combining predictors and targets, which is necessary for supervised learning
data <- merge(training_features, training_labels, by = "respondent_id")Data Check
Before proceeding with model building, we first explore the dataset to understand its structure and quality. Using summary statistics, we examine the distribution of each variable and identify potential anomalies. The dataset dimensions and structure are checked to confirm the number of observations and variable types. We also assess missing values, both column-wise and overall, to identify any gaps that may require handling. These initial checks ensure that the data is well-understood, clean, and ready for feature engineering and model training.
# Get a summary of each variable in the dataset, including min, max, mean, quartiles for numeric variables,
# and counts for categorical variables. This helps understand variable distributions and detect anomalies.
summary(data) respondent_id h1n1_concern h1n1_knowledge behavioral_antiviral_meds
Min. : 0 Min. :0.000 Min. :0.000 Min. :0.00000
1st Qu.: 6676 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.00000
Median :13353 Median :2.000 Median :1.000 Median :0.00000
Mean :13353 Mean :1.618 Mean :1.263 Mean :0.04884
3rd Qu.:20030 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.00000
Max. :26706 Max. :3.000 Max. :2.000 Max. :1.00000
NA's :92 NA's :116 NA's :71
behavioral_avoidance behavioral_face_mask behavioral_wash_hands
Min. :0.0000 Min. :0.00000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.0000
Median :1.0000 Median :0.00000 Median :1.0000
Mean :0.7256 Mean :0.06898 Mean :0.8256
3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.00000 Max. :1.0000
NA's :208 NA's :19 NA's :42
behavioral_large_gatherings behavioral_outside_home behavioral_touch_face
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :1.0000
Mean :0.3586 Mean :0.3373 Mean :0.6773
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000
NA's :87 NA's :82 NA's :128
doctor_recc_h1n1 doctor_recc_seasonal chronic_med_condition
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.2203 Mean :0.3297 Mean :0.2833
3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000
NA's :2160 NA's :2160 NA's :971
child_under_6_months health_worker health_insurance
Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:1.0000
Median :0.00000 Median :0.0000 Median :1.0000
Mean :0.08259 Mean :0.1119 Mean :0.8797
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.00000 Max. :1.0000 Max. :1.0000
NA's :820 NA's :804 NA's :12274
opinion_h1n1_vacc_effective opinion_h1n1_risk opinion_h1n1_sick_from_vacc
Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:3.000 1st Qu.:1.000 1st Qu.:1.000
Median :4.000 Median :2.000 Median :2.000
Mean :3.851 Mean :2.343 Mean :2.358
3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.000 Max. :5.000 Max. :5.000
NA's :391 NA's :388 NA's :395
opinion_seas_vacc_effective opinion_seas_risk opinion_seas_sick_from_vacc
Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:4.000 1st Qu.:2.000 1st Qu.:1.000
Median :4.000 Median :2.000 Median :2.000
Mean :4.026 Mean :2.719 Mean :2.118
3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.000 Max. :5.000 Max. :5.000
NA's :462 NA's :514 NA's :537
age_group education race sex
Length:26707 Length:26707 Length:26707 Length:26707
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
income_poverty marital_status rent_or_own employment_status
Length:26707 Length:26707 Length:26707 Length:26707
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
hhs_geo_region census_msa household_adults household_children
Length:26707 Length:26707 Min. :0.0000 Min. :0.0000
Class :character Class :character 1st Qu.:0.0000 1st Qu.:0.0000
Mode :character Mode :character Median :1.0000 Median :0.0000
Mean :0.8865 Mean :0.5346
3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :3.0000 Max. :3.0000
NA's :249 NA's :249
employment_industry employment_occupation h1n1_vaccine seasonal_vaccine
Length:26707 Length:26707 Min. :0.0000 Min. :0.0000
Class :character Class :character 1st Qu.:0.0000 1st Qu.:0.0000
Mode :character Mode :character Median :0.0000 Median :0.0000
Mean :0.2125 Mean :0.4656
3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
# Display the dimensions of the dataset: number of rows (observations) and columns (variables)
dim(data)[1] 26707 38
# Display the structure of the dataset, including variable types (numeric, factor, etc.) and first few values
str(data)'data.frame': 26707 obs. of 38 variables:
$ respondent_id : int 0 1 2 3 4 5 6 7 8 9 ...
$ h1n1_concern : int 1 3 1 1 2 3 0 1 0 2 ...
$ h1n1_knowledge : int 0 2 1 1 1 1 0 0 2 1 ...
$ behavioral_antiviral_meds : int 0 0 0 0 0 0 0 0 0 0 ...
$ behavioral_avoidance : int 0 1 1 1 1 1 0 1 1 1 ...
$ behavioral_face_mask : int 0 0 0 0 0 0 0 0 0 0 ...
$ behavioral_wash_hands : int 0 1 0 1 1 1 0 1 1 0 ...
$ behavioral_large_gatherings: int 0 0 0 1 1 0 0 0 1 1 ...
$ behavioral_outside_home : int 1 1 0 0 0 0 0 0 1 0 ...
$ behavioral_touch_face : int 1 1 0 0 1 1 0 1 1 1 ...
$ doctor_recc_h1n1 : int 0 0 NA 0 0 0 0 1 0 0 ...
$ doctor_recc_seasonal : int 0 0 NA 1 0 1 0 0 0 0 ...
$ chronic_med_condition : int 0 0 1 1 0 0 0 1 0 1 ...
$ child_under_6_months : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_worker : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_insurance : int 1 1 NA NA NA NA NA 1 NA 1 ...
$ opinion_h1n1_vacc_effective: int 3 5 3 3 3 5 4 5 4 4 ...
$ opinion_h1n1_risk : int 1 4 1 3 3 2 1 2 1 2 ...
$ opinion_h1n1_sick_from_vacc: int 2 4 1 5 2 1 1 1 1 2 ...
$ opinion_seas_vacc_effective: int 2 4 4 5 3 5 4 4 4 4 ...
$ opinion_seas_risk : int 1 2 1 4 1 4 2 2 2 2 ...
$ opinion_seas_sick_from_vacc: int 2 4 2 1 4 4 1 1 1 2 ...
$ age_group : chr "55 - 64 Years" "35 - 44 Years" "18 - 34 Years" "65+ Years" ...
$ education : chr "< 12 Years" "12 Years" "College Graduate" "12 Years" ...
$ race : chr "White" "White" "White" "White" ...
$ sex : chr "Female" "Male" "Male" "Female" ...
$ income_poverty : chr "Below Poverty" "Below Poverty" "<= $75,000, Above Poverty" "Below Poverty" ...
$ marital_status : chr "Not Married" "Not Married" "Not Married" "Not Married" ...
$ rent_or_own : chr "Own" "Rent" "Own" "Rent" ...
$ employment_status : chr "Not in Labor Force" "Employed" "Employed" "Not in Labor Force" ...
$ hhs_geo_region : chr "oxchjgsf" "bhuqouqj" "qufhixun" "lrircsnp" ...
$ census_msa : chr "Non-MSA" "MSA, Not Principle City" "MSA, Not Principle City" "MSA, Principle City" ...
$ household_adults : int 0 0 2 0 1 2 0 2 1 0 ...
$ household_children : int 0 0 0 0 0 3 0 0 0 0 ...
$ employment_industry : chr "" "pxcmvdjn" "rucpziij" "" ...
$ employment_occupation : chr "" "xgwztkwe" "xtkaffoo" "" ...
$ h1n1_vaccine : int 0 0 0 0 0 0 0 1 0 0 ...
$ seasonal_vaccine : int 0 1 0 1 0 0 0 1 0 0 ...
# Check for missing values in each column by summing NA values column-wise
colSums(is.na(data)) respondent_id h1n1_concern
0 92
h1n1_knowledge behavioral_antiviral_meds
116 71
behavioral_avoidance behavioral_face_mask
208 19
behavioral_wash_hands behavioral_large_gatherings
42 87
behavioral_outside_home behavioral_touch_face
82 128
doctor_recc_h1n1 doctor_recc_seasonal
2160 2160
chronic_med_condition child_under_6_months
971 820
health_worker health_insurance
804 12274
opinion_h1n1_vacc_effective opinion_h1n1_risk
391 388
opinion_h1n1_sick_from_vacc opinion_seas_vacc_effective
395 462
opinion_seas_risk opinion_seas_sick_from_vacc
514 537
age_group education
0 0
race sex
0 0
income_poverty marital_status
0 0
rent_or_own employment_status
0 0
hhs_geo_region census_msa
0 0
household_adults household_children
249 249
employment_industry employment_occupation
0 0
h1n1_vaccine seasonal_vaccine
0 0
# Get the total number of missing values in the entire dataset
sum(is.na(data))[1] 23219
The code below is designed to provide a visual exploration of the numerical variables in the dataset. Understanding the distribution and spread of each feature is essential for feature engineering and model building. The box_hist function automatically generates both a boxplot and a histogram for each variable, which helps in identifying outliers, skewness, and overall distribution patterns. Boxplots highlight extreme values or potential outliers, while histograms provide a sense of the frequency distribution of the variable. This type of exploratory data analysis is an important step before modeling.
# Exclude the first column (usually an ID column) to focus on features
plot <- data[, -1]
plot <- plot[sapply(plot, is.numeric)] #only numeric columns
str(plot) # Check the structure of the new 'plot' dataset to ensure it's ready for visualization'data.frame': 26707 obs. of 25 variables:
$ h1n1_concern : int 1 3 1 1 2 3 0 1 0 2 ...
$ h1n1_knowledge : int 0 2 1 1 1 1 0 0 2 1 ...
$ behavioral_antiviral_meds : int 0 0 0 0 0 0 0 0 0 0 ...
$ behavioral_avoidance : int 0 1 1 1 1 1 0 1 1 1 ...
$ behavioral_face_mask : int 0 0 0 0 0 0 0 0 0 0 ...
$ behavioral_wash_hands : int 0 1 0 1 1 1 0 1 1 0 ...
$ behavioral_large_gatherings: int 0 0 0 1 1 0 0 0 1 1 ...
$ behavioral_outside_home : int 1 1 0 0 0 0 0 0 1 0 ...
$ behavioral_touch_face : int 1 1 0 0 1 1 0 1 1 1 ...
$ doctor_recc_h1n1 : int 0 0 NA 0 0 0 0 1 0 0 ...
$ doctor_recc_seasonal : int 0 0 NA 1 0 1 0 0 0 0 ...
$ chronic_med_condition : int 0 0 1 1 0 0 0 1 0 1 ...
$ child_under_6_months : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_worker : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_insurance : int 1 1 NA NA NA NA NA 1 NA 1 ...
$ opinion_h1n1_vacc_effective: int 3 5 3 3 3 5 4 5 4 4 ...
$ opinion_h1n1_risk : int 1 4 1 3 3 2 1 2 1 2 ...
$ opinion_h1n1_sick_from_vacc: int 2 4 1 5 2 1 1 1 1 2 ...
$ opinion_seas_vacc_effective: int 2 4 4 5 3 5 4 4 4 4 ...
$ opinion_seas_risk : int 1 2 1 4 1 4 2 2 2 2 ...
$ opinion_seas_sick_from_vacc: int 2 4 2 1 4 4 1 1 1 2 ...
$ household_adults : int 0 0 2 0 1 2 0 2 1 0 ...
$ household_children : int 0 0 0 0 0 3 0 0 0 0 ...
$ h1n1_vaccine : int 0 0 0 0 0 0 0 1 0 0 ...
$ seasonal_vaccine : int 0 1 0 1 0 0 0 1 0 0 ...
# Define a function to plot boxplots and histograms for each variable
box_hist <- function(data) {
for(var in names(data)) { # Loop over each column name in the dataset
par(mfrow = c(1,2)) # Set plotting layout to 1 row, 2 columns (side-by-side plots)
boxplot(data[[var]], # Create boxplot for the current variable
main = paste("Box plot of ", var),
ylab = var)
hist(data[[var]], # Create histogram for the current variable
main = paste("Histogram of ", var),
xlab = var)
}
par(mfrow = c(1,1)) # Reset plotting layout to default (1 row, 1 column) after loop
}
# Call the function to generate visualizations for all features
box_hist(plot)Recoding Values
In this section, we recode categorical variables into numeric form. Many machine learning algorithms, including XGBoost, require numeric inputs to perform computations. By converting ordered categorical variables such as employment status, education level, income, and age group into numeric values, we retain their natural ordering while making them compatible with model training. The numeric coding preserves the inherent ranking of categories which can help the model interpret relative differences between levels.
# See unique values for employment_status
unique(data$employment_status)[1] "Not in Labor Force" "Employed" "Unemployed"
[4] ""
# Recode employment_status into numeric, preserving order
data$employment_status <- as.numeric(factor(data$employment_status,
levels = c("Not in Labor Force",
"Unemployed",
"Employed")))
# check unique values
unique(data$education)[1] "< 12 Years" "12 Years" "College Graduate" "Some College"
[5] ""
# Recode education into numeric, preserving the education hierarchy
data$education <- as.numeric(factor(data$education,
levels = c("< 12 Years",
"12 Years",
"Some College",
"College Graduate")))
# Check unique values for income_poverty
unique(data$income_poverty)[1] "Below Poverty" "<= $75,000, Above Poverty"
[3] "> $75,000" ""
# Recode income_poverty into numeric, keeping logical order
data$income_poverty <- as.numeric(factor(data$income_poverty,
levels = c("Below Poverty",
"<= $75,000, Above Poverty",
"> $75,000")))
# Check unique values for age_group
unique(data$age_group)[1] "55 - 64 Years" "35 - 44 Years" "18 - 34 Years" "65+ Years"
[5] "45 - 54 Years"
# Recode age_group into numeric, preserving age progression
data$age_group <- as.numeric(factor(data$age_group,
levels = c("18 - 34 Years",
"35 - 44 Years",
"45 - 54 Years",
"55 - 64 Years",
"65+ Years")))Checking Numeric and Categorical Variables
In this section, we inspect the types of columns in the dataset, distinguishing between numeric and character variables. Understanding the variable types is important because many machine learning algorithms require numeric input, and character columns often need to be recoded or encoded.
# Quickly check which columns are numeric using sapply
sapply(data, is.numeric) respondent_id h1n1_concern
TRUE TRUE
h1n1_knowledge behavioral_antiviral_meds
TRUE TRUE
behavioral_avoidance behavioral_face_mask
TRUE TRUE
behavioral_wash_hands behavioral_large_gatherings
TRUE TRUE
behavioral_outside_home behavioral_touch_face
TRUE TRUE
doctor_recc_h1n1 doctor_recc_seasonal
TRUE TRUE
chronic_med_condition child_under_6_months
TRUE TRUE
health_worker health_insurance
TRUE TRUE
opinion_h1n1_vacc_effective opinion_h1n1_risk
TRUE TRUE
opinion_h1n1_sick_from_vacc opinion_seas_vacc_effective
TRUE TRUE
opinion_seas_risk opinion_seas_sick_from_vacc
TRUE TRUE
age_group education
TRUE TRUE
race sex
FALSE FALSE
income_poverty marital_status
TRUE FALSE
rent_or_own employment_status
FALSE TRUE
hhs_geo_region census_msa
FALSE FALSE
household_adults household_children
TRUE TRUE
employment_industry employment_occupation
FALSE FALSE
h1n1_vaccine seasonal_vaccine
TRUE TRUE
# Define a function to return names of numeric columns
numeric_variables <- function(data) {
numeric_list <- list() # Initialize empty list to store numeric columns
# Loop through each column in the data
for(var in names(data)) {
var_data <- data[[var]] # Extract column
if(is.numeric(var_data)){ # Check if column is numeric
numeric_list[[var]] <- var_data # Store numeric column in list
}
}
return(names(numeric_list)) # Return names of numeric columns
}
# Apply function to data; confirms there are 30 numeric variables
numeric_variables(data) [1] "respondent_id" "h1n1_concern"
[3] "h1n1_knowledge" "behavioral_antiviral_meds"
[5] "behavioral_avoidance" "behavioral_face_mask"
[7] "behavioral_wash_hands" "behavioral_large_gatherings"
[9] "behavioral_outside_home" "behavioral_touch_face"
[11] "doctor_recc_h1n1" "doctor_recc_seasonal"
[13] "chronic_med_condition" "child_under_6_months"
[15] "health_worker" "health_insurance"
[17] "opinion_h1n1_vacc_effective" "opinion_h1n1_risk"
[19] "opinion_h1n1_sick_from_vacc" "opinion_seas_vacc_effective"
[21] "opinion_seas_risk" "opinion_seas_sick_from_vacc"
[23] "age_group" "education"
[25] "income_poverty" "employment_status"
[27] "household_adults" "household_children"
[29] "h1n1_vaccine" "seasonal_vaccine"
# Quickly check which columns are character using sapply
sapply(data, is.character) respondent_id h1n1_concern
FALSE FALSE
h1n1_knowledge behavioral_antiviral_meds
FALSE FALSE
behavioral_avoidance behavioral_face_mask
FALSE FALSE
behavioral_wash_hands behavioral_large_gatherings
FALSE FALSE
behavioral_outside_home behavioral_touch_face
FALSE FALSE
doctor_recc_h1n1 doctor_recc_seasonal
FALSE FALSE
chronic_med_condition child_under_6_months
FALSE FALSE
health_worker health_insurance
FALSE FALSE
opinion_h1n1_vacc_effective opinion_h1n1_risk
FALSE FALSE
opinion_h1n1_sick_from_vacc opinion_seas_vacc_effective
FALSE FALSE
opinion_seas_risk opinion_seas_sick_from_vacc
FALSE FALSE
age_group education
FALSE FALSE
race sex
TRUE TRUE
income_poverty marital_status
FALSE TRUE
rent_or_own employment_status
TRUE FALSE
hhs_geo_region census_msa
TRUE TRUE
household_adults household_children
FALSE FALSE
employment_industry employment_occupation
TRUE TRUE
h1n1_vaccine seasonal_vaccine
FALSE FALSE
# Define a function to return names of character columns
character_variables <- function(data) {
character_list <- list() # Initialize empty list to store character columns
# Loop through each column in the data
for(var in names(data)) {
var_data <- data[[var]] # Extract column
if(is.character(var_data)){ # Check if column is character
character_list[[var]] <- var_data # Store character column in list
}
}
return(names(character_list)) # Return names of character columns
}
# Apply function to data; confirms there are 8 character variables
character_variables(data) [1] "race" "sex" "marital_status"
[4] "rent_or_own" "hhs_geo_region" "census_msa"
[7] "employment_industry" "employment_occupation"
Handling NAs with Median Imputation
In real-world datasets, missing values are common, especially for numeric variables. Many machine learning algorithms cannot handle missing values directly, so we need to impute them.
This code defines a function replace_numeric_variables that automatically replaces all missing numeric values (NAs) with the median of their respective column. The median is chosen over the mean because it is robust to outliers, ensuring that extreme values do not skew the imputation. After defining the function, it is applied to the dataset data, producing a cleaned version where numeric missing values are replaced.
# Function to replace all numeric variables with their median if they contain NA values
replace_numeric_variables <- function(data) {
# Loop over every column in the data frame
for(var in names(data)) {
# Check if the column is numeric and contains any missing values
if(is.numeric(data[[var]]) & anyNA(data[[var]])) {
# Compute the median of the column, ignoring NAs
col_median <- median(data[[var]], na.rm = TRUE)
# Replace NA values in the column with the computed median
data[[var]][is.na(data[[var]])] <- col_median
}
}
# Return the modified dataset with NAs replaced
return(data)
}
# Apply the function to the data
data <- replace_numeric_variables(data)Dropping Character Variables
In many machine learning workflows, models like XGBoost, Random Forest, and most numerical algorithms cannot directly handle character (string) variables. Converting all variables to numeric or factor is required for modeling.
In this step, we identify all character columns in the dataset and remove them, leaving only numeric variables for modeling. This simplifies the dataset and ensures compatibility.
# Identify character columns in the dataset
char_cols <- sapply(data, is.character)
# Check how many character columns exist
sum(char_cols)[1] 8
# Drop all character columns from the dataset
data <- data[ , !char_cols]
# Inspect the structure of the updated dataset to confirm character columns are removed
str(data)'data.frame': 26707 obs. of 30 variables:
$ respondent_id : int 0 1 2 3 4 5 6 7 8 9 ...
$ h1n1_concern : int 1 3 1 1 2 3 0 1 0 2 ...
$ h1n1_knowledge : int 0 2 1 1 1 1 0 0 2 1 ...
$ behavioral_antiviral_meds : num 0 0 0 0 0 0 0 0 0 0 ...
$ behavioral_avoidance : int 0 1 1 1 1 1 0 1 1 1 ...
$ behavioral_face_mask : num 0 0 0 0 0 0 0 0 0 0 ...
$ behavioral_wash_hands : int 0 1 0 1 1 1 0 1 1 0 ...
$ behavioral_large_gatherings: num 0 0 0 1 1 0 0 0 1 1 ...
$ behavioral_outside_home : int 1 1 0 0 0 0 0 0 1 0 ...
$ behavioral_touch_face : int 1 1 0 0 1 1 0 1 1 1 ...
$ doctor_recc_h1n1 : int 0 0 0 0 0 0 0 1 0 0 ...
$ doctor_recc_seasonal : int 0 0 0 1 0 1 0 0 0 0 ...
$ chronic_med_condition : num 0 0 1 1 0 0 0 1 0 1 ...
$ child_under_6_months : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_worker : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_insurance : int 1 1 1 1 1 1 1 1 1 1 ...
$ opinion_h1n1_vacc_effective: num 3 5 3 3 3 5 4 5 4 4 ...
$ opinion_h1n1_risk : int 1 4 1 3 3 2 1 2 1 2 ...
$ opinion_h1n1_sick_from_vacc: num 2 4 1 5 2 1 1 1 1 2 ...
$ opinion_seas_vacc_effective: int 2 4 4 5 3 5 4 4 4 4 ...
$ opinion_seas_risk : int 1 2 1 4 1 4 2 2 2 2 ...
$ opinion_seas_sick_from_vacc: num 2 4 2 1 4 4 1 1 1 2 ...
$ age_group : num 4 2 1 5 3 5 4 3 3 4 ...
$ education : num 1 2 4 2 3 2 1 3 4 2 ...
$ income_poverty : num 1 1 2 1 2 2 2 2 3 2 ...
$ employment_status : num 1 3 3 1 3 3 3 3 3 1 ...
$ household_adults : num 0 0 2 0 1 2 0 2 1 0 ...
$ household_children : num 0 0 0 0 0 3 0 0 0 0 ...
$ h1n1_vaccine : int 0 0 0 0 0 0 0 1 0 0 ...
$ seasonal_vaccine : int 0 1 0 1 0 0 0 1 0 0 ...
Building XGBoost Model for Seasonal and H1n1 vaccines Intake Probabilities
The goal of this section is to build a predictive model for seasonal_vaccine and h1n1_vaccine target variables using the XGBoost algorithm. XGBoost is chosen because it efficiently handles structured, medium-sized data, captures complex nonlinear relationships and feature interactions, provides robust regularization, and directly outputs probabilities suitable for AUC evaluation. Its combination of predictive performance, flexibility, and scalability makes it an excellent choice for modeling vaccination uptake in this dataset.
# -----------------------------
# Building XGBoost models
# -----------------------------
set.seed(123) # Set a random seed for reproducibility of results
# -----------------------------
# Seasonal vaccine model
# -----------------------------
target_seasonal_vaccine <- data$seasonal_vaccine
# Extract the target variable (binary outcome) for seasonal vaccine
seasonal_predictors <- data %>%
select(-respondent_id, -seasonal_vaccine, -h1n1_vaccine)
# Select predictor variables by removing ID and both target columns
matrix_seasonal_predictors <- as.matrix(seasonal_predictors)
# Convert the predictors to a numeric matrix required by XGBoost
# Split the data into training and testing sets (80%-20%)
train_index <- createDataPartition(target_seasonal_vaccine, p = 0.8, list = FALSE)
data_train <- matrix_seasonal_predictors[train_index, ]
data_test <- matrix_seasonal_predictors[-train_index, ]
target_seasonal_vaccine_train <- target_seasonal_vaccine[train_index]
target_seasonal_vaccine_test <- target_seasonal_vaccine[-train_index]
# Convert datasets to XGBoost DMatrix format
xgb_train <- xgb.DMatrix(data = data_train, label = target_seasonal_vaccine_train)
xgb_test <- xgb.DMatrix(data = data_test, label = target_seasonal_vaccine_test)
# Define XGBoost parameters
params <- list(
objective = "binary:logistic", # Binary classification with logistic regression
eval_metric = "auc", # Evaluation metric: Area Under the ROC Curve
eta = 0.05, # Learning rate (shrinkage) to prevent overfitting
max_depth = 6, # Maximum depth of trees (complexity control)
subsample = 0.8, # Fraction of data used per tree (stochastic sampling)
colsample_bytree = 0.8 # Fraction of features used per tree (regularization)
)
# Train the XGBoost model
xgb_model <- xgb.train(
params = params, # Model parameters
data = xgb_train, # Training dataset
nrounds = 500, # Maximum number of boosting rounds
watchlist = list(train = xgb_train, test = xgb_test),
# Monitor performance on training and test sets
early_stopping_rounds = 30, # Stop if test metric does not improve for 30 rounds
verbose = 0 # Suppress detailed output
)
# Predict on the test set
xgb_pred <- predict(xgb_model, data_test)
# Evaluate model performance using AUC
auc_xgb <- auc(target_seasonal_vaccine_test, xgb_pred)
print(paste("XGB model: ", auc_xgb))[1] "XGB model: 0.845697471443582"
# -----------------------------
# H1N1 vaccine model
# -----------------------------
set.seed(123) # Reset seed for reproducibility
target_h1n1_vaccine <- data$h1n1_vaccine
# Extract the target variable for H1N1 vaccine
h1n1_predictors <- data %>%
select(-respondent_id, -seasonal_vaccine, -h1n1_vaccine)
# Predictor variables for H1N1 (exclude ID and both targets)
matrix_h1n1_predictors <- as.matrix(h1n1_predictors)
# Convert to matrix for XGBoost
# Split into training and testing sets (80%-20%)
train_index_2 <- createDataPartition(target_h1n1_vaccine, p = 0.8, list = FALSE)
data_train_2 <- matrix_h1n1_predictors[train_index_2, ]
data_test_2 <- matrix_h1n1_predictors[-train_index_2, ]
target_h1n1_vaccine_train <- target_h1n1_vaccine[train_index_2]
target_h1n1_vaccine_test <- target_h1n1_vaccine[-train_index_2]
# Convert to XGBoost DMatrix
xgb_train_2 <- xgb.DMatrix(data = data_train_2, label = target_h1n1_vaccine_train)
xgb_test_2 <- xgb.DMatrix(data = data_test_2, label = target_h1n1_vaccine_test)
# Train XGBoost model for H1N1
xgb_model_2 <- xgb.train(
params = params,
data = xgb_train_2,
nrounds = 500,
watchlist = list(train = xgb_train_2, test = xgb_test_2),
early_stopping_rounds = 30,
verbose = 0
)
# Predict on the test set
xgb_pred_2 <- predict(xgb_model_2, data_test_2)
# Evaluate performance using AUC
auc_xgb_2 <- auc(target_h1n1_vaccine_test, xgb_pred_2)
print(paste("XGB model: ", auc_xgb_2))[1] "XGB model: 0.829001194529438"
Applying the XGBoost Model to the Test Features Dataset
After training the XGBoost models for predicting seasonal and H1N1 vaccine uptake, the next step is to apply these models to the test dataset to generate predictions. This process involves several key steps: data inspection, variable recoding, missing value handling, and prediction generation.
################################################################################
###################################################################################
# Apply the trained XGBoost models to the test features dataset
# Inspect the structure of the test dataset
str(test_features) # Shows variable names, types, and example values'data.frame': 26708 obs. of 36 variables:
$ respondent_id : int 26707 26708 26709 26710 26711 26712 26713 26714 26715 26716 ...
$ h1n1_concern : int 2 1 2 1 3 2 2 2 1 2 ...
$ h1n1_knowledge : int 2 1 2 1 1 2 2 1 1 2 ...
$ behavioral_antiviral_meds : int 0 0 0 0 1 0 0 0 0 0 ...
$ behavioral_avoidance : int 1 0 0 0 1 1 1 1 1 0 ...
$ behavioral_face_mask : int 0 0 1 0 0 0 0 0 0 0 ...
$ behavioral_wash_hands : int 1 0 1 0 1 1 1 1 1 1 ...
$ behavioral_large_gatherings: int 1 0 1 0 1 1 1 0 1 0 ...
$ behavioral_outside_home : int 0 0 1 0 1 0 0 0 1 0 ...
$ behavioral_touch_face : int 1 0 1 0 1 1 1 1 1 1 ...
$ doctor_recc_h1n1 : int 0 0 0 1 0 0 1 0 0 0 ...
$ doctor_recc_seasonal : int 0 0 0 1 0 0 1 0 0 0 ...
$ chronic_med_condition : int 0 0 0 1 0 0 0 0 0 1 ...
$ child_under_6_months : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_worker : int 0 0 0 0 1 1 0 0 0 0 ...
$ health_insurance : int 1 0 NA 1 1 1 NA 1 NA 1 ...
$ opinion_h1n1_vacc_effective: int 5 4 5 4 5 4 4 5 3 4 ...
$ opinion_h1n1_risk : int 1 1 4 2 2 4 2 1 1 2 ...
$ opinion_h1n1_sick_from_vacc: int 1 1 2 2 4 1 4 2 1 2 ...
$ opinion_seas_vacc_effective: int 5 4 5 4 4 5 3 4 3 5 ...
$ opinion_seas_risk : int 1 1 4 4 4 5 1 1 1 4 ...
$ opinion_seas_sick_from_vacc: int 1 1 4 2 2 1 2 1 1 2 ...
$ age_group : chr "35 - 44 Years" "18 - 34 Years" "55 - 64 Years" "65+ Years" ...
$ education : chr "College Graduate" "12 Years" "College Graduate" "12 Years" ...
$ race : chr "Hispanic" "White" "White" "White" ...
$ sex : chr "Female" "Male" "Male" "Female" ...
$ income_poverty : chr "> $75,000" "Below Poverty" "> $75,000" "<= $75,000, Above Poverty" ...
$ marital_status : chr "Not Married" "Not Married" "Married" "Married" ...
$ rent_or_own : chr "Rent" "Rent" "Own" "Own" ...
$ employment_status : chr "Employed" "Employed" "Employed" "Not in Labor Force" ...
$ hhs_geo_region : chr "mlyzmhmf" "bhuqouqj" "lrircsnp" "lrircsnp" ...
$ census_msa : chr "MSA, Not Principle City" "Non-MSA" "Non-MSA" "MSA, Not Principle City" ...
$ household_adults : int 1 3 1 1 0 0 1 1 1 0 ...
$ household_children : int 0 0 0 0 1 2 0 0 0 0 ...
$ employment_industry : chr "atmlpfrs" "atmlpfrs" "nduyfdeo" "" ...
$ employment_occupation : chr "hfxkjkmi" "xqwwgdyp" "pvmttkik" "" ...
# Recode categorical variables into numeric form for model compatibility
str(test_features) # Check structure before recoding'data.frame': 26708 obs. of 36 variables:
$ respondent_id : int 26707 26708 26709 26710 26711 26712 26713 26714 26715 26716 ...
$ h1n1_concern : int 2 1 2 1 3 2 2 2 1 2 ...
$ h1n1_knowledge : int 2 1 2 1 1 2 2 1 1 2 ...
$ behavioral_antiviral_meds : int 0 0 0 0 1 0 0 0 0 0 ...
$ behavioral_avoidance : int 1 0 0 0 1 1 1 1 1 0 ...
$ behavioral_face_mask : int 0 0 1 0 0 0 0 0 0 0 ...
$ behavioral_wash_hands : int 1 0 1 0 1 1 1 1 1 1 ...
$ behavioral_large_gatherings: int 1 0 1 0 1 1 1 0 1 0 ...
$ behavioral_outside_home : int 0 0 1 0 1 0 0 0 1 0 ...
$ behavioral_touch_face : int 1 0 1 0 1 1 1 1 1 1 ...
$ doctor_recc_h1n1 : int 0 0 0 1 0 0 1 0 0 0 ...
$ doctor_recc_seasonal : int 0 0 0 1 0 0 1 0 0 0 ...
$ chronic_med_condition : int 0 0 0 1 0 0 0 0 0 1 ...
$ child_under_6_months : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_worker : int 0 0 0 0 1 1 0 0 0 0 ...
$ health_insurance : int 1 0 NA 1 1 1 NA 1 NA 1 ...
$ opinion_h1n1_vacc_effective: int 5 4 5 4 5 4 4 5 3 4 ...
$ opinion_h1n1_risk : int 1 1 4 2 2 4 2 1 1 2 ...
$ opinion_h1n1_sick_from_vacc: int 1 1 2 2 4 1 4 2 1 2 ...
$ opinion_seas_vacc_effective: int 5 4 5 4 4 5 3 4 3 5 ...
$ opinion_seas_risk : int 1 1 4 4 4 5 1 1 1 4 ...
$ opinion_seas_sick_from_vacc: int 1 1 4 2 2 1 2 1 1 2 ...
$ age_group : chr "35 - 44 Years" "18 - 34 Years" "55 - 64 Years" "65+ Years" ...
$ education : chr "College Graduate" "12 Years" "College Graduate" "12 Years" ...
$ race : chr "Hispanic" "White" "White" "White" ...
$ sex : chr "Female" "Male" "Male" "Female" ...
$ income_poverty : chr "> $75,000" "Below Poverty" "> $75,000" "<= $75,000, Above Poverty" ...
$ marital_status : chr "Not Married" "Not Married" "Married" "Married" ...
$ rent_or_own : chr "Rent" "Rent" "Own" "Own" ...
$ employment_status : chr "Employed" "Employed" "Employed" "Not in Labor Force" ...
$ hhs_geo_region : chr "mlyzmhmf" "bhuqouqj" "lrircsnp" "lrircsnp" ...
$ census_msa : chr "MSA, Not Principle City" "Non-MSA" "Non-MSA" "MSA, Not Principle City" ...
$ household_adults : int 1 3 1 1 0 0 1 1 1 0 ...
$ household_children : int 0 0 0 0 1 2 0 0 0 0 ...
$ employment_industry : chr "atmlpfrs" "atmlpfrs" "nduyfdeo" "" ...
$ employment_occupation : chr "hfxkjkmi" "xqwwgdyp" "pvmttkik" "" ...
# View unique values of 'employment_status' to understand categories
unique(test_features$employment_status)[1] "Employed" "Not in Labor Force" "Unemployed"
[4] ""
# Convert 'employment_status' to numeric factor with defined order
test_features$employment_status <- as.numeric(
factor(test_features$employment_status,
levels = c("Not in Labor Force", "Unemployed", "Employed"))
)
str(test_features) # Check structure after recoding 'employment_status''data.frame': 26708 obs. of 36 variables:
$ respondent_id : int 26707 26708 26709 26710 26711 26712 26713 26714 26715 26716 ...
$ h1n1_concern : int 2 1 2 1 3 2 2 2 1 2 ...
$ h1n1_knowledge : int 2 1 2 1 1 2 2 1 1 2 ...
$ behavioral_antiviral_meds : int 0 0 0 0 1 0 0 0 0 0 ...
$ behavioral_avoidance : int 1 0 0 0 1 1 1 1 1 0 ...
$ behavioral_face_mask : int 0 0 1 0 0 0 0 0 0 0 ...
$ behavioral_wash_hands : int 1 0 1 0 1 1 1 1 1 1 ...
$ behavioral_large_gatherings: int 1 0 1 0 1 1 1 0 1 0 ...
$ behavioral_outside_home : int 0 0 1 0 1 0 0 0 1 0 ...
$ behavioral_touch_face : int 1 0 1 0 1 1 1 1 1 1 ...
$ doctor_recc_h1n1 : int 0 0 0 1 0 0 1 0 0 0 ...
$ doctor_recc_seasonal : int 0 0 0 1 0 0 1 0 0 0 ...
$ chronic_med_condition : int 0 0 0 1 0 0 0 0 0 1 ...
$ child_under_6_months : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_worker : int 0 0 0 0 1 1 0 0 0 0 ...
$ health_insurance : int 1 0 NA 1 1 1 NA 1 NA 1 ...
$ opinion_h1n1_vacc_effective: int 5 4 5 4 5 4 4 5 3 4 ...
$ opinion_h1n1_risk : int 1 1 4 2 2 4 2 1 1 2 ...
$ opinion_h1n1_sick_from_vacc: int 1 1 2 2 4 1 4 2 1 2 ...
$ opinion_seas_vacc_effective: int 5 4 5 4 4 5 3 4 3 5 ...
$ opinion_seas_risk : int 1 1 4 4 4 5 1 1 1 4 ...
$ opinion_seas_sick_from_vacc: int 1 1 4 2 2 1 2 1 1 2 ...
$ age_group : chr "35 - 44 Years" "18 - 34 Years" "55 - 64 Years" "65+ Years" ...
$ education : chr "College Graduate" "12 Years" "College Graduate" "12 Years" ...
$ race : chr "Hispanic" "White" "White" "White" ...
$ sex : chr "Female" "Male" "Male" "Female" ...
$ income_poverty : chr "> $75,000" "Below Poverty" "> $75,000" "<= $75,000, Above Poverty" ...
$ marital_status : chr "Not Married" "Not Married" "Married" "Married" ...
$ rent_or_own : chr "Rent" "Rent" "Own" "Own" ...
$ employment_status : num 3 3 3 1 3 3 1 2 1 1 ...
$ hhs_geo_region : chr "mlyzmhmf" "bhuqouqj" "lrircsnp" "lrircsnp" ...
$ census_msa : chr "MSA, Not Principle City" "Non-MSA" "Non-MSA" "MSA, Not Principle City" ...
$ household_adults : int 1 3 1 1 0 0 1 1 1 0 ...
$ household_children : int 0 0 0 0 1 2 0 0 0 0 ...
$ employment_industry : chr "atmlpfrs" "atmlpfrs" "nduyfdeo" "" ...
$ employment_occupation : chr "hfxkjkmi" "xqwwgdyp" "pvmttkik" "" ...
# View unique values of 'education'
unique(test_features$education)[1] "College Graduate" "12 Years" "Some College" "< 12 Years"
[5] ""
# Convert 'education' to numeric factor with defined order
test_features$education <- as.numeric(
factor(test_features$education,
levels = c("< 12 Years", "12 Years", "Some College", "College Graduate"))
)
str(test_features) # Check structure after recoding 'education''data.frame': 26708 obs. of 36 variables:
$ respondent_id : int 26707 26708 26709 26710 26711 26712 26713 26714 26715 26716 ...
$ h1n1_concern : int 2 1 2 1 3 2 2 2 1 2 ...
$ h1n1_knowledge : int 2 1 2 1 1 2 2 1 1 2 ...
$ behavioral_antiviral_meds : int 0 0 0 0 1 0 0 0 0 0 ...
$ behavioral_avoidance : int 1 0 0 0 1 1 1 1 1 0 ...
$ behavioral_face_mask : int 0 0 1 0 0 0 0 0 0 0 ...
$ behavioral_wash_hands : int 1 0 1 0 1 1 1 1 1 1 ...
$ behavioral_large_gatherings: int 1 0 1 0 1 1 1 0 1 0 ...
$ behavioral_outside_home : int 0 0 1 0 1 0 0 0 1 0 ...
$ behavioral_touch_face : int 1 0 1 0 1 1 1 1 1 1 ...
$ doctor_recc_h1n1 : int 0 0 0 1 0 0 1 0 0 0 ...
$ doctor_recc_seasonal : int 0 0 0 1 0 0 1 0 0 0 ...
$ chronic_med_condition : int 0 0 0 1 0 0 0 0 0 1 ...
$ child_under_6_months : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_worker : int 0 0 0 0 1 1 0 0 0 0 ...
$ health_insurance : int 1 0 NA 1 1 1 NA 1 NA 1 ...
$ opinion_h1n1_vacc_effective: int 5 4 5 4 5 4 4 5 3 4 ...
$ opinion_h1n1_risk : int 1 1 4 2 2 4 2 1 1 2 ...
$ opinion_h1n1_sick_from_vacc: int 1 1 2 2 4 1 4 2 1 2 ...
$ opinion_seas_vacc_effective: int 5 4 5 4 4 5 3 4 3 5 ...
$ opinion_seas_risk : int 1 1 4 4 4 5 1 1 1 4 ...
$ opinion_seas_sick_from_vacc: int 1 1 4 2 2 1 2 1 1 2 ...
$ age_group : chr "35 - 44 Years" "18 - 34 Years" "55 - 64 Years" "65+ Years" ...
$ education : num 4 2 4 2 2 4 4 3 2 3 ...
$ race : chr "Hispanic" "White" "White" "White" ...
$ sex : chr "Female" "Male" "Male" "Female" ...
$ income_poverty : chr "> $75,000" "Below Poverty" "> $75,000" "<= $75,000, Above Poverty" ...
$ marital_status : chr "Not Married" "Not Married" "Married" "Married" ...
$ rent_or_own : chr "Rent" "Rent" "Own" "Own" ...
$ employment_status : num 3 3 3 1 3 3 1 2 1 1 ...
$ hhs_geo_region : chr "mlyzmhmf" "bhuqouqj" "lrircsnp" "lrircsnp" ...
$ census_msa : chr "MSA, Not Principle City" "Non-MSA" "Non-MSA" "MSA, Not Principle City" ...
$ household_adults : int 1 3 1 1 0 0 1 1 1 0 ...
$ household_children : int 0 0 0 0 1 2 0 0 0 0 ...
$ employment_industry : chr "atmlpfrs" "atmlpfrs" "nduyfdeo" "" ...
$ employment_occupation : chr "hfxkjkmi" "xqwwgdyp" "pvmttkik" "" ...
# View unique values of 'income_poverty'
unique(test_features$income_poverty)[1] "> $75,000" "Below Poverty"
[3] "<= $75,000, Above Poverty" ""
# Convert 'income_poverty' to numeric factor with defined order
test_features$income_poverty <- as.numeric(
factor(test_features$income_poverty,
levels = c("Below Poverty", "<= $75,000, Above Poverty", "> $75,000"))
)
str(test_features) # Check structure after recoding 'income_poverty''data.frame': 26708 obs. of 36 variables:
$ respondent_id : int 26707 26708 26709 26710 26711 26712 26713 26714 26715 26716 ...
$ h1n1_concern : int 2 1 2 1 3 2 2 2 1 2 ...
$ h1n1_knowledge : int 2 1 2 1 1 2 2 1 1 2 ...
$ behavioral_antiviral_meds : int 0 0 0 0 1 0 0 0 0 0 ...
$ behavioral_avoidance : int 1 0 0 0 1 1 1 1 1 0 ...
$ behavioral_face_mask : int 0 0 1 0 0 0 0 0 0 0 ...
$ behavioral_wash_hands : int 1 0 1 0 1 1 1 1 1 1 ...
$ behavioral_large_gatherings: int 1 0 1 0 1 1 1 0 1 0 ...
$ behavioral_outside_home : int 0 0 1 0 1 0 0 0 1 0 ...
$ behavioral_touch_face : int 1 0 1 0 1 1 1 1 1 1 ...
$ doctor_recc_h1n1 : int 0 0 0 1 0 0 1 0 0 0 ...
$ doctor_recc_seasonal : int 0 0 0 1 0 0 1 0 0 0 ...
$ chronic_med_condition : int 0 0 0 1 0 0 0 0 0 1 ...
$ child_under_6_months : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_worker : int 0 0 0 0 1 1 0 0 0 0 ...
$ health_insurance : int 1 0 NA 1 1 1 NA 1 NA 1 ...
$ opinion_h1n1_vacc_effective: int 5 4 5 4 5 4 4 5 3 4 ...
$ opinion_h1n1_risk : int 1 1 4 2 2 4 2 1 1 2 ...
$ opinion_h1n1_sick_from_vacc: int 1 1 2 2 4 1 4 2 1 2 ...
$ opinion_seas_vacc_effective: int 5 4 5 4 4 5 3 4 3 5 ...
$ opinion_seas_risk : int 1 1 4 4 4 5 1 1 1 4 ...
$ opinion_seas_sick_from_vacc: int 1 1 4 2 2 1 2 1 1 2 ...
$ age_group : chr "35 - 44 Years" "18 - 34 Years" "55 - 64 Years" "65+ Years" ...
$ education : num 4 2 4 2 2 4 4 3 2 3 ...
$ race : chr "Hispanic" "White" "White" "White" ...
$ sex : chr "Female" "Male" "Male" "Female" ...
$ income_poverty : num 3 1 3 2 2 3 NA 2 NA 3 ...
$ marital_status : chr "Not Married" "Not Married" "Married" "Married" ...
$ rent_or_own : chr "Rent" "Rent" "Own" "Own" ...
$ employment_status : num 3 3 3 1 3 3 1 2 1 1 ...
$ hhs_geo_region : chr "mlyzmhmf" "bhuqouqj" "lrircsnp" "lrircsnp" ...
$ census_msa : chr "MSA, Not Principle City" "Non-MSA" "Non-MSA" "MSA, Not Principle City" ...
$ household_adults : int 1 3 1 1 0 0 1 1 1 0 ...
$ household_children : int 0 0 0 0 1 2 0 0 0 0 ...
$ employment_industry : chr "atmlpfrs" "atmlpfrs" "nduyfdeo" "" ...
$ employment_occupation : chr "hfxkjkmi" "xqwwgdyp" "pvmttkik" "" ...
# View unique values of 'age_group'
unique(test_features$age_group)[1] "35 - 44 Years" "18 - 34 Years" "55 - 64 Years" "65+ Years"
[5] "45 - 54 Years"
# Convert 'age_group' to numeric factor with defined order
test_features$age_group <- as.numeric(
factor(test_features$age_group,
levels = c("18 - 34 Years", "35 - 44 Years", "45 - 54 Years",
"55 - 64 Years", "65+ Years"))
)
# Function to replace all numeric variables that have missing values with the median
replace_numeric_variables <- function(data) {
for(var in names(data)) {
# Check if variable is numeric and contains any NA values
if(is.numeric(data[[var]]) & anyNA(data[[var]])) {
# Calculate median of the column excluding NAs
col_median <- median(data[[var]], na.rm = TRUE)
# Replace NA values with the column median
data[[var]][is.na(data[[var]])] <- col_median
}
}
return(data) # Return the modified dataset
}
# Apply the median replacement function to the test dataset
test_features <- replace_numeric_variables(test_features)
# Verify that there are no remaining missing values
sum(is.na(test_features)) # Should return 0[1] 0
#####
# Drop character variables because XGBoost cannot handle them directly
# Identify character columns
char_cols <- sapply(test_features, is.character)
# Count how many character columns exist
sum(char_cols)[1] 8
# Remove all character columns
test_features <- test_features[ , !char_cols]
# Inspect the structure of the cleaned test dataset
str(test_features)'data.frame': 26708 obs. of 28 variables:
$ respondent_id : int 26707 26708 26709 26710 26711 26712 26713 26714 26715 26716 ...
$ h1n1_concern : int 2 1 2 1 3 2 2 2 1 2 ...
$ h1n1_knowledge : num 2 1 2 1 1 2 2 1 1 2 ...
$ behavioral_antiviral_meds : int 0 0 0 0 1 0 0 0 0 0 ...
$ behavioral_avoidance : int 1 0 0 0 1 1 1 1 1 0 ...
$ behavioral_face_mask : int 0 0 1 0 0 0 0 0 0 0 ...
$ behavioral_wash_hands : num 1 0 1 0 1 1 1 1 1 1 ...
$ behavioral_large_gatherings: num 1 0 1 0 1 1 1 0 1 0 ...
$ behavioral_outside_home : num 0 0 1 0 1 0 0 0 1 0 ...
$ behavioral_touch_face : num 1 0 1 0 1 1 1 1 1 1 ...
$ doctor_recc_h1n1 : num 0 0 0 1 0 0 1 0 0 0 ...
$ doctor_recc_seasonal : num 0 0 0 1 0 0 1 0 0 0 ...
$ chronic_med_condition : num 0 0 0 1 0 0 0 0 0 1 ...
$ child_under_6_months : int 0 0 0 0 0 0 0 0 0 0 ...
$ health_worker : int 0 0 0 0 1 1 0 0 0 0 ...
$ health_insurance : num 1 0 1 1 1 1 1 1 1 1 ...
$ opinion_h1n1_vacc_effective: num 5 4 5 4 5 4 4 5 3 4 ...
$ opinion_h1n1_risk : num 1 1 4 2 2 4 2 1 1 2 ...
$ opinion_h1n1_sick_from_vacc: int 1 1 2 2 4 1 4 2 1 2 ...
$ opinion_seas_vacc_effective: num 5 4 5 4 4 5 3 4 3 5 ...
$ opinion_seas_risk : int 1 1 4 4 4 5 1 1 1 4 ...
$ opinion_seas_sick_from_vacc: int 1 1 4 2 2 1 2 1 1 2 ...
$ age_group : num 2 1 4 5 2 3 5 4 4 5 ...
$ education : num 4 2 4 2 2 4 4 3 2 3 ...
$ income_poverty : num 3 1 3 2 2 3 2 2 2 3 ...
$ employment_status : num 3 3 3 1 3 3 1 2 1 1 ...
$ household_adults : int 1 3 1 1 0 0 1 1 1 0 ...
$ household_children : int 0 0 0 0 1 2 0 0 0 0 ...
# Convert test features to numeric matrix for XGBoost prediction
x_test_final <- as.matrix(
test_features %>% select(-respondent_id) # Exclude ID column from predictors
)
# Generate predicted probabilities for seasonal vaccine using the trained model
seasonal_probs <- predict(xgb_model, x_test_final)
# Generate predicted probabilities for H1N1 vaccine using the trained model
h1n1_probs <- predict(xgb_model_2, x_test_final)
# Combine predictions with respondent IDs to create submission dataframe
submission <- data.frame(
respondent_id = test_features$respondent_id, # ID column
h1n1_vaccine = h1n1_probs, # Predicted H1N1 probabilities
seasonal_vaccine = seasonal_probs # Predicted seasonal vaccine probabilities
)
# Save submission file as CSV (no row names)
write.csv(submission, "submissions_2.csv", row.names = FALSE)Conclusion
In this report, we successfully built and applied a predictive model to estimate the likelihood of individuals receiving the H1N1 and seasonal vaccines. Using XGBoost, a robust gradient boosting algorithm, allowed us to capture complex, non-linear relationships between demographic, socioeconomic, and health-related predictors and vaccination behavior.
We carefully preprocessed both training and test datasets by handling missing values, encoding categorical variables, and converting features into numeric matrices suitable for modeling. The models were evaluated using the AUC metric, indicating strong discriminatory performance. Finally, predictions were generated for the test dataset and compiled into a submission-ready format.
Further improvements could include hyperparameter tuning, feature engineering, and incorporating additional behavioral or temporal data to enhance prediction accuracy.