Introduction

The aim of the project was to determine if there are demographic correlates of vaccine stance that can be further validated as predictive through a machine-learning approach. Due to procedural issues data collection could not proceed as originally intended, and resulting final sample consists of < 100 data points. Therefore the below workbook serves more as a tutorial and worked example of methodology, rather than data analysis or basis for inference.

Vaccine stance was originally intended to encompass a combination of attitude and status, e.g. whether a person took the vaccine despite being hesitant, or did not take due to being categorically opposed. However, due to scarcity of data and unclarity in determining precise stance, as well as predominance of vaccinated respondents with compound attitudes, the final selected ‘target’ for modelling was a binary stance reflecting hesitancy or lack thereof. NOTE that even that could not be determined with clarity, due to the fact that many respondents answered a conditional question despite not having met the condition in question, further preventing any inference from the collected data

The stance variable was constructed from the following question in the survey: Hesitancy was coded as 1 and otherwise, 0.

Q11 If you have had the opportunity to be vaccinated (COVID-19), but you have not decided to accept the vaccine, what was your primary motivation?

My religious beliefs do not support this option (0)
There is insufficient information about the side effects of the vaccine, I am open to future vaccines which are better developed (1)
I am young and of good health so I should not be affected severely if infected with the virus (0)
I am protecting my immunity through other means instead (0)
I have not made up my mind yet (1)

Overview of approach

Once data has been looked at in general, a typical procedure for modelling would follow the below steps:

train-test split: putting aside a subset of data (size and balancing of subset guided by exploratory analysis), to be able to later verify predictiveness of algorithm and generalisability of inferred relations on the test set
feature selection / engineering: looking at the training data further with an eye as to what might make for good features / predictive factors NOTE be mindful when creating features that in general the same thing will need applied to test, and need to provision for discrepancies, and methods such as calls to unique are unsafe!
pipeline coding up the transformations to be applied to raw data, ideally in a reusable manner
train performing parameter fitting of a model (usually some form of loss minimisation)
test evaluating the performance of the model on the data unseen data

Summary of findings

The clearest (correlative) signal could be seen in three demographic features: age group, gender, and household size.

Geographical factors, being more granular, needed to be pooled together but appeared less indicative of stance. Religious affiliation was deemed unfit for modelling, being a protected characteristic, and has not been looked at. Information source, being a multiple-choice question, was difficult to use at full granularity / combination of responses, and a single most-correlated factor appeared to be .., although signal here was not as clear as in the purely demographic.

NOTE however that the modelling did not validate the predictiveness of above observed correlations These correlations, visible in the training set, allowed to fit the logistic regression model to a relatively good level of approx. 0.59 AUC. However, when attempting to generalise to test set, that same model and inferred relationship led to a significant performance drop to 0.52 AUC. That indicates overfitting - that the signal was present only in the train set. Considering extremely small sample size, it is to be expected that random patterns can emerge after further splitting down the data.

Train vs test set trends

As an illustrative example, we can do a post-hoc comparison of data splits between train and test sets on selected features. This aspect is not done here for simplicity.

Workbook

Libraries and source code

Read in all useful libraries, as well as any functions written for the project Note that the absolute path will need to change to reflect local file structure!

options(warn=-1)

suppressMessages(library(stringr))
suppressMessages(library(jsonlite))
suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(purrr))
suppressMessages(library(repr))
suppressMessages(library(caret))
suppressMessages(library(pROC))
suppressMessages(library(janitor))
suppressMessages(library(rgdal))
suppressMessages(library(sf))
suppressMessages(library(sp))

The easiest way - set working directory once, then can load source code + data via relative path

setwd('C:/Users/glenn/Desktop/Feburary 2022/Grant managment/vaccine hesitancy/vaccinepaper')
source("analysis2.r")
source("modelling.r")
#source("namespace.r")

Read data

data_name <-  './MyHealthChoiceGhana_August 25, 2022_01.37.csv'
admin_name <- './gadm40_GHA_shp/gadm40_GHA_1.shp'

df = parse_survey( data_name)

## [1] "Raw data has 85 rows and 31 cols."
## [1] "Filtered data has 62 rows and 17 cols..."

Train-test split

Before doing any modelling and feature engineering, need to set aside test set that we are not going to touch until the very end. This ensures there is no ‘cheating’ / ‘data leakage’. NOTE we are also setting the random seed, this is so that we can reproduce results - and also prevents us from going through the whole procedure twice with slightly different train set, which again would be a form of leakage

set.seed(22)
train_index = createDataPartition(df$stance_target, p= .6, list= FALSE)
df_train = df[train_index,]
df_test = df[-train_index,]

print(paste('Train set has',length(df_train$stance_target),'examples and',sum(df_train$stance_target),'are the target'))

## [1] "Train set has 38 examples and 28 are the target"

print(paste('Test set has',length(df_test$stance_target),'examples and',sum(df_test$stance_target),'are the target'))

## [1] "Test set has 24 examples and 18 are the target"

Feature Engineering

Look at the training data and see how target splits by the feature, to inform selection / engineering of suitable predictors. We’ll create also two lists, one for functions in a pipeline to be applied to transform the data, another for columns that can be used for modelling as they are raw, without transforming.

pipe = list()
model_columns = list()

Let’s look at an illustrative example first, Q3 (Age-group)

df_train['stance'] = 'decided'
df_train[df_train['stance_target']==1,'stance'] = 'hesitant'

options(repr.plot.width=5,repr.plot.height=4)
plot_target_cat_split( df_train, 'stance', 'Q3')

Another way of visualizing

Since target is binary here and we can use 0’s and 1’s, my personal preference for guiding engineering is using proportions and adding a shaded error area (rather than using stacked bars that are less at-a-glance); extra useful is also to look at an added random ‘fake target’ column, to see if simple random assignment could show similar trends to the actual data.

df_train['random'] = sample(c(0,1), length(df_train[['stance_target']]), replace= TRUE)
df_train['random_target'] = df_train['stance_target']
df_train[df_train['random']==1,'random_target'] = 1 - df_train[df_train['random']==1,'random_target']

options(repr.plot.width=7,repr.plot.height=5)
plot_target_prop_by_cat3( df_train, 'stance_target', 'Q3', use_random= TRUE)

Seeing here that: in the age groups above 35 there is quite few samples, large variance, so signal unreliable there appears to be a weak trend of increasing hesitancy with age but it is quite weak and not that dissimiliar to random… What might be a good solution: pool together everyone >35, see how that looks. We would need to encode data as numbers, so using here the Q3_num ordering that was created in the parsing step

df_train['age_fea'] = df_train['Q3_num']
df_train[which(df_train['Q3_num']>=3),'age_fea'] =3

options(repr.plot.width=7,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'age_fea', use_random= TRUE)

That’s our first proper feature - and after pooling the low-sample end of the spectrum, the signal looks clearer - and, dare I say, present A point for ‘housekeeping’: whatever processing we apply to the training data, we will need to reproduce on the test data, so once we are decided on a feature, we can e.g. make a reusable function that takes in a data frame and transforms it. NOTE that any parameters e.g. if we want to normalise the distribution, MUST come from the training data and not be calculated on the passed in data frame!

We will repeat that for each potential feature

Region: Q1 and Q2

Both of these we will most likely need to pool together, as there is definitely too few samples to be usable - and also we would run into problems of identifiability - since these are geographical regions, there could be an option of pooling neighbors together or by otherwise known relation (e.g. predominantly urban, predominantly rural, etc).

options(repr.plot.width=9,repr.plot.height=5)

Let’s check if the most populous regions have any signal on their own, before thinking about combining - since we already one-hot-encoded each region, can just look up which region corresponds to which column in the lookups’ .json files.

options(repr.plot.width=9,repr.plot.height=5)

Iterating through Accra and Volta, it seems perhaps Greater Accra vs others might have some signal, so adding that.

options(repr.plot.width=9,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'Q1_14', use_random= TRUE)

Similar approach for Q2 (Place of work)

However, here the uneven spread of answers across regions is even more pronounced, and there does not seem to be much signal in the most represented one, Volta, so deciding against using Q2 entirely.

df_train %>% group_by( 
        !!rlang::sym( 'Q2')
    ) %>% count(
    )

## # A tibble: 7 x 2
## # Groups:   Q2 [7]
##   Q2                       n
##   <chr>                <int>
## 1 Ashanti Region           3
## 2 Eastern Region           2
## 3 empty                    3
## 4 Greater Accra Region     5
## 5 Northern Region          1
## 6 Volta Region            23
## 7 Western Region           1

options(repr.plot.width=9,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'Q2_14', use_random= TRUE)

Next one to look at is Q4 (Gender)

That one, being gender, has better balancing and also appears to have signal, can use it as is.

options(repr.plot.width=9,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'Q4', use_random= TRUE)

Next: Q5 (Family size)

Here, similarly to Q3, we have a natural ordering of family size, so we could pool together by size if needed.

options(repr.plot.width=9,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'Q5', use_random= TRUE)

df_train %>% group_by( 
        !!rlang::sym( 'Q5')
    ) %>% count(
    )

## # A tibble: 4 x 2
## # Groups:   Q5 [4]
##   Q5        n
##   <chr> <int>
## 1 1         6
## 2 2-4       9
## 3 5 -7     16
## 4 8+        7

There looks to be signal, but since some answers have few samples, might work better pooled.

add_fam_feature = function( data) {
    data['fam_fea'] = 0
    data[ which( data[ 'Q5_num'] >= 3), 'fam_fea'] = 1
    data
}
pipe = append( pipe, add_fam_feature)
model_columns = append( model_columns, 'fam_fea')

df_train = add_fam_feature( df_train)
options(repr.plot.width=9,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'fam_fea', use_random= TRUE)

Next: Q6 (Education)

Education question, while could be - potentially - treated as ordered (years spent in education), it appears quite skewed towards respondents with a Bachelor’s degree; further, once pooling together under-represented answers seems like hardly any difference in stance between degree-holders and non-holders.

options(repr.plot.width=9,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'Q6', use_random= TRUE)

options(repr.plot.width=9,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'Q6_3', use_random= TRUE)

Moving onto the multiple-choice Q8 (Ranking of source of COVID-19 information)

NOTE we are skipping Q7 since it has been flagged as sensitive / protected characteristic and should not be used for modelling

Since a multiple-choice (and many possible combinations), we shall look at each one-hotted option. In addition, we could construct extra features such as a numerical feature of how many answers were selected (here everyone had to choose 2 options, so not relevant), or a boolean feature of a particular combination.

options(repr.plot.width=9,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'Q8_4', use_random= TRUE)

options(repr.plot.width=9,repr.plot.height=5)
plot_target_prop_by_cat( df_train, 'stance_target', 'Q8_2', use_random= TRUE)

There seems to be - possibly - some signal, although when trying for more granularity (e.g. a particular combination) things can flip (e.g. social media alone seems anti-correlated, but social media in combination with news or internet searches becomes positively correlated). Gut feeling is not much information / spurious correlations (also see how random feature can exhibit what looks like correlations of similar magnitude) in Q8, but will add two ‘strongest’ looking:

These are: social media and news/internet searches.

model_columns = append( model_columns, 'Q8_4')
model_columns = append( model_columns, 'Q8_2')

Remaining questions should not be used for modelling

All the remaining questions Q9 - Q13 are intrinsically target-related and should not be used for modelling, as they probe directly or indirectly the vaccine status and stance.

A final step recommended for logistic regression - scaling

Feature scaling is recommended for models such as logistic regression (or generally linear models) - it is not absolutely necessary always, but a good habit, as in some situations very different scales of values will be confusing to compare or even a problem for numerical accuracy.Here we are just collecting the stats to use for scaling after upsampling - also, note these same stats will be used for scaling the test set!

scaler_m = map(df_train[,unlist(model_columns)],mean)
scaler_st = map(df_train[,unlist(model_columns)],sd)

Can try the most basic model with not much frills

With such a scarcicity of data, it is probably better not to have too many features, which would just risk overfitting.

We can try a little bit of upsampling + balancing / bootstrapping

Again, good to set seed for reproducibility.

# and for passing data to models we will also need the target:
model_columns_ = append( model_columns, 'stance_target')

# let's upsample our training set a bit, and try balancing:
set.seed(33)
up_specs = c(50,50)
names(up_specs) = c(0,1)
re_id = bootstrap_samples( df_train, 'stance_target', up_specs)

data.train = df_train[ unlist(re_id), unlist(model_columns_)]
data.train[ , unlist(model_columns)] = (data.train[ , unlist(model_columns)] - scaler_m ) / scaler_st

Fit a logistic regression model

NOTE that this is the most basic linear model, without regularisation a.k.a. without penalising model complexity but R has packages where regularisation is available.

model = glm(
    stance_target ~ ., family= binomial(link= 'logit'), data= data.train
)

And examine the performance on the train set:

proba = predict( model, data.train[,unlist(model_columns)])
pred = as.numeric(proba > 0.5)
targ = data.train[,'stance_target']

confusionMatrix( as.factor(pred), as.factor(targ))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 45 42
##          1  5  8
##                                           
##                Accuracy : 0.53            
##                  95% CI : (0.4276, 0.6306)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.3086          
##                                           
##                   Kappa : 0.06            
##                                           
##  Mcnemar's Test P-Value : 1.512e-07       
##                                           
##             Sensitivity : 0.9000          
##             Specificity : 0.1600          
##          Pos Pred Value : 0.5172          
##          Neg Pred Value : 0.6154          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4500          
##    Detection Prevalence : 0.8700          
##       Balanced Accuracy : 0.5300          
##                                           
##        'Positive' Class : 0               
##

roc(targ ~ proba, plot = TRUE, print.auc = TRUE)

## 
## Call:
## roc.formula(formula = targ ~ proba, plot = TRUE, print.auc = TRUE)
## 
## Data: proba in 50 controls (targ 0) < 50 cases (targ 1).
## Area under the curve: 0.5904

The pudding, i.e. the real test of the model…

df_test = apply_pipeline( pipe, df_test)
data.test = df_test[ , unlist(model_columns_)]
data.test[ , unlist(model_columns)] = (data.test[ , unlist(model_columns)] - scaler_m ) / scaler_st

df_test['stance'] = 'decided'
df_test[df_test['stance_target']==1,'stance'] = 'hesitant'

proba_test = predict( model, data.test[, unlist(model_columns)])
pred_test = as.numeric(proba_test > 0.5)
targ_test = data.test[ , 'stance_target']

confusionMatrix( as.factor(pred_test), as.factor(targ_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  5 14
##          1  1  4
##                                          
##                Accuracy : 0.375          
##                  95% CI : (0.188, 0.5941)
##     No Information Rate : 0.75           
##     P-Value [Acc > NIR] : 0.999980       
##                                          
##                   Kappa : 0.0323         
##                                          
##  Mcnemar's Test P-Value : 0.001946       
##                                          
##             Sensitivity : 0.8333         
##             Specificity : 0.2222         
##          Pos Pred Value : 0.2632         
##          Neg Pred Value : 0.8000         
##              Prevalence : 0.2500         
##          Detection Rate : 0.2083         
##    Detection Prevalence : 0.7917         
##       Balanced Accuracy : 0.5278         
##                                          
##        'Positive' Class : 0              
##

roc(targ_test ~ proba_test, plot = TRUE, print.auc = TRUE)

## 
## Call:
## roc.formula(formula = targ_test ~ proba_test, plot = TRUE, print.auc = TRUE)
## 
## Data: proba_test in 6 controls (targ_test 0) > 18 cases (targ_test 1).
## Area under the curve: 0.5231

summary(model)

## 
## Call:
## glm(formula = stance_target ~ ., family = binomial(link = "logit"), 
##     data = data.train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.43380  -1.08825  -0.05279   1.15768   1.31405  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.06287    0.20825   0.302    0.763
## fam_fea     -0.17944    0.20992  -0.855    0.393
## Q8_4         0.05067    0.24556   0.206    0.837
## Q8_2        -0.21980    0.25104  -0.876    0.381
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.63  on 99  degrees of freedom
## Residual deviance: 136.23  on 96  degrees of freedom
## AIC: 144.23
## 
## Number of Fisher Scoring iterations: 4

Adding extra analyses

#Visual data assessment: categorical splits of categorical target per question
#Just a stacked bar chart, to see how different vaccine results are distributed in each answer category (for single-choice questions!) 




admin <- readOGR( admin_name)

## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\glenn\Desktop\Feburary 2022\Grant managment\vaccine hesitancy\vaccinepaper\gadm40_GHA_shp\gadm40_GHA_1.shp", layer: "gadm40_GHA_1"
## with 10 features
## It has 11 fields

admin@data <- admin@data[c('NAME_1')]
df = parse_survey( data_name)

## [1] "Raw data has 85 rows and 31 cols."
## [1] "Filtered data has 62 rows and 17 cols..."

#head(df[,c('Q9','Q9num','target')])

# Age group

plot_target_cat_split( df, "Q9", "Q3")

# Gender

plot_target_cat_split( df, "Q9", "Q4")

# Family size

plot_target_cat_split( df, "Q9", "Q5")

# unvacc x Age group

plot_target_cat_split3( df, "Q3", "Q13")

#plot_target_cat_split3( df, "Q13", "Q3")
plot_target_cat_split4( df, "Q3", "Q13")

#+ My religious beliefs do not support this option  (0) 
#+ There is insufficient information about the side effects of the vaccine, I am open to future vaccines which are better developed  (1) 
#+ I am young and of good health so I should not be affected severely if infected with the virus  (0) 
#+ I am protecting my immunity through other means instead  (0) 
#+ I have not made up my mind yet  (1) 

#hist(df[['Q9num']])

df['stance'] = 'decided'
df[df['stance_target']==1,'stance'] = 'hesitant'


#Sanity checking
#Making sure that the one-hot encoding worked ok in R - sum should be the same as length of data, and no one-hotted column should have more than 1 #unique value from original column assigned




cols_to_check = setdiff(CAT_COLS, MUL_COLS)
for (col in cols_to_check){
    one_hotted_cols = names(df)[grep(paste(col,'_',sep=''),names(df))]
    check_sum = sum( df[one_hotted_cols])
    print(paste('Column',col,'check sum:',check_sum))
    for (col_o in one_hotted_cols) {
        if ( length( unique(df[ df[col_o] == 1, col])) > 1) {
            print( col_o)
        }
    }
}

## [1] "Column Q1 check sum: 62"
## [1] "Column Q2 check sum: 62"
## [1] "Column Q4 check sum: 62"
## [1] "Column Q6 check sum: 62"
## [1] "Column Q7 check sum: 62"
## [1] "Column Q9 check sum: 62"
## [1] "Column Q10 check sum: 62"
## [1] "Column Q11 check sum: 62"

plot(admin)

lat_lon_names <- c('LocationLongitude','LocationLatitude')

data_sp = SpatialPoints(
    df[, lat_lon_names], proj4str=CRS('+init=epsg:4326')
)
data_sp = SpatialPointsDataFrame(data_sp,data.frame(df[,'stance_target']))

#within_column = st_within( st_as_sf( data_sp), st_as_sf( admin))
#swap_idx = partial( swap_factors, dict= admin[['NAME_1']])

#df['lon_lat_region'] = unlist( map( within_column, swap_idx))

contains_column = st_contains( st_as_sf( admin), st_as_sf( data_sp))
admin@data[['target_sum']] = unlist( map( contains_column, length))

plot( st_as_sf( admin[,'target_sum']))

#Just a stacked bar chart, to see how different vaccine results are distributed in each answer category (for single-choice questions!)




options(repr.plot.width=9,repr.plot.height=6)
plot_target_cat_split( df[c(grep('Volta', df[['Q1']]), grep('Greater Acc', df[['Q1']])),], 'stance', 'Q1')

options(repr.plot.width=9,repr.plot.height=6)
plot_target_cat_split( df, 'stance', 'Q3')

save.image("vaccineRMD.RData")

R Tutorial on modelling vaccine stance based on survey data

Daga Panas, Glenna Nightingale, Martin Ayanore, Michelle King-Okoye, Louise Blundy, and Sohan Seth

2022-12-02