Introduction

Hello there!

In this project I am going to utilize logistic regression to model the fact of attending lawful/peaceful demonstrations in the past 12 months as an outcome variable. The question producing the records for this variable sounded as the following:

“Now I’d like you to look at this card. I’m going to read out some different forms of political action that people can take, and I’d like you to tell me, for each one, whether you have actually done any of these things, whether you might do it or would never, under any circumstances, do it.”

Exploratory Data Analysis

First of all, let’s have a look at the data. It comes from the Wave 7 of the World Values Survey (WVS), Russian Federation case

https://www.worldvaluessurvey.org/WVSDocumentationWV7.jsp

require(sjPlot)
require(dplyr)
require(janitor)
require(VIM)
require(polycor)
require(ggcorrplot)
library(knitr)
library(kableExtra)
library(pscl)
require(equatiomatic)
require(broom)
require(caret)
library(car)
library(Amelia)
library(mlbench)
require(mice)
require(DMwR)
options(scipen = 999)

df <- readRDS("ruwvs")
df$age <- as.numeric(df$age)
str(df)
## 'data.frame':    1810 obs. of  10 variables:
##  $ Q211      : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 2 2 2 ...
##  $ intinpol  : Factor w/ 4 levels "Very interested",..: 4 2 2 3 3 3 3 2 3 4 ...
##  $ eduR      : Factor w/ 4 levels "Primary","Secondary",..: 3 4 4 4 4 2 2 2 4 2 ...
##  $ townsize  : Factor w/ 5 levels "Under 5,000",..: 5 5 5 5 5 4 4 4 5 5 ...
##  $ settlement: Factor w/ 5 levels "Capital city",..: 1 1 1 2 2 2 2 2 1 1 ...
##  $ income    : Factor w/ 10 levels "Lower step","second step",..: 8 6 6 3 4 3 6 4 5 9 ...
##  $ region    : Factor w/ 8 levels "RU: North West Federal District",..: 2 2 2 4 4 2 2 2 2 2 ...
##  $ age       : num  21 43 27 32 46 54 25 63 36 13 ...
##  $ income1   : num  7 5 5 2 3 2 5 3 4 8 ...
##  $ eduT      : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 2 1 ...

As it can be seen from the above, the dataset has 1 810 records and 10 variables. The outcome variable is binary, with 0 for ‘no’ and 1 for ‘yes’ answers, and coded as Q211. The majority of variables are categorical (factor), two numeric ones are age of a respondent and his level of income recoded to discrete numeric values.

The distribution of answers for the outcome variable is the following.

tabyl(df$Q211)

As you can see, there is a minor imbalance in classes: there are more respondents who answered as not attending the demonstrations, 60% against 40%. It is not critical imbalance, though.

Here is the summary statistics for each predictor variable grouped by answers of the outcome variable.

psych::describeBy(df, df$Q211)
## 
##  Descriptive statistics by group 
## group: 0
##             vars    n  mean    sd median trimmed   mad min max range  skew
## Q211*          1 1072  1.00  0.00      1    1.00  0.00   1   1     0   NaN
## intinpol*      2 1066  2.76  0.86      3    2.78  1.48   1   4     3 -0.11
## eduR*          3 1065  3.01  0.82      3    3.04  1.48   1   4     3 -0.25
## townsize*      4 1072  3.31  1.51      4    3.38  1.48   1   5     4 -0.32
## settlement*    5 1072  3.16  1.27      3    3.14  1.48   1   5     4  0.34
## income*        6 1022  4.80  1.93      5    4.78  1.48   1  10     9  0.13
## region*        7 1072  4.24  2.28      4    4.18  2.97   1   8     7  0.27
## age            8 1072 30.11 17.28     28   29.17 19.27   3  76    73  0.41
## income1        9 1022  3.80  1.93      4    3.78  1.48   0   9     9  0.13
## eduT*         10 1065  1.32  0.47      1    1.27  0.00   1   2     1  0.78
##             kurtosis   se
## Q211*            NaN 0.00
## intinpol*      -0.77 0.03
## eduR*          -0.93 0.03
## townsize*      -1.36 0.05
## settlement*    -1.15 0.04
## income*        -0.36 0.06
## region*        -1.20 0.07
## age            -0.79 0.53
## income1        -0.36 0.06
## eduT*          -1.39 0.01
## ------------------------------------------------------------ 
## group: 1
##             vars   n  mean    sd median trimmed   mad min max range  skew
## Q211*          1 738  2.00  0.00      2    2.00  0.00   2   2     0   NaN
## intinpol*      2 738  2.50  0.82      3    2.50  1.48   1   4     3 -0.03
## eduR*          3 735  3.08  0.82      3    3.12  1.48   1   4     3 -0.32
## townsize*      4 738  3.28  1.59      4    3.35  1.48   1   5     4 -0.33
## settlement*    5 738  3.01  1.40      3    3.01  1.48   1   5     4  0.31
## income*        6 725  4.73  1.94      5    4.72  1.48   1  10     9  0.07
## region*        7 738  3.87  2.21      4    3.74  2.97   1   8     7  0.52
## age            8 738 30.84 16.90     30   30.19 19.27   3  75    72  0.27
## income1        9 725  3.73  1.94      4    3.72  1.48   0   9     9  0.07
## eduT*         10 735  1.36  0.48      1    1.33  0.00   1   2     1  0.58
##             kurtosis   se
## Q211*            NaN 0.00
## intinpol*      -0.53 0.03
## eduR*          -1.00 0.03
## townsize*      -1.45 0.06
## settlement*    -1.23 0.05
## income*        -0.45 0.07
## region*        -1.03 0.08
## age            -0.83 0.62
## income1        -0.45 0.07
## eduT*          -1.67 0.02

So okay, we see the numbers but it is a bit hard to glance the overall picture. Yet, I can tell that the median age of respondents who attended lawful demonstartions a bit higher (30 versus 28 years), their income, on average, is a little lower (3.73 versus 3.8, on the scale from 0 to 9), and their interest in politics is, surprisingly, lower, on average - 2.5 versus 2.76, on the scale from 1 to 4.

The differences between these two groups of respondents (the ones who attended demonstrations and the ones who did not) seem minor. I will try to visualize bivariate distributions of those characteristics, to, perhaps, see the clearer picture.

df_hetcor <- hetcor(df)
df_cors <- round(df_hetcor$correlations, 2)
ggcorrplot(df_cors, type = "lower", ggtheme = ggplot2::theme_bw, colors =c("turquoise4", "white", "red3"))

The plot above shows the correlation mosaic with the highest negative correlations between pairs of variables colored in turquoise and with the highest positive correlations - in red. The highest,by module, coefficients are observed between settlement and townsize and income variable coded as categorical and as continuous (income and income1). Both are expectected. Type of settlement (settement) and a size of a settlement (townsize) are dependent on each other. The resolution here is that, when it comes to regressi, there is a high risk for multicolinearity to appear.

It also can be seen that the variable with the highest, by module, correlation coefficients with the outcome variable (Q211) are interest in politics (intinpol), respondent’s education level (eduR), type of settlement (settlement), region, and education level of a country (eduT). I will consider them as predictors for the modeling. But first I need to visualize them one on one, bivariately.

So, to the bivariate distributions. I will start with categorical predictors. First, the barplot reflecting amount of answers for each category of the outcome variable by respondents’ interest in politics.

ggplot(df) + 
  geom_bar(aes(x = intinpol, fill=Q211), position="fill", col = "gray34") +
  theme_bw() +
  labs(title="Participation in Lawful Demonstrations Versus Interest in politics",
       subtitle="Respondent Level",
       x="Interest in Politics",
       y="Observations Count",
       fill="Participated in Lawful Demonstration") +
  scale_fill_manual(values = c("cornflowerblue", "goldenrod1"),
                    labels = c("no", "yes")) + theme(axis.text.x = element_text(angle = 90
                                                                                ))

The relationship observed here is pretty obvious: the more a person is interested in politics, the more likely (s-)he is to take part in lawful demonstrations.

Next is respondent’s income level, coded as categorical levels. On the graph displayed below I do not see any particular trend. It appears, income level does not have any manifested relationship with a fact of participation in peaceful demonstrations. Perhaps, I should not include it in a regression model.

ggplot(df) + 
  geom_bar(aes(x = income, fill=Q211), position="fill", col = "gray34") +
  theme_bw() +
  labs(title="Participation in Lawful Demonstrations Versus Income",
       subtitle="Respondent Level",
       x="Income Level",
       y="Observations Count",
       fill="Participated in Lawful Demonstration") +
  scale_fill_manual(values = c("cornflowerblue", "goldenrod1"),
                    labels = c("no", "yes")) + theme(axis.text.x = element_text(angle = 90))

Respondents’ education level. Again, no particular visual association between two variables. It looks like for all categories of education level variable there are about the same qauntity of answers on either of the categories of attendng peaceful demonstrations variable.

ggplot(df) + 
  geom_bar(aes(x = eduR, fill=Q211), position="fill", col = "gray34") +
  theme_bw() +
  labs(title="Participation in Lawful Demonstrations Versus Education",
       subtitle="Respondent Level",
       x="Interest in Politics",
       y="Observations Count",
       fill="Participated in Lawful Demonstration") +
  scale_fill_manual(values = c("cornflowerblue", "goldenrod1"),
                    labels = c("no", "yes")) + theme(axis.text.x = element_text(angle = 90))

If we look at the simplified version of the a respondent’s education variable (one either has higher education or not), we see that among people who indicated the presence of higher education, there are more of those who attended peaceful demonstrations.

ggplot(df) + 
  geom_bar(aes(x = eduT, fill=Q211), position="fill", col = "gray34") +
  theme_bw() +
  labs(title="Participation in Lawful Demonstrations Versus Education",
       subtitle="Respondent level, simplified",
       x="Education, 'yes'/'no'",
       y="Observations Count",
       fill="Participated in Lawful Demonstration") +
  scale_fill_manual(values = c("cornflowerblue", "goldenrod1"), 
                    labels = c("no", "yes")) 

Finally, the region variable versus the outcome variable.

ggplot(df) + 
  geom_bar(aes(x = region, fill=Q211), position="fill", col = "gray34") +
  theme_bw() +
  labs(title="Participation in Lawful Demonstrations Versus Region",
       subtitle="Country Level",
       x="Region",
       y="Observations Count",
       fill="Participated in Lawful\nDemonstration") +
  scale_fill_manual(values = c("cornflowerblue", "goldenrod1"),
                    labels = c("no", "yes")) + theme(axis.text.x = element_text(angle = 90))

It seems that the Central Federal District scores the highest on the quantity of people who attended lawful demonstrations, while the South Federal District scores the lowest.

Now, to the numeic variables.

age = ggplot(df, aes(x = Q211, y = age, fill = Q211)) +
  geom_boxplot() +
  labs(y="Age", x = "",  
       title="Respondent's Age") +
  theme_bw() +
  theme(legend.position = "none", axis.text = element_text(size=13)) +
  scale_fill_manual(values=c("cornflowerblue", "goldenrod1")) 

inc = ggplot(df, aes(x = Q211, y = income1, fill = Q211)) +
  geom_boxplot() +
  labs(y="Income", x = "", fill = "Hasthe respondent attended lawful demonstration?",  
       title="Respondent's Income") +
  theme_bw() +
  theme(legend.position = "none", axis.text = element_text(size=13)) +
  scale_fill_manual(values=c("cornflowerblue", "goldenrod1"))

gridExtra::grid.arrange(age, inc, ncol=2)

Again, there seems to be no particular association between respondent’s income level and him/her attending peacefull demonstrations, even when the income is taken as numeric variable. While for respondent’s age I see some difference between distributions in each category of the outcome variable.

Logistic Regression

Thus, considering the carried on EDA, I will try different combinations of predictors for a logistic regression model so that:

  1. only variables that showed to manifest some pattern when put against the outcome variable would got into a model;

  2. there will not be too many predictors;

  3. there will be at least one numeric variable.

Then, I will compare the resulting models by AIC value thus choosing the best one, i.e., with the lowest AIC.

Choosing the best model

logmo1 <- glm(Q211 ~ intinpol + eduR + age, family = binomial, data = df)  
logmo2 <- glm(Q211 ~ intinpol + eduT + age, family = binomial, data = df)  
logmo3 <- glm(Q211 ~ intinpol + region + age, family = binomial, data = df)  
logmo4 <- glm(Q211 ~ intinpol + settlement + age, family = binomial, data = df)  


logmo7 <- glm(Q211 ~ intinpol + eduR + income1, family = binomial, data = df)  
logmo8 <- glm(Q211 ~ intinpol + eduT + income1, family = binomial, data = df)  

logmo11 <- glm(Q211 ~ intinpol + region + income1, family = binomial, data = df) 
logmo12 <- glm(Q211 ~ intinpol + settlement + income1, family = binomial, data = df) 


model <- c("Model1", "Model2", "Model3", "Model4",
           "Model7", "Model8", "Model11", "Model12")
AIC <- c(logmo1$aic, logmo2$aic, logmo3$aic, logmo4$aic, logmo7$aic, logmo8$aic, logmo11$aic, logmo12$aic)

aic_step1 <- as.data.frame(cbind(model, AIC))
kable(aic_step1) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
  row_spec(8, bold = T)
model AIC
Model1 2395.27822285059
Model2 2391.55810662871
Model3 2381.6835584312
Model4 2373.54437491893
Model7 2320.75090721679
Model8 2316.85635298015
Model11 2308.06249185737
Model12 2298.96750625708

Okay, the last model, model12, has the lowest AIC value. I will take it for further analysis.

Assessing the model

Now, let’s assess the model’s quality adultly, starting with the statistical signifficance of it predictors.

anova(logmo12, test="Chisq")

All of them are signifficant, having p-value lower than 0.05 alpha-level, although income1 is “on the border”.

pR2(logmo12)
## fitting null model for pseudo-r2
##            llh        llhNull             G2       McFadden           r2ML 
## -1140.48375313 -1183.41142791    85.85534957     0.03627451     0.04806377 
##           r2CU 
##     0.06470619

Well, the McFadden value, also known as McFadden pseudo r-squared value, is 0.04 which is nowhere between 0.2 and 0.4, so the model is not really good.

Model’s interpretation

Still, that is the best model I got from my “rightfull” combination of predictors. So, I will prolong with it.

Now, the next step would be to construct the model’s equation. So here it is:

extract_eq(logmo12, wrap = T)

\[ \begin{aligned} \log\left[ \frac { P( \operatorname{Q211} = \operatorname{1} ) }{ 1 - P( \operatorname{Q211} = \operatorname{1} ) } \right] &= \alpha + \beta_{1}(\operatorname{intinpol}_{\operatorname{Somewhat\ interested}}) + \beta_{2}(\operatorname{intinpol}_{\operatorname{Not\ very\ interested}}) + \beta_{3}(\operatorname{intinpol}_{\operatorname{Not\ at\ all\ interested}})\ + \\ &\quad \beta_{4}(\operatorname{settlement}_{\operatorname{Regional\ center}}) + \beta_{5}(\operatorname{settlement}_{\operatorname{District\ center}}) + \beta_{6}(\operatorname{settlement}_{\operatorname{Another\ city,\ town\ (not\ a\ regional\ or\ district\ center)}}) + \beta_{7}(\operatorname{settlement}_{\operatorname{Village}})\ + \\ &\quad \beta_{8}(\operatorname{income1}) + \epsilon \end{aligned} \]

tab_model(logmo12)
  Q 211
Predictors Odds Ratios CI p
(Intercept) 3.53 2.12 – 5.95 <0.001
intinpol [Somewhat
interested]
0.65 0.45 – 0.95 0.025
intinpol [Not very
interested]
0.58 0.40 – 0.83 0.004
intinpol [Not at all
interested]
0.27 0.17 – 0.41 <0.001
settlement [Regional
center]
0.47 0.32 – 0.68 <0.001
settlement [District
center]
0.32 0.22 – 0.47 <0.001
settlement [Another city,
town (not a regional or
district center)]
0.32 0.16 – 0.61 0.001
settlement [Village] 0.43 0.29 – 0.63 <0.001
income1 0.95 0.90 – 1.00 0.068
Observations 1743
R2 Tjur 0.048

I guess, the interpretation of odds ratio would be the following:

  1. intinpol: the reference category inclides people who are very interested in politics. From that the following is coming:
  • if people are somewhat interested in politics, their odds of participating in peaceful demonstrations decrease by 35 percent (= 1- 0.65) as compared to people from the reference category;

  • if people are not very interested in politics, their odds of participating in peaceful demonstrations decrease by 42% percent (= 1-0.58) as compared to people from the reference category;

  • if people are not at all interested in politics, their odds of participate in peaceful demonstrations decrease by 73 percent (= 1-0.27) as compared to people from the reference category.

  1. settlement: the reference category inclides people from capital cities. From that the following is coming:
  • if people are from a regional center, their odds of participating in peaceful demonstrations decrease by 53% (= 1-0.47) as compared to people from the reference category;

  • if people are from a district center, their odds of participating in peaceful demonstrations decrease by 68 percent (= 1-0.32) as compared to people from the reference category;

  • if people are from another city, town (not a regional or district center), their odds of participating in peaceful demonstrations decrease by 68% (= 1-0.32) as compared to people from the reference category;

  • if people are from a village, their odds of participating in peaceful demonstrations decrease by 57 percent (= 1-0.43) as compared to people from the reference category.

  1. income1: its estimates are not signifficant, yet we can still suggest some interpretation. When income increases by one unit, the odds of participating in peaceful demonstrations decrease by 5 percent.

Prediction

The next thing to do is to see how new data can be predicted with this logistic regression model. One way to do that is to use augment() function that provides fitted values and residual values, standardized ones included.

augment(logmo12)

Another way to do it is to use predict() function using the data that the model has not “seen” yet. The unseen data is a part of the dataset excluded when the model is built.

Following this thought, I need to split my data into two parts and rebuild the model. The conventional partition threshold used for such purposes is 80% by 20%. The first part is used for training the model and the second one - for testing it, or, in our case, produce new values of the outcome.

#accuracy calculation
set.seed(9629)
ind = createDataPartition(df$Q211, p = 0.20, list = F) #test 20%, train 80%
df.test = df[ind,] 
df.train = df[-ind,]

logmo_train <- glm(Q211 ~ intinpol + settlement + income1, family = binomial,na.action = na.omit, data = df.train) 

df.test$predicted = predict(logmo_train, df.test, type="response")
df.test$pred0.5 <- as.factor(ifelse(df.test$predicted >= 0.5,"1","0"))
confusionMatrix(df.test$pred0.5, df.test$Q211)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 188 122
##          1  18  24
##                                              
##                Accuracy : 0.6023             
##                  95% CI : (0.549, 0.6538)    
##     No Information Rate : 0.5852             
##     P-Value [Acc > NIR] : 0.2766             
##                                              
##                   Kappa : 0.0859             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.9126             
##             Specificity : 0.1644             
##          Pos Pred Value : 0.6065             
##          Neg Pred Value : 0.5714             
##              Prevalence : 0.5852             
##          Detection Rate : 0.5341             
##    Detection Prevalence : 0.8807             
##       Balanced Accuracy : 0.5385             
##                                              
##        'Positive' Class : 0                  
## 

The output above is a confusion matrix showing values of the outcome which were “guessed” correctly and incorrectly on the test set, along with accuracy and Kappa values. The model’s accuracy is relatively low, the model predicts true values correctly only in 60% of cases. Judging by the confusion matrix, I can also say that the model performs quite bad when it comes to guessing the ones, i.e. cases when a respondent attends lawful demonstrations.

Model’s diagnostics

Wraping up the model analysis with diagnostics. Let’s check if the assumptions of linearity hold.

residualPlots(logmo12)

##            Test stat Pr(>|Test stat|)
## intinpol                             
## settlement                           
## income1       0.5141           0.4734

The test above shows that no additional predictors are statistically significant. The plots are also okay. So the linearity assumptions are hold in this model.

Now let me check for multicolinerity.

vif(logmo12) # everything is ok
##                GVIF Df GVIF^(1/(2*Df))
## intinpol   1.041525  3        1.006804
## settlement 1.033021  4        1.004069
## income1    1.043325  1        1.021433

VIF values (GVIF) for the predictors are below 2 which means that no ,ulticolinearity is observed.

influenceIndexPlot(logmo12)

Finally, outliers. On the plot above it can be seen that there are some outliers present n the data. Yet, none of those are sttistically significant as the Cook distance does not appear on the plots at all.

So we are good, no violations found.

Data Imputation

ow, to the most special part. The data used so far contains missing values (NAs). It would be interesting to know if the model gets better when those values are either removed or artificially imputed.

Visualizing missing data

Let’s see how this ‘missingness’ looks like.

missmap(df, col=c("blue", "red"), legend = T, main = "Missing values vs observed")

Seems like there is only 1% of NAs in the data.

df %>% aggr(combined = TRUE, numbers = TRUE)

Form the plot above we find out that those NAs are in interest in politics, respondent’s education, and income variables.

Trying several imputation methods

There are several ways to handle missing data. One is to just get rid of them via listwise deletion. Other ways imply replacing missing data with artifically generated values.

I will try listwise deletion and two imputation algorithms, k-nearest neighbors and MICE methods.

df_omit <- na.omit(df) # listwise deletion

dfImp_knn <- knnImputation(df) #k-nearest neighbours imputation

df_mc <- df %>% select(-income)
mice_data <- mice::mice(df_mc, m = 5, method = 'cart', seed = 500) #mice imputation
## 
##  iter imp variable
##   1   1  intinpol  eduR  income1  eduT
##   1   2  intinpol  eduR  income1  eduT
##   1   3  intinpol  eduR  income1  eduT
##   1   4  intinpol  eduR  income1  eduT
##   1   5  intinpol  eduR  income1  eduT
##   2   1  intinpol  eduR  income1  eduT
##   2   2  intinpol  eduR  income1  eduT
##   2   3  intinpol  eduR  income1  eduT
##   2   4  intinpol  eduR  income1  eduT
##   2   5  intinpol  eduR  income1  eduT
##   3   1  intinpol  eduR  income1  eduT
##   3   2  intinpol  eduR  income1  eduT
##   3   3  intinpol  eduR  income1  eduT
##   3   4  intinpol  eduR  income1  eduT
##   3   5  intinpol  eduR  income1  eduT
##   4   1  intinpol  eduR  income1  eduT
##   4   2  intinpol  eduR  income1  eduT
##   4   3  intinpol  eduR  income1  eduT
##   4   4  intinpol  eduR  income1  eduT
##   4   5  intinpol  eduR  income1  eduT
##   5   1  intinpol  eduR  income1  eduT
##   5   2  intinpol  eduR  income1  eduT
##   5   3  intinpol  eduR  income1  eduT
##   5   4  intinpol  eduR  income1  eduT
##   5   5  intinpol  eduR  income1  eduT
dfImp_mice <- mice::complete(mice_data, 1)


anyNA(df_omit)
## [1] FALSE
anyNA(dfImp_knn)
## [1] FALSE
anyNA(dfImp_mice)
## [1] FALSE

Looks like all three managed to handle NAs.

Comparing models’ coefficients

Another step would be to compare the model built with the initial dataset and models which are to be constructed with the datasets withut NAs.

logmo_naomit <- glm(Q211 ~ intinpol + settlement + income1, family = binomial, data = df_omit) 

logmo_knn <- glm(Q211 ~ intinpol + settlement + income1, family = binomial, data = dfImp_knn) 

logmo_mice <- glm(Q211 ~ intinpol + settlement + income1, family = binomial, data = dfImp_mice) 

model_ <- c("Listwise Imputed Model", "kNN Imputed Model", "MICE Imputed Model")
AIC_ <- c(logmo_naomit$aic, logmo_knn$aic, logmo_mice$aic)

aic_step2 <- as.data.frame(cbind(model_, AIC_))
kable(aic_step2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
model_ AIC_
Listwise Imputed Model 2290.28545491267
kNN Imputed Model 2376.63708865548
MICE Imputed Model 2376.34510558277

Well, first of all, the model built with data produced after the listwise deletion seems to work better than the models made with datasets produced with two other methods. Secondly, the same model is better than the model built on the original data. This all judging by AIC values. For the original dataset model this value was 2299.

Let’s turn to the odds ratio as the point for comparison.

tab_model(logmo12, logmo_naomit, logmo_knn, logmo_mice)
  Q 211 Q 211 Q 211 Q 211
Predictors Odds Ratios CI p Odds Ratios CI p Odds Ratios CI p Odds Ratios CI p
(Intercept) 3.53 2.12 – 5.95 <0.001 3.51 2.10 – 5.91 <0.001 3.32 2.00 – 5.56 <0.001 3.34 2.02 – 5.58 <0.001
intinpol [Somewhat
interested]
0.65 0.45 – 0.95 0.025 0.66 0.45 – 0.95 0.028 0.72 0.50 – 1.03 0.075 0.72 0.50 – 1.03 0.074
intinpol [Not very
interested]
0.58 0.40 – 0.83 0.004 0.58 0.40 – 0.83 0.003 0.62 0.43 – 0.89 0.010 0.62 0.43 – 0.89 0.009
intinpol [Not at all
interested]
0.27 0.17 – 0.41 <0.001 0.27 0.17 – 0.41 <0.001 0.28 0.18 – 0.43 <0.001 0.28 0.18 – 0.42 <0.001
settlement [Regional
center]
0.47 0.32 – 0.68 <0.001 0.47 0.32 – 0.69 <0.001 0.45 0.31 – 0.66 <0.001 0.45 0.31 – 0.66 <0.001
settlement [District
center]
0.32 0.22 – 0.47 <0.001 0.32 0.22 – 0.47 <0.001 0.32 0.22 – 0.46 <0.001 0.32 0.22 – 0.46 <0.001
settlement [Another city,
town (not a regional or
district center)]
0.32 0.16 – 0.61 0.001 0.32 0.16 – 0.61 0.001 0.31 0.16 – 0.59 <0.001 0.31 0.16 – 0.59 <0.001
settlement [Village] 0.43 0.29 – 0.63 <0.001 0.43 0.29 – 0.63 <0.001 0.42 0.28 – 0.61 <0.001 0.42 0.28 – 0.61 <0.001
income1 0.95 0.90 – 1.00 0.068 0.95 0.91 – 1.00 0.075 0.95 0.90 – 1.00 0.042 0.95 0.90 – 1.00 0.037
Observations 1743 1736 1810 1810
R2 Tjur 0.048 0.048 0.048 0.048

So, the original dataset model goes first, then there is listwise deletion data model, knn imputed data model, and mice imputed model, respectively.

It seems that the odds ratio values are about the same for all of the models. However, which is interesting, in case of the last two models (knn and mice), the income1 variable is significant. Which means that the interpretation of its odds ratio value makes sense now. Aside from that, one of the categories of the interest in politics variable (‘Somewhat interested’) became insignificant in case of models built with knn and mice imputation.

The R-squared Tjur values are the same among the models which makes it hard to tell if any method of handling NA values are better, based solely on this value. It is kinda possible with AIC, though. Which I mentioned a bit earlier.

That’s basically it. I hope it was interesting. Thatnks for joining me in this little adventure.