About Project and Data

The key purpose of this project is to investigate probability of participation in peaceful demonstrations.

In order to do that, the Binary Logistic Regression is going to be performed based on the data come from the WVS 2017 survey in Russia.

The outcome variable is whether a person could attend peaceful demonstrations (1 - ‘have done or might do’) or not (0 - ‘would never do’).

# remove scientific notation
options(scipen=999)

#our universe
library(tidyverse)
# modeling
library(caret)
library(car)
library(equatiomatic)
library(broom)
library(jtools)
library(ggstance)
# plotting data
library(DataExplorer)
library(ggpubr)
library(scales)
library(RColorBrewer)
# tables & statistics
library(knitr)
library(sjPlot)
library(summarytools)
library(psych)
library(lmtest)
library(MASS)
library(descr)
library(pROC)
library(polycor)
library(corrgram)
library(survey)
library(margins)
# data imputation
library(VIM)
library(mice)

mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}


data <- readRDS("D:/4th Course/Advanced Data Analysis/ruwvs")

Descriptive Statistics

Let’s us firstly explore general patterns in the data and whether some improvements are needed.

1. What types of the variables do we have?

sapply(data, class)

##       Q211   intinpol       eduR   townsize settlement     income     region 
##   "factor"   "factor"   "factor"   "factor"   "factor"   "factor"   "factor" 
##        age    income1       eduT 
##   "factor"  "numeric"   "factor"

There two variables related to income, as well as education level. To reduce redundancy, let’s focus on one of them and take income (not continuous) and eduT (binary for having higher education) variables instead of income1 (continuous) and eduR (factor with 4 levels).

Currently, age variable is a factor with 85 levels. So, it would be better to convert it to numeric type.

Also, let’s rename the dependent variable and call it as visit_dem (just for convenience, not to forget the meaning of our Y).

data$visit_dem <- data$Q211
data <- data %>% dplyr::select(-income1, -eduR, -Q211)
data$age <- as.numeric(as.character(data$age))

2. What are distributions of numerical variables?

As we know, there is only one numeric variable in the dataset. Let’s plot its distribution.

ggplot() +
  geom_histogram(data = data, aes(x = age), fill = "#ff8a00", color = "#e56a26", alpha = 0.5) +
  theme_minimal() +
  labs(title = "Age Variable Distribution")

describe(data$age)

Distribution is slightly right-skewed. Minimal age is 19, max one is 91. Mean is almost closed to a value of median. 3. What are distributions of the dependent variable per each category?

Let’s visualize how our dependent variable distributes per each category of the factors.

data_omit <- data %>% na.omit()

# plot 1 - intinpol
databar <- data %>% dplyr::select(intinpol, visit_dem) %>%
na.omit() %>% group_by(intinpol) %>%
arrange(intinpol) %>%
count(visit_dem) %>%
mutate(percent = n/sum(n))

databar$visit_dem <- ifelse(databar$visit_dem == "0", "No", "Yes")

pl <- ggplot(data = databar, mapping = aes(x = intinpol, y = percent, fill = visit_dem)) +
geom_col() +
geom_text(mapping = aes(label = percent(percent)),
size = 3,
position = position_stack(vjust = 0.5)) +
labs(x = " ",
y = " ",
title = "being interested in politics",
fill = "by attendance") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(values=c("#B8A9C9", "#D6D4E0")) + coord_flip() + theme_classic()

# plot 2 - townsize
databar1 <- data %>% dplyr::select(townsize, visit_dem) %>%
na.omit() %>% group_by(townsize) %>%
arrange(townsize) %>%
count(visit_dem) %>%
mutate(percent = n/sum(n))

databar1$visit_dem <- ifelse(databar1$visit_dem == "0", "No", "Yes")

pl1 <- ggplot(data = databar1, mapping = aes(x = townsize, y = percent, fill = visit_dem)) +
geom_col() +
geom_text(mapping = aes(label = percent(percent)),
size = 3,
position = position_stack(vjust = 0.5)) +
labs(x = " ",
y = " ",
title = "sizes of town",
fill = "by attendance") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(values=c("#B8A9C9", "#D6D4E0")) + coord_flip() + theme_classic()

# plot 3 - settlement
databar2 <- data %>% dplyr::select(settlement, visit_dem) %>%
na.omit() %>% group_by(settlement) %>%
arrange(settlement) %>%
count(visit_dem) %>%
mutate(percent = n/sum(n))

databar2$visit_dem <- ifelse(databar2$visit_dem == "0", "No", "Yes")

pl2 <- ggplot(data = databar2, mapping = aes(x = settlement, y = percent, fill = visit_dem)) +
geom_col() +
geom_text(mapping = aes(label = percent(percent)),
size = 3,
position = position_stack(vjust = 0.5)) +
labs(x = " ",
y = " ",
title = "type of settlement",
fill = "by attendance") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(values=c("#B8A9C9", "#D6D4E0")) + coord_flip() + theme_classic()

# plot 4 - income
databar3 <- data %>% dplyr::select(income, visit_dem) %>%
na.omit() %>% group_by(income) %>%
arrange(income) %>%
count(visit_dem) %>%
mutate(percent = n/sum(n))

databar3$visit_dem <- ifelse(databar3$visit_dem == "0", "No", "Yes")

pl3 <- ggplot(data = databar3, mapping = aes(x = income, y = percent, fill = visit_dem)) +
geom_col() +
geom_text(mapping = aes(label = percent(percent)),
size = 3,
position = position_stack(vjust = 0.5)) +
labs(x = " ",
y = " ",
title = "income steps",
fill = "by attendance") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(values=c("#B8A9C9", "#D6D4E0")) + coord_flip() + theme_classic()

# plot 5 - region
databar4 <- data %>% dplyr::select(region, visit_dem) %>%
na.omit() %>% group_by(region) %>%
arrange(region) %>%
count(visit_dem) %>%
mutate(percent = n/sum(n))

databar4$visit_dem <- ifelse(databar4$visit_dem == "0", "No", "Yes")

pl4 <- ggplot(data = databar4, mapping = aes(x = region, y = percent, fill = visit_dem)) +
geom_col() +
geom_text(mapping = aes(label = percent(percent)),
size = 3,
position = position_stack(vjust = 0.5)) +
labs(x = " ",
y = " ",
title = "regions",
fill = "by attendance") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(values=c("#B8A9C9", "#D6D4E0")) + coord_flip() + theme_classic()

# plot 6 - eduT
databar5 <- data %>% dplyr::select(eduT, visit_dem) %>%
na.omit() %>% group_by(eduT) %>%
arrange(eduT) %>%
count(visit_dem) %>%
mutate(percent = n/sum(n))

databar5$visit_dem <- ifelse(databar5$visit_dem == "0", "No", "Yes")
databar5$eduT <- ifelse(databar5$eduT == "0", "No", "Yes")

pl5 <- ggplot(data = databar5, mapping = aes(x = eduT, y = percent, fill = visit_dem)) +
geom_col() +
geom_text(mapping = aes(label = percent(percent)),
size = 3,
position = position_stack(vjust = 0.5)) +
labs(x = " ",
y = " ",
title = "having tertiary education",
fill = "by attendance") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(values=c("#B8A9C9", "#D6D4E0")) + coord_flip() + theme_classic()

ggarrange(pl, pl1,
          ncol = 2, nrow = 1, common.legend = TRUE, legend = "bottom")

pl2

ggarrange(pl5,
          ncol = 1, nrow = 1, common.legend = TRUE, legend = "bottom")

pl4

ggarrange(pl3, 
          ncol = 1, nrow = 1, common.legend = TRUE, legend = "bottom")

rm(pl, pl1, pl2, pl3, pl4, pl5,
   databar, databar1, databar2, databar3, databar4, databar5)

These plots are highly informative from a perspective of descriptiveness because they show a variety of patterns in the data. The general remark is that values of variables standed for attendance of peaceful demonstrations rather considerably vary in each cases.

ggplot(data = data_omit, aes(x = age, y = visit_dem, fill = visit_dem)) +
  geom_violin(trim = FALSE) + 
  labs(title = "Distributions of Attendance per Numeric Variable of Age", 
       x = "Age", y = "Attending peaceful demonstrations")+
  geom_boxplot(width=0.1) +
  scale_fill_manual(values=c("#B8A9C9", "#D6D4E0")) + theme_minimal() + coord_flip() +
  theme(plot.title = element_text(hjust = 0.5))

Looking at the shape of distribution, we can say that a number of elder people (~50-75) who attend demonstrations is visually bigger than a number of people who do not. At the same time, the median age is approximately the same in each category of attendance variable.

4. How many missing values do we have?

plot_missing(data)

md.pattern(data)

##      townsize settlement region age visit_dem intinpol eduT income   
## 1736        1          1      1   1         1        1    1      1  0
## 59          1          1      1   1         1        1    1      0  1
## 7           1          1      1   1         1        1    0      1  1
## 2           1          1      1   1         1        1    0      0  2
## 4           1          1      1   1         1        0    1      1  1
## 1           1          1      1   1         1        0    1      0  2
## 1           1          1      1   1         1        0    0      0  3
##             0          0      0   0         0        6   10     63 79

In our case, we have three variables with missing values - intinpol (factor), eduT (factor) and income (also factor). It means that using KNN method for data imputation we should calculate Manhattan distance.

The first plot According to it, all variables even with missings one are marked as good. Hypothetically, it means that it will not be critical if we drop missing values and decrease size of the dataset in a such way.

The second plot demonstrates numerically how many missings are there. For example, income variable has 63 observations with missings. Overall, there is 4.36% of observations that are absent.

Summing up, the ways of dealing with missing data such as kNN, MICE, and na.omit are going to be considered after modeling logistic regression.

Binary Logistic Regression Modeling

There are 5 variables in the first model. According to the ANOVA Chi-square test eduT and age variables insignificant. By this, hardly the named variables play a role in prediction of Y.

model1 <- glm(visit_dem ~ intinpol + townsize + settlement + eduT + age, data = data_omit, family = "binomial")

anova(model1, test = "Chisq")

Let’s look at details. As for model fit, Pseudo-R² (McFadden) is 0.06, Pseudo-R² (Cragg-Uhler) is 0.10, AIC is 2254.75, and BIC is 2331.18. Intercept is 3.33, values of coefficients diverse. As was stated before, eduT1 (p-value = 0.13) and age (p-value = 0.91) are not significant.

summ(model1, vifs = T)

Observations	1736
Dependent variable	visit_dem
Type	Generalized linear model
Family	binomial
Link	logit

χ(13)	131.18
Pseudo-R (Cragg-Uhler)	0.10
Pseudo-R (McFadden)	0.06
AIC	2254.75
BIC	2331.18

	Est.	S.E.	z val.	p	VIF
(Intercept)	3.33	0.51	6.56	0.00	NA
intinpolSomewhat interested	-0.40	0.19	-2.09	0.04	1.05
intinpolNot very interested	-0.53	0.19	-2.79	0.01	1.05
intinpolNot at all interested	-1.30	0.22	-5.83	0.00	1.05
townsize5000-20000	-1.20	0.27	-4.43	0.00	39.74
townsize20000-100000	-1.09	0.35	-3.10	0.00	39.74
townsize100000-500000	-1.67	0.38	-4.44	0.00	39.74
townsize500000 and more	-2.40	0.42	-5.76	0.00	39.74
settlementRegional center	-0.89	0.20	-4.43	0.00	39.97
settlementDistrict center	-2.14	0.29	-7.29	0.00	39.97
settlementAnother city, town (not a regional or district center)	-2.33	0.48	-4.81	0.00	39.97
settlementVillage	-2.94	0.44	-6.72	0.00	39.97
eduT1	0.17	0.11	1.52	0.13	1.08
age	0.00	0.00	0.12	0.91	1.07
Standard errors: MLE

In addition, rather high values of VIF can tell us about a possible multicollinearity between townsize and settlement (will come back to it a bit later).

Now let’s create another model which will include all variables from the dataset to predict a probability of visiting demonstrations.

model2 <- glm(visit_dem ~ ., data = data_omit, family = "binomial")
anova(model2, test = "Chisq")

In this case, significance of eduT becomes a bit higher in comparison with previous model. age and two new variables such as income and region appeared as insignificant - their roles are not major in the prediction.

summ(model2, vifs = T)

Observations	1736
Dependent variable	visit_dem
Type	Generalized linear model
Family	binomial
Link	logit

χ(29)	152.54
Pseudo-R (Cragg-Uhler)	0.11
Pseudo-R (McFadden)	0.06
AIC	2265.39
BIC	2429.17

	Est.	S.E.	z val.	p	VIF
(Intercept)	3.26	0.62	5.25	0.00	NA
intinpolSomewhat interested	-0.40	0.20	-2.04	0.04	1.15
intinpolNot very interested	-0.57	0.20	-2.90	0.00	1.15
intinpolNot at all interested	-1.38	0.23	-6.04	0.00	1.15
townsize5000-20000	-0.98	0.29	-3.37	0.00	53.07
townsize20000-100000	-0.84	0.38	-2.21	0.03	53.07
townsize100000-500000	-1.41	0.41	-3.43	0.00	53.07
townsize500000 and more	-2.17	0.46	-4.74	0.00	53.07
settlementRegional center	-0.79	0.24	-3.29	0.00	75.62
settlementDistrict center	-2.07	0.33	-6.26	0.00	75.62
settlementAnother city, town (not a regional or district center)	-2.27	0.53	-4.30	0.00	75.62
settlementVillage	-2.66	0.51	-5.19	0.00	75.62
incomesecond step	-0.02	0.30	-0.08	0.94	1.52
incomeThird step	-0.37	0.27	-1.35	0.18	1.52
incomeFourth step	-0.43	0.27	-1.59	0.11	1.52
incomeFifth step	-0.25	0.26	-0.95	0.34	1.52
incomeSixth step	-0.52	0.28	-1.89	0.06	1.52
incomeSeventh step	-0.28	0.29	-0.97	0.33	1.52
incomeEight step	-0.43	0.33	-1.30	0.19	1.52
incomeNineth step	-1.01	0.51	-1.99	0.05	1.52
incomeTenth step	-0.38	0.61	-0.63	0.53	1.52
regionRU: Central Federal District	0.29	0.21	1.41	0.16	2.45
regionRU: North Caucasian federal district	0.05	0.29	0.19	0.85	2.45
regionRU: Volga; Privolzhsky Federal District	0.32	0.21	1.56	0.12	2.45
regionRU: Urals Federal District	-0.13	0.25	-0.52	0.60	2.45
regionRU: Far East Federal District	-0.05	0.31	-0.15	0.88	2.45
regionRU: Siberian Federal District	0.36	0.22	1.61	0.11	2.45
regionRU: South Federal District	-0.07	0.26	-0.28	0.78	2.45
age	-0.00	0.00	-0.42	0.68	1.17
eduT1	0.21	0.12	1.78	0.08	1.17
Standard errors: MLE

The model provide more coefficients because there are two “new” factor variables which give us plus 18 levels in total. Although here Pseudo-R² (McFadden) is the same as it was in the first one, Pseudo-R² (Cragg-Uhler) has become higher in 0.01. AIC of this model is 2265.39 (previously it was 2254.75).

To continue creativity let’s create the third model. In this case, the model will have all variables as predictors (expect the defined dependent variable of course), but now we exclude income variable due to its insignificant and rather big number of levels.

model3 <- glm(visit_dem ~ . -income, data = data_omit, family = "binomial")
anova(model3, test = "Chisq")

Still, we have only three variables which pay roles in prediction - intinpol, townsize, and settlement.

summ(model3, vifs = T)

Observations	1736
Dependent variable	visit_dem
Type	Generalized linear model
Family	binomial
Link	logit

χ(20)	142.22
Pseudo-R (Cragg-Uhler)	0.11
Pseudo-R (McFadden)	0.06
AIC	2257.71
BIC	2372.36

	Est.	S.E.	z val.	p	VIF
(Intercept)	2.85	0.57	5.01	0.00	NA
intinpolSomewhat interested	-0.43	0.20	-2.19	0.03	1.10
intinpolNot very interested	-0.59	0.19	-3.02	0.00	1.10
intinpolNot at all interested	-1.36	0.23	-6.01	0.00	1.10
townsize5000-20000	-0.96	0.29	-3.32	0.00	51.31
townsize20000-100000	-0.77	0.38	-2.05	0.04	51.31
townsize100000-500000	-1.36	0.41	-3.33	0.00	51.31
townsize500000 and more	-2.12	0.45	-4.66	0.00	51.31
settlementRegional center	-0.77	0.24	-3.24	0.00	72.27
settlementDistrict center	-2.05	0.33	-6.25	0.00	72.27
settlementAnother city, town (not a regional or district center)	-2.23	0.53	-4.25	0.00	72.27
settlementVillage	-2.58	0.51	-5.08	0.00	72.27
regionRU: Central Federal District	0.25	0.20	1.20	0.23	2.19
regionRU: North Caucasian federal district	0.01	0.29	0.03	0.98	2.19
regionRU: Volga; Privolzhsky Federal District	0.26	0.20	1.28	0.20	2.19
regionRU: Urals Federal District	-0.17	0.24	-0.70	0.49	2.19
regionRU: Far East Federal District	-0.06	0.30	-0.19	0.85	2.19
regionRU: Siberian Federal District	0.34	0.22	1.57	0.12	2.19
regionRU: South Federal District	-0.16	0.26	-0.61	0.54	2.19
age	0.00	0.00	0.05	0.96	1.08
eduT1	0.17	0.11	1.50	0.13	1.10
Standard errors: MLE

Values of Pseudo-R² (Cragg-Uhler) and Pseudo-R² (McFadden) are identical to ones which are in the second model. Also, we can observe lowers AIC and BIC which equal to 2257.71 and 2372.36 accordingly.

Choosing the best

For now we have created three models. To choose the best one out of them Likelihood ratio test is performed.

lrtest(model1, model2, model3)

Being based on the results, the best model is the second one due to the lowest value of LogLik which is -1102.7 (basically, such results can be a mark that we have not so good models but still we will going on).

To a current step, we define the best model being based on the results of Likelihood ratio test. However, we previously found that there are some variables that do not contribute to the prediction that we are going to realize in the best way at the moment under current circumstances. Let’s perform stepwise model selection by AIC for feature selection. In this way, we might not get any improvements in the model but we can simplify the model using such algorithm without impacting much on the model performance.

Stepwise Model Selection and Evaluation

glm.new <- stepAIC(model2, trace = 0)
summ(glm.new, vifs = TRUE)

Observations	1736
Dependent variable	visit_dem
Type	Generalized linear model
Family	binomial
Link	logit

χ(12)	131.17
Pseudo-R (Cragg-Uhler)	0.10
Pseudo-R (McFadden)	0.06
AIC	2252.76
BIC	2323.74

	Est.	S.E.	z val.	p	VIF
(Intercept)	3.35	0.48	6.99	0.00	NA
intinpolSomewhat interested	-0.40	0.19	-2.10	0.04	1.04
intinpolNot very interested	-0.54	0.19	-2.81	0.00	1.04
intinpolNot at all interested	-1.30	0.22	-5.88	0.00	1.04
townsize5000-20000	-1.20	0.27	-4.44	0.00	39.64
townsize20000-100000	-1.09	0.35	-3.10	0.00	39.64
townsize100000-500000	-1.67	0.38	-4.44	0.00	39.64
townsize500000 and more	-2.40	0.42	-5.76	0.00	39.64
settlementRegional center	-0.89	0.20	-4.44	0.00	39.68
settlementDistrict center	-2.14	0.29	-7.29	0.00	39.68
settlementAnother city, town (not a regional or district center)	-2.33	0.48	-4.82	0.00	39.68
settlementVillage	-2.94	0.44	-6.72	0.00	39.68
eduT1	0.17	0.11	1.52	0.13	1.05
Standard errors: MLE

The algorithm removed from the model such variables as income, age, and region. Moreover, AIC and BIC have decreased, as well as Pseudo-R² (Cragg-Uhler) a bit.

Model Fit of model2:

χ²(20) = 142.22, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.11
Pseudo-R² (McFadden) = 0.06
AIC = 2257.71, BIC = 2372.36

Model Fit of glm.new:

χ²(12) = 131.17, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.10
Pseudo-R² (McFadden) = 0.06
AIC = 2252.76, BIC = 2323.74

Going further. Model evaluation:

LogRegR2(glm.new)

## Chi2                 131.1695 
## Df                   12 
## Sig.                 0 
## Cox and Snell Index  0.07277447 
## Nagelkerke Index     0.0979611 
## McFadden's R2        0.05562897

McFadden’s R2 which is 0.06 demonstrates data scarcity of the prediction via the model. Cox and Snell Index, as well as Nagelkerke Index (Nagelkerke pseudo R2 is closer to 0.1), has a bit higher values.

Having obtained such results, we proceed within the model anyway and try to make a prediction.

Predictions

To test how well the model works let’s create a new data set with average values of predictors and their maximum and minimum values. As we have only factor variables, firstly frequency table is coded.

summarytools::freq(data_omit)

## Frequencies  
## data_omit$intinpol  
## Type: Factor  
## 
##                               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------------------- ------ --------- -------------- --------- --------------
##             Very interested    142      8.18           8.18      8.18           8.18
##         Somewhat interested    618     35.60          43.78     35.60          43.78
##         Not very interested    684     39.40          83.18     39.40          83.18
##       Not at all interested    292     16.82         100.00     16.82         100.00
##                        <NA>      0                               0.00         100.00
##                       Total   1736    100.00         100.00    100.00         100.00
## 
## data_omit$townsize  
## Type: Factor  
## 
##                         Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------------- ------ --------- -------------- --------- --------------
##           Under 5,000    372     21.43          21.43     21.43          21.43
##            5000-20000    194     11.18          32.60     11.18          32.60
##          20000-100000    273     15.73          48.33     15.73          48.33
##         100000-500000    329     18.95          67.28     18.95          67.28
##       500000 and more    568     32.72         100.00     32.72         100.00
##                  <NA>      0                               0.00         100.00
##                 Total   1736    100.00         100.00    100.00         100.00
## 
## data_omit$settlement  
## Type: Factor  
## 
##                                                                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------------------------------------------------ ------ --------- -------------- --------- --------------
##                                                 Capital city    148      8.53           8.53      8.53           8.53
##                                              Regional center    538     30.99          39.52     30.99          39.52
##                                              District center    527     30.36          69.87     30.36          69.87
##       Another city, town (not a regional or district center)     54      3.11          72.98      3.11          72.98
##                                                      Village    469     27.02         100.00     27.02         100.00
##                                                         <NA>      0                               0.00         100.00
##                                                        Total   1736    100.00         100.00    100.00         100.00
## 
## data_omit$income  
## Type: Factor  
## 
##                      Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------ ------ --------- -------------- --------- --------------
##         Lower step     82      4.72           4.72      4.72           4.72
##        second step    136      7.83          12.56      7.83          12.56
##         Third step    261     15.03          27.59     15.03          27.59
##        Fourth step    272     15.67          43.26     15.67          43.26
##         Fifth step    382     22.00          65.26     22.00          65.26
##         Sixth step    276     15.90          81.16     15.90          81.16
##       Seventh step    182     10.48          91.65     10.48          91.65
##        Eight  step    103      5.93          97.58      5.93          97.58
##        Nineth step     27      1.56          99.14      1.56          99.14
##         Tenth step     15      0.86         100.00      0.86         100.00
##               <NA>      0                               0.00         100.00
##              Total   1736    100.00         100.00    100.00         100.00
## 
## data_omit$region  
## Type: Factor  
## 
##                                                 Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------------------------------------- ------ --------- -------------- --------- --------------
##               RU: North West Federal District    168      9.68           9.68      9.68           9.68
##                  RU: Central Federal District    477     27.48          37.15     27.48          37.15
##          RU: North Caucasian federal district     91      5.24          42.40      5.24          42.40
##       RU: Volga; Privolzhsky Federal District    362     20.85          63.25     20.85          63.25
##                    RU: Urals Federal District    154      8.87          72.12      8.87          72.12
##                 RU: Far East Federal District     79      4.55          76.67      4.55          76.67
##                 RU: Siberian Federal District    233     13.42          90.09     13.42          90.09
##                    RU: South Federal District    172      9.91         100.00      9.91         100.00
##                                          <NA>      0                               0.00         100.00
##                                         Total   1736    100.00         100.00    100.00         100.00
## 
## data_omit$eduT  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           0   1163     66.99          66.99     66.99          66.99
##           1    573     33.01         100.00     33.01         100.00
##        <NA>      0                               0.00         100.00
##       Total   1736    100.00         100.00    100.00         100.00
## 
## data_omit$visit_dem  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           0   1013     58.35          58.35     58.35          58.35
##           1    723     41.65         100.00     41.65         100.00
##        <NA>      0                               0.00         100.00
##       Total   1736    100.00         100.00    100.00         100.00

The table presented above is rather detailed. It shows in a precise way how many observations are placed in each category of a particular factor variable. So, we have:

df <- data.frame(
  intinpol = c("Very interested", "Not at all interested", "Not very interested"),
  townsize = c("5000-20000", "100000-500000", "500000 and more"),
  settlement = c("Another city, town (not a regional or district center)",
                 "Village",
                 "Regional center"),
  eduT = c("0", "0", "1"),
  visit_dem = c("1", "0", "0"))

df <- lapply(df, as.factor) %>% as.data.frame()

head(df)

1st Attempt

Now let’s finally predict probabilities of attending demonstrations using our logistic regression model.

# Prediction
logreg.pred <- predict(glm.new, df)
logreg.pred <- factor(ifelse(logreg.pred > 0.5, "1", "0"))
cm <- confusionMatrix(data = logreg.pred, reference = df$visit_dem)

# Function for plotting confusion matrix
draw_confusion_matrix <- function(cm) {

  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

  # create the matrix
  rect(150, 430, 240, 370, col='#3F97D0')
  text(195, 435, 'Class1', cex=1.2)
  rect(250, 430, 340, 370, col='#F7AD50')
  text(295, 435, 'Class2', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#F7AD50')
  rect(250, 305, 340, 365, col='#3F97D0')
  text(140, 400, 'Class1', cex=1.2, srt=90)
  text(140, 335, 'Class2', cex=1.2, srt=90)

  # add in the cm results
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}

# Confusion Matrix
draw_confusion_matrix(cm)

As we had only 3 observations, our model misclassified only one of them. As a result, the accuracy is 0.667. However, hardly can we estimate whether it is good or bad result as from one viewpoint we have only one occasion of incorrect classification, but from another it is 1 out of 3 (33% of the data) observation.

2nd Attempt

Let’s make another prediction before which we split our data on train and test samples and recreate the model using train data. We do that only for the purpose to see the accuracy, not for model estimation somehow.

# Splitting the data
set.seed(999) 
partition = sample(seq_len(nrow(data_omit)), size = nrow(data_omit)*0.2)
data.test = data_omit[partition,]
data.train = data_omit[-partition,]

# Rebuild the model
model_new <- glm(formula = visit_dem ~ intinpol + townsize + settlement + eduT, 
                 family = "binomial", data = data.train)

# New prediction
set.seed(999)
logreg.pred2 <- predict(model_new, data.test)
logreg.pred2 <- factor(ifelse(logreg.pred2 > 0.5, "1", "0"))
cm2 <- confusionMatrix(data = logreg.pred2, reference = data.test$visit_dem)

# New Confusion Matrix
draw_confusion_matrix(cm2)

Reminder: Class 1 is “0” (‘would never do’), class 2 is “1” (‘have done or might do’)

Well, as can be seen, the model are prone to make mistakes. The prediction is not exact and scarcely can be reliable. We have Type I Error - 132 cases are predicted that a person could attend peaceful demonstrations but actually it is not so.

Let’s also observe the area under ROC curve. AUC is 0.55 that is rather poor result (half of the whole square).

roc_qda <- roc(response = data.test$visit_dem, 
               predictor = factor(logreg.pred2, ordered = TRUE), plot = FALSE)

plot(roc_qda, col="red", lwd=3, main="ROC curve QDA")

auc(roc_qda)

## Area under the curve: 0.5498

The next ste is looking at the marginal model plot we can notice a pretty smooth sigmoid function that tells us about the absence of strict division per two categories presented in visit_dem variable.

car::marginalModelPlots(glm.new)

More Info about Model & Diagnostics

As we deal only with factors in the model, we do not have a problem of linearity. What is about multicollinearity then?

Let’s come back to VIFs - Variance inflation factor procedure.

car::vif(glm.new)

##                 GVIF Df GVIF^(1/(2*Df))
## intinpol    1.035042  3        1.005757
## townsize   39.644726  4        1.584066
## settlement 39.679290  4        1.584238
## eduT        1.052978  1        1.026147

Two variables have appeared to be harmful in some sense for our prediction such as townsize and settlement. Let’s make heterogeneous correlation matrix using the data on which we created glm.new model.

dat.cor <- hetcor(data_omit)
dat.cor <- as.matrix(dat.cor$correlations)
corrgram(dat.cor)

kable(dat.cor)

	intinpol	townsize	settlement	income	region	age	eduT	visit_dem
intinpol	1.0000000	0.0329305	-0.0178030	-0.1355609	-0.0300549	-0.1170213	-0.0899508	-0.1999725
townsize	0.0329305	1.0000000	-0.9778675	0.1048237	-0.0758219	-0.0946728	0.2701519	-0.0019435
settlement	-0.0178030	-0.9778675	1.0000000	-0.0931952	0.2616517	0.0837240	-0.2670877	-0.1086444
income	-0.1355609	0.1048237	-0.0931952	1.0000000	0.0380709	-0.2760198	0.3422549	-0.0230005
region	-0.0300549	-0.0758219	0.2616517	0.0380709	1.0000000	-0.0010649	-0.0542256	-0.1076701
age	-0.1170213	-0.0946728	0.0837240	-0.2760198	-0.0010649	1.0000000	-0.2299950	0.0215622
eduT	-0.0899508	0.2701519	-0.2670877	0.3422549	-0.0542256	-0.2299950	1.0000000	0.0866297
visit_dem	-0.1999725	-0.0019435	-0.1086444	-0.0230005	-0.1076701	0.0215622	0.0866297	1.0000000

Correlation between settlement and townsize is -0.978! This is a serious problem which is a reason for multicollinearity.

Wald test allows to estimate the statistical significance of each coefficient in the model. If the test fails to reject the null hypothesis, this suggests that removing the variable from the model will not substantially harm the fit of that model.

survey::regTermTest(glm.new, "settlement", method = "Wald")

## Wald test for settlement
##  in glm(formula = visit_dem ~ intinpol + townsize + settlement + 
##     eduT, family = "binomial", data = data_omit)
## F =  14.94539  on  4  and  1723  df: p= 0.0000000000051788

survey::regTermTest(glm.new, "townsize", method = "Wald")

## Wald test for townsize
##  in glm(formula = visit_dem ~ intinpol + townsize + settlement + 
##     eduT, family = "binomial", data = data_omit)
## F =  10.97278  on  4  and  1723  df: p= 0.0000000086855

However, in both cases we reject the null hypothesis. Likewise, deleting the variable from the model will make model fit worse. The procedure is not so helpful as it could be. So, let’s make a decision to remove townsize from the model (just being guide with a rather stupid fact that in its case p-value is a bit bigger).

However, before do that we will still build two new models and look at their parameters - one model without townsize, another mode without settlement.

glm.new2 <- glm(formula = visit_dem ~ intinpol + settlement + 
    eduT, family = "binomial", data = data_omit)
glm.new3 <- glm(formula = visit_dem ~ intinpol + townsize + 
    eduT, family = "binomial", data = data_omit)

The models are non-nested that is why AICs are computed. Here EDF stands for Equivalent Degrees of freedom which are parameters for non-parametric regression. Both models have the same EDF, however, the model without townsize has lower AIC.

print("EDF & AIC of the Model without `townsize`")

## [1] "EDF & AIC of the Model without `townsize`"

stats::extractAIC(glm.new2)

## [1]    9.000 2292.348

print("EDF & AIC of the Model without `settlement`")

## [1] "EDF & AIC of the Model without `settlement`"

stats::extractAIC(glm.new3)

## [1]    9.000 2308.732

Also, we can see that GVIF of townsize is bigger in 0.0013 than GVIF of settlement.

print("Model without `townsize`")

## [1] "Model without `townsize`"

car::vif(glm.new2)

##                GVIF Df GVIF^(1/(2*Df))
## intinpol   1.019966  3        1.003300
## settlement 1.057724  4        1.007040
## eduT       1.045827  1        1.022657

print("Model without `settlement`")

## [1] "Model without `settlement`"

car::vif(glm.new3)

##              GVIF Df GVIF^(1/(2*Df))
## intinpol 1.017308  3        1.002864
## townsize 1.056424  4        1.006885
## eduT     1.047227  1        1.023341

Let’s compare each model with the model in which townsize and settlement are presented.

anova(glm.new, glm.new2)

anova(glm.new, glm.new3)

Here we see a proof of the idea that we get previously when Wald test was performed - model fit has become worse.

Overall, let’s take glm.new2 as the best one. Its equation:

extract_eq(glm.new2, wrap = T)

\[ \begin{aligned} \log\left[ \frac { P( \operatorname{visit\_dem} = \operatorname{1} ) }{ 1 - P( \operatorname{visit\_dem} = \operatorname{1} ) } \right] &= \alpha + \beta_{1}(\operatorname{intinpol}_{\operatorname{Somewhat\ interested}}) + \beta_{2}(\operatorname{intinpol}_{\operatorname{Not\ very\ interested}}) + \beta_{3}(\operatorname{intinpol}_{\operatorname{Not\ at\ all\ interested}})\ + \\ &\quad \beta_{4}(\operatorname{settlement}_{\operatorname{Regional\ center}}) + \beta_{5}(\operatorname{settlement}_{\operatorname{District\ center}}) + \beta_{6}(\operatorname{settlement}_{\operatorname{Another\ city,\ town\ (not\ a\ regional\ or\ district\ center)}}) + \beta_{7}(\operatorname{settlement}_{\operatorname{Village}})\ + \\ &\quad \beta_{8}(\operatorname{eduT}_{\operatorname{1}}) + \epsilon \end{aligned} \]

Just for curiosity, we can also plot coefficients of all models (let’s make them scaled). As can be seen, patterns of intinpol and eduT are rather stable in comparison with the other variables inclusiveness of which vary.

plot_summs(glm.new, glm.new2, glm.new3, scale = TRUE, plot.distributions = FALSE)

Now let’s see whether we can make our “the best model” better in case if we perform data imputation.

Data Imputation

We have three variables with missings - intinpol, eduT, and income. Noticing that the first two are presented in our logistic regression model. Let’s implement two approaches for dealing with missing values - kNN Imputation and MICE.

data_knn <- kNN(data, k = 9, variable = c("intinpol", "eduT", "income")) %>% 
             dplyr::select(-intinpol_imp, -eduT_imp, -income_imp)
 
data_mice <- mice(data, m = 5, method = "pmm", seed = 999)

## 
##  iter imp variable
##   1   1  intinpol  income  eduT
##   1   2  intinpol  income  eduT
##   1   3  intinpol  income  eduT
##   1   4  intinpol  income  eduT
##   1   5  intinpol  income  eduT
##   2   1  intinpol  income  eduT
##   2   2  intinpol  income  eduT
##   2   3  intinpol  income  eduT
##   2   4  intinpol  income  eduT
##   2   5  intinpol  income  eduT
##   3   1  intinpol  income  eduT
##   3   2  intinpol  income  eduT
##   3   3  intinpol  income  eduT
##   3   4  intinpol  income  eduT
##   3   5  intinpol  income  eduT
##   4   1  intinpol  income  eduT
##   4   2  intinpol  income  eduT
##   4   3  intinpol  income  eduT
##   4   4  intinpol  income  eduT
##   4   5  intinpol  income  eduT
##   5   1  intinpol  income  eduT
##   5   2  intinpol  income  eduT
##   5   3  intinpol  income  eduT
##   5   4  intinpol  income  eduT
##   5   5  intinpol  income  eduT

data_mice <- complete(data_mice)

And then visualize them.

intinpol variable

intinpol0 <- ggplot() +
  geom_bar(data = data, aes(x = intinpol), fill = "#d69bca", color = "#7d498b", alpha = 0.7) +
  theme_bw() + coord_flip()

intinpol1 <- ggplot() +
  geom_bar(data = data_knn, aes(x = intinpol), fill = "#d69bca", color = "#7d498b", alpha = 0.7) +
  theme_bw() + coord_flip()

intinpol2 <- ggplot() +
  geom_bar(data = data_mice, aes(x = intinpol), fill = "#d69bca", color = "#7d498b", alpha = 0.7) +
  theme_bw() + coord_flip()

intinpol3 <- ggplot() +
  geom_bar(data = data_omit, aes(x = intinpol), fill = "#d69bca", color = "#7d498b", alpha = 0.7) +
  theme_bw() + coord_flip()

ggarrange(intinpol0, intinpol1, intinpol2, intinpol3,
          labels = c("Initial", "KNN", "MICE", "NA_OMIT"),
          ncol = 2, nrow = 2)

rm(intinpol0, intinpol1, intinpol2, intinpol3)

income variable

income0 <- ggplot() +
  geom_bar(data = data, aes(x = income), fill = "#79e062", color = "#2dbc0d", alpha = 0.7) +
  theme_bw() + coord_flip()

income1 <- ggplot() +
  geom_bar(data = data_knn, aes(x = income), fill = "#79e062", color = "#2dbc0d", alpha = 0.7) +
  theme_bw() + coord_flip()

income2 <- ggplot() +
  geom_bar(data = data_mice, aes(x = income), fill = "#79e062", color = "#2dbc0d", alpha = 0.7) +
  theme_bw() + coord_flip()

income3 <- ggplot() +
  geom_bar(data = data_omit, aes(x = income), fill = "#79e062", color = "#2dbc0d", alpha = 0.7) +
  theme_bw() + coord_flip()

ggarrange(income0, income1, income2, income3,
          labels = c("Initial", "KNN", "MICE", "NA_OMIT"),
          ncol = 2, nrow = 2)

rm(income0, income1, income2, income3)

eduT variable

eduT0 <- ggplot() +
  geom_bar(data = data, aes(x = eduT), fill = "#418dcb", color = "#105a93", alpha = 0.7) + theme_bw()

eduT1 <- ggplot() +
  geom_bar(data = data_knn, aes(x = eduT), fill = "#418dcb", color = "#105a93", alpha = 0.7) + theme_bw()

eduT2 <- ggplot() +
  geom_bar(data = data_mice, aes(x = eduT), fill = "#418dcb", color = "#105a93", alpha = 0.7) + theme_bw()

eduT3 <- ggplot() +
  geom_bar(data = data_omit, aes(x = eduT), fill = "#418dcb", color = "#105a93", alpha = 0.7) + theme_bw()

ggarrange(eduT0, eduT1, eduT2, eduT3,
          labels = c("Initial", "KNN", "MICE", "NA_OMIT"),
          ncol = 4, nrow = 1)

rm(eduT0, eduT1, eduT2, eduT3)

Visually any differences were found. The only thing that we have to take into account that in the case of NA.OMIT we truncate the dataset and, as a result, have less number of observations.

Let’s use new datasets with “filled” missigns, create some models and plot their coefficients to see their diversity.

glm.new2 <- glm(formula = visit_dem ~ intinpol + settlement + eduT, family = "binomial", data = data_omit)

gml.knn <- glm(formula = visit_dem ~ intinpol + settlement + eduT, family = "binomial", data = data_knn)

gml.mice <- glm(formula = visit_dem ~ intinpol + settlement + eduT, family = "binomial", data = data_mice)

plot_summs(glm.new2, gml.knn, gml.mice, scale = F, plot.distributions = F)

Well, we can see that there are almost no differences regarding coefficients. However, according to AIC, the model based on na.omit is better than the others in which MICE and KNN were used for creation.

print("EDF & AIC of the Model Based on `data_omit`")

## [1] "EDF & AIC of the Model Based on `data_omit`"

stats::extractAIC(glm.new2)

## [1]    9.000 2292.348

print("EDF & AIC of the Model Based on `data_knn`")

## [1] "EDF & AIC of the Model Based on `data_knn`"

stats::extractAIC(gml.knn)

## [1]    9.000 2378.421

print("EDF & AIC of the Model Based on `data_mice`")

## [1] "EDF & AIC of the Model Based on `data_mice`"

stats::extractAIC(gml.mice)

## [1]    9.000 2378.408

Interpretation

Now let’s come back to the model which we previously defined as the best one for us.

Model equation: (once again just as a recall)

extract_eq(glm.new2, wrap = T)

Then, let’s interpret coefficients of the model that we finally get after such a rather tough work.

exp(cbind(OR = coef(glm.new3), confint(glm.new3)))

##                                      OR     2.5 %    97.5 %
## (Intercept)                   1.5366507 1.0457209 2.2674403
## intinpolSomewhat interested   0.6389872 0.4401982 0.9241354
## intinpolNot very interested   0.5719963 0.3952609 0.8246473
## intinpolNot at all interested 0.2726478 0.1772869 0.4164381
## townsize5000-20000            0.4647049 0.3188976 0.6717430
## townsize20000-100000          0.7196573 0.5212682 0.9915057
## townsize100000-500000         0.7292332 0.5356522 0.9913108
## townsize500000 and more       0.8491115 0.6468099 1.1144884
## eduT1                         1.2161928 0.9853271 1.5007225

So, we have 3 variables in the final model.

levels of intinpol variable:

levels(data_omit$intinpol)

## [1] "Very interested"       "Somewhat interested"   "Not very interested"  
## [4] "Not at all interested"

if a person is “Somewhat interested” in politics, the odds of attending peaceful demonstrations change by a factor of 0.64. In other words, it decreases by 36% in comparison with people who are “Very interested” [95% CI = 1.05; 2.27].
if a person is “Not very interested” in politics, the odds of attending peaceful demonstrations change by a factor of 0.57. In other words, it decreases by 43% in comparison with people who are “Very interested” [95% CI = 0.44; 0.92].
if a person is “Not at all interested” in politics, the odds of attending peaceful demonstrations change by a factor of 0.27. In other words, it decreases by 73% in comparison with people who are “Very interested” [95% CI = 0.18; 0.42].

Here we can see a tendency that the more person is interested in politics, the more he or she is likely to take pat in peaceful demonstrations. Let’s proceed.

levels of townsize variable:

levels(data_omit$townsize)

## [1] "Under 5,000"     "5000-20000"      "20000-100000"    "100000-500000"  
## [5] "500000 and more"

in case if a size of the town is 5000-20000, the odds of attending peaceful demonstrations change by a factor of 0.46. In other words, it decreases by 54% in comparison with towns sizes of which are “Under 5,000” [95% CI = 0.32; 0.67].
if a size of the town is 20000-100000, the odds of attending peaceful demonstrations change by a factor of 0.72. In other words, it decreases by 28% in comparison with towns sizes of which are “Under 5,000” [95% CI = 0.54; 0.99].
if a size of the town is 100000-500000, the odds of attending peaceful demonstrations change by a factor of 0.73. In other words, it decreases by 27% in comparison with towns sizes of which are “Under 5,000” [95% CI = 0.54; 0.99].
if a size of the town is 500000 and more, the odds of attending peaceful demonstrations change by a factor of 0.85. In other words, it decreases by 15% in comparison with towns sizes of which are “Under 5,000” [95% CI = 0.65; 1.11].

levels of eduT variable:

levels(data_omit$eduT)

## [1] "0" "1"

in case if a person has higher education, the odds of attending peaceful demonstrations change by a factor of 1.22. In other words, it increase by 122% compared with people who do not higher education.

Let’s explore marginal effects - how much in % the variable contributes to the predictor in the model.

mar <- margins(glm.new3, type = "response")
summary(mar)

plot(mar)

There are only 2 variables p-values of which are not below 0.05 - “eduT - 1” and “townsize - 500000 and more”. By this, the others have statistically significant marginal effects. Noticeably, each marginal effect has a rather wide intervals.

If intinpol is “Not at all interested”, the difference between the predicted probability of the outcome is 0.3 percentage points relatively the reference category, “Very interested”.
If intinpol is “Not very interested”, the difference between the predicted probability of the outcome is 0.14 percentage points relatively the reference category, “Very interested”.
If intinpol is “Somewhat interested”, the difference between the predicted probability of the outcome is 0.11 percentage points relatively the reference category, “Very interested”.
If townsize is “100000-500000”, the difference between the predicted probability of the outcome is 0.08 percentage points relatively the reference category, “Under 5000”.
If townsize is “20000-100000”, the difference between the predicted probability of the outcome is 0.08 percentage points relatively the reference category, “Under 5000”.
If townsize is “5000-20000”, the difference between the predicted probability of the outcome is 0.17 percentage points relatively the reference category, “Under 5000”.

Conclusion

Although the created model does not have a nice performance in prediction as it could be. Still, some conclusions can be made. One of them is that being highly interested in politics pays a rather important role and is a reason why a person can visit demonstrations. Also, people obtained higher education are more likely to attend such events than people without tertiary education. Additionally, if a person comes from a small town (under 5000) then she/he is more likely to visit a demonstration than a person who lives in a town size of which is 500000 and more, for example (for a person from a large town such likeliness is lower in 15%). At the same time, such factors as income and age do not play important roles. Overall, it should be mentioned that trying to add any other variables would be a nice to improve predictions and figure out what the other factors are as reasons for attending peaceful demonstrations.

Binary Logistic Regression

Zakharova Victoria

Last updated: 01 Ноябрь, 2020