The outcome variable is Attending lawful demonstrations in the past 12 months (the question is “Now I’d like you to look at this card. I’m going to read out some different forms of political action that people can take, and I’d like you to tell me, for each one, whether you have actually done any of these things, whether you might do it or would never, under any circumstances, do it. Attending peaceful demonstrations”)
We will model participation in lawful demonstrations Q211 (as a form of political activity) by the respondent’s education, type of settlement, size of settlement, income, age, and interest in politics, controlling for the federal district.
df <- readRDS("ruwvs")
#str(df)
df$age <- as.numeric(df$age)
summary(df)
## Q211 intinpol eduR
## 0:1072 Very interested :148 Primary : 35
## 1: 738 Somewhat interested :638 Secondary :469
## Not very interested :713 Post-secondary:692
## Not at all interested:305 Tertiary :604
## NA's : 6 NA's : 10
##
##
## townsize
## Under 5,000 :391
## 5000-20000 :203
## 20000-100000 :284
## 100000-500000 :342
## 500000 and more:590
##
##
## settlement
## Capital city :149
## Regional center :569
## District center :542
## Another city, town (not a regional or district center): 58
## Village :492
##
##
## income region
## Fifth step :385 RU: Central Federal District :489
## Sixth step :278 RU: Volga; Privolzhsky Federal District:382
## Fourth step :275 RU: Siberian Federal District :239
## Third step :262 RU: South Federal District :183
## Seventh step:183 RU: North West Federal District :177
## (Other) :364 RU: Urals Federal District :156
## NA's : 63 (Other) :184
## age income1 eduT
## Min. : 3.00 Min. :0.000 0 :1196
## 1st Qu.:16.00 1st Qu.:2.000 1 : 604
## Median :28.00 Median :4.000 NA's: 10
## Mean :30.41 Mean :3.773
## 3rd Qu.:43.00 3rd Qu.:5.000
## Max. :76.00 Max. :9.000
## NA's :63
Respondents’ socio-demographic charactreristics
Age histogram below is slightly right-skewed, which means that the sample has mostly been filled by younger generation. The average respondent’s age is 30 years old. As for the data about respondents’ income - it is normally disctrubuted, without strong deviations to one or another side. In education area the majority of respondents have post-secondary (38.25) or tetriary (33.4%) education. 26% of respondents have secondary education and only 2% of the sample have just primary education.
library(dplyr)
library(ggplot2)
# respondent's background
#age
ggplot() +
geom_histogram(data = df,
aes(x = age),
binwidth = 5,
fill = "cornflowerblue",
col= "black",
alpha = 0.5) +
labs(subtitle="mean=30.41 (white solid line), median=28 (white dashed line),\nmin.=3, 1st Qu.=16, 3rd Qu.=43, max.=76",
title = "Respondents' age",
x = "",
y = "Number of respondents") +
geom_vline(aes(xintercept = mean(df$age)), linetype="solid", color="white", size=1.5) +
geom_vline(aes(xintercept = median(df$age)), linetype="dashed", color="white", size=1.5) +
theme_bw()
#income
ggplot() +
geom_histogram(data=df,
aes(x=income1),
binwidth = 1,
fill = "cornflowerblue",
col= "black",
alpha = 0.5) +
labs(subtitle="mean=3.7 (white solid line), median=4 (white dashed line),\nmin.=0, 1st Qu.=2, 3rd Qu.=5, max.=9",
title = "Respondents' income (10 level scale)",
x = "",
y = "Number of respondents") +
geom_vline(aes(xintercept = mean(df$income1)), linetype="solid", color="white", size=1.5) +
geom_vline(aes(xintercept = median(df$income1)), linetype="dashed", color="white", size=1.5) +
theme_bw()
#education
ggplot(data = df, aes(x = eduR)) +
geom_bar(aes(y = prop.table(..count..) * 100),
position = "dodge", color="black", fill="cornflowerblue",alpha = 0.7, width=0.8) +
labs(title = "Respondents' levels of education", x = "", y = "percent", caption = "WVS 2017 Russia")+
geom_text(aes(y = prop.table(..count..) * 100 + 0.5,
label = paste0(round(prop.table(..count..) * 100,digits= 2), '%')),
stat = 'count',
position = position_dodge(),
size = 4) +
theme_bw()
Geographical characteristics:
All respondents are from Russia, mostly from Central Federal District (27%). Distribution over other districs can be seen on the graph below. The majority of them are from district (29.9%) or regional centers(31.4%) and villages(27.2%); respondnts’ towns are mostly large (32.6% have townsize 500.000+) or, in contrast, small (21.6% have townsize under 5.000).
# respondents' place of living
#region
df <- within(df, region <- factor(region,
levels=names(sort(table(region),
decreasing=FALSE))))
ggplot(data = df, aes(x = region)) +
geom_bar(aes(y = prop.table(..count..) * 100),
position = "dodge", color="black", fill="cornflowerblue", alpha = 0.7, width=0.8) +
coord_flip() +
labs(title = "Respondents' region", x = "", y = "percent", caption = "WVS 2017 Russia")+
geom_text(aes(y = prop.table(..count..) * 100 - 1.5,
label = paste0(round(prop.table(..count..) * 100,digits= 2), '%')),
stat = 'count',
#position = position_dodge(),
size = 4) +
theme_bw()
#settlement
ggplot(data = df, aes(x = settlement)) +
geom_bar(aes(y = prop.table(..count..) * 100),
position = "dodge", color="black", fill="cornflowerblue", alpha = 0.7, width=0.8) +
coord_flip() +
labs(title = "Respondents' settlement", x = "", y = "percent", caption = "WVS 2017 Russia")+
geom_text(aes(y = prop.table(..count..) * 100 - 1.5,
label = paste0(round(prop.table(..count..) * 100,digits= 2), '%')),
stat = 'count',
#position = position_dodge(),
size = 4) +
theme_bw()
#townsize
ggplot(data = df, aes(x = townsize)) +
geom_bar(aes(y = prop.table(..count..) * 100),
position = "dodge", color="black", fill="cornflowerblue",alpha = 0.7, width=0.8) +
coord_flip() +
labs(title = "Respondents' town's size", x = "", y = "percent", caption = "WVS 2017 Russia")+
geom_text(aes(y = ..count../sum(..count..)*100 - 1.5,
label = paste0(round(..count../sum(..count..)*100,digits= 2), '%')),
stat = 'count',
#position = position_dodge(),
size = 4) +
theme_bw()
When it comes to respondents’ interest in politics it can be seen that 2/5 of the sample are not really interested in politics. Taking into account those, who are not interested in politics at all, it takes 56% of the sample of those, who ignore politics.
# interest in politics and political activity
ggplot(df,
aes(x = intinpol, y=..count../sum(..count..)*100,
fill = Q211)) +
geom_bar(position = "stack")+
labs(title = "Respondents' interest in politics", x = "", y = "percent", fill = "Politically active?", caption = "WVS 2017 Russia")+
geom_text(aes(y = prop.table(..count..) * 100 +1,
label = paste0(round(prop.table(..count..) * 100,digits= 2), '%')),
stat = 'count',
position = position_stack(vjust = 0.5),
size = 4) +
coord_flip() +
theme_bw() +
scale_fill_manual(values=c("sienna3", "skyblue3"))
library(polycor)
df_cor <- hetcor(df)
df_cor
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## Q211 intinpol eduR townsize settlement
## Q211 1 Polychoric Polychoric Polychoric Polychoric
## intinpol -0.2 1 Polychoric Polychoric Polychoric
## eduR 0.07058 -0.1151 1 Polychoric Polychoric
## townsize -0.002576 0.03294 0.2528 1 Polychoric
## settlement -0.1093 -0.0178 -0.2516 -0.9847 1
## income -0.023 -0.1356 0.3034 0.1048 -0.09324
## region 0.1525 -0.018 0.1033 0.05894 -0.2217
## age 0.02156 -0.117 -0.2265 -0.0947 0.08369
## income1 -0.02156 -0.1332 0.3006 0.1057 -0.09373
## eduT 0.08664 -0.08995 <NA> 0.2702 -0.267
## income region age income1 eduT
## Q211 Polychoric Polychoric Polyserial Polyserial Polychoric
## intinpol Polychoric Polychoric Polyserial Polyserial Polychoric
## eduR Polychoric Polychoric Polyserial Polyserial Polychoric
## townsize Polychoric Polychoric Polyserial Polyserial Polychoric
## settlement Polychoric Polychoric Polyserial Polyserial Polychoric
## income 1 Polychoric Polyserial Polyserial Polychoric
## region 0.0376 1 Polyserial Polyserial Polychoric
## age -0.276 0.03911 1 Pearson Polyserial
## income1 0.9999 0.03637 -0.2745 1 Polyserial
## eduT 0.3423 0.1176 -0.23 0.3373 1
##
## Standard Errors:
## Q211 intinpol eduR townsize settlement income region
## Q211
## intinpol 0.03144
## eduR 0.03305 0.02771
## townsize 0.03279 0.02776 0.02654
## settlement 0.03207 0.02738 0.0262 0.001632
## income 0.03074 0.02554 0.02368 0.02599 0.0256
## region 0.03103 0.02677 0.02712 0.02681 0.02561 0.02528
## age 0.03032 0.02527 0.02439 0.02562 0.02548 0.02182 0.02508
## income1 0.03032 0.02513 0.02326 0.02559 0.02524 0.002691 0.02494
## eduT 0.03911 0.03323 0 0.03175 0.03127 0.02811 0.03218
## age income1
## Q211
## intinpol
## eduR
## townsize
## settlement
## income
## region
## age
## income1 0.0222
## eduT 0.03033 0.02741
##
## n = 1736
##
## P-values for Tests of Bivariate Normality:
## Q211 intinpol eduR townsize settlement
## Q211
## intinpol 0.02881
## eduR 0.7348 0.1433
## townsize 0.001121 0.06721 0.1324
## settlement 9.393e-06 0.008892 0.00703 7.572e-12
## income 0.6202 0.006957 7.566e-07 0.01495 8.763e-06
## region 0.2317 1.557e-08 8.174e-08 3.28e-25 5.092e-110
## age 2.023e-10 7.045e-10 4.085e-31 1.423e-280 1.666e-31
## income1 3.558e-42 2.126e-42 3.611e-42 7.56899999860163e-315 8.22e-66
## eduT <NA> 0.2291 0 0.5738 0.1498
## income region age income1
## Q211
## intinpol
## eduR
## townsize
## settlement
## income
## region 1.23e-15
## age 1.07e-11 3.381e-213
## income1 NaN 8.684e-253 8.425e-52
## eduT 8.759e-05 0.0003211 1.846e-20 2.362e-42
library(ggcorrplot)
df_corplot <- round(df_cor$correlations, 2)
ggcorrplot(df_corplot, type = "lower")
First, we see strong correlation between two income variables and between townsize and settlement. This means that for the model we should choose one of these two variables. Thus I will exclude “townsize” variable, because it is less corralated with the outcome than settlement, and factor income variable, as it is less convenient with the same correlational level. Also I will exclude “eduR” variable, leaving binary variable to the regression for the same reason.
#general model
model_full <- glm(Q211 ~ age + income1 + eduT + settlement + region + intinpol, data = df, family="binomial"(link=logit))
summary(model_full) #AIC: 2283.4
##
## Call:
## glm(formula = Q211 ~ age + income1 + eduT + settlement + region +
## intinpol, family = binomial(link = logit), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8052 -1.0272 -0.7765 1.2232 2.0811
##
## Coefficients:
## Estimate
## (Intercept) 0.8221265
## age -0.0009979
## income1 -0.0544330
## eduT1 0.1788241
## settlementRegional center -0.4597112
## settlementDistrict center -0.8721639
## settlementAnother city, town (not a regional or district center) -0.9810863
## settlementVillage -0.5133248
## regionRU: North Caucasian federal district 0.1437896
## regionRU: Urals Federal District -0.1227691
## regionRU: North West Federal District 0.1589217
## regionRU: South Federal District -0.3330271
## regionRU: Siberian Federal District 0.3835438
## regionRU: Volga; Privolzhsky Federal District 0.3173954
## regionRU: Central Federal District 0.4590180
## intinpolSomewhat interested -0.4609568
## intinpolNot very interested -0.6365091
## intinpolNot at all interested -1.4088843
## Std. Error
## (Intercept) 0.4070988
## age 0.0031580
## income1 0.0289287
## eduT1 0.1138132
## settlementRegional center 0.2254509
## settlementDistrict center 0.2170352
## settlementAnother city, town (not a regional or district center) 0.3581382
## settlementVillage 0.2284048
## regionRU: North Caucasian federal district 0.3372543
## regionRU: Urals Federal District 0.3013116
## regionRU: North West Federal District 0.2952891
## regionRU: South Federal District 0.3011301
## regionRU: Siberian Federal District 0.2818394
## regionRU: Volga; Privolzhsky Federal District 0.2694427
## regionRU: Central Federal District 0.2716458
## intinpolSomewhat interested 0.1937967
## intinpolNot very interested 0.1931379
## intinpolNot at all interested 0.2251833
## z value
## (Intercept) 2.019
## age -0.316
## income1 -1.882
## eduT1 1.571
## settlementRegional center -2.039
## settlementDistrict center -4.019
## settlementAnother city, town (not a regional or district center) -2.739
## settlementVillage -2.247
## regionRU: North Caucasian federal district 0.426
## regionRU: Urals Federal District -0.407
## regionRU: North West Federal District 0.538
## regionRU: South Federal District -1.106
## regionRU: Siberian Federal District 1.361
## regionRU: Volga; Privolzhsky Federal District 1.178
## regionRU: Central Federal District 1.690
## intinpolSomewhat interested -2.379
## intinpolNot very interested -3.296
## intinpolNot at all interested -6.257
## Pr(>|z|)
## (Intercept) 0.043438
## age 0.752001
## income1 0.059886
## eduT1 0.116135
## settlementRegional center 0.041443
## settlementDistrict center 5.86e-05
## settlementAnother city, town (not a regional or district center) 0.006155
## settlementVillage 0.024612
## regionRU: North Caucasian federal district 0.669850
## regionRU: Urals Federal District 0.683678
## regionRU: North West Federal District 0.590446
## regionRU: South Federal District 0.268759
## regionRU: Siberian Federal District 0.173558
## regionRU: Volga; Privolzhsky Federal District 0.238809
## regionRU: Central Federal District 0.091073
## intinpolSomewhat interested 0.017381
## intinpolNot very interested 0.000982
## intinpolNot at all interested 3.93e-10
##
## (Intercept) *
## age
## income1 .
## eduT1
## settlementRegional center *
## settlementDistrict center ***
## settlementAnother city, town (not a regional or district center) **
## settlementVillage *
## regionRU: North Caucasian federal district
## regionRU: Urals Federal District
## regionRU: North West Federal District
## regionRU: South Federal District
## regionRU: Siberian Federal District
## regionRU: Volga; Privolzhsky Federal District
## regionRU: Central Federal District .
## intinpolSomewhat interested *
## intinpolNot very interested ***
## intinpolNot at all interested ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2247.4 on 1718 degrees of freedom
## (74 observations deleted due to missingness)
## AIC: 2283.4
##
## Number of Fisher Scoring iterations: 4
#and find the best model, given your data
From the summary of the first general logistic regression model it can be seen that significant relation provide settlement and interest in politics. For that reason I have tried model with only two such variables - it provides significant relation of predictors with the outcome, but its AIC is much higher.
# settlement and interest in politics only
model_2 <- glm(Q211 ~ settlement + intinpol, data = df, family="binomial"(link=logit))
summary(model_2) #AIC: 2371.6
##
## Call:
## glm(formula = Q211 ~ settlement + intinpol, family = binomial(link = logit),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6126 -1.0390 -0.7803 1.2656 1.8061
##
## Coefficients:
## Estimate
## (Intercept) 0.9823
## settlementRegional center -0.7572
## settlementDistrict center -1.1291
## settlementAnother city, town (not a regional or district center) -1.1371
## settlementVillage -0.8334
## intinpolSomewhat interested -0.3538
## intinpolNot very interested -0.4835
## intinpolNot at all interested -1.2584
## Std. Error
## (Intercept) 0.2339
## settlementRegional center 0.1922
## settlementDistrict center 0.1939
## settlementAnother city, town (not a regional or district center) 0.3287
## settlementVillage 0.1946
## intinpolSomewhat interested 0.1853
## intinpolNot very interested 0.1838
## intinpolNot at all interested 0.2144
## z value
## (Intercept) 4.199
## settlementRegional center -3.939
## settlementDistrict center -5.822
## settlementAnother city, town (not a regional or district center) -3.459
## settlementVillage -4.283
## intinpolSomewhat interested -1.909
## intinpolNot very interested -2.631
## intinpolNot at all interested -5.871
## Pr(>|z|)
## (Intercept) 2.68e-05
## settlementRegional center 8.19e-05
## settlementDistrict center 5.81e-09
## settlementAnother city, town (not a regional or district center) 0.000541
## settlementVillage 1.85e-05
## intinpolSomewhat interested 0.056267
## intinpolNot very interested 0.008517
## intinpolNot at all interested 4.34e-09
##
## (Intercept) ***
## settlementRegional center ***
## settlementDistrict center ***
## settlementAnother city, town (not a regional or district center) ***
## settlementVillage ***
## intinpolSomewhat interested .
## intinpolNot very interested **
## intinpolNot at all interested ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2440.9 on 1803 degrees of freedom
## Residual deviance: 2355.6 on 1796 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 2371.6
##
## Number of Fisher Scoring iterations: 4
Then I have run model with the same two significant variables (settlement and interest in politics) and added age variable in one model and income in another. AIC is better in the model with income, although income is insignificant here, as well as age. However, for now I will take income variable into the model.
# age + settlement + intinpol
model_3_1 <- glm(Q211 ~ age + settlement + intinpol, data = df, family="binomial"(link=logit))
summary(model_3_1) #AIC: 2373.5
##
## Call:
## glm(formula = Q211 ~ age + settlement + intinpol, family = binomial(link = logit),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6198 -1.0444 -0.7783 1.2683 1.8098
##
## Coefficients:
## Estimate
## (Intercept) 0.9581686
## age 0.0006893
## settlementRegional center -0.7562321
## settlementDistrict center -1.1300942
## settlementAnother city, town (not a regional or district center) -1.1311374
## settlementVillage -0.8355760
## intinpolSomewhat interested -0.3512133
## intinpolNot very interested -0.4791006
## intinpolNot at all interested -1.2532920
## Std. Error
## (Intercept) 0.2552441
## age 0.0029240
## settlementRegional center 0.1923113
## settlementDistrict center 0.1940143
## settlementAnother city, town (not a regional or district center) 0.3296424
## settlementVillage 0.1948539
## intinpolSomewhat interested 0.1856856
## intinpolNot very interested 0.1847034
## intinpolNot at all interested 0.2154344
## z value
## (Intercept) 3.754
## age 0.236
## settlementRegional center -3.932
## settlementDistrict center -5.825
## settlementAnother city, town (not a regional or district center) -3.431
## settlementVillage -4.288
## intinpolSomewhat interested -1.891
## intinpolNot very interested -2.594
## intinpolNot at all interested -5.818
## Pr(>|z|)
## (Intercept) 0.000174
## age 0.813632
## settlementRegional center 8.41e-05
## settlementDistrict center 5.72e-09
## settlementAnother city, town (not a regional or district center) 0.000600
## settlementVillage 1.80e-05
## intinpolSomewhat interested 0.058566
## intinpolNot very interested 0.009490
## intinpolNot at all interested 5.97e-09
##
## (Intercept) ***
## age
## settlementRegional center ***
## settlementDistrict center ***
## settlementAnother city, town (not a regional or district center) ***
## settlementVillage ***
## intinpolSomewhat interested .
## intinpolNot very interested **
## intinpolNot at all interested ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2440.9 on 1803 degrees of freedom
## Residual deviance: 2355.5 on 1795 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 2373.5
##
## Number of Fisher Scoring iterations: 4
# income + settlement + intinpol
model_3_2 <- glm(Q211 ~ income1 + settlement + intinpol, data = df, family="binomial"(link=logit))
summary(model_3_2) #AIC: 2299
##
## Call:
## glm(formula = Q211 ~ income1 + settlement + intinpol, family = binomial(link = logit),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7385 -1.0498 -0.8044 1.2668 1.8603
##
## Coefficients:
## Estimate
## (Intercept) 1.26179
## income1 -0.04819
## settlementRegional center -0.75756
## settlementDistrict center -1.13768
## settlementAnother city, town (not a regional or district center) -1.14404
## settlementVillage -0.85176
## intinpolSomewhat interested -0.42486
## intinpolNot very interested -0.54943
## intinpolNot at all interested -1.32210
## Std. Error
## (Intercept) 0.26340
## income1 0.02642
## settlementRegional center 0.19453
## settlementDistrict center 0.19587
## settlementAnother city, town (not a regional or district center) 0.33594
## settlementVillage 0.19767
## intinpolSomewhat interested 0.19018
## intinpolNot very interested 0.18854
## intinpolNot at all interested 0.21939
## z value
## (Intercept) 4.790
## income1 -1.824
## settlementRegional center -3.894
## settlementDistrict center -5.808
## settlementAnother city, town (not a regional or district center) -3.406
## settlementVillage -4.309
## intinpolSomewhat interested -2.234
## intinpolNot very interested -2.914
## intinpolNot at all interested -6.026
## Pr(>|z|)
## (Intercept) 1.66e-06
## income1 0.06815
## settlementRegional center 9.85e-05
## settlementDistrict center 6.31e-09
## settlementAnother city, town (not a regional or district center) 0.00066
## settlementVillage 1.64e-05
## intinpolSomewhat interested 0.02549
## intinpolNot very interested 0.00357
## intinpolNot at all interested 1.68e-09
##
## (Intercept) ***
## income1 .
## settlementRegional center ***
## settlementDistrict center ***
## settlementAnother city, town (not a regional or district center) ***
## settlementVillage ***
## intinpolSomewhat interested *
## intinpolNot very interested **
## intinpolNot at all interested ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2366.8 on 1742 degrees of freedom
## Residual deviance: 2281.0 on 1734 degrees of freedom
## (67 observations deleted due to missingness)
## AIC: 2299
##
## Number of Fisher Scoring iterations: 4
# age + income + settlement + intinpol
#model_3_3 <- glm(Q211 ~ income1 + settlement + intinpol + age, data = df, family="binomial"(link=logit))
#summary(model_3_3) #AIC: 2300, neither age, nor income gave here significant result, so I do not add this model into further comparison
Than I’ve decided which variables can I also try, to make the model better (because the smallest AIC is still provides the general model with lots of insignificant predictors, which I do not really like). I have added binary education variable. AIC is still bigger than from the full model, and education do not give significant result, BUT in this case income becomes significant predictor, which I do like.
# income + settlement + intinpol + edu
model_4 <- glm(Q211 ~ income1 + eduT + settlement + intinpol, data = df, family="binomial"(link=logit))
summary(model_4) #AIC: 2289
##
## Call:
## glm(formula = Q211 ~ income1 + eduT + settlement + intinpol,
## family = binomial(link = logit), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7889 -1.0431 -0.8065 1.2501 1.8931
##
## Coefficients:
## Estimate
## (Intercept) 1.20299
## income1 -0.05731
## eduT1 0.17166
## settlementRegional center -0.72626
## settlementDistrict center -1.10399
## settlementAnother city, town (not a regional or district center) -1.11736
## settlementVillage -0.80521
## intinpolSomewhat interested -0.41660
## intinpolNot very interested -0.55018
## intinpolNot at all interested -1.30746
## Std. Error
## (Intercept) 0.26584
## income1 0.02733
## eduT1 0.11166
## settlementRegional center 0.19527
## settlementDistrict center 0.19716
## settlementAnother city, town (not a regional or district center) 0.33658
## settlementVillage 0.20051
## intinpolSomewhat interested 0.19050
## intinpolNot very interested 0.18888
## intinpolNot at all interested 0.21972
## z value
## (Intercept) 4.525
## income1 -2.097
## eduT1 1.537
## settlementRegional center -3.719
## settlementDistrict center -5.599
## settlementAnother city, town (not a regional or district center) -3.320
## settlementVillage -4.016
## intinpolSomewhat interested -2.187
## intinpolNot very interested -2.913
## intinpolNot at all interested -5.951
## Pr(>|z|)
## (Intercept) 6.03e-06
## income1 0.035981
## eduT1 0.124219
## settlementRegional center 0.000200
## settlementDistrict center 2.15e-08
## settlementAnother city, town (not a regional or district center) 0.000901
## settlementVillage 5.92e-05
## intinpolSomewhat interested 0.028748
## intinpolNot very interested 0.003581
## intinpolNot at all interested 2.67e-09
##
## (Intercept) ***
## income1 *
## eduT1
## settlementRegional center ***
## settlementDistrict center ***
## settlementAnother city, town (not a regional or district center) ***
## settlementVillage ***
## intinpolSomewhat interested *
## intinpolNot very interested **
## intinpolNot at all interested ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2269.9 on 1726 degrees of freedom
## (74 observations deleted due to missingness)
## AIC: 2289.9
##
## Number of Fisher Scoring iterations: 4
I have compared all the models by AIC. The smallest AIC has the full model, however there only two predictors are significant. For that reason I have started to try other model and compare AIC of those. In the result I can say that I prefer the last model, where income, settlement and interest in politics are significant predictors and one insignificant predictor - education. Below there is the table with all models’ AIC scores. I have also added a table with predictors where asterisk shows if the predictor was significant in that model. Below there will be the model equation.
# the lower the better
#AIC(model_full) # 2283.4
#AIC(model_2) # 2371.6
#AIC(model_3_1) # 2373.5
#AIC(model_3_2) # 2299
#AIC(model_4) # 2289
model <- c("model_full", "model_2", "model_3_1", "model_3_2", "model_4")
predictors <- c("interest in politics* + settlement* + income + age + edu + region", "interest in politics* + settlement*", "interest in politics* + settlement* + age", "interest in politics* + settlement* + income", "interest in politics* + settlement* + income* + edu")
AIC_score <- c("2283.4", "2371.6", "2373.5", "2299", "2289")
AIC_scores <- data.frame(model, predictors, AIC_score)
library(knitr)
library(kableExtra)
kable(AIC_scores) %>%
kable_styling(c("bordered"))%>%
add_header_above(c("Comparison of logistic regression models by AIC"=3)) %>%
row_spec(5, background = "lightsteelblue")
| model | predictors | AIC_score |
|---|---|---|
| model_full | interest in politics* + settlement* + income + age + edu + region | 2283.4 |
| model_2 | interest in politics* + settlement* | 2371.6 |
| model_3_1 | interest in politics* + settlement* + age | 2373.5 |
| model_3_2 | interest in politics* + settlement* + income | 2299 |
| model_4 | interest in politics* + settlement* + income* + edu | 2289 |
#install.packages("equatiomatic")
library(equatiomatic)
equatiomatic::extract_eq(model_4, use_coefs = TRUE)
\[ \log\left[ \frac { P( \operatorname{Q211} = \operatorname{1} ) }{ 1 - P( \operatorname{Q211} = \operatorname{1} ) } \right] = 1.2 - 0.06(\operatorname{income1}) + 0.17(\operatorname{eduT}_{\operatorname{1}}) - 0.73(\operatorname{settlement}_{\operatorname{Regional\ center}}) - 1.1(\operatorname{settlement}_{\operatorname{District\ center}}) - 1.12(\operatorname{settlement}_{\operatorname{Another\ city,\ town\ (not\ a\ regional\ or\ district\ center)}}) - 0.81(\operatorname{settlement}_{\operatorname{Village}}) - 0.42(\operatorname{intinpol}_{\operatorname{Somewhat\ interested}}) - 0.55(\operatorname{intinpol}_{\operatorname{Not\ very\ interested}}) - 1.31(\operatorname{intinpol}_{\operatorname{Not\ at\ all\ interested}}) + \epsilon \]
So, as we remember, the significant predictors here are interest in politics, settlement and income. According to the odds ratio (below) the following can be concluded.
exp(cbind(OR = coef(model_4), confint(model_4)))
## OR
## (Intercept) 3.3300552
## income1 0.9443019
## eduT1 1.1872700
## settlementRegional center 0.4837158
## settlementDistrict center 0.3315453
## settlementAnother city, town (not a regional or district center) 0.3271412
## settlementVillage 0.4469928
## intinpolSomewhat interested 0.6592834
## intinpolNot very interested 0.5768451
## intinpolNot at all interested 0.2705063
## 2.5 %
## (Intercept) 1.9869893
## income1 0.8948839
## eduT1 0.9536530
## settlementRegional center 0.3284886
## settlementDistrict center 0.2242568
## settlementAnother city, town (not a regional or district center) 0.1666414
## settlementVillage 0.3004486
## intinpolSomewhat interested 0.4528583
## intinpolNot very interested 0.3974378
## intinpolNot at all interested 0.1751707
## 97.5 %
## (Intercept) 5.6391781
## income1 0.9961249
## eduT1 1.4775889
## settlementRegional center 0.7070315
## settlementDistrict center 0.4862669
## settlementAnother city, town (not a regional or district center) 0.6265489
## settlementVillage 0.6600712
## intinpolSomewhat interested 0.9566525
## intinpolNot very interested 0.8342848
## intinpolNot at all interested 0.4148800
Income:
Settlement:
Interest in politics:
Below there is the calculation the predicted probability of participating in lawful demonstrations at each value of interest in politics, holding income at its mean. At this step I have decided to stop with two predictors only. From the graph below it can bу seen that the probability of participation in peaceful demonstrations is lower with hogher income of respondents, as well as decrease in respondent’s interest in politics give negative effect, too (respondent, who are not interested in politics are less likely to participate in demostrations and, vice versa, for respondents, who are interested in politics, probability to participate in lawful demonstrations is higher than of those, who are not interested.)
newdata1 <- with(df, data.frame(income1 = mean(income1, na.rm=TRUE),
intinpol = c("Very interested", "Somewhat interested", "Not very interested", "Not at all interested")))
newdata1
model_new <- glm(Q211 ~ income1 + intinpol, data = df, family="binomial"(link=logit))
#summary(model_new)
newdata1$intinpolP <- predict(model_new, newdata = newdata1, type = "response")
#newdata1
newdata2 <- with(df, data.frame(income1 = rep(seq(from = 1, to = 8, length.out = 100), 4),
intinpol = factor(rep(c("Very interested", "Somewhat interested", "Not very interested", "Not at all interested"), each = 100))))
newdata3 <- cbind(newdata2, predict(model_new, newdata = newdata2, type = "response", se = T))
#newdata3
newdata3 <- within(newdata3, {
PredictedProb <- fit
LL <- fit - (1.96 * se.fit)
UL <- fit + (1.96 * se.fit)
})
#head(newdata3)
ggplot(newdata3, aes(x = income1, y = PredictedProb)) +
geom_ribbon(aes(ymin = LL, ymax = UL, fill = intinpol), alpha = 0.2) +
geom_line(aes(colour = intinpol), size = 0.75) +
ylim(0, 1) +
theme_bw()
Accuracy for the model, that I have chosen is 0.6, which is not really great, but still not bad. In other words, the percentage of correct predictions is 60.48%. This tells us that for the 1164 observations (respondent) used in the model, the model correctly predicted whether or not somebody attended lawful demonstration 60.48% of the time.
library(caret)
set.seed(3456)
trainIndex <- createDataPartition(df$Q211, p = .67,
list = FALSE,
times = 1)
Train <- df[ trainIndex,] %>% na.omit()
Test <- df[trainIndex,] %>% na.omit()
#Train
#Test
model_train <- glm(Q211 ~ income1 + eduT + settlement + intinpol, data = Train, family=binomial)
#summary(model_train)
Test$model_prob <- predict(model_train, Test, type = "response")
Test <- Test %>% mutate(model_pred = 1*(model_prob > .53) + 0,
visit_binary = 1*(Q211 == "1") + 0)
Test <- Test %>% mutate(accurate = 1*(model_pred == visit_binary))
#str(Test)
sum(Test$accurate)/nrow(Test)
## [1] 0.6043103
library(car)
car::residualPlots(model_4)
## Test stat Pr(>|Test stat|)
## income1 0.6031 0.4374
## eduT
## settlement
## intinpol
car::marginalModelPlots(model_4)
vif(model_4)
## GVIF Df GVIF^(1/(2*Df))
## income1 1.111424 1 1.054241
## eduT 1.113826 1 1.055380
## settlement 1.066610 4 1.008093
## intinpol 1.042529 3 1.006966
Looking at missing data pattern:
df <- readRDS("ruwvs")
#install.packages("mice")
library(mice)
md.pattern(df)
## Q211 townsize settlement region age intinpol eduR eduT income income1
## 1736 1 1 1 1 1 1 1 1 1 1
## 59 1 1 1 1 1 1 1 1 0 0
## 7 1 1 1 1 1 1 0 0 1 1
## 2 1 1 1 1 1 1 0 0 0 0
## 4 1 1 1 1 1 0 1 1 1 1
## 1 1 1 1 1 1 0 1 1 0 0
## 1 1 1 1 1 1 0 0 0 0 0
## 0 0 0 0 0 6 10 10 63 63
##
## 1736 0
## 59 2
## 7 2
## 2 4
## 4 1
## 1 3
## 1 5
## 152
df_MICE <- mice(df,m=5,maxit=50,meth='pmm',seed=500)
summary(df_MICE)
## Class: mids
## Number of multiple imputations: 5
## Imputation methods:
## Q211 intinpol eduR townsize settlement income
## "" "pmm" "pmm" "" "" "pmm"
## region age income1 eduT
## "" "" "" "pmm"
## PredictorMatrix:
## Q211 intinpol eduR townsize settlement income region age
## Q211 0 1 1 1 1 1 1 1
## intinpol 1 0 1 1 1 1 1 1
## eduR 1 1 0 1 1 1 1 1
## townsize 1 1 1 0 1 1 1 1
## settlement 1 1 1 1 0 1 1 1
## income 1 1 1 1 1 0 1 1
## income1 eduT
## Q211 0 1
## intinpol 0 1
## eduR 0 1
## townsize 0 1
## settlement 0 1
## income 0 1
## Number of logged events: 1223
## it im dep meth
## 1 0 0 collinear
## 2 1 1 intinpol pmm
## 3 1 1 eduR pmm
## 4 1 1 income pmm
## 5 1 1 income pmm
## 6 1 1 eduT pmm
## out
## 1 income1
## 2 age17, age35, age89, age92, age93, age94, age95, age96, age98, age99, age100, age103
## 3 age17, age40, age89, age92, age93, age94, age95, age96, age98, age99, age100, age103
## 4 age17, age88, age89, age92, age93, age94, age95, age96, age98, age99, age100, age103, eduT1
## 5 * A ridge penalty had to be used to calculate the inverse crossproduct of the predictor matrix. Please remove duplicate variables or unique respondent names/numbers from the imputation model. It may be advisable to check the fraction of missing information (fmi) to evaluate the validity of the imputation model
## 6 eduRTertiary, age17, age35, age89, age92, age93, age94, age95, age96, age98, age99, age100, age103
completedData <- complete(df_MICE,1)
xyplot(df_MICE,Q211 ~ income1 + eduT + settlement + intinpol,pch=18,cex=1)
stripplot(df_MICE, pch = 20, cex = 1.2)
The similar can be done with VIM also:
library(VIM)
aggr_plot <- aggr(df, col=c('cornflowerblue','sienna3'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## income 0.034806630
## income1 0.034806630
## eduR 0.005524862
## eduT 0.005524862
## intinpol 0.003314917
## Q211 0.000000000
## townsize 0.000000000
## settlement 0.000000000
## region 0.000000000
## age 0.000000000
df_kNN <- kNN(df, variable = colnames(df), k = 6)
summary(df_kNN) # no NAs - great!
## Q211 intinpol eduR
## 0:1072 Very interested :148 Primary : 36
## 1: 738 Somewhat interested :639 Secondary :470
## Not very interested :715 Post-secondary:699
## Not at all interested:308 Tertiary :605
##
##
##
## townsize
## Under 5,000 :391
## 5000-20000 :203
## 20000-100000 :284
## 100000-500000 :342
## 500000 and more:590
##
##
## settlement
## Capital city :149
## Regional center :569
## District center :542
## Another city, town (not a regional or district center): 58
## Village :492
##
##
## income region
## Fifth step :395 RU: Central Federal District :489
## Sixth step :291 RU: Volga; Privolzhsky Federal District:382
## Fourth step :278 RU: Siberian Federal District :239
## Third step :266 RU: South Federal District :183
## Seventh step:200 RU: North West Federal District :177
## second step :136 RU: Urals Federal District :156
## (Other) :244 (Other) :184
## age income1 eduT Q211_imp intinpol_imp
## 35 : 47 Min. :0.000 0:1205 Mode :logical Mode :logical
## 40 : 46 1st Qu.:2.000 1: 605 FALSE:1810 FALSE:1804
## 27 : 44 Median :4.000 TRUE :6
## 33 : 44 Mean :3.823
## 36 : 44 3rd Qu.:5.000
## 30 : 43 Max. :9.000
## (Other):1542
## eduR_imp townsize_imp settlement_imp income_imp
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1800 FALSE:1810 FALSE:1810 FALSE:1747
## TRUE :10 TRUE :63
##
##
##
##
## region_imp age_imp income1_imp eduT_imp
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1810 FALSE:1810 FALSE:1747 FALSE:1800
## TRUE :63 TRUE :10
##
##
##
##
After comparing models with and without impulation I may say that the results stay the same, the model has not been significantly improved after the imputation process.
summary(model_4) # with na.omit, AIC: 2289.9
##
## Call:
## glm(formula = Q211 ~ income1 + eduT + settlement + intinpol,
## family = binomial(link = logit), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7889 -1.0431 -0.8065 1.2501 1.8931
##
## Coefficients:
## Estimate
## (Intercept) 1.20299
## income1 -0.05731
## eduT1 0.17166
## settlementRegional center -0.72626
## settlementDistrict center -1.10399
## settlementAnother city, town (not a regional or district center) -1.11736
## settlementVillage -0.80521
## intinpolSomewhat interested -0.41660
## intinpolNot very interested -0.55018
## intinpolNot at all interested -1.30746
## Std. Error
## (Intercept) 0.26584
## income1 0.02733
## eduT1 0.11166
## settlementRegional center 0.19527
## settlementDistrict center 0.19716
## settlementAnother city, town (not a regional or district center) 0.33658
## settlementVillage 0.20051
## intinpolSomewhat interested 0.19050
## intinpolNot very interested 0.18888
## intinpolNot at all interested 0.21972
## z value
## (Intercept) 4.525
## income1 -2.097
## eduT1 1.537
## settlementRegional center -3.719
## settlementDistrict center -5.599
## settlementAnother city, town (not a regional or district center) -3.320
## settlementVillage -4.016
## intinpolSomewhat interested -2.187
## intinpolNot very interested -2.913
## intinpolNot at all interested -5.951
## Pr(>|z|)
## (Intercept) 6.03e-06
## income1 0.035981
## eduT1 0.124219
## settlementRegional center 0.000200
## settlementDistrict center 2.15e-08
## settlementAnother city, town (not a regional or district center) 0.000901
## settlementVillage 5.92e-05
## intinpolSomewhat interested 0.028748
## intinpolNot very interested 0.003581
## intinpolNot at all interested 2.67e-09
##
## (Intercept) ***
## income1 *
## eduT1
## settlementRegional center ***
## settlementDistrict center ***
## settlementAnother city, town (not a regional or district center) ***
## settlementVillage ***
## intinpolSomewhat interested *
## intinpolNot very interested **
## intinpolNot at all interested ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2269.9 on 1726 degrees of freedom
## (74 observations deleted due to missingness)
## AIC: 2289.9
##
## Number of Fisher Scoring iterations: 4
model_4_knn <- glm(Q211 ~ income1 + eduT + settlement + intinpol, data = df_kNN, family="binomial"(link=logit))
summary(model_4_knn) #with kNN imputation - coefficients are the same, but AIC became larger (2373.2)
##
## Call:
## glm(formula = Q211 ~ income1 + eduT + settlement + intinpol,
## family = binomial(link = logit), data = df_kNN)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7843 -1.0383 -0.7887 1.2587 1.9423
##
## Coefficients:
## Estimate
## (Intercept) 1.21509
## income1 -0.07292
## eduT1 0.14909
## settlementRegional center -0.77185
## settlementDistrict center -1.12085
## settlementAnother city, town (not a regional or district center) -1.13235
## settlementVillage -0.83998
## intinpolSomewhat interested -0.34073
## intinpolNot very interested -0.49130
## intinpolNot at all interested -1.30567
## Std. Error
## (Intercept) 0.26283
## income1 0.02697
## eduT1 0.10998
## settlementRegional center 0.19368
## settlementDistrict center 0.19618
## settlementAnother city, town (not a regional or district center) 0.32855
## settlementVillage 0.19904
## intinpolSomewhat interested 0.18627
## intinpolNot very interested 0.18459
## intinpolNot at all interested 0.21592
## z value
## (Intercept) 4.623
## income1 -2.704
## eduT1 1.356
## settlementRegional center -3.985
## settlementDistrict center -5.713
## settlementAnother city, town (not a regional or district center) -3.447
## settlementVillage -4.220
## intinpolSomewhat interested -1.829
## intinpolNot very interested -2.662
## intinpolNot at all interested -6.047
## Pr(>|z|)
## (Intercept) 3.78e-06
## income1 0.006851
## eduT1 0.175235
## settlementRegional center 6.74e-05
## settlementDistrict center 1.11e-08
## settlementAnother city, town (not a regional or district center) 0.000568
## settlementVillage 2.44e-05
## intinpolSomewhat interested 0.067355
## intinpolNot very interested 0.007777
## intinpolNot at all interested 1.47e-09
##
## (Intercept) ***
## income1 **
## eduT1
## settlementRegional center ***
## settlementDistrict center ***
## settlementAnother city, town (not a regional or district center) ***
## settlementVillage ***
## intinpolSomewhat interested .
## intinpolNot very interested **
## intinpolNot at all interested ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2447.2 on 1809 degrees of freedom
## Residual deviance: 2352.8 on 1800 degrees of freedom
## AIC: 2372.8
##
## Number of Fisher Scoring iterations: 4
model_4_mice <- glm(Q211 ~ income1 + eduT + settlement + intinpol, data = df_MICE$data, family="binomial"(link=logit))
summary(model_4_mice) #coefficient and AIC are the same as in the model without imputation
##
## Call:
## glm(formula = Q211 ~ income1 + eduT + settlement + intinpol,
## family = binomial(link = logit), data = df_MICE$data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7889 -1.0431 -0.8065 1.2501 1.8931
##
## Coefficients:
## Estimate
## (Intercept) 1.20299
## income1 -0.05731
## eduT1 0.17166
## settlementRegional center -0.72626
## settlementDistrict center -1.10399
## settlementAnother city, town (not a regional or district center) -1.11736
## settlementVillage -0.80521
## intinpolSomewhat interested -0.41660
## intinpolNot very interested -0.55018
## intinpolNot at all interested -1.30746
## Std. Error
## (Intercept) 0.26584
## income1 0.02733
## eduT1 0.11166
## settlementRegional center 0.19527
## settlementDistrict center 0.19716
## settlementAnother city, town (not a regional or district center) 0.33658
## settlementVillage 0.20051
## intinpolSomewhat interested 0.19050
## intinpolNot very interested 0.18888
## intinpolNot at all interested 0.21972
## z value
## (Intercept) 4.525
## income1 -2.097
## eduT1 1.537
## settlementRegional center -3.719
## settlementDistrict center -5.599
## settlementAnother city, town (not a regional or district center) -3.320
## settlementVillage -4.016
## intinpolSomewhat interested -2.187
## intinpolNot very interested -2.913
## intinpolNot at all interested -5.951
## Pr(>|z|)
## (Intercept) 6.03e-06
## income1 0.035981
## eduT1 0.124219
## settlementRegional center 0.000200
## settlementDistrict center 2.15e-08
## settlementAnother city, town (not a regional or district center) 0.000901
## settlementVillage 5.92e-05
## intinpolSomewhat interested 0.028748
## intinpolNot very interested 0.003581
## intinpolNot at all interested 2.67e-09
##
## (Intercept) ***
## income1 *
## eduT1
## settlementRegional center ***
## settlementDistrict center ***
## settlementAnother city, town (not a regional or district center) ***
## settlementVillage ***
## intinpolSomewhat interested *
## intinpolNot very interested **
## intinpolNot at all interested ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2269.9 on 1726 degrees of freedom
## (74 observations deleted due to missingness)
## AIC: 2289.9
##
## Number of Fisher Scoring iterations: 4