##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max range skew
## Q211* 1 1013 1.00 0.00 1 1.00 0.00 1 1 0 NaN
## intinpol* 2 1013 2.75 0.86 3 2.77 1.48 1 4 3 -0.10
## townsize* 3 1013 3.31 1.51 4 3.38 1.48 1 5 4 -0.32
## settlement* 4 1013 3.16 1.27 3 3.14 1.48 1 5 4 0.33
## region* 5 1013 4.25 2.29 4 4.19 2.97 1 8 7 0.26
## age 6 1013 45.34 17.20 43 44.46 19.27 18 91 73 0.38
## income1 7 1013 3.80 1.93 4 3.77 1.48 0 9 9 0.13
## eduT* 8 1013 1.31 0.46 1 1.26 0.00 1 2 1 0.83
## kurtosis se
## Q211* NaN 0.00
## intinpol* -0.77 0.03
## townsize* -1.35 0.05
## settlement* -1.13 0.04
## region* -1.22 0.07
## age -0.82 0.54
## income1 -0.37 0.06
## eduT* -1.32 0.01
## ------------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max range skew
## Q211* 1 723 2.00 0.00 2 2.00 0.00 2 2 0 NaN
## intinpol* 2 723 2.50 0.82 3 2.50 1.48 1 4 3 -0.02
## townsize* 3 723 3.30 1.59 4 3.37 1.48 1 5 4 -0.34
## settlement* 4 723 3.00 1.40 3 3.00 1.48 1 5 4 0.32
## region* 5 723 3.85 2.19 4 3.72 2.97 1 8 7 0.53
## age 6 723 45.93 16.86 45 45.30 19.27 18 90 72 0.26
## income1 7 723 3.73 1.94 4 3.72 1.48 0 9 9 0.07
## eduT* 8 723 1.36 0.48 1 1.32 0.00 1 2 1 0.58
## kurtosis se
## Q211* NaN 0.00
## intinpol* -0.54 0.03
## townsize* -1.44 0.06
## settlement* -1.22 0.05
## region* -1.01 0.08
## age -0.85 0.63
## income1 -0.45 0.07
## eduT* -1.66 0.02
Overall, our variable of interest - attending lawful demonstations (Q211) - has two levels. There are 723 citizens who would attend such demonstations and 1013 of those who would not.
Interesting to notice that in most cases people who would attend lawful demonstration has higher median age than those, who would not.
The biggest share of citizens who would attend lawful demonstrations live in regional centers, district centers or villages.
Those who would attend lawful demonstrations are characterized by Fifth step of income, no matter which education they gained.
##
## Call:
## glm(formula = Q211 ~ intinpol + settlement + eduT + income1 +
## age + region, family = "binomial", data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8052 -1.0272 -0.7765 1.2232 2.0811
##
## Coefficients:
## Estimate
## (Intercept) 0.9960171
## intinpolSomewhat interested -0.4609568
## intinpolNot very interested -0.6365091
## intinpolNot at all interested -1.4088843
## settlementRegional center -0.4597112
## settlementDistrict center -0.8721639
## settlementAnother city, town (not a regional or district center) -0.9810863
## settlementVillage -0.5133248
## eduT1 0.1788241
## income1 -0.0544330
## age -0.0009979
## regionRU: Central Federal District 0.3000963
## regionRU: North Caucasian federal district -0.0151321
## regionRU: Volga; Privolzhsky Federal District 0.1584737
## regionRU: Urals Federal District -0.2816908
## regionRU: Far East Federal District -0.1589217
## regionRU: Siberian Federal District 0.2246221
## regionRU: South Federal District -0.4919488
## Std. Error
## (Intercept) 0.3734018
## intinpolSomewhat interested 0.1937967
## intinpolNot very interested 0.1931379
## intinpolNot at all interested 0.2251833
## settlementRegional center 0.2254509
## settlementDistrict center 0.2170352
## settlementAnother city, town (not a regional or district center) 0.3581382
## settlementVillage 0.2284048
## eduT1 0.1138132
## income1 0.0289287
## age 0.0031580
## regionRU: Central Federal District 0.2042000
## regionRU: North Caucasian federal district 0.2849057
## regionRU: Volga; Privolzhsky Federal District 0.2011626
## regionRU: Urals Federal District 0.2401377
## regionRU: Far East Federal District 0.2952891
## regionRU: Siberian Federal District 0.2140290
## regionRU: South Federal District 0.2419097
## z value
## (Intercept) 2.667
## intinpolSomewhat interested -2.379
## intinpolNot very interested -3.296
## intinpolNot at all interested -6.257
## settlementRegional center -2.039
## settlementDistrict center -4.019
## settlementAnother city, town (not a regional or district center) -2.739
## settlementVillage -2.247
## eduT1 1.571
## income1 -1.882
## age -0.316
## regionRU: Central Federal District 1.470
## regionRU: North Caucasian federal district -0.053
## regionRU: Volga; Privolzhsky Federal District 0.788
## regionRU: Urals Federal District -1.173
## regionRU: Far East Federal District -0.538
## regionRU: Siberian Federal District 1.049
## regionRU: South Federal District -2.034
## Pr(>|z|)
## (Intercept) 0.007644 **
## intinpolSomewhat interested 0.017381 *
## intinpolNot very interested 0.000982 ***
## intinpolNot at all interested 3.93e-10 ***
## settlementRegional center 0.041443 *
## settlementDistrict center 5.86e-05 ***
## settlementAnother city, town (not a regional or district center) 0.006155 **
## settlementVillage 0.024612 *
## eduT1 0.116135
## income1 0.059886 .
## age 0.752001
## regionRU: Central Federal District 0.141665
## regionRU: North Caucasian federal district 0.957642
## regionRU: Volga; Privolzhsky Federal District 0.430820
## regionRU: Urals Federal District 0.240780
## regionRU: Far East Federal District 0.590446
## regionRU: Siberian Federal District 0.293951
## regionRU: South Federal District 0.041991 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2247.4 on 1718 degrees of freedom
## AIC: 2283.4
##
## Number of Fisher Scoring iterations: 4
The Null-Hypothesis will be: the given variable does not have an influence on the attending lawful demonstrations The Alternative Hypothesis will be: the given variable does have an influence on attending lawful demonstrations.
Lets pay some attention on variables which P-value is less than 0.05. The probability of finding the result like this or more extreme, assuming that such variables as intinpol, settlement, region have no effect is less than 5%. Therefore, we reject the Null-Hypothesis and assume that these variables do have a significant influence on attending lawful demonstrations.
Next, we have a look at variables which P-value is more than 0.05. The probability of finding the result like this or more extreme, assuming that such variables as age, eduT, income1 have no effect is more than 5%. Therefore, we accept the Null-Hypothesis and assume that these variables do not have a significant influence on attending lawful demonstrations.
Next, we want to figure out which variables to include in final model and drop out insignificant ones in order to get rid of the noise they can produce.
##
## Call:
## glm(formula = Q211 ~ intinpol + settlement + eduT + income1 +
## region, family = "binomial", data = df2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8079 -1.0261 -0.7772 1.2216 2.0782
##
## Coefficients:
## Estimate
## (Intercept) 0.93837
## intinpolSomewhat interested -0.45765
## intinpolNot very interested -0.63021
## intinpolNot at all interested -1.39997
## settlementRegional center -0.45944
## settlementDistrict center -0.87407
## settlementAnother city, town (not a regional or district center) -0.97458
## settlementVillage -0.51604
## eduT1 0.18247
## income1 -0.05216
## regionRU: Central Federal District 0.29634
## regionRU: North Caucasian federal district -0.01422
## regionRU: Volga; Privolzhsky Federal District 0.15736
## regionRU: Urals Federal District -0.28383
## regionRU: Far East Federal District -0.16103
## regionRU: Siberian Federal District 0.22390
## regionRU: South Federal District -0.49261
## Std. Error
## (Intercept) 0.32573
## intinpolSomewhat interested 0.19349
## intinpolNot very interested 0.19207
## intinpolNot at all interested 0.22336
## settlementRegional center 0.22548
## settlementDistrict center 0.21698
## settlementAnother city, town (not a regional or district center) 0.35755
## settlementVillage 0.22827
## eduT1 0.11323
## income1 0.02802
## regionRU: Central Federal District 0.20388
## regionRU: North Caucasian federal district 0.28489
## regionRU: Volga; Privolzhsky Federal District 0.20116
## regionRU: Urals Federal District 0.24005
## regionRU: Far East Federal District 0.29521
## regionRU: Siberian Federal District 0.21403
## regionRU: South Federal District 0.24191
## z value
## (Intercept) 2.881
## intinpolSomewhat interested -2.365
## intinpolNot very interested -3.281
## intinpolNot at all interested -6.268
## settlementRegional center -2.038
## settlementDistrict center -4.028
## settlementAnother city, town (not a regional or district center) -2.726
## settlementVillage -2.261
## eduT1 1.611
## income1 -1.862
## regionRU: Central Federal District 1.453
## regionRU: North Caucasian federal district -0.050
## regionRU: Volga; Privolzhsky Federal District 0.782
## regionRU: Urals Federal District -1.182
## regionRU: Far East Federal District -0.545
## regionRU: Siberian Federal District 1.046
## regionRU: South Federal District -2.036
## Pr(>|z|)
## (Intercept) 0.00397 **
## intinpolSomewhat interested 0.01802 *
## intinpolNot very interested 0.00103 **
## intinpolNot at all interested 3.66e-10 ***
## settlementRegional center 0.04159 *
## settlementDistrict center 5.62e-05 ***
## settlementAnother city, town (not a regional or district center) 0.00642 **
## settlementVillage 0.02378 *
## eduT1 0.10707
## income1 0.06261 .
## regionRU: Central Federal District 0.14609
## regionRU: North Caucasian federal district 0.96018
## regionRU: Volga; Privolzhsky Federal District 0.43407
## regionRU: Urals Federal District 0.23706
## regionRU: Far East Federal District 0.58542
## regionRU: Siberian Federal District 0.29552
## regionRU: South Federal District 0.04172 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2357.9 on 1735 degrees of freedom
## Residual deviance: 2247.5 on 1719 degrees of freedom
## AIC: 2281.5
##
## Number of Fisher Scoring iterations: 4
We dropped out age. From the content point of view, both younger and older can be equally involved in lawful demonstrations depending on the topic of discussion. However, there are two more variables which p-values were considered to be insignificant, so let`s have a closer look on model building step by step.
It appears that adding eduT does not contribute to our model, so the final point to consider is to drop it. Although p-value of income1 was insignificant, it contributes to the model, so the variable will be preserved.
The final model to discuss is without eduT and age.
Due to AIC, the lower value was indicated by the second model, we can say that the second model (with eduT) is the best one.
Let`s consider variables importance more precisely:
Conclusion: The plot has proved that all the variables are important for our model (importance >0), so the final model is the second one.
## Q211 ~ intinpol + settlement + eduT + income1 + region
\[ \begin{aligned} \log\left[ \frac { P( \operatorname{Q211} = \operatorname{1} ) }{ 1 - P( \operatorname{Q211} = \operatorname{1} ) } \right] &= \alpha + \beta_{1}(\operatorname{intinpol}_{\operatorname{Somewhat\ interested}}) + \beta_{2}(\operatorname{intinpol}_{\operatorname{Not\ very\ interested}})\ + \\ &\quad \beta_{3}(\operatorname{intinpol}_{\operatorname{Not\ at\ all\ interested}}) + \beta_{4}(\operatorname{settlement}_{\operatorname{Regional\ center}}) + \beta_{5}(\operatorname{settlement}_{\operatorname{District\ center}})\ + \\ &\quad \beta_{6}(\operatorname{settlement}_{\operatorname{Another\ city,\ town\ (not\ a\ regional\ or\ district\ center)}} \times \operatorname{region}_{\operatorname{settlementAnother\ city,\ town\ (not\ a\ al\ or\ district\ center)}}) + \beta_{7}(\operatorname{settlement}_{\operatorname{Village}}) + \beta_{8}(\operatorname{eduT}_{\operatorname{1}})\ + \\ &\quad \beta_{9}(\operatorname{income1}) + \beta_{10}(\operatorname{region}_{\operatorname{RU}} \times \operatorname{region}_{\operatorname{\ Central\ Federal\ District}}) + \beta_{11}(\operatorname{region}_{\operatorname{RU}} \times \operatorname{region}_{\operatorname{\ North\ Caucasian\ federal\ district}})\ + \\ &\quad \beta_{12}(\operatorname{region}_{\operatorname{RU}} \times \operatorname{region}_{\operatorname{\ Volga;\ Privolzhsky\ Federal\ District}}) + \beta_{13}(\operatorname{region}_{\operatorname{RU}} \times \operatorname{region}_{\operatorname{\ Urals\ Federal\ District}}) + \beta_{14}(\operatorname{region}_{\operatorname{RU}} \times \operatorname{region}_{\operatorname{\ Far\ East\ Federal\ District}})\ + \\ &\quad \beta_{15}(\operatorname{region}_{\operatorname{RU}} \times \operatorname{region}_{\operatorname{\ Siberian\ Federal\ District}}) + \beta_{16}(\operatorname{region}_{\operatorname{RU}} \times \operatorname{region}_{\operatorname{\ South\ Federal\ District}}) + \epsilon \end{aligned} \]
## OR
## (Intercept) 2.5557994
## intinpolSomewhat interested 0.6327662
## intinpolNot very interested 0.5324778
## intinpolNot at all interested 0.2466042
## settlementRegional center 0.6316389
## settlementDistrict center 0.4172492
## settlementAnother city, town (not a regional or district center) 0.3773508
## settlementVillage 0.5968790
## eduT1 1.2001831
## income1 0.9491730
## regionRU: Central Federal District 1.3449239
## regionRU: North Caucasian federal district 0.9858775
## regionRU: Volga; Privolzhsky Federal District 1.1704140
## regionRU: Urals Federal District 0.7528927
## regionRU: Far East Federal District 0.8512669
## regionRU: Siberian Federal District 1.2509413
## regionRU: South Federal District 0.6110300
## 2.5 %
## (Intercept) 1.3532650
## intinpolSomewhat interested 0.4320207
## intinpolNot very interested 0.3644688
## intinpolNot at all interested 0.1585151
## settlementRegional center 0.4047389
## settlementDistrict center 0.2716470
## settlementAnother city, town (not a regional or district center) 0.1848084
## settlementVillage 0.3803199
## eduT1 0.9611032
## income1 0.8982903
## regionRU: Central Federal District 0.9033133
## regionRU: North Caucasian federal district 0.5614864
## regionRU: Volga; Privolzhsky Federal District 0.7901501
## regionRU: Urals Federal District 0.4692820
## regionRU: Far East Federal District 0.4732048
## regionRU: Siberian Federal District 0.8232027
## regionRU: South Federal District 0.3792655
## 97.5 %
## (Intercept) 4.8565439
## intinpolSomewhat interested 0.9233994
## intinpolNot very interested 0.7747165
## intinpolNot at all interested 0.3808177
## settlementRegional center 0.9804156
## settlementDistrict center 0.6365019
## settlementAnother city, town (not a regional or district center) 0.7539667
## settlementVillage 0.9314048
## eduT1 1.4983501
## income1 1.0026211
## regionRU: Central Federal District 2.0103054
## regionRU: North Caucasian federal district 1.7188931
## regionRU: Volga; Privolzhsky Federal District 1.7398356
## regionRU: Urals Federal District 1.2039454
## regionRU: Far East Federal District 1.5099103
## regionRU: Siberian Federal District 1.9064432
## regionRU: South Federal District 0.9800786
For now, we can draw conclusions about the direction of the effect produced by variables.However, we will not interpret it for education and income, since they appear to be insignificant.
## [1] 3.770737
Conclusion: Living in Central Federal District/ Volga; Privolzhsky Federal District/Siberian Federal District or being interested in politics or living in capital cities is associated with higher probability of attending lawful demonstrations.
## xvals yvals upper lower
## 1 0.000 0.5361993 0.6270888 0.4453099
## 2 0.375 0.5313313 0.6198545 0.4428080
## 3 0.750 0.5264572 0.6128251 0.4400894
## 4 1.125 0.5215782 0.6060223 0.4371340
## 5 1.500 0.5166950 0.5994672 0.4339228
## 6 1.875 0.5118086 0.5931795 0.4304377
## 7 2.250 0.5069200 0.5871773 0.4266627
## 8 2.625 0.5020300 0.5814754 0.4225846
## 9 3.000 0.4971397 0.5760856 0.4181938
## 10 3.375 0.4922499 0.5710151 0.4134846
## 11 3.750 0.4873615 0.5662669 0.4084562
## 12 4.125 0.4824756 0.5618393 0.4031120
## 13 4.500 0.4775931 0.5577261 0.3974601
## 14 4.875 0.4727148 0.5539169 0.3915127
## 15 5.250 0.4678417 0.5503982 0.3852853
## 16 5.625 0.4629748 0.5471533 0.3787963
## 17 6.000 0.4581149 0.5441638 0.3720660
## 18 6.375 0.4532629 0.5414101 0.3651158
## 19 6.750 0.4484199 0.5388721 0.3579676
## 20 7.125 0.4435865 0.5365296 0.3506434
Overall, the higher income is associated with lower predicted attendance of lawful demonstrations.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 838 500
## 1 175 223
##
## Accuracy : 0.6112
## 95% CI : (0.5878, 0.6342)
## No Information Rate : 0.5835
## P-Value [Acc > NIR] : 0.0102
##
## Kappa : 0.145
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8272
## Specificity : 0.3084
## Pos Pred Value : 0.6263
## Neg Pred Value : 0.5603
## Prevalence : 0.5835
## Detection Rate : 0.4827
## Detection Prevalence : 0.7707
## Balanced Accuracy : 0.5678
##
## 'Positive' Class : 0
##
Here, predicted and observed outcomes are compared in confusion matrix. A classification is correct when predicted class of the observation coincides with its observed class. In other cases the prediction is wrong and observations are misclassified.
Next, we want to check whether our model is overfitted or not.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 156 83
## 1 56 45
##
## Accuracy : 0.5912
## 95% CI : (0.5368, 0.6439)
## No Information Rate : 0.6235
## P-Value [Acc > NIR] : 0.90052
##
## Kappa : 0.0912
##
## Mcnemar's Test P-Value : 0.02743
##
## Sensitivity : 0.7358
## Specificity : 0.3516
## Pos Pred Value : 0.6527
## Neg Pred Value : 0.4455
## Prevalence : 0.6235
## Detection Rate : 0.4588
## Detection Prevalence : 0.7029
## Balanced Accuracy : 0.5437
##
## 'Positive' Class : 0
##
The model`s performance on a test sample remains quite same, which means that we are likely to avoid overfitting.
## Area under the curve: 0.5397
The model reports bad quality, since values are under 0.8
## Test stat Pr(>|Test stat|)
## intinpol
## settlement
## eduT
## income1 0.7792 0.3774
## region
As we can see, predictors are not linear, so the assumption is kept
## GVIF Df GVIF^(1/(2*Df))
## intinpol 1.092152 3 1.014800
## settlement 1.659764 4 1.065383
## eduT 1.130039 1 1.063033
## income1 1.151644 1 1.073147
## region 1.709112 7 1.039026
No multicollinearity since GVIF < 10, so the assumption is kept.