library(table1)
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data frame culture92 contains information about leisure practices and social profiles of the USSR respondents in 1992:

book_shop – how often the respondents visits book shops (1 - never, 4 - several times a month)

Tasks:

Recode book shop attendance into a binary variable with levels ‘does not visit’ and ‘visits’ (1p) Predict the probability of book shop attendance by gender, age, educational level, and educational level of respondent’s mother (1p) Interpret coefficients and effects of the predictors (4p) Interpret model fit (3p)

load('culture92.RData')

1 Variables

Recode book shop attendance into a binary variable with levels ‘does not visit’ and ‘visits’ (1p)

culture92$bs_att <- ifelse(culture92$book_shop == 1, 'does not visit', 
                      ifelse(culture92$book_shop == 2, 'visits',
                             ifelse(culture92$book_shop == 3, 'visits',
                                    ifelse(culture92$book_shop == 4, 'visits', NA))))

table(culture92$book_shop)
## 
##    1    2    3    4 
## 1076  863 1023  558
table(culture92$bs_att)
## 
## does not visit         visits 
##           1076           2444
culture_t <- select(culture92, c(bs_att, edu_lvl, edu_lvl_m, sex, age))
sapply(culture_t, function(x) sum(is.na(x)))
##    bs_att   edu_lvl edu_lvl_m       sex       age 
##       104        23        62         8        20
culture <- na.omit(culture_t)

table1(~bs_att + sex + 
         age + edu_lvl + edu_lvl_m, data = culture)
Overall
(N=3417)
bs_att
does not visit 1037 (30.3%)
visits 2380 (69.7%)
sex
male 1549 (45.3%)
female 1868 (54.7%)
age
Mean (SD) 40.8 (16.1)
Median [Min, Max] 39.0 [16.0, 85.0]
edu_lvl
tertiary 845 (24.7%)
secondary 1906 (55.8%)
below secondary 666 (19.5%)
edu_lvl_m
tertiary 366 (10.7%)
secondary 1086 (31.8%)
below secondary 1965 (57.5%)

In the table we can see that there are more book shop visitors than non-visitors (70% vs 30%). The gender distribution is almost equal (55% female and 45% male). Age of the sample varies from 16 to 85, with a mean at 39 y.o. The biggest group among educational levels is secondary. While among mothers of respondents the below secondary level is more frequent.

2 Model

Predict the probability of book shop attendance by gender, age, educational level, and educational level of respondent’s mother (1p)

2.1 log of odds

culture$bs_att_1 <- ifelse(culture$bs_att == 'does not visit', 0, 1)

mod1 <- glm(bs_att_1 ~ sex + age + edu_lvl + edu_lvl_m, data = culture, family = binomial)
summary(mod1)
## 
## Call:
## glm(formula = bs_att_1 ~ sex + age + edu_lvl + edu_lvl_m, family = binomial, 
##     data = culture)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               2.872860   0.198001  14.509  < 2e-16 ***
## sexfemale                 0.280000   0.081860   3.420 0.000625 ***
## age                      -0.014301   0.002701  -5.294 1.20e-07 ***
## edu_lvlsecondary         -1.017640   0.120025  -8.479  < 2e-16 ***
## edu_lvlbelow secondary   -2.150527   0.137096 -15.686  < 2e-16 ***
## edu_lvl_msecondary       -0.251198   0.172883  -1.453 0.146225    
## edu_lvl_mbelow secondary -0.702015   0.167562  -4.190 2.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4194.6  on 3416  degrees of freedom
## Residual deviance: 3674.6  on 3410  degrees of freedom
## AIC: 3688.6
## 
## Number of Fisher Scoring iterations: 4

Interpretation: As we can notice, all of the predictors are significant (sex, age, secondary and below secondary levels of edu and below secondary mother’s edu level), except secondary education of mother (compared to tertiary).

Sex is positively associated with visiting a book shop. Being a female increases the log of odds of visiting a book shop over not visiting by 0.28, compared to being a male.

Age is negatively associated with visiting a book shop. With the increase of age by 1 the log of odds of visiting a book shop over not visiting decreases by 0.01.

Having a secondary education level decreases the log of odds of visiting a book shop over not visiting by 1.01 compared to the tertiary education level. Having a below secondary education level decreases the log of odds of visiting a book shop over not visiting by 2.15 compared to the tertiary education level.

Having a mother with below secondary education level decreases the log of odds of visiting a book shop over not visiting by 0.70 compared to having a mother with tertiary education level.

2.2 marginal effects

margins::margins_summary(mod1)
##                    factor     AME     SE        z      p   lower   upper
##                       age -0.0026 0.0005  -5.3717 0.0000 -0.0035 -0.0016
##  edu_lvl_mbelow secondary -0.1228 0.0266  -4.6125 0.0000 -0.1750 -0.0706
##        edu_lvl_msecondary -0.0402 0.0268  -1.5009 0.1334 -0.0928  0.0123
##    edu_lvlbelow secondary -0.4154 0.0239 -17.3751 0.0000 -0.4622 -0.3685
##          edu_lvlsecondary -0.1577 0.0159  -9.9023 0.0000 -0.1889 -0.1264
##                 sexfemale  0.0503 0.0147   3.4297 0.0006  0.0216  0.0791

Interpretation: Increase in age by 1 on average decreases the probability of of visiting a book shop by 0.0026.

Having a mom with below secondary edu level on average decreases the probability of of visiting a book shop by 0.12 compared to mom with tertiary edu.

Having a below secondary edu level on average decreases the probability of of visiting a book shop by 0.41 compared to tertiary edu. Having a secondary edu level on average decreases the probability of of visiting a book shop by 0.15 compared to tertiary edu.

Being a female on average increases the probability of of visiting a book shop by 0.05 compared to being a male.

2.3 margins visualization

sjPlot::plot_model(mod1, type = 'pred', terms = c('age'), axis.title = c('age','Probability of visiting a book shop', title = ''))
## Data were 'prettified'. Consider using `terms="age [all]"` to get smooth
##   plots.

sjPlot::plot_model(mod1, type = 'pred', terms = c('sex'), axis.title = c('sex','Probability of visiting a book shop', title = ''))

sjPlot::plot_model(mod1, type = 'pred', terms = c('edu_lvl'), axis.title = c('edu_lvl','Probability of visiting a book shop', title = ''))

sjPlot::plot_model(mod1, type = 'pred', terms = c('edu_lvl_m'), axis.title = c('edu_lvl_m','Probability of visiting a book shop', title = ''))

Interpretation: As we can see on the plot with age, the probability to visit a book shop decreases when people get older. Probability of female visiting a book shop is higher than male. Having the highest level of education also have the highest probability to go to a book shop, while for people with below secondary education the probability is low (around 55%). And for mother’s education, her having below secondary education lowers the probability to visit a book shop, while for tertiary and secondary education the probabilities are not that different.

3 Model fit

3.1 pseudo R2

pscl::pR2(mod1)[4]
## fitting null model for pseudo-r2
##  McFadden 
## 0.1239779

Interpretation: The pseudo R2 is low - 0.12, the good fit estimates are from 0.2 to 0.4. However, 0.12 is not critical. So our model is alright, but could have been better.

3.2 pcp

pscl::hitmiss(mod1)
## Classification Threshold = 0.5 
##        y=0  y=1
## yhat=0 371  195
## yhat=1 666 2185
## Percent Correctly Predicted = 74.8%
## Percent Correctly Predicted = 35.78%, for y = 0
## Percent Correctly Predicted = 91.81%  for y = 1
## Null Model Correctly Predicts 69.65%
## [1] 74.80246 35.77628 91.80672

Interpretation: As we can see from percent of correct predictions, the overall value is 75%, it is not bad. But if we look closely the model is good at predicting visiting the book shops (92% correctly), and quite bad at predicting non-visiting behavior (36% correctly).

3.3 sensitivity & specificity

sens_t <- pROC::roc(culture$bs_att_1, predict(mod1, culture, type = "response")) 
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(sens_t)

fitted = predict(mod1, culture, type = "response")
pROC::auc(culture$bs_att_1, fitted)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7284

Interpretation: On the plot we can notice that the line of our model does not coincide with diagonal one, which is good, that means that model predicts book shop visiting better than a random prediction. But the curve is not very far from the diagonal line. And if we look at area under the curve estimate, we can see that it’s value is 0.73, which is a good one.

Overall, the model is not bad. It is really good at predicting the visiting book shops behavior, but not as good at predicting non-visiting behavior. So it may need some adjustments to predict not visiting a book shop better.