library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Data frame culture92 contains information about leisure practices and social profiles of the USSR respondents in 1992:
book_shop – how often the respondents visits book shops (1 - never, 4 - several times a month)
Tasks:
Recode book shop attendance into a binary variable with levels ‘does not visit’ and ‘visits’ (1p) Predict the probability of book shop attendance by gender, age, educational level, and educational level of respondent’s mother (1p) Interpret coefficients and effects of the predictors (4p) Interpret model fit (3p)
load('culture92.RData')
Recode book shop attendance into a binary variable with levels ‘does not visit’ and ‘visits’ (1p)
culture92$bs_att <- ifelse(culture92$book_shop == 1, 'does not visit',
ifelse(culture92$book_shop == 2, 'visits',
ifelse(culture92$book_shop == 3, 'visits',
ifelse(culture92$book_shop == 4, 'visits', NA))))
table(culture92$book_shop)
##
## 1 2 3 4
## 1076 863 1023 558
table(culture92$bs_att)
##
## does not visit visits
## 1076 2444
culture_t <- select(culture92, c(bs_att, edu_lvl, edu_lvl_m, sex, age))
sapply(culture_t, function(x) sum(is.na(x)))
## bs_att edu_lvl edu_lvl_m sex age
## 104 23 62 8 20
culture <- na.omit(culture_t)
table1(~bs_att + sex +
age + edu_lvl + edu_lvl_m, data = culture)
Overall (N=3417) |
|
---|---|
bs_att | |
does not visit | 1037 (30.3%) |
visits | 2380 (69.7%) |
sex | |
male | 1549 (45.3%) |
female | 1868 (54.7%) |
age | |
Mean (SD) | 40.8 (16.1) |
Median [Min, Max] | 39.0 [16.0, 85.0] |
edu_lvl | |
tertiary | 845 (24.7%) |
secondary | 1906 (55.8%) |
below secondary | 666 (19.5%) |
edu_lvl_m | |
tertiary | 366 (10.7%) |
secondary | 1086 (31.8%) |
below secondary | 1965 (57.5%) |
In the table we can see that there are more book shop visitors than non-visitors (70% vs 30%). The gender distribution is almost equal (55% female and 45% male). Age of the sample varies from 16 to 85, with a mean at 39 y.o. The biggest group among educational levels is secondary. While among mothers of respondents the below secondary level is more frequent.
Predict the probability of book shop attendance by gender, age, educational level, and educational level of respondent’s mother (1p)
culture$bs_att_1 <- ifelse(culture$bs_att == 'does not visit', 0, 1)
mod1 <- glm(bs_att_1 ~ sex + age + edu_lvl + edu_lvl_m, data = culture, family = binomial)
summary(mod1)
##
## Call:
## glm(formula = bs_att_1 ~ sex + age + edu_lvl + edu_lvl_m, family = binomial,
## data = culture)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.872860 0.198001 14.509 < 2e-16 ***
## sexfemale 0.280000 0.081860 3.420 0.000625 ***
## age -0.014301 0.002701 -5.294 1.20e-07 ***
## edu_lvlsecondary -1.017640 0.120025 -8.479 < 2e-16 ***
## edu_lvlbelow secondary -2.150527 0.137096 -15.686 < 2e-16 ***
## edu_lvl_msecondary -0.251198 0.172883 -1.453 0.146225
## edu_lvl_mbelow secondary -0.702015 0.167562 -4.190 2.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4194.6 on 3416 degrees of freedom
## Residual deviance: 3674.6 on 3410 degrees of freedom
## AIC: 3688.6
##
## Number of Fisher Scoring iterations: 4
Interpretation: As we can notice, all of the predictors are significant (sex, age, secondary and below secondary levels of edu and below secondary mother’s edu level), except secondary education of mother (compared to tertiary).
Sex is positively associated with visiting a book shop. Being a female increases the log of odds of visiting a book shop over not visiting by 0.28, compared to being a male.
Age is negatively associated with visiting a book shop. With the increase of age by 1 the log of odds of visiting a book shop over not visiting decreases by 0.01.
Having a secondary education level decreases the log of odds of visiting a book shop over not visiting by 1.01 compared to the tertiary education level. Having a below secondary education level decreases the log of odds of visiting a book shop over not visiting by 2.15 compared to the tertiary education level.
Having a mother with below secondary education level decreases the log of odds of visiting a book shop over not visiting by 0.70 compared to having a mother with tertiary education level.
margins::margins_summary(mod1)
## factor AME SE z p lower upper
## age -0.0026 0.0005 -5.3717 0.0000 -0.0035 -0.0016
## edu_lvl_mbelow secondary -0.1228 0.0266 -4.6125 0.0000 -0.1750 -0.0706
## edu_lvl_msecondary -0.0402 0.0268 -1.5009 0.1334 -0.0928 0.0123
## edu_lvlbelow secondary -0.4154 0.0239 -17.3751 0.0000 -0.4622 -0.3685
## edu_lvlsecondary -0.1577 0.0159 -9.9023 0.0000 -0.1889 -0.1264
## sexfemale 0.0503 0.0147 3.4297 0.0006 0.0216 0.0791
Interpretation: Increase in age by 1 on average decreases the probability of of visiting a book shop by 0.0026.
Having a mom with below secondary edu level on average decreases the probability of of visiting a book shop by 0.12 compared to mom with tertiary edu.
Having a below secondary edu level on average decreases the probability of of visiting a book shop by 0.41 compared to tertiary edu. Having a secondary edu level on average decreases the probability of of visiting a book shop by 0.15 compared to tertiary edu.
Being a female on average increases the probability of of visiting a book shop by 0.05 compared to being a male.
sjPlot::plot_model(mod1, type = 'pred', terms = c('age'), axis.title = c('age','Probability of visiting a book shop', title = ''))
## Data were 'prettified'. Consider using `terms="age [all]"` to get smooth
## plots.
sjPlot::plot_model(mod1, type = 'pred', terms = c('sex'), axis.title = c('sex','Probability of visiting a book shop', title = ''))
sjPlot::plot_model(mod1, type = 'pred', terms = c('edu_lvl'), axis.title = c('edu_lvl','Probability of visiting a book shop', title = ''))
sjPlot::plot_model(mod1, type = 'pred', terms = c('edu_lvl_m'), axis.title = c('edu_lvl_m','Probability of visiting a book shop', title = ''))
Interpretation: As we can see on the plot with age, the probability to visit a book shop decreases when people get older. Probability of female visiting a book shop is higher than male. Having the highest level of education also have the highest probability to go to a book shop, while for people with below secondary education the probability is low (around 55%). And for mother’s education, her having below secondary education lowers the probability to visit a book shop, while for tertiary and secondary education the probabilities are not that different.
pscl::pR2(mod1)[4]
## fitting null model for pseudo-r2
## McFadden
## 0.1239779
Interpretation: The pseudo R2 is low - 0.12, the good fit estimates are from 0.2 to 0.4. However, 0.12 is not critical. So our model is alright, but could have been better.
pscl::hitmiss(mod1)
## Classification Threshold = 0.5
## y=0 y=1
## yhat=0 371 195
## yhat=1 666 2185
## Percent Correctly Predicted = 74.8%
## Percent Correctly Predicted = 35.78%, for y = 0
## Percent Correctly Predicted = 91.81% for y = 1
## Null Model Correctly Predicts 69.65%
## [1] 74.80246 35.77628 91.80672
Interpretation: As we can see from percent of correct predictions, the overall value is 75%, it is not bad. But if we look closely the model is good at predicting visiting the book shops (92% correctly), and quite bad at predicting non-visiting behavior (36% correctly).
sens_t <- pROC::roc(culture$bs_att_1, predict(mod1, culture, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(sens_t)
fitted = predict(mod1, culture, type = "response")
pROC::auc(culture$bs_att_1, fitted)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7284
Interpretation: On the plot we can notice that the line of our model does not coincide with diagonal one, which is good, that means that model predicts book shop visiting better than a random prediction. But the curve is not very far from the diagonal line. And if we look at area under the curve estimate, we can see that it’s value is 0.73, which is a good one.
Overall, the model is not bad. It is really good at predicting the visiting book shops behavior, but not as good at predicting non-visiting behavior. So it may need some adjustments to predict not visiting a book shop better.