library(base)
library(foreign)
library(dplyr)
library(DT)
library(knitr)
library(kableExtra)
library(summarytools)
library(pscl)
library(DescTools)
library(margins)
library(ggplot2)
library(gridExtra)
library(grid)
library(sjPlot)
dataset = read.spss("~/Downloads/HBSC_2014_5_countries.sav",
to.data.frame=TRUE) %>% dplyr::select(sex, AGE, alcltm, beenbullied, friendtalk, famtalk, m96) %>% rename(friendafter8 = m96)
dataset$friendafter8[dataset$friendafter8 == "Daily (specified in own variable)"] <- "Daily"
In order to answer the question of why children might drink alcohol and why they might bully others, I decided to use variables such as talk about problems with friends (friendtalk) and whether students meet their friends after 8pm (friendafter8) and whether students might talk to their parents about problems (famtalk). In addition to this, I used information about gender and age.
See the table below for the variables and the questions that respondents were asked.
Variable <- c("alcltm",
"beenbullied", "sex",
"age",
"friendtalk",
"famtalk",
"friendafter8")
Type <- c("Dependent", "Dependent", "Control", "Control", "Independent", "Independent", "Independent")
Question <- c("On how many days (if any) have you drunk alcohol? Please tick one box for each line. In your lifetime",
"How often have you been bullied at school in the past couple of months?",
"Are you a boy or a girl?",
"Age",
"I can talk about my problems with my friends",
"I can talk about my problems with my family",
"How often do you meet your friends outside school time after 8 o’clock in the evening?")
descr <- data.frame(Variable, Type, Question)
descr %>% kable() %>% kable_classic_2("hover", html_font = "Cambria") %>%
column_spec(1, bold = T)
| Variable | Type | Question |
|---|---|---|
| alcltm | Dependent | On how many days (if any) have you drunk alcohol? Please tick one box for each line. In your lifetime |
| beenbullied | Dependent | How often have you been bullied at school in the past couple of months? |
| sex | Control | Are you a boy or a girl? |
| age | Control | Age |
| friendtalk | Independent | I can talk about my problems with my friends |
| famtalk | Independent | I can talk about my problems with my family |
| friendafter8 | Independent | How often do you meet your friends outside school time after 8 o’clock in the evening? |
I also need to check that all variables are correctly coded and written, examine the number of missing values and the distribution of each variable.
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | |||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sex [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||
| AGE [numeric] |
|
121 distinct values | 264 (1.0%) | ||||||||||||||||||||||||||||||||||||
| alcltm [factor] |
|
|
2581 (9.9%) | ||||||||||||||||||||||||||||||||||||
| beenbullied [factor] |
|
|
434 (1.7%) | ||||||||||||||||||||||||||||||||||||
| friendtalk [factor] |
|
|
667 (2.6%) | ||||||||||||||||||||||||||||||||||||
| famtalk [factor] |
|
|
591 (2.3%) | ||||||||||||||||||||||||||||||||||||
| friendafter8 [factor] |
|
|
2815 (10.8%) |
I decided to change the friendtalk and famtalk variables by making them numeric, where 1 is Very strongly disagree and 7 is Very strongly agree. Also, in the variable alcltm almost 10% are missing values. The other variables have smaller percentages, but may cause problems in the analysis. Still these percentages are small, so I decided to use the imputation method.
Also, before answering the question of why children drink, I convert the variable alcltm and leave only 2 values: 0 - Never, 1 - all other options. I do the same with the havebullied variable (0 - Haven’t, 1 - all other options). In addition, there are no observations for Daily in the variable describing how often children meet their friends after 8. Therefore, I change the number of urons in the variable, leaving only three levels: Hardly ever or never, Less than weekly and Weekly.
levels(dataset$friendtalk) = c("1", "2", "3", "4", "5", "6", "7")
dataset$friendtalk = as.numeric(as.character(dataset$friendtalk))
levels(dataset$famtalk) = c("1", "2", "3", "4", "5", "6", "7")
dataset$famtalk = as.numeric(as.character(dataset$famtalk))
dataset$drinking01 <- ifelse(dataset$alcltm == "Never", 0, 1)
dataset$aggressor01 <- ifelse(dataset$beenbullied == "Haven't", 0, 1)
dataset$friendafter8 = as.factor(as.character(dataset$friendafter8))
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
dataset = dataset %>% mutate_if(is.factor, funs(replace(.,is.na(.), Mode(na.omit(.))))) %>% mutate_if(is.numeric, funs(replace(.,is.na(.), Mode(na.omit(.)))))
I can check to see if there are still have any missing variables:
kable_classic_2(kable(paste0(round(((nrow(dataset) - nrow(na.omit(dataset)))/nrow(dataset))*100,2), "%"), col.names = "Percent of NA") %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE))
| Percent of NA |
|---|
| 0% |
Also I visualize the relationship between independent variables and the dependent variables.
grid.arrange(arrangeGrob(
dataset %>%
ggplot(aes(x = alcltm, fill = sex)) + geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
theme_linedraw(),
dataset %>%
ggplot(aes(x = alcltm, fill = friendafter8)) + geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
theme_linedraw(), nrow = 2))
grid.arrange(arrangeGrob(
dataset %>%
ggplot(aes(x = alcltm, y = friendtalk)) +
geom_boxplot() + geom_jitter(size = 1, alpha = 0.4, color = "coral1")+
theme_linedraw(),
dataset %>%
ggplot(aes(x = alcltm, y = famtalk)) +
geom_boxplot() + geom_jitter(size = 1, alpha = 0.4, color = "coral1") +
theme_linedraw(), nrow = 2))
grid.arrange(arrangeGrob(
dataset %>%
ggplot(aes(x = beenbullied, fill = sex)) + geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
theme_linedraw(),
dataset %>%
ggplot(aes(x = beenbullied, fill = friendafter8)) + geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
theme_linedraw(), nrow = 2))
grid.arrange(arrangeGrob(
dataset %>%
ggplot(aes(x = beenbullied, y = friendtalk)) +
geom_boxplot() + geom_jitter(size = 1, alpha = 0.4, color = "coral1")+
theme_linedraw(),
dataset %>%
ggplot(aes(x = beenbullied, y = famtalk)) +
geom_boxplot() + geom_jitter(size = 1, alpha = 0.4, color = "coral1") +
theme_linedraw(), nrow = 2))
model = glm(drinking01 ~ sex + AGE + friendtalk + famtalk + friendafter8, family = binomial(link = probit), data = dataset)
summary(model)
##
## Call:
## glm(formula = drinking01 ~ sex + AGE + friendtalk + famtalk +
## friendafter8, family = binomial(link = probit), data = dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2414 -0.8553 -0.4815 0.9414 2.5643
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.779437 0.084420 -56.615 < 2e-16 ***
## sexGirl -0.234544 0.017488 -13.412 < 2e-16 ***
## AGE 0.348133 0.005593 62.250 < 2e-16 ***
## friendtalk 0.038597 0.004910 7.861 3.83e-15 ***
## famtalk -0.086793 0.004890 -17.750 < 2e-16 ***
## friendafter8Less than weekly 0.183937 0.020765 8.858 < 2e-16 ***
## friendafter8Weekly 0.423613 0.022901 18.498 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34836 on 26101 degrees of freedom
## Residual deviance: 28531 on 26095 degrees of freedom
## AIC: 28545
##
## Number of Fisher Scoring iterations: 4
## tab_model(model, title = "drinking01 ~ sex + AGE + friendtalk + famtalk + friendafter8")
Results:
ALl predictors are statistically significant in this model as their p-value is below the 0.05 significance level.
Sex: for girls log of odds of drinking is less by 0.23 compared to boys. That means: This means that it is more likely that a boy will drink than that a girl will drink.
Age: with each year of age log of odds of drinking increases by 0.34. That means: the older a child gets, the more likely he or she will drink alcohol.
Friendtalk: with each point of opportunity to speak out about a problem with friends log of odds of drinking increases by 0.04. Thus, if a school student can share problems with his or her friends, the likelihood that he or she will drink increases.
Famtalk: with each point of opportunity to speak out about a problem with parents log of odds of drinking descreases by 0.09. Thus, if a school student can share his or her problems with his or her parents, he or she is less likely to drink alcohol.
Friendafter8(Less than weekly): for school children who meet up with friends after 8 less than weekly log of odds of drinking is grater by 0.18 compared to children who don’t meet up with friends after 8.
Friendafter8(Weekly): for school children who meet up with friends after 8 weekly log of odds of drinking is grater by 0.42 compared to children who don’t meet up with friends after 8.
summary(margins(model)) %>% kable() %>% kable_classic_2("hover", html_font = "Cambria")
| factor | AME | SE | z | p | lower | upper |
|---|---|---|---|---|---|---|
| AGE | 0.1078042 | 0.0013230 | 81.484599 | 0 | 0.1052112 | 0.1103972 |
| famtalk | -0.0268767 | 0.0014890 | -18.049989 | 0 | -0.0297951 | -0.0239583 |
| friendafter8Less than weekly | 0.0581703 | 0.0066470 | 8.751324 | 0 | 0.0451423 | 0.0711982 |
| friendafter8Weekly | 0.1368891 | 0.0075957 | 18.022036 | 0 | 0.1220019 | 0.1517763 |
| friendtalk | 0.0119520 | 0.0015157 | 7.885653 | 0 | 0.0089813 | 0.0149226 |
| sexGirl | -0.0727200 | 0.0053916 | -13.487587 | 0 | -0.0832874 | -0.0621526 |
Results:
Sex: if the school student is a girl, there is a 7% decrease in the probability that student will drink
Age: with each year of age probability of child drinking increases by 10%.
Friendtalk: with each level of contact with parents about student’s problems increases, the probability of a student drinking increases by 1.2%.
Famtalk: with each level of contact with parents about student’s problems increases, the probability of a student drinking decreases by 2.7%.
Friendafter8(Less than weekly): if the school children meet up with friends after 8 less than weekly, there is a 5.8% increase in the probability that he/she will drink
Friendafter8(Weekly): if the school children meet up with friends after 8 weekly, there is a 13.9% increase in the probability that he/she will drink
PseudoR2(model)
## McFadden
## 0.1810072
hitmiss(model)
## Classification Threshold = 0.5
## y=0 y=1
## yhat=0 12980 4122
## yhat=1 3025 5975
## Percent Correctly Predicted = 72.62%
## Percent Correctly Predicted = 81.1%, for y = 0
## Percent Correctly Predicted = 59.18% for y = 1
## Null Model Correctly Predicts 61.32%
## [1] 72.61896 81.09966 59.17599
Results:
For this model, pseudo R2 estimation (0.1810072) doesn’t belong to the 0.2-0.4 range, so the fit of it cannot be called perfect.
Also, this model correctly predicts 72.62% of results, which seem to be a quite nice result. However percent correctly predicted for y = 1 equals to 59.18%. It is not bad but not perfect. Besides this, I can say that this model has a good fit, because it results are not much overeducated.
model1 = glm(aggressor01 ~ sex + AGE + friendtalk + famtalk + friendafter8, family = binomial(link = probit), data = dataset)
summary(model1)
##
## Call:
## glm(formula = aggressor01 ~ sex + AGE + friendtalk + famtalk +
## friendafter8, family = binomial(link = probit), data = dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2967 -0.8094 -0.7236 1.3422 1.9268
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.868059 0.079519 10.916 < 2e-16 ***
## sexGirl -0.047855 0.016996 -2.816 0.00487 **
## AGE -0.052117 0.005265 -9.899 < 2e-16 ***
## friendtalk -0.073794 0.004551 -16.215 < 2e-16 ***
## famtalk -0.056807 0.004731 -12.007 < 2e-16 ***
## friendafter8Less than weekly -0.036130 0.020690 -1.746 0.08078 .
## friendafter8Weekly -0.073407 0.023313 -3.149 0.00164 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30843 on 26101 degrees of freedom
## Residual deviance: 30170 on 26095 degrees of freedom
## AIC: 30184
##
## Number of Fisher Scoring iterations: 4
## tab_model(model1, title = "aggressor01 ~ sex + AGE + friendtalk + famtalk + friendafter8")
Results:
Most of the predictors (excluding friendafter(Less than weekly), p-value = 0.08) are statistically significant in this model as their p-value is below the 0.05 significance level.
Sex: for girls log of odds of bully others is less by 0.04 compared to boys. That means: This means that it is more likely that a boy will bully others than a girl.
Age: with each year of log of odds of bully others descrases by 0.05. That means: the older a child gets, the less likely he or she will bully others.
Friendtalk: with each point of opportunity to speak about a problem with friends log of odds of bully others decreases by 0.07. Thus, if a school student can share problems with his or her friends, the likelihood that he or she will bully others decreases.
Famtalk: with each point of opportunity to speak about a problem with parents log of odds of bully others descreases by 0.06. Thus, if a school student can share his or her problems with his or her parents, he or she is less likely to bully other.
Friendafter8(Weekly): for school children who meet up with friends after 8 weekly log of odds of bully others is less by 0.07 compared to children who don’t meet up with friends after 8.
summary(margins(model1)) %>% kable() %>% kable_classic_2("hover", html_font = "Cambria")
| factor | AME | SE | z | p | lower | upper |
|---|---|---|---|---|---|---|
| AGE | -0.0170643 | 0.0017153 | -9.948588 | 0.0000000 | -0.0204262 | -0.0137025 |
| famtalk | -0.0185999 | 0.0015374 | -12.098071 | 0.0000000 | -0.0216131 | -0.0155866 |
| friendafter8Less than weekly | -0.0118633 | 0.0067647 | -1.753694 | 0.0794830 | -0.0251220 | 0.0013954 |
| friendafter8Weekly | -0.0238396 | 0.0074860 | -3.184570 | 0.0014497 | -0.0385119 | -0.0091674 |
| friendtalk | -0.0241618 | 0.0014692 | -16.445193 | 0.0000000 | -0.0270414 | -0.0212821 |
| sexGirl | -0.0156765 | 0.0055688 | -2.815047 | 0.0048770 | -0.0265913 | -0.0047618 |
Results:
Sex: if the school student is a girl, there is a 1% decrease in the probability that student will bully others.
Age: with each year of age probability that student’s will bully others decreases by 1.7%.
Friendtalk: with each level of contact with parents about student’s problems increases, the probability that student’s will bully others decreases by 2.4%.
Famtalk: with each level of contact with parents about student’s problems increases, the probability that student’s will bully others decreases by 1.8%.
Friendafter8(Less than weekly): if the school children meet up with friends after 8 less than weekly, there is a 1.2% decrease in the probability that he/she will bully others.
Friendafter8(Weekly): if the school children meet up with friends after 8 weekly, there is a 2.3% increase in the probability that he/she will bully others decreases by 2.4%.
PseudoR2(model1)
## McFadden
## 0.02184272
hitmiss(model1)
## Classification Threshold = 0.5
## y=0 y=1
## yhat=0 18657 7137
## yhat=1 195 113
## Percent Correctly Predicted = 71.91%
## Percent Correctly Predicted = 98.97%, for y = 0
## Percent Correctly Predicted = 1.559% for y = 1
## Null Model Correctly Predicts 72.22%
## [1] 71.910198 98.965627 1.558621
Results:
For this model, pseudo R2 estimation (0.02184272) belongs to the 0.2-0.4 range, so this fit of it can be called perfect.
Also, this model correctly predicts 71.91% of results, which seem to be a quite nice result. However percent correctly predicted for y = 1 equals to 1.559% (need to be very careful), and the number of ≈72% just reflected the real share of this category in the variable. So, I cannot say that this model has a good fit, it resulted to be overeducated.
After this analysis, the questions can be answered as follows:
In general, the older students get, the more likely they become to drink. Also, this applies more to boys than to girls. However, if students are able to talk to their parents about their problems, they are less likely to drink. On the other hand, if a student is able to discuss problems with his or her friends and meets with them after 8, the likelihood of drinking increases.
In the case of bullying, it is observed that if the student has the opportunity to discuss problems with parents and friends, and if he or she is seen after 8 with friends, on a weekly basis, the student is less likely to bully others. Also, this likelihood decreases with the age of the student (the older he or she gets, the less likely he or she is to bully) and if the student is a girl, she is less likely to bully others