library(base)
library(foreign)
library(dplyr)
library(DT)
library(knitr)
library(kableExtra)
library(summarytools)
library(pscl)
library(DescTools)
library(margins)
library(ggplot2)
library(gridExtra)
library(grid)
library(sjPlot)

dataset = read.spss("~/Downloads/HBSC_2014_5_countries.sav", 
                    to.data.frame=TRUE) %>% dplyr::select(sex, AGE, alcltm, beenbullied, friendtalk, famtalk, m96) %>% rename(friendafter8 = m96)
dataset$friendafter8[dataset$friendafter8 == "Daily (specified in own variable)"] <- "Daily"

1. Data preparation and descriptive statistics

In order to answer the question of why children might drink alcohol and why they might bully others, I decided to use variables such as talk about problems with friends (friendtalk) and whether students meet their friends after 8pm (friendafter8) and whether students might talk to their parents about problems (famtalk). In addition to this, I used information about gender and age.

See the table below for the variables and the questions that respondents were asked.

Variable <- c("alcltm",
                        "beenbullied", "sex",
                        "age",
                        "friendtalk",
                        "famtalk",
                        "friendafter8")
Type <- c("Dependent", "Dependent", "Control", "Control", "Independent", "Independent", "Independent")
Question <- c("On how many days (if any) have you drunk alcohol? Please tick one box for each line. In your lifetime",
                "How often have you been bullied at school in the past couple of months?",
              "Are you a boy or a girl?",
                "Age",
                "I can talk about my problems with my friends",
                "I can talk about my problems with my family",
                "How often do you meet your friends outside school time after 8 o’clock in the evening?")

descr <- data.frame(Variable, Type, Question)

descr %>% kable() %>% kable_classic_2("hover", html_font = "Cambria") %>% 
  column_spec(1, bold = T)
Variable Type Question
alcltm Dependent On how many days (if any) have you drunk alcohol? Please tick one box for each line. In your lifetime
beenbullied Dependent How often have you been bullied at school in the past couple of months?
sex Control Are you a boy or a girl?
age Control Age
friendtalk Independent I can talk about my problems with my friends
famtalk Independent I can talk about my problems with my family
friendafter8 Independent How often do you meet your friends outside school time after 8 o’clock in the evening?

I also need to check that all variables are correctly coded and written, examine the number of missing values and the distribution of each variable.

Variable Stats / Values Freqs (% of Valid) Graph Missing
sex [factor]
1. Boy
2. Girl
12960(49.7%)
13142(50.3%)
0 (0.0%)
AGE [numeric]
Mean (sd) : 13.5 (1.7)
min ≤ med ≤ max:
10.5 ≤ 13.5 ≤ 16.5
IQR (CV) : 3.3 (0.1)
121 distinct values 264 (1.0%)
alcltm [factor]
1. Never
2. 1-2 days
3. 3-5 days
4. 6-9 days
5. 10-19 days
6. 20-29 days
7. 30 days (or more)
13424(57.1%)
3798(16.1%)
1676(7.1%)
1245(5.3%)
1193(5.1%)
570(2.4%)
1615(6.9%)
2581 (9.9%)
beenbullied [factor]
1. Haven't
2. Once or twice
3. 2-3 times per month
4. Once/week
5. Several times/week
18418(71.8%)
4477(17.4%)
1143(4.5%)
708(2.8%)
922(3.6%)
434 (1.7%)
friendtalk [factor]
1. Very strongly disagree
2. 2
3. 3
4. 4
5. 5
6. 6
7. Very strongly agree
1889(7.4%)
1071(4.2%)
1108(4.4%)
2435(9.6%)
3129(12.3%)
5356(21.1%)
10447(41.1%)
667 (2.6%)
famtalk [factor]
1. Very strongly disagree
2. 2
3. 3
4. 4
5. 5
6. 6
7. Very strongly agree
1578(6.2%)
1015(4.0%)
1092(4.3%)
2307(9.0%)
2921(11.4%)
5087(19.9%)
11511(45.1%)
591 (2.3%)
friendafter8 [factor]
1. Hardly ever or never
2. Less than weekly
3. Weekly
4. Daily (specified in own v
12392(53.2%)
6109(26.2%)
4786(20.6%)
0(0.0%)
2815 (10.8%)

I decided to change the friendtalk and famtalk variables by making them numeric, where 1 is Very strongly disagree and 7 is Very strongly agree. Also, in the variable alcltm almost 10% are missing values. The other variables have smaller percentages, but may cause problems in the analysis. Still these percentages are small, so I decided to use the imputation method.

Also, before answering the question of why children drink, I convert the variable alcltm and leave only 2 values: 0 - Never, 1 - all other options. I do the same with the havebullied variable (0 - Haven’t, 1 - all other options). In addition, there are no observations for Daily in the variable describing how often children meet their friends after 8. Therefore, I change the number of urons in the variable, leaving only three levels: Hardly ever or never, Less than weekly and Weekly.

levels(dataset$friendtalk) = c("1", "2", "3", "4", "5", "6", "7")
dataset$friendtalk = as.numeric(as.character(dataset$friendtalk))

levels(dataset$famtalk) = c("1", "2", "3", "4", "5", "6", "7")
dataset$famtalk = as.numeric(as.character(dataset$famtalk))

dataset$drinking01 <- ifelse(dataset$alcltm == "Never", 0, 1)
dataset$aggressor01 <- ifelse(dataset$beenbullied == "Haven't", 0, 1)

dataset$friendafter8 = as.factor(as.character(dataset$friendafter8))

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

dataset =  dataset %>% mutate_if(is.factor, funs(replace(.,is.na(.), Mode(na.omit(.)))))  %>% mutate_if(is.numeric, funs(replace(.,is.na(.), Mode(na.omit(.))))) 

I can check to see if there are still have any missing variables:

kable_classic_2(kable(paste0(round(((nrow(dataset) - nrow(na.omit(dataset)))/nrow(dataset))*100,2), "%"), col.names = "Percent of NA") %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE))
Percent of NA
0%

Visualize correlation

Also I visualize the relationship between independent variables and the dependent variables.

For alcltm

grid.arrange(arrangeGrob(
 dataset %>%
  ggplot(aes(x = alcltm, fill = sex)) + geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  theme_linedraw(),

dataset %>%
  ggplot(aes(x = alcltm, fill = friendafter8)) + geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  theme_linedraw(), nrow = 2))

grid.arrange(arrangeGrob(
dataset %>%
  ggplot(aes(x = alcltm, y = friendtalk)) +
  geom_boxplot() + geom_jitter(size = 1, alpha = 0.4, color = "coral1")+
  theme_linedraw(),

dataset %>%
  ggplot(aes(x = alcltm, y = famtalk)) +
  geom_boxplot() + geom_jitter(size = 1, alpha = 0.4, color = "coral1") +
  theme_linedraw(), nrow = 2))

For beenbullied

 grid.arrange(arrangeGrob(
dataset %>%
  ggplot(aes(x = beenbullied, fill = sex)) + geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  theme_linedraw(), 

dataset %>%
  ggplot(aes(x = beenbullied, fill = friendafter8)) + geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  theme_linedraw(), nrow = 2))

 grid.arrange(arrangeGrob(
dataset %>%
  ggplot(aes(x = beenbullied, y = friendtalk)) +
  geom_boxplot() + geom_jitter(size = 1, alpha = 0.4, color = "coral1")+
  theme_linedraw(),

dataset %>%
  ggplot(aes(x = beenbullied, y = famtalk)) +
  geom_boxplot() + geom_jitter(size = 1, alpha = 0.4, color = "coral1") +
  theme_linedraw(), nrow = 2))

2. Why do children drink alcohol?

Model and interpretation

model = glm(drinking01 ~ sex + AGE + friendtalk + famtalk + friendafter8, family = binomial(link = probit), data = dataset) 

summary(model) 
## 
## Call:
## glm(formula = drinking01 ~ sex + AGE + friendtalk + famtalk + 
##     friendafter8, family = binomial(link = probit), data = dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2414  -0.8553  -0.4815   0.9414   2.5643  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -4.779437   0.084420 -56.615  < 2e-16 ***
## sexGirl                      -0.234544   0.017488 -13.412  < 2e-16 ***
## AGE                           0.348133   0.005593  62.250  < 2e-16 ***
## friendtalk                    0.038597   0.004910   7.861 3.83e-15 ***
## famtalk                      -0.086793   0.004890 -17.750  < 2e-16 ***
## friendafter8Less than weekly  0.183937   0.020765   8.858  < 2e-16 ***
## friendafter8Weekly            0.423613   0.022901  18.498  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34836  on 26101  degrees of freedom
## Residual deviance: 28531  on 26095  degrees of freedom
## AIC: 28545
## 
## Number of Fisher Scoring iterations: 4
## tab_model(model, title = "drinking01 ~ sex + AGE + friendtalk + famtalk + friendafter8")

Results:

  • ALl predictors are statistically significant in this model as their p-value is below the 0.05 significance level.

  • Sex: for girls log of odds of drinking is less by 0.23 compared to boys. That means: This means that it is more likely that a boy will drink than that a girl will drink.

  • Age: with each year of age log of odds of drinking increases by 0.34. That means: the older a child gets, the more likely he or she will drink alcohol.

  • Friendtalk: with each point of opportunity to speak out about a problem with friends log of odds of drinking increases by 0.04. Thus, if a school student can share problems with his or her friends, the likelihood that he or she will drink increases.

  • Famtalk: with each point of opportunity to speak out about a problem with parents log of odds of drinking descreases by 0.09. Thus, if a school student can share his or her problems with his or her parents, he or she is less likely to drink alcohol.

  • Friendafter8(Less than weekly): for school children who meet up with friends after 8 less than weekly log of odds of drinking is grater by 0.18 compared to children who don’t meet up with friends after 8.

  • Friendafter8(Weekly): for school children who meet up with friends after 8 weekly log of odds of drinking is grater by 0.42 compared to children who don’t meet up with friends after 8.

summary(margins(model)) %>% kable() %>% kable_classic_2("hover", html_font = "Cambria")
factor AME SE z p lower upper
AGE 0.1078042 0.0013230 81.484599 0 0.1052112 0.1103972
famtalk -0.0268767 0.0014890 -18.049989 0 -0.0297951 -0.0239583
friendafter8Less than weekly 0.0581703 0.0066470 8.751324 0 0.0451423 0.0711982
friendafter8Weekly 0.1368891 0.0075957 18.022036 0 0.1220019 0.1517763
friendtalk 0.0119520 0.0015157 7.885653 0 0.0089813 0.0149226
sexGirl -0.0727200 0.0053916 -13.487587 0 -0.0832874 -0.0621526

Results:

  • Sex: if the school student is a girl, there is a 7% decrease in the probability that student will drink

  • Age: with each year of age probability of child drinking increases by 10%.

  • Friendtalk: with each level of contact with parents about student’s problems increases, the probability of a student drinking increases by 1.2%.

  • Famtalk: with each level of contact with parents about student’s problems increases, the probability of a student drinking decreases by 2.7%.

  • Friendafter8(Less than weekly): if the school children meet up with friends after 8 less than weekly, there is a 5.8% increase in the probability that he/she will drink

  • Friendafter8(Weekly): if the school children meet up with friends after 8 weekly, there is a 13.9% increase in the probability that he/she will drink

Interpret the model fit

PseudoR2(model)
##  McFadden 
## 0.1810072
hitmiss(model)
## Classification Threshold = 0.5 
##          y=0  y=1
## yhat=0 12980 4122
## yhat=1  3025 5975
## Percent Correctly Predicted = 72.62%
## Percent Correctly Predicted = 81.1%, for y = 0
## Percent Correctly Predicted = 59.18%  for y = 1
## Null Model Correctly Predicts 61.32%
## [1] 72.61896 81.09966 59.17599

Results:

  • For this model, pseudo R2 estimation (0.1810072) doesn’t belong to the 0.2-0.4 range, so the fit of it cannot be called perfect.

  • Also, this model correctly predicts 72.62% of results, which seem to be a quite nice result. However percent correctly predicted for y = 1 equals to 59.18%. It is not bad but not perfect. Besides this, I can say that this model has a good fit, because it results are not much overeducated.

3. Why do children bully others?

Model and interpretation

model1 = glm(aggressor01 ~ sex + AGE + friendtalk + famtalk + friendafter8, family = binomial(link = probit), data = dataset)
summary(model1)
## 
## Call:
## glm(formula = aggressor01 ~ sex + AGE + friendtalk + famtalk + 
##     friendafter8, family = binomial(link = probit), data = dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2967  -0.8094  -0.7236   1.3422   1.9268  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   0.868059   0.079519  10.916  < 2e-16 ***
## sexGirl                      -0.047855   0.016996  -2.816  0.00487 ** 
## AGE                          -0.052117   0.005265  -9.899  < 2e-16 ***
## friendtalk                   -0.073794   0.004551 -16.215  < 2e-16 ***
## famtalk                      -0.056807   0.004731 -12.007  < 2e-16 ***
## friendafter8Less than weekly -0.036130   0.020690  -1.746  0.08078 .  
## friendafter8Weekly           -0.073407   0.023313  -3.149  0.00164 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30843  on 26101  degrees of freedom
## Residual deviance: 30170  on 26095  degrees of freedom
## AIC: 30184
## 
## Number of Fisher Scoring iterations: 4
## tab_model(model1, title = "aggressor01 ~ sex + AGE + friendtalk + famtalk + friendafter8")

Results:

  • Most of the predictors (excluding friendafter(Less than weekly), p-value = 0.08) are statistically significant in this model as their p-value is below the 0.05 significance level.

  • Sex: for girls log of odds of bully others is less by 0.04 compared to boys. That means: This means that it is more likely that a boy will bully others than a girl.

  • Age: with each year of log of odds of bully others descrases by 0.05. That means: the older a child gets, the less likely he or she will bully others.

  • Friendtalk: with each point of opportunity to speak about a problem with friends log of odds of bully others decreases by 0.07. Thus, if a school student can share problems with his or her friends, the likelihood that he or she will bully others decreases.

  • Famtalk: with each point of opportunity to speak about a problem with parents log of odds of bully others descreases by 0.06. Thus, if a school student can share his or her problems with his or her parents, he or she is less likely to bully other.

  • Friendafter8(Weekly): for school children who meet up with friends after 8 weekly log of odds of bully others is less by 0.07 compared to children who don’t meet up with friends after 8.

summary(margins(model1)) %>% kable() %>% kable_classic_2("hover", html_font = "Cambria")
factor AME SE z p lower upper
AGE -0.0170643 0.0017153 -9.948588 0.0000000 -0.0204262 -0.0137025
famtalk -0.0185999 0.0015374 -12.098071 0.0000000 -0.0216131 -0.0155866
friendafter8Less than weekly -0.0118633 0.0067647 -1.753694 0.0794830 -0.0251220 0.0013954
friendafter8Weekly -0.0238396 0.0074860 -3.184570 0.0014497 -0.0385119 -0.0091674
friendtalk -0.0241618 0.0014692 -16.445193 0.0000000 -0.0270414 -0.0212821
sexGirl -0.0156765 0.0055688 -2.815047 0.0048770 -0.0265913 -0.0047618

Results:

  • Sex: if the school student is a girl, there is a 1% decrease in the probability that student will bully others.

  • Age: with each year of age probability that student’s will bully others decreases by 1.7%.

  • Friendtalk: with each level of contact with parents about student’s problems increases, the probability that student’s will bully others decreases by 2.4%.

  • Famtalk: with each level of contact with parents about student’s problems increases, the probability that student’s will bully others decreases by 1.8%.

  • Friendafter8(Less than weekly): if the school children meet up with friends after 8 less than weekly, there is a 1.2% decrease in the probability that he/she will bully others.

  • Friendafter8(Weekly): if the school children meet up with friends after 8 weekly, there is a 2.3% increase in the probability that he/she will bully others decreases by 2.4%.

Interpret the model fit

PseudoR2(model1)
##   McFadden 
## 0.02184272
hitmiss(model1) 
## Classification Threshold = 0.5 
##          y=0  y=1
## yhat=0 18657 7137
## yhat=1   195  113
## Percent Correctly Predicted = 71.91%
## Percent Correctly Predicted = 98.97%, for y = 0
## Percent Correctly Predicted = 1.559%  for y = 1
## Null Model Correctly Predicts 72.22%
## [1] 71.910198 98.965627  1.558621

Results:

  • For this model, pseudo R2 estimation (0.02184272) belongs to the 0.2-0.4 range, so this fit of it can be called perfect.

  • Also, this model correctly predicts 71.91% of results, which seem to be a quite nice result. However percent correctly predicted for y = 1 equals to 1.559% (need to be very careful), and the number of ≈72% just reflected the real share of this category in the variable. So, I cannot say that this model has a good fit, it resulted to be overeducated.

Conclusions

After this analysis, the questions can be answered as follows:

  1. In general, the older students get, the more likely they become to drink. Also, this applies more to boys than to girls. However, if students are able to talk to their parents about their problems, they are less likely to drink. On the other hand, if a student is able to discuss problems with his or her friends and meets with them after 8, the likelihood of drinking increases.

  2. In the case of bullying, it is observed that if the student has the opportunity to discuss problems with parents and friends, and if he or she is seen after 8 with friends, on a weekly basis, the student is less likely to bully others. Also, this likelihood decreases with the age of the student (the older he or she gets, the less likely he or she is to bully) and if the student is a girl, she is less likely to bully others