Two measures of bullying behavior have been constructed.
The first uses self-report on a 9-item Illinois Bully Scale.
The second is peer nomination in which children list any person they perceive as a bully.
The total number of nominations a person receives is a measure of bullying behavior.
Source: Espelage, D.L., Holt, M.K., & Henkel, R.R. (2003). Examination of Peer-Group Contextual Effects on Aggression During Early Adolescence. Child Development, 74, 205-220.
Column 1: Score on the Illinois Bully Scale
Column 2: Total number of peer nominations
# load packages
pacman::p_load(tidyverse, MASS, pscl)
# input data
dta <- read.table("C:/Users/Ching-Fang Wu/Documents/data/bullying.txt", h=T)
# first 8 lines
head(dta,8)
## Score Nomination
## 1 1.56 0
## 2 1.56 0
## 3 1.11 0
## 4 1.56 0
## 5 1.22 4
## 6 1.33 0
## 7 1.11 0
## 8 1.00 0
# check data structure
str(dta)
## 'data.frame': 291 obs. of 2 variables:
## $ Score : num 1.56 1.56 1.11 1.56 1.22 1.33 1.11 1 1.11 1.22 ...
## $ Nomination: int 0 0 0 0 4 0 0 0 0 19 ...
# proportion of zeros
sum(dta$Nomination < 1)/length(dta$Nomination)
## [1] 0.5635739
未被提名為霸凌者比率0.5636
# mean
colMeans(dta)
## Score Nomination
## 1.658110 2.487973
#variance
apply(dta, 2, var) #1 indicates rows, 2 indicates columns
## Score Nomination
## 0.4842037 41.2645100
possion分配有Equi-dispersion的特性:E(Nomination)=Var(Nomination),這裡Var(Nomination)>E(Nomination),可能有Over-dispersion狀況。
# use the Freedman-Diaconis rule for bin width
bd <- 2*IQR(dta$Nomination)/(dim(dta)[1]^(1/3))
ot <- theme_set(theme_bw())
# histogram of nomination
ggplot(dta, aes(Nomination)) +
stat_bin(binwidth = bd) +
labs(x = "Number of nominations", y = "Count")
被同儕提名次數“0”的人數很多,表示大多數的人都不是霸凌者。但是“0”值次數過高,有可能是真的,也可能是假的。
ggplot(dta, aes(Score, Nomination)) +
geom_jitter(alpha = 0.2) +
stat_smooth(method = "glm", method.args = list(family = poisson)) + # poisson fit
labs(y = "Number of peer nomination", x = "Bully score")
## `geom_smooth()` using formula 'y ~ x'
summary(m0 <- glm(Nomination ~ Score, family = poisson, data = dta))
##
## Call:
## glm(formula = Nomination ~ Score, family = poisson, data = dta)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.4995 -1.8246 -1.5952 -0.1552 11.6806
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.66308 0.08871 -7.475 7.75e-14 ***
## Score 0.81434 0.03504 23.239 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2205.2 on 290 degrees of freedom
## Residual deviance: 1774.6 on 289 degrees of freedom
## AIC: 2160.3
##
## Number of Fisher Scoring iterations: 6
#zero-inflated poisson適用「高比例零值」的計數型資料型態
summary(m1 <- zeroinfl(Nomination ~ Score | Score, data = dta))
##
## Call:
## zeroinfl(formula = Nomination ~ Score | Score, data = dta)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.37068 -0.68873 -0.59685 -0.05357 9.87491
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.49604 0.09987 4.967 6.81e-07 ***
## Score 0.59609 0.03942 15.120 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4967 0.3463 4.322 1.54e-05 ***
## Score -0.7716 0.1967 -3.924 8.72e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 8
## Log-likelihood: -780.6 on 4 Df
Count model coefficients中, Score增加一單位,被認為是霸凌者比率增加0.59609
Zero-inflation model coefficients中, Score增加一單位,被認為是霸凌者的比率減少0.7716
summary(m2 <- zeroinfl(Nomination ~ Score | Score, data = dta, dist = "negbin"))
##
## Call:
## zeroinfl(formula = Nomination ~ Score | Score, data = dta, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.58338 -0.48698 -0.38705 0.02297 6.96982
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6737 0.4354 -1.547 0.122
## Score 0.8819 0.2052 4.297 1.73e-05 ***
## Log(theta) -1.0658 0.1792 -5.947 2.73e-09 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.954 2.484 1.189 0.234
## Score -3.419 2.260 -1.513 0.130
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 0.3445
## Number of iterations in BFGS optimization: 21
## Log-likelihood: -494.7 on 5 Df
Count model coefficients中, Score增加一單位,被認為是霸凌者的比率增加0.8819
Zero-inflation model coefficients中, Score增加一單位,被認為是霸凌者的比率減少3.419
vuong(m0, m1)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -4.891816 model2 > model1 4.9955e-07
## AIC-corrected -4.858930 model2 > model1 5.9011e-07
## BIC-corrected -4.798531 model2 > model1 7.9917e-07
vuong(m1, m2)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -4.242754 model2 > model1 1.104e-05
## AIC-corrected -4.242754 model2 > model1 1.104e-05
## BIC-corrected -4.242754 model2 > model1 1.104e-05
m1較佳,因為m1的AIC小於m2。
# fortify data with fitted values and residuals
dta_m <- data.frame(dta, fit1 = predict(m1, type = "response"),
fit2 = predict(m2, type = "response"),
r1 = resid(m1, type="pearson"),
r2 = resid(m2, type="pearson"))
ggplot(dta_m, aes(fit1, r1)) +
geom_point(pch=20) +
geom_point(aes(fit2, r2), pch=1) +
geom_hline(yintercept = 0) +
labs(x = "Fitted values", y = "Residuals",
title = "Poisson vs. Negative Binomial")
# fortify data with model fits and CIs
dta_m2 <- data.frame(dta, yhat = predict(m2))
# model fit over observations
ggplot(dta_m2, aes(Score, Nomination)) +
geom_point(pch = 20, alpha = .2) +
geom_line(aes(Score, yhat), col = "magenta", lwd=rel(1)) +
stat_smooth(method = "glm", method.args = list(family = poisson)) +
labs(x = "Bully Score", y = "Predicted number of nomination")
## `geom_smooth()` using formula 'y ~ x'