1 Introduction

Two measures of bullying behavior have been constructed.

The first uses self-report on a 9-item Illinois Bully Scale.

The second is peer nomination in which children list any person they perceive as a bully.

The total number of nominations a person receives is a measure of bullying behavior.

Source: Espelage, D.L., Holt, M.K., & Henkel, R.R. (2003). Examination of Peer-Group Contextual Effects on Aggression During Early Adolescence. Child Development, 74, 205-220.

Column 1: Score on the Illinois Bully Scale
Column 2: Total number of peer nominations

2 Data management

# load packages
pacman::p_load(tidyverse, MASS, pscl)

# input data
dta <- read.table("C:/Users/Ching-Fang Wu/Documents/data/bullying.txt", h=T)

# first 8 lines
head(dta,8)

##   Score Nomination
## 1  1.56          0
## 2  1.56          0
## 3  1.11          0
## 4  1.56          0
## 5  1.22          4
## 6  1.33          0
## 7  1.11          0
## 8  1.00          0

# check data structure
str(dta)

## 'data.frame':    291 obs. of  2 variables:
##  $ Score     : num  1.56 1.56 1.11 1.56 1.22 1.33 1.11 1 1.11 1.22 ...
##  $ Nomination: int  0 0 0 0 4 0 0 0 0 19 ...

# proportion of zeros
sum(dta$Nomination < 1)/length(dta$Nomination)

## [1] 0.5635739

未被提名為霸凌者比率0.5636

# mean
colMeans(dta)

##      Score Nomination 
##   1.658110   2.487973

#variance
apply(dta, 2, var) #1 indicates rows, 2 indicates columns

##      Score Nomination 
##  0.4842037 41.2645100

possion分配有Equi-dispersion的特性：E(Nomination)=Var(Nomination)，這裡Var(Nomination)>E(Nomination)，可能有Over-dispersion狀況。

# use the Freedman-Diaconis rule for bin width
bd <- 2*IQR(dta$Nomination)/(dim(dta)[1]^(1/3))

3 Data Visulization

ot <- theme_set(theme_bw())

# histogram of nomination
ggplot(dta, aes(Nomination)) + 
 stat_bin(binwidth = bd) + 
 labs(x = "Number of nominations", y = "Count")

被同儕提名次數“0”的人數很多，表示大多數的人都不是霸凌者。但是“0”值次數過高，有可能是真的，也可能是假的。

3.1 poisson fit

ggplot(dta, aes(Score, Nomination)) +
 geom_jitter(alpha = 0.2) +
 stat_smooth(method = "glm", method.args = list(family = poisson))  + # poisson fit
 labs(y = "Number of peer nomination", x = "Bully score")

## `geom_smooth()` using formula 'y ~ x'

4 Modeling

4.1 m0：poisson model

summary(m0 <- glm(Nomination ~ Score, family = poisson, data = dta))

## 
## Call:
## glm(formula = Nomination ~ Score, family = poisson, data = dta)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.4995  -1.8246  -1.5952  -0.1552  11.6806  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.66308    0.08871  -7.475 7.75e-14 ***
## Score        0.81434    0.03504  23.239  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2205.2  on 290  degrees of freedom
## Residual deviance: 1774.6  on 289  degrees of freedom
## AIC: 2160.3
## 
## Number of Fisher Scoring iterations: 6

4.2 m1：zero-inflated poisson

#zero-inflated poisson適用「高比例零值」的計數型資料型態
summary(m1 <- zeroinfl(Nomination ~ Score | Score, data = dta))

## 
## Call:
## zeroinfl(formula = Nomination ~ Score | Score, data = dta)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.37068 -0.68873 -0.59685 -0.05357  9.87491 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.49604    0.09987   4.967 6.81e-07 ***
## Score        0.59609    0.03942  15.120  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.4967     0.3463   4.322 1.54e-05 ***
## Score        -0.7716     0.1967  -3.924 8.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 8 
## Log-likelihood: -780.6 on 4 Df

Count model coefficients中， Score增加一單位，被認為是霸凌者比率增加0.59609

Zero-inflation model coefficients中， Score增加一單位，被認為是霸凌者的比率減少0.7716

4.3 m2：zero-inflated negative binomial

summary(m2 <- zeroinfl(Nomination ~ Score | Score, data = dta, dist = "negbin"))

## 
## Call:
## zeroinfl(formula = Nomination ~ Score | Score, data = dta, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58338 -0.48698 -0.38705  0.02297  6.96982 
## 
## Count model coefficients (negbin with log link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.6737     0.4354  -1.547    0.122    
## Score         0.8819     0.2052   4.297 1.73e-05 ***
## Log(theta)   -1.0658     0.1792  -5.947 2.73e-09 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    2.954      2.484   1.189    0.234
## Score         -3.419      2.260  -1.513    0.130
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 0.3445 
## Number of iterations in BFGS optimization: 21 
## Log-likelihood: -494.7 on 5 Df

Count model coefficients中， Score增加一單位，被認為是霸凌者的比率增加0.8819

Zero-inflation model coefficients中， Score增加一單位，被認為是霸凌者的比率減少3.419

5 comparing non-nested models

vuong(m0, m1)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -4.891816 model2 > model1 4.9955e-07
## AIC-corrected         -4.858930 model2 > model1 5.9011e-07
## BIC-corrected         -4.798531 model2 > model1 7.9917e-07

vuong(m1, m2)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A   p-value
## Raw                   -4.242754 model2 > model1 1.104e-05
## AIC-corrected         -4.242754 model2 > model1 1.104e-05
## BIC-corrected         -4.242754 model2 > model1 1.104e-05

m1較佳，因為m1的AIC小於m2。

# fortify data with fitted values and residuals
dta_m <- data.frame(dta, fit1 = predict(m1, type = "response"), 
                      fit2 = predict(m2, type = "response"), 
                      r1 = resid(m1, type="pearson"),
                      r2 = resid(m2, type="pearson"))

6 residual plots

ggplot(dta_m, aes(fit1, r1)) +
 geom_point(pch=20) +
 geom_point(aes(fit2, r2), pch=1)  +
 geom_hline(yintercept = 0) +
 labs(x = "Fitted values", y = "Residuals", 
      title = "Poisson vs. Negative Binomial")

# fortify data with model fits and CIs
dta_m2 <- data.frame(dta, yhat = predict(m2))

# model fit over observations
ggplot(dta_m2, aes(Score, Nomination)) +
 geom_point(pch = 20, alpha = .2) +
 geom_line(aes(Score, yhat), col = "magenta", lwd=rel(1)) +
 stat_smooth(method = "glm", method.args = list(family = poisson)) +
 labs(x = "Bully Score", y = "Predicted number of nomination")

## `geom_smooth()` using formula 'y ~ x'

Week9 In-class exercise 1 - Count regression

Ching-Fang Wu

Sun Nov 08 22:08:07 2020