Replicate the analysis reported in the bullying example.
Two measures of bullying behavior have been constructed. The first uses self-report on a 9-item Illinois Bully Scale. The second is peer nomination in which children list any person they perceive as a bully. The total number of nominations a person receives is a measure of bullying behavior.
Source: Espelage, D.L., Holt, M.K., & Henkel, R.R. (2003). Examination of Peer-Group Contextual Effects on Aggression During Early Adolescence. Child Development, 74, 205-220.
Column 1: Score on the Illinois Bully Scale Column 2: Total number of peer nominations
# load packages
pacman::p_load(tidyverse, MASS, pscl)
# input data
dta <- read.table("./data/bullying.txt", h=T)
# first 6 lines
head(dta)
Score Nomination
1 1.56 0
2 1.56 0
3 1.11 0
4 1.56 0
5 1.22 4
6 1.33 0
# check data structure
str(dta)
'data.frame': 291 obs. of 2 variables:
$ Score : num 1.56 1.56 1.11 1.56 1.22 1.33 1.11 1 1.11 1.22 ...
$ Nomination: int 0 0 0 0 4 0 0 0 0 19 ...
# proportion of zeros
sum(dta$Nomination < 1)/length(dta$Nomination)
[1] 0.56357
#
colMeans(dta)
Score Nomination
1.6581 2.4880
#
apply(dta, 2, var)
Score Nomination
0.4842 41.2645
# use the Freedman-Diaconis rule for bin width
bd <- 2*IQR(dta$Nomination)/(dim(dta)[1]^(1/3))
#
ot <- theme_set(theme_bw())
# histogram of nomination
ggplot(dta, aes(Nomination)) +
stat_bin(binwidth = bd) +
labs(x = "Number of nominations", y = "Count")
# poisson fit
ggplot(dta, aes(Score, Nomination)) +
geom_jitter(alpha = 0.2) +
stat_smooth(method = "glm", method.args = list(family = poisson)) +
labs(y = "Number of peer nomination", x = "Bully score")
In my visual inspection, the number of peer nominations increased as the bully score increases.
# poisson model
summary(m0 <- glm(Nomination ~ Score, family = poisson, data = dta))
Call:
glm(formula = Nomination ~ Score, family = poisson, data = dta)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.499 -1.825 -1.595 -0.155 11.681
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6631 0.0887 -7.47 7.7e-14
Score 0.8143 0.0350 23.24 < 2e-16
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2205.2 on 290 degrees of freedom
Residual deviance: 1774.6 on 289 degrees of freedom
AIC: 2160
Number of Fisher Scoring iterations: 6
# zero-inflated poisson
summary(m1 <- zeroinfl(Nomination ~ Score | Score, data = dta))
Call:
zeroinfl(formula = Nomination ~ Score | Score, data = dta)
Pearson residuals:
Min 1Q Median 3Q Max
-2.3707 -0.6887 -0.5968 -0.0536 9.8749
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.4960 0.0999 4.97 6.8e-07
Score 0.5961 0.0394 15.12 < 2e-16
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.497 0.346 4.32 1.5e-05
Score -0.772 0.197 -3.92 8.7e-05
Number of iterations in BFGS optimization: 8
Log-likelihood: -781 on 4 Df
# zero-inflated negative binomial
summary(m2 <- zeroinfl(Nomination ~ Score | Score, data = dta,
dist = "negbin"))
Call:
zeroinfl(formula = Nomination ~ Score | Score, data = dta, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.583 -0.487 -0.387 0.023 6.970
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.674 0.435 -1.55 0.12
Score 0.882 0.205 4.30 1.7e-05
Log(theta) -1.066 0.179 -5.95 2.7e-09
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.95 2.48 1.19 0.23
Score -3.42 2.26 -1.51 0.13
Theta = 0.344
Number of iterations in BFGS optimization: 21
Log-likelihood: -495 on 5 Df
# comparing non-nested models
vuong(m0, m1)
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic H_A p-value
Raw -4.8918 model2 > model1 5.00e-07
AIC-corrected -4.8589 model2 > model1 5.90e-07
BIC-corrected -4.7985 model2 > model1 7.99e-07
vuong(m1, m2)
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic H_A p-value
Raw -4.2428 model2 > model1 1.1e-05
AIC-corrected -4.2428 model2 > model1 1.1e-05
BIC-corrected -4.2428 model2 > model1 1.1e-05
# fortify data with fitted values and residuals
dta_m <- data.frame(dta, fit1 = predict(m1, type = "response"),
fit2 = predict(m2, type = "response"),
r1 = resid(m1, type="pearson"),
r2 = resid(m2, type="pearson"))
# residual plots
ggplot(dta_m, aes(fit1, r1)) +
geom_point(pch=20) +
geom_point(aes(fit2, r2), pch=1) +
geom_hline(yintercept = 0) +
labs(x = "Fitted values", y = "Residuals",
title = "Poisson vs. Negative Binomial")
sjPlot::tab_model(m2, show.obs=F, show.r2=F, show.se=F)
| Â | Nomination | |||
|---|---|---|---|---|
| Predictors | Incidence Rate Ratios | CI | p | |
| Count Model | ||||
| (Intercept) | 0.51 | 0.22 – 1.20 | 0.122 | |
| Score | 2.42 | 1.62 – 3.61 | <0.001 | |
| Zero-Inflated Model | ||||
| (Intercept) | 19.18 | 0.15 – 2494.37 | 0.234 | |
| Score | 0.03 | 0.00 – 2.75 | 0.130 | |
The peer nominations increases by between 62% to 261 % per the bully score. For no peer nominations, the bully score did not significantly associate with it.
# fortify data with model fits and CIs
dta_m2 <- data.frame(dta, yhat = predict(m2))
# model fit over observations
ggplot(dta_m2, aes(Score, Nomination)) +
geom_point(pch = 20, alpha = .2) +
geom_line(aes(Score, yhat), col = "magenta", lwd=rel(1)) +
stat_smooth(method = "glm", method.args = list(family = poisson)) +
labs(x = "Bully Score", y = "Predicted number of nomination")
Complete the analysis reported in the youth employment example by improvin on the R script and adding comments and interpretation.
A sample of 978 young men aged 20 to 22 years was extracted from the National Longitudinal Survey of Youth (NLSY). The outcome of interest is whether a youth’s reported major activity in 1985 was (1) working, (2) in school, or (3) inactive. Six additional variables were also recorded.
Source: Powers, D.A., & Xie, Y. (2000). Statistical Methods for Categorical Data Analysis. p. 232.
Column 1: Employment status: School = In school, Work = Working, Idle = Inactive Column 2: Ethnicity, Black or Others Column 3: Family structure, Non-intact or Otherwise Column 4: Father’s education, College or Otherwise Column 5: Family income in 1979 in thousands Column 6: Local unemployment rate in 1980 Column 7: Standardized score on the Armed Services Aptitude Battery Test
#
pacman::p_load(tidyverse, MASS, nnet)
#
dta <- read.table("./data/youth_employment.txt", h=T)
#
head(dta)
Employment Race Family DadEdu Income Local ASAB
1 Work Others Non-Intact Otherwise 0.4909 6.9 -0.110
2 School Others Non-Intact Otherwise 0.6940 4.8 0.452
3 Work Others Otherwise Otherwise 1.1000 6.5 0.967
4 Work Others Non-Intact College 1.5000 3.8 1.667
5 Work Others Non-Intact Otherwise 0.2544 6.9 0.000
6 School Others Non-Intact Otherwise 0.9391 5.4 0.000
#
str(dta)
'data.frame': 978 obs. of 7 variables:
$ Employment: chr "Work" "School" "Work" "Work" ...
$ Race : chr "Others" "Others" "Others" "Others" ...
$ Family : chr "Non-Intact" "Non-Intact" "Otherwise" "Non-Intact" ...
$ DadEdu : chr "Otherwise" "Otherwise" "Otherwise" "College" ...
$ Income : num 0.491 0.694 1.1 1.5 0.254 ...
$ Local : num 6.9 4.8 6.5 3.8 6.9 5.4 4.2 6.2 6.2 6.9 ...
$ ASAB : num -0.11 0.452 0.967 1.667 0 ...
ftable(dta, row.vars = c('Race', 'Family', 'DadEdu'), col.vars = 'Employment')
Employment Idle School Work
Race Family DadEdu
Black Non-Intact College 3 10 2
Otherwise 39 40 27
Otherwise College 6 8 3
Otherwise 28 60 26
Others Non-Intact College 5 12 18
Otherwise 47 43 54
Otherwise College 23 66 72
Otherwise 84 161 141
#
summary(m0 <- multinom(Employment ~ ., data = dta))
# weights: 24 (14 variable)
initial value 1074.442818
iter 10 value 1024.081197
iter 20 value 1017.247769
iter 20 value 1017.247768
iter 20 value 1017.247768
final value 1017.247768
converged
Call:
multinom(formula = Employment ~ ., data = dta)
Coefficients:
(Intercept) RaceOthers FamilyOtherwise DadEduOtherwise Income Local
School 0.28184 -0.22896 0.54744 -0.24139 0.26834 0.012477
Work 0.32681 0.44408 0.13436 -0.17970 0.40712 -0.071149
ASAB
School 0.17655
Work 0.30788
Std. Errors:
(Intercept) RaceOthers FamilyOtherwise DadEduOtherwise Income Local
School 0.40303 0.19630 0.18614 0.23548 0.20910 0.034602
Work 0.42378 0.21858 0.19227 0.24110 0.21098 0.037389
ASAB
School 0.10650
Work 0.11021
Residual Deviance: 2034.5
AIC: 2062.5
sjPlot::tab_model(m0, show.obs=F, show.r2=F)
| Â | Employment | |||
|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Response |
| (Intercept) | 1.33 | 0.60 – 2.92 | 0.484 | School |
| Race [Others] | 0.80 | 0.54 – 1.17 | 0.243 | School |
| Family [Otherwise] | 1.73 | 1.20 – 2.49 | 0.003 | School |
| DadEdu [Otherwise] | 0.79 | 0.50 – 1.25 | 0.305 | School |
| Income | 1.31 | 0.87 – 1.97 | 0.199 | School |
| Local | 1.01 | 0.95 – 1.08 | 0.718 | School |
| ASAB | 1.19 | 0.97 – 1.47 | 0.097 | School |
| (Intercept) | 1.39 | 0.60 – 3.18 | 0.441 | Work |
| Race [Others] | 1.56 | 1.02 – 2.39 | 0.042 | Work |
| Family [Otherwise] | 1.14 | 0.78 – 1.67 | 0.485 | Work |
| DadEdu [Otherwise] | 0.84 | 0.52 – 1.34 | 0.456 | Work |
| Income | 1.50 | 0.99 – 2.27 | 0.054 | Work |
| Local | 0.93 | 0.87 – 1.00 | 0.057 | Work |
| ASAB | 1.36 | 1.10 – 1.69 | 0.005 | Work |
The odds of male black youth reporting at work versus inactive as a major activity were 0.64 (.42, .98) times those of others.
The odds of reporting being in school versus inactive for young men from intact families were 1.72 (=1/.58) times as high as the odds for those from nonintact families.
A one unit change in the ASVAB score increases the odds for working (vs. inactive) by 43% = (exp(0.36)-1) × 100.
dta$Race <- relevel(factor(dta$Race), ref="Others", levels=c("Black", "Others"))
dta$Family<-relevel(factor(dta$Family), ref="Otherwise")
dta$DadEdu<-relevel(factor(dta$DadEdu), ref="Otherwise")
dta_c <- dta %>%
mutate(Incf = cut(Income,
breaks = quantile(Income, c(0, 1/3, 2/3, 1)),
labels = c('Low', 'Middle', 'High')),
ASABf = cut(ASAB,
breaks = quantile(ASAB, c(0, 1/3, 2/3, 1)),
labels = c('Low', 'Middle', 'High')),
Localf = cut(Local,
breaks = quantile(Local, c(0, 1/2, 1)),
labels = c('Low', 'High')))
#
with(dta_c, ftable(Incf, ASABf, Localf, Employment))
Employment Idle School Work
Incf ASABf Localf
Low Low Low 26 39 19
High 43 34 20
Middle Low 21 22 16
High 11 26 9
High Low 3 13 8
High 4 6 3
Middle Low Low 13 21 26
High 11 24 10
Middle Low 24 30 35
High 5 20 13
High Low 8 23 30
High 9 17 13
High Low Low 6 8 9
High 4 6 5
Middle Low 19 23 16
High 4 14 16
High Low 13 45 64
High 11 26 29
dta_wn <- subset(dta, Employment != 'School') %>% mutate(resp = ifelse(Employment=='Work', 1, 0))
dta_sn <- subset(dta, Employment != 'Work') %>% mutate(resp = ifelse(Employment=='School', 1, 0))
m_wn <- glm(resp ~ Race+Family+DadEdu+Income+Local+ASAB, data=dta_wn, family=binomial)
m_sn <- glm(resp ~ Race+Family+DadEdu+Income+Local+ASAB, data=dta_sn, family=binomial)
sjPlot::tab_model(m_wn, m_sn, show.obs=F, show.r2=F)
| Â | resp | resp | ||||
|---|---|---|---|---|---|---|
| Predictors | Odds Ratios | CI | p | Odds Ratios | CI | p |
| (Intercept) | 2.02 | 1.00 – 4.10 | 0.051 | 1.52 | 0.78 – 2.94 | 0.218 |
| Race [Black] | 0.64 | 0.41 – 0.99 | 0.046 | 1.25 | 0.85 – 1.84 | 0.265 |
| Family [Non-Intact] | 0.90 | 0.62 – 1.33 | 0.609 | 0.57 | 0.40 – 0.82 | 0.003 |
| DadEdu [College] | 1.19 | 0.74 – 1.92 | 0.471 | 1.28 | 0.81 – 2.04 | 0.303 |
| Income | 1.72 | 1.10 – 2.73 | 0.019 | 1.22 | 0.84 – 1.81 | 0.316 |
| Local | 0.92 | 0.85 – 0.99 | 0.025 | 1.01 | 0.95 – 1.09 | 0.708 |
| ASAB | 1.34 | 1.07 – 1.67 | 0.010 | 1.22 | 0.99 – 1.51 | 0.067 |
m1 <- nnet::multinom(Employment ~ Race+Family+DadEdu, data=dta, Hess=TRUE, trace=F)
Reproduce the results of analysis reported in the mental impairment and SES example by adapting the R script in the lecture note and from the rating experiment example.
The mental health status of a sample of 1,660 young New York residents in midtown Manhattan was classified by their parents’ socioeconomic status (A=High, F=Low).
Source: Srole, L. & Fisher, A.K. (1978). Mental health in the metropolis. The midtown Manhattan study. Reported in Agresti, A. (1989). Tutorial on modeling ordered categorical response data. Psychological Bulletin, 105, 290-301.
Column 1: Subject ID Column 2: SES, numerical code A=6, … , F=1 Column 3: SES, reference code for A Column 4: SES, reference code for B Column 5: SES, reference code for C Column 6: SES, reference code for D Column 7: SES, reference code for E Column 9: Mental impairment, well=1, mild=2, moderate=3, impaired=4
#
pacman::p_load(tidyverse, MASS, HH)
#
dta <- read.table("data/mentalSES.txt", h = T)
head(dta)
Id ses A B C D E mental
1 1 6 1 0 0 0 0 1
2 2 6 1 0 0 0 0 1
3 3 6 1 0 0 0 0 1
4 4 6 1 0 0 0 0 1
5 5 6 1 0 0 0 0 1
6 6 6 1 0 0 0 0 1
dta$mental <- factor(dta$mental, levels = c("1", "2", "3", "4"),
labels = c("Well", "Mild", "Moderate", "Impaired"))
dta$SES <- factor(dta$ses, levels = c("6", "5", "4", "3", "2", "1"),
labels = c("A", "B", "C", "D", "E", "F"))
str(dta)
'data.frame': 1660 obs. of 9 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ ses : int 6 6 6 6 6 6 6 6 6 6 ...
$ A : int 1 1 1 1 1 1 1 1 1 1 ...
$ B : int 0 0 0 0 0 0 0 0 0 0 ...
$ C : int 0 0 0 0 0 0 0 0 0 0 ...
$ D : int 0 0 0 0 0 0 0 0 0 0 ...
$ E : int 0 0 0 0 0 0 0 0 0 0 ...
$ mental: Factor w/ 4 levels "Well","Mild",..: 1 1 1 1 1 1 1 1 1 1 ...
$ SES : Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
with(dta, ftable(SES, mental))
mental Well Mild Moderate Impaired
SES
A 64 94 58 46
B 57 94 54 40
C 57 105 65 60
D 72 141 77 94
E 36 97 54 78
F 21 71 54 71
dta_c <- as.data.frame(with(dta, ftable(SES, mental)))
dta_c <- dta_c %>%
group_by(SES) %>%
mutate(Total = sum(Freq), Prob = Freq/Total)
head(dta_c)
# A tibble: 6 x 5
# Groups: SES [6]
SES mental Freq Total Prob
<fct> <fct> <int> <int> <dbl>
1 A Well 64 262 0.244
2 B Well 57 245 0.233
3 C Well 57 287 0.199
4 D Well 72 384 0.188
5 E Well 36 265 0.136
6 F Well 21 217 0.0968
HH::likert(t(with(dta, table(mental, SES))), main="", ylab="SES")
library(VGAM)
m0 <- vglm(ordered(mental) ~ SES, data=dta,
family=cumulative(parallel=TRUE))
summary(m0)
Call:
vglm(formula = ordered(mental) ~ SES, family = cumulative(parallel = TRUE),
data = dta)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -1.204 0.119 -10.14 < 2e-16
(Intercept):2 0.495 0.115 4.32 1.6e-05
(Intercept):3 1.504 0.120 12.52 < 2e-16
SESB 0.017 0.161 0.11 0.91605
SESC -0.208 0.155 -1.35 0.17819
SESD -0.299 0.145 -2.06 0.03928
SESE -0.567 0.158 -3.59 0.00033
SESF -0.824 0.167 -4.93 8.0e-07
Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
logitlink(P[Y<=3])
Residual deviance: 4449.4 on 4972 degrees of freedom
Log-likelihood: -2224.7 on 4972 degrees of freedom
Number of Fisher scoring iterations: 4
No Hauck-Donner effect found in any of the estimates
Exponentiated coefficients:
SESB SESC SESD SESE SESF
1.01712 0.81207 0.74156 0.56737 0.43874
m1 <- vglm(ordered(mental) ~ SES, data=dta,
family=cumulative)
VGAM::lrtest(m0, m1)
Likelihood ratio test
Model 1: ordered(mental) ~ SES
Model 2: ordered(mental) ~ SES
#Df LogLik Df Chisq Pr(>Chisq)
1 4972 -2225
2 4962 -2221 -10 7.83 0.65
#
summary(m2 <- polr(mental ~ SES, data = dta,
method = "probit", Hess = TRUE))
Call:
polr(formula = mental ~ SES, data = dta, Hess = TRUE, method = "probit")
Coefficients:
Value Std. Error t value
SESB -0.00929 0.0953 -0.0976
SESC 0.12664 0.0916 1.3823
SESD 0.18624 0.0861 2.1643
SESE 0.35077 0.0938 3.7404
SESF 0.50498 0.0991 5.0946
Intercepts:
Value Std. Error t value
Well|Mild -0.724 0.070 -10.353
Mild|Moderate 0.307 0.068 4.477
Moderate|Impaired 0.920 0.070 13.068
Residual Deviance: 4447.15
AIC: 4463.15
confint(m2)
2.5 % 97.5 %
SESB -0.196022 0.17743
SESC -0.052902 0.30623
SESD 0.017621 0.35494
SESE 0.167043 0.53465
SESF 0.310830 0.69938