1 In-class exercises 1:

Replicate the analysis reported in the bullying example.

Two measures of bullying behavior have been constructed. The first uses self-report on a 9-item Illinois Bully Scale. The second is peer nomination in which children list any person they perceive as a bully. The total number of nominations a person receives is a measure of bullying behavior.

Source: Espelage, D.L., Holt, M.K., & Henkel, R.R. (2003). Examination of Peer-Group Contextual Effects on Aggression During Early Adolescence. Child Development, 74, 205-220.

Column 1: Score on the Illinois Bully Scale Column 2: Total number of peer nominations

1.1 Load data file

# load packages
pacman::p_load(tidyverse, MASS, pscl)

# input data
dta <- read.table("./data/bullying.txt", h=T)

# first 6 lines
head(dta)

  Score Nomination
1  1.56          0
2  1.56          0
3  1.11          0
4  1.56          0
5  1.22          4
6  1.33          0

# check data structure
str(dta)

'data.frame':   291 obs. of  2 variables:
 $ Score     : num  1.56 1.56 1.11 1.56 1.22 1.33 1.11 1 1.11 1.22 ...
 $ Nomination: int  0 0 0 0 4 0 0 0 0 19 ...

1.2 Summaries

# proportion of zeros
sum(dta$Nomination < 1)/length(dta$Nomination)

[1] 0.56357

#
colMeans(dta)

     Score Nomination 
    1.6581     2.4880

#
apply(dta, 2, var)

     Score Nomination 
    0.4842    41.2645

1.3 Data Visualization

# use the Freedman-Diaconis rule for bin width
bd <- 2*IQR(dta$Nomination)/(dim(dta)[1]^(1/3))

#
ot <- theme_set(theme_bw())

# histogram of nomination
ggplot(dta, aes(Nomination)) + 
 stat_bin(binwidth = bd) + 
 labs(x = "Number of nominations", y = "Count")

# poisson fit 
ggplot(dta, aes(Score, Nomination)) +
 geom_jitter(alpha = 0.2) +
 stat_smooth(method = "glm", method.args = list(family = poisson)) +
 labs(y = "Number of peer nomination", x = "Bully score")

In my visual inspection, the number of peer nominations increased as the bully score increases.

1.4 Model: Poisson Model

# poisson model 
summary(m0 <- glm(Nomination ~ Score, family = poisson, data = dta))


Call:
glm(formula = Nomination ~ Score, family = poisson, data = dta)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-6.499  -1.825  -1.595  -0.155  11.681  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.6631     0.0887   -7.47  7.7e-14
Score         0.8143     0.0350   23.24  < 2e-16

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2205.2  on 290  degrees of freedom
Residual deviance: 1774.6  on 289  degrees of freedom
AIC: 2160

Number of Fisher Scoring iterations: 6

1.5 Model 1: Zero-Inflated Poisson

# zero-inflated poisson
summary(m1 <- zeroinfl(Nomination ~ Score | Score, data = dta))


Call:
zeroinfl(formula = Nomination ~ Score | Score, data = dta)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.3707 -0.6887 -0.5968 -0.0536  9.8749 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.4960     0.0999    4.97  6.8e-07
Score         0.5961     0.0394   15.12  < 2e-16

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    1.497      0.346    4.32  1.5e-05
Score         -0.772      0.197   -3.92  8.7e-05

Number of iterations in BFGS optimization: 8 
Log-likelihood: -781 on 4 Df

1.6 Model 2: Zero-Inflated Negative Binomial

# zero-inflated negative binomial
summary(m2 <- zeroinfl(Nomination ~ Score | Score, data = dta, 
                              dist = "negbin"))


Call:
zeroinfl(formula = Nomination ~ Score | Score, data = dta, dist = "negbin")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-0.583 -0.487 -0.387  0.023  6.970 

Count model coefficients (negbin with log link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.674      0.435   -1.55     0.12
Score          0.882      0.205    4.30  1.7e-05
Log(theta)    -1.066      0.179   -5.95  2.7e-09

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)     2.95       2.48    1.19     0.23
Score          -3.42       2.26   -1.51     0.13

Theta = 0.344 
Number of iterations in BFGS optimization: 21 
Log-likelihood: -495 on 5 Df

1.7 Model Comparison

# comparing non-nested models
vuong(m0, m1)

Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A  p-value
Raw                     -4.8918 model2 > model1 5.00e-07
AIC-corrected           -4.8589 model2 > model1 5.90e-07
BIC-corrected           -4.7985 model2 > model1 7.99e-07

vuong(m1, m2)

Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A p-value
Raw                     -4.2428 model2 > model1 1.1e-05
AIC-corrected           -4.2428 model2 > model1 1.1e-05
BIC-corrected           -4.2428 model2 > model1 1.1e-05

1.8 Model diagnosis

# fortify data with fitted values and residuals
dta_m <- data.frame(dta, fit1 = predict(m1, type = "response"), 
                      fit2 = predict(m2, type = "response"), 
                      r1 = resid(m1, type="pearson"),
                      r2 = resid(m2, type="pearson")) 

# residual plots                                               
ggplot(dta_m, aes(fit1, r1)) +
 geom_point(pch=20) +
 geom_point(aes(fit2, r2), pch=1)  +
 geom_hline(yintercept = 0) +
 labs(x = "Fitted values", y = "Residuals", 
      title = "Poisson vs. Negative Binomial")

sjPlot::tab_model(m2, show.obs=F, show.r2=F, show.se=F)

	Nomination
Predictors	Incidence Rate Ratios	CI	p
Count Model
(Intercept)	0.51	0.22 – 1.20	0.122
Score	2.42	1.62 – 3.61	<0.001
Zero-Inflated Model
(Intercept)	19.18	0.15 – 2494.37	0.234
Score	0.03	0.00 – 2.75	0.130

The peer nominations increases by between 62% to 261 % per the bully score. For no peer nominations, the bully score did not significantly associate with it.

1.9 Prediction

# fortify data with model fits and CIs
dta_m2 <- data.frame(dta, yhat = predict(m2)) 

# model fit over observations
ggplot(dta_m2, aes(Score, Nomination)) +
 geom_point(pch = 20, alpha = .2) +
 geom_line(aes(Score, yhat), col = "magenta", lwd=rel(1)) +
 stat_smooth(method = "glm", method.args = list(family = poisson)) +
 labs(x = "Bully Score", y = "Predicted number of nomination")

2 In-class Exercises 2:

Complete the analysis reported in the youth employment example by improvin on the R script and adding comments and interpretation.

A sample of 978 young men aged 20 to 22 years was extracted from the National Longitudinal Survey of Youth (NLSY). The outcome of interest is whether a youth’s reported major activity in 1985 was (1) working, (2) in school, or (3) inactive. Six additional variables were also recorded.

Source: Powers, D.A., & Xie, Y. (2000). Statistical Methods for Categorical Data Analysis. p. 232.

Column 1: Employment status: School = In school, Work = Working, Idle = Inactive Column 2: Ethnicity, Black or Others Column 3: Family structure, Non-intact or Otherwise Column 4: Father’s education, College or Otherwise Column 5: Family income in 1979 in thousands Column 6: Local unemployment rate in 1980 Column 7: Standardized score on the Armed Services Aptitude Battery Test

#
pacman::p_load(tidyverse, MASS, nnet)

#
dta <- read.table("./data/youth_employment.txt", h=T)

#
head(dta)

  Employment   Race     Family    DadEdu Income Local   ASAB
1       Work Others Non-Intact Otherwise 0.4909   6.9 -0.110
2     School Others Non-Intact Otherwise 0.6940   4.8  0.452
3       Work Others  Otherwise Otherwise 1.1000   6.5  0.967
4       Work Others Non-Intact   College 1.5000   3.8  1.667
5       Work Others Non-Intact Otherwise 0.2544   6.9  0.000
6     School Others Non-Intact Otherwise 0.9391   5.4  0.000

#
str(dta)

'data.frame':   978 obs. of  7 variables:
 $ Employment: chr  "Work" "School" "Work" "Work" ...
 $ Race      : chr  "Others" "Others" "Others" "Others" ...
 $ Family    : chr  "Non-Intact" "Non-Intact" "Otherwise" "Non-Intact" ...
 $ DadEdu    : chr  "Otherwise" "Otherwise" "Otherwise" "College" ...
 $ Income    : num  0.491 0.694 1.1 1.5 0.254 ...
 $ Local     : num  6.9 4.8 6.5 3.8 6.9 5.4 4.2 6.2 6.2 6.9 ...
 $ ASAB      : num  -0.11 0.452 0.967 1.667 0 ...

2.1 Decrptive Table

ftable(dta, row.vars = c('Race', 'Family', 'DadEdu'), col.vars = 'Employment')

                            Employment Idle School Work
Race   Family     DadEdu                               
Black  Non-Intact College                 3     10    2
                  Otherwise              39     40   27
       Otherwise  College                 6      8    3
                  Otherwise              28     60   26
Others Non-Intact College                 5     12   18
                  Otherwise              47     43   54
       Otherwise  College                23     66   72
                  Otherwise              84    161  141

2.2 Model: Multinominal

#
summary(m0 <- multinom(Employment ~ ., data = dta))

# weights:  24 (14 variable)
initial  value 1074.442818 
iter  10 value 1024.081197
iter  20 value 1017.247769
iter  20 value 1017.247768
iter  20 value 1017.247768
final  value 1017.247768 
converged

Call:
multinom(formula = Employment ~ ., data = dta)

Coefficients:
       (Intercept) RaceOthers FamilyOtherwise DadEduOtherwise  Income     Local
School     0.28184   -0.22896         0.54744        -0.24139 0.26834  0.012477
Work       0.32681    0.44408         0.13436        -0.17970 0.40712 -0.071149
          ASAB
School 0.17655
Work   0.30788

Std. Errors:
       (Intercept) RaceOthers FamilyOtherwise DadEduOtherwise  Income    Local
School     0.40303    0.19630         0.18614         0.23548 0.20910 0.034602
Work       0.42378    0.21858         0.19227         0.24110 0.21098 0.037389
          ASAB
School 0.10650
Work   0.11021

Residual Deviance: 2034.5 
AIC: 2062.5

sjPlot::tab_model(m0, show.obs=F, show.r2=F)

	Employment
Predictors	Odds Ratios	CI	p	Response
(Intercept)	1.33	0.60 – 2.92	0.484	School
Race [Others]	0.80	0.54 – 1.17	0.243	School
Family [Otherwise]	1.73	1.20 – 2.49	0.003	School
DadEdu [Otherwise]	0.79	0.50 – 1.25	0.305	School
Income	1.31	0.87 – 1.97	0.199	School
Local	1.01	0.95 – 1.08	0.718	School
ASAB	1.19	0.97 – 1.47	0.097	School
(Intercept)	1.39	0.60 – 3.18	0.441	Work
Race [Others]	1.56	1.02 – 2.39	0.042	Work
Family [Otherwise]	1.14	0.78 – 1.67	0.485	Work
DadEdu [Otherwise]	0.84	0.52 – 1.34	0.456	Work
Income	1.50	0.99 – 2.27	0.054	Work
Local	0.93	0.87 – 1.00	0.057	Work
ASAB	1.36	1.10 – 1.69	0.005	Work

The odds of male black youth reporting at work versus inactive as a major activity were 0.64 (.42, .98) times those of others.

The odds of reporting being in school versus inactive for young men from intact families were 1.72 (=1/.58) times as high as the odds for those from nonintact families.

A one unit change in the ASVAB score increases the odds for working (vs. inactive) by 43% = (exp(0.36)-1) × 100.

2.3 Decrptive Table

dta$Race <- relevel(factor(dta$Race), ref="Others", levels=c("Black", "Others"))
dta$Family<-relevel(factor(dta$Family), ref="Otherwise")
dta$DadEdu<-relevel(factor(dta$DadEdu), ref="Otherwise")
dta_c <- dta %>%
  mutate(Incf = cut(Income,
                    breaks = quantile(Income, c(0, 1/3, 2/3, 1)),
                    labels = c('Low', 'Middle', 'High')),
         ASABf = cut(ASAB,
                    breaks = quantile(ASAB, c(0, 1/3, 2/3, 1)),
                    labels = c('Low', 'Middle', 'High')),
         Localf = cut(Local, 
                    breaks = quantile(Local, c(0, 1/2, 1)),
                    labels = c('Low', 'High')))

#
with(dta_c, ftable(Incf, ASABf, Localf, Employment))

                     Employment Idle School Work
Incf   ASABf  Localf                            
Low    Low    Low                 26     39   19
              High                43     34   20
       Middle Low                 21     22   16
              High                11     26    9
       High   Low                  3     13    8
              High                 4      6    3
Middle Low    Low                 13     21   26
              High                11     24   10
       Middle Low                 24     30   35
              High                 5     20   13
       High   Low                  8     23   30
              High                 9     17   13
High   Low    Low                  6      8    9
              High                 4      6    5
       Middle Low                 19     23   16
              High                 4     14   16
       High   Low                 13     45   64
              High                11     26   29

2.4 One category as baseline

dta_wn <- subset(dta, Employment != 'School') %>% mutate(resp = ifelse(Employment=='Work', 1, 0))
dta_sn <- subset(dta, Employment != 'Work') %>% mutate(resp = ifelse(Employment=='School', 1, 0))
m_wn <- glm(resp ~ Race+Family+DadEdu+Income+Local+ASAB, data=dta_wn, family=binomial)
m_sn <- glm(resp ~ Race+Family+DadEdu+Income+Local+ASAB, data=dta_sn, family=binomial)

sjPlot::tab_model(m_wn, m_sn, show.obs=F, show.r2=F)

	resp			resp
Predictors	Odds Ratios	CI	p	Odds Ratios	CI	p
(Intercept)	2.02	1.00 – 4.10	0.051	1.52	0.78 – 2.94	0.218
Race [Black]	0.64	0.41 – 0.99	0.046	1.25	0.85 – 1.84	0.265
Family [Non-Intact]	0.90	0.62 – 1.33	0.609	0.57	0.40 – 0.82	0.003
DadEdu [College]	1.19	0.74 – 1.92	0.471	1.28	0.81 – 2.04	0.303
Income	1.72	1.10 – 2.73	0.019	1.22	0.84 – 1.81	0.316
Local	0.92	0.85 – 0.99	0.025	1.01	0.95 – 1.09	0.708
ASAB	1.34	1.07 – 1.67	0.010	1.22	0.99 – 1.51	0.067

m1 <- nnet::multinom(Employment ~ Race+Family+DadEdu, data=dta, Hess=TRUE, trace=F)

3 In-class Exercises 3:

Reproduce the results of analysis reported in the mental impairment and SES example by adapting the R script in the lecture note and from the rating experiment example.

The mental health status of a sample of 1,660 young New York residents in midtown Manhattan was classified by their parents’ socioeconomic status (A=High, F=Low).

Source: Srole, L. & Fisher, A.K. (1978). Mental health in the metropolis. The midtown Manhattan study. Reported in Agresti, A. (1989). Tutorial on modeling ordered categorical response data. Psychological Bulletin, 105, 290-301.

Column 1: Subject ID Column 2: SES, numerical code A=6, … , F=1 Column 3: SES, reference code for A Column 4: SES, reference code for B Column 5: SES, reference code for C Column 6: SES, reference code for D Column 7: SES, reference code for E Column 9: Mental impairment, well=1, mild=2, moderate=3, impaired=4

3.1 Load data file

#
pacman::p_load(tidyverse, MASS, HH)

#
dta <- read.table("data/mentalSES.txt", h = T)

head(dta)

  Id ses A B C D E mental
1  1   6 1 0 0 0 0      1
2  2   6 1 0 0 0 0      1
3  3   6 1 0 0 0 0      1
4  4   6 1 0 0 0 0      1
5  5   6 1 0 0 0 0      1
6  6   6 1 0 0 0 0      1

dta$mental <- factor(dta$mental, levels = c("1", "2", "3", "4"),
                     labels = c("Well", "Mild", "Moderate", "Impaired"))
dta$SES <- factor(dta$ses, levels = c("6", "5", "4", "3", "2", "1"),
                  labels = c("A", "B", "C", "D", "E", "F"))

str(dta)

'data.frame':   1660 obs. of  9 variables:
 $ Id    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ses   : int  6 6 6 6 6 6 6 6 6 6 ...
 $ A     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ B     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ C     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ D     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ E     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mental: Factor w/ 4 levels "Well","Mild",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SES   : Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

with(dta, ftable(SES, mental))

    mental Well Mild Moderate Impaired
SES                                   
A            64   94       58       46
B            57   94       54       40
C            57  105       65       60
D            72  141       77       94
E            36   97       54       78
F            21   71       54       71

dta_c <- as.data.frame(with(dta, ftable(SES, mental)))

dta_c <- dta_c %>%
  group_by(SES) %>%
  mutate(Total = sum(Freq), Prob = Freq/Total)
head(dta_c)

# A tibble: 6 x 5
# Groups:   SES [6]
  SES   mental  Freq Total   Prob
  <fct> <fct>  <int> <int>  <dbl>
1 A     Well      64   262 0.244 
2 B     Well      57   245 0.233 
3 C     Well      57   287 0.199 
4 D     Well      72   384 0.188 
5 E     Well      36   265 0.136 
6 F     Well      21   217 0.0968

3.2 Likert scale plotting

HH::likert(t(with(dta, table(mental, SES))), main="", ylab="SES")

3.3 Cumulative logit model for ordinal responses

library(VGAM)
m0 <- vglm(ordered(mental) ~ SES, data=dta, 
                 family=cumulative(parallel=TRUE))

summary(m0)


Call:
vglm(formula = ordered(mental) ~ SES, family = cumulative(parallel = TRUE), 
    data = dta)

Coefficients: 
              Estimate Std. Error z value Pr(>|z|)
(Intercept):1   -1.204      0.119  -10.14  < 2e-16
(Intercept):2    0.495      0.115    4.32  1.6e-05
(Intercept):3    1.504      0.120   12.52  < 2e-16
SESB             0.017      0.161    0.11  0.91605
SESC            -0.208      0.155   -1.35  0.17819
SESD            -0.299      0.145   -2.06  0.03928
SESE            -0.567      0.158   -3.59  0.00033
SESF            -0.824      0.167   -4.93  8.0e-07

Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]), 
logitlink(P[Y<=3])

Residual deviance: 4449.4 on 4972 degrees of freedom

Log-likelihood: -2224.7 on 4972 degrees of freedom

Number of Fisher scoring iterations: 4 

No Hauck-Donner effect found in any of the estimates


Exponentiated coefficients:
   SESB    SESC    SESD    SESE    SESF 
1.01712 0.81207 0.74156 0.56737 0.43874

3.4 Model Comparison

m1 <- vglm(ordered(mental) ~ SES, data=dta, 
                 family=cumulative)

VGAM::lrtest(m0, m1)

Likelihood ratio test

Model 1: ordered(mental) ~ SES
Model 2: ordered(mental) ~ SES
   #Df LogLik  Df Chisq Pr(>Chisq)
1 4972  -2225                     
2 4962  -2221 -10  7.83       0.65

3.5 Model 2

#
summary(m2 <- polr(mental ~ SES, data = dta,  
                           method = "probit", Hess = TRUE))

Call:
polr(formula = mental ~ SES, data = dta, Hess = TRUE, method = "probit")

Coefficients:
        Value Std. Error t value
SESB -0.00929     0.0953 -0.0976
SESC  0.12664     0.0916  1.3823
SESD  0.18624     0.0861  2.1643
SESE  0.35077     0.0938  3.7404
SESF  0.50498     0.0991  5.0946

Intercepts:
                  Value   Std. Error t value
Well|Mild          -0.724   0.070    -10.353
Mild|Moderate       0.307   0.068      4.477
Moderate|Impaired   0.920   0.070     13.068

Residual Deviance: 4447.15 
AIC: 4463.15

confint(m2)

         2.5 %  97.5 %
SESB -0.196022 0.17743
SESC -0.052902 0.30623
SESD  0.017621 0.35494
SESE  0.167043 0.53465
SESF  0.310830 0.69938

Generalized Linear Models II: In-class exercises

Hao-Lun Fu

2020-11-09