1 Inclass 1

1.1 Info

Problem: Replicate the analysis reported in the brightness discrimination example.
Data: Consider the outcome of a brightness discrimination experiment.
On each trial, two patches of light were presented: one was a standard with an intensity of 35 units, and another was a variable intensity comparison light.
The subject was asked to indicate which one of the lights was brighter.

Column 1: Stimulus condition
Column 2: Intensity of the variable (comparison) light
Column 3: Number of comparisons
Column 4: Number of times comparison light was judged brighter

1.2 Data input

# Loading package
pacman::p_load(tidyr, tidyverse, nlme, car, afex, emmeans, lattice, faraway, boot, ggplot2, kableExtra, MASS)

# input data
dta1 <- read.table("brightness.txt", h=T)

# top six row 
head(dta1)

##   Stimulus Intensity Trial Yes
## 1        1        10    49   2
## 2        2        20    48   3
## 3        3        30    48  13
## 4        4        40    50  31
## 5        5        50    47  38
## 6        6        60    49  48

# observed proportions
dta1$Prop <-  dta1$Yes/dta1$Trial

The proportion of stimulus condition of 6 is the highest (97.95%).

1.3 GLM

The relationship between stimulus intensity and the probability of correct discrimination be
Pr{Y = y | S = I} = Φ(α + β × I),
where I is the stimulus intensity and α, β are parameters.

# use glm to fit a psychophysical function
summary(m0 <- glm(cbind(Yes, Trial - Yes) ~ Intensity, data = dta1, 
                  family = binomial(link = "probit")))

## 
## Call:
## glm(formula = cbind(Yes, Trial - Yes) ~ Intensity, family = binomial(link = "probit"), 
##     data = dta1)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
##  1.0388  -0.7613  -0.3124   0.4486  -0.5979   0.7050  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.874139   0.286787  -10.02   <2e-16 ***
## Intensity    0.077472   0.007367   10.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 184.5899  on 5  degrees of freedom
## Residual deviance:   2.8119  on 4  degrees of freedom
## AIC: 26.549
## 
## Number of Fisher Scoring iterations: 4

1.4 Model fit

# model fit
1 - pchisq(m0$deviance, m0$df.residual)

## [1] 0.5897832

# Hosmer-Lemeshow Goodness of Fit
library(ResourceSelection)

## Warning: package 'ResourceSelection' was built under R version 4.0.3

## ResourceSelection 0.3-5   2019-07-22

hoslem.test(m0$y, fitted(m0))

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  m0$y, fitted(m0)
## X-squared = NaN, df = 8, p-value = NA

Estimated π_i = Φ(-2.874 + 0.077 × I_i)

The model may not fit the data well?

# new values in the intensity range 
newxval <- with(dta1, data.frame(Intensity = seq(5,70, by = 0.5)))

# predicted probabilities
phat <- predict(m0, newdata = newxval, type = "response")

1.5 Plot

1.5.1 Fitted curve

ot <- theme_set(theme_bw())

# fitted psychophysical curve
ggplot() +
  geom_line(aes(x = newxval[,1], y = phat)) +
  geom_point(aes(x = dta1$Intensity, y = dta1$Prop)) +
  geom_vline(xintercept = 35, lty = 2, col = "gray") +
  geom_hline(yintercept = .5, lty = 2, col = "gray") +
  labs(x = "Intensity", y = "Probability of correnct responses")

The stimulus intensity is positively associated with correct responses of brightness discrimination.

1.5.2 lecture code

plot(Yes/Trial ~ Intensity, data=dta1,
     bty='o',
     xlab="Intensity", 
     ylab="Probability of correnct responses")

m_mnr <- glm(cbind(Yes, Trial-Yes) ~ Intensity, 
             data=dta1, 
             family=binomial(link="probit"))
grid()
lines(dta1$Yes, fitted(m0, type="response"))

1.6 Reference

Gegenfurtner, K. (1992). Praxis: Brent’s algorithm for function minimization. Behavior Research Methods, Instruments, and Computers. 24(4), 560-564.

2 Inclass 2

2.1 Info

Problem: The patient was classified by psychiatrists as either a ‘case’ or a ‘non-case’. The question of interest was whether GHQ score, together with patient’s gender, could be used to detect psychiatric cases. Perform an appropriate analysis.
Data: Silvapulle (1981) reported a study on the relation between psychiatric diagnosis and the score on a 12-item General Health Questionnaire (GHQ) for 120 patients admitted to a hospital.
Download the ghq.rda data file to your “data” folder and load it to your R session with the command [load(“data/ghq.rda”)]

Column 1: Gender ID
Column 2: GHQ score
Column 3: Number of cases
Column 4: Number of non-cases

2.2 Data input

load(file = "ghq.rda")

str(ghq)

## 'data.frame':    22 obs. of  4 variables:
##  $ sex: Factor w/ 2 levels "men","women": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ghq: int  0 1 2 3 4 5 6 7 8 9 ...
##  $ c  : int  0 0 1 0 1 3 0 2 0 0 ...
##  $ nc : int  18 8 1 0 0 0 0 0 0 0 ...

dta2 <- ghq %>% 
  mutate(Total = c + nc) %>%
  dplyr::rename(Gender = sex,
                Score = ghq, 
                Case=c, 
                Noncase=nc)

#library(kableExtra)
#knitr::kable(t(aggregate(cbind(Case, Noncase) ~ Gender, data=dta2, sum)))

kbl(t(aggregate(cbind(Case, Noncase) ~ Gender, data=dta2, sum)))%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

Gender	men	women
Case	8	22
Noncase	27	63

The case-noncase ratio is around 1:3 in both gender (no gender effect?)

kbl(t(aggregate(cbind(Case, Noncase) ~ Score, data=dta2, sum)))%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

Score	0	1	2	3	4	5	6	7	8	9	10
Case	2	2	5	3	3	6	1	3	3	1	1
Noncase	60	22	6	1	1	0	0	0	0	0	0

The higher GHQ score may indicate more psychiatric case, the cut point may around over 4 to 5. (should consider the score effect)

2.3 Plot

2.3.1 Probability plot?

The gender and GHQ effect

ggplot(dta2, aes(x=Score, y=Case/Total, color=Gender)) + 
 geom_point(pch=1, size=rel(2))+
 stat_smooth(method="glm", formula=y ~ x, 
             method.args=list(family=binomial),
             aes(weight=Total), 
             se=FALSE) +
 scale_color_manual(values=c("black", "red3"))+
 scale_x_continuous(breaks=seq(0, 10, by=1))+
 scale_y_continuous(breaks=seq(0, 1, by=.1))+
 labs(x="Score on GHQ", 
      y="Proportion of cases") +
 theme_minimal()+
 theme(legend.position=c(.1, .9))

## Warning: Removed 5 rows containing non-finite values (stat_smooth).

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: Removed 5 rows containing missing values (geom_point).

Limit sample size in male. Male may have a higher risk to become the case than female?

2.4 Analysis of deviance

m0 <- glm(cbind(Case, Noncase) ~ Score + Gender + Score:Gender, data=dta2, family=binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

m1 <- glm(cbind(Case, Noncase) ~ Score + Gender, data=dta2, family=binomial)

m2 <- glm(cbind(Case, Noncase) ~ Score, data=dta2, family=binomial)

kbl(anova(m2, m1, m0, test="Chisq"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
15	5.743961	NA	NA	NA
14	4.941883	1	0.8020783	0.3704727
13	1.542230	1	3.3996529	0.0652101

The gender effect can be ignore due to the result of model comparison, but the interaction effect should be considered.

2.4.1 Test for score effect

m3 <- glm(cbind(Case, Noncase) ~ Gender, data=dta2, family=binomial(link=logit))

kbl(anova(m3, m1, test="Chisq"))%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
15	83.053819	NA	NA	NA
14	4.941883	1	78.11194	0

The Score effect should be considered.

2.5 Goodness of fit

(m1$null.deviance - m1$deviance)/m1$null.deviance

## [1] 0.9405854

1 - pchisq(m1$deviance, m1$df.residual)

## [1] 0.9866052

ResourceSelection::hoslem.test(m1$y, fitted(m1))

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  m1$y, fitted(m1)
## X-squared = 11431, df = 8, p-value < 2.2e-16

2.6 Parameter estimates

sjPlot::tab_model(m1, show.r2=FALSE, show.obs=FALSE, show.se=TRUE)

	cbind(Case, Noncase)
Predictors	Odds Ratios	std. Error	CI	p
(Intercept)	0.02	0.98	0.00 – 0.08	<0.001
Score	4.19	0.29	2.57 – 8.20	<0.001
Gender [women]	2.21	0.93	0.42 – 17.73	0.393

For a given GHQ score, the odds of a female being a case is between 0.42 to 17.73 of the corresponding odds for a female. (without statistical meaning)

Odds “m to n” in favor of an event mean we expect the event will occur m times for every n times it does not occur.

For every 100 male cases, we expect between 42 to 177 female cases. (without statistical meaning)

Conditional on gender, one point increases in GHQ is associated with the increase of 157% to 720% in the odds for being a case (2.57 to 8.2 times).

The odds of being a case with an increase of 1 GHQ unit moves up from 1:1 to between 2.57 and 8.20 to 1.

2.7 Linear predictors

dta2_p <- dta2 %>%
  mutate(Prop = Case/Total,
         phat = predict(m1, newdata = dta2[, c(1,2)], type="response"),
         elgit = log((Case+1/120)/(Noncase+(1/120))),
         .std.resid = (glm.diag(m1)$rp))
dta2_m1 <- broom::augment(m1, newdata = dta2_p, 
                          type.residuals = c("deviance"))

#rstandard(m1, type = c("deviance"))
#residuals(m1, type = c("deviance"))

ggplot(dta2_m1, aes(Score, .fitted, color = Gender)) + 
  geom_line(alpha = .5) +
  geom_point(data = dta2_p,
             aes(Score, elgit)) +
  scale_color_manual(values = c("black", "red3")) +
  labs(x = "Score on GHQ", y = "Empirical and fitted log-odds")+
  theme_minimal()+
  theme(legend.position = c(.1, .9))

There is no interaction between gender and score.

2.8 Model fit probability plot

curve(faraway::ilogit(x), -6, 6,bty='n', 
      xlab=expression(eta), ylab="Probability")
grid()

beta <- coef(m1)
with(dta2, plot(Score, Case/Total,
                col = Gender, 
                bty = 'n', 
                xlab = "GHQ score"))
grid()
curve(ilogit(beta[1] + beta[2]*x), add=T, col=1)
curve(ilogit(beta[1] + beta[2]*x + beta[3]), add=T, col=2)

The female have higher probability to become the case.

2.9 Residual plot

ggplot(dta2_m1, 
       aes(Score, .std.resid, color = Gender)) + 
  geom_point(alpha=.5) +
  scale_y_continuous(breaks=seq(-3, 3, by=1), 
                     limits=c(-3.3, 3.3))+
  scale_color_manual(values=c("black", "red3"))+
  labs(x="Score on GHQ",
       y="Standardized residuals")+
  theme_minimal()+
  theme(legend.position = c(.9, .9))

The standardized residuals are scatter between -1 to 1. (model overfit?)

2.10 Reference

Silvapulle, M. (1981). On the Existence of Maximum Likelihood Estimators for the Binomial Response Models. Journal of the Royal Statistical Society. Series B (Methodological), 43(3), 310-313.

3 Inclass 3

3.1 Info

Problem: Reproduce the results of analysis (SAS output) reported in the victims example.
Data: A recent General Social Survey asked subjects, “Within the past 12 months, how many people have you known personally that were victims of homicide?”
The responses by race, for those who identified themselves as black or as white, were collected.

Column 1: Number of victims personally known to the respondent
Column 2: Race ID

3.2 Data input

dta3 <- read.csv("victims.csv")

dta3 <- mutate(dta3, 
               Race = factor(Race))
str(dta3)

## 'data.frame':    1308 obs. of  2 variables:
##  $ nV  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Race: Factor w/ 2 levels "Black","White": 2 2 2 2 2 2 2 2 2 2 ...

head(dta3)

##   nV  Race
## 1  0 White
## 2  0 White
## 3  0 White
## 4  0 White
## 5  0 White
## 6  0 White

3.3 Descriptive data

3.3.1 Frequency

t1 <- table(dta3$nV, dta3$Race)

# ftable(mytable)

kbl(t1)%>%
  kable_styling(bootstrap_options = c("striped", "hover")
              , full_width = F)%>%
  add_header_above(c("Number of Victims", "Race" = 2))

Number of Victims	Race
	Black	White
0	119	1070
1	16	60
2	12	14
3	7	4
4	3	0
5	2	0
6	0	1

The sampling is unbalance that White is more than the Black.
It seems those who identified themselves as black were reported more victims than the white.(should consider the Race effect)

3.3.2 Mean & variance

dta3mean <- aggregate(nV ~ Race, data = dta3, mean)
dta3var <- aggregate(nV ~ Race, data = dta3, var)
t2 <- Hmisc::Merge(dta3mean, dta3var, id = ~ Race)

##          Vars Obs Unique IDs IDs in #1 IDs not in #1
## dta3mean    2   2          2        NA            NA
## dta3var     2   2          2         2             0
## Merged      3   2          2         2             0
## 
## Number of unique IDs in any data frame : 2 
## Number of unique IDs in all data frames: 2

colnames(t2) <- c("Race", "mean", "var")

kbl(t2)%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

Race	mean	var
Black	0.5220126	1.1498288
White	0.0922541	0.1552448

The mean of the reported number of victims showed the reporter of black race is higher than the White.

3.4 Plot

3.4.1 Bar chart

ggplot(dta3, 
       aes(nV)) +
 geom_histogram(binwidth = 1, fill=8, col=1)+
 facet_grid(~Race) +
 labs(x="Number of victims", 
      y="Number of the person who response") +
 theme_minimal()+
 theme(legend.position = c(.95, .9))

Race effect

3.4.2 XXX

dta3$nVc <- dta3$nV
dta3$nVc[dta3$nVc >= 0 ] <- "1"
tail(dta3)

##      nV  Race nVc
## 1303  4 Black   1
## 1304  4 Black   1
## 1305  4 Black   1
## 1306  5 Black   1
## 1307  5 Black   1
## 1308  6 White   1

dta3$nVc <- as.integer(dta3$nVc, ref="0")
dta3$Race <- relevel(dta3$Race, ref = "White")

ggplot(dta3, 
       aes(Race, nV, color=Race)) +
 stat_smooth(method="glm",
             formula= nV ~ Race,
             method.args=list(family=poisson),
             size=rel(.5)) +
 geom_point(pch=20, alpha=.5) +
 #facet_grid(. ~ Race) +
 labs(y="Days absent", 
      x="Math score") +
 theme_minimal()+
 theme(legend.position = c(.95, .9))

3.5 GLM

3.5.1 Poisson

dta3$nVc <- dta3$nV
dta3$nVc[dta3$nVc >= 0 ] <- "1"
tail(dta3)

##      nV  Race nVc
## 1303  4 Black   1
## 1304  4 Black   1
## 1305  4 Black   1
## 1306  5 Black   1
## 1307  5 Black   1
## 1308  6 White   1

dta3$nVc <- as.integer(dta3$nVc, ref="0")

summary(m0p <- glm(nV ~ Race, data=dta3, family=poisson(link=log)))

## 
## Call:
## glm(formula = nV ~ Race, family = poisson(link = log), data = dta3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0218  -0.4295  -0.4295  -0.4295   6.1874  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.38321    0.09713  -24.54   <2e-16 ***
## RaceBlack    1.73314    0.14657   11.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 962.80  on 1307  degrees of freedom
## Residual deviance: 844.71  on 1306  degrees of freedom
## AIC: 1122
## 
## Number of Fisher Scoring iterations: 6

3.5.1.1 Parameter estimates

sjPlot::tab_model(m0p, show.se=T, show.r2=F, show.obs=F)

	nV
Predictors	Incidence Rate Ratios	std. Error	CI	p
(Intercept)	0.09	0.10	0.08 – 0.11	<0.001
Race [Black]	5.66	0.15	4.24 – 7.53	<0.001

3.5.1.2 Goodness of fit and overdispersion

1 - pchisq(deviance(m0p), df.residual(m0p))

## [1] 1

performance::check_overdispersion(m0p)

## # Overdispersion test
## 
##        dispersion ratio =    1.746
##   Pearson's Chi-Squared = 2279.873
##                 p-value =  < 0.001

## Overdispersion detected.

Overdispersion

plot(log(fitted(m0p)), log((dta3$nV-fitted(m0p))^2), 
     xlab=expression(hat(mu)),
     ylab=expression((y-hat(mu))^2))
grid()
abline(0, 1, col=2)

3.5.1.3 Dispersion parameter

kbl(as.data.frame(summary(m0p, dispersion=1.746)$coef))%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-2.383208	0.1283420	-18.569195	0
RaceBlack	1.733145	0.1936693	8.948991	0

drop1(m0, test="F")

## Warning in drop1.glm(m0, test = "F"): F test assumes 'quasibinomial' family

## Single term deletions
## 
## Model:
## cbind(Case, Noncase) ~ Score + Gender + Score:Gender
##              Df Deviance    AIC F value    Pr(>F)    
## <none>            1.5422 22.018                      
## Score:Gender  1   4.9419 23.418  28.657 0.0001313 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.5.2 Negative binominal

summary(m0n <- glm.nb(nV ~ Race, data = dta3))

## 
## Call:
## glm.nb(formula = nV ~ Race, data = dta3, init.theta = 0.2023119205, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7184  -0.3899  -0.3899  -0.3899   3.5072  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3832     0.1172 -20.335  < 2e-16 ***
## RaceBlack     1.7331     0.2385   7.268 3.66e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.2023) family taken to be 1)
## 
##     Null deviance: 471.57  on 1307  degrees of freedom
## Residual deviance: 412.60  on 1306  degrees of freedom
## AIC: 1001.8
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.2023 
##           Std. Err.:  0.0409 
## 
##  2 x log-likelihood:  -995.7980

exp(-2.3832) = 0.092
exp(-2.3832 + 1.7331) = 0.522

WHY same as the mean?

White: log(μ_i) = 0.092 + 4.9429×0.0922 ≈ 0.126
Black: log(μ_i) = 0.522 + 4.9429×0.5222 ≈ 1.869

The Black race tend to known more victims of homicide.

MASS::stepAIC(glm.nb(nV ~ Race, data = dta3), 
               scope=list(upper = ~ Race, lower= ~ 1), trace=FALSE)

## 
## Call:  glm.nb(formula = nV ~ Race, data = dta3, init.theta = 0.2023119205, 
##     link = log)
## 
## Coefficients:
## (Intercept)    RaceBlack  
##      -2.383        1.733  
## 
## Degrees of Freedom: 1307 Total (i.e. Null);  1306 Residual
## Null Deviance:       471.6 
## Residual Deviance: 412.6     AIC: 1002

sjPlot::tab_model(m0n, show.df =T, show.se=T, show.r2=T, show.obs=T, show.aic = T, show.fstat = T,
df.method = "wald", transform=NULL)

	nV
Predictors	Log-Mean	std. Error	CI	p	df
(Intercept)	-2.38	0.12	-2.61 – -2.15	<0.001	Inf
Race [Black]	1.73	0.24	1.27 – 2.20	<0.001	Inf
Observations	1308
R² Nagelkerke	0.146
AIC	1001.798

3.6 Reference

Agresti, A. (2002). Categorical Data Analysis. p. 561.

In-class Exercises: Generalized linear models

Chang-Chun Chen

2020-10-28