Problem: Replicate the analysis reported in the brightness discrimination example.
Data: Consider the outcome of a brightness discrimination experiment.
On each trial, two patches of light were presented: one was a standard with an intensity of 35 units, and another was a variable intensity comparison light.
The subject was asked to indicate which one of the lights was brighter.
Column 1: Stimulus condition
Column 2: Intensity of the variable (comparison) light
Column 3: Number of comparisons
Column 4: Number of times comparison light was judged brighter
# Loading package
pacman::p_load(tidyr, tidyverse, nlme, car, afex, emmeans, lattice, faraway, boot, ggplot2, kableExtra, MASS)
# input data
dta1 <- read.table("brightness.txt", h=T)
# top six row
head(dta1)
## Stimulus Intensity Trial Yes
## 1 1 10 49 2
## 2 2 20 48 3
## 3 3 30 48 13
## 4 4 40 50 31
## 5 5 50 47 38
## 6 6 60 49 48
The proportion of stimulus condition of 6 is the highest (97.95%).
The relationship between stimulus intensity and the probability of correct discrimination be
Pr{Y = y | S = I} = Φ(α + β × I),
where I is the stimulus intensity and α, β are parameters.
# use glm to fit a psychophysical function
summary(m0 <- glm(cbind(Yes, Trial - Yes) ~ Intensity, data = dta1,
family = binomial(link = "probit")))
##
## Call:
## glm(formula = cbind(Yes, Trial - Yes) ~ Intensity, family = binomial(link = "probit"),
## data = dta1)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## 1.0388 -0.7613 -0.3124 0.4486 -0.5979 0.7050
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.874139 0.286787 -10.02 <2e-16 ***
## Intensity 0.077472 0.007367 10.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 184.5899 on 5 degrees of freedom
## Residual deviance: 2.8119 on 4 degrees of freedom
## AIC: 26.549
##
## Number of Fisher Scoring iterations: 4
## [1] 0.5897832
## Warning: package 'ResourceSelection' was built under R version 4.0.3
## ResourceSelection 0.3-5 2019-07-22
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: m0$y, fitted(m0)
## X-squared = NaN, df = 8, p-value = NA
Estimated πi = Φ(-2.874 + 0.077 × Ii)
The model may not fit the data well?
ot <- theme_set(theme_bw())
# fitted psychophysical curve
ggplot() +
geom_line(aes(x = newxval[,1], y = phat)) +
geom_point(aes(x = dta1$Intensity, y = dta1$Prop)) +
geom_vline(xintercept = 35, lty = 2, col = "gray") +
geom_hline(yintercept = .5, lty = 2, col = "gray") +
labs(x = "Intensity", y = "Probability of correnct responses")
The stimulus intensity is positively associated with correct responses of brightness discrimination.
Gegenfurtner, K. (1992). Praxis: Brent’s algorithm for function minimization. Behavior Research Methods, Instruments, and Computers. 24(4), 560-564.
Column 1: Gender ID
Column 2: GHQ score
Column 3: Number of cases
Column 4: Number of non-cases
## 'data.frame': 22 obs. of 4 variables:
## $ sex: Factor w/ 2 levels "men","women": 1 1 1 1 1 1 1 1 1 1 ...
## $ ghq: int 0 1 2 3 4 5 6 7 8 9 ...
## $ c : int 0 0 1 0 1 3 0 2 0 0 ...
## $ nc : int 18 8 1 0 0 0 0 0 0 0 ...
dta2 <- ghq %>%
mutate(Total = c + nc) %>%
dplyr::rename(Gender = sex,
Score = ghq,
Case=c,
Noncase=nc)
#library(kableExtra)
#knitr::kable(t(aggregate(cbind(Case, Noncase) ~ Gender, data=dta2, sum)))
kbl(t(aggregate(cbind(Case, Noncase) ~ Gender, data=dta2, sum)))%>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Gender | men | women |
Case | 8 | 22 |
Noncase | 27 | 63 |
The case-noncase ratio is around 1:3 in both gender (no gender effect?)
kbl(t(aggregate(cbind(Case, Noncase) ~ Score, data=dta2, sum)))%>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Score | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Case | 2 | 2 | 5 | 3 | 3 | 6 | 1 | 3 | 3 | 1 | 1 |
Noncase | 60 | 22 | 6 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
The higher GHQ score may indicate more psychiatric case, the cut point may around over 4 to 5. (should consider the score effect)
The gender and GHQ effect
ggplot(dta2, aes(x=Score, y=Case/Total, color=Gender)) +
geom_point(pch=1, size=rel(2))+
stat_smooth(method="glm", formula=y ~ x,
method.args=list(family=binomial),
aes(weight=Total),
se=FALSE) +
scale_color_manual(values=c("black", "red3"))+
scale_x_continuous(breaks=seq(0, 10, by=1))+
scale_y_continuous(breaks=seq(0, 1, by=.1))+
labs(x="Score on GHQ",
y="Proportion of cases") +
theme_minimal()+
theme(legend.position=c(.1, .9))
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: Removed 5 rows containing missing values (geom_point).
Limit sample size in male. Male may have a higher risk to become the case than female?
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
m1 <- glm(cbind(Case, Noncase) ~ Score + Gender, data=dta2, family=binomial)
m2 <- glm(cbind(Case, Noncase) ~ Score, data=dta2, family=binomial)
kbl(anova(m2, m1, m0, test="Chisq"))%>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
15 | 5.743961 | NA | NA | NA |
14 | 4.941883 | 1 | 0.8020783 | 0.3704727 |
13 | 1.542230 | 1 | 3.3996529 | 0.0652101 |
The gender effect can be ignore due to the result of model comparison, but the interaction effect should be considered.
m3 <- glm(cbind(Case, Noncase) ~ Gender, data=dta2, family=binomial(link=logit))
kbl(anova(m3, m1, test="Chisq"))%>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
---|---|---|---|---|
15 | 83.053819 | NA | NA | NA |
14 | 4.941883 | 1 | 78.11194 | 0 |
The Score effect should be considered.
## [1] 0.9405854
## [1] 0.9866052
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: m1$y, fitted(m1)
## X-squared = 11431, df = 8, p-value < 2.2e-16
cbind(Case, Noncase) | ||||
---|---|---|---|---|
Predictors | Odds Ratios | std. Error | CI | p |
(Intercept) | 0.02 | 0.98 | 0.00 – 0.08 | <0.001 |
Score | 4.19 | 0.29 | 2.57 – 8.20 | <0.001 |
Gender [women] | 2.21 | 0.93 | 0.42 – 17.73 | 0.393 |
For a given GHQ score, the odds of a female being a case is between 0.42 to 17.73 of the corresponding odds for a female. (without statistical meaning)
Odds “m to n” in favor of an event mean we expect the event will occur m times for every n times it does not occur.
For every 100 male cases, we expect between 42 to 177 female cases. (without statistical meaning)
Conditional on gender, one point increases in GHQ is associated with the increase of 157% to 720% in the odds for being a case (2.57 to 8.2 times).
The odds of being a case with an increase of 1 GHQ unit moves up from 1:1 to between 2.57 and 8.20 to 1.
dta2_p <- dta2 %>%
mutate(Prop = Case/Total,
phat = predict(m1, newdata = dta2[, c(1,2)], type="response"),
elgit = log((Case+1/120)/(Noncase+(1/120))),
.std.resid = (glm.diag(m1)$rp))
dta2_m1 <- broom::augment(m1, newdata = dta2_p,
type.residuals = c("deviance"))
ggplot(dta2_m1, aes(Score, .fitted, color = Gender)) +
geom_line(alpha = .5) +
geom_point(data = dta2_p,
aes(Score, elgit)) +
scale_color_manual(values = c("black", "red3")) +
labs(x = "Score on GHQ", y = "Empirical and fitted log-odds")+
theme_minimal()+
theme(legend.position = c(.1, .9))
There is no interaction between gender and score.
beta <- coef(m1)
with(dta2, plot(Score, Case/Total,
col = Gender,
bty = 'n',
xlab = "GHQ score"))
grid()
curve(ilogit(beta[1] + beta[2]*x), add=T, col=1)
curve(ilogit(beta[1] + beta[2]*x + beta[3]), add=T, col=2)
The female have higher probability to become the case.
ggplot(dta2_m1,
aes(Score, .std.resid, color = Gender)) +
geom_point(alpha=.5) +
scale_y_continuous(breaks=seq(-3, 3, by=1),
limits=c(-3.3, 3.3))+
scale_color_manual(values=c("black", "red3"))+
labs(x="Score on GHQ",
y="Standardized residuals")+
theme_minimal()+
theme(legend.position = c(.9, .9))
The standardized residuals are scatter between -1 to 1. (model overfit?)
Silvapulle, M. (1981). On the Existence of Maximum Likelihood Estimators for the Binomial Response Models. Journal of the Royal Statistical Society. Series B (Methodological), 43(3), 310-313.
Problem: Reproduce the results of analysis (SAS output) reported in the victims example.
Data: A recent General Social Survey asked subjects, “Within the past 12 months, how many people have you known personally that were victims of homicide?”
The responses by race, for those who identified themselves as black or as white, were collected.
Column 1: Number of victims personally known to the respondent
Column 2: Race ID
## 'data.frame': 1308 obs. of 2 variables:
## $ nV : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Race: Factor w/ 2 levels "Black","White": 2 2 2 2 2 2 2 2 2 2 ...
## nV Race
## 1 0 White
## 2 0 White
## 3 0 White
## 4 0 White
## 5 0 White
## 6 0 White
t1 <- table(dta3$nV, dta3$Race)
# ftable(mytable)
kbl(t1)%>%
kable_styling(bootstrap_options = c("striped", "hover")
, full_width = F)%>%
add_header_above(c("Number of Victims", "Race" = 2))
Black | White | |
---|---|---|
0 | 119 | 1070 |
1 | 16 | 60 |
2 | 12 | 14 |
3 | 7 | 4 |
4 | 3 | 0 |
5 | 2 | 0 |
6 | 0 | 1 |
The sampling is unbalance that White is more than the Black.
It seems those who identified themselves as black were reported more victims than the white.(should consider the Race effect)
dta3mean <- aggregate(nV ~ Race, data = dta3, mean)
dta3var <- aggregate(nV ~ Race, data = dta3, var)
t2 <- Hmisc::Merge(dta3mean, dta3var, id = ~ Race)
## Vars Obs Unique IDs IDs in #1 IDs not in #1
## dta3mean 2 2 2 NA NA
## dta3var 2 2 2 2 0
## Merged 3 2 2 2 0
##
## Number of unique IDs in any data frame : 2
## Number of unique IDs in all data frames: 2
colnames(t2) <- c("Race", "mean", "var")
kbl(t2)%>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Race | mean | var |
---|---|---|
Black | 0.5220126 | 1.1498288 |
White | 0.0922541 | 0.1552448 |
The mean of the reported number of victims showed the reporter of black race is higher than the White.
ggplot(dta3,
aes(nV)) +
geom_histogram(binwidth = 1, fill=8, col=1)+
facet_grid(~Race) +
labs(x="Number of victims",
y="Number of the person who response") +
theme_minimal()+
theme(legend.position = c(.95, .9))
Race effect
## nV Race nVc
## 1303 4 Black 1
## 1304 4 Black 1
## 1305 4 Black 1
## 1306 5 Black 1
## 1307 5 Black 1
## 1308 6 White 1
dta3$nVc <- as.integer(dta3$nVc, ref="0")
dta3$Race <- relevel(dta3$Race, ref = "White")
ggplot(dta3,
aes(Race, nV, color=Race)) +
stat_smooth(method="glm",
formula= nV ~ Race,
method.args=list(family=poisson),
size=rel(.5)) +
geom_point(pch=20, alpha=.5) +
#facet_grid(. ~ Race) +
labs(y="Days absent",
x="Math score") +
theme_minimal()+
theme(legend.position = c(.95, .9))
## nV Race nVc
## 1303 4 Black 1
## 1304 4 Black 1
## 1305 4 Black 1
## 1306 5 Black 1
## 1307 5 Black 1
## 1308 6 White 1
dta3$nVc <- as.integer(dta3$nVc, ref="0")
summary(m0p <- glm(nV ~ Race, data=dta3, family=poisson(link=log)))
##
## Call:
## glm(formula = nV ~ Race, family = poisson(link = log), data = dta3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0218 -0.4295 -0.4295 -0.4295 6.1874
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.38321 0.09713 -24.54 <2e-16 ***
## RaceBlack 1.73314 0.14657 11.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 962.80 on 1307 degrees of freedom
## Residual deviance: 844.71 on 1306 degrees of freedom
## AIC: 1122
##
## Number of Fisher Scoring iterations: 6
nV | ||||
---|---|---|---|---|
Predictors | Incidence Rate Ratios | std. Error | CI | p |
(Intercept) | 0.09 | 0.10 | 0.08 – 0.11 | <0.001 |
Race [Black] | 5.66 | 0.15 | 4.24 – 7.53 | <0.001 |
## [1] 1
## # Overdispersion test
##
## dispersion ratio = 1.746
## Pearson's Chi-Squared = 2279.873
## p-value = < 0.001
## Overdispersion detected.
Overdispersion
plot(log(fitted(m0p)), log((dta3$nV-fitted(m0p))^2),
xlab=expression(hat(mu)),
ylab=expression((y-hat(mu))^2))
grid()
abline(0, 1, col=2)
kbl(as.data.frame(summary(m0p, dispersion=1.746)$coef))%>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.383208 | 0.1283420 | -18.569195 | 0 |
RaceBlack | 1.733145 | 0.1936693 | 8.948991 | 0 |
## Warning in drop1.glm(m0, test = "F"): F test assumes 'quasibinomial' family
## Single term deletions
##
## Model:
## cbind(Case, Noncase) ~ Score + Gender + Score:Gender
## Df Deviance AIC F value Pr(>F)
## <none> 1.5422 22.018
## Score:Gender 1 4.9419 23.418 28.657 0.0001313 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## glm.nb(formula = nV ~ Race, data = dta3, init.theta = 0.2023119205,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7184 -0.3899 -0.3899 -0.3899 3.5072
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3832 0.1172 -20.335 < 2e-16 ***
## RaceBlack 1.7331 0.2385 7.268 3.66e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.2023) family taken to be 1)
##
## Null deviance: 471.57 on 1307 degrees of freedom
## Residual deviance: 412.60 on 1306 degrees of freedom
## AIC: 1001.8
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.2023
## Std. Err.: 0.0409
##
## 2 x log-likelihood: -995.7980
exp(-2.3832) = 0.092
exp(-2.3832 + 1.7331) = 0.522
WHY same as the mean?
White: log(μi) = 0.092 + 4.9429×0.0922 ≈ 0.126
Black: log(μi) = 0.522 + 4.9429×0.5222 ≈ 1.869
The Black race tend to known more victims of homicide.
##
## Call: glm.nb(formula = nV ~ Race, data = dta3, init.theta = 0.2023119205,
## link = log)
##
## Coefficients:
## (Intercept) RaceBlack
## -2.383 1.733
##
## Degrees of Freedom: 1307 Total (i.e. Null); 1306 Residual
## Null Deviance: 471.6
## Residual Deviance: 412.6 AIC: 1002
sjPlot::tab_model(m0n, show.df =T, show.se=T, show.r2=T, show.obs=T, show.aic = T, show.fstat = T,
df.method = "wald", transform=NULL)
nV | |||||
---|---|---|---|---|---|
Predictors | Log-Mean | std. Error | CI | p | df |
(Intercept) | -2.38 | 0.12 | -2.61 – -2.15 | <0.001 | Inf |
Race [Black] | 1.73 | 0.24 | 1.27 – 2.20 | <0.001 | Inf |
Observations | 1308 | ||||
R2 Nagelkerke | 0.146 | ||||
AIC | 1001.798 |
Agresti, A. (2002). Categorical Data Analysis. p. 561.