Problem: Use an appropriate model to examine how the response depends on respondent’s gender (0 = male, 1 = female), age, and academic status (1 = academic or 0 = otherwise).
Data: In a survey people were asked, “Is addiction a disease or are addicts weak-willed?” The response was in three categories, 0 = “addicts are weak-willed,” 1 = “addiction is a disease,” or 2 = “both alternatives hold”. The data set is available as addiction{catdata}.
# Loading package
pacman::p_load(tidyr, tidyverse, nlme, car, afex, emmeans, lattice, faraway, boot, ggplot2, kableExtra, MASS, pscl, nnet, mgcv)
library(catdata)
data(addiction, package = "catdata")
head(addiction)
## ill gender age university
## 1 1 1 61 0
## 2 0 1 43 0
## 3 2 0 44 0
## 4 0 1 21 1
## 5 0 0 33 0
## 6 1 0 83 0
addiction$ill <- factor((addiction$ill), labels=c("weak-willed", "disease", "both"), ordered=T)
addiction$gender <- factor((addiction$gender), labels=c("Female", "Male"), ordered=T)
addiction$university <- factor((addiction$university), labels=c("Otherwise", "Academic"), ordered=T)
str(addiction)
## 'data.frame': 712 obs. of 4 variables:
## $ ill : Ord.factor w/ 3 levels "weak-willed"<..: 2 1 3 1 1 2 1 2 2 2 ...
## $ gender : Ord.factor w/ 2 levels "Female"<"Male": 2 2 1 2 1 1 1 1 2 1 ...
## $ age : int 61 43 44 21 33 83 29 61 37 19 ...
## $ university: Ord.factor w/ 2 levels "Otherwise"<"Academic": 1 1 1 2 1 1 1 1 1 1 ...
## gender Female Male
## ill
## weak-willed 99 103
## disease 137 179
## both 79 90
Both male and female more likely to state “addiction is a disease”.
## university Otherwise Academic
## ill
## weak-willed 178 23
## disease 194 121
## both 123 44
For those with university degree more likely to state “addiction is a disease” than those without.
newdta1 <- addiction %>%
mutate(resp_n=ifelse(ill=='weak-willed', 1, 0),
resp_f=ifelse(ill=='disease', 1, 0),
resp_p=ifelse(ill=='both', 1, 0))
ggplot(newdta1, aes(age, resp_n)) +
stat_smooth(method="glm", formula=y ~ x,
method.args=list(family=binomial),
alpha=.2, se=FALSE, col=3) +
stat_smooth(aes(y=resp_f),
method="glm", formula=y ~ x,
method.args=list(family=binomial),
alpha=.2, se=FALSE, col=2) +
stat_smooth(aes(y=resp_p), method="glm",
formula=y ~ x,
method.args=list(family=binomial),
alpha=.2, se=FALSE, col=4) +
facet_grid(gender~university)+
scale_y_continuous(limits=c(0,1), breaks=seq(0, 1, by=.1))+
guides(color=guide_legend(reverse=T))+
labs(x="age",
y="Probability of ill",
title="weak-willed=Green, disease=Red, both=Blue") +
theme_minimal()+
theme(legend.position = c(.9, .2))
## Warning: Removed 26 rows containing non-finite values (stat_smooth).
## Warning: Removed 26 rows containing non-finite values (stat_smooth).
## Warning: Removed 26 rows containing non-finite values (stat_smooth).
WHY with NA…
# all
m0 <- nnet::multinom(ill ~ gender*age*university,
data=addiction, Hess=TRUE, trace=F)
# simple
m1 <- nnet::multinom(ill ~ gender+age+university,
data=addiction, Hess=TRUE, trace=F)
knitr::kable(anova(m0, m1, test="Chisq"))
Model | Resid. df | Resid. Dev | Test | Df | LR stat. | Pr(Chi) |
---|---|---|---|---|---|---|
gender + age + university | 1356 | 1350.417 | NA | NA | NA | |
gender * age * university | 1348 | 1342.698 | 1 vs 2 | 8 | 7.718734 | 0.4614161 |
ill | ||||
---|---|---|---|---|
Predictors | Odds Ratios | CI | p | Response |
(Intercept) | 0.88 | 0.55 – 1.40 | 0.581 | disease |
gender [linear] | 1.36 | 1.04 – 1.78 | 0.024 | disease |
age | 1.03 | 1.02 – 1.04 | <0.001 | disease |
university [linear] | 3.15 | 2.22 – 4.47 | <0.001 | disease |
(Intercept) | 0.26 | 0.15 – 0.45 | <0.001 | both |
gender [linear] | 1.23 | 0.90 – 1.66 | 0.192 | both |
age | 1.04 | 1.03 – 1.06 | <0.001 | both |
university [linear] | 2.13 | 1.42 – 3.18 | <0.001 | both |
Problem: Analyze the data in the physical activity example with an appropriate model.
Data: The data were collected during face-to-face home-based interviews with 357 female participants. Each participant’s height, weight as well as her answer to the following question were recorded.
“Vigorous physical activities usually make you breathe hard or feel tired most of the time. Examples of vigorous activities include: jogging, fast dancing, soccer, fast swimming, fast biking, and Stairmaster. How many days in a typical week do you do vigorous physical activities for 20 minutes or more?”
BMI was calculated using the Quetelet index (kg/m2), which is considered both a convenient and reliable indicator of overweight and obesity.
Column 1: Participant’s BMI index
Column 2: Number of times of vigorous activities per week
## bmi count
## 1 22 0
## 2 31 0
## 3 29 0
## 4 33 0
## 5 18 0
## 6 25 0
## 'data.frame': 357 obs. of 2 variables:
## $ bmi : int 22 31 29 33 18 25 28 30 29 16 ...
## $ count: int 0 0 0 0 0 0 0 0 0 0 ...
## bmi count
## ave 26.44538 0.5938375
## var 14.88816 2.2194001
ggplot(dta2, aes(x=bmi, y=count))+
geom_bar(stat="identity")+
labs(x="BMI", y="Count")+
theme_minimal()+
theme(legend.position = c(.1,.9))
# Poisson fit
ggplot(dta2, aes(x=bmi, y=count)) +
geom_jitter() +
stat_smooth(method="glm",
method.args = list(family = "poisson")) +
labs(y="Count of BMI", x="BMI")
## `geom_smooth()` using formula 'y ~ x'
##
## Call:
## glm(formula = count ~ bmi, family = "poisson", data = dta2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8404 -1.1270 -0.9570 -0.8127 4.8654
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.27048 0.43383 5.234 1.66e-07 ***
## bmi -0.10898 0.01729 -6.302 2.94e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 795.57 on 356 degrees of freedom
## Residual deviance: 756.44 on 355 degrees of freedom
## AIC: 946.99
##
## Number of Fisher Scoring iterations: 6
##
## Call:
## zeroinfl(formula = count ~ bmi, data = dta2)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.7046 -0.4250 -0.3568 -0.2989 4.8343
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.73817 0.53913 3.224 0.00126 **
## bmi -0.02276 0.02165 -1.051 0.29321
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.51385 0.95813 -1.580 0.11411
## bmi 0.11588 0.03736 3.101 0.00193 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 7
## Log-likelihood: -281.4 on 4 Df
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -7.822786 model2 > model1 2.5834e-15
## AIC-corrected -7.740484 model2 > model1 4.9520e-15
## BIC-corrected -7.580912 model2 > model1 1.7157e-14
## Waiting for profiling to be done...
## Estimate 2.5 % 97.5 %
## (Intercept) 2.2704806 1.4099338 3.11111748
## bmi -0.1089812 -0.1428471 -0.07502992
## Estimate 2.5 % 97.5 %
## count_(Intercept) 1.73816504 0.68148156 2.79484853
## count_bmi -0.02275706 -0.06519126 0.01967714
## zero_(Intercept) -1.51384962 -3.39175256 0.36405331
## zero_bmi 0.11588088 0.04264803 0.18911372
## Estimate 2.5 % 97.5 %
## (Intercept) 9.6840538 4.0956844 22.4461134
## bmi 0.8967472 0.8668866 0.9277157
## Estimate 2.5 % 97.5 %
## count_(Intercept) 5.6868986 1.97680431 16.360151
## count_bmi 0.9774999 0.93688825 1.019872
## zero_(Intercept) 0.2200612 0.03364965 1.439151
## zero_bmi 1.1228621 1.04357053 1.208178
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: count
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 356 795.57
## bmi 1 39.13 355 756.44
dta2_m<- data.frame(dta2,
fit0=predict(m0_dta2, type="response"),
fit1=predict(m1_dta2, type="response"),
r0=resid(m0_dta2, type="pearson"),
r1=resid(m1_dta2, type="pearson"))
ggplot(dta2_m, aes(fit0, r0))+
geom_point(pch=1)+
geom_point(aes(fit1, r1),pch=3)+
geom_hline(yintercept = 0)+
labs(x="Fitted values", y="Residuals")
Slymen, D.J., Ayala, G.X., Arredondo, E.M., & Elder, J.P. (2006). A demonstration of modeling count data with an application to physical activity. Epidemiologic Perspectives & Innovations, 3:3, 1-9.
Problem: Analyze the effects of gender and age on the level of pain, (R1 = 0, no pain, …, 5 = severe pain), before treatment started. You may consult the package vignette for an analysis on the variable R4 by issuing the command in your R session:
Data: In a clinical study a sample of 127 patients with sport related injuries have been treated with two different therapies (chosen by random design). After 3,7 and 10 days of treatment the pain occuring during knee movement was observed. The dataset is available as knee{catdata}. Analyze the effects of gender and age on the level of pain, (R1 = 0, no pain, …, 5 = severe pain), before treatment started. You may consult the package vignette for an analysis on the variable R4 by issuing the command in your R session:
data(knee, package=“catdata”)
vignette(“ordinal-knee1”)
## N Th Age Sex R1 R2 R3 R4
## 1 1 1 28 1 4 4 4 4
## 2 2 1 32 1 4 4 4 4
## 3 3 1 41 1 3 3 3 3
## 4 4 2 21 1 4 3 3 2
## 5 5 2 34 1 4 3 3 2
## 6 6 1 24 1 3 3 3 2
## 'data.frame': 127 obs. of 8 variables:
## $ N : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Th : int 1 1 1 2 2 1 2 2 2 1 ...
## $ Age: int 28 32 41 21 34 24 28 40 24 39 ...
## $ Sex: num 1 1 1 1 1 1 1 1 0 0 ...
## $ R1 : int 4 4 3 4 4 3 4 3 4 4 ...
## $ R2 : int 4 4 3 3 3 3 3 2 4 4 ...
## $ R3 : int 4 4 3 3 3 3 3 2 4 4 ...
## $ R4 : int 4 4 3 2 2 2 2 2 3 3 ...
## No id variables; using all as measure variables
names(dta_3) <- c("MeasureTime", "Painlevel")
summary(m0_dta_3 <-
polr(factor(dta_3$Painlevel) ~ factor(dta_3$MeasureTime),
method = "probit",Hess=T))
## Call:
## polr(formula = factor(dta_3$Painlevel) ~ factor(dta_3$MeasureTime),
## Hess = T, method = "probit")
##
## Coefficients:
## Value Std. Error t value
## factor(dta_3$MeasureTime)R2 -0.3476 0.1335 -2.604
## factor(dta_3$MeasureTime)R3 -0.5326 0.1342 -3.968
## factor(dta_3$MeasureTime)R4 -0.7395 0.1354 -5.460
##
## Intercepts:
## Value Std. Error t value
## 1|2 -1.1634 0.1073 -10.8458
## 2|3 -0.7974 0.1042 -7.6538
## 3|4 -0.1022 0.1003 -1.0183
## 4|5 0.9509 0.1075 8.8467
##
## Residual Deviance: 1523.003
## AIC: 1537.003