1 Homework 1

1.1 Info

  • Problem: Use an appropriate model to examine how the response depends on respondent’s gender (0 = male, 1 = female), age, and academic status (1 = academic or 0 = otherwise).

  • Data: In a survey people were asked, “Is addiction a disease or are addicts weak-willed?” The response was in three categories, 0 = “addicts are weak-willed,” 1 = “addiction is a disease,” or 2 = “both alternatives hold”. The data set is available as addiction{catdata}.

1.2 Data input

# Loading package
pacman::p_load(tidyr, tidyverse, nlme, car, afex, emmeans, lattice, faraway, boot, ggplot2, kableExtra, MASS, pscl, nnet, mgcv)

library(catdata)

data(addiction, package = "catdata")

head(addiction)
##   ill gender age university
## 1   1      1  61          0
## 2   0      1  43          0
## 3   2      0  44          0
## 4   0      1  21          1
## 5   0      0  33          0
## 6   1      0  83          0
addiction$ill <- factor((addiction$ill), labels=c("weak-willed", "disease", "both"), ordered=T)
addiction$gender <- factor((addiction$gender), labels=c("Female", "Male"), ordered=T)
addiction$university <- factor((addiction$university), labels=c("Otherwise", "Academic"), ordered=T)

str(addiction)
## 'data.frame':    712 obs. of  4 variables:
##  $ ill       : Ord.factor w/ 3 levels "weak-willed"<..: 2 1 3 1 1 2 1 2 2 2 ...
##  $ gender    : Ord.factor w/ 2 levels "Female"<"Male": 2 2 1 2 1 1 1 1 2 1 ...
##  $ age       : int  61 43 44 21 33 83 29 61 37 19 ...
##  $ university: Ord.factor w/ 2 levels "Otherwise"<"Academic": 1 1 1 2 1 1 1 1 1 1 ...
with(addiction, ftable(ill, gender))
##             gender Female Male
## ill                           
## weak-willed            99  103
## disease               137  179
## both                   79   90

Both male and female more likely to state “addiction is a disease”.

with(addiction, ftable(ill, university))
##             university Otherwise Academic
## ill                                      
## weak-willed                  178       23
## disease                      194      121
## both                         123       44

For those with university degree more likely to state “addiction is a disease” than those without.

HH::likert(t(with(addiction, table(ill, gender))), main="", ylab="gender")

HH::likert(t(with(addiction, table(ill, university))), main="", ylab="Academic status")

newdta1 <- addiction %>% 
  mutate(resp_n=ifelse(ill=='weak-willed', 1, 0),
         resp_f=ifelse(ill=='disease', 1, 0),
         resp_p=ifelse(ill=='both', 1, 0))

ggplot(newdta1, aes(age, resp_n)) + 
 stat_smooth(method="glm", formula=y ~ x, 
             method.args=list(family=binomial),
             alpha=.2, se=FALSE, col=3) +
 stat_smooth(aes(y=resp_f),
             method="glm", formula=y ~ x, 
             method.args=list(family=binomial),
             alpha=.2, se=FALSE, col=2) +
 stat_smooth(aes(y=resp_p), method="glm", 
             formula=y ~ x, 
             method.args=list(family=binomial),
             alpha=.2, se=FALSE, col=4) +
 facet_grid(gender~university)+
 scale_y_continuous(limits=c(0,1), breaks=seq(0, 1, by=.1))+
 guides(color=guide_legend(reverse=T))+
 labs(x="age", 
      y="Probability of ill",
      title="weak-willed=Green, disease=Red, both=Blue") +
 theme_minimal()+
 theme(legend.position = c(.9, .2))
## Warning: Removed 26 rows containing non-finite values (stat_smooth).

## Warning: Removed 26 rows containing non-finite values (stat_smooth).

## Warning: Removed 26 rows containing non-finite values (stat_smooth).

WHY with NA…

# all
m0 <- nnet::multinom(ill ~ gender*age*university, 
                     data=addiction, Hess=TRUE, trace=F)
# simple
m1 <- nnet::multinom(ill ~ gender+age+university, 
                     data=addiction, Hess=TRUE, trace=F)

knitr::kable(anova(m0, m1, test="Chisq"))
Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
gender + age + university 1356 1350.417 NA NA NA
gender * age * university 1348 1342.698 1 vs 2 8 7.718734 0.4614161
sjPlot::tab_model(m1, show.r2=FALSE, show.obs=FALSE)
  ill
Predictors Odds Ratios CI p Response
(Intercept) 0.88 0.55 – 1.40 0.581 disease
gender [linear] 1.36 1.04 – 1.78 0.024 disease
age 1.03 1.02 – 1.04 <0.001 disease
university [linear] 3.15 2.22 – 4.47 <0.001 disease
(Intercept) 0.26 0.15 – 0.45 <0.001 both
gender [linear] 1.23 0.90 – 1.66 0.192 both
age 1.04 1.03 – 1.06 <0.001 both
university [linear] 2.13 1.42 – 3.18 <0.001 both

2 Homework 2

2.1 Info

  • Problem: Analyze the data in the physical activity example with an appropriate model.

  • Data: The data were collected during face-to-face home-based interviews with 357 female participants. Each participant’s height, weight as well as her answer to the following question were recorded.
    “Vigorous physical activities usually make you breathe hard or feel tired most of the time. Examples of vigorous activities include: jogging, fast dancing, soccer, fast swimming, fast biking, and Stairmaster. How many days in a typical week do you do vigorous physical activities for 20 minutes or more?”

BMI was calculated using the Quetelet index (kg/m2), which is considered both a convenient and reliable indicator of overweight and obesity.

Column 1: Participant’s BMI index
Column 2: Number of times of vigorous activities per week

2.2 Data input

dta2 <- read.table("physical_activity.txt", h=T)

head(dta2)
##   bmi count
## 1  22     0
## 2  31     0
## 3  29     0
## 4  33     0
## 5  18     0
## 6  25     0
str(dta2)
## 'data.frame':    357 obs. of  2 variables:
##  $ bmi  : int  22 31 29 33 18 25 28 30 29 16 ...
##  $ count: int  0 0 0 0 0 0 0 0 0 0 ...
t <- rbind(colMeans(dta2), apply(dta2, 2, FUN = var))
rownames(t) <- c("ave", "var")
t
##          bmi     count
## ave 26.44538 0.5938375
## var 14.88816 2.2194001
ggplot(dta2, aes(x=bmi, y=count))+
  geom_bar(stat="identity")+
  labs(x="BMI", y="Count")+
  theme_minimal()+
  theme(legend.position = c(.1,.9))

# Poisson fit
ggplot(dta2, aes(x=bmi, y=count)) + 
  geom_jitter() + 
  stat_smooth(method="glm", 
              method.args = list(family = "poisson")) +
  labs(y="Count of BMI", x="BMI")
## `geom_smooth()` using formula 'y ~ x'

summary(m0_dta2 <- glm(count~bmi, data=dta2, family = "poisson"))
## 
## Call:
## glm(formula = count ~ bmi, family = "poisson", data = dta2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8404  -1.1270  -0.9570  -0.8127   4.8654  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.27048    0.43383   5.234 1.66e-07 ***
## bmi         -0.10898    0.01729  -6.302 2.94e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 795.57  on 356  degrees of freedom
## Residual deviance: 756.44  on 355  degrees of freedom
## AIC: 946.99
## 
## Number of Fisher Scoring iterations: 6
summary(m1_dta2 <- zeroinfl(count ~ bmi, data=dta2))
## 
## Call:
## zeroinfl(formula = count ~ bmi, data = dta2)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7046 -0.4250 -0.3568 -0.2989  4.8343 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.73817    0.53913   3.224  0.00126 **
## bmi         -0.02276    0.02165  -1.051  0.29321   
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.51385    0.95813  -1.580  0.11411   
## bmi          0.11588    0.03736   3.101  0.00193 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 7 
## Log-likelihood: -281.4 on 4 Df
vuong(m0_dta2, m1_dta2)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -7.822786 model2 > model1 2.5834e-15
## AIC-corrected         -7.740484 model2 > model1 4.9520e-15
## BIC-corrected         -7.580912 model2 > model1 1.7157e-14
#parameter estimates
show(m0_est <- cbind(Estimate=coef(m0_dta2), confint(m0_dta2)))
## Waiting for profiling to be done...
##               Estimate      2.5 %      97.5 %
## (Intercept)  2.2704806  1.4099338  3.11111748
## bmi         -0.1089812 -0.1428471 -0.07502992
show(m1_est <- cbind(Estimate=coef(m1_dta2), confint(m1_dta2)))
##                      Estimate       2.5 %     97.5 %
## count_(Intercept)  1.73816504  0.68148156 2.79484853
## count_bmi         -0.02275706 -0.06519126 0.01967714
## zero_(Intercept)  -1.51384962 -3.39175256 0.36405331
## zero_bmi           0.11588088  0.04264803 0.18911372
exp(m0_est)
##              Estimate     2.5 %     97.5 %
## (Intercept) 9.6840538 4.0956844 22.4461134
## bmi         0.8967472 0.8668866  0.9277157
exp(m1_est)
##                    Estimate      2.5 %    97.5 %
## count_(Intercept) 5.6868986 1.97680431 16.360151
## count_bmi         0.9774999 0.93688825  1.019872
## zero_(Intercept)  0.2200612 0.03364965  1.439151
## zero_bmi          1.1228621 1.04357053  1.208178
anova(m0_dta2, m1_dta2)
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: count
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev
## NULL                   356     795.57
## bmi   1    39.13       355     756.44
dta2_m<- data.frame(dta2, 
                    fit0=predict(m0_dta2, type="response"),
                    fit1=predict(m1_dta2, type="response"),
                    r0=resid(m0_dta2, type="pearson"),
                    r1=resid(m1_dta2, type="pearson"))
ggplot(dta2_m, aes(fit0, r0))+
  geom_point(pch=1)+
  geom_point(aes(fit1, r1),pch=3)+
  geom_hline(yintercept = 0)+
  labs(x="Fitted values", y="Residuals")

2.3 Reference

Slymen, D.J., Ayala, G.X., Arredondo, E.M., & Elder, J.P. (2006). A demonstration of modeling count data with an application to physical activity. Epidemiologic Perspectives & Innovations, 3:3, 1-9.

3 Homework 3

3.1 Info

  • Problem: Analyze the effects of gender and age on the level of pain, (R1 = 0, no pain, …, 5 = severe pain), before treatment started. You may consult the package vignette for an analysis on the variable R4 by issuing the command in your R session:

  • Data: In a clinical study a sample of 127 patients with sport related injuries have been treated with two different therapies (chosen by random design). After 3,7 and 10 days of treatment the pain occuring during knee movement was observed. The dataset is available as knee{catdata}. Analyze the effects of gender and age on the level of pain, (R1 = 0, no pain, …, 5 = severe pain), before treatment started. You may consult the package vignette for an analysis on the variable R4 by issuing the command in your R session:
    data(knee, package=“catdata”)
    vignette(“ordinal-knee1”)

3.2 Data input

data(knee, package="catdata")

head(knee)
##   N Th Age Sex R1 R2 R3 R4
## 1 1  1  28   1  4  4  4  4
## 2 2  1  32   1  4  4  4  4
## 3 3  1  41   1  3  3  3  3
## 4 4  2  21   1  4  3  3  2
## 5 5  2  34   1  4  3  3  2
## 6 6  1  24   1  3  3  3  2
str(knee)
## 'data.frame':    127 obs. of  8 variables:
##  $ N  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Th : int  1 1 1 2 2 1 2 2 2 1 ...
##  $ Age: int  28 32 41 21 34 24 28 40 24 39 ...
##  $ Sex: num  1 1 1 1 1 1 1 1 0 0 ...
##  $ R1 : int  4 4 3 4 4 3 4 3 4 4 ...
##  $ R2 : int  4 4 3 3 3 3 3 2 4 4 ...
##  $ R3 : int  4 4 3 3 3 3 3 2 4 4 ...
##  $ R4 : int  4 4 3 2 2 2 2 2 3 3 ...
dta_3 <- reshape2::melt(knee[,5:8], key=, value=value)
## No id variables; using all as measure variables
names(dta_3) <- c("MeasureTime", "Painlevel")
summary(m0_dta_3 <- 
          polr(factor(dta_3$Painlevel) ~ factor(dta_3$MeasureTime), 
               method = "probit",Hess=T))
## Call:
## polr(formula = factor(dta_3$Painlevel) ~ factor(dta_3$MeasureTime), 
##     Hess = T, method = "probit")
## 
## Coefficients:
##                               Value Std. Error t value
## factor(dta_3$MeasureTime)R2 -0.3476     0.1335  -2.604
## factor(dta_3$MeasureTime)R3 -0.5326     0.1342  -3.968
## factor(dta_3$MeasureTime)R4 -0.7395     0.1354  -5.460
## 
## Intercepts:
##     Value    Std. Error t value 
## 1|2  -1.1634   0.1073   -10.8458
## 2|3  -0.7974   0.1042    -7.6538
## 3|4  -0.1022   0.1003    -1.0183
## 4|5   0.9509   0.1075     8.8467
## 
## Residual Deviance: 1523.003 
## AIC: 1537.003