Luka Pascal Cavazza, 19. February 2023

IMB; Multivariate Analysis;

HOMEWORK 4

Post-Coma Recovery of IQ

My unit of observation are the people that got injured and were consequently in a coma (and woke up). The sample size is 331 observations.

Description of variables:

  • id: patients ID number
  • days: number of days post coma at which IQs were measured
  • duration: duration of the coma in days
  • sex: gender
  • age: in years at the time of injury
  • piq: performance (i.e., mathematical) IQ
  • viq: verbal IQ

Source of the dataset is as mentioned before the carData library.

Research question

The basic idea is to see whether explanatory variables like duration of coma, age and the time at which the mesurement was made affects (performance) IQ the probability of recovery.

I’d just like to note that the first part of the homework will be basically the same just so that I have usable data (variables) again and I’ll also copy the descriptions, just so I know what I’ll be looking at if I return to this sometime in the future.

So let’s see what we are dealing with.

library(carData)
data(Wong)

head(Wong)
##     id days duration  sex      age piq viq
## 1 3358   30        4 Male 20.67077  87  89
## 2 3535   16       17 Male 55.28816  95  77
## 3 3547   40        1 Male 55.91513  95 116
## 4 3592   13       10 Male 61.66461  59  73
## 5 3728   19        6 Male 30.12731  67  73
## 6 3790   13        3 Male 57.06229  76  69

The first thing that I notice is how age is written. It’s obvious that it was calculated automatically but I’ll change it so it makes more sense.

Wong$age <- round(Wong$age, 0) #rounding the age column to the nearest number
data <- Wong

I just got an idea. Let’s see if there are any duplicates within ID and then we can see if there is any progress between 1st and 2nd test score.

#install.packages("misty")
library(misty)
## |-------------------------------------|
## | misty 0.4.7 (2023-01-06)            |
## | Miscellaneous Functions T. Yanagida |
## |-------------------------------------|
data2M <- df.duplicated(data, id, first = TRUE, keep.all = TRUE,  #extracting the duplicate values in ID in a new data frame data 2 measurements
              from.last = FALSE, keep.row.names = TRUE,
              check = TRUE)

dataU <- df.unique(data, id, keep.all = TRUE, #extracting unique data in a new data frame data unique
          from.last = FALSE, keep.row.names = TRUE,
          check = TRUE)

sum(duplicated(data2M$id)) #checking if it works, should be 224 observations
## [1] 131

This is a bit weird, but I believe the problem is some people were tested more than twice and some just once. I hope this will not be too big of a problem.

It would make sense to make a categorical variable with 3 levels, if people were in a coma up to 10 days I will assign them a value of 1, for 10 to 50 days a value of 2 and above 50 a value of 3

createDurationFactor <- function(duration) { #creating a duration factor so I have 3 levels
  return (ifelse (duration < 10, "short", ifelse(duration > 50, "long", "semi")))
}

dataU$DurationF <- createDurationFactor(dataU$duration)

Now I can move on to the remaining part.

I still want to change my factors.

dataU$sexF <- factor(dataU$sex, #this isn't really needed
                          labels = c("Male", "Female"),
                          levels = c("Male", "Female"))

dataU$DurationF <- factor(dataU$DurationF, 
                          labels = c("short", "semi", "long"),
                          levels = c("short", "semi", "long"))
head(dataU)
##     id days duration  sex age piq viq DurationF sexF
## 1 3358   30        4 Male  21  87  89     short Male
## 2 3535   16       17 Male  55  95  77      semi Male
## 3 3547   40        1 Male  56  95 116     short Male
## 4 3592   13       10 Male  62  59  73      semi Male
## 5 3728   19        6 Male  30  67  73     short Male
## 6 3790   13        3 Male  57  76  69     short Male

Graphical representation of relationships among analysed variables with a scatterplot

Below is the graphical representation of relationships among my variables presented in a scatterplot matrix.

scatterplotMatrix(dataU[, c(-1,-4,-8, -9)], #removing ID since it doesn't play any role here in my opinion, removed sex as it's not a numerical variable and I could also remove days but I decided not to
                  smooth = FALSE)

From this I can see that mostly younger people (from the previous HW we can recall it was males) had incidents that resulted in a coma. Also what is interesting to me is the peaks of verbal IQ and performance IQ are inverted but I don’t really know how to explain that. We can also see that days and duration are positively correlated for example and vice versa for age and duration.

Descriptive stats

summary(dataU[c("days", "duration", "age", "piq", "viq")])
##       days             duration           age             piq        
##  Min.   :   13.00   Min.   :  0.00   Min.   : 7.00   Min.   : 50.00  
##  1st Qu.:   41.75   1st Qu.:  1.00   1st Qu.:22.00   1st Qu.: 74.00  
##  Median :   71.00   Median :  7.00   Median :27.00   Median : 84.00  
##  Mean   :  374.53   Mean   : 13.98   Mean   :32.34   Mean   : 83.33  
##  3rd Qu.:  171.50   3rd Qu.: 14.00   3rd Qu.:41.00   3rd Qu.: 93.00  
##  Max.   :11628.00   Max.   :255.00   Max.   :80.00   Max.   :127.00  
##       viq        
##  Min.   : 64.00  
##  1st Qu.: 83.00  
##  Median : 91.00  
##  Mean   : 92.09  
##  3rd Qu.:101.25  
##  Max.   :131.00

Let’s check the variance

var(dataU$age)
## [1] 206.0748
var(dataU$days)
## [1] 1744224

Regression

Let’s perform the logistic regression.

fit1 <- glm(DurationF ~ 1, #logistic regression
            family = binomial,
            data = dataU)

summary(fit1)
## 
## Call:
## glm(formula = DurationF ~ 1, family = binomial, data = dataU)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.986  -0.986  -0.986   1.382   1.382  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -0.4684     0.1453  -3.223  0.00127 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.58  on 199  degrees of freedom
## Residual deviance: 266.58  on 199  degrees of freedom
## AIC: 268.58
## 
## Number of Fisher Scoring iterations: 4
exp(cbind(odds = fit1$coefficients, confint.default(fit1))) #calculation of the odds for Y=1
##                  odds     2.5 %    97.5 %
## (Intercept) 0.6260163 0.4708595 0.8322999
head(fitted(fit1)) #estimated probability for each pearson for Y=1
##     1     2     3     4     5     6 
## 0.385 0.385 0.385 0.385 0.385 0.385
Pseudo_R2_fit1 <-  123/200 #short duration
Pseudo_R2_fit1
## [1] 0.615

The higher the number, the better the model (correctly classified). I would say this one is OK but not great as probability is 61,5%.

The probability that the duration of a coma is short is 38,5% (Y=1) which is quite a lot.

Now I’ll try throwing an explanatory variable (age) in there with the duration being the dependent variable.

fit2 <- glm(DurationF ~ age,
            family = binomial, 
            data = dataU)
summary(fit2)
## 
## Call:
## glm(formula = DurationF ~ age, family = binomial, data = dataU)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1852  -1.0442  -0.8179   1.2709   1.9476  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.35832    0.37350   0.959   0.3374  
## age         -0.02615    0.01114  -2.347   0.0189 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.58  on 199  degrees of freedom
## Residual deviance: 260.62  on 198  degrees of freedom
## AIC: 264.62
## 
## Number of Fisher Scoring iterations: 4

p-value is below alpha=5% but not below 0,001 so it’s significant, but the results may not be that good. Also now I’m figuring it would be better if I would divide duration into 2 categories, not 3…

anova(fit1, fit2, test = "Chi") #which model is better?
## Analysis of Deviance Table
## 
## Model 1: DurationF ~ 1
## Model 2: DurationF ~ age
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       199     266.58                       
## 2       198     260.62  1   5.9588  0.01464 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: simple model is better. H1: complex model is better.

As we can see the deviance is lower, so I would reject the null hypothesis at p value (alpha<5%).

I’ll calculate the odds ratio again as well as the confidence interval.

exp(cbind(OR = fit2$coefficients, confint.default(fit2)))
##                    OR     2.5 %    97.5 %
## (Intercept) 1.4309295 0.6881628 2.9753998
## age         0.9741843 0.9531363 0.9956971

Explanation: if age is increased by 1 year, the odds for a short coma duration are equal to 0,97 times the initial odds. How about I add some more.

fit3 <- glm(DurationF ~ age + days + piq + viq,
            family = binomial, 
            data = dataU)

vif(fit3)
##      age     days      piq      viq 
## 1.044053 1.047315 1.570109 1.549326
mean(vif(fit3))
## [1] 1.302701

I could have some problem with multicollinearity with piq and viq (expected), others are close to 1 so that’s OK.

I’ll check for some outliers.

dataU$StdResid <- rstandard(fit3)
dataU$CookDist <- cooks.distance(fit3)

hist(dataU$StdResid, 
     main = "Histogram of standardized residuals",
     ylab = "Frequency", 
     xlab = "Standardized residuals")

Now this surprised me, I don’t really know how to proceed from now on. I can’t find an explanation for this to be honest. I’ll try to work my way through the slides.

head(dataU[order(-dataU$CookDist), c("id", "CookDist")],5)
##       id   CookDist
## 331 5964 0.47721648
## 330 5142 0.13607880
## 258 5811 0.05084331
## 286  405 0.04985172
## 2   3535 0.03644547
dataU <- dataU[-331, -330]

Estimation of two new models

fit2 <- glm(DurationF ~ age,
            family = binomial, 
            data = dataU)

fit3 <- glm(DurationF ~ age + days + piq + viq,
            family = binomial, 
            data = dataU)

H0: Simple model (fit2 / Model 2) is better H1: Complex model (fit3 / Model 3) is better

# Comparison of models based on -2LL statistics 
anova(fit2, fit3, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: DurationF ~ age
## Model 2: DurationF ~ age + days + piq + viq
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       198     260.62                          
## 2       195     242.63  3   17.997 0.0004404 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finding: based on the ANOVA test, H0 is rejected at p-value < 0.001 meaning that the complex model is better fit.

summary(fit3)
## 
## Call:
## glm(formula = DurationF ~ age + days + piq + viq, family = binomial, 
##     data = dataU)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0150  -0.9383  -0.6691   1.1599   2.0847  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.3839024  1.2403270   1.922 0.054606 .  
## age         -0.0179656  0.0114911  -1.563 0.117948    
## days         0.0001736  0.0001300   1.335 0.181896    
## piq         -0.0573929  0.0153625  -3.736 0.000187 ***
## viq          0.0259957  0.0151489   1.716 0.086161 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.58  on 199  degrees of freedom
## Residual deviance: 242.63  on 195  degrees of freedom
## AIC: 252.63
## 
## Number of Fisher Scoring iterations: 4

The only statistically significant variable is piq.

Values of estimates from summary on e^x

exp(cbind(OR = fit3$coefficients, confint.default(fit3)))
##                     OR     2.5 %      97.5 %
## (Intercept) 10.8471502 0.9539979 123.3343052
## age          0.9821948 0.9603210   1.0045668
## days         1.0001736 0.9999187   1.0004285
## piq          0.9442230 0.9162162   0.9730859
## viq          1.0263365 0.9963111   1.0572668

If piq is increased by 1% point, the odds (Y=1) are equal to 0,944 times the initial odds (p<0,001), ceteris paribus.