My unit of observation are the people that got injured and were consequently in a coma (and woke up). The sample size is 331 observations.
Source of the dataset is as mentioned before the carData library.
The basic idea is to see whether explanatory variables like duration of coma, age and the time at which the mesurement was made affects (performance) IQ the probability of recovery.
I’d just like to note that the first part of the homework will be basically the same just so that I have usable data (variables) again and I’ll also copy the descriptions, just so I know what I’ll be looking at if I return to this sometime in the future.
So let’s see what we are dealing with.
library(carData)
data(Wong)
head(Wong)
## id days duration sex age piq viq
## 1 3358 30 4 Male 20.67077 87 89
## 2 3535 16 17 Male 55.28816 95 77
## 3 3547 40 1 Male 55.91513 95 116
## 4 3592 13 10 Male 61.66461 59 73
## 5 3728 19 6 Male 30.12731 67 73
## 6 3790 13 3 Male 57.06229 76 69
The first thing that I notice is how age is written. It’s obvious that it was calculated automatically but I’ll change it so it makes more sense.
Wong$age <- round(Wong$age, 0) #rounding the age column to the nearest number
data <- Wong
I just got an idea. Let’s see if there are any duplicates within ID and then we can see if there is any progress between 1st and 2nd test score.
#install.packages("misty")
library(misty)
## |-------------------------------------|
## | misty 0.4.7 (2023-01-06) |
## | Miscellaneous Functions T. Yanagida |
## |-------------------------------------|
data2M <- df.duplicated(data, id, first = TRUE, keep.all = TRUE, #extracting the duplicate values in ID in a new data frame data 2 measurements
from.last = FALSE, keep.row.names = TRUE,
check = TRUE)
dataU <- df.unique(data, id, keep.all = TRUE, #extracting unique data in a new data frame data unique
from.last = FALSE, keep.row.names = TRUE,
check = TRUE)
sum(duplicated(data2M$id)) #checking if it works, should be 224 observations
## [1] 131
This is a bit weird, but I believe the problem is some people were tested more than twice and some just once. I hope this will not be too big of a problem.
It would make sense to make a categorical variable with 3 levels, if people were in a coma up to 10 days I will assign them a value of 1, for 10 to 50 days a value of 2 and above 50 a value of 3
createDurationFactor <- function(duration) { #creating a duration factor so I have 3 levels
return (ifelse (duration < 10, "short", ifelse(duration > 50, "long", "semi")))
}
dataU$DurationF <- createDurationFactor(dataU$duration)
Now I can move on to the remaining part.
I still want to change my factors.
dataU$sexF <- factor(dataU$sex, #this isn't really needed
labels = c("Male", "Female"),
levels = c("Male", "Female"))
dataU$DurationF <- factor(dataU$DurationF,
labels = c("short", "semi", "long"),
levels = c("short", "semi", "long"))
head(dataU)
## id days duration sex age piq viq DurationF sexF
## 1 3358 30 4 Male 21 87 89 short Male
## 2 3535 16 17 Male 55 95 77 semi Male
## 3 3547 40 1 Male 56 95 116 short Male
## 4 3592 13 10 Male 62 59 73 semi Male
## 5 3728 19 6 Male 30 67 73 short Male
## 6 3790 13 3 Male 57 76 69 short Male
Below is the graphical representation of relationships among my variables presented in a scatterplot matrix.
scatterplotMatrix(dataU[, c(-1,-4,-8, -9)], #removing ID since it doesn't play any role here in my opinion, removed sex as it's not a numerical variable and I could also remove days but I decided not to
smooth = FALSE)
From this I can see that mostly younger people (from the previous HW we
can recall it was males) had incidents that resulted in a coma. Also
what is interesting to me is the peaks of verbal IQ and performance IQ
are inverted but I don’t really know how to explain that. We can also
see that days and duration are positively correlated for example and
vice versa for age and duration.
summary(dataU[c("days", "duration", "age", "piq", "viq")])
## days duration age piq
## Min. : 13.00 Min. : 0.00 Min. : 7.00 Min. : 50.00
## 1st Qu.: 41.75 1st Qu.: 1.00 1st Qu.:22.00 1st Qu.: 74.00
## Median : 71.00 Median : 7.00 Median :27.00 Median : 84.00
## Mean : 374.53 Mean : 13.98 Mean :32.34 Mean : 83.33
## 3rd Qu.: 171.50 3rd Qu.: 14.00 3rd Qu.:41.00 3rd Qu.: 93.00
## Max. :11628.00 Max. :255.00 Max. :80.00 Max. :127.00
## viq
## Min. : 64.00
## 1st Qu.: 83.00
## Median : 91.00
## Mean : 92.09
## 3rd Qu.:101.25
## Max. :131.00
Let’s check the variance
var(dataU$age)
## [1] 206.0748
var(dataU$days)
## [1] 1744224
Let’s perform the logistic regression.
fit1 <- glm(DurationF ~ 1, #logistic regression
family = binomial,
data = dataU)
summary(fit1)
##
## Call:
## glm(formula = DurationF ~ 1, family = binomial, data = dataU)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.986 -0.986 -0.986 1.382 1.382
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4684 0.1453 -3.223 0.00127 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 266.58 on 199 degrees of freedom
## Residual deviance: 266.58 on 199 degrees of freedom
## AIC: 268.58
##
## Number of Fisher Scoring iterations: 4
exp(cbind(odds = fit1$coefficients, confint.default(fit1))) #calculation of the odds for Y=1
## odds 2.5 % 97.5 %
## (Intercept) 0.6260163 0.4708595 0.8322999
head(fitted(fit1)) #estimated probability for each pearson for Y=1
## 1 2 3 4 5 6
## 0.385 0.385 0.385 0.385 0.385 0.385
Pseudo_R2_fit1 <- 123/200 #short duration
Pseudo_R2_fit1
## [1] 0.615
The higher the number, the better the model (correctly classified). I would say this one is OK but not great as probability is 61,5%.
The probability that the duration of a coma is short is 38,5% (Y=1) which is quite a lot.
Now I’ll try throwing an explanatory variable (age) in there with the duration being the dependent variable.
fit2 <- glm(DurationF ~ age,
family = binomial,
data = dataU)
summary(fit2)
##
## Call:
## glm(formula = DurationF ~ age, family = binomial, data = dataU)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1852 -1.0442 -0.8179 1.2709 1.9476
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.35832 0.37350 0.959 0.3374
## age -0.02615 0.01114 -2.347 0.0189 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 266.58 on 199 degrees of freedom
## Residual deviance: 260.62 on 198 degrees of freedom
## AIC: 264.62
##
## Number of Fisher Scoring iterations: 4
p-value is below alpha=5% but not below 0,001 so it’s significant, but the results may not be that good. Also now I’m figuring it would be better if I would divide duration into 2 categories, not 3…
anova(fit1, fit2, test = "Chi") #which model is better?
## Analysis of Deviance Table
##
## Model 1: DurationF ~ 1
## Model 2: DurationF ~ age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 199 266.58
## 2 198 260.62 1 5.9588 0.01464 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: simple model is better. H1: complex model is better.
As we can see the deviance is lower, so I would reject the null hypothesis at p value (alpha<5%).
I’ll calculate the odds ratio again as well as the confidence interval.
exp(cbind(OR = fit2$coefficients, confint.default(fit2)))
## OR 2.5 % 97.5 %
## (Intercept) 1.4309295 0.6881628 2.9753998
## age 0.9741843 0.9531363 0.9956971
Explanation: if age is increased by 1 year, the odds for a short coma duration are equal to 0,97 times the initial odds. How about I add some more.
fit3 <- glm(DurationF ~ age + days + piq + viq,
family = binomial,
data = dataU)
vif(fit3)
## age days piq viq
## 1.044053 1.047315 1.570109 1.549326
mean(vif(fit3))
## [1] 1.302701
I could have some problem with multicollinearity with piq and viq (expected), others are close to 1 so that’s OK.
I’ll check for some outliers.
dataU$StdResid <- rstandard(fit3)
dataU$CookDist <- cooks.distance(fit3)
hist(dataU$StdResid,
main = "Histogram of standardized residuals",
ylab = "Frequency",
xlab = "Standardized residuals")
Now this surprised me, I don’t really know how to proceed from now on. I
can’t find an explanation for this to be honest. I’ll try to work my way
through the slides.
head(dataU[order(-dataU$CookDist), c("id", "CookDist")],5)
## id CookDist
## 331 5964 0.47721648
## 330 5142 0.13607880
## 258 5811 0.05084331
## 286 405 0.04985172
## 2 3535 0.03644547
dataU <- dataU[-331, -330]
fit2 <- glm(DurationF ~ age,
family = binomial,
data = dataU)
fit3 <- glm(DurationF ~ age + days + piq + viq,
family = binomial,
data = dataU)
H0: Simple model (fit2 / Model 2) is better H1: Complex model (fit3 / Model 3) is better
# Comparison of models based on -2LL statistics
anova(fit2, fit3, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: DurationF ~ age
## Model 2: DurationF ~ age + days + piq + viq
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 198 260.62
## 2 195 242.63 3 17.997 0.0004404 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finding: based on the ANOVA test, H0 is rejected at p-value < 0.001 meaning that the complex model is better fit.
summary(fit3)
##
## Call:
## glm(formula = DurationF ~ age + days + piq + viq, family = binomial,
## data = dataU)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0150 -0.9383 -0.6691 1.1599 2.0847
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3839024 1.2403270 1.922 0.054606 .
## age -0.0179656 0.0114911 -1.563 0.117948
## days 0.0001736 0.0001300 1.335 0.181896
## piq -0.0573929 0.0153625 -3.736 0.000187 ***
## viq 0.0259957 0.0151489 1.716 0.086161 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 266.58 on 199 degrees of freedom
## Residual deviance: 242.63 on 195 degrees of freedom
## AIC: 252.63
##
## Number of Fisher Scoring iterations: 4
The only statistically significant variable is piq.
Values of estimates from summary on e^x
exp(cbind(OR = fit3$coefficients, confint.default(fit3)))
## OR 2.5 % 97.5 %
## (Intercept) 10.8471502 0.9539979 123.3343052
## age 0.9821948 0.9603210 1.0045668
## days 1.0001736 0.9999187 1.0004285
## piq 0.9442230 0.9162162 0.9730859
## viq 1.0263365 0.9963111 1.0572668
If piq is increased by 1% point, the odds (Y=1) are equal to 0,944 times the initial odds (p<0,001), ceteris paribus.