Luka Pascal Cavazza, 1. February 2023

IMB; Multivariate Analysis;

HOMEWORK 3

Post-Coma Recovery of IQ

Since this dataset seemed to work quite well for the last homework I’ll be using it again.

So my unit of observation are the people that got injured and were consequently in a coma (and woke up), the sample size is 331 observations.

Description of variables:

  • id: patients ID number
  • days: number of days post coma at which IQs were measured
  • duration: duration of the coma in days
  • sex: gender
  • age: in years at the time of injury
  • piq: performance (i.e., mathematical) IQ
  • viq: verbal IQ

Source of the dataset is as mentioned before the carData library.

The basic idea is to see whether duration of coma affects IQ and if it can recover over time (if there will be more than one measurement per patient).

I’d just like to note that the first part of the homework will be basically the same just so that I have usable data (variables) again and I’ll also copy the descriptions, just so I know what I’ll be looking at if I return to this sometime in the future.

So let’s see what we are dealing with.

library(carData)
data(Wong)

head(Wong)
##     id days duration  sex      age piq viq
## 1 3358   30        4 Male 20.67077  87  89
## 2 3535   16       17 Male 55.28816  95  77
## 3 3547   40        1 Male 55.91513  95 116
## 4 3592   13       10 Male 61.66461  59  73
## 5 3728   19        6 Male 30.12731  67  73
## 6 3790   13        3 Male 57.06229  76  69

The first thing that I notice is how age is written. It’s obvious that it was calculated automatically but I’ll change it so it makes more sense.

Wong$age <- round(Wong$age, 0) #rounding the age column to the nearest number
data <- Wong

I just got an idea. Let’s see if there are any duplicates within ID and then we can see if there is any progress between 1st and 2nd test score.

#install.packages("misty")
library(misty)
## |-------------------------------------|
## | misty 0.4.7 (2023-01-06)            |
## | Miscellaneous Functions T. Yanagida |
## |-------------------------------------|
data2M <- df.duplicated(data, id, first = TRUE, keep.all = TRUE,  #extracting the duplicate values in ID in a new data frame data 2 measurements
              from.last = FALSE, keep.row.names = TRUE,
              check = TRUE)

dataU <- df.unique(data, id, keep.all = TRUE, #extracting unique data in a new data frame data unique
          from.last = FALSE, keep.row.names = TRUE,
          check = TRUE)

sum(duplicated(data2M$id)) #checking if it works, should be 224 observations
## [1] 131

This is a bit weird, but I believe the problem is some people were tested more than twice and some just once. I hope this will not be too big of a problem.

It would make sense to make a categorical variable with 3 levels, if people were in a coma up to 10 days I will assign them a value of 1, for 10 to 50 days a value of 2 and above 50 a value of 3

createDurationFactor <- function(duration) { #creating a duration factor so I have 3 levels
  return (ifelse (duration < 10, 1, ifelse(duration > 50, 3, 2)))
}

dataU$DurationF <- createDurationFactor(dataU$duration)

Now I can move on to the remaining part.

Graphical representation of relationships among analysed variables with a scatterplot

Below is the graphical representation of relationships among my variables presented in a scatterplot matrix.

scatterplotMatrix(dataU[, c(-1,-4,-8)], #removing ID since it doesn't play any role here in my opinion, removed sex as it's not a numerical variable and I could also remove days but I decided not to
                  smooth = FALSE)

From this I can see that mostly younger people (from the previous HW we can recall it was males) had incidents that resulted in a coma. Also what is interesting to me is the peaks of verbal IQ and performance IQ are inverted but I don’t really know how to explain that. We can also see that days and duration are positively correlated for example and vice versa for age and duration.

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:misty':
## 
##     na.pattern
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(dataU[ , c(-1,-4,-8)]))
##           days duration   age   piq   viq
## days      1.00     0.16 -0.19  0.02 -0.05
## duration  0.16     1.00 -0.19 -0.19 -0.02
## age      -0.19    -0.19  1.00  0.11  0.03
## piq       0.02    -0.19  0.11  1.00  0.58
## viq      -0.05    -0.02  0.03  0.58  1.00
## 
## n= 200 
## 
## 
## P
##          days   duration age    piq    viq   
## days            0.0244   0.0072 0.7377 0.5180
## duration 0.0244          0.0058 0.0076 0.7428
## age      0.0072 0.0058          0.1099 0.6395
## piq      0.7377 0.0076   0.1099        0.0000
## viq      0.5180 0.7428   0.6395 0.0000

From the correlation matrix we can see that verbal IQ and duration are strongly correlated (but not performance IQ, which is really interesting).

The regression model

With this I’d like to explain how the duration of the coma, time of measurement, age and verbal IQ affect performance IQ. I could also make fit2 to check how these things affect verbal IQ but I think I’ll just stick to one for now.

fit1 <- lm(piq ~ days + DurationF + age + viq, 
           data = dataU)

I believe all of the variables are relevant when explaining the dependent variable. Duration factor could potentially be used as just as duration but I want to see if it can be used this way.

Let’s check standardized residuals and Cook’s distances as I’ll need this later.

dataU$StdResid <- round(rstandard(fit1), 3) 
dataU$CooksD <- round(cooks.distance(fit1), 3)

Checking assumptions

Linearity, normal distribution, multicollinearity and homoskedasticity will be checked in this part.

Linearity

plot(fit1, 1) #1 for just residuals vs. fitted

I’m quite surprised as I think this is as linear as it gets. Great.

Normal distribution

Let’s see if the data is normally distributed.

hist(dataU$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

This looks quite normally distributed but I’ll still run the Shapiro test. The cut point is at 5% but as I have 224 observations I’ll be happy even if it’s close.

shapiro.test(dataU$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  dataU$StdResid
## W = 0.9789, p-value = 0.004164

This is perfect.

H0: Variable is normally distributed H1: Variable is not normally distributed

As already mentioned above I have more than 100 observations so normal distribution is not significantly important.

Multicollinearity

Let’s check for multicollinearity, if it’ll be above 5 I might have a problem.

vif(fit1)
##      days DurationF       age       viq 
##  1.059351  1.062360  1.068020  1.004232
mean(vif(fit1))
## [1] 1.048491

This is great, things are going great up till this point.

Homoskedasticity

If p-value will be above 5% I have a problem with homoskedasticity.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##              Data               
##  -------------------------------
##  Response : piq 
##  Variables: fitted values of piq 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    5.419691 
##  Prob > Chi2   =    0.01991089

Amazing. I’m very satisfied with this dataset.

summary(fit1)
## 
## Call:
## lm(formula = piq ~ days + DurationF + age + viq, data = dataU)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.570  -6.150  -0.300   6.438  31.886 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.5917130  6.2993584   5.015 1.19e-06 ***
## days         0.0010906  0.0006019   1.812   0.0715 .  
## DurationF   -5.4263518  1.3300048  -4.080 6.56e-05 ***
## age          0.0664058  0.0555971   1.194   0.2338    
## viq          0.6189730  0.0599057  10.332  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.89 on 195 degrees of freedom
## Multiple R-squared:  0.4032, Adjusted R-squared:  0.3909 
## F-statistic: 32.93 on 4 and 195 DF,  p-value: < 2.2e-16

Because p-values for days (barely) and age are too high we cannot say they statistically affect performance IQ.