Since this dataset seemed to work quite well for the last homework I’ll be using it again.
So my unit of observation are the people that got injured and were consequently in a coma (and woke up), the sample size is 331 observations.
Description of variables:
Source of the dataset is as mentioned before the carData library.
The basic idea is to see whether duration of coma affects IQ and if it can recover over time (if there will be more than one measurement per patient).
I’d just like to note that the first part of the homework will be basically the same just so that I have usable data (variables) again and I’ll also copy the descriptions, just so I know what I’ll be looking at if I return to this sometime in the future.
So let’s see what we are dealing with.
library(carData)
data(Wong)
head(Wong)
## id days duration sex age piq viq
## 1 3358 30 4 Male 20.67077 87 89
## 2 3535 16 17 Male 55.28816 95 77
## 3 3547 40 1 Male 55.91513 95 116
## 4 3592 13 10 Male 61.66461 59 73
## 5 3728 19 6 Male 30.12731 67 73
## 6 3790 13 3 Male 57.06229 76 69
The first thing that I notice is how age is written. It’s obvious that it was calculated automatically but I’ll change it so it makes more sense.
Wong$age <- round(Wong$age, 0) #rounding the age column to the nearest number
data <- Wong
I just got an idea. Let’s see if there are any duplicates within ID and then we can see if there is any progress between 1st and 2nd test score.
#install.packages("misty")
library(misty)
## |-------------------------------------|
## | misty 0.4.7 (2023-01-06) |
## | Miscellaneous Functions T. Yanagida |
## |-------------------------------------|
data2M <- df.duplicated(data, id, first = TRUE, keep.all = TRUE, #extracting the duplicate values in ID in a new data frame data 2 measurements
from.last = FALSE, keep.row.names = TRUE,
check = TRUE)
dataU <- df.unique(data, id, keep.all = TRUE, #extracting unique data in a new data frame data unique
from.last = FALSE, keep.row.names = TRUE,
check = TRUE)
sum(duplicated(data2M$id)) #checking if it works, should be 224 observations
## [1] 131
This is a bit weird, but I believe the problem is some people were tested more than twice and some just once. I hope this will not be too big of a problem.
It would make sense to make a categorical variable with 3 levels, if people were in a coma up to 10 days I will assign them a value of 1, for 10 to 50 days a value of 2 and above 50 a value of 3
createDurationFactor <- function(duration) { #creating a duration factor so I have 3 levels
return (ifelse (duration < 10, 1, ifelse(duration > 50, 3, 2)))
}
dataU$DurationF <- createDurationFactor(dataU$duration)
Now I can move on to the remaining part.
Below is the graphical representation of relationships among my variables presented in a scatterplot matrix.
scatterplotMatrix(dataU[, c(-1,-4,-8)], #removing ID since it doesn't play any role here in my opinion, removed sex as it's not a numerical variable and I could also remove days but I decided not to
smooth = FALSE)
From this I can see that mostly younger people (from the previous HW we
can recall it was males) had incidents that resulted in a coma. Also
what is interesting to me is the peaks of verbal IQ and performance IQ
are inverted but I don’t really know how to explain that. We can also
see that days and duration are positively correlated for example and
vice versa for age and duration.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:misty':
##
## na.pattern
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(dataU[ , c(-1,-4,-8)]))
## days duration age piq viq
## days 1.00 0.16 -0.19 0.02 -0.05
## duration 0.16 1.00 -0.19 -0.19 -0.02
## age -0.19 -0.19 1.00 0.11 0.03
## piq 0.02 -0.19 0.11 1.00 0.58
## viq -0.05 -0.02 0.03 0.58 1.00
##
## n= 200
##
##
## P
## days duration age piq viq
## days 0.0244 0.0072 0.7377 0.5180
## duration 0.0244 0.0058 0.0076 0.7428
## age 0.0072 0.0058 0.1099 0.6395
## piq 0.7377 0.0076 0.1099 0.0000
## viq 0.5180 0.7428 0.6395 0.0000
From the correlation matrix we can see that verbal IQ and duration are strongly correlated (but not performance IQ, which is really interesting).
With this I’d like to explain how the duration of the coma, time of measurement, age and verbal IQ affect performance IQ. I could also make fit2 to check how these things affect verbal IQ but I think I’ll just stick to one for now.
fit1 <- lm(piq ~ days + DurationF + age + viq,
data = dataU)
I believe all of the variables are relevant when explaining the dependent variable. Duration factor could potentially be used as just as duration but I want to see if it can be used this way.
Let’s check standardized residuals and Cook’s distances as I’ll need this later.
dataU$StdResid <- round(rstandard(fit1), 3)
dataU$CooksD <- round(cooks.distance(fit1), 3)
Linearity, normal distribution, multicollinearity and homoskedasticity will be checked in this part.
plot(fit1, 1) #1 for just residuals vs. fitted
I’m quite surprised as I think this is as linear as it gets. Great.
Let’s see if the data is normally distributed.
hist(dataU$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
This looks quite normally distributed but I’ll still run the Shapiro
test. The cut point is at 5% but as I have 224 observations I’ll be
happy even if it’s close.
shapiro.test(dataU$StdResid)
##
## Shapiro-Wilk normality test
##
## data: dataU$StdResid
## W = 0.9789, p-value = 0.004164
This is perfect.
H0: Variable is normally distributed H1: Variable is not normally distributed
As already mentioned above I have more than 100 observations so normal distribution is not significantly important.
Let’s check for multicollinearity, if it’ll be above 5 I might have a problem.
vif(fit1)
## days DurationF age viq
## 1.059351 1.062360 1.068020 1.004232
mean(vif(fit1))
## [1] 1.048491
This is great, things are going great up till this point.
If p-value will be above 5% I have a problem with homoskedasticity.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------
## Response : piq
## Variables: fitted values of piq
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 5.419691
## Prob > Chi2 = 0.01991089
Amazing. I’m very satisfied with this dataset.
summary(fit1)
##
## Call:
## lm(formula = piq ~ days + DurationF + age + viq, data = dataU)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.570 -6.150 -0.300 6.438 31.886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.5917130 6.2993584 5.015 1.19e-06 ***
## days 0.0010906 0.0006019 1.812 0.0715 .
## DurationF -5.4263518 1.3300048 -4.080 6.56e-05 ***
## age 0.0664058 0.0555971 1.194 0.2338
## viq 0.6189730 0.0599057 10.332 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.89 on 195 degrees of freedom
## Multiple R-squared: 0.4032, Adjusted R-squared: 0.3909
## F-statistic: 32.93 on 4 and 195 DF, p-value: < 2.2e-16
Because p-values for days (barely) and age are too high we cannot say they statistically affect performance IQ.