data()
data(package = .packages(all.available=TRUE))
#install.packages("carData")
library(carData)
mydata <- force(Wong)
head(mydata)
## id days duration sex age piq viq
## 1 3358 30 4 Male 20.67077 87 89
## 2 3535 16 17 Male 55.28816 95 77
## 3 3547 40 1 Male 55.91513 95 116
## 4 3592 13 10 Male 61.66461 59 73
## 5 3728 19 6 Male 30.12731 67 73
## 6 3790 13 3 Male 57.06229 76 69
nrow(mydata)
## [1] 331
Wong, P. P., Monette, G., and Weiner, N. I. (2001) Mathematical models of cognitive recovery. Brain Injury, 15, 519–530.
any(is.na(mydata))
## [1] FALSE
There are no missing values.
Let’s first see which category is the more common one (has more units)
summary(mydata$sex)
## Female Male
## 71 260
Male has more units and will be put as the first one while creating a factor. (dummy value = 0)
*Note: This was not necessary thing to do, however since I have a categorical variable with only two categories, I wanted to show how I would do it if I had more than two categories, and use the most common one as a reference group (and create more dummy variables).
mydata$sexF <- factor(mydata$sex,
levels = c("Male", "Female"),
labels = c ("M" , "F"))
head(mydata,3)
## id days duration sex age piq viq sexF
## 1 3358 30 4 Male 20.67077 87 89 M
## 2 3535 16 17 Male 55.28816 95 77 M
## 3 3547 40 1 Male 55.91513 95 116 M
summary(mydata [ , c(-1,-4) ]) #Excluding ID and sex because it it repetitive
## days duration age piq
## Min. : 13.0 Min. : 0.0 Min. : 6.513 Min. : 50.00
## 1st Qu.: 59.0 1st Qu.: 1.0 1st Qu.:21.737 1st Qu.: 77.00
## Median : 150.0 Median : 7.0 Median :26.877 Median : 87.00
## Mean : 481.9 Mean : 14.3 Mean :31.853 Mean : 87.56
## 3rd Qu.: 416.0 3rd Qu.: 16.0 3rd Qu.:40.923 3rd Qu.: 97.00
## Max. :11628.0 Max. :255.0 Max. :80.033 Max. :133.00
## viq sexF
## Min. : 64.00 M:260
## 1st Qu.: 85.00 F: 71
## Median : 94.00
## Mean : 94.96
## 3rd Qu.:105.00
## Max. :132.00
Since variable id is not unique for every patient, I will make an ID unique for every unit and put it as a first variable.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
mutate(ID = row_number())
mydata <- mydata %>%
select(ID, everything())
head(mydata,3)
## ID id days duration sex age piq viq sexF
## 1 1 3358 30 4 Male 20.67077 87 89 M
## 2 2 3535 16 17 Male 55.28816 95 77 M
## 3 3 3547 40 1 Male 55.91513 95 116 M
How the number of days after coma at which IQs were measured, duration of the coma, age and verbal IQ affect the performance IQ after coma?
Dependent variable: piq (performance IQ)
Explanatory variables:
Additional question: Will adding gender make any statistically significant difference (will it explain the performance IQ better)?
Assumptions:
Requirements:
Let’s fit the first model and check assumptions and requirements.
The first thing I will do is to make my dependent variable piq a second variable, right after ID, for easier analysis.
library(dplyr)
mydata <- mydata %>%
select(ID, piq, everything())
head(mydata, 3)
## ID piq id days duration sex age viq sexF
## 1 1 87 3358 30 4 Male 20.67077 89 M
## 2 2 95 3535 16 17 Male 55.28816 77 M
## 3 3 95 3547 40 1 Male 55.91513 116 M
The first thing we will do is scatterplot matrix, which will help us determine whether the linearity assumption is met (looking the first row for linearity and correlation between explanatory variables for multicolinearity)
library(car)
## Warning: package 'car' was built under R version 4.3.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
scatterplotMatrix(mydata[ , c(-1, -3, -6, -9)], #Excluding ID and categorical variables
smooth = FALSE)
From the first row, I can conclude that the linearity assumption is met, since I cannot see any non-linear shapes in the scatterplots. (This is also confirmed later with the scatterplot of standardized residuals and standardized fitted values)
From the scatterplots among explanatory variables, I can conclude that multicolinearity should not be a problem, but this will be further confirmed by the variance inflation factor (VIF).
fit1 <- lm(piq ~ days + duration + age + viq,
data = mydata)
vif(fit1)
## days duration age viq
## 1.061496 1.072845 1.042438 1.002976
mean(vif(fit1))
## [1] 1.044939
As predicted, all the values are below 5, and the mean of variance inflation factor is close to one.
No perfect multicolineartity assumption is met, as well as the not too strong mulitolinearity requirement.
Now I will check if the errors are normally distributed with the histogram of standardized residuals, and if there are any outliers (with the help of standardized residuals) or units with high impact (with the help of cooks distances)
mydata$stdResid <- round(rstandard(fit1), 3) #standardized residuals
mydata$CooksD <- round(cooks.distance(fit1), 3) #Cooks distances
hist (mydata$stdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
I will further check the normality with Shapiro Wilk normality test, because I am not confident to conclude from the histogram.
*Note: This is not necessary since the sample size is large enough, however for the educational purposes I will continue with this analysis
shapiro.test(mydata$stdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$stdResid
## W = 0.98558, p-value = 0.002163
H0: Errors are normally distributed.
H1: Errors are not normally distributed.
We reject the null hypothesis (p=0.003) and conclude that errors are not normally distributed, however, this does not matter since the sample size is sufficiently big.
Additionally, from the histogram of standardized residuals, it can be seen that there are some outlier since some values are smaller than -3 and higher than 3. We will filter those units out
hist(mydata$CooksD,
xlab = "Cooks DIstances",
ylab = "Frequency",
main = "Histogram of Cooks distances")
All values are below 1, but there are some gaps suggesting that there are some units of high impact and these units need to be removed from the data.
head(mydata[order(mydata$stdResid), c("ID","stdResid")],3) #Highest values of standardized residuals
## ID stdResid
## 96 96 -3.491
## 286 286 -3.446
## 70 70 -2.969
head(mydata[order(-mydata$stdResid), c("ID", "stdResid")], 3) #Lowest values of standardized residuals
## ID stdResid
## 196 196 3.049
## 153 153 2.986
## 299 299 2.660
The outliers are the ones with IDs 96, 286 and 196.
head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 8)
## ID CooksD
## 305 305 0.465
## 331 331 0.188
## 234 234 0.075
## 311 311 0.062
## 286 286 0.034
## 330 330 0.034
## 92 92 0.026
## 329 329 0.026
I will remove the units which cause this gaps (units of high impact), and those are units with IDs 305, 331.
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(96, 286, 196, 305, 331))
I will fit again since we have less units then before and with the help of scatterplot of standardized fitted values and standardized residuals, I will check the homoskedasticity assumption
fit1 <- lm (piq ~ days + duration + age + viq,
data = mydata)
mydata$stdFitted <- scale(fit1$fitted.values)
library(car)
scatterplot(stdResid ~ stdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
smooth = FALSE,
regLine = FALSE,
boxplots = FALSE,
data=mydata)
From the scatterplot, I would say that homoskedasticity is violated but let’s check with the Breusch Pagan test.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------
## Response : piq
## Variables: fitted values of piq
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.78612
## Prob > Chi2 = 0.05167933
H0: The variance is constant - homoskedasticity
H1: The variance is not constant - heteroskedasticity
We cannot reject the null hypothesis (p>5%) and conclude that assumption that variance are constant is met.
Since homoskedasticity is not violated, we do not need to use robust standard errors now.
library(estimatr)
fit1<- lm(piq ~ days + duration + age + viq,
data = mydata)
summary(fit1)
##
## Call:
## lm(formula = piq ~ days + duration + age + viq, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.386 -6.249 -0.600 6.054 32.554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.7931099 4.1825567 5.210 3.38e-07 ***
## days 0.0018549 0.0006247 2.969 0.00321 **
## duration -0.1533643 0.0274455 -5.588 4.91e-08 ***
## age -0.0020573 0.0428495 -0.048 0.96174
## viq 0.7079263 0.0417880 16.941 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.53 on 321 degrees of freedom
## Multiple R-squared: 0.5008, Adjusted R-squared: 0.4946
## F-statistic: 80.52 on 4 and 321 DF, p-value: < 2.2e-16
H0: 𝛽1 = 0
H1: 𝛽1 =/= 0
We reject the null hypothesis at p=0.004. Number of days post coma at which IQs were measured have a statistically significant effect on the performance IQ.
If number of days post coma at which IQs were measured increase by 1 day, the performance IQ after coma will increase on average by 0.002 IQ points (p=0.004), assuming all other explanatory variables remain unchanged.
The coefficient is positive as expected.
H0: 𝛽2 = 0
H1: 𝛽2 =/= 0
We reject the null hypothesis at p<0.001. Duration of the coma have a statistically significant effect on the performance IQ after coma.
If duration of the coma increases by 1 day, the performance IQ after coma will decrease on average by 0.153 IQ points (p<0.001), assuming all other explanatory variables remain unchanged.
The coefficient is negative as expected.
H0: 𝛽3 = 0
H1: 𝛽3 =/= 0
We cannot reject the null hypothesis, since p-value is bigger than 5%.
We cannot say that the age of a patient who was in coma has an effect on the patient’s performance IQ after coma.
H0: 𝛽4 = 0
H1: 𝛽4 =/= 0
We reject the null hypothesis at p<0.001. Verbal IQ post coma have a statistically significant effect on the performance IQ after coma.
If the verbal IQ of a patient after coma increases by 1 IQ point, the performance IQ after coma will increase on average by 0.708 IQ points (p<0.001), assuming all other explanatory variables remain unchanged.
The coefficient is positive as expected.
50.08% of the variability of performance IQ can be explained by the variability (or the linear effect) of all other explanatory variables (days, duration, age, and viq)
H0: 𝜌2 = 0 or 𝛽1 = 𝛽2 = 𝛽3 = 𝛽4 = 0
H1: 𝜌2 > 0 or at least one 𝛽𝑗 is different from 0
We reject the null hypothesis at p<0.001 and conclude that at least some of the explanatory variables impact performance IQ after coma.
sqrt(summary(fit1)$r.squared)
## [1] 0.7077027
This the square root of the R squared. It tells us the relationship between all the variables in the model.
The relationship between the piq (dependent variable), days, duration, age and viq (explanatory) variables is strong.
Note* - we do not include sign here (positive/negative) since it has many dimensions (relationship between 5 variables)
library(lm.beta)
lm.beta(fit1)
##
## Call:
## lm(formula = piq ~ days + duration + age + viq, data = mydata)
##
## Standardized Coefficients::
## (Intercept) days duration age viq
## NA 0.12313715 -0.23324299 -0.00192616 0.67007769
We look for the highest value of standardized coefficient in absolute values. The variable viq (verbal IQ) has the highest impact on the dependent variable (piq-performance IQ)
Adding variable sexF (factor) in the model. It is a categorical - dummy variable.
Note* - I will not check the assumptions, adding this variable is just for the purpose of explaining the coefficient of the dummy variable and comparing the two models.
fit2 <- lm(piq ~ days + duration + age + viq + sexF,
data = mydata)
summary(fit2)
##
## Call:
## lm(formula = piq ~ days + duration + age + viq + sexF, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.063 -5.988 -0.944 6.192 31.549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.3012073 4.2162352 5.052 7.36e-07 ***
## days 0.0018807 0.0006254 3.007 0.00285 **
## duration -0.1510237 0.0275644 -5.479 8.68e-08 ***
## age 0.0003072 0.0429322 0.007 0.99429
## viq 0.7088216 0.0418070 16.955 < 2e-16 ***
## sexFF 1.3399207 1.4313506 0.936 0.34992
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.53 on 320 degrees of freedom
## Multiple R-squared: 0.5022, Adjusted R-squared: 0.4944
## F-statistic: 64.57 on 5 and 320 DF, p-value: < 2.2e-16
Although p-values are changed, the coefficient that were statistically significant remained statistically significant, and the variable age still does not impact the dependent variable.
The signs of the coefficients remain unchanged.
H0: 𝛽5 = 0
H1: 𝛽5 =/= 0
We cannot reject the null hypothesis, since p-value is bigger than 5%.
We cannot say that the gender of a patient who was in coma has an effect on the patient’s performance IQ after coma.
However, for the educational purposes, I am going to explain this coefficient.
Given the values of other explanatory variables, the females have on average performance IQ (after coma) by 1.34 IQ points higher compared to the males. (p-value)
anova(fit1, fit2) #Comparison of two regression models that differ in the number of explanatory variables
## Analysis of Variance Table
##
## Model 1: piq ~ days + duration + age + viq
## Model 2: piq ~ days + duration + age + viq + sexF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 321 35564
## 2 320 35466 1 97.125 0.8763 0.3499
H0: Both models are equally good. (*Note: in this case we use the simple one-with less variables)
H1: Complex model is better. (complex meaning more variables)
We cannot reject the null hypothesis, since the p-value is bigger that 5%.
We can conclude that the models are equally good, and because of this we will do the simple model.