Anastasija Šofranac MVA HW3 - Regression Analysis

Research Question: How the number of days after coma (at which IQs were measured), durartion of the coma, age and verbal IQ affect the performance IQ of a patient after coma?

Additional question: Will adding gender make any statistically significant difference (will it explain the performance IQ better)?

Data

data()
data(package = .packages(all.available=TRUE))

#install.packages("carData")
library(carData)
mydata <- force(Wong)

head(mydata)

##     id days duration  sex      age piq viq
## 1 3358   30        4 Male 20.67077  87  89
## 2 3535   16       17 Male 55.28816  95  77
## 3 3547   40        1 Male 55.91513  95 116
## 4 3592   13       10 Male 61.66461  59  73
## 5 3728   19        6 Male 30.12731  67  73
## 6 3790   13        3 Male 57.06229  76  69

nrow(mydata)

## [1] 331

Explanation of data

Unit of observation: patient (who was in a coma)
Sample size: 331 (large sample)
Variables:
- id: patient ID number.
- days: number of days post coma at which IQs were measured (numerical variable, unit od measurement - days). Explanatory variable
- duration: duration of the coma (numerical variable, unit od measurement - days). Explanatory variable
- sex: factor with levels Female and Male (categorical variable with two levels - dummy). Explanatory variable
- age: age at the time of injury (numerical variable, unit od measurement - years). Explanatory variable
- piq: performance IQ after coma (numerical variable, unit od measurement - IQ points). Dependent variable
- viq: verbal IQ after coma (numerical variable, unit od measurement - IQ points). Explanatory variable

Source

Wong, P. P., Monette, G., and Weiner, N. I. (2001) Mathematical models of cognitive recovery. Brain Injury, 15, 519–530.

Data manipulation and Descriptive statistics

Checking for missing values

any(is.na(mydata))

## [1] FALSE

There are no missing values.

Converting categorical variable sex into factors

Let’s first see which category is the more common one (has more units)

summary(mydata$sex)

## Female   Male 
##     71    260

Male has more units and will be put as the first one while creating a factor. (dummy value = 0)

*Note: This was not necessary thing to do, however since I have a categorical variable with only two categories, I wanted to show how I would do it if I had more than two categories, and use the most common one as a reference group (and create more dummy variables).

mydata$sexF <- factor(mydata$sex,
                         levels = c("Male", "Female"),
                         labels = c ("M" , "F"))

head(mydata,3)

##     id days duration  sex      age piq viq sexF
## 1 3358   30        4 Male 20.67077  87  89    M
## 2 3535   16       17 Male 55.28816  95  77    M
## 3 3547   40        1 Male 55.91513  95 116    M

Statistical description

summary(mydata [ , c(-1,-4) ]) #Excluding ID and sex because it it repetitive

##       days            duration          age              piq        
##  Min.   :   13.0   Min.   :  0.0   Min.   : 6.513   Min.   : 50.00  
##  1st Qu.:   59.0   1st Qu.:  1.0   1st Qu.:21.737   1st Qu.: 77.00  
##  Median :  150.0   Median :  7.0   Median :26.877   Median : 87.00  
##  Mean   :  481.9   Mean   : 14.3   Mean   :31.853   Mean   : 87.56  
##  3rd Qu.:  416.0   3rd Qu.: 16.0   3rd Qu.:40.923   3rd Qu.: 97.00  
##  Max.   :11628.0   Max.   :255.0   Max.   :80.033   Max.   :133.00  
##       viq         sexF   
##  Min.   : 64.00   M:260  
##  1st Qu.: 85.00   F: 71  
##  Median : 94.00          
##  Mean   : 94.96          
##  3rd Qu.:105.00          
##  Max.   :132.00

Mean 481.9 (variable days): On average, patients’ IQ was measured 481.9 days after the coma.
Min 6.5 (variable age): The minimum value of 6.5 years indicates that the youngest patient that was in coma was 6.5 years old.
Median 7.0 (variable duration) : Half of the patients were in come up to 7 days, while the remaining 50% of patients were in coma for a longer time.
Max 132.0 (variable viq): The maximum value of IQ measured on a patient after the coma was 132.0 IQ points.
Value 71 (variable sexF) : The number of female patients that were in coma is 71.

Creating a new variable ID

Since variable id is not unique for every patient, I will make an ID unique for every unit and put it as a first variable.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  mutate(ID = row_number())

mydata <- mydata %>%
  select(ID, everything())

head(mydata,3)

##   ID   id days duration  sex      age piq viq sexF
## 1  1 3358   30        4 Male 20.67077  87  89    M
## 2  2 3535   16       17 Male 55.28816  95  77    M
## 3  3 3547   40        1 Male 55.91513  95 116    M

Defining multiple regression model

How the number of days after coma at which IQs were measured, duration of the coma, age and verbal IQ affect the performance IQ after coma?

Dependent variable: piq (performance IQ)

Explanatory variables:

days: I would expect positive coefficient, meaning that with each additional day after coma at which IQs were measured, the higher performance IQ will be.
duration: I would expect negative coefficient, meaning that if duration (number of days) of the coma increases, the performance IQ will be smaller.
age: I would expect a negative coefficient, meaning that younger patients will be able to recover better after coma, and as age increases, performance IQ will be smaller.
viq: I would expect positive coefficient, meaning that if person has higher verbal IQ it will positively affect performance IQ. So with each additional verbal IQ point, performance IQ will increase.

Additional question: Will adding gender make any statistically significant difference (will it explain the performance IQ better)?

sex: I would not expect any difference between two genders, however we will see if this coefficient will be statistically significant or not and if there would be any difference in genders.

Regression Analysis

Assumptions:

Linearity in parameters.
The expected value of the errors equals 0.
Homoskedasticity (equal variances).
Normal distribution of errors.
Errors are independent. This assumption is already met. (Each unit measuer once, no time series, no hierarchical data).
No perfect multicolinearity.
The number of units is greater than the number of estimated parameters: 𝑛 > 𝑘. This assumption is already met.

Requirements:

The dependent variable is numerical, the explanatory variables can be numerical or dichotomous. This requirement is fulfilled.
Each explanatory variable must vary. None of the variables has nonzero variance so this assumption is met as well.
Outliers and units of high impact are removed from the data.
Not too strong multicolinearity.

Let’s fit the first model and check assumptions and requirements.

The first thing I will do is to make my dependent variable piq a second variable, right after ID, for easier analysis.

library(dplyr)

mydata <- mydata %>%
  select(ID, piq, everything())

head(mydata, 3)

##   ID piq   id days duration  sex      age viq sexF
## 1  1  87 3358   30        4 Male 20.67077  89    M
## 2  2  95 3535   16       17 Male 55.28816  77    M
## 3  3  95 3547   40        1 Male 55.91513 116    M

The first thing we will do is scatterplot matrix, which will help us determine whether the linearity assumption is met (looking the first row for linearity and correlation between explanatory variables for multicolinearity)

library(car)

## Warning: package 'car' was built under R version 4.3.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

scatterplotMatrix(mydata[ , c(-1, -3, -6, -9)],  #Excluding ID and categorical variables
                  smooth = FALSE)

From the first row, I can conclude that the linearity assumption is met, since I cannot see any non-linear shapes in the scatterplots. (This is also confirmed later with the scatterplot of standardized residuals and standardized fitted values)

From the scatterplots among explanatory variables, I can conclude that multicolinearity should not be a problem, but this will be further confirmed by the variance inflation factor (VIF).

fit1 <- lm(piq ~ days + duration + age + viq,
           data = mydata)

vif(fit1)

##     days duration      age      viq 
## 1.061496 1.072845 1.042438 1.002976

mean(vif(fit1))

## [1] 1.044939

As predicted, all the values are below 5, and the mean of variance inflation factor is close to one.

No perfect multicolineartity assumption is met, as well as the not too strong mulitolinearity requirement.

Now I will check if the errors are normally distributed with the histogram of standardized residuals, and if there are any outliers (with the help of standardized residuals) or units with high impact (with the help of cooks distances)

mydata$stdResid <- round(rstandard(fit1), 3) #standardized residuals 
mydata$CooksD <- round(cooks.distance(fit1), 3) #Cooks distances

hist (mydata$stdResid,
      xlab = "Standardized residuals",
      ylab = "Frequency",
      main = "Histogram of standardized residuals")

I will further check the normality with Shapiro Wilk normality test, because I am not confident to conclude from the histogram.

*Note: This is not necessary since the sample size is large enough, however for the educational purposes I will continue with this analysis

shapiro.test(mydata$stdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$stdResid
## W = 0.98558, p-value = 0.002163

H0: Errors are normally distributed.

H1: Errors are not normally distributed.

We reject the null hypothesis (p=0.003) and conclude that errors are not normally distributed, however, this does not matter since the sample size is sufficiently big.

Additionally, from the histogram of standardized residuals, it can be seen that there are some outlier since some values are smaller than -3 and higher than 3. We will filter those units out

hist(mydata$CooksD,
     xlab = "Cooks DIstances",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

All values are below 1, but there are some gaps suggesting that there are some units of high impact and these units need to be removed from the data.

head(mydata[order(mydata$stdResid), c("ID","stdResid")],3) #Highest values of standardized residuals

##      ID stdResid
## 96   96   -3.491
## 286 286   -3.446
## 70   70   -2.969

head(mydata[order(-mydata$stdResid), c("ID", "stdResid")], 3)  #Lowest values of standardized residuals

##      ID stdResid
## 196 196    3.049
## 153 153    2.986
## 299 299    2.660

The outliers are the ones with IDs 96, 286 and 196.

head(mydata[order(-mydata$CooksD), c("ID", "CooksD")], 8)

##      ID CooksD
## 305 305  0.465
## 331 331  0.188
## 234 234  0.075
## 311 311  0.062
## 286 286  0.034
## 330 330  0.034
## 92   92  0.026
## 329 329  0.026

I will remove the units which cause this gaps (units of high impact), and those are units with IDs 305, 331.

library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(96, 286, 196, 305, 331))

I will fit again since we have less units then before and with the help of scatterplot of standardized fitted values and standardized residuals, I will check the homoskedasticity assumption

fit1 <- lm (piq ~ days + duration + age + viq,
           data = mydata)

mydata$stdFitted <- scale(fit1$fitted.values)

library(car)
scatterplot(stdResid ~ stdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            smooth = FALSE,
            regLine = FALSE,
            boxplots = FALSE,
            data=mydata)

From the scatterplot, I would say that homoskedasticity is violated but let’s check with the Breusch Pagan test.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit1)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##              Data               
##  -------------------------------
##  Response : piq 
##  Variables: fitted values of piq 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.78612 
##  Prob > Chi2   =    0.05167933

H0: The variance is constant - homoskedasticity

H1: The variance is not constant - heteroskedasticity

We cannot reject the null hypothesis (p>5%) and conclude that assumption that variance are constant is met.

Since homoskedasticity is not violated, we do not need to use robust standard errors now.

library(estimatr)

fit1<- lm(piq ~ days + duration + age + viq,
                        data = mydata)

summary(fit1)

## 
## Call:
## lm(formula = piq ~ days + duration + age + viq, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.386  -6.249  -0.600   6.054  32.554 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.7931099  4.1825567   5.210 3.38e-07 ***
## days         0.0018549  0.0006247   2.969  0.00321 ** 
## duration    -0.1533643  0.0274455  -5.588 4.91e-08 ***
## age         -0.0020573  0.0428495  -0.048  0.96174    
## viq          0.7079263  0.0417880  16.941  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.53 on 321 degrees of freedom
## Multiple R-squared:  0.5008, Adjusted R-squared:  0.4946 
## F-statistic: 80.52 on 4 and 321 DF,  p-value: < 2.2e-16

Explanation of the coefficients (+ checking if they are statistically significant)

days:

H0: 𝛽1 = 0

H1: 𝛽1 =/= 0

We reject the null hypothesis at p=0.004. Number of days post coma at which IQs were measured have a statistically significant effect on the performance IQ.

If number of days post coma at which IQs were measured increase by 1 day, the performance IQ after coma will increase on average by 0.002 IQ points (p=0.004), assuming all other explanatory variables remain unchanged.

The coefficient is positive as expected.

duration:

H0: 𝛽2 = 0

H1: 𝛽2 =/= 0

We reject the null hypothesis at p<0.001. Duration of the coma have a statistically significant effect on the performance IQ after coma.

If duration of the coma increases by 1 day, the performance IQ after coma will decrease on average by 0.153 IQ points (p<0.001), assuming all other explanatory variables remain unchanged.

The coefficient is negative as expected.

age:

H0: 𝛽3 = 0

H1: 𝛽3 =/= 0

We cannot reject the null hypothesis, since p-value is bigger than 5%.

We cannot say that the age of a patient who was in coma has an effect on the patient’s performance IQ after coma.

viq:

H0: 𝛽4 = 0

H1: 𝛽4 =/= 0

We reject the null hypothesis at p<0.001. Verbal IQ post coma have a statistically significant effect on the performance IQ after coma.

If the verbal IQ of a patient after coma increases by 1 IQ point, the performance IQ after coma will increase on average by 0.708 IQ points (p<0.001), assuming all other explanatory variables remain unchanged.

The coefficient is positive as expected.

Coefficient of determination

R - squared = 0.5008

50.08% of the variability of performance IQ can be explained by the variability (or the linear effect) of all other explanatory variables (days, duration, age, and viq)

Test of Significance of regression

F-statistics (ANOVA)

H0: 𝜌2 = 0 or 𝛽1 = 𝛽2 = 𝛽3 = 𝛽4 = 0

H1: 𝜌2 > 0 or at least one 𝛽𝑗 is different from 0

We reject the null hypothesis at p<0.001 and conclude that at least some of the explanatory variables impact performance IQ after coma.

Coefficient of multiple correlation

sqrt(summary(fit1)$r.squared)

## [1] 0.7077027

This the square root of the R squared. It tells us the relationship between all the variables in the model.

The relationship between the piq (dependent variable), days, duration, age and viq (explanatory) variables is strong.

Note* - we do not include sign here (positive/negative) since it has many dimensions (relationship between 5 variables)

Which variable has the highest impact?

library(lm.beta)
lm.beta(fit1)

## 
## Call:
## lm(formula = piq ~ days + duration + age + viq, data = mydata)
## 
## Standardized Coefficients::
## (Intercept)        days    duration         age         viq 
##          NA  0.12313715 -0.23324299 -0.00192616  0.67007769

We look for the highest value of standardized coefficient in absolute values. The variable viq (verbal IQ) has the highest impact on the dependent variable (piq-performance IQ)

Adding new explanatory variable

Adding variable sexF (factor) in the model. It is a categorical - dummy variable.

Note* - I will not check the assumptions, adding this variable is just for the purpose of explaining the coefficient of the dummy variable and comparing the two models.

fit2 <- lm(piq ~ days + duration + age + viq + sexF,
                        data = mydata)

summary(fit2)

## 
## Call:
## lm(formula = piq ~ days + duration + age + viq + sexF, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.063  -5.988  -0.944   6.192  31.549 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.3012073  4.2162352   5.052 7.36e-07 ***
## days         0.0018807  0.0006254   3.007  0.00285 ** 
## duration    -0.1510237  0.0275644  -5.479 8.68e-08 ***
## age          0.0003072  0.0429322   0.007  0.99429    
## viq          0.7088216  0.0418070  16.955  < 2e-16 ***
## sexFF        1.3399207  1.4313506   0.936  0.34992    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.53 on 320 degrees of freedom
## Multiple R-squared:  0.5022, Adjusted R-squared:  0.4944 
## F-statistic: 64.57 on 5 and 320 DF,  p-value: < 2.2e-16

Although p-values are changed, the coefficient that were statistically significant remained statistically significant, and the variable age still does not impact the dependent variable.

The signs of the coefficients remain unchanged.

sexF:

H0: 𝛽5 = 0

H1: 𝛽5 =/= 0

We cannot reject the null hypothesis, since p-value is bigger than 5%.

We cannot say that the gender of a patient who was in coma has an effect on the patient’s performance IQ after coma.

However, for the educational purposes, I am going to explain this coefficient.

Given the values of other explanatory variables, the females have on average performance IQ (after coma) by 1.34 IQ points higher compared to the males. (p-value)

Comparing the two models

anova(fit1, fit2) #Comparison of two regression models that differ in the number of explanatory variables

## Analysis of Variance Table
## 
## Model 1: piq ~ days + duration + age + viq
## Model 2: piq ~ days + duration + age + viq + sexF
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    321 35564                           
## 2    320 35466  1    97.125 0.8763 0.3499

H0: Both models are equally good. (*Note: in this case we use the simple one-with less variables)

H1: Complex model is better. (complex meaning more variables)

We cannot reject the null hypothesis, since the p-value is bigger that 5%.

We can conclude that the models are equally good, and because of this we will do the simple model.

Anastasija Šofranac MVA HW3 - Regression Analysis

2024-01-22

Research Question: How the number of days after coma (at which IQs were measured), durartion of the coma, age and verbal IQ affect the performance IQ of a patient after coma?

Additional question: Will adding gender make any statistically significant difference (will it explain the performance IQ better)?

Data

Explanation of data

Source

Data manipulation and Descriptive statistics

Checking for missing values

Converting categorical variable sex into factors

Statistical description

Creating a new variable ID

Defining multiple regression model

Regression Analysis

Explanation of the coefficients (+ checking if they are statistically significant)

Coefficient of determination

Test of Significance of regression

Coefficient of multiple correlation

Which variable has the highest impact?

Adding new explanatory variable

Comparing the two models