3. Homework

R Markdown

Importnig data to R and showing it with head() function

Units of observation are patients, sample size is 331.

Variables description with units of measurement:

duration: duration of the coma (in days).
sex: a factor with levels Female and Male.
age: age at the time of injury (in years)
viq:verbal IQ.
piq: performance (i.e., mathematical) IQ.

Main goal of the data analysis: Main goal of the data analysis is to figure out if length of comma have effect on performace IQ.

Source: Library called carData

#data(package = .packages(all.available = TRUE))
# library(boot)
# mydata <- channing
# mydata <- melanoma
# library(carData)
# mydata <- Freedman                      
mydata <- Wong                          


mydata <- na.omit(mydata)        
mydata <- mydata %>%  select(c(-"id"))  
mydata <- mydata[c(5,1,2,4,6,3)]
#str(mydata)
head(mydata)

##   piq days duration      age viq  sex
## 1  87   30        4 20.67077  89 Male
## 2  95   16       17 55.28816  77 Male
## 3  95   40        1 55.91513 116 Male
## 4  59   13       10 61.66461  73 Male
## 5  67   19        6 30.12731  73 Male
## 6  76   13        3 57.06229  69 Male

library(pastecs)
round(stat.desc(mydata[ ,-c(6)]), 2)

##                   piq       days duration      age      viq
## nbr.val        331.00     331.00   331.00   331.00   331.00
## nbr.null         0.00       0.00    46.00     0.00     0.00
## nbr.na           0.00       0.00     0.00     0.00     0.00
## min             50.00      13.00     0.00     6.51    64.00
## max            133.00   11628.00   255.00    80.03   132.00
## range           83.00   11615.00   255.00    73.52    68.00
## sum          28981.00  159493.00  4732.00 10543.41 31433.00
## median          87.00     150.00     7.00    26.88    94.00
## mean            87.56     481.85    14.30    31.85    94.96
## SE.mean          0.83      62.50     1.43     0.76     0.77
## CI.mean.0.95     1.64     122.95     2.82     1.50     1.52
## var            228.96 1292958.09   678.08   192.32   197.49
## std.dev         15.13    1137.08    26.04    13.87    14.05
## coef.var         0.17       2.36     1.82     0.44     0.15

Scatterplot

From the scatterplot we can see people there is obvious correlation between all variables except age and sex. Based on the plot I anticipate problems with heteroskedasticity, but no other problems

library(car)
scatterplotMatrix(mydata, 
                  smooth = FALSE)

From correlation we can see there is strongest correlation between piq and viq, which is normal. Between independend variables there is not too strong correlation.

library(Hmisc)
rcorr(as.matrix(mydata[ , -c(6)]))

##            piq  days duration   age  viq
## piq       1.00  0.04    -0.17  0.07 0.65
## days      0.04  1.00     0.22 -0.13 0.00
## duration -0.17  0.22     1.00 -0.17 0.00
## age       0.07 -0.13    -0.17  1.00 0.05
## viq       0.65  0.00     0.00  0.05 1.00
## 
## n= 331 
## 
## 
## P
##          piq    days   duration age    viq   
## piq             0.4950 0.0018   0.2312 0.0000
## days     0.4950        0.0000   0.0152 0.9414
## duration 0.0018 0.0000          0.0022 0.9914
## age      0.2312 0.0152 0.0022          0.3311
## viq      0.0000 0.9414 0.9914   0.3311

First regression model

I choose this model and this variables for the explanation of the performance IQ, because the theory predict that those all variables that impact the IQ after the coma (days since wakening from coma, length of coma, age, sex and VIQ)

fit1 <- lm(piq ~ days + duration + age + viq+sex ,
           data = mydata)

Multicolinearity test:

Mean is 1.04 (<5) which means there is no problems with multicolinearity.

vif(fit1)

##     days duration      age      viq      sex 
## 1.065519 1.082507 1.046219 1.003372 1.017660

mean(vif(fit1))

## [1] 1.043055

Normality test and removing outliers

Shapiro-Wilk normality test:

\(H_0:\) errors are normally distributed

\(H_1:\) errors not are normally distributed

From the Shapiro-Wilk normality test we can conclude that errors are not normally distributed (p=0.002), but this is not problem, because we have more than 331 observations.

mydata$StdResid <- round(rstandard(fit1), 3) 
mydata$CooksD <- round(cooks.distance(fit1), 3) 

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.98604, p-value = 0.002751

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

head(mydata[order(mydata$StdResid),], 6)

##     piq days duration      age viq  sex StdResid CooksD
## 96   50   49       35 18.71595 101 Male   -3.457  0.020
## 286  66  986        0 21.47023 116 Male   -3.408  0.030
## 70   61   50        7 24.38056 104 Male   -2.927  0.010
## 144  51  151        0 20.08761  86 Male   -2.764  0.012
## 136  66   96       18 26.56263 105 Male   -2.444  0.007
## 118  63  120       10 37.27584  99 Male   -2.426  0.005

Based on data table above we remove outliers those close to 3 and \(|>3|\)

mydata <- mydata[-c(96,70,286), ]

head(mydata[order(-mydata$CooksD),], 6)

##     piq  days duration      age viq    sex StdResid CooksD
## 305  78  1093      255 16.56126  84   Male    2.471  0.380
## 331  71 11038        0 12.54483  73   Male   -1.496  0.155
## 234  67   295      130 28.27379 117   Male   -2.147  0.062
## 311  90  1569      180 28.08214 101   Male    1.436  0.051
## 92   76    64        5 76.65982 106 Female   -2.041  0.036
## 153 133   146        1 21.68378 111 Female    2.858  0.029

Based on data table above we remove outliers that thave biggest Cooks distances:

mydata <- mydata[-c(331,305,234), ]

New fit

We make new fit because we removed some outliers

fit1 <- lm(piq ~ days + duration + age + viq+sex,
           data = mydata)

If we look at standardresidual we can see there is no more outliers all residals are \(|<3|\).

Shapiro-Wilk normality test:

\(H_0:\) errors are normally distributed

\(H_1:\) errors not are normally distributed

From the Shapiro-Wilk normality test we cannot reject NULL hypothesis so we cannot state that errors are not normally distributed (p=0.06).

mydata$StdResid <- round(rstandard(fit1), 3) 
mydata$CooksD <- round(cooks.distance(fit1), 3)  #units of high impact

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.99169, p-value = 0.06411

Normality and heteroskedasticity check

mydata$StdFittedValues <- scale(fit1$fitted.values) # we use scale cause we standardize

Based on the scatterplot we can conclude that there is no problem with linearity. We can assume that there is no problem with heteroskedasticity, but to be sure, we will make a Breusch Pagan Test for Heteroskedasticity.

library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = TRUE,
            smooth = FALSE)

Because \(p > 5%\) we cannot reject null hypothesis, which means that there is no problem with heteroskedasticity.

library(olsrr)
ols_test_breusch_pagan(fit1)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##              Data               
##  -------------------------------
##  Response : piq 
##  Variables: fitted values of piq 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    3.718967 
##  Prob > Chi2   =    0.05379765

summary(fit1)

## 
## Call:
## lm(formula = piq ~ days + duration + age + viq + sex, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.275  -6.162  -1.140   6.493  31.682 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.5717488  4.3553868   4.953 1.19e-06 ***
## days         0.0010870  0.0005353   2.031   0.0431 *  
## duration    -0.1073556  0.0235474  -4.559 7.32e-06 ***
## age         -0.0059261  0.0435249  -0.136   0.8918    
## viq          0.7254271  0.0425524  17.048  < 2e-16 ***
## sexMale     -1.7281687  1.4534042  -1.189   0.2353    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.68 on 320 degrees of freedom
## Multiple R-squared:  0.4958, Adjusted R-squared:  0.4879 
## F-statistic: 62.93 on 5 and 320 DF,  p-value: < 2.2e-16

sqrt(summary(fit1)$r.squared)

## [1] 0.7041077

For each day since coma, on average total IQ will increase by 0.00108 ceteris paribus (\(p < 0.001\))

For each day in coma, on average total IQ will decrease by 0.107 ceteris paribus (\(p = 4.3\)%)

If viq increases by one, on average total IQ will decrease by 0.725 ceteris paribus (\(p < 0.001\))

Age and sex are not statistically significant, p-values \(> 5\). We cannot confirm there sex or age affect on the IQ after coma.

Because \(R^2= 0.496\), 49,6% of variability of piq is explained by linear effect of days since coma, days in coma, age, viq, and sex.

Multiple correlation coefficient = 0.7041077 - linear relationship between piq and 5 explanatory variables is strong.

F-statistics (tells us how good is regression model):

\(H_0: \rho^2 = 0\) errors are normally distributed

\(H_1: \rho^2 > 0\) errors not are normally distributed

Because \(p<0.001\) we can reject null hypothesis -> There is linear relationship between piq and at least one explanatory variable.