Units of observation are patients, sample size is 331.
Variables description with units of measurement:
Main goal of the data analysis: Main goal of the data analysis is to figure out if length of comma have effect on performace IQ.
Source: Library called carData
#data(package = .packages(all.available = TRUE))
# library(boot)
# mydata <- channing
# mydata <- melanoma
# library(carData)
# mydata <- Freedman
mydata <- Wong
mydata <- na.omit(mydata)
mydata <- mydata %>% select(c(-"id"))
mydata <- mydata[c(5,1,2,4,6,3)]
#str(mydata)
head(mydata)
## piq days duration age viq sex
## 1 87 30 4 20.67077 89 Male
## 2 95 16 17 55.28816 77 Male
## 3 95 40 1 55.91513 116 Male
## 4 59 13 10 61.66461 73 Male
## 5 67 19 6 30.12731 73 Male
## 6 76 13 3 57.06229 69 Male
library(pastecs)
round(stat.desc(mydata[ ,-c(6)]), 2)
## piq days duration age viq
## nbr.val 331.00 331.00 331.00 331.00 331.00
## nbr.null 0.00 0.00 46.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 50.00 13.00 0.00 6.51 64.00
## max 133.00 11628.00 255.00 80.03 132.00
## range 83.00 11615.00 255.00 73.52 68.00
## sum 28981.00 159493.00 4732.00 10543.41 31433.00
## median 87.00 150.00 7.00 26.88 94.00
## mean 87.56 481.85 14.30 31.85 94.96
## SE.mean 0.83 62.50 1.43 0.76 0.77
## CI.mean.0.95 1.64 122.95 2.82 1.50 1.52
## var 228.96 1292958.09 678.08 192.32 197.49
## std.dev 15.13 1137.08 26.04 13.87 14.05
## coef.var 0.17 2.36 1.82 0.44 0.15
From the scatterplot we can see people there is obvious correlation between all variables except age and sex. Based on the plot I anticipate problems with heteroskedasticity, but no other problems
library(car)
scatterplotMatrix(mydata,
smooth = FALSE)
From correlation we can see there is strongest correlation between piq and viq, which is normal. Between independend variables there is not too strong correlation.
library(Hmisc)
rcorr(as.matrix(mydata[ , -c(6)]))
## piq days duration age viq
## piq 1.00 0.04 -0.17 0.07 0.65
## days 0.04 1.00 0.22 -0.13 0.00
## duration -0.17 0.22 1.00 -0.17 0.00
## age 0.07 -0.13 -0.17 1.00 0.05
## viq 0.65 0.00 0.00 0.05 1.00
##
## n= 331
##
##
## P
## piq days duration age viq
## piq 0.4950 0.0018 0.2312 0.0000
## days 0.4950 0.0000 0.0152 0.9414
## duration 0.0018 0.0000 0.0022 0.9914
## age 0.2312 0.0152 0.0022 0.3311
## viq 0.0000 0.9414 0.9914 0.3311
I choose this model and this variables for the explanation of the performance IQ, because the theory predict that those all variables that impact the IQ after the coma (days since wakening from coma, length of coma, age, sex and VIQ)
fit1 <- lm(piq ~ days + duration + age + viq+sex ,
data = mydata)
Multicolinearity test:
Mean is 1.04 (<5) which means there is no problems with multicolinearity.
vif(fit1)
## days duration age viq sex
## 1.065519 1.082507 1.046219 1.003372 1.017660
mean(vif(fit1))
## [1] 1.043055
Shapiro-Wilk normality test:
\(H_0:\) errors are normally distributed
\(H_1:\) errors not are normally distributed
From the Shapiro-Wilk normality test we can conclude that errors are not normally distributed (p=0.002), but this is not problem, because we have more than 331 observations.
mydata$StdResid <- round(rstandard(fit1), 3)
mydata$CooksD <- round(cooks.distance(fit1), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.98604, p-value = 0.002751
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(mydata[order(mydata$StdResid),], 6)
## piq days duration age viq sex StdResid CooksD
## 96 50 49 35 18.71595 101 Male -3.457 0.020
## 286 66 986 0 21.47023 116 Male -3.408 0.030
## 70 61 50 7 24.38056 104 Male -2.927 0.010
## 144 51 151 0 20.08761 86 Male -2.764 0.012
## 136 66 96 18 26.56263 105 Male -2.444 0.007
## 118 63 120 10 37.27584 99 Male -2.426 0.005
Based on data table above we remove outliers those close to 3 and \(|>3|\)
mydata <- mydata[-c(96,70,286), ]
head(mydata[order(-mydata$CooksD),], 6)
## piq days duration age viq sex StdResid CooksD
## 305 78 1093 255 16.56126 84 Male 2.471 0.380
## 331 71 11038 0 12.54483 73 Male -1.496 0.155
## 234 67 295 130 28.27379 117 Male -2.147 0.062
## 311 90 1569 180 28.08214 101 Male 1.436 0.051
## 92 76 64 5 76.65982 106 Female -2.041 0.036
## 153 133 146 1 21.68378 111 Female 2.858 0.029
Based on data table above we remove outliers that thave biggest Cooks distances:
mydata <- mydata[-c(331,305,234), ]
We make new fit because we removed some outliers
fit1 <- lm(piq ~ days + duration + age + viq+sex,
data = mydata)
If we look at standardresidual we can see there is no more outliers all residals are \(|<3|\).
Shapiro-Wilk normality test:
\(H_0:\) errors are normally distributed
\(H_1:\) errors not are normally distributed
From the Shapiro-Wilk normality test we cannot reject NULL hypothesis so we cannot state that errors are not normally distributed (p=0.06).
mydata$StdResid <- round(rstandard(fit1), 3)
mydata$CooksD <- round(cooks.distance(fit1), 3) #units of high impact
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.99169, p-value = 0.06411
mydata$StdFittedValues <- scale(fit1$fitted.values) # we use scale cause we standardize
Based on the scatterplot we can conclude that there is no problem with linearity. We can assume that there is no problem with heteroskedasticity, but to be sure, we will make a Breusch Pagan Test for Heteroskedasticity.
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = TRUE,
smooth = FALSE)
Because \(p > 5%\) we cannot reject
null hypothesis, which means that there is no problem with
heteroskedasticity.
library(olsrr)
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -------------------------------
## Response : piq
## Variables: fitted values of piq
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 3.718967
## Prob > Chi2 = 0.05379765
summary(fit1)
##
## Call:
## lm(formula = piq ~ days + duration + age + viq + sex, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.275 -6.162 -1.140 6.493 31.682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.5717488 4.3553868 4.953 1.19e-06 ***
## days 0.0010870 0.0005353 2.031 0.0431 *
## duration -0.1073556 0.0235474 -4.559 7.32e-06 ***
## age -0.0059261 0.0435249 -0.136 0.8918
## viq 0.7254271 0.0425524 17.048 < 2e-16 ***
## sexMale -1.7281687 1.4534042 -1.189 0.2353
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.68 on 320 degrees of freedom
## Multiple R-squared: 0.4958, Adjusted R-squared: 0.4879
## F-statistic: 62.93 on 5 and 320 DF, p-value: < 2.2e-16
sqrt(summary(fit1)$r.squared)
## [1] 0.7041077
For each day since coma, on average total IQ will increase by 0.00108 ceteris paribus (\(p < 0.001\))
For each day in coma, on average total IQ will decrease by 0.107 ceteris paribus (\(p = 4.3\)%)
If viq increases by one, on average total IQ will decrease by 0.725 ceteris paribus (\(p < 0.001\))
Age and sex are not statistically significant, p-values \(> 5\). We cannot confirm there sex or age affect on the IQ after coma.
Because \(R^2= 0.496\), 49,6% of variability of piq is explained by linear effect of days since coma, days in coma, age, viq, and sex.
Multiple correlation coefficient = 0.7041077 - linear relationship between piq and 5 explanatory variables is strong.
F-statistics (tells us how good is regression model):
\(H_0: \rho^2 = 0\) errors are normally distributed
\(H_1: \rho^2 > 0\) errors not are normally distributed
Because \(p<0.001\) we can reject null hypothesis -> There is linear relationship between piq and at least one explanatory variable.