hmdata <- read.csv("~/R Bootcamp/R Take Home Exam/oasis_cross-sectional.csv")
head(hmdata)
## ID M.F Hand Age Educ SES MMSE CDR eTIV nWBV ASF Delay
## 1 OAS1_0001_MR1 F R 74 2 3 29 0.0 1344 0.743 1.306 N/A
## 2 OAS1_0002_MR1 F R 55 4 1 29 0.0 1147 0.810 1.531 N/A
## 3 OAS1_0003_MR1 F R 73 4 3 27 0.5 1454 0.708 1.207 N/A
## 4 OAS1_0004_MR1 M R 28 NA NA NA NA 1588 0.803 1.105 N/A
## 5 OAS1_0005_MR1 M R 18 NA NA NA NA 1737 0.848 1.010 N/A
## 6 OAS1_0006_MR1 F R 24 NA NA NA NA 1131 0.862 1.551 N/A
The unit of observation is an individual, while the original sample size was 436. Since the targeted population are people above 40 years old, the cleaned data consists of 262 subjects. The data was obtained from OASIS cross-sectional on brain imaging.
Variable description:
The purpose of this study is to develop a model that will predict early Alzheimer’s. Since the dependent variable is a dichotomous categorical variable, binary logistic regression will be used to estimate the likelihood of a person to have dementia given the age, education, MMSE and eTIV scores and gender.
As mentioned the targeted age is above 40, which was filtered as well as the irrelevant variables for the research were removed.
hmdata <- hmdata[hmdata$Age >= 40, ]
hmdata <- subset(hmdata, select = -c(Hand, Delay, SES, nWBV, ASF))
colnames(hmdata)[colnames(hmdata)== "M.F"] <- "Gender"
Next the missing values per column/variable need to be determined and imputated based on median. Median was used based on the research of other N/A values for similar data sets on brain function.
colSums(is.na(hmdata))
## ID Gender Age Educ MMSE CDR eTIV
## 0 0 0 29 29 29 0
hmdata$Educ <- impute(hmdata$Educ, median)
hmdata$SES <- impute(hmdata$SES, median)
hmdata$MMSE <- impute(hmdata$MMSE, median)
hmdata$CDR <- impute(hmdata$CDR, median)
head(hmdata)
## ID Gender Age Educ MMSE CDR eTIV
## 1 OAS1_0001_MR1 F 74 2 29 0.0 1344
## 2 OAS1_0002_MR1 F 55 4 29 0.0 1147
## 3 OAS1_0003_MR1 F 73 4 27 0.5 1454
## 9 OAS1_0010_MR1 M 74 5 30 0.0 1636
## 10 OAS1_0011_MR1 F 52 3 30 0.0 1321
## 12 OAS1_0013_MR1 F 81 5 30 0.0 1664
colSums(is.na(hmdata))
## ID Gender Age Educ MMSE CDR eTIV
## 0 0 0 0 0 0 0
hmdata$GenderF <- factor(hmdata$Gender,
levels = c("F", "M"),
labels = c("Female", "Male"))
hmdata$Group <- as.factor(ifelse(hmdata$CDR > 0.4, "1", "0"))
hmdata$Group <- factor(hmdata$Group,
levels = c(0, 1),
labels = c("Nondemented", "Demented"))
hmdata$EducF <- factor(hmdata$Educ,
levels = c(1,2,3,4,5),
labels = c("Less than highschool", "High school grad.", "Some college", "College grad.", "Beyond college"))
summary(hmdata[ , c(-1, -2, -6)]) #Descriptive statistics
##
## 29 values imputed to 3
##
##
## 29 values imputed to 29
## Age Educ MMSE eTIV GenderF
## Min. :40.00 Min. :1.000 Min. :14.00 Min. :1123 Female:174
## 1st Qu.:60.25 1st Qu.:2.000 1st Qu.:26.00 1st Qu.:1350 Male : 88
## Median :72.00 Median :3.000 Median :29.00 Median :1446
## Mean :69.94 Mean :3.156 Mean :27.26 Mean :1460
## 3rd Qu.:80.00 3rd Qu.:4.000 3rd Qu.:30.00 3rd Qu.:1554
## Max. :96.00 Max. :5.000 Max. :30.00 Max. :1992
## Group EducF
## Nondemented:162 Less than highschool:23
## Demented :100 High school grad. :64
## Some college :75
## College grad. :49
## Beyond college :51
##
fit0 <- glm(Group ~ 1,
family = binomial,
data = hmdata)
summary(fit0)
##
## Call:
## glm(formula = Group ~ 1, family = binomial, data = hmdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9806 -0.9806 -0.9806 1.3879 1.3879
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4824 0.1272 -3.793 0.000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 348.4 on 261 degrees of freedom
## Residual deviance: 348.4 on 261 degrees of freedom
## AIC: 350.4
##
## Number of Fisher Scoring iterations: 4
b0= -0.48
exp(cbind(odds = fit0$coefficients, confint.default(fit0)))
## odds 2.5 % 97.5 %
## (Intercept) 0.617284 0.4811002 0.7920168
head(fitted(fit0))
## 1 2 3 9 10 12
## 0.3816794 0.3816794 0.3816794 0.3816794 0.3816794 0.3816794
It is more likely that people will not have dementia. 38% will have dementia.
Pseudo_R2_fit1 <- 162/262 #Proportion of correctly classified units
Pseudo_R2_fit1
## [1] 0.6183206
It will correctly place 62% of people just by chance because the probability is 0.38.
fit1 <- glm(Group ~ Age, #adding age to the model
family = binomial,
data = hmdata)
summary(fit1)
##
## Call:
## glm(formula = Group ~ Age, family = binomial, data = hmdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6977 -0.9599 -0.4737 1.1219 1.7124
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.80473 0.93242 -6.225 4.8e-10 ***
## Age 0.07421 0.01255 5.911 3.4e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 348.4 on 261 degrees of freedom
## Residual deviance: 302.6 on 260 degrees of freedom
## AIC: 306.6
##
## Number of Fisher Scoring iterations: 4
Something happened with the deviance (-2LL) - it decreased and is closer to 0 (should be better when adding more variables).
H0: simple model is better H1: complex model is better x^2=(simple -2LL)-(complex -2LL)
anova(fit0, fit1, test = "Chi") #Comparision of models based on -2LL statistics
## Analysis of Deviance Table
##
## Model 1: Group ~ 1
## Model 2: Group ~ Age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 261 348.4
## 2 260 302.6 1 45.798 1.311e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject null hypothesis (p<0.001), meaning that complex model is better.
exp(cbind(OR = fit1$coefficients, confint.default(fit1))) #Odds ratio for Y=1 (with 95% CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.003013274 0.0004845782 0.01873757
## Age 1.077032079 1.0508546928 1.10386156
If Age is increased by 1 year, the odds for having Alzheimer’s is equal to 1.08 times the initial odds, under the assumptions that the remaining explanatory variables do not change. The older a person is, more likely they are to have Alzheimer’s
fit2 <- glm(Group ~ Age + EducF + GenderF + eTIV + MMSE,
family = binomial,
data = hmdata)
vif(fit2)
## GVIF Df GVIF^(1/(2*Df))
## Age 1.140231 1 1.067816
## EducF 1.188671 4 1.021840
## GenderF 1.576215 1 1.255474
## eTIV 1.601244 1 1.265403
## MMSE 1.103967 1 1.050698
There is no problem with multicollinearity since the results are all close to one - below 2.
hmdata$StdResid <- rstandard(fit2)
hmdata$Cook <- cooks.distance(fit2)
hist(hmdata$StdResid,
main = "Histogram of standardized residuals",
ylab = "Frequency",
xlab = "Standardized residuals")
head(hmdata[order(-hmdata$Cook), c("ID", "Cook")])
## ID Cook
## 61 OAS1_0065_MR1 0.06162871
## 412 OAS1_0453_MR1 0.05530700
## 62 OAS1_0066_MR1 0.04752483
## 321 OAS1_0354_MR1 0.04142318
## 288 OAS1_0317_MR1 0.03464295
## 365 OAS1_0402_MR1 0.03142155
Based on the Cooks D there are no units standing out (no big gap) that would have high impact.
H0: simple model is better H1: complex model is better
anova(fit1, fit2, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: Group ~ Age
## Model 2: Group ~ Age + EducF + GenderF + eTIV + MMSE
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 260 302.60
## 2 253 174.62 7 127.98 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can reject null hypothesis (p<0.001), and conclude that complex model is better. There are 7 degrees of freedom due to the multiple dummy variables for EducF.
summary(fit2)
##
## Call:
## glm(formula = Group ~ Age + EducF + GenderF + eTIV + MMSE, family = binomial,
## data = hmdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4904 -0.4948 -0.2766 0.1790 2.4901
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.535855 4.430550 3.732 0.00019 ***
## Age 0.042957 0.018091 2.374 0.01757 *
## EducFHigh school grad. 0.811602 0.748170 1.085 0.27802
## EducFSome college -0.164008 0.792264 -0.207 0.83600
## EducFCollege grad. 0.519736 0.772678 0.673 0.50117
## EducFBeyond college 0.073491 0.793440 0.093 0.92620
## GenderFMale 0.456955 0.491957 0.929 0.35297
## eTIV 0.001668 0.001451 1.149 0.25048
## MMSE -0.831398 0.134109 -6.199 5.67e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 348.40 on 261 degrees of freedom
## Residual deviance: 174.62 on 253 degrees of freedom
## AIC: 192.62
##
## Number of Fisher Scoring iterations: 6
exp(cbind(OR = fit2$coefficients, confint.default(fit2)))
## OR 2.5 % 97.5 %
## (Intercept) 1.518556e+07 2571.2208036 8.968547e+10
## Age 1.043893e+00 1.0075273 1.081571e+00
## EducFHigh school grad. 2.251511e+00 0.5195540 9.757029e+00
## EducFSome college 8.487353e-01 0.1796371 4.010039e+00
## EducFCollege grad. 1.681584e+00 0.3698402 7.645802e+00
## EducFBeyond college 1.076259e+00 0.2272687 5.096756e+00
## GenderFMale 1.579258e+00 0.6021413 4.141978e+00
## eTIV 1.001669e+00 0.9988242 1.004522e+00
## MMSE 4.354399e-01 0.3347917 5.663459e-01
We can see that only age and MMSE show significance, which is why these are the only two that will be explained: - Age: Given that all values of other variables stay the same, the odds of having alzheimer’s of people that are older is 1.04 higher than the odds of younger people. - MMSE: If MMSE score increases by 1 point, the odds that they have Alzheimer’s equals to 0.44 times the initial odds, under the assumptions that the remaining explanatory variables do not change (p<0.001).
hmdata$EstProb <- fitted(fit2) #Estimates of probabilities for Y=1: P(Y=1)
head(hmdata)
## ID Gender Age Educ MMSE CDR eTIV GenderF Group
## 1 OAS1_0001_MR1 F 74 2 29 0.0 1344 Female Nondemented
## 2 OAS1_0002_MR1 F 55 4 29 0.0 1147 Female Nondemented
## 3 OAS1_0003_MR1 F 73 4 27 0.5 1454 Female Demented
## 9 OAS1_0010_MR1 M 74 5 30 0.0 1636 Male Nondemented
## 10 OAS1_0011_MR1 F 52 3 30 0.0 1321 Female Nondemented
## 12 OAS1_0013_MR1 F 81 5 30 0.0 1664 Female Nondemented
## EducF StdResid Cook EstProb
## 1 High school grad. -0.6902051 8.028790e-04 0.20700612
## 2 College grad. -0.3517005 1.922871e-04 0.05843509
## 3 College grad. 1.1399531 6.054350e-03 0.54198595
## 9 Beyond college -0.5189092 4.794228e-04 0.12252959
## 10 Some college -0.1788196 9.950257e-06 0.01577414
## 12 Beyond college -0.4951221 5.710055e-04 0.11122837
hmdata$Classification <- ifelse(test = hmdata$EstProb < 0.50,
yes = "NO",
no = "YES")
#If estimated probability is below 0.50, person is classified into a group of people without Alzheimer's
hmdata$ClassificationF <- factor(hmdata$Classification,
levels = c("NO", "YES"),
labels = c("NO", "YES"))
ClassificationTable <- table(hmdata$Group, hmdata$ClassificationF)
ClassificationTable
##
## NO YES
## Nondemented 147 15
## Demented 25 75
Pseudo_R2_fit2 <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(hmdata)
Pseudo_R2_fit2
## [1] 0.8473282
In reality, 15 people didn’t have Alzheimer’s even though they were classified as if they did. Proportion of correctly classified units increased from 0.62 to 0.85.