Homework 4

hmdata <- read.csv("~/R Bootcamp/R Take Home Exam/oasis_cross-sectional.csv")

head(hmdata)

##              ID M.F Hand Age Educ SES MMSE CDR eTIV  nWBV   ASF Delay
## 1 OAS1_0001_MR1   F    R  74    2   3   29 0.0 1344 0.743 1.306   N/A
## 2 OAS1_0002_MR1   F    R  55    4   1   29 0.0 1147 0.810 1.531   N/A
## 3 OAS1_0003_MR1   F    R  73    4   3   27 0.5 1454 0.708 1.207   N/A
## 4 OAS1_0004_MR1   M    R  28   NA  NA   NA  NA 1588 0.803 1.105   N/A
## 5 OAS1_0005_MR1   M    R  18   NA  NA   NA  NA 1737 0.848 1.010   N/A
## 6 OAS1_0006_MR1   F    R  24   NA  NA   NA  NA 1131 0.862 1.551   N/A

Cleaning and describing data

The unit of observation is an individual, while the original sample size was 436. Since the targeted population are people above 40 years old, the cleaned data consists of 262 subjects. The data was obtained from OASIS cross-sectional on brain imaging.

Variable description:

ID: Identification of a subject
M.F: Gender (Male/Female)
Hand: Dominant hand (left/right)
Age: Age in years
Educ: Education level ( 1: less than high school grad., 2: high school grad., 3: some college, 4: college grad., 5: beyond college)
SES - Socioeconomic status (1-5)
MMSE - Mini Mental State Examination
CDR - Clinical Dementia Rating (0 = normal, 0.5 = very mild dementia, 1 = mild dementia, 2 = moderate dementia, 3 = severe dementia)
eTIV - Estimated Total Intracranial Volume
nWBV - Normalize Whole Brain Volume
ASF Atlas Scaling Factor
Delay - the number of days between two medical visits.

The purpose of this study is to develop a model that will predict early Alzheimer’s. Since the dependent variable is a dichotomous categorical variable, binary logistic regression will be used to estimate the likelihood of a person to have dementia given the age, education, MMSE and eTIV scores and gender.

As mentioned the targeted age is above 40, which was filtered as well as the irrelevant variables for the research were removed.

hmdata <- hmdata[hmdata$Age >= 40, ]

hmdata <- subset(hmdata, select = -c(Hand, Delay, SES, nWBV, ASF))

colnames(hmdata)[colnames(hmdata)== "M.F"] <- "Gender"

Next the missing values per column/variable need to be determined and imputated based on median. Median was used based on the research of other N/A values for similar data sets on brain function.

colSums(is.na(hmdata))

##     ID Gender    Age   Educ   MMSE    CDR   eTIV 
##      0      0      0     29     29     29      0

hmdata$Educ <- impute(hmdata$Educ, median)

hmdata$SES <- impute(hmdata$SES, median)

hmdata$MMSE <- impute(hmdata$MMSE, median)

hmdata$CDR <- impute(hmdata$CDR, median)

head(hmdata)

##               ID Gender Age Educ MMSE CDR eTIV
## 1  OAS1_0001_MR1      F  74    2   29 0.0 1344
## 2  OAS1_0002_MR1      F  55    4   29 0.0 1147
## 3  OAS1_0003_MR1      F  73    4   27 0.5 1454
## 9  OAS1_0010_MR1      M  74    5   30 0.0 1636
## 10 OAS1_0011_MR1      F  52    3   30 0.0 1321
## 12 OAS1_0013_MR1      F  81    5   30 0.0 1664

colSums(is.na(hmdata))

##     ID Gender    Age   Educ   MMSE    CDR   eTIV 
##      0      0      0      0      0      0      0

hmdata$GenderF <- factor(hmdata$Gender,
                       levels = c("F", "M"),
                       labels = c("Female", "Male"))

hmdata$Group <- as.factor(ifelse(hmdata$CDR > 0.4, "1", "0"))
hmdata$Group <- factor(hmdata$Group,
                       levels = c(0, 1),
                       labels = c("Nondemented", "Demented"))

hmdata$EducF <- factor(hmdata$Educ,
                       levels = c(1,2,3,4,5),
                       labels = c("Less than highschool", "High school grad.", "Some college", "College grad.", "Beyond college"))

Binary logistic regression

summary(hmdata[ , c(-1, -2, -6)]) #Descriptive statistics

## 
##  29 values imputed to 3 
## 
## 
##  29 values imputed to 29

##       Age             Educ            MMSE            eTIV        GenderF   
##  Min.   :40.00   Min.   :1.000   Min.   :14.00   Min.   :1123   Female:174  
##  1st Qu.:60.25   1st Qu.:2.000   1st Qu.:26.00   1st Qu.:1350   Male  : 88  
##  Median :72.00   Median :3.000   Median :29.00   Median :1446               
##  Mean   :69.94   Mean   :3.156   Mean   :27.26   Mean   :1460               
##  3rd Qu.:80.00   3rd Qu.:4.000   3rd Qu.:30.00   3rd Qu.:1554               
##  Max.   :96.00   Max.   :5.000   Max.   :30.00   Max.   :1992               
##          Group                      EducF   
##  Nondemented:162   Less than highschool:23  
##  Demented   :100   High school grad.   :64  
##                    Some college        :75  
##                    College grad.       :49  
##                    Beyond college      :51  
##

fit0 <- glm(Group ~ 1, 
            family = binomial, 
            data = hmdata)

summary(fit0)

## 
## Call:
## glm(formula = Group ~ 1, family = binomial, data = hmdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9806  -0.9806  -0.9806   1.3879   1.3879  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.4824     0.1272  -3.793 0.000149 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 348.4  on 261  degrees of freedom
## Residual deviance: 348.4  on 261  degrees of freedom
## AIC: 350.4
## 
## Number of Fisher Scoring iterations: 4

b0= -0.48

exp(cbind(odds = fit0$coefficients, confint.default(fit0)))

##                 odds     2.5 %    97.5 %
## (Intercept) 0.617284 0.4811002 0.7920168

head(fitted(fit0))

##         1         2         3         9        10        12 
## 0.3816794 0.3816794 0.3816794 0.3816794 0.3816794 0.3816794

It is more likely that people will not have dementia. 38% will have dementia.

Pseudo_R2_fit1 <- 162/262 #Proportion of correctly classified units
Pseudo_R2_fit1

## [1] 0.6183206

It will correctly place 62% of people just by chance because the probability is 0.38.

fit1 <- glm(Group ~ Age,  #adding age to the model
            family = binomial, 
            data = hmdata)

summary(fit1)

## 
## Call:
## glm(formula = Group ~ Age, family = binomial, data = hmdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6977  -0.9599  -0.4737   1.1219   1.7124  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.80473    0.93242  -6.225  4.8e-10 ***
## Age          0.07421    0.01255   5.911  3.4e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 348.4  on 261  degrees of freedom
## Residual deviance: 302.6  on 260  degrees of freedom
## AIC: 306.6
## 
## Number of Fisher Scoring iterations: 4

Something happened with the deviance (-2LL) - it decreased and is closer to 0 (should be better when adding more variables).

H0: simple model is better H1: complex model is better x^2=(simple -2LL)-(complex -2LL)

anova(fit0, fit1, test = "Chi") #Comparision of models based on -2LL statistics

## Analysis of Deviance Table
## 
## Model 1: Group ~ 1
## Model 2: Group ~ Age
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       261      348.4                          
## 2       260      302.6  1   45.798 1.311e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject null hypothesis (p<0.001), meaning that complex model is better.

exp(cbind(OR = fit1$coefficients, confint.default(fit1))) #Odds ratio for Y=1 (with 95% CI)

##                      OR        2.5 %     97.5 %
## (Intercept) 0.003013274 0.0004845782 0.01873757
## Age         1.077032079 1.0508546928 1.10386156

If Age is increased by 1 year, the odds for having Alzheimer’s is equal to 1.08 times the initial odds, under the assumptions that the remaining explanatory variables do not change. The older a person is, more likely they are to have Alzheimer’s

Full model

fit2 <- glm(Group ~ Age + EducF + GenderF + eTIV + MMSE, 
            family = binomial, 
            data = hmdata)
vif(fit2)

##             GVIF Df GVIF^(1/(2*Df))
## Age     1.140231  1        1.067816
## EducF   1.188671  4        1.021840
## GenderF 1.576215  1        1.255474
## eTIV    1.601244  1        1.265403
## MMSE    1.103967  1        1.050698

There is no problem with multicollinearity since the results are all close to one - below 2.

Identifying potential outliers

hmdata$StdResid <- rstandard(fit2)
hmdata$Cook <- cooks.distance(fit2)

hist(hmdata$StdResid,
     main = "Histogram of standardized residuals",
     ylab = "Frequency",
     xlab = "Standardized residuals")

head(hmdata[order(-hmdata$Cook), c("ID", "Cook")])

##                ID       Cook
## 61  OAS1_0065_MR1 0.06162871
## 412 OAS1_0453_MR1 0.05530700
## 62  OAS1_0066_MR1 0.04752483
## 321 OAS1_0354_MR1 0.04142318
## 288 OAS1_0317_MR1 0.03464295
## 365 OAS1_0402_MR1 0.03142155

Based on the Cooks D there are no units standing out (no big gap) that would have high impact.

H0: simple model is better H1: complex model is better

anova(fit1, fit2, test = "Chi")

## Analysis of Deviance Table
## 
## Model 1: Group ~ Age
## Model 2: Group ~ Age + EducF + GenderF + eTIV + MMSE
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       260     302.60                          
## 2       253     174.62  7   127.98 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject null hypothesis (p<0.001), and conclude that complex model is better. There are 7 degrees of freedom due to the multiple dummy variables for EducF.

Explanation of results

summary(fit2)

## 
## Call:
## glm(formula = Group ~ Age + EducF + GenderF + eTIV + MMSE, family = binomial, 
##     data = hmdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4904  -0.4948  -0.2766   0.1790   2.4901  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            16.535855   4.430550   3.732  0.00019 ***
## Age                     0.042957   0.018091   2.374  0.01757 *  
## EducFHigh school grad.  0.811602   0.748170   1.085  0.27802    
## EducFSome college      -0.164008   0.792264  -0.207  0.83600    
## EducFCollege grad.      0.519736   0.772678   0.673  0.50117    
## EducFBeyond college     0.073491   0.793440   0.093  0.92620    
## GenderFMale             0.456955   0.491957   0.929  0.35297    
## eTIV                    0.001668   0.001451   1.149  0.25048    
## MMSE                   -0.831398   0.134109  -6.199 5.67e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 348.40  on 261  degrees of freedom
## Residual deviance: 174.62  on 253  degrees of freedom
## AIC: 192.62
## 
## Number of Fisher Scoring iterations: 6

exp(cbind(OR = fit2$coefficients, confint.default(fit2)))

##                                  OR        2.5 %       97.5 %
## (Intercept)            1.518556e+07 2571.2208036 8.968547e+10
## Age                    1.043893e+00    1.0075273 1.081571e+00
## EducFHigh school grad. 2.251511e+00    0.5195540 9.757029e+00
## EducFSome college      8.487353e-01    0.1796371 4.010039e+00
## EducFCollege grad.     1.681584e+00    0.3698402 7.645802e+00
## EducFBeyond college    1.076259e+00    0.2272687 5.096756e+00
## GenderFMale            1.579258e+00    0.6021413 4.141978e+00
## eTIV                   1.001669e+00    0.9988242 1.004522e+00
## MMSE                   4.354399e-01    0.3347917 5.663459e-01

We can see that only age and MMSE show significance, which is why these are the only two that will be explained: - Age: Given that all values of other variables stay the same, the odds of having alzheimer’s of people that are older is 1.04 higher than the odds of younger people. - MMSE: If MMSE score increases by 1 point, the odds that they have Alzheimer’s equals to 0.44 times the initial odds, under the assumptions that the remaining explanatory variables do not change (p<0.001).

hmdata$EstProb <- fitted(fit2) #Estimates of probabilities for Y=1: P(Y=1)
head(hmdata)

##               ID Gender Age Educ MMSE CDR eTIV GenderF       Group
## 1  OAS1_0001_MR1      F  74    2   29 0.0 1344  Female Nondemented
## 2  OAS1_0002_MR1      F  55    4   29 0.0 1147  Female Nondemented
## 3  OAS1_0003_MR1      F  73    4   27 0.5 1454  Female    Demented
## 9  OAS1_0010_MR1      M  74    5   30 0.0 1636    Male Nondemented
## 10 OAS1_0011_MR1      F  52    3   30 0.0 1321  Female Nondemented
## 12 OAS1_0013_MR1      F  81    5   30 0.0 1664  Female Nondemented
##                EducF   StdResid         Cook    EstProb
## 1  High school grad. -0.6902051 8.028790e-04 0.20700612
## 2      College grad. -0.3517005 1.922871e-04 0.05843509
## 3      College grad.  1.1399531 6.054350e-03 0.54198595
## 9     Beyond college -0.5189092 4.794228e-04 0.12252959
## 10      Some college -0.1788196 9.950257e-06 0.01577414
## 12    Beyond college -0.4951221 5.710055e-04 0.11122837

hmdata$Classification <- ifelse(test = hmdata$EstProb < 0.50, 
                                yes = "NO", 
                                no = "YES")

#If estimated probability is below 0.50, person is classified into a group of people without Alzheimer's

hmdata$ClassificationF <- factor(hmdata$Classification,
                                 levels = c("NO", "YES"), 
                                 labels = c("NO", "YES"))

ClassificationTable <- table(hmdata$Group, hmdata$ClassificationF) 

ClassificationTable

##              
##                NO YES
##   Nondemented 147  15
##   Demented     25  75

Pseudo_R2_fit2 <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(hmdata) 
Pseudo_R2_fit2

## [1] 0.8473282

In reality, 15 people didn’t have Alzheimer’s even though they were classified as if they did. Proportion of correctly classified units increased from 0.62 to 0.85.