Introduction

This dataset was created for educational purposes. This dataset contains health information for 2,149 Alzheimer patients, with a total of 35 variables.

rabie_el_kharoua_2024, title={Alzheimer’s Disease Dataset}, url={https://www.kaggle.com/dsv/8668279}, DOI={10.34740/KAGGLE/DSV/8668279}, publisher={Kaggle}, author={Rabie El Kharoua}, year={2024}

Preparation

Load the necessary packages and dataset.

dim(data)

## [1] 2149   35

Check for NA values.

sum(is.na(data))

## [1] 0

Check the dataset

summary(data)

##    PatientID         Age            Gender         Ethnicity     
##  Min.   :4751   Min.   :60.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:5288   1st Qu.:67.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :5825   Median :75.00   Median :1.0000   Median :0.0000  
##  Mean   :5825   Mean   :74.91   Mean   :0.5063   Mean   :0.6975  
##  3rd Qu.:6362   3rd Qu.:83.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :6899   Max.   :90.00   Max.   :1.0000   Max.   :3.0000  
##  EducationLevel       BMI           Smoking       AlcoholConsumption 
##  Min.   :0.000   Min.   :15.01   Min.   :0.0000   Min.   : 0.002003  
##  1st Qu.:1.000   1st Qu.:21.61   1st Qu.:0.0000   1st Qu.: 5.139810  
##  Median :1.000   Median :27.82   Median :0.0000   Median : 9.934412  
##  Mean   :1.287   Mean   :27.66   Mean   :0.2885   Mean   :10.039442  
##  3rd Qu.:2.000   3rd Qu.:33.87   3rd Qu.:1.0000   3rd Qu.:15.157931  
##  Max.   :3.000   Max.   :39.99   Max.   :1.0000   Max.   :19.989293  
##  PhysicalActivity    DietQuality        SleepQuality    FamilyHistoryAlzheimers
##  Min.   :0.003616   Min.   :0.009385   Min.   : 4.003   Min.   :0.0000         
##  1st Qu.:2.570626   1st Qu.:2.458455   1st Qu.: 5.483   1st Qu.:0.0000         
##  Median :4.766424   Median :5.076087   Median : 7.116   Median :0.0000         
##  Mean   :4.920202   Mean   :4.993138   Mean   : 7.051   Mean   :0.2522         
##  3rd Qu.:7.427899   3rd Qu.:7.558625   3rd Qu.: 8.563   3rd Qu.:1.0000         
##  Max.   :9.987429   Max.   :9.998346   Max.   :10.000   Max.   :1.0000         
##  CardiovascularDisease    Diabetes        Depression       HeadInjury    
##  Min.   :0.0000        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000        Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.1443        Mean   :0.1508   Mean   :0.2006   Mean   :0.0926  
##  3rd Qu.:0.0000        3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000        Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##   Hypertension      SystolicBP     DiastolicBP     CholesterolTotal
##  Min.   :0.0000   Min.   : 90.0   Min.   : 60.00   Min.   :150.1   
##  1st Qu.:0.0000   1st Qu.:112.0   1st Qu.: 74.00   1st Qu.:190.3   
##  Median :0.0000   Median :134.0   Median : 91.00   Median :225.1   
##  Mean   :0.1489   Mean   :134.3   Mean   : 89.85   Mean   :225.2   
##  3rd Qu.:0.0000   3rd Qu.:157.0   3rd Qu.:105.00   3rd Qu.:262.0   
##  Max.   :1.0000   Max.   :179.0   Max.   :119.00   Max.   :300.0   
##  CholesterolLDL   CholesterolHDL  CholesterolTriglycerides      MMSE          
##  Min.   : 50.23   Min.   :20.00   Min.   : 50.41           Min.   : 0.005312  
##  1st Qu.: 87.20   1st Qu.:39.10   1st Qu.:137.58           1st Qu.: 7.167602  
##  Median :123.34   Median :59.77   Median :230.30           Median :14.441660  
##  Mean   :124.34   Mean   :59.46   Mean   :228.28           Mean   :14.755132  
##  3rd Qu.:161.73   3rd Qu.:78.94   3rd Qu.:314.84           3rd Qu.:22.161028  
##  Max.   :199.97   Max.   :99.98   Max.   :399.94           Max.   :29.991381  
##  FunctionalAssessment MemoryComplaints BehavioralProblems      ADL           
##  Min.   :0.00046      Min.   :0.000    Min.   :0.0000     Min.   : 0.001288  
##  1st Qu.:2.56628      1st Qu.:0.000    1st Qu.:0.0000     1st Qu.: 2.342836  
##  Median :5.09444      Median :0.000    Median :0.0000     Median : 5.038973  
##  Mean   :5.08005      Mean   :0.208    Mean   :0.1568     Mean   : 4.982958  
##  3rd Qu.:7.54698      3rd Qu.:0.000    3rd Qu.:0.0000     3rd Qu.: 7.581490  
##  Max.   :9.99647      Max.   :1.000    Max.   :1.0000     Max.   : 9.999747  
##    Confusion      Disorientation   PersonalityChanges DifficultyCompletingTasks
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000     Min.   :0.0000           
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000     1st Qu.:0.0000           
##  Median :0.0000   Median :0.0000   Median :0.0000     Median :0.0000           
##  Mean   :0.2052   Mean   :0.1582   Mean   :0.1508     Mean   :0.1587           
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000     3rd Qu.:0.0000           
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000     Max.   :1.0000           
##  Forgetfulness      Diagnosis      DoctorInCharge    
##  Min.   :0.0000   Min.   :0.0000   Length:2149       
##  1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :0.3015   Mean   :0.3537                     
##  3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :1.0000   Max.   :1.0000

head(data, 5)

##   PatientID Age Gender Ethnicity EducationLevel      BMI Smoking
## 1      4751  73      0         0              2 22.92775       0
## 2      4752  89      0         0              0 26.82768       0
## 3      4753  73      0         3              1 17.79588       0
## 4      4754  74      1         0              1 33.80082       1
## 5      4755  89      0         0              0 20.71697       0
##   AlcoholConsumption PhysicalActivity DietQuality SleepQuality
## 1          13.297218         6.327112   1.3472143     9.025679
## 2           4.542524         7.619885   0.5187671     7.151293
## 3          19.555085         7.844988   1.8263347     9.673574
## 4          12.209266         8.428001   7.4356041     8.392554
## 5          18.454356         6.310461   0.7954975     5.597238
##   FamilyHistoryAlzheimers CardiovascularDisease Diabetes Depression HeadInjury
## 1                       0                     0        1          1          0
## 2                       0                     0        0          0          0
## 3                       1                     0        0          0          0
## 4                       0                     0        0          0          0
## 5                       0                     0        0          0          0
##   Hypertension SystolicBP DiastolicBP CholesterolTotal CholesterolLDL
## 1            0        142          72         242.3668       56.15090
## 2            0        115          64         231.1626      193.40800
## 3            0         99         116         284.1819      153.32276
## 4            0        118         115         159.5822       65.36664
## 5            0         94         117         237.6022       92.86970
##   CholesterolHDL CholesterolTriglycerides      MMSE FunctionalAssessment
## 1       33.68256                162.18914 21.463532             6.518877
## 2       79.02848                294.63091 20.613267             7.118696
## 3       69.77229                 83.63832  7.356249             5.895077
## 4       68.45749                277.57736 13.991127             8.965106
## 5       56.87430                291.19878 13.517609             6.045039
##   MemoryComplaints BehavioralProblems        ADL Confusion Disorientation
## 1                0                  0 1.72588346         0              0
## 2                0                  0 2.59242413         0              0
## 3                0                  0 7.11954774         0              1
## 4                0                  1 6.48122586         0              0
## 5                0                  0 0.01469122         0              0
##   PersonalityChanges DifficultyCompletingTasks Forgetfulness Diagnosis
## 1                  0                         1             0         0
## 2                  0                         0             1         0
## 3                  0                         1             0         0
## 4                  0                         0             0         0
## 5                  1                         1             0         0
##   DoctorInCharge
## 1      XXXConfid
## 2      XXXConfid
## 3      XXXConfid
## 4      XXXConfid
## 5      XXXConfid

Drop the PatientID and DoctorInCharge variables

data <- data[,-c(1,35)]

EDA

We learn about the distribution of the observations on several variables.

# Diagnosis distribution, how many patients are diagnoses with Alzheimer
data %>% ggplot(aes(x = as.factor(Diagnosis),
                    fill = as.factor(Diagnosis))) + 
  geom_bar() + xlab("Diagnosis (1 for Postive, 0 for Negative)") +
  labs(fill = "Diagnosis")

data %>% ggplot(aes(x = as.factor(Gender),
                    fill = as.factor(Gender))) + 
  geom_bar() + xlab("Genders (1 for Female, 0 for Male)") +
  labs(fill = "Gender")

data %>% ggplot(aes(x = as.factor(EducationLevel),
                    fill = as.factor(EducationLevel))) + 
  geom_bar() + xlab("Education Level (0 = None, 1 = High School, 2 = Bachelor's, 3 = Higher)") +
  labs(fill = "Education Level")

data %>% ggplot(aes(x = as.factor(Ethnicity),
                    fill = as.factor(Ethnicity))) +
  geom_bar() + 
  xlab("Ethnicity (0 = Caucasian, 1 = African American, 2 = Asian, 3 = Other)") + 
  labs(fill = "Ethnicity")

Machine Learning Models

Variables selection

We check the correlation of other variables with Diagnosis

cor(data[-33], data$Diagnosis)

##                                    [,1]
## Age                       -0.0054883776
## Gender                    -0.0209747103
## Ethnicity                 -0.0147822954
## EducationLevel            -0.0439658496
## BMI                        0.0263428130
## Smoking                   -0.0048652352
## AlcoholConsumption        -0.0076179684
## PhysicalActivity           0.0059450425
## DietQuality                0.0085057713
## SleepQuality              -0.0565480860
## FamilyHistoryAlzheimers   -0.0328997475
## CardiovascularDisease      0.0314902464
## Diabetes                  -0.0315076042
## Depression                -0.0058929070
## HeadInjury                -0.0214114257
## Hypertension               0.0350800344
## SystolicBP                -0.0156152313
## DiastolicBP                0.0052926798
## CholesterolTotal           0.0063944630
## CholesterolLDL            -0.0319758259
## CholesterolHDL             0.0425840413
## CholesterolTriglycerides   0.0226718770
## MMSE                      -0.2371256071
## FunctionalAssessment      -0.3648983066
## MemoryComplaints           0.3067423937
## BehavioralProblems         0.2243504008
## ADL                       -0.3323459163
## Confusion                 -0.0191857197
## Disorientation            -0.0246481569
## PersonalityChanges        -0.0206274625
## DifficultyCompletingTasks  0.0090685818
## Forgetfulness             -0.0003542898

For these models, we end up choosing variables with moderate correlation with Diagnosis, which are MMSE, Functional, Memory, Behavioral and ADL (23, 24, 25, 26, 27).

ml_data <- data[c(23:27, 33)]

We check the correlation between the independent variables, to make sure that there is no collinearity.

cor(ml_data[-6])

##                             MMSE FunctionalAssessment MemoryComplaints
## MMSE                 1.000000000           0.02493215      0.007651651
## FunctionalAssessment 0.024932154           1.00000000      0.002320470
## MemoryComplaints     0.007651651           0.00232047      1.000000000
## BehavioralProblems   0.025408172          -0.02194094     -0.009765303
## ADL                  0.003358908           0.05390426     -0.037510563
##                      BehavioralProblems          ADL
## MMSE                        0.025408172  0.003358908
## FunctionalAssessment       -0.021940939  0.053904259
## MemoryComplaints           -0.009765303 -0.037510563
## BehavioralProblems          1.000000000  0.043375578
## ADL                         0.043375578  1.000000000

Conclusion: There seems to be no correlation between the independent variables, we will include all of them in or models.

Model Selection

We will select a model by using a simple method. We will split the data into training and testing set. The training dataset will contain 80% of the original data, and the testing set will contain the remaining.

set.seed(1)
train <- sample(nrow(ml_data), floor(0.8*nrow(data)))

Logistic Regression

glm.fit <- glm(Diagnosis ~ ., data = ml_data[train,], family = binomial)

The error rate of the models will be calculated by the correct prediction made on the testing set.

glm.er <- function(fit, test) {
  glm.pred <- rep(0, nrow(test))
  glm.prob <- predict(fit, test, type = "response")
  glm.pred[glm.prob > 0.5] <- 1
  er <- mean(glm.pred == test$Diagnosis)
  return(er)
} # This function will find the rate of correct prediction

round(glm.er(glm.fit, ml_data[-train,]),3)*100

## [1] 83.5

The linear logistic model have at 83.5% rate of correct prediction, we will use this as a baseline. We will now try to include polynomials and interactions in the logistic model. It turns out that including squared MMSE will increase the prediction rate.

glm.fit.i <- glm(Diagnosis ~ poly(MMSE,2) + FunctionalAssessment + MemoryComplaints +
                   BehavioralProblems + ADL, 
                 data = ml_data[train,],
                 family = binomial)
round(glm.er(glm.fit.i, ml_data[-train,]),3)*100

## [1] 85.8

The prediction rate is 85.8% by including both MMSE and MMSE^2 in the model. In all the models below, we will also include these variables to compare the prediction rate.

LDA vs QDA

lda.fit <- lda(Diagnosis ~ poly(MMSE,2) + FunctionalAssessment + MemoryComplaints +
                   BehavioralProblems + ADL, 
                 data = ml_data[train,])

We calculate the prediction rate similarly to logistic model, with slight modification.

lda.prob <- predict(lda.fit, ml_data[-train,])
round(mean(lda.prob$class == ml_data[-train,]$Diagnosis),3)*100

## [1] 85.8

# Unlike logistic, LDA has classification function

A similar prediction rate of 85.8%, which makes sense for logistic and LDA models to give to similar prediction rate. Now we test QDA model.

qda.fit <- qda(Diagnosis ~ MMSE + FunctionalAssessment + MemoryComplaints +
                 BehavioralProblems + ADL, 
               data = ml_data[train,])
qda.prob <- predict(qda.fit, data[-train,])
round(mean(qda.prob$class == ml_data[-train,]$Diagnosis),3)*100

## [1] 80.2

QDA gives worse rate than LDA and logistic, which means that the true regression model is linear.

KNN

K-nearest neighbors method is used for classification. First, we will prepare the inputs for the knn function.

knn.train <- ml_data[train,][-6]
knn.test <- ml_data[-train,][-6]
knn.class <- ml_data[train,][,6]

The KNN function with K = 10 can be written as above, the prediction rate is calculated similarly as above.

set.seed(1) # In case there are multiple clusters with the same distances.
knn.pred <- knn(knn.train, knn.test, knn.class, k = 10)
round(mean(knn.pred == ml_data[-train,]$Diagnosis),3)*100

## [1] 80.7

The KNN function will be tested with multiple values of k.

k_values = c(1,5,10,20,50,100)
prediction_rate = rep(0, length(k_values))
for (i in 1:length(k_values)) {
  set.seed(1)
  knn.pred <- knn(knn.train, knn.test, knn.class, k = k_values[i])
  prediction_rate[i] <- round(mean(knn.pred == ml_data[-train,]$Diagnosis),3)*100
}

# Transformation
knn.table <- as.data.frame(cbind(k_values, prediction_rate))
names(knn.table) <- c("k", "Prediction Rate")
knn.table

##     k Prediction Rate
## 1   1            82.1
## 2   5            80.9
## 3  10            80.7
## 4  20            80.2
## 5  50            81.4
## 6 100            78.6

As we can see above, the prediction rate of KNN for this dataset is lower than linear logistic and LDA model.

Conclusion

Linear logistic and LDA model give the highest prediction rate. We will use the linear logistic model moving forward.

Cross Validation vs Bootstrap

We have selected the best model above (linear logistic model) by splitting the original dataset into training (80%) and testing set (20%). Now we will use cross validation and bootstrap as the methods to validate the prediction rate. Note that this cross validation method only works on logistic model.

LOOCV

cost <- function(r, pi = 0) mean(abs(r - pi) > 0.5) # For classification models
loocv.err <- cv.glm(ml_data[train,], glm.fit, cost = cost)
round(1 - loocv.err$delta[1],3)*100

## [1] 85.1

The prediction rate when tested with LOOCV is 85.1%.

k-CV

k.cv.err <- cv.glm(ml_data[train,], glm.fit, cost = cost, K = 10) # 10 k-CV
round(1 - k.cv.err$delta[1],3)*100

## [1] 85

The prediction rate when tested with 10 k-CV is 85%.

Boostrap

rsq <- function(formula, data, indices) {
  d <- data[indices,]
  fit <- glm(formula, data = d, family = binomial)
  pred <- rep(0, nrow(d))
  prob <- predict(fit)
  pred[prob > 0.5] <- 1
  return(round(mean(d$Diagnosis == pred),3)*100)
} # Define a function to calculate the prediction rate

# Repeat bootstrap 1000 times
boot(data = ml_data, statistic = rsq, R = 1000, formula = Diagnosis ~ poly(MMSE,2) + 
       FunctionalAssessment + MemoryComplaints + BehavioralProblems + ADL)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = ml_data, statistic = rsq, R = 1000, formula = Diagnosis ~ 
##     poly(MMSE, 2) + FunctionalAssessment + MemoryComplaints + 
##         BehavioralProblems + ADL)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     86.3 -0.0884   0.7926372

The prediction rate is 86.3% after testing the model through 1000 bootstrap instances.

Several ML Models on Kaggle Alzheimer Dataset

Danh Dang

2025-03-11