This dataset was created for educational purposes. This dataset contains health information for 2,149 Alzheimer patients, with a total of 35 variables.
rabie_el_kharoua_2024, title={Alzheimer’s Disease Dataset}, url={https://www.kaggle.com/dsv/8668279}, DOI={10.34740/KAGGLE/DSV/8668279}, publisher={Kaggle}, author={Rabie El Kharoua}, year={2024}
Load the necessary packages and dataset.
dim(data)
## [1] 2149 35
Check for NA values.
sum(is.na(data))
## [1] 0
Check the dataset
summary(data)
## PatientID Age Gender Ethnicity
## Min. :4751 Min. :60.00 Min. :0.0000 Min. :0.0000
## 1st Qu.:5288 1st Qu.:67.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median :5825 Median :75.00 Median :1.0000 Median :0.0000
## Mean :5825 Mean :74.91 Mean :0.5063 Mean :0.6975
## 3rd Qu.:6362 3rd Qu.:83.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :6899 Max. :90.00 Max. :1.0000 Max. :3.0000
## EducationLevel BMI Smoking AlcoholConsumption
## Min. :0.000 Min. :15.01 Min. :0.0000 Min. : 0.002003
## 1st Qu.:1.000 1st Qu.:21.61 1st Qu.:0.0000 1st Qu.: 5.139810
## Median :1.000 Median :27.82 Median :0.0000 Median : 9.934412
## Mean :1.287 Mean :27.66 Mean :0.2885 Mean :10.039442
## 3rd Qu.:2.000 3rd Qu.:33.87 3rd Qu.:1.0000 3rd Qu.:15.157931
## Max. :3.000 Max. :39.99 Max. :1.0000 Max. :19.989293
## PhysicalActivity DietQuality SleepQuality FamilyHistoryAlzheimers
## Min. :0.003616 Min. :0.009385 Min. : 4.003 Min. :0.0000
## 1st Qu.:2.570626 1st Qu.:2.458455 1st Qu.: 5.483 1st Qu.:0.0000
## Median :4.766424 Median :5.076087 Median : 7.116 Median :0.0000
## Mean :4.920202 Mean :4.993138 Mean : 7.051 Mean :0.2522
## 3rd Qu.:7.427899 3rd Qu.:7.558625 3rd Qu.: 8.563 3rd Qu.:1.0000
## Max. :9.987429 Max. :9.998346 Max. :10.000 Max. :1.0000
## CardiovascularDisease Diabetes Depression HeadInjury
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1443 Mean :0.1508 Mean :0.2006 Mean :0.0926
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Hypertension SystolicBP DiastolicBP CholesterolTotal
## Min. :0.0000 Min. : 90.0 Min. : 60.00 Min. :150.1
## 1st Qu.:0.0000 1st Qu.:112.0 1st Qu.: 74.00 1st Qu.:190.3
## Median :0.0000 Median :134.0 Median : 91.00 Median :225.1
## Mean :0.1489 Mean :134.3 Mean : 89.85 Mean :225.2
## 3rd Qu.:0.0000 3rd Qu.:157.0 3rd Qu.:105.00 3rd Qu.:262.0
## Max. :1.0000 Max. :179.0 Max. :119.00 Max. :300.0
## CholesterolLDL CholesterolHDL CholesterolTriglycerides MMSE
## Min. : 50.23 Min. :20.00 Min. : 50.41 Min. : 0.005312
## 1st Qu.: 87.20 1st Qu.:39.10 1st Qu.:137.58 1st Qu.: 7.167602
## Median :123.34 Median :59.77 Median :230.30 Median :14.441660
## Mean :124.34 Mean :59.46 Mean :228.28 Mean :14.755132
## 3rd Qu.:161.73 3rd Qu.:78.94 3rd Qu.:314.84 3rd Qu.:22.161028
## Max. :199.97 Max. :99.98 Max. :399.94 Max. :29.991381
## FunctionalAssessment MemoryComplaints BehavioralProblems ADL
## Min. :0.00046 Min. :0.000 Min. :0.0000 Min. : 0.001288
## 1st Qu.:2.56628 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 2.342836
## Median :5.09444 Median :0.000 Median :0.0000 Median : 5.038973
## Mean :5.08005 Mean :0.208 Mean :0.1568 Mean : 4.982958
## 3rd Qu.:7.54698 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.: 7.581490
## Max. :9.99647 Max. :1.000 Max. :1.0000 Max. : 9.999747
## Confusion Disorientation PersonalityChanges DifficultyCompletingTasks
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2052 Mean :0.1582 Mean :0.1508 Mean :0.1587
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Forgetfulness Diagnosis DoctorInCharge
## Min. :0.0000 Min. :0.0000 Length:2149
## 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :0.0000 Median :0.0000 Mode :character
## Mean :0.3015 Mean :0.3537
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
head(data, 5)
## PatientID Age Gender Ethnicity EducationLevel BMI Smoking
## 1 4751 73 0 0 2 22.92775 0
## 2 4752 89 0 0 0 26.82768 0
## 3 4753 73 0 3 1 17.79588 0
## 4 4754 74 1 0 1 33.80082 1
## 5 4755 89 0 0 0 20.71697 0
## AlcoholConsumption PhysicalActivity DietQuality SleepQuality
## 1 13.297218 6.327112 1.3472143 9.025679
## 2 4.542524 7.619885 0.5187671 7.151293
## 3 19.555085 7.844988 1.8263347 9.673574
## 4 12.209266 8.428001 7.4356041 8.392554
## 5 18.454356 6.310461 0.7954975 5.597238
## FamilyHistoryAlzheimers CardiovascularDisease Diabetes Depression HeadInjury
## 1 0 0 1 1 0
## 2 0 0 0 0 0
## 3 1 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## Hypertension SystolicBP DiastolicBP CholesterolTotal CholesterolLDL
## 1 0 142 72 242.3668 56.15090
## 2 0 115 64 231.1626 193.40800
## 3 0 99 116 284.1819 153.32276
## 4 0 118 115 159.5822 65.36664
## 5 0 94 117 237.6022 92.86970
## CholesterolHDL CholesterolTriglycerides MMSE FunctionalAssessment
## 1 33.68256 162.18914 21.463532 6.518877
## 2 79.02848 294.63091 20.613267 7.118696
## 3 69.77229 83.63832 7.356249 5.895077
## 4 68.45749 277.57736 13.991127 8.965106
## 5 56.87430 291.19878 13.517609 6.045039
## MemoryComplaints BehavioralProblems ADL Confusion Disorientation
## 1 0 0 1.72588346 0 0
## 2 0 0 2.59242413 0 0
## 3 0 0 7.11954774 0 1
## 4 0 1 6.48122586 0 0
## 5 0 0 0.01469122 0 0
## PersonalityChanges DifficultyCompletingTasks Forgetfulness Diagnosis
## 1 0 1 0 0
## 2 0 0 1 0
## 3 0 1 0 0
## 4 0 0 0 0
## 5 1 1 0 0
## DoctorInCharge
## 1 XXXConfid
## 2 XXXConfid
## 3 XXXConfid
## 4 XXXConfid
## 5 XXXConfid
Drop the PatientID and DoctorInCharge variables
data <- data[,-c(1,35)]
We learn about the distribution of the observations on several variables.
# Diagnosis distribution, how many patients are diagnoses with Alzheimer
data %>% ggplot(aes(x = as.factor(Diagnosis),
fill = as.factor(Diagnosis))) +
geom_bar() + xlab("Diagnosis (1 for Postive, 0 for Negative)") +
labs(fill = "Diagnosis")
data %>% ggplot(aes(x = as.factor(Gender),
fill = as.factor(Gender))) +
geom_bar() + xlab("Genders (1 for Female, 0 for Male)") +
labs(fill = "Gender")
data %>% ggplot(aes(x = as.factor(EducationLevel),
fill = as.factor(EducationLevel))) +
geom_bar() + xlab("Education Level (0 = None, 1 = High School, 2 = Bachelor's, 3 = Higher)") +
labs(fill = "Education Level")
data %>% ggplot(aes(x = as.factor(Ethnicity),
fill = as.factor(Ethnicity))) +
geom_bar() +
xlab("Ethnicity (0 = Caucasian, 1 = African American, 2 = Asian, 3 = Other)") +
labs(fill = "Ethnicity")
We check the correlation of other variables with Diagnosis
cor(data[-33], data$Diagnosis)
## [,1]
## Age -0.0054883776
## Gender -0.0209747103
## Ethnicity -0.0147822954
## EducationLevel -0.0439658496
## BMI 0.0263428130
## Smoking -0.0048652352
## AlcoholConsumption -0.0076179684
## PhysicalActivity 0.0059450425
## DietQuality 0.0085057713
## SleepQuality -0.0565480860
## FamilyHistoryAlzheimers -0.0328997475
## CardiovascularDisease 0.0314902464
## Diabetes -0.0315076042
## Depression -0.0058929070
## HeadInjury -0.0214114257
## Hypertension 0.0350800344
## SystolicBP -0.0156152313
## DiastolicBP 0.0052926798
## CholesterolTotal 0.0063944630
## CholesterolLDL -0.0319758259
## CholesterolHDL 0.0425840413
## CholesterolTriglycerides 0.0226718770
## MMSE -0.2371256071
## FunctionalAssessment -0.3648983066
## MemoryComplaints 0.3067423937
## BehavioralProblems 0.2243504008
## ADL -0.3323459163
## Confusion -0.0191857197
## Disorientation -0.0246481569
## PersonalityChanges -0.0206274625
## DifficultyCompletingTasks 0.0090685818
## Forgetfulness -0.0003542898
For these models, we end up choosing variables with moderate correlation with Diagnosis, which are MMSE, Functional, Memory, Behavioral and ADL (23, 24, 25, 26, 27).
ml_data <- data[c(23:27, 33)]
We check the correlation between the independent variables, to make sure that there is no collinearity.
cor(ml_data[-6])
## MMSE FunctionalAssessment MemoryComplaints
## MMSE 1.000000000 0.02493215 0.007651651
## FunctionalAssessment 0.024932154 1.00000000 0.002320470
## MemoryComplaints 0.007651651 0.00232047 1.000000000
## BehavioralProblems 0.025408172 -0.02194094 -0.009765303
## ADL 0.003358908 0.05390426 -0.037510563
## BehavioralProblems ADL
## MMSE 0.025408172 0.003358908
## FunctionalAssessment -0.021940939 0.053904259
## MemoryComplaints -0.009765303 -0.037510563
## BehavioralProblems 1.000000000 0.043375578
## ADL 0.043375578 1.000000000
Conclusion: There seems to be no correlation between the independent variables, we will include all of them in or models.
We will select a model by using a simple method. We will split the data into training and testing set. The training dataset will contain 80% of the original data, and the testing set will contain the remaining.
set.seed(1)
train <- sample(nrow(ml_data), floor(0.8*nrow(data)))
glm.fit <- glm(Diagnosis ~ ., data = ml_data[train,], family = binomial)
The error rate of the models will be calculated by the correct prediction made on the testing set.
glm.er <- function(fit, test) {
glm.pred <- rep(0, nrow(test))
glm.prob <- predict(fit, test, type = "response")
glm.pred[glm.prob > 0.5] <- 1
er <- mean(glm.pred == test$Diagnosis)
return(er)
} # This function will find the rate of correct prediction
round(glm.er(glm.fit, ml_data[-train,]),3)*100
## [1] 83.5
The linear logistic model have at 83.5% rate of correct prediction, we will use this as a baseline. We will now try to include polynomials and interactions in the logistic model. It turns out that including squared MMSE will increase the prediction rate.
glm.fit.i <- glm(Diagnosis ~ poly(MMSE,2) + FunctionalAssessment + MemoryComplaints +
BehavioralProblems + ADL,
data = ml_data[train,],
family = binomial)
round(glm.er(glm.fit.i, ml_data[-train,]),3)*100
## [1] 85.8
The prediction rate is 85.8% by including both MMSE and MMSE^2 in the model. In all the models below, we will also include these variables to compare the prediction rate.
lda.fit <- lda(Diagnosis ~ poly(MMSE,2) + FunctionalAssessment + MemoryComplaints +
BehavioralProblems + ADL,
data = ml_data[train,])
We calculate the prediction rate similarly to logistic model, with slight modification.
lda.prob <- predict(lda.fit, ml_data[-train,])
round(mean(lda.prob$class == ml_data[-train,]$Diagnosis),3)*100
## [1] 85.8
# Unlike logistic, LDA has classification function
A similar prediction rate of 85.8%, which makes sense for logistic and LDA models to give to similar prediction rate. Now we test QDA model.
qda.fit <- qda(Diagnosis ~ MMSE + FunctionalAssessment + MemoryComplaints +
BehavioralProblems + ADL,
data = ml_data[train,])
qda.prob <- predict(qda.fit, data[-train,])
round(mean(qda.prob$class == ml_data[-train,]$Diagnosis),3)*100
## [1] 80.2
QDA gives worse rate than LDA and logistic, which means that the true regression model is linear.
K-nearest neighbors method is used for classification. First, we will prepare the inputs for the knn function.
knn.train <- ml_data[train,][-6]
knn.test <- ml_data[-train,][-6]
knn.class <- ml_data[train,][,6]
The KNN function with K = 10 can be written as above, the prediction rate is calculated similarly as above.
set.seed(1) # In case there are multiple clusters with the same distances.
knn.pred <- knn(knn.train, knn.test, knn.class, k = 10)
round(mean(knn.pred == ml_data[-train,]$Diagnosis),3)*100
## [1] 80.7
The KNN function will be tested with multiple values of k.
k_values = c(1,5,10,20,50,100)
prediction_rate = rep(0, length(k_values))
for (i in 1:length(k_values)) {
set.seed(1)
knn.pred <- knn(knn.train, knn.test, knn.class, k = k_values[i])
prediction_rate[i] <- round(mean(knn.pred == ml_data[-train,]$Diagnosis),3)*100
}
# Transformation
knn.table <- as.data.frame(cbind(k_values, prediction_rate))
names(knn.table) <- c("k", "Prediction Rate")
knn.table
## k Prediction Rate
## 1 1 82.1
## 2 5 80.9
## 3 10 80.7
## 4 20 80.2
## 5 50 81.4
## 6 100 78.6
As we can see above, the prediction rate of KNN for this dataset is lower than linear logistic and LDA model.
Linear logistic and LDA model give the highest prediction rate. We will use the linear logistic model moving forward.
We have selected the best model above (linear logistic model) by splitting the original dataset into training (80%) and testing set (20%). Now we will use cross validation and bootstrap as the methods to validate the prediction rate. Note that this cross validation method only works on logistic model.
cost <- function(r, pi = 0) mean(abs(r - pi) > 0.5) # For classification models
loocv.err <- cv.glm(ml_data[train,], glm.fit, cost = cost)
round(1 - loocv.err$delta[1],3)*100
## [1] 85.1
The prediction rate when tested with LOOCV is 85.1%.
k.cv.err <- cv.glm(ml_data[train,], glm.fit, cost = cost, K = 10) # 10 k-CV
round(1 - k.cv.err$delta[1],3)*100
## [1] 85
The prediction rate when tested with 10 k-CV is 85%.
rsq <- function(formula, data, indices) {
d <- data[indices,]
fit <- glm(formula, data = d, family = binomial)
pred <- rep(0, nrow(d))
prob <- predict(fit)
pred[prob > 0.5] <- 1
return(round(mean(d$Diagnosis == pred),3)*100)
} # Define a function to calculate the prediction rate
# Repeat bootstrap 1000 times
boot(data = ml_data, statistic = rsq, R = 1000, formula = Diagnosis ~ poly(MMSE,2) +
FunctionalAssessment + MemoryComplaints + BehavioralProblems + ADL)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = ml_data, statistic = rsq, R = 1000, formula = Diagnosis ~
## poly(MMSE, 2) + FunctionalAssessment + MemoryComplaints +
## BehavioralProblems + ADL)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 86.3 -0.0884 0.7926372
The prediction rate is 86.3% after testing the model through 1000 bootstrap instances.