Alzheimer’s disease is an irreversible, progressive brain disorder that slowly destroys memory and cognitive ability. It is currently ranked as the sixth leading cause of death in the United States. About 34 million people worldwide have Alzheimer’s disease (AD), and it’s prevalence is expected to triple over the next 40 years.2
Dementia is a general term for the loss of cognitive functioning, memory, reasoning, and behavioral abilities, to an extent that it interferes with daily life and activities. During the 20th century Alzheimer’s was regarded as a clinical syndrome of dementia, not as a pathological process. In 2011 Alzheimer’s was redefined as a disease itself, based on evidence from biomarkers and neuroimaging.12 Alzheimer’s is the most common cause of dementia in older adults, accounting for 60-80% of cases.4
AD is characterized by neuro-fiber tangles (tau tangles), beta-amyloid protein plaques, synaptic loss and neuron atrophy in the brain.3 Before the early 2000s autopsy was the only way to definitively diagnose if a person had Alzheimer’s. Advances in research, lab and imaging tests are now available to identify Alzheimer’s biomarkers.1 Blood tests for beta-amyloid protein, genetic tests, cognitive tests, neurological exams, behavioral and mood exams and brain imaging are being used in the diagnosis of Alzheimer’s and dementia.
Mild cognitive impairment (MCI) is a transitional stage between age-related cognitive decline and Alzheimer’s disease when symptoms begin to appear during the pre-dementia stage. During this phase of AD, the pathology continues to progress, but clinical symptoms remain limited. Performance of more complex tasks is affected, but general independence is preserved. Only in the third phase of AD, the phase of dementia (now called “major neurocognitive disorder”) do we see the characteristic independence-destroying changes in cognitive and behavioral functioning.12 For effective treatment, it would be important to identify MCI patients at high risk for conversion to AD. Magnetic resonance imaging (MRI)-based imaging may predict the MCI-to-AD conversion from one to three years before the clinical diagnosis.5
Alzheimer’s disease progression
Longitudinal studies of Alzheimer’s disease are instrumental in identifying risk factors, disease mechanisms and progression, and treatment and preventive measures. One of the difficulties with longitudinal studies is controlling for variables of lifestyle in the subjects. The Nun Study directed by David A. Snowdon, initiated in 1986, used a relatively homogeneous cohort.
The Nun Study is a longitudinal study of aging and Alzheimer’s disease in Catholic sisters who were members of the School Sisters of Notre Dame congregation. Supported by the National Institute on Aging of the National Institutes of Health it is the largest long-term investigation of a brain donor population in the world.6,7
The study includes 3,926 Sisters of Notre Dame congregation born between 1886 and 1916, most of the women had joined the order at about age 20. The participants were 75 to 102 years of age at the beginning of the study, and the oldest member had survived to 107 years of age by 2002. 678 agreed to participate in annual assessments of their mental and physical function and activities of daily living through 1994. Because of the relatively homogeneous adult lifestyle of the sisters, the findings are not confounded by factors such as smoking and alcohol use, reproductive history, marital status, living arrangements, income, or social isolation that tend to be found in other adult populations.7
Three basic sources of data are available about the participants. 1. Convent archives provide information about potential early and middle-life risk factors for Alzheimer disease and other disorders. 2. Annual examinations document changes in the cognitive and physical function of each participant during old age. 3. Because each sister agreed to brain donation at death, the structure and pathology of the brain can be related to early and middle-life risk factors and to late-life cognitive and physical function.
Earlier results from the research showed that the more highly educated sisters lived about 4 years longer with better mental and physical functioning than the sisters who had less than a bachelor’s degree. The less educated sisters had twice the mortality rates of the more educated nuns.
MRI imaging of the subject’s brains, both in-vivo and post-mortem, have been used to show that volumetric measures of the hippocampus may be used to diagnose Alzheimer’s before symptoms are present.11 The study indicates that the prevalence of Alzheimer’s disease at death peaks at ages 85 and 89 and declines after age 90.7 This conclusion helped to encourage research to show that dementia is not a normal part of aging.1
Since the nuns’ lifestyles were so similar, the study concluded that the findings about education were not so much linked to poor circumstances in adult life as to something that occurred in youth. Idea density measures were derived from handwritten autobiographies from the convent archives for 180 U.S. born sisters who took their religious vows during 1931-1943. The autobiographies were written within two years before the sisters formally joined the congregation, between the age of 18 and 32 (mean age = 22).
The findings suggest that it is possible to identify persons who are at risk for developing late-life cognitive impairment by measuring linguistic ability (idea density) in early adulthood, a measure which is associated with general knowledge and vocabulary. Early-life idea density was significantly related to the categories of late-life cognitive function, including mild cognitive impairments.
Low idea density was associated with greater impairment and was significantly associated with lower brain weight, higher degree of cerebral atrophy and the likelihood of meeting neuropathologic criteria for Alzheimer’s disease. The odds ratio of 5.3 for mild cognitive impairment indicates that the likelihood of low idea density in this group was approximately five times that of the cognitively intact group. There was a very strong relationship between the likelihood of low idea density and the presence of dementia.8
Positivity and longevity
Three indicators of longer life were found when coding the sister’s autobiographies, noting the amount of positive sentences, positive words, and the variety of positive emotions used. Lower positivity in writing correlated with greater mortality. A growing body of literature has shown positive and negative emotion-related attitudes and states to be associated with physical health, mental health, and longevity.9
Findings of the Nun Study indicate multiple factors about the expression of Alzheimer’s symptoms and traits, and show that age and disease don’t always lead to cognitive impairment. The Nun Study influenced further research, studies and discoveries, including the Religious Orders Study, funded by the National Institute on Aging in 1993. It continued until 2016 and gathered 22 years of clinical data on more than 1,100 men and women members of religious clergy.15
Neuroimaging has meant a major breakthrough for the diagnosis and treatment of neurodegenerative diseases. Mathematical and statistical tools have developed and evolved together with an increasing development and use of neuroimaging. Imaging techniques provide insights of the brain, allowing the study of the structural and functional changes that can be linked to neurodegenerative diseases. This means a huge amount of data where automatic tools can help to identify patterns, reduce noise and enhance knowledge of brain functioning.10
The anatomy of every person’s brain is slightly different in shape and size. In order to compare images of different subjects, we need to eliminate these particularities and transform the images so that group analysis or comparison can be performed. Individual images are mapped from their individual subject space (current anatomy) to a reference space, a common anatomical reference that allows the comparison. This procedure is known as spatial normalization or registration.10
How effective is brain volume for diagnosing dementia in Alzheimer’s patients? To investigate this question I’ll use a longitudinal study dataset from Open Access Series of Imaging Studies (OASIS) to look at the relationships of variables to brain volume. How age relates to brain volume, and how aging affects brain volume. What differences there are in brain volume by gender and does normalizing brain volume eliminate those differences. Then using logistic regression I’ll see how brain volume performs in prediction models for dementia.
OASIS makes neuroimaging datasets freely available to the scientific community to facilitate discoveries in basic and clinical neuroscience. The OASIS Alzheimer’s longitudinal study data is available from the OASIS website.
Data Source: https://www.oasis-brains.org/,
https://central.xnat.org/app/template/XDATScreen_report_xnat_projectData.vm/search_element/xnat:projectData/search_field/xnat:projectData.ID/search_value/CENTRAL_OASIS_LONG
MRI Data in Nondemented and Demented Older Adults: This set consists of a longitudinal collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit.
OASIS Longitudinal Dataset
| Variable | Description | Type |
|---|---|---|
| Group | Dementia Classification | Categorical |
| Visit | Visit number in study | Discrete |
| MRDelay | Number of days between visits | Discrete |
| M/F | Gender | Categorical |
| Hand | all subjects are right handed | Constant=R |
| Age | Age in years | Discrete |
| EDUC | Years of Education | Discrete |
| SES | Socioeconomic Status | Categorical |
| MMSE | Mini Mental State Examination | Discrete |
| CDR | Clinical Dementia Rating | Discrete |
| eTIV | Estimated Total Intracranial Volume | Continuous |
| nWBV | Normalized Whole Brain Volume | Continuous |
| ASF | Atlas Scaling Factor | Continuous |
As assessed by the Hollingshead Index of Socioeconomic Status (1975) and classified into categories from 1 (highest) to 5 (lowest). The Hollingshead Four Factor Index of Socioeconomic Status is a survey designed to measure social status of an individual based on four domains: marital status, retired/employed status, educational attainment, and occupational prestige. It has been one of the most frequently used measures of SES.13
The Mini-Mental State Exam (MMSE) a.k.a the Folstein Test is a 30-point test used to measure thinking ability or cognitive impairment. It is the most widely used test for assessing dementia.3
During the MMSE, a health professional asks a patient a series of questions designed to test a range of everyday mental skills. The maximum MMSE score of 30 points shows normal cognitive functioning. On average, the MMSE score of a person with Alzheimer’s declines about two to four points each year.4
| MMSE Score | Level of Dementia |
|---|---|
| 24 - 30 | Normal cognition; no dementia |
| 19 - 23 | Mild dementia |
| 10 - 18 | Moderate dementia |
| 9 or less | Severe dementia |
The Clinical Dementia Rating (CDR) is a 5-point scale used to characterize six domains of cognitive and functional performance applicable to Alzheimer disease and related dementias: Memory, Orientation, Judgment & Problem Solving, Community Affairs, Home & Hobbies, and Personal Care. The necessary information to make each rating is obtained through a semi-structured interview of the patient and a reliable informant or collateral source (e.g., family member).4
| CDR Score | Description |
|---|---|
| 0 | Normal |
| 0.5 | Very Mild dementia |
| 1 | Mild Dementia |
| 2 | Moderate Dementia |
| 3 | Severe Dementia |
The following plots show that the distributions of age by subjects CDR scores is close to normal for scores in the Normal to Mild Dementia ranges, and the age ranges are similar. The distribution for CDR = Moderate Dementia (2) is heavily right skewed, but there are only 3 subjects with CDR = 2 in the data. There are no subjects in the data with CDR = Severe Dementia (3).
g1 <- ggdotplot(data = df, x = "cdr", y="age", fill="cdr", size = 0.75,
palette = c("forestgreen", "blue2", "#D95F02", "darkred" ),
main="Age by CDR", ylab = "Age (years)", xlab = FALSE) +
guides(fill="none")
g2 <- ggboxplot(data = df, x = "cdr", y="age", fill="cdr",
orientation = "v", outlier.shape = 22, size = 1.1,
palette = c("forestgreen", "blue2", "#D95F02", "darkred" ),
xlab = "CDR", ylab = "Age (years)") +
guides(fill="none")
plot_grid(g1, g2, ncol = 1)
|
|
|
|
|
|
df %>% select(age, educ, ses, mmse, cdr, etiv, nwbv, asf) %>%
summary()
## age educ ses mmse cdr
## Min. :60.00 Min. : 6.0 Min. :1.00 Min. : 4.00 Min. :0.0000
## 1st Qu.:71.00 1st Qu.:12.0 1st Qu.:2.00 1st Qu.:27.00 1st Qu.:0.0000
## Median :77.00 Median :15.0 Median :2.00 Median :29.00 Median :0.0000
## Mean :77.01 Mean :14.6 Mean :2.46 Mean :27.34 Mean :0.2909
## 3rd Qu.:82.00 3rd Qu.:16.0 3rd Qu.:3.00 3rd Qu.:30.00 3rd Qu.:0.5000
## Max. :98.00 Max. :23.0 Max. :5.00 Max. :30.00 Max. :2.0000
## NA's :19 NA's :2
## etiv nwbv asf
## Min. :1106 Min. :0.6440 Min. :0.876
## 1st Qu.:1357 1st Qu.:0.7000 1st Qu.:1.099
## Median :1470 Median :0.7290 Median :1.194
## Mean :1488 Mean :0.7296 Mean :1.195
## 3rd Qu.:1597 3rd Qu.:0.7560 3rd Qu.:1.293
## Max. :2004 Max. :0.8370 Max. :1.587
##
gdf <- ggplot(df, legend = "none") +
guides(fill = "none") +
scale_fill_manual(values = RColorBrewer::brewer.pal(5, "Dark2")) +
xlab(NULL) +
ylab(NULL)
theme_light()
p1 <- gdf + geom_bar(aes(group, fill = group)) +
labs(title = "Group")
p2 <- gdf + geom_bar(aes(m.f, fill = m.f)) +
scale_fill_manual(values = c("royalblue", "lightcoral")) +
labs(title = "Gender")
p3 <- gdf + geom_histogram(aes(age),
fill = "#1B9E77", color = "black")+
labs(title = "Age")
p4 <- gdf + geom_histogram(aes(educ),
fill = "#D95F02", color = "black")+
labs(title = "Education (years)")
p5 <- gdf + geom_bar(aes(ses, fill = as.factor(ses)), na.rm = TRUE) +
scale_x_reverse() +
labs(title = "SES")
p6 <- gdf + geom_bar(aes(as.factor(cdr), fill = as.factor(cdr))) +
labs(title = "CDR")
#plot_grid(p1, p2, p3, p4)
p7 <- gdf + geom_histogram(aes(mmse), na.rm = TRUE,
fill = "#7570B3", color = "black") +
labs(title = "MMSE")
p8 <- gdf + geom_histogram(aes(etiv), na.rm = TRUE,
fill = "#E7298A", color = "black") +
labs(title = "eTIV")
p9 <- gdf + geom_histogram(aes(nwbv), na.rm = TRUE,
fill = "#66A61E", color = "black")+
labs(title = "nWBV")
plot_grid(p1, p2, p3, p4, p5, p6, p7, p8, p9)
The number of participants fell off sharply after the second visit. There are more observations with CDR of Normal Cognition (CDR(0) = 206) than the dementia values combined (CDR(> 0) = 167), this could indicate a class bias for logistic regression. MMSE is left skewed corresponding with the right skew in CDR. The eTIV distribution is right skewed with mean 18 cm3 greater than the median. The nWBV distribution is almost normal with mean 0.006 cm3 greater than the median, and the minimum and maximum values \(\tilde =\) 0.1 from the mean.
There are more women than men in the data. There are more subjects with higher socio-economic status than lower SES, and more subjects with 12 or more years of education than with less than a high school education.
In order to continue with my analysis I’ll create binary variables for dementia and gender from the categorical variables group and m.f.
# Assign 0 to dementia for Converted where cdr = 0 or group is Nondemented, otherwise set dementia to 1
# Assign 0 to gender for Female, 1 to gender for Male
df <- df %>% mutate(dementia = if_else (group == "Nondemented" | (group == "Converted" & cdr == 0), 0, 1),
gender = if_else(m.f == "F", 0, 1))
pacman::p_load(GGally)
pcor <- df %>% select(cdr, mmse, gender, age, educ, ses, nwbv, etiv, dementia) # select the columns for correlation
# create a correlation plot
a1 <- ggcorr(pcor, nbreaks = 5,
low = "blue4",
mid = "bisque3",
high = "darkred", # set the number of levels and the colors
hjust = .55, size = 4, # nudge and format the axis labels
label = TRUE, label_size = 4, label_color = "white", # format the coefficient
name = "Significance") + # legend title
labs(title = "Variable Correlation") # plot title
theme(plot.title.position = "plot",
plot.title = element_text(hjust = .5, vjust = 0, size = rel(2)), # position and format the title and caption
plot.caption.position = "plot")
a1
There is a strong correlation between CDR and the dementia diagnosis, this is expected since CDR is the determining variable for dementia in the dataset. There is a significant negative correlation between CDR and MMSE, which is also expected since MMSE is one of the factors used in determining the CDR score. And consequently, MMSE has an influence on the dementia diagnosis.
Education has a negative relationship with SES, indicating that more years of education results in a lower value for SES, which translates to a higher socio-economic status because SES has a reversed scale. Education has a moderate positive correlation to eTIV and a moderate negative correlation to dementia.
Age has a moderate negative correlation with nWBV, but doesn’t appear to have an influence on the dementia diagnosis.
Gender has a moderate positive correlation to eTIV, but a moderate negative correlation to nWBV, which may show that the normalization of brain volume in the dataset isn’t completely effective.
There are only a few missing values for MMSE and SES, in about 5% of the observations. Assigning imputed values to these rows should not have a significant effect on the statistics.
There is only one subject with missing MMSE scores, the value is missing from her 2nd and 3rd visits.
smmry1 <- summary(df$mmse)
df %>% filter(is.na(mmse)) %>%
select(subject.id, group, visit, m.f, age, educ, ses, mmse, cdr) %>% # show the rows missing MMSE values
kable() %>% kable_minimal()
| subject.id | group | visit | m.f | age | educ | ses | mmse | cdr |
|---|---|---|---|---|---|---|---|---|
| OAS2_0181 | Demented | 2 | F | 75 | 12 | NA | NA | 1 |
| OAS2_0181 | Demented | 3 | F | 77 | 12 | NA | NA | 1 |
mmse1 <- df %>% filter(subject.id == "OAS2_0181" & visit == 1) %>% # get the MMSE value for the first visit
select(mmse)
mmse_a <- df %>% # get the average MMSE for similar subjects
filter(age > 73 & age <= 78 & gender == 0 &
educ == 12 & group == 'Demented') %>%
summarise(round(mean(mmse, na.rm = TRUE))) %>% as.integer()
Her MMSE = 26 on her first visit. The average MMSE for women with dementia, in the same age range and education years is 25. We saw earlier that MMSE scores can decrease for Alzheimer’s patients. So setting this subject’s missing MMSE scores to this average value appears reasonable.
df$mmse[is.na(df$mmse)] <- mmse_a
smmry2 <- summary(df$mmse)
smmry2["NA's"] <- sum(is.na(df$mmse))
tibble(names(smmry1), smmry1, smmry2) %>%
kable(col.names = c("", "Before", "After"), caption = "MMSE Summary Statistics") %>%
kable_minimal(full_width = FALSE, position="l")
| Before | After | |
|---|---|---|
| Min. | 4.00 | 4.00 |
| 1st Qu. | 27.00 | 27.00 |
| Median | 29.00 | 29.00 |
| Mean | 27.34 | 27.33 |
| 3rd Qu. | 30.00 | 30.00 |
| Max. | 30.00 | 30.00 |
| NA’s | 2.00 | 0.00 |
There is a 0.01 difference in the mean of MMSE after imputing the value.
There is a relationship between years of education and socio-economic status. Subjects with more years of education have a higher SES.
gdf <- ggplot(df, legend = "none") +
guides(fill = "none") +
scale_fill_manual(values = RColorBrewer::brewer.pal(5, "Dark2")) +
ylab(NULL) +
theme_light()
gdf + geom_jitter(aes(educ, ses, color = factor(ses)),
size = 3.5, shape = 18, na.rm = TRUE) +
scale_y_reverse() + # SES goes from 1 = highest - 5 = lowest
labs(title = "Socio-economic Status by Education",
x = "Education (years)", y = "SES") +
guides(color = "none") +
scale_color_brewer(palette = "Dark2")
I’ll use the values of education to derive the 19 missing SES values. Selecting the range of SES values from observations with the same range in education levels as the rows with missing SES values. Then randomly assign an SES value from that range to the NAs in the dataset.
smmry1 <- summary(df$ses)
sna <- df %>% filter(is.na(ses)) # find the null SES rows
ids <- unique(sna$subject.id) # get the unique subject IDs for the n/a rows
# impute missing data values for SES, use the education level range for the n/a rows
# to give the range of SES values to use for random assignment
x <- df %>% drop_na(ses) %>% # drop rows with SES = n/a
filter(educ >= min(sna$educ) & educ <= max(sna$educ)) # find rows with EDUC in range for n/a SES rows
x1 <- sort(unique(x$ses), decreasing = TRUE) # get the range of SES values for n/a SES rows
x2 <- sample.int(x1, length(ids) + 1, replace = TRUE) # create vector of random sample from SES range
for (i in 1:length(ids)) { # assign SES for each unique Subject ID
for (n in 1:length(sna$ses)) {
if (sna$subject.id[n] == ids[i]) sna$ses[n] <- x2[i]
}
}
df$ses[is.na(df$ses)] <- sna$ses # fill n/a SES values in the data
smmry2 <- summary(df$ses)
smmry2["NA's"] <- sum(is.na(df$ses))
tibble(names(smmry1), smmry1, smmry2) %>%
kable(col.names = c("", "Before", "After"), caption = "SES Summary Statistics") %>%
kable_minimal(full_width = FALSE, position="l")
| Before | After | |
|---|---|---|
| Min. | 1.00 | 1.000 |
| 1st Qu. | 2.00 | 2.000 |
| Median | 2.00 | 2.000 |
| Mean | 2.46 | 2.453 |
| 3rd Qu. | 3.00 | 3.000 |
| Max. | 5.00 | 5.000 |
| NA’s | 19.00 | 0.000 |
There is a 0.007 difference in the mean of SES after imputing the 19 data values. To verify that these small differences have not affected the correlations, I’ll compare the correlation coefficients of the imputed data to the original correlation coefficients.
a2 <- ggcorr(pcor) # create correlation after imputing data
tmp <- ifelse(sum(a1[["data"]][["coefficient"]] - a2[["data"]][["coefficient"]]) # check that the coefficients
== 0, "True", "False") # haven't changed
cat("The correlation values are equal = ", tmp)
## The correlation values are equal = True
The variable correlations have not changed due to the imputed data values.
To look at the changes in brain volume over time I’ll use the six subjects that have five visits in the dataset. One subject was diagnosed with dementia at the beginning of the study and made the five visits in a little over 3 three years. One subject converted to dementia over the course of seven years. The remaining 4 subjects did not have dementia and made the five visits over the course of 6 1/2 to seven years.
ids <- df %>% filter(visit == 5) %>% select(subject.id)
dfl <- semi_join(df, ids, by = "subject.id")
dfl %>% select(subject.id, dementia, m.f, age, educ, visit, mr.delay, nwbv) %>% # show the rows for subjects with 5 visits
filter(visit == 1 | visit == 5) %>% # show the first and last visits
kable() %>% kable_minimal(full_width = FALSE, position = "l")
| subject.id | dementia | m.f | age | educ | visit | mr.delay | nwbv |
|---|---|---|---|---|---|---|---|
| OAS2_0017 | 0 | M | 80 | 12 | 1 | 0 | 0.752 |
| OAS2_0017 | 0 | M | 86 | 12 | 5 | 2400 | 0.761 |
| OAS2_0036 | 0 | F | 69 | 13 | 1 | 0 | 0.789 |
| OAS2_0036 | 0 | F | 75 | 13 | 5 | 2369 | 0.778 |
| OAS2_0048 | 1 | M | 66 | 16 | 1 | 0 | 0.711 |
| OAS2_0048 | 1 | M | 69 | 16 | 5 | 1233 | 0.676 |
| OAS2_0070 | 0 | M | 80 | 17 | 1 | 0 | 0.728 |
| OAS2_0070 | 0 | M | 86 | 17 | 5 | 2386 | 0.705 |
| OAS2_0073 | 0 | F | 70 | 14 | 1 | 0 | 0.787 |
| OAS2_0073 | 0 | F | 77 | 14 | 5 | 2517 | 0.769 |
| OAS2_0127 | 0 | M | 79 | 18 | 1 | 0 | 0.729 |
| OAS2_0127 | 1 | M | 86 | 18 | 5 | 2639 | 0.669 |
g1 <- ggline(data = dfl, x = "visit", y="nwbv", color="subject.id", size = 1,
palette = "Dark2",
main="nWBV by Visit",xlab = "Visit", ylab = "nWBV (cm^3)") +
guides(color = "none")
g2 <- ggline(data = dfl, x = "mr.delay", y="nwbv", color="subject.id", size = 1,
palette = "Dark2",
main="nWBV by MRI Delay",xlab = "MRI Delay (days)", ylab = "nWBV (cm^3)") +
guides(color = "none")
g3 <- ggline(data = dfl, x = "age", y="nwbv", color="subject.id", size = 1,
palette = "Dark2",
main="nWBV by Age",xlab = "Age", ylab = "nWBV (cm^3)") %>%
ggpar(legend.title = "Subject ID", legend = "right",
font.legend = 12) # create legend to use as 4th plot
g4 <- get_legend(g3) %>% as_ggplot() # grab the color legend from g3 as a plot
g3 <- g3 + guides(color = "none") # suppress g3 legend
plot_grid(g1, g2, g3, g4)
Subject OAS2_0048 who had dementia at the beginning of the study had a rapid decline in nWBV over the three years he participated. Subject OAS2_0127 who converted to dementia also shows a steep decline in nWBV over the course of seven years. In this small sample an nWBV = 0.72 seems to be the conversion point for a dementia diagnosis which corresponds closely with the mean nWBV for the entire dataset.
We saw earlier that the distribution of nWBV for the whole dataset is normally distributed. The distributions of nWBV grouped by dementia diagnosis are also normally distributed. Is there a difference in means of nWBV for negative and positive dementia diagnosis?
g <- df %>% ggplot(aes(nwbv, fill=factor(dementia))) +
scale_fill_manual(name = "Dementia", values = c("plum4", "firebrick")) +
theme_light() +
theme(axis.text.y = element_blank())
g1 <- g + geom_histogram(alpha=0.7) +
labs(title = "nWBV by Dementia", x = "nWBV (cm^3)", y=NULL) +
guides(fill = "legend") +
theme(axis.text.y = element_blank(), legend.position = "left")
g2 <- g + geom_boxplot(alpha=0.8) +
labs(title = "nWBV by Dementia", x = "nWBV (cm^3)") +
guides(fill="none")
plot_grid(g1, g2, rel_widths = c(1,0.8))
dd <- df %>% filter(dementia == 1) # get all records with dementia diagnosis
dn <- df %>% filter(dementia == 0)
t.test(dn$nwbv, dd$nwbv, conf.level = 0.01) # get a t.test confidence interval
##
## Welch Two Sample t-test
##
## data: dn$nwbv and dd$nwbv
## t = 6.7968, df = 368.12, p-value = 4.325e-11
## alternative hypothesis: true difference in means is not equal to 0
## 1 percent confidence interval:
## 0.02443871 0.02452907
## sample estimates:
## mean of x mean of y
## 0.7403990 0.7159152
The t-test gives a p-value close to zero, we are 99% confident that the difference in means of nWBV by dementia diagnosis is between 0.0244 and -0.0245 cm3. This is a small difference.
However, because this is a longitudinal dataset, the observations are not all independent. In order to have independent observations I’ll select the first observation for each subject. This data is also normally distributed with summary statistics close to that of the entire dataset.
df_v1 <- filter(df, visit == 1) %>% # use the data for the first visit
mutate(dementia = factor(dementia),
factor(gender))
g <- df_v1 %>% ggplot(aes(nwbv, fill=factor(dementia))) +
scale_fill_manual(name = "Dementia", values = c("plum4", "firebrick")) +
theme_light() +
theme(axis.text.y = element_blank())
g1 <- g + geom_histogram(alpha=0.7) +
labs(title = "nWBV by Dementia", x = "nWBV (cm^3)", y=NULL) +
guides(fill = "legend") +
theme(axis.text.y = element_blank(), legend.position = "left")
g2 <- g + geom_boxplot(alpha=0.8) +
labs(title = "nWBV by Dementia", x = "nWBV (cm^3)") +
guides(fill="none")
plot_grid(g1, g2, rel_widths = c(1,0.8))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
tibble(Statistic = names(summary(df$nwbv)), All.Records= summary(df$nwbv), Visit.1=summary(df_v1$nwbv)) %>%
kable(caption = "nWBV Distribution") %>%
kable_minimal(full_width=FALSE, position="l")
| Statistic | All.Records | Visit.1 |
|---|---|---|
| Min. | 0.6440 | 0.6600 |
| 1st Qu. | 0.7000 | 0.7100 |
| Median | 0.7290 | 0.7350 |
| Mean | 0.7296 | 0.7361 |
| 3rd Qu. | 0.7560 | 0.7578 |
| Max. | 0.8370 | 0.8370 |
dd <- df_v1 %>% filter(dementia == 1) # get all records with dementia diagnosis
dn <- df_v1 %>% filter(dementia == 0)
t.test(dn$nwbv, dd$nwbv, conf.level = 0.01) # get a t.test confidence interval
##
## Welch Two Sample t-test
##
## data: dn$nwbv and dd$nwbv
## t = 3.5017, df = 147.08, p-value = 0.0006125
## alternative hypothesis: true difference in means is not equal to 0
## 1 percent confidence interval:
## 0.01974163 0.01988371
## sample estimates:
## mean of x mean of y
## 0.7446588 0.7248462
The t-test for difference in means still gives a small p-value of 0.0006 < 0.01. We are 99% confident that the difference in mean nWBV for dementia diagnosis is between 0.0197 and 0.0199 cm3, which is a smaller interval than when using all of the records. This might indicate that nWBV not good predictor of dementia by itself.
gv1 <- ggplot(df_v1, legend = "none") +
guides(fill = "none") +
scale_fill_manual(values = RColorBrewer::brewer.pal(5, "Dark2")) +
ylab(NULL) +
theme_light()
gv1 + geom_boxplot(aes(nwbv, factor(cdr), color = factor(cdr)),
size = 1.5) +
labs(title = "CDR by nWBV", x = "nWBV", y = "CDR") +
guides(color = "none") +
scale_color_brewer(palette = "Dark2")
t1 <- df_v1 %>% group_by(cdr) %>%
summarise(mean(nwbv), min(nwbv), max(nwbv)) %>% round(3) %>%
kable(caption = "nWBV Statistics by CDR", col.names = c("CDR", "Mean", "Min.", "Max.")) %>%
column_spec(c(2,3,4), width = "3em")
t2 <- df_v1 %>% group_by(dementia) %>%
summarise(round(mean(nwbv),3), min(nwbv), max(nwbv)) %>%
kable(caption = "nWBV Statistics by dementia diagnosis", col.names = c("Dementia", "Mean", "Min.", "Max.")) %>%
column_spec(c(2,3,4), width = "3em")
kables(list(t1, t2)) %>% kable_minimal()
|
|
tmp <- df_v1 %>% filter(dementia == 0 & nwbv < 0.72) %>% count()
The minimum nWBV measurements for each category of CDR and Dementia are all similar. There are 23 non-dementia subjects with nWBV < 0.72, which is 27% of the non-dementia observations.
The distribution of ages by dementia diagnosis are similar. Age may not have strong significance in the prediction of dementia.
df_v1 %>% group_by(dementia) %>%
summarise(min(age), median(age), mean(age), max(age)) %>%
kable() %>% kable_minimal(full_width = FALSE, position = 'l')
| dementia | min(age) | median(age) | mean(age) | max(age) |
|---|---|---|---|---|
| 0 | 60 | 76 | 75.82353 | 93 |
| 1 | 61 | 75 | 74.95385 | 96 |
# create a quick plot of the participant ages in each group
qplot(age, data = df_v1, fill=dementia, geom = "dotplot",
facets = "dementia", xlab = "Age", main = "Age by Dementia") +
scale_fill_manual("Dementia", labels = c("No", "Yes"), values = c("forestgreen", "darkred")) +
theme_light()
We would expect that there is a difference in the average estimated total brain volume (eTIV) between men and women. A histogram of the distributions of eTIV for men and women shows a shift in the distributions.
g1 <- gv1 + geom_density(aes(etiv), fill = 'wheat') +
labs(title = "eTIV", x = "eTIV (cm^3)")
g2 <- gv1 + geom_density(aes(etiv, fill=m.f), alpha = 0.7) +
labs(title = "eTIV by Gender", x = "eTIV (cm^3)") +
scale_fill_manual(name = "Gender", values = c("lightcoral", "royalblue")) +
guides(fill = "legend")
plot_grid(g1, g2, rel_widths = c(0.8, 1))
df_v1 %>% group_by(m.f) %>%
summarise(mean(etiv), min(etiv), max(etiv)) %>%
kable(caption = "eTIV Statistics by Gender", col.names = c("M/F", "Mean", "Min.", "Max")) %>%
kable_minimal(full_width = FALSE, position = 'l')
| M/F | Mean | Min. | Max |
|---|---|---|---|
| F | 1390.852 | 1123 | 1732 |
| M | 1593.048 | 1307 | 1987 |
The mean eTIV for men is almost 200 cm3 greater than that of women.
men <- df_v1 %>% filter(gender == 0)
wmn <- df_v1 %>% filter(gender == 1)
t.test(men$etiv, wmn$etiv)
##
## Welch Two Sample t-test
##
## data: men$etiv and wmn$etiv
## t = -8.015, df = 103.43, p-value = 1.758e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -252.2256 -152.1667
## sample estimates:
## mean of x mean of y
## 1390.852 1593.048
As expected, a test for the difference in means has a p-value close to zero. A null hypotheses that the average eTIV for men equals the average eTIV for women is rejected.
The distribution of nWBV normalizes the range of brain volume, but swaps the outliers from women to men and switches the larger median and mean to women.
x <- ggboxplot(df_v1, x = "m.f", y = 'etiv', # use ggpubr plots
fill = "m.f", palette = c("royalblue", "lightcoral"),
ylab = "eTIV (cm^3)", xlab = "Gender", legend = "none")
y <- ggboxplot(df_v1, x = "m.f", y = 'nwbv',
fill = "m.f", palette = c("royalblue", "lightcoral"),
ylab = "nWBV (cm^3)", xlab = "Gender", legend = "none")
pg <- plot_grid(x, y)
title <- ggdraw() + draw_label("eTIV and nWBV by Gender", fontface='bold')
plot_grid(title, pg, ncol=1, rel_heights=c(0.1, 1))
df_v1 %>% group_by(m.f) %>%
summarise(median(nwbv), mean(nwbv), min(nwbv), max(nwbv)) %>%
kable(caption = "nWBV Statistics by Gender", col.names = c("M/F", "Median", "Mean", "Min.", "Max.")) %>%
kable_minimal(full_width = FALSE, position = "l")
| M/F | Median | Mean | Min. | Max. |
|---|---|---|---|---|
| F | 0.7465 | 0.7434432 | 0.662 | 0.822 |
| M | 0.7270 | 0.7256129 | 0.660 | 0.837 |
t.test(men$nwbv, wmn$nwbv)
##
## Welch Two Sample t-test
##
## data: men$nwbv and wmn$nwbv
## t = 3.0272, df = 131.14, p-value = 0.002972
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.006178494 0.029482064
## sample estimates:
## mean of x mean of y
## 0.7434432 0.7256129
The p-value = 0.003 < 0.01 for the difference in means of nWBV for men and women is small. The null hypotheses that mean nWBV is equal for men and women is rejected. There is a difference between the average nWBV for men and the average nWBV for women. But it is less significant than the difference in average eTIV scores.
There doesn’t seem to be a linear relationship between age and eTIV until about age 77, then a negative linear relationship kicks in. There’s a strong descending linear relationship between age and nWBV.
g1 <- gv1 + geom_point(aes(age, etiv)) +
geom_smooth(aes(age, etiv), size = 1.5, se=FALSE) +
labs(title = "eTIV by Age", y = "eTIV", x = "Age") +
scale_color_brewer(palette = "Dark2")
g2 <- gv1 + geom_point(aes(age, nwbv)) +
geom_smooth(aes(age, nwbv), size = 1.5, se=FALSE) +
labs(title = "nWBV by Age", y = "nWBV", x = "Age") +
scale_color_brewer(palette = "Dark2")
plot_grid(g1, g2)
The diagnostic plots for eTIV by Age show a good linear relationship in the Residuals to Fitted plot. The residuals are close to normally distributed, have equal variance and there are no influential points affecting the regression line. Age is a good predictor for eTIV in this data. The intercept is 1491.357 cm3, for every year in age eTIV goes down 0.224 cm3.
eTIV = 1491.357 cm3 - 0.224(Age) cm3
etiv_age <- lm(df_v1$etiv ~ df_v1$age)
par(mfrow=c(2,2))
plot(etiv_age, lwd = 2)
summary(etiv_age)
##
## Call:
## lm(formula = df_v1$etiv ~ df_v1$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -351.98 -126.90 -20.65 89.94 515.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1491.3568 144.2796 10.337 <2e-16 ***
## df_v1$age -0.2244 1.9029 -0.118 0.906
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 175.3 on 148 degrees of freedom
## Multiple R-squared: 9.395e-05, Adjusted R-squared: -0.006662
## F-statistic: 0.01391 on 1 and 148 DF, p-value: 0.9063
The diagnostic plots for nWBV by age also show good linearity, normal distribution, homoscedasticity and no influential points. They do seem to show the reversal of distributions by gender that we saw earlier. The intercept for nWBV by age is 0.9395 cm3 and decreases by 0.0027 cm3 for every year of age.
nWBV cm3 = 0.9395 cm3 - 0.0027(Age) cm3
nwbv_age <- lm(df_v1$nwbv ~ df_v1$age)
par(mfrow=c(2,2))
plot(nwbv_age, lwd = 2)
summary(nwbv_age)
##
## Call:
## lm(formula = df_v1$nwbv ~ df_v1$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.071152 -0.022550 -0.000974 0.025509 0.072759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9395002 0.0249893 37.596 < 2e-16 ***
## df_v1$age -0.0026963 0.0003296 -8.181 1.18e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03036 on 148 degrees of freedom
## Multiple R-squared: 0.3114, Adjusted R-squared: 0.3067
## F-statistic: 66.93 on 1 and 148 DF, p-value: 1.183e-13
g <- df_v1 %>% ggplot() +
scale_color_manual(values = c("forestgreen", "blue3", "#D95F02", "darkred"), name = "CDR") +
theme_light()
fcdr = factor(df_v1$cdr)
g1 <- g + geom_jitter(aes(x = fcdr, y = educ, col = fcdr), alpha = 0.7) +
guides(color = "none") +
labs(x = "CDR", y = "Education (years)")
g2 <- g + geom_jitter(aes(x = fcdr, y = ses, col = fcdr), alpha = 0.7) +
scale_y_reverse() +
labs(x = "CDR", y = "SES")
pg <- plot_grid(g1, g2)
title <- ggdraw() + draw_label("Distributions of Education and Socio-Economic Status (SES)", fontface='bold')
plot_grid(title, pg, ncol=1, rel_heights=c(0.1, 1))
g1 <- g + geom_jitter(aes(x = fcdr, y = etiv, col = fcdr), alpha = 0.7) +
labs(x = "CDR", y = "eTIV")
g2 <- g + geom_jitter(aes(x = fcdr, y = nwbv, col = fcdr), alpha = 0.7) +
labs(x = "CDR", y = "nWBV")
pg <- plot_grid(g1, g2)
title <- ggdraw() + draw_label("Distributions of eTIV and nWBV", fontface='bold')
plot_grid(title, pg, ncol=1, rel_heights=c(0.1, 1))
The spreads of education and SES are similar across the CDR categories. eTIV has close to equal spreads across categories, while nWBV has a definite downward trend as CDR score increases.
CDR is the determinant for the dementia variable in this dataset, so it will be left out of the logistic models. MMSE is also a dementia scale, several models with and without MMSE will be evaluated.
attach(df_v1)
# Full model
fit1 <- glm(dementia ~ nwbv + gender + age + educ + ses + mmse, family = binomial())
s1 <- summary(fit1)
# remove SES
fit2 <- glm(dementia ~ nwbv + gender + age + educ + mmse, family = binomial())
s2 <- summary(fit2)
# remove gender
fit3 <- glm(dementia ~ nwbv + age + educ + mmse, family = binomial())
s3 <- summary(fit3)
# nWBV by itself
fit4 <- glm(dementia ~ nwbv, family = binomial())
s4 <- summary(fit4)
# MMSE alone
fit5 <- glm(dementia ~ mmse, family = binomial())
s5 <- summary(fit5)
# nWBV with MMSE
fit6 <- glm(dementia ~ mmse + nwbv, family = binomial())
s6 <- summary(fit6)
# all except nWBV
fit7 <- glm(dementia ~ nwbv + age + educ, family = binomial())
s7 <- summary(fit7)
# all except MMSE
fit8 <- glm(dementia ~ nwbv + gender + age + educ, family = binomial())
s8 <- summary(fit8)
detach(df_v1)
models <- tibble(Model = c('1. nwbv + gender + age + educ + ses + mmse', '2. nwbv + gender + age + educ + mmse',
'3. nwbv + age + educ + mmse', '4. nwbv', '5. mmse', '6. mmse + nwbv',
'7. nwbv + age + educ', '8. nwbv + gender + age + educ'),
AIC = c(s1$aic, s2$aic, s3$aic, s4$aic, s5$aic, s6$aic, s7$aic, s8$aic),
Slope = c(s1$coefficients[2,'Estimate'], s2$coefficients[2,'Estimate'],
s3$coefficients[2,'Estimate'], s4$coefficients[2,'Estimate'],
s5$coefficients[2,'Estimate'], s6$coefficients[2,'Estimate'],
s7$coefficients[2,'Estimate'], s8$coefficients[2,'Estimate']),
Iterations = c(s1$iter, s2$iter, s3$iter, s4$iter, s5$iter, s6$iter, s7$iter, s8$iter))
models <- arrange(models, AIC)
models$Model <- as.character(models$Model)
kable(models) %>% kable_minimal()
| Model | AIC | Slope | Iterations |
|---|---|---|---|
|
113.2108 | -28.1499386 | 7 |
|
114.3282 | -25.9066172 | 7 |
|
115.6736 | -26.1404662 | 7 |
|
124.6907 | -0.9732704 | 6 |
|
125.9090 | -0.9530976 | 6 |
|
176.7322 | -25.7180304 | 4 |
|
179.1653 | -29.2572729 | 4 |
|
197.8611 | -16.2470829 | 4 |
Model 3 is the best performing model. MMSE has the greatest significance with a p-value close to zero. Age has the next greatest significance at p-value = 0.001, and nWBV is moderately significant with p-value = 0.004.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 67.8811677 | 14.3633465 | 4.726000 | 0.0000023 |
| nwbv | -28.1499386 | 9.8537614 | -2.856771 | 0.0042797 |
| age | -0.1636427 | 0.0510379 | -3.206301 | 0.0013445 |
| educ | -0.2178868 | 0.0970770 | -2.244473 | 0.0248020 |
| mmse | -1.1323991 | 0.2170506 | -5.217213 | 0.0000002 |
nWBV itself is the least effective predictor for dementia in this dataset. MMSE alone is a better predictor than nWBV, and is also slightly better than MMSE and NWBV combined. SES and gender are not significant for models which include both nWBV and MMSE, but in model 8, gender does have some significance with a p-value = 0.036.
s3
##
## Call:
## glm(formula = dementia ~ nwbv + age + educ + mmse, family = binomial())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2137 -0.5755 -0.2432 0.2231 2.2887
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 67.88117 14.36335 4.726 2.29e-06 ***
## nwbv -28.14994 9.85376 -2.857 0.00428 **
## age -0.16364 0.05104 -3.206 0.00134 **
## educ -0.21789 0.09708 -2.244 0.02480 *
## mmse -1.13240 0.21705 -5.217 1.82e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 205.27 on 149 degrees of freedom
## Residual deviance: 103.21 on 145 degrees of freedom
## AIC: 113.21
##
## Number of Fisher Scoring iterations: 7
table(df_v1$dementia) %>%
kable(caption = "Class Bias Check", col.names = c("Dementia", "Freq")) %>%
kable_minimal(full_width=FALSE, position="l")
| Dementia | Freq |
|---|---|
| 0 | 85 |
| 1 | 65 |
There is a small imbalance in the dementia classes, 57% non-dementia and 43% dementia. I’ll sample the observations for training models.
# Create Training Data
input_ones <- df_v1[which(df_v1$dementia == 1), ] # all 1's
input_zeros <- df_v1[which(df_v1$dementia == 0), ] # all 0's
set.seed(99) # for repeatability of samples
# Pick as many 0's as 1's
input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones)) # 1's for training
input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_ones)) # 0's for training.
training_ones <- input_ones[input_ones_training_rows, ]
training_zeros <- input_zeros[input_zeros_training_rows, ]
trainingData <- rbind(training_ones, training_zeros) # row bind the 1's and 0's
# Create Test Data
test_ones <- input_ones[-input_ones_training_rows, ]
test_zeros <- input_zeros[-input_zeros_training_rows, ]
testData <- rbind(test_ones, test_zeros) # row bind the 1's and 0's
table(trainingData$dementia) %>%
kable(caption = "Class Bias Check", col.names = c("Dementia", "Freq")) %>%
kable_minimal(full_width=FALSE, position="l")
| Dementia | Freq |
|---|---|
| 0 | 45 |
| 1 | 45 |
m1 <- glm(dementia ~ nwbv + age + educ + mmse, data = trainingData,
family = binomial(link = "logit"))
predicted_train <- predict(m1, trainingData, type = 'response')
predicted_test <- predict(m1, testData, type = 'response')
summary(m1)
##
## Call:
## glm(formula = dementia ~ nwbv + age + educ + mmse, family = binomial(link = "logit"),
## data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.17752 -0.70747 -0.08252 0.52700 2.00395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 53.4315 14.7197 3.630 0.000283 ***
## nwbv -19.3602 10.4013 -1.861 0.062699 .
## age -0.1120 0.0551 -2.032 0.042164 *
## educ -0.1915 0.1155 -1.658 0.097284 .
## mmse -0.9924 0.2413 -4.113 3.91e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 124.766 on 89 degrees of freedom
## Residual deviance: 73.939 on 85 degrees of freedom
## AIC: 83.939
##
## Number of Fisher Scoring iterations: 6
pacman::p_load(car)
vif(m1) %>% kable(col.names = "GVIF") %>% kable_minimal(full_width = FALSE, position = 'l')
| GVIF | |
|---|---|
| nwbv | 1.541680 |
| age | 1.837715 |
| educ | 1.044635 |
| mmse | 1.275545 |
m2 <- glm(dementia ~ nwbv, data = trainingData,
family = binomial(link = "logit"))
predicted_train2 <- predict(m2, trainingData, type = 'response')
predicted_test2 <- predict(m2, testData, type = 'response')
summary(m2)
##
## Call:
## glm(formula = dementia ~ nwbv, family = binomial(link = "logit"),
## data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.67646 -1.08407 0.07664 1.12451 1.53630
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.341 4.836 2.552 0.0107 *
## nwbv -16.842 6.597 -2.553 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 124.77 on 89 degrees of freedom
## Residual deviance: 117.49 on 88 degrees of freedom
## AIC: 121.49
##
## Number of Fisher Scoring iterations: 4
m3 <- glm(dementia ~ mmse, data = trainingData,
family = binomial(link = "logit"))
predicted_train3 <- predict(m3, trainingData, type = 'response')
predicted_test3 <- predict(m3, testData, type = 'response')
summary(m3)
##
## Call:
## glm(formula = dementia ~ mmse, family = binomial(link = "logit"),
## data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0352 -0.8794 -0.2852 0.5191 1.9091
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 25.2215 5.7776 4.365 1.27e-05 ***
## mmse -0.8956 0.2019 -4.435 9.22e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 124.77 on 89 degrees of freedom
## Residual deviance: 81.44 on 88 degrees of freedom
## AIC: 85.44
##
## Number of Fisher Scoring iterations: 6
plotROC(testData$dementia, predicted_test)
c1 <- Concordance(testData$dementia, predicted_test)
c1[["Sensitivity"]] <- sensitivity(testData$dementia, predicted_test)
c1[["Specificity"]] <- round(specificity(testData$dementia, predicted_test),3)
c1[["AUROC"]] <- AUROC(testData$dementia, predicted_test)
# Model 2 ROC: dementia ~ nwbv
roc2 <- plotROC(testData$dementia, predicted_test2)
c2 <- Concordance(testData$dementia, predicted_test2)
c2[["Sensitivity"]] <- sensitivity(testData$dementia, predicted_test2)
c2[["Specificity"]] <- round(specificity(testData$dementia, predicted_test2),3)
c2[["AUROC"]] <- AUROC(testData$dementia, predicted_test2)
# Model 3 ROC: dementia ~ mmse
plotROC(testData$dementia, predicted_test3)
c3 <- Concordance(testData$dementia, predicted_test3)
c3[["Sensitivity"]] <- sensitivity(testData$dementia, predicted_test3)
c3[["Specificity"]] <- round(specificity(testData$dementia, predicted_test3),3)
c3[["AUROC"]] <- AUROC(testData$dementia, predicted_test3)
The best model from logistic regression also has the best performance in the training models with AUROC = 95.75%. MMSE alone outperformed nWBV alone, with 91.5% and 63.31% AUROC respectively.
tibble(Statistic = names(c1), Model1 = c1, Model2 = c2, Model3 = c3) %>%
filter(Statistic != "Tied" & Statistic != "Pairs") %>%
kable() %>% kable_minimal(full_width = FALSE, position = 'l')
| Statistic | Model1 | Model2 | Model3 |
|---|---|---|---|
| Concordance | 0.95875 | 0.6225 | 0.87875 |
| Discordance | 0.04125 | 0.3775 | 0.12125 |
| Sensitivity | 0.9 | 0.6 | 0.85 |
| Specificity | 0.9 | 0.65 | 0.825 |
| AUROC | 0.9575 | 0.633125 | 0.915 |
To see how well the models work for categorizing the data to CDR, I’ll create a decision tree model and use k-fold cross validation to estimate the model.
df1 <- df_v1 %>% mutate(cdr = as.factor(cdr)) # factor CDR
n_train <- round(0.7 * nrow(df1)) # 70% of length of main data set as integer
train_indices <- sample(1:nrow(df1), n_train) # create a vector with random indices
train <- df1[train_indices, ] # training data
test <- df1[-train_indices, ] # test data
k <- 5
splitPlan <- kWayCrossValidation(nrow(df1), k, NULL, NULL) # create 5-fold cross validation plan
If MMSE is in the prediction formula then the decision tree uses only MMSE for classification of the data to CDR, with 80% accuracy. Without MMSE the prediction formula of nWBV + age + education gives the best performance with 62.22% accuracy, which is rather poor performance.
formula <- cdr ~ nwbv + age + educ # formula for predicting CDR
opt_cp <- 0 #list with optimal CP parameters
for(i in 1:k) {
split <- splitPlan[[i]]
#training simple decision tree model
model_cv <- rpart(formula = formula,
data = train,
method = "class")
#get the best CP value
opt_cp[i] <- model_cv$cptable[which.min(model_cv$cptable[,"xerror"]),"CP"]
}
#training the model with optimal CP parameter on whole training data set
model_dt <- rpart(formula = formula,
data = train,
method = "class",
cp = mean(opt_cp))
#plot decision tree model
prp(x = model_dt, type=1,
extra = 1,
faclen = 0)
#testing the model
prediction_dt <- predict(object = model_cv,
newdata = test,
type = "class")
#print confusion matrix
caret::confusionMatrix(data = prediction_dt,
reference = test$cdr)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 0.5 1
## 0 16 5 1
## 0.5 9 12 2
## 1 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6222
## 95% CI : (0.4654, 0.7623)
## No Information Rate : 0.5556
## P-Value [Acc > NIR] : 0.2276
##
## Kappa : 0.2943
##
## Mcnemar's Test P-Value : 0.2464
##
## Statistics by Class:
##
## Class: 0 Class: 0.5 Class: 1
## Sensitivity 0.6400 0.7059 0.00000
## Specificity 0.7000 0.6071 1.00000
## Pos Pred Value 0.7273 0.5217 NaN
## Neg Pred Value 0.6087 0.7727 0.93333
## Prevalence 0.5556 0.3778 0.06667
## Detection Rate 0.3556 0.2667 0.00000
## Detection Prevalence 0.4889 0.5111 0.00000
## Balanced Accuracy 0.6700 0.6565 0.50000
As suspected normalized whole brain volume (nWBV) by itself is not a good predictor for dementia, it had the poorest AIC = 197.861 of the logistic models. MMSE on it’s own was a better predictor with AIC = 124.691. Adding nWBV, MMSE, age and education together gave the best prediction model with AIC = 113.211.
The training model for nWBV alone also had poor performance with AUROC = 63.312%. MMSE alone had very good performance in the training model with AUROC = 91.5%. The best prediction model also had the best results in the training model with AUROC = 95.75%.
final <- filter(models, substr(Model,1,1) == "3" | substr(Model,1,1) == "5" | substr(Model,1,1) == "4" ) %>%
select(Model, AIC)
final$AUROC <- c(c1$AUROC, c3$AUROC, c2$AUROC )
kable(final) %>% kable_minimal(full_width=FALSE, position="l")
| Model | AIC | AUROC |
|---|---|---|
|
113.2108 | 0.957500 |
|
124.6907 | 0.915000 |
|
197.8611 | 0.633125 |
Research documented for the Alzheimer’s Disease Neuroimaging Initiative (ADNI) shows that imaging plus bio-marker data are better predictors than imaging alone, and that cognitive measures are generally stronger predictors of Alzheimer’s dementia conversion than imaging measures.14 Using measurements of hippocampus volume instead of whole brain volume may give more accurate models, as discussed in findings from the Nun Study.11 Hippocampus measurements were not available in the OASIS data.
I’m pleased with how this project came together, I feel there is thorough analysis of the relationship between several variables and brain volume. I found the investigation of gender to brain volume interesting, particularly the reversal of the distributions for eTIV and nWBV.
I also found the logistic regression and decision tree categorization interesting, I didn’t expect nWBV alone to be the least effective predictor of dementia, although I wasn’t surprised that MMSE was better than nWBV. If I were to continue with this project, I’d try other categorization methods to see if they improved the prediction of CDR. And I’d like see to how hippocampus measurements and biomarker data would work in prediction models.