The setup section is just installing packages and the reading the dataset from the local machine. If you would like to run this code on your device, download the csv file from https://www.kaggle.com/brsdincer/alzheimer-features
Note that the path may have to be changed in order to load the file into RStudio.
The plotting function is one that aids in automating plotting data, making 2 plots: one as a scatter plot (geom_point), and one as a jitter plot (geom_jitter) and plotting a linear model on both. It is not true that in every case a linear model is a good fit, but we can asses the quality of the linear model looking at, for example, the Q-Q plot. Both may be valuable to look at in order to understand the data, though the scatter plot had a more accurate depiction of the data wheread the jitter plot captures the clustering of data.
This plotting function was mainly used in the Extra analyses section which can be accessed via the source code.
First we have just a basic summary of the dataset. It looks at different demographics, how many people fall into those catgories, and mean, max, min values of certain numerical measures.
Looking at the different classifications based on sex, upon first glance it may seem that there is a relatively even distribution of males and females across each dementia group. However, with a Chi-squared test for homogeneity, we get p = 3.346e-06, which is statistically significant. Looking at the visual, we can glean that there is an unequal distribution of sex across the dementia groupings and that, according to this dataset, males are more likely to be in the demented group whereas females are more likely to be in the nondemented group.
demog_df <- alz %>%
na.omit() %>%
select(Group, AGE, SEX, EDUC, SES) %>%
summarize(total_count = n(),
num_males = sum(SEX == "M"),
num_females = sum(SEX == "F"),
num_demented = sum(Group == "Demented"),
num_nondemented = sum(Group == "Nondemented"),
num_converted = sum(Group == "Converted"),
mean_age = mean(AGE),
min_age = min(AGE),
max_age = max(AGE),
mean_ses = mean(SES),
min_ses = min(SES),
max_ses = max(SES),
mean_edu = mean(EDUC),
min_edu = min(EDUC),
max_edu = max(EDUC)) %>%
pivot_longer(everything(), values_to = "Value", names_to = "Type")
demog_df[,-1] <- round(demog_df[,-1],0)
demog_df
## # A tibble: 15 x 2
## Type Value
## <chr> <dbl>
## 1 total_count 354
## 2 num_males 150
## 3 num_females 204
## 4 num_demented 127
## 5 num_nondemented 190
## 6 num_converted 37
## 7 mean_age 77
## 8 min_age 60
## 9 max_age 98
## 10 mean_ses 2
## 11 min_ses 1
## 12 max_ses 5
## 13 mean_edu 15
## 14 min_edu 6
## 15 max_edu 23
alz_group_sex <- alz %>%
group_by(Group, SEX) %>%
summarize(total = n())
ggplot(data = alz_group_sex, aes(x = Group, y = total, fill = SEX))+ geom_bar(postion = "stack", stat = "identity") +
labs(title = "Distrbution of sex across dementia groups") + xlab("Dementia Group") + ylab("Count")
alz_gs_chi <- data.frame(c(86, 60), c(61, 129), c(13, 24))
rownames(alz_gs_chi) <- c('M','F')
colnames(alz_gs_chi) <- c('Demented','Nondemented', 'Converted')
chisq.test(alz_gs_chi)
##
## Pearson's Chi-squared test
##
## data: alz_gs_chi
## X-squared = 25.216, df = 2, p-value = 3.346e-06
This output of the ggpairs function may seem overwhelming, but it is extremely valuable in what it can tell us about the data. For the sake of being concise, we will only look at a select subset of variables and their correlations.
We do not see a statistically significant correlation between age and MMSE or CDR scores.
We do see a statistically significant correlation between education years and both MMSE and CDR scores. The correlation for MMSE is positive (what we would expect, more education leads to better scores and not as bad dementia symptoms), and the correlation for CDR is negative (which also is intuitive since lower CDR score are indicative of less severe dementia symptoms).
We see a statistically significant correlation between socioeconomic status and MMSE, but not SES and CDR score. **Interesting since a higher score in MMSE is a less demented outcome, but MMSE score is negatively correlated with SES, meaning the higher the SES score (the better off), the lower the MMSE score, which may seem counterintuitive.
We see a correlation between nWBV and both MMSE and CDR scores. For MMSE, it is positive, suggesting a higher volume score on the nWBV is related to better outcomes on the MMSE (higher MMSE score) and for CDR, it is negative, suggesting a higher volume score is related to better outcomes on the CDR (lower CDR score).
Interestingly enough, there is no correlation between eTIV and MMSE or CDR and ASF and MMSE or CDR. Since these are also measures of brain volume, it is counterintuitive that nWBV sould have a statistically signigicant correlation with our outcome measures, but the other measures of brain volume would not.
ggpairs(alz)
We will run a multivariate regression looking at outcomes of MMSE based on the variables:SES, education, sex, and dementia group.
Looking below, only group has signifcant effect on MMSE, however, sex has significant effect when group is not a factor. So let’s split up data by Group and analyze, since we already know there is a statistcally significant difference in distribution of sex across groups.
model_full <- lm(MMSE ~ SES + EDUC + SEX + Group, data = alz)
model_no_group <- lm(MMSE ~ SES + EDUC + SEX, data = alz)
summary(model_full)
##
## Call:
## lm(formula = MMSE ~ SES + EDUC + SEX + Group, data = alz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.4276 -0.9232 0.4215 0.9841 5.7881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.620071 1.565241 17.646 < 2e-16 ***
## SES 0.020063 0.203299 0.099 0.921
## EDUC 0.066247 0.079353 0.835 0.404
## SEXM -0.009236 0.329483 -0.028 0.978
## GroupDemented -4.263301 0.572774 -7.443 7.78e-13 ***
## GroupNondemented 0.558042 0.537294 1.039 0.300
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.916 on 348 degrees of freedom
## (19 observations deleted due to missingness)
## Multiple R-squared: 0.3919, Adjusted R-squared: 0.3832
## F-statistic: 44.85 on 5 and 348 DF, p-value: < 2.2e-16
summary(model_no_group)
##
## Call:
## lm(formula = MMSE ~ SES + EDUC + SEX, data = alz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0712 -0.4343 1.1261 2.1518 4.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.76286 1.90063 13.029 < 2e-16 ***
## SES -0.08578 0.24482 -0.350 0.726266
## EDUC 0.23264 0.09617 2.419 0.016068 *
## SEXM -1.32810 0.38938 -3.411 0.000723 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.604 on 350 degrees of freedom
## (19 observations deleted due to missingness)
## Multiple R-squared: 0.06561, Adjusted R-squared: 0.0576
## F-statistic: 8.192 on 3 and 350 DF, p-value: 2.768e-05
Here we make 3 separate datasets, one for individuals of each group.
alz.d <- filter(alz, Group == "Demented")
alz.n <- filter(alz, Group == "Nondemented")
alz.c <- filter(alz, Group == "Converted")
For the sake of being concise, we will only look at a select subset of variables and their correlations.
For the demented group, the only statistically correlations involving outcome measures are CDR is correlated with SES (negative correlation, which is what we expect given the nature of the measures), and both MMSE and CDR with nWBV (positive and negative correlations, respectively, which is what is expected).
Interestingly enough, for the nondemented group, there are no correlations between any of the outcome measures (MMSE, CDR) and the factors (all other variables). A possible explanation is that once individuals are classified as nondemented, there is not a signifcant impact of other factors (age, sex, brain volume measures) on how well they do on the outcome measures (MMSE, CDR). Perhaps once one is classified as nondemented, a higher score on these outcome measures is aribtrary (i.e. once you reach a certain score to be nondemented, better is not really a significant measure of higher brain health and functioning).
For the converted group, there are no correlations between any of the outcome measures (MMSE, CDR) and the factors (all other variables). This may be attributed the small sample size of the group (n = 37), or also the unstable nature of the group (were previously assesed as demented but then the assesment was later changed to nondemented).
alz.d %>% select(-Group) %>%
# ggpairs for demented group
ggpairs()
alz.n %>% select(-Group) %>%
# ggpairs for nondemented group
ggpairs()
alz.c %>% select(-Group) %>%
# ggpairs for converted group
ggpairs()
We will use a decision tree algorithm given by the rpart function that will create a predicitve model for both MMSE and CDR scores based on a subset of the factors (those that have shown statistically significant correlations in the past with the outcome measures). To read a plot like this, you see a branch with a boolean statement (a question). Answering ‘yes’ or ‘true’ brings you to the left branch, and ‘no’ or false’ brings you to the right branch. The bottom leaves are predicted outcomes, with the topmost number being the predicted score on the given outcome.
A quick note on importance: “You can view the importance of each variable in the model by referencing the variable.importance attribute of the resulting rpart object. From the rpart documentation, ‘An overall measure of variable importance is the sum of the goodness of split measures for each split for which it was the primary variable…’” (source: https://www.gormanalysis.com/blog/decision-trees-in-r-using-rpart/). Thus importance shows us a weighted measure of how much each factor has an effect on the outcome measure.
Here are some things of note that can be gleaned from the tree decision model: For the MMSE tree:
Levels 1, 2, 3, 5 are expected (those nondemented/converted and those demented with higher brain volume will have better MMSE scores).
Level 4 should be noted. A younger age (i.e. less than 70) was a predictive factor that led to a lower MMSE score. This may seem counterintuitive, but it may make sense that those participating in the study who are younger may have more acute symptoms (which led them to participate in the study). This same explanation is proposed for the split on level 9, where ages less than 81 have a lower predicted MMSE score.
For the CDR tree:
We can see the algorithm grouped those both converted and nondemented in level 1, but split them out in a later branch, suggesting there is a difference between those intially regarded as nondemented and those who only later were regarded as nondemented (i.e. the converted group). We see that those in the nondemented group had the lowest expected score (best), while the converted group predicted to have a higher (worse) score.
If the individual is not placed in the demented group, we see that the only factor that predicts CDR score is grouping, which aligns with the correlation analysis using ggpairs (i.e. in the datasets split out by groups, there are no correlations between any of the outcome measures (MMSE, CDR)– in converted and nondemented groups– and the factors (all other variables)).
If one is classified as demented, we see that a higher brain volume (measured by nWBV) predicts a better CDR score which is to be expected.
Similar to what was seen in the MMSE tree, those with older ages were actually predicted to have better CDR scores (see level 7). As explained in the MMSE tree, this may be sampling bias (i.e. the younger participants participated in this study due to acute symptoms).
Looking at level 14, we see that being male is a predictor for a lower (better) CDR score. This is extremely interesting since in a previous analysis of demographics, we found that males were more likely to be in the demented group whereas females are more likely to be in the nondemented group. Perhaps, though, since this branch where being male is a predictor is already only looking at demented patients, we have an interesting find. ***It appears that females in this study are more likely to be nondemented, but within the group of demented patients, males tended to have better outcomes (less severity of dementia symptoms) than females.
We see that group by far has the highest importance, which makes sense since the classifcation of group of each indiviual was based on the outcome measures of MMSE and CDR.
Taking out the group factor and re-running the decision tree algorithm was attempted, but it was apparent that no other factor had a persuasive enough effect on the outcome variables (MMSE, CDR) so the resulting model was very weak with low predictive power.
set.seed(1)
tree_mmse <- rpart(MMSE ~ Group + SEX + AGE + EDUC + SES + nWBV, alz)
tree_cdr <- rpart(CDR ~ Group + SEX + AGE + EDUC + SES + nWBV, alz)
fancyRpartPlot(tree_mmse, main = "MMSE tree plot based on Group, Sex, Age, Education, SES, and nWBV", palettes = "GnBu")
fancyRpartPlot(tree_cdr, main = "CDR tree plot based on Group, Sex, Age, Education, SES, and nWBV", palettes = "OrRd")
# MMSE importance
tree_mmse$variable.importance
## Group nWBV AGE EDUC SEX SES
## 1882.78686 883.91788 690.21739 219.67098 183.25935 77.04816
# CDR importance
tree_cdr$variable.importance
## Group nWBV EDUC SEX AGE SES
## 36.664282 5.226756 4.002895 3.548081 1.398588 1.236868
For additional analyses not inlcuded in this page, please see the source code.
Conclusions have been made for each individual analysis. A more holisitic look at all of the analyses and a summation of what we gleaned will be found in the Discussion section of the paper.