HW-3_Alzheimer-features

library(readr)
library(psych)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(readxl)
library(tidyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

library(ggpubr)

Alzheimer <- read_csv("~/Library/CloudStorage/OneDrive-Personal/IMB/Multivariate analysis/ALzheimer features - R/alzheimer.csv")

## Rows: 373 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Group, M/F
## dbl (8): Age, EDUC, SES, MMSE, CDR, eTIV, nWBV, ASF
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Alzheimer)

## # A tibble: 6 × 10
##   Group       `M/F`   Age  EDUC   SES  MMSE   CDR  eTIV  nWBV   ASF
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Nondemented M        87    14     2    27   0    1987 0.696 0.883
## 2 Nondemented M        88    14     2    30   0    2004 0.681 0.876
## 3 Demented    M        75    12    NA    23   0.5  1678 0.736 1.05 
## 4 Demented    M        76    12    NA    28   0.5  1738 0.713 1.01 
## 5 Demented    M        80    12    NA    22   0.5  1698 0.701 1.03 
## 6 Nondemented F        88    18     3    28   0    1215 0.71  1.44

I downloaded this data set from the site Kaggle (https://www.kaggle.com/datasets/brsdincer/alzheimer-features), published under the title ” Alzheimer Features For Analysis”.

I decided to exclude the socioeconomic status (SES) variable from my firther analysis and check if we have any Not Applicable or duplicated values which I would then exclude from further statistics. For further analysis I am using adapted data set with dropped 21 NA variables that were included in the original data set.

AlzheimerData <- drop_na(Alzheimer)

colnames(AlzheimerData)[2]<- c("Gender")
head(AlzheimerData)

## # A tibble: 6 × 10
##   Group       Gender   Age  EDUC   SES  MMSE   CDR  eTIV  nWBV   ASF
##   <chr>       <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Nondemented M         87    14     2    27   0    1987 0.696 0.883
## 2 Nondemented M         88    14     2    30   0    2004 0.681 0.876
## 3 Nondemented F         88    18     3    28   0    1215 0.71  1.44 
## 4 Nondemented F         90    18     3    27   0    1200 0.718 1.46 
## 5 Nondemented M         80    12     4    28   0    1689 0.712 1.04 
## 6 Nondemented M         83    12     4    29   0.5  1701 0.711 1.03

Description of the data - data set AlzheimerData:

In this dataset we have 371 observations of 9 variables.

Group –> Class (Demented, Nondemented, Converted)

Gender –> M -> Male, F -> Female

Age –> Age

EDUC –> Years of Education

SES - Socio Economic Status (1 - low, 2 - medium, 3 - high, 4 - prestige)

MMSE –> Mini Mental State Examination: The Mini‐Mental State Examination (Folstein 1975), or MMSE, is a simple pen‐and‐paper test of cognitive function based on a total possible score of 30 points; it includes tests of orientation, concentration, attention, verbal memory, naming and visuospatial skills.The maximum score on the test is 30 and scores lower than 23 points is a cutpoint for dementia.

CDR –> Clinical Dementia Rating: The Clinical Dementia Rating (CDR®) Dementia Staging Instrument in one aspect is a 5-point scale used to characterize six domains of cognitive and functional performance applicable to Alzheimer disease and related dementias: Memory, Orientation, Judgment & Problem Solving, Community Affairs, Home & Hobbies, and Personal Care. The necessary information to make each rating is obtained through a semi-structured interview of the patient and a reliable informant or collateral source (e.g., family member) referred to as the CDR® Assessment Protocol. Ratings are assigned on a 0–5 point scale, (0 = absent; 0.5 = questionable; 1= present, but mild; 2 = moderate; 3 = severe; 4 = profound; 5 = terminal).

eTIV –> Estimated Total Tntracranial Volume: Intracranial volume (ICV) is an important normalization measure used in morphometric analyses to correct for head size in studies of Alzheimer Disease (AD).It is usually measured with an MRI.Measured im mililiters.

nWBV –> Normalized Whole Brain Volume: Measuring total intracranial volume (TIV) allows whole-brain and regional volumetric measures to be normalized for head size. The TIV can be defined as the volume within the cranium, including the brain, meninges, and CSF (Cerebral spinal fluid).

ASF –> Atlas Scaling Factor: The Atlas Scaling Factor (ASF) was computed as the determinant of the affine transform connecting each individual to the atlas-representative template. The ASF represents the whole-brain volume expansion (or contraction) required to register each individual to the template.

The research question regarding this data set that I am thinking of using also in further analysis is How these variables are connected and wether there is a correlation between wether the individual is in group Demented, Nondemented or Converted

# With this code I changed all non-numeric variables in the data set into factors which were in this case Group and Gender.

AlzheimerData1 <- as.data.frame(unclass(AlzheimerData),
                                stringsAsFactors = TRUE)

str(AlzheimerData1)

## 'data.frame':    354 obs. of  10 variables:
##  $ Group : Factor w/ 3 levels "Converted","Demented",..: 3 3 3 3 3 3 3 3 3 2 ...
##  $ Gender: Factor w/ 2 levels "F","M": 2 2 1 1 2 2 2 1 1 2 ...
##  $ Age   : num  87 88 88 90 80 83 85 93 95 68 ...
##  $ EDUC  : num  14 14 18 18 12 12 12 14 14 12 ...
##  $ SES   : num  2 2 3 3 4 4 4 2 2 2 ...
##  $ MMSE  : num  27 30 28 27 28 29 30 30 29 27 ...
##  $ CDR   : num  0 0 0 0 0 0.5 0 0 0 0.5 ...
##  $ eTIV  : num  1987 2004 1215 1200 1689 ...
##  $ nWBV  : num  0.696 0.681 0.71 0.718 0.712 0.711 0.705 0.698 0.703 0.806 ...
##  $ ASF   : num  0.883 0.876 1.444 1.462 1.039 ...

library(ggplot2)
#install.packages("ggcorrplot")
library(ggcorrplot)

## 
## Attaching package: 'ggcorrplot'

## The following object is masked from 'package:rstatix':
## 
##     cor_pmat

# Example data set
data("Alzheimerdata1")

## Warning in data("Alzheimerdata1"): data set 'Alzheimerdata1' not found

# Compute correlation matrix
cor_matrix <- cor(AlzheimerData1[ , c(-1, -2)])

# Plot the heatmap using ggcorrplot library
ggcorrplot(cor_matrix, method="square", hc.order = TRUE, type = "lower",
           lab = TRUE, lab_size = 3, colors = "lightblue")

To get a general idea about how the variables, I made a correlation matrix on all. In the next steps I only included the variables I worked with.

Research question: Can and if yes, how can MMSE score be explained with variables such as age, eTIV and demented state group (variable “Group”)?

In the literature research that is not the most recent one, the MMSE score was explained with age, educational level and sex*. I decided to research the correlation of age, sex (Gender)and education firstly because of the secondaria data research and then additionally the status of dementia (Group) since the Mini-Mental state exam is one of the exams and tests we can use to diagnose the dementia state so consequently if the patient is considered demented, if would be expected that has an effect on MMSE score. Additionally I checked the conection with eTIV, since one of the symptoms of dementia is that the brain volumen decreases as the disease progresses.

*Mendes, L. P. D. S., Malta, F. F., Ennes, T. D. O., Ribeiro-Samora, G. A., Dias, R. C., Rocha, B. L. C., Rodrigues, M., Borges, L. F., & Parreira, V. F. (2019). Prediction equation for the mini-mental state examination: influence of education, age, and sex. Fisioterapia E Pesquisa, 26(1), 37–43. https://doi.org/10.1590/1809-2950/17030126012019

library(car)
scatterplotMatrix(AlzheimerData1[ , c(-1, -2, -5, -7, -10, -9, -11)], 
                  smooth = FALSE)

AlzheimerData1$GenderFactor <- factor(AlzheimerData1$Gender,
                    levels = c("M", "F"),
                    labels = c("0", "1"))

AlzheimerDataCOR <- AlzheimerData1[ , c(-5, -2, -7, -9, -10)]

head(AlzheimerDataCOR)

##         Group Age EDUC MMSE eTIV GenderFactor
## 1 Nondemented  87   14   27 1987            0
## 2 Nondemented  88   14   30 2004            0
## 3 Nondemented  88   18   28 1215            1
## 4 Nondemented  90   18   27 1200            1
## 5 Nondemented  80   12   28 1689            0
## 6 Nondemented  83   12   29 1701            0

Dscription of used variables in the model (data: AlzheimerDataCOR)

The dependent variable: MMSE: Mini Mental State Examination The Mini‐Mental State Examination (Folstein 1975), or MMSE, is a simple pen‐and‐paper test of cognitive function based on a total possible score of 30 points; it includes tests of orientation, concentration, attention, verbal memory, naming and visuospatial skills.The maximum score on the test is 30 and scores lower than 23 points is a cutpoint for dementia.

Explanatory variables: Group: Wether the participant is described as demented, converted or nondemented

Age: Age of participant

EDUC: Years of education

eTIV: Estimated intracranial volume

GenderFactor: Sex of participant

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(AlzheimerDataCOR[c(-1, -6)]))

##        Age  EDUC  MMSE  eTIV
## Age   1.00 -0.02  0.06  0.04
## EDUC -0.02  1.00  0.18  0.27
## MMSE  0.06  0.18  1.00 -0.02
## eTIV  0.04  0.27 -0.02  1.00
## 
## n= 354 
## 
## 
## P
##      Age    EDUC   MMSE   eTIV  
## Age         0.6395 0.2944 0.4511
## EDUC 0.6395        0.0005 0.0000
## MMSE 0.2944 0.0005        0.7155
## eTIV 0.4511 0.0000 0.7155

fit1 <- lm(MMSE ~  Age + eTIV + Group + EDUC + GenderFactor,
           data = AlzheimerDataCOR)

summary(fit1)

## 
## Call:
## lm(formula = MMSE ~ Age + eTIV + Group + EDUC + GenderFactor, 
##     data = AlzheimerDataCOR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2791  -0.8976   0.3421   1.0905   5.9834 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      28.715919   2.435702  11.790  < 2e-16 ***
## Age               0.007930   0.020117   0.394    0.694    
## eTIV             -0.001156   0.001134  -1.020    0.309    
## GroupDemented    -4.222506   0.565343  -7.469 6.61e-13 ***
## GroupNondemented  0.644824   0.530132   1.216    0.225    
## EDUC              0.075301   0.057438   1.311    0.191    
## GenderFactor1    -0.231140   0.404725  -0.571    0.568    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.915 on 347 degrees of freedom
## Multiple R-squared:  0.3939, Adjusted R-squared:  0.3834 
## F-statistic: 37.58 on 6 and 347 DF,  p-value: < 2.2e-16

Based on F-statistics we can reject H0 at p < 0.001. We found a linear relationship between dependent and at least one explanatory variable. Model is “good”.

Multiple.R2 is 0.39 –> 39% of variability of MMSE is explained by linear effect of Age, eTIV, Education, Gender and dementia state variable Group.

sqrt(summary(fit1)$r.squared)

## [1] 0.6275875

#I will now additionally test multicolinearity of the variables used.
vif(fit1)

##                  GVIF Df GVIF^(1/(2*Df))
## Age          1.025755  1        1.012796
## eTIV         1.649366  1        1.284276
## Group        1.207450  2        1.048256
## EDUC         1.148938  1        1.071885
## GenderFactor 1.665947  1        1.290716

mean(vif(fit1))

## [1] 1.227026

Conclusion: All are bellow square root of 5 - which means they are not so strongly correlated beetween eachother.

AlzheimerDataCOR$StdResid <- round(rstandard(fit1), 3) #To get standardized residuals - rounded to 3 decimal numbers
AlzheimerDataCOR$CooksD <- round(cooks.distance(fit1), 3) #We use Cooks distances to find units that have a big impact

hist(AlzheimerDataCOR$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

hist(AlzheimerDataCOR$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

H0: The graph IS normally distributed H1:The graph IS NOT normally distributed

shapiro.test(AlzheimerDataCOR$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  AlzheimerDataCOR$StdResid
## W = 0.84873, p-value < 2.2e-16

We can reject the H0 at p < 0.05, we see that the graph is not normally distributed and there are some residuals outside of normal distribution which we will find in next step.

head(AlzheimerDataCOR[order(AlzheimerDataCOR$StdResid),], 10)

##        Group Age EDUC MMSE eTIV GenderFactor StdResid CooksD
## 96  Demented  69   16    4 1701            0   -7.009  0.108
## 95  Demented  68   16    7 1714            0   -5.968  0.085
## 84  Demented  69   14   15 1331            0   -3.312  0.033
## 240 Demented  78   14   15 1401            1   -3.216  0.019
## 21  Demented  77   16   16 1590            0   -2.922  0.014
## 166 Demented  73   12   16 1478            1   -2.779  0.018
## 165 Demented  71   12   16 1492            1   -2.771  0.020
## 213 Demented  71   16   17 1562            0   -2.573  0.012
## 101 Demented  72   12   17 1483            0   -2.505  0.012
## 223 Demented  65   12   17 1520            0   -2.477  0.016

head(AlzheimerDataCOR[order(-AlzheimerDataCOR$StdResid),], 10)

##        Group Age EDUC MMSE eTIV GenderFactor StdResid CooksD
## 132 Demented  61   18   30 1957            0    2.081  0.029
## 12  Demented  66   12   30 1447            1    2.073  0.013
## 133 Demented  62   18   30 1928            0    2.061  0.025
## 325 Demented  75   16   30 1891            0    2.046  0.015
## 296 Demented  73   12   30 1343            1    2.006  0.008
## 198 Demented  66   12   30 1446            0    1.991  0.010
## 168 Demented  68   15   30 1556            0    1.946  0.007
## 269 Demented  75   18   30 1651            0    1.889  0.008
## 208 Demented  86   15   30 1498            0    1.877  0.008
## 118 Demented  78   14   30 1315            0    1.856  0.010

head(AlzheimerDataCOR[order(-AlzheimerDataCOR$CooksD),], 10)

##         Group Age EDUC MMSE eTIV GenderFactor StdResid CooksD
## 96   Demented  69   16    4 1701            0   -7.009  0.108
## 95   Demented  68   16    7 1714            0   -5.968  0.085
## 84   Demented  69   14   15 1331            0   -3.312  0.033
## 132  Demented  61   18   30 1957            0    2.081  0.029
## 133  Demented  62   18   30 1928            0    2.061  0.025
## 165  Demented  71   12   16 1492            1   -2.771  0.020
## 222  Demented  62   12   17 1525            0   -2.471  0.019
## 240  Demented  78   14   15 1401            1   -3.216  0.019
## 110 Converted  87   18   24 1275            1   -1.768  0.018
## 166  Demented  73   12   16 1478            1   -2.779  0.018

I cleaned the standardised residuals and Cooks distance and corrected the graphs in next step to get the corrected fit2.

AlzheimerDataCOR <- AlzheimerDataCOR[AlzheimerDataCOR$StdResid >= -3,]
head(AlzheimerDataCOR[order(AlzheimerDataCOR$StdResid),], 10)

##        Group Age EDUC MMSE eTIV GenderFactor StdResid CooksD
## 21  Demented  77   16   16 1590            0   -2.922  0.014
## 166 Demented  73   12   16 1478            1   -2.779  0.018
## 165 Demented  71   12   16 1492            1   -2.771  0.020
## 213 Demented  71   16   17 1562            0   -2.573  0.012
## 101 Demented  72   12   17 1483            0   -2.505  0.012
## 223 Demented  65   12   17 1520            0   -2.477  0.016
## 222 Demented  62   12   17 1525            0   -2.471  0.019
## 319 Demented  80   12   17 1755            0   -2.424  0.016
## 306 Demented  75   12   18 1479            1   -2.092  0.010
## 131 Demented  82    8   18 1464            1   -2.025  0.017

AlzheimerDataCOR <- AlzheimerDataCOR[AlzheimerDataCOR$CooksD <= 0.084,]
head(AlzheimerDataCOR[order(-AlzheimerDataCOR$CooksD),], 10)

##         Group Age EDUC MMSE eTIV GenderFactor StdResid CooksD
## 132  Demented  61   18   30 1957            0    2.081  0.029
## 133  Demented  62   18   30 1928            0    2.061  0.025
## 165  Demented  71   12   16 1492            1   -2.771  0.020
## 222  Demented  62   12   17 1525            0   -2.471  0.019
## 110 Converted  87   18   24 1275            1   -1.768  0.018
## 166  Demented  73   12   16 1478            1   -2.779  0.018
## 131  Demented  82    8   18 1464            1   -2.025  0.017
## 223  Demented  65   12   17 1520            0   -2.477  0.016
## 319  Demented  80   12   17 1755            0   -2.424  0.016
## 325  Demented  75   16   30 1891            0    2.046  0.015

fit2 <- lm(MMSE ~  Age + eTIV + Group + EDUC + GenderFactor,
           data = AlzheimerDataCOR)

# Here I again ckecked the cleaned standardised residuals and Cook's distances and showed them in a histogram.
AlzheimerDataCOR$StdResid <- round(rstandard(fit2), 3) 
AlzheimerDataCOR$CooksD <- round(cooks.distance(fit2), 3)

hist(AlzheimerDataCOR$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

hist(AlzheimerDataCOR$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

H0: The graph IS normally distributed H1:The graph IS NOT normally distributed

shapiro.test(AlzheimerDataCOR$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  AlzheimerDataCOR$StdResid
## W = 0.92232, p-value = 1.7e-12

We can reject the H0 at p < 0.05.

AlzheimerDataCOR$StdFittedValues <- scale(fit2$fitted.values)

scatterplot(y = AlzheimerDataCOR$StdResid, x = AlzheimerDataCOR$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

H0: Variance is constant (homoskedasticity) H1: Variance is not constant

I checked that with Breusch Pagan Test for Heteroskedasticity.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data               
##  --------------------------------
##  Response : MMSE 
##  Variables: fitted values of MMSE 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    212.1317 
##  Prob > Chi2   =    4.707554e-48

In this model we have heteroskedasticity this is why in the last step for corrected model I will use lm_robust function that will correct standard errors.

library(estimatr)
fit3 <- lm_robust(MMSE ~  Age + eTIV + Group + EDUC + GenderFactor,
           data = AlzheimerDataCOR)
summary(fit3)

## 
## Call:
## lm_robust(formula = MMSE ~ Age + eTIV + Group + EDUC + GenderFactor, 
##     data = AlzheimerDataCOR)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                    Estimate Std. Error t value  Pr(>|t|)  CI Lower  CI Upper
## (Intercept)      29.2205191   1.944796 15.0250 1.468e-39 25.395291 33.045748
## Age              -0.0088187   0.016645 -0.5298 5.966e-01 -0.041557  0.023919
## eTIV             -0.0007766   0.000996 -0.7798 4.361e-01 -0.002736  0.001182
## GroupDemented    -3.8137156   0.422494 -9.0267 1.296e-17 -4.644721 -2.982710
## GroupNondemented  0.5945651   0.278395  2.1357 3.341e-02  0.046989  1.142142
## EDUC              0.0961349   0.038976  2.4665 1.413e-02  0.019474  0.172796
## GenderFactor1    -0.2996906   0.380448 -0.7877 4.314e-01 -1.047995  0.448614
##                   DF
## (Intercept)      343
## Age              343
## eTIV             343
## GroupDemented    343
## GroupNondemented 343
## EDUC             343
## GenderFactor1    343
## 
## Multiple R-squared:  0.4342 ,    Adjusted R-squared:  0.4243 
## F-statistic: 31.39 on 6 and 343 DF,  p-value: < 2.2e-16

Explaining the final model and F - statistics

The secondary research from 2019* by de Souza Mendes et. al. showed that MMSE can be exlplained with the educational level, age, and sex explained 38% of the total variance of the MMSE score (p<0.0001) and resulted in the following equation: MMSE=23.350+0.265(years of schooling)-0.042(age)+1.323(sex).

Based on F-statistics we can reject H0 at p < 0.001. We found a linear relationship between dependent and at least one explanatory variable. Model is “good”. Given that the lower score on MMSE and specifically scores under 23 signify dementia, it is expected that the correlation will be negative with the group of demented. Multiple.R2 is 0.43 –> 43% of variability of MMSE is explained by linear effect of Age, eTIV, Education, Gender and dementia state variable Group.

Age: In our model we did not find any signifficant effect of age on MMSE, which is contrasting to the secondary research.

eTIV: In our model we did not find any signifficant effect of estimated total intracranial volume (eTIV) on MMSE.

Group Demented: If the participant is in the group demented, the score on the MMSE on average decreases by 3.8 points at p < 0.001

Group Nondemented: If the participant is in the group nondemented, the score on the MMSE on average increases by 0.59 points at p < 0.05

EDUC: If the years of education of participant increase by 1 year, on average the score on MMSE of this participant increases for 0.1 point at p < 0.05.

Gender factor (sex): In our model we did not find any statistically signifficant effect of sex on MMSE, which is contrasting to the secondary research.

I did not get the same findings of statistically explanatory variables than the researcers did from the secondary source*, except for education (in our case years of education and not level), which I find interesting.

HW-3_Alzheimer-features

Inja Dovšak

2023-02-02

Research question: Can and if yes, how can MMSE score be explained with variables such as age, eTIV and demented state group (variable “Group”)?

Dscription of used variables in the model (data: AlzheimerDataCOR)

Explaining the final model and F - statistics