HAAF Principal Components Analysis

In order to reduce the dimensionality and try to figure out what’s going on, here I use PCA, followed by manova and anova. I only apply it to the control population, which is the typical patients on Day 1.

I filtered the available data so that I only had the Day 1, Typical participants. The function below gave me the flexibility to filter for any group of interest.

###Running PCA, either comparing Day 1 to Day 2, Day 1 to Diabetic, or Day 2 to Diabetic (Group = 0 for typical 1 for diabetic)
PCAgroup<-function(Day1, Day2, Group1, Group2, df){
  df_selected<-subset(df, (df$Day == Day1 | df$Day == Day2))
  df_selected<-subset(df_selected, (df$Group == Group1 | df$Group == Group2))
  df_selected<-na.omit(df_selected)
  return(df_selected)
}

I wanted to perform PCA on the control population, and then see if I could use the components to support our idea that the Thalamus was a notably active region. We can see that 6 components explain ~90% of the variance in the data.

dftest<-PCAgroup(1,1,0,0,df)
df_pca <- prcomp(dftest[,5:46], scale = TRUE, rotate = "varimax")
summary(df_pca)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5    PC6
## Standard deviation     5.3934 1.86236 1.45394 1.11922 0.95717 0.9166
## Proportion of Variance 0.6926 0.08258 0.05033 0.02982 0.02181 0.0200
## Cumulative Proportion  0.6926 0.77516 0.82549 0.85532 0.87713 0.8971
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.82065 0.69241 0.68181 0.62671 0.61100 0.54695
## Proportion of Variance 0.01603 0.01141 0.01107 0.00935 0.00889 0.00712
## Cumulative Proportion  0.91317 0.92458 0.93565 0.94500 0.95389 0.96102
##                           PC13    PC14    PC15    PC16   PC17    PC18
## Standard deviation     0.48261 0.47068 0.41133 0.38683 0.3722 0.32789
## Proportion of Variance 0.00555 0.00527 0.00403 0.00356 0.0033 0.00256
## Cumulative Proportion  0.96656 0.97184 0.97586 0.97943 0.9827 0.98528
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.30552 0.28786 0.26415 0.24837 0.22390 0.21283
## Proportion of Variance 0.00222 0.00197 0.00166 0.00147 0.00119 0.00108
## Cumulative Proportion  0.98751 0.98948 0.99114 0.99261 0.99380 0.99488
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.20216 0.18725 0.16539 0.16250 0.14329 0.12870
## Proportion of Variance 0.00097 0.00083 0.00065 0.00063 0.00049 0.00039
## Cumulative Proportion  0.99585 0.99669 0.99734 0.99797 0.99846 0.99885
##                           PC31    PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.12135 0.09561 0.08652 0.07388 0.06453 0.04939
## Proportion of Variance 0.00035 0.00022 0.00018 0.00013 0.00010 0.00006
## Cumulative Proportion  0.99920 0.99942 0.99960 0.99973 0.99983 0.99989
##                           PC37    PC38    PC39    PC40    PC41     PC42
## Standard deviation     0.04435 0.03515 0.03040 0.02200 0.01247 2.39e-15
## Proportion of Variance 0.00005 0.00003 0.00002 0.00001 0.00000 0.00e+00
## Cumulative Proportion  0.99993 0.99996 0.99998 1.00000 1.00000 1.00e+00

These components don’t really differ by glucose level, which is what we would have hoped to see. However, PC3 does show the pattern we would expect to see.

Also, I wanted to see what comprised the components, so I calculated what percentage of each region mapped onto each component

#http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
var_coord_func <- function(loadings, comp.sdev){
  loadings*comp.sdev}
loadings <- df_pca$rotation
sdev <- df_pca$sdev
var.coord <- t(apply(loadings, 1, var_coord_func, sdev)) 
var.cos2 <- var.coord^2
# Compute contributions of each brain region to the component
#result is between 0 and 1...should sum to 1
comp.cos2 <- apply(var.cos2, 2, sum)
contrib <- function(var.cos2, comp.cos2){var.cos2/comp.cos2}
var.contrib <- t(apply(var.cos2,1, contrib, comp.cos2))

Most seem to be a pretty big mix of regions. However, Component 3 is largely dictated by the thalamus (and a few other regions). Component 3 also seems to affect the glucose level, as we saw in the previous figure. So maybe there is something here.

Now that we know what’s in all of the components, let’s see if they actually are able to distinguish between glucose levels. To do this I used a manova and and anova. Threshold for acceptance after a Bonferroni correction is p < .0083

pcaOutput2 <- as.data.frame(df_pca$x)
pcaOutput2$groups <- dftest$Glucose.level
dependent.vars <- cbind(pcaOutput2$PC1, pcaOutput2$PC2, pcaOutput2$PC3, pcaOutput2$PC4, pcaOutput2$PC5, pcaOutput2$PC6)
summary(manova(dependent.vars~dftest$Glucose.level))

##                      Df  Pillai approx F num Df den Df   Pr(>F)   
## dftest$Glucose.level  1 0.40635   3.9929      6     35 0.003792 **
## Residuals            40                                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The manova shows that we are able to distinguish between glucose level on the basis of the principal components. Now we use an ANOVA to see which of the components are significant. It would probably be better to use a non-parametric test here, but this is where we are right now…

summary(aov(dependent.vars~dftest$Glucose.level))  #component 3 is the only significant one

##  Response 1 :
##                      Df Sum Sq Mean Sq F value Pr(>F)
## dftest$Glucose.level  1    6.0  5.9983  0.2022 0.6554
## Residuals            40 1186.6 29.6655               
## 
##  Response 2 :
##                      Df Sum Sq Mean Sq F value Pr(>F)
## dftest$Glucose.level  1    0.0  0.0000       0 0.9975
## Residuals            40  142.2  3.5551               
## 
##  Response 3 :
##                      Df Sum Sq Mean Sq F value    Pr(>F)    
## dftest$Glucose.level  1 26.908 26.9077  18.009 0.0001267 ***
## Residuals            40 59.764  1.4941                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                      Df Sum Sq Mean Sq F value  Pr(>F)  
## dftest$Glucose.level  1  4.202  4.2024  3.5647 0.06629 .
## Residuals            40 47.156  1.1789                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 5 :
##                      Df Sum Sq Mean Sq F value Pr(>F)
## dftest$Glucose.level  1  0.191 0.19119  0.2046 0.6534
## Residuals            40 37.372 0.93429               
## 
##  Response 6 :
##                      Df Sum Sq Mean Sq F value Pr(>F)
## dftest$Glucose.level  1  0.136 0.13600  0.1585 0.6926
## Residuals            40 34.312 0.85781

When we perform the same process for Day 2 and diabetics, there’s some stuff that’s kind of interesting to observe.

For the Day 2 participants, the first 5 components take care of 90% of the variance. For them, the change in glucose level is observable for PC4, rather than PC3.

What’s kind of interesting about this is that principal component 3 (the significant one that was largely comprised of the thalamus) looks similar to day 1, but it’s not significant (p = .267). Instead, principal component 4 is significant (p = .0012).

## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 1 rows containing missing values (position_stack).

The diabetic participants look a lot more like Day 1 than Day 2. It takes 9 components to get to 90% of the variance. Components 2 (p = 0.005) and 5 (p = 0.0005) are significant.

## Warning: Removed 1 rows containing missing values (position_stack).

HAAF Principal Components Analysis

Julie Wisch

August 2, 2018