A recent factor analysis project (as discussed previously here, here, and here) gave me an opportunity to experiment with some different ways of visualizing highly multidimensional data sets. Factor analysis results are often presented in tables of factor loadings, which are good when you want the numerical details, but bad when you want to convey larger-scale patterns – loadings of 0.91 and 0.19 look similar in a table but very different in a graph. The original data were 17 measures of spoken language processing from 99 participants with aphasia. I’ve posted a data file that contains the pairwise correlations among the measures and the factor loadings:

load(url("http://dmirman.github.io/FAex.Rdata"))

A bar graph of factor loadings

summary(loadings)

##  Semantic Recognition Speech Production Speech Recognition
##  Min.   :-0.1918      Min.   :-0.8924   Min.   :-0.08145  
##  1st Qu.: 0.1781      1st Qu.: 0.1082   1st Qu.: 0.21305  
##  Median : 0.3753      Median : 0.2195   Median : 0.34497  
##  Mean   : 0.4158      Mean   : 0.2623   Mean   : 0.36232  
##  3rd Qu.: 0.7106      3rd Qu.: 0.4951   3rd Qu.: 0.49335  
##  Max.   : 0.9222      Max.   : 0.8416   Max.   : 0.82656  
##                                                           
##  Semantic Errors                                    Test   
##  Min.   :-0.933956   PNT: Semantic Errors             : 1  
##  1st Qu.:-0.010944   Phoneme Discrimination (No Delay): 1  
##  Median : 0.093386   Phoneme Discrimination (Delay)   : 1  
##  Mean   : 0.008282   Auditory Lexical Decision        : 1  
##  3rd Qu.: 0.135101   Rhyme Discrimination             : 1  
##  Max.   : 0.287418   Rhyme Probe Test                 : 1  
##                      (Other)                          :11

First, let’s make a nice graph of how the measures load on each of the factors. We’ll want them more or less grouped by factor, but the best option I could come up with was to manually order them:

Ord <- c(17, 16, 13, 15, 12,  2, 3,  5,  9,  8, 11,  7, 14,  6,  4,  10,  1)
loadings$Test <- reorder(loadings$Test, Ord)

The factors are in separate columns, so we need to melt the data into a “long” form for plotting (you’ll need to have loaded the reshape2 package):

loadings.m <- melt(loadings, id="Test", 
                   measure=c("Semantic Recognition", "Speech Production", 
                             "Speech Recognition", "Semantic Errors"), 
                   variable.name="Factor", value.name="Loading")

Now make the bar graph (which became Figure 1 in our first paper on these data). Explanations of each bit of the ggplot code are inserted as comments:

#For each test, plot the loading as length and fill color of a bar
# note that the length will be the absolute value of the loading but the 
# fill color will be the signed value, more on this below
ggplot(loadings.m, aes(Test, abs(Loading), fill=Loading)) + 
  facet_wrap(~ Factor, nrow=1) + #place the factors in separate facets
  geom_bar(stat="identity") + #make the bars
  coord_flip() + #flip the axes so the test names can be horizontal  
  #define the fill color gradient: blue=positive, red=negative
  scale_fill_gradient2(name = "Loading", 
                       high = "blue", mid = "white", low = "red", 
                       midpoint=0, guide=F) +
  ylab("Loading Strength") + #improve y-axis label
  theme_bw(base_size=10) #use a black-and0white theme with set font size

This is Figure 1 in Mirman et al., 2015, Nature Communications

A few words about the fill color gradient: the contribution of a measure to a factor is reflected by the absolute value of the loading, the sign just reflects the scale direction of the measurement. For most of the measures, a higher score indicates better performance (for example, percent correct), but there are two measures for which higher scores indicate poorer performance (phonological and semantic errors), which therefore produce negative loadings. Having those bars facing the opposite direction would create a lot of wasted white space. Using the absolute value of the loading makes all of the bars face the same direction, but I used fill color to indicate the negative loadings. I set the negative loadings to be red so they stand out more. I set white as the mid-point color and the mid-point as 0 so that near-zero loadings (which are less important) would be desaturated and therefore less visually salient.

Plotting the correlations

For our second paper on these results, I thought it would be good to present more detailed information about the patterns of pairwise correlations that produced these factors and their loadings. Rather than present a giant 17x17 table of correlation values, I plotted the correlation matrix:

summary(corrs)

##  Camel and Cactus Test Pyramids and Palm Trees Test
##  Min.   :-0.2787       Min.   :-0.3053             
##  1st Qu.: 0.2279       1st Qu.: 0.3457             
##  Median : 0.4063       Median : 0.4896             
##  Mean   : 0.4275       Mean   : 0.4273             
##  3rd Qu.: 0.6526       3rd Qu.: 0.5825             
##  Max.   : 1.0000       Max.   : 1.0000             
##                                                    
##  Peabody Picture Vocabulary Test Synonymy Triplets
##  Min.   :-0.2059                 Min.   :-0.2315  
##  1st Qu.: 0.4704                 1st Qu.: 0.4320  
##  Median : 0.5593                 Median : 0.5083  
##  Mean   : 0.5003                 Mean   : 0.4823  
##  3rd Qu.: 0.6285                 3rd Qu.: 0.6669  
##  Max.   : 1.0000                 Max.   : 1.0000  
##                                                   
##  Semantic Category Discrimination Phoneme Discrimination (No Delay)
##  Min.   :-0.2421                  Min.   :-0.1585                  
##  1st Qu.: 0.3807                  1st Qu.: 0.4110                  
##  Median : 0.5196                  Median : 0.5083                  
##  Mean   : 0.4649                  Mean   : 0.4628                  
##  3rd Qu.: 0.6323                  3rd Qu.: 0.5554                  
##  Max.   : 1.0000                  Max.   : 1.0000                  
##                                                                    
##  Phoneme Discrimination (Delay) Rhyme Discrimination
##  Min.   :-0.2265                Min.   :-0.3202     
##  1st Qu.: 0.4896                1st Qu.: 0.3530     
##  Median : 0.5369                Median : 0.4685     
##  Mean   : 0.4981                Mean   : 0.4213     
##  3rd Qu.: 0.6079                3rd Qu.: 0.5165     
##  Max.   : 1.0000                Max.   : 1.0000     
##                                                     
##  Philadelphia Repetition Test Nonword Repetition Philadelphia Naming Test
##  Min.   :-0.7304              Min.   :-0.5950    Min.   :-0.5375         
##  1st Qu.: 0.3102              1st Qu.: 0.2800    1st Qu.: 0.5013         
##  Median : 0.3833              Median : 0.4509    Median : 0.5709         
##  Mean   : 0.3628              Mean   : 0.3863    Mean   : 0.4808         
##  3rd Qu.: 0.4930              3rd Qu.: 0.5373    3rd Qu.: 0.6009         
##  Max.   : 1.0000              Max.   : 1.0000    Max.   : 1.0000         
##                                                                          
##  Immediate Serial Recall Span Semantic Category Probe Test
##  Min.   :-0.5292              Min.   :-0.2689             
##  1st Qu.: 0.4284              1st Qu.: 0.3596             
##  Median : 0.5067              Median : 0.5369             
##  Mean   : 0.4622              Mean   : 0.4586             
##  3rd Qu.: 0.6121              3rd Qu.: 0.6349             
##  Max.   : 1.0000              Max.   : 1.0000             
##                                                           
##  Rhyme Probe Test  Auditory Lexical Decision PNT: Phonological Errors
##  Min.   :-0.3093   Min.   :-0.2354           Min.   :-0.7304         
##  1st Qu.: 0.4581   1st Qu.: 0.4105           1st Qu.:-0.3202         
##  Median : 0.4704   Median : 0.4320           Median :-0.2265         
##  Mean   : 0.4409   Mean   : 0.4121           Mean   :-0.2196         
##  3rd Qu.: 0.5567   3rd Qu.: 0.5165           3rd Qu.:-0.1422         
##  Max.   : 1.0000   Max.   : 1.0000           Max.   : 1.0000         
##                                                                      
##  PNT: Semantic Errors                                Test   
##  Min.   :-0.30533     PNT: Semantic Errors             : 1  
##  1st Qu.:-0.23152     Phoneme Discrimination (No Delay): 1  
##  Median :-0.15854     Phoneme Discrimination (Delay)   : 1  
##  Mean   :-0.08027     Auditory Lexical Decision        : 1  
##  3rd Qu.:-0.08912     Rhyme Discrimination             : 1  
##  Max.   : 1.00000     Rhyme Probe Test                 : 1  
##                       (Other)                          :11

This needs to be melted into a long form so that each test pair and their correlation is a single row in the data frame:

corrs.m <- melt(corrs, id="Test", variable.name="Test2", value.name="Correlation")
summary(corrs.m)

##                                 Test    
##  PNT: Semantic Errors             : 17  
##  Phoneme Discrimination (No Delay): 17  
##  Phoneme Discrimination (Delay)   : 17  
##  Auditory Lexical Decision        : 17  
##  Rhyme Discrimination             : 17  
##  Rhyme Probe Test                 : 17  
##  (Other)                          :187  
##                                Test2      Correlation     
##  Camel and Cactus Test            : 17   Min.   :-0.7304  
##  Pyramids and Palm Trees Test     : 17   1st Qu.: 0.2800  
##  Peabody Picture Vocabulary Test  : 17   Median : 0.4691  
##  Synonymy Triplets                : 17   Mean   : 0.3758  
##  Semantic Category Discrimination : 17   3rd Qu.: 0.5990  
##  Phoneme Discrimination (No Delay): 17   Max.   : 1.0000  
##  (Other)                          :187

head(corrs.m)

##                                Test                 Test2 Correlation
## 1             Camel and Cactus Test Camel and Cactus Test   1.0000000
## 2      Pyramids and Palm Trees Test Camel and Cactus Test   0.8282492
## 3   Peabody Picture Vocabulary Test Camel and Cactus Test   0.6285245
## 4                 Synonymy Triplets Camel and Cactus Test   0.7558075
## 5  Semantic Category Discrimination Camel and Cactus Test   0.6525854
## 6 Phoneme Discrimination (No Delay) Camel and Cactus Test   0.3861599

It will be easier to understand the graph if the tests that went into each correlation follow the same order:

corrs.m$Test2 <- reorder(corrs.m$Test2, rep(Ord, each=17))

Plot the correlations matrix:

library(grid) #for adjusting plot margins
#place the tests on the x- and y-axes, 
#fill the elements with the strength of the correlation
ggplot(corrs.m, aes(Test2, Test, fill=abs(Correlation))) + 
  geom_tile() + #rectangles for each correlation
  #add actual correlation value in the rectangle
  geom_text(aes(label = round(Correlation, 2)), size=2.5) + 
  theme_bw(base_size=10) + #black and white theme with set font size
  #rotate x-axis labels so they don't overlap, 
  #get rid of unnecessary axis titles
  #adjust plot margins
  theme(axis.text.x = element_text(angle = 90), 
        axis.title.x=element_blank(), 
        axis.title.y=element_blank(), 
        plot.margin = unit(c(3, 1, 0, 0), "mm")) +
  #set correlation fill gradient
  scale_fill_gradient(low="white", high="red") + 
  guides(fill=F) #omit unnecessary gradient legend

Now here is the fun part: can we add a stacked bar graph of factor loadings so that we can see how the factor analysis transformed these pairwise correlations into factors? First, we need to make that stacked bar graph. The code will be very similar to the bar graph in the previous section, but we’ll use fill color of the stacked bars to distinguish the factors instead of putting them in separate facets. We’ll also assign the ggplot object to a variable for making the combo graph:

p1 <- last_plot() #store the correlation matrix plot object for later
#make the stacked bar graph
p2 <- ggplot(loadings.m, aes(Test, abs(Loading), fill=Factor)) + 
  geom_bar(stat="identity") + coord_flip() + 
  ylab("Loading Strength") + theme_bw(base_size=10) + 
  #remove labels and tweak margins for combining with the correlation matrix plot
  theme(axis.text.y = element_blank(), 
        axis.title.y = element_blank(), 
        plot.margin = unit(c(3,1,39,-3), "mm"))
p2

Now we can use the grid.arrange() function from the gridExtra package to put both plots in one figure. The function takes the plot objects (p1 and p2) and various optional arrangment parameters. In this case, we’ll tell it to put the plots side-by-side in two columns (ncol=2) and that the left column (which will contain the big correlation matrix) should be twice as wide as the right column (widths=c(2, 1)):

library(gridExtra)
grid.arrange(p1, p2, ncol=2, widths=c(2, 1))

This is Figure 2 in Mirman et al., in press, Neuropsychologia

The plot.margin manipulation in p1 and p2 was to minimize margin white space and to align the stacked bars with the rows of the correlation matrix. The main problem is that the x-axis tick labels are very long for the correlation matrix, so the bottom edge of that panel is much lower than the bottom edge of the bars panel. To adjust for that, I raised the bottom margin of the bars plot by 39mm – that specific number was determined by brute force trial-and-error, and it is specific to the size of the figure (in this case: 11 x 8.5).

Plotting factor analysis results with ggplot

A bar graph of factor loadings

Plotting the correlations