A recent factor analysis project (as discussed previously here, here, and here) gave me an opportunity to experiment with some different ways of visualizing highly multidimensional data sets. Factor analysis results are often presented in tables of factor loadings, which are good when you want the numerical details, but bad when you want to convey larger-scale patterns – loadings of 0.91 and 0.19 look similar in a table but very different in a graph. The original data were 17 measures of spoken language processing from 99 participants with aphasia. I’ve posted a data file that contains the pairwise correlations among the measures and the factor loadings:
load(url("http://dmirman.github.io/FAex.Rdata"))
summary(loadings)
## Semantic Recognition Speech Production Speech Recognition
## Min. :-0.1918 Min. :-0.8924 Min. :-0.08145
## 1st Qu.: 0.1781 1st Qu.: 0.1082 1st Qu.: 0.21305
## Median : 0.3753 Median : 0.2195 Median : 0.34497
## Mean : 0.4158 Mean : 0.2623 Mean : 0.36232
## 3rd Qu.: 0.7106 3rd Qu.: 0.4951 3rd Qu.: 0.49335
## Max. : 0.9222 Max. : 0.8416 Max. : 0.82656
##
## Semantic Errors Test
## Min. :-0.933956 PNT: Semantic Errors : 1
## 1st Qu.:-0.010944 Phoneme Discrimination (No Delay): 1
## Median : 0.093386 Phoneme Discrimination (Delay) : 1
## Mean : 0.008282 Auditory Lexical Decision : 1
## 3rd Qu.: 0.135101 Rhyme Discrimination : 1
## Max. : 0.287418 Rhyme Probe Test : 1
## (Other) :11
First, let’s make a nice graph of how the measures load on each of the factors. We’ll want them more or less grouped by factor, but the best option I could come up with was to manually order them:
Ord <- c(17, 16, 13, 15, 12, 2, 3, 5, 9, 8, 11, 7, 14, 6, 4, 10, 1)
loadings$Test <- reorder(loadings$Test, Ord)
The factors are in separate columns, so we need to melt
the data into a “long” form for plotting (you’ll need to have loaded the reshape2
package):
loadings.m <- melt(loadings, id="Test",
measure=c("Semantic Recognition", "Speech Production",
"Speech Recognition", "Semantic Errors"),
variable.name="Factor", value.name="Loading")
Now make the bar graph (which became Figure 1 in our first paper on these data). Explanations of each bit of the ggplot
code are inserted as comments:
#For each test, plot the loading as length and fill color of a bar
# note that the length will be the absolute value of the loading but the
# fill color will be the signed value, more on this below
ggplot(loadings.m, aes(Test, abs(Loading), fill=Loading)) +
facet_wrap(~ Factor, nrow=1) + #place the factors in separate facets
geom_bar(stat="identity") + #make the bars
coord_flip() + #flip the axes so the test names can be horizontal
#define the fill color gradient: blue=positive, red=negative
scale_fill_gradient2(name = "Loading",
high = "blue", mid = "white", low = "red",
midpoint=0, guide=F) +
ylab("Loading Strength") + #improve y-axis label
theme_bw(base_size=10) #use a black-and0white theme with set font size
A few words about the fill color gradient: the contribution of a measure to a factor is reflected by the absolute value of the loading, the sign just reflects the scale direction of the measurement. For most of the measures, a higher score indicates better performance (for example, percent correct), but there are two measures for which higher scores indicate poorer performance (phonological and semantic errors), which therefore produce negative loadings. Having those bars facing the opposite direction would create a lot of wasted white space. Using the absolute value of the loading makes all of the bars face the same direction, but I used fill color to indicate the negative loadings. I set the negative loadings to be red so they stand out more. I set white as the mid-point color and the mid-point as 0 so that near-zero loadings (which are less important) would be desaturated and therefore less visually salient.
For our second paper on these results, I thought it would be good to present more detailed information about the patterns of pairwise correlations that produced these factors and their loadings. Rather than present a giant 17x17 table of correlation values, I plotted the correlation matrix:
summary(corrs)
## Camel and Cactus Test Pyramids and Palm Trees Test
## Min. :-0.2787 Min. :-0.3053
## 1st Qu.: 0.2279 1st Qu.: 0.3457
## Median : 0.4063 Median : 0.4896
## Mean : 0.4275 Mean : 0.4273
## 3rd Qu.: 0.6526 3rd Qu.: 0.5825
## Max. : 1.0000 Max. : 1.0000
##
## Peabody Picture Vocabulary Test Synonymy Triplets
## Min. :-0.2059 Min. :-0.2315
## 1st Qu.: 0.4704 1st Qu.: 0.4320
## Median : 0.5593 Median : 0.5083
## Mean : 0.5003 Mean : 0.4823
## 3rd Qu.: 0.6285 3rd Qu.: 0.6669
## Max. : 1.0000 Max. : 1.0000
##
## Semantic Category Discrimination Phoneme Discrimination (No Delay)
## Min. :-0.2421 Min. :-0.1585
## 1st Qu.: 0.3807 1st Qu.: 0.4110
## Median : 0.5196 Median : 0.5083
## Mean : 0.4649 Mean : 0.4628
## 3rd Qu.: 0.6323 3rd Qu.: 0.5554
## Max. : 1.0000 Max. : 1.0000
##
## Phoneme Discrimination (Delay) Rhyme Discrimination
## Min. :-0.2265 Min. :-0.3202
## 1st Qu.: 0.4896 1st Qu.: 0.3530
## Median : 0.5369 Median : 0.4685
## Mean : 0.4981 Mean : 0.4213
## 3rd Qu.: 0.6079 3rd Qu.: 0.5165
## Max. : 1.0000 Max. : 1.0000
##
## Philadelphia Repetition Test Nonword Repetition Philadelphia Naming Test
## Min. :-0.7304 Min. :-0.5950 Min. :-0.5375
## 1st Qu.: 0.3102 1st Qu.: 0.2800 1st Qu.: 0.5013
## Median : 0.3833 Median : 0.4509 Median : 0.5709
## Mean : 0.3628 Mean : 0.3863 Mean : 0.4808
## 3rd Qu.: 0.4930 3rd Qu.: 0.5373 3rd Qu.: 0.6009
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
##
## Immediate Serial Recall Span Semantic Category Probe Test
## Min. :-0.5292 Min. :-0.2689
## 1st Qu.: 0.4284 1st Qu.: 0.3596
## Median : 0.5067 Median : 0.5369
## Mean : 0.4622 Mean : 0.4586
## 3rd Qu.: 0.6121 3rd Qu.: 0.6349
## Max. : 1.0000 Max. : 1.0000
##
## Rhyme Probe Test Auditory Lexical Decision PNT: Phonological Errors
## Min. :-0.3093 Min. :-0.2354 Min. :-0.7304
## 1st Qu.: 0.4581 1st Qu.: 0.4105 1st Qu.:-0.3202
## Median : 0.4704 Median : 0.4320 Median :-0.2265
## Mean : 0.4409 Mean : 0.4121 Mean :-0.2196
## 3rd Qu.: 0.5567 3rd Qu.: 0.5165 3rd Qu.:-0.1422
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
##
## PNT: Semantic Errors Test
## Min. :-0.30533 PNT: Semantic Errors : 1
## 1st Qu.:-0.23152 Phoneme Discrimination (No Delay): 1
## Median :-0.15854 Phoneme Discrimination (Delay) : 1
## Mean :-0.08027 Auditory Lexical Decision : 1
## 3rd Qu.:-0.08912 Rhyme Discrimination : 1
## Max. : 1.00000 Rhyme Probe Test : 1
## (Other) :11
This needs to be melted into a long form so that each test pair and their correlation is a single row in the data frame:
corrs.m <- melt(corrs, id="Test", variable.name="Test2", value.name="Correlation")
summary(corrs.m)
## Test
## PNT: Semantic Errors : 17
## Phoneme Discrimination (No Delay): 17
## Phoneme Discrimination (Delay) : 17
## Auditory Lexical Decision : 17
## Rhyme Discrimination : 17
## Rhyme Probe Test : 17
## (Other) :187
## Test2 Correlation
## Camel and Cactus Test : 17 Min. :-0.7304
## Pyramids and Palm Trees Test : 17 1st Qu.: 0.2800
## Peabody Picture Vocabulary Test : 17 Median : 0.4691
## Synonymy Triplets : 17 Mean : 0.3758
## Semantic Category Discrimination : 17 3rd Qu.: 0.5990
## Phoneme Discrimination (No Delay): 17 Max. : 1.0000
## (Other) :187
head(corrs.m)
## Test Test2 Correlation
## 1 Camel and Cactus Test Camel and Cactus Test 1.0000000
## 2 Pyramids and Palm Trees Test Camel and Cactus Test 0.8282492
## 3 Peabody Picture Vocabulary Test Camel and Cactus Test 0.6285245
## 4 Synonymy Triplets Camel and Cactus Test 0.7558075
## 5 Semantic Category Discrimination Camel and Cactus Test 0.6525854
## 6 Phoneme Discrimination (No Delay) Camel and Cactus Test 0.3861599
It will be easier to understand the graph if the tests that went into each correlation follow the same order:
corrs.m$Test2 <- reorder(corrs.m$Test2, rep(Ord, each=17))
Plot the correlations matrix:
library(grid) #for adjusting plot margins
#place the tests on the x- and y-axes,
#fill the elements with the strength of the correlation
ggplot(corrs.m, aes(Test2, Test, fill=abs(Correlation))) +
geom_tile() + #rectangles for each correlation
#add actual correlation value in the rectangle
geom_text(aes(label = round(Correlation, 2)), size=2.5) +
theme_bw(base_size=10) + #black and white theme with set font size
#rotate x-axis labels so they don't overlap,
#get rid of unnecessary axis titles
#adjust plot margins
theme(axis.text.x = element_text(angle = 90),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
plot.margin = unit(c(3, 1, 0, 0), "mm")) +
#set correlation fill gradient
scale_fill_gradient(low="white", high="red") +
guides(fill=F) #omit unnecessary gradient legend
Now here is the fun part: can we add a stacked bar graph of factor loadings so that we can see how the factor analysis transformed these pairwise correlations into factors? First, we need to make that stacked bar graph. The code will be very similar to the bar graph in the previous section, but we’ll use fill color of the stacked bars to distinguish the factors instead of putting them in separate facets. We’ll also assign the ggplot
object to a variable for making the combo graph:
p1 <- last_plot() #store the correlation matrix plot object for later
#make the stacked bar graph
p2 <- ggplot(loadings.m, aes(Test, abs(Loading), fill=Factor)) +
geom_bar(stat="identity") + coord_flip() +
ylab("Loading Strength") + theme_bw(base_size=10) +
#remove labels and tweak margins for combining with the correlation matrix plot
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
plot.margin = unit(c(3,1,39,-3), "mm"))
p2
Now we can use the grid.arrange()
function from the gridExtra
package to put both plots in one figure. The function takes the plot objects (p1
and p2
) and various optional arrangment parameters. In this case, we’ll tell it to put the plots side-by-side in two columns (ncol=2
) and that the left column (which will contain the big correlation matrix) should be twice as wide as the right column (widths=c(2, 1)
):
library(gridExtra)
grid.arrange(p1, p2, ncol=2, widths=c(2, 1))
The plot.margin
manipulation in p1
and p2
was to minimize margin white space and to align the stacked bars with the rows of the correlation matrix. The main problem is that the x-axis tick labels are very long for the correlation matrix, so the bottom edge of that panel is much lower than the bottom edge of the bars panel. To adjust for that, I raised the bottom margin of the bars plot by 39mm – that specific number was determined by brute force trial-and-error, and it is specific to the size of the figure (in this case: 11 x 8.5).