data = read.csv('toothpaste.csv')Explore Data
The survey consists of responses to six seven-point likert-scale items with anchors, strongly disagree and strongly agree.
* prevents_cavities: It is important to buy a toothpaste that prevents cavities
* shiny_teeth: I like a toothpaste that gives shiny teeth
* strengthens_gums: A toothpaste should strengthen your gums
* freshens_breath: I prefer a toothpaste that freshens breath
* prevent_decay_not_imp: Prevention of tooth decay is not an important benefit offered by a toothpaste
* attractive_teeth: The most important consideration in buying a toothpaste is attractive teeth
survey = data[,2:7]
head(survey)## prevents_cavities shiny_teeth strengthens_gums freshens_breath
## 1 7 3 6 4
## 2 1 3 2 4
## 3 6 2 7 4
## 4 4 5 4 6
## 5 1 2 2 3
## 6 6 3 6 4
## prevent_decay_not_imp attractive_teeth
## 1 2 4
## 2 5 4
## 3 1 3
## 4 2 5
## 5 6 2
## 6 2 4
summary(survey)## prevents_cavities shiny_teeth strengthens_gums freshens_breath
## Min. :1.000 Min. :2.0 Min. :1.0 Min. :2.0
## 1st Qu.:2.000 1st Qu.:3.0 1st Qu.:2.0 1st Qu.:3.0
## Median :4.000 Median :4.0 Median :4.0 Median :4.0
## Mean :3.933 Mean :3.9 Mean :4.1 Mean :4.1
## 3rd Qu.:6.000 3rd Qu.:5.0 3rd Qu.:6.0 3rd Qu.:5.0
## Max. :7.000 Max. :7.0 Max. :7.0 Max. :7.0
## prevent_decay_not_imp attractive_teeth
## Min. :1.0 Min. :2.000
## 1st Qu.:2.0 1st Qu.:3.000
## Median :3.5 Median :4.000
## Mean :3.5 Mean :4.167
## 3rd Qu.:5.0 3rd Qu.:4.750
## Max. :7.0 Max. :7.000
Suitability for Factor Analysis
Correlations
We examine bivariate correlations among responses to the six survey questions.
- A correlation matrix with some large and some small correlations is ideal for factor analysis.
- If all correlations are small, there is no way to group variables. On the other hand, if all correlations are large, then all the variables will load onto the same factor.
It is common to assume responses to itemized rating scales, the kind used for this survey, are on an interval scale. Therefore, we will compute pearson’s product moment correlation. Note, if we treat the data as being ordinal (as some researchers would argue), we would examine polychoric correlations.
round(cor(survey), 3)## prevents_cavities shiny_teeth strengthens_gums
## prevents_cavities 1.000 -0.053 0.873
## shiny_teeth -0.053 1.000 -0.155
## strengthens_gums 0.873 -0.155 1.000
## freshens_breath -0.086 0.572 -0.248
## prevent_decay_not_imp -0.858 0.020 -0.778
## attractive_teeth 0.004 0.640 -0.018
## freshens_breath prevent_decay_not_imp
## prevents_cavities -0.086 -0.858
## shiny_teeth 0.572 0.020
## strengthens_gums -0.248 -0.778
## freshens_breath 1.000 -0.007
## prevent_decay_not_imp -0.007 1.000
## attractive_teeth 0.640 -0.136
## attractive_teeth
## prevents_cavities 0.004
## shiny_teeth 0.640
## strengthens_gums -0.018
## freshens_breath 0.640
## prevent_decay_not_imp -0.136
## attractive_teeth 1.000
Since we are only interested in the pattern of correlations, we can look at a heatmap of correlations.
library(corrplot)
corrplot(cor(data[,2:7]),type = 'lower',,col = c('red','white','green'),method = 'square',diag = F)Here is a plot using ggcorrplot
library(ggcorrplot)
ggcorrplot(cor(data[,2:7]),colors = c('red','white','green'),type = 'lower')Bartlett’s Test of Sphericity
Looks to see if there are at least some non-zero correlations by comparing correlation matrix to an identity matrix. A significant test indicates suitability for factor analysis.
library(psych)
cortest.bartlett(cor(survey),n = 30)## $chisq
## [1] 111.3138
##
## $p.value
## [1] 9.017094e-17
##
## $df
## [1] 15
KMO Measure of Sampling Adequacy (MSA)
Compares partial correlation matrix to pairwise correlation matrix. A partial correlation is a correlation after partialing out all other correlations. If the variables are strongly related, partial correlations should be small and MSA close to 1. If MSA > 0.5, data is suitable for factor analysis.
KMO(r = cor(survey))## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(survey))
## Overall MSA = 0.66
## MSA for each item =
## prevents_cavities shiny_teeth strengthens_gums
## 0.62 0.70 0.68
## freshens_breath prevent_decay_not_imp attractive_teeth
## 0.64 0.77 0.56
Determine Number of Factors
It is critical to have an a priori expectation of the number of factors.
Now, let us examine some data-driven methods used to corroborate an priori solution or select from a set of a priori solutions.
Scree Plot
Line graph of eigen values for each factor. Ideal number of factors is indicated by a sudden change in the line graph or what is known as the elbow.
scree(cor(survey),factors = T,pc = F)Eigen Value
According to the eigen-value criterion, all factors with eigen value greater than 1 are selected.
data.frame(factor = 1:ncol(survey), eigen = scree(cor(survey),factors = T,pc = F)$fv)## factor eigen
## 1 1 2.53970272
## 2 2 1.27058351
## 3 3 0.01617005
## 4 4 -0.05397856
## 5 5 -0.55820744
## 6 6 -0.67456755
Parallel Analysis
Simulate a dataset with same variables and observations as original dataset. Compute correlation matrix and eigen values. Now, compare eigen values from simulated data to original data. Select factors with eigen values in the original data greater than eigen values in the simulated data.
fa.parallel(survey,fa='fa',fm = 'pa')## Parallel analysis suggests that the number of factors = 2 and the number of components = NA
Total Variance Explained
To ensure that the factors represents the original variables sufficiently well, the total variance explained by factors should be greater than 70%.
Since the three above tests corroborated the a priori two-factor solution, we will now run an exploratory factor analysis using principal axis factoring with two factors. Next, we examine the Cumulative Variance explained by the two factors.
result = fa(r = survey,nfactors = 2,fm = 'pa',rotate = 'none')
result$Vaccounted## PA1 PA2
## SS loadings 2.5700755 1.8667388
## Proportion Var 0.4283459 0.3111231
## Cumulative Var 0.4283459 0.7394691
## Proportion Explained 0.5792615 0.4207385
## Cumulative Proportion 0.5792615 1.0000000
Extracted Communalities
- Communality reflects the amount of variance in a variable that can be explained by the factors.
- Larger the communality, the more of the variable is captured by the factor solution.
- On the other hand, a small communality implies most of the variance in the variable was not captured. Ideally, communality of each variable must be greater than 0.7, but a communality greater than 0.5 may be seen as acceptable.
data.frame(communality = result$communality)## communality
## prevents_cavities 0.9266887
## shiny_teeth 0.5625047
## strengthens_gums 0.8370333
## freshens_breath 0.6016147
## prevent_decay_not_imp 0.7898264
## attractive_teeth 0.7191465
Mapping Variables to Factors
Each variable is represented as a linear combination of factors. An ideal factor solution is where each variable is expected to load on (i.e., related to) only one factor. Such a result is easy to interpret. In practice, each variable may load on many factors. This may still be acceptable so long as the loading on one factor is large and on all other factors is small.
When the pattern of loadings does not show a clear preference of a variable for a factor, rotating the axes may help generate a clear mapping. There are two broad types of axes rotation
* Orthogonal: Axes are rotated while constraining them to be at right angles. E.g., varimax, quartimax, equimax
* Oblique: Axes are allowed to have any angle between them. E.g., oblimin, promax
Here is the pattern of loadings for the unrotated solution (i.e., rotation=’none)
print(result$loadings, cut=0)##
## Loadings:
## PA1 PA2
## prevents_cavities 0.948 0.168
## shiny_teeth -0.207 0.721
## strengthens_gums 0.914 0.039
## freshens_breath -0.247 0.735
## prevent_decay_not_imp -0.850 -0.260
## attractive_teeth -0.101 0.842
##
## PA1 PA2
## SS loadings 2.570 1.867
## Proportion Var 0.428 0.311
## Cumulative Var 0.428 0.739
To make the matrix of loadings easier to interpret, let us exclude small loadings, say below 0.15. Note, four of the six variables load on both factors.
print(result$loadings, cut=0.15)##
## Loadings:
## PA1 PA2
## prevents_cavities 0.948 0.168
## shiny_teeth -0.207 0.721
## strengthens_gums 0.914
## freshens_breath -0.247 0.735
## prevent_decay_not_imp -0.850 -0.260
## attractive_teeth 0.842
##
## PA1 PA2
## SS loadings 2.570 1.867
## Proportion Var 0.428 0.311
## Cumulative Var 0.428 0.739
Now, let’s examine the matrix after an orthogonal rotation using varimax. Is the mapping of variables to factors more clear?
fa_varimax = fa(r = survey,nfactors = 2,fm = 'pa',rotate = 'varimax')
print(fa_varimax$loadings,cut=0.15)##
## Loadings:
## PA1 PA2
## prevents_cavities 0.962
## shiny_teeth 0.748
## strengthens_gums 0.902 -0.155
## freshens_breath 0.771
## prevent_decay_not_imp -0.886
## attractive_teeth 0.844
##
## PA1 PA2
## SS loadings 2.539 1.898
## Proportion Var 0.423 0.316
## Cumulative Var 0.423 0.739
Now, let’s try an oblique rotation using oblimin. What do you think of the mapping of variables to factors?
fa_oblimin = fa(r = survey,nfactors = 2,fm = 'pa',rotate = 'oblimin')
print(fa_oblimin$loadings,cut=0.15)##
## Loadings:
## PA1 PA2
## prevents_cavities 0.948 0.168
## shiny_teeth -0.207 0.721
## strengthens_gums 0.914
## freshens_breath -0.247 0.735
## prevent_decay_not_imp -0.850 -0.260
## attractive_teeth 0.842
##
## PA1 PA2
## SS loadings 2.570 1.867
## Proportion Var 0.428 0.311
## Cumulative Var 0.428 0.739
Interpretation
Review pattern of factor loadings from the rotated matrix to interpret the factor analysis solution. The meaning of each factor is derived from the variables loading on it. So, let’s review the variables to describe each factor.
Since oblimin offered the clearest mapping of variables to factors, that matrix is used. To make interpretation easier, matrix is sorted.
What need is reflected in the first factor, PA1? What need is reflected by the second factor, PA2? Are these consistent with our a priori expectation?
print(fa_oblimin$loadings,cut=0.15, sort=T)##
## Loadings:
## PA1 PA2
## prevents_cavities 0.948 0.168
## strengthens_gums 0.914
## prevent_decay_not_imp -0.850 -0.260
## shiny_teeth -0.207 0.721
## freshens_breath -0.247 0.735
## attractive_teeth 0.842
##
## PA1 PA2
## SS loadings 2.570 1.867
## Proportion Var 0.428 0.311
## Cumulative Var 0.428 0.739
fa.diagram(fa_oblimin,sort = T)Representing the Factor
If the goal is to use the factors in further analysis, then they may be represented in one of three ways
Average scores of variables reflecting the factor
factor1_avg = rowMeans(survey[,c('prevents_cavities','strengthens_gums','prevent_decay_not_imp')])
factor2_avg = rowMeans(survey[,c('attractive_teeth','freshens_breath','shiny_teeth')]) Weighted average of variables reflecting the factor, where weights are the factor loadings
factor1_score = fa_oblimin$scores[,'PA1']
factor2_score = fa_oblimin$scores[,'PA2']Pick a variable as a representative of the factor. Here, we are selecting the variable with the largest factor loading.
factor1_surrogate = survey[,'prevents_cavities']
factor2_surrogate = survey[,'attractive_teeth']