the summary of:
Levshina,N.(2015). 18.Multidimensional analysis of register variation: Principal Component Analysis and Factor Analysis, How to do Linguistics with R:Data exploration and statistical analysis(pp 351-366), John Benjamins Publishing Company
18.1 Multidimensional analysis of register variation
Registers
- language varieties associated with a particular situation of use
- face-to-face conversations, emails, textbooks, fictional novels, lectures and Twitter messeages
Situations can be characterized by such parameters as
- the channel of communication (speech, writing or signing)
- relationship between the participants (social status, personal relationships)
- communicative purpose (transfer of information, persuation, entertainments, etc.)
- settings (private or public place of communication)
Registers are also associated with specific linguistic features
- face-to-face conversation
- a high proportion of first and second person pronouns
- a low proportional of nouns and adjectives
Duglas Biber(1988)
- Factor Analysis
- ‘Involved versus Informational Production’
- ‘Narrative versus Non-narrative Concerns’
- the registers and specific texts can be mapped onto the register space according to their linguistic features
- face-to-face and telephone conversations
- academic prose
Exploratory Factor Analysis (FA) and Principal Components Analysis (PCA)
- use to simplify the data structure and classify variables and objects
- FA is more appropriate for detecting theoretically relevant underlying dimensions in the data
- PCA and FA usually yield similar results (Field et al. 2012:760)
- if the variables are strongly correlated
- if the number of variables is large
18.2. Case study: Register variation in the British National Corpus
18.2.1 The data and research question
library(Rling);library(psych);library(FactoMineR)Data structure
data(reg_bnc)
str(reg_bnc)## 'data.frame': 69 obs. of 12 variables:
## $ Reg : Factor w/ 6 levels "Acad","Fiction",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Ncomm : num 0.17 0.205 0.206 0.136 0.133 ...
## $ Nprop : num 0.02697 0.02498 0.0468 0.0112 0.00985 ...
## $ Vpres : num 0.0355 0.0391 0.0366 0.0485 0.0452 ...
## $ Vpast : num 0.0219 0.0298 0.0236 0.0189 0.0198 ...
## $ P1 : num 0.0347 0.0208 0.018 0.0276 0.0455 ...
## $ P2 : num 0.01832 0.01137 0.00775 0.03749 0.03703 ...
## $ Adj : num 0.0536 0.0585 0.0596 0.0407 0.0446 ...
## $ ConjCoord: num 0.0395 0.034 0.0335 0.0339 0.0384 ...
## $ ConjSub : num 0.031 0.0276 0.0232 0.0315 0.0283 ...
## $ Interject: num 0.00997 0.00414 0.00226 0.02173 0.04298 ...
## $ Num : num 0.0206 0.0192 0.0277 0.0414 0.0164 ...
- number of BNC subsections = 69
- variables are numeric vectors except reg
rownames(reg_bnc)## [1] "S_brdcst_disc" "S_brdcst_doc" "S_brdcst_news" "S_classroom"
## [5] "S_consult" "S_conv" "S_courtroom" "S_demonstratn"
## [9] "S_interv_oral" "S_interview" "S_lect_arts" "S_lect_com"
## [13] "S_lect_law" "S_lect_natsci" "S_lect_socsci" "S_meeting"
## [17] "S_parliament" "S_pub_debate" "S_sermon" "S_spch-script"
## [21] "S_spch+script" "S_sportslive" "S_tutorial" "S_unclass"
## [25] "W_ac_engin" "W_ac_hum_arts" "W_ac_law_edu" "W_ac_medicine"
## [29] "W_ac_nat_sci" "W_ac_soc_sci" "W_admin" "W_advert"
## [33] "W_biography" "W_commerce" "W_email" "W_essay_schl"
## [37] "W_essay_univ" "W_fict_drama" "W_fict_poetry" "W_fict_prose"
## [41] "W_hansard" "W_inst_doc" "W_instruction" "W_let_pers"
## [45] "W_let_prof" "W_misc" "W_new_arts1" "W_news_arts2"
## [49] "W_news_com" "W_news_edit" "W_news_misc" "W_news_o_com"
## [53] "W_news_o_rep" "W_news_o_sci" "W_news_o_soc" "W_news_o_sprt"
## [57] "W_news_rprt" "W_news_sci" "W_news_script" "W_news_soc"
## [61] "W_news_sprt" "W_news_tabld" "W_nonac_arts" "W_nonac_engin"
## [65] "W_nonac_law" "W_nonac_med" "W_nonac_nat" "W_nonac_soc"
## [69] "W_religion"
The main question
- Can we identify interpretable dimensions of register variation on the basis of the data?
- What are the relationships between the metaregisters with regard to these dimensions?
18.2.2 Principal Components Analysis
Two important conditions
- the variables should be intercorrelated
- otherwise, we cannot reduce the data to a smaller number of underlying components
- the correlations should not be too high
- very high correlations cause inaccurate estimates in FA (not a problem for PCA)
- a rule of thumb
- the absolute values of correlations should not be lower than 0.3 and above 0.9
The correlations between the variables
round(cor(reg_bnc[,-1]),2)## Ncomm Nprop Vpres Vpast P1 P2 Adj ConjCoord ConjSub
## Ncomm 1.00 0.23 -0.41 -0.21 -0.83 -0.75 0.86 -0.13 -0.52
## Nprop 0.23 1.00 -0.34 0.36 -0.37 -0.50 0.13 -0.45 -0.68
## Vpres -0.41 -0.34 1.00 -0.46 0.42 0.50 -0.35 0.21 0.48
## Vpast -0.21 0.36 -0.46 1.00 0.03 -0.11 -0.16 0.07 -0.22
## P1 -0.83 -0.37 0.42 0.03 1.00 0.80 -0.79 0.23 0.57
## P2 -0.75 -0.50 0.50 -0.11 0.80 1.00 -0.70 0.31 0.57
## Adj 0.86 0.13 -0.35 -0.16 -0.79 -0.70 1.00 0.04 -0.39
## ConjCoord -0.13 -0.45 0.21 0.07 0.23 0.31 0.04 1.00 0.26
## ConjSub -0.52 -0.68 0.48 -0.22 0.57 0.57 -0.39 0.26 1.00
## Interject -0.67 -0.39 0.41 0.02 0.70 0.79 -0.62 0.18 0.36
## Num 0.21 0.28 -0.28 -0.13 -0.25 -0.16 0.03 -0.41 -0.28
## Interject Num
## Ncomm -0.67 0.21
## Nprop -0.39 0.28
## Vpres 0.41 -0.28
## Vpast 0.02 -0.13
## P1 0.70 -0.25
## P2 0.79 -0.16
## Adj -0.62 0.03
## ConjCoord 0.18 -0.41
## ConjSub 0.36 -0.28
## Interject 1.00 -0.09
## Num -0.09 1.00
- the variable Num has many correlation with the absolute value slightly under 0.3
- As for possible multicollinearity, this should not be a concern, since there are no highly correlated variables
Bartlett test
- null hypothesis : “the variables are not correlated”
cortest.bartlett()in the packagepsych
cortest.bartlett(reg_bnc[,-1])## R was not square, finding R from data
## $chisq
## [1] 536.3401
##
## $p.value
## [1] 4.109611e-80
##
## $df
## [1] 55
PCA
PCA()from theFactoMineRPackage:
reg.pca<-PCA(reg_bnc,quali.sup=1,graph=FALSE)quali.sup=1tells R that the first variable, Reg, should be regarded as a qualitative supplementary variable.- do not contribute to the construction of principal components
- types of supplementary variables : qualitative and quantitative
graph=FALSE- suppresses the immedate creation of graphical output
- PCA is usually performed on standardized scores rather than original ones because it is sensitive to scaling differences
Eigenvalues
- proportions of variance explained by each component and cumulative explained variance
- an eigenvalue shows how much of the total variance is explained by each component
- the higher the correlations between a component and the variables, the greater the component’s eigenvalue
- the eigenvalue of every additional component is smaller than the previous one
head(reg.pca$eig)## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 5.0682936 46.075396 46.07540
## comp 2 1.8722103 17.020094 63.09549
## comp 3 1.3758435 12.507669 75.60316
## comp 4 0.7900757 7.182506 82.78566
## comp 5 0.6451271 5.864791 88.65046
## comp 6 0.4217144 3.833768 92.48422
The optimal numbers of components
- retain components whose eigenvalues are greater than 1(the Kaiser criterion) or 0.7(Field et al. 2012:762-784)
- inspect the values visually using a screeplot
barplot(reg.pca$eig[,2], names=1:nrow(reg.pca$eig), xlab="components", ylab="Percentage of explained variance")PCA plot
choix="var"is used to represent the variables
plot(reg.pca,choix="var",cex=0.8)the plot displays the first two components
- the angle between the vectors and the axes indicate how strongly the variables are correlated with the dimensions
- the smaller the angle, the stronger the correlation
if two vectors point to almost the same direction, this means that the corresponding variables are highly correlated and therefore may represent the same underlying theoretical construct
- the length of the vector reflects how much variation in the variable is captured by this low-dimensional display with the maximum length of 1
- quality of the representation of a variable on the plane
the first component relates to the involved and informational communication
the correlation coefficients
dimdesc(reg.pca, axes=1)## $Dim.1
## $Dim.1$quanti
## correlation p.value
## P2 0.9117524 1.365059e-27
## P1 0.8958585 2.694053e-25
## Interject 0.7913207 5.876448e-16
## ConjSub 0.7268571 1.540034e-12
## Vpres 0.6203029 1.311435e-08
## ConjCoord 0.3461531 3.574209e-03
## Num -0.3236023 6.681150e-03
## Nprop -0.5825157 1.513793e-07
## Adj -0.7699620 1.056008e-14
## Ncomm -0.8551296 8.636119e-21
##
## $Dim.1$quali
## R2 p.value
## Reg 0.7783745 2.391373e-19
##
## $Dim.1$category
## Estimate p.value
## Spok 3.100121 9.721141e-20
## Acad -1.256963 4.616021e-02
## NonacProse -1.272090 4.425996e-02
## News -1.427441 4.513794e-05
- 1st component
- top positively correlated features are the
1stand2nd person pronounsandinterjections - the strongest negative correlations are observed for
common nounsandadjectives - by default, the function returns only those estimates that are significant (at the level of 0.05)
- the fuction returns the estimates of regression coefficients for qualitative supplementary variables(metaregister)
- the largest positive stimate is observed for the
spoken data(involvement)
- top positively correlated features are the
dimdesc(reg.pca, axes=2)## $Dim.2
## $Dim.2$quanti
## correlation p.value
## Adj 0.5234173 3.936813e-06
## ConjCoord 0.4633356 6.092970e-05
## Ncomm 0.3965029 7.440364e-04
## Vpres 0.3489208 3.299614e-03
## ConjSub 0.3451253 3.681245e-03
## Num -0.3367338 4.667194e-03
## Nprop -0.6146862 1.925513e-08
## Vpast -0.6342292 4.893429e-09
##
## $Dim.2$quali
## R2 p.value
## Reg 0.2553784 0.001918148
##
## $Dim.2$category
## Estimate p.value
## Acad 1.225288 0.0136778790
## News -1.081269 0.0006610448
- 2st component
- positive:
adjectivesandcoordinate conjunctions - negative:
past forms of verbsandproper nouns - description vs. reporting of past events (tentatively)
- the academic texts: significantly correlated with the positive values
- the news: significantly correlated with negative values
- positive:
PCA plot(dim 2,3) and correlation coefficient(dim3)
plot(reg.pca,axes=c(2,3),choix="var",cex=0.8)dimdesc(reg.pca, axes=3)## $Dim.3
## $Dim.3$quanti
## correlation p.value
## Vpast 0.6962689 3.086827e-11
## ConjCoord 0.5988047 5.481959e-08
## Vpres -0.2928527 1.460723e-02
## Num -0.6430702 2.549810e-09
##
## $Dim.3$category
## Estimate p.value
## Fiction 1.527773 0.006017634
- 3rd component
- narrative and non-narrative texts
- positive correlation:
past tense verb,coordinating conjuction - negative correlation:
numerals - It distinguishes
fictionfrom all other registers
Plot the individual subection onto the space
#dim 1,2
plot(reg.pca, cex=0.8, col.ind="grey",col.quali="black")#dim 2,3
plot(reg.pca,axes=c(2,3),cex=0.8,col.ind="grey",col.quali="black")- BNC subsections that belong to the same metaregister tend to cluster together
Plot confidence ellipses around the centroid
- to estimate the amount of overlap of the prototypes of the registers
plotellipses(reg.pca,label="quali")plotellipses(reg.pca,axes=c(2,3),label="quali")- the size of confidence ellipses depends on the number of points
- in case of
fiction: this category is represented only by three BNC subsections
- in case of
18.2.3 Factor Analysis
- PCA
- the main purpose of PCA is to find as few orthogonal(uncorrelated) components as possible while maximizing the total explained variance
- to reduce dimensionality
- FA
- more widely used for exploring theoretical constructs, or latent variables, which are called factors
- unlike PCA, FA ‘rotates’ the factors, trying to increase the load of variables on several common factors
Factor Analysis(FA)
reg.fa<-factanal(reg_bnc[,-1],factor=3)
reg.fa##
## Call:
## factanal(x = reg_bnc[, -1], factors = 3)
##
## Uniquenesses:
## Ncomm Nprop Vpres Vpast P1 P2 Adj
## 0.120 0.335 0.510 0.005 0.175 0.192 0.102
## ConjCoord ConjSub Interject Num
## 0.496 0.438 0.416 0.726
##
## Loadings:
## Factor1 Factor2 Factor3
## Ncomm -0.927 -0.125
## Nprop -0.214 -0.458 -0.640
## Vpres 0.417 0.539 0.159
## Vpast 0.138 -0.983
## P1 0.868 0.118 0.240
## P2 0.796 0.259 0.327
## Adj -0.940 0.107
## ConjCoord 0.709
## ConjSub 0.480 0.336 0.467
## Interject 0.716 0.101 0.248
## Num -0.109 -0.508
##
## Factor1 Factor2 Factor3
## SS loadings 4.127 1.682 1.676
## Proportion Var 0.375 0.153 0.152
## Cumulative Var 0.375 0.528 0.680
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 65.18 on 25 degrees of freedom.
## The p-value is 1.95e-05
- factor loading
- the stronger a variable loads onto a factor, the more strongly this variable defines the factor.
- analogous to correlation coefficients between the variables and factors
- As a rule of thumb, loadings with absolute (positive or negative) values greater than 0.3 are considered to be important
- the stronger a variable loads onto a factor, the more strongly this variable defines the factor.
- Factor 1 : involved vs. informational communication
- Factor 2 : narrative vs. non-narrative dimension
- Factor 3 : ???
the subcomponents and their scores
scores="Bartlett"
reg.fa<-factanal(reg_bnc[,-1],factors=3,scores="Bartlett")
plot(reg.fa$scores[,2:3],type="n")
text(reg.fa$scores[,2:3],rownames(reg_bnc),cex=0.7)- it seems that the third factor relates to more or less
- elaborate development of ideas(fiction, interviews, lectures on arts)
- vs. concise informationally dense communication (news,emails)
- Are three factors enough?
- p-value in the bottom line of reg.fa.
- p-value smaller than 0.05 indicates that the number of factors is insufficient
Varimax and Promax
- two popular kinds of rotation
- Varimax
- a type of orthogonal rotation
- factors are considered uncorrelated
- Promax
- a type of oblique rotation
- allows for some degree of correlation between factors
- If we expect to find really clear-cut, unique factors, it is better to use Varimax.
- If one expects the resulting factors to be closely related, one can try Promax.
reg.fa<-factanal(reg_bnc[,-1],factor=3,rotation="promax")
reg.fa$loadings##
## Loadings:
## Factor1 Factor2 Factor3
## Ncomm -1.005 -0.195
## Nprop -0.718 0.258
## Vpres 0.331 -0.472
## Vpast 0.259 0.113 1.066
## P1 0.869
## P2 0.735 0.205
## Adj -1.106 0.350 -0.102
## ConjCoord -0.220 0.853 0.216
## ConjSub 0.314 0.448 -0.159
## Interject 0.697 0.132
## Num -0.596 -0.238
##
## Factor1 Factor2 Factor3
## SS loadings 4.347 2.012 1.615
## Proportion Var 0.395 0.183 0.147
## Cumulative Var 0.395 0.578 0.725
18.3 Summary
- Multidimensional analysis of register variation is not the only possible application of PCA and FA in linguistics
- For example, one can use loadings of componenets or factors as input in regression analysis to solve the problem of multicollinearity or simplify the model.
How to report results of PCA and FA
- sample size,
- the number of variables and the procedure
- how you made the decision about the number of components/factors, as well as the rotation method and the p-value
- crucially, one should include a table with factor loadings per each variable.
- all relevant biplots should be provided, as well, if the purpose is also to obtain a classification of observations