the summary of:

Levshina,N.(2015). 18.Multidimensional analysis of register variation: Principal Component Analysis and Factor Analysis, How to do Linguistics with R:Data exploration and statistical analysis(pp 351-366), John Benjamins Publishing Company


18.1 Multidimensional analysis of register variation

Registers


Situations can be characterized by such parameters as


Registers are also associated with specific linguistic features


Duglas Biber(1988)


Exploratory Factor Analysis (FA) and Principal Components Analysis (PCA)


18.2. Case study: Register variation in the British National Corpus

18.2.1 The data and research question

library(Rling);library(psych);library(FactoMineR)

Data structure

data(reg_bnc)
str(reg_bnc)
## 'data.frame':    69 obs. of  12 variables:
##  $ Reg      : Factor w/ 6 levels "Acad","Fiction",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Ncomm    : num  0.17 0.205 0.206 0.136 0.133 ...
##  $ Nprop    : num  0.02697 0.02498 0.0468 0.0112 0.00985 ...
##  $ Vpres    : num  0.0355 0.0391 0.0366 0.0485 0.0452 ...
##  $ Vpast    : num  0.0219 0.0298 0.0236 0.0189 0.0198 ...
##  $ P1       : num  0.0347 0.0208 0.018 0.0276 0.0455 ...
##  $ P2       : num  0.01832 0.01137 0.00775 0.03749 0.03703 ...
##  $ Adj      : num  0.0536 0.0585 0.0596 0.0407 0.0446 ...
##  $ ConjCoord: num  0.0395 0.034 0.0335 0.0339 0.0384 ...
##  $ ConjSub  : num  0.031 0.0276 0.0232 0.0315 0.0283 ...
##  $ Interject: num  0.00997 0.00414 0.00226 0.02173 0.04298 ...
##  $ Num      : num  0.0206 0.0192 0.0277 0.0414 0.0164 ...
  • number of BNC subsections = 69
  • variables are numeric vectors except reg
rownames(reg_bnc)
##  [1] "S_brdcst_disc" "S_brdcst_doc"  "S_brdcst_news" "S_classroom"  
##  [5] "S_consult"     "S_conv"        "S_courtroom"   "S_demonstratn"
##  [9] "S_interv_oral" "S_interview"   "S_lect_arts"   "S_lect_com"   
## [13] "S_lect_law"    "S_lect_natsci" "S_lect_socsci" "S_meeting"    
## [17] "S_parliament"  "S_pub_debate"  "S_sermon"      "S_spch-script"
## [21] "S_spch+script" "S_sportslive"  "S_tutorial"    "S_unclass"    
## [25] "W_ac_engin"    "W_ac_hum_arts" "W_ac_law_edu"  "W_ac_medicine"
## [29] "W_ac_nat_sci"  "W_ac_soc_sci"  "W_admin"       "W_advert"     
## [33] "W_biography"   "W_commerce"    "W_email"       "W_essay_schl" 
## [37] "W_essay_univ"  "W_fict_drama"  "W_fict_poetry" "W_fict_prose" 
## [41] "W_hansard"     "W_inst_doc"    "W_instruction" "W_let_pers"   
## [45] "W_let_prof"    "W_misc"        "W_new_arts1"   "W_news_arts2" 
## [49] "W_news_com"    "W_news_edit"   "W_news_misc"   "W_news_o_com" 
## [53] "W_news_o_rep"  "W_news_o_sci"  "W_news_o_soc"  "W_news_o_sprt"
## [57] "W_news_rprt"   "W_news_sci"    "W_news_script" "W_news_soc"   
## [61] "W_news_sprt"   "W_news_tabld"  "W_nonac_arts"  "W_nonac_engin"
## [65] "W_nonac_law"   "W_nonac_med"   "W_nonac_nat"   "W_nonac_soc"  
## [69] "W_religion"

The main question

  • Can we identify interpretable dimensions of register variation on the basis of the data?
  • What are the relationships between the metaregisters with regard to these dimensions?

18.2.2 Principal Components Analysis

Two important conditions

  1. the variables should be intercorrelated
    • otherwise, we cannot reduce the data to a smaller number of underlying components
  2. the correlations should not be too high
    • very high correlations cause inaccurate estimates in FA (not a problem for PCA)
    • a rule of thumb
      • the absolute values of correlations should not be lower than 0.3 and above 0.9

The correlations between the variables

round(cor(reg_bnc[,-1]),2)
##           Ncomm Nprop Vpres Vpast    P1    P2   Adj ConjCoord ConjSub
## Ncomm      1.00  0.23 -0.41 -0.21 -0.83 -0.75  0.86     -0.13   -0.52
## Nprop      0.23  1.00 -0.34  0.36 -0.37 -0.50  0.13     -0.45   -0.68
## Vpres     -0.41 -0.34  1.00 -0.46  0.42  0.50 -0.35      0.21    0.48
## Vpast     -0.21  0.36 -0.46  1.00  0.03 -0.11 -0.16      0.07   -0.22
## P1        -0.83 -0.37  0.42  0.03  1.00  0.80 -0.79      0.23    0.57
## P2        -0.75 -0.50  0.50 -0.11  0.80  1.00 -0.70      0.31    0.57
## Adj        0.86  0.13 -0.35 -0.16 -0.79 -0.70  1.00      0.04   -0.39
## ConjCoord -0.13 -0.45  0.21  0.07  0.23  0.31  0.04      1.00    0.26
## ConjSub   -0.52 -0.68  0.48 -0.22  0.57  0.57 -0.39      0.26    1.00
## Interject -0.67 -0.39  0.41  0.02  0.70  0.79 -0.62      0.18    0.36
## Num        0.21  0.28 -0.28 -0.13 -0.25 -0.16  0.03     -0.41   -0.28
##           Interject   Num
## Ncomm         -0.67  0.21
## Nprop         -0.39  0.28
## Vpres          0.41 -0.28
## Vpast          0.02 -0.13
## P1             0.70 -0.25
## P2             0.79 -0.16
## Adj           -0.62  0.03
## ConjCoord      0.18 -0.41
## ConjSub        0.36 -0.28
## Interject      1.00 -0.09
## Num           -0.09  1.00
  • the variable Num has many correlation with the absolute value slightly under 0.3
  • As for possible multicollinearity, this should not be a concern, since there are no highly correlated variables

Bartlett test

  • null hypothesis : “the variables are not correlated”
  • cortest.bartlett() in the package psych
cortest.bartlett(reg_bnc[,-1])
## R was not square, finding R from data
## $chisq
## [1] 536.3401
## 
## $p.value
## [1] 4.109611e-80
## 
## $df
## [1] 55

PCA

  • PCA() from the FactoMineR Package:
reg.pca<-PCA(reg_bnc,quali.sup=1,graph=FALSE)
  • quali.sup=1 tells R that the first variable, Reg, should be regarded as a qualitative supplementary variable.
    • do not contribute to the construction of principal components
    • types of supplementary variables : qualitative and quantitative
  • graph=FALSE
    • suppresses the immedate creation of graphical output
  • PCA is usually performed on standardized scores rather than original ones because it is sensitive to scaling differences

Eigenvalues

  • proportions of variance explained by each component and cumulative explained variance
  • an eigenvalue shows how much of the total variance is explained by each component
    • the higher the correlations between a component and the variables, the greater the component’s eigenvalue
    • the eigenvalue of every additional component is smaller than the previous one
head(reg.pca$eig)
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1  5.0682936              46.075396                          46.07540
## comp 2  1.8722103              17.020094                          63.09549
## comp 3  1.3758435              12.507669                          75.60316
## comp 4  0.7900757               7.182506                          82.78566
## comp 5  0.6451271               5.864791                          88.65046
## comp 6  0.4217144               3.833768                          92.48422

The optimal numbers of components

  • retain components whose eigenvalues are greater than 1(the Kaiser criterion) or 0.7(Field et al. 2012:762-784)
  • inspect the values visually using a screeplot
barplot(reg.pca$eig[,2], names=1:nrow(reg.pca$eig), xlab="components", ylab="Percentage of explained variance")


PCA plot

  • choix="var" is used to represent the variables
plot(reg.pca,choix="var",cex=0.8)

  • the plot displays the first two components

  • the angle between the vectors and the axes indicate how strongly the variables are correlated with the dimensions
    • the smaller the angle, the stronger the correlation
    • if two vectors point to almost the same direction, this means that the corresponding variables are highly correlated and therefore may represent the same underlying theoretical construct

    • the length of the vector reflects how much variation in the variable is captured by this low-dimensional display with the maximum length of 1
      • quality of the representation of a variable on the plane
  • the first component relates to the involved and informational communication


the correlation coefficients

dimdesc(reg.pca, axes=1)
## $Dim.1
## $Dim.1$quanti
##           correlation      p.value
## P2          0.9117524 1.365059e-27
## P1          0.8958585 2.694053e-25
## Interject   0.7913207 5.876448e-16
## ConjSub     0.7268571 1.540034e-12
## Vpres       0.6203029 1.311435e-08
## ConjCoord   0.3461531 3.574209e-03
## Num        -0.3236023 6.681150e-03
## Nprop      -0.5825157 1.513793e-07
## Adj        -0.7699620 1.056008e-14
## Ncomm      -0.8551296 8.636119e-21
## 
## $Dim.1$quali
##            R2      p.value
## Reg 0.7783745 2.391373e-19
## 
## $Dim.1$category
##             Estimate      p.value
## Spok        3.100121 9.721141e-20
## Acad       -1.256963 4.616021e-02
## NonacProse -1.272090 4.425996e-02
## News       -1.427441 4.513794e-05
  • 1st component
    • top positively correlated features are the 1st and 2nd person pronouns and interjections
    • the strongest negative correlations are observed for common nouns and adjectives
    • by default, the function returns only those estimates that are significant (at the level of 0.05)
    • the fuction returns the estimates of regression coefficients for qualitative supplementary variables(metaregister)
    • the largest positive stimate is observed for the spoken data (involvement)
dimdesc(reg.pca, axes=2)
## $Dim.2
## $Dim.2$quanti
##           correlation      p.value
## Adj         0.5234173 3.936813e-06
## ConjCoord   0.4633356 6.092970e-05
## Ncomm       0.3965029 7.440364e-04
## Vpres       0.3489208 3.299614e-03
## ConjSub     0.3451253 3.681245e-03
## Num        -0.3367338 4.667194e-03
## Nprop      -0.6146862 1.925513e-08
## Vpast      -0.6342292 4.893429e-09
## 
## $Dim.2$quali
##            R2     p.value
## Reg 0.2553784 0.001918148
## 
## $Dim.2$category
##       Estimate      p.value
## Acad  1.225288 0.0136778790
## News -1.081269 0.0006610448
  • 2st component
    • positive: adjectives and coordinate conjunctions
    • negative: past forms of verbs and proper nouns
    • description vs. reporting of past events (tentatively)
    • the academic texts: significantly correlated with the positive values
    • the news: significantly correlated with negative values

PCA plot(dim 2,3) and correlation coefficient(dim3)

plot(reg.pca,axes=c(2,3),choix="var",cex=0.8)

dimdesc(reg.pca, axes=3)
## $Dim.3
## $Dim.3$quanti
##           correlation      p.value
## Vpast       0.6962689 3.086827e-11
## ConjCoord   0.5988047 5.481959e-08
## Vpres      -0.2928527 1.460723e-02
## Num        -0.6430702 2.549810e-09
## 
## $Dim.3$category
##         Estimate     p.value
## Fiction 1.527773 0.006017634
  • 3rd component
    • narrative and non-narrative texts
    • positive correlation: past tense verb, coordinating conjuction
    • negative correlation: numerals
    • It distinguishes fiction from all other registers

Plot the individual subection onto the space

#dim 1,2
plot(reg.pca, cex=0.8, col.ind="grey",col.quali="black")

#dim 2,3
plot(reg.pca,axes=c(2,3),cex=0.8,col.ind="grey",col.quali="black")

  • BNC subsections that belong to the same metaregister tend to cluster together

Plot confidence ellipses around the centroid

  • to estimate the amount of overlap of the prototypes of the registers
plotellipses(reg.pca,label="quali")

plotellipses(reg.pca,axes=c(2,3),label="quali")

  • the size of confidence ellipses depends on the number of points
    • in case of fiction: this category is represented only by three BNC subsections

18.2.3 Factor Analysis

  • PCA
    • the main purpose of PCA is to find as few orthogonal(uncorrelated) components as possible while maximizing the total explained variance
    • to reduce dimensionality
  • FA
    • more widely used for exploring theoretical constructs, or latent variables, which are called factors
    • unlike PCA, FA ‘rotates’ the factors, trying to increase the load of variables on several common factors

Factor Analysis(FA)

reg.fa<-factanal(reg_bnc[,-1],factor=3)
reg.fa
## 
## Call:
## factanal(x = reg_bnc[, -1], factors = 3)
## 
## Uniquenesses:
##     Ncomm     Nprop     Vpres     Vpast        P1        P2       Adj 
##     0.120     0.335     0.510     0.005     0.175     0.192     0.102 
## ConjCoord   ConjSub Interject       Num 
##     0.496     0.438     0.416     0.726 
## 
## Loadings:
##           Factor1 Factor2 Factor3
## Ncomm     -0.927          -0.125 
## Nprop     -0.214  -0.458  -0.640 
## Vpres      0.417   0.539   0.159 
## Vpast      0.138  -0.983         
## P1         0.868   0.118   0.240 
## P2         0.796   0.259   0.327 
## Adj       -0.940           0.107 
## ConjCoord                  0.709 
## ConjSub    0.480   0.336   0.467 
## Interject  0.716   0.101   0.248 
## Num       -0.109          -0.508 
## 
##                Factor1 Factor2 Factor3
## SS loadings      4.127   1.682   1.676
## Proportion Var   0.375   0.153   0.152
## Cumulative Var   0.375   0.528   0.680
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 65.18 on 25 degrees of freedom.
## The p-value is 1.95e-05
  • factor loading
    • the stronger a variable loads onto a factor, the more strongly this variable defines the factor.
      • analogous to correlation coefficients between the variables and factors
    • As a rule of thumb, loadings with absolute (positive or negative) values greater than 0.3 are considered to be important
  • Factor 1 : involved vs. informational communication
  • Factor 2 : narrative vs. non-narrative dimension
  • Factor 3 : ???

the subcomponents and their scores

scores="Bartlett"

reg.fa<-factanal(reg_bnc[,-1],factors=3,scores="Bartlett")
plot(reg.fa$scores[,2:3],type="n")
text(reg.fa$scores[,2:3],rownames(reg_bnc),cex=0.7)

  • it seems that the third factor relates to more or less
    • elaborate development of ideas(fiction, interviews, lectures on arts)
    • vs. concise informationally dense communication (news,emails)
  • Are three factors enough?
    • p-value in the bottom line of reg.fa.
    • p-value smaller than 0.05 indicates that the number of factors is insufficient

Varimax and Promax

  • two popular kinds of rotation
  • Varimax
    • a type of orthogonal rotation
    • factors are considered uncorrelated
  • Promax
    • a type of oblique rotation
    • allows for some degree of correlation between factors
  • If we expect to find really clear-cut, unique factors, it is better to use Varimax.
  • If one expects the resulting factors to be closely related, one can try Promax.
reg.fa<-factanal(reg_bnc[,-1],factor=3,rotation="promax")
reg.fa$loadings
## 
## Loadings:
##           Factor1 Factor2 Factor3
## Ncomm     -1.005          -0.195 
## Nprop             -0.718   0.258 
## Vpres      0.331          -0.472 
## Vpast      0.259   0.113   1.066 
## P1         0.869                 
## P2         0.735   0.205         
## Adj       -1.106   0.350  -0.102 
## ConjCoord -0.220   0.853   0.216 
## ConjSub    0.314   0.448  -0.159 
## Interject  0.697   0.132         
## Num               -0.596  -0.238 
## 
##                Factor1 Factor2 Factor3
## SS loadings      4.347   2.012   1.615
## Proportion Var   0.395   0.183   0.147
## Cumulative Var   0.395   0.578   0.725

18.3 Summary


How to report results of PCA and FA