18.Multidimensional analysis of register variation

the summary of:

Levshina,N.(2015). 18.Multidimensional analysis of register variation: Principal Component Analysis and Factor Analysis, How to do Linguistics with R:Data exploration and statistical analysis(pp 351-366), John Benjamins Publishing Company

18.1 Multidimensional analysis of register variation

Registers

language varieties associated with a particular situation of use
face-to-face conversations, emails, textbooks, fictional novels, lectures and Twitter messeages

Situations can be characterized by such parameters as

the channel of communication (speech, writing or signing)
relationship between the participants (social status, personal relationships)
communicative purpose (transfer of information, persuation, entertainments, etc.)
settings (private or public place of communication)

Registers are also associated with specific linguistic features

face-to-face conversation
- a high proportion of first and second person pronouns
- a low proportional of nouns and adjectives

Duglas Biber(1988)

Factor Analysis
- ‘Involved versus Informational Production’
- ‘Narrative versus Non-narrative Concerns’
the registers and specific texts can be mapped onto the register space according to their linguistic features
- face-to-face and telephone conversations
- academic prose

Exploratory Factor Analysis (FA) and Principal Components Analysis (PCA)

use to simplify the data structure and classify variables and objects
FA is more appropriate for detecting theoretically relevant underlying dimensions in the data
PCA and FA usually yield similar results (Field et al. 2012:760)
- if the variables are strongly correlated
- if the number of variables is large

18.2. Case study: Register variation in the British National Corpus

18.2.1 The data and research question

library(Rling);library(psych);library(FactoMineR)

Data structure

data(reg_bnc)
str(reg_bnc)

## 'data.frame':    69 obs. of  12 variables:
##  $ Reg      : Factor w/ 6 levels "Acad","Fiction",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Ncomm    : num  0.17 0.205 0.206 0.136 0.133 ...
##  $ Nprop    : num  0.02697 0.02498 0.0468 0.0112 0.00985 ...
##  $ Vpres    : num  0.0355 0.0391 0.0366 0.0485 0.0452 ...
##  $ Vpast    : num  0.0219 0.0298 0.0236 0.0189 0.0198 ...
##  $ P1       : num  0.0347 0.0208 0.018 0.0276 0.0455 ...
##  $ P2       : num  0.01832 0.01137 0.00775 0.03749 0.03703 ...
##  $ Adj      : num  0.0536 0.0585 0.0596 0.0407 0.0446 ...
##  $ ConjCoord: num  0.0395 0.034 0.0335 0.0339 0.0384 ...
##  $ ConjSub  : num  0.031 0.0276 0.0232 0.0315 0.0283 ...
##  $ Interject: num  0.00997 0.00414 0.00226 0.02173 0.04298 ...
##  $ Num      : num  0.0206 0.0192 0.0277 0.0414 0.0164 ...

number of BNC subsections = 69
variables are numeric vectors except reg

rownames(reg_bnc)

##  [1] "S_brdcst_disc" "S_brdcst_doc"  "S_brdcst_news" "S_classroom"  
##  [5] "S_consult"     "S_conv"        "S_courtroom"   "S_demonstratn"
##  [9] "S_interv_oral" "S_interview"   "S_lect_arts"   "S_lect_com"   
## [13] "S_lect_law"    "S_lect_natsci" "S_lect_socsci" "S_meeting"    
## [17] "S_parliament"  "S_pub_debate"  "S_sermon"      "S_spch-script"
## [21] "S_spch+script" "S_sportslive"  "S_tutorial"    "S_unclass"    
## [25] "W_ac_engin"    "W_ac_hum_arts" "W_ac_law_edu"  "W_ac_medicine"
## [29] "W_ac_nat_sci"  "W_ac_soc_sci"  "W_admin"       "W_advert"     
## [33] "W_biography"   "W_commerce"    "W_email"       "W_essay_schl" 
## [37] "W_essay_univ"  "W_fict_drama"  "W_fict_poetry" "W_fict_prose" 
## [41] "W_hansard"     "W_inst_doc"    "W_instruction" "W_let_pers"   
## [45] "W_let_prof"    "W_misc"        "W_new_arts1"   "W_news_arts2" 
## [49] "W_news_com"    "W_news_edit"   "W_news_misc"   "W_news_o_com" 
## [53] "W_news_o_rep"  "W_news_o_sci"  "W_news_o_soc"  "W_news_o_sprt"
## [57] "W_news_rprt"   "W_news_sci"    "W_news_script" "W_news_soc"   
## [61] "W_news_sprt"   "W_news_tabld"  "W_nonac_arts"  "W_nonac_engin"
## [65] "W_nonac_law"   "W_nonac_med"   "W_nonac_nat"   "W_nonac_soc"  
## [69] "W_religion"

The main question

Can we identify interpretable dimensions of register variation on the basis of the data?
What are the relationships between the metaregisters with regard to these dimensions?

18.2.2 Principal Components Analysis

Two important conditions

the variables should be intercorrelated
- otherwise, we cannot reduce the data to a smaller number of underlying components
the correlations should not be too high
- very high correlations cause inaccurate estimates in FA (not a problem for PCA)
- a rule of thumb
  - the absolute values of correlations should not be lower than 0.3 and above 0.9

The correlations between the variables

round(cor(reg_bnc[,-1]),2)

##           Ncomm Nprop Vpres Vpast    P1    P2   Adj ConjCoord ConjSub
## Ncomm      1.00  0.23 -0.41 -0.21 -0.83 -0.75  0.86     -0.13   -0.52
## Nprop      0.23  1.00 -0.34  0.36 -0.37 -0.50  0.13     -0.45   -0.68
## Vpres     -0.41 -0.34  1.00 -0.46  0.42  0.50 -0.35      0.21    0.48
## Vpast     -0.21  0.36 -0.46  1.00  0.03 -0.11 -0.16      0.07   -0.22
## P1        -0.83 -0.37  0.42  0.03  1.00  0.80 -0.79      0.23    0.57
## P2        -0.75 -0.50  0.50 -0.11  0.80  1.00 -0.70      0.31    0.57
## Adj        0.86  0.13 -0.35 -0.16 -0.79 -0.70  1.00      0.04   -0.39
## ConjCoord -0.13 -0.45  0.21  0.07  0.23  0.31  0.04      1.00    0.26
## ConjSub   -0.52 -0.68  0.48 -0.22  0.57  0.57 -0.39      0.26    1.00
## Interject -0.67 -0.39  0.41  0.02  0.70  0.79 -0.62      0.18    0.36
## Num        0.21  0.28 -0.28 -0.13 -0.25 -0.16  0.03     -0.41   -0.28
##           Interject   Num
## Ncomm         -0.67  0.21
## Nprop         -0.39  0.28
## Vpres          0.41 -0.28
## Vpast          0.02 -0.13
## P1             0.70 -0.25
## P2             0.79 -0.16
## Adj           -0.62  0.03
## ConjCoord      0.18 -0.41
## ConjSub        0.36 -0.28
## Interject      1.00 -0.09
## Num           -0.09  1.00

the variable Num has many correlation with the absolute value slightly under 0.3
As for possible multicollinearity, this should not be a concern, since there are no highly correlated variables

Bartlett test

null hypothesis : “the variables are not correlated”
cortest.bartlett() in the package psych

cortest.bartlett(reg_bnc[,-1])

## R was not square, finding R from data

## $chisq
## [1] 536.3401
## 
## $p.value
## [1] 4.109611e-80
## 
## $df
## [1] 55

PCA

PCA() from the FactoMineR Package:

reg.pca<-PCA(reg_bnc,quali.sup=1,graph=FALSE)

quali.sup=1 tells R that the first variable, Reg, should be regarded as a qualitative supplementary variable.
- do not contribute to the construction of principal components
- types of supplementary variables : qualitative and quantitative
graph=FALSE
- suppresses the immedate creation of graphical output
PCA is usually performed on standardized scores rather than original ones because it is sensitive to scaling differences

Eigenvalues

proportions of variance explained by each component and cumulative explained variance
an eigenvalue shows how much of the total variance is explained by each component
- the higher the correlations between a component and the variables, the greater the component’s eigenvalue
- the eigenvalue of every additional component is smaller than the previous one

head(reg.pca$eig)

##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1  5.0682936              46.075396                          46.07540
## comp 2  1.8722103              17.020094                          63.09549
## comp 3  1.3758435              12.507669                          75.60316
## comp 4  0.7900757               7.182506                          82.78566
## comp 5  0.6451271               5.864791                          88.65046
## comp 6  0.4217144               3.833768                          92.48422

The optimal numbers of components

retain components whose eigenvalues are greater than 1(the Kaiser criterion) or 0.7(Field et al. 2012:762-784)
inspect the values visually using a screeplot

barplot(reg.pca$eig[,2], names=1:nrow(reg.pca$eig), xlab="components", ylab="Percentage of explained variance")

PCA plot

choix="var" is used to represent the variables

plot(reg.pca,choix="var",cex=0.8)

the plot displays the first two components
the angle between the vectors and the axes indicate how strongly the variables are correlated with the dimensions
- the smaller the angle, the stronger the correlation
- if two vectors point to almost the same direction, this means that the corresponding variables are highly correlated and therefore may represent the same underlying theoretical construct
- the length of the vector reflects how much variation in the variable is captured by this low-dimensional display with the maximum length of 1
  - quality of the representation of a variable on the plane
the first component relates to the involved and informational communication

the correlation coefficients

dimdesc(reg.pca, axes=1)

## $Dim.1
## $Dim.1$quanti
##           correlation      p.value
## P2          0.9117524 1.365059e-27
## P1          0.8958585 2.694053e-25
## Interject   0.7913207 5.876448e-16
## ConjSub     0.7268571 1.540034e-12
## Vpres       0.6203029 1.311435e-08
## ConjCoord   0.3461531 3.574209e-03
## Num        -0.3236023 6.681150e-03
## Nprop      -0.5825157 1.513793e-07
## Adj        -0.7699620 1.056008e-14
## Ncomm      -0.8551296 8.636119e-21
## 
## $Dim.1$quali
##            R2      p.value
## Reg 0.7783745 2.391373e-19
## 
## $Dim.1$category
##             Estimate      p.value
## Spok        3.100121 9.721141e-20
## Acad       -1.256963 4.616021e-02
## NonacProse -1.272090 4.425996e-02
## News       -1.427441 4.513794e-05

1st component
- top positively correlated features are the 1st and 2nd person pronouns and interjections
- the strongest negative correlations are observed for common nouns and adjectives
- by default, the function returns only those estimates that are significant (at the level of 0.05)
- the fuction returns the estimates of regression coefficients for qualitative supplementary variables(metaregister)
- the largest positive stimate is observed for the spoken data (involvement)

dimdesc(reg.pca, axes=2)

## $Dim.2
## $Dim.2$quanti
##           correlation      p.value
## Adj         0.5234173 3.936813e-06
## ConjCoord   0.4633356 6.092970e-05
## Ncomm       0.3965029 7.440364e-04
## Vpres       0.3489208 3.299614e-03
## ConjSub     0.3451253 3.681245e-03
## Num        -0.3367338 4.667194e-03
## Nprop      -0.6146862 1.925513e-08
## Vpast      -0.6342292 4.893429e-09
## 
## $Dim.2$quali
##            R2     p.value
## Reg 0.2553784 0.001918148
## 
## $Dim.2$category
##       Estimate      p.value
## Acad  1.225288 0.0136778790
## News -1.081269 0.0006610448

2st component
- positive: adjectives and coordinate conjunctions
- negative: past forms of verbs and proper nouns
- description vs. reporting of past events (tentatively)
- the academic texts: significantly correlated with the positive values
- the news: significantly correlated with negative values

PCA plot(dim 2,3) and correlation coefficient(dim3)

plot(reg.pca,axes=c(2,3),choix="var",cex=0.8)

dimdesc(reg.pca, axes=3)

## $Dim.3
## $Dim.3$quanti
##           correlation      p.value
## Vpast       0.6962689 3.086827e-11
## ConjCoord   0.5988047 5.481959e-08
## Vpres      -0.2928527 1.460723e-02
## Num        -0.6430702 2.549810e-09
## 
## $Dim.3$category
##         Estimate     p.value
## Fiction 1.527773 0.006017634

3rd component
- narrative and non-narrative texts
- positive correlation: past tense verb, coordinating conjuction
- negative correlation: numerals
- It distinguishes fiction from all other registers

Plot the individual subection onto the space

#dim 1,2
plot(reg.pca, cex=0.8, col.ind="grey",col.quali="black")

#dim 2,3
plot(reg.pca,axes=c(2,3),cex=0.8,col.ind="grey",col.quali="black")

BNC subsections that belong to the same metaregister tend to cluster together

Plot confidence ellipses around the centroid

to estimate the amount of overlap of the prototypes of the registers

plotellipses(reg.pca,label="quali")

plotellipses(reg.pca,axes=c(2,3),label="quali")

the size of confidence ellipses depends on the number of points
- in case of fiction: this category is represented only by three BNC subsections

18.2.3 Factor Analysis

PCA
- the main purpose of PCA is to find as few orthogonal(uncorrelated) components as possible while maximizing the total explained variance
- to reduce dimensionality
FA
- more widely used for exploring theoretical constructs, or latent variables, which are called factors
- unlike PCA, FA ‘rotates’ the factors, trying to increase the load of variables on several common factors

Factor Analysis(FA)

reg.fa<-factanal(reg_bnc[,-1],factor=3)
reg.fa

## 
## Call:
## factanal(x = reg_bnc[, -1], factors = 3)
## 
## Uniquenesses:
##     Ncomm     Nprop     Vpres     Vpast        P1        P2       Adj 
##     0.120     0.335     0.510     0.005     0.175     0.192     0.102 
## ConjCoord   ConjSub Interject       Num 
##     0.496     0.438     0.416     0.726 
## 
## Loadings:
##           Factor1 Factor2 Factor3
## Ncomm     -0.927          -0.125 
## Nprop     -0.214  -0.458  -0.640 
## Vpres      0.417   0.539   0.159 
## Vpast      0.138  -0.983         
## P1         0.868   0.118   0.240 
## P2         0.796   0.259   0.327 
## Adj       -0.940           0.107 
## ConjCoord                  0.709 
## ConjSub    0.480   0.336   0.467 
## Interject  0.716   0.101   0.248 
## Num       -0.109          -0.508 
## 
##                Factor1 Factor2 Factor3
## SS loadings      4.127   1.682   1.676
## Proportion Var   0.375   0.153   0.152
## Cumulative Var   0.375   0.528   0.680
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 65.18 on 25 degrees of freedom.
## The p-value is 1.95e-05

factor loading
- the stronger a variable loads onto a factor, the more strongly this variable defines the factor.
  - analogous to correlation coefficients between the variables and factors
- As a rule of thumb, loadings with absolute (positive or negative) values greater than 0.3 are considered to be important
Factor 1 : involved vs. informational communication
Factor 2 : narrative vs. non-narrative dimension
Factor 3 : ???

the subcomponents and their scores

scores="Bartlett"

reg.fa<-factanal(reg_bnc[,-1],factors=3,scores="Bartlett")
plot(reg.fa$scores[,2:3],type="n")
text(reg.fa$scores[,2:3],rownames(reg_bnc),cex=0.7)

it seems that the third factor relates to more or less
- elaborate development of ideas(fiction, interviews, lectures on arts)
- vs. concise informationally dense communication (news,emails)
Are three factors enough?
- p-value in the bottom line of reg.fa.
- p-value smaller than 0.05 indicates that the number of factors is insufficient

Varimax and Promax

two popular kinds of rotation
Varimax
- a type of orthogonal rotation
- factors are considered uncorrelated
Promax
- a type of oblique rotation
- allows for some degree of correlation between factors
If we expect to find really clear-cut, unique factors, it is better to use Varimax.
If one expects the resulting factors to be closely related, one can try Promax.

reg.fa<-factanal(reg_bnc[,-1],factor=3,rotation="promax")
reg.fa$loadings

## 
## Loadings:
##           Factor1 Factor2 Factor3
## Ncomm     -1.005          -0.195 
## Nprop             -0.718   0.258 
## Vpres      0.331          -0.472 
## Vpast      0.259   0.113   1.066 
## P1         0.869                 
## P2         0.735   0.205         
## Adj       -1.106   0.350  -0.102 
## ConjCoord -0.220   0.853   0.216 
## ConjSub    0.314   0.448  -0.159 
## Interject  0.697   0.132         
## Num               -0.596  -0.238 
## 
##                Factor1 Factor2 Factor3
## SS loadings      4.347   2.012   1.615
## Proportion Var   0.395   0.183   0.147
## Cumulative Var   0.395   0.578   0.725

18.3 Summary

Multidimensional analysis of register variation is not the only possible application of PCA and FA in linguistics
For example, one can use loadings of componenets or factors as input in regression analysis to solve the problem of multicollinearity or simplify the model.

How to report results of PCA and FA

sample size,
the number of variables and the procedure
how you made the decision about the number of components/factors, as well as the rotation method and the p-value
crucially, one should include a table with factor loadings per each variable.
all relevant biplots should be provided, as well, if the purpose is also to obtain a classification of observations