Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

options(Encoding = "UTF-8")
library(Rling)
library(dplyr)
library(rgl)
library(MASS)
library(psych)

The Data

The data is provided as liwc_house_conflict.csv. We collected over 1000 different speeches given on the floor of the US House of Representatives that discussed different war time conflicts with Iraq, Kuwait, Russia, Syria, Iran, and a few others. This data was then processed with the Linguistic Inquiry and Word Count software, which provides a linguistic frequency analysis for many categories.

You should pick 15-20 categories that you think might cluster together and/or be interesting to examine for their register relatedness. You can learn more about the categories by checking out the attached manual starting on page four. Do not use the “total” categories with their subgroups or you might get a singular matrix error. You might also consider running a quick summary on your choosen categories as well, to make sure they are not effectly zero frequency (i.e., most of the informal language ones will be very small percents due to the location of the speech).

Import your data and create a data frame here with only the categories you are interested in.

For this analysis, the following categories were chosen from the file:

Analytic - Analytical Thinking
Clout - Clout
Authentic - Authentic
Tone - Emotional Tone
WPS - Words per sentence
Sixltr - Words > 6 letters
Dic - Dictionary Words
posemo - Positive Emotion
negemo - Negative Emotion
female - Female references
male - Male references
see - See
hear - Hear
feel - Feel
work - Work
leisure - Leisure
home - Home
money - Money
relig - Religion
death - Death

main_data <- read.csv("liwc_house_conflict.csv")
main_data <- main_data %>% rename("Filename" = "Ã¯..Filename")
rownames(main_data) <- main_data[,1]
data <- main_data %>% dplyr::select(Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,posemo,negemo,female,male,see,hear,feel,work,leisure,home,money,relig,death)

Calculate a MDS

Calculate a MDS on your data - you can use 1, 2, 3 factors.
First, you will need to create distance scores for this analysis to run. Fill in YOUR.DF with the name of the data.frame you created above.

MDS assumes that the numbers between variables are distances, so we need to calculate the distances. Here, we use Euclidean distances and run the MDS on the distances. It is best to represent MDS using 1-3 dimensions for simplicity. Let’s use 2 initially to make plot in the 2-D space.

##create distance scores
distances = dist(data, method = "euclidean")

##run the MDS
mds = cmdscale(distances, #distances
               k = 2, #number of dimensions
               eig = T #calculate the eigenvalues
               )

Eigenvalues and Scree Plots

Eigenvalues represent the amount of variance accounted for by each dimension. Let’s create a scree plot using the eigenvalues to determine the optimal number of dimensions.

barplot(mds$eig, #plot the eigenvalues
        xlab = "Dimensions", 
        ylab = "Eigenvalue", 
        main = "Scree plot")

Things are not very clear from this scree plot. Let’s look at the top 10 eigenvalues.

head(sort(mds$eig, decreasing = TRUE),10)

##  [1] 692168.6052 287521.2862 260030.5055 124748.6872  20714.8966
##  [6]  14382.0462   8388.9154   2806.6626   1449.8033    702.8467

Based on these values, it looks like beyond factor 4, the eigenvalues drop drastically and kind of level out.

Plot the MDS

Include a plot of the results from the MDS.

{
  plot(mds$points, #plot the MDS dimension points
      type = "n", #blank canvas plot
      main = "MDS of US House of Representatives Speeches")
  
  text(mds$points, #plot the dimensions
       labels = rownames(main_data), #label them with the names
       cex = 0.6) #text sizing
}

We can see a lot of overlap here and a few speeches that seem to stand out.

Let’s try running a 3-D MDS and see the plot.

##run the 3D MDS
mds3 = cmdscale(distances, #distances
               k = 3, #number of dimensions
               eig = T #calculate the eigenvalues
               )

#plot the 3D graph
{
  plot3d(mds3$points, type = "n")
  text3d(mds3$points, texts = rownames(main_data), cex = 0.6)
}

This plot can be zoomed / moved to see where each of the 1040 documents fall with respect to these dimensions.

Although, we can’t really visualize it, let’s also create a 4D MDS for further analysis.

mds4 = cmdscale(distances, #distances
               k = 4, #number of dimensions
               eig = T #calculate the eigenvalues
               )

Goodness of fit

Let’s also look at the goodness of fit, which is a measure of match, for all the models.

#2D model
mds$GOF

## [1] 0.6920615 0.6920615

#3D model
mds3$GOF

## [1] 0.8757493 0.8757493

#4D model
mds4$GOF

## [1] 0.9638729 0.9638729

The 4D model has the best fit. With 0.96, it is very close to 1.

Stress

Let’s also look at the stress, which is a measure of mismatch, for the 2D and 3D models

#2D model
sqrt(sum((distances - dist(mds$points))^2)/sum(distances^2))

## [1] 0.2749429

#3D model
sqrt(sum((distances - dist(mds3$points))^2)/sum(distances^2))

## [1] 0.1167261

#4D model
sqrt(sum((distances - dist(mds4$points))^2)/sum(distances^2))

## [1] 0.0336854

The 4D model has the least stress, which is in the excellent category.

Create a Shepard Plot

Include a Shepard Plot to see if any large mis-fit exists. You don’t have to interpret where, but just to see if we might expect to see some poor item loadings in the PCA/EFA.

A Shepard plot shows residuals or the mismatch between actual distance and modeled distance. Let’s create the Shepard plot for the 4D model, which was deemed better than the 2D and 3D models based on goodness of fit and stress.

# Create the numbers
sh = Shepard(distances, #real Euclidean distances
             mds4$points) #modeled numbers

{
  plot(sh, main = "Shepard Plot", pch = ".")
  lines(sh$x, sh$yf, type = "S")
}

Based on this plot, the speeches seems to be extremely well modeled by 4 dimensions. There aren’t any major outliers and all the points are very close to the diagonal.

Before you start

Include Bartlett’s test and the KMO statistic to determine if you have adequate correlations and sampling before running a PCA/EFA.

We use Bartlett’s test to check correlations. It checks if the correlation matrix is different from 0.

Kaiser-Meyer-Olkin (KMO) Statistic to check for sampling accuracy.

# Bartlett's test
correlations = cor(data)
cortest.bartlett(correlations, n = nrow(data))

## $chisq
## [1] 5189.859
## 
## $p.value
## [1] 0
## 
## $df
## [1] 190

# KMO Stat
KMO(correlations)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = correlations)
## Overall MSA =  0.55
## MSA for each item = 
##  Analytic     Clout Authentic      Tone       WPS    Sixltr       Dic 
##      0.73      0.71      0.41      0.41      0.57      0.72      0.62 
##    posemo    negemo    female      male       see      hear      feel 
##      0.29      0.34      0.62      0.59      0.75      0.76      0.82 
##      work   leisure      home     money     relig     death 
##      0.67      0.57      0.50      0.66      0.74      0.67

From Bartlett’s test, the p-value is significant indicating large enough correlations. In other words, the correlation matrix is different.

The overall MSA from the KMO test seems to be very poor.

Let’s look the variances in our dataset.

psych::describe(data)

The variances seem to be sufficient enough. Let’s leave all the variables in the dataset for further analysis.

How many factors/components?

Explore how many factors/components you should use.
Include a parallel analysis and scree plot.
Sum the Kaiser criterion.
Go with the smaller number of items or the most agreement between different criteria.

Finding factors/components

number_items = fa.parallel(data,
                           fm = "ml", #type of math
                           fa = "both") #look at both efa/pca

## Parallel analysis suggests that the number of factors =  7  and the number of components =  5

Parallel analysis suggests that the number of factors = 7 and the number of components = 5.

Kaiser Criterion

sum(number_items$fa.values > 1)

## [1] 2

sum(number_items$fa.values > 0.7)

## [1] 2

Kaiser Criterion suggests 2 eigenvalues.

Based on the scree plots and Kaiser Criterion, I chose to go with 2 components.

Simple structure - run the PCA/EFA

Choose to run either a PCA or an EFA.
Include the saved fa or principal code, but then be sure to print out the results, so the summary is on your report.
Plot the results from your analysis.

For this analysis, I chose to run PCA.

#PCA
PCA_fit = principal(data,
                    nfactors = 2, #number of components
                    rotate = "none")

PCA_fit

## Principal Components Analysis
## Call: principal(r = data, nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##             PC1   PC2     h2   u2 com
## Analytic  -0.72 -0.12 0.5307 0.47 1.1
## Clout      0.56  0.13 0.3277 0.67 1.1
## Authentic  0.17  0.14 0.0486 0.95 1.9
## Tone      -0.04  0.93 0.8619 0.14 1.0
## WPS       -0.28  0.15 0.1018 0.90 1.5
## Sixltr    -0.78 -0.08 0.6151 0.38 1.0
## Dic        0.72  0.20 0.5574 0.44 1.2
## posemo     0.00  0.57 0.3217 0.68 1.0
## negemo     0.05 -0.74 0.5525 0.45 1.0
## female     0.17 -0.01 0.0280 0.97 1.0
## male       0.28 -0.11 0.0899 0.91 1.3
## see        0.27  0.07 0.0758 0.92 1.2
## hear       0.40  0.10 0.1678 0.83 1.1
## feel       0.22 -0.05 0.0493 0.95 1.1
## work      -0.69  0.17 0.5099 0.49 1.1
## leisure    0.10  0.01 0.0102 0.99 1.0
## home      -0.03  0.05 0.0035 1.00 1.5
## money     -0.34  0.23 0.1671 0.83 1.8
## relig      0.25  0.01 0.0639 0.94 1.0
## death      0.20 -0.49 0.2760 0.72 1.3
## 
##                        PC1  PC2
## SS loadings           3.15 2.21
## Proportion Var        0.16 0.11
## Cumulative Var        0.16 0.27
## Proportion Explained  0.59 0.41
## Cumulative Proportion 0.59 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.08 
##  with the empirical chi square  2575.01  with prob <  0 
## 
## Fit based upon off diagonal values = 0.73

fa.plot(PCA_fit, labels = colnames(data))

fa.diagram(PCA_fit)

Adequate solution

Examine the fit indice(s). Are they any good? How might you interpret them?

Root mean square of the residuals

PCA_fit$rms

## [1] 0.08072

The root mean square of residuals seems fine. We ideally want it to be below 0.06 but this is still below 0.1. This is a badness of fit statistic and since it is low, the model is fine.

Examine the results - what do they appear to tell you? Are there groupings of variables in these analyses that might explain different structures/jargons/registers in language we find in Congress?

Looking at the PCA fit model the following observations can be made with respect to each of these type of dimensions:
- Summary Variables: Speakers seem to usually use a lot of emotional tone, but lack authenticity. * Language Metrics: The words per sentence seems to be low, as is expected of verbal communication, but the usage of big words is high, as expected in formal conversations.
- Affect Words: More negative sentiment seems to be represented in these speeches than positive sentiment.
- Social WOrds: There seem to be more references to males than females.
- Perceptual Processes: The perception of hearing is represented more in these speeches than the perception of feeling.
- Personal Concerns: There seems to be a high emphasis on work and death and a low emphasis on leisure and home.
Based on the plot of PCA, we can see which words load with which component
- Summary Variables: Analytic, Authetic, and Clout load on the first component while Tone loads on the second component.
- Language Metrics: All the language metrics seem to load on the first component.
- Affect Words: Both positive and negative words load on the second component.
- Social WOrds: Both male and female words load on the first component.
- Perceptual Processes: All the perceptual words seem to load on the first component.
- Personal Concerns: Work, leisure, money and religion load on the first component, while Death and Home load on the second component.
Based on the plot, we can also see that the following words cluster together:
- Cluster 1: Work, Six letter words, and Analytic
- Cluster 2: Money and words per sentence.
- Cluster 3: Tone and positive words
- Cluster 4: Death and negative words
- Cluster 5: Authentic, Clout, Dictionay words, male, female, see, hear, feel, home, religion, and leisure.
Based on the component analysis diagram, the following observations can be made about the 2 principal components:
- Component 1 has more of dictionary words, clout related words, and words related to the perception of hearing and has less of six letter words, analytic words, and work and money references.
- Component 2 has more of tone related words and positive words, and less of death references and negative words.

ANLY540 - Analysis of Human Language - Assignment 9: MDS/PCA/EFA

Suraj Kumaran

2019-08-04