In-class demo of correlations and corpus counts


Part A. Building a perfect positive correlation

Let’s start with setting the alpha and beta coefficients of the line. We’ll create a dataframe called pos for “positive correlation”:

pos <- data.frame(0:19); colnames(pos) <- 'x'
# Coefficient terms of line
alpha <- 4
beta <- 2

# Create pos data frame
# Create array of points for x
pos <- data.frame(0:19); colnames(pos) <- 'x'

# Use Alpha and Beta to generate y
pos$y <- alpha + (pos$x* beta)

head(pos)
##   x  y
## 1 0  4
## 2 1  6
## 3 2  8
## 4 3 10
## 5 4 12
## 6 5 14

Next we can compute the linear model with the lm() function, which takes two arguments: a formula y ~ x (predicting y from x) and a dataframe.

lm(
   formula,
   data
)
# Compute the linear model
lm(y ~ x, data = pos)
## 
## Call:
## lm(formula = y ~ x, data = pos)
## 
## Coefficients:
## (Intercept)            x  
##           4            2

The coefficients 4 and 2 are shown as coefficients in the model.



Plot scatterplot and add regression line

Now let’s create a plot of the points in pos and a linear regression line using the alpha and beta coefficients.

1. Use values to create scatterplot

As discussed in previous classes, the function plot() can be used with a dataframe or other object to produce a scatterplot.

plot()

The line is drawn with the function abline() which we’ll give our coefficients as arguments:

abline()
plot(pos)
abline(alpha, beta)

2. Use lm function to compute linear relationship

Next, we’ll again make a scatterplot of the points with pos. This time, however, we’ll plot the output of the linear regression by feeding the abline function the output of the model. As seen below, abline accepts function calls, in addition to values. Let’s also jazz it up a bit by plotting a slightly thicker red line, using the col = “red” and lwd = 4 (line width) specifications:

abline(
   function,
   col
   lwd
)
plot(pos)
abline(lm(y ~ x, data = pos),col = "red", lwd = 4)

Phew! Looks the same as before!

Correlations

We can use the cor() function to compute the correlation between x and y. As expected, it’s a perfect correlation (r = 1), by design.

cor(
   x,
   y
)
cor(pos$x, pos$y)
## [1] 1

Real data just don’t look so uniform, alas. There’s usually some kind of noise due to unexplained variance, measurment error, etc. Linear models capture this noise with an error term \(\epsilon\) added to the predictors in the formula:

\[y \sim x + \epsilon \]

For simplicity, I’ve chosen to make the error extremely regular: alternating values between 4 and -4 for visual effect. We’ll just add this information to the pos dataframe. Note that I’ve used the rep function to create a vector by providing a list of values c(4, -4) repeated 10 times times = 10.

rep(
   c(4,-4),
   times = 10
)

Then we’ll create a new column with representing the new value of y2 from the linear regression model.

pos$error <- rep(c(4, -4), times = 10)

head(pos)
##   x  y error
## 1 0  4     4
## 2 1  6    -4
## 3 2  8     4
## 4 3 10    -4
## 5 4 12     4
## 6 5 14    -4
# Compute alternative y using error
pos$y2 <- alpha + (pos$x* beta) + pos$error

head(pos)
##   x  y error y2
## 1 0  4     4  8
## 2 1  6    -4  2
## 3 2  8     4 12
## 4 3 10    -4  6
## 5 4 12     4 16
## 6 5 14    -4 10

We can again use the cor() function to compute the correlation between x and our y + error term.

How do you expect the correlation to compare to the previous correlation? Try it out here and check your answer below:

# Add correlation here
Click here to reveal answer

We’ll again use the cor() function, which takes two arguments:

cor(
x, 
y
)
cor(pos$x, pos$y2)
## [1] 0.9420074

Let’s try plotting the data using method 1, using using the plot and abline as before. Next, we’ve added blue lines to show the residuals using segments():

plot(y2 ~ x, data = pos)
abline(alpha, beta, col = "red", lwd = 2)

# Add lines to show residuals (using segments())
segments(pos$x,                 # Add x position from
    pos$y,                  # Add y position from
    pos$x,                          # Add y position to
    pos$y + pos$error,          # Add x position to
    lwd = 2, col = "blue"   # Plot sparkles
    )

The best fitting line will be the one that minimizes the distance between the predicted value \(\hat{y}\) (pronounced y hat), and the observed value \(y\).

To get a feel for how this works, we’ll find the least squares estimate through the residuals \(e\) in two models:

1. Create distance between observed and predicted values for y and add each value to the dataframe:

\[ e = (y - \hat{y})^2 \]

pos$resid <- (pos$y - pos$y2)^2

sum(pos$resid) # 320
## [1] 320

The sum of squares is 320.

2. Create data from alternative regression line, setting alpha and beta as below:

alpha2 <- 4.5
beta2 <- 1.8


# Use alpha1 and beta2 to generate y.alt
pos$y.alt <- alpha2 + (pos$x* beta2)

head(pos)
##   x  y error y2 resid y.alt
## 1 0  4     4  8    16   4.5
## 2 1  6    -4  2    16   6.3
## 3 2  8     4 12    16   8.1
## 4 3 10    -4  6    16   9.9
## 5 4 12     4 16    16  11.7
## 6 5 14    -4 10    16  13.5
# Calculate residuals for alternate line
pos$resid2 <- (pos$y.alt - pos$y2)^2
sum(pos$resid2) # 369.8
## [1] 369.8

The sum of squares is 369.8.

Comparing the summed residuals from each line, we see that the alternative has more variance, and so explains the data less well.



Part B. Getting practice using lm() and cor()

Instructions: First, try this on your own, consulting a neighbor if you get stuck. We’ll discuss the answers as a class.

Load the father.son data set including in the UsingR library with install.packages(‘UsingR’) and library().

# install.packages('UsingR')
library(UsingR)


Q1. Plot the relationship between sheight (y) and fheight (x)

# Enter the plot here
Click here to reveal answer

We can use the plot() function with the formula for predicting sheight from fheight in the father.son data.

plot(sheight ~ fheight, data = father.son)


Q2. Compute a linear model; assign it to the variable name fs.lm.

# Complete model specification here
# fs.lm  <- 
Click here to reveal answer

We can use the lm() with the formula as above.

fs.lm <- lm(sheight ~ fheight, data = father.son)

fs.lm


Q3. Try printing out the resultsof fs.lm using summary() and report the results with summary().

# Type answer here
Click here to reveal answer

We can use the lm() with the formula as above.

# summary(fs.lm)

# Report the Intercept and the Slope of the line.
# Intercept
# 33.8866

# fs.lm$coeff[1]

# Slope: 
# fs.lm$coeff[2]


Q4. Use abline() to add a regression line to the plot from fs.lm. Use red to make line stand out.

# Type answer here
Click here to reveal answer

We can use the lm() with the formula as above.

# abline(fs.lm,  col = "red", lwd = 3)


Q5. Compute the correlation between sheight and fheight using cor().

# Type answer here
Click here to reveal answer

We can use the lm() with the formula as above.

with(father.son, cor(sheight, fheight))
## [1] 0.5013383


Q6. Taking our understanding of correlations a little further, we’ll use cor.test() to determine if the correlation is significant. First read up on the function using ?cor.test()

?cor.test()
Click here to reveal answer

We can use the lm() with the formula as above.

# Test correlation for significance using cor.test()

# with(father.son, cor.test(sheight, fheight))


# Add correlation coefficient to the plot. Use locator() to tell R exactly where to put the text. Fill in the ??? with the real correlation value.

# text(locator(1), "r = 0.5", cex = 1.4, col = "red")


# text(65, 75, "r = 0.5", cex = 1.4, col = "blue")

# ?text



PART C. EXPLORING SOME ANNOTATED CORPUS DATA

The “much less” cordinator is argued to head a ellipsis construction that places the correlate and remnant in contrastive focus, and presupposes a contextual salient scale between the correlate and the remnant.

  1. Examples of “much less” construction:
  2. I don’t eat JEL.ly, much less PEA.nut butter. NP remnant
  3. I don’t EAT jelly, much less MAKE it. VP remnant
  4. I don’t EAT JEL.ly, much less on CRACK.ers. Sprouted PP remnant
# Load ML dataset from github
ml <- read.csv('https://raw.githubusercontent.com/jaharris/jaharris.github.io/master/static/files/data/much-less-COCA-lab.csv', header = T)

# NB. Data from COCA on the coordinator 'much less'. Annotated with Katy Carlson, Morehead State University. 

head(ml)
##   Number Year Genre          Source
## 1    197 2006  ACAD  ChurchHistory 
## 2    288 2003  ACAD ForeignAffairs 
## 3    513 1996   MAG        America 
## 4    552 1995  ACAD   AsianAffairs 
## 5    563 1994   MAG       Atlantic 
## 6    579 1994  ACAD    HispanicRev 
##                                                                                                                                                                                              Sentence
## 1                              conflict with Catholic orthodoxy. # Indeed, Aude was not the first -- much less the only! -- medieval Christian to puzzle over how the simultaneously human and divine
## 2                              these trends are, China still faces serious obstacles to becoming a high-profile, much less a dominant, player in the international community. For the moment, China's
## 3 patristic texts demonstrates just the reverse-that the Fathers never arrived at a dominant, much less a unanimous interpretation of Psalm 1. Doctrinal Aberration? Other attacks raise the question
## 4                    the Saffron Surge should not be interpreted as an aberrant, transient, or much less a perverse political trend that can be arrested by a determined reassertion of the Nehruvian
## 5                                      there is no corresponding vignette to give children a picture of an amicable, much less a long-lasting, marriage. (Susan Wilson believes that you " can't beat
## 6                                      to be able to assert that Pizarnik is not writing a specifically erotic, much less a pornographic, poetry (whatever the distinction between the two might be),
##   Syntax Contrast Position Other Local Licensor.category Licensor
## 1 Adj/DP      Adj    in NP  <NA>     1               Neg     <NA>
## 2 Adj/DP      Adj    in NP  <NA>     1               NEC     <NA>
## 3 Adj/DP      Adj    in NP  <NA>     1               Neg    never
## 4 Adj/DP      Adj    in NP  <NA>     1               Neg     <NA>
## 5 Adj/DP      Adj    in NP  <NA>     1               Neg     <NA>
## 6 Adj/DP      Adj    in NP  <NA>     1               Neg     <NA>
##                 Note Note2 Syntax2 Contrast2
## 1               <NA>  <NA>     Adj       Adj
## 2 Not Enough Context  <NA>     Adj       Adj
## 3               <NA>  <NA>     Adj       Adj
## 4               <NA>  <NA>     Adj       Adj
## 5               <NA>  <NA>     Adj       Adj
## 6               <NA>  <NA>     Adj       Adj


#### Q1. Check basics of the dataset

First, let’s look at counts of different remnant types (Syntax2) with xtabs or table

xtabs(~ ml$Syntax2)
## ml$Syntax2
## Adj Adv Det  NP  PP  SC   V  VP 
##  33  20   1 736 233  42  60 519

The data are organized into the counts per syntactic categories (Adj, Adv, Det, etc.)


Q2. We can see that three categories account for the vast majority of the data. Use sort to better see which ones they are.

# Type answer here
Click here to reveal answer

sort(xtabs(~ ml$Syntax2))
## ml$Syntax2
## Det Adv Adj  SC   V  PP  VP  NP 
##   1  20  33  42  60 233 519 736


Q3. Create a matrix ml2 with the most common 3 categories. Be sure to drop unsued levels using droplevels(). Hint: Use * %in% * with subset() as in:

1 %in% c(1,2,3)
# Type answer here
# ml2 <- 
Click here to reveal answer

ml2 <- droplevels(subset(ml, Syntax2 %in% c('NP', 'PP', 'VP')))


Q4. Look at remnant counts of Syntax2 again and convert the table output to percentages.

# Type answer here
Click here to reveal answer

# Look at counts of Syntax2 again
ml2.m <- with(ml2, xtabs(~ Syntax2)); ml2.m
## Syntax2
##  NP  PP  VP 
## 736 233 519
#  NP  PP  VP 
# 736 233 519 

# Convert to percentage
ml2.p <- round(100*ml2.m/nrow(ml)); ml2.p
## Syntax2
## NP PP VP 
## 45 14 32


Q5. Use xtabs() or table() to compare Syntax2 by Local (whether the correlate is local or not)

# Type answer here
Click here to reveal answer

# Use xtabs() or table() to compare Syntax2 by Local
ml2.m <- with(ml2, xtabs(~ Syntax2 + Local)); ml2.m
##        Local
## Syntax2   0   1
##      NP 153 583
##      PP  38 195
##      VP  49 470
# Syntax2   0   1
#      NP 153 583
#      PP  38 195
#      VP  49 470

ml2.m <- with(ml2, table(Syntax2, Local)); ml2.m
##        Local
## Syntax2   0   1
##      NP 153 583
##      PP  38 195
##      VP  49 470


Q6. Run a chisq.test on the result

# Type answer here
Click here to reveal answer

chisq.test(ml2.m)
## 
##  Pearson's Chi-squared test
## 
## data:  ml2.m
## X-squared = 28.975, df = 2, p-value = 5.106e-07


Q7. Plot ml.m. Add legend and title. Use gaudy, vibrant colors. Be sure that the x-axis is grouped by the remnant names (not Locality).

# Type answer here
Click here to reveal answer

barplot(t(ml2.m), 
        beside = T, 
        col = c("darkmagenta", "springgreen"))

legend("topright", 
       c("Non Local", "Local"), 
       bty = "n", 
       fill = c("darkmagenta", "springgreen"), 
       title = "Locality"
       )