Let’s start with setting the alpha and beta coefficients of the line. We’ll create a dataframe called pos for “positive correlation”:
pos <- data.frame(0:19); colnames(pos) <- 'x'
# Coefficient terms of line
alpha <- 4
beta <- 2
# Create pos data frame
# Create array of points for x
pos <- data.frame(0:19); colnames(pos) <- 'x'
# Use Alpha and Beta to generate y
pos$y <- alpha + (pos$x* beta)
head(pos)
## x y
## 1 0 4
## 2 1 6
## 3 2 8
## 4 3 10
## 5 4 12
## 6 5 14
Next we can compute the linear model with the lm() function, which takes two arguments: a formula y ~ x (predicting y from x) and a dataframe.
lm(
formula,
data
)
# Compute the linear model
lm(y ~ x, data = pos)
##
## Call:
## lm(formula = y ~ x, data = pos)
##
## Coefficients:
## (Intercept) x
## 4 2
The coefficients 4 and 2 are shown as coefficients in the model.
Now let’s create a plot of the points in pos and a linear regression line using the alpha and beta coefficients.
As discussed in previous classes, the function plot() can be used with a dataframe or other object to produce a scatterplot.
plot()
The line is drawn with the function abline() which we’ll give our coefficients as arguments:
abline()
plot(pos)
abline(alpha, beta)
Next, we’ll again make a scatterplot of the points with pos. This time, however, we’ll plot the output of the linear regression by feeding the abline function the output of the model. As seen below, abline accepts function calls, in addition to values. Let’s also jazz it up a bit by plotting a slightly thicker red line, using the col = “red” and lwd = 4 (line width) specifications:
abline(
function,
col
lwd
)
plot(pos)
abline(lm(y ~ x, data = pos),col = "red", lwd = 4)
Phew! Looks the same as before!
We can use the cor() function to compute the correlation between x and y. As expected, it’s a perfect correlation (r = 1), by design.
cor(
x,
y
)
cor(pos$x, pos$y)
## [1] 1
Real data just don’t look so uniform, alas. There’s usually some kind of noise due to unexplained variance, measurment error, etc. Linear models capture this noise with an error term \(\epsilon\) added to the predictors in the formula:
\[y \sim x + \epsilon \]
For simplicity, I’ve chosen to make the error extremely regular: alternating values between 4 and -4 for visual effect. We’ll just add this information to the pos dataframe. Note that I’ve used the rep function to create a vector by providing a list of values c(4, -4) repeated 10 times times = 10.
rep(
c(4,-4),
times = 10
)
Then we’ll create a new column with representing the new value of y2 from the linear regression model.
pos$error <- rep(c(4, -4), times = 10)
head(pos)
## x y error
## 1 0 4 4
## 2 1 6 -4
## 3 2 8 4
## 4 3 10 -4
## 5 4 12 4
## 6 5 14 -4
# Compute alternative y using error
pos$y2 <- alpha + (pos$x* beta) + pos$error
head(pos)
## x y error y2
## 1 0 4 4 8
## 2 1 6 -4 2
## 3 2 8 4 12
## 4 3 10 -4 6
## 5 4 12 4 16
## 6 5 14 -4 10
We can again use the cor() function to compute the correlation between x and our y + error term.
How do you expect the correlation to compare to the previous correlation? Try it out here and check your answer below:
# Add correlation here
We’ll again use the cor() function, which takes two arguments:
cor(
x,
y
)
cor(pos$x, pos$y2)
## [1] 0.9420074
Let’s try plotting the data using method 1, using using the plot and abline as before. Next, we’ve added blue lines to show the residuals using segments():
plot(y2 ~ x, data = pos)
abline(alpha, beta, col = "red", lwd = 2)
# Add lines to show residuals (using segments())
segments(pos$x, # Add x position from
pos$y, # Add y position from
pos$x, # Add y position to
pos$y + pos$error, # Add x position to
lwd = 2, col = "blue" # Plot sparkles
)
The best fitting line will be the one that minimizes the distance between the predicted value \(\hat{y}\) (pronounced y hat), and the observed value \(y\).
To get a feel for how this works, we’ll find the least squares estimate through the residuals \(e\) in two models:
\[ e = (y - \hat{y})^2 \]
pos$resid <- (pos$y - pos$y2)^2
sum(pos$resid) # 320
## [1] 320
The sum of squares is 320.
alpha2 <- 4.5
beta2 <- 1.8
# Use alpha1 and beta2 to generate y.alt
pos$y.alt <- alpha2 + (pos$x* beta2)
head(pos)
## x y error y2 resid y.alt
## 1 0 4 4 8 16 4.5
## 2 1 6 -4 2 16 6.3
## 3 2 8 4 12 16 8.1
## 4 3 10 -4 6 16 9.9
## 5 4 12 4 16 16 11.7
## 6 5 14 -4 10 16 13.5
# Calculate residuals for alternate line
pos$resid2 <- (pos$y.alt - pos$y2)^2
sum(pos$resid2) # 369.8
## [1] 369.8
The sum of squares is 369.8.
Comparing the summed residuals from each line, we see that the alternative has more variance, and so explains the data less well.
Instructions: First, try this on your own, consulting a neighbor if you get stuck. We’ll discuss the answers as a class.
Load the father.son data set including in the UsingR library with install.packages(‘UsingR’) and library().
# install.packages('UsingR')
library(UsingR)
# Enter the plot here
We can use the plot() function with the formula for predicting sheight from fheight in the father.son data.
plot(sheight ~ fheight, data = father.son)
# Complete model specification here
# fs.lm <-
We can use the lm() with the formula as above.
fs.lm <- lm(sheight ~ fheight, data = father.son)
fs.lm
# Type answer here
We can use the lm() with the formula as above.
# summary(fs.lm)
# Report the Intercept and the Slope of the line.
# Intercept
# 33.8866
# fs.lm$coeff[1]
# Slope:
# fs.lm$coeff[2]
# Type answer here
We can use the lm() with the formula as above.
# abline(fs.lm, col = "red", lwd = 3)
# Type answer here
We can use the lm() with the formula as above.
with(father.son, cor(sheight, fheight))
## [1] 0.5013383
?cor.test()
We can use the lm() with the formula as above.
# Test correlation for significance using cor.test()
# with(father.son, cor.test(sheight, fheight))
# Add correlation coefficient to the plot. Use locator() to tell R exactly where to put the text. Fill in the ??? with the real correlation value.
# text(locator(1), "r = 0.5", cex = 1.4, col = "red")
# text(65, 75, "r = 0.5", cex = 1.4, col = "blue")
# ?text
The “much less” cordinator is argued to head a ellipsis construction that places the correlate and remnant in contrastive focus, and presupposes a contextual salient scale between the correlate and the remnant.
# Load ML dataset from github
ml <- read.csv('https://raw.githubusercontent.com/jaharris/jaharris.github.io/master/static/files/data/much-less-COCA-lab.csv', header = T)
# NB. Data from COCA on the coordinator 'much less'. Annotated with Katy Carlson, Morehead State University.
head(ml)
## Number Year Genre Source
## 1 197 2006 ACAD ChurchHistory
## 2 288 2003 ACAD ForeignAffairs
## 3 513 1996 MAG America
## 4 552 1995 ACAD AsianAffairs
## 5 563 1994 MAG Atlantic
## 6 579 1994 ACAD HispanicRev
## Sentence
## 1 conflict with Catholic orthodoxy. # Indeed, Aude was not the first -- much less the only! -- medieval Christian to puzzle over how the simultaneously human and divine
## 2 these trends are, China still faces serious obstacles to becoming a high-profile, much less a dominant, player in the international community. For the moment, China's
## 3 patristic texts demonstrates just the reverse-that the Fathers never arrived at a dominant, much less a unanimous interpretation of Psalm 1. Doctrinal Aberration? Other attacks raise the question
## 4 the Saffron Surge should not be interpreted as an aberrant, transient, or much less a perverse political trend that can be arrested by a determined reassertion of the Nehruvian
## 5 there is no corresponding vignette to give children a picture of an amicable, much less a long-lasting, marriage. (Susan Wilson believes that you " can't beat
## 6 to be able to assert that Pizarnik is not writing a specifically erotic, much less a pornographic, poetry (whatever the distinction between the two might be),
## Syntax Contrast Position Other Local Licensor.category Licensor
## 1 Adj/DP Adj in NP <NA> 1 Neg <NA>
## 2 Adj/DP Adj in NP <NA> 1 NEC <NA>
## 3 Adj/DP Adj in NP <NA> 1 Neg never
## 4 Adj/DP Adj in NP <NA> 1 Neg <NA>
## 5 Adj/DP Adj in NP <NA> 1 Neg <NA>
## 6 Adj/DP Adj in NP <NA> 1 Neg <NA>
## Note Note2 Syntax2 Contrast2
## 1 <NA> <NA> Adj Adj
## 2 Not Enough Context <NA> Adj Adj
## 3 <NA> <NA> Adj Adj
## 4 <NA> <NA> Adj Adj
## 5 <NA> <NA> Adj Adj
## 6 <NA> <NA> Adj Adj
#### Q1. Check basics of the dataset
First, let’s look at counts of different remnant types (Syntax2) with xtabs or table
xtabs(~ ml$Syntax2)
## ml$Syntax2
## Adj Adv Det NP PP SC V VP
## 33 20 1 736 233 42 60 519
The data are organized into the counts per syntactic categories (Adj, Adv, Det, etc.)
# Type answer here
sort(xtabs(~ ml$Syntax2))
## ml$Syntax2
## Det Adv Adj SC V PP VP NP
## 1 20 33 42 60 233 519 736
1 %in% c(1,2,3)
# Type answer here
# ml2 <-
ml2 <- droplevels(subset(ml, Syntax2 %in% c('NP', 'PP', 'VP')))
# Type answer here
# Look at counts of Syntax2 again
ml2.m <- with(ml2, xtabs(~ Syntax2)); ml2.m
## Syntax2
## NP PP VP
## 736 233 519
# NP PP VP
# 736 233 519
# Convert to percentage
ml2.p <- round(100*ml2.m/nrow(ml)); ml2.p
## Syntax2
## NP PP VP
## 45 14 32
# Type answer here
# Use xtabs() or table() to compare Syntax2 by Local
ml2.m <- with(ml2, xtabs(~ Syntax2 + Local)); ml2.m
## Local
## Syntax2 0 1
## NP 153 583
## PP 38 195
## VP 49 470
# Syntax2 0 1
# NP 153 583
# PP 38 195
# VP 49 470
ml2.m <- with(ml2, table(Syntax2, Local)); ml2.m
## Local
## Syntax2 0 1
## NP 153 583
## PP 38 195
## VP 49 470
# Type answer here
chisq.test(ml2.m)
##
## Pearson's Chi-squared test
##
## data: ml2.m
## X-squared = 28.975, df = 2, p-value = 5.106e-07
# Type answer here
barplot(t(ml2.m),
beside = T,
col = c("darkmagenta", "springgreen"))
legend("topright",
c("Non Local", "Local"),
bty = "n",
fill = c("darkmagenta", "springgreen"),
title = "Locality"
)