Getting started - loading some data to model.
library(tidyverse)
-- Attaching packages -------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.4 v purrr 0.3.4
v tibble 3.1.2 v dplyr 1.0.6
v tidyr 1.1.3 v stringr 1.4.0
v readr 1.4.0 v forcats 0.5.1
-- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
From the datasets package we’ll load up some data called Anscombe’s quartet.
a <- datasets::anscombe
There are 11 observations of 8 variables which form 4 pairs of x and y data, e.g., x is associated with or predicts y.
Let’s quickly summarize these variables numerically - a quick and dirty way is to just use the summary() function on a dataframe.
summary(a)
x1 x2 x3 x4 y1 y2 y3
Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 Min. : 4.260 Min. :3.100 Min. : 5.39
1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25
Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 Median : 7.580 Median :8.140 Median : 7.11
Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 Mean : 7.501 Mean :7.501 Mean : 7.50
3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98
Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 Max. :10.840 Max. :9.260 Max. :12.74
y4
Min. : 5.250
1st Qu.: 6.170
Median : 7.040
Mean : 7.501
3rd Qu.: 8.190
Max. :12.500
You’ll notice that all of the X variables have the same mean, and all of the Y variables have the same mean.
Let’s run a few simple linear models.
m1 <- lm(y1 ~ x1, data = a)
summary(m1)
Call:
lm(formula = y1 ~ x1, data = a)
Residuals:
Min 1Q Median 3Q Max
-1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
x1 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
m2 <- lm(y2 ~ x2, data = a)
summary(m2)
Call:
lm(formula = y2 ~ x2, data = a)
Residuals:
Min 1Q Median 3Q Max
-1.9009 -0.7609 0.1291 0.9491 1.2691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.001 1.125 2.667 0.02576 *
x2 0.500 0.118 4.239 0.00218 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
m3 <- lm(y3 ~ x3, data = a)
summary(m3)
Call:
lm(formula = y3 ~ x3, data = a)
Residuals:
Min 1Q Median 3Q Max
-1.1586 -0.6146 -0.2303 0.1540 3.2411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0025 1.1245 2.670 0.02562 *
x3 0.4997 0.1179 4.239 0.00218 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
m4 <- lm(y4 ~ x4, data = a)
summary(m4)
Call:
lm(formula = y4 ~ x4, data = a)
Residuals:
Min 1Q Median 3Q Max
-1.751 -0.831 0.000 0.809 1.839
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0017 1.1239 2.671 0.02559 *
x4 0.4999 0.1178 4.243 0.00216 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
Wow, looks like the models all say the same thing about the relationship between the X variable and the Y variable. The value of the beta for X is about .5. So, each unit increase in X results in a .5 unit increase in Y.
But of course, we should plot our data first to see if the models make sense. Let’s take a look:
Pair 1:
a %>% ggplot(aes(x = x1, y = y1))+
geom_point()
Hmm, this looks pretty linear…
a %>% ggplot(aes(x = x1, y = y1))+
geom_point()+
geom_smooth(method = "lm", se = F, color = "red")+
geom_smooth(se = F)
Let’s look at the other pairs…
a %>% ggplot(aes(x = x2, y = y2))+
geom_point()+
geom_smooth(method = "lm", se = F, color = "red")+
geom_smooth(se = F)
a %>% ggplot(aes(x = x3, y = y3))+
geom_point()+
geom_smooth(method = "lm", se = F, color = "red")+
geom_smooth(se = F)
a %>% ggplot(aes(x = x4, y = y4))+
geom_point()+
geom_smooth(method = "lm", se = F, color = "red")+
geom_smooth(se = F)
These data famously illustrate the importance of plotting. If you don’t, you might miss important aspects of the relationships among variables and end up using models that do not make much sense.
We’ll load some data used in Ch. 3-4, which is some human ratings of emotional valence of words.
war <- read_csv("warriner_2013_emotional_valence.csv")
-- Column specification ----------------------------------------------------------------------------------------
cols(
Word = col_character(),
Val = col_double()
)
Let’s find the words with the lowest and highest valence:
filter(war, Val == min(Val) | Val == max(Val))
These words illustrate the meaning of the scale… one word associated with extremely negative (repulsive) feelings, and another word that brings most people a lot of joy.
So what are the worst words in the data?
arrange(war, Val) #ascending
How about the best?
arrange(war, desc(Val)) #the desc() function goes in descending order
Now for some descriptive statistics:
mean(war$Val)
[1] 5.063847
sd(war$Val)
[1] 1.274892
quantile(war$Val)
0% 25% 50% 75% 100%
1.26 4.25 5.20 5.95 8.53
Let’s look a bit closer at the percentage of the distribution.
mean(war$Val) - sd(war$Val)
[1] 3.788955
mean(war$Val) + sd(war$Val)
[1] 6.338738
quantile(war$Val, c(0.16, 0.84))
16% 84%
3.67 6.32
Similarly, for the median:
median(war$Val)
[1] 5.2
quantile(war$Val, 0.5)
50%
5.2
Starting off with 50 random numbers
x <- rnorm(50)
head(x)
[1] -1.02849246 -0.26713045 0.05819758 -0.77565856 1.71131100 -1.59169993
Now creating Y values that are based on some idea of a relationship with X:
y <- 10 + 3*x
Let’s look:
plot(x, y, pch = 19)
Too perfect! This is perfect prediction without error; we can say the relationship between X and Y is deterministic.
We need to add some noise if this is going to be realistic.
error <- rnorm(50)
y <- 10 + 3*x + error
Now let’s look:
plot(x, y, pch = 19)
Still a pretty tight relationship, but not deterministic.
Let’s run a regression:
xmd1 <- lm(y ~ x)
xmd1
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
9.937 3.084
summary(xmd1)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.74043 -0.69724 0.06523 0.71081 1.76953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.9373 0.1332 74.62 <2e-16 ***
x 3.0835 0.1447 21.30 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.94 on 48 degrees of freedom
Multiple R-squared: 0.9044, Adjusted R-squared: 0.9024
F-statistic: 453.9 on 1 and 48 DF, p-value: < 2.2e-16
Fitted values are what the model predicts for Y. You can access them with fitted(). Similarly, residuals are the difference between what the model predicts for Y and the actual values of Y. You can access them with residuals().
Some helpful tidyverse functions:
library(broom)
tidy(xmd1) #for model parameters
glance(xmd1) #for model fit
That’s all for now!