Getting started - loading some data to model.

library(tidyverse)
-- Attaching packages -------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.4     v purrr   0.3.4
v tibble  3.1.2     v dplyr   1.0.6
v tidyr   1.1.3     v stringr 1.4.0
v readr   1.4.0     v forcats 0.5.1
-- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Anscombe’s Quartet

From the datasets package we’ll load up some data called Anscombe’s quartet.

a <- datasets::anscombe

There are 11 observations of 8 variables which form 4 pairs of x and y data, e.g., x is associated with or predicts y.

Let’s quickly summarize these variables numerically - a quick and dirty way is to just use the summary() function on a dataframe.

summary(a)
       x1             x2             x3             x4           y1               y2              y3       
 Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260   Min.   :3.100   Min.   : 5.39  
 1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315   1st Qu.:6.695   1st Qu.: 6.25  
 Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580   Median :8.140   Median : 7.11  
 Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501   Mean   :7.501   Mean   : 7.50  
 3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570   3rd Qu.:8.950   3rd Qu.: 7.98  
 Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840   Max.   :9.260   Max.   :12.74  
       y4        
 Min.   : 5.250  
 1st Qu.: 6.170  
 Median : 7.040  
 Mean   : 7.501  
 3rd Qu.: 8.190  
 Max.   :12.500  

You’ll notice that all of the X variables have the same mean, and all of the Y variables have the same mean.

Let’s run a few simple linear models.

m1 <- lm(y1 ~ x1, data = a)
summary(m1)

Call:
lm(formula = y1 ~ x1, data = a)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92127 -0.45577 -0.04136  0.70941  1.83882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0001     1.1247   2.667  0.02573 * 
x1            0.5001     0.1179   4.241  0.00217 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared:  0.6665,    Adjusted R-squared:  0.6295 
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
m2 <- lm(y2 ~ x2, data = a)
summary(m2)

Call:
lm(formula = y2 ~ x2, data = a)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9009 -0.7609  0.1291  0.9491  1.2691 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    3.001      1.125   2.667  0.02576 * 
x2             0.500      0.118   4.239  0.00218 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared:  0.6662,    Adjusted R-squared:  0.6292 
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
m3 <- lm(y3 ~ x3, data = a)
summary(m3)

Call:
lm(formula = y3 ~ x3, data = a)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1586 -0.6146 -0.2303  0.1540  3.2411 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0025     1.1245   2.670  0.02562 * 
x3            0.4997     0.1179   4.239  0.00218 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared:  0.6663,    Adjusted R-squared:  0.6292 
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
m4 <- lm(y4 ~ x4, data = a)
summary(m4)

Call:
lm(formula = y4 ~ x4, data = a)

Residuals:
   Min     1Q Median     3Q    Max 
-1.751 -0.831  0.000  0.809  1.839 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0017     1.1239   2.671  0.02559 * 
x4            0.4999     0.1178   4.243  0.00216 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared:  0.6667,    Adjusted R-squared:  0.6297 
F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

Wow, looks like the models all say the same thing about the relationship between the X variable and the Y variable. The value of the beta for X is about .5. So, each unit increase in X results in a .5 unit increase in Y.

But of course, we should plot our data first to see if the models make sense. Let’s take a look:

Pair 1:

a %>% ggplot(aes(x = x1, y = y1))+
  geom_point()

Hmm, this looks pretty linear…

a %>% ggplot(aes(x = x1, y = y1))+
  geom_point()+
  geom_smooth(method = "lm", se = F, color = "red")+
  geom_smooth(se = F)

Let’s look at the other pairs…

a %>% ggplot(aes(x = x2, y = y2))+
  geom_point()+
  geom_smooth(method = "lm", se = F, color = "red")+
  geom_smooth(se = F)

a %>% ggplot(aes(x = x3, y = y3))+
  geom_point()+
  geom_smooth(method = "lm", se = F, color = "red")+
  geom_smooth(se = F)

a %>% ggplot(aes(x = x4, y = y4))+
  geom_point()+
  geom_smooth(method = "lm", se = F, color = "red")+
  geom_smooth(se = F)

These data famously illustrate the importance of plotting. If you don’t, you might miss important aspects of the relationships among variables and end up using models that do not make much sense.

Examples from Winter Ch. 3-4

Chapter 3 - A review of means and descriptive stats

We’ll load some data used in Ch. 3-4, which is some human ratings of emotional valence of words.

war <- read_csv("warriner_2013_emotional_valence.csv")

-- Column specification ----------------------------------------------------------------------------------------
cols(
  Word = col_character(),
  Val = col_double()
)

Let’s find the words with the lowest and highest valence:

filter(war, Val == min(Val) | Val == max(Val))

These words illustrate the meaning of the scale… one word associated with extremely negative (repulsive) feelings, and another word that brings most people a lot of joy.

So what are the worst words in the data?

arrange(war, Val) #ascending

How about the best?

arrange(war, desc(Val)) #the desc() function goes in descending order

Now for some descriptive statistics:

mean(war$Val)
[1] 5.063847
sd(war$Val)
[1] 1.274892
quantile(war$Val)
  0%  25%  50%  75% 100% 
1.26 4.25 5.20 5.95 8.53 

Let’s look a bit closer at the percentage of the distribution.

mean(war$Val) - sd(war$Val)
[1] 3.788955
mean(war$Val) + sd(war$Val)
[1] 6.338738
quantile(war$Val, c(0.16, 0.84))
 16%  84% 
3.67 6.32 

Similarly, for the median:

median(war$Val)
[1] 5.2
quantile(war$Val, 0.5)
50% 
5.2 

Chapter 4 - simulating a linear model

Starting off with 50 random numbers

x <- rnorm(50)
head(x)
[1] -1.02849246 -0.26713045  0.05819758 -0.77565856  1.71131100 -1.59169993

Now creating Y values that are based on some idea of a relationship with X:

y <- 10 + 3*x

Let’s look:

plot(x, y, pch = 19)

Too perfect! This is perfect prediction without error; we can say the relationship between X and Y is deterministic.

We need to add some noise if this is going to be realistic.

error <- rnorm(50)
y <- 10 + 3*x + error

Now let’s look:

plot(x, y, pch = 19)

Still a pretty tight relationship, but not deterministic.

Let’s run a regression:

xmd1 <- lm(y ~ x)

xmd1

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
      9.937        3.084  
summary(xmd1)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.74043 -0.69724  0.06523  0.71081  1.76953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.9373     0.1332   74.62   <2e-16 ***
x             3.0835     0.1447   21.30   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.94 on 48 degrees of freedom
Multiple R-squared:  0.9044,    Adjusted R-squared:  0.9024 
F-statistic: 453.9 on 1 and 48 DF,  p-value: < 2.2e-16

Fitted values are what the model predicts for Y. You can access them with fitted(). Similarly, residuals are the difference between what the model predicts for Y and the actual values of Y. You can access them with residuals().

Some helpful tidyverse functions:

library(broom)
tidy(xmd1) #for model parameters
glance(xmd1) #for model fit

That’s all for now!

LS0tDQp0aXRsZTogIlNpbXBsZSBMaW5lYXIgTW9kZWxzIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KR2V0dGluZyBzdGFydGVkIC0gbG9hZGluZyBzb21lIGRhdGEgdG8gbW9kZWwuDQoNCmBgYHtyfQ0KbGlicmFyeShkYXRhc2V0cykNCmxpYnJhcnkodGlkeXZlcnNlKQ0KYGBgDQoNCiMgQW5zY29tYmUncyBRdWFydGV0DQoNCkZyb20gdGhlIGBkYXRhc2V0c2AgcGFja2FnZSB3ZSdsbCBsb2FkIHVwIHNvbWUgZGF0YSBjYWxsZWQgQW5zY29tYmUncyBxdWFydGV0Lg0KDQpgYGB7cn0NCmEgPC0gZGF0YXNldHM6OmFuc2NvbWJlDQpgYGANCg0KVGhlcmUgYXJlIDExIG9ic2VydmF0aW9ucyBvZiA4IHZhcmlhYmxlcyB3aGljaCBmb3JtIDQgcGFpcnMgb2YgeCBhbmQgeSBkYXRhLCBlLmcuLCB4IGlzIGFzc29jaWF0ZWQgd2l0aCBvciBwcmVkaWN0cyB5Lg0KDQpMZXQncyBxdWlja2x5IHN1bW1hcml6ZSB0aGVzZSB2YXJpYWJsZXMgbnVtZXJpY2FsbHkgLSBhIHF1aWNrIGFuZCBkaXJ0eSB3YXkgaXMgdG8ganVzdCB1c2UgdGhlIGBzdW1tYXJ5KClgIGZ1bmN0aW9uIG9uIGEgZGF0YWZyYW1lLg0KDQpgYGB7cn0NCnN1bW1hcnkoYSkNCmBgYA0KDQpZb3UnbGwgbm90aWNlIHRoYXQgYWxsIG9mIHRoZSBYIHZhcmlhYmxlcyBoYXZlIHRoZSBzYW1lIG1lYW4sIGFuZCBhbGwgb2YgdGhlIFkgdmFyaWFibGVzIGhhdmUgdGhlIHNhbWUgbWVhbi4NCg0KTGV0J3MgcnVuIGEgZmV3IHNpbXBsZSBsaW5lYXIgbW9kZWxzLg0KDQpgYGB7cn0NCm0xIDwtIGxtKHkxIH4geDEsIGRhdGEgPSBhKQ0Kc3VtbWFyeShtMSkNCg0KbTIgPC0gbG0oeTIgfiB4MiwgZGF0YSA9IGEpDQpzdW1tYXJ5KG0yKQ0KDQptMyA8LSBsbSh5MyB+IHgzLCBkYXRhID0gYSkNCnN1bW1hcnkobTMpDQoNCm00IDwtIGxtKHk0IH4geDQsIGRhdGEgPSBhKQ0Kc3VtbWFyeShtNCkNCmBgYA0KDQpXb3csIGxvb2tzIGxpa2UgdGhlIG1vZGVscyBhbGwgc2F5IHRoZSBzYW1lIHRoaW5nIGFib3V0IHRoZSByZWxhdGlvbnNoaXAgYmV0d2VlbiB0aGUgWCB2YXJpYWJsZSBhbmQgdGhlIFkgdmFyaWFibGUuIFRoZSB2YWx1ZSBvZiB0aGUgYmV0YSBmb3IgWCBpcyBhYm91dCAuNS4gU28sIGVhY2ggdW5pdCBpbmNyZWFzZSBpbiBYIHJlc3VsdHMgaW4gYSAuNSB1bml0IGluY3JlYXNlIGluIFkuDQoNCkJ1dCBvZiBjb3Vyc2UsIHdlIHNob3VsZCBwbG90IG91ciBkYXRhIGZpcnN0IHRvIHNlZSBpZiB0aGUgbW9kZWxzIG1ha2Ugc2Vuc2UuIExldCdzIHRha2UgYSBsb29rOg0KDQpQYWlyIDE6DQpgYGB7cn0NCmEgJT4lIGdncGxvdChhZXMoeCA9IHgxLCB5ID0geTEpKSsNCiAgZ2VvbV9wb2ludCgpDQpgYGANCg0KSG1tLCB0aGlzIGxvb2tzIHByZXR0eSBsaW5lYXIuLi4NCg0KYGBge3J9DQphICU+JSBnZ3Bsb3QoYWVzKHggPSB4MSwgeSA9IHkxKSkrDQogIGdlb21fcG9pbnQoKSsNCiAgZ2VvbV9zbW9vdGgobWV0aG9kID0gImxtIiwgc2UgPSBGLCBjb2xvciA9ICJyZWQiKSsNCiAgZ2VvbV9zbW9vdGgoc2UgPSBGKQ0KYGBgDQoNCkxldCdzIGxvb2sgYXQgdGhlIG90aGVyIHBhaXJzLi4uDQoNCmBgYHtyfQ0KYSAlPiUgZ2dwbG90KGFlcyh4ID0geDIsIHkgPSB5MikpKw0KICBnZW9tX3BvaW50KCkrDQogIGdlb21fc21vb3RoKG1ldGhvZCA9ICJsbSIsIHNlID0gRiwgY29sb3IgPSAicmVkIikrDQogIGdlb21fc21vb3RoKHNlID0gRikNCmBgYA0KDQpgYGB7cn0NCmEgJT4lIGdncGxvdChhZXMoeCA9IHgzLCB5ID0geTMpKSsNCiAgZ2VvbV9wb2ludCgpKw0KICBnZW9tX3Ntb290aChtZXRob2QgPSAibG0iLCBzZSA9IEYsIGNvbG9yID0gInJlZCIpKw0KICBnZW9tX3Ntb290aChzZSA9IEYpDQpgYGANCg0KDQoNCmBgYHtyfQ0KYSAlPiUgZ2dwbG90KGFlcyh4ID0geDQsIHkgPSB5NCkpKw0KICBnZW9tX3BvaW50KCkrDQogIGdlb21fc21vb3RoKG1ldGhvZCA9ICJsbSIsIHNlID0gRiwgY29sb3IgPSAicmVkIikrDQogIGdlb21fc21vb3RoKHNlID0gRikNCmBgYA0KDQpUaGVzZSBkYXRhIGZhbW91c2x5IGlsbHVzdHJhdGUgdGhlIGltcG9ydGFuY2Ugb2YgcGxvdHRpbmcuIElmIHlvdSBkb24ndCwgeW91IG1pZ2h0IG1pc3MgaW1wb3J0YW50IGFzcGVjdHMgb2YgdGhlIHJlbGF0aW9uc2hpcHMgYW1vbmcgdmFyaWFibGVzIGFuZCBlbmQgdXAgdXNpbmcgbW9kZWxzIHRoYXQgZG8gbm90IG1ha2UgbXVjaCBzZW5zZS4NCg0KIyBFeGFtcGxlcyBmcm9tIFdpbnRlciBDaC4gMy00DQoNCiMjIENoYXB0ZXIgMyAtIEEgcmV2aWV3IG9mIG1lYW5zIGFuZCBkZXNjcmlwdGl2ZSBzdGF0cw0KDQpXZSdsbCBsb2FkIHNvbWUgZGF0YSB1c2VkIGluIENoLiAzLTQsIHdoaWNoIGlzIHNvbWUgaHVtYW4gcmF0aW5ncyBvZiBlbW90aW9uYWwgdmFsZW5jZSBvZiB3b3Jkcy4NCg0KYGBge3J9DQp3YXIgPC0gcmVhZF9jc3YoIndhcnJpbmVyXzIwMTNfZW1vdGlvbmFsX3ZhbGVuY2UuY3N2IikNCmBgYA0KDQpMZXQncyBmaW5kIHRoZSB3b3JkcyB3aXRoIHRoZSBsb3dlc3QgYW5kIGhpZ2hlc3QgdmFsZW5jZToNCg0KYGBge3J9DQpmaWx0ZXIod2FyLCBWYWwgPT0gbWluKFZhbCkgfCBWYWwgPT0gbWF4KFZhbCkpDQpgYGANCg0KVGhlc2Ugd29yZHMgaWxsdXN0cmF0ZSB0aGUgbWVhbmluZyBvZiB0aGUgc2NhbGUuLi4gb25lIHdvcmQgYXNzb2NpYXRlZCB3aXRoIGV4dHJlbWVseSBuZWdhdGl2ZSAocmVwdWxzaXZlKSBmZWVsaW5ncywgYW5kIGFub3RoZXIgd29yZCB0aGF0IGJyaW5ncyBtb3N0IHBlb3BsZSBhIGxvdCBvZiBqb3kuIA0KDQpTbyB3aGF0IGFyZSB0aGUgd29yc3Qgd29yZHMgaW4gdGhlIGRhdGE/DQoNCmBgYHtyfQ0KYXJyYW5nZSh3YXIsIFZhbCkgI2FzY2VuZGluZw0KYGBgDQoNCkhvdyBhYm91dCB0aGUgYmVzdD8NCg0KYGBge3J9DQphcnJhbmdlKHdhciwgZGVzYyhWYWwpKSAjdGhlIGRlc2MoKSBmdW5jdGlvbiBnb2VzIGluIGRlc2NlbmRpbmcgb3JkZXINCmBgYA0KDQpOb3cgZm9yIHNvbWUgZGVzY3JpcHRpdmUgc3RhdGlzdGljczoNCg0KYGBge3J9DQptZWFuKHdhciRWYWwpDQpzZCh3YXIkVmFsKQ0KcXVhbnRpbGUod2FyJFZhbCkNCmBgYA0KDQpMZXQncyBsb29rIGEgYml0IGNsb3NlciBhdCB0aGUgcGVyY2VudGFnZSBvZiB0aGUgZGlzdHJpYnV0aW9uLg0KDQpgYGB7cn0NCm1lYW4od2FyJFZhbCkgLSBzZCh3YXIkVmFsKQ0KbWVhbih3YXIkVmFsKSArIHNkKHdhciRWYWwpDQpxdWFudGlsZSh3YXIkVmFsLCBjKDAuMTYsIDAuODQpKQ0KYGBgDQoNClNpbWlsYXJseSwgZm9yIHRoZSBtZWRpYW46DQoNCmBgYHtyfQ0KbWVkaWFuKHdhciRWYWwpDQpxdWFudGlsZSh3YXIkVmFsLCAwLjUpDQpgYGANCg0KIyMgQ2hhcHRlciA0IC0gc2ltdWxhdGluZyBhIGxpbmVhciBtb2RlbA0KDQpTdGFydGluZyBvZmYgd2l0aCA1MCByYW5kb20gbnVtYmVycw0KDQpgYGB7cn0NCnggPC0gcm5vcm0oNTApDQpoZWFkKHgpDQpgYGANCg0KTm93IGNyZWF0aW5nIFkgdmFsdWVzIHRoYXQgYXJlIGJhc2VkIG9uIHNvbWUgaWRlYSBvZiBhIHJlbGF0aW9uc2hpcCB3aXRoIFg6DQoNCmBgYHtyfQ0KeSA8LSAxMCArIDMqeA0KYGBgDQoNCkxldCdzIGxvb2s6DQoNCmBgYHtyfQ0KcGxvdCh4LCB5LCBwY2ggPSAxOSkNCmBgYA0KDQpUb28gcGVyZmVjdCEgVGhpcyBpcyBwZXJmZWN0IHByZWRpY3Rpb24gd2l0aG91dCBlcnJvcjsgd2UgY2FuIHNheSB0aGUgcmVsYXRpb25zaGlwIGJldHdlZW4gWCBhbmQgWSBpcyBkZXRlcm1pbmlzdGljLg0KDQpXZSBuZWVkIHRvIGFkZCBzb21lIG5vaXNlIGlmIHRoaXMgaXMgZ29pbmcgdG8gYmUgcmVhbGlzdGljLg0KDQpgYGB7cn0NCmVycm9yIDwtIHJub3JtKDUwKQ0KeSA8LSAxMCArIDMqeCArIGVycm9yDQpgYGANCg0KTm93IGxldCdzIGxvb2s6DQoNCmBgYHtyfQ0KcGxvdCh4LCB5LCBwY2ggPSAxOSkNCmBgYA0KU3RpbGwgYSBwcmV0dHkgdGlnaHQgcmVsYXRpb25zaGlwLCBidXQgbm90IGRldGVybWluaXN0aWMuIA0KDQpMZXQncyBydW4gYSByZWdyZXNzaW9uOg0KDQpgYGB7cn0NCnhtZDEgPC0gbG0oeSB+IHgpDQoNCnhtZDENCg0Kc3VtbWFyeSh4bWQxKQ0KYGBgDQoNCkZpdHRlZCB2YWx1ZXMgYXJlIHdoYXQgdGhlIG1vZGVsIHByZWRpY3RzIGZvciBZLiBZb3UgY2FuIGFjY2VzcyB0aGVtIHdpdGggYGZpdHRlZCgpYC4gU2ltaWxhcmx5LCByZXNpZHVhbHMgYXJlIHRoZSBkaWZmZXJlbmNlIGJldHdlZW4gd2hhdCB0aGUgbW9kZWwgcHJlZGljdHMgZm9yIFkgYW5kIHRoZSBhY3R1YWwgdmFsdWVzIG9mIFkuIFlvdSBjYW4gYWNjZXNzIHRoZW0gd2l0aCBgcmVzaWR1YWxzKClgLg0KDQpTb21lIGhlbHBmdWwgdGlkeXZlcnNlIGZ1bmN0aW9uczoNCg0KYGBge3J9DQpsaWJyYXJ5KGJyb29tKQ0KdGlkeSh4bWQxKSAjZm9yIG1vZGVsIHBhcmFtZXRlcnMNCmdsYW5jZSh4bWQxKSAjZm9yIG1vZGVsIGZpdA0KYGBgDQoNClRoYXQncyBhbGwgZm9yIG5vdyE=