Pop Quiz: Anscombes Quartet

Erin

2022-03-23

Introduction

This dataset is called Anscombes Quartet. This is a synthetic dataset that was created to show the importance of looking outside pvalue and R squared value when doing analysis. Given that the model is synthetic there is no missing values and no data qaulity issues.

> head(four)
  x123   y1   y2    y3 x4   y4
1   10 8.04 9.14  7.46  8 6.58
2    8 6.95 8.14  6.77  8 5.76
3   13 7.58 8.74 12.74  8 7.71
4    9 8.81 8.77  7.11  8 8.84
5   11 8.33 9.26  7.81  8 8.47
6   14 9.96 8.10  8.84  8 7.04

Scatter Plots

These scatter plots are all positively correlated. It is important to notice that all of these visualizations look different but have the same R^2 and pvalue. This will be exemplified when linear regression is used.

> scatterplot(y1~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE, data=four)

> scatterplot(y2~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE, data=four)

> scatterplot(y3~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE, data=four)

> scatterplot(y4~x4, regLine=FALSE, smooth=FALSE, boxplots=FALSE, data=four)

Outliers

Y1

  • This value is normally distributed as the whiskers are approx the same size
  • There is no large outliers
> Boxplot( ~ y1, data=four, id=list(method="y"))

Y2

  • This value is not normally distributed as the whisker is much longer on the bottom than the top
  • There is one large outlier that 3
  • The distribution is negatively/left skewed
> Boxplot( ~ y2, data=four, id=list(method="y"))

[1] "8"

Y3

  • This value is normally distributed as the whiskers are approx the same size
  • There is one large outlier that 13
> Boxplot( ~ y3, data=four, id=list(method="y"))

[1] "3"

Y4

  • This value is normally distributed as the whiskers are approx the same size
  • There is one large outlier that 13
> Boxplot( ~ y4, data=four, id=list(method="y"))

[1] "8"

Summary of Variables

From the relationship between median and mean we can also determine distribution.

If \(mean > median = positively sknewed\) and if \(mean < median = negatively skewed\)
  • X123 is normally distributed as \(9=9\)
  • Y1 is negatively skewed as \(7.501<7.58\)
  • Y2 is negatively skewed as \(7.501<8.14\)
  • Y3 is positively skewed as \(7.5>7.11\)
  • X4 is positively skewed as \(9>8\)
> summary(four)
      x123            y1               y2              y3              x4    
 Min.   : 4.0   Min.   : 4.260   Min.   :3.100   Min.   : 5.39   Min.   : 8  
 1st Qu.: 6.5   1st Qu.: 6.315   1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 8  
 Median : 9.0   Median : 7.580   Median :8.140   Median : 7.11   Median : 8  
 Mean   : 9.0   Mean   : 7.501   Mean   :7.501   Mean   : 7.50   Mean   : 9  
 3rd Qu.:11.5   3rd Qu.: 8.570   3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8  
 Max.   :14.0   Max.   :10.840   Max.   :9.260   Max.   :12.74   Max.   :19  
       y4        
 Min.   : 5.250  
 1st Qu.: 6.170  
 Median : 7.040  
 Mean   : 7.501  
 3rd Qu.: 8.190  
 Max.   :12.500  

Linear Regression

It is evident that the \(Pr(>|t|)\), \(Adjusted R^2\), \(Residual Std Error\) and \(p-values\) are the same in the four regression analysis done. This is important because
  • each of these variables has different distributions
  • all appear differently on scatter plots
  • it is important to remember these statistics are not nessicarily the end of the analysis and a well rounded analysis should always be done
The analysis the the linear regression outputs will include
  • Residuals
    • difference between the prediction and the actual values to see how well the model fits
  • The Coefficient
    • Calculate the linear expression for the regression model
    • Standard error

The same calculate the linear expression for the regression model and std error can be used for all regression models

  • linear expression for the regression model is \[ y=0.5001x+3.0001\]
  • With std error the formula is
    • max \[ y=0.618x+4.1257\]
    • min \[ y=0.3822x+1.8763\]

Y1 and X123

Since the residual median is not 0 we can say that this model is not fully symmetrical. However, given the max and min residuals are similar it can be concluded that the model does fit somewhat.

> fourpt1.2 <- lm(y1~x123, data=four)
> summary(fourpt1.2)

Call:
lm(formula = y1 ~ x123, data = four)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92127 -0.45577 -0.04136  0.70941  1.83882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0001     1.1247   2.667  0.02573 * 
x123          0.5001     0.1179   4.241  0.00217 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared:  0.6665,    Adjusted R-squared:  0.6295 
F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Y2 and X123

Since the residual median is not 0 we can say that this model is not fully symmetrical. However, given the max and min residuals are similar it can be concluded that the model does fit somewhat.

> RegModel.3 <- lm(y2~x123, data=four)
> summary(RegModel.3)

Call:
lm(formula = y2 ~ x123, data = four)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9009 -0.7609  0.1291  0.9491  1.2691 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    3.001      1.125   2.667  0.02576 * 
x123           0.500      0.118   4.239  0.00218 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared:  0.6662,    Adjusted R-squared:  0.6292 
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

Y3 and X123

Since the residual median is not 0 we can say that this model is not fully symmetrical. Given the max and min residuals are not similar it can be concluded that the model does not fit well.

> fourpt2.4 <- lm(y3~x123, data=four)
> summary(fourpt2.4)

Call:
lm(formula = y3 ~ x123, data = four)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1586 -0.6146 -0.2303  0.1540  3.2411 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0025     1.1245   2.670  0.02562 * 
x123          0.4997     0.1179   4.239  0.00218 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared:  0.6663,    Adjusted R-squared:  0.6292 
F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

Y4 and X4

Since the residual median is 0 we can say that this model is fully symmetrical. Given the max and min residuals are similar it can be concluded that the model does fit somewhat.

> fourpt3.5 <- lm(y4~x4, data=four)
> summary(fourpt3.5)

Call:
lm(formula = y4 ~ x4, data = four)

Residuals:
   Min     1Q Median     3Q    Max 
-1.751 -0.831  0.000  0.809  1.839 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   3.0017     1.1239   2.671  0.02559 * 
x4            0.4999     0.1178   4.243  0.00216 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared:  0.6667,    Adjusted R-squared:  0.6297 
F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165