This dataset is called Anscombes Quartet. This is a synthetic dataset that was created to show the importance of looking outside pvalue and R squared value when doing analysis. Given that the model is synthetic there is no missing values and no data qaulity issues.
> head(four)
x123 y1 y2 y3 x4 y4
1 10 8.04 9.14 7.46 8 6.58
2 8 6.95 8.14 6.77 8 5.76
3 13 7.58 8.74 12.74 8 7.71
4 9 8.81 8.77 7.11 8 8.84
5 11 8.33 9.26 7.81 8 8.47
6 14 9.96 8.10 8.84 8 7.04
These scatter plots are all positively correlated. It is important to notice that all of these visualizations look different but have the same R^2 and pvalue. This will be exemplified when linear regression is used.
> scatterplot(y1~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE, data=four)
> scatterplot(y2~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE, data=four)
> scatterplot(y3~x123, regLine=FALSE, smooth=FALSE, boxplots=FALSE, data=four)
> scatterplot(y4~x4, regLine=FALSE, smooth=FALSE, boxplots=FALSE, data=four)
> Boxplot( ~ y1, data=four, id=list(method="y"))
> Boxplot( ~ y2, data=four, id=list(method="y"))
[1] "8"
> Boxplot( ~ y3, data=four, id=list(method="y"))
[1] "3"
> Boxplot( ~ y4, data=four, id=list(method="y"))
[1] "8"
X123 is normally distributed as \(9=9\)Y1 is negatively skewed as \(7.501<7.58\)Y2 is negatively skewed as \(7.501<8.14\)Y3 is positively skewed as \(7.5>7.11\)X4 is positively skewed as \(9>8\)> summary(four)
x123 y1 y2 y3 x4
Min. : 4.0 Min. : 4.260 Min. :3.100 Min. : 5.39 Min. : 8
1st Qu.: 6.5 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 8
Median : 9.0 Median : 7.580 Median :8.140 Median : 7.11 Median : 8
Mean : 9.0 Mean : 7.501 Mean :7.501 Mean : 7.50 Mean : 9
3rd Qu.:11.5 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8
Max. :14.0 Max. :10.840 Max. :9.260 Max. :12.74 Max. :19
y4
Min. : 5.250
1st Qu.: 6.170
Median : 7.040
Mean : 7.501
3rd Qu.: 8.190
Max. :12.500
Since the residual median is not 0 we can say that this model is not fully symmetrical. However, given the max and min residuals are similar it can be concluded that the model does fit somewhat.
> fourpt1.2 <- lm(y1~x123, data=four)
> summary(fourpt1.2)
Call:
lm(formula = y1 ~ x123, data = four)
Residuals:
Min 1Q Median 3Q Max
-1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
x123 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
Since the residual median is not 0 we can say that this model is not fully symmetrical. However, given the max and min residuals are similar it can be concluded that the model does fit somewhat.
> RegModel.3 <- lm(y2~x123, data=four)
> summary(RegModel.3)
Call:
lm(formula = y2 ~ x123, data = four)
Residuals:
Min 1Q Median 3Q Max
-1.9009 -0.7609 0.1291 0.9491 1.2691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.001 1.125 2.667 0.02576 *
x123 0.500 0.118 4.239 0.00218 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
Since the residual median is not 0 we can say that this model is not fully symmetrical. Given the max and min residuals are not similar it can be concluded that the model does not fit well.
> fourpt2.4 <- lm(y3~x123, data=four)
> summary(fourpt2.4)
Call:
lm(formula = y3 ~ x123, data = four)
Residuals:
Min 1Q Median 3Q Max
-1.1586 -0.6146 -0.2303 0.1540 3.2411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0025 1.1245 2.670 0.02562 *
x123 0.4997 0.1179 4.239 0.00218 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
Since the residual median is 0 we can say that this model is fully symmetrical. Given the max and min residuals are similar it can be concluded that the model does fit somewhat.
> fourpt3.5 <- lm(y4~x4, data=four)
> summary(fourpt3.5)
Call:
lm(formula = y4 ~ x4, data = four)
Residuals:
Min 1Q Median 3Q Max
-1.751 -0.831 0.000 0.809 1.839
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0017 1.1239 2.671 0.02559 *
x4 0.4999 0.1178 4.243 0.00216 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
F-statistic: 18 on 1 and 9 DF, p-value: 0.002165