This is practice code of “Modeling Techniques in Predictive Analytics” written by Thomas W.Miller.Please see more information about the book here : http://acornpub.co.kr/book/predictive-analytics-modeling

Chapter 1. Analytics and Data Science

Anscombe Quartet in R code

  • Anscombe, F.J. 1973, February
  • Statistics Analysis Graphic
  • American Statician

1. Define anscombe dataframe

anscombe <- data.frame(
   x1 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
   x2 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
   x3 = c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5),
   x4 = c(8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8),
   y1 = c(8.04, 6.95,  7.58, 8.81, 8.33, 9.96, 7.24, 4.26,10.84, 4.82, 5.68),
   y2 = c(9.14, 8.14,  8.74, 8.77, 9.26, 8.1, 6.13, 3.1,  9.13, 7.26, 4.74),
   y3 = c(7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73),
   y4 = c(6.58, 5.76,  7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89)
)

2. Show results from four regression analyses

with(anscombe, print(summary(lm(y1 ~ x1))))
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
with(anscombe, print(summary(lm(y2 ~ x2))))
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## x2             0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
with(anscombe, print(summary(lm(y3 ~ x3))))
## 
## Call:
## lm(formula = y3 ~ x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## x3            0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
with(anscombe, print(summary(lm(y4 ~ x4))))
## 
## Call:
## lm(formula = y4 ~ x4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## x4            0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

You can easily notice same regression coefficients for four models.

3. Place four plots on one page using standard R graphics

  • ensuring that all have the same scale for horizontal and vertical axes
par(mfrow=c(2,2),mar=c(3,3,3,1))
with(anscombe, plot(x1, y1, xlim=c(2,20),ylim=c(2,14), 
  pch = 19, col = "darkblue", cex = 2, las = 1))
title("Set I")
with(anscombe,plot(x2, y2, xlim=c(2,20),ylim=c(2,14), 
  pch = 19, col = "darkblue", cex = 2, las = 1))
title("Set II")
with(anscombe,plot(x3, y3, xlim=c(2,20),ylim=c(2,14), 
  pch = 19, col = "darkblue", cex = 2, las = 1))
title("Set III")
with(anscombe,plot(x4, y4, xlim=c(2,20),ylim=c(2,14), 
  pch = 19, col = "darkblue", cex = 2, las = 1))

As we’ve seen in #2, these four data set have same regression model, but showed compeletely different patterns.