ANLY 512 - Problem Set 2

Questions

Anscombes quartet is a set of 4 $x,y$ data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

data <- anscombe

Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

library(fBasics)

## Warning: package 'fBasics' was built under R version 3.4.4

## Loading required package: timeDate

## Warning: package 'timeDate' was built under R version 3.4.3

## Loading required package: timeSeries

## Warning: package 'timeSeries' was built under R version 3.4.4

colMeans(data)

##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909

colVars(data)

##        x1        x2        x3        x4        y1        y2        y3 
## 11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620 
##        y4 
##  4.123249

cor_x1_y1 <- cor(data$x1,data$y1)
cat("The correlation between x1 and y1 is",cor_x1_y1,"\n")

## The correlation between x1 and y1 is 0.8164205

cor_x2_y2 <- cor(data$x2,data$y2)
cat("The correlation between x2 and y2 is",cor_x2_y2,"\n")

## The correlation between x2 and y2 is 0.8162365

cor_x3_y3 <- cor(data$x3,data$y3)
cat("The correlation between x3 and y3 is",cor_x3_y3,"\n")

## The correlation between x3 and y3 is 0.8162867

cor_x4_y4 <- cor(data$x4,data$y4)
cat("The correlation between x4 and y4 is",cor_x4_y4,"\n")

## The correlation between x4 and y4 is 0.8165214

Create scatter plots for each $x, y$ pair of data.

plot(data$x1,data$y1,main="Scatter plot lot between x1 and y1")

plot(data$x2,data$y2,main="Scatter plot lot between x2 and y2")

plot(data$x3,data$y3,main="Scatter plot lot between x3 and y3")

plot(data$x4,data$y4,main="Scatter plot lot between x4 and y4")

Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic

par(mfrow=c(2,2))
plot(data$x1,data$y1,main="Scatter plot lot between x1 and y1",pch=19)
plot(data$x2,data$y2,main="Scatter plot lot between x2 and y2",pch=19)
plot(data$x3,data$y3,main="Scatter plot lot between x3 and y3",pch=19)
plot(data$x4,data$y4,main="Scatter plot lot between x4 and y4",pch=19)

Now fit a linear model to each data set using the lm() function.

model_x1y1 <- lm(data$x1~data$y1)
model_x2y2 <- lm(data$x2~data$y2)
model_x3y3 <- lm(data$x3~data$y3)
model_x4y4 <- lm(data$x4~data$y4)

Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

par(mfrow=c(2,2))
plot(data$x1,data$y1,main="Scatter plot lot between x1 and y1",pch=19,abline(model_x1y1))
plot(data$x2,data$y2,main="Scatter plot lot between x2 and y2",pch=19,abline(model_x2y2))
plot(data$x3,data$y3,main="Scatter plot lot between x3 and y3",pch=19,abline(model_x3y3))
plot(data$x4,data$y4,main="Scatter plot lot between x4 and y4",pch=19,abline(model_x4y4))

Now compare the model fits for each model object.

summary(model_x1y1)

Call: lm(formula = data$x1 ~ data$y1)

Residuals: Min 1Q Median 3Q Max -2.6522 -1.5117 -0.2657 1.2341 3.8946

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9975 2.4344 -0.410 0.69156
data$y1 1.3328 0.3142 4.241 0.00217 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 2.019 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217

summary(model_x2y2)

Call: lm(formula = data$x2 ~ data$y2)

Residuals: Min 1Q Median 3Q Max -1.8516 -1.4315 -0.3440 0.8467 4.2017

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9948 2.4354 -0.408 0.69246
data$y2 1.3325 0.3144 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 2.02 on 9 degrees of freedom Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179

summary(model_x3y3)

Call: lm(formula = data$x3 ~ data$y3)

Residuals: Min 1Q Median 3Q Max -2.9869 -1.3733 -0.0266 1.3200 3.2133

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0003 2.4362 -0.411 0.69097
data$y3 1.3334 0.3145 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 2.019 on 9 degrees of freedom Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176

summary(model_x4y4)

Call: lm(formula = data$x4 ~ data$y4)

Residuals: Min 1Q Median 3Q Max -2.7859 -1.4122 -0.1853 1.4551 3.3329

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0036 2.4349 -0.412 0.68985
data$y4 1.3337 0.3143 4.243 0.00216 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 2.018 on 9 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297 F-statistic: 18 on 1 and 9 DF, p-value: 0.002165

In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

As we can see from the outputs of quesion 2, four different pairs of data showed highly similar summary statistics, they have very close means, variances and correlations. However, when we look at the scatter plots of those pairs, they have totally different distributions. That being said, the summary statistics of dataset could be very misleading without visualization of the data, it’s very dangerous to sumarized a dataset just using the statistic summary. As we can see from the linear regression line, they don’t follow the same pattern. In summary, data visualization is very important when we try to figure out the whole story of a data set. We can’t just rely on the statistic summary, instead, we need to combine the statistic summay and data visualization in order to draw accurate conclusions.

ANLY 512 - Problem Set 2

Anscombe’s quartet

Shibo Feng

2018-06-19

Objectives

Questions