ANLY 512 - Problem Set 2

Anscombe’s quartet

Xizi Tong 2018-04-20

Questions

Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

library(datasets)
data("anscombe")
View(anscombe)
data=anscombe

Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

library('fBasics')

## Warning: package 'fBasics' was built under R version 3.4.4

## Loading required package: timeDate

## Warning: package 'timeDate' was built under R version 3.4.3

## Loading required package: timeSeries

## Warning: package 'timeSeries' was built under R version 3.4.4

basicStats(data)

##                    x1        x2        x3        x4        y1        y2
## nobs        11.000000 11.000000 11.000000 11.000000 11.000000 11.000000
## NAs          0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## Minimum      4.000000  4.000000  4.000000  8.000000  4.260000  3.100000
## Maximum     14.000000 14.000000 14.000000 19.000000 10.840000  9.260000
## 1. Quartile  6.500000  6.500000  6.500000  8.000000  6.315000  6.695000
## 3. Quartile 11.500000 11.500000 11.500000  8.000000  8.570000  8.950000
## Mean         9.000000  9.000000  9.000000  9.000000  7.500909  7.500909
## Median       9.000000  9.000000  9.000000  8.000000  7.580000  8.140000
## Sum         99.000000 99.000000 99.000000 99.000000 82.510000 82.510000
## SE Mean      1.000000  1.000000  1.000000  1.000000  0.612541  0.612568
## LCL Mean     6.771861  6.771861  6.771861  6.771861  6.136083  6.136024
## UCL Mean    11.228139 11.228139 11.228139 11.228139  8.865735  8.865795
## Variance    11.000000 11.000000 11.000000 11.000000  4.127269  4.127629
## Stdev        3.316625  3.316625  3.316625  3.316625  2.031568  2.031657
## Skewness     0.000000  0.000000  0.000000  2.466911 -0.048374 -0.978693
## Kurtosis    -1.528926 -1.528926 -1.528926  4.520661 -1.199123 -0.514319
##                    y3        y4
## nobs        11.000000 11.000000
## NAs          0.000000  0.000000
## Minimum      5.390000  5.250000
## Maximum     12.740000 12.500000
## 1. Quartile  6.250000  6.170000
## 3. Quartile  7.980000  8.190000
## Mean         7.500000  7.500909
## Median       7.110000  7.040000
## Sum         82.500000 82.510000
## SE Mean      0.612196  0.612242
## LCL Mean     6.135943  6.136748
## UCL Mean     8.864057  8.865070
## Variance     4.122620  4.123249
## Stdev        2.030424  2.030579
## Skewness     1.380120  1.120774
## Kurtosis     1.240044  0.628751

correlationTest(data$x1,data$y1,method = 'pearson',title = 'x1 and y1 correlation')

## 
## Title:
##  x1 and y1 correlation
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8164
##   STATISTIC:
##     t: 4.2415
##   P VALUE:
##     Alternative Two-Sided: 0.00217 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001085 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4244, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5113, 1
## 
## Description:
##  Mon Apr 30 14:45:22 2018

correlationTest(data$x2,data$y2,method = 'pearson',title = 'x2 and y2 correlation')

## 
## Title:
##  x2 and y2 correlation
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8162
##   STATISTIC:
##     t: 4.2386
##   P VALUE:
##     Alternative Two-Sided: 0.002179 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001089 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4239, 0.9506
##          Less: -1, 0.9387
##       Greater: 0.5109, 1
## 
## Description:
##  Mon Apr 30 14:45:22 2018

correlationTest(data$x3,data$y3,method = 'pearson',title = 'x3 and y3 correlation')

## 
## Title:
##  x3 and y3 correlation
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8163
##   STATISTIC:
##     t: 4.2394
##   P VALUE:
##     Alternative Two-Sided: 0.002176 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001088 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4241, 0.9507
##          Less: -1, 0.9387
##       Greater: 0.511, 1
## 
## Description:
##  Mon Apr 30 14:45:22 2018

correlationTest(data$x4,data$y4,method = 'pearson',title = 'x4 and y4 correlation')

## 
## Title:
##  x4 and y4 correlation
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8165
##   STATISTIC:
##     t: 4.243
##   P VALUE:
##     Alternative Two-Sided: 0.002165 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001082 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4246, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5115, 1
## 
## Description:
##  Mon Apr 30 14:45:22 2018

Create scatter plots for each \(x, y\) pair of data.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.3

pl1=ggplot(data,aes(x=x1,y=y1))+geom_point(color='red')
pl1

pl2=ggplot(data,aes(x=x2,y=y2))+geom_point(color='red')
pl2

pl3=ggplot(data,aes(x=x3,y=y3))+geom_point(color='red')
pl3

pl4=ggplot(data,aes(x=x4,y=y4))+geom_point(color='red')
pl4

Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic

library('gridExtra')

## Warning: package 'gridExtra' was built under R version 3.4.4

grid.arrange(pl2,pl2,pl3,pl4, top="ScatterPlot")

Now fit a linear model to each data set using the lm() function.

pl10=plot(data$x1,data$y1)

abline(lm(y1~x1,data=data))

pl20=plot(data$x2,data$y2)

abline(lm(y2~x2,data=data))

pl30=plot(data$x3,data$y3)

abline(lm(y3~x3,data=data))

pl40=plot(data$x4,data$y4)

abline(lm(y4~x4,data=data))

Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

par(mfrow=c(2,2))
pl10=plot(data$x1,data$y1)
abline(lm(y1~x1,data=data))

pl20=plot(data$x2,data$y2)
abline(lm(y2~x2,data=data))

pl30=plot(data$x3,data$y3)
abline(lm(y3~x3,data=data))

pl40=plot(data$x4,data$y4)
abline(lm(y4~x4,data=data))

Now compare the model fits for each model object.

summary(lm(y1~x1,data))$r.squared

[1] 0.6665425

summary(lm(y2~x2,data))$r.squared

[1] 0.666242

summary(lm(y3~x3,data))$r.squared

[1] 0.666324

summary(lm(y4~x4,data))$r.squared

[1] 0.6667073

In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

The x have similar mean, variance; so does y; They also have the same fitted regression and very similar r.squared of 0.67; However, their plots are completely different; if we do not make those data visualizations, we would be fooled to think they are the same when each of them actually tells very differnt stories