ANLY 512 - Problem Set 2

Questions

Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

library(datasets)
summary(anscombe)

##        x1             x2             x3             x4    
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19  
##        y1               y2              y3              y4        
##  Min.   : 4.260   Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.: 6.315   1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median : 7.580   Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   : 7.501   Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.: 8.570   3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :10.840   Max.   :9.260   Max.   :12.74   Max.   :12.500

data <- anscombe
x1 <- data[,1]
x2 <- data[,2]
x3 <- data[,3]
x4 <- data[,4]
y1 <- data[,5]
y2 <- data[,6]
y3 <- data[,7]
y4 <- data[,8]

Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

mean(x1)

## [1] 9

mean(x2)

## [1] 9

mean(x3)

## [1] 9

mean(x4)

## [1] 9

var(x1)

## [1] 11

var(x2)

## [1] 11

var(x3)

## [1] 11

var(x4)

## [1] 11

Create scatter plots for each \(x, y\) pair of data.

plot(x1,y1,main = 'Scatter plot for x1 and y1')

plot(x2,y2,main = 'Scatter plot for x2 and y2')

plot(x3,y3,main = 'Scatter plot for x3 and y3')

plot(x4,y4,main = 'Scatter plot for x4 and y4')

Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic

par(mfrow=c(2,2))
plot(x1,y1,pch = 19,main = 'Scatter plot for x1 and y1',col = 'green')
plot(x2,y2,pch = 19,main = 'Scatter plot for x2 and y2',col = 'blue')
plot(x3,y3,pch = 19,main = 'Scatter plot for x3 and y3',col = 'purple')
plot(x4,y4,pch = 19,main = 'Scatter plot for x4 and y4',col = 'red')

Now fit a linear model to each data set using the lm() function.

M1 <- lm(y1~x1)
M1

## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Coefficients:
## (Intercept)           x1  
##      3.0001       0.5001

M2 <- lm(y2~x2)
M2

## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Coefficients:
## (Intercept)           x2  
##       3.001        0.500

M3 <- lm(y3~x3)
M3

## 
## Call:
## lm(formula = y3 ~ x3)
## 
## Coefficients:
## (Intercept)           x3  
##      3.0025       0.4997

M4 <- lm(y4~x4)
M4

## 
## Call:
## lm(formula = y4 ~ x4)
## 
## Coefficients:
## (Intercept)           x4  
##      3.0017       0.4999

Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

par(mfrow=c(2,2))
plot(x1,y1,pch = 19,main = 'Scatter plot for x1 and y1 in Anscombe dataset',col='green')
abline(M1,col = 'red')
plot(x2,y2,pch = 19,main = 'Scatter plot for x2 and y2 in Anscombe dataset',col = 'purple')
abline(M2, col = 'red')
plot(x3,y3,pch = 19,main = 'Scatter plot for x3 and y3 in Anscombe dataset',col = 'blue')
abline(M3, col = 'red')
plot(x4,y4,pch = 19,main = 'Scatter plot for x4 and y4 in Anscombe dataset',col='brown')
abline(M4, col = 'red')

Now compare the model fits for each model object.

anova(M1)

Analysis of Variance Table

Response: y1 Df Sum Sq Mean Sq F value Pr(>F)
x1 1 27.510 27.5100 17.99 0.00217 ** Residuals 9 13.763 1.5292
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

anova(M2)

Analysis of Variance Table

Response: y2 Df Sum Sq Mean Sq F value Pr(>F)
x2 1 27.500 27.5000 17.966 0.002179 ** Residuals 9 13.776 1.5307
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

anova(M3)

Analysis of Variance Table

Response: y3 Df Sum Sq Mean Sq F value Pr(>F)
x3 1 27.470 27.4700 17.972 0.002176 ** Residuals 9 13.756 1.5285
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

anova(M4)

Analysis of Variance Table

Response: y4 Df Sum Sq Mean Sq F value Pr(>F)
x4 1 27.490 27.4900 18.003 0.002165 ** Residuals 9 13.742 1.5269
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

Anscombe’s Quartet has four distinctive datasets of x and y values. The variance analysis show that the intercepts and the residuals from the linear model fit are roughly alike for all the four datasets. The data visualization that is derived from #6 on the other hand show that the datasets are different entirely. Thus this proves the imporance of data visulization in ordre to draw accurate results about a dataset.

ANLY 512 - Problem Set 2

Anscombe’s quartet

Diksha Gupta

2017-06-04

Objectives

Questions