ANLY 512 - Problem Set 2

Questions

Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

data=anscombe
View(data)

Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

summary(data)

##        x1             x2             x3             x4    
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19  
##        y1               y2              y3              y4        
##  Min.   : 4.260   Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.: 6.315   1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median : 7.580   Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   : 7.501   Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.: 8.570   3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :10.840   Max.   :9.260   Max.   :12.74   Max.   :12.500

var(data$x1)

## [1] 11

var(data$x2)

## [1] 11

var(data$x3)

## [1] 11

var(data$x4)

## [1] 11

var(data$y1)

## [1] 4.127269

var(data$y2)

## [1] 4.127629

var(data$y3)

## [1] 4.12262

var(data$y4)

## [1] 4.123249

library(fBasics)

## Loading required package: timeDate

## Loading required package: timeSeries

cor.test(data$x1,data$y1)

## 
##  Pearson's product-moment correlation
## 
## data:  data$x1 and data$y1
## t = 4.2415, df = 9, p-value = 0.00217
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4243912 0.9506933
## sample estimates:
##       cor 
## 0.8164205

cor.test(data$x2,data$y2)

## 
##  Pearson's product-moment correlation
## 
## data:  data$x2 and data$y2
## t = 4.2386, df = 9, p-value = 0.002179
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4239389 0.9506402
## sample estimates:
##       cor 
## 0.8162365

cor.test(data$x3,data$y3)

## 
##  Pearson's product-moment correlation
## 
## data:  data$x3 and data$y3
## t = 4.2394, df = 9, p-value = 0.002176
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4240623 0.9506547
## sample estimates:
##       cor 
## 0.8162867

cor.test(data$x4,data$y4)

## 
##  Pearson's product-moment correlation
## 
## data:  data$x4 and data$y4
## t = 4.243, df = 9, p-value = 0.002165
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4246394 0.9507224
## sample estimates:
##       cor 
## 0.8165214

Create scatter plots for each \(x, y\) pair of data.

plot(data$x1,data$y1)

plot(data$x2,data$y2)

plot(data$x3,data$y3)

plot(data$x4,data$y4)

Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic

par(mfrow=c(2,2))
plot(data$x1,data$y1,pch=18)
plot(data$x2,data$y2,pch=18)
plot(data$x3,data$y3,pch=18)
plot(data$x4,data$y4,pch=18)

Now fit a linear model to each data set using the lm() function.

lm1=lm(data$x1~data$y1)
summary(lm1)

## 
## Call:
## lm(formula = data$x1 ~ data$y1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6522 -1.5117 -0.2657  1.2341  3.8946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.9975     2.4344  -0.410  0.69156   
## data$y1       1.3328     0.3142   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

lm2=lm(data$x2~data$y2)
summary(lm2)

## 
## Call:
## lm(formula = data$x2 ~ data$y2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8516 -1.4315 -0.3440  0.8467  4.2017 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.9948     2.4354  -0.408  0.69246   
## data$y2       1.3325     0.3144   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.02 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

lm3=lm(data$x3~data$y3)
summary(lm3)

## 
## Call:
## lm(formula = data$x3 ~ data$y3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9869 -1.3733 -0.0266  1.3200  3.2133 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -1.0003     2.4362  -0.411  0.69097   
## data$y3       1.3334     0.3145   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

lm4=lm(data$x4~data$y4)
summary(lm4)

## 
## Call:
## lm(formula = data$x4 ~ data$y4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7859 -1.4122 -0.1853  1.4551  3.3329 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -1.0036     2.4349  -0.412  0.68985   
## data$y4       1.3337     0.3143   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.018 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

par(mfrow=c(2,2))
plot(data$x1,data$y1)
abline(lm1,col="purple")
plot(data$x2,data$y2)
abline(lm2,col="red")
plot(data$x3,data$y3)
abline(lm3,col="blue")
plot(data$x4,data$y4)
abline(lm4,col="orange")

Now compare the model fits for each model object.

anova(lm1)

Analysis of Variance Table

Response: data\(x1 Df Sum Sq Mean Sq F value Pr(>F) data\)y1 1 73.32 73.320 17.99 0.00217 ** Residuals 9 36.68 4.076
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

anova(lm2)

Analysis of Variance Table

Response: data\(x2 Df Sum Sq Mean Sq F value Pr(>F) data\)y2 1 73.287 73.287 17.966 0.002179 ** Residuals 9 36.713 4.079
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

anova(lm3)

Analysis of Variance Table

Response: data\(x3 Df Sum Sq Mean Sq F value Pr(>F) data\)y3 1 73.296 73.296 17.972 0.002176 ** Residuals 9 36.704 4.078
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

anova(lm4)

Analysis of Variance Table

Response: data\(x4 Df Sum Sq Mean Sq F value Pr(>F) data\)y4 1 73.338 73.338 18.003 0.002165 ** Residuals 9 36.662 4.074
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

From the mean and variance tests, we found that the variances for x1, x2, x3,and x4 were all the same and the variances for y1, y2, y3, and y4 were close to each other. So it seems like they are similar to each other.Then we drew the graphs and fitted them for the linear regression model. Only the third group of data fitted well. The lesson of Anscombe’s Quarter showed us that data visualization can give us more clear information about our data and make us have a better understanding.

ANLY 512 - Problem Set 2

Anscombe’s quartet

Rui Yan

12.3.2018

Questions