RichyVarghese_ANLY512-90-O-2019-LateSpring

Questions

Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

data <- anscombe

Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

library(fBasics)

## Loading required package: timeDate

## Loading required package: timeSeries

basicStats(data)

##                    x1        x2        x3        x4        y1        y2
## nobs        11.000000 11.000000 11.000000 11.000000 11.000000 11.000000
## NAs          0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## Minimum      4.000000  4.000000  4.000000  8.000000  4.260000  3.100000
## Maximum     14.000000 14.000000 14.000000 19.000000 10.840000  9.260000
## 1. Quartile  6.500000  6.500000  6.500000  8.000000  6.315000  6.695000
## 3. Quartile 11.500000 11.500000 11.500000  8.000000  8.570000  8.950000
## Mean         9.000000  9.000000  9.000000  9.000000  7.500909  7.500909
## Median       9.000000  9.000000  9.000000  8.000000  7.580000  8.140000
## Sum         99.000000 99.000000 99.000000 99.000000 82.510000 82.510000
## SE Mean      1.000000  1.000000  1.000000  1.000000  0.612541  0.612568
## LCL Mean     6.771861  6.771861  6.771861  6.771861  6.136083  6.136024
## UCL Mean    11.228139 11.228139 11.228139 11.228139  8.865735  8.865795
## Variance    11.000000 11.000000 11.000000 11.000000  4.127269  4.127629
## Stdev        3.316625  3.316625  3.316625  3.316625  2.031568  2.031657
## Skewness     0.000000  0.000000  0.000000  2.466911 -0.048374 -0.978693
## Kurtosis    -1.528926 -1.528926 -1.528926  4.520661 -1.199123 -0.514319
##                    y3        y4
## nobs        11.000000 11.000000
## NAs          0.000000  0.000000
## Minimum      5.390000  5.250000
## Maximum     12.740000 12.500000
## 1. Quartile  6.250000  6.170000
## 3. Quartile  7.980000  8.190000
## Mean         7.500000  7.500909
## Median       7.110000  7.040000
## Sum         82.500000 82.510000
## SE Mean      0.612196  0.612242
## LCL Mean     6.135943  6.136748
## UCL Mean     8.864057  8.865070
## Variance     4.122620  4.123249
## Stdev        2.030424  2.030579
## Skewness     1.380120  1.120774
## Kurtosis     1.240044  0.628751

correlationTest(data$x1,data$y1)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8164
##   STATISTIC:
##     t: 4.2415
##   P VALUE:
##     Alternative Two-Sided: 0.00217 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001085 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4244, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5113, 1
## 
## Description:
##  Thu Apr 18 22:01:59 2019

correlationTest(data$x2,data$y2)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8162
##   STATISTIC:
##     t: 4.2386
##   P VALUE:
##     Alternative Two-Sided: 0.002179 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001089 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4239, 0.9506
##          Less: -1, 0.9387
##       Greater: 0.5109, 1
## 
## Description:
##  Thu Apr 18 22:01:59 2019

correlationTest(data$x3,data$y3)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8163
##   STATISTIC:
##     t: 4.2394
##   P VALUE:
##     Alternative Two-Sided: 0.002176 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001088 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4241, 0.9507
##          Less: -1, 0.9387
##       Greater: 0.511, 1
## 
## Description:
##  Thu Apr 18 22:01:59 2019

correlationTest(data$x4,data$y4)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8165
##   STATISTIC:
##     t: 4.243
##   P VALUE:
##     Alternative Two-Sided: 0.002165 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001082 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4246, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5115, 1
## 
## Description:
##  Thu Apr 18 22:01:59 2019

Create scatter plots for each \(x, y\) pair of data.

plot(data$x1, data$y1, main = "Scatter plot - x1,y1")

plot(data$x2, data$y2, main = "Scatter plot - x2,y2")

plot(data$x3, data$y3, main = "Scatter plot - x3,y3")

plot(data$x4, data$y4, main = "Scatter plot - x4,y4")

Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic

par(mfrow=c(2,2))
plot(data$x1, data$y1, main = "Scatter plot - x1,y1", pch = 19)
plot(data$x2, data$y2, main = "Scatter plot - x2,y2", pch = 19)
plot(data$x3, data$y3, main = "Scatter plot - x3,y3", pch = 19)
plot(data$x4, data$y4, main = "Scatter plot - x4,y4", pch = 19)

Now fit a linear model to each data set using the lm() function.

fit1<-lm(data$y1~data$x1)
fit2<-lm(data$y2~data$x2)
fit3<-lm(data$y3~data$x3)
fit4<-lm(data$y4~data$x4)

Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

par(mfrow=c(2,2))
plot(data$x1, data$y1, main = "Scatter plot - x1,y1", pch = 19)
abline(fit1)
plot(data$x2, data$y2, main = "Scatter plot - x2,y2", pch = 19)
abline(fit2)
plot(data$x3, data$y3, main = "Scatter plot - x3,y3", pch = 19)
abline(fit3)
plot(data$x4, data$y4, main = "Scatter plot - x4,y4", pch = 19)
abline(fit4)

Now compare the model fits for each model object.

anova(fit1)

## Analysis of Variance Table
## 
## Response: data$y1
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## data$x1    1 27.510 27.5100   17.99 0.00217 **
## Residuals  9 13.763  1.5292                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit2)

## Analysis of Variance Table
## 
## Response: data$y2
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## data$x2    1 27.500 27.5000  17.966 0.002179 **
## Residuals  9 13.776  1.5307                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit3)

## Analysis of Variance Table
## 
## Response: data$y3
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## data$x3    1 27.470 27.4700  17.972 0.002176 **
## Residuals  9 13.756  1.5285                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit4)

## Analysis of Variance Table
## 
## Response: data$y4
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## data$x4    1 27.490 27.4900  18.003 0.002165 **
## Residuals  9 13.742  1.5269                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

Anscombe’s Quartet has 4 datasets with similar descriptive/summary statistics, which when plotted look different from each other. Each dataset has an x,y pair with 11 data points each. Looking at the visualizations, x1,y1 has a loose linear relationship, x2,y2 has a non-linear relationship, x3,y3 has a tight linear relationship with one outlier, and x4,y4 shows no relationship with one outlier.

RichyVarghese_ANLY512-90-O-2019-LateSpring

Anscombe’s quartet

Richy Varghese

2019-04-18

Objectives

Questions