Objectives

The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignmenet on Moodle.

Questions

  1. Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.
data <-anscombe
data
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
  1. Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)
mean(data$x1)
## [1] 9
var(data$x1)
## [1] 11
mean(data$x2)
## [1] 9
var(data$x2)
## [1] 11
mean(data$x3)
## [1] 9
var(data$x3)
## [1] 11
mean(data$x4)
## [1] 9
var(data$x4)
## [1] 11
mean(data$y1)
## [1] 7.500909
var(data$y1)
## [1] 4.127269
mean(data$y2)
## [1] 7.500909
var(data$y2)
## [1] 4.127629
mean(data$y3)
## [1] 7.5
var(data$y3)
## [1] 4.12262
mean(data$y4)
## [1] 7.500909
var(data$y4)
## [1] 4.123249
library("fBasics")
## Warning: package 'fBasics' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.4.4
correlationTest(data$x1, data$y1)
## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8164
##   STATISTIC:
##     t: 4.2415
##   P VALUE:
##     Alternative Two-Sided: 0.00217 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001085 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4244, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5113, 1
## 
## Description:
##  Wed May 02 20:14:01 2018
correlationTest(data$x2, data$y2)
## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8162
##   STATISTIC:
##     t: 4.2386
##   P VALUE:
##     Alternative Two-Sided: 0.002179 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001089 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4239, 0.9506
##          Less: -1, 0.9387
##       Greater: 0.5109, 1
## 
## Description:
##  Wed May 02 20:14:01 2018
correlationTest(data$x3, data$y3)
## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8163
##   STATISTIC:
##     t: 4.2394
##   P VALUE:
##     Alternative Two-Sided: 0.002176 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001088 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4241, 0.9507
##          Less: -1, 0.9387
##       Greater: 0.511, 1
## 
## Description:
##  Wed May 02 20:14:01 2018
correlationTest(data$x4, data$y4)
## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8165
##   STATISTIC:
##     t: 4.243
##   P VALUE:
##     Alternative Two-Sided: 0.002165 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001082 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4246, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5115, 1
## 
## Description:
##  Wed May 02 20:14:01 2018
  1. Create scatter plots for each \(x, y\) pair of data.
plot(data$x1, data$y1, main = "Scater Plot - x1,y1")

plot(data$x2, data$y2, main = "Scater Plot - x2,y2")

plot(data$x3, data$y3, main = "Scater Plot - x3,y3")

plot(data$x4, data$y4, main = "Scater Plot - x4,y4")

  1. Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic
par(mfrow= c(2,2))
plot(data$x1, data$y1, main = "Scater Plot - x1,y1", pch = 20)
plot(data$x2, data$y2, main = "Scater Plot - x2,y2", pch = 20)
plot(data$x3, data$y3, main = "Scater Plot - x3,y3", pch = 20)
plot(data$x4, data$y4, main = "Scater Plot - x4,y4", pch = 20)

  1. Now fit a linear model to each data set using the lm() function.
lm1 <- lm(data$y1 ~ data$x1)
summary(lm1)
## 
## Call:
## lm(formula = data$y1 ~ data$x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## data$x1       0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
lm2 <- lm(data$y2 ~ data$x2)
summary(lm2)
## 
## Call:
## lm(formula = data$y2 ~ data$x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9009 -0.7609  0.1291  0.9491  1.2691 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125   2.667  0.02576 * 
## data$x2        0.500      0.118   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179
lm3 <- lm(data$y3 ~ data$x3)
summary(lm3)
## 
## Call:
## lm(formula = data$y3 ~ data$x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1586 -0.6146 -0.2303  0.1540  3.2411 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0025     1.1245   2.670  0.02562 * 
## data$x3       0.4997     0.1179   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176
lm4 <- lm(data$y4 ~ data$x4)
summary(lm4)
## 
## Call:
## lm(formula = data$y4 ~ data$x4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0017     1.1239   2.671  0.02559 * 
## data$x4       0.4999     0.1178   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165
  1. Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)
par(mfrow= c(2,2))

plot(data$x1, data$y1, main = "Scater Plot - x1,y1", pch = 20)
abline(lm1, col = "red")

plot(data$x2, data$y2, main = "Scater Plot - x2,y2", pch = 20)
abline(lm2, col = "red")

plot(data$x3, data$y3, main = "Scater Plot - x3,y3", pch = 20)
abline(lm3, col = "red")

plot(data$x4, data$y4, main = "Scater Plot - x4,y4", pch = 20)
abline(lm4, col = "red")

  1. Now compare the model fits for each model object.
anova(lm1)

Analysis of Variance Table

Response: data\(y1 Df Sum Sq Mean Sq F value Pr(>F) data\)x1 1 27.510 27.5100 17.99 0.00217 ** Residuals 9 13.763 1.5292
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

anova(lm2)

Analysis of Variance Table

Response: data\(y2 Df Sum Sq Mean Sq F value Pr(>F) data\)x2 1 27.500 27.5000 17.966 0.002179 ** Residuals 9 13.776 1.5307
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

anova(lm3)

Analysis of Variance Table

Response: data\(y3 Df Sum Sq Mean Sq F value Pr(>F) data\)x3 1 27.470 27.4700 17.972 0.002176 ** Residuals 9 13.756 1.5285
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

anova(lm4)

Analysis of Variance Table

Response: data\(y4 Df Sum Sq Mean Sq F value Pr(>F) data\)x4 1 27.490 27.4900 18.003 0.002165 ** Residuals 9 13.742 1.5269
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

  1. In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

Anscombe’s Quartet contains four datasets of same kind of data.Dataset has x and y columns with eleven pair of values in it. we find that summary statistics are mostly similar for each column values and corelations. The dataset explains the story of data and why summary statistics are not enough to analyze data.The datasets looks identical in most ways. But plotting on x, y plane it shows that datasets are not similar.Each dataset has a different story. For example, dataset 1 follows a linear relationship vaguely. dataset 2 doesn’t follow linear relationship, the y has smooth curve reltionship with x.dataset 3 has a tight linear relationship between x and y, except for large outlier.dataset 4 has x constant but one outlier. summary statistics may fail to tell a story of data but visualizations are helpful to give a clear picture on what is going on with data.