Objectives

The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignmenet on Moodle.

Questions

  1. Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.
data <- anscombe
  1. Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)
library(fBasics)
## Warning: package 'fBasics' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.3
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.4.4
colMeans(data)
##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
colVars(data)
##        x1        x2        x3        x4        y1        y2        y3 
## 11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620 
##        y4 
##  4.123249
cor_x1_y1 <- cor(data$x1,data$y1)
cat("The correlation between x1 and y1 is",cor_x1_y1,"\n")
## The correlation between x1 and y1 is 0.8164205
cor_x2_y2 <- cor(data$x2,data$y2)
cat("The correlation between x2 and y2 is",cor_x2_y2,"\n")
## The correlation between x2 and y2 is 0.8162365
cor_x3_y3 <- cor(data$x3,data$y3)
cat("The correlation between x3 and y3 is",cor_x3_y3,"\n")
## The correlation between x3 and y3 is 0.8162867
cor_x4_y4 <- cor(data$x4,data$y4)
cat("The correlation between x4 and y4 is",cor_x4_y4,"\n")
## The correlation between x4 and y4 is 0.8165214
  1. Create scatter plots for each \(x, y\) pair of data.
plot(data$x1,data$y1,main="Scatter plot lot between x1 and y1")

plot(data$x2,data$y2,main="Scatter plot lot between x2 and y2")

plot(data$x3,data$y3,main="Scatter plot lot between x3 and y3")

plot(data$x4,data$y4,main="Scatter plot lot between x4 and y4")

  1. Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic
par(mfrow=c(2,2))
plot(data$x1,data$y1,main="Scatter plot lot between x1 and y1",pch=19)
plot(data$x2,data$y2,main="Scatter plot lot between x2 and y2",pch=19)
plot(data$x3,data$y3,main="Scatter plot lot between x3 and y3",pch=19)
plot(data$x4,data$y4,main="Scatter plot lot between x4 and y4",pch=19)

  1. Now fit a linear model to each data set using the lm() function.
model_x1y1 <- lm(data$x1~data$y1)
model_x2y2 <- lm(data$x2~data$y2)
model_x3y3 <- lm(data$x3~data$y3)
model_x4y4 <- lm(data$x4~data$y4)
  1. Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)
par(mfrow=c(2,2))
plot(data$x1,data$y1,main="Scatter plot lot between x1 and y1",pch=19,abline(model_x1y1))
plot(data$x2,data$y2,main="Scatter plot lot between x2 and y2",pch=19,abline(model_x2y2))
plot(data$x3,data$y3,main="Scatter plot lot between x3 and y3",pch=19,abline(model_x3y3))
plot(data$x4,data$y4,main="Scatter plot lot between x4 and y4",pch=19,abline(model_x4y4))

  1. Now compare the model fits for each model object.
summary(model_x1y1)

Call: lm(formula = data\(x1 ~ data\)y1)

Residuals: Min 1Q Median 3Q Max -2.6522 -1.5117 -0.2657 1.2341 3.8946

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9975 2.4344 -0.410 0.69156
data$y1 1.3328 0.3142 4.241 0.00217 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 2.019 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217

summary(model_x2y2)

Call: lm(formula = data\(x2 ~ data\)y2)

Residuals: Min 1Q Median 3Q Max -1.8516 -1.4315 -0.3440 0.8467 4.2017

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9948 2.4354 -0.408 0.69246
data$y2 1.3325 0.3144 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 2.02 on 9 degrees of freedom Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179

summary(model_x3y3)

Call: lm(formula = data\(x3 ~ data\)y3)

Residuals: Min 1Q Median 3Q Max -2.9869 -1.3733 -0.0266 1.3200 3.2133

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0003 2.4362 -0.411 0.69097
data$y3 1.3334 0.3145 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 2.019 on 9 degrees of freedom Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176

summary(model_x4y4)

Call: lm(formula = data\(x4 ~ data\)y4)

Residuals: Min 1Q Median 3Q Max -2.7859 -1.4122 -0.1853 1.4551 3.3329

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0036 2.4349 -0.412 0.68985
data$y4 1.3337 0.3143 4.243 0.00216 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 2.018 on 9 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297 F-statistic: 18 on 1 and 9 DF, p-value: 0.002165

  1. In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

As we can see from the outputs of quesion 2, four different pairs of data showed highly similar summary statistics, they have very close means, variances and correlations. However, when we look at the scatter plots of those pairs, they have totally different distributions. That being said, the summary statistics of dataset could be very misleading without visualization of the data, it’s very dangerous to sumarized a dataset just using the statistic summary. As we can see from the linear regression line, they don’t follow the same pattern. In summary, data visualization is very important when we try to figure out the whole story of a data set. We can’t just rely on the statistic summary, instead, we need to combine the statistic summay and data visualization in order to draw accurate conclusions.