Objectives

The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion upload your document to rpubs.com and share the link to the “Problem Set 2” assignmenet on Moodle.

Questions

  1. Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.
data<-anscombe
  1. Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)
library(fBasics)
colStats(anscombe,FUN = mean)
##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
colStats(anscombe,FUN = var)
##        x1        x2        x3        x4        y1        y2        y3 
## 11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620 
##        y4 
##  4.123249
cor(anscombe[,1:4],anscombe[,5:8])
##            y1         y2         y3         y4
## x1  0.8164205  0.8162365  0.8162867 -0.3140467
## x2  0.8164205  0.8162365  0.8162867 -0.3140467
## x3  0.8164205  0.8162365  0.8162867 -0.3140467
## x4 -0.5290927 -0.7184365 -0.3446610  0.8165214
  1. Create scatter plots for each \(x, y\) pair of data.
attach(anscombe)

plot(x1,y1)

plot(x2,y2)

plot(x3,y3)

plot(x4,y4)

  1. Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic
par(mfrow=c(2,2))

plot(x1,y1,pch=19)
plot(x2,y2,pch=19)
plot(x3,y3,pch=19)
plot(x4,y4,pch=19)

par(mfrow=c(1,1))
  1. Now fit a linear model to each data set using the lm() function.
fit1 <- lm(y1~x1)
fit2 <- lm(y2~x2)
fit3 <- lm(y3~x3)
fit4 <- lm(y4~x4)
  1. Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)
par(mfrow=c(2,2))

plot(x1,y1,pch=19)
abline(lm(y1 ~ x1))

plot(x2,y2,pch=19)
abline(lm(y2 ~ x2))

plot(x3,y3,pch=19)
abline(lm(y3 ~ x3))

plot(x4,y4,pch=19)
abline(lm(y4 ~ x4))

  1. Now compare the model fits for each model object.
summary(fit1)

Call: lm(formula = y1 ~ x1)

Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 * x1 0.5001 0.1179 4.241 0.00217 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217

summary(fit2)

Call: lm(formula = y2 ~ x2)

Residuals: Min 1Q Median 3Q Max -1.9009 -0.7609 0.1291 0.9491 1.2691

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.001 1.125 2.667 0.02576 * x2 0.500 0.118 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179

summary(fit3)

Call: lm(formula = y3 ~ x3)

Residuals: Min 1Q Median 3Q Max -1.1586 -0.6146 -0.2303 0.1540 3.2411

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0025 1.1245 2.670 0.02562 * x3 0.4997 0.1179 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176

summary(fit4)

Call: lm(formula = y4 ~ x4)

Residuals: Min 1Q Median 3Q Max -1.751 -0.831 0.000 0.809 1.839

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0017 1.1239 2.671 0.02559 * x4 0.4999 0.1178 4.243 0.00216 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297 F-statistic: 18 on 1 and 9 DF, p-value: 0.002165

It shows the model fit for 4 different models are pretty similar-with independent variable significant at 99% significance level,an adjusted R^2 of around 63% and residual standard error of around 1.236.

  1. In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

In short, Anscombe mentioned that even if there’s a tendency of valuing quant calculation more than visualization in research, graphs sometimes can reveal and detect some specific data features, trends and problems that may can’t be seen in the results of math calculation.

He took linear regression model as the topic of exploration. There are four types of data features that may not be detected by math calculation but are clear in graphs.

  1. A few residuals are much larger than other, which is the outlier problem

Solve: do a detailed research of the outliers and create a subset for them to do further exploartion, instead of simply delete the outliers. Because they may sometimes bring interesting and useful insights.

  1. A curved regression of residuals on fitted values

Solve: Transform y such as taking a log / Transform x by adding extra term in the formula (such as an exponentiation term)

3)Progressive change in the variability of the residuals as the fitted value increase

Solve: Transform y

  1. A skewed distribution of the residual

Solve: Transform y

At last, the for datasets example indicates that even if the final fit attribution data are all similar, the real data shape can vary. Which re-states that it’s necessary for a statistical anlysis to have both quant calculation and visualization.