ANLY 512 - Problem Set 2

Objectives

The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion post your assignment on Rpubs and upload a link to it to the “Problem Set 2” assignmenet on Moodle.

Questions

Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

I installed ggplot2 to facilitate my work

library(datasets)
head(anscombe)

##   x1 x2 x3 x4   y1   y2    y3   y4
## 1 10 10 10  8 8.04 9.14  7.46 6.58
## 2  8  8  8  8 6.95 8.14  6.77 5.76
## 3 13 13 13  8 7.58 8.74 12.74 7.71
## 4  9  9  9  8 8.81 8.77  7.11 8.84
## 5 11 11 11  8 8.33 9.26  7.81 8.47
## 6 14 14 14  8 9.96 8.10  8.84 7.04

data = anscombe

Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

library("coefplot", lib.loc = "~/R/win-library/3.3")

## Warning: package 'coefplot' was built under R version 3.3.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.3.3

## Warning: Installed Rcpp (0.12.7) different from Rcpp used to build dplyr (0.12.12).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

library("dplyr",lib.loc = "~/R/win-library/3.3")

## Warning: package 'dplyr' was built under R version 3.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sapply(data,mean)

##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909

sapply(data,var)

##        x1        x2        x3        x4        y1        y2        y3 
## 11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620 
##        y4 
##  4.123249

cor(data)

##            x1         x2         x3         x4         y1         y2
## x1  1.0000000  1.0000000  1.0000000 -0.5000000  0.8164205  0.8162365
## x2  1.0000000  1.0000000  1.0000000 -0.5000000  0.8164205  0.8162365
## x3  1.0000000  1.0000000  1.0000000 -0.5000000  0.8164205  0.8162365
## x4 -0.5000000 -0.5000000 -0.5000000  1.0000000 -0.5290927 -0.7184365
## y1  0.8164205  0.8164205  0.8164205 -0.5290927  1.0000000  0.7500054
## y2  0.8162365  0.8162365  0.8162365 -0.7184365  0.7500054  1.0000000
## y3  0.8162867  0.8162867  0.8162867 -0.3446610  0.4687167  0.5879193
## y4 -0.3140467 -0.3140467 -0.3140467  0.8165214 -0.4891162 -0.4780949
##            y3         y4
## x1  0.8162867 -0.3140467
## x2  0.8162867 -0.3140467
## x3  0.8162867 -0.3140467
## x4 -0.3446610  0.8165214
## y1  0.4687167 -0.4891162
## y2  0.5879193 -0.4780949
## y3  1.0000000 -0.1554718
## y4 -0.1554718  1.0000000

Note that in #2 above, in the correlation matrix, the diagonals entries are 1s. This is because a variable correlates with itself. Listed is the correlation between each pair

Create scatter plots for each \(x, y\) pair of data.

plot(data$x1, data$y1)

plot(data$x2, data$y2)

plot(data$x3, data$y3)

plot(data$x4, data$y4)

Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic

I used “pch” to alter the plotting character. I could have used “cex” or “col” to alter the size or color respectively

plot(data$x1, data$y1, pch=20) + plot(data$x2, data$y2, pch=20) + plot(data$x3, data$y3, pch=20) + plot(data$x4, data$y4, pch=20)

## numeric(0)

I was unable to place the plots into a 2x2 matrix

Now fit a linear model to each data set using the lm() function.

plot(data$x1, data$y1)
abline(coef(lm(data$x1~data$y1)))

plot(data$x2, data$y2)
abline(coef(lm(data$x2~data$y2)))

plot(data$x3, data$y3)
abline(coef(lm(data$x3~data$y3)))

plot(data$x4, data$y4)
abline(coef(lm(data$x4~data$y4)))

Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

plot(data$x1, data$y1, pch = 20)
abline(coef(lm(data$x1~data$y1)))

plot(data$x2, data$y2, pch = 20)
abline(coef(lm(data$x2~data$y2)))

plot(data$x3, data$y3, pch = 20)
abline(coef(lm(data$x3~data$y3)))

plot(data$x4, data$y4, pch = 20)
abline(coef(lm(data$x4~data$y4)))

7. Now compare the model fits for each model object.

(1) The x1-y1 plot shows virtually no correlation.  The data points are scattered.
(2) The points in the x2-y2 plot follow a nonlinear line.  They obviously do not follow a linear path.  The nonlinear path could be quadratic.
(3) If we ignore the outlier in the x3-y3 plot, we see that the remaining points are almost linear.  They almost all rest on a line; however, not on the regression line.  Although, y3 increases as x3 increases.
(4) No correlation between x4 and y4.  However, y4 is not influenced by x4 at all:  regardless of y4's variation, x4 remains the same.

8. In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

Anscombe was concerned that a lot of key factors are not considered when using the theoretical description (A). He mentions that sometimes we also should consider the relations between residuals and fitted values. He also lists the things to look for in a plot of {residuals} against {fitted values} or {x-sub i}. He indicates that outliers will sometimes yield valuable information; and, that removing the outliers and analyzing the remaining data, only to study the outliers later, is a good idea. He goes on to discuss more general regession analysis. Incidentally, Anscombe’s quartet (Fig. 1-4) is a wonderful data visualization that helps one understand x-y relations, outliers, and it facilitates understanding of regression analysis.