ANLY 512 - Problem Set 2

Questions

Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

data("anscombe")
data <- anscombe
data

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

library(fBasics)

## Loading required package: timeDate

## Loading required package: timeSeries

##

## Rmetrics Package fBasics

## Analysing Markets and calculating Basic Statistics

## Copyright (C) 2005-2014 Rmetrics Association Zurich

## Educational Software for Financial Engineering and Computational Science

## Rmetrics is free software and comes with ABSOLUTELY NO WARRANTY.

## https://www.rmetrics.org --- Mail to: info@rmetrics.org

# Summary Data - Means
colMeans(data)

##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909

#Summary Data - Variances
colVars(data)

##        x1        x2        x3        x4        y1        y2        y3 
## 11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620 
##        y4 
##  4.123249

# Correlation between x1 and y1
cor(data$x1, data$y1)

## [1] 0.8164205

# Correlation between x2 and y2
cor(data$x2, data$y2)

## [1] 0.8162365

# Correlation between x3 and y3
cor(data$x3, data$y3)

## [1] 0.8162867

# Correlation between x4 and y4
cor(data$x4, data$y4)

## [1] 0.8165214

Create scatter plots for each \(x, y\) pair of data.

#Scatterplot x1_y1
plot_1 <- plot(data$x1, data$y1, main = "Scatterplot x1_y1", xlab = "x1", ylab = "y1")

#Scatterplot x2_y2
plot_2 <- plot(data$x2, data$y2, main = "Scatterplot x2_y2", xlab = "x2", ylab = "y2")

#Scatterplot x3_y3
plot_3 <- plot(data$x3, data$y3, main = "Scatterplot x3_y3", xlab = "x3", ylab = "y3")

#Scatterplot x4_y4
plot_4 <- plot(data$x4, data$y4, main = "Scatterplot x4_y4", xlab = "x4", ylab = "y4")

Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic

library(ggplot2)
library(ggthemes)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.4.1

plot1 <- ggplot(data, aes(x1, y1)) + geom_point() + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("Scatterplot x1 vs y1")
plot2 <- ggplot(data, aes(x2, y2)) + geom_point() + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("Scatterplot x2 vs y2")
plot3 <- ggplot(data, aes(x3, y3)) + geom_point() + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("Scatterplot x3 vs y3")
plot4 <- ggplot(data, aes(x4, y4)) + geom_point() + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("Scatterplot x4 vs y4")
grid.arrange(plot1, plot2, plot3, plot4)

Now fit a linear model to each data set using the lm() function.

plot1 <- ggplot(data, aes(x1, y1)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x1_y1 with lm")
plot1

plot2 <- ggplot(data, aes(x2, y2)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x2_y2 with lm")
plot2

plot3 <- ggplot(data, aes(x3, y3)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x3_y3 with lm")
plot3

plot4 <- ggplot(data, aes(x4, y4)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x4_y4 with lm")
plot4

Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

grid.arrange(plot1, plot2, plot3, plot4)

Now compare the model fits for each model object.

plot1_glm <- ggplot(data, aes(x1, y1)) + geom_point() + geom_smooth(se = FALSE, method = "glm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x1_y1 with glm")
plot2_glm <- ggplot(data, aes(x2, y2)) + geom_point() + geom_smooth(se = FALSE, method = "glm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x2_y2 with glm")
plot3_glm <- ggplot(data, aes(x3, y3)) + geom_point() + geom_smooth(se = FALSE, method = "glm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x3_y3 with glm")
plot4_glm <- ggplot(data, aes(x4, y4)) + geom_point() + geom_smooth(se = FALSE, method = "glm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x4_y4 with glm")

plot1_loess <- ggplot(data, aes(x1, y1)) + geom_point() + geom_smooth(se = FALSE, method = "loess") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x1_y1 with loess")
plot2_loess <- ggplot(data, aes(x2, y2)) + geom_point() + geom_smooth(se = FALSE, method = "loess") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x2_y2 with loess")
plot3_loess <- ggplot(data, aes(x3, y3)) + geom_point() + geom_smooth(se = FALSE, method = "loess") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x3_y3 with loess")
plot4_loess <- ggplot(data, aes(x4, y4)) + geom_point() + geom_smooth(se = FALSE, method = "loess") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x4_y4 with loess")

grid.arrange(plot1, plot1_glm, plot1_loess, plot2,plot2_glm, plot2_loess, plot3, plot3_glm, plot3_loess, plot4, plot4_glm, plot4_loess)

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 7.945

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 0.003025

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 7.945

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.055

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 122.21

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger

## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 5)

I have created a model fit with the following methods: lm, glm, and loess. We can see that for x1_y1, lm and glm trendlines are very close ot the data points while for x2_y2 it is clear that Loess provides the better fit to the data. x3_y3 also gets a very good fit with loess, although the fit from the linear model and the general linear models are also acceptable (the fits from both linear models are very similar). x4_y4 only gets a model fit with lm and glm, there is no fit possible with loess.

In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

The summary statistics of Anscombe’s Quartet gives us very similar results for the different variables. Indeed we get the same mean, variance, and correlation for each column and [x ; y] set respectively. This would lead us to believe that each [x ; y] combination is very similar and could even be interchangeable. However, the graphical representation shows us that those are in fact very different datasets. This realization would have been impossible to detect without data visualization. Therefore, Anscombe Quartet is a valuable vote of confidence for data visualization especially in response to people claiming that numbers and summary statistics are often more accurate than graphs. It shows that data visualization is a very important tool for data analysis, to be used in conjunction with summary statistics and other tools in order to get a more complete understanding of the data and what it represents.

ANLY 512 - Problem Set 2

Anscombe’s quartet

Ahien Clementine Djouka

2017-09-12

Objectives

Questions