Graphics reveal data, as stated by Edward R. Tufte in its seminal book “The Visual Display of Quantitative Information” (Cheshire, CT: Graphics Press, 1983).
Tufte’s book was very influencing in guiding the revolution in data visualization and infographics that took place during the 80’s and the 90’s, the decades of the personal computer.
Tufte’s work contributed to a renaissance of a discipline that was always considered an important part of statistics. Both Fisher and Tukey, two among the fathers of modern statistics, always paid great consideration to exploratory data analysis and its visualization. John Wilder Tukey, in particular, indicated ‘Exploratory Data Analysis’ (Reading, MA: Addison-Wesley, 1977) as a major topic in statistics. Since early ’70s, chapters from his work in progress book were made accessible to the statistical community to have the colleagues aware of this important area.
A cautionary tale on data visualization was told by Anscombe in 1973: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician 27 (1): 17 - 21.
Anscombe developed four datasets that had nearly identical statistical properties, but appeared very different when plotted. Essentially, the variables in these four datasets have the same mean and variance, and the same correlation and regression coefficient. Nevertheless, they are very different when graphed, raising doubts as far as the meaning of the statistical inference that can be drawn from the data is concerned. The take home message was: Always plot the data!
## The data
data(anscombe)
## How they look 'statistically'
dim(anscombe)
## [1] 11 8
str(anscombe)
## 'data.frame': 11 obs. of 8 variables:
## $ x1: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x2: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x3: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x4: num 8 8 8 8 8 8 8 19 8 8 ...
## $ y1: num 8.04 6.95 7.58 8.81 8.33 ...
## $ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
## $ y3: num 7.46 6.77 12.74 7.11 7.81 ...
## $ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
attach(anscombe) ### usually indicated as a "don't do it!", but it is so convenient...
## The 'pander' package is helful to dispaly data in markdown, not helpful in the usual R consolle
library(pander)
## Let's see the complete dataset
pander(anscombe)
| x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 |
|---|---|---|---|---|---|---|---|
| 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
| 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
| 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
| 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
| 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
| 14 | 14 | 14 | 8 | 9.96 | 8.1 | 8.84 | 7.04 |
| 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
| 4 | 4 | 4 | 19 | 4.26 | 3.1 | 5.39 | 12.5 |
| 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
| 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
| 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
## And now the summary statistics
pander(summary(anscombe[,c(1,2,3,4)]))
| x1 | x2 | x3 | x4 |
|---|---|---|---|
| Min. : 4.0 | Min. : 4.0 | Min. : 4.0 | Min. : 8 |
| 1st Qu.: 6.5 | 1st Qu.: 6.5 | 1st Qu.: 6.5 | 1st Qu.: 8 |
| Median : 9.0 | Median : 9.0 | Median : 9.0 | Median : 8 |
| Mean : 9.0 | Mean : 9.0 | Mean : 9.0 | Mean : 9 |
| 3rd Qu.:11.5 | 3rd Qu.:11.5 | 3rd Qu.:11.5 | 3rd Qu.: 8 |
| Max. :14.0 | Max. :14.0 | Max. :14.0 | Max. :19 |
pander(summary(anscombe[,c(5,6,7,8)]))
| y1 | y2 | y3 | y4 |
|---|---|---|---|
| Min. : 4.26 | Min. :3.10 | Min. : 5.39 | Min. : 5.25 |
| 1st Qu.: 6.32 | 1st Qu.:6.70 | 1st Qu.: 6.25 | 1st Qu.: 6.17 |
| Median : 7.58 | Median :8.14 | Median : 7.11 | Median : 7.04 |
| Mean : 7.50 | Mean :7.50 | Mean : 7.50 | Mean : 7.50 |
| 3rd Qu.: 8.57 | 3rd Qu.:8.95 | 3rd Qu.: 7.98 | 3rd Qu.: 8.19 |
| Max. :10.84 | Max. :9.26 | Max. :12.74 | Max. :12.50 |
So, same mean for both the ‘x’ and the ‘y’ variable, but the median is not exactly the same
Let see the variance
## Let's calculate the parameter and assign it to a vector
## For convenience we use a vector named dataset1 (and so on) for the first set of number
## because we will use it as a row_name when we will prepare the dataframe containing all calculated parameters
## Standard deviation
dataset1 <- sd(x1); dataset2 <- sd(x2); dataset3 <- sd(x3); dataset4 <- sd(x4)
sd.y1 <- sd(y1); sd.y2 <- sd(y2); sd.y3 <- sd(y3); sd.y4 <- sd(y4)
## Variance
var.x1 <- var(x1); var.x2 <- var(x2); var.x3 <- var(x3); var.x4 <- var(x4)
var.y1 <- var(y1); var.y2 <- var(y2); var.y3 <- var(y3); var.y4 <- var(y4)
## Covariance
cv1 <- cov(x1,y1); cv2 <- cov(x2,y2); cv3 <- cov(x3,y3); cv4 <- cov(x4,y4)
## Correlation
cr1 <- cor(x1,y1); cr2 <- cor(x2,y2); cr3 <- cor(x3,y3); cr4 <- cor(x4,y4)
## Let's combine the data into a dataframe
sd.x <- rbind(dataset1, dataset2, dataset3, dataset4)
sd.y <- rbind(sd.y1, sd.y2, sd.y3, sd.y4)
cv <- rbind(cv1, cv2, cv3, cv4)
cr <- rbind(cr1, cr2, cr3, cr4)
dat.par <- as.data.frame(cbind(sd.x, sd.y, cv, cr))
names(dat.par) <- c("SD x", "SD y", "Covariance", "Correlation") ## we assign the proper name to the variable head
pander(round(dat.par,3))
| SD x | SD y | Covariance | Correlation | |
|---|---|---|---|---|
| dataset1 | 3.317 | 2.032 | 5.501 | 0.816 |
| dataset2 | 3.317 | 2.032 | 5.5 | 0.816 |
| dataset3 | 3.317 | 2.03 | 5.497 | 0.816 |
| dataset4 | 3.317 | 2.031 | 5.499 | 0.817 |
Quite impressive, is’nt it?
library(asbio)
## Warning: package 'asbio' was built under R version 3.1.3
## Loading required package: tcltk
windows(height=3.50)
op <- par(mfrow=c(1,4),mar=c (0,0,2,3), oma = c(5, 4.2, 0, 0))
with(anscombe, plot(x1, y1, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x1, y1),2)))))); abline(3,0.5)
with(anscombe, plot(x2, y2, xlab = "", ylab = "",, main = bquote(paste(italic(r),
" = ",.(round(cor(x2, y2),2)))))); abline(3,0.5)
with(anscombe, plot(x3, y3, xlab = "", ylab = "",, main = bquote(paste(italic(r),
" = ",.(round(cor(x3, y3),2)))))); abline(3,0.5)
with(anscombe, plot(x4, y4, xlab = "", ylab = "",, main = bquote(paste(italic(r),
" = ",.(round(cor(x4, y4),2)))))); abline(3,0.5)
mtext(expression(italic(x)),side=1, outer = TRUE, line = 3)
mtext(expression(italic(y)),side=2, outer = TRUE, line = 2.6)
mtext("(a)",side=3, at = -42, line = .5)
mtext("(b)",side=3, at = -26, line = .5)
mtext("(c)",side=3, at = -10.3, line = .5)
mtext("(d)",side=3, at = 5.5, line = .5)
par(op)
So, same correlation coefficient but the plot is not really the same.
How about the regression?
## Let's set the models
fit1 <- lm(y1~x1,data=anscombe)
fit2 <- lm(y2~x2,data=anscombe)
fit3 <- lm(y3~x3,data=anscombe)
fit4 <- lm(y4~x4,data=anscombe)
## Now we will extract the regression coefficients for both the intercept and the predictor
## Intercept
# dataset 1
dataset1 <- round(fit1$coefficients[1],3)
# dataset 2
dataset2 <- round(fit2$coefficients[1],3)
# dataset 3
dataset3 <- round(fit3$coefficients[1],3)
# dataset 4
dataset4 <- round(fit4$coefficients[1],3)
## Predictor
# dataset 1
d1 <- round(fit1$coefficients[2],3)
# dataset 2
d2 <- round(fit2$coefficients[2],3)
# dataset 3
d3 <- round(fit3$coefficients[2],3)
# dataset 4
d4 <- round(fit4$coefficients[2],3)
## Let's combine the data into a dataframe
int <- rbind(dataset1,dataset2,dataset3,dataset4)
pred <- rbind(d1,d2,d3,d4)
dat.reg <- as.data.frame(cbind(int,pred))
names(dat.reg) <- c("intercept","predictor") ## we assign the proper name to the variable head
## Let's see the data
pander(dat.reg)
| intercept | predictor | |
|---|---|---|
| dataset1 | 3 | 0.5 |
| dataset2 | 3.001 | 0.5 |
| dataset3 | 3.002 | 0.5 |
| dataset4 | 3.002 | 0.5 |
Well, always plot the data, uhm!…
We may boxplot the variables to see how they are distributed according to their main statistics
The box-and-whisker plot was created by John W. Tukey, and it is used to show the distribution of a variable in a dataset
I added a simulated boxplot with illustration of its meaning
To facilitate reading, I’ve modified the displaying of the outliers with outcex (which regulates the dimensions of the point in the boxplot), pch (which regulate the form of the point in the boxplot), and outcol (which regulates the color of the point in the boxplot). I also changed the dimensions of the size of the x-axis annotation with cex.axis and of the x-axis label for the simulated variable (with cex.lab)
par(mfrow=c(1,3))
boxplot(anscombe[,c(1,2,3,4)], col = "gold", outcex=1.5, pch=19, outcol="red", cex.axis=1.5)
boxplot(anscombe[,c(5,6,7,8)], col = "pink", outcex=1.5, pch=19, outcol="red", cex.axis=1.5)
set.seed(123) # to have the example reproducible as far as the simulated data is concerned
rr<-c(rnorm(100), -3)
bx <- boxplot(rr, xlab = "Simulated variable", col="yellow", outcex=1.5, pch=19, outcol="red", cex.lab = 1.5)
f = c(bx$stats[1], bx$stats[2], bx$stats[3], bx$stats[4], bx$stats[5], bx$out[1])
text(rep(1.25, 5), f, labels = c("minimum", "lower quartile", "median", "upper quartile", "maximum", "outlier"),
adj = c(0, 0), cex = 1.3)
So, do not trust too easy statistics (mean and standard deviation): median may be more reliable as an indicator of the variable distribution.
Anyway, how do they look these infamous quartets?
## This plot is based on the 'anscombe' dataset included in the 'datasets' package (default in R)
require(stats); require(graphics)
##-- now some "magic" to do the 4 regressions in a loop:
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
## or ff[[2]] <- as.name(paste0("y", i))
## ff[[3]] <- as.name(paste0("x", i))
mods[[i]] <- lmi <- lm(ff, data = anscombe)
print(anova(lmi))
}
## Analysis of Variance Table
##
## Response: y1
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 27.5 27.51 18 0.0022 **
## Residuals 9 13.8 1.53
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y2
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 27.5 27.50 18 0.0022 **
## Residuals 9 13.8 1.53
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y3
## Df Sum Sq Mean Sq F value Pr(>F)
## x3 1 27.5 27.47 18 0.0022 **
## Residuals 9 13.8 1.53
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y4
## Df Sum Sq Mean Sq F value Pr(>F)
## x4 1 27.5 27.49 18 0.0022 **
## Residuals 9 13.7 1.53
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13))
abline(mods[[i]], col = "blue")
}
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)
par(op)
A linear regression model seems appropriate for the first dataset only; the second dataset is curvilinear; in the third dataset there is an outlier that drives the regression in the wrong direction; the fourth dataset… well, I do not know what to say.
We can chek the fit of the models to see how well they were in each one.
First model
par(mfrow = c(2, 2))
plot(fit1)
Second model
par(mfrow = c(2, 2))
plot(fit2)
Third model
par(mfrow = c(2, 2))
plot(fit3)
Fourth model
par(mfrow = c(2, 2))
plot(fit4)
## Warning: not plotting observations with leverage one:
## 8
## Warning: not plotting observations with leverage one:
## 8
The first model was reasonably acceptable on the basis of the regression diagnostics; the second and third models were poor; the fourth model was not even wrong.
So, plot the data and check the regression diagnostics before saying anything.
The Anscombe quartet had eleven cases, and it is very unlikely someone would use a dataset with 11 cases to do anything…
However, it is always better to have a look at the data before taking any initiative.
More on this in the next parts of this tutorial.
I hope you’ve enjoyed this!
Have a nice day!