June 30, 2015

Always plot the data! - Part one.

Graphics reveal data, as stated by Edward R. Tufte in its seminal book “The Visual Display of Quantitative Information” (Cheshire, CT: Graphics Press, 1983).

Alt text

Tufte’s book was very influencing in guiding the revolution in data visualization and infographics that took place during the 80’s and the 90’s, the decades of the personal computer.

Alt text

Tufte’s work contributed to a renaissance of a discipline that was always considered an important part of statistics. Both Fisher and Tukey, two among the fathers of modern statistics, always paid great consideration to exploratory data analysis and its visualization. John Wilder Tukey, in particular, indicated ‘Exploratory Data Analysis’ (Reading, MA: Addison-Wesley, 1977) as a major topic in statistics. Since early ’70s, chapters from his work in progress book were made accessible to the statistical community to have the colleagues aware of this important area.

Alt text

A cautionary tale on data visualization was told by Anscombe in 1973: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician 27 (1): 17 - 21.

Anscombe developed four datasets that had nearly identical statistical properties, but appeared very different when plotted. Essentially, the variables in these four datasets have the same mean and variance, and the same correlation and regression coefficient. Nevertheless, they are very different when graphed, raising doubts as far as the meaning of the statistical inference that can be drawn from the data is concerned. The take home message was: Always plot the data!

Let see the data
Note that I used {r message=FALSE, warning=FALSE} to avoid that the warning about the package is printed in the markdown
## The data

data(anscombe)

## How they look 'statistically'

dim(anscombe)
## [1] 11  8
str(anscombe)
## 'data.frame':    11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
attach(anscombe)   ### usually indicated as a "don't do it!", but it is so convenient...

## The 'pander' package is helful to dispaly data in markdown, not helpful in the usual R consolle

library(pander)

## Let's see the complete dataset

pander(anscombe)
x1 x2 x3 x4 y1 y2 y3 y4
10 10 10 8 8.04 9.14 7.46 6.58
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47
14 14 14 8 9.96 8.1 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25
4 4 4 19 4.26 3.1 5.39 12.5
12 12 12 8 10.84 9.13 8.15 5.56
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89
## And now the summary statistics

pander(summary(anscombe[,c(1,2,3,4)]))
x1 x2 x3 x4
Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8
1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8
Median : 9.0 Median : 9.0 Median : 9.0 Median : 8
Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9
3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8
Max. :14.0 Max. :14.0 Max. :14.0 Max. :19
pander(summary(anscombe[,c(5,6,7,8)]))
y1 y2 y3 y4
Min. : 4.26 Min. :3.10 Min. : 5.39 Min. : 5.25
1st Qu.: 6.32 1st Qu.:6.70 1st Qu.: 6.25 1st Qu.: 6.17
Median : 7.58 Median :8.14 Median : 7.11 Median : 7.04
Mean : 7.50 Mean :7.50 Mean : 7.50 Mean : 7.50
3rd Qu.: 8.57 3rd Qu.:8.95 3rd Qu.: 7.98 3rd Qu.: 8.19
Max. :10.84 Max. :9.26 Max. :12.74 Max. :12.50

So, same mean for both the ‘x’ and the ‘y’ variable, but the median is not exactly the same

Let see the variance

## Let's calculate the parameter and assign it to a vector
## For convenience we use a vector named dataset1 (and so on) for the first set of number
## because we will use it as a row_name when we will prepare the dataframe containing all calculated parameters

## Standard deviation

dataset1 <- sd(x1); dataset2 <- sd(x2); dataset3 <- sd(x3); dataset4 <- sd(x4)


sd.y1 <- sd(y1); sd.y2 <- sd(y2); sd.y3 <- sd(y3); sd.y4 <- sd(y4)

## Variance

var.x1 <- var(x1); var.x2 <- var(x2); var.x3 <- var(x3); var.x4 <- var(x4)


var.y1 <- var(y1); var.y2 <- var(y2); var.y3 <- var(y3); var.y4 <- var(y4)


## Covariance

cv1 <- cov(x1,y1); cv2 <- cov(x2,y2); cv3 <- cov(x3,y3); cv4 <- cov(x4,y4)

## Correlation

cr1 <- cor(x1,y1); cr2 <- cor(x2,y2); cr3 <- cor(x3,y3); cr4 <- cor(x4,y4)



## Let's combine the data into a dataframe

sd.x <- rbind(dataset1, dataset2, dataset3, dataset4)
sd.y <- rbind(sd.y1, sd.y2, sd.y3, sd.y4)
cv <- rbind(cv1, cv2, cv3, cv4)
cr <- rbind(cr1, cr2, cr3, cr4)

dat.par <- as.data.frame(cbind(sd.x, sd.y, cv, cr))
names(dat.par) <- c("SD x", "SD y", "Covariance", "Correlation")        ## we assign the proper name to the variable head

pander(round(dat.par,3)) 
  SD x SD y Covariance Correlation
dataset1 3.317 2.032 5.501 0.816
dataset2 3.317 2.032 5.5 0.816
dataset3 3.317 2.03 5.497 0.816
dataset4 3.317 2.031 5.499 0.817

Quite impressive, is’nt it?

I’ll use the ‘asbio’ package to demonstrate the point as far as the correlation between the two variables in each dataset is concerned
This time I did’nt use the {r message=FALSE, warning=FALSE} and the warning about the package is printed in the markdown
library(asbio)
## Warning: package 'asbio' was built under R version 3.1.3
## Loading required package: tcltk
windows(height=3.50)

op <- par(mfrow=c(1,4),mar=c (0,0,2,3), oma = c(5, 4.2, 0, 0))

with(anscombe, plot(x1, y1, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x1, y1),2)))))); abline(3,0.5) 
with(anscombe, plot(x2, y2, xlab = "", ylab = "",, main = bquote(paste(italic(r),
" = ",.(round(cor(x2, y2),2)))))); abline(3,0.5) 
with(anscombe, plot(x3, y3, xlab = "", ylab = "",, main = bquote(paste(italic(r),
" = ",.(round(cor(x3, y3),2)))))); abline(3,0.5) 
with(anscombe, plot(x4, y4, xlab = "", ylab = "",, main = bquote(paste(italic(r),
" = ",.(round(cor(x4, y4),2)))))); abline(3,0.5) 
mtext(expression(italic(x)),side=1, outer = TRUE, line = 3)
mtext(expression(italic(y)),side=2, outer = TRUE, line = 2.6)
mtext("(a)",side=3, at = -42, line = .5)
mtext("(b)",side=3, at = -26, line = .5)
mtext("(c)",side=3, at = -10.3, line = .5)
mtext("(d)",side=3, at = 5.5, line = .5)

plot of chunk unnamed-chunk-3

par(op)

So, same correlation coefficient but the plot is not really the same.

How about the regression?

## Let's set the models
    
fit1 <- lm(y1~x1,data=anscombe)
fit2 <- lm(y2~x2,data=anscombe)
fit3 <- lm(y3~x3,data=anscombe)
fit4 <- lm(y4~x4,data=anscombe)


## Now we will extract the regression coefficients for both the intercept and the predictor

## Intercept

# dataset 1

dataset1 <- round(fit1$coefficients[1],3)

# dataset 2

dataset2 <- round(fit2$coefficients[1],3)

# dataset 3

dataset3 <- round(fit3$coefficients[1],3)

# dataset 4

dataset4 <- round(fit4$coefficients[1],3)


## Predictor

# dataset 1

d1 <- round(fit1$coefficients[2],3)

# dataset 2

d2 <- round(fit2$coefficients[2],3)

# dataset 3

d3 <- round(fit3$coefficients[2],3)

# dataset 4

d4 <- round(fit4$coefficients[2],3)


## Let's combine the data into a dataframe

int <- rbind(dataset1,dataset2,dataset3,dataset4)
pred <- rbind(d1,d2,d3,d4)

dat.reg <- as.data.frame(cbind(int,pred))
names(dat.reg) <- c("intercept","predictor")    ## we assign the proper name to the variable head

## Let's see the data 

pander(dat.reg)
  intercept predictor
dataset1 3 0.5
dataset2 3.001 0.5
dataset3 3.002 0.5
dataset4 3.002 0.5

Well, always plot the data, uhm!…

We may boxplot the variables to see how they are distributed according to their main statistics

The box-and-whisker plot was created by John W. Tukey, and it is used to show the distribution of a variable in a dataset

I added a simulated boxplot with illustration of its meaning

To facilitate reading, I’ve modified the displaying of the outliers with outcex (which regulates the dimensions of the point in the boxplot), pch (which regulate the form of the point in the boxplot), and outcol (which regulates the color of the point in the boxplot). I also changed the dimensions of the size of the x-axis annotation with cex.axis and of the x-axis label for the simulated variable (with cex.lab)

par(mfrow=c(1,3))
boxplot(anscombe[,c(1,2,3,4)], col = "gold", outcex=1.5, pch=19, outcol="red", cex.axis=1.5)
boxplot(anscombe[,c(5,6,7,8)], col = "pink", outcex=1.5, pch=19, outcol="red", cex.axis=1.5)

set.seed(123)       # to have the example reproducible as far as the simulated data is concerned
rr<-c(rnorm(100), -3)

bx <- boxplot(rr, xlab = "Simulated variable", col="yellow", outcex=1.5, pch=19, outcol="red", cex.lab = 1.5)
f = c(bx$stats[1], bx$stats[2], bx$stats[3], bx$stats[4], bx$stats[5], bx$out[1])

text(rep(1.25, 5), f, labels = c("minimum", "lower quartile", "median", "upper quartile", "maximum", "outlier"), 
    adj = c(0, 0), cex = 1.3)

plot of chunk unnamed-chunk-5

So, do not trust too easy statistics (mean and standard deviation): median may be more reliable as an indicator of the variable distribution.

Anyway, how do they look these infamous quartets?

## This plot is based on the 'anscombe' dataset included in the 'datasets' package (default in R)

require(stats); require(graphics)

##-- now some "magic" to do the 4 regressions in a loop:
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  ## or   ff[[2]] <- as.name(paste0("y", i))
  ##      ff[[3]] <- as.name(paste0("x", i))
  mods[[i]] <- lmi <- lm(ff, data = anscombe)
  print(anova(lmi))
}
## Analysis of Variance Table
## 
## Response: y1
##           Df Sum Sq Mean Sq F value Pr(>F)   
## x1         1   27.5   27.51      18 0.0022 **
## Residuals  9   13.8    1.53                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y2
##           Df Sum Sq Mean Sq F value Pr(>F)   
## x2         1   27.5   27.50      18 0.0022 **
## Residuals  9   13.8    1.53                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y3
##           Df Sum Sq Mean Sq F value Pr(>F)   
## x3         1   27.5   27.47      18 0.0022 **
## Residuals  9   13.8    1.53                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y4
##           Df Sum Sq Mean Sq F value Pr(>F)   
## x4         1   27.5   27.49      18 0.0022 **
## Residuals  9   13.7    1.53                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma =  c(0, 0, 2, 0))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13))
  abline(mods[[i]], col = "blue")
}
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)

plot of chunk unnamed-chunk-6

par(op)

A linear regression model seems appropriate for the first dataset only; the second dataset is curvilinear; in the third dataset there is an outlier that drives the regression in the wrong direction; the fourth dataset… well, I do not know what to say.

We can chek the fit of the models to see how well they were in each one.

First model

par(mfrow = c(2, 2))
plot(fit1)

plot of chunk unnamed-chunk-7

Second model

par(mfrow = c(2, 2))
plot(fit2)

plot of chunk unnamed-chunk-8

Third model

par(mfrow = c(2, 2))
plot(fit3)

plot of chunk unnamed-chunk-9

Fourth model

par(mfrow = c(2, 2))
plot(fit4)
## Warning: not plotting observations with leverage one:
##   8
## Warning: not plotting observations with leverage one:
##   8

plot of chunk unnamed-chunk-10

The first model was reasonably acceptable on the basis of the regression diagnostics; the second and third models were poor; the fourth model was not even wrong.

So, plot the data and check the regression diagnostics before saying anything.

The Anscombe quartet had eleven cases, and it is very unlikely someone would use a dataset with 11 cases to do anything…

However, it is always better to have a look at the data before taking any initiative.

More on this in the next parts of this tutorial.

I hope you’ve enjoyed this!

Have a nice day!

Alt text

Twitter: @AntoViral

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.