December 11, 2015

Searching for truth in mtcars

A core task in data analysis is looking for relationships among variables in our data. As an example, let's look at one of our favorite datasets in R: mtcars

Anscombe quartet - data exploration

R comes with anscombe, which includes four different x-y datasets. Let's explore the descriptive stats.

dim(anscombe)
## [1] 11  8
summary(anscombe)
##        x1             x2             x3             x4    
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19  
##        y1               y2              y3              y4        
##  Min.   : 4.260   Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.: 6.315   1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median : 7.580   Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   : 7.501   Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.: 8.570   3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :10.840   Max.   :9.260   Max.   :12.74   Max.   :12.500

Anscombe's quartet - correlations

The means look very similar – how about correlations?

attach(anscombe)
cor(x1,y1)
## [1] 0.8164205
cor(x2,y2)
## [1] 0.8162365
cor(x3,y3)
## [1] 0.8162867
cor(x4,y4)
## [1] 0.8165214

Anscombe's quartet - plots

These datsets are definitely very similar. Let's look at plots to verify the relationship.

Sexism at UC Berkeley

In 1973, UC Berkeley was sued for bias against women who had applied for graduate school. In that year, men were much more likely to be admitted than women, and the difference was too large to attribute to chance.

##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

Sexism at UC Berkeley??

You decide!

  • In R: load UCBAdmissions from {datasets}

  • In Excel: open "UCBAdmissions.csv"

  • Explore the data with various graphs and find the truth!

Simpson's paradox

A trend might appear in data, but disappears or even reverses when looking at groups of the data.

Simpson's paradox at UC Berkeley

Further resources & exploration