A core task in data analysis is looking for relationships among variables in our data. As an example, let's look at one of our favorite datasets in R: mtcars
December 11, 2015
A core task in data analysis is looking for relationships among variables in our data. As an example, let's look at one of our favorite datasets in R: mtcars
R comes with anscombe, which includes four different x-y datasets. Let's explore the descriptive stats.
dim(anscombe)
## [1] 11 8
summary(anscombe)
## x1 x2 x3 x4 ## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 ## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 ## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 ## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 ## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 ## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 ## y1 y2 y3 y4 ## Min. : 4.260 Min. :3.100 Min. : 5.39 Min. : 5.250 ## 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170 ## Median : 7.580 Median :8.140 Median : 7.11 Median : 7.040 ## Mean : 7.501 Mean :7.501 Mean : 7.50 Mean : 7.501 ## 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190 ## Max. :10.840 Max. :9.260 Max. :12.74 Max. :12.500
The means look very similar – how about correlations?
attach(anscombe) cor(x1,y1)
## [1] 0.8164205
cor(x2,y2)
## [1] 0.8162365
cor(x3,y3)
## [1] 0.8162867
cor(x4,y4)
## [1] 0.8165214
These datsets are definitely very similar. Let's look at plots to verify the relationship.
In 1973, UC Berkeley was sued for bias against women who had applied for graduate school. In that year, men were much more likely to be admitted than women, and the difference was too large to attribute to chance.
## Gender ## Admit Male Female ## Admitted 1198 557 ## Rejected 1493 1278
You decide!
In R: load UCBAdmissions from {datasets}
In Excel: open "UCBAdmissions.csv"
Explore the data with various graphs and find the truth!
A trend might appear in data, but disappears or even reverses when looking at groups of the data.
Other investigations of Simpson's paradox include:
- US median wage decline
- the "hot hand" effect in basketball
- Good for women, good for men, bad for people (article on medical intervention research)
- R package 'Simpsons'
- Kidney stone treatment, MLB batting averages, low birth weight among tobacco smoking mothers (see wiki link below).
Other resources:
- Wikipedia - Simpson's paradox
- Wikipedia - Anscombe's quartet
- Interactive demo from UC Berkeley, created using D3.js and AngularJS