Exploring the Abalone dataset with ggplot2

I did a brief exploration of the Abalone dataset as a way of learning about the ggplot2 package.

First, let's read the dataset into a data frame and add column names:

abalone <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", 
    header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "weight.whole", "weight.shucked", 
    "weight.viscera", "weight.shell", "rings")

Let's take a quick look at the data:

summary(abalone)
##  sex          length         diameter         height       weight.whole  
##  F:1307   Min.   :0.075   Min.   :0.055   Min.   :0.000   Min.   :0.002  
##  I:1342   1st Qu.:0.450   1st Qu.:0.350   1st Qu.:0.115   1st Qu.:0.442  
##  M:1528   Median :0.545   Median :0.425   Median :0.140   Median :0.799  
##           Mean   :0.524   Mean   :0.408   Mean   :0.140   Mean   :0.829  
##           3rd Qu.:0.615   3rd Qu.:0.480   3rd Qu.:0.165   3rd Qu.:1.153  
##           Max.   :0.815   Max.   :0.650   Max.   :1.130   Max.   :2.825  
##  weight.shucked  weight.viscera    weight.shell        rings      
##  Min.   :0.001   Min.   :0.0005   Min.   :0.0015   Min.   : 1.00  
##  1st Qu.:0.186   1st Qu.:0.0935   1st Qu.:0.1300   1st Qu.: 8.00  
##  Median :0.336   Median :0.1710   Median :0.2340   Median : 9.00  
##  Mean   :0.359   Mean   :0.1806   Mean   :0.2388   Mean   : 9.93  
##  3rd Qu.:0.502   3rd Qu.:0.2530   3rd Qu.:0.3290   3rd Qu.:11.00  
##  Max.   :1.488   Max.   :0.7600   Max.   :1.0050   Max.   :29.00

Note that for sex, I=infant.

Let's briefly explore rings using a density plot, and see how sex affects rings:

library(ggplot2)
ggplot(abalone) + aes(rings, color = sex) + geom_density()

plot of chunk unnamed-chunk-3

That's interesting - the density plot for female and male is almost identical. (And just by chance, red was mapped to Female, blue was mapped to Male, and green was mapped to Infant… how funny!)

Okay, let's try stacking those instead:

ggplot(abalone) + aes(rings, fill = sex) + geom_density(position = "stack")

plot of chunk unnamed-chunk-4

I think the first plot was more informative.

Let's move on, and see the relationship between length and rings, and how that is affected by sex. This plot uses a lot of the ggplot2 options in order to look more polished:

ggplot(abalone) + aes(length, rings, color = sex) + geom_point() + labs(x = "Shell Length", 
    y = "Number of Rings", title = "Number of Rings vs Length", color = "Sex of Abalone") + 
    theme(legend.position = c(0, 1), legend.justification = c(0, 1), legend.background = element_rect(fill = "white", 
        color = "black")) + scale_color_hue(labels = c("Female", "Infant", "Male"))

plot of chunk unnamed-chunk-5

You could also accomplish something similar by faceting:

ggplot(abalone) + aes(length, rings, color = sex) + geom_point() + labs(x = "Shell Length", 
    y = "Number of Rings", title = "Number of Rings vs Length") + facet_grid(. ~ 
    sex, labeller = label_both) + stat_smooth(method = "lm", se = FALSE) + theme(legend.position = "none")

plot of chunk unnamed-chunk-6