Scatter plots are the basic tools of the trade during exploratory data analysis. These are extremely easy to make, and give us a bird's eye view of the entire data set. They can be made to reveal even more interesting patterns in the data by using color to represent additional variables.
library(gcookbook) # For the data set
library(ggplot2)
Let us quickly examine the data:
str(heightweight)
## 'data.frame': 236 obs. of 5 variables:
## $ sex : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 1 1 1 1 ...
## $ ageYear : num 11.9 12.9 12.8 13.4 15.9 ...
## $ ageMonth: int 143 155 153 161 191 171 185 142 160 140 ...
## $ heightIn: num 56.3 62.3 63.3 59 62.5 62.5 59 56.5 62 53.8 ...
## $ weightLb: num 85 105 108 92 112 ...
head(heightweight)
## sex ageYear ageMonth heightIn weightLb
## 1 f 11.92 143 56.3 85.0
## 2 f 12.92 155 62.3 105.0
## 3 f 12.75 153 63.3 108.0
## 4 f 13.42 161 59.0 92.0
## 5 f 15.92 191 62.5 112.5
## 6 f 14.25 171 62.5 112.0
So this data set gives the height (inches), weight (lbs), age and gender of 236 teen aged boys and girls. We first make a default scatter plot of age Vs height of the youngsters.
ggplot(heightweight, aes(x = ageYear, y = heightIn)) + geom_point(size = 3)
We see a linearly increasing trend in height as the age increases. As expected. We also see a good amount of variation in the heights, from 50 to 72 inches. What could explain this variation ? Is there any difference in height between boys and girls ?
To answer this question, we color each dot by the gender of the child.
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = sex)) + geom_point(size = 4) +
theme_update(panel.background = element_rect(fill = "darkgray", colour = "lightgray"))
This plot tells us many things:
Next, we redo the scatter plot with color of the dot representing the weight of the child:
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = weightLb)) + geom_point(size = 4) +
scale_colour_gradientn(colours = c("darkred", "orange", "yellow", "white")) +
theme_update(panel.background = element_rect(fill = "darkgray", colour = "lightgray"))
Dark red denotes the 'under weight' children, and white points to a tendency towards obesity. Orange and yellow represent intermediate weights. This can help child health specialists to quickly identify children who may need attention.
If you want to represent the gender also on the same plot, shape of the dots can be used to distinguish boys from girls:
ptshape = ifelse(heightweight$sex == "f", 17, 19)
ggplot(heightweight, aes(x = ageYear, y = heightIn, colour = weightLb)) + geom_point(size = 4,
shape = ptshape) + scale_colour_gradientn(colours = c("darkred", "orange",
"yellow", "white")) + theme_update(panel.background = element_rect(fill = "darkgray",
colour = "lightgray"))