- Load data from a URL.
- Describe the anatomy of a box plot.
- Use box plots to compare mean and median values between groups.
- Order groups by mean value.
- Subset data.
- Describe some pitfalls of using bar charts to display or compare means.
Bar charts are useful for displaying count data, for example, but are often used to portray statistical information that they don’t represent well. In this lesson we’ll learn to ‘Kick the bar chart habit’ by creating box plots as an alternative to bar charts. This lesson uses data from a multi-system survey of mouse physiology in 8 inbred founder strains and 54 F1 hybrids of the Collaborative Cross. The study is described in Lenarcic et al, 2012. For more information about this data set, see the CGDpheno3 data at Mouse Phenome Database.
Load the ggplot and scales libraries. You’ll need to install the packages first if you haven’t done so already. Install them from the Packages tab, or use the install.packages() command. Use double quotes around the package name.
install.packages("ggplot2")
You only need to install a package once to download it into your machine’s library. Once you have installed the package on your machine, you need to load the library in order to use the functions contained in the package.
library(ggplot2)
When you load a library you’ll get a warning message indicating the R version in which the library was built. If it’s different from the R version that you’re running, you might occasionally run into problems depending on the library and the functions it contains. To find out what version of R you have, type
version
## _
## platform x86_64-apple-darwin13.4.0
## arch x86_64
## os darwin13.4.0
## system x86_64, darwin13.4.0
## status
## major 3
## minor 3.3
## year 2017
## month 03
## day 06
## svn rev 72310
## language R
## version.string R version 3.3.3 (2017-03-06)
## nickname Another Canoe
The version of R is given as version.string, followed by the nickname for the version.
Load the data from this shortened URL. Mind the double quotes.
cc_data <- read.csv(file = "http://bit.ly/CGDPheno3")
Explore the data variables. The first 4 columns contain strain, sex, and ID numbers. The remaining contain phenotype measurements with abbreviated names.
names(cc_data)
## [1] "strain" "sex" "id"
## [4] "mouse_num" "CHOL" "HDL"
## [7] "GLU" "TG" "WBC"
## [10] "pctLYMP" "pctNEUT" "pctMONO"
## [13] "pctBASO" "pctEOS" "pctLUC"
## [16] "RBC" "pctRetic" "RDW"
## [19] "MCH" "MCHC" "CHCM"
## [22] "HDW" "MCV" "cHGB"
## [25] "mHGB" "HCT" "PLT"
## [28] "MPV" "pulse" "pulse_std"
## [31] "systolic_BP" "systolic_BP_std" "CV"
## [34] "HR" "HRV" "R_amplitude"
## [37] "RS_amplitude" "N" "PQ"
## [40] "PR" "QRS" "QT"
## [43] "QT_dispersion" "QTc" "QTc_dispersion"
## [46] "RR" "ST" "BMC"
## [49] "BMD" "bone_area" "total_area"
## [52] "LTM" "pct_fat" "TTM"
## [55] "bw"
How many mice?
dim(cc_data)
## [1] 642 55
How many mice of each sex?
table(cc_data$sex)
##
## f m
## 321 321
How many mice of each strain?
table(cc_data$strain)
##
## 129S1/SvImJ 129SAF1 129SB6F1 129SCASTF1 129SNODF1 129SNZOF1
## 10 10 10 10 5 21
## 129SPWKF1 129SWSBF1 A/J A129SF1 AB6F1 ACASTF1
## 10 10 10 10 10 10
## ANODF1 ANZOF1 APWKF1 AWSBF1 B6129SF1/J B6AF1/J
## 10 10 10 10 10 10
## B6CASTF1 B6NODF1 B6NZOF1 B6PWKF1 B6WSBF1 C57BL/6J
## 10 10 10 11 10 10
## CAST/EiJ CAST129SF1 CASTAF1 CASTB6F1 CASTNODF1 CASTNZOF1
## 11 10 10 10 10 10
## CASTPWKF1 CASTWSBF1 NOD/ShiLtJ NOD129SF1 NODAF1 NODB6F1
## 10 10 10 10 10 10
## NODCASTF1 NODNZOF1 NODPWKF1 NODWSBF1 NZO/HlLtJ NZO129SF1
## 10 10 12 10 10 10
## NZOAF1 NZOB6F1 NZONODF1 NZOWSBF1 PWK/PhJ PWK129SF1
## 9 10 10 10 10 12
## PWKAF1 PWKB6F1 PWKCASTF1 PWKNODF1 PWKNZOF1 PWKWSBF1
## 10 10 10 10 10 10
## WSB/EiJ WSB129SF1 WSBAF1 WSBB6F1 WSBCASTF1 WSBNODF1
## 17 10 10 10 11 13
## WSBNZOF1 WSBPWKF1
## 10 10
How many mice of each strain by sex?
table(cc_data$sex, cc_data$strain)
##
## 129S1/SvImJ 129SAF1 129SB6F1 129SCASTF1 129SNODF1 129SNZOF1 129SPWKF1
## f 5 5 5 5 5 7 5
## m 5 5 5 5 0 14 5
##
## 129SWSBF1 A/J A129SF1 AB6F1 ACASTF1 ANODF1 ANZOF1 APWKF1 AWSBF1
## f 5 5 5 5 5 5 5 5 5
## m 5 5 5 5 5 5 5 5 5
##
## B6129SF1/J B6AF1/J B6CASTF1 B6NODF1 B6NZOF1 B6PWKF1 B6WSBF1 C57BL/6J
## f 5 5 5 5 5 6 5 5
## m 5 5 5 5 5 5 5 5
##
## CAST/EiJ CAST129SF1 CASTAF1 CASTB6F1 CASTNODF1 CASTNZOF1 CASTPWKF1
## f 5 5 5 5 5 5 5
## m 6 5 5 5 5 5 5
##
## CASTWSBF1 NOD/ShiLtJ NOD129SF1 NODAF1 NODB6F1 NODCASTF1 NODNZOF1
## f 5 5 5 5 5 5 5
## m 5 5 5 5 5 5 5
##
## NODPWKF1 NODWSBF1 NZO/HlLtJ NZO129SF1 NZOAF1 NZOB6F1 NZONODF1 NZOWSBF1
## f 7 5 5 5 4 5 5 5
## m 5 5 5 5 5 5 5 5
##
## PWK/PhJ PWK129SF1 PWKAF1 PWKB6F1 PWKCASTF1 PWKNODF1 PWKNZOF1 PWKWSBF1
## f 5 7 5 5 5 5 5 5
## m 5 5 5 5 5 5 5 5
##
## WSB/EiJ WSB129SF1 WSBAF1 WSBB6F1 WSBCASTF1 WSBNODF1 WSBNZOF1 WSBPWKF1
## f 7 5 5 5 6 7 5 5
## m 10 5 5 5 5 6 5 5
How do the first few rows of data look? Note the NAs in the data. These are missing values and can complicate analyses unless specifically addressed.
head(cc_data)
## strain sex id mouse_num CHOL HDL GLU TG WBC pctLYMP
## 1 129S1/SvImJ f CCF1F01 92297 100 86.8 140 86 11.17 84.4
## 2 129S1/SvImJ f CCF1F02 92298 104 94.3 143 56 12.84 81.4
## 3 129S1/SvImJ f CCF1F03 92299 111 86.7 126 80 9.67 78.2
## 4 129S1/SvImJ f CCF1F04 92300 107 92.6 161 80 9.84 90.4
## 5 129S1/SvImJ f CCF1F05 92301 115 97.7 155 84 10.29 89.3
## 6 129S1/SvImJ m CCF1M01 92302 137 122.9 167 90 8.90 85.1
## pctNEUT pctMONO pctBASO pctEOS pctLUC RBC pctRetic RDW MCH MCHC CHCM
## 1 11.1 0.7 0.3 3.0 0.6 9.89 2.7 15.1 16.2 33.8 31.9
## 2 15.2 1.4 0.3 1.1 0.6 10.43 2.5 14.5 15.8 33.2 31.8
## 3 17.5 1.5 0.3 1.9 0.6 9.79 4.1 16.4 15.7 33.2 31.8
## 4 5.9 1.2 0.1 1.8 0.4 10.22 3.4 15.0 15.9 33.0 31.4
## 5 6.0 1.2 0.2 2.8 0.4 10.70 2.4 14.2 15.9 33.4 31.8
## 6 10.7 1.3 0.2 2.1 0.6 10.51 2.9 13.8 15.6 32.7 31.3
## HDW MCV cHGB mHGB HCT PLT MPV pulse pulse_std systolic_BP
## 1 1.88 47.8 15.1 16.0 47.3 771 6.8 NA NA NA
## 2 1.95 47.6 15.8 16.5 49.7 643 8.5 NA NA NA
## 3 2.10 47.4 14.8 15.4 46.4 762 7.5 NA NA NA
## 4 1.97 48.1 15.4 16.2 49.2 923 8.3 NA NA NA
## 5 1.98 47.8 16.3 17.1 51.1 783 8.3 NA NA NA
## 6 1.77 47.5 15.7 16.4 50.0 791 7.8 NA NA NA
## systolic_BP_std CV HR HRV R_amplitude RS_amplitude N PQ PR QRS QT
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA NA
## QT_dispersion QTc QTc_dispersion RR ST BMC BMD bone_area total_area
## 1 NA NA NA NA NA 0.31 0.07 5.84 9.92
## 2 NA NA NA NA NA 0.26 0.07 4.83 9.77
## 3 NA NA NA NA NA 0.27 0.07 4.90 9.77
## 4 NA NA NA NA NA 0.25 0.06 4.89 10.95
## 5 NA NA NA NA NA 0.26 0.07 4.93 10.83
## 6 NA NA NA NA NA 0.27 0.06 5.12 11.63
## LTM pct_fat TTM bw
## 1 8.00 13.91 9.71 21.89
## 2 7.10 19.88 9.63 24.33
## 3 7.95 19.56 10.54 22.25
## 4 7.65 23.81 10.50 22.57
## 5 7.45 22.98 10.44 23.07
## 6 9.20 21.93 12.41 27.44
Use the ggplot() function, which is found in the ggplot2 library. Quick reminder of ggplot syntax: ggplot(data, mapping) + layer(). Plot red blood cells by strain.
ggplot(data = cc_data, mapping = aes(x = strain, y = RBC)) +
geom_boxplot()
In a boxplot, the upper whisker extends to the highest value within 1.5 * inter-quartile range (IQR, or distance between first and third quartiles) and the lower whisker extends to the lowest value within 1.5 * IQR of the hinge. Data beyond the end of the whiskers (outliers) are plotted as points.
It’s difficult to distinguish the strain names on the x-axis, so flip the coordinates to place strain on the y-axis and red blood cells on the x-axis.
ggplot(data = cc_data, mapping = aes(x = strain, y = RBC)) +
geom_boxplot() +
coord_flip()
Sort the strains by mean red blood cells. Do this by re-ordering strains within the mapping function aes(). Save the plot as a variable.
ggplot(data = cc_data, mapping = aes(x = reorder(strain, RBC, FUN = "mean", na.rm = TRUE), y = RBC)) +
geom_boxplot() +
coord_flip()
Add a point indicating the mean RBC value for each strain. Add a statistical summary layer to do this. Specify the color, shape and size of the point marking the mean.
ggplot(data = cc_data, mapping = aes(x = reorder(strain, RBC, FUN = "mean", na.rm = TRUE), y = RBC)) +
geom_boxplot() +
coord_flip() +
stat_summary(fun.y = "mean", geom = "point", colour = "mediumpurple4", shape = 15, size = 2)
You should see a purple square indicating the mean red blood cell value for each strain. Is it the same as the median value for each strain? Notice that the mean value is sensitive to outliers, while the median value is not sensitive to outliers. Find the boxplot for WSBCASTF1. Notice that a single data point with a value greater than 11.5 pulls the mean value for this strain far over to the right.
Plot the data points over each boxplot. Since ggplot builds a plot layer by layer, the boxplot layer should come before the data points so as not to obscure them.
ggplot(data = cc_data, mapping = aes(x = reorder(strain, RBC, FUN = "mean", na.rm = TRUE), y = RBC)) +
geom_boxplot() +
geom_point() +
coord_flip() +
stat_summary(fun.y = "mean", geom = "point", colour = "mediumpurple4", shape = 15, size = 2)
Color the data points by sex. Save the plot as a variable. To view the plot, type the name of the variable.
rbc_boxplot <- ggplot(data = cc_data, mapping = aes(x = reorder(strain, RBC, FUN = "mean", na.rm = TRUE),
y = RBC)) +
geom_boxplot() +
geom_point(aes(colour = sex)) +
coord_flip() +
stat_summary(fun.y = "mean", geom = "point", colour = "mediumpurple4", shape = 15, size = 2)
rbc_boxplot
Add axis labels. Redefine the plot variable.
rbc_boxplot <- rbc_boxplot +
xlab("strain") +
ylab("red blood cell count (n/uL)")
rbc_boxplot
Add a title. Redefine the plot variable.
rbc_boxplot <- rbc_boxplot +
ggtitle("Red Blood Cell Distribution by Strain")
rbc_boxplot
Select a subset of the strains. Choose strains with the highest and lowest mean and median red blood cell counts. Include the parent strains of the F1s.
subset.cc_data <- subset(cc_data, strain %in% c("ACASTF1", "APWKF1", "CAST/EiJ", "PWK/PhJ",
"A/J", "NODAF1", "NOD/ShiLtJ") == TRUE)
Create boxplots from the subset.
ggplot(data = subset.cc_data, mapping = aes(x = strain, y = RBC)) +
geom_boxplot()
Order by mean RBC value as before. Save the plot as a variable.
subset_boxplot <- ggplot(data = subset.cc_data,
mapping = aes(x = reorder(strain, RBC, FUN = "mean", na.rm = TRUE),
y = RBC)) +
geom_boxplot()
subset_boxplot
This time there’s no need to flip the axes since the strain names are legible on the x-axis. Plot the data points by sex.The boxplots have already been drawn and saved in the variable subset_boxplot. Layer the data points on top of the boxplots.
subset_boxplot <- subset_boxplot +
geom_point(aes(colour = sex))
subset_boxplot
Add a purple square indicating the mean RBC value for each strain.
subset_boxplot <- subset_boxplot +
stat_summary(fun.y = "mean", geom = "point", colour = "mediumpurple4", shape = 15, size = 2)
subset_boxplot
Add x and y axis labels.
subset_boxplot <- subset_boxplot +
xlab("strain") +
ylab("red blood cell count (n/uL)")
subset_boxplot
Add a title.
subset_boxplot <- subset_boxplot +
ggtitle("Red Blood Cell Distribution by Strain")
subset_boxplot
Output the plot to a PDF file. Set width and height. Turn off the output to pdf with the dev.off() command.
pdf("subset-boxplot.pdf", width= 8, height = 9)
print(subset_boxplot)
dev.off()
## quartz_off_screen
## 2
Picture a set of bar charts indicating the mean and s.e.m. for each strain. Which plot communicates more information, a bar chart or a box plot?
Code Challenge: Choose another phenotype to plot as boxplots by strain. Order boxplots by mean phenotype value. Flip the coordinates if necessary to make strain names legible. Add a point indicating the mean strain value. Add data points over the boxplots. Add axis labels and a plot title.