NOTE: today’s workshop material was written in rmarkdown and rendered with knitr, another tidyverse package. You can find the cheatsheet here: [https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf]
One of, if not the most important part of statistical analysis is to convey your results in an effective and meaningful way to your audience. However, data visualisation is not just about presenting these final models. Graphics are useful for exploratory analyses and also data cleaning (i.e. spotting outliers, weird distributions, etc).
But why R when I can produce graphs in SPSS, GraphPad or similar?
It is true that graphics can be created in other statistical packages, however they are usually pretty basic and don’t pop!
Example Scatterplot in SPSS
R is different in that you have the flexibility to fine-tune your message…
… you can use it to display 3D distributions…
… you can produce maps with results from spatial analysis…
… you can add extra information to your graphs…
… you can even overlay one type of graph on another.
There are several R packages that can be used to create data visualisations in R. Today you’ll be learning how to use ggplot2, a tidyverse package (Pallavi mentioned in workshop 2 that you can pipe your cleaned data into a ggplot function).
A grammar of a language defines the rules of structuring words and phrases into meaningful expressions. ggplot2 implements its own grammar: a consistent language for describing and building visualisations. The good thing about having a consistent language is that you can learn faster as the options are similar across the ggplot2 functions. One thing to remember is even though the grammar is consistent, you still need to be clever about which types of plots you choose.
“This is easy to see by analogy to the English language: good grammar is just the first step in creating a good sentence”. - Wickham (2010)
You can learn more about the Grammar of Graphics [http://vita.had.co.nz/papers/layered-grammar.pdf]
Today you will be shown the basics of the ggplot2 grammar of graphics and produce an array of different plots. There will also be some handy tricks shown along the way for tweaking them.
In the examples today, I will show you some of the components that make up a plot:
Every ggplot graphic has the same template beginning:
ggplot(data, aes(...))Geometric objects (aka geoms) are the actual marks we put on a plot. Examples include:
geom_point, for scatter plots, dot plots, etc)geom_line, for time series, trend lines, etc)geom_boxplot, for, well, boxplots!)geom_histogram)geom_density)geom_violin)A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator
You can get a list of available geometric objects using the code below:
help.search("geom_", package = "ggplot2")
Aesthetics are required and will change depending on the chosen geom. Some aesthetics include:
Some things to remember:
data.framematrixlibrary(ggplot2)
#lets use the mtcars dataset as an example. We'll assign it to the dataframe 'x' to make coding easier, then look at the structure.
x <- mtcars
x
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
str(x)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
#call the base function and add the relevant "geom".
ggplot(x, aes(x=hp)) +
geom_histogram()
ggplot(x, aes(x=hp)) +
geom_density()
#you can perform these with multiple factors.
ggplot(x, aes(x=hp, colour=as.factor(am))) +
geom_density()
ggplot(x, aes(x=mpg, y=disp, colour=as.factor(cyl)))+
geom_point()
#you can add main and axes titles
ggplot(x, aes(x=mpg, y=disp, colour=as.factor(cyl)))+
geom_point() + xlab("Miles per Gallon") + ylab("Displacement (cu.in)") +
ggtitle("MPG and Displacement by Number of Car Cylinders")
#and update the legend easily
ggplot(x, aes(x=mpg, y=disp, colour=as.factor(cyl)))+
geom_point() + xlab("Miles per Gallon") + ylab("Displacement (cu.in)") +
ggtitle("MPG and Displacement by Number of Car Cylinders") +
scale_color_discrete(name = "Number of Cylinders",
labels=c("4", "6", "8"))
#you can also change the background to get rid of some noise too!
ggplot(x, aes(x=mpg, y=disp, colour=as.factor(cyl)))+
geom_point() + xlab("Miles per Gallon") +
ylab("Displacement (cu.in)") +
ggtitle("MPG and Displacement by Number of Car Cylinders") +
scale_color_discrete(name = "Number of Cylinders",
labels=c("4", "6", "8")) +
theme_bw()
#want to add a regression line, no worries ;)
ggplot(x, aes(x=mpg, y=disp, colour=as.factor(cyl)))+
geom_point() + xlab("Miles per Gallon") +
ylab("Displacement (cu.in)") +
ggtitle("MPG and Displacement by Number of Car Cylinders") +
scale_color_discrete(name = "Number of Cylinders",
labels=c("4", "6", "8")) +
theme_bw() +
geom_smooth(method = "lm",
formula = y ~ x + log(x), se = FALSE,
color = "purple")
#or facet on another variable and move the legend to be located in the space.
ggplot(x, aes(x=mpg, y=disp, colour=as.factor(cyl)))+
geom_point() + xlab("Miles per Gallon") +
ylab("Displacement (cu.in)") +
ggtitle("MPG and Displacement by Number of Car Cylinders") +
scale_color_discrete(name = "Number of Cylinders",
labels=c("4", "6", "8")) +
facet_wrap(~ gear, nrow = 2) +
theme(legend.position = c(0.8, 0.2),
legend.background = element_rect(fill="lightgrey",
size=0.5, linetype="solid",
colour ="purple"))
#using the same dataset but we'll change the variables.
ggplot(x, aes(x=as.factor(vs), y=qsec, colour=as.factor(am))) +
geom_boxplot()
#and as above, we can make some updates to our graph, including labelling the groups on the x-axis.
ggplot(x, aes(x=as.factor(vs), y=qsec, fill=as.factor(am))) +
geom_boxplot() + xlab("Engine Shape") + ylab("1/4 Mile Time (s)") +
ggtitle("1/4 Mile Time by Engine Shape and Transmission") +
scale_fill_discrete(name = "Transmission",
labels=c("Automatic", "Manual")) +
theme_linedraw() +
scale_x_discrete(labels=c("0" = "V-shaped", "1" = "Straight"))
#what if you wanted to see the actual points over the top of the boxplots (i've made these 50% transparent with the alpha argument).
ggplot(x, aes(x=as.factor(vs), y=qsec)) +
geom_boxplot(aes(fill=as.factor(am))) +
geom_point(aes(x=as.factor(vs), y=qsec), position="jitter", alpha=0.5) +
xlab("Engine Shape") + ylab("1/4 Mile Time (s)") +
ggtitle("1/4 Mile Time by Engine Shape and Transmission") +
scale_fill_discrete(name = "Transmission",
labels=c("Automatic", "Manual")) +
theme_bw() +
scale_x_discrete(labels=c("0" = "V-shaped", "1" = "Straight"))
#what about turning the boxplots horizontal...
ggplot(x, aes(x=as.factor(vs), y=qsec)) +
geom_boxplot(aes(fill=as.factor(am))) +
geom_point(aes(x=as.factor(vs), y=qsec), position="jitter") +
xlab("Engine Shape") + ylab("1/4 Mile Time (s)") +
ggtitle("1/4 Mile Time by Engine Shape and Transmission") +
scale_fill_discrete(name = "Transmission",
labels=c("Automatic", "Manual")) +
theme_bw() +
coord_flip()
You can also combine multiple plots for publication. To do this, asign the necessary plots to a variable, then use gridExtra to bind them together.
a <- ggplot(x, aes(x=as.factor(vs), y=qsec)) +
geom_boxplot(aes(fill=as.factor(am))) +
geom_point(aes(x=as.factor(vs), y=qsec), position="jitter", alpha=0.5) +
xlab("Engine Shape") + ylab("1/4 Mile Time (s)") +
ggtitle("1/4 Mile Time by Engine Shape and Transmission") +
scale_fill_discrete(name = "Transmission",
labels=c("Automatic", "Manual")) +
theme_bw()
b <- ggplot(x, aes(x=mpg, y=disp, colour=as.factor(cyl)))+
geom_point() + xlab("Miles per Gallon") + ylab("Displacement (cu.in)") +
ggtitle("MPG and Displacement by Number of Car Cylinders") +
scale_color_discrete(name = "Number of Cylinders",
labels=c("4", "6", "8")) +
theme_bw()
c <- ggplot(x, aes(x=hp, colour=as.factor(am))) +
geom_density() + scale_fill_discrete(name = "Transmission",
labels=c("Automatic", "Manual")) +
ggtitle("Engine Shape by Transmission")
d <- ggplot(x, aes(x=hp)) +
geom_histogram() +
theme(legend.title = element_blank()) +
ggtitle("Gross Horsepower")
require(gridExtra)
grid.arrange(a,b,c,d, ncol=2)
The ggsave() function allows you to export a plot created with ggplot. You can specify the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi) to create high quality graphics for publication. In order to save the above combined plot, we first assign it to a variable plotFinal, then tell ggsave to save that plot in png format to your working directory.
plotFinal <- grid.arrange(a,b,c,d, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave(filename = "W3_finalCombinedPlot.png", plot = plotFinal, width = 12, height = 10)
Now that you’ve seen some of the ggplot2 function, it’s your turn to perform the following.
If you don’t have a dataset with you today, maybe try using one of these:
iris - Edgar Anderson’s Iris DataChickWeight - Weight versus age of chicks on different dietsesoph - Smoking, Alcohol and (O)esophageal Cancerrock - Measurements on Petroleum Rock Samplessleep - Student’s Sleep Dataswiss - Swiss Fertility and Socioeconomic Indicators (1888) Datawomen - Average Heights and Weights for American WomenThey are available in base R so you can save them in an object like we did at the start: x <- iris for example.
For your dataset, perform an visual exploratory data analysis to determine the shape of your data? This may include a number of different types of plots, depending on your data.
Create a scatterplot with a regression curve. Include complete formatting of the plot.
Facet the above scatterplot on a factor in your dataset.
What does a violin plot look like?
Have a go at changing the colours used in the figure.
Choose any of your graphs. Alter the axis ticks and labels.
Maybe have a go with some of the plots above with your data?
Visit [http://www.cookbook-r.com/Graphs/] for any extra pointers.