Note: These are exercises from Wickham (2016, 2nd ed).
library(ggplot2)
For viewing the data set, type mpg
. To see them a bit more comfortably, use View(mpg)
(note the capital V).
mpg
Let’s look at the components for creating a chart with ggplot2, using a scatterplot as the example.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
The pattern shown here is fundamental for gglplot: * data and aesthetic mappings are provided in ggplot(), then * layers are added with +.
A short version of the above is
ggplot(mpg, aes(displ, hwy)) +
geom_point()
This produces exactly the same output as the longer version above.
The solutions are provided by running the code. But make sure you try first to figure things out for yourself. You can run commands in the Console, or by creating an R Script, typing commands, marking with the mouse what you want to execute, and click ‘’’Run’’’.
cty
, the average city mileage, and hwy
, the averabe highway mileage? How would you describe this relationship?cty
, the average city mileage, and hwy
, the averabe highway mileage? How would you describe this relationship?ggplot(mpg, aes(cty, hwy)) +
geom_point()
The point here to make is that this very linear relationship is of course caused by another factor, motor size. That is to say, the description “the higher/lower cty, the higher/lower hwy” is to taken as strictly descriptive.
ggplot(mpg, aes(cty, hwy)) + geom_point()
ggplot(diamonds, aes(carat, price)) + geom_point()
ggplot(economics, aes(date, unemploy)) + geom_line()
ggplot(mpg, aes(cty)) + geom_histogram()
To add variables to a plot, we need to map them onto aesthetics. In two dimensions, we can use the x and the y axis, as shown in the scatterplots above. For adding a third (or fourth, etc.) variable we need to use aesthetics such as shape, color, and size. (In the example class the type of car, such as pickup, drv is the drivetain, such as forward (f), rear (r) or 4-wheel (4) drive, and cyl is the number of cylinders).
As always, try to imagine what the plot in each case will look before you click the Run Current Chunk
button.
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
ggplot(mpg, aes(displ, hwy, shape = drv)) +
geom_point()
ggplot(mpg, aes(displ, hwy, size = cyl)) +
geom_point()
Do you find the scale provided by ggplot useful in these instances? You probably do, because they allow you to translate the aestethics (colour, size, shape) back into values of the variable. Ggplot is also pretty smart about the choice of scales, but of course these defaults can all be overridden.
For setting the colour to a specific value, such as “blue”, the colour needs to be outside of the aes()
expression, in a layer (remember, layers are described folling the +
sign):
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")
Question: Ggplot does not provide a scale (a legend) with this graph. Why? Is this a bug? Should you attempt to provide one manually?
From Whickham 2.4.1, page 16. Formulate them in a more closed format, so that students get at least one task that is concrete, before exploring their own combinations - which definitely should be encouraged.
Go here – Dalal?
A boxplot, or box-and-whiskers plot, summarizes a distribution of scores for variable
ggplot(mpg, aes(drv, hwy)) +
geom_boxplot()
Histograms and Frequency Polygons show the distribution of scores of a single numeric variable.
ggplot(mpg, aes(hwy)) +
geom_histogram()
ggplot(mpg, aes(hwy)) +
geom_freqpoly()
Note that the y axis shows counts, that is, frequencies, not values (scores).
A bar chart is the analog of a histogram, but for discrete variables
ggplot(mpg, aes(manufacturer)) +
geom_bar()
As a final example, we look at time line plots. Here, the x-axis shows time (e.g, years), and the y axis shows measurements for numeric variables, or counts (frequencies) for categorical data. We use the economics data set, which contains basic economy data for the US, such as unemployment (unemploy
) numbers, over years, for the next example.
ggplot(economics, aes(date, unemploy / pop)) +
geom_line()
Question: How did we overcome the problem that the orginal data set, economics
, does not contain values of the unemployment rate directly, but only the number of unemployed people, and the population size?
And to show just one example of overwriting defaults, let’s look at the labels for the axis. While ‘date’ is clear enough, the label ‘unemploy/pop’ label is a bit mysterious. Let’s change this to “unemployment rate”
ggplot(economics, aes(date, unemploy / pop)) +
geom_line() +
ylab("unemployment rate")
Go here
Go here.