In this session we will demonstrate various techniques for visualising data using R. This will also include the correct interpretation and understanding of the different plotting techniques. We will begin by showing how to do simple data visualisations in R before moving onto introducing ggplot2
, which allows for better data visualisations.
First, start by opening RStudio and a new R script by going to File -> New File -> R Script
. Save this file as DASession1.R into a directory you will have constant access to throughout the course. We shall now load into R all of the libraries we will need for this session. This can be done by typing the following into your R script:
library(ggplot2)
library(nycflights13)
The libraries can be loaded into R by highlighting them in your script and then clicking on the Run
button located in the top right of the script window. The first library ggplot2
allows us to use functions within that package in order to create nicer data visualisations. The second library nycflights13
contains data on flights from New York City in 2013 that we shall be examining.
Before visualising any data set, we first need to know what it consists of. For example, the contents of the flights
data within the nycflights13
library can be observed using the following commands:
head(flights, n = 3)
This prints to the R console the first n = 3
rows of the flights
data set, displaying each of the variables within said data set. We now know the data set contains 19 variables, as well as their names. A quick check on the size of a data set can be obtained using:
dim(flights)
[1] 336776 19
which displays the dimensions of the data set. Thus, here we have 336776 rows and 19 columns worth of data.
To reduce the amount of data we will be working with and make things a little easier, lets only look at Alaska Airlines flights leaving from New York City in 2013. This can be done by subsetting the data in such a way that we only observe flights from Alaska Airlines (carrier code AS), as follows:
Alaska <- flights[flights$carrier == "AS", ]
This essentially picks out all of the rows within the flights
data set for which the carrier code is AS
and discards the rest, thus creating a new data set entitled Alaska
. To observe the Alaska
data we can use the following commands:
head(Alaska, n = 2)
dim(Alaska)
[1] 714 19
Thus, we have reduced the number of data points for each variable down to 714. In the next session we will look at more sophisticated ways of manipulating data sets. Now, lets go on to look at different visualisations of our Alaska
data set, starting with scatterplots.
The first data visualisation technique we introduce is the Scatterplot (or bivariate plot), which allows for two variables to be plotted against one another, with one plotted on the x-axis, and the other on the y-axis. This allows us to examine if there is any relationship between the two variables, such as positive or negative correlation, and whether the relationship appears linear or not.
Lets say we wanted to observe the relationship between departure and arrival delays. We can do that in R using the built-in plot()
function as follows:
plot(Alaska$dep_delay, Alaska$arr_delay)
The $ sign allows us to select specific variables from our Alaska
data set, such that we have plotted the departure delays (dep_delay
) on the x-axis, and arrival delays (arr_delay
) on the y-axis. Now, lets start by making the scatterplot clearer by adding more informative axis labels. This can be done as follows:
plot(Alaska$dep_delay, Alaska$arr_delay, xlab = "Departure delay (minutes)",
ylab = "Arrival delay (minutes)")
Now we have given our axes more informative labels by passing the additional arugments xlab
and ylab
to the plot()
function. To see additional arugments that can be passed to the plot()
function, we can observe its corresponding help file using:
?plot
Lets finish off our scatterplot by giving it an informative title and changing the style of the plotted points:
plot(Alaska$dep_delay, Alaska$arr_delay, xlab = "Departure delay (minutes)",
ylab = "Arrival delay (minutes)",
main = "Alaska Airlines flights leaving NYC in 2013", pch = 16)
The argument main
allows for a title to be added to the scatterplot, while the arugment pch
changes the plotting character. There are 25 different plotting characters to choose from. An easy way to look at all of them is to plot them:
plot(1:25, pch = 1:25)
Lastly, before we move onto creating a nicer version of our scatterplot using ggplot()
, can you describe the relationship observed between deparature and arrival delays?
ggplot
We will now take advantage of the ggplot2
library to produce better looking visualisations of our data. First, let us set up the plotting region for our scatterplot of departure against arrival delays as follows:
ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay))
Here, we have set up our plotting region by giving to the ggplot()
function:
Alaska
by setting data = Alaska
.aes(x = dep_delay, y = arr_delay)
, where aes()
relates to the plots aesthetics. That is,
dep_delay
maps to the x
coordinate; andarr_delay
maps to the y
coordinate.In order to include the points on the scatterplot we now need to add an additonal layer using the +
command. The points are then added as follows:
ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
where geom_point()
specifies that the geometric objects to add to our plotting region are points.
When adding layers using ggplot
it should be noted that:
+
command should come at the end of lines, otherwise R will produce and error.+
command. This is so your code will be nice and clear with each layer given its own line of code. This is handy for code debugging.We can change the axes labels and include a title on our plot by adding another layer as follows:
ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point() +
labs(x = "Departure delay (minutes)", y = "Arrival delay (minutes)",
title = "Alaska Airlines flights leaving NYC in 2013")
Before moving onto the next section think about the following questions:
From our scatterplot it is clear to see that the vast majority of the points lie close to zero for both departure and arrival delays. This can make it difficult at times to observe what is going on. This is due to so many points being plotted very close to each other, and often plotted over one another in such a way that its impossible to count how many points are actually plotted. This is referred to as over-plotting. Using ggplot()
, there are two ways we can address this problem:
alpha
argument.geom_jitter()
command.We shall first alter the transparancy of the points and see if this improves the situation. This is done as follows:
ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.2) +
labs(x = "Departure delay (minutes)", y = "Arrival delay (minutes)",
title = "Alaska Airlines flights leaving NYC in 2013")
The alpha
command ranges between 0 and 1, where 0 relates to 100% transparency, while 1 (default) sets the points to be 100% opaque. By changing the transparancy levels of the points we can observe clusters of points that are close to one another as they will be darker than areas of the plot with fewer points clustered together. Try playing around with different levels of alpha
.
Jittering - The idea behind jittering is that each point is randomly moved, or nudged, slightly from its original position in such a way that clusters of points with the same coordinates can be observed, instead of being plotted on top of one another. To understand this, lets create a small data set consisting purely of zeros, such that:
jitter.example <- matrix(0, nrow = 10, ncol = 2)
This basically creates a 10 by 2 matrix of zeros. You can look at it in the console by simply typing:
jitter.example
Now, ggplot
only works with data frames and not matrices, so we need to convert jitter.example
into a data frame. This can be done using:
jitter.example <- as.data.frame(jitter.example)
There are functions within R that can be used to determine whether an object is a matrix or data frame. See:
?is.matrix
?is.data.frame
Now, lets plot our toy example:
ggplot(data = jitter.example, mapping = aes(x = V1, y = V2)) +
geom_point()
Note that since changing jitter.example
into a data frame, the columns have been given the default variable names V1
and V2
. From the plot, if you had never seen our toy example before you would think only a single value was plotted due to them all being zero. If we shift each of the points slightly using jittering we will be able to see them more clearly:
ggplot(data = jitter.example, mapping = aes(x = V1, y = V2)) +
geom_jitter(width = 0.1, height = 0.1)
Note that geom_jitter()
has replaced geom_point()
. Now we can clearly see all 10 observations plotted. The amount of horizontal and vertical jittering of the points is controlled by the width
and height
arguments within geom_jitter()
. Try playing around with different width
and height
values. Which values of the width
and height
do you feel would be inappropriate?
Now that we understand the idea behind jittering, lets produce a jittered scatterplot of the Alaska
data:
ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_jitter(width = 30, height = 30) +
labs(x = "Departure delay (minutes)", y = "Arrival delay (minutes)",
title = "Alaska Airlines flights leaving NYC in 2013")
Now we can see more of the points plotted within the cluster of points around (0,0). However, since this cluster this so large, it can be argued that jittering has not helped much here. Also, it is important to add enough jitter in order to separate the overlapping of points, however, not so much that we lose any pattern observed within the points. It should also be noted that jittering does not change the actual values of the points within the data set, it is merely used to help with visualising the data.
Histograms allow us to look at the statistical distribution of a variable. They show us how many values of a variable fall within specified bins. These bins give ranges of values for which the variable lies. The bins can be altered, that is, by changing their width, or by increasing the number of bins, such that we see the distribution at a higher resolution.
Here, lets take a look at the weather
data set that is within the nycflights13
library. This data set contains hourly weather data from three airports (LGA, JFK and EWR) in New York City in 2013. We can look at its contents via:
head(weather, n = 3)
Now, if we want to look at the distribution of the hourly temperature, we can do that using the hist()
function as follows:
hist(weather$temp)
We can also store and obtain further information from the histogram that does not get displayed by default as follows:
hist.info <- hist(weather$temp, plot = FALSE)
hist.info
$breaks
[1] 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
[18] 95 100 105
$counts
[1] 57 259 491 883 2096 2927 2257 2116 1645 2023 2701 2170 2027 2241
[15] 1512 432 241 34 2
$density
[1] 4.365474e-04 1.983610e-03 3.760435e-03 6.762656e-03 1.605269e-02
[6] 2.241709e-02 1.728575e-02 1.620587e-02 1.259861e-02 1.549360e-02
[11] 2.068622e-02 1.661944e-02 1.552424e-02 1.716321e-02 1.158000e-02
[16] 3.308570e-03 1.845753e-03 2.603967e-04 1.531745e-05
$mids
[1] 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5 62.5
[12] 67.5 72.5 77.5 82.5 87.5 92.5 97.5 102.5
$xname
[1] "weather$temp"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
Here, the breaks
refer to the cutoff points for the bins, resulting in R deciding to display 19 bins by default for our hourly temperature data. The counts
tells us the number of observed values within each bin. The density
allows for the density distribution to be plotted instead of the frequency, while mids
provides the midpoints of each bin. The variable name is given by xname
, whether the bins are equidistant by equidist
, and the object class is provided last (histogram in this case).
We can now use the information provided about the bins to look at the hourly temperature distribution on a finer or coarser scale using the breaks
command as follows:
hist(weather$temp, breaks = 30, main = "Breaks = 30")
hist(weather$temp, breaks = 10, main = "Breaks = 10")
Why does the frequency differ between the two plotted histograms?
Note, when giving R the number of breaks you want to use, it does not neccessarily always give you exactly what you would like. Finer control over the breakpoints between the bins can be obtained by passing a vector into the breaks
argument specifying the exact locations of the breakpoints. For example, if we only wanted breaks to occur at 10, 30, 50, 80, 100 and 110 we could pass the following vector to the breaks
argument:
hist(weather$temp, breaks = c(10, 30, 50, 80, 100, 110))
Do these breaks make sense?
Finally, similarly to the plot()
function, we can clean up the labels on the histogram so that they are more informative using:
hist(weather$temp, breaks = 30, xlab = "Temperature (Hourly)",
main = "Hourly temperatures from NYC in 2013")
Before moving on, consider the following questions:
ggplot
Again, we can create better looking visualisations of our data using ggplot
, which also goes for histograms. To create a histogram using ggplot
we use the geom_histogram()
command, or layer, instead of geom_point()
as with scatterplots. We can create a histogram of the hourly temperature data in NYC in 2013 using ggplot
as follows:
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here, ggplot
tells us that it used 30 bins when creating the histogram. It should be noted that the histogram does not look exactly the same as the one we previous created with breaks=30
. This is because the algorithm used for specifying the bins is not exactly the same. We can adjust the bins in ggplot
using two different approaches. Either by adjusting the
bins
argument; orbinwidth
argument.Lets first start by specifying the number of bins as follows:
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(bins = 60, color = "white")
What does changing the number of bins tell us about the distribution of the hourly temperature levels?
Note, we also specified the outline colour of the bins to make it easier to differentiate between them. The colour of the bins themselves can be changed by including the fill
argument. The colour options available can be found by typing the following into the R console:
colors()
Instead of specifying the number of bins, we can specify their width using binwidth
as follows:
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(binwidth = 5, color = "white")
Finally, we can give the histogram a title and clean up the labels to them more informative:
ggplot(data = weather, mapping = aes(x = temp)) +
geom_histogram(bins = 60, color = "white", fill = "skyblue") +
labs(x = "Temperature (Hourly)",
title = "Hourly temperatures from NYC in 2013")
Before moving on to the next section, think about the following questions:
Another way to look at the distribution of a variable is using a boxplot. A boxplot makes use of the standard five-number summary, that is
Keeping with the hourly temperature data, the five-number summary can be obtained in R using the following command:
summary(weather$temp)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
10.94 39.92 55.40 55.26 69.98 100.04 1
This provides us with the five-number summary, as well as the mean
hourly temperature. There is one missing value in the hourly temperature data, which is represented in R by NA
. What does having similar median
and mean
values say about the distribution of a variable?
The boxplot of the hourly temperature data can be obtained as follows:
boxplot(weather$temp, ylab = "Hourly Temperature",
main = "Hourly temperatures from NYC in 2013")
The elements of the boxplot relating to the five-number summary have also been labelled. Other features of the boxplot are:
Boxplots are useful visualisations when comparing the distribution of a numerical variable split across groups (or a categorical variable). For example, we could look at how the hourly temperature changes by month, where month is our categorical, or grouping, variable. This can be done as follows:
boxplot(temp ~ month, data = weather, ylab = "Hourly Temperature",
main = "Hourly temperatures from NYC in 2013")
Here, we have given to the boxplot function the formula temp ~ month
, which basically means we would like boxplots of hourly temperature (temp
) separately for each month
. Note, we also have to specify that temp
and month
come from the weather
data set. After splitting the hourly temperatures by month, we now see points extending beyond the whiskers of the boxplots. These are known as outliers, and may be thought of as unusually small or large values. However, the definition of an outlier here is somewhat arbitrary as they are defined by the length of the whiskers, which are no more than 1.5 x IQR.
Finally, lets make the boxplot more informative by renaming the labels for each of the boxplots:
boxplot(temp ~ month, data = weather, ylab = "Hourly Temperature",
main = "Hourly temperatures from NYC in 2013", xlab = "Month",
names = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
How does the hourly temperature change throughout the year?
ggplot
To create boxplots using ggplot
we use the geom_boxplot()
function. If we want to look at boxplots of a variable separately for a categorical variable then we need to introduce the factor()
function. This converts a numerical variable into a categorical one, essentially creating labels or categories from the numeric values. For example, the month
variable within the weather
data set is a numerical variable taking on the values 1,,12, for each month. However, it makes more sense to convert this into a categorical variable using the factor()
function, such that:
weather$month
[1] 1 1 1 1 1 1 1 1 1 1
becomes
factor(weather$month)
[1] 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
with levels, or categories, 1,,12 for each month. Hence, the boxplots can be produced using ggplot
as follows:
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot(fill = "steelblue") +
labs(x = "Month", y = "Temperature (Hourly)",
title = "Hourly temperatures from NYC in 2013 by month") +
scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
Note, we have introduced a new function scale_x_discrete()
, which is used to rename the labels of the boxplots. This function is used as our categorical variables are discrete in nature.
Before moving on, consider the following questions:
Barplots, or barcharts, are used to visualise the distributions of categorical variables. This essentially provides us with the frequencies of categories within a categorical variable. Lets take a look at the distribution of airline carriers within the flights
data that flew out of New York City in 2013. We can do this by creating a table containing the number of flights from each airline carrier as follows:
carrier.freq <- table(flights$carrier)
9E AA AS B6 DL EV F9 FL HA MQ OO UA
18460 32729 714 54635 48110 54173 685 3260 342 26397 32 58665
US VX WN YV
20536 5162 12275 601
Turning the above into a barplot provides an easier visualisation of the information provided by the table, and is done as follows:
barplot(carrier.freq)
Note, the names of the airlines relating to the carrier codes can be found by typing airlines
into the R console. Using this information and the barplot, find the name of the airline which had the most flights out of New York City in 2013?
Barplots can also be used to compare two categorical variables. For instance, lets say we wanted to look at the number of flights that flew out of New York York in 2013 from each carrier and from each airport (LGA, JFK and EWR). To obtain a table of this information we simply include the flights origin to our previous table as follows:
carrier.origin <- table(flights$origin, flights$carrier)
9E AA AS B6 DL EV F9 FL HA MQ OO
EWR 1268 3487 714 6557 4342 43939 0 0 0 2276 6
JFK 14651 13783 0 42076 20701 1408 0 0 342 7193 0
LGA 2541 15459 0 6002 23067 8826 685 3260 0 16928 26
UA US VX WN YV
EWR 46087 4405 1566 6188 0
JFK 4534 2995 3596 0 0
LGA 8044 13136 0 6087 601
To turn this information into a barplot we use the same method as before:
barplot(carrier.origin, legend.text = TRUE)
where the additional argument legend.text=TRUE
includes the legend highlighting the three airports. This is what is referred to as a Stacked barplot since the bars for each origin
are simply stacked on top of one another for each of the carriers. Which origin did JetBlue Airways appear to favour in 2013?
ggplot
To create barplots using ggplot
we use the geom_col()
function. However, as mentioned earlier, ggplot
expects the data passed to it to be a data frame, and so we will first transform our previous tables into data frames as follows:
carrier.freq <- as.data.frame(carrier.freq)
colnames(carrier.freq) <- c("carrier", "number")
carrier.origin <- as.data.frame(carrier.origin)
colnames(carrier.origin) <- c("origin", "carrier", "number")
The names of the columns in our data frames have also been updated using the colnames()
function. Take a look at your new data frames within the R console. First, lets plot the data relating to the carriers who flew out of New York City in 2013. This is done using the following code:
ggplot(data = carrier.freq, mapping = aes(x = carrier, y = number)) +
geom_col()
Add to the code above to give your barplot new labels and a title similar to the one displayed in the notes.
The barplot for comparing two categorical vairables is very similar in this case, where we simply pass the additional fill
argument to the aes()
function. Including the fill
argument lets ggplot
plot know that we want to split the barplot according to an additional categorical variable, which is origin
in this case. The barplot is then given by:
ggplot(data = carrier.origin, mapping = aes(x = carrier, y = number, fill = origin)) +
geom_col() +
labs(x = "Carrier", y = "Count",
title = "Carriers who flew out of New York City in 2013")
This provides us with a visually nice barplot to present our carrier information by airport of origin. However, there are also alternative barplots to the stacked barplot. One alternative to a stacked barplot is the side-by-side (or dodged) barplot, which, as suggested by its name, places the bars next to each another instead of on top of one another. This can be produced as follows:
ggplot(data = carrier.origin, mapping = aes(x = carrier, y = number, fill = origin)) +
geom_col(position = "dodge") +
labs(x = "Carrier", y = "Count",
title = "Carriers who flew out of New York City in 2013")
This is done by passing to the geom_col()
function the position
of the barplots, which in this case is dodged
. Before moving on, consider the following two questions:
Lastly, lets take a look at what is referred to as a faceted barplot. They provide an easier way to compare the carrier distributions by origin, and can be obtained as follows:
ggplot(data = carrier.origin, mapping = aes(x = carrier, y = number, fill = origin)) +
geom_col() +
facet_wrap(~ origin, ncol = 1) +
labs(x = "Carrier", y = "Count",
title = "Carriers who flew out of New York City in 2013")
Here we include the facet_wrap()
function, where we want separate barplots by origin
, and hence we use ~ origin
. We also choose to have them plotted in one column via ncol = 1
. This makes it easier to compare their distributions now that they are not stacked on top or beside one another.
Before moving on to the next section, consider the following questions:
Linegraphs are typically used when looking at time series data, that is, when we have information on how a variable changes over time. Hence, there is a natural ordering to the data when observing how something changes over time, and therefore, linegraphs should be avoided if there is no sequential ordering of a variable. Lets again look at the hourly temperature data, but this time only for Newark Inernational Airport in January. This can be done by first subsetting the data as follows:
Newark.Jan <- weather[weather$origin == "EWR" & weather$month == 1, ]
Take a look at the new data set for hourly temperatures at Newark International Airport in January using functions metioned earlier. We can now produce a linegraph of the hourly temperature as follows:
plot(Newark.Jan$time_hour, Newark.Jan$temp, type = "l",
xlab = "Time (Hours)", ylab = "Temperature",
main = "Hourly Temperature at Newark Airport in January 2013")
Here, we have passed to the plot()
function the additional type=l
argument, indicating that we would like a linegraph to be produced. By default, the plot()
function plots the data as points. Before moving on, consider the following questions:
time_hour
been plotted on the x-axis and not hour
?ggplot
To produce linegraphs using ggplot
we use the geom_line()
command. Hence, our linegraph for the hourly temperatures at Newark International Airport in January 2013 can be created as follows:
ggplot(data = Newark.Jan, mapping = aes(x = time_hour, y = temp)) +
geom_line() +
labs(x = "Time (Hours)", y = "Temperature",
title = "Hourly Temperature at Newark Airport in January 2013")
Before moving on to the next section, consider the following:
From the flights
data set, subset the data for the airline carrier JetBlue Airways
and produce a scatterplot of their departure delays against arrival delays using ggplot
. Interpret the scatterplot.
Produce a histogram of the hourly temperature from Newark Liberty International (EWR) Airport in 2013 using ggplot
. How does the temperature distribution compare with that from all airports in New York City in 2013?
For John F. Kennedy Airport, produce boxplots (using a single ggplot
command) of the hourly temperature for the months May, June, July, August and September. How does the hourly temperature change during this period?
Hint: You can subset data across multiple variables using the & (AND) and %in% (IN) operators. For example, flights[flights$carrier == “US” & flights$origin %in% c(“LGA”, “EWR”), ].
Take a look at the mtcars
data set within the datasets
library relating to data extracted from the 1974 Motor Trend US magazine. Using ggplot
, produce a faceted barplot of the categorical variables relating to the number of cylinders (cyl
) and the automobiles transmission (am
). Interpret the barplot.
Produce a linegraph of the hourly temperature at LAGuardia (LGA) Airport for the month of October 2013. Interpret the linegraph.