In this session we will demonstrate various techniques for visualising data using R. This will also include the correct interpretation and understanding of the different plotting techniques. We will begin by showing how to do simple data visualisations in R before moving onto introducing ggplot2, which allows for better data visualisations.

Getting started

First, start by opening RStudio and a new R script by going to File -> New File -> R Script. Save this file as DASession1.R into a directory you will have constant access to throughout the course. We shall now load into R all of the libraries we will need for this session. This can be done by typing the following into your R script:

library(ggplot2)
library(nycflights13)

The libraries can be loaded into R by highlighting them in your script and then clicking on the Run button located in the top right of the script window. The first library ggplot2 allows us to use functions within that package in order to create nicer data visualisations. The second library nycflights13 contains data on flights from New York City in 2013 that we shall be examining.

Viewing the data

Before visualising any data set, we first need to know what it consists of. For example, the contents of the flights data within the nycflights13 library can be observed using the following commands:

head(flights, n = 3)

This prints to the R console the first n = 3 rows of the flights data set, displaying each of the variables within said data set. We now know the data set contains 19 variables, as well as their names. A quick check on the size of a data set can be obtained using:

dim(flights)
[1] 336776     19

which displays the dimensions of the data set. Thus, here we have 336776 rows and 19 columns worth of data.

To reduce the amount of data we will be working with and make things a little easier, lets only look at Alaska Airlines flights leaving from New York City in 2013. This can be done by subsetting the data in such a way that we only observe flights from Alaska Airlines (carrier code AS), as follows:

Alaska <- flights[flights$carrier == "AS", ]

This essentially picks out all of the rows within the flights data set for which the carrier code is AS and discards the rest, thus creating a new data set entitled Alaska. To observe the Alaska data we can use the following commands:

head(Alaska, n = 2)
dim(Alaska)
[1] 714  19

Thus, we have reduced the number of data points for each variable down to 714. In the next session we will look at more sophisticated ways of manipulating data sets. Now, lets go on to look at different visualisations of our Alaska data set, starting with scatterplots.

Visualising the data

Scatterplots

The first data visualisation technique we introduce is the Scatterplot (or bivariate plot), which allows for two variables to be plotted against one another, with one plotted on the x-axis, and the other on the y-axis. This allows us to examine if there is any relationship between the two variables, such as positive or negative correlation, and whether the relationship appears linear or not.

Lets say we wanted to observe the relationship between departure and arrival delays. We can do that in R using the built-in plot() function as follows:

plot(Alaska$dep_delay, Alaska$arr_delay)

The $ sign allows us to select specific variables from our Alaska data set, such that we have plotted the departure delays (dep_delay) on the x-axis, and arrival delays (arr_delay) on the y-axis. Now, lets start by making the scatterplot clearer by adding more informative axis labels. This can be done as follows:

plot(Alaska$dep_delay, Alaska$arr_delay, xlab = "Departure delay (minutes)",
     ylab = "Arrival delay (minutes)")

Now we have given our axes more informative labels by passing the additional arugments xlab and ylab to the plot() function. To see additional arugments that can be passed to the plot() function, we can observe its corresponding help file using:

?plot

Lets finish off our scatterplot by giving it an informative title and changing the style of the plotted points:

plot(Alaska$dep_delay, Alaska$arr_delay, xlab = "Departure delay (minutes)",
     ylab = "Arrival delay (minutes)",
     main = "Alaska Airlines flights leaving NYC in 2013", pch = 16)

The argument main allows for a title to be added to the scatterplot, while the arugment pch changes the plotting character. There are 25 different plotting characters to choose from. An easy way to look at all of them is to plot them:

plot(1:25, pch = 1:25)

Lastly, before we move onto creating a nicer version of our scatterplot using ggplot(), can you describe the relationship observed between deparature and arrival delays?

Scatterplots using ggplot

We will now take advantage of the ggplot2 library to produce better looking visualisations of our data. First, let us set up the plotting region for our scatterplot of departure against arrival delays as follows:

ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay))

Here, we have set up our plotting region by giving to the ggplot() function:

  1. The data Alaska by setting data = Alaska.
  2. The mapping of the coordinates for the axes using aes(x = dep_delay, y = arr_delay), where aes() relates to the plots aesthetics. That is,
    • dep_delay maps to the x coordinate; and
    • arr_delay maps to the y coordinate.

In order to include the points on the scatterplot we now need to add an additonal layer using the + command. The points are then added as follows:

ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point()

where geom_point() specifies that the geometric objects to add to our plotting region are points.

When adding layers using ggplot it should be noted that:

  • the + command should come at the end of lines, otherwise R will produce and error.
  • when adding additional layers it is a good idea to take a new line after each + command. This is so your code will be nice and clear with each layer given its own line of code. This is handy for code debugging.

We can change the axes labels and include a title on our plot by adding another layer as follows:

ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point() +
  labs(x = "Departure delay (minutes)", y = "Arrival delay (minutes)",
       title = "Alaska Airlines flights leaving NYC in 2013") 

Before moving onto the next section think about the following questions:

  • Why do departure and arrival delays appear to have a positive relationship?
  • Why is there a cluster of points at (0,0)? What does this indicate?

Over-plotting

From our scatterplot it is clear to see that the vast majority of the points lie close to zero for both departure and arrival delays. This can make it difficult at times to observe what is going on. This is due to so many points being plotted very close to each other, and often plotted over one another in such a way that its impossible to count how many points are actually plotted. This is referred to as over-plotting. Using ggplot(), there are two ways we can address this problem:

  1. adjust the transparancy of the plotted points using the alpha argument.
  2. jitter the points using the geom_jitter() command.

We shall first alter the transparancy of the points and see if this improves the situation. This is done as follows:

ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha = 0.2) +
  labs(x = "Departure delay (minutes)", y = "Arrival delay (minutes)",
       title = "Alaska Airlines flights leaving NYC in 2013") 

The alpha command ranges between 0 and 1, where 0 relates to 100% transparency, while 1 (default) sets the points to be 100% opaque. By changing the transparancy levels of the points we can observe clusters of points that are close to one another as they will be darker than areas of the plot with fewer points clustered together. Try playing around with different levels of alpha.

Jittering - The idea behind jittering is that each point is randomly moved, or nudged, slightly from its original position in such a way that clusters of points with the same coordinates can be observed, instead of being plotted on top of one another. To understand this, lets create a small data set consisting purely of zeros, such that:

jitter.example <- matrix(0, nrow = 10, ncol = 2)

This basically creates a 10 by 2 matrix of zeros. You can look at it in the console by simply typing:

jitter.example

Now, ggplot only works with data frames and not matrices, so we need to convert jitter.example into a data frame. This can be done using:

jitter.example <- as.data.frame(jitter.example)

There are functions within R that can be used to determine whether an object is a matrix or data frame. See:

?is.matrix
?is.data.frame

Now, lets plot our toy example:

ggplot(data = jitter.example, mapping = aes(x = V1, y = V2)) + 
  geom_point()

Note that since changing jitter.example into a data frame, the columns have been given the default variable names V1 and V2. From the plot, if you had never seen our toy example before you would think only a single value was plotted due to them all being zero. If we shift each of the points slightly using jittering we will be able to see them more clearly:

ggplot(data = jitter.example, mapping = aes(x = V1, y = V2)) + 
  geom_jitter(width = 0.1, height = 0.1)

Note that geom_jitter() has replaced geom_point(). Now we can clearly see all 10 observations plotted. The amount of horizontal and vertical jittering of the points is controlled by the width and height arguments within geom_jitter(). Try playing around with different width and height values. Which values of the width and height do you feel would be inappropriate?

Now that we understand the idea behind jittering, lets produce a jittered scatterplot of the Alaska data:

ggplot(data = Alaska, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_jitter(width = 30, height = 30) +
  labs(x = "Departure delay (minutes)", y = "Arrival delay (minutes)",
       title = "Alaska Airlines flights leaving NYC in 2013") 

Now we can see more of the points plotted within the cluster of points around (0,0). However, since this cluster this so large, it can be argued that jittering has not helped much here. Also, it is important to add enough jitter in order to separate the overlapping of points, however, not so much that we lose any pattern observed within the points. It should also be noted that jittering does not change the actual values of the points within the data set, it is merely used to help with visualising the data.

Histograms

Histograms allow us to look at the statistical distribution of a variable. They show us how many values of a variable fall within specified bins. These bins give ranges of values for which the variable lies. The bins can be altered, that is, by changing their width, or by increasing the number of bins, such that we see the distribution at a higher resolution.

Here, lets take a look at the weather data set that is within the nycflights13 library. This data set contains hourly weather data from three airports (LGA, JFK and EWR) in New York City in 2013. We can look at its contents via:

head(weather, n = 3)

Now, if we want to look at the distribution of the hourly temperature, we can do that using the hist() function as follows:

hist(weather$temp)

We can also store and obtain further information from the histogram that does not get displayed by default as follows:

hist.info <- hist(weather$temp, plot = FALSE)
hist.info
$breaks
 [1]  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90
[18]  95 100 105

$counts
 [1]   57  259  491  883 2096 2927 2257 2116 1645 2023 2701 2170 2027 2241
[15] 1512  432  241   34    2

$density
 [1] 4.365474e-04 1.983610e-03 3.760435e-03 6.762656e-03 1.605269e-02
 [6] 2.241709e-02 1.728575e-02 1.620587e-02 1.259861e-02 1.549360e-02
[11] 2.068622e-02 1.661944e-02 1.552424e-02 1.716321e-02 1.158000e-02
[16] 3.308570e-03 1.845753e-03 2.603967e-04 1.531745e-05

$mids
 [1]  12.5  17.5  22.5  27.5  32.5  37.5  42.5  47.5  52.5  57.5  62.5
[12]  67.5  72.5  77.5  82.5  87.5  92.5  97.5 102.5

$xname
[1] "weather$temp"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

Here, the breaks refer to the cutoff points for the bins, resulting in R deciding to display 19 bins by default for our hourly temperature data. The counts tells us the number of observed values within each bin. The density allows for the density distribution to be plotted instead of the frequency, while mids provides the midpoints of each bin. The variable name is given by xname, whether the bins are equidistant by equidist, and the object class is provided last (histogram in this case).

We can now use the information provided about the bins to look at the hourly temperature distribution on a finer or coarser scale using the breaks command as follows:

hist(weather$temp, breaks = 30, main = "Breaks = 30")
hist(weather$temp, breaks = 10, main = "Breaks = 10")

Why does the frequency differ between the two plotted histograms?

Note, when giving R the number of breaks you want to use, it does not neccessarily always give you exactly what you would like. Finer control over the breakpoints between the bins can be obtained by passing a vector into the breaks argument specifying the exact locations of the breakpoints. For example, if we only wanted breaks to occur at 10, 30, 50, 80, 100 and 110 we could pass the following vector to the breaks argument:

hist(weather$temp, breaks = c(10, 30, 50, 80, 100, 110))

Do these breaks make sense?

Finally, similarly to the plot() function, we can clean up the labels on the histogram so that they are more informative using:

hist(weather$temp, breaks = 30, xlab = "Temperature (Hourly)",
     main = "Hourly temperatures from NYC in 2013")

Before moving on, consider the following questions:

  1. What does the histogram tell us about the hourly temperature levels from NYC in 2013?
  2. What is an appropriate number of bins to specify?

Histograms using ggplot

Again, we can create better looking visualisations of our data using ggplot, which also goes for histograms. To create a histogram using ggplot we use the geom_histogram() command, or layer, instead of geom_point() as with scatterplots. We can create a histogram of the hourly temperature data in NYC in 2013 using ggplot as follows:

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here, ggplot tells us that it used 30 bins when creating the histogram. It should be noted that the histogram does not look exactly the same as the one we previous created with breaks=30. This is because the algorithm used for specifying the bins is not exactly the same. We can adjust the bins in ggplot using two different approaches. Either by adjusting the

  • number of bins using the bins argument; or
  • the width of the bins using the binwidth argument.

Lets first start by specifying the number of bins as follows:

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(bins = 60, color = "white")

What does changing the number of bins tell us about the distribution of the hourly temperature levels?

Note, we also specified the outline colour of the bins to make it easier to differentiate between them. The colour of the bins themselves can be changed by including the fill argument. The colour options available can be found by typing the following into the R console:

colors()

Instead of specifying the number of bins, we can specify their width using binwidth as follows:

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(binwidth = 5, color = "white")

Finally, we can give the histogram a title and clean up the labels to them more informative:

ggplot(data = weather, mapping = aes(x = temp)) +
  geom_histogram(bins = 60, color = "white", fill = "skyblue") +
    labs(x = "Temperature (Hourly)",
          title = "Hourly temperatures from NYC in 2013") 

Before moving on to the next section, think about the following questions:

  1. How would you describe the distribution of hourly temperature? Is it symmetrical, or skewed?
  2. What would you say the mean or median hourly temperature is from looking at the histogram?

Boxplots

Another way to look at the distribution of a variable is using a boxplot. A boxplot makes use of the standard five-number summary, that is

  • the minimum value;
  • the first quartile or 25th percentile;
  • the median (or 2nd quartile / 50th percentile);
  • the third quartile or 75th percentile;
  • and the maximum value.

Keeping with the hourly temperature data, the five-number summary can be obtained in R using the following command:

summary(weather$temp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  10.94   39.92   55.40   55.26   69.98  100.04       1 

This provides us with the five-number summary, as well as the mean hourly temperature. There is one missing value in the hourly temperature data, which is represented in R by NA. What does having similar median and mean values say about the distribution of a variable?

The boxplot of the hourly temperature data can be obtained as follows:

boxplot(weather$temp, ylab = "Hourly Temperature",
        main = "Hourly temperatures from NYC in 2013")

The elements of the boxplot relating to the five-number summary have also been labelled. Other features of the boxplot are:

  • the interquartile range (IQR), that is, the height of the box (3rd - 1st quartiles), and is a measure of the spread of the middle 50% of values. Longer boxes are indicative of greater variability.
  • The dashed lines are known as whiskers and represent values less than the 25th percentile, and greater than the 75th percentile, respectively. The longer the whiskers, the more variability there is outside of the middle 50%.

Boxplots are useful visualisations when comparing the distribution of a numerical variable split across groups (or a categorical variable). For example, we could look at how the hourly temperature changes by month, where month is our categorical, or grouping, variable. This can be done as follows:

boxplot(temp ~ month, data = weather, ylab = "Hourly Temperature",
        main = "Hourly temperatures from NYC in 2013")

Here, we have given to the boxplot function the formula temp ~ month, which basically means we would like boxplots of hourly temperature (temp) separately for each month. Note, we also have to specify that temp and month come from the weather data set. After splitting the hourly temperatures by month, we now see points extending beyond the whiskers of the boxplots. These are known as outliers, and may be thought of as unusually small or large values. However, the definition of an outlier here is somewhat arbitrary as they are defined by the length of the whiskers, which are no more than 1.5 x IQR.

Finally, lets make the boxplot more informative by renaming the labels for each of the boxplots:

boxplot(temp ~ month, data = weather, ylab = "Hourly Temperature",
        main = "Hourly temperatures from NYC in 2013", xlab = "Month",
        names = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

How does the hourly temperature change throughout the year?

Boxplots using ggplot

To create boxplots using ggplot we use the geom_boxplot() function. If we want to look at boxplots of a variable separately for a categorical variable then we need to introduce the factor() function. This converts a numerical variable into a categorical one, essentially creating labels or categories from the numeric values. For example, the month variable within the weather data set is a numerical variable taking on the values 1,,12, for each month. However, it makes more sense to convert this into a categorical variable using the factor() function, such that:

weather$month
 [1] 1 1 1 1 1 1 1 1 1 1

becomes

factor(weather$month)
 [1] 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4 5 6 7 8 9 10 11 12

with levels, or categories, 1,,12 for each month. Hence, the boxplots can be produced using ggplot as follows:

ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "Month", y = "Temperature (Hourly)",
        title = "Hourly temperatures from NYC in 2013 by month")  +
  scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                              "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

Note, we have introduced a new function scale_x_discrete(), which is used to rename the labels of the boxplots. This function is used as our categorical variables are discrete in nature.

Before moving on, consider the following questions:

  • Which months have the highest variability in temperature?
  • Are there any values that stick out?

Barplots

Barplots, or barcharts, are used to visualise the distributions of categorical variables. This essentially provides us with the frequencies of categories within a categorical variable. Lets take a look at the distribution of airline carriers within the flights data that flew out of New York City in 2013. We can do this by creating a table containing the number of flights from each airline carrier as follows:

carrier.freq <- table(flights$carrier)

   9E    AA    AS    B6    DL    EV    F9    FL    HA    MQ    OO    UA 
18460 32729   714 54635 48110 54173   685  3260   342 26397    32 58665 
   US    VX    WN    YV 
20536  5162 12275   601 

Turning the above into a barplot provides an easier visualisation of the information provided by the table, and is done as follows:

barplot(carrier.freq)

Note, the names of the airlines relating to the carrier codes can be found by typing airlines into the R console. Using this information and the barplot, find the name of the airline which had the most flights out of New York City in 2013?

Barplots can also be used to compare two categorical variables. For instance, lets say we wanted to look at the number of flights that flew out of New York York in 2013 from each carrier and from each airport (LGA, JFK and EWR). To obtain a table of this information we simply include the flights origin to our previous table as follows:

carrier.origin <- table(flights$origin, flights$carrier)
     
         9E    AA    AS    B6    DL    EV    F9    FL    HA    MQ    OO
  EWR  1268  3487   714  6557  4342 43939     0     0     0  2276     6
  JFK 14651 13783     0 42076 20701  1408     0     0   342  7193     0
  LGA  2541 15459     0  6002 23067  8826   685  3260     0 16928    26
     
         UA    US    VX    WN    YV
  EWR 46087  4405  1566  6188     0
  JFK  4534  2995  3596     0     0
  LGA  8044 13136     0  6087   601

To turn this information into a barplot we use the same method as before:

barplot(carrier.origin, legend.text = TRUE)

where the additional argument legend.text=TRUE includes the legend highlighting the three airports. This is what is referred to as a Stacked barplot since the bars for each origin are simply stacked on top of one another for each of the carriers. Which origin did JetBlue Airways appear to favour in 2013?

Barplots using ggplot

To create barplots using ggplot we use the geom_col() function. However, as mentioned earlier, ggplot expects the data passed to it to be a data frame, and so we will first transform our previous tables into data frames as follows:

carrier.freq <- as.data.frame(carrier.freq)
colnames(carrier.freq) <- c("carrier", "number")
carrier.origin <- as.data.frame(carrier.origin)
colnames(carrier.origin) <- c("origin", "carrier", "number")

The names of the columns in our data frames have also been updated using the colnames() function. Take a look at your new data frames within the R console. First, lets plot the data relating to the carriers who flew out of New York City in 2013. This is done using the following code:

ggplot(data = carrier.freq, mapping = aes(x = carrier, y = number)) +
  geom_col()

Add to the code above to give your barplot new labels and a title similar to the one displayed in the notes.

The barplot for comparing two categorical vairables is very similar in this case, where we simply pass the additional fill argument to the aes() function. Including the fill argument lets ggplot plot know that we want to split the barplot according to an additional categorical variable, which is origin in this case. The barplot is then given by:

ggplot(data = carrier.origin, mapping = aes(x = carrier, y = number, fill = origin)) +
  geom_col() +
  labs(x = "Carrier", y = "Count",
        title = "Carriers who flew out of New York City in 2013") 

This provides us with a visually nice barplot to present our carrier information by airport of origin. However, there are also alternative barplots to the stacked barplot. One alternative to a stacked barplot is the side-by-side (or dodged) barplot, which, as suggested by its name, places the bars next to each another instead of on top of one another. This can be produced as follows:

ggplot(data = carrier.origin, mapping = aes(x = carrier, y = number, fill = origin)) +
  geom_col(position = "dodge") +
  labs(x = "Carrier", y = "Count",
        title = "Carriers who flew out of New York City in 2013") 

This is done by passing to the geom_col() function the position of the barplots, which in this case is dodged. Before moving on, consider the following two questions:

  • What could be considered a positive of using a side-by-side barplot as opposed to a stacked barplot?
  • What could be a disadvantage of using a side-by-side barplot?

Lastly, lets take a look at what is referred to as a faceted barplot. They provide an easier way to compare the carrier distributions by origin, and can be obtained as follows:

ggplot(data = carrier.origin, mapping = aes(x = carrier, y = number, fill = origin)) +
  geom_col() +
  facet_wrap(~ origin, ncol = 1) +
  labs(x = "Carrier", y = "Count",
        title = "Carriers who flew out of New York City in 2013") 

Here we include the facet_wrap() function, where we want separate barplots by origin, and hence we use ~ origin. We also choose to have them plotted in one column via ncol = 1. This makes it easier to compare their distributions now that they are not stacked on top or beside one another.

Before moving on to the next section, consider the following questions:

  • What is the difference between histograms and barplots?
  • Why can’t we use histograms to visualise categorical variables?

Linegraphs

Linegraphs are typically used when looking at time series data, that is, when we have information on how a variable changes over time. Hence, there is a natural ordering to the data when observing how something changes over time, and therefore, linegraphs should be avoided if there is no sequential ordering of a variable. Lets again look at the hourly temperature data, but this time only for Newark Inernational Airport in January. This can be done by first subsetting the data as follows:

Newark.Jan <- weather[weather$origin == "EWR" & weather$month == 1, ]

Take a look at the new data set for hourly temperatures at Newark International Airport in January using functions metioned earlier. We can now produce a linegraph of the hourly temperature as follows:

plot(Newark.Jan$time_hour, Newark.Jan$temp, type = "l",
     xlab = "Time (Hours)", ylab = "Temperature",
     main = "Hourly Temperature at Newark Airport in January 2013")

Here, we have passed to the plot() function the additional type=l argument, indicating that we would like a linegraph to be produced. By default, the plot() function plots the data as points. Before moving on, consider the following questions:

  • In the code for the linegraph above, why has time_hour been plotted on the x-axis and not hour?
  • Why are linegraphs frequently used when time is the explanatory variable?

Linegraphs using ggplot

To produce linegraphs using ggplot we use the geom_line() command. Hence, our linegraph for the hourly temperatures at Newark International Airport in January 2013 can be created as follows:

ggplot(data = Newark.Jan, mapping = aes(x = time_hour, y = temp)) +
  geom_line() +
    labs(x = "Time (Hours)", y = "Temperature",
        title = "Hourly Temperature at Newark Airport in January 2013") 

Before moving on to the next section, consider the following:

  • Why should linegraphs be avoided when there is no sequential ordering of the data?
  • Why are linegraphs used when time is the explanatory variable?
  • Describe the trend in hourly temperature at Newark International Airport in January 2013.

Tasks

  1. From the flights data set, subset the data for the airline carrier JetBlue Airways and produce a scatterplot of their departure delays against arrival delays using ggplot. Interpret the scatterplot.

  2. Produce a histogram of the hourly temperature from Newark Liberty International (EWR) Airport in 2013 using ggplot. How does the temperature distribution compare with that from all airports in New York City in 2013?

  3. For John F. Kennedy Airport, produce boxplots (using a single ggplot command) of the hourly temperature for the months May, June, July, August and September. How does the hourly temperature change during this period?

Hint: You can subset data across multiple variables using the & (AND) and %in% (IN) operators. For example, flights[flights$carrier == “US” & flights$origin %in% c(“LGA”, “EWR”), ].

  1. Take a look at the mtcars data set within the datasets library relating to data extracted from the 1974 Motor Trend US magazine. Using ggplot, produce a faceted barplot of the categorical variables relating to the number of cylinders (cyl) and the automobiles transmission (am). Interpret the barplot.

  2. Produce a linegraph of the hourly temperature at LAGuardia (LGA) Airport for the month of October 2013. Interpret the linegraph.