Complete all Questions and submit final PDF under Assignments in Canvas.

The Goal

In our last lab, we were introduced to how we can work with data in R. We learned about selecting a single column, a row within that column, making tables, and creating subsets. Today, we are going to focus on the next step which is creating visualizations of the data.

In order to make graphs for publications, presentations, or industry, statisticians tend to use a package called ggplot2 rather than the more standard visualizations in base R. In this lab, we are going to play with using this powerful package to make graphics. In this course, we will use ggplots to create visualizations for class and our projects.

For an excellent introduction to using ggplot2, look here. This is a resource I would suggest you consult throughout the class.

Open up an RMarkdown file, and delete everything after Line 10.

Change line 9 to say: knitr::opts_chunk$set(echo = FALSE)

The Data

Let's try a new data set today. This data set is about the Kentucky Derby, a horse race that takes place in Kentucky every year. Create a new chunk, and paste the following, and then hit play.

derbyplus <- read.csv("https://raw.githubusercontent.com/proback/BYSH/master/data/derbyplus.csv")

Now that we have the data, let's learn how to visualize it.

Introducing ggPlot2

The package ggplot2 is a powerful collection of R functions for creating flexible, professional graphics. A package is collection of R codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package. We will be using several packages as we move through this class.

The first step to using ggplots is to install the ggplot2 package. Go to the top of your RStudio window and find "Tools". From there, click on "Install Packages." In the blank box, type in ggplot2, and hit install. The computer should automatically begin to load in the packages that you need, but this may take a minute.

Now, some of you may see an error about language parsing, or an error involving rlang. If you do, go ahead and install the rlang package. Then, copy and paste the following into a chunk and hit play. Nothing will seem to happen, and that's okay.

library(rlang)

Once you have installed the ggplot2 package, you need to tell R that you would like to begin using the function. To do this, create a chunk in your RMarkdown and copy and paste the following, and hit play.

library(ggplot2)

You are now ready to being using the functions inside the ggplot2 package!

One quick note. If you knit your Markdown right now, you will notice that the command library(ggplot2), as well as its accompanying output, appears in your document. We don't really want that. To repress the output, go to the top of the r chunk. You should see ```{r}. Change this to ```{r, echo = FALSE}, and knit your document again. This commands tell R not to echo, or print, the output of the chunk in the PDF. If the output still appears, try ```{r, include = FALSE}.

EDA: One Variable

One of the variables in this data set is the condition of the track: fast, slow, or good. To start off with, let's create a visual to see the distribution of these categories. Paste the following code in a chunk and hit play.

ggplot(derbyplus, aes(x=condition)) + geom_bar(fill='blue')

The creation of plots in ggplot2 requires building the plot in layers. First, we build the background, the grid on which we will be building our graph. This is the job of the ggplot() part of the code. Once we have built the background, we are ready to plot our data. The command we will use for this depends upon the data type we are working with. In this case, we are dealing with a categorical variable, so we want a bar plot. The command we use to build a bar plot is geom_bar(). In this same layer, we specify that we want the bars to be filled in blue.

Notice that to add each layer to the graph, in the code we use a plus sign. We add the background AND THEN the bars to make the final graph. Let's break that down in more detail:

Labels and Titles

We can add on another layer for the x and y axis labels, as well as a title for the plot. The command labs, which stands for labels, is used for this.

ggplot(derbyplus, aes(x=condition)) + geom_bar(fill='blue') + labs(title="Figure 1: Track Condition", x = "Type of Track Condition", y = "Count of Races")

This ggplot syntax actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Finally, you add a label. Thinking through the steps in this manner will help you understand the syntax of ggplots.

  1. Using ggplot2, make a plot of the different numbers of starters (starters) in this data set. Make your plot purple, and title it "Figure 1: Number of Starters in each Race". Label the x-axis "Number of Starters."
  2. Other Plots

    There are a variety of plots we can use to explore the distribution of a single variable, including histograms and box plots.

    1. Create a box plot of speed. Title your plot Figure 2: Speed, and label the x axis "Speed". Hint: Instead of a geom_bar, we want geom_boxplot.
      1. Create a histogram of speed. Title your plot Figure 3: Speed, and label the x axis "Winning Speed in MPH" and the y-axis "Number of Races". Hint: Instead of a geom_bar, we want geom_histogram.
      2. When we make this plot, we get a warning that tells us we have not picked a bin size. When we make histograms, we get to decide how many bins we want in the histogram, meaning how the data are grouped before we plot them. We can use the bins= code to help with that. Now that we have a handle on how we can create plots, we can start to personalize them, by choosing the bin size, etc. For example, I like this plot.

        ggplot(derbyplus, aes(starters)) + geom_histogram(color="purple", fill = "yellow", bins = 10)
        1. Create a histogram of speed with gold bars outlined in black. Use 15 bins. Title your plot Figure 4: Speed, and label the x axis "Winning Speed in MPH" and the y axis "Number of Races".
        2. EDA: Comparing two variables

          The plot we just made is neat and professional, but the power of ggplot2 comes into play when we start relating variables to one another.

          Let's examine the relationship between the number of starters and the speed of winner. With two numeric variables, we need to use a scatter plot for this. In ggplot2, we could create the necessary plot using the following code:

          ggplot(derbyplus, aes(x=starters, y = speed)) + geom_point()

          Just as before, we have two layers. The first draws the background and the second adds on the data. Here, in the first layer, we specify both the x and y axis of the graph, as we have two variables that we are working with.

          1. Create a scatter plot for speed and starters, and make the dots purple. (Hint: This time we are not specifying a fill, but a color). Title your plot Figure 5: Starters Vs. Speed, and label the x axis "Number of Starters" and the y axis "Winning Speed in MPH".
          2. Adding Lines and Curves

            One question we tended to ask in STA 212 (MST 256) is whether or not the relationship between two variables appeared linear. If so, this tended to suggest that a least squares linear regression (LSLR) model might be a reasonable model to consider.

            1. Based on the graph in Question 5, describe the relationship between X = number of starters and Y = winning speed. Based on your description, what choice of f(X) do you think it would be reasonable to consider? Write out the function.
            2. We can add also add a linear on a graph if we don't want to have to imagine it. To add a line or curve onto the graph, we add on another layer. The layer is added with the code stat_smooth(). This code will (1) fit an LSLR line and (2) plot it on top of your scatter plot.

              ggplot(derbyplus, aes(x=starters, y = speed)) + geom_point() + stat_smooth(method = "lm", formula = y ~ x, size = 1, se = FALSE)

              If we wanted to draw a curve (like a polynomial regression model), we can use the same structure of coding. However, we have to specify that we want stat_smooth, the argument that tells R to draw the trend of the data, to use a polynomial regression model for the relationship between Y and X. We do that as follows:

              ggplot(derbyplus, aes(x=starters, y = speed)) + geom_point() + stat_smooth(method = "lm", formula = y ~ poly(x,2), size = 1)
              1. Show the graph with a sixth order polynomial curve added on.
              2. Comparing more than two variables

                At this point in the lab, we have explored the type of track condition, as well as the number of starters and the winning speed. What if we want to make a plot that shows the relationship among all three variables? We could, perhaps, make a scatter plot, and then just use colors to separate out the different types of track conditions. To do this, we include a color command in the specification of the axes.

                ggplot(derbyplus, aes(x=starters, y = speed, color = condition)) + geom_point()

                Here, the command color=condition tells R that we want each dot to be colored in a way that corresponds to the track condition.

                1. What disadvantages do you see in the plot we have created?
                2. There is another option that allows us to compare the three variables. Instead of trying to create one plot, we can create a faceted plot. Faceted plots take a particular variable, such as the type of track condition, and create plots that are divided by that variable. To see an example, run the code below.

                  ggplot(derbyplus, aes(x=starters, y = speed, color = condition)) + geom_point() +  facet_wrap( ~ condition, ncol=3)

                  This is the same code we used above, with the addition of the line facet_wrap( ~ condition, ncol=3). Let's break this addition down. The command facet_wrap tells R that we are going to separate our graphs based on some categorical variable. The specific variable is then chosen with the code ~condition. We are then able to specify how we want the graphs to be stacked. We want to allow 3 columns (ncol=3).

                  1. What command would you use if you wanted only two columns? Show the resultant plot, and change the x-axis to say "Number of Starters" and the y-axis to say "Winning Speed".
                    1. What command would you use if you wanted to add fitted LSLR lines to the facet plot? Show the resultant plot, and make sure the axes are appropriately labeled.
                      1. What command would you use if you wanted to add fitted second order polynomial regression curves to the facet plot? Show the resultant plot, and make sure the axes are appropriately labeled.
                      2. Next Steps

                        We have now explored EDA and how to make visualizations using ggplot2. We will use this all the time in our class. There are other cool add ons, likely plotly, that we can use as well!

                        Turning in your assignment

                        When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF document to Canvas. No other formats will be accepted. Make sure you look through the final PDF to make sure everything has knit correctly.
                        Creative Commons License
                        This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2021 January 8.
                        The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.
                        The data set used in this lab is part of the data provided as accompanying data sets for the online textbook Broadening Your Statistical Horizons. The data were accessed through the book GitHub repository.