STA 112 Lab 1

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

Goal

Today we are going to be exploring the foundations of coding in R. As we move through this course, you will get experience with different things that R can do. For today, we want to start getting familiar with the set up and how to run codes using R.

We are specifically going to focus on how to perform exploratory data analysis, or EDA, using R. This means we are going to learn how to use R to create visualizations using data.

Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.

Loading the Data

Before we can create any visualizations, we need data. We will work with data on a variety of subjects in this course. In statistics, sometimes we work with data that is very recent and sometimes we work with data from the past to detect trends that might inform future decisions. Today, we will be working with historical data.

To load the data, you need to create a Code Chunk (chunk for short). Remember that to do that, there are two options:

Option 1: Look at the top of your Markdown file, and find Code. Click it. From the drop-down menu, choose Insert Chunk. Click it! A gray box should appear in your Markdown file. This is a chunk.
Option 2: In the top gray bar of your Markdown file, look for a small green C (shown below). Click that, and choose R! A gray box should appear in your Markdown file. This is a chunk.

When you are done, you should have a code chunk that looks like this:

Once you have created your chunk, copy and paste the following inside the chunk. Note: You will need to scroll in the box below to get the whole code that you need to copy.

derbyplus <-read.csv("https://raw.githubusercontent.com/proback/BYSH/master/data/derbyplus.csv")

Now, look at the right hand side of the chunk and find the little green triangle symbol. We will call this the play button in this course. Go ahead and press the play button (press play). This tells R to run the code!

Once you have pressed play, look at the upper right hand panel of your RStudio screen (your Environment tab). Do you see that you have an object called derbyplus? This means that you have loaded the data set you need!

Next to the name derbyplus, you will see that R tells us the data set has 122 observations and 5 variables. This means that the data set has 122 rows and 5 columns.

If you click on the name derbyplus in the Environment, a spreadsheet showing the data set itself will appear. You can scroll through this spreadsheet to see the entire data set.

Answering the lab questions

Now that we have the data loaded, let’s answer our first lab question. Remember, to create a lab question you want to start a new line and put two ## in your RMarkdown document, hit the space bar, and then type Question 1. This means you should see ## Question 1. Then, hit enter (or return if you are on a Mac) twice, and you are ready to answer the questions!

The ## allows your Question numbers of show up in bold so I can easily find them for grading. The * puts your text in italics. Let’s try it!

Data Summary

Our data for today relates to the Kentucky Derby, a famous horse race that takes place every year in Kentucky. Our data contain information on the winning horse of each Derby.

Our focus for today is on exploratory data analysis, or EDA. This means using graphs and tables to explore a data set. One powerful tool we use in EDA is a data summary. This is a way to very quickly get an idea of what information is contained in the data set.

To create a data summary, create a chunk, type the following, and then press play:

summary(derbyplus)

Welcome to your first R command! In R, the format for coding is command(object). In other words, the command tells R what you want to do. The command summary tells R we want to create a summary of a data set. The object piece tells R what object or data set you want the command to work with. So, summary(derbyplus) tells R we want to see a summary of the derbyplus data set.

Question 1

What years of Derby races do we have information on in the derbyplus data set?

Question 2

Find the variable speed in your data summary. What was the fastest speed and what was the the slowest speed?

We can see that the summary gives us a quick overview of what variables are in the data set and what information those variables provide. The summary is really useful, but sometimes we still have to look at the data itself to answer certain questions.

Question 3

Click on the data set in your Environment (upper right hand panel) to open it. What is the name of the horse who won the Kentucky Derby in 1907?

EDA: Visualizing the Data

Now that we have started to explore our data set using a data summary, let’s try data visualization. Data visualization means using graphs and tables to explore data. In this course, we will learn several different types of graphs, as well exploring what information each displays and when we should use them. To get started making graphs in R, there is one thing we need to do.

Installing ggplot2

To create plots in R, we need to install the ggplot2 package. A package is collection of R codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package. We will be using several packages as we move through this class.

Go to the top of your RStudio window and find “Tools”. From there, click on “Install Packages.” In the blank box, type in ggplot2, and hit install. The computer should automatically begin to load in the packages that you need, but this may take a minute.

Once you have installed the package, create a chunk in your RMarkdown. Inside of the chunk, paste the code library(ggplot2), and hit play. You are now ready to begin creating graphs

Note: Some of you may see an error about language parsing, or an error involving rlang. If you do, go ahead and install the rlang package. Then, run the code library(rlang). Once this is done, install ggplot2 again.

Plotting One Variable

There are 5 different variables in our data set, but we are going to start by focusing on just one: track condition. This is a variable that represents how good the track surface was for the race. Once we have chosen a variable, the type of plot we are going to make depends on the type of variable we are dealing with.

Question 4

Open up your data and find the variable condition. Based on the values you see recorded, is track condition a numeric variable or categorical variable? Explain.

Once we have identified the variable type, we are able to determine what tools we might use to explore the variable. In this case, one type of plot we might use is called a bar plot. To create the plot, paste the following code in a chunk and hit play.

ggplot(derbyplus, aes(x=condition)) + geom_bar(fill='blue')

When we look at a plot we generally want to know two things to start with:

1. What values are possible for the variable?
1. How often do each of these possible options seem to occur in the data?

Question 5

What are the possible values for the track condition variable? In other words, how many different track conditions are there?

Question 6

According to your bar plot, which track condition occurred the most often?

Now that we have explored the plot, let’s think a little more on the code that was used to create it:

ggplot(derbyplus, aes(x=condition)) + geom_bar(fill='blue')

This ggplot syntax seems a little strange, but it actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Thinking through the steps in this manner will help you understand the syntax.

Notice that to add each layer to the graph, in the code we use a plus sign. We add the background AND THEN the bars to make the final graph. Let’s break that down in more detail:

ggplot(derbyplus, aes(x=condition)): This part of the code creates the axes and the background of the plot. The two arguments are the data set we are using (derbyplus) and the variable(s) that will be used to define the axis/axes(aes). In this case, we defined only that the x-axis would contain information on track condition(aes(x=condition)).
geom_bar(fill='blue'): Once the axes are set, we are adding on (+) the actual data. In this case, we add bars (geom_bar). We also specify that we want those bars to be filled in in blue. This code will change depending on the type of plot and the colors you want to use.

Question 7

Create a bar plot of track condition, but this time change the color of bars to red. Show your code and your plot.

Labels and Titles

We can add on another layer to our plot: x and y axis labels, as well as a title for the plot. The command labs, which stands for labels, is used for this.

ggplot(derbyplus, aes(x=condition)) + 
     geom_bar(fill='blue') + 
     labs(title="The title you want", 
     x = "The x-axis label", y = "The y-axis label")

Question 8

Adapt the code above to (1) make the bars on your plot purple, (2) title your plot “Figure 1: Number of Races with Conditions of Each Type”, (3) label the x-axis “Track Condition”, and (4) label the y-axis “Count of Races”.

Considering Speed: Histograms

Now that we have explored the track condition, let’s try another variable. A second variable in our data set is speed, which represents the feet per second of the horse that won the Derby in a particular year.

Question 9

Is speed a numeric variable or a categorical variable? Hint: You can look back at your data summary or open up the data set and take a look at the variable to help you answer this questions!

To visualize the distribution of the variable speed, we are going to use a plot called a histogram. Paste the following code in a chunk and hit play.

ggplot(derbyplus, aes(x=speed)) + 
    geom_histogram(bins = 10, fill='cyan', color= 'white')

Question 10

Change the code above to make the bars of the histogram gold, with black outlines around the bars. Title it “Speed of Winning Horse in Each Race”. Label the x-axis “Speed of Winner.” Show both your code and the plot as part of your answer. Hint: There are two colors in the histogram code I gave you. Use that to help you figure out which color is specified by fill and by color in the code.

EDA: Comparing Variables

So far, we have made plots of two variables: speed and track condition. Making plots of one variable is called univariate analysis, meaning we only looked at one variable at a time. Now, we are going to move into multivariate analysis by making plots that look at two or more variables at a time.

Comparing Speed and Number of Starters

Suppose we are asked to explore the relationship between X = the number of horses that ran in a race (starters) and Y = the speed of the winning horse (speed). This means that we need to look at two different variables in the data set: starters and speed.

Both winning speed and number of starters are numeric variables, so as we learned in class a scatter plot is a good choice to explore the relationship between the two variables. To make a scatter plot, create a new chunk and run the following code:

ggplot(derbyplus, aes(x=starters, y = speed)) + geom_point()

Just as before, we have two layers. The first draws the background and the second adds on the data. Here, in the first layer, we specify both the x and y axis of the graph, as we have two variables that we are working with.

Question 11

What part of the code specifically tells R to create a scatter plot? Hint: Think about the difference in the codes we have already used to create a bar plot and a histogram.

Question 12

Using the code above Question 11, create the scatter plot. Add labels to your scatter plot (this means a title and labels on both the x and y axis) and change the dots to make them blue. Show your result. Hint: This time we are not specifying a fill, but a color.

Now that we have our graph, we are going to describe the relationship we see in the graph.

Question 13

Based on what you see in the plot, is the relationship between the number of horses who ran in the race and the winning speed a weak, moderate, or strong relationship? Explain your choice.

Question 14

Based on what you see in the plot, is the relationship between the number of horses who ran in the race and the winning speed a linear relationship or a curved relationship?

Comparing Speed and Condition (Method 1)

Okay, so we use a scatter plot to visualize the relationship between two numeric variables. What if we want to visualize the relationship between one numeric variable and one categorical variable?

We are now asked to explore how Y = the winning speed of the horse changes based on X = the track condition. Our X variable is categorical and our Y variable is numeric. What kind of plot do we make now?? Let’s see.

Create a new chunk and run the following code:

ggplot(derbyplus, aes(x=condition, y = speed, fill = condition)) + geom_boxplot()

This code creates a special type of box plot called a side by side box plot. Basically, we get a box plot of the winning speed for horses who ran races in each track condition. So, the red box plot that we see to the left of the plot represents the distribution of winning speeds for horses who ran their race under fast track conditions.

How did we create this? The command fill = condition tells R to color each box in the plot according to the track condition.

Question 15

Based on what you see in the plot, compare the distribution of winning speeds for races run under fast track conditions versus slow track conditions. What do you notice is different about the two box plots?

Comparing Speed and Condition with a Facet Plot

In addition to side by side box plots, we can use faceted plots to compare a numeric and categorical variable, or even two numeric variables and a categorical variable.

Faceted plots take a particular variable, such as the type of track condition, and create plots that are divided by that variable. To see an example, run the code below:

ggplot(derbyplus, aes(x=speed, fill = condition)) + 
       geom_histogram(bins = 10, color = "black") + 
       facet_wrap( ~ condition, ncol=3)

This is the same code we used to make a histogram of speed, with the addition of the line facet_wrap( ~ condition, ncol=3). Let’s break this addition down.

The command facet_wrap tells R that we are going to separate our graphs based on some categorical variable. The specific variable is then chosen with the code ~condition. We are then able to specify how we want the graphs to be stacked. We want to allow 3 columns (ncol=3).

Question 16

What command would you use if you wanted only two columns? Show the resultant plot, and change the x-axis to say “Winning Speed”.

Question 17

Create a facet plot for number of starters by track condition. Show the resultant plot, and change the x-axis to say “Number of Starters”.

This last question is not related to the rest of our lab content, but it will be needed for our content next week!!

Question 18

Complete this quick survey. You get full points for filling it out, and your name is only used so I know who completed the survey. Just write “completed” as the answer to this question, and once I check to verify you completed the survey, you get full credit!

Link: https://docs.google.com/forms/d/e/1FAIpQLSfc0Mh04w6pcbcZo-CHBIDLTQ00snH6BF5tLZ8lJAoU-iRLag/viewform?usp=header

Wrapping it Up

Okay, so at this point we have done a lot of EDA work!! Suppose someone asked you to describe what you had found during EDA. In other words, what is the data about, and what relationships have you found among the three variables (speed, condition, and number of starters)? At this point, we could do that! That will help us make decisions to fit models, as we will start doing in class very soon.

Today we have learned to create:

1. a histogram (we use these to visualize the distribution of one numeric variable)
1. a bar plot (we use these to visualize the distribution of one categorical variable)
1. a scatter plot (we use these to visualize the relationship between two numeric variables)
1. a side by side box plot (we use these to visualize the relationship between a categorical variable and a numeric variable)
1. faceted graphs.

As we go through the semester, we will use these types of graphs over and over again to help us answer questions using data. Please refer back to this lab as you need to when you create visualizations of data!

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 August 25.

The data set used in this lab is part of the data provided as accompanying data sets for the online textbook Broadening Your Statistical Horizons. The data were accessed through the book GitHub repository.