Complete all Questions and submit final PDF under Assignments in Canvas.

The Goal

Whenever we get a data set, the first thing we do is to explore it. This exploration involves three main steps: (1) data summary, (2) data cleaning, and (3) data visualization. The first two steps are called data wrangling, and their task is to get the data ready to work with. The third step is called exploratory data analysis, or EDA.

The goal of EDA is to discover unusual features in the data, to explore the distribution of variables in the data, and to discover relationships between variables in the data. In our course, the answers to these questions will help us decide on the models we will use to answer questions using data.

For today, we are going to focus on data summary and data visualization, and we will talk about data wrangling in a later lab.

Before we start EDA, though, let's talk about how we will actually be working with data in this course. Working with data involves using statistical computing, which just means computer software specifically designed to help us work with with data. There are a lot of statistical computing languages, including SAS, STATA, Python, and Matlab. For our course, we are going to use be using R.

R is a computing language that is used in both academia and in industry, so knowing R is a nice thing to add to your CV as you apply for internships, jobs or graduate school. Today, we are going to start familiarizing ourselves with R through the process of exploring data. Don't worry if you have never done computing before - we are going to learn from the ground up in this course!

Creating a Markdown Document

Okay, let's get started. In our course, we are going to be using a tool called RMarkdown to help us perform computing in R. For clarity, R is in the computing language, and RMarkdown is the tool in which we will be using that language.

Steps:

In the video, I showed everyone how to access R online and how to download it onto your computer. However you chose to do it (online or on your computer), go ahead and open up RStudio.
To create an RMarkdown document, look at the top of your RStudio screen. Below the word ``File" in the upper toolbar, you should see a symbol that looks like a small piece of paper with a green plus sign in the upper left corner. Click on it.
Choose the option RMarkdown, which should be the third option presented.
You will be prompted to name your document. Make sure the title you choose matches the lab title. In this case, use Lab 1.
For the author, write your full name.
Finally, select PDF as your preferred output format.
And then hit okay!

A file is going to open up on your screen. This is an RMarkdown template - every time you create a new RMarkdown file, this is what you will see.

Before doing anything else, find the Knit option at the top of your RStudio window.

Click on the drop down arrow and choose Knit to PDF
At this point, you will be prompted to save your document. I would highly recommend saving the document somewhere you can find later.
When you name your document, make sure to choose a name without spaces. For instance, choose "Lab1" over "Lab 1". Including spaces will make it difficult for the computer to save your document.

After you have saved the document, your Markdown will start to compile. After a few seconds, you should see your PDF document in a new pop-up window.

NOTE for PC Users: If you are using a PC, you may get a prompt asking you to update or install packages. Agree to the requests! You may have to do this a few times (meaning a few little windows may pop up), but once you have everything updated, you will not have to do this again.

Close your PDF and go back to your RMarkdown file. At this stage, we are still looking at the template. Let's delete everything that is not necessary for our work today. For every Markdown you create in this class, this includes everything line 12 to the end of the document. Go ahead and delete everything line 12 to the end. DO NOT delete anything above line 12. You need those commands in order for Markdown to knit.

Now that we've cleared away what we don't need, we are ready to begin putting in our own commands.

Loading the Data

The very first step in starting any new Markdown file is to load in the data that you need. If you do not load the data inside of your Markdown file, your Markdown will not Knit. Go to line 12 on your Markdown file. Here, we are going to insert a code chunk so that we can load the data.

Look at the upper right hand corner of your Markdown file and find Insert. Click Insert and choose R. On line 12 of your Markdown file, a gray box will appear. This box is called a code chunk, or chunk. Anything you type inside of the code chunk the computer will recognize as a code. For this particular chunk, we want to load in data. To do that, enter the following command in the R chunk. You can either type it in manually or copy and paste it from this document.

derbyplus <- read.csv("https://raw.githubusercontent.com/proback/BYSH/master/data/derbyplus.csv")

Now, when you paste the command above, the words will appear in your gray box, but nothing else appears to happen. This is because we have not yet run the command. This command will actually tell R "hey, go to this website and download a data set called derbyplus", but R won't do that until we tell it to do so.

To actually tell R to go and download the data, hit the green triangle button to the right of your chunk. We will call this the play button. When you hit play, R runs all the commands in the chunk. In this case, it runs the one line command that instructs R to access a GitHub site and download some data.

Take a look at the data tab area, i.e., the upper right hand section of the RStudio window called Environment. You should see that the word derbyplus is now listed in this space. This is R's way of telling you that it has successfully loaded a data set called derbyplus. If you look to the right of the name derbyplus, you will see that R tells us the data set has 122 observations and 5 variables. If you click on the name derbyplus in the workspace, a spreadsheet showing the data set itself will appear. You can scroll through this spreadsheet to see the entire data set.

Every single time you make a Markdown file, you should have a chunk at the top that tells RMarkdown where to find the data it will need for the analysis you want to perform. Every. Single. Time.

Answering the lab questions

Now that you have loaded the data, you are ready to start the lab. Go to line 16 of your Markdown. Notice that this leaves one line of blank space between your code chunk and what you are about to type. This space between text and code chunks is necessary in order for your document to format properly.

On line 16, you are going to create a section header for the first lab question. Create two ## signs on line 16, and then type Question 1. In other words, line 16 should have ## Question 1. This creates a new section in your Markdown called Question 1. Go ahead and knit; see that you have created a section.

You can create as many sections as you like. Under a section header, you have the ability to type, just like you would in a word document, so you can type the responses to questions. You can also insert code chunks. Remember, a new chunk can be inserted by clicking on the Insert Chunk button (drop down menu under Chunks on the upper right corner of your markdown document).

Knit your document now, and make sure that so far, your document contains one section, Question 1. We are now ready to begin the lab.

Data Wrangling: Data Summary

Our data for today relates to the Kentucky Derby, a famous horse race that takes place every year in Kentucky. Our data contain information on the winning horse of each Derby since it began.

Good to know...but what years of Derby races do we have information on?

One powerful data exploration tool is a data summary. To make a data summary, create a new chunk of code. In that chunk you are going to run the command summary(derbyplus). This will produce a summary of the derbyplus data set. (a) Show the results of running the summary command. (b) What years of Derby races do we have information on?
The summary command is really useful, but sometimes we have to look at the data itself to answer certain questions. Click on the data set in your data tab (upper left hand panel) to open it. What is the name of the horse who won the Kentucky Derby in 1905?

What we have just one is one step in data summary. Before we start working with a data set, we need to be able to state:

How many rows and how many columns the data set has
What each row represents
What range of values are possible for each column

Look back at the output from your summary command (summary(derbyplus)), and find the variable condition. This represents the condition of the track during the race. How many options are there for track condition, and what are they?
Note: If your results for summary contain the word character, try running table(derbyplus$condition) instead to help you answer this question!

We'll learn more about how we can use the data summary step of the analysis process as we go through the class, and for today, we are going to skip over the data cleaning step (we'll come back to it!). For now, let's move on to data visualization.

EDA: Visualizing the Data

Visualizing involves making a variety of plots to explore the data.

The first step to creating plots in R is to install the ggplot2 package. A package is collection of R codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package. We will be using several packages as we move through this class.

Go to the top of your RStudio window and find "Tools". From there, click on "Install Packages." In the blank box, type in ggplot2, and hit install. The computer should automatically begin to load in the packages that you need, but this may take a minute.

Note: Some of you may see an error about language parsing, or an error involving rlang. If you do, go ahead and install the rlang package. Then, run the code library(rlang). Once this is done, install ggplot2 again.

Once you have installed the package, create a chunk in your RMarkdown. Inside of the chunk, paste the code library(ggplot2), and hit play. You are now ready to being using the functions inside the ggplot2 package!

EDA: Plotting One Variable

Making a plot

One of the variables in this data set is the condition of the track.

Based on what we saw in our data summary, is track condition a numeric or categorical variable?

To visualize this variable, we are going to use a plot. Paste the following code in a chunk and hit play.

ggplot(derbyplus, aes(x=condition)) + geom_bar(fill='blue')

Show your plot as part of your answer. What is the name of the plot you have created? Which track condition occurred the most often?

This ggplot syntax seems a little strange, but it actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Thinking through the steps in this manner will help you understand the syntax of ggplots.

Notice that to add each layer to the graph, in the code we use a plus sign. We add the background AND THEN the bars to make the final graph. Let's break that down in more detail:

ggplot(derbyplus, aes(x=condition)): This part of the code creates the axes and the background of the plot. The two arguments (or) are the data set we are using (derbyplus) and the variable(s) that will be used to define the axis/axes(aes). In this case, we defined only that the x-axis would contain information on track condition(aes(x=condition)).
geom_bar(fill='blue'): Once the axes are set, we are adding on (+) the actual data. In this case, we add bars (geom_bar). We also specify that we want those bars to be filled in in blue. This code will change depending on the type of plot and the colors you want to use.

Create the same plot as in the previous question, but change the color of bars to red. Show your code and your plot.

Labels and Titles

We can add on another layer to our plot: x and y axis labels, as well as a title for the plot. The command labs, which stands for labels, is used for this.

ggplot(derbyplus, aes(x=condition)) + geom_bar(fill='blue') + labs(title="Number of Races with Conditions of Each Type", x = "Type of Track Condition", y = "Count of Races")

Make a bar plot of the different numbers of starters (starters) in this data set. Make your plot purple, and title it "Number of Starters in each Race". Label the x-axis "Number of Starters" and the y-axis "Count of Races".

Histograms

Another variable in this data set is the speed of the winner.

Based on what we saw in our data summary, is speed a numeric or categorical variable?

To visualize this variable, we are going to use a plot called a histogram. Paste the following code in a chunk and hit play.

ggplot(derbyplus, aes(x=speed)) + geom_histogram(bins = 10, fill='cyan', color= 'white')

Aside from the color of the graph, what about this code is different than the code we used to make the plot in Questions 5 and 6? Hint: There should be three differences.

Make your histogram gold, with black outlines around the bars. Title it "Speed of Winning Horse in Each Race". Label the x-axis "Speed of Winner." Show both your code and the plot as part of your answer. Hints: There are two colors in the histogram code I gave you. Use that to help you figure out which color is specified by 'fill' and by 'color' in the code.

EDA: Comparing Variables

So far, we have made plots of three variables: speed, condition, and number of starters. Making plots of one variable is called univariate analysis, meaning we only looked at one variable at a time. Now, we are going to move into multivariate analysis by making plots that look at two or more variables at a time.

Comparing Speed and Number of Starters

We have already determined that both speed and number of starters are numeric variables. Let's make a plot that compares the two! Create a new chunk and run the following code:

ggplot(derbyplus, aes(x=starters, y = speed)) + geom_point()

Just as before, we have two layers. The first draws the background and the second adds on the data. Here, in the first layer, we specify both the x and y axis of the graph, as we have two variables that we are working with.

In the graph we have just made, we use geom_point to add on the second layer. Based on this, what kind of plot does geom_point tell R to make?

(a) Add labels to your graph. (b) How would we change the dots to make them blue? Show your result. Hint: This time we are not specifying a fill, but a color.

Based on what you see in the plot, do you think there is a relationship between number of starters and winning speed? Explain.

Comparing Speed and Condition (Method 1)

Okay, so we use a scatter plot to visualize the relationship between two numeric variables. What if we want to visualize the relationship between one numeric variable and one categorical variable?

Create a new chunk and run the following code:

ggplot(derbyplus, aes(x=condition, y = speed, fill = condition)) + geom_boxplot()

The command fill = condition tells R to color each box in the plot according to the track condition.

What kind of plot have you created?

Based on what you see in the plot, do you think there is a relationship between track condition and winning speed? Explain.

Comparing Speed and Condition with a Facet Plot

In addition to the plot we just created, we can use facet plots to compare a numeric and categorical variable, or even two numeric variables and a categorical variable.

Faceted plots take a particular variable, such as the type of track condition, and create plots that are divided by that variable. To see an example, run the code below.

ggplot(derbyplus, aes(x=speed, fill = condition)) + geom_histogram(bins = 10, color = "black") +  facet_wrap( ~ condition, ncol=3)

This is the same code we used to make a histogram of speed, with the addition of the line facet_wrap( ~ condition, ncol=3). Let's break this addition down.

The command facet_wrap tells R that we are going to separate our graphs based on some categorical variable. The specific variable is then chosen with the code ~condition. We are then able to specify how we want the graphs to be stacked. We want to allow 3 columns (ncol=3).

What command would you use if you wanted only two columns? Show the resultant plot, and change the x-axis to say "Winning Speed".

Create a faceted plot for number of starters by track condition. Show the resultant plot, and change the x-axis to say "Number of Starters".
Based on what you see in the plot, do you think there is a relationship between track condition and number of starters? Explain.

Wrapping it Up

Okay, so at this point we have done a lot of EDA work!! Suppose someone asked you to describe what you had found during EDA. In other words, what is the data about, and what relationships have you found among the three variables (speed, condition, and number of starters)? At this point, we could do that! That will help us make decisions to fit models, as we will start doing in class very soon.

Turning in your assignment

You must submit a PDF or html document to Canvas. No other formats will be accepted.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2021 August 21.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.

The data set used in this lab is part of the data provided as accompanying data sets for the online textbook Broadening Your Statistical Horizons. The data were accessed through the book GitHub repository.

STA 112 Lab 1: Exploratory Data Analysis