Whenever we get a data set, the first thing we do is to explore it. This exploration involves three main steps: (1) data summary, (2) data cleaning, and (3) data visualization. The first two steps are called data wrangling, and their task is to get the data ready to work with. The third step is called exploratory data analysis, or EDA.
The goal of EDA is to discover unusual features in the data, to explore the distribution of variables in the data, and to discover relationships between variables in the data. In our course, the answers to these questions will help us decide on the models we will use to answer questions using data.
For today, we are going to focus on data summary and data visualization, and we will talk about data wrangling in a later lab.
Before we start EDA, though, let's talk about how we will actually be working with data in this course. Working with data involves using statistical computing, which just means computer software specifically designed to help us work with with data. There are a lot of statistical computing languages, including SAS, STATA, Python, and Matlab. For our course, we are going to use be using R.
R is a computing language that is used in both academia and in industry, so knowing R is a nice thing to add to your CV as you apply for internships, jobs or graduate school. Today, we are going to start familiarizing ourselves with R through the process of exploring data. Don't worry if you have never done computing before - we are going to learn from the ground up in this course!
Okay, let's get started. In our course, we are going to be using a tool called RMarkdown to help us perform computing in R. For clarity, R is in the computing language, and RMarkdown is the tool in which we will be using that language.
Steps:
A file is going to open up on your screen. This is an RMarkdown template - every time you create a new RMarkdown file, this is what you will see.
Before doing anything else, find the Knit option at the top of your RStudio window.
After you have saved the document, your Markdown will start to compile. After a few seconds, you should see your PDF document in a new pop-up window.
NOTE for PC Users: If you are using a PC, you may get a prompt asking you to update or install packages. Agree to the requests! You may have to do this a few times (meaning a few little windows may pop up), but once you have everything updated, you will not have to do this again.
Close your PDF and go back to your RMarkdown file. At this stage, we are still looking at the template. Let's delete everything that is not necessary for our work today. For every Markdown you create in this class, this includes everything line 12 to the end of the document. Go ahead and delete everything line 12 to the end. DO NOT delete anything above line 12. You need those commands in order for Markdown to knit.
Now that we've cleared away what we don't need, we are ready to begin putting in our own commands.
The very first step in starting any new Markdown file is to load in the data that you need. If you do not load the data inside of your Markdown file, your Markdown will not Knit. Go to line 12 on your Markdown file. Here, we are going to insert a code chunk so that we can load the data.
Look at the upper right hand corner of your Markdown file and find Insert. Click Insert and choose R. On line 12 of your Markdown file, a gray box will appear. This box is called a code chunk, or chunk. Anything you type inside of the code chunk the computer will recognize as a code. For this particular chunk, we want to load in data. To do that, enter the following command in the R chunk. You can either type it in manually or copy and paste it from this document.
derbyplus <- read.csv("https://raw.githubusercontent.com/proback/BYSH/master/data/derbyplus.csv")
Now, when you paste the command above, the words will appear in your gray box, but nothing else appears to happen. This is because we have not yet run the command. This command will actually tell R "hey, go to this website and download a data set called derbyplus", but R won't do that until we tell it to do so.
To actually tell R to go and download the data, hit the green triangle button to the right of your chunk. We will call this the play button. When you hit play, R runs all the commands in the chunk. In this case, it runs the one line command that instructs R to access a GitHub site and download some data.
Take a look at the data tab area, i.e., the upper right hand section of the RStudio window called Environment. You should see that the word derbyplus
is now listed in this space. This is R's way of telling you that it has successfully loaded a data set called derbyplus
. If you look to the right of the name derbyplus
, you will see that R tells us the data set has 122 observations and 5 variables. If you click on the name derbyplus
in the workspace, a spreadsheet showing the data set itself will appear. You can scroll through this spreadsheet to see the entire data set.
Every single time you make a Markdown file, you should have a chunk at the top that tells RMarkdown where to find the data it will need for the analysis you want to perform. Every. Single. Time.
Now that you have loaded the data, you are ready to start the lab. Go to line 16 of your Markdown. Notice that this leaves one line of blank space between your code chunk and what you are about to type. This space between text and code chunks is necessary in order for your document to format properly.
On line 16, you are going to create a section header for the first lab question. Create two ## signs on line 16, and then type Question 1. In other words, line 16 should have ## Question 1. This creates a new section in your Markdown called Question 1. Go ahead and knit; see that you have created a section.
You can create as many sections as you like. Under a section header, you have the ability to type, just like you would in a word document, so you can type the responses to questions. You can also insert code chunks. Remember, a new chunk can be inserted by clicking on the Insert Chunk button (drop down menu under Chunks on the upper right corner of your markdown document).
Knit your document now, and make sure that so far, your document contains one section, Question 1. We are now ready to begin the lab.
Our data for today relates to the Kentucky Derby, a famous horse race that takes place every year in Kentucky. Our data contain information on the winning horse of each Derby since it began.
Good to know...but what years of Derby races do we have information on?
summary(derbyplus)
. This will produce a summary of the derbyplus data set. (a) Show the results of running the summary command. (b) What years of Derby races do we have information on?What we have just one is one step in data summary. Before we start working with a data set, we need to be able to state:
summary(derbyplus)
), and find the variable condition
. This represents the condition of the track during the race. How many options are there for track condition, and what are they? Note: If your results for summary contain the word character
, try running table(derbyplus$condition)
instead to help you answer this question!
We'll learn more about how we can use the data summary step of the analysis process as we go through the class, and for today, we are going to skip over the data cleaning step (we'll come back to it!). For now, let's move on to data visualization.
Visualizing involves making a variety of plots to explore the data.
The first step to creating plots in R is to install the ggplot2
package. A package is collection of R codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package. We will be using several packages as we move through this class.
Go to the top of your RStudio window and find "Tools". From there, click on "Install Packages." In the blank box, type in ggplot2, and hit install. The computer should automatically begin to load in the packages that you need, but this may take a minute.
Note: Some of you may see an error about language parsing, or an error involving rlang
. If you do, go ahead and install the rlang
package. Then, run the code library(rlang)
. Once this is done, install ggplot2
again.
Once you have installed the package, create a chunk in your RMarkdown. Inside of the chunk, paste the code library(ggplot2)
, and hit play. You are now ready to being using the functions inside the ggplot2 package!
One of the variables in this data set is the condition of the track.
To visualize this variable, we are going to use a plot. Paste the following code in a chunk and hit play.
ggplot(derbyplus, aes(x=condition)) + geom_bar(fill='blue')
This ggplot syntax seems a little strange, but it actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Thinking through the steps in this manner will help you understand the syntax of ggplots.
Notice that to add each layer to the graph, in the code we use a plus sign. We add the background AND THEN the bars to make the final graph. Let's break that down in more detail:
ggplot(derbyplus, aes(x=condition))
: This part of the code creates the axes and the background of the plot. The two arguments (or) are the data set we are using (derbyplus
) and the variable(s) that will be used to define the axis/axes(aes
). In this case, we defined only that the x-axis would contain information on track condition(aes(x=condition)
).geom_bar(fill='blue')
: Once the axes are set, we are adding on (+
) the actual data. In this case, we add bars (geom_bar
). We also specify that we want those bars to be filled in in blue. This code will change depending on the type of plot and the colors you want to use.We can add on another layer to our plot: x and y axis labels, as well as a title for the plot. The command labs
, which stands for labels, is used for this.
ggplot(derbyplus, aes(x=condition)) + geom_bar(fill='blue') + labs(title="Number of Races with Conditions of Each Type", x = "Type of Track Condition", y = "Count of Races")
starters
) in this data set. Make your plot purple, and title it "Number of Starters in each Race". Label the x-axis "Number of Starters" and the y-axis "Count of Races".Another variable in this data set is the speed of the winner.
To visualize this variable, we are going to use a plot called a histogram. Paste the following code in a chunk and hit play.
ggplot(derbyplus, aes(x=speed)) + geom_histogram(bins = 10, fill='cyan', color= 'white')
So far, we have made plots of three variables: speed, condition, and number of starters. Making plots of one variable is called univariate analysis, meaning we only looked at one variable at a time. Now, we are going to move into multivariate analysis by making plots that look at two or more variables at a time.
We have already determined that both speed and number of starters are numeric variables. Let's make a plot that compares the two! Create a new chunk and run the following code:
ggplot(derbyplus, aes(x=starters, y = speed)) + geom_point()
Just as before, we have two layers. The first draws the background and the second adds on the data. Here, in the first layer, we specify both the x and y axis of the graph, as we have two variables that we are working with.
geom_point
to add on the second layer. Based on this, what kind of plot does geom_point
tell R to make?fill
, but a color
.Okay, so we use a scatter plot to visualize the relationship between two numeric variables. What if we want to visualize the relationship between one numeric variable and one categorical variable?
Create a new chunk and run the following code:
ggplot(derbyplus, aes(x=condition, y = speed, fill = condition)) + geom_boxplot()
The command fill = condition
tells R to color each box in the plot according to the track condition.
In addition to the plot we just created, we can use facet plots to compare a numeric and categorical variable, or even two numeric variables and a categorical variable.
Faceted plots take a particular variable, such as the type of track condition, and create plots that are divided by that variable. To see an example, run the code below.
ggplot(derbyplus, aes(x=speed, fill = condition)) + geom_histogram(bins = 10, color = "black") + facet_wrap( ~ condition, ncol=3)
This is the same code we used to make a histogram of speed, with the addition of the line facet_wrap( ~ condition, ncol=3)
. Let's break this addition down.
The command facet_wrap
tells R that we are going to separate our graphs based on some categorical variable. The specific variable is then chosen with the code ~condition
. We are then able to specify how we want the graphs to be stacked. We want to allow 3 columns (ncol=3
).
Okay, so at this point we have done a lot of EDA work!! Suppose someone asked you to describe what you had found during EDA. In other words, what is the data about, and what relationships have you found among the three variables (speed, condition, and number of starters)? At this point, we could do that! That will help us make decisions to fit models, as we will start doing in class very soon.