STA 214 Lab 1
Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.
Goal
We have all learned R in STA 112, but it may have been a while for some of us. Today, we are going to refresh our knowledge of using R to do key data work. We are also going to practice working with categorical response variables, as this will help us as we move into our work with logistic regression.
Getting Started:
Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.
Clearing the Work space
For most of us, we used R in STA 112. This means that when you open RStudio, the “data environment” (the upper right hand panel of RStudio) where all our data sets are stored may be very full! If your Environment is blank, don’t worry about it! If it’s not blank, there is a way to clear your Environment, i.e., remove data sets that you no longer need.
To clear all of the data in the upper right hand panel, look at the top of the panel. Beside “Import Dataset” there is an image that looks like a small broom. Pushing that will “clean” your space, meaning that it will remove all the contents within the panel. It is good practice to do this before you start each lab. This keeps us from having a lot of unnecessary stuff cluttering up this window, making the data sets you need easier to find.
Loading the Data
Now that our Environment is clear, we need to load a new data set into it. For our lab today, we will be working with a data set on animal adoption that you can find on Canvas. We will access data from a variety of sources in this course, but today we will be loading our data from a csv file.
To load the data, complete the following steps:
- Step 1: Go to Canvas and download the data for Lab 1.
- Step 2: Go back to R and look at the upper right hand panel of your RStudio screen (the Environment tab).
- Step 3: Find “Import Dataset” or “Import” and click on it.
- Step 4: Choose “Text (base)” or “From CSV” (it will depend on your computer).
- Step 5 Find your data (AdoptedData.csv) in the list that comes up. Choose it!
- Step 6: Now, look at the bottom right hand panel of
your screen. You should see a line of code with something like
AdoptedData <- read.csv(“AdoptedData.csv”)
orAdoptedData <- read_csv(“AdoptedData.csv”)
. - Step 7 Copy that ENTIRE line of code.
- Step 8: In your Markdown file, insert a code chunk. Look at the top of your Markdown file, and find Insert. Click it. From the drop-down menu, choose R. Click it! A gray box should appear in your Markdown file. This is a code chunk.
- Step 9: Paste the line from Step 6 into this gray code chunk, and press the green arrow (the play button).
Now, look on the upper right hand panel of your RStudio screen. See how you now have a data set called AdoptedData? Great! We are ready to go!
Exploratory Data Analysis
Our data set for today has \(n = 1000\) rows and 10 columns. Each row represents an animal that was brought into the same animal shelter for adoption within the same year. All of these animals were adopted by the end of the year.
One of the important things for shelters to know is how long in general it takes an animal to be adopted. This helps them plan for funds they need, as well as determining how many pet foster parents might be needed.
The shelter has recorded whether each animal was adopted in (1) less
than 30 days or (2) if it took 30 days or more for the animal to be
adopted. We will refer to this information as adoption
time and the information is stored in the
AdoptedLessThan30
column in the data set.
Question 1
Did it take the the first animal in the data set (a) less than 30 days or (b) at least 30 days to be adopted?
Hint: You can find this information either by opening the data set, or by printing out the first row.
We can look at individual rows, but typically when our variable of interest is categorical as it is today, we are interested in looking at a table of the variable of interest. This helps us see the distribution of the variable, which means what values are possible for the variable and how often each value occurs in the data set.
To build a table in R, we use the following:
table(AdoptedData$AdoptedLessThan30)
Remember that R works using the structure
command(object)
. The command
is what you want
R to actually do. In this case, we will use table
as our
command because we want to create a table. The object
is
what you want the command to be executed on. In other words, we want a
table of what? In our case, we want a table of a single column in our
data set.
In R, to tell the computer we wish to work with a single column in a
data set (like AdoptedLessThan30
), we use $
.
So, to tell the computer “Please go into the data set and look at only
this specific column”, we use
AdoptedData$AdoptedLessThan30
.
Question 2
How many animals were adopted in less than 30 days?
As we recalled in our first class, we can report more than just raw counts using this table.
Question 3
What percent of animals in the data set were adopted in less than 30 days?
Question 4
How many times more animals in the data set were adopted in less than 30 days than were adopted in at least 30 days (meaning 30 days or more)?
All of these different numeric values allows us to describe the distribution of our variable of interest. We can also use this to describe trends in the data to those who might be interested, like staff at the shelter.
In this course, we will talk about considerations when we are speaking to data stakeholders, which means people who are interested in the results of our analysis. This means we will pick up some tips to consider when presenting results. Let’s get one now.
Professionalism: Note 1
One thing to note that is that the name of our current variable of
interest is very cumbersome: AdoptedLessThan30
. When we are
writing about our conclusions to a client or anyone who is interested in
our results, we never use the name of the variable that
is given in R. The only exception is if that variable name is a proper
word (like height or weight).
Question 5
So, it is not professional to say that 310 animals in the data set
have AdoptedLessThan30
equal to “At least 30”. How could
this be phrased professionally?
In addition to how we write about data results, we also need to think about the format of any output that we show. For instance, we have created a table to analyze the data so far, but that table is raw R output and is not formatted very professionally. Luckily, there is a very cool code that helps us convert an R table into a professionally formatted table:
::kable( ) knitr
If you put a code that produces a table inside of the
( )
, the table will be formatted professionally, ready for
a report.
::kable( table(AdoptedData$AdoptedLessThan30) ) knitr
You can even add a title:
::kable( table(AdoptedData$AdoptedLessThan30), caption ="The Title You Want") knitr
Go ahead and run this code in R. Hmm…that doesn’t look so pretty. Why are we excited about this?? Knit your document, and take a look at the table you see there.
Ah, there we go. It looks great, except that the column names are strange. Let’s fix those:
::kable( table(AdoptedData$AdoptedLessThan30), caption ="The Title You Want", col.names = c("Column 1 Name", "Column 2 Name")) knitr
Question 6
Create a table of the AdoptedLessThan30
variable that is
professionally formatted using knitr::kable()
. Title your
table “Table: Days until adoption”, with the first column labelled “Days
until Adoption” and the second labelled “Number of Animals”. Show this
table as the answer to this question.
We will learn more of these tips as we move throughout the course. The key is to remember to always apply these tips when you are creating output for an analysis that will be seen by anyone other than just you.
Creating a Bar Graph
Now that we have made a table of our response variable of interest,
let’s try a graph. With categorical data, we tend to use a plot called a
bar graph, and we will be using a package in R called
ggplot2
to create one. If you have already installed
ggplot2, you can skip the next section.
Installing ggplot2
To make visualizations in this course, we will be working with a group of functions in the ggplot2 package in R. The package ggplot2 is a powerful collection of R functions for creating flexible, professional graphics.
The first step to using ggplot2 is to install the
ggplot2
package. Go to the top of your RStudio window and
find “Tools”. From there, click on “Install Packages.” In the blank box,
type in ggplot2, and hit install. The computer should automatically
begin to load in the packages that you need, but this may take a
minute.
Note that this process of installing a package is one you need to do only once. Think of this as teaching R a new set of skills. Once it knows the skills, you don’t have to teach it again.
Now, some of you may see an error about language parsing, or an error
involving rlang
. If you do, go ahead and install the
rlang
package. Then, copy and paste the following into a
chunk and hit play. Nothing will seem to happen, and that’s okay.
library(rlang)
Creating a Bar Plot
Once you have installed the ggplot2 package, you need to tell R that you would like to begin using the function by loading the library. Remember that we said installing a package is like teaching R a skill? Loading a library is how we tell we R we want it to use those skills. To do this, create a chunk in your RMarkdown and copy and paste the following, and hit play.
suppressMessages(library(ggplot2))
Note that this process of loading a library is one you have to do ONCE each time we start a lab or project.This tells R “Hey, remember those skills we taught you? Use them.”
To create a bar plot for the adoption time, paste the following code in a chunk and hit play.
ggplot(AdoptedData, aes(x=AdoptedLessThan30)) +
geom_bar(fill='blue', col = 'white')
The creation of plots in ggplot2 requires building the plot in
layers. First, we build the background, the grid on which we will be
building our graph. This is the job of the ggplot()
part of
the code.
ggplot()
Once we have built the background, we need to tell R what data we
want to use to create the graph. We need to tell it the name of the data
set we want to use (AdoptedData
) and then the variable we
want to work with (AdoptedLessThan30
). The aes
part of the code stands for “axis”, we want body mass to be on the
x-axis of our graph.
ggplot(AdoptedData, aes(x=AdoptedLessThan30))
The next step is to tell R what type of graph we want to make using
this variable. In this case, we want a bar plot. The command we use to
build a histogram is + geom_barplot()
.
ggplot(AdoptedData, aes(x=AdoptedLessThan30))
+ geom_barplot()
Now, we want to specify some things about our bar plot. We want the
bars to be filled ( 'fill'
) in blue and we want them
outlined ( 'col'
) in white.
ggplot(AdoptedData, aes(x=AdoptedLessThan30)) +
geom_bar(fill="blue", col = "white")
Let’s try it.
Question 7
Create a bar plot of the column indicating whether or not the animal was fostered before being adopted. Make the bars of the graph gold and outline them in black. Show your result.
We can add on another layer to our plot by adding x and y axis
labels, as well as a title for the plot. The command labs
,
which stands for labels, is used for this.
ggplot(AdoptedData, aes(x=AdoptedLessThan30)) +
geom_bar(fill='blue', col = 'white') +
labs(title="Figure 1:", x = "Number of Days before Adoption", y = "Number of Animals")
This ggplot syntax actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Finally, you add a label. Thinking through the steps in this manner will help you understand the syntax of this package.
Question 8
Copy the code you used to make the graph in Question 7. Now, add the title “Figure 2:” and add appropriate labels to the x and y axis.
One VERY important thing to remember when we make plots is to make sure the axes are easily interpretable by your reader. You do not want to use default variable names, like “FosterYes”. Instead, we want clear labels like “Was the animal fostered?”.
This is a requirement for all graphs you make in this course - you must label your axes appropriately and title your graphs.
Examining Relationships
At this point, we have created both a table and a graph of adoption time. Now, we want to start to see if there are other variables in the data set that might be related to adoption time. In other words, we are going to add an explanatory variable.
Let’s start with foster status. Does adoption time differ depending on whether or not the animal was in foster care? To find this out, we need to expand our initial table.
table( AdoptedData$AdoptedLessThan30, AdoptedData$WasFostered)
Question 9
Using our techniques from before, convert this table into professional format and include a title.
Question 10
Based on the data, is there a practically significant difference in adoption time (at least 30 or less than 30) depending on if the animal was fostered? Use at least one numeric quantity to justify your answer.
Now, one thing to note right now that is that we are exploring for an association, not a causation effect. In other words, we are not exploring whether being in foster care causes an animal to be adopted in more or less time. Animals that are adopted faster might be less likely to end up needing foster families, for instance. This means that all we are looking for is a relationship between the two variables, but we are not exploring whether one variable causes changes in another.
Okay, so we have used tables to explore the relationship between the two categorical variables. However, what if we get numeric data involved?
Question 11
Create a table to explore the relationship between the age of the animal in years and adoption time. Explain why this table might be difficult to use to explore this relationship.
When we have one categorical and one numeric variable, we need to use a different tool instead of a table. This is something we will find as we move throughout the course - the tools we use are very dependent upon the type of variables we are working with.
Side by side box plots
When we have one categorical variable and one numeric variable, one tool we can use to explore the relationship between the two is a side-by-side box plot.
ggplot(AdoptedData, aes(x = AdoptedLessThan30, y = InitialAge_Years)) +
geom_boxplot()
The only change we made from the bar plot is that we now use
geom_boxplot
instead of geom_bar
, and we now
specify the variable for both the x and y axis.
Question 12
Adapt the code above to change the color of the boxes to any color except white, and outline the boxes in any color except black or white. Add labels and a title, and show your final plot.
Question 13
Based on the graph, does there seem to be a practically significant difference in adoption time (whether or not the animal was adopted in less than 30 days) depending the age of the animal in years? Explain your answer.
Ideally, we would like to be able to explore this and other relationships in more depth. This is where logistic regression will come in, and we will work on that next class.
Just for Fun: Themes
We have completed the core content for today’s lab, but I did want to take a minute to let us play a bit in R. When you create graphs, there are so many options and ways for you to personalize the graphs to suit your style and the style of the individuals you are presenting for. This is why in R there are really cool packages called themes or palettes that allow you to personalize your graphs.
For instance, let’s say I want my graph to look like Barbie drew it (yep, it’s exactly what you think):
# Install the packages
::install_github("MatthewBJane/theme_park")
remoteslibrary(ThemePark)
If you are prompted during the install process to type something,
type 3. Once you have finished installing, you need to put a # in front
of remotes::install_github("MatthewBJane/theme_park")
.
If you do not do this, your document will not knit.
If you get stuck, let me know!! This part is just for fun, so if it causes any frustration just stop and let me know!
# Make the plot
ggplot(AdoptedData, aes(x = AdoptedLessThan30, y = InitialAge_Years)) +
geom_boxplot(color = barbie_theme_colors["medium"]) +
labs( x = "Adoption Time", y = "Age in Years", title = "Figure 3:") +
theme_barbie()
If you run the code below, you will see other themes in this package. Try a few out!!
head(themepark_themes)
Question 14
Which theme from the list do you like best and why? (There is no single right answer here, I just want to know!!)
Next Time
Today we have explored how we can use graphs and tables to look at the relationship between a binary response variable and either a categorical or a numeric response variable. Next time, we are going to work on how to use models to formalize these relationships. This will also allow us to explore more complex relationships, as well as making predictions.