STA 214 Lab 1

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

Goal

We have all learned R in STA 112, but it may have been a while for some of us. Today, we are going to refresh our knowledge of using R to do key data work. We are also going to practice working with categorical response variables, as this will help us as we move into our work with logistic regression.

Getting Started:

Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.

Clearing the Work space

For most of us, we used R in STA 112. This means that when you open RStudio, the “data environment” (the upper right hand panel of RStudio) where all our data sets are stored may be very full! If your Environment is blank, don’t worry about it! If it’s not blank, there is a way to clear your Environment, i.e., remove data sets that you no longer need.

To clear all of the data in the upper right hand panel, look at the top of the panel. Beside “Import Dataset” there is an image that looks like a small broom. Pushing that will “clean” your space, meaning that it will remove all the contents within the panel. It is good practice to do this before you start each lab. This keeps us from having a lot of unnecessary stuff cluttering up this window, making the data sets you need easier to find.

Loading the Data

Now that our Environment is clear, we need to load a new data set into it. For our lab today, we will be working with a data set on animal adoption that you can find on Canvas. We will access data from a variety of sources in this course, but today we will be loading our data from a csv file.

To load the data, complete the following steps:

  • Step 1: Go to Canvas and download the data for Lab 1.
  • Step 2: Go back to R and look at the upper right hand panel of your RStudio screen (the Environment tab).
  • Step 3: Find “Import Dataset” or “Import” and click on it.
  • Step 4: Choose “Text (base)” or “From CSV” (it will depend on your computer).
  • Step 5 Find your data (AdoptedData.csv) in the list that comes up. Choose it!
  • Step 6: Now, look at the bottom right hand panel of your screen. You should see a line of code with something like AdoptedData <- read.csv(“AdoptedData.csv”) or AdoptedData <- read_csv(“AdoptedData.csv”).
  • Step 7 Copy that ENTIRE line of code.
  • Step 8: In your Markdown file, insert a code chunk. Look at the top of your Markdown file, and find Insert. Click it. From the drop-down menu, choose R. Click it! A gray box should appear in your Markdown file. This is a code chunk.
  • Step 9: Paste the line from Step 6 into this gray code chunk, and press the green arrow (the play button).

Now, look on the upper right hand panel of your RStudio screen. See how you now have a data set called AdoptedData? Great! We are ready to go!

Exploratory Data Analysis

Our data set for today has \(n = 1000\) rows and 10 columns. Each row represents an animal that was brought into the same animal shelter for adoption within the same year. All of these animals were adopted by the end of the year.

One of the important things for shelters to know is how long in general it takes an animal to be adopted. This helps them plan for funds they need, as well as determining how many pet foster parents might be needed.

The shelter has recorded whether each animal was adopted in (1) less than 30 days or (2) if it took 30 days or more for the animal to be adopted. We will refer to this information as adoption time and the information is stored in the AdoptedLessThan30 column in the data set.

Question 1

Did it take the the first animal in the data set (a) less than 30 days or (b) at least 30 days to be adopted?

Hint: You can find this information either by opening the data set, or by printing out the first row.

We can look at individual rows, but typically when our variable of interest is categorical as it is today, we are interested in looking at a table of the variable of interest. This helps us see the distribution of the variable, which means what values are possible for the variable and how often each value occurs in the data set.

To build a table in R, we use the following:

table(AdoptedData$AdoptedLessThan30)

Remember that R works using the structure command(object). The command is what you want R to actually do. In this case, we will use table as our command because we want to create a table. The object is what you want the command to be executed on. In other words, we want a table of what? In our case, we want a table of a single column in our data set.

In R, to tell the computer we wish to work with a single column in a data set (like AdoptedLessThan30), we use $. So, to tell the computer “Please go into the data set and look at only this specific column”, we use AdoptedData$AdoptedLessThan30.

Question 2

How many animals were adopted in less than 30 days?

As we recalled in our first class, we can report more than just raw counts using this table.

Question 3

What percent of animals in the data set were adopted in less than 30 days?

Question 4

How many times more animals in the data set were adopted in less than 30 days than were adopted in at least 30 days (meaning 30 days or more)?

All of these different numeric values allows us to describe the distribution of our variable of interest. We can also use this to describe trends in the data to those who might be interested, like staff at the shelter.

In this course, we will talk about considerations when we are speaking to data stakeholders, which means people who are interested in the results of our analysis. This means we will pick up some tips to consider when presenting results. Let’s get one now.

Professionalism: Note 1

One thing to note that is that the name of our current variable of interest is very cumbersome: AdoptedLessThan30. When we are writing about our conclusions to a client or anyone who is interested in our results, we never use the name of the variable that is given in R. The only exception is if that variable name is a proper word (like height or weight).

Question 5

So, it is not professional to say that 310 animals in the data set have AdoptedLessThan30 equal to “At least 30”. How could this be phrased professionally?

In addition to how we write about data results, we also need to think about the format of any output that we show. For instance, we have created a table to analyze the data so far, but that table is raw R output and is not formatted very professionally. Luckily, there is a very cool code that helps us convert an R table into a professionally formatted table:

knitr::kable( )

If you put a code that produces a table inside of the ( ), the table will be formatted professionally, ready for a report.

knitr::kable( table(AdoptedData$AdoptedLessThan30) )

You can even add a title:

knitr::kable( table(AdoptedData$AdoptedLessThan30), caption ="The Title You Want")

Go ahead and run this code in R. Hmm…that doesn’t look so pretty. Why are we excited about this?? Knit your document, and take a look at the table you see there.

Ah, there we go. It looks great, except that the column names are strange. Let’s fix those:

knitr::kable( table(AdoptedData$AdoptedLessThan30), caption ="The Title You Want", col.names = c("Column 1 Name", "Column 2 Name"))

Question 6

Create a table of the AdoptedLessThan30 variable that is professionally formatted using knitr::kable(). Title your table “Table: Days until adoption”, with the first column labelled “Days until Adoption” and the second labelled “Number of Animals”. Show this table as the answer to this question.

We will learn more of these tips as we move throughout the course. The key is to remember to always apply these tips when you are creating output for an analysis that will be seen by anyone other than just you.

Creating a Bar Graph

Now that we have made a table of our response variable of interest, let’s try a graph. With categorical data, we tend to use a plot called a bar graph, and we will be using a package in R called ggplot2 to create one. If you have already installed ggplot2, you can skip the next section.

Installing ggplot2

To make visualizations in this course, we will be working with a group of functions in the ggplot2 package in R. The package ggplot2 is a powerful collection of R functions for creating flexible, professional graphics.

The first step to using ggplot2 is to install the ggplot2 package. Go to the top of your RStudio window and find “Tools”. From there, click on “Install Packages.” In the blank box, type in ggplot2, and hit install. The computer should automatically begin to load in the packages that you need, but this may take a minute.

Note that this process of installing a package is one you need to do only once. Think of this as teaching R a new set of skills. Once it knows the skills, you don’t have to teach it again.

Now, some of you may see an error about language parsing, or an error involving rlang. If you do, go ahead and install the rlang package. Then, copy and paste the following into a chunk and hit play. Nothing will seem to happen, and that’s okay.

library(rlang)

Creating a Bar Plot

Once you have installed the ggplot2 package, you need to tell R that you would like to begin using the function by loading the library. Remember that we said installing a package is like teaching R a skill? Loading a library is how we tell we R we want it to use those skills. To do this, create a chunk in your RMarkdown and copy and paste the following, and hit play.

suppressMessages(library(ggplot2))

Note that this process of loading a library is one you have to do ONCE each time we start a lab or project.This tells R “Hey, remember those skills we taught you? Use them.”

To create a bar plot for the adoption time, paste the following code in a chunk and hit play.

ggplot(AdoptedData, aes(x=AdoptedLessThan30)) +  
  geom_bar(fill='blue', col = 'white')

The creation of plots in ggplot2 requires building the plot in layers. First, we build the background, the grid on which we will be building our graph. This is the job of the ggplot() part of the code.

ggplot()

Once we have built the background, we need to tell R what data we want to use to create the graph. We need to tell it the name of the data set we want to use (AdoptedData) and then the variable we want to work with (AdoptedLessThan30). The aes part of the code stands for “axis”, we want body mass to be on the x-axis of our graph.

ggplot(AdoptedData, aes(x=AdoptedLessThan30))

The next step is to tell R what type of graph we want to make using this variable. In this case, we want a bar plot. The command we use to build a histogram is + geom_barplot().

ggplot(AdoptedData, aes(x=AdoptedLessThan30))
  + geom_barplot() 

Now, we want to specify some things about our bar plot. We want the bars to be filled ( 'fill') in blue and we want them outlined ( 'col') in white.

ggplot(AdoptedData, aes(x=AdoptedLessThan30)) +  
  geom_bar(fill="blue", col = "white")

Let’s try it.

Question 7

Create a bar plot of the column indicating whether or not the animal was fostered before being adopted. Make the bars of the graph gold and outline them in black. Show your result.

We can add on another layer to our plot by adding x and y axis labels, as well as a title for the plot. The command labs, which stands for labels, is used for this.

ggplot(AdoptedData, aes(x=AdoptedLessThan30)) +  
  geom_bar(fill='blue', col = 'white') + 
  labs(title="Figure 1:", x = "Number of Days before Adoption", y = "Number of Animals") 

This ggplot syntax actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Finally, you add a label. Thinking through the steps in this manner will help you understand the syntax of this package.

Question 8

Copy the code you used to make the graph in Question 7. Now, add the title “Figure 2:” and add appropriate labels to the x and y axis.

One VERY important thing to remember when we make plots is to make sure the axes are easily interpretable by your reader. You do not want to use default variable names, like “FosterYes”. Instead, we want clear labels like “Was the animal fostered?”.

This is a requirement for all graphs you make in this course - you must label your axes appropriately and title your graphs.

Examining Relationships

At this point, we have created both a table and a graph of adoption time. Now, we want to start to see if there are other variables in the data set that might be related to adoption time. In other words, we are going to add an explanatory variable.

Let’s start with foster status. Does adoption time differ depending on whether or not the animal was in foster care? To find this out, we need to expand our initial table.

table( AdoptedData$AdoptedLessThan30, AdoptedData$WasFostered)

Question 9

Using our techniques from before, convert this table into professional format and include a title.

Question 10

Based on the data, is there a practically significant difference in adoption time (at least 30 or less than 30) depending on if the animal was fostered? Use at least one numeric quantity to justify your answer.

Now, one thing to note right now that is that we are exploring for an association, not a causation effect. In other words, we are not exploring whether being in foster care causes an animal to be adopted in more or less time. Animals that are adopted faster might be less likely to end up needing foster families, for instance. This means that all we are looking for is a relationship between the two variables, but we are not exploring whether one variable causes changes in another.

Okay, so we have used tables to explore the relationship between the two categorical variables. However, what if we get numeric data involved?

Question 11

Create a table to explore the relationship between the age of the animal in years and adoption time. Explain why this table might be difficult to use to explore this relationship.

When we have one categorical and one numeric variable, we need to use a different tool instead of a table. This is something we will find as we move throughout the course - the tools we use are very dependent upon the type of variables we are working with.

Side by side box plots

When we have one categorical variable and one numeric variable, one tool we can use to explore the relationship between the two is a side-by-side box plot.

ggplot(AdoptedData, aes(x = AdoptedLessThan30, y = InitialAge_Years)) +
  geom_boxplot()

The only change we made from the bar plot is that we now use geom_boxplot instead of geom_bar, and we now specify the variable for both the x and y axis.

Question 12

Adapt the code above to change the color of the boxes to any color except white, and outline the boxes in any color except black or white. Add labels and a title, and show your final plot.

Question 13

Based on the graph, does there seem to be a practically significant difference in adoption time (whether or not the animal was adopted in less than 30 days) depending the age of the animal in years? Explain your answer.

Ideally, we would like to be able to explore this and other relationships in more depth. This is where logistic regression will come in, and we will work on that next class.

Just for Fun: Themes

We have completed the core content for today’s lab, but I did want to take a minute to let us play a bit in R. When you create graphs, there are so many options and ways for you to personalize the graphs to suit your style and the style of the individuals you are presenting for. This is why in R there are really cool packages called themes or palettes that allow you to personalize your graphs.

For instance, let’s say I want my graph to look like Barbie drew it (yep, it’s exactly what you think):

# Install the packages
remotes::install_github("MatthewBJane/theme_park")
library(ThemePark)

If you are prompted during the install process to type something, type 3. Once you have finished installing, you need to put a # in front of remotes::install_github("MatthewBJane/theme_park"). If you do not do this, your document will not knit.

If you get stuck, let me know!! This part is just for fun, so if it causes any frustration just stop and let me know!

# Make the plot 
ggplot(AdoptedData, aes(x = AdoptedLessThan30, y = InitialAge_Years)) +
  geom_boxplot(color = barbie_theme_colors["medium"]) + 
  labs( x = "Adoption Time", y = "Age in Years", title = "Figure 3:") + 
  theme_barbie()

If you run the code below, you will see other themes in this package. Try a few out!!

head(themepark_themes)

Question 14

Which theme from the list do you like best and why? (There is no single right answer here, I just want to know!!)

Next Time

Today we have explored how we can use graphs and tables to look at the relationship between a binary response variable and either a categorical or a numeric response variable. Next time, we are going to work on how to use models to formalize these relationships. This will also allow us to explore more complex relationships, as well as making predictions.