STA 111 Lab 1

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

Goal

Today we are going to be exploring the foundations of coding in R. As we move through this course, you will get experience with different things that R can do. For today, we want to start getting familiar with the set up and how to run codes using R.

Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.

The First Data Set

The first thing we need to start any analysis is data. We will work with data on a variety of subjects in this class, but to get started, we are going to look at a historical data set. In statistics, sometimes we work with data that is very recent and sometimes we work with data from the past to detect trends that might inform future decisions. For today, we are going to look at a data set that explores the number of discoveries that the World Almanac and Book of Facts designate as great discoveries each year from 1860 to 1959.

To load the data, you need to create a Code Chunk (chunk for short). Remember that to do that, you go to Code at the very upper left of your RMarkdown window and then choose Insert Chunk.

Once you have created your chunk, type the following inside and press play (the green triangle on the right hand side of the chunk).

data("discoveries")

Once you have pressed play, look at the upper right hand panel of your RStudio screen (your Environment tab). Do you see that you have an object called discoveries? This means that you have loaded the data set you need!

Viewing the Data

Now that we have the data loaded, let’s answer our first lab question. Remember, to create a lab question you want to put two ## in your RMarkdown document, hit the space bar, and then type Question 1. This means you should see ## Question 1. Then, hit enter (or return if you are on a Mac) twice, and you are ready to answer the question!

Question 1

Create a new chunk. In this chunk, type discoveries and press play. Some information about the discoveries will appear. What years of data do we have information on?

Question 2

Based on the data, how many great discoveries were there in 1863?

Asking Questions Using Data

Now, the data at the moment is just a list of 100 numbers. Each of these numbers represents the number of great discoveries recorded by the World Almanac and Book and Facts on a given year. Our focus in this course is to learn what we can do with data like this. What kinds of questions can we answer using this data set? How can we communicate the answers to those questions in a way others can clearly understand?

For example, one question we could explore using these data is: What was the largest number of great discoveries recorded by the World Almanac and Book of Facts?

Question 3

What is another question we could explore using this data set? Note: There is no one right answer here!

Let’s focus on this first question: What was the largest number of great discoveries recorded by the World Almanac and Book of Facts?? Now, we could look at the 100 numbers and find the biggest, but this is tedious, and becomes even more so when we have bigger data sets. This is when we start to use R to help us.

Create a chunk, type the following, and then press play:

max(discoveries)

Welcome to your first R command! In R, the format for coding is command(object). In other words, the command tells R what you want to do. The command max tells R we want to print out the largest value in the discoveries data set. The object piece tells R what object or data set you want the command to work with. So, max(discoveries) tells R we want to see the largest value in the discoveries data set.

Question 4

What was the largest number of great discoveries recorded by the World Almanac and Book of Facts?

Okay, that’s one research question. Let’s try another: Was the largest number of great discoveries made in the 19th century (before 1900) or in the 20th century (1900 or later)? It turns out we can use R to help us figure this out as well. To answer our question, we will use a plot.

Create a chunk, paste in the following, and press play.

plot(discoveries)

Do you notice we have the same structure here? We have a command (plot) and an object (discoveries).

Question 5

Was the largest number of great discoveries made in the 19th century (before 1900) or in the 20th century (1900 or later)?

This plot we have created is just one example of a data visualization, or a way to use graphs and tables to explore data. In this course, we will learn several different types of plots, as well exploring what information each displays and when we should use them. Another skill we will practice is interpreting plots.

Question 6

Suppose you are asked to describe what information this graph is showing. How would you reply? Note: There is no one right answer, but a good way to start is to comment on what information is on the x axis and what information is on the y axis, and then comment on any general story you think can be told based on the graph.

So, we can create a plot, and we can find the largest (max) value in the data set. What else can we do?

Question 7

What is the average (mean) number of great discoveries recorded in this data set? Hint: Think about the structure of coding we have learned. You need a command and an object.

Question 8

What is the smallest (min) number of great discoveries recorded in this data set?

As we move through the course, we will see how we can use statistical computing to help us answer many different types of questions using data. Let’s see how this works on a new data set.

A Second Data Set

Let’s load a second data into R. Create a chunk, paste in the following, and press play.

data("faithful")

This data set provides information on eruption cycles of a very famous geyser called Old Faithful. This geyser can be seen in Yellowstone National Park in Wyoming and draws visitors from all over the world as it shoots water over 7000 feet into the air. The geyser is called Old Faithful because it has an eruption cycle about every 40 - 120 minutes, which is very regular for a geyser!

This data set we have loaded (faithful) is what we call a rectangular data set, which just means data that you can put in the form of a spreadsheet. There are rows and columns, typically where each row represents one case and each column represents one variable.

Our discoveries data set was not rectangular; it only had one row of information, but no columns.

Now that we have the data, it’s always a good idea to explore the data set. Start by clicking on the faithful data set in your Environment Tab.

Question 9

How many rows are in this data set? How many columns?

Recall that each row in a data set is called a case, and each column is called a variable. The case tells us what we are recording information on. For these data, each case is one eruption cycle of the Old Faithful geyser in Yellowstone National Park in Wyoming.

Types of Variables

For each eruption cycle (each case) we have information on two different variables:

  • eruption: how many times did the geyser go off during this eruption cycle? We may have decimals if the geyser did not reach its full height each time it went off.
  • duration: how many minutes until the geyser had another eruption cycle?

There are two main types of variables.

Numeric variables are variables that we want to treat as numbers. It should make sense to add, subtract, multiply, etc., these variables. Some examples include height, weight, temperature, number of textbooks in your backpack, etc.

Categorical variables are variables that we want to treat as categories. One example might be your class: are you a first year, sophomore, junior or senior? The answer to this question assigns you to a category.

Question 10

How many numeric variables do we have in the faithful data set? How many categorical variables?

When deciding whether a variable is numeric or categorical, be careful not to just look at the name of the variable. Instead, you want to look at the variable itself. What do the responses look like?

Considering One Variable

What if we want to look at just one variable in the data set? We know how to look at the whole data set, but suppose I just want to look at the waiting variable. This tells us how long a visitor to Yellowstone would have to wait in between eruptions of Old Faithful. We can access a single column of a data set using a $.

faithful$waiting

The structure of this command is dataset$column. We are telling R “hey R, take a look at the faithful data set. Go inside the faithful data set ($) take a look at the column called waiting.”

Question 11

What command would access just the column in the data set that tells us about the number of eruptions?

Question 12

What command would tell us the largest (max) number of eruptions in this data set? Hint: We are putting together things we have learned. We need a command (max) and then the data we want the command to run on.

Back to Data Visualization

In our discoveries data set, we only had one variable. Now, we have two. This means we can (1) create a plot of just eruptions, (2) create a plot of just waiting times and (3) create a plot comparing the eruptions and waiting times. The type of graph we need is going to depend on what question we want to answer using the data.

For now, let’s try (3). To create a scatter plot with the waiting time on the x axis and the number of eruptions on the y-axis, we need:

plot(x = faithful$waiting, y = faithful$eruptions)

This gives R a command, plot. What do we want to plot? On the x-axis, we want to plot faithful$waiting and on the y-axis we want faithful$eruptions.

If you put the command above in a code chunk and hit play, the plot appears in your document. If you want the plot to close, hit the x in the upper right hand corner of the plot.

Question 13

If you had to wait 80 minutes in between Old Faithful eruption cycles, how many times do you expect the geyser to erupt during its eruption cycle? Feel free to give a range (for instance between 0 and 1 time.)

Looking at the plot, we notice that the names of the x and y axes are a little strange. By default, R names these axes with the exact names of the variables. We can change the label for the axes like so.

plot(x = faithful$waiting, y = faithful$eruptions,  
xlab= "The X Axis Label",  ylab= "The Y Axis Label")

We can also change the color of the dots in the graph.

plot(x = faithful$waiting, y = faithful$eruptions, 
xlab= "The X Axis Label",  ylab= "The Y Axis Label", col = "blue")

Question 14

By adapting the code above, create a scatter plot where (1) the x axis is labelled Waiting Time in Minutes, (2) the y axis is labelled Number of Eruptions and (3) the color of the dots is red.

Answering questions about the data

Question 15

An individual who is visiting the park wants to know if your work so far indicates that longer waiting times are associated with more eruptions during the eruption cycle. Based on the plot, does it look like this is the case for these 272 eruptions? Explain your reasoning.

Summary

In this lab, we have started to explore how to use R to work with data. As we move through the course, we will continue to use R to apply the concepts we learn in class on different data sets.

Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 May 27.