Today we are going to be exploring the foundations of coding in R. As we move through this course, you will get experience with different things that R can do. For today, we want to start getting familiar with the set up and how to run codes using R.
Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.
The first thing we need to start any analysis is data. We will work with data on a variety of subjects in this class, but to get started, we are going to look at a historical data set. In statistics, sometimes we work with data that is very recent and sometimes we work with data from the past to detect trends that might inform future decisions. For today, we are going to look at a data set that explores the number of discoveries that the World Almanac and Book of Facts designate as great discoveries each year from 1860 to 1959.
To load the data, you need to create a Code Chunk (chunk for short). Remember that to do that, you go to Code at the very upper left of your RMarkdown window and then choose Insert Chunk.
Once you have created your chunk, type the following inside and press play (the green triangle on the right hand side of the chunk).
data("discoveries")
Once you have pressed play, look at the upper right hand panel of your RStudio screen (your Environment tab). Do you see that you have an object called discoveries? This means that you have loaded the data set you need!
Now that we have the data loaded, let's answer our first lab question. Remember, to create a lab question you want to put two ## in your RMarkdown document, hit the space bar, and then type Question 1. This means you should see ## Question 1. Then, hit enter (or return if you are on a Mac) twice, and you are ready to answer the question!
Create a new chunk. In this chunk, type discoveries and press play. Some information about the discoveries will appear. What years of data do we have information on?
discoveries
Based on the data, how many great discoveries were there in 1863?
Now, the data at the moment is just a list of 100 numbers. Each of these numbers represents the number of great discoveries recorded by the World Almanac and Book and Facts on a given year. Our focus in this course is to learn what we can do with data like this. What kinds of questions can we answer using this data set? How can we communicate the answers to those questions in a way others can clearly understand?
For example, one question we could explore using these data is: What was the largest number of great discoveries recorded by the World Almanac and Book of Facts?
What is another question we could explore using this data set? Note: There is no one right answer here! Let's focus on this first question: What was the largest number of great discoveries recorded by the World Almanac and Book of Facts?? Now, we could look at the 100 numbers and find the biggest, but this is tedious, and becomes even more so when we have bigger data sets. This is when we start to use R to help us.
Create a chunk, type the following, and then press play:
max(discoveries)
Welcome to your first R command! In R, the format for coding is command(object). In other words, the command tells R what you want to do. The command max tells R we want to print out the largest value in the discoveries data set. The object piece tells R what object or data set you want the command to work with. So, max(discoveries) tells R we want to see the largest value in the discoveries data set.
What was the largest number of great discoveries recorded by the World Almanac and Book of Facts? Okay, that's one research question. Let's try another: Was the largest number of great discoveries made in the 19th century (before 1900) or in the 20th century (1900 or later)? It turns out we can use R to help us figure this out as well. To answer our question, we will use a plot.
Create a chunk, paste in the following, and press play.
plot(discoveries)
Do you notice we have the same structure here? We have a command (plot) and an object (discoveries).
Was the largest number of great discoveries made in the 19th century (before 1900) or in the 20th century (1900 or later)? This plot we have created is just one example of a data visualization, or a way to use graphs and tables to explore data. In this course, we will learn several different types of plots, as well exploring what information each displays and when we should use them. Another skill we will practice is interpreting plots.
Suppose you are asked to describe what information this graph is showing. How would you reply? Note: There is no one right answer, but a good way to start is to comment on what information is on the x axis and what information is on the y axis, and then comment on any general story you think can be told based on the graph. So, we can create a plot, and we can find the largest (max) value in the data set. What else can we do? This week we have learned two measures of central tendancy. Lets count these two. R has two built in commands to calculate mean and median. In the code chunk you will have to write down the commands like mean or median and within the () your have to define the object. In our case it is discoveries
What is the average (mean) number of great discoveries recorded in this data set? Hint: Think about the structure of coding we have learned. You need a command and an object.
mean(discoveries)
What is the median of the great discoveries recorded in this data set?
What is the smallest (min) number of great discoveries recorded in this data set? Hint: You have to use min() comand in the chunk
This week we have learned how to plot data using histogram. Lets visualize the discoveries data. We need to use hist() fuction to create a histogram.
hist(discoveries)
From the histogram can you say, how many bins were used. It was by default. R used what it thought would be appropriate. What is your opinion about that
What is the modaility of the histogram? Whats is the skewness of the histogram?
This week we have learned the importance of selecting and appropriate bin. Lets play with four different bins: 4, 6, 10, 20
hist(discoveries, breaks=4, main="With breaks=4")
hist(discoveries, breaks=6, main="With breaks=6")
hist(discoveries, breaks=10, main="With breaks=10")
hist(discoveries, breaks=20, main="With breaks=20")
From four bins, which one gives the best result? Give your explanation
Let's load a second data into R. Create a chunk, paste in the following, and press play.
data("faithful")
This data set provides information on eruption cycles of a very famous geyser called Old Faithful. This geyser can be seen in Yellowstone National Park in Wyoming and draws visitors from all over the world as it shoots water over 7000 feet into the air. The geyser is called Old Faithful because it has an eruption cycle about every 40 - 120 minutes, which is very regular for a geyser!
This data set we have loaded (faithful) is what we call a rectangular data set or data matrix, which just means data that you can put in the form of a spreadsheet. In R we call it a dataframe. There are rows and columns, typically where each row represents one case or one observation and each column represents one variable or characteristics of the dataset.
Our discoveries data set was not rectangular; it only had one row of information, but no columns.
Now that we have the data, it's always a good idea to explore the data set. Start by clicking on the faithful data set in your Environment Tab.
How many rows are in this data set? How many columns? Recall that each row in a data set is called a case, and each column is called a variable. The case tells us what we are recording information on. For these data, each case is one eruption cycle of the Old Faithful geyser in Yellowstone National Park in Wyoming.
For each eruption cycle (each case) we have information on two different variables:
eruption: how many times did the geyser go off during this eruption cycle? We may have decimals if the geyser did not reach its full height each time it went off. duration: how many minutes until the geyser had another eruption cycle? There are two main types of variables.
Numeric variables are variables that we want to treat as numbers. It should make sense to add, subtract, multiply, etc., these variables. Some examples include height, weight, temperature, number of textbooks in your backpack, etc.
Categorical variables are variables that we want to treat as categories. One example might be your class: are you a first year, sophomore, junior or senior? The answer to this question assigns you to a category.
How many numeric variables do we have in the faithful data set? How many categorical variables?
What if we want to look at just one variable in the data set? We know how to look at the whole data set, but suppose I just want to look at the waiting variable. This tells us how long a visitor to Yellowstone would have to wait in between eruptions of Old Faithful. We can access a single column of a data set using a $.
faithful$waiting
What command would tell us the largest (max) number of eruptions in this data set? Hint: We are putting together things we have learned. We need a command (max) and then the data we want the command to run on.