In this course, and in statistics in general, we need to use statistical software to help us process, visualize, and model using data. We will be working with R for this semester.
Throughout the course, we will explore different things you can do using R. For today, the goal is just to start getting comfortable with the grammar of R and what we can do with it.
Before starting this lab, make sure you have set up your RMarkdown by watching the videos on Canvas.
Go ahead and open up an RMarkdown file, and delete everything after Line 10. Knit to make sure you can, and let me know if you get stuck.
Okay, our Markdown file is open. Cool. Now what? Well, now we need data. Everything we do in statistics starts with data.
The very first step in starting any new Markdown file is to load in the data that you need. If you do not load the data inside of your Markdown file, your Markdown will not Knit.
Today, we will be loading our data from the internet. R has some commands that allow us to do that, and we will explore those first.
To start off with, we need to install a package, since the data we will be using today is one R has access to. Look at the top of your RStudio screen and find "Tools". From the drop down menu, choose "Install Packages". In the white box, type palmerpenguins, and then hit install. Step 1 complete.
Now that we have the package installed, it is time to load our data. To do that, we need to create a space in our RMarkdown file to create code. Remember, Markdown can do three things: (1) create regular text, (2) run code, and (3) create math equations. Right now, we want (2).
To tell RMarkdown we are about to give it code, we need to create a chunk. Look at the top of your Markdown file, and find Insert. Click it. From the drop-down menu, choose R. Click it! A gray box should appear in your Markdown file. This is a chunk.
Anything we put inside a chunk will be treated as computer code. Let's put in a code to load the data now.
library(palmerpenguins)
data("penguins")
Now, look at the right hand side of the chunk and find the little green triangle symbol. We will call this the play button in this course. Go ahead and press the play button (press play).
Now, look on the upper right hand panel of your RStudio screen. See how you now have a data set called penguins? Great! We are ready to go!
Now that you have loaded the data, you are ready to actually start the lab. Go to line 17 of your Markdown. Notice that this leaves one line of blank space between your code chunk where we loaded the penguins data and what you are about to type. This space between text and code chunks is necessary in order for your document to format properly.
On line 17, you are going to create a section header for the first lab question. Create two ## signs on line 17, put a space, and then type Question 1. In other words, line 17 should have ## Question 1. This creates a new section in your Markdown called Question 1. Go ahead and knit; see that you have created a section.
You can create as many sections as you like. Under a section header, you have the ability to type, just like you would in a word document, so you can type the responses to questions. You can also insert a code chunk and run code. Remember, a new chunk can be inserted by clicking on the Insert button and choosing R.
Knit your document now, and make sure that so far, your document contains one section, Question 1. We are now ready to begin the lab.
On your Markdown, hit enter twice to leave a blank line between ## Question 1 and where you type. Copy the question below, and paste it into your Markdown, and then put an * at the beginning and at the end of the question. This will put the question in italics, and allow you to see the question statement when you knit. Repeat this process for all the questions in your lab.
Now, let's say I want to look at one particular column in this data set, in this case species
. How can I tell R I only want to see this one column? To get a single column, we type the name of the data set, a dollar sign, and then the column we want. So, for our data, we create a new code chunk, and inside we type the following, and hit play.
penguins$species
Go ahead and knit your Markdown file. Whoa, see all the data output in the document?? Pages of it?? We have given a command that tells R to print out all of the information on species and islands, which is a lot. We don't want all that in what we turn in. So, to tell R not to run the code, put a #
in front of the commands.
#penguins$species
Knit your Markdown again, and you should not see the long string of output anymore.
Now we know how to isolate a particular column in R. What if I wanted to know the species of the 200th penguin in the data set? I don't really want to print out the whole species column and find the 200th entry. Luckily, we don't have to. Create a new chunk and try this:
penguins$species[200]
The [ ]
part of the command allows us to pull particular elements from the column. We can even pull things like the first 5 rows.
penguins$species[1:5]
So far, we can see that R allows us to select different parts of a data set using $
and [ ]
. This is handy if we want to explore a particular data point.
Let's think practically. If I want to know about the species or islands in this data set, do we really want to see hundreds of rows of information on species or islands? Probably not. What we actually want to do is look at some sort of summary of the data. For categorical data like this, a table is most useful for this.
There are several ways to make tables in R, but we will discuss two. The first is very direct. We tell R we want to use the penguins
data set and the variable species
, by using the code penguins$species
(dataset$variable). Then, we use the table(whatWeWantToMakeATableWith)
command to actually make the table.
table(penguins$species)
However, this makes a table that is not particularly pretty or professional when you knit. A second option that does create professional tables is:
knitr::kable(table(penguins$species), col.names=c("Species", "Count") )
The code is more complex, but the heart of it is the same table. This table will not look very pretty when you press play, but go ahead and knit. See how nicely the table gets formatted?
col.names
part of the code above will help with that.)Okay, so now we can summarize the data by looking at the islands and the species. What if we want to look at them together? In other words, what if I want to know which species of penguins are on which islands? Can I do that?
knitr::kable(table(penguins$species, penguins$island))
Why would this be important? Well, when we are building models, we might need to know whether or not our grops were balanced before choosing a model. What if we had only male penguins from Dream Island, and then tried to fit a model looking at the relationship between penguin sex and beak length on this island. We couldn't do it, because we only have information on male penguins. This is why taking the time to perform exploratory data analysis, and really dig into the data, is so important.
Okay, so now we can look at individual columns, and rows within columns, and we can make tables. What if we have a client who only cares about penguins from the Gentoo species? This means I need to select from my data set the rows for just those penguins.
To do this, I want to assign the rows of data that are about Gentoo penguins to a new data set. Create a new chunk and run the following.
GentooOnly <- subset(penguins, species == "Gentoo")
What happens when you run this? Seemingly nothing, but take a look in the upper right hand panel of your RStudio screen. There is a new data set there, called GentooOnly
, and if you open it, you will notice that all of the penguins in the data are Gentoo.
penguins
.subset(penguins)
subset(penguins, species)
subset(penguins, species == "Gentoo")
<-
) in a new data set called GentooOnly : GentooOnly <- subset(penguins, species == "Gentoo")
.You can also adapt this to include more than one requirement. Perhaps we want only penguins who are Gentoo penguins and (&
) have a bill length over 30 mm.
GentooBill30 <- subset(penguins, species == "Gentoo" & bill_length_mm > 30)
A client wants to know if the average body mass of male penguins of the Gentoo species is heavier than the average body mass of male penguins of the Adelie species. To get the mean of a column in R, we need mean(thingWeWantTheMeanOf)
. There are a lot of commands like this: sd( )
for standard deviation, var( )
for variance, min( )
for the smallest value and max( )
for the largest.
However, if we want to take the mean, we need to make sure there is no missing data in the data set. This means we don't want to see NA
in any of the columns. To check, we can run a quick summary.
summary(penguins)
It looks like there is missing information on a lot of our variables! This means we have missing data, or certain pieces of information that we do not have for specific penguins in our data. There is a whole field devoted to how we handle this, but for now, let's remove the missing data.
penguins <- na.omit(penguins)
penguins <-
part of the code does.We can also do things like add, subtract, or multiply two quantities in R.
mean(penguins$bill_length_mm) - 5
For today, we started to explore this penguin data set, and some of what we can do in R. As we move through our semester, we will use the types of commands we learned today over and over. We will also be learning new commands and new things that we can do with R. Next on our list is visualization, and we will tackle that in the next lab.