barplot()summary()link to install info TODO: MAKE THIS AN ACTUAL URL
In Statistics, we often use computers to analyze data. There are a lot of programs that can help you do statistical analyses. One of the most popular (and powerful) is called R. R is a “statistical computing environment” that is designed for manipulating data, doing calculations, and making graphical displays. R works by writing R code.
That might sound scary, but don’t worry: this is not a programming class. Over the course of the semester, you’ll learn how to edit and write some basic R code to help you analyze data to answer research questions. Our goal in lab is to help you learn the basics of R and R coding, but through the lens of answering statistical questions.
R Studio Screenshot
You may accidentally at some point open the “R GUI” which is not R Studio. You’re generally going to want to open R Studio instead.
“R GUI” screenshot – don’t open this; use R Studio instead!
There are a lot of “R” words floating around. What’s going on?
This is an R Markdown document. R Markdown lets you combine text, R code, and plots in one pretty, reproducible report. If you’re curious about this, you can find more details on using R Markdown at http://rmarkdown.rstudio.com.
R Markdown runs code contained in “chunks”. A chunk looks like this:
print("Hello world!")
## [1] "Hello world!"
Notice that the code, print("Hello world!") is contained between three backticks (```, right below the esc key on a US English keyboard) – this is how R Markdown knows where your chunks start and stop.
When you click the Knit button in R Studio, a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
**) and italic by surrounding it with one asterisk (*).At it’s most basic, R is a fancy calculator. TODO: LINK TO HOW TO DO MATH IN R
You can just run a single chunk by clicking the green “play” button in the upper right corner of the chunk.
5 * 7
## [1] 35
When you run the chunk, you’ll see a [1] before the output of 35. Just ignore this. The result is 35.
Try it yourself! In the chunk below, compute 50 divided by 9. You’ll notice the chunk contains the text # Write code here!. This is called a “comment” – it’s not code that R runs, it’s just there to explain your code. Feel free to delete and replace it, or start a new line and type there. See what happens!
# Write code here!
We’re going to start by working with a data set with data on 344 penguins collected from 3 islands in the Palmer Archipeligo in Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network, and the data were prepared by Dr. Allison Horst.
Illustration of penguin species. Artwork by @allison_horst.
TODO: Link to codebook
We’ll talk more about the specifics of this next week, but for now know that this is creating a data set called penguins, and the data is coming from the URL in the innermost parentheses.
TODO: FORK PALMERPENGUINS, REMOVE NA’S
penguins <- read.csv(url("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"))
penguins <- penguins[complete.cases(penguins), ]
Let’s see what’s in the data. We can peek at the first few (6, specifically) rows of the data using the head() function:
head(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## sex year
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 5 female 2007
## 6 male 2007
## 7 female 2007
We read that line as “head of penguins”. Remember that penguins is what we named our data set. We can see that penguins contains a number of variables, like species, island, and more.
Illustration of penguin bill measurements. Artwork by @allison_horst.
Let’s explore our penguins data by making a plot that will help us visualize a categorical variable. We’ll start by looking at the number of penguins observed of each species.
barplot(table(penguins$species),
xlab = "Species",
ylab = "Frequency",
main = "Bar Chart of Number of Penguins of Each Species Observed",
col = c("darkorange1", "mediumorchid2", "darkcyan"))
In addition to analyzing data visually, we can also use R to summarize data numerically. We’ll use the summary() function to do that for a given variable. Here, we’ll summarize the flipper_length_mm variable, which is the length of the penguins’ flippers (in millimeters).
summary(penguins$flipper_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 172 190 197 201 213 231
R gives us 6 numbers: the minimum (shortest) flipper length, the first quartile, the median (middle) flipper length, the mean (average) flipper length, the third quartile, and the maximum (longest) flipper length. We’ll talk about each of these in more detail later in the course!
With a group of up to three people (including you), complete the following exercises.
Group Members: Replace this text with the names of your group members.
1. Which of the variables in
penguinsare quantitative and which are categorical?
Answer: Replace this text with your answer.
2. Make the same graph for the number of penguins observed on each island in the Palmer Archipelago, using the
islandsvariable in thepenguinsdata set. Hint: be sure to change the axis labels and title too! If you have time, play around with the colors! (things like “red”, “yellow”, “blue” work, but see here for a list of possible color names in R if you’re more adventurous.)
# Get started by copying and pasting the code from the speciesPlot chunk above! (Remember that this text is a comment, so it's not run by R; you can delete it if you want.)
3. Recreate the numerical summaries for the
bill_length_mmvariable. What is the mean (average) bill length?
# Get started by copying and pasting the code from the flipperSummaries chunk above! (Remember that this text is a comment, so it's not run by R; you can delete it if you want.)
Answer: Replace this text with your answer to the question about mean bill length.
Now you’ll dive deeper into the analysis. Maybe you’ll discuss a different application, or preview the next week’s topic. You’ll also have the chance to draw substantive conclusions from your analysis, and to think more deeply about what it is you’ve just done, and why.
Talk About It 1: What’s something you learned today? It can be about statistics, R, penguins, or anything else related to STATS 250.
Write 1-2 sentences about your answer here
Talk About It 2: How do you think statistics can help you in your major or future career?
Write 1-2 sentences about your answer here
Talk About It 3: What was the best part of lab for you today? What was a challenging part?
Write 1-2 sentences about your answer here
When you’ve finished the lab, click the Knit button one last time. Then, in the folder where you saved this .Rmd file when you downloaded it from Canvas, you’ll see an HTML file with the same name (for example, lab01.html). This is what you will upload to Canvas as your submission for Lab 1.
TODO: Screenshots!