Learning Objectives

Statistical Learning Objectives

  1. Practice interpreting histograms and boxplots
  2. Revisit the five-number summary of a variable
  3. Explore scatterplots? TODO: MAYBE NOT??

R Learning Objectives

  1. Learn how to make a histogram in R
  2. Learn how to make a boxplot in R

Functions covered in this lab

  1. hist()
  2. boxplot()

Weekly Advice

When learning how to use a new program like R, the best advice we can give you is to try things. The worst thing that can happen is that you get an error that you can learn from. The way to learn things in this context is to play around, have fun, and make mistakes.


Lab Tutorial

We’re back to hanging out with our penguin friends.

penguins <- read.csv(url("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"))

Boxplots in R

The command for making a boxplot in R is pretty simple: it’s just boxplot(). To make a boxplot of a single variable, just give R the name of the data set, a dollar sign ($), then the name of the variable. Also provide the arguments main and ylab for a plot title and y-axis label.

boxplot(penguins$body_mass_g,
        main = "Boxplot of Penguin Body Mass",
        ylab = "Body mass (g)")

Sometimes we’re interested in comparing two or more groups using “side-by-side” boxplots. We can compare the different species of penguins’ body masses in this way, still using the boxplot function.

boxplot(penguins$body_mass_g ~ penguins$species,
        main = "Boxplots of Penguin Body Mass by Species",
        ylab = "Body mass (g)",
        xlab = "Species")

Notice the ~ (tilde) in the code here. Think of this as the word “by”. In the code above, we’re making boxplots of penguin body mass by penguin species. Notice that we’re specifying the data set for each of these variables! If we didn’t, R wouldn’t know where to look for the variable.

We could also make boxplots of body mass by sex:

boxplot(penguins$body_mass_g ~ penguins$sex,
        main = "Boxplot of Penguin Body Mass by Sex",
        ylab = "Body mass (g)",
        xlab = "Sex")

We could also avoid typing penguins$ twice by giving boxplot() an argument called data. Here’s how that would look:

boxplot(body_mass_g ~ sex,
        data = penguins,
        main = "Boxplot of Penguin Body Mass by Sex",
        ylab = "Body mass (g)",
        xlab = "Sex")

Be careful to note that this only works when you’re giving R a “formula” (the thing involving a tilde ~ – we’ll talk more about formulas later in the course).

Per usual, we can add colors to a plot using the col argument. Here’s that species plot again:

boxplot(penguins$body_mass_g ~ penguins$species,
        main = "Boxplots of Penguin Body Mass by Species",
        ylab = "Body mass (g)",
        xlab = "Species",
        col = c("darkorange1", "mediumorchid2", "darkcyan")
)

Histograms in R

Histograms in R are also pretty easy – you just use the hist() function.

hist(penguins$body_mass_g)

So here we’ve got a histogram. Notice that we didn’t provide the main, xlab, and ylab arguments that we’d normally use for a plot title and an axis label, but R still gave us a title and labels. This is nice, but the labels are horrible: nobody (other than you) knows what penguins$body_mass_g means, so we don’t want to use that as a title or axis label. The moral of the story is to always provide main, xlab, and ylab arguments when making a plot!

Here’s something better:

hist(penguins$body_mass_g,
     main = "Histogram of Penguin Body Mass",
     xlab = "Body Mass (g)")

With histograms, it’s often helpful to change the number of bins to get a different view of the data. We can sort of control the number of bins using the breaks argument.

hist(penguins$body_mass_g,
     main = "Histogram of Penguin Body Mass",
     xlab = "Body Mass (g)",
     breaks = 20)

So now we’ve got a lot more bins than in the original plot. There might not be exactly 20, though. R uses the breaks argument as a suggestion only – it’ll try to give you what you want, but (1) no promises and (2) it will prefer what it thinks is prettier. Your best strategy here is to play around with the number you give as breaks until you get close to what you want.

How to find help in R

R has built-in “documentation” for every function. If you want to find that documentation, you can Google it, but that takes too long. So it’s better to use R’s built in help! In the R console, just type a question mark ? followed by the name of the function you want help with, then hit enter. For example, ?hist will bring up the documentation for the hist() function.

The most useful feature of help is a list of a function’s arguments what they do. You may not be able to fully understand some of the terms in the documentation just yet, but try it out and your lab instructor will be able to help!

Remember how we can use R as a calculator?

36 / 6
## [1] 6

What if we want R to remember the result of our calculation? We can give the result a name by assigning it to something.

x <- 36 /6  

We read that code as “x gets 36 / 6”. The arrow is made using the less-than symbol (<, shift + comma on a US English keyboard) and a hyphen.

Now, we’ve stored the result as x, and R will remember that x is 6. You can see in the environment pane in R Studio (top right) that there’s now a “value” called x and it’s 6. You can also access the value of x by typing x into R. Check it out:

x
## [1] 6

NOTE: R is “case-sensitive”, which means that upper-case letters are different than lower-case letters. Notice what happens when we ask R for the value of X:

X
## Error in eval(expr, envir, enclos): object 'X' not found

R doesn’t like this! Notice that there’s an error that says object 'X' not found. This means that R doesn’t know about something called X (upper-case X), because we haven’t assigned anything to X. Be careful about upper and lower case letters in R! (Also, notice that we made a mistake here, and nothing bad happened! R told us we were wrong and got a little angry, but that’s okay, because R is a computer program and doesn’t actually have feelings).

Reading in Data

We can also assign data to names in R. Remember last week? We ran this code: TODO: FORK PALMERPENGUINS, REMOVE NA’S

penguins <- read.csv(url("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"))

Here, we stored the result of read.csv() as an object called penguins. penguins is what’s called a “data frame” in R (in lecture, we call it a “data matrix”). So, now when we want to use the data, we can just tell R about penguins instead of having to write that whole read.csv() line again and again.

Now, when you look in the environment pane of R Studio (the “Environment” tab in the R Workspace, the top right pane), you’ll see a “data” object called “penguins” – this is our data frame!

What’s a CSV file?

CSV stands for “comma separated values” and is a commonly used file type for storing data. Here’s a sample of what a .csv file looks like:

write.csv(head(penguins))
## "","species","island","bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g","sex","year"
## "1","Adelie","Torgersen",39.1,18.7,181,3750,"male",2007
## "2","Adelie","Torgersen",39.5,17.4,186,3800,"female",2007
## "3","Adelie","Torgersen",40.3,18,195,3250,"female",2007
## "4","Adelie","Torgersen",NA,NA,NA,NA,NA,2007
## "5","Adelie","Torgersen",36.7,19.3,193,3450,"female",2007
## "6","Adelie","Torgersen",39.3,20.6,190,3650,"male",2007

Each row of the file (rows are denoted by ## in the above output, but ## is not in the actual file) is an “observation” or “case”, and consists of one or more variables whose values are separated by commas (hey, look at that). The first row contains the variable names. You don’t need to know these specific details, but it will be helpful to understand what a .csv file is for these labs.

When you use the function read.csv(), R expects you to tell it where a .csv file is (notice that the name of the function matches the file type – .csv!). So, inside those parentheses, you can give R a file location or a URL that tells it about a .csv file.

Try it: Go to the URL inside of read.csv() above – what do you see?

Exploring Data “Structure”

So now that we’ve got a data frame loaded into R, let’s see what’s in it. We got a preview of this last week, but let’s more officially explore it.

Remember from last lab that we can see the first 6 rows of the data by using a function called head():

head(penguins)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 4  Adelie Torgersen             NA            NA                NA          NA
## 5  Adelie Torgersen           36.7          19.3               193        3450
## 6  Adelie Torgersen           39.3          20.6               190        3650
##      sex year
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 4   <NA> 2007
## 5 female 2007
## 6   male 2007

What if we just wanted to know the names of the variables that are contained in penguins? We can use the names() function:

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

This can be useful to remind ourselves of how the variable names are spelled/formatted in R. Remember how R is case sensitive? It’s important to format these variable names exactly, because R isn’t as smart as you are and so you need to tell it exactly what to do.

Something to notice about the variable names above is that words are separated by underscores (_) – this is because R does not like spaces in variable names. When giving things names in R, you can only use a combination of letters, numbers, periods, and underscores, and the names have to start with a letter or a period. People tend to use underscores or periods instead of spaces.

Watch what happens when you try to assign something to a “bad” name:

tik tok <- 12
## Error: <text>:1:5: unexpected symbol
## 1: tik tok
##         ^
4eva <- 4 * 2 
## Error: <text>:1:1: unexpected input
## 1: 4ev
##     ^
_hi_mom <- 5^2
## Error: <text>:1:1: unexpected input
## 1: _
##     ^

The errors saying “unexpected symbols” or “unexpected input” are R’s way of telling you that these names are not allowed, and that you should use a different name. Here’s how we’d correct these:

tiktok <- 12
forever <- 4 * 2
dear_mother <- 5^2

Let’s now explore the “structure” of the data (similar to what we did with ‘head()’, but with a twist). To see a data frame’s structure, we can use the function str() (pronounced “stir”, not “straight to Rick’s”):

str(penguins)
## 'data.frame':    344 obs. of  8 variables:
##  $ species          : chr  "Adelie" "Adelie" "Adelie" "Adelie" ...
##  $ island           : chr  "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
##  $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
##  $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
##  $ sex              : chr  "male" "female" "female" NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...

This tells you the number of rows (“observations”), the number of columns, the names of the variables, what type of variable it is, and gives you a preview of what the variables look like.

If you really only want the “dimension” of the data frame (i.e., how many rows and how many columns), you can use the dim() function:

dim(penguins)
## [1] 344   8

The results are given in the order “rows, columns” because data is Really Cool (rows, columns).

Frequency Tables

Remember when we made this bar chart last week?

barplot(table(penguins$species),
     xlab = "Species",
     ylab = "Frequency",
     main = "Bar Chart of Number of Penguins of Each Species Observed",
     col = c("darkorange1", "mediumorchid2", "darkcyan"))

In order to make this plot, we had to give R a “frequency table” of the variable species. This is a way to count how many observations (rows) there are that correspond to each value of species. To make a frequency table, we use the table() function:

table(penguins$species)
## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124

So, there are TODO: FIX NUMBERS 124 Gentoo penguins in the data.

Notice that inside the table function, we have something that looks a little weird. We wrote penguins$species. This is how we tell R to use the species variable inside the data frame penguins. The dollar sign ($) tells R to look inside the data frame penguins for the column called species.

It’s very important that you tell R which data frame the variable you’re interested in is from. Let’s see what happens when we don’t:

species
## Error in eval(expr, envir, enclos): object 'species' not found

Because we don’t have anything called species in our environment (there’s nothing called species in the environment pane), R doesn’t know what we’re talking about! species only exists inside of penguins.

We can also make “two-way” frequency tables (sometimes called “contingency tables”) to summarize counts for two categorical variables:

table(penguins$species, penguins$island)
##            
##             Biscoe Dream Torgersen
##   Adelie        44    56        52
##   Chinstrap      0    68         0
##   Gentoo       124     0         0

So it looks like all the Gentoo penguins in our data live on Biscoe island.

Remember that data is really cool, so the first variable you give to table() is in the rows of the table, and the second is in the columns.

Notice that we separated the two variables inside of table() with a comma – it’s important to remember this!


Try It!

With a group of up to three people (including you), complete the following exercises.

Group Members: Replace this text with the names of your group members.

In this Try It, we’ll be using a data set we’ll call a2trees that contains information about a subset of the top 9 types of trees planted around Ann Arbor. The city maintains an interactive map of all such trees at this link – you should check it out.

1. We’ll start by reading in the data. The data are stored in a file called a2trees_clean.csv. You’ll need to give the name of this file (in quotes!) to the appropriate R function. When you read it in, call the data set a2trees.

Answer:

# Replace this comment with code required for Try It 1.

2. How many variables are in a2trees? How many cases are there? Using code, find the names of the variables in the data set.

Answer:

# Replace this comment with code required for Try It 2. (Remember that this text is a comment, so R ignores it; you can delete it if you want.)

Replace this text with the number of variables and number of cases in a2trees.

3. Make a table of the variable in a2trees which represents a tree’s health. How many trees in the data set are in very good health? What do you notice about the order in which R presents the categories?

Answer:

# Replace this comment with code required for Try It 3. (Remember that this text is a comment, so R ignores it; you can delete it if you want.)

Replace this text with your answers to the questions.

4. Make a bar graph of the variable you tablulated in Try It 3. You can use the code from the first chunk in the Frequency Tables section as a starting point. Be sure to change the axis labels and title! If you have time, play around with the colors! (things like “red”, “yellow”, “blue” work, but see here for a list of possible color names in R if you’re more adventurous.)

Answer:

# Get started by copying and pasting the code from the speciesPlot chunk above! (Remember that this text is a comment, so R ignores it; you can delete it if you want.)

5. Make a two-way contingency table for tree health and common genus.

Answer:

# Replace this comment with code required for Try It 5. (Remember that this text is a comment, so R ignores it; you can delete it if you want.)

6. Based on the data, what is the botanical genus (biological classification) of maple trees?

Answer:

# Replace this comment with any code you need to answer the question

Replace this text with your written answer to the question.

Dive Deeper

In the Try It, you played around a little with data about trees in Ann Arbor. Now, we’re going to have you dive a little deeper.

Replace this text with your written answer to the question.

2. Based on your answer to #2 above, speculate on how these data were collected, and why.

Replace this text with your written answer to the question.

3. Make a frequency table of the common genus of the trees in the data. Based on your answer to Try It #5 (the two-way frequency table), what proportion of elm trees are in good health?

Write 1-2 sentences about your answer here

4. Could this data set be used to answer the research question “Are there more maple or oak trees in Nichols Arboretum1? Why or why not?

Write 1-2 sentences about your answer here

5. Could this data set be used to answer the research question “What is the average height of trees on public land designated as ‘landmark trees’ in Ann Arbor”? Why or why not?

Write 1-2 sentences about your answer here

Wrap-Up and Submission

When you’ve finished the lab, click the Knit button one last time. Then, in the folder where you saved this .Rmd file when you downloaded it from Canvas, you’ll see an HTML file with the same name (for example, lab02.html). This is what you will upload to Canvas as your submission for Lab 2.

TODO: Screenshots!


  1. “The Arb” is a University-owned nature area, and if you’re on campus it’s a great place for a peaceful, socially-distanced walk!↩︎