Overview

Welcome to Intro to R Part II! In this lesson, you will…

  1. Learn how to import and view a data set in R
  2. Generate and interpret descriptive statistics
  3. Generate and interpret plots using ggplot2
  4. Use the pipe operator %>% as a way of chaining together multiple operations

1 Importing Data

Open RStudio and create a new script (call it lesson.R, or whatever you like). Save the lesson.

Then download some data. Head to the Cambridge Open Data Portal. Access the portal, and search for “Dogs of Cambridge.” This dataset displays the name, breed, and approximate location of dogs in Cambridge and is based on dog license data! Scroll down to “Columns in this Dataset” to get a feel for the data structure and variables. Then, export the data as a .csv file and save the data in the same folder as your R script.

Before we load this dataset into R, we need to tell R the name and location of the folder containing the data (this is called the working directory). Use the dropdown menu to set the working directory manually: Session > Set Working Directory > To Source File Location.

Now we are ready to load the data. Run the following statement to load the Dogs of Cambridge data into R. This statement uses the read.csv function and stores the data as an object.

DogData <- read.csv("Dogs_of_Cambridge.csv")

Now, let’s take a look at the data. We can do this using the View() function from our previous lab, or the head() function, which will show the first six lines of the dataset.

head(DogData)

2 Summary Statistics

A statistic is any numerical summary of a data set. For instance, what is the most popular dog breed in Cambridge? Or, which neighborhoods has the most dogs? Let’s work through a few examples. To do this, we will need to access specific variables within the dataframe. To access a variable from inside a dataset, we use the character $. Try this code and see what happens. Then access a few other variables of your choosing.

DogData$Dog_Breed

Now we are ready to find some summary statistics. Let’s start by figuring out the most popular dog name in Cambridge. First, let’s use the table() function to figure out the frequencies (i.e., the number of observations) of particular dog names.

table(DogData$Dog_Name)

Now, let’s practice what we learned in the last session by saving this table as an object.

dog_names <- table(DogData$Dog_Name)

Then, we use the which.max() function to figure out the most popular dog name! This function returns the highest frequency (i.e., the name with the most observations). What is the most popular dog name in Cambridge?

which.max(dog_names)
## Luna 
## 1229

On your own, see if you can figure out which Cambridge neighborhood has the most dogs! How many dogs are in that neighborhood?

3 Plots

There are many ways to visualize data in statistical plots. In this lab, we’ll focus on the bar plot, which we use to visualize group-level summary statistics, like counts, averages, or proportions. Let’s make a bar plot that shows us the number of dogs in each neighborhood. As we’ll see in this example, most of the work in making really good plots lies in editing: removing distractions and adding some flair until you have a clear and attractive plot.

In this lesson, we’ll learn how to make plots with a package called ggplot2. There are lots of ways to make plots in R, but ggplot2 is a particularly useful package for making beautiful plots. We won’t be able to thoroughly cover ggplot2 today, but you can find LOTS of ggplot2 resources on the internet (for example: https://ggplot2.tidyverse.org/#learning-ggplot2).

So let’s start by using the library function to load the tidyverse package that we installed in the last session. tidyverse automatically loads the ggplot2 library. As we progress through this section, we are also going to #comment our code, so that we get in the habit of leaving clear comments for ourselves (and others)!

# Load the tidyverse package
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Now, if we want to visualize how many dogs are in each neighborhood, we’ll need to plot the Neighborhood variable. Here’s the basic ggplot2 code for a bar plot that shows us the number of dogs per neighborhood. Run this code in your own console. What do you notice?

# Create a simple bar plot
ggplot(DogData) + 
  geom_bar(aes(x=Neighborhood))

This plot shows us the number of dogs per neighborhood, but we will need to do some editing to make this plot clear! We can start by cleaning up the x axis. Let’s add another piece of code:

# Stagger the x axis for clarity
ggplot(DogData) + 
  geom_bar(aes(x=Neighborhood))+
  scale_x_discrete(guide = guide_axis(n.dodge=3))

Now we have a bar plot with no overlapping x-axis. That looks better! Now, let’s add some labels and make the title bold.

# Add labels to the barplot
ggplot(DogData) + 
  geom_bar(aes(x=Neighborhood))+
  scale_x_discrete(guide = guide_axis(n.dodge=3))+ 
  labs(x="Neighborhoods in Cambridge",
       y="Number of Dogs",
       title="Dogs per Neighborhood in Cambridge")

# Make the title bold
ggplot(DogData) + 
  geom_bar(aes(x=Neighborhood))+
  scale_x_discrete(guide = guide_axis(n.dodge=3))+ 
  labs(x="Neighborhoods in Cambridge",
       y="Number of Dogs",
       title="Dogs per Neighborhood in Cambridge")+
  theme(plot.title = element_text(face="bold"))

That’s better, but we can still improve this plot. Let’s reduce the number of categories and focus on the neighborhoods with the most dogs. To do this, we are going to introduce another useful package: dplyr. Like ggplot2, dplyr is automatically loaded when we load tidyverse. So we don’t need to install any new packages.

To filter dataset to include the neighborhoods with the most dogs, we are going to introduce pipes (%>%). Pipes allow us to combine multiple operations in a single “pipeline” by taking the output from one function and feeding it to the first argument of the next function. That probably sounds confusing, so let’s take a break from our plot to look at a simple example. Try the following code:

# Compute the logarithm of 100 
log(100)
## [1] 4.60517
# Compute the logarithm of 100 with a pipe
100 %>% log()
## [1] 4.60517

What happened here? We took function(argument) and rewrote this as argument %>% function(). For a simple calculation with only two steps, the difference is minimal. But for the more complex kinds calculations, the difference can be substantial.

For example, let’s return to our plot and use pipes to filter the dataset to only include neighborhoods with more than 250 dogs. Let’s save that new dataset in an object called DogData250.

# Filter dataset to include neighborhoods > 250 dogs
DogData250 <- DogData %>%
  group_by(Neighborhood) %>%
  filter(n() > 250) %>%
  ungroup()

How do we know this new dataset includes the neighborhoods with more than 250 dogs? Let’s check! Let’s use pipes to arrange the neighborhoods by number of dogs. Then, let’s use unique to show us the neighborhoods that are included in DogData250. The neighborhoods with more than 250 dogs should in DogData250.

# Sort neighborhoods by number of dogs
table(DogData$Neighborhood)  %>% as.data.frame() %>% arrange(desc(Freq))
##                     Var1 Freq
## 1         West Cambridge  606
## 2        North Cambridge  596
## 3      Neighborhood Nine  577
## 4          Cambridgeport  438
## 5          Mid-Cambridge  404
## 6         East Cambridge  397
## 7              Riverside  197
## 8  Wellington-Harrington  168
## 9               The Port  161
## 10               Baldwin  152
## 11       Strawberry Hill  124
## 12   Cambridge Highlands   91
## 13            Area 2/MIT   31
# Use unique() to see which neighborhoods are in DogData250
unique(DogData250$Neighborhood)
## [1] "North Cambridge"   "Neighborhood Nine" "Mid-Cambridge"    
## [4] "Cambridgeport"     "West Cambridge"    "East Cambridge"

So now we have a dataset with Cambridge neighborhoods with more than 250 dogs. Let’s return to our plot and make our barplot with this dataset.

# Make a barplot with this new dataset
ggplot(DogData250) + 
  geom_bar(aes(x=Neighborhood))+
  scale_x_discrete(guide = guide_axis(n.dodge=2))+ 
  labs(x="Neighborhoods in Cambridge",
       y="Number of Dogs",
       title="Dog-Crazy Neighborhoods in Cambridge")+
  theme(plot.title = element_text(face="bold"))

Great. The final step here is to introduce color. You can really geek out over colors in ggplot2, but we’ll keep it simple for today. We will use a library of color palettes that are color-blind-friendly. This library of color palettes is called viridis, so we’ll need to install this package.

install.packages("viridis")
# Load the viridis package
library(viridis)  
## Loading required package: viridisLite
# Make a barplot with this new dataset and color-blind-friendly colors
ggplot(DogData250) + 
  geom_bar(aes(x=Neighborhood, fill=Neighborhood))+ 
  scale_color_viridis(discrete=TRUE)+
  scale_x_discrete(guide = guide_axis(n.dodge=2))+ 
  labs(x="Neighborhoods in Cambridge",
       y="Number of Dogs",
       title="Dog-Crazy Neighborhoods in Cambridge")+
  theme(plot.title = element_text(face="bold"))