Welcome to Intro to R Part II! In this lesson, you will…
ggplot2%>% as a way of chaining
together multiple operationsOpen RStudio and create a new script (call it lesson.R, or whatever you like). Save the lesson.
Then download some data. Head to the Cambridge Open Data Portal. Access the portal, and search for “Dogs of Cambridge.” This dataset displays the name, breed, and approximate location of dogs in Cambridge and is based on dog license data! Scroll down to “Columns in this Dataset” to get a feel for the data structure and variables. Then, export the data as a .csv file and save the data in the same folder as your R script.
Before we load this dataset into R, we need to tell R the name and location of the folder containing the data (this is called the working directory). Use the dropdown menu to set the working directory manually: Session > Set Working Directory > To Source File Location.
Now we are ready to load the data. Run the following statement to
load the Dogs of Cambridge data into R. This statement uses the
read.csv function and stores the data as an
object.
DogData <- read.csv("Dogs_of_Cambridge.csv")
Now, let’s take a look at the data. We can do this using the
View() function from our previous lab, or the
head() function, which will show the first six lines of the
dataset.
head(DogData)
A statistic is any numerical summary of a data set. For instance,
what is the most popular dog breed in Cambridge? Or, which neighborhoods
has the most dogs? Let’s work through a few examples. To do this, we
will need to access specific variables within the dataframe. To access a
variable from inside a dataset, we use the character $. Try
this code and see what happens. Then access a few other variables of
your choosing.
DogData$Dog_Breed
Now we are ready to find some summary statistics. Let’s start by
figuring out the most popular dog name in Cambridge. First, let’s use
the table() function to figure out the frequencies (i.e.,
the number of observations) of particular dog names.
table(DogData$Dog_Name)
Now, let’s practice what we learned in the last session by saving this table as an object.
dog_names <- table(DogData$Dog_Name)
Then, we use the which.max() function to figure out the
most popular dog name! This function returns the highest frequency
(i.e., the name with the most observations). What is the most popular
dog name in Cambridge?
which.max(dog_names)
## Luna
## 1229
On your own, see if you can figure out which Cambridge neighborhood has the most dogs! How many dogs are in that neighborhood?
There are many ways to visualize data in statistical plots. In this lab, we’ll focus on the bar plot, which we use to visualize group-level summary statistics, like counts, averages, or proportions. Let’s make a bar plot that shows us the number of dogs in each neighborhood. As we’ll see in this example, most of the work in making really good plots lies in editing: removing distractions and adding some flair until you have a clear and attractive plot.
In this lesson, we’ll learn how to make plots with a package called
ggplot2. There are lots of ways to make plots in R, but
ggplot2 is a particularly useful package for making
beautiful plots. We won’t be able to thoroughly cover
ggplot2 today, but you can find LOTS of
ggplot2 resources on the internet (for example: https://ggplot2.tidyverse.org/#learning-ggplot2).
So let’s start by using the library function to load the
tidyverse package that we installed in the last session.
tidyverse automatically loads the ggplot2 library. As we
progress through this section, we are also going to
#comment our code, so that we get in the habit of leaving
clear comments for ourselves (and others)!
# Load the tidyverse package
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Now, if we want to visualize how many dogs are in each neighborhood, we’ll need to plot the Neighborhood variable. Here’s the basic ggplot2 code for a bar plot that shows us the number of dogs per neighborhood. Run this code in your own console. What do you notice?
# Create a simple bar plot
ggplot(DogData) +
geom_bar(aes(x=Neighborhood))
This plot shows us the number of dogs per neighborhood, but we will need to do some editing to make this plot clear! We can start by cleaning up the x axis. Let’s add another piece of code:
# Stagger the x axis for clarity
ggplot(DogData) +
geom_bar(aes(x=Neighborhood))+
scale_x_discrete(guide = guide_axis(n.dodge=3))
Now we have a bar plot with no overlapping x-axis. That looks better! Now, let’s add some labels and make the title bold.
# Add labels to the barplot
ggplot(DogData) +
geom_bar(aes(x=Neighborhood))+
scale_x_discrete(guide = guide_axis(n.dodge=3))+
labs(x="Neighborhoods in Cambridge",
y="Number of Dogs",
title="Dogs per Neighborhood in Cambridge")
# Make the title bold
ggplot(DogData) +
geom_bar(aes(x=Neighborhood))+
scale_x_discrete(guide = guide_axis(n.dodge=3))+
labs(x="Neighborhoods in Cambridge",
y="Number of Dogs",
title="Dogs per Neighborhood in Cambridge")+
theme(plot.title = element_text(face="bold"))
That’s better, but we can still improve this plot. Let’s reduce the
number of categories and focus on the neighborhoods with the most dogs.
To do this, we are going to introduce another useful package:
dplyr. Like ggplot2, dplyr is
automatically loaded when we load tidyverse. So we don’t
need to install any new packages.
To filter dataset to include the neighborhoods with the most dogs, we
are going to introduce pipes (%>%).
Pipes allow us to combine multiple operations in a single “pipeline” by
taking the output from one function and feeding it to the first argument
of the next function. That probably sounds confusing, so let’s take a
break from our plot to look at a simple example. Try the following
code:
# Compute the logarithm of 100
log(100)
## [1] 4.60517
# Compute the logarithm of 100 with a pipe
100 %>% log()
## [1] 4.60517
What happened here? We took function(argument) and rewrote this as argument %>% function(). For a simple calculation with only two steps, the difference is minimal. But for the more complex kinds calculations, the difference can be substantial.
For example, let’s return to our plot and use pipes to filter the
dataset to only include neighborhoods with more than 250 dogs. Let’s
save that new dataset in an object called DogData250.
# Filter dataset to include neighborhoods > 250 dogs
DogData250 <- DogData %>%
group_by(Neighborhood) %>%
filter(n() > 250) %>%
ungroup()
How do we know this new dataset includes the neighborhoods with more
than 250 dogs? Let’s check! Let’s use pipes to arrange the neighborhoods
by number of dogs. Then, let’s use unique to show us the
neighborhoods that are included in DogData250. The
neighborhoods with more than 250 dogs should in
DogData250.
# Sort neighborhoods by number of dogs
table(DogData$Neighborhood) %>% as.data.frame() %>% arrange(desc(Freq))
## Var1 Freq
## 1 West Cambridge 606
## 2 North Cambridge 596
## 3 Neighborhood Nine 577
## 4 Cambridgeport 438
## 5 Mid-Cambridge 404
## 6 East Cambridge 397
## 7 Riverside 197
## 8 Wellington-Harrington 168
## 9 The Port 161
## 10 Baldwin 152
## 11 Strawberry Hill 124
## 12 Cambridge Highlands 91
## 13 Area 2/MIT 31
# Use unique() to see which neighborhoods are in DogData250
unique(DogData250$Neighborhood)
## [1] "North Cambridge" "Neighborhood Nine" "Mid-Cambridge"
## [4] "Cambridgeport" "West Cambridge" "East Cambridge"
So now we have a dataset with Cambridge neighborhoods with more than 250 dogs. Let’s return to our plot and make our barplot with this dataset.
# Make a barplot with this new dataset
ggplot(DogData250) +
geom_bar(aes(x=Neighborhood))+
scale_x_discrete(guide = guide_axis(n.dodge=2))+
labs(x="Neighborhoods in Cambridge",
y="Number of Dogs",
title="Dog-Crazy Neighborhoods in Cambridge")+
theme(plot.title = element_text(face="bold"))
Great. The final step here is to introduce color. You can really geek
out over colors in ggplot2, but we’ll keep it simple for today. We will
use a library of color palettes that are color-blind-friendly. This
library of color palettes is called viridis, so we’ll need
to install this package.
install.packages("viridis")
# Load the viridis package
library(viridis)
## Loading required package: viridisLite
# Make a barplot with this new dataset and color-blind-friendly colors
ggplot(DogData250) +
geom_bar(aes(x=Neighborhood, fill=Neighborhood))+
scale_color_viridis(discrete=TRUE)+
scale_x_discrete(guide = guide_axis(n.dodge=2))+
labs(x="Neighborhoods in Cambridge",
y="Number of Dogs",
title="Dog-Crazy Neighborhoods in Cambridge")+
theme(plot.title = element_text(face="bold"))