knitr::opts_chunk$set(echo = TRUE)
When we measure things, we get lots of numbers. Sometimes so many that your head explodes just thinking about them. One of the most helpful things you can do to begin to make sense of these numbers is to look at them in graphical form.
In this lab, you will learn the following:
We will continue to use the data for NYC film permits (Film_Permits.csv) used in the previous lab instructions. We will also try to answer the earlier questions by graphing the data.
library(data.table)
nyc_films <-fread("Film_Permits.csv")
colnames(nyc_films)
## [1] "Event ID" "EventType" "StartDateTime"
## [4] "EndDateTime" "EnteredOn" "EventAgency"
## [7] "ParkingHeld" "Borough" "CommunityBoard(s)"
## [10] "PolicePrecinct(s)" "Category" "SubCategoryName"
## [13] "Country" "ZipCode(s)"
We will also use the data containing information from 2020-2022 police-reported motor vehicle collisions in NYC (Motor_Vehicle_Collisions_Crashes.csv). The data is available here: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95
Most of the film permits were requested for which of the five boroughs of NYC?
We can find out by plotting the data using a bar graph. But to plot the data, we first summarized it. In the previous lab, we learned to make frequency tables with the number of film permits made in each borough and then make different bars to represent the counts (the frequency of each borough).
Here is the code used in the previous lab (copied and pasted!):
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
counts <- nyc_films %>%
group_by(Borough) %>%
summarize(count_of_permits = length(Borough))
counts
## # A tibble: 5 × 2
## Borough count_of_permits
## <chr> <int>
## 1 Bronx 2280
## 2 Brooklyn 22087
## 3 Manhattan 34200
## 4 Queens 11484
## 5 Staten Island 938
We can also produce plots using a fantastic package called
ggplot2. It is very powerful once you get the hang of it,
and when you do, you can make all sorts of interesting graphs. Here’s
the code to make the bar graph.
library(ggplot2)
ggplot(counts, aes(x = Borough, y = count_of_permits )) +
geom_bar(stat="identity")
The bar graph shows that most film permits were requested for the borough of Manhattan.
In your Lab’s directory, there is a file called Motor_Vehicle_Collisions_Crashes.csv. Perform the following tasks:
Load the file into R, find out the columns’ names and read the top lines of the file.
Graph the number of crashes per borough
Enter your code below:
Let’s make the bar graph of the data regarding the variable
Category.
counts <- nyc_films %>%
group_by(Category) %>%
summarize(count_of_permits = length(Category))
ggplot(counts, aes(x = Category, y = count_of_permits )) +
geom_bar(stat="identity")
OK, this figure might look a bit weird because the labels on the bottom are running into each other. We’ll fix that in a bit. First, let’s notice the changes.
Borough was changed to Category. That
was the main thing.
Note that none of the library() commands are used
again, and I didn’t re-run the very early code to get the data. R
already has those things in its memory, so we don’t need to do that
again. If you ever clear the memory of R, you will need to reload those
things.
Fine, so how do we fix the graph? Good question. One way to find out is to search using Google.
Googling your questions is an excellent way to learn R. It’s what everybody does these days…[goes to Google…].
The trick is to add the last line to the code. I just copy-pasted it from the solution found on stack overflow (I hope you will become friends with stack overflow, there are many solutions there to all of your questions)
counts <- nyc_films %>%
group_by(Category) %>%
summarize(count_of_permits = length(Category))
ggplot(counts, aes(x = Category, y = count_of_permits )) +
geom_bar(stat="identity")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Before we go further, let’s check some basic properties of
ggplot2 to give you a sense of how it works. This will make
more sense later with other labs, so come back here to remind
yourself.
We’ll do just a bit of basic, then move on to making more graphs by copying and pasting.
The ggplot function uses layers. What are these layers? Well, it draws things from the bottom up. It lays down one layer of graphics; then, you can keep adding on top, drawing more things. So the idea is like: Layer 1 + Layer 2 + Layer 3, and so on. If you want Layer 3 to be Layer 2, then you switch them in the code.
Here is a way of thinking about ggplot code:
ggplot(name_of_data, aes(x = name_of_x_variable, y = name_of_y_variable)) +
geom_layer()+
geom_layer()+
geom_layer()
Let’s focus on the \(+\) signs in
the above description. What we are doing with the plus signs is adding
layers to the plot. The layers get added in the order that they are
written. If you look back to our previous code, you will see we added a
geom_bar layer; then, we added another layer to change the
rotation of the words on the x-axis. This is how it works.
BUT WAIT? How are we supposed to know what to add? This is nuts! We’ll give you lots of examples where you can copy and paste, and they will work. That’s how you’ll learn. If you want to read the help manual, you can do that too. It’s on the ggplot2 website. This will become useful after you already know what you are doing; before that, it will probably just seem confusing. However, it is pretty neat to look at and see all the different things you can do; it’s very powerful.
Let’s get the hang of adding things to the graph that let us change some stuff we might want to change. For example, how do you add a title? Or change the labels on the axes? Or add different colors, change the font size, or change the background? You can adjust these things by adding other lines to the existing code.
The last graph had count_of_permits as the label on the
y-axis. That doesn’t look right. ggplot2 automatically took the title
from the column and made it the name on the y-axis. We can change that
by adding ylab("what we want"). We do this by adding a
\(+\) to the last line, then adding
ylab().
ggplot(counts, aes(x = Category, y = count_of_permits )) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits")
Let’s slightly modify the x label too:
ggplot(counts, aes(x = Category, y = count_of_permits )) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of Film")
Let’s give our graph a title.
ggplot(counts, aes(x = Category, y = count_of_permits )) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film Permits in NYC by Category")
Let’s make the bars different colors. To do this, we add new code to
the inside of the aes() part:
ggplot(counts, aes(x = Category, y = count_of_permits, color=Category )) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in NYC by Category")
Let’s make the bars different colors. To do this, we add new code to
the inside of the aes() part… Notice that I’ve started
using new lines to make the code more readable.
ggplot(counts, aes(x = Category, y = count_of_permits,
color=Category,
fill= Category )) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in NYC by Category")
If we want to move the legend to appear below the graph, we can add the following:
theme(legend.position="bottom")
ggplot(counts, aes(x = Category, y = count_of_permits,
color=Category,
fill= Category )) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in NYC by Category") +
theme(legend.position="bottom")
Sometimes you don’t want a legend at all, to remove it, add the line:
theme(legend.position="none")
ggplot(counts, aes(x = Category, y = count_of_permits,
color=Category,
fill= Category )) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in NYC by Category") +
theme(legend.position="none")
The rest is often just visual preference. For example, the graph
above has this grey grid behind the bars. Use
theme_classic() to remove the grid for a clean, classic
no-nonsense look.
ggplot(counts, aes(x = Category, y = count_of_permits,
color=Category,
fill= Category )) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in NYC by Category") +
theme(legend.position="none") +
theme_classic()
Interesting, theme_classic() is misbehaving a little
bit. We have some of our layers out of order; let’s re-order. I just
moved theme_classic() to underneath the
geom_bar() line. Now everything gets drawn properly.
ggplot(counts, aes(x = Category, y = count_of_permits,
color=Category,
fill= Category )) +
geom_bar(stat="identity") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in NYC by Category") +
theme(legend.position="none")
Changing font size is often something you want to do. ggplot2 can do
this in different ways. I suggest using the base_size
option inside theme_classic(). You set one number for the
largest font size in the graph, and everything else gets scaled to fit
that number. It’s convenient. Look for the inside of
theme_classic().
ggplot(counts, aes(x = Category, y = count_of_permits,
color=Category,
fill= Category )) +
geom_bar(stat="identity") +
theme_classic(base_size = 15) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in NYC by Category") +
theme(legend.position="none")
Or make things small… to see what happens.
ggplot(counts, aes(x = Category, y = count_of_permits,
color=Category,
fill= Category )) +
geom_bar(stat="identity") +
theme_classic(base_size = 10) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in NYC by Category") +
theme(legend.position="none")
Plot bar graph for the data set Motor_Vehicle_Collisions_Crashes.csv again. But this time, your graph should include a white background, colored bars, properly labeled horizontal and vertical axes, a title, and no legend.
Enter your code below:
That’s enough of the ggplot2 basics for now. You will discover that many things are possible with ggplot2. It is amazing. We will get back to answering some questions about the data with graphs. But now that we have built the code to make the graphs, we need to copy-paste and make a few minor changes, and boom, we have our chart.
Let’s graph the data for SubCategoryName in the
nyc_films.
# get the counts (this is a comment. It's just here for you to read)
counts <- nyc_films %>%
group_by(SubCategoryName) %>%
summarize(count_of_permits = length(SubCategoryName))
# make the plot
ggplot(counts, aes(x = SubCategoryName, y = count_of_permits,
color=SubCategoryName,
fill= SubCategoryName )) +
geom_bar(stat="identity") +
theme_classic(base_size = 10) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Sub-category of film") +
ggtitle("Number of Film permits in NYC by Sub-category") +
theme(legend.position="none")
It shows that “episodic series” is the most common. Using a graph like this gave us our answer super fast.
Let’s see one more thing about ggplot2. It’s called
facet_wrap().
We know that some films are made in different boroughs and the same films are made in different categories, but do different boroughs have different patterns for the kinds of categories of films they request permits for? Are there more TV shows in Brooklyn? How do we find out? Watch just like this:
# get the counts (This is a comment. It's just here for you to read)
counts <- nyc_films %>%
group_by(Borough,Category) %>%
summarize(count_of_permits = length(Category))
## `summarise()` has grouped output by 'Borough'. You can override using the
## `.groups` argument.
# make the plot
ggplot(counts, aes(x = Category, y = count_of_permits,
color=Category,
fill= Category )) +
geom_bar(stat="identity") +
theme_classic(base_size = 10) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in NYC by Category and Borough") +
theme(legend.position="none") +
facet_wrap(~Borough, ncol=3)
We did two critical things. First, we added Borough and
Category into the group_by() function. This
automatically gives separate counts for each film category for each
Borough. Then we added facet_wrap(~Borough, ncol=3) to the
end of the plot, and it automatically drew us 5 different bar graphs,
one for each Borough! That was fast. Imagine doing that by hand.
Plot the number of crashes by the variable NUMBER OF PERSONS INJURED and separated by boroughs
Hint: Remember that R doesn’t accept spaces in the name of the variables.
Enter your code below:
The library dplyr allows us to filter data by a single
borough (see Lab 3). Let’s use the above example about plotting
categories, but we only plan the data for Manhattan this time.
# get the counts of categories for Manhattan (this is a comment. It's just here for you to read)
counts <- nyc_films %>%
group_by(Category) %>%
filter(Borough=='Manhattan')%>%
summarize(count_of_permits = length(Category))
# make the plot
ggplot(counts, aes(x = Category, y = count_of_permits,
color=Category,
fill= Category )) +
geom_bar(stat="identity") +
theme_classic(base_size = 10) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab("Number of Film Permits") +
xlab("Category of film") +
ggtitle("Number of Film permits in Manhattan by Category") +
theme(legend.position="none")
Note that this graph is the same as obtained for Manhattan as in the previous graph with facet wrap.
Plot the NUMBER OF PERSONS INJURED for Manhattan with legend.
Enter your code below:
The material used in this document contains excerpts and modifications from: