This code through explores how you can produce data visualizations using ggplot2.The package ggplot2 is a frequent choice for R users, as an ‘elegant’ data analysis graphics creation tool. According to its website, it was developed by Hadley Wickham, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and RStudio. Exploratory data visualization is said to be one of the greatest strengths of R.
Specifically, we’ll explain and demonstrate how to produce charts and tables with a time series data set. We produce multiple graphs, namely bar charts, with ggplot2 and use data transformation functions and other additions to shape them up. The data that we will be using is from a github user here, Gun Violence Data by jamesqo. The data originally comes from the Gun Violence Archive’s website. The GVA is a “not for profit corporation formed in 2013 to provide free online public access to accurate information about gun-related violence in the United States” according to its website.
This topic is valuable because data visualization tools are highly effective at communicating trends in data. We also will be using data regarding a highly relevant topic, which is gun violence in the United States. Conclusions regarding this data set might be used to better inform gun policy in the United States.
Specifically, you’ll learn how to isolate variables of interest and communicate their trends. You will also see how a data set may be used for different purposes. In this case, our data set includes data points on gun violence perpetrators, victims, and the circumstances of the crime.
Here, we’ll show how to view a summary of a large data set (239,677 observations in total) and produce various data visualizations using ggplot2.
This is based on the work of ggplot2, which was developed by Hadley Wickham, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and RStudio.
A basic example shows how we can communicate trends in data through a table (which will be called ‘Yearly’) and a bar chart (‘chart’).
See the table below.
# Some code
# Table of total shootings by year
Yearly <- data %>%
group_by(year) %>%
summarize(Shootings = n())
head(Yearly) # We use head() to view the tableI should note that the collection of this data shows abnormalities in the years 2013 and 2018. There were not actually fewer shootings in those years, but they may suffer from NA values for dates and thus not be counted.
Now suppose we want to make viewing easier, we might turn to a chart or graph instead. In this case, we will use the ggplot2 verb ggplot() which will need the data frame, x, and y. We include fill= as we want to make sure we have a legend area. Then we tell R which type of graph (a bar chart with given y-values) by including + geomcol(). We also add labels for the title and axes.
Note that we have an obvious outlier in the year 2013 where the number of shootings is 278. If we want better view of the upper region in the graph or find another reason to discard 2013 observations, we can drop it from the graph.
# Chart of shootings by year
chart <- ggplot(na.omit(Yearly), # Note that we omit the NAs of the df
aes(fill=Legend, x=year, y=Shootings))
chart + geom_col(aes(fill="Shootings")) +
labs(title="Shootings by Year, 2013-2018", x ="Year", y = "Total Shootings")
Lastly, we might want to include some text on our graph with the raw numbers as 2013 is a very low number of observations. Below is how you’d annotate the bar chart. The ‘x’ and ‘y’ denote the position on the respective axes. I’ve already input the values for our labels. You might want to customize the positions of each label–it’s up to you how to proceed.
# Chart with y-value labels
chart + geom_col(aes(fill="Shootings")) +
labs(title="Shootings by Year, 2013-2018",
x ="Year",
y = "Total Shootings") +
annotate("text", x=2013:2018, y=10000,
label = c("278", "51854","53579","58763","61401", "13802"))
I will put these values on top of the bar. Note that there is no obvious trend with the y-values, which means I have to pick and specify a y-value for each year’s observation.
We might customize it a little more, re-positioning our labels, adjusting our axes, and adding a theme.
chart + geom_col(aes(fill="Shootings")) +
labs(title="Shootings by Year, 2013-2018",
x ="Year",
y = "Total Shootings") +
annotate("text",
x=2013:2018,
y=c(3000, 54000, 56000,61000, 64000, 16000),
label = c("278","51,854","53,579","58,763","61,401", "13,802")) +
scale_x_continuous(n.breaks = 6) +
scale_y_continuous(n.breaks = 15) +
theme_minimal()
## Advanced Examples
More specifically, ggplot2 can be added onto to create more complex graphs. You may have noticed that we have the advantage of being able to add various elements to our graph with a simple ‘+’ whether it be text annotations, position specifications, chart elements, etc. The sky is the limit with ggplot2.
In order to demonstrate this, We will take the dataset a bit further and create a stacked bar chart of total hurt by gun violence. In this stacked bar chart, we will not only show the total value but also the value of casualties and the distribution between injuries and casualties.
We start out by creating two new data frames (‘Casualties’ and ‘Injured’) of aggregates n_killed and n_injured by year. Notice that we are removing the NAs (na.rm = TRUE) so we can get a number with our summation function (FUN=sum). If the NAs are not removed, the sum will come back NA.
Now we combine these into a merge. I have also created a new variable totaling the number injured and killed to get a total of people hurt by gun violence. This will go onto our y-axis value.
# Aggregates
Casualties <- aggregate(data$n_killed, # Aggregated casualties
by=list(year=data$year),
FUN=sum,
na.rm = TRUE)
Injured <- aggregate(data$n_injured, # Aggregated injured
by=list(year=data$year),
FUN=sum,
na.rm = TRUE)
Casualties <- rename(Casualties, # Renaming
casualties = x)
Injured <- rename(Injured,
injured = x)
Victims <- merge(Casualties, Injured) # Merging aggregates
# Creating a variable that sums the year's casualties and injuries
Victims$total <- rowSums(Victims[,c("casualties", "injured")],
na.rm=TRUE)Now we will take the merged casualty & injury data frame ‘Victims’ and use it for our plot. We will name this plot ‘chart1’, which will use ggplot2 function geom_col. Notice that I plugged in our total and casualty values into the geom_col set up–this is because simply putting injuries warped the distribution of the graph. I was able to tell this as we’ll see that in any year between 2013-2018, roughly 1/3 of victims of gun violence are killed by it. Mapping casualties on top of the total correctly displays the data and communicates that between 2013-2018, most victims of gun violence were injured (and assumably survived).
# Creating a bar chart of total hurt by gun violence, 2013-2018
chart1 <- ggplot(Victims, aes(fill=Legend, y=total, x=year)) +
geom_col(aes(y=total, fill="Injuries")) +
geom_col(aes(y=casualties, fill="Casualties"), # Specifying casualties
position="stack") +
labs(title=
"Shooting Injuries and Casualties by Year, 2013-2018",
x ="Year", y = "Total Hurt") +
scale_x_continuous(n.breaks = 6) + # Axis adjustments
scale_y_continuous(n.breaks = 13) +
theme_minimal() # ThemeWhat’s more, we can create relevant statistics that we calculate and then use the same methods to communicate a point more effectively.
Let’s say that we want to know the death rate. A percentage value tells us the share of which death occurs.
# Creating a ratio of casualties/(total hurt) to see the death rate
Victims$percent <- Victims$casualties/Victims$total * 100
chart2 <- ggplot(Victims, # Plot
aes(fill=Legend,
x=year,
y=percent),
na.rm=TRUE, # # Remove NA year
show.legend=TRUE)
chart2 + geom_col(aes(y=percent, fill="Death Rate (%)")) +
labs(title=
"Death Rate of Gun Violence, 2013-2018",
x ="Year",
y = "Death Rate (%)")
We can also view the raw numbers of death rate by year in a table and add those percentage values to chart2.
Victims %>% # Summarized table
group_by(year) %>%
summarize(percent)Now we can add these percentage values to our bar chart, along with some other adjustments.
Other adjustments would be: axis scaling and theme changes.
# Chart2 with adjustments and percentages
chart2 + geom_col(aes(y=percent, fill="% Killed")) +
labs(title=
"Death Rate of Gun Violence Victims, 2013-2018",
x ="Year",
y = "Death Rate (%)") +
scale_x_continuous(n.breaks = 6) +
scale_y_continuous(n.breaks = 10) +
annotate("text",
x = 2013:2018,
y = c(23, 34, 32, 31.5, 32, 35),
label = c("24.5%", "35.3%", "33.3%", "33%", "33.6%", "36.4%")) +
theme_minimal()So within our sample, we find that:
Of shootings in 2013, 24.6% of gun violence victims were casualties. This is the smallest death rate of all the years in our sample.
Most recently, in 2018, 36.4% of gun violence victims were casualties. This is the largest death rate of all the years in our sample, followed closely by 2014, which was 35.3%.
The death rate has not steadily increased over 2013-2018 as we might think. I don’t have reason to believe it would but I personally would suspect that the number of shootings has increased over time, which isn’t supported by our first chart. I believe that this could be partially due to the unbalanced number of observations in our sample (particularly for years 2013 and 2018) as well as the NA values that were omitted.
Our bar chart shows a non-linear trend. The death rate increased up until 2014. Then the death rate decreased after 2014 until 2016, after which it increased.
Most notably, ggplot2 is valuable for identifications of trends, which we will show through grouping and ranking. Noticing trends is especially helpful in policy, as we will be specifically identifying which states’ civilians are at the highest risk and which states are most ‘responsible’ for our country’s deaths caused by gun violence. Obviously we cannot conclude these things from this data set or research alone, but it’s a start.
We will now create yearly data frames grouped by state, and then merge those.
Then we take aggregates of those killed by year and state to produce an aggregate for total casualties per year.
We then take each state’s number of casualties and compare it to the total to see which states account for the most deaths.
# Filtering by state for yearly data frames
state2013 <- data %>%
group_by(state) %>%
filter(year==2013)%>%
summarise(y2013=n())
state2014 <- data %>%
group_by(state) %>%
filter(year==2014)%>%
summarise(y2014=n())
state2015 <- data %>%
group_by(state) %>%
filter(year==2015)%>%
summarise(y2015=n())
state2016 <- data %>%
group_by(state) %>%
filter(year==2016)%>%
summarise(y2016=n())
state2017 <- data %>%
group_by(state) %>%
filter(year==2017)%>%
summarise(y2017=n())
state2018 <- data %>%
group_by(state) %>%
filter(year==2018)%>%
summarise(y2018=n())
# Now merging these together
by_state <- merge(state2013, # Merging
state2014,
by="state",
all.y=TRUE)
by_state <- merge(by_state,
state2015,
by="state",
all.y=TRUE)
by_state <- merge(by_state,
state2016,
by="state",
all.y=TRUE)
by_state <- merge(by_state,
state2017,
by="state",
all.y=TRUE)
by_state <- merge(by_state,
state2018,
by="state",
all.y=TRUE)
# Collecting aggregates for each state's yearly value
state <- aggregate(data$n_killed,
by=list(year=data$year,
state=data$state),
FUN=sum,
na.rm = TRUE)
state <- rename(state, casualties=x) # Rename variable
# National year total across all states
yeartotal <- aggregate(state$casualties,
by=list(year=state$year),
FUN=sum,
na.rm = TRUE)
yeartotal <- rename(yeartotal, yrtotal=x) # Rename variable
yr_state <- merge(state, yeartotal, # Merging again
by="year", all.y=TRUE)
# Now finding each state's portion of casualties per the total
yr_state$percent <- (yr_state$casualties/yr_state$yrtotal) * 100 # % variableNow we can generate lists and produce visualizations of this rate per state. We will explore the states with the highest rate of deaths in the years 2017-2018. We will narrow this down to the top 10 states.
We create a yearly subset and then order it highest to lowest. Then we slice it to the first 10 observations.
# States ranked, year 2018.
yr2018 <- subset (yr_state, # Subset for 2018
year == 2018)
yr2018_2 <- yr2018[order # Order by highest
(yr2018$percent,
decreasing = TRUE), ]
yr2018_2<- yr2018_2 %>% # Limit to top 10
slice(1:10)Now onto producing a bar chart and seeing the states. It turns out that ggplot2 will alphabetically sort characters on an axis regardless of that condition we set earlier, so we have to specify the order of the states on the x-axis. I looked at the data frame and just copied the order from there as our 10 observations in the yr2018_2 data frame are organized highest to lowest.
# ggplot2 will alphabetically sort characters on an axis
ggplot(yr2018_2, aes(x=state,
percent)) + # Input x-axis to avoid ggplot2 alphabetical sorting
scale_x_discrete(limits = c("Texas",
"California",
"Florida",
"Illinois",
"Ohio",
"Pennsylvania",
"Alabama",
"Louisiana",
"Missouri",
"Georgia")) +
geom_col(aes(fill = state),
stat = "identity") +
labs(title= "States with the highest portion of US gun violence casualties in 2017",
x="State",
y="Percent of National Casualties (%)") +
theme(axis.text.x =
element_text(angle = 35,
vjust=0.75))It is worthy to note that we can use multiple graphs or images, and then combine them through other means, such as a gif through gganimate to see the change over time in states’ share of gun violence casualties. I performed the same commands as above with the year 2017 to see what the ranking looked like.
# Same approach but year 2017 instead.
yr2017 <- subset (yr_state, year == 2017) # Subset for 2018
yr2017_2 <- yr2017[order # Order by highest
(yr2017$percent,
decreasing = TRUE), ]
yr2017_2<- yr2017_2 %>% # Slice to top 10
slice(1:10)
ggplot(yr2017_2, aes(x = state,
percent)) + # Input x-axis to avoid ggplot2 alphabetical sorting
scale_x_discrete(limits = c("California",
"Texas",
"Illinois",
"Florida",
"Ohio",
"Georgia",
"Pennsylvania",
"Missouri",
"North Carolina",
"Louisiana")) +
geom_col(aes(fill = state),
stat = "identity") +
labs(title=
"States with the highest portion of US gun violence casualties in 2017",
x="State",
y="Percent of National Casualties (%)") +
theme(axis.text.x =
element_text(angle = 35,
vjust=0.70)) So in conclusion, we not only gain familiarity with ggplot2 but we also obtain some interesting insights about gun violence in the US.
Learn more about ggplot2 with the following:
Resource I ggplot2 - Tidyverse
Resource II CRAN - Package ggplot2
Resource III ggplot2 package - RDocumentation
This code through references and cites the following sources:
Hadley Wickham (2016). Source I. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Hadley Wickham (2016). Source II. Function reference • ggplot2
Rafael A. Irizarry (2019). Source III. Chapter 8 ggplot2 | Introduction to Data Science - rafalab
Gun Violence Archive (2022). Source III. About | Gun Violence Archive