We begin by exploring the Washington Post’s data on fatal police shootings in the US during 2015 and 2016. This data was downloaded from https://github.com/washingtonpost/data-police-shootings and further details about the data are available at the Post’s site
We begin by reading the data into R and examining its structure.
shootings <- read.csv("fatal-police-shootings-data.csv")
str(shootings)
## 'data.frame': 1692 obs. of 14 variables:
## $ id : int 3 4 5 8 9 11 13 15 16 17 ...
## $ name : Factor w/ 1682 levels " Austin Wilburly Reid",..: 1558 1029 826 1113 1165 960 954 179 124 1025 ...
## $ date : Factor w/ 588 levels "2015-01-02","2015-01-03",..: 1 1 2 3 3 3 4 5 5 5 ...
## $ manner_of_death : Factor w/ 3 levels "beaten","shot",..: 2 2 3 2 2 2 2 2 2 2 ...
## $ armed : Factor w/ 59 levels "","ax","baseball bat",..: 22 22 56 55 39 22 22 22 56 55 ...
## $ age : int 53 47 23 32 39 18 22 35 34 47 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 1 2 ...
## $ race : Factor w/ 7 levels "","A","B","H",..: 2 7 4 7 4 7 4 7 7 3 ...
## $ city : Factor w/ 1037 levels "Abingdon","Acworth",..: 849 15 1011 818 297 376 164 38 129 470 ...
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 48 38 17 5 6 37 4 17 13 39 ...
## $ signs_of_mental_illness: Factor w/ 2 levels "False","True": 2 1 1 2 1 1 1 1 1 1 ...
## $ threat_level : Factor w/ 3 levels "attack","other",..: 1 1 2 1 1 1 1 1 2 1 ...
## $ flee : Factor w/ 5 levels "","Car","Foot",..: 4 4 4 4 4 4 2 4 4 4 ...
## $ body_camera : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 2 1 ...
We may wish to separate the data into 2015 and 2016 data. The format of the dates as given is YYYY-MM-DD. We start by creating a new variable, year, which includes just these first four digits. Additionally, we create a month variable which includes only the 6th and 7th characters of the date variable.
shootings$year <- as.numeric(substr(shootings$date, 1, 4))
shootings$month <- as.numeric(substr(shootings$date, 6, 7))
Next, we notice that the values (or levels) of the race variable are currently “” “A” “B” “H” “N” “O” and “W,” which is not particularly informative. By examining the Post’s website, we can infer that these codes stand for “Unknown,” “Asian,” “Black,” “Hispanic,” “Native American,” “Other,” and “White.” Let’s change the names in our dataset so that automatically created labels will be more informative.
levels(shootings$race) <- c("Unknown","Asian","Black","Hispanic","Native American","Other","White")
Now that we have cleaned up the data a little, let’s subset out the 2015 data and attach it to our workspace.
shootings.2015 <- subset(shootings,year==2015)
attach(shootings.2015)
Now, if we type race on the command line we will see the contents of the variable race for the dataset shootings.2015
Let’s take a look at the variable armed which indicates what weapon (if any) was held by the subject at the time of the shooting. This is a categorical variable, so let’s start with a barplot.
barplot(table(armed),horiz=T,las=2,cex.names=.5)
There are a lot of options here and it is hard to get much information on our graph. Many of the categories have a very small number of observations (only one each for flag pole and stapler) and some (e.g., spear, pole, pitchfork, rock) have zero observations. These levels are included because the 2016 data does contain observation of these types of weapons, but in the 2015 data no such observation exist. We can get rid of levels with zero observations using the droplevels command. Afterward, we need to reattach the modified data and re-create our plot.
shootings.2015 <- droplevels(shootings.2015)
attach(shootings.2015)
## The following objects are masked from shootings.2015 (pos = 3):
##
## age, armed, body_camera, city, date, flee, gender, id,
## manner_of_death, month, name, race, signs_of_mental_illness,
## state, threat_level, year
barplot(table(armed),horiz=T,las=2,cex.names=.5)
This helps some, but not very much. Let’s restrict our attention to the top 5 categories. To do this, we first need to determine which are the top 5 categories. The obvious way to do this is to create a table of the armed variable and then sort that table. Then applying the names command will give us a list of the weapon types sorted from most frequent to least frequent. We can then select and save the first 5 items in this list.
armedTable <- table(armed)
sortedArmedTable <- sort(armedTable,decreasing=T)
names(sortedArmedTable)
## [1] "gun" "knife"
## [3] "unarmed" "vehicle"
## [5] "toy weapon" "undetermined"
## [7] "machete" "unknown weapon"
## [9] "box cutter" "sword"
## [11] "hammer" "metal pipe"
## [13] "guns and explosives" "Taser"
## [15] "blunt object" "crossbow"
## [17] "hatchet" "metal stick"
## [19] "screwdriver" ""
## [21] "ax" "baseball bat"
## [23] "baseball bat and fireplace poker" "bean-bag gun"
## [25] "beer bottle" "brick"
## [27] "carjack" "chain"
## [29] "contractor's level" "cordless drill"
## [31] "flagpole" "gun and knife"
## [33] "lawn mower blade" "meat cleaver"
## [35] "metal hand tool" "metal object"
## [37] "metal pole" "nail gun"
## [39] "sharp object" "shovel"
## [41] "stapler" "straight edge razor"
commonArms <- names(sortedArmedTable)[1:5]
Now that we have selected the 5 most common categories in the armed variable, we can subset our data to include only observations whose value for the armed variable is contained in our list of commonArms. Afterward, we will again use the droplevels command to remove empty categories, attach the new data, and recreate our barplot.
restrictedData <- subset(shootings.2015,armed %in% commonArms)
restrictedData <- droplevels(restrictedData)
attach(restrictedData)
## The following objects are masked from shootings.2015 (pos = 3):
##
## age, armed, body_camera, city, date, flee, gender, id,
## manner_of_death, month, name, race, signs_of_mental_illness,
## state, threat_level, year
## The following objects are masked from shootings.2015 (pos = 4):
##
## age, armed, body_camera, city, date, flee, gender, id,
## manner_of_death, month, name, race, signs_of_mental_illness,
## state, threat_level, year
barplot(sort(table(armed)),horiz=T,las=2,main="Weapon Held by Civilians in Fatal Police Shootings in 2015\n (only 5 most common weapons shown)",col="lightblue",cex.names=.5)
We may wish to further break this down by race. Let’s subset our data according to the race of the civilian who was shot and killed.
barplot(table(race),las=2)
A quick barplot of race confirms what you probably already knew: the three most common racial groups in this data are white, Hispanic, and black. So let’s focus on those 3 groups and create a dataset for each of them. Then we can create a table which shows the frequency of each type of weapon for each of these groups.
blackVic <- subset(restrictedData,race=="Black")
whiteVic <- subset(restrictedData,race=="White")
hispVic <- subset(restrictedData,race=="Hispanic")
armedByRace <- rbind(table(blackVic$armed),table(whiteVic$armed),table(hispVic$armed))
barplot(armedByRace,col=c("Blue","Red","Green"),las=2,horiz=F,main="Weapon Held by Civilians in Fatal Police Shootings in 2015\n (only 5 most common weapons shown)",cex.names=.7)
legend("topright",legend=c("Black Civilian","White Civilian","Hispanic Civilian"),col=c("Blue","Red","Green"),pch=15)
Perhaps we would prefer to see this plot as a side-by-side barplot:
barplot(armedByRace,col=c("Blue","Red","Green"),las=2,horiz=F,main="Weapon Held by Civilians in Fatal Police Shootings in 2015\n (only 5 most common weapons shown)",cex.names=.7,beside=T)
legend("topright",legend=c("Black Civilian","White Civilian","Hispanic Civilian"),col=c("Blue","Red","Green"),pch=15)
Perhaps we would prefer to see it divided the other way: the breakdown of armed status within races, rather than the breakdown of races within weapon groups. This will require quite a bit more typing, as we now need to create 5 new datasets, one for each of the 5 most commoned weapon groups. Let’s just look at the top 3 – gun, knife, and unarmed.
gunData <- subset(restrictedData, armed == "gun")
knifeData <- subset(restrictedData, armed == "knife")
unarmedData <- subset(restrictedData, armed == "unarmed")
raceByArmed <- rbind(table(gunData$race),table(knifeData$race),table(unarmedData$race))
barplot(raceByArmed,col=c("Black","Red","Green"),las=2,horiz=F,main="Race and Armed Status of Civilians in Fatal Police Shootings in 2015\n (only 3 most common weapons shown)",cex.names=.7)
legend("topleft",legend=c("Armed with Gun","Armed with Knife","Unarmed"),col=c("Black","Red","Green"),pch=15)
barplot(raceByArmed,col=c("Black","Red","Green"),las=2,horiz=F,main="Race and Armed Status of Civilians in Fatal Police Shootings in 2015\n (only 3 most common weapons shown)",cex.names=.7, beside=T)
legend("topleft",legend=c("Armed with Gun","Armed with Knife","Unarmed"),col=c("Black","Red","Green"),pch=15)
Dividing our data different ways is getting to be quite a bit of work. The package ggplot2 allows us to create graphics in a more flexible way, which takes away much of this work. We will discuss the details later in the semester. For now, take a look at the `ggplot codes which generate the same graphs as above.
First, a simple bar plot of the armed status.
library(ggplot2)
ggplot(restrictedData,aes(x=armed))+geom_bar()+theme_bw()
Now, let’s group this by race.
ggplot(restrictedData,aes(x=armed,fill=race))+geom_bar()+theme_bw()
Side-by-side instead:
ggplot(restrictedData,aes(x=armed,fill=race))+geom_bar(position="dodge")+theme_bw()
Go the other way (race by arms instead of arms by race), stacked and side-by-side:
ggplot(restrictedData,aes(x=race,fill=armed))+geom_bar()+theme_bw()
ggplot(restrictedData,aes(x=race,fill=armed))+geom_bar(position="dodge")+theme_bw()
Perhaps we are interested in seeing how these two variables, armed and race interact with a third variable, flee, which describes if and how the civilian was fleeing when the shooting occured.
ggplot(restrictedData,aes(x=armed,fill=race))+geom_bar()+facet_grid(flee~.)+theme_bw()
ggplot(restrictedData,aes(x=race,fill=armed))+geom_bar()+facet_grid(flee~.)+theme_bw()
We can even look at how armed and race interact with flee and another variable threat_level at the same type.
ggplot(restrictedData,aes(x=armed,fill=race))+geom_bar()+facet_grid(flee~threat_level)+theme_bw()
ggplot(restrictedData,aes(x=race,fill=armed))+geom_bar()+facet_grid(flee~threat_level)+theme_bw()
Of course, you can see that many of these displays are ineffective and need substantial modification before they are ready for prime-time. But you should also be able to see how quickly we were able to create different ways of dividing and displaying the data using `ggplot, compared to the amount of typing that would have been required to make similar displays in the base graphics package.