Load general libraries:
Data Set 1: Poverty in the United States
This dataset describes the distribution of poverty in the United States across several different dimensions. The discussion on the forum only mentioned race or gender, so I will truncate the data to only include race. (It wouldn’t make sense to combine race and gender data frames anyway, since the table does not inform us about their intersection.)
Data Cleaning
The data is multi-indexed with a complicated header—it’s a mess. I’ll start the cleaning by bypassing it by skipping the header lines.
# install.packages('readxl')
library(readxl)
pov_table <- read_excel('pov_table3.xls', skip=13, col_names=FALSE)
head(pov_table)## # A tibble: 6 x 13
## X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9 X__10 X__11
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 White…………… 245985 27113 547 11 0.2 247272 26436 714 10.7 0.3
## 2 White, no… 195221 17263 493 8.8 0.3 195256 16993 571 8.7 0.3
## 3 Black…………… 41962 9234 388 22 0.9 42474 8993 373 21.2 0.9
## 4 Asian…………… 18879 1908 175 10.1 0.9 19475 1953 190 10 1
## 5 Hispanic … 57556 11137 399 19.4 0.7 59053 10790 423 18.3 0.7
## 6 <NA> NA NA NA NA NA NA NA NA NA NA
## # ... with 2 more variables: X__12 <chr>, X__13 <chr>
## # A tibble: 5 x 13
## X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9 X__10 X__11
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 White…………… 245985 27113 547 11 0.2 247272 26436 714 10.7 0.3
## 2 White, no… 195221 17263 493 8.8 0.3 195256 16993 571 8.7 0.3
## 3 Black…………… 41962 9234 388 22 0.9 42474 8993 373 21.2 0.9
## 4 Asian…………… 18879 1908 175 10.1 0.9 19475 1953 190 10 1
## 5 Hispanic … 57556 11137 399 19.4 0.7 59053 10790 423 18.3 0.7
## # ... with 2 more variables: X__12 <chr>, X__13 <chr>
The simplest way to deal with the fact there are two years ‘presented horizontally’ is brute force—slice up the data frame:
race_2016 <- race[,1:6]
race_2017 <- race[,c(1,7:11)]
race_colnames <- c('race', 'total', 'in_poverty', 'me', 'perc', 'perc_me')
colnames(race_2016) <- race_colnames
colnames(race_2017) <- race_colnames
race_2016$year <- as.factor(2016)
race_2017$year <- as.factor(2017)
race_long <- rbind(race_2016, race_2017)
head(race_long, 7) # note year column at far right## # A tibble: 7 x 7
## race total in_poverty me perc perc_me year
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 White………………………………………… 245985 27113 547 11 0.2 2016
## 2 White, not Hispanic……………………… 195221 17263 493 8.8 0.3 2016
## 3 Black………………………………………… 41962 9234 388 22 0.9 2016
## 4 Asian………………………………………… 18879 1908 175 10.1 0.9 2016
## 5 Hispanic (any race)……………………… 57556 11137 399 19.4 0.7 2016
## 6 White………………………………………… 247272 26436 714 10.7 0.3 2017
## 7 White, not Hispanic……………………… 195256 16993 571 8.7 0.3 2017
Next is to clean up superfluous characters. The race column is annoying because it varies between using the ellipsis character (…) and three periods (...), still easily handled with Regex, however:
## # A tibble: 6 x 7
## race total in_poverty me perc perc_me year
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 White 245985 27113 547 11 0.2 2016
## 2 White, not Hispanic 195221 17263 493 8.8 0.3 2016
## 3 Black 41962 9234 388 22 0.9 2016
## 4 Asian 18879 1908 175 10.1 0.9 2016
## 5 Hispanic (any race) 57556 11137 399 19.4 0.7 2016
## 6 White 247272 26436 714 10.7 0.3 2017
Analysis
Since this data is concerned with temporal variation, let’s create a line graph to get a sense of it:
It’s not a great chart.
There’s only two years available in the data, so its utility is obviously limited.
We do see, however, that the populations suffering from the most poverty have decreased incidence of poverty in 2017 compared to the prior year. Non-Hispanic whites and Asians have the lowest incidence, which is nearly the same in 2017 as in 2016.
Data Set 2: Law Enforcement Officers Killed
This dataset consists of two files. Officer deaths are split into one file if accidental, another if ‘feloniously killed.’
I will clean up these files and combine them, creating a master dataset of officer deaths from 2008-2017.
Data Cleaning
Let’s start with accidental officer deaths:
Get rid of the first row and column which give totals:
## # A tibble: 6 x 11
## Area `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NORTHEAST 11 6 8 8 9 5 8 5 5
## 2 New Engl… 1 2 3 2 4 0 1 0 1
## 3 Connecti… 1 0 2 0 0 0 0 0 0
## 4 Maine 0 0 0 1 0 0 0 0 0
## 5 Massachu… 0 2 1 1 3 0 1 0 1
## 6 New Hamp… 0 0 0 0 0 0 0 0 0
## # ... with 1 more variable: `2017` <dbl>
Now we have to decide what we actually want to extract from this data: Officers accidentally killed by region (Midwest, Northeast, etc.), or by state. Since by state is more informative, I will focus on producing state-level data.
First, filter out the rows that contains region headings and sums:
filter_out <- 'NORTHEAST|MIDWEST|SOUTH|WEST|PUERTO RICO AND OTHER OUTLYING AREAS|MIDWEST|Mountain|New England|Middle Atlantic|East North Central|West North Central|South Atlantic|East South Central|West South Central|Pacific'
accidental <- accidental %>%
filter(str_detect(Area, filter_out) == FALSE)
colnames(accidental)[1] <- 'state'
head(accidental)## # A tibble: 6 x 11
## state `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Connecti… 1 0 2 0 0 0 0 0 0
## 2 Maine 0 0 0 1 0 0 0 0 0
## 3 Massachu… 0 2 1 1 3 0 1 0 1
## 4 New Hamp… 0 0 0 0 0 0 0 0 0
## 5 Rhode Is… 0 0 0 0 1 0 0 0 0
## 6 Vermont 0 0 0 0 0 0 0 0 0
## # ... with 1 more variable: `2017` <dbl>
Convert from wide to long:
accidental_long <- accidental %>%
gather(year, fatalities, `2008`:`2017`) %>%
mutate(type='accidental') %>%
arrange(state, year)
head(accidental_long)## # A tibble: 6 x 4
## state year fatalities type
## <chr> <chr> <dbl> <chr>
## 1 Alabama 2008 2 accidental
## 2 Alabama 2009 1 accidental
## 3 Alabama 2010 0 accidental
## 4 Alabama 2011 2 accidental
## 5 Alabama 2012 1 accidental
## 6 Alabama 2013 3 accidental
Now, prepare the felonious officer deaths in the same way:
felon <- read_xls('table-1.xls', skip=3) %>%
slice(-1) %>%
filter(str_detect(Area, filter_out) == FALSE)
felon$Total <- NULL
colnames(felon)[1] <- 'state'
felon_long <- felon %>%
gather(year, fatalities, `2008`:`2017`) %>%
mutate(type='felonious') %>%
arrange(state, year)
head(felon_long)## # A tibble: 6 x 4
## state year fatalities type
## <chr> <chr> <dbl> <chr>
## 1 Alabama 2008 0 felonious
## 2 Alabama 2009 4 felonious
## 3 Alabama 2010 1 felonious
## 4 Alabama 2011 1 felonious
## 5 Alabama 2012 2 felonious
## 6 Alabama 2013 0 felonious
Join the accidental and felonious data sets:
officer_deaths <- rbind(accidental_long, felon_long) %>%
arrange(state, year, type)
head(officer_deaths, 10)## # A tibble: 10 x 4
## state year fatalities type
## <chr> <chr> <dbl> <chr>
## 1 Alabama 2008 2 accidental
## 2 Alabama 2008 0 felonious
## 3 Alabama 2009 1 accidental
## 4 Alabama 2009 4 felonious
## 5 Alabama 2010 0 accidental
## 6 Alabama 2010 1 felonious
## 7 Alabama 2011 2 accidental
## 8 Alabama 2011 1 felonious
## 9 Alabama 2012 1 accidental
## 10 Alabama 2012 2 felonious
Analysis
The obvious analysis is to consider officer deaths, felonious and accidental, over time. We can use a line graph for that:
plot_deaths <- officer_deaths %>%
group_by(year, type) %>%
summarize(deaths=sum(fatalities))
head(plot_deaths)## # A tibble: 6 x 3
## # Groups: year [3]
## year type deaths
## <chr> <chr> <dbl>
## 1 2008 accidental 68
## 2 2008 felonious 41
## 3 2009 accidental 48
## 4 2009 felonious 48
## 5 2010 accidental 72
## 6 2010 felonious 55
ggplot(plot_deaths, aes(year, deaths, group=type, color=type)) +
geom_line() +
labs(x="Year", y="Officer Deaths") +
theme(axis.text.x=element_text(angle=45, hjust=1))Two things strike me about these numbers: About half the time, more officers are killed by accident—and a quick Google suggests this is car accidents, specifically. Second, the total number of officers killed is always less than 150. Per year.
I would’ve suspected it was a lot larger. To check my intuition, I texted my dad to ask him to estimate how many police officers died on duty every year. He suggested one or two thousand! That is, my dad thought the United States is more dangerous for police officers than Iraq was at peak violence for American soldiers (c. 2006)!
This data combined with my poll of \(n = 1\) suggests Americans may overestimate the dangers law enforcement officers face.
Data Set 3: Arms Exports
This is the data set I submitted to the forum, the Arms Transfers Database.
It is maintained by the Stockholm International Peace Research Institute, and is meant to be a catalog of all international arms sales/transfers since 1950.
The specific dataset under consideration is the exports dataset. Here, quantity of arms exports is operationalized as the Institute’s trend indicator values (TIV). TIV are an attempt to use a common unit to represent arms transfers, rather than trying to account for the relative value of, e.g., 10 thousand machine guns versus 10 thousand assault rifles. It is based on production cost of weapons systes (as opposed to market value), as well as the condition of the weapon (used, new, refurbished).
Data Cleaning
The data contains the total TIV for all countries’ exports between 1950 and 2017.
Note: The file extension is .xls, but it appears to be a regular CSV file.
raw <- read.csv('TIV-Export-All-1950-2017.csv.xls', skip=10, header=TRUE,
stringsAsFactors=FALSE)
head(raw)## X X1950 X1951 X1952 X1953 X1954 X1955 X1956 X1957 X1958 X1959
## 1 Albania NA NA NA NA NA NA NA NA NA NA
## 2 Algeria NA NA NA NA NA NA NA NA NA NA
## 3 Angola NA NA NA NA NA NA NA NA NA NA
## 4 Argentina NA NA NA NA NA NA NA NA NA NA
## 5 Aruba NA NA NA NA NA NA NA NA NA NA
## 6 Australia NA NA 36 NA NA 125 NA NA NA NA
## X1960 X1961 X1962 X1963 X1964 X1965 X1966 X1967 X1968 X1969 X1970 X1971
## 1 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA 6 NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA 6 NA 4 NA NA 0 8 NA 0 NA
## 5 NA NA NA NA NA NA NA NA NA NA NA NA
## 6 NA 2 2 NA NA NA NA NA NA 26 17 36
## X1972 X1973 X1974 X1975 X1976 X1977 X1978 X1979 X1980 X1981 X1982 X1983
## 1 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA 6 NA NA NA
## 3 NA NA NA NA NA NA NA 3 0 NA NA NA
## 4 11 NA NA NA 1 NA NA 8 1 24 3 1
## 5 NA NA NA NA NA NA NA NA NA NA NA NA
## 6 6 72 92 131 113 62 21 22 15 15 24 31
## X1984 X1985 X1986 X1987 X1988 X1989 X1990 X1991 X1992 X1993 X1994 X1995
## 1 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA 8 NA NA NA
## 4 NA NA NA 4 1 NA 6 0 NA 11 4 5
## 5 NA NA NA NA NA NA NA NA NA NA NA NA
## 6 62 13 57 89 18 18 158 85 NA 30 36 36
## X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA 2 NA NA NA NA NA
## 4 NA NA NA NA 2 6 2 NA NA NA 2 NA
## 5 18 NA NA NA NA NA NA NA NA NA NA NA
## 6 16 20 4 NA NA 43 47 44 2 49 14 18
## X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015 X2016 X2017 Total
## 1 NA NA NA 0 NA NA NA NA NA NA 0
## 2 NA NA NA NA NA NA NA NA 0 NA 12
## 3 NA NA NA NA NA NA NA NA NA NA 13
## 4 NA NA NA 1 NA NA NA NA NA NA 113
## 5 NA NA NA NA NA NA NA NA NA NA 18
## 6 25 80 115 143 45 54 97 87 134 97 2583
Each row represents a country, and each column a year. (Note that R has appends an ‘X’ to each year column.) Let’s remove the superfluous Total column and row, and label the first column:
Now to make the data tidy. Each row must be a country-year observation. This is easily accomplished with tidyr:
exports <- gather(raw, year, exports, X1950:X2017, na.rm=TRUE) %>%
mutate(year=as.integer(str_remove(year, 'X'))) %>%
arrange(country, year)
head(exports)## country year exports
## 1 Albania 2011 0
## 2 Algeria 1968 6
## 3 Algeria 1980 6
## 4 Algeria 2016 0
## 5 Angola 1979 3
## 6 Angola 1980 0
## 'data.frame': 2384 obs. of 3 variables:
## $ country: chr "Albania" "Algeria" "Algeria" "Algeria" ...
## $ year : int 2011 1968 1980 2016 1979 1980 1992 2002 1962 1964 ...
## $ exports: int 0 6 6 0 3 0 8 2 6 4 ...
Analysis
This dataset can be summarized either over time and over space. I will do a sampling of both below.
Time: Arms Exports Over Time
The simplest way to aggregate arms exports over time would be to sum exports by year:
yearly_exports <- exports %>%
group_by(year) %>%
summarise(total_exports=sum(exports))
ggplot(yearly_exports, aes(year, total_exports)) +
geom_line()It’s difficult to pick out specific events from this chart, but in some way it is a concise representation of international politics during the Cold War, 9/11, and after.
Over Time, by Region
We might be able to see more from this data by breaking the aggregation into smaller, though still macro, chunks. The International Organization for Standardization maintains many standards, including how countries are divided into regions. Fortunately, their classification is readily available on GitHub:
regions <- read.csv('https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv', stringsAsFactors=FALSE) %>%
select(name, region)
head(regions)## name region
## 1 Afghanistan Asia
## 2 Åland Islands Europe
## 3 Albania Europe
## 4 Algeria Africa
## 5 American Samoa Oceania
## 6 Andorra Europe
Combine with existing data:
regional_exports <- exports %>%
inner_join(regions, by=c('country'='name')) %>%
group_by(year, region) %>%
summarise(total_exports=sum(exports))Now, plot historically as above, but including multiple lines corresponding to regions:
This gives us a bit more definition. The vast majority of arms exports went to Europe and Asia. Indeed, I think you can see the Vietnam War on the Asian line—which entailed massive aid to South Vietnam—as the war ramped up in 1965 through Nixon’s Vietnamization of the the war in the first half of the 1970s.
Space: Top Arms Exporters
Who were the top 10 arms dealers?
countries <- exports %>%
group_by(country) %>%
summarise(total_exports=sum(exports)) %>%
arrange(desc(total_exports))
library(scales)
ggplot(head(countries, 10), aes(reorder(country, -total_exports), total_exports)) +
geom_bar(stat='identity') +
labs(x="Country", y="TIV") +
scale_y_continuous(labels = comma) + # no scientific notation on y-axis
theme(axis.text.x=element_text(angle=45, hjust=1))A specialist would probably demand this graph be broken into two for correct inferences, one before the collapse of the Soviet Union in 1991 and one for after.