Load general libraries:

library(dplyr)
library(ggplot2)
library(stringr)
library(tidyr)

Data Set 1: Poverty in the United States

This dataset describes the distribution of poverty in the United States across several different dimensions. The discussion on the forum only mentioned race or gender, so I will truncate the data to only include race. (It wouldn’t make sense to combine race and gender data frames anyway, since the table does not inform us about their intersection.)

Data Cleaning

The data is multi-indexed with a complicated header—it’s a mess. I’ll start the cleaning by bypassing it by skipping the header lines.

# install.packages('readxl')
library(readxl)
pov_table <- read_excel('pov_table3.xls', skip=13, col_names=FALSE)
head(pov_table)

## # A tibble: 6 x 13
##   X__1         X__2  X__3  X__4  X__5  X__6   X__7  X__8  X__9 X__10 X__11
##   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 White…………… 245985 27113   547  11     0.2 247272 26436   714  10.7   0.3
## 2 White, no… 195221 17263   493   8.8   0.3 195256 16993   571   8.7   0.3
## 3 Black……………  41962  9234   388  22     0.9  42474  8993   373  21.2   0.9
## 4 Asian……………  18879  1908   175  10.1   0.9  19475  1953   190  10     1  
## 5 Hispanic …  57556 11137   399  19.4   0.7  59053 10790   423  18.3   0.7
## 6 <NA>           NA    NA    NA  NA    NA       NA    NA    NA  NA    NA  
## # ... with 2 more variables: X__12 <chr>, X__13 <chr>

race <- pov_table[1:5,]
head(race)

## # A tibble: 5 x 13
##   X__1         X__2  X__3  X__4  X__5  X__6   X__7  X__8  X__9 X__10 X__11
##   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 White…………… 245985 27113   547  11     0.2 247272 26436   714  10.7   0.3
## 2 White, no… 195221 17263   493   8.8   0.3 195256 16993   571   8.7   0.3
## 3 Black……………  41962  9234   388  22     0.9  42474  8993   373  21.2   0.9
## 4 Asian……………  18879  1908   175  10.1   0.9  19475  1953   190  10     1  
## 5 Hispanic …  57556 11137   399  19.4   0.7  59053 10790   423  18.3   0.7
## # ... with 2 more variables: X__12 <chr>, X__13 <chr>

The simplest way to deal with the fact there are two years ‘presented horizontally’ is brute force—slice up the data frame:

race_2016 <- race[,1:6]
race_2017 <- race[,c(1,7:11)]
race_colnames <- c('race', 'total', 'in_poverty', 'me', 'perc', 'perc_me')
colnames(race_2016) <- race_colnames
colnames(race_2017) <- race_colnames
race_2016$year <- as.factor(2016)
race_2017$year <- as.factor(2017)
race_long <- rbind(race_2016, race_2017)
head(race_long, 7)  # note year column at far right

## # A tibble: 7 x 7
##   race                          total in_poverty    me  perc perc_me year 
##   <chr>                         <dbl>      <dbl> <dbl> <dbl>   <dbl> <fct>
## 1 White…………………………………………        245985      27113   547  11       0.2 2016 
## 2 White, not Hispanic……………………… 195221      17263   493   8.8     0.3 2016 
## 3 Black…………………………………………         41962       9234   388  22       0.9 2016 
## 4 Asian…………………………………………         18879       1908   175  10.1     0.9 2016 
## 5 Hispanic (any race)………………………  57556      11137   399  19.4     0.7 2016 
## 6 White…………………………………………        247272      26436   714  10.7     0.3 2017 
## 7 White, not Hispanic……………………… 195256      16993   571   8.7     0.3 2017

Next is to clean up superfluous characters. The race column is annoying because it varies between using the ellipsis character (…) and three periods (...), still easily handled with Regex, however:

race_long <- race_long %>%
   mutate(race = str_remove_all(race, '…|\\.'))
head(race_long)

## # A tibble: 6 x 7
##   race                 total in_poverty    me  perc perc_me year 
##   <chr>                <dbl>      <dbl> <dbl> <dbl>   <dbl> <fct>
## 1 White               245985      27113   547  11       0.2 2016 
## 2 White, not Hispanic 195221      17263   493   8.8     0.3 2016 
## 3 Black                41962       9234   388  22       0.9 2016 
## 4 Asian                18879       1908   175  10.1     0.9 2016 
## 5 Hispanic (any race)  57556      11137   399  19.4     0.7 2016 
## 6 White               247272      26436   714  10.7     0.3 2017

Analysis

Since this data is concerned with temporal variation, let’s create a line graph to get a sense of it:

ggplot(race_long, aes(year, perc, group=race, colour=race)) +
    geom_line()

It’s not a great chart.

There’s only two years available in the data, so its utility is obviously limited.

We do see, however, that the populations suffering from the most poverty have decreased incidence of poverty in 2017 compared to the prior year. Non-Hispanic whites and Asians have the lowest incidence, which is nearly the same in 2017 as in 2016.

Data Set 2: Law Enforcement Officers Killed

This dataset consists of two files. Officer deaths are split into one file if accidental, another if ‘feloniously killed.’

I will clean up these files and combine them, creating a master dataset of officer deaths from 2008-2017.

Data Cleaning

Let’s start with accidental officer deaths:

accidental <- read_xls('table-48.xls', skip=3)

Get rid of the first row and column which give totals:

accidental <- accidental %>% slice(-1)
accidental$Total <- NULL
head(accidental)

## # A tibble: 6 x 11
##   Area      `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 NORTHEAST     11      6      8      8      9      5      8      5      5
## 2 New Engl…      1      2      3      2      4      0      1      0      1
## 3 Connecti…      1      0      2      0      0      0      0      0      0
## 4 Maine          0      0      0      1      0      0      0      0      0
## 5 Massachu…      0      2      1      1      3      0      1      0      1
## 6 New Hamp…      0      0      0      0      0      0      0      0      0
## # ... with 1 more variable: `2017` <dbl>

Now we have to decide what we actually want to extract from this data: Officers accidentally killed by region (Midwest, Northeast, etc.), or by state. Since by state is more informative, I will focus on producing state-level data.

First, filter out the rows that contains region headings and sums:

filter_out <- 'NORTHEAST|MIDWEST|SOUTH|WEST|PUERTO RICO AND OTHER OUTLYING AREAS|MIDWEST|Mountain|New England|Middle Atlantic|East North Central|West North Central|South Atlantic|East South Central|West South Central|Pacific'
accidental <- accidental %>%
    filter(str_detect(Area, filter_out) == FALSE)
colnames(accidental)[1] <- 'state'
head(accidental)

## # A tibble: 6 x 11
##   state     `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Connecti…      1      0      2      0      0      0      0      0      0
## 2 Maine          0      0      0      1      0      0      0      0      0
## 3 Massachu…      0      2      1      1      3      0      1      0      1
## 4 New Hamp…      0      0      0      0      0      0      0      0      0
## 5 Rhode Is…      0      0      0      0      1      0      0      0      0
## 6 Vermont        0      0      0      0      0      0      0      0      0
## # ... with 1 more variable: `2017` <dbl>

Convert from wide to long:

accidental_long <- accidental %>% 
    gather(year, fatalities, `2008`:`2017`) %>%
    mutate(type='accidental') %>%
    arrange(state, year)
head(accidental_long)

## # A tibble: 6 x 4
##   state   year  fatalities type      
##   <chr>   <chr>      <dbl> <chr>     
## 1 Alabama 2008           2 accidental
## 2 Alabama 2009           1 accidental
## 3 Alabama 2010           0 accidental
## 4 Alabama 2011           2 accidental
## 5 Alabama 2012           1 accidental
## 6 Alabama 2013           3 accidental

Now, prepare the felonious officer deaths in the same way:

felon <- read_xls('table-1.xls', skip=3) %>%
    slice(-1) %>%
    filter(str_detect(Area, filter_out) == FALSE)
felon$Total <- NULL
colnames(felon)[1] <- 'state'

felon_long <- felon %>%
    gather(year, fatalities, `2008`:`2017`) %>%
    mutate(type='felonious') %>%
    arrange(state, year)

head(felon_long)

## # A tibble: 6 x 4
##   state   year  fatalities type     
##   <chr>   <chr>      <dbl> <chr>    
## 1 Alabama 2008           0 felonious
## 2 Alabama 2009           4 felonious
## 3 Alabama 2010           1 felonious
## 4 Alabama 2011           1 felonious
## 5 Alabama 2012           2 felonious
## 6 Alabama 2013           0 felonious

Join the accidental and felonious data sets:

officer_deaths <- rbind(accidental_long, felon_long) %>%
    arrange(state, year, type)
head(officer_deaths, 10)

## # A tibble: 10 x 4
##    state   year  fatalities type      
##    <chr>   <chr>      <dbl> <chr>     
##  1 Alabama 2008           2 accidental
##  2 Alabama 2008           0 felonious 
##  3 Alabama 2009           1 accidental
##  4 Alabama 2009           4 felonious 
##  5 Alabama 2010           0 accidental
##  6 Alabama 2010           1 felonious 
##  7 Alabama 2011           2 accidental
##  8 Alabama 2011           1 felonious 
##  9 Alabama 2012           1 accidental
## 10 Alabama 2012           2 felonious

Analysis

The obvious analysis is to consider officer deaths, felonious and accidental, over time. We can use a line graph for that:

plot_deaths <- officer_deaths %>%
    group_by(year, type) %>%
    summarize(deaths=sum(fatalities))
head(plot_deaths)

## # A tibble: 6 x 3
## # Groups:   year [3]
##   year  type       deaths
##   <chr> <chr>       <dbl>
## 1 2008  accidental     68
## 2 2008  felonious      41
## 3 2009  accidental     48
## 4 2009  felonious      48
## 5 2010  accidental     72
## 6 2010  felonious      55

ggplot(plot_deaths, aes(year, deaths, group=type, color=type)) +
    geom_line() +
    labs(x="Year", y="Officer Deaths") +
    theme(axis.text.x=element_text(angle=45, hjust=1))

Two things strike me about these numbers: About half the time, more officers are killed by accident—and a quick Google suggests this is car accidents, specifically. Second, the total number of officers killed is always less than 150. Per year.

I would’ve suspected it was a lot larger. To check my intuition, I texted my dad to ask him to estimate how many police officers died on duty every year. He suggested one or two thousand! That is, my dad thought the United States is more dangerous for police officers than Iraq was at peak violence for American soldiers (c. 2006)!

This data combined with my poll of \(n = 1\) suggests Americans may overestimate the dangers law enforcement officers face.

Data Set 3: Arms Exports

This is the data set I submitted to the forum, the Arms Transfers Database.

It is maintained by the Stockholm International Peace Research Institute, and is meant to be a catalog of all international arms sales/transfers since 1950.

The specific dataset under consideration is the exports dataset. Here, quantity of arms exports is operationalized as the Institute’s trend indicator values (TIV). TIV are an attempt to use a common unit to represent arms transfers, rather than trying to account for the relative value of, e.g., 10 thousand machine guns versus 10 thousand assault rifles. It is based on production cost of weapons systes (as opposed to market value), as well as the condition of the weapon (used, new, refurbished).

Data Cleaning

The data contains the total TIV for all countries’ exports between 1950 and 2017.

Note: The file extension is .xls, but it appears to be a regular CSV file.

raw <- read.csv('TIV-Export-All-1950-2017.csv.xls', skip=10, header=TRUE,
                stringsAsFactors=FALSE)
head(raw)

##           X X1950 X1951 X1952 X1953 X1954 X1955 X1956 X1957 X1958 X1959
## 1   Albania    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2   Algeria    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 3    Angola    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 4 Argentina    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 5     Aruba    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 6 Australia    NA    NA    36    NA    NA   125    NA    NA    NA    NA
##   X1960 X1961 X1962 X1963 X1964 X1965 X1966 X1967 X1968 X1969 X1970 X1971
## 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2    NA    NA    NA    NA    NA    NA    NA    NA     6    NA    NA    NA
## 3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 4    NA    NA     6    NA     4    NA    NA     0     8    NA     0    NA
## 5    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 6    NA     2     2    NA    NA    NA    NA    NA    NA    26    17    36
##   X1972 X1973 X1974 X1975 X1976 X1977 X1978 X1979 X1980 X1981 X1982 X1983
## 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2    NA    NA    NA    NA    NA    NA    NA    NA     6    NA    NA    NA
## 3    NA    NA    NA    NA    NA    NA    NA     3     0    NA    NA    NA
## 4    11    NA    NA    NA     1    NA    NA     8     1    24     3     1
## 5    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 6     6    72    92   131   113    62    21    22    15    15    24    31
##   X1984 X1985 X1986 X1987 X1988 X1989 X1990 X1991 X1992 X1993 X1994 X1995
## 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 3    NA    NA    NA    NA    NA    NA    NA    NA     8    NA    NA    NA
## 4    NA    NA    NA     4     1    NA     6     0    NA    11     4     5
## 5    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 6    62    13    57    89    18    18   158    85    NA    30    36    36
##   X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 3    NA    NA    NA    NA    NA    NA     2    NA    NA    NA    NA    NA
## 4    NA    NA    NA    NA     2     6     2    NA    NA    NA     2    NA
## 5    18    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 6    16    20     4    NA    NA    43    47    44     2    49    14    18
##   X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015 X2016 X2017 Total
## 1    NA    NA    NA     0    NA    NA    NA    NA    NA    NA     0
## 2    NA    NA    NA    NA    NA    NA    NA    NA     0    NA    12
## 3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    13
## 4    NA    NA    NA     1    NA    NA    NA    NA    NA    NA   113
## 5    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    18
## 6    25    80   115   143    45    54    97    87   134    97  2583

Each row represents a country, and each column a year. (Note that R has appends an ‘X’ to each year column.) Let’s remove the superfluous Total column and row, and label the first column:

raw$Total <- NULL
colnames(raw)[1] <- 'country'
raw <- raw %>% filter(country != 'Total')

Now to make the data tidy. Each row must be a country-year observation. This is easily accomplished with tidyr:

exports <- gather(raw, year, exports, X1950:X2017, na.rm=TRUE) %>%
    mutate(year=as.integer(str_remove(year, 'X'))) %>%
    arrange(country, year)

head(exports)

##   country year exports
## 1 Albania 2011       0
## 2 Algeria 1968       6
## 3 Algeria 1980       6
## 4 Algeria 2016       0
## 5  Angola 1979       3
## 6  Angola 1980       0

str(exports)

## 'data.frame':    2384 obs. of  3 variables:
##  $ country: chr  "Albania" "Algeria" "Algeria" "Algeria" ...
##  $ year   : int  2011 1968 1980 2016 1979 1980 1992 2002 1962 1964 ...
##  $ exports: int  0 6 6 0 3 0 8 2 6 4 ...

Analysis

This dataset can be summarized either over time and over space. I will do a sampling of both below.

Time: Arms Exports Over Time

The simplest way to aggregate arms exports over time would be to sum exports by year:

yearly_exports <- exports %>%
    group_by(year) %>%
    summarise(total_exports=sum(exports))

ggplot(yearly_exports, aes(year, total_exports)) +
    geom_line()

It’s difficult to pick out specific events from this chart, but in some way it is a concise representation of international politics during the Cold War, 9/11, and after.

Over Time, by Region

We might be able to see more from this data by breaking the aggregation into smaller, though still macro, chunks. The International Organization for Standardization maintains many standards, including how countries are divided into regions. Fortunately, their classification is readily available on GitHub:

regions <- read.csv('https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv', stringsAsFactors=FALSE) %>% 
    select(name, region)
head(regions)

##             name  region
## 1    Afghanistan    Asia
## 2  Åland Islands  Europe
## 3        Albania  Europe
## 4        Algeria  Africa
## 5 American Samoa Oceania
## 6        Andorra  Europe

Combine with existing data:

regional_exports <- exports %>%
    inner_join(regions, by=c('country'='name')) %>%
    group_by(year, region) %>%
    summarise(total_exports=sum(exports))

Now, plot historically as above, but including multiple lines corresponding to regions:

ggplot(regional_exports, aes(year, total_exports, group=region, color=region)) +
    geom_line()

This gives us a bit more definition. The vast majority of arms exports went to Europe and Asia. Indeed, I think you can see the Vietnam War on the Asian line—which entailed massive aid to South Vietnam—as the war ramped up in 1965 through Nixon’s Vietnamization of the the war in the first half of the 1970s.

Space: Top Arms Exporters

Who were the top 10 arms dealers?

countries <- exports %>%
    group_by(country) %>%
    summarise(total_exports=sum(exports)) %>%
    arrange(desc(total_exports))

library(scales)
ggplot(head(countries, 10), aes(reorder(country, -total_exports), total_exports)) +
    geom_bar(stat='identity') +
    labs(x="Country", y="TIV") +
    scale_y_continuous(labels = comma) +  # no scientific notation on y-axis
    theme(axis.text.x=element_text(angle=45, hjust=1))

A specialist would probably demand this graph be broken into two for correct inferences, one before the collapse of the Soviet Union in 1991 and one for after.

Data 607 – Project 2

Ben Horvath

October 7, 2018

Data Set 1: Poverty in the United States

Data Cleaning

Analysis

Data Set 2: Law Enforcement Officers Killed

Data Cleaning

Analysis

Data Set 3: Arms Exports

Data Cleaning

Analysis

Time: Arms Exports Over Time

Over Time, by Region

Space: Top Arms Exporters