Preparing Data for Downstream Data Analysis


Introduction

Overview

The aim of this project is to clean data, transform it in to a “tidy”" state and then perform some analysis. The data chosen, from fellow students’ suggestions, are population figures for the United States from the 2010 Census and student alcohol consumption from a school in Portugal.

Relevence

The three main components of tidy data, outlined by Hadley Wickham in his paper, Tidy Data are:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

This data structure is a great rule of thumb in order to perform analysis. A lot of data made for human consumption violates these three tenets and is not conducive to computational analysis. Being able to transform this data effectively (in the case of R: using tools such as dplyr and tidyr) is an ever-present challenge for data scientists.

Libraries

Outside of formatting packages, the libraries used for this project are:

  • data.table: Light-weight, fast development for data structure; not necessarily needed for this data set, but recommended under best practices.
  • dplyr: For data manipulation.
  • tidyr: To produce tidy data from the dplyr pipeline.
  • ggplot2: Provides some advanced plotting utilities.

Loading and Cleaning the Data

2010 Census Data

The 2010 census data, suggested by Brandon O’Hara, was released by the U.S Census Bureau on December 2014. The data provides annual estimates of the resident population for the United States, regions, states, and Puerto Rico from April 1, 2010 to July 1, 2014. The raw data, which was cleaned of descriptive attachments and merged cells, is available here. After uploading the data, the ten first observations are:

file_link <- 
  "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/pop2012.csv"

# Returns a data table
census2010  <- fread(file_link, header = TRUE, na.strings = "", data.table = TRUE)
knitr::kable(head(census2010,10))
area census_2010 raw_2010 10-Jul 11-Jul 12-Jul 13-Jul 14-Jul
United States 308,745,538 308,758,105 309,347,057 311,721,632 314,112,078 316,497,531 318,857,056
Northeast 55,317,240 55,318,348 55,381,690 55,635,670 55,832,038 56,028,220 56,152,333
Midwest 66,927,001 66,929,898 66,972,390 67,149,657 67,331,458 67,567,871 67,745,108
South 114,555,744 114,562,951 114,871,231 116,089,908 117,346,322 118,522,802 119,771,934
West 71,945,553 71,946,908 72,121,746 72,846,397 73,602,260 74,378,638 75,187,681
.Alabama 4,779,736 4,780,127 4,785,822 4,801,695 4,817,484 4,833,996 4,849,377
.Alaska 710,231 710,249 713,856 722,572 731,081 737,259 736,732
.Arizona 6,392,017 6,392,310 6,411,999 6,472,867 6,556,236 6,634,997 6,731,484
.Arkansas 2,915,918 2,915,958 2,922,297 2,938,430 2,949,300 2,958,765 2,966,369
.California 37,253,956 37,254,503 37,336,011 37,701,901 38,062,780 38,431,393 38,802,500

The 2010 adjusted and raw variables are adjusted census data from April 2010 and the original data gathered, respectively. The Jul-10 to Jul-14 variables are projections based off the adjusted data.

From the raw data, shown below, states are preceded by a period. A simple regex expression will remove these:

census2010$area <- sub("^\\.", "", census2010$area)

knitr::kable(head(census2010, 10))
area census_2010 raw_2010 10-Jul 11-Jul 12-Jul 13-Jul 14-Jul
United States 308,745,538 308,758,105 309,347,057 311,721,632 314,112,078 316,497,531 318,857,056
Northeast 55,317,240 55,318,348 55,381,690 55,635,670 55,832,038 56,028,220 56,152,333
Midwest 66,927,001 66,929,898 66,972,390 67,149,657 67,331,458 67,567,871 67,745,108
South 114,555,744 114,562,951 114,871,231 116,089,908 117,346,322 118,522,802 119,771,934
West 71,945,553 71,946,908 72,121,746 72,846,397 73,602,260 74,378,638 75,187,681
Alabama 4,779,736 4,780,127 4,785,822 4,801,695 4,817,484 4,833,996 4,849,377
Alaska 710,231 710,249 713,856 722,572 731,081 737,259 736,732
Arizona 6,392,017 6,392,310 6,411,999 6,472,867 6,556,236 6,634,997 6,731,484
Arkansas 2,915,918 2,915,958 2,922,297 2,938,430 2,949,300 2,958,765 2,966,369
California 37,253,956 37,254,503 37,336,011 37,701,901 38,062,780 38,431,393 38,802,500

The first five rows are calculated rows and should not be in the table. To tabulate them manually, especially the regional areas, would be too cumbersome. They will remain in the table due to their availability.

Student Drinking

The alcohol consumption for two Portuguese schools (“GP” - Gabriel Pereira or “MS” - Mousinho da Silveira) is available by area of study, math and Portuguese. This data was recommended by Oluwakemi Omotunde and the data dictionary is available here

#Data for math students
st_math_link <- "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/student/student-mat.csv"
student_math <- fread(st_math_link, header = TRUE, na.strings = "", data.table = TRUE)
knitr::kable(head(student_math))
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
GP F 18 U GT3 A 4 4 at_home teacher course mother 2 2 0 yes no no no yes yes no no 4 3 4 1 1 3 6 5 6 6
GP F 17 U GT3 T 1 1 at_home other course father 1 2 0 no yes no no no yes yes no 5 3 3 1 1 3 4 5 5 6
GP F 15 U LE3 T 1 1 at_home other other mother 1 2 3 yes no yes no yes yes yes no 4 3 2 2 3 3 10 7 8 10
GP F 15 U GT3 T 4 2 health services home mother 1 3 0 no yes yes yes yes yes yes yes 3 2 2 1 1 5 2 15 14 15
GP F 16 U GT3 T 3 3 other other home father 1 2 0 no yes yes no yes yes no no 4 3 2 1 2 5 4 6 10 10
GP M 16 U LE3 T 4 3 services other reputation mother 1 2 0 no yes yes yes yes yes yes no 5 4 2 1 2 5 10 15 15 15
#Data for Portuguese students
st_port_link <- "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/student/student-por.csv"
student_port <- fread(st_port_link, header = TRUE, na.strings = "", data.table = TRUE)

student_port$sex <- sub("0", "F", student_port$sex)

knitr::kable(head(student_port))
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
GP F 18 U GT3 A 4 4 at_home teacher course mother 2 2 0 yes no no no yes yes no no 4 3 4 1 1 3 4 0 11 11
GP F 17 U GT3 T 1 1 at_home other course father 1 2 0 no yes no no no yes yes no 5 3 3 1 1 3 2 9 11 11
GP F 15 U LE3 T 1 1 at_home other other mother 1 2 0 yes no no no yes yes yes no 4 3 2 2 3 3 6 12 13 12
GP F 15 U GT3 T 4 2 health services home mother 1 3 0 no yes no yes yes yes yes yes 3 2 2 1 1 5 0 14 14 14
GP F 16 U GT3 T 3 3 other other home father 1 2 0 no yes no no yes yes no no 4 3 2 1 2 5 0 11 13 13
GP M 16 U LE3 T 4 3 services other reputation mother 1 2 0 no yes no yes yes yes yes no 5 4 2 1 2 5 6 12 12 13

Due to the behavior of fread, sex was coerced into logical values for the first couple of rows. This error was handles by the regex expression above.

Stop and Frisk

In metropolitan areas, namely New York City, stop and frisk has become a divisive practice since its ramp-up at the turn of the century. Some have lauded its effectiveness and others have scrutinized its unfair focus on young, minority men. Answering such things is beyond the scope of this assignment, but there are some interesting observations that can be garnered. The NYPD keeps a a log of stop and frisk data between 2003 and 2015 with a “wide” list of variables. The data dictionary is available for download here. The data for the stop and frisk practice in NYC is loaded below:

file_link <- 
   "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/stopNfrisk/2015snf.csv"

# Returns a data table
snf2015  <- fread(file_link, header = TRUE, na.strings = " ", data.table = TRUE)
knitr::kable(head(snf2015,10))
year pct ser_num datestop timestop recstat inout trhsloc perobs crimsusp perstop typeofid explnstp othpers arstmade arstoffn sumissue sumoffen compyear comppct offunif officrid frisked searched contrabn adtlrept pistol riflshot asltweap knifcuti machgun othrweap pf_hands pf_wall pf_grnd pf_drwep pf_ptwep pf_baton pf_hcuff pf_pepsp pf_other radio ac_rept ac_inves rf_vcrim rf_othsw ac_proxm rf_attir cs_objcs cs_descr cs_casng cs_lkout rf_vcact cs_cloth cs_drgtr ac_evasv ac_assoc cs_furtv rf_rfcmp ac_cgdir rf_verbl cs_vcrim cs_bulge cs_other ac_incid ac_time rf_knowl ac_stsnd ac_other sb_hdobj sb_outln sb_admis sb_other repcmd revcmd rf_furt rf_bulg offverb offshld forceuse sex race dob age ht_feet ht_inch weight haircolr eyecolor build othfeatr addrtyp rescode premtype premname addrnum stname stinter crossst aptnum city state zip addrpct sector beat post xcoord ycoord dettypCM lineCM detailCM
2015 61 18 1012015 315 1 O P 2 FELONY 10 V Y N N N 0 0 N Y N N N N N N N N N N N N N N N N N N N Y N N N Y N N Y Y N N N N N N N Y N N N N N Y Y N N N N N N N 186 186 N N V S M W NA 33 5 11 190 BR BR M L NA NA STREET AVENUE W EAST 28 STREET NA BROOKLYN NA NA 61 E 7 NA 1000091 156314 CM 1 14
2015 22 5 1152015 1747 A O P 1 FELONY 4 V Y N N N 0 0 N Y N N N N N N N N N Y N N N N N N N N Y Y N Y Y Y N N Y N N N N N Y N Y N N N N N Y N N N N N N N N N 483 483 N N V S SF M B NA 14 5 8 140 BK BR T L NA NA 119 W. 104 ST TRANSVERSE ROAD NUMBER FOUR WEST DRIVE NA MANHATTAN NA NA 22 E 14 NA 994617 227693 CM 1 20
2015 20 36 1292015 1745 1 O P 1 MISD 16 V Y N N N 0 0 N N N N N N N N N N N Y N N N N N N N N Y Y N N N N N N N N N N N N N N Y N N N N N Y N N N N N N N N N 483 483 N N V S SF M B NA 14 5 3 140 BK BR T L NA NA STREET 472 COLUMBUS AVENUE WEST 83 STREET WEST 82 STREET NA MANHATTAN NA NA 20 H NA 16 991510 225019 CM 1 78
2015 20 38 1292015 1745 1 O P 1 MIDS 16 V Y Y N N 0 0 N N N N N N N N N N N N N N N N N N N N Y Y N N N N N N N N N N N N Y N Y N N N N N Y N N N N N N N N N 483 483 N N V S M B NA 14 5 9 180 BK BR M L NA NA STREET 472 COLUMBUS AVENUE WEST 83 STREET WEST 82 STREET NA MANHATTAN NA NA 20 H NA 16 991510 225019 CM 1 78
2015 20 41 1292015 1745 1 O P 1 MISD 16 V Y Y N N 0 0 N N N N N N N N N N N Y N N N N N N N N Y Y N N N N N N N N N N N N Y N Y N N N N N Y N N N N N N N N N 483 483 N N V S SF M B NA 13 5 10 160 BK BR M L NA NA STREET 472 COLUMBUS AVENUE WEST 83 STREET WEST 82 STREET NA MANHATTAN NA NA 20 H NA 16 991510 225019 CM 1 78
2015 20 39 1292015 1745 1 O P 1 MISD 16 V Y Y N N 0 0 N N N N N N N N N N N Y N N N N N N N N Y Y N N N N N N N N N N N N Y N Y N N N N N Y N N N N Y N N N N 483 483 N N V S SF M W NA 13 5 2 130 BK BR M L NA NA STREET 472 COLUMBUS AVENUE WEST 83 STREET WEST 82 STREET NA MANHATTAN NA NA 20 H NA 16 991510 225019 CM 1 78
2015 67 122 2062015 2155 1 O P 2 FEL 5 R Y N N N 0 0 N Y N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N Y N N N N N N Y Y N N N N N N N 186 186 Y N V S M B NA 25 5 9 160 BK BR T L NA NA EAST 93 STREET CLARKSON AVENUE NA BROOKLYN NA NA 67 M NA NA 1005554 179416 CM 1 20
2015 7 37 2072015 2115 1 I H 1 FEL 5 V Y N N N 0 0 N Y N N N N N N N N N N N N N N N N N N N N Y Y N Y N N Y Y Y N N N N Y N N N N N N N Y Y Y N N N N N N 804 804 Y N V S M B NA 15 5 7 150 BR BR T L NA NA LOBBY 521 ROOSEVELT DRIVE RIVINGTON STREET DELANCEY STREET NA MANHATTAN NA NA 7 E 6 NA 990820 200020 CM 1 85
2015 7 42 2122015 1405 1 I T 9 FEL 3 P Y N N N 0 0 N Y N N N N N N N N N N N N N N N Y N N N N N N N N N N N Y Y N N N N N N Y N N N N N Y N N N N N N N N 863 863 Y N V S OR M B NA 23 5 9 160 BK BR T L NA NA MEZZ ESSEX STREET DELANCEY STREET NA MANHATTAN NA NA 7 B 2 NA 987521 201066 CM 1 45
2015 68 9 2142015 200 A O P 1 FEONY 10 V Y Y N N 0 0 Y N N N N N N N N N N N N N N N N N N N Y Y N N N N N N Y N N N N N N N N N N N N N Y N N N N N N N N N 68 68 N N M W NA 16 5 10 150 BR GR M L NA NA STREET 142 93 STREET RIDGE BOULEVARD MARINE AVENUE NA BROOKLYN NA NA 68 G NA NA 974057 164979 CM 1 14

Transform to Tidy Data

2010 Census Data

The format of the census data violates the rule that each observation forms a row. The time series format puts the population projections for July 2010 to July 2014 in one row. Also, the 2010 raw data will not be necessary in our future analysis, so we will remove this as well. By piping the data through some tidyr tools we get the desired result:

census2010 <- census2010 %>%
    gather("year", "pop", 4:8) %>%
    subset(select = c(1,2,4,5)) %>%
    arrange(area)

census2010$census_2010 <- as.numeric(gsub(",", "", census2010$census_2010))
census2010$pop <- as.numeric(gsub(",", "", census2010$pop))
knitr::kable(head(census2010,10))
area census_2010 year pop
Alabama 4779736 10-Jul 4785822
Alabama 4779736 11-Jul 4801695
Alabama 4779736 12-Jul 4817484
Alabama 4779736 13-Jul 4833996
Alabama 4779736 14-Jul 4849377
Alaska 710231 10-Jul 713856
Alaska 710231 11-Jul 722572
Alaska 710231 12-Jul 731081
Alaska 710231 13-Jul 737259
Alaska 710231 14-Jul 736732

Due to the formatting of the data, population values are strings. The regex and casting above coerces the values in to numeric values.

Student Drinking

The issue with the student data is that it is split over two tables depending on what class the student is taking, i.e. math or Portuguese. These tables need to be joined under one table.

# Add major variable to distinguish between 
student_math[,"major"] <- "math"
student_port[,"major"] <- "Portuguese"

student <- rbind(student_math, student_port)

# Size of nearly formed table
nrow(student)
[1] 1044

All the data is combined under the one table, student, with an added variable of major to distinguish between the students in the respective field of study.

Stop and Frisk

The stop and frisk data is not necessarily “untidy” in the sense that it violates any of the components mentioned in the introduction, but it could become more structured for analysis with the tools supplied in dplyr and tidyr. We want to specify and/or generalize some data; namely:

  • Generalize precinct names into boroughs.
  • Isolate observations where an individual was stopped and frisked
  • Specify reason for stop andreason for frisk* into two variables (multiple reasons allowed).
  • Generalize “found” items to be: firearm, non-firearm weapon or contraband (most severe case overrides others, if applicable)
  • Generalize type of force applied into lethal, non-lethal or `none: lethal = gun or baton; non-lethal = physical force; else = none.

This will morph the wide dataset (22563 x 113) into a long dataset (529320 x 7).

snf2015 <- snf2015 %>%
    # Generalize police borough
    mutate(pol_borough = ifelse(
        pct %in% 0:34, "ManH", ifelse(
            pct %in% 35:52, "Bronx", ifelse(
                pct %in% 53:94, "BK", ifelse(
                    pct %in% 95:115, "Qns", ifelse(
                        pct %in% 116:123, "StnI", "other")))))) %>%
    # Generalize "stop"
    gather(stop_reason, stop_applicable, 
           c(49:52, 54, 55, 58, 62:64)) %>%
    # Generalize "frisk"
    gather(frisk_reason, frisk_applicable,
           c(rf_vcrim, rf_othsw, rf_attir, rf_vcact, rf_rfcmp,
             rf_verbl, rf_knowl, rf_furt, rf_bulg)) %>%
    # Isolate reasons for "stop" and\or "frisk"
    subset(stop_applicable == "Y" | frisk_applicable == "Y") %>%
    # Generalize "found" items
    mutate(found = ifelse(
        pistol == "Y" | riflshot == "Y" | asltweap == "Y"|
            machgun == "Y", "firearm", ifelse(
                knifcuti == "Y" | othrweap == "Y",
                "non-firearm_weapon", ifelse(
                    contrabn == "Y", "contraband","none")))) %>%
    #Generalize force used
    mutate(force = ifelse(
        pf_drwep == "Y" | pf_ptwep == "Y" | pf_baton == "Y", "lethal", ifelse(
            pf_hands == "Y" | pf_wall == "Y" | pf_grnd == "Y" | pf_hcuff == "Y" |
                pf_pepsp == "Y" | pf_other == "Y", "non-lethal", "none"))) %>%
    select(pol_borough, stop_reason, frisk_reason, found, force, forceuse, arstmade)

knitr::kable(head(snf2015,10))
pol_borough stop_reason frisk_reason found force forceuse arstmade
ManH cs_objcs rf_vcrim none non-lethal SF N
ManH cs_objcs rf_vcrim none none N
BK cs_objcs rf_vcrim none none N
StnI cs_objcs rf_vcrim none non-lethal OR N
ManH cs_objcs rf_vcrim none none N
Qns cs_objcs rf_vcrim none non-lethal SW N
ManH cs_objcs rf_vcrim contraband non-lethal DS Y
ManH cs_objcs rf_vcrim none non-lethal SW N
BK cs_objcs rf_vcrim none non-lethal OT N
BK cs_objcs rf_vcrim none none N

Analysis with Transformed Data

Census Data

The projected population growths by region was a recommended area to look in to. The plot below shows the population projection form July, 2010 to July 2011 for the respective regions.

census_region <- filter(census2010,
                        area == "Northeast" |
                            area == "Midwest" |
                            area == "South" |
                            area == "West")


ggplot(census_region, aes(x = year, y = pop)) +
    geom_point(size = 3, aes(color = area))

All the rates for population seem to be positive, or at least flat between years. The region with the largest growth is the south. The regional growth is not surprising, but the population differences going on the state level could be drowning out the behavior for the region. Let us look at the differences in the projected population in 2014 relative to the 2010 Census data and look at the largest population growth and the smallest.

census_diff <- filter(census2010, year == "14-Jul" &
                          area != "Northeast" &
                          area != "Midwest" &
                          area != "South" &
                          area != "West" &
                          area != "United States") %>%
    mutate(pop_growth = pop - census_2010) %>%
    select(area, pop_growth) %>%
    arrange(-pop_growth)

# Largest growth
knitr::kable(head(census_diff))
area pop_growth
Texas 1811397
California 1548544
Florida 1091987
Georgia 409690
North Carolina 408481
New York 368125
# Smallest growth
knitr::kable(tail(census_diff))
area pop_growth
47 New Hampshire 10343
48 Rhode Island 2606
49 Maine 1728
50 Vermont 821
51 West Virginia -2668
52 Puerto Rico -177392

Texas, California and Florida have the largest population growth and Puerto Rico, West Virginia and Vermont has the least growth. There are many interesting topics that one could dive into form these observations. For the large gains, Is Texas there because of immigration, is California there because of the tech-boom, is Florida there because of baby-boomers retiring? For the lowest ,on the other hand, is Puerto Rico there because of their poor economic situation, is West Virginia there because of the decline of the the coal mines and job prospects?

If you are curious how other states fared. Took a look at the interactive table below.

datatable(census_diff)

Student Drinking

The relationship to alcohol consumption and the setting one lives in, i.e. rural (R) vs urban (U), was brought up by the student who requested this data. In order to do some analysis, the student’s semester grades will be averaged, grade_avg and the level of drinking the student participated in during the weekday and weekend is added into one variable, alc_sum. These variables will be formed by using tidyr tools to operate on through the dplyr pipeline.

student <- student %>%
    mutate(grade_avg = (G1 + G2 + G3)/3) %>%
    mutate(alc_sum = (Dalc+Walc)) %>%
    select(school, sex, address, alc_sum, major, grade_avg)

The summary statistics for drinking behavior (on a scale from 2 to 10, with 10 being the heaviest consumption) for the rural students is:

# Summary for rural students
summary(student$alc_sum[student$address == "R"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.000   3.000   3.965   5.000  10.000 

And for the urban student it is:

# Summary for urban students
summary(student$alc_sum[student$address == "U"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.000   3.000   3.709   5.000  10.000 

These summary statistics do not bring much insight. Yes, the average consumption for the rural student is more, but the remainder of the statistics are the same. We will plot the density to see if it brings any insights.

ggplot(student, aes(alc_sum)) +
    geom_density(fill = "grey") + facet_wrap(~ address)

From the density plot, it does appear that the percentage of rural (R) students drink more than urban (U) students. This is evident from the sharp, uni-modal point at alc_sum = 2 for the urban students and its shallow spread to the right. The rural student’s density has a nearly linear spread and is not as sharply concave as their fellow urban classmates’ density.

It is apparent that the rural students do consume more alcohol, but let us see if this has any relationship to performance at school. Using summary statistics again on the average semester grade, grade_avg:

# Summary for rural students
summary(student$grade_avg[student$address == "R"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.333   9.000  10.670  10.610  13.000  18.670 

And for the urban student it is:

# Summary for urban students
summary(student$grade_avg[student$address == "U"])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.333   9.667  11.670  11.510  13.670  19.330 

From the summary statistics, it is clear that the urban (U) students outperform their rural (R) classmates.

Since the urban students drink less and outperform their counterparts, is it safe to assume that they do because they drink less alcohol on average during the week? It most definitely does not; correlation does imply causation. To dive in a little further, let us see if there is a clear correlation between the two types of students given the variables discussed above using boxplots.

ggplot(student, aes(factor(alc_sum), grade_avg)) +
    geom_boxplot() + facet_wrap(~ address)

Strictly looking at the urban (U) student, there is a clear negative relationship between large alcohol consumption and grades. Looking at the rural (R) students, however, there is no clear relationship. Correlation may not imply causation, but no correlation does imply no causation.

There are a lot of mitigating factors in the student dataset that could be effecting the performance of the rural students. Since there could be less mitigating factors for urban students, factors such as alcohol consumption may have a stronger effect on school performance. Factors such as levels of the parents’ education and stressful environments at home due to economic conditions may have a larger impact, but it does not appear to be due to alcohol.

Stop and Frisk

First we will look at a plot of the stop and frisk rates for the respective boroughs and then a plot of the reason the individual was stopped.

ggplot(snf2015, aes(factor(pol_borough))) + geom_bar()

ggplot(snf2015, aes(factor(pol_borough), fill = stop_reason)) + geom_bar(position = "dodge")

For the reasons of being stopped above, let us see what was the result of the search, i.e., was anything found.

ggplot(snf2015, aes(factor(pol_borough),fill = found)) + geom_bar(position = "dodge")

count(snf2015,found)
# A tibble: 4 × 2
               found      n
               <chr>  <int>
1         contraband  24089
2            firearm   5310
3 non-firearm_weapon  23852
4               none 476069

It is obvious that nothing was found from most of these stops; there is a found rate of 10.060266%. Let us see the times when something was actually found from these stops based off the reason for search. Also, let’s see what the the highest success rate of finding something based off the stop.

ggplot(subset(snf2015, found != "none"), aes(factor(stop_reason), fill = found)) + geom_bar(position = "dodge")

snf_table <- as.data.frame.matrix(table(snf2015$stop_reason, snf2015$found))
snf_names <- rownames(snf_table)
snf_table <- snf_table %>%
    mutate(success = rowSums(snf_table[,1:3])/rowSums(snf_table)) %>%
    mutate(freq_rank = frank(-rowSums(snf_table))) %>%
    arrange(-success)
rownames(snf_table) <- snf_names
knitr::kable(snf_table)
contraband firearm non-firearm_weapon none success freq_rank
cs_bulge 1980 425 2649 26793 0.1586963 9
cs_casng 1679 687 3300 30938 0.1547918 7
cs_cloth 2888 353 1548 30550 0.1355160 8
cs_descr 3619 807 3159 66567 0.1022899 3
cs_drgtr 1315 363 1473 28097 0.1008385 10
cs_furtv 4311 747 3628 78762 0.0993276 2
cs_lkout 1564 376 1830 34256 0.0991427 6
cs_objcs 1847 366 1629 39494 0.0886561 5
cs_other 2760 784 2852 83552 0.0711078 1
cs_vcrim 2126 402 1784 57060 0.0702601 4

The above table is sorted by the level of success. freq_rank is the rank for how often that that is the reason for the stop. It is important to point out that the most common reasons for stops, e.g., other, furtive behavior and matches a description have a lower success rate for finding something than an observed bulge, casing an area or clothing.

Let us see if the reason for frisking, carried out after the stop, has the same behavior:

ggplot(subset(snf2015, found != "none"), aes(factor(frisk_reason), fill = found)) + geom_bar(position = "dodge")

snf_table <- as.data.frame.matrix(table(snf2015$frisk_reason, snf2015$found))
snf_names <- rownames(snf_table)
snf_table <- snf_table %>%
    mutate(success = rowSums(snf_table[,1:3])/rowSums(snf_table)) %>%
    mutate(freq_rank = frank(-rowSums(snf_table))) %>%
    arrange(-success)
rownames(snf_table) <- snf_names
knitr::kable(snf_table)
contraband firearm non-firearm_weapon none success freq_rank
rf_attir 4249 768 4606 57543 0.1432719 3
rf_bulg 2497 667 3802 45332 0.1331982 5
rf_furt 4123 980 3588 80748 0.0971724 1
rf_knowl 1726 324 1544 34990 0.0931474 9
rf_othsw 2050 413 1805 42598 0.0910682 8
rf_rfcmp 2186 395 1856 44574 0.0905307 7
rf_vcact 2628 598 2033 52838 0.0905210 4
rf_vcrim 2025 432 1948 45205 0.0887926 6
rf_verbl 2605 733 2670 72241 0.0767805 2

The reason for frisking does not follow the same behavior; 2 of the top 3 frequencies have the highest success rate.

From the data above, it appears that stop and frisk has about a 10% success rate. The most common reasons for stopping appear arbitrary and do not yield as large as a success rate as those which are more concrete. A trade-off for the NYPD might be to fore-go some of the low-success, vague reasons for stopping individuals to gain a better relationship in the neighborhoods in which they are.

Liam M. Byrne

2016-10-16