Preparing Data for Downstream Data Analysis
Introduction
Overview
The aim of this project is to clean data, transform it in to a “tidy”" state and then perform some analysis. The data chosen, from fellow students’ suggestions, are population figures for the United States from the 2010 Census and student alcohol consumption from a school in Portugal.
Relevence
The three main components of tidy data, outlined by Hadley Wickham in his paper, Tidy Data are:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
This data structure is a great rule of thumb in order to perform analysis. A lot of data made for human consumption violates these three tenets and is not conducive to computational analysis. Being able to transform this data effectively (in the case of R: using tools such as dplyr and tidyr) is an ever-present challenge for data scientists.
Libraries
Outside of formatting packages, the libraries used for this project are:
data.table: Light-weight, fast development for data structure; not necessarily needed for this data set, but recommended under best practices.
dplyr: For data manipulation.tidyr: To produce tidy data from thedplyrpipeline.ggplot2: Provides some advanced plotting utilities.
Loading and Cleaning the Data
2010 Census Data
The 2010 census data, suggested by Brandon O’Hara, was released by the U.S Census Bureau on December 2014. The data provides annual estimates of the resident population for the United States, regions, states, and Puerto Rico from April 1, 2010 to July 1, 2014. The raw data, which was cleaned of descriptive attachments and merged cells, is available here. After uploading the data, the ten first observations are:
file_link <-
"https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/pop2012.csv"
# Returns a data table
census2010 <- fread(file_link, header = TRUE, na.strings = "", data.table = TRUE)
knitr::kable(head(census2010,10))| area | census_2010 | raw_2010 | 10-Jul | 11-Jul | 12-Jul | 13-Jul | 14-Jul |
|---|---|---|---|---|---|---|---|
| United States | 308,745,538 | 308,758,105 | 309,347,057 | 311,721,632 | 314,112,078 | 316,497,531 | 318,857,056 |
| Northeast | 55,317,240 | 55,318,348 | 55,381,690 | 55,635,670 | 55,832,038 | 56,028,220 | 56,152,333 |
| Midwest | 66,927,001 | 66,929,898 | 66,972,390 | 67,149,657 | 67,331,458 | 67,567,871 | 67,745,108 |
| South | 114,555,744 | 114,562,951 | 114,871,231 | 116,089,908 | 117,346,322 | 118,522,802 | 119,771,934 |
| West | 71,945,553 | 71,946,908 | 72,121,746 | 72,846,397 | 73,602,260 | 74,378,638 | 75,187,681 |
| .Alabama | 4,779,736 | 4,780,127 | 4,785,822 | 4,801,695 | 4,817,484 | 4,833,996 | 4,849,377 |
| .Alaska | 710,231 | 710,249 | 713,856 | 722,572 | 731,081 | 737,259 | 736,732 |
| .Arizona | 6,392,017 | 6,392,310 | 6,411,999 | 6,472,867 | 6,556,236 | 6,634,997 | 6,731,484 |
| .Arkansas | 2,915,918 | 2,915,958 | 2,922,297 | 2,938,430 | 2,949,300 | 2,958,765 | 2,966,369 |
| .California | 37,253,956 | 37,254,503 | 37,336,011 | 37,701,901 | 38,062,780 | 38,431,393 | 38,802,500 |
The 2010 adjusted and raw variables are adjusted census data from April 2010 and the original data gathered, respectively. The Jul-10 to Jul-14 variables are projections based off the adjusted data.
From the raw data, shown below, states are preceded by a period. A simple regex expression will remove these:
census2010$area <- sub("^\\.", "", census2010$area)
knitr::kable(head(census2010, 10))| area | census_2010 | raw_2010 | 10-Jul | 11-Jul | 12-Jul | 13-Jul | 14-Jul |
|---|---|---|---|---|---|---|---|
| United States | 308,745,538 | 308,758,105 | 309,347,057 | 311,721,632 | 314,112,078 | 316,497,531 | 318,857,056 |
| Northeast | 55,317,240 | 55,318,348 | 55,381,690 | 55,635,670 | 55,832,038 | 56,028,220 | 56,152,333 |
| Midwest | 66,927,001 | 66,929,898 | 66,972,390 | 67,149,657 | 67,331,458 | 67,567,871 | 67,745,108 |
| South | 114,555,744 | 114,562,951 | 114,871,231 | 116,089,908 | 117,346,322 | 118,522,802 | 119,771,934 |
| West | 71,945,553 | 71,946,908 | 72,121,746 | 72,846,397 | 73,602,260 | 74,378,638 | 75,187,681 |
| Alabama | 4,779,736 | 4,780,127 | 4,785,822 | 4,801,695 | 4,817,484 | 4,833,996 | 4,849,377 |
| Alaska | 710,231 | 710,249 | 713,856 | 722,572 | 731,081 | 737,259 | 736,732 |
| Arizona | 6,392,017 | 6,392,310 | 6,411,999 | 6,472,867 | 6,556,236 | 6,634,997 | 6,731,484 |
| Arkansas | 2,915,918 | 2,915,958 | 2,922,297 | 2,938,430 | 2,949,300 | 2,958,765 | 2,966,369 |
| California | 37,253,956 | 37,254,503 | 37,336,011 | 37,701,901 | 38,062,780 | 38,431,393 | 38,802,500 |
The first five rows are calculated rows and should not be in the table. To tabulate them manually, especially the regional areas, would be too cumbersome. They will remain in the table due to their availability.
Student Drinking
The alcohol consumption for two Portuguese schools (“GP” - Gabriel Pereira or “MS” - Mousinho da Silveira) is available by area of study, math and Portuguese. This data was recommended by Oluwakemi Omotunde and the data dictionary is available here
#Data for math students
st_math_link <- "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/student/student-mat.csv"
student_math <- fread(st_math_link, header = TRUE, na.strings = "", data.table = TRUE)
knitr::kable(head(student_math))| school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 | yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 |
| GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 | no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 |
| GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 3 | yes | no | yes | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 |
| GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 | no | yes | yes | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 |
| GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 | no | yes | yes | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 |
| GP | M | 16 | U | LE3 | T | 4 | 3 | services | other | reputation | mother | 1 | 2 | 0 | no | yes | yes | yes | yes | yes | yes | no | 5 | 4 | 2 | 1 | 2 | 5 | 10 | 15 | 15 | 15 |
#Data for Portuguese students
st_port_link <- "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/student/student-por.csv"
student_port <- fread(st_port_link, header = TRUE, na.strings = "", data.table = TRUE)
student_port$sex <- sub("0", "F", student_port$sex)
knitr::kable(head(student_port))| school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 | yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 4 | 0 | 11 | 11 |
| GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 | no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 2 | 9 | 11 | 11 |
| GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 0 | yes | no | no | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 6 | 12 | 13 | 12 |
| GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 | no | yes | no | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 0 | 14 | 14 | 14 |
| GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 | no | yes | no | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 0 | 11 | 13 | 13 |
| GP | M | 16 | U | LE3 | T | 4 | 3 | services | other | reputation | mother | 1 | 2 | 0 | no | yes | no | yes | yes | yes | yes | no | 5 | 4 | 2 | 1 | 2 | 5 | 6 | 12 | 12 | 13 |
Due to the behavior of fread, sex was coerced into logical values for the first couple of rows. This error was handles by the regex expression above.
Stop and Frisk
In metropolitan areas, namely New York City, stop and frisk has become a divisive practice since its ramp-up at the turn of the century. Some have lauded its effectiveness and others have scrutinized its unfair focus on young, minority men. Answering such things is beyond the scope of this assignment, but there are some interesting observations that can be garnered. The NYPD keeps a a log of stop and frisk data between 2003 and 2015 with a “wide” list of variables. The data dictionary is available for download here. The data for the stop and frisk practice in NYC is loaded below:
file_link <-
"https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/stopNfrisk/2015snf.csv"
# Returns a data table
snf2015 <- fread(file_link, header = TRUE, na.strings = " ", data.table = TRUE)
knitr::kable(head(snf2015,10))| year | pct | ser_num | datestop | timestop | recstat | inout | trhsloc | perobs | crimsusp | perstop | typeofid | explnstp | othpers | arstmade | arstoffn | sumissue | sumoffen | compyear | comppct | offunif | officrid | frisked | searched | contrabn | adtlrept | pistol | riflshot | asltweap | knifcuti | machgun | othrweap | pf_hands | pf_wall | pf_grnd | pf_drwep | pf_ptwep | pf_baton | pf_hcuff | pf_pepsp | pf_other | radio | ac_rept | ac_inves | rf_vcrim | rf_othsw | ac_proxm | rf_attir | cs_objcs | cs_descr | cs_casng | cs_lkout | rf_vcact | cs_cloth | cs_drgtr | ac_evasv | ac_assoc | cs_furtv | rf_rfcmp | ac_cgdir | rf_verbl | cs_vcrim | cs_bulge | cs_other | ac_incid | ac_time | rf_knowl | ac_stsnd | ac_other | sb_hdobj | sb_outln | sb_admis | sb_other | repcmd | revcmd | rf_furt | rf_bulg | offverb | offshld | forceuse | sex | race | dob | age | ht_feet | ht_inch | weight | haircolr | eyecolor | build | othfeatr | addrtyp | rescode | premtype | premname | addrnum | stname | stinter | crossst | aptnum | city | state | zip | addrpct | sector | beat | post | xcoord | ycoord | dettypCM | lineCM | detailCM |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2015 | 61 | 18 | 1012015 | 315 | 1 | O | P | 2 | FELONY | 10 | V | Y | N | N | N | 0 | 0 | N | Y | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | N | Y | N | N | Y | Y | N | N | N | N | N | N | N | Y | N | N | N | N | N | Y | Y | N | N | N | N | N | N | N | 186 | 186 | N | N | V | S | M | W | NA | 33 | 5 | 11 | 190 | BR | BR | M | L | NA | NA | STREET | AVENUE W | EAST 28 STREET | NA | BROOKLYN | NA | NA | 61 | E | 7 | NA | 1000091 | 156314 | CM | 1 | 14 | |||||||
| 2015 | 22 | 5 | 1152015 | 1747 | A | O | P | 1 | FELONY | 4 | V | Y | N | N | N | 0 | 0 | N | Y | N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | Y | Y | N | Y | Y | Y | N | N | Y | N | N | N | N | N | Y | N | Y | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | N | 483 | 483 | N | N | V | S | SF | M | B | NA | 14 | 5 | 8 | 140 | BK | BR | T | L | NA | NA | 119 W. 104 ST | TRANSVERSE ROAD NUMBER FOUR | WEST DRIVE | NA | MANHATTAN | NA | NA | 22 | E | 14 | NA | 994617 | 227693 | CM | 1 | 20 | ||||||
| 2015 | 20 | 36 | 1292015 | 1745 | 1 | O | P | 1 | MISD | 16 | V | Y | N | N | N | 0 | 0 | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | Y | Y | N | N | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | N | 483 | 483 | N | N | V | S | SF | M | B | NA | 14 | 5 | 3 | 140 | BK | BR | T | L | NA | NA | STREET | 472 | COLUMBUS AVENUE | WEST 83 STREET | WEST 82 STREET | NA | MANHATTAN | NA | NA | 20 | H | NA | 16 | 991510 | 225019 | CM | 1 | 78 | ||||
| 2015 | 20 | 38 | 1292015 | 1745 | 1 | O | P | 1 | MIDS | 16 | V | Y | Y | N | N | 0 | 0 | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | Y | Y | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | Y | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | N | 483 | 483 | N | N | V | S | M | B | NA | 14 | 5 | 9 | 180 | BK | BR | M | L | NA | NA | STREET | 472 | COLUMBUS AVENUE | WEST 83 STREET | WEST 82 STREET | NA | MANHATTAN | NA | NA | 20 | H | NA | 16 | 991510 | 225019 | CM | 1 | 78 | |||||
| 2015 | 20 | 41 | 1292015 | 1745 | 1 | O | P | 1 | MISD | 16 | V | Y | Y | N | N | 0 | 0 | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | Y | Y | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | Y | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | N | 483 | 483 | N | N | V | S | SF | M | B | NA | 13 | 5 | 10 | 160 | BK | BR | M | L | NA | NA | STREET | 472 | COLUMBUS AVENUE | WEST 83 STREET | WEST 82 STREET | NA | MANHATTAN | NA | NA | 20 | H | NA | 16 | 991510 | 225019 | CM | 1 | 78 | ||||
| 2015 | 20 | 39 | 1292015 | 1745 | 1 | O | P | 1 | MISD | 16 | V | Y | Y | N | N | 0 | 0 | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | Y | Y | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | Y | N | N | N | N | N | Y | N | N | N | N | Y | N | N | N | N | 483 | 483 | N | N | V | S | SF | M | W | NA | 13 | 5 | 2 | 130 | BK | BR | M | L | NA | NA | STREET | 472 | COLUMBUS AVENUE | WEST 83 STREET | WEST 82 STREET | NA | MANHATTAN | NA | NA | 20 | H | NA | 16 | 991510 | 225019 | CM | 1 | 78 | ||||
| 2015 | 67 | 122 | 2062015 | 2155 | 1 | O | P | 2 | FEL | 5 | R | Y | N | N | N | 0 | 0 | N | Y | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N | N | Y | Y | N | N | N | N | N | N | N | 186 | 186 | Y | N | V | S | M | B | NA | 25 | 5 | 9 | 160 | BK | BR | T | L | NA | NA | EAST 93 STREET | CLARKSON AVENUE | NA | BROOKLYN | NA | NA | 67 | M | NA | NA | 1005554 | 179416 | CM | 1 | 20 | ||||||||
| 2015 | 7 | 37 | 2072015 | 2115 | 1 | I | H | 1 | FEL | 5 | V | Y | N | N | N | 0 | 0 | N | Y | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | Y | Y | N | Y | N | N | Y | Y | Y | N | N | N | N | Y | N | N | N | N | N | N | N | Y | Y | Y | N | N | N | N | N | N | 804 | 804 | Y | N | V | S | M | B | NA | 15 | 5 | 7 | 150 | BR | BR | T | L | NA | NA | LOBBY | 521 | ROOSEVELT DRIVE | RIVINGTON STREET | DELANCEY STREET | NA | MANHATTAN | NA | NA | 7 | E | 6 | NA | 990820 | 200020 | CM | 1 | 85 | |||||
| 2015 | 7 | 42 | 2122015 | 1405 | 1 | I | T | 9 | FEL | 3 | P | Y | N | N | N | 0 | 0 | N | Y | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | N | N | N | Y | Y | N | N | N | N | N | N | Y | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | 863 | 863 | Y | N | V | S | OR | M | B | NA | 23 | 5 | 9 | 160 | BK | BR | T | L | NA | NA | MEZZ | ESSEX STREET | DELANCEY STREET | NA | MANHATTAN | NA | NA | 7 | B | 2 | NA | 987521 | 201066 | CM | 1 | 45 | ||||||
| 2015 | 68 | 9 | 2142015 | 200 | A | O | P | 1 | FEONY | 10 | V | Y | Y | N | N | 0 | 0 | Y | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | Y | Y | N | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | N | N | N | N | N | Y | N | N | N | N | N | N | N | N | N | 68 | 68 | N | N | M | W | NA | 16 | 5 | 10 | 150 | BR | GR | M | L | NA | NA | STREET | 142 | 93 STREET | RIDGE BOULEVARD | MARINE AVENUE | NA | BROOKLYN | NA | NA | 68 | G | NA | NA | 974057 | 164979 | CM | 1 | 14 |
Transform to Tidy Data
2010 Census Data
The format of the census data violates the rule that each observation forms a row. The time series format puts the population projections for July 2010 to July 2014 in one row. Also, the 2010 raw data will not be necessary in our future analysis, so we will remove this as well. By piping the data through some tidyr tools we get the desired result:
census2010 <- census2010 %>%
gather("year", "pop", 4:8) %>%
subset(select = c(1,2,4,5)) %>%
arrange(area)
census2010$census_2010 <- as.numeric(gsub(",", "", census2010$census_2010))
census2010$pop <- as.numeric(gsub(",", "", census2010$pop))
knitr::kable(head(census2010,10))| area | census_2010 | year | pop |
|---|---|---|---|
| Alabama | 4779736 | 10-Jul | 4785822 |
| Alabama | 4779736 | 11-Jul | 4801695 |
| Alabama | 4779736 | 12-Jul | 4817484 |
| Alabama | 4779736 | 13-Jul | 4833996 |
| Alabama | 4779736 | 14-Jul | 4849377 |
| Alaska | 710231 | 10-Jul | 713856 |
| Alaska | 710231 | 11-Jul | 722572 |
| Alaska | 710231 | 12-Jul | 731081 |
| Alaska | 710231 | 13-Jul | 737259 |
| Alaska | 710231 | 14-Jul | 736732 |
Due to the formatting of the data, population values are strings. The regex and casting above coerces the values in to numeric values.
Student Drinking
The issue with the student data is that it is split over two tables depending on what class the student is taking, i.e. math or Portuguese. These tables need to be joined under one table.
# Add major variable to distinguish between
student_math[,"major"] <- "math"
student_port[,"major"] <- "Portuguese"
student <- rbind(student_math, student_port)
# Size of nearly formed table
nrow(student)[1] 1044
All the data is combined under the one table, student, with an added variable of major to distinguish between the students in the respective field of study.
Stop and Frisk
The stop and frisk data is not necessarily “untidy” in the sense that it violates any of the components mentioned in the introduction, but it could become more structured for analysis with the tools supplied in dplyr and tidyr. We want to specify and/or generalize some data; namely:
- Generalize precinct names into boroughs.
- Isolate observations where an individual was stopped and frisked
- Specify reason for stop andreason for frisk* into two variables (multiple reasons allowed).
- Generalize “found” items to be:
firearm,non-firearm weaponorcontraband(most severe case overrides others, if applicable) - Generalize type of
forceapplied intolethal,non-lethalor `none: lethal = gun or baton; non-lethal = physical force; else = none.
This will morph the wide dataset (22563 x 113) into a long dataset (529320 x 7).
snf2015 <- snf2015 %>%
# Generalize police borough
mutate(pol_borough = ifelse(
pct %in% 0:34, "ManH", ifelse(
pct %in% 35:52, "Bronx", ifelse(
pct %in% 53:94, "BK", ifelse(
pct %in% 95:115, "Qns", ifelse(
pct %in% 116:123, "StnI", "other")))))) %>%
# Generalize "stop"
gather(stop_reason, stop_applicable,
c(49:52, 54, 55, 58, 62:64)) %>%
# Generalize "frisk"
gather(frisk_reason, frisk_applicable,
c(rf_vcrim, rf_othsw, rf_attir, rf_vcact, rf_rfcmp,
rf_verbl, rf_knowl, rf_furt, rf_bulg)) %>%
# Isolate reasons for "stop" and\or "frisk"
subset(stop_applicable == "Y" | frisk_applicable == "Y") %>%
# Generalize "found" items
mutate(found = ifelse(
pistol == "Y" | riflshot == "Y" | asltweap == "Y"|
machgun == "Y", "firearm", ifelse(
knifcuti == "Y" | othrweap == "Y",
"non-firearm_weapon", ifelse(
contrabn == "Y", "contraband","none")))) %>%
#Generalize force used
mutate(force = ifelse(
pf_drwep == "Y" | pf_ptwep == "Y" | pf_baton == "Y", "lethal", ifelse(
pf_hands == "Y" | pf_wall == "Y" | pf_grnd == "Y" | pf_hcuff == "Y" |
pf_pepsp == "Y" | pf_other == "Y", "non-lethal", "none"))) %>%
select(pol_borough, stop_reason, frisk_reason, found, force, forceuse, arstmade)
knitr::kable(head(snf2015,10))| pol_borough | stop_reason | frisk_reason | found | force | forceuse | arstmade |
|---|---|---|---|---|---|---|
| ManH | cs_objcs | rf_vcrim | none | non-lethal | SF | N |
| ManH | cs_objcs | rf_vcrim | none | none | N | |
| BK | cs_objcs | rf_vcrim | none | none | N | |
| StnI | cs_objcs | rf_vcrim | none | non-lethal | OR | N |
| ManH | cs_objcs | rf_vcrim | none | none | N | |
| Qns | cs_objcs | rf_vcrim | none | non-lethal | SW | N |
| ManH | cs_objcs | rf_vcrim | contraband | non-lethal | DS | Y |
| ManH | cs_objcs | rf_vcrim | none | non-lethal | SW | N |
| BK | cs_objcs | rf_vcrim | none | non-lethal | OT | N |
| BK | cs_objcs | rf_vcrim | none | none | N |
Analysis with Transformed Data
Census Data
The projected population growths by region was a recommended area to look in to. The plot below shows the population projection form July, 2010 to July 2011 for the respective regions.
census_region <- filter(census2010,
area == "Northeast" |
area == "Midwest" |
area == "South" |
area == "West")
ggplot(census_region, aes(x = year, y = pop)) +
geom_point(size = 3, aes(color = area))All the rates for population seem to be positive, or at least flat between years. The region with the largest growth is the south. The regional growth is not surprising, but the population differences going on the state level could be drowning out the behavior for the region. Let us look at the differences in the projected population in 2014 relative to the 2010 Census data and look at the largest population growth and the smallest.
census_diff <- filter(census2010, year == "14-Jul" &
area != "Northeast" &
area != "Midwest" &
area != "South" &
area != "West" &
area != "United States") %>%
mutate(pop_growth = pop - census_2010) %>%
select(area, pop_growth) %>%
arrange(-pop_growth)
# Largest growth
knitr::kable(head(census_diff))| area | pop_growth |
|---|---|
| Texas | 1811397 |
| California | 1548544 |
| Florida | 1091987 |
| Georgia | 409690 |
| North Carolina | 408481 |
| New York | 368125 |
# Smallest growth
knitr::kable(tail(census_diff))| area | pop_growth | |
|---|---|---|
| 47 | New Hampshire | 10343 |
| 48 | Rhode Island | 2606 |
| 49 | Maine | 1728 |
| 50 | Vermont | 821 |
| 51 | West Virginia | -2668 |
| 52 | Puerto Rico | -177392 |
Texas, California and Florida have the largest population growth and Puerto Rico, West Virginia and Vermont has the least growth. There are many interesting topics that one could dive into form these observations. For the large gains, Is Texas there because of immigration, is California there because of the tech-boom, is Florida there because of baby-boomers retiring? For the lowest ,on the other hand, is Puerto Rico there because of their poor economic situation, is West Virginia there because of the decline of the the coal mines and job prospects?
If you are curious how other states fared. Took a look at the interactive table below.
datatable(census_diff)Student Drinking
The relationship to alcohol consumption and the setting one lives in, i.e. rural (R) vs urban (U), was brought up by the student who requested this data. In order to do some analysis, the student’s semester grades will be averaged, grade_avg and the level of drinking the student participated in during the weekday and weekend is added into one variable, alc_sum. These variables will be formed by using tidyr tools to operate on through the dplyr pipeline.
student <- student %>%
mutate(grade_avg = (G1 + G2 + G3)/3) %>%
mutate(alc_sum = (Dalc+Walc)) %>%
select(school, sex, address, alc_sum, major, grade_avg)The summary statistics for drinking behavior (on a scale from 2 to 10, with 10 being the heaviest consumption) for the rural students is:
# Summary for rural students
summary(student$alc_sum[student$address == "R"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 3.000 3.965 5.000 10.000
And for the urban student it is:
# Summary for urban students
summary(student$alc_sum[student$address == "U"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 3.000 3.709 5.000 10.000
These summary statistics do not bring much insight. Yes, the average consumption for the rural student is more, but the remainder of the statistics are the same. We will plot the density to see if it brings any insights.
ggplot(student, aes(alc_sum)) +
geom_density(fill = "grey") + facet_wrap(~ address)From the density plot, it does appear that the percentage of rural (R) students drink more than urban (U) students. This is evident from the sharp, uni-modal point at alc_sum = 2 for the urban students and its shallow spread to the right. The rural student’s density has a nearly linear spread and is not as sharply concave as their fellow urban classmates’ density.
It is apparent that the rural students do consume more alcohol, but let us see if this has any relationship to performance at school. Using summary statistics again on the average semester grade, grade_avg:
# Summary for rural students
summary(student$grade_avg[student$address == "R"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.333 9.000 10.670 10.610 13.000 18.670
And for the urban student it is:
# Summary for urban students
summary(student$grade_avg[student$address == "U"]) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.333 9.667 11.670 11.510 13.670 19.330
From the summary statistics, it is clear that the urban (U) students outperform their rural (R) classmates.
Since the urban students drink less and outperform their counterparts, is it safe to assume that they do because they drink less alcohol on average during the week? It most definitely does not; correlation does imply causation. To dive in a little further, let us see if there is a clear correlation between the two types of students given the variables discussed above using boxplots.
ggplot(student, aes(factor(alc_sum), grade_avg)) +
geom_boxplot() + facet_wrap(~ address)Strictly looking at the urban (U) student, there is a clear negative relationship between large alcohol consumption and grades. Looking at the rural (R) students, however, there is no clear relationship. Correlation may not imply causation, but no correlation does imply no causation.
There are a lot of mitigating factors in the student dataset that could be effecting the performance of the rural students. Since there could be less mitigating factors for urban students, factors such as alcohol consumption may have a stronger effect on school performance. Factors such as levels of the parents’ education and stressful environments at home due to economic conditions may have a larger impact, but it does not appear to be due to alcohol.
Stop and Frisk
First we will look at a plot of the stop and frisk rates for the respective boroughs and then a plot of the reason the individual was stopped.
ggplot(snf2015, aes(factor(pol_borough))) + geom_bar()ggplot(snf2015, aes(factor(pol_borough), fill = stop_reason)) + geom_bar(position = "dodge")For the reasons of being stopped above, let us see what was the result of the search, i.e., was anything found.
ggplot(snf2015, aes(factor(pol_borough),fill = found)) + geom_bar(position = "dodge")count(snf2015,found)# A tibble: 4 × 2
found n
<chr> <int>
1 contraband 24089
2 firearm 5310
3 non-firearm_weapon 23852
4 none 476069
It is obvious that nothing was found from most of these stops; there is a found rate of 10.060266%. Let us see the times when something was actually found from these stops based off the reason for search. Also, let’s see what the the highest success rate of finding something based off the stop.
ggplot(subset(snf2015, found != "none"), aes(factor(stop_reason), fill = found)) + geom_bar(position = "dodge")snf_table <- as.data.frame.matrix(table(snf2015$stop_reason, snf2015$found))
snf_names <- rownames(snf_table)
snf_table <- snf_table %>%
mutate(success = rowSums(snf_table[,1:3])/rowSums(snf_table)) %>%
mutate(freq_rank = frank(-rowSums(snf_table))) %>%
arrange(-success)
rownames(snf_table) <- snf_names
knitr::kable(snf_table)| contraband | firearm | non-firearm_weapon | none | success | freq_rank | |
|---|---|---|---|---|---|---|
| cs_bulge | 1980 | 425 | 2649 | 26793 | 0.1586963 | 9 |
| cs_casng | 1679 | 687 | 3300 | 30938 | 0.1547918 | 7 |
| cs_cloth | 2888 | 353 | 1548 | 30550 | 0.1355160 | 8 |
| cs_descr | 3619 | 807 | 3159 | 66567 | 0.1022899 | 3 |
| cs_drgtr | 1315 | 363 | 1473 | 28097 | 0.1008385 | 10 |
| cs_furtv | 4311 | 747 | 3628 | 78762 | 0.0993276 | 2 |
| cs_lkout | 1564 | 376 | 1830 | 34256 | 0.0991427 | 6 |
| cs_objcs | 1847 | 366 | 1629 | 39494 | 0.0886561 | 5 |
| cs_other | 2760 | 784 | 2852 | 83552 | 0.0711078 | 1 |
| cs_vcrim | 2126 | 402 | 1784 | 57060 | 0.0702601 | 4 |
The above table is sorted by the level of success. freq_rank is the rank for how often that that is the reason for the stop. It is important to point out that the most common reasons for stops, e.g., other, furtive behavior and matches a description have a lower success rate for finding something than an observed bulge, casing an area or clothing.
Let us see if the reason for frisking, carried out after the stop, has the same behavior:
ggplot(subset(snf2015, found != "none"), aes(factor(frisk_reason), fill = found)) + geom_bar(position = "dodge")snf_table <- as.data.frame.matrix(table(snf2015$frisk_reason, snf2015$found))
snf_names <- rownames(snf_table)
snf_table <- snf_table %>%
mutate(success = rowSums(snf_table[,1:3])/rowSums(snf_table)) %>%
mutate(freq_rank = frank(-rowSums(snf_table))) %>%
arrange(-success)
rownames(snf_table) <- snf_names
knitr::kable(snf_table)| contraband | firearm | non-firearm_weapon | none | success | freq_rank | |
|---|---|---|---|---|---|---|
| rf_attir | 4249 | 768 | 4606 | 57543 | 0.1432719 | 3 |
| rf_bulg | 2497 | 667 | 3802 | 45332 | 0.1331982 | 5 |
| rf_furt | 4123 | 980 | 3588 | 80748 | 0.0971724 | 1 |
| rf_knowl | 1726 | 324 | 1544 | 34990 | 0.0931474 | 9 |
| rf_othsw | 2050 | 413 | 1805 | 42598 | 0.0910682 | 8 |
| rf_rfcmp | 2186 | 395 | 1856 | 44574 | 0.0905307 | 7 |
| rf_vcact | 2628 | 598 | 2033 | 52838 | 0.0905210 | 4 |
| rf_vcrim | 2025 | 432 | 1948 | 45205 | 0.0887926 | 6 |
| rf_verbl | 2605 | 733 | 2670 | 72241 | 0.0767805 | 2 |
The reason for frisking does not follow the same behavior; 2 of the top 3 frequencies have the highest success rate.
From the data above, it appears that stop and frisk has about a 10% success rate. The most common reasons for stopping appear arbitrary and do not yield as large as a success rate as those which are more concrete. A trade-off for the NYPD might be to fore-go some of the low-success, vague reasons for stopping individuals to gain a better relationship in the neighborhoods in which they are.