Preparing Data for Downstream Data Analysis

Introduction
Loading and Cleaning the Data
Transform to Tidy Data
Analysis with Transformed Data

Introduction

Overview

The aim of this project is to clean data, transform it in to a “tidy”" state and then perform some analysis. The data chosen, from fellow students’ suggestions, are population figures for the United States from the 2010 Census and student alcohol consumption from a school in Portugal.

Relevence

The three main components of tidy data, outlined by Hadley Wickham in his paper, Tidy Data are:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

This data structure is a great rule of thumb in order to perform analysis. A lot of data made for human consumption violates these three tenets and is not conducive to computational analysis. Being able to transform this data effectively (in the case of R: using tools such as dplyr and tidyr) is an ever-present challenge for data scientists.

Libraries

Outside of formatting packages, the libraries used for this project are:

data.table: Light-weight, fast development for data structure; not necessarily needed for this data set, but recommended under best practices.
dplyr: For data manipulation.
tidyr: To produce tidy data from the dplyr pipeline.
ggplot2: Provides some advanced plotting utilities.

Loading and Cleaning the Data

2010 Census Data

The 2010 census data, suggested by Brandon O’Hara, was released by the U.S Census Bureau on December 2014. The data provides annual estimates of the resident population for the United States, regions, states, and Puerto Rico from April 1, 2010 to July 1, 2014. The raw data, which was cleaned of descriptive attachments and merged cells, is available here. After uploading the data, the ten first observations are:

file_link <- 
  "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/pop2012.csv"

# Returns a data table
census2010  <- fread(file_link, header = TRUE, na.strings = "", data.table = TRUE)
knitr::kable(head(census2010,10))

area	census_2010	raw_2010	10-Jul	11-Jul	12-Jul	13-Jul	14-Jul
United States	308,745,538	308,758,105	309,347,057	311,721,632	314,112,078	316,497,531	318,857,056
Northeast	55,317,240	55,318,348	55,381,690	55,635,670	55,832,038	56,028,220	56,152,333
Midwest	66,927,001	66,929,898	66,972,390	67,149,657	67,331,458	67,567,871	67,745,108
South	114,555,744	114,562,951	114,871,231	116,089,908	117,346,322	118,522,802	119,771,934
West	71,945,553	71,946,908	72,121,746	72,846,397	73,602,260	74,378,638	75,187,681
.Alabama	4,779,736	4,780,127	4,785,822	4,801,695	4,817,484	4,833,996	4,849,377
.Alaska	710,231	710,249	713,856	722,572	731,081	737,259	736,732
.Arizona	6,392,017	6,392,310	6,411,999	6,472,867	6,556,236	6,634,997	6,731,484
.Arkansas	2,915,918	2,915,958	2,922,297	2,938,430	2,949,300	2,958,765	2,966,369
.California	37,253,956	37,254,503	37,336,011	37,701,901	38,062,780	38,431,393	38,802,500

The 2010 adjusted and raw variables are adjusted census data from April 2010 and the original data gathered, respectively. The Jul-10 to Jul-14 variables are projections based off the adjusted data.

From the raw data, shown below, states are preceded by a period. A simple regex expression will remove these:

census2010$area <- sub("^\\.", "", census2010$area)

knitr::kable(head(census2010, 10))

area	census_2010	raw_2010	10-Jul	11-Jul	12-Jul	13-Jul	14-Jul
United States	308,745,538	308,758,105	309,347,057	311,721,632	314,112,078	316,497,531	318,857,056
Northeast	55,317,240	55,318,348	55,381,690	55,635,670	55,832,038	56,028,220	56,152,333
Midwest	66,927,001	66,929,898	66,972,390	67,149,657	67,331,458	67,567,871	67,745,108
South	114,555,744	114,562,951	114,871,231	116,089,908	117,346,322	118,522,802	119,771,934
West	71,945,553	71,946,908	72,121,746	72,846,397	73,602,260	74,378,638	75,187,681
Alabama	4,779,736	4,780,127	4,785,822	4,801,695	4,817,484	4,833,996	4,849,377
Alaska	710,231	710,249	713,856	722,572	731,081	737,259	736,732
Arizona	6,392,017	6,392,310	6,411,999	6,472,867	6,556,236	6,634,997	6,731,484
Arkansas	2,915,918	2,915,958	2,922,297	2,938,430	2,949,300	2,958,765	2,966,369
California	37,253,956	37,254,503	37,336,011	37,701,901	38,062,780	38,431,393	38,802,500

The first five rows are calculated rows and should not be in the table. To tabulate them manually, especially the regional areas, would be too cumbersome. They will remain in the table due to their availability.

Student Drinking

The alcohol consumption for two Portuguese schools (“GP” - Gabriel Pereira or “MS” - Mousinho da Silveira) is available by area of study, math and Portuguese. This data was recommended by Oluwakemi Omotunde and the data dictionary is available here

#Data for math students
st_math_link <- "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/student/student-mat.csv"
student_math <- fread(st_math_link, header = TRUE, na.strings = "", data.table = TRUE)
knitr::kable(head(student_math))

school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	traveltime	studytime	failures	schoolsup	famsup	paid	activities	nursery	higher	internet	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
GP	F	18	U	GT3	A	4	4	at_home	teacher	course	mother	2	2	0	yes	no	no	no	yes	yes	no	no	4	3	4	1	1	3	6	5	6	6
GP	F	17	U	GT3	T	1	1	at_home	other	course	father	1	2	0	no	yes	no	no	no	yes	yes	no	5	3	3	1	1	3	4	5	5	6
GP	F	15	U	LE3	T	1	1	at_home	other	other	mother	1	2	3	yes	no	yes	no	yes	yes	yes	no	4	3	2	2	3	3	10	7	8	10
GP	F	15	U	GT3	T	4	2	health	services	home	mother	1	3	0	no	yes	yes	yes	yes	yes	yes	yes	3	2	2	1	1	5	2	15	14	15
GP	F	16	U	GT3	T	3	3	other	other	home	father	1	2	0	no	yes	yes	no	yes	yes	no	no	4	3	2	1	2	5	4	6	10	10
GP	M	16	U	LE3	T	4	3	services	other	reputation	mother	1	2	0	no	yes	yes	yes	yes	yes	yes	no	5	4	2	1	2	5	10	15	15	15

#Data for Portuguese students
st_port_link <- "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/student/student-por.csv"
student_port <- fread(st_port_link, header = TRUE, na.strings = "", data.table = TRUE)

student_port$sex <- sub("0", "F", student_port$sex)

knitr::kable(head(student_port))

school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	traveltime	studytime	schoolsup	famsup	paid	activities	nursery	higher	internet	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
GP	F	18	U	GT3	A	4	4	at_home	teacher	course	mother	2	2	yes	no	no	no	yes	yes	no	no	4	3	4	1	1	3	4	0	11	11
GP	F	17	U	GT3	T	1	1	at_home	other	course	father	1	2	no	yes	no	no	no	yes	yes	no	5	3	3	1	1	3	2	9	11	11
GP	F	15	U	LE3	T	1	1	at_home	other	other	mother	1	2	yes	no	no	no	yes	yes	yes	no	4	3	2	2	3	3	6	12	13	12
GP	F	15	U	GT3	T	4	2	health	services	home	mother	1	3	no	yes	no	yes	yes	yes	yes	yes	3	2	2	1	1	5	0	14	14	14
GP	F	16	U	GT3	T	3	3	other	other	home	father	1	2	no	yes	no	no	yes	yes	no	no	4	3	2	1	2	5	0	11	13	13
GP	M	16	U	LE3	T	4	3	services	other	reputation	mother	1	2	no	yes	no	yes	yes	yes	yes	no	5	4	2	1	2	5	6	12	12	13

Due to the behavior of fread, sex was coerced into logical values for the first couple of rows. This error was handles by the regex expression above.

Stop and Frisk

In metropolitan areas, namely New York City, stop and frisk has become a divisive practice since its ramp-up at the turn of the century. Some have lauded its effectiveness and others have scrutinized its unfair focus on young, minority men. Answering such things is beyond the scope of this assignment, but there are some interesting observations that can be garnered. The NYPD keeps a a log of stop and frisk data between 2003 and 2015 with a “wide” list of variables. The data dictionary is available for download here. The data for the stop and frisk practice in NYC is loaded below:

file_link <- 
   "https://raw.githubusercontent.com/Liam-O/DATA607/master/Project2/stopNfrisk/2015snf.csv"

# Returns a data table
snf2015  <- fread(file_link, header = TRUE, na.strings = " ", data.table = TRUE)
knitr::kable(head(snf2015,10))

year	pct	ser_num	datestop	timestop	recstat	inout	trhsloc	perobs	crimsusp	perstop	typeofid	explnstp	othpers	arstmade	sumissue	offunif	frisked	searched	contrabn	adtlrept	pistol	riflshot	asltweap	knifcuti	machgun	othrweap	pf_hands	pf_wall	pf_grnd	pf_drwep	pf_ptwep	pf_baton	pf_hcuff	pf_pepsp	pf_other	radio	ac_rept	ac_inves	rf_vcrim	rf_othsw	ac_proxm	rf_attir	cs_objcs	cs_descr	cs_casng	cs_lkout	rf_vcact	cs_cloth	cs_drgtr	ac_evasv	ac_assoc	cs_furtv	rf_rfcmp	ac_cgdir	rf_verbl	cs_vcrim	cs_bulge	cs_other	ac_incid	ac_time	rf_knowl	ac_stsnd	ac_other	sb_hdobj	sb_outln	sb_admis	sb_other	repcmd	revcmd	rf_furt	rf_bulg	offverb	offshld	forceuse	sex	race	dob	age	ht_feet	ht_inch	weight	haircolr	eyecolor	build	addrtyp	rescode	premtype	premname	addrnum	stname	stinter	crossst	aptnum	city	state	zip	addrpct	sector	beat	post	xcoord	ycoord	dettypCM	lineCM	detailCM
2015	61	18	1012015	315	1	O	P	2	FELONY	10	V	Y	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	N	Y	N	N	Y	Y	N	N	N	N	N	N	N	Y	N	N	N	N	N	Y	Y	N	N	N	N	N	N	N	186	186	N	N	V	S		M	W	NA	33	5	11	190	BR	BR	M	L	NA	NA	STREET			AVENUE W	EAST 28 STREET	NA	BROOKLYN	NA	NA	61	E	7	NA	1000091	156314	CM	1	14
2015	22	5	1152015	1747	A	O	P	1	FELONY	4	V	Y	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	Y	Y	N	Y	Y	Y	N	N	Y	N	N	N	N	N	Y	N	Y	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	483	483	N	N	V	S	SF	M	B	NA	14	5	8	140	BK	BR	T	L	NA	NA	119 W. 104 ST			TRANSVERSE ROAD NUMBER FOUR	WEST DRIVE	NA	MANHATTAN	NA	NA	22	E	14	NA	994617	227693	CM	1	20
2015	20	36	1292015	1745	1	O	P	1	MISD	16	V	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	Y	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	483	483	N	N	V	S	SF	M	B	NA	14	5	3	140	BK	BR	T	L	NA	NA	STREET	472	COLUMBUS AVENUE	WEST 83 STREET	WEST 82 STREET	NA	MANHATTAN	NA	NA	20	H	NA	16	991510	225019	CM	1	78
2015	20	38	1292015	1745	1	O	P	1	MIDS	16	V	Y	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	Y	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	Y	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	483	483	N	N	V	S		M	B	NA	14	5	9	180	BK	BR	M	L	NA	NA	STREET	472	COLUMBUS AVENUE	WEST 83 STREET	WEST 82 STREET	NA	MANHATTAN	NA	NA	20	H	NA	16	991510	225019	CM	1	78
2015	20	41	1292015	1745	1	O	P	1	MISD	16	V	Y	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	Y	Y	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	Y	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	483	483	N	N	V	S	SF	M	B	NA	13	5	10	160	BK	BR	M	L	NA	NA	STREET	472	COLUMBUS AVENUE	WEST 83 STREET	WEST 82 STREET	NA	MANHATTAN	NA	NA	20	H	NA	16	991510	225019	CM	1	78
2015	20	39	1292015	1745	1	O	P	1	MISD	16	V	Y	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	Y	Y	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	Y	N	N	N	N	N	Y	N	N	N	N	Y	N	N	N	N	483	483	N	N	V	S	SF	M	W	NA	13	5	2	130	BK	BR	M	L	NA	NA	STREET	472	COLUMBUS AVENUE	WEST 83 STREET	WEST 82 STREET	NA	MANHATTAN	NA	NA	20	H	NA	16	991510	225019	CM	1	78
2015	67	122	2062015	2155	1	O	P	2	FEL	5	R	Y	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	N	N	N	N	Y	Y	N	N	N	N	N	N	N	186	186	Y	N	V	S		M	B	NA	25	5	9	160	BK	BR	T	L	NA	NA				EAST 93 STREET	CLARKSON AVENUE	NA	BROOKLYN	NA	NA	67	M	NA	NA	1005554	179416	CM	1	20
2015	7	37	2072015	2115	1	I	H	1	FEL	5	V	Y	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	Y	N	Y	N	N	Y	Y	Y	N	N	N	N	Y	N	N	N	N	N	N	N	Y	Y	Y	N	N	N	N	N	N	804	804	Y	N	V	S		M	B	NA	15	5	7	150	BR	BR	T	L	NA	NA	LOBBY	521	ROOSEVELT DRIVE	RIVINGTON STREET	DELANCEY STREET	NA	MANHATTAN	NA	NA	7	E	6	NA	990820	200020	CM	1	85
2015	7	42	2122015	1405	1	I	T	9	FEL	3	P	Y	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	N	N	Y	Y	N	N	N	N	N	N	Y	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	863	863	Y	N	V	S	OR	M	B	NA	23	5	9	160	BK	BR	T	L	NA	NA	MEZZ			ESSEX STREET	DELANCEY STREET	NA	MANHATTAN	NA	NA	7	B	2	NA	987521	201066	CM	1	45
2015	68	9	2142015	200	A	O	P	1	FEONY	10	V	Y	Y	N	N	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	Y	N	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	N	N	N	N	Y	N	N	N	N	N	N	N	N	N	68	68	N	N				M	W	NA	16	5	10	150	BR	GR	M	L	NA	NA	STREET	142	93 STREET	RIDGE BOULEVARD	MARINE AVENUE	NA	BROOKLYN	NA	NA	68	G	NA	NA	974057	164979	CM	1	14

Transform to Tidy Data

2010 Census Data

The format of the census data violates the rule that each observation forms a row. The time series format puts the population projections for July 2010 to July 2014 in one row. Also, the 2010 raw data will not be necessary in our future analysis, so we will remove this as well. By piping the data through some tidyr tools we get the desired result:

census2010 <- census2010 %>%
    gather("year", "pop", 4:8) %>%
    subset(select = c(1,2,4,5)) %>%
    arrange(area)

census2010$census_2010 <- as.numeric(gsub(",", "", census2010$census_2010))
census2010$pop <- as.numeric(gsub(",", "", census2010$pop))
knitr::kable(head(census2010,10))

area	census_2010	year	pop
Alabama	4779736	10-Jul	4785822
Alabama	4779736	11-Jul	4801695
Alabama	4779736	12-Jul	4817484
Alabama	4779736	13-Jul	4833996
Alabama	4779736	14-Jul	4849377
Alaska	710231	10-Jul	713856
Alaska	710231	11-Jul	722572
Alaska	710231	12-Jul	731081
Alaska	710231	13-Jul	737259
Alaska	710231	14-Jul	736732

Due to the formatting of the data, population values are strings. The regex and casting above coerces the values in to numeric values.

Student Drinking

The issue with the student data is that it is split over two tables depending on what class the student is taking, i.e. math or Portuguese. These tables need to be joined under one table.

# Add major variable to distinguish between 
student_math[,"major"] <- "math"
student_port[,"major"] <- "Portuguese"

student <- rbind(student_math, student_port)

# Size of nearly formed table
nrow(student)

[1] 1044

All the data is combined under the one table, student, with an added variable of major to distinguish between the students in the respective field of study.

Stop and Frisk

The stop and frisk data is not necessarily “untidy” in the sense that it violates any of the components mentioned in the introduction, but it could become more structured for analysis with the tools supplied in dplyr and tidyr. We want to specify and/or generalize some data; namely:

Generalize precinct names into boroughs.
Isolate observations where an individual was stopped and frisked
Specify reason for stop andreason for frisk* into two variables (multiple reasons allowed).
Generalize “found” items to be: firearm, non-firearm weapon or contraband (most severe case overrides others, if applicable)
Generalize type of force applied into lethal, non-lethal or `none: lethal = gun or baton; non-lethal = physical force; else = none.

This will morph the wide dataset (22563 x 113) into a long dataset (529320 x 7).

snf2015 <- snf2015 %>%
    # Generalize police borough
    mutate(pol_borough = ifelse(
        pct %in% 0:34, "ManH", ifelse(
            pct %in% 35:52, "Bronx", ifelse(
                pct %in% 53:94, "BK", ifelse(
                    pct %in% 95:115, "Qns", ifelse(
                        pct %in% 116:123, "StnI", "other")))))) %>%
    # Generalize "stop"
    gather(stop_reason, stop_applicable, 
           c(49:52, 54, 55, 58, 62:64)) %>%
    # Generalize "frisk"
    gather(frisk_reason, frisk_applicable,
           c(rf_vcrim, rf_othsw, rf_attir, rf_vcact, rf_rfcmp,
             rf_verbl, rf_knowl, rf_furt, rf_bulg)) %>%
    # Isolate reasons for "stop" and\or "frisk"
    subset(stop_applicable == "Y" | frisk_applicable == "Y") %>%
    # Generalize "found" items
    mutate(found = ifelse(
        pistol == "Y" | riflshot == "Y" | asltweap == "Y"|
            machgun == "Y", "firearm", ifelse(
                knifcuti == "Y" | othrweap == "Y",
                "non-firearm_weapon", ifelse(
                    contrabn == "Y", "contraband","none")))) %>%
    #Generalize force used
    mutate(force = ifelse(
        pf_drwep == "Y" | pf_ptwep == "Y" | pf_baton == "Y", "lethal", ifelse(
            pf_hands == "Y" | pf_wall == "Y" | pf_grnd == "Y" | pf_hcuff == "Y" |
                pf_pepsp == "Y" | pf_other == "Y", "non-lethal", "none"))) %>%
    select(pol_borough, stop_reason, frisk_reason, found, force, forceuse, arstmade)

knitr::kable(head(snf2015,10))

pol_borough	stop_reason	frisk_reason	found	force	forceuse	arstmade
ManH	cs_objcs	rf_vcrim	none	non-lethal	SF	N
ManH	cs_objcs	rf_vcrim	none	none		N
BK	cs_objcs	rf_vcrim	none	none		N
StnI	cs_objcs	rf_vcrim	none	non-lethal	OR	N
ManH	cs_objcs	rf_vcrim	none	none		N
Qns	cs_objcs	rf_vcrim	none	non-lethal	SW	N
ManH	cs_objcs	rf_vcrim	contraband	non-lethal	DS	Y
ManH	cs_objcs	rf_vcrim	none	non-lethal	SW	N
BK	cs_objcs	rf_vcrim	none	non-lethal	OT	N
BK	cs_objcs	rf_vcrim	none	none		N

Analysis with Transformed Data

Census Data

The projected population growths by region was a recommended area to look in to. The plot below shows the population projection form July, 2010 to July 2011 for the respective regions.

census_region <- filter(census2010,
                        area == "Northeast" |
                            area == "Midwest" |
                            area == "South" |
                            area == "West")


ggplot(census_region, aes(x = year, y = pop)) +
    geom_point(size = 3, aes(color = area))

All the rates for population seem to be positive, or at least flat between years. The region with the largest growth is the south. The regional growth is not surprising, but the population differences going on the state level could be drowning out the behavior for the region. Let us look at the differences in the projected population in 2014 relative to the 2010 Census data and look at the largest population growth and the smallest.

census_diff <- filter(census2010, year == "14-Jul" &
                          area != "Northeast" &
                          area != "Midwest" &
                          area != "South" &
                          area != "West" &
                          area != "United States") %>%
    mutate(pop_growth = pop - census_2010) %>%
    select(area, pop_growth) %>%
    arrange(-pop_growth)

# Largest growth
knitr::kable(head(census_diff))

area	pop_growth
Texas	1811397
California	1548544
Florida	1091987
Georgia	409690
North Carolina	408481
New York	368125

# Smallest growth
knitr::kable(tail(census_diff))

	area	pop_growth
47	New Hampshire	10343
48	Rhode Island	2606
49	Maine	1728
50	Vermont	821
51	West Virginia	-2668
52	Puerto Rico	-177392

Texas, California and Florida have the largest population growth and Puerto Rico, West Virginia and Vermont has the least growth. There are many interesting topics that one could dive into form these observations. For the large gains, Is Texas there because of immigration, is California there because of the tech-boom, is Florida there because of baby-boomers retiring? For the lowest ,on the other hand, is Puerto Rico there because of their poor economic situation, is West Virginia there because of the decline of the the coal mines and job prospects?

If you are curious how other states fared. Took a look at the interactive table below.

datatable(census_diff)

Student Drinking

The relationship to alcohol consumption and the setting one lives in, i.e. rural (R) vs urban (U), was brought up by the student who requested this data. In order to do some analysis, the student’s semester grades will be averaged, grade_avg and the level of drinking the student participated in during the weekday and weekend is added into one variable, alc_sum. These variables will be formed by using tidyr tools to operate on through the dplyr pipeline.

student <- student %>%
    mutate(grade_avg = (G1 + G2 + G3)/3) %>%
    mutate(alc_sum = (Dalc+Walc)) %>%
    select(school, sex, address, alc_sum, major, grade_avg)

The summary statistics for drinking behavior (on a scale from 2 to 10, with 10 being the heaviest consumption) for the rural students is:

# Summary for rural students
summary(student$alc_sum[student$address == "R"])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.000   3.000   3.965   5.000  10.000

And for the urban student it is:

# Summary for urban students
summary(student$alc_sum[student$address == "U"])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.000   3.000   3.709   5.000  10.000

These summary statistics do not bring much insight. Yes, the average consumption for the rural student is more, but the remainder of the statistics are the same. We will plot the density to see if it brings any insights.

ggplot(student, aes(alc_sum)) +
    geom_density(fill = "grey") + facet_wrap(~ address)

From the density plot, it does appear that the percentage of rural (R) students drink more than urban (U) students. This is evident from the sharp, uni-modal point at alc_sum = 2 for the urban students and its shallow spread to the right. The rural student’s density has a nearly linear spread and is not as sharply concave as their fellow urban classmates’ density.

It is apparent that the rural students do consume more alcohol, but let us see if this has any relationship to performance at school. Using summary statistics again on the average semester grade, grade_avg:

# Summary for rural students
summary(student$grade_avg[student$address == "R"])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.333   9.000  10.670  10.610  13.000  18.670

And for the urban student it is:

# Summary for urban students
summary(student$grade_avg[student$address == "U"])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.333   9.667  11.670  11.510  13.670  19.330

From the summary statistics, it is clear that the urban (U) students outperform their rural (R) classmates.

Since the urban students drink less and outperform their counterparts, is it safe to assume that they do because they drink less alcohol on average during the week? It most definitely does not; correlation does imply causation. To dive in a little further, let us see if there is a clear correlation between the two types of students given the variables discussed above using boxplots.

ggplot(student, aes(factor(alc_sum), grade_avg)) +
    geom_boxplot() + facet_wrap(~ address)

Strictly looking at the urban (U) student, there is a clear negative relationship between large alcohol consumption and grades. Looking at the rural (R) students, however, there is no clear relationship. Correlation may not imply causation, but no correlation does imply no causation.

There are a lot of mitigating factors in the student dataset that could be effecting the performance of the rural students. Since there could be less mitigating factors for urban students, factors such as alcohol consumption may have a stronger effect on school performance. Factors such as levels of the parents’ education and stressful environments at home due to economic conditions may have a larger impact, but it does not appear to be due to alcohol.

Stop and Frisk

First we will look at a plot of the stop and frisk rates for the respective boroughs and then a plot of the reason the individual was stopped.

ggplot(snf2015, aes(factor(pol_borough))) + geom_bar()

ggplot(snf2015, aes(factor(pol_borough), fill = stop_reason)) + geom_bar(position = "dodge")

For the reasons of being stopped above, let us see what was the result of the search, i.e., was anything found.

ggplot(snf2015, aes(factor(pol_borough),fill = found)) + geom_bar(position = "dodge")

count(snf2015,found)

# A tibble: 4 × 2
               found      n
               <chr>  <int>
1         contraband  24089
2            firearm   5310
3 non-firearm_weapon  23852
4               none 476069

It is obvious that nothing was found from most of these stops; there is a found rate of 10.060266%. Let us see the times when something was actually found from these stops based off the reason for search. Also, let’s see what the the highest success rate of finding something based off the stop.

ggplot(subset(snf2015, found != "none"), aes(factor(stop_reason), fill = found)) + geom_bar(position = "dodge")

snf_table <- as.data.frame.matrix(table(snf2015$stop_reason, snf2015$found))
snf_names <- rownames(snf_table)
snf_table <- snf_table %>%
    mutate(success = rowSums(snf_table[,1:3])/rowSums(snf_table)) %>%
    mutate(freq_rank = frank(-rowSums(snf_table))) %>%
    arrange(-success)
rownames(snf_table) <- snf_names
knitr::kable(snf_table)

	contraband	firearm	non-firearm_weapon	none	success	freq_rank
cs_bulge	1980	425	2649	26793	0.1586963	9
cs_casng	1679	687	3300	30938	0.1547918	7
cs_cloth	2888	353	1548	30550	0.1355160	8
cs_descr	3619	807	3159	66567	0.1022899	3
cs_drgtr	1315	363	1473	28097	0.1008385	10
cs_furtv	4311	747	3628	78762	0.0993276	2
cs_lkout	1564	376	1830	34256	0.0991427	6
cs_objcs	1847	366	1629	39494	0.0886561	5
cs_other	2760	784	2852	83552	0.0711078	1
cs_vcrim	2126	402	1784	57060	0.0702601	4

The above table is sorted by the level of success. freq_rank is the rank for how often that that is the reason for the stop. It is important to point out that the most common reasons for stops, e.g., other, furtive behavior and matches a description have a lower success rate for finding something than an observed bulge, casing an area or clothing.

Let us see if the reason for frisking, carried out after the stop, has the same behavior:

ggplot(subset(snf2015, found != "none"), aes(factor(frisk_reason), fill = found)) + geom_bar(position = "dodge")

snf_table <- as.data.frame.matrix(table(snf2015$frisk_reason, snf2015$found))
snf_names <- rownames(snf_table)
snf_table <- snf_table %>%
    mutate(success = rowSums(snf_table[,1:3])/rowSums(snf_table)) %>%
    mutate(freq_rank = frank(-rowSums(snf_table))) %>%
    arrange(-success)
rownames(snf_table) <- snf_names
knitr::kable(snf_table)

	contraband	firearm	non-firearm_weapon	none	success	freq_rank
rf_attir	4249	768	4606	57543	0.1432719	3
rf_bulg	2497	667	3802	45332	0.1331982	5
rf_furt	4123	980	3588	80748	0.0971724	1
rf_knowl	1726	324	1544	34990	0.0931474	9
rf_othsw	2050	413	1805	42598	0.0910682	8
rf_rfcmp	2186	395	1856	44574	0.0905307	7
rf_vcact	2628	598	2033	52838	0.0905210	4
rf_vcrim	2025	432	1948	45205	0.0887926	6
rf_verbl	2605	733	2670	72241	0.0767805	2

The reason for frisking does not follow the same behavior; 2 of the top 3 frequencies have the highest success rate.

From the data above, it appears that stop and frisk has about a 10% success rate. The most common reasons for stopping appear arbitrary and do not yield as large as a success rate as those which are more concrete. A trade-off for the NYPD might be to fore-go some of the low-success, vague reasons for stopping individuals to gain a better relationship in the neighborhoods in which they are.