Introduction

Credit: NYC DOT

Topic

When the New York City Department of Transportation maintains roadways in a timely manner, few people thank them. But when conditions deteriorate, serious collisions can occur, resulting in loss of life.

Managing a city operation is a challenge. With the increase in data collection and reporting, NYC employees and residents have powerful tools to answer questions and identify problem spots. The result of this data analysis could impact funding, hiring, citywide initiatives, and project management at all levels.

New York City 311 is a free public service that allows individuals to register complaints or inquiries on city conditions. Comparing the roadway condition complaints to Census data has great potential for statistical discovery.

Hypothesis

Comparing majority black and majority white populations, is there a correlation between race and per capita roadway condition complaints across zip codes?

Using mean income per zip code, is there a correlation between income and roadway condition complaints?

$H_0$: Race and income are not correlated with roadway condition complaints per capita.

$H_a$: Race and/or income are correlated with roadway condition complaints per capita.

Libraries

Show code to view required libraries.

library(BHH2)
library(devtools)
library(DT)
library(ggplot2)
library(jsonlite)
library(knitr)
library(openintro)
library(plotly)
library(plyr)
library(psych)
library(dplyr)
library(RCurl)
library(reshape2)
library(rmarkdown)
library(shiny)
library(stringr)
library(tidyr)

library(XML)
library(scales)
library(RColorBrewer)
library(leaflet)
library(httr)

Cases

The following is a summary of case statistics. For this project, each case is a zip code. Each case is a group of all 311 complaints for that zip code for the selected time period.

Statistic<-c("8,500,000",
             "1/1/2013 to 5/12/2016", 
             "325,025", 
             "281,680",
             "274,047",
             "769",
             "180",
             "182")

Description<-c("New York City population, estimated.",
               "Selected date range of 311 complaints.",
               "Street condition complaints in date range.",
               "Street condition complaints with a reported zip code.",
               "Asphalt complaints.",
               "Missing markings complaints.",
               "Faded markings complaints.", 
               "Number of zip codes in NYC, considered as cases.")

test1<-data.frame(Statistic,Description)

kable(test1)

Statistic	Description
8,500,000	New York City population, estimated.
1/1/2013 to 5/12/2016	Selected date range of 311 complaints.
325,025	Street condition complaints in date range.
281,680	Street condition complaints with a reported zip code.
274,047	Asphalt complaints.
769	Missing markings complaints.
180	Faded markings complaints.
182	Number of zip codes in NYC, considered as cases.

Sources

New York City 311 Data

NYC Open Data

311 Service Requests from 2010 to Present

US Income Data

US Census Factfinder Portal

2010-2014 American Community Survey 5-Year Estimates

Selected economic characteristics

US Population and Demographic Data

US Census Factfinder Portal

2010-2014 American Community Survey 5-Year Estimates

ACS DEMOGRAPHIC AND HOUSING ESTIMATES

Data Characteristics

Data Collection

The roadway data was collected by New York City 311 through phone calls and web forms from individuals contacting the city to complain or inquire about a roadway situation. Phone calls are entered directly into the 311 database by phone operators.

The US Census demographic and population data was collected by the US government in a 2010-2014 survey. From the Census website: “The American Community Survey (ACS) is a mandatory, ongoing statistical survey that samples a small percentage of the population every year.”

Study Type

This is an observational study as no experiment was created. Some datasets used for this analysis are from sampling done by the US Census.

Response

The response variable is the number of complaints per capita for each zip code. This is a numeric variable.

Explanatory

The proposed explanatory variables are percent race by zip code and mean income by zip code, both numeric.

Generalizability

The data includes sample demographic statistics across New York City. The complaint data is limited to the individuals who interact with that particular government service. However, due to the volume of 311 records, a significant percentage of the city population is being represented in these cases. Therefore, the sample data is generalizable to the city.

Causality

There are many causalities related to race, income and social services. Therefore, it may not be possible to infer that race or income alone affects interaction with local government. Additionally, a complaint is not necessarily an indication of a poor roadway condition. Certain populations may complain less while still experiencing poor conditions.

Transformation

Street Conditions

The following is a list of options for street condition complaints when making a report to 311.

For this project, seven of the 26 complaints were selected. The five asphalt-related complaints were combined into the asphalt complaint category. The remaining two were the two options for marking complaints: faded and missing.

complaintlist<-getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/complainttypes.csv")

df2<- data.frame(read.csv(text=complaintlist))

datatable(df2, options = list( pageLength = 5, lengthMenu = c(5, 10, 40),   initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#01975b', 'color': '#fff'});",
    "}")), rownames=TRUE)

311 Raw Data

This project spans 311 complaints from January 1, 2013 through May 12, 2016.

311 receives millions of complaints each year across all city agencies: Health, Environmental Protection, Sanitation, Police, and others.

For this project, the data was filtered as follows:

Only Department of Transportation data.
Only complaints which had a zip code as part of the complaint, to make data analysis possible.
Only complaints which were for a street condition.
Only the five primarily asphalt-related street conditions and two markings-related street conditions.

Asphalt conditions

Pothole
Rough, Pitted or Cracked Roads
Hummock
Cave-in
Failed Street Repair

Marking conditions

Line/Marking - After Repaving
Line/Marking - Faded

To further simplify, all asphalt conditions were combined into one roadway surface complaint group, named “asphalt.”

Missing values were filled in as 0.

Sample view of 311 data

sample311<-getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/sample311.csv")

df1<- data.frame(read.csv(text=sample311))

datatable(df1, options = list( pageLength = 5, lengthMenu = c(5, 10),  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#00838f', 'color': '#fff'});", "}"), rownames=TRUE))

Zip Code by Borough

A list of zip codes and their associated borough was extracted from the 311 data.

Table of zip codes by borough.

Table of zip codes by neighborhood name (PDF).

Income by Zip Code

Income information by zip code was retrieved from the US Census.

The data was transformed as follows, via the US Census portal and dataset processing.

Downloaded New York State zip codes only.
Selected columns using metadata.
Converted zip column to display zip code only.
Added “ZIP_” prefix to prevent forcing to numeric value.

The relevant attribute column names were as follows:

HC01_VC90 - Estimate; INCOME AND BENEFITS (IN 2014 INFLATION-ADJUSTED DOLLARS) - With earnings - Mean earnings (dollars)

This attribute corresponded to mean income per zip code.

Demographics by Zip Code

Population and demographic information by zip code were retrieved from the US Census and transformed as per the previous section.

The relevant attribute column names were as follows:

HC01_VC03 - Estimate; SEX AND AGE - Total population
HC03_VC49 - Percent; RACE - One race - White
HC03_VC50 - Percent; RACE - One race - Black or African American

These attributes corresponded to total population per zip code, percent white per zip code, percent black per zip code.

Combined Dataset

The datasets were merged and transformed to display the required data. The combined datasets were uploaded to GitHub for reproducibility.

Attribute Names

The following is attribute information for the combined primary dataset.

dict <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/sdatadictionary.csv")

dictionary<- data.frame(read.csv(text=dict))

datatable(dictionary, options = list( pageLength = 5, lengthMenu = c(5, 10),  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#fff9c4', 'color': '#000000'});", "}"), rownames=TRUE))

Load Data

Show code to view data loading.

# data by zip code
sumzipdata <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/zipdatacsv2.csv")

sdata<- data.frame(read.csv(text=sumzipdata))


# data by month
monthdata<- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/monthlycomplaints.csv")

mdata<- data.frame(read.csv(text=monthdata))

Combined dataset of complaints and demographics by zip code.

Analysis

Summary

Structure and summary statistics of the primary dataset.

See above attribute table for variable descriptions.

# Structure
# Summary
str(sdata)

## 'data.frame':    182 obs. of  14 variables:
##  $ zip1      : Factor w/ 182 levels "ZIP_10001","ZIP_10002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ borough   : Factor w/ 5 levels "BRONX","BROOKLYN",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ asphalt   : int  1497 1530 1753 465 389 209 685 903 624 1473 ...
##  $ missing   : int  1 7 3 0 0 0 1 7 1 4 ...
##  $ faded     : int  23 38 38 14 6 4 20 13 16 30 ...
##  $ complaints: int  1521 1575 1794 479 395 213 706 923 641 1507 ...
##  $ income    : int  159847 75330 159651 186872 187011 163900 415435 99435 163529 196497 ...
##  $ pop       : int  22767 79894 57068 3024 7570 2950 6748 61806 30708 52941 ...
##  $ pwhite    : num  60.5 31.2 77.2 77.4 74.9 71.5 67.2 59.8 72.4 79.3 ...
##  $ pblack    : num  11.4 7.6 4.6 2.3 3.7 3.2 7 6.8 6.5 3.8 ...
##  $ pcomp     : num  0.0668 0.0197 0.0314 0.1584 0.0522 ...
##  $ pcompa    : num  0.0658 0.0192 0.0307 0.1538 0.0514 ...
##  $ pcompm    : num  4.39e-05 8.76e-05 5.26e-05 0.00 0.00 ...
##  $ pcompf    : num  0.00101 0.000476 0.000666 0.00463 0.000793 ...

summary(sdata)

##         zip1              borough      asphalt          missing      
##  ZIP_10001:  1   BRONX        :26   Min.   :   1.0   Min.   : 0.000  
##  ZIP_10002:  1   BROOKLYN     :39   1st Qu.: 902.2   1st Qu.: 1.000  
##  ZIP_10003:  1   MANHATTAN    :45   Median :1345.0   Median : 3.000  
##  ZIP_10004:  1   QUEENS       :60   Mean   :1505.8   Mean   : 4.225  
##  ZIP_10005:  1   STATEN ISLAND:12   3rd Qu.:1887.8   3rd Qu.: 5.000  
##  ZIP_10006:  1                      Max.   :7603.0   Max.   :24.000  
##  (Other)  :176                                                       
##      faded          complaints         income            pop        
##  Min.   :  0.00   Min.   :   1.0   Min.   : 37028   Min.   :  1252  
##  1st Qu.: 13.25   1st Qu.: 925.5   1st Qu.: 65729   1st Qu.: 25476  
##  Median : 24.50   Median :1368.5   Median : 79395   Median : 41636  
##  Mean   : 37.71   Mean   :1547.7   Mean   : 96711   Mean   : 46314  
##  3rd Qu.: 37.75   3rd Qu.:1934.5   3rd Qu.:104391   3rd Qu.: 64742  
##  Max.   :447.00   Max.   :7730.0   Max.   :415435   Max.   :110385  
##                                                                     
##      pwhite          pblack          pcomp               pcompa         
##  Min.   : 1.90   Min.   : 0.00   Min.   :0.0006639   Min.   :0.0006639  
##  1st Qu.:24.55   1st Qu.: 2.80   1st Qu.:0.0198602   1st Qu.:0.0192186  
##  Median :49.55   Median : 7.95   Median :0.0311590   Median :0.0305364  
##  Mean   :48.28   Mean   :21.74   Mean   :0.0387768   Mean   :0.0377392  
##  3rd Qu.:71.80   3rd Qu.:32.27   3rd Qu.:0.0493438   3rd Qu.:0.0480794  
##  Max.   :98.80   Max.   :92.20   Max.   :0.1583995   Max.   :0.1537698  
##                                                                         
##      pcompm              pcompf         
##  Min.   :0.000e+00   Min.   :0.0000000  
##  1st Qu.:2.574e-05   1st Qu.:0.0003332  
##  Median :6.423e-05   Median :0.0005900  
##  Mean   :9.194e-05   Mean   :0.0009456  
##  3rd Qu.:1.145e-04   3rd Qu.:0.0009929  
##  Max.   :4.984e-04   Max.   :0.0117003  
##

# Selected statistics

Statistic<-c(46314,
             0.0388, 
             48.28, 
             21.74,
             49.55,
             7.95,
             96710,
             79395
             )

Description<-c("Mean population per NYC zip code.",
               "Mean street condition complaints per capita.",
               "Mean percent white population per zip code.",
               "Mean percent black population per zip code.",
               "Median percent white population per zip code.",
               "Median percent black population per zip code.",
               "Mean income per zip code.",
               "Median income per zip code."
               )

test2<-data.frame(Statistic,Description)

kable(test2, digits=2)

Statistic	Description
46314.00	Mean population per NYC zip code.
0.04	Mean street condition complaints per capita.
48.28	Mean percent white population per zip code.
21.74	Mean percent black population per zip code.
49.55	Median percent white population per zip code.
7.95	Median percent black population per zip code.
96710.00	Mean income per zip code.
79395.00	Median income per zip code.

# Distribution of Road Condition Complaints Per Capita
hist(sdata$pcomp, col="#fff9c4", breaks=10, main="Distribution of Road Condition Complaints Per Capita", xlab="Complaints Per Capita")

Income

# Box Plot of Income and Population by Zip Code
boxplot(sdata$income, sdata$pop, names=c("Income","Population"), col=c("#c5e1a5","#ffe0b2"), main="Box Plot of Income and Population by Zip Code")

# Distribution of Income
hist(sdata$income, col="#c5e1a5", breaks=20, main="Distribution of Income", xlab="Income")

# Road Condition Complaints vs. Income
minc<- lm(pcomp ~ income, data = sdata)
plot(sdata$pcomp ~ sdata$income, col="#81d4fa", main="Road Condition Complaints vs. Income", xlab="Income", ylab="Complaints Per Capita")
abline(minc)

# Road Condition Complaints vs. Income, with xlim1
minc1<- lm(pcomp ~ income, data = sdata)
plot(sdata$pcomp ~ sdata$income, col="#1976d2", main="Road Condition Complaints vs. Income, $20,000-$200,000", xlab="Income", ylab="Complaints Per Capita", xlim=c(20000,200000))
abline(minc1)

# Road Condition Complaints vs. Income, with xlim2
minc2<- lm(pcomp ~ income, data = sdata)
plot(sdata$pcomp ~ sdata$income, col="#01579b", main="Road Condition Complaints vs. Income, $30,000-$100,000", xlab="Income", ylab="Complaints Per Capita", xlim=c(30000,100000), ylim=c(0,0.10))
abline(minc2)

Race

# Correlation for Percent White
wcor<-cor(sdata$pcomp, sdata$pwhite)

# Correlation for Percent Black
bcor<-cor(sdata$pcomp, sdata$pblack)

# Distribution of Percent Whites in Zip Code
hist(sdata$pwhite, col="#ffcc80", breaks=20, main="White Population", xlab="Distribution of Percent Whites in Zip Code", ylim=c(0,70))

# Distribution of Percent Blacks in Zip Code
hist(sdata$pblack, col="#26a69a", breaks=20, main="Black Population", xlab="Distribution of Percent Blacks in Zip Code", ylim=c(0,70))

# Road Condition Complaints vs. Percentage Whites
mwhite<- lm(pcomp ~ pwhite, data = sdata)
plot(sdata$pcomp ~ sdata$pwhite, col="#ffcc80", main="Road Condition Complaints vs. Percentage Whites", xlab="Percent White Population per Zip Code", ylab="Complaints Per Capita")
abline(mwhite)

# Road Condition Complaints vs. Percentage Blacks
mblack<- lm(pcomp ~ pblack, data = sdata)
plot(sdata$pcomp ~ sdata$pblack, col="#26a69a", main="Road Condition Complaints vs. Percentage Blacks", xlab="Percent Black Population per Zip Code", ylab="Complaints Per Capita")
abline(mblack)

# Residuals for Percent White
plot(mwhite$residuals ~ sdata$pcomp, main="Residuals for Percent White", col="#ffcc80", xlab="Per Capita Complaints", ylab="Residuals")
abline(mwhite)

hist(mwhite$residuals, main="Residuals for Percent White", ylim=c(0,80), xlab="Residuals" , col="#ffcc80")

qqnorm(mwhite$residuals, col="#ffcc80")
qqline(mwhite$residuals, col="#ffcc80")

# Residuals for Percent Black
plot(mblack$residuals ~ sdata$pcomp, main="Residuals for Percent Black", col="#26a69a", xlab="Per Capita Complaints", ylab="Residuals")
abline(mblack)

hist(mblack$residuals, main="Residuals for Percent Black", ylim=(c(0,80)), xlab="Residuals", col="#26a69a")

qqnorm(mblack$residuals, col="#26a69a")
qqline(mblack$residuals, col="#26a69a")

The correlation for whites and per capita roadway complaints is 0.29.
The correlation for blacks and per capita roadway complaints is -0.21.

Borough

# Box Plots by Borough

# Per Capita Complaints by Borough
# Per Capita Complaints by Borough, Asphalt Only
# Per Capita Complaints by Borough, Missing Markings Only
# Per Capita Complaints by Borough, Faded Markings Only


boxplot(
    sdata$pcomp[sdata$borough=="BRONX"],
    sdata$pcomp[sdata$borough=="BROOKLYN"],
    sdata$pcomp[sdata$borough=="MANHATTAN"],
    sdata$pcomp[sdata$borough=="QUEENS"],
    sdata$pcomp[sdata$borough=="STATEN ISLAND"],
    col=(c("#E8EAF6","#c5cae9","#9fa8da","#7986cb","#5c6bc0")),
    names=(c("BRONX","BKLYN","MANHATTAN","QUEENS","STATEN IS.")), 
    main="Per Capita Complaints by Borough", ylab="Per Capita Complaints", xlab="Borough")

boxplot(
    sdata$pcompa[sdata$borough=="BRONX"],
    sdata$pcompa[sdata$borough=="BROOKLYN"],
    sdata$pcompa[sdata$borough=="MANHATTAN"],
    sdata$pcompa[sdata$borough=="QUEENS"],
    sdata$pcompa[sdata$borough=="STATEN ISLAND"],
    col=(c("#E8EAF6","#c5cae9","#9fa8da","#7986cb","#5c6bc0")),
    names=(c("BRONX","BKLYN","MANHATTAN","QUEENS","STATEN IS.")), 
    main="Per Capita Complaints by Borough, Asphalt Only", ylab="Per Capita Asphalt Complaints", xlab="Borough")

boxplot(
    sdata$pcompm[sdata$borough=="BRONX"],
    sdata$pcompm[sdata$borough=="BROOKLYN"],
    sdata$pcompm[sdata$borough=="MANHATTAN"],
    sdata$pcompm[sdata$borough=="QUEENS"],
    sdata$pcompm[sdata$borough=="STATEN ISLAND"],
    col=(c("#E8EAF6","#c5cae9","#9fa8da","#7986cb","#5c6bc0")),
    names=(c("BRONX","BKLYN","MANHATTAN","QUEENS","STATEN IS.")), 
    main="Per Capita Complaints by Borough, Missing Markings Only", ylab="Per Capita Missing Markings Complaints", xlab="Borough")

boxplot(
    sdata$pcompf[sdata$borough=="BRONX"],
    sdata$pcompf[sdata$borough=="BROOKLYN"],
    sdata$pcompf[sdata$borough=="MANHATTAN"],
    sdata$pcompf[sdata$borough=="QUEENS"],
    sdata$pcompf[sdata$borough=="STATEN ISLAND"],
    col=(c("#E8EAF6","#c5cae9","#9fa8da","#7986cb","#5c6bc0")),
    names=(c("BRONX","BKLYN","MANHATTAN","QUEENS","STATEN IS.")), 
    main="Per Capita Complaints by Borough, Faded Markings Only", ylab="Per Capita Faded Markings Complaints", xlab="Borough", ylim=c(0,0.006))

Comparing the mean of the distributions by splitting the variable into the borough groups.

# means by borough

by(sdata$pcomp, sdata$borough, mean)

## sdata$borough: BRONX
## [1] 0.02962233
## -------------------------------------------------------- 
## sdata$borough: BROOKLYN
## [1] 0.03189598
## -------------------------------------------------------- 
## sdata$borough: MANHATTAN
## [1] 0.03309111
## -------------------------------------------------------- 
## sdata$borough: QUEENS
## [1] 0.0421183
## -------------------------------------------------------- 
## sdata$borough: STATEN ISLAND
## [1] 0.08558736

by(sdata$pcomp, sdata$borough, summary)

## sdata$borough: BRONX
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002635 0.015800 0.018430 0.029620 0.032140 0.143400 
## -------------------------------------------------------- 
## sdata$borough: BROOKLYN
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01028 0.02526 0.03020 0.03190 0.03709 0.06315 
## -------------------------------------------------------- 
## sdata$borough: MANHATTAN
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0006639 0.0149300 0.0211700 0.0330900 0.0361900 0.1584000 
## -------------------------------------------------------- 
## sdata$borough: QUEENS
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002696 0.025920 0.036960 0.042120 0.052200 0.151200 
## -------------------------------------------------------- 
## sdata$borough: STATEN ISLAND
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06368 0.07676 0.08626 0.08559 0.09400 0.11880

Time

ggplot(
  mdata, aes(x = yyyymm, y = tcomplaints, fill=tcomplaints)) + 
  geom_bar(stat="identity") +
  ggtitle("Complaints by Month")+ 
  theme(axis.text=element_text(angle=90))+
  labs(x="Month",y="Complaints")

Outliers

There were several interesting outliers. The graph of per capita income had many outliers in the upper ranges, expected for New York City.

The box plots for borough response time had more outliers in Manhattan at the upper range of per capita complaints. This could indicate that Manhattan has higher density in general, or a dedicated group of concerned citizens.

The per capita complaints for Staten Island were significantly greater than other boroughs. Staten Island has historically had issues with road conditions. They are geographically separated from the other four boroughs and have many sprawled suburban areas which are difficult to address in a concise and timely manner.

There were high income outliers for zip codes. In once case, a Westchester County zip code was being included in the NYC zip code list, possibly due to a shared border street with the Bronx. (ZIP_10803, mean income 232513, Bronx.)

A high income outlier for Queens was a zip code in Long Island city across the river from Manhattan, where dozens of new skyscraper condominiums have risen in recent years on the site of previously industrial neighborhoods. (ZIP_11109, mean income 168940, Queens.)

A high outlier for per capita complaints was ZIP_10004 in Lower Manhattan. Interesting, this zip code includes the NYC DOT headquarters, which could potentially be skewing the data due to employees using 311 web forms as part of their job or interest.

Some zip codes in the US Census dataset showed as 0 income or NA income. Researching these zip codes identified these as parks, airports, large office buildings, and in one case the former World Trade Center zip code, which has been discontinued.

Testing

Satisfying conditions for inference

The conditions for inference do appear to be satisfied. The sample size is greater than 30; the datasets follow a unimodal normal distribution; the samples are random.

Confidence interval

summary(mwhite)

## 
## Call:
## lm(formula = pcomp ~ pwhite, data = sdata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.051881 -0.016013 -0.006274  0.011839  0.110939 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.438e-02  4.115e-03   5.923 1.57e-08 ***
## pwhite      2.983e-04  7.453e-05   4.002 9.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02693 on 180 degrees of freedom
## Multiple R-squared:  0.0817, Adjusted R-squared:  0.0766 
## F-statistic: 16.02 on 1 and 180 DF,  p-value: 9.171e-05

summary(mblack)

## 
## Call:
## lm(formula = pcomp ~ pblack, data = sdata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.043073 -0.017971 -0.006732  0.010870  0.115067 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.387e-02  2.677e-03  16.391  < 2e-16 ***
## pblack      -2.344e-04  7.998e-05  -2.931  0.00382 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02746 on 180 degrees of freedom
## Multiple R-squared:  0.04554,    Adjusted R-squared:  0.04024 
## F-statistic: 8.588 on 1 and 180 DF,  p-value: 0.003823

summary(minc)

## 
## Call:
## lm(formula = pcomp ~ income, data = sdata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.055829 -0.017399 -0.007053  0.011395  0.111364 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.992e-02  4.252e-03   7.036 3.99e-11 ***
## income      9.160e-08  3.851e-08   2.379   0.0184 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02767 on 180 degrees of freedom
## Multiple R-squared:  0.03047,    Adjusted R-squared:  0.02509 
## F-statistic: 5.658 on 1 and 180 DF,  p-value: 0.01842

Inference

# summary

load(url("http://bit.ly/dasi_gss_ws_cl"))
source("http://bit.ly/dasi_inference")

inference(y = sdata$pwhite, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")

## Single mean 
## Summary statistics:

## mean = 48.2846 ;  sd = 26.8587 ;  n = 182 
## Standard error = 1.9909 
## 95 % Confidence interval = ( 44.3825 , 52.1867 )

inference(y = sdata$pblack, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")

## Single mean 
## Summary statistics:

## mean = 21.7357 ;  sd = 25.516 ;  n = 182 
## Standard error = 1.8914 
## 95 % Confidence interval = ( 18.0287 , 25.4427 )

ANOVA summary statistics for income

inference(y = sdata$pcomp, x = sdata$income, est = "mean", type = "ht", null = 0, alternative = "greater", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_37028 = 1, mean_37028 = 0.0499, sd_37028 = NA
## n_38229 = 1, mean_38229 = 0.0132, sd_38229 = NA
## n_39218 = 1, mean_39218 = 0.0242, sd_39218 = NA
## n_39913 = 1, mean_39913 = 0.0111, sd_39913 = NA
## n_39915 = 1, mean_39915 = 0.012, sd_39915 = NA
## n_40428 = 1, mean_40428 = 0.0175, sd_40428 = NA
## n_41298 = 1, mean_41298 = 0.0172, sd_41298 = NA
## n_41669 = 1, mean_41669 = 0.0208, sd_41669 = NA
## n_43997 = 1, mean_43997 = 0.0185, sd_43997 = NA
## n_44255 = 1, mean_44255 = 0.025, sd_44255 = NA
## n_44756 = 1, mean_44756 = 0.0141, sd_44756 = NA
## n_46712 = 1, mean_46712 = 0.0155, sd_46712 = NA
## n_47834 = 1, mean_47834 = 0.0244, sd_47834 = NA
## n_49360 = 1, mean_49360 = 0.0166, sd_49360 = NA
## n_50538 = 1, mean_50538 = 0.0103, sd_50538 = NA
## n_52952 = 1, mean_52952 = 0.0107, sd_52952 = NA
## n_53132 = 1, mean_53132 = 0.035, sd_53132 = NA
## n_53563 = 1, mean_53563 = 0.0139, sd_53563 = NA
## n_54186 = 1, mean_54186 = 0.022, sd_54186 = NA
## n_54293 = 1, mean_54293 = 0.0283, sd_54293 = NA
## n_55024 = 1, mean_55024 = 0.0155, sd_55024 = NA
## n_56086 = 1, mean_56086 = 0.0329, sd_56086 = NA
## n_56291 = 1, mean_56291 = 0.0204, sd_56291 = NA
## n_56512 = 1, mean_56512 = 0.0251, sd_56512 = NA
## n_56796 = 1, mean_56796 = 0.0279, sd_56796 = NA
## n_56809 = 1, mean_56809 = 0.0141, sd_56809 = NA
## n_56880 = 1, mean_56880 = 0.0183, sd_56880 = NA
## n_57737 = 1, mean_57737 = 0.0143, sd_57737 = NA
## n_60092 = 1, mean_60092 = 0.017, sd_60092 = NA
## n_60266 = 1, mean_60266 = 0.0267, sd_60266 = NA
## n_60304 = 1, mean_60304 = 0.0176, sd_60304 = NA
## n_61004 = 1, mean_61004 = 0.0319, sd_61004 = NA
## n_61271 = 1, mean_61271 = 0.0632, sd_61271 = NA
## n_61349 = 1, mean_61349 = 0.0231, sd_61349 = NA
## n_61402 = 1, mean_61402 = 0.01, sd_61402 = NA
## n_61546 = 1, mean_61546 = 0.0346, sd_61546 = NA
## n_61763 = 1, mean_61763 = 0.017, sd_61763 = NA
## n_61799 = 1, mean_61799 = 0.008, sd_61799 = NA
## n_61943 = 1, mean_61943 = 0.0194, sd_61943 = NA
## n_62336 = 1, mean_62336 = 0.0121, sd_62336 = NA
## n_62591 = 1, mean_62591 = 0.0254, sd_62591 = NA
## n_62728 = 1, mean_62728 = 0.0233, sd_62728 = NA
## n_63353 = 1, mean_63353 = 0.0332, sd_63353 = NA
## n_63897 = 1, mean_63897 = 0.0205, sd_63897 = NA
## n_65091 = 1, mean_65091 = 0.0124, sd_65091 = NA
## n_65678 = 1, mean_65678 = 0.031, sd_65678 = NA
## n_65882 = 1, mean_65882 = 0.0372, sd_65882 = NA
## n_66180 = 1, mean_66180 = 0.017, sd_66180 = NA
## n_66430 = 1, mean_66430 = 0.0328, sd_66430 = NA
## n_66793 = 1, mean_66793 = 0.0219, sd_66793 = NA
## n_67059 = 1, mean_67059 = 0.0255, sd_67059 = NA
## n_67535 = 1, mean_67535 = 0.0375, sd_67535 = NA
## n_67943 = 1, mean_67943 = 0.0219, sd_67943 = NA
## n_68241 = 1, mean_68241 = 0.0306, sd_68241 = NA
## n_68255 = 1, mean_68255 = 0.029, sd_68255 = NA
## n_68575 = 1, mean_68575 = 0.0179, sd_68575 = NA
## n_69019 = 1, mean_69019 = 0.0251, sd_69019 = NA
## n_69547 = 1, mean_69547 = 0.0453, sd_69547 = NA
## n_69607 = 1, mean_69607 = 0.0258, sd_69607 = NA
## n_69811 = 1, mean_69811 = 0.016, sd_69811 = NA
## n_70002 = 1, mean_70002 = 0.0363, sd_70002 = NA
## n_70144 = 1, mean_70144 = 0.057, sd_70144 = NA
## n_71785 = 1, mean_71785 = 0.0334, sd_71785 = NA
## n_71786 = 1, mean_71786 = 0.0176, sd_71786 = NA
## n_72470 = 1, mean_72470 = 0.07, sd_72470 = NA
## n_72612 = 1, mean_72612 = 0.0339, sd_72612 = NA
## n_72951 = 1, mean_72951 = 0.0369, sd_72951 = NA
## n_73020 = 1, mean_73020 = 0.0423, sd_73020 = NA
## n_73138 = 1, mean_73138 = 0.0273, sd_73138 = NA
## n_73444 = 1, mean_73444 = 0.0373, sd_73444 = NA
## n_73767 = 1, mean_73767 = 0.0277, sd_73767 = NA
## n_73822 = 1, mean_73822 = 0.0245, sd_73822 = NA
## n_73998 = 1, mean_73998 = 0.049, sd_73998 = NA
## n_74239 = 1, mean_74239 = 0.0174, sd_74239 = NA
## n_74273 = 1, mean_74273 = 0.0478, sd_74273 = NA
## n_74457 = 1, mean_74457 = 0.0882, sd_74457 = NA
## n_74755 = 1, mean_74755 = 0.0418, sd_74755 = NA
## n_74934 = 1, mean_74934 = 0.027, sd_74934 = NA
## n_75330 = 1, mean_75330 = 0.0197, sd_75330 = NA
## n_75415 = 1, mean_75415 = 0.036, sd_75415 = NA
## n_75885 = 1, mean_75885 = 0.0244, sd_75885 = NA
## n_76050 = 1, mean_76050 = 0.0262, sd_76050 = NA
## n_76210 = 1, mean_76210 = 0.0459, sd_76210 = NA
## n_76273 = 1, mean_76273 = 0.029, sd_76273 = NA
## n_77253 = 1, mean_77253 = 0.0301, sd_77253 = NA
## n_77364 = 2, mean_77364 = 0.028, sd_77364 = 0.0031
## n_77415 = 1, mean_77415 = 0.0621, sd_77415 = NA
## n_77727 = 1, mean_77727 = 0.0493, sd_77727 = NA
## n_77796 = 1, mean_77796 = 0.0481, sd_77796 = NA
## n_79341 = 1, mean_79341 = 0.045, sd_79341 = NA
## n_79448 = 1, mean_79448 = 0.0704, sd_79448 = NA
## n_79488 = 1, mean_79488 = 0.0641, sd_79488 = NA
## n_79830 = 1, mean_79830 = 0.0313, sd_79830 = NA
## n_79996 = 1, mean_79996 = 0.026, sd_79996 = NA
## n_80156 = 1, mean_80156 = 0.0259, sd_80156 = NA
## n_80774 = 1, mean_80774 = 0.0637, sd_80774 = NA
## n_82566 = 1, mean_82566 = 0.0804, sd_82566 = NA
## n_82651 = 1, mean_82651 = 0.0344, sd_82651 = NA
## n_83316 = 1, mean_83316 = 0.0477, sd_83316 = NA
## n_84058 = 1, mean_84058 = 0.1266, sd_84058 = NA
## n_84284 = 1, mean_84284 = 0.0877, sd_84284 = NA
## n_84338 = 1, mean_84338 = 0.0417, sd_84338 = NA
## n_84388 = 1, mean_84388 = 0.0341, sd_84388 = NA
## n_86014 = 1, mean_86014 = 0.0203, sd_86014 = NA
## n_86324 = 1, mean_86324 = 0.0212, sd_86324 = NA
## n_86328 = 1, mean_86328 = 0.0548, sd_86328 = NA
## n_86741 = 1, mean_86741 = 0.0794, sd_86741 = NA
## n_86896 = 1, mean_86896 = 0.0494, sd_86896 = NA
## n_88043 = 1, mean_88043 = 0.0454, sd_88043 = NA
## n_88110 = 1, mean_88110 = 0.0412, sd_88110 = NA
## n_88151 = 1, mean_88151 = 0.0496, sd_88151 = NA
## n_88272 = 1, mean_88272 = 0.0462, sd_88272 = NA
## n_88723 = 1, mean_88723 = 0.0701, sd_88723 = NA
## n_88783 = 1, mean_88783 = 0.0848, sd_88783 = NA
## n_88819 = 1, mean_88819 = 0.0444, sd_88819 = NA
## n_89901 = 1, mean_89901 = 0.0123, sd_89901 = NA
## n_90629 = 1, mean_90629 = 0.0521, sd_90629 = NA
## n_91077 = 1, mean_91077 = 0.0379, sd_91077 = NA
## n_91492 = 1, mean_91492 = 0.0526, sd_91492 = NA
## n_92584 = 1, mean_92584 = 0.0301, sd_92584 = NA
## n_93502 = 1, mean_93502 = 0.0537, sd_93502 = NA
## n_94318 = 1, mean_94318 = 0.0543, sd_94318 = NA
## n_95273 = 1, mean_95273 = 0.0883, sd_95273 = NA
## n_96252 = 1, mean_96252 = 0.1434, sd_96252 = NA
## n_96420 = 1, mean_96420 = 0.052, sd_96420 = NA
## n_96432 = 1, mean_96432 = 0.0367, sd_96432 = NA
## n_96441 = 1, mean_96441 = 0.0638, sd_96441 = NA
## n_96565 = 1, mean_96565 = 0.0279, sd_96565 = NA
## n_96581 = 1, mean_96581 = 0.0821, sd_96581 = NA
## n_96722 = 1, mean_96722 = 0.0347, sd_96722 = NA
## n_99435 = 1, mean_99435 = 0.0149, sd_99435 = NA
## n_99557 = 1, mean_99557 = 0.0549, sd_99557 = NA
## n_101351 = 1, mean_101351 = 0.1024, sd_101351 = NA
## n_103311 = 1, mean_103311 = 0.0327, sd_103311 = NA
## n_103853 = 1, mean_103853 = 0.0246, sd_103853 = NA
## n_104570 = 1, mean_104570 = 0.0949, sd_104570 = NA
## n_105838 = 1, mean_105838 = 0.0937, sd_105838 = NA
## n_107497 = 1, mean_107497 = 0.0954, sd_107497 = NA
## n_108378 = 1, mean_108378 = 0.011, sd_108378 = NA
## n_111811 = 1, mean_111811 = 0.0688, sd_111811 = NA
## n_116523 = 1, mean_116523 = 0.1188, sd_116523 = NA
## n_116573 = 1, mean_116573 = 7e-04, sd_116573 = NA
## n_116979 = 1, mean_116979 = 0.0038, sd_116979 = NA
## n_122437 = 1, mean_122437 = 0.0027, sd_122437 = NA
## n_124132 = 1, mean_124132 = 0.042, sd_124132 = NA
## n_126846 = 1, mean_126846 = 0.1512, sd_126846 = NA
## n_127438 = 1, mean_127438 = 0.0728, sd_127438 = NA
## n_130084 = 1, mean_130084 = 0.048, sd_130084 = NA
## n_131372 = 1, mean_131372 = 0.0152, sd_131372 = NA
## n_139515 = 1, mean_139515 = 0.0358, sd_139515 = NA
## n_141608 = 1, mean_141608 = 0.0273, sd_141608 = NA
## n_144485 = 1, mean_144485 = 0.0329, sd_144485 = NA
## n_144821 = 1, mean_144821 = 0.0866, sd_144821 = NA
## n_145975 = 1, mean_145975 = 0.0549, sd_145975 = NA
## n_159651 = 1, mean_159651 = 0.0314, sd_159651 = NA
## n_159847 = 1, mean_159847 = 0.0668, sd_159847 = NA
## n_159984 = 1, mean_159984 = 0.0478, sd_159984 = NA
## n_163529 = 1, mean_163529 = 0.0209, sd_163529 = NA
## n_163900 = 1, mean_163900 = 0.0722, sd_163900 = NA
## n_165258 = 1, mean_165258 = 0.0313, sd_165258 = NA
## n_166068 = 1, mean_166068 = 0.0647, sd_166068 = NA
## n_168940 = 1, mean_168940 = 0.0136, sd_168940 = NA
## n_175270 = 1, mean_175270 = 0.0157, sd_175270 = NA
## n_176533 = 1, mean_176533 = 0.0362, sd_176533 = NA
## n_184393 = 1, mean_184393 = 0.0255, sd_184393 = NA
## n_185322 = 1, mean_185322 = 0.0112, sd_185322 = NA
## n_186872 = 1, mean_186872 = 0.1584, sd_186872 = NA
## n_187011 = 1, mean_187011 = 0.0522, sd_187011 = NA
## n_190868 = 1, mean_190868 = 0.0261, sd_190868 = NA
## n_192865 = 1, mean_192865 = 0.0241, sd_192865 = NA
## n_195093 = 1, mean_195093 = 0.0184, sd_195093 = NA
## n_196198 = 1, mean_196198 = 0.0191, sd_196198 = NA
## n_196497 = 1, mean_196497 = 0.0285, sd_196497 = NA
## n_200596 = 1, mean_200596 = 0.0332, sd_200596 = NA
## n_212092 = 1, mean_212092 = 0.0619, sd_212092 = NA
## n_214973 = 1, mean_214973 = 8e-04, sd_214973 = NA
## n_226647 = 1, mean_226647 = 0.0618, sd_226647 = NA
## n_230004 = 1, mean_230004 = 0.008, sd_230004 = NA
## n_232513 = 1, mean_232513 = 0.0026, sd_232513 = NA
## n_340196 = 1, mean_340196 = 0.0053, sd_340196 = NA
## n_415435 = 1, mean_415435 = 0.1046, sd_415435 = NA

## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##            Df  Sum Sq    Mean Sq F value  Pr(>F)
## x         180 0.14215 0.00078972   80.32 0.08872
## Residuals   1 0.00001 0.00000983

Conclusion

Summary

The initial question was to determine if a correlation exists between road condition complaints and income or race. Using plots, linear modeling, and statistical analysis, both race and income did appear to be correlated with per capita complaints.

The validity of the data was indicated by summary statistics for the chosen variables, which showed low p-values less than 0.05, as well as normality and qqplots for residuals.

Further data exploration beyond the original scope of the hypothesis resulted in the discovery of significant correlations in complaints by borough, as well as clear seasonal trends in complaint volume. Staten Island has a greater percentage of complaints per capita. Early spring months have the greatest volume of complaints, which is logical as they are after snow plow contact, salt treatment, and snow melting, and when individuals increase outdoor activities.

Insights

There may be a self-selection of 311 complainants. Certain people may not want to call 311, for example if they are not comfortable speaking on the phone, or do not have time in their workday, or do not speak English. 311 does provide service in multiple languages, but callers may not know this.

311 complaints also come from web forms, which is limited to individuals with computers and internet access. While 311 does have a mobile app, it is limited. Some of these complaint categories are not available on the mobile app and require the web interface or a phone call to report.

Complaints with blank zip codes were excluded to make data analysis possible. However, excluding complaints where the complainant provided voluntary information omit disenfranchised neighborhoods or individuals.

Median income might have been a better indicator than mean since high incomes in large cities skew the average higher. In reality, close to 50% of NYC is living below or near the poverty line, a very different story from a mean income of $90,000.

Missing and faded markings might have identical meaning to 311 callers, even though these are significantly different for DOT’s processes. For operational purposes I was looking for accurate data since these are handled through different processes.

Grouping race percentage by zip code may not be a reliable indicator of the behavior of a certain race. For example, blacks in majority-white zip codes may be more likely to complain to 311 than blacks in majority-black zip codes, but some of this level of analysis is lost in grouping.

Future Research

These two datasets are extremely rich and much more research could be done.

Incorporating more Census demographic data could allow for a statistical summary of more variables. This could help develop predictive models of roadway complaints by population characteristics. This could help influence city policies and allow the city to reach underserved or disenfranchised populations.

Monthly complaints could be broken down into borough to see if some boroughs are showing unresolved or increasing conditions at a different rate than other boroughs.

Crowdsourcing in such a large population is bound to have a relevant result. To determine the impact of these complaint numbers, data from the Department of Transportation would be required. For example, a percentage of roadway area paved per month, or a percentage of marking footage totals installed per month. This could show if the city government is responding to complaints proportionally.

A citizen with even a beginning knowledge of R can make significant findings to inform their participation in local government. City agencies have increased data science and GIS approaches in the hopes of harnessing these large streams of data. A targeted approach with a limited scope allows for achievable goals. Combining and analyzing these types of datasets can result in programs to reduce injury and fatality rates citywide.

Rendering of Chrystie Street, Manhattan, proposed bike lane. Credit: NYC DOT

Data 606 Final Project: New York City Road Condition Complaints and Demographics

Armenoush Aslanian-Persico

Introduction

Topic

Hypothesis

Libraries

Cases

Sources

Data Characteristics

Transformation

Street Conditions

311 Raw Data

Zip Code by Borough

Income by Zip Code

Demographics by Zip Code

Combined Dataset

Analysis

Summary

Income

Race

Borough

Time

Outliers

Testing

Conclusion

Summary

Insights

Future Research