Credit: NYC DOT
When the New York City Department of Transportation maintains roadways in a timely manner, few people thank them. But when conditions deteriorate, serious collisions can occur, resulting in loss of life.
Managing a city operation is a challenge. With the increase in data collection and reporting, NYC employees and residents have powerful tools to answer questions and identify problem spots. The result of this data analysis could impact funding, hiring, citywide initiatives, and project management at all levels.
New York City 311 is a free public service that allows individuals to register complaints or inquiries on city conditions. Comparing the roadway condition complaints to Census data has great potential for statistical discovery.
Comparing majority black and majority white populations, is there a correlation between race and per capita roadway condition complaints across zip codes?
Using mean income per zip code, is there a correlation between income and roadway condition complaints?
\(H_0\): Race and income are not correlated with roadway condition complaints per capita.
\(H_a\): Race and/or income are correlated with roadway condition complaints per capita.
Show code to view required libraries.
library(BHH2)
library(devtools)
library(DT)
library(ggplot2)
library(jsonlite)
library(knitr)
library(openintro)
library(plotly)
library(plyr)
library(psych)
library(dplyr)
library(RCurl)
library(reshape2)
library(rmarkdown)
library(shiny)
library(stringr)
library(tidyr)
library(XML)
library(scales)
library(RColorBrewer)
library(leaflet)
library(httr)
The following is a summary of case statistics. For this project, each case is a zip code. Each case is a group of all 311 complaints for that zip code for the selected time period.
Statistic<-c("8,500,000",
"1/1/2013 to 5/12/2016",
"325,025",
"281,680",
"274,047",
"769",
"180",
"182")
Description<-c("New York City population, estimated.",
"Selected date range of 311 complaints.",
"Street condition complaints in date range.",
"Street condition complaints with a reported zip code.",
"Asphalt complaints.",
"Missing markings complaints.",
"Faded markings complaints.",
"Number of zip codes in NYC, considered as cases.")
test1<-data.frame(Statistic,Description)
kable(test1)
| Statistic | Description |
|---|---|
| 8,500,000 | New York City population, estimated. |
| 1/1/2013 to 5/12/2016 | Selected date range of 311 complaints. |
| 325,025 | Street condition complaints in date range. |
| 281,680 | Street condition complaints with a reported zip code. |
| 274,047 | Asphalt complaints. |
| 769 | Missing markings complaints. |
| 180 | Faded markings complaints. |
| 182 | Number of zip codes in NYC, considered as cases. |
New York City 311 Data
311 Service Requests from 2010 to Present
US Income Data
2010-2014 American Community Survey 5-Year Estimates
Selected economic characteristics
US Population and Demographic Data
2010-2014 American Community Survey 5-Year Estimates
ACS DEMOGRAPHIC AND HOUSING ESTIMATES
Data Collection
The roadway data was collected by New York City 311 through phone calls and web forms from individuals contacting the city to complain or inquire about a roadway situation. Phone calls are entered directly into the 311 database by phone operators.
The US Census demographic and population data was collected by the US government in a 2010-2014 survey. From the Census website: “The American Community Survey (ACS) is a mandatory, ongoing statistical survey that samples a small percentage of the population every year.”
Study Type
This is an observational study as no experiment was created. Some datasets used for this analysis are from sampling done by the US Census.
Response
The response variable is the number of complaints per capita for each zip code. This is a numeric variable.
Explanatory
The proposed explanatory variables are percent race by zip code and mean income by zip code, both numeric.
Generalizability
The data includes sample demographic statistics across New York City. The complaint data is limited to the individuals who interact with that particular government service. However, due to the volume of 311 records, a significant percentage of the city population is being represented in these cases. Therefore, the sample data is generalizable to the city.
Causality
There are many causalities related to race, income and social services. Therefore, it may not be possible to infer that race or income alone affects interaction with local government. Additionally, a complaint is not necessarily an indication of a poor roadway condition. Certain populations may complain less while still experiencing poor conditions.
The following is a list of options for street condition complaints when making a report to 311.
For this project, seven of the 26 complaints were selected. The five asphalt-related complaints were combined into the asphalt complaint category. The remaining two were the two options for marking complaints: faded and missing.
complaintlist<-getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/complainttypes.csv")
df2<- data.frame(read.csv(text=complaintlist))
datatable(df2, options = list( pageLength = 5, lengthMenu = c(5, 10, 40), initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#01975b', 'color': '#fff'});",
"}")), rownames=TRUE)
This project spans 311 complaints from January 1, 2013 through May 12, 2016.
311 receives millions of complaints each year across all city agencies: Health, Environmental Protection, Sanitation, Police, and others.
For this project, the data was filtered as follows:
Only Department of Transportation data.
Only complaints which had a zip code as part of the complaint, to make data analysis possible.
Only complaints which were for a street condition.
Only the five primarily asphalt-related street conditions and two markings-related street conditions.
Asphalt conditions
Marking conditions
To further simplify, all asphalt conditions were combined into one roadway surface complaint group, named “asphalt.”
Missing values were filled in as 0.
Sample view of 311 data
sample311<-getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/sample311.csv")
df1<- data.frame(read.csv(text=sample311))
datatable(df1, options = list( pageLength = 5, lengthMenu = c(5, 10), initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#00838f', 'color': '#fff'});", "}"), rownames=TRUE))
A list of zip codes and their associated borough was extracted from the 311 data.
Table of zip codes by borough.
Table of zip codes by neighborhood name (PDF).
Income information by zip code was retrieved from the US Census.
The data was transformed as follows, via the US Census portal and dataset processing.
The relevant attribute column names were as follows:
This attribute corresponded to mean income per zip code.
Population and demographic information by zip code were retrieved from the US Census and transformed as per the previous section.
The relevant attribute column names were as follows:
These attributes corresponded to total population per zip code, percent white per zip code, percent black per zip code.
The datasets were merged and transformed to display the required data. The combined datasets were uploaded to GitHub for reproducibility.
Attribute Names
The following is attribute information for the combined primary dataset.
dict <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/sdatadictionary.csv")
dictionary<- data.frame(read.csv(text=dict))
datatable(dictionary, options = list( pageLength = 5, lengthMenu = c(5, 10), initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#fff9c4', 'color': '#000000'});", "}"), rownames=TRUE))
Load Data
Show code to view data loading.
# data by zip code
sumzipdata <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/zipdatacsv2.csv")
sdata<- data.frame(read.csv(text=sumzipdata))
# data by month
monthdata<- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-606-Project/master/monthlycomplaints.csv")
mdata<- data.frame(read.csv(text=monthdata))
Combined dataset of complaints and demographics by zip code.
Structure and summary statistics of the primary dataset.
See above attribute table for variable descriptions.
# Structure
# Summary
str(sdata)
## 'data.frame': 182 obs. of 14 variables:
## $ zip1 : Factor w/ 182 levels "ZIP_10001","ZIP_10002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ borough : Factor w/ 5 levels "BRONX","BROOKLYN",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ asphalt : int 1497 1530 1753 465 389 209 685 903 624 1473 ...
## $ missing : int 1 7 3 0 0 0 1 7 1 4 ...
## $ faded : int 23 38 38 14 6 4 20 13 16 30 ...
## $ complaints: int 1521 1575 1794 479 395 213 706 923 641 1507 ...
## $ income : int 159847 75330 159651 186872 187011 163900 415435 99435 163529 196497 ...
## $ pop : int 22767 79894 57068 3024 7570 2950 6748 61806 30708 52941 ...
## $ pwhite : num 60.5 31.2 77.2 77.4 74.9 71.5 67.2 59.8 72.4 79.3 ...
## $ pblack : num 11.4 7.6 4.6 2.3 3.7 3.2 7 6.8 6.5 3.8 ...
## $ pcomp : num 0.0668 0.0197 0.0314 0.1584 0.0522 ...
## $ pcompa : num 0.0658 0.0192 0.0307 0.1538 0.0514 ...
## $ pcompm : num 4.39e-05 8.76e-05 5.26e-05 0.00 0.00 ...
## $ pcompf : num 0.00101 0.000476 0.000666 0.00463 0.000793 ...
summary(sdata)
## zip1 borough asphalt missing
## ZIP_10001: 1 BRONX :26 Min. : 1.0 Min. : 0.000
## ZIP_10002: 1 BROOKLYN :39 1st Qu.: 902.2 1st Qu.: 1.000
## ZIP_10003: 1 MANHATTAN :45 Median :1345.0 Median : 3.000
## ZIP_10004: 1 QUEENS :60 Mean :1505.8 Mean : 4.225
## ZIP_10005: 1 STATEN ISLAND:12 3rd Qu.:1887.8 3rd Qu.: 5.000
## ZIP_10006: 1 Max. :7603.0 Max. :24.000
## (Other) :176
## faded complaints income pop
## Min. : 0.00 Min. : 1.0 Min. : 37028 Min. : 1252
## 1st Qu.: 13.25 1st Qu.: 925.5 1st Qu.: 65729 1st Qu.: 25476
## Median : 24.50 Median :1368.5 Median : 79395 Median : 41636
## Mean : 37.71 Mean :1547.7 Mean : 96711 Mean : 46314
## 3rd Qu.: 37.75 3rd Qu.:1934.5 3rd Qu.:104391 3rd Qu.: 64742
## Max. :447.00 Max. :7730.0 Max. :415435 Max. :110385
##
## pwhite pblack pcomp pcompa
## Min. : 1.90 Min. : 0.00 Min. :0.0006639 Min. :0.0006639
## 1st Qu.:24.55 1st Qu.: 2.80 1st Qu.:0.0198602 1st Qu.:0.0192186
## Median :49.55 Median : 7.95 Median :0.0311590 Median :0.0305364
## Mean :48.28 Mean :21.74 Mean :0.0387768 Mean :0.0377392
## 3rd Qu.:71.80 3rd Qu.:32.27 3rd Qu.:0.0493438 3rd Qu.:0.0480794
## Max. :98.80 Max. :92.20 Max. :0.1583995 Max. :0.1537698
##
## pcompm pcompf
## Min. :0.000e+00 Min. :0.0000000
## 1st Qu.:2.574e-05 1st Qu.:0.0003332
## Median :6.423e-05 Median :0.0005900
## Mean :9.194e-05 Mean :0.0009456
## 3rd Qu.:1.145e-04 3rd Qu.:0.0009929
## Max. :4.984e-04 Max. :0.0117003
##
# Selected statistics
Statistic<-c(46314,
0.0388,
48.28,
21.74,
49.55,
7.95,
96710,
79395
)
Description<-c("Mean population per NYC zip code.",
"Mean street condition complaints per capita.",
"Mean percent white population per zip code.",
"Mean percent black population per zip code.",
"Median percent white population per zip code.",
"Median percent black population per zip code.",
"Mean income per zip code.",
"Median income per zip code."
)
test2<-data.frame(Statistic,Description)
kable(test2, digits=2)
| Statistic | Description |
|---|---|
| 46314.00 | Mean population per NYC zip code. |
| 0.04 | Mean street condition complaints per capita. |
| 48.28 | Mean percent white population per zip code. |
| 21.74 | Mean percent black population per zip code. |
| 49.55 | Median percent white population per zip code. |
| 7.95 | Median percent black population per zip code. |
| 96710.00 | Mean income per zip code. |
| 79395.00 | Median income per zip code. |
# Distribution of Road Condition Complaints Per Capita
hist(sdata$pcomp, col="#fff9c4", breaks=10, main="Distribution of Road Condition Complaints Per Capita", xlab="Complaints Per Capita")
# Box Plot of Income and Population by Zip Code
boxplot(sdata$income, sdata$pop, names=c("Income","Population"), col=c("#c5e1a5","#ffe0b2"), main="Box Plot of Income and Population by Zip Code")
# Distribution of Income
hist(sdata$income, col="#c5e1a5", breaks=20, main="Distribution of Income", xlab="Income")
# Road Condition Complaints vs. Income
minc<- lm(pcomp ~ income, data = sdata)
plot(sdata$pcomp ~ sdata$income, col="#81d4fa", main="Road Condition Complaints vs. Income", xlab="Income", ylab="Complaints Per Capita")
abline(minc)
# Road Condition Complaints vs. Income, with xlim1
minc1<- lm(pcomp ~ income, data = sdata)
plot(sdata$pcomp ~ sdata$income, col="#1976d2", main="Road Condition Complaints vs. Income, $20,000-$200,000", xlab="Income", ylab="Complaints Per Capita", xlim=c(20000,200000))
abline(minc1)
# Road Condition Complaints vs. Income, with xlim2
minc2<- lm(pcomp ~ income, data = sdata)
plot(sdata$pcomp ~ sdata$income, col="#01579b", main="Road Condition Complaints vs. Income, $30,000-$100,000", xlab="Income", ylab="Complaints Per Capita", xlim=c(30000,100000), ylim=c(0,0.10))
abline(minc2)
# Correlation for Percent White
wcor<-cor(sdata$pcomp, sdata$pwhite)
# Correlation for Percent Black
bcor<-cor(sdata$pcomp, sdata$pblack)
# Distribution of Percent Whites in Zip Code
hist(sdata$pwhite, col="#ffcc80", breaks=20, main="White Population", xlab="Distribution of Percent Whites in Zip Code", ylim=c(0,70))
# Distribution of Percent Blacks in Zip Code
hist(sdata$pblack, col="#26a69a", breaks=20, main="Black Population", xlab="Distribution of Percent Blacks in Zip Code", ylim=c(0,70))
# Road Condition Complaints vs. Percentage Whites
mwhite<- lm(pcomp ~ pwhite, data = sdata)
plot(sdata$pcomp ~ sdata$pwhite, col="#ffcc80", main="Road Condition Complaints vs. Percentage Whites", xlab="Percent White Population per Zip Code", ylab="Complaints Per Capita")
abline(mwhite)
# Road Condition Complaints vs. Percentage Blacks
mblack<- lm(pcomp ~ pblack, data = sdata)
plot(sdata$pcomp ~ sdata$pblack, col="#26a69a", main="Road Condition Complaints vs. Percentage Blacks", xlab="Percent Black Population per Zip Code", ylab="Complaints Per Capita")
abline(mblack)
# Residuals for Percent White
plot(mwhite$residuals ~ sdata$pcomp, main="Residuals for Percent White", col="#ffcc80", xlab="Per Capita Complaints", ylab="Residuals")
abline(mwhite)
hist(mwhite$residuals, main="Residuals for Percent White", ylim=c(0,80), xlab="Residuals" , col="#ffcc80")
qqnorm(mwhite$residuals, col="#ffcc80")
qqline(mwhite$residuals, col="#ffcc80")
# Residuals for Percent Black
plot(mblack$residuals ~ sdata$pcomp, main="Residuals for Percent Black", col="#26a69a", xlab="Per Capita Complaints", ylab="Residuals")
abline(mblack)
hist(mblack$residuals, main="Residuals for Percent Black", ylim=(c(0,80)), xlab="Residuals", col="#26a69a")
qqnorm(mblack$residuals, col="#26a69a")
qqline(mblack$residuals, col="#26a69a")
The correlation for whites and per capita roadway complaints is 0.29.
The correlation for blacks and per capita roadway complaints is -0.21.
# Box Plots by Borough
# Per Capita Complaints by Borough
# Per Capita Complaints by Borough, Asphalt Only
# Per Capita Complaints by Borough, Missing Markings Only
# Per Capita Complaints by Borough, Faded Markings Only
boxplot(
sdata$pcomp[sdata$borough=="BRONX"],
sdata$pcomp[sdata$borough=="BROOKLYN"],
sdata$pcomp[sdata$borough=="MANHATTAN"],
sdata$pcomp[sdata$borough=="QUEENS"],
sdata$pcomp[sdata$borough=="STATEN ISLAND"],
col=(c("#E8EAF6","#c5cae9","#9fa8da","#7986cb","#5c6bc0")),
names=(c("BRONX","BKLYN","MANHATTAN","QUEENS","STATEN IS.")),
main="Per Capita Complaints by Borough", ylab="Per Capita Complaints", xlab="Borough")
boxplot(
sdata$pcompa[sdata$borough=="BRONX"],
sdata$pcompa[sdata$borough=="BROOKLYN"],
sdata$pcompa[sdata$borough=="MANHATTAN"],
sdata$pcompa[sdata$borough=="QUEENS"],
sdata$pcompa[sdata$borough=="STATEN ISLAND"],
col=(c("#E8EAF6","#c5cae9","#9fa8da","#7986cb","#5c6bc0")),
names=(c("BRONX","BKLYN","MANHATTAN","QUEENS","STATEN IS.")),
main="Per Capita Complaints by Borough, Asphalt Only", ylab="Per Capita Asphalt Complaints", xlab="Borough")
boxplot(
sdata$pcompm[sdata$borough=="BRONX"],
sdata$pcompm[sdata$borough=="BROOKLYN"],
sdata$pcompm[sdata$borough=="MANHATTAN"],
sdata$pcompm[sdata$borough=="QUEENS"],
sdata$pcompm[sdata$borough=="STATEN ISLAND"],
col=(c("#E8EAF6","#c5cae9","#9fa8da","#7986cb","#5c6bc0")),
names=(c("BRONX","BKLYN","MANHATTAN","QUEENS","STATEN IS.")),
main="Per Capita Complaints by Borough, Missing Markings Only", ylab="Per Capita Missing Markings Complaints", xlab="Borough")
boxplot(
sdata$pcompf[sdata$borough=="BRONX"],
sdata$pcompf[sdata$borough=="BROOKLYN"],
sdata$pcompf[sdata$borough=="MANHATTAN"],
sdata$pcompf[sdata$borough=="QUEENS"],
sdata$pcompf[sdata$borough=="STATEN ISLAND"],
col=(c("#E8EAF6","#c5cae9","#9fa8da","#7986cb","#5c6bc0")),
names=(c("BRONX","BKLYN","MANHATTAN","QUEENS","STATEN IS.")),
main="Per Capita Complaints by Borough, Faded Markings Only", ylab="Per Capita Faded Markings Complaints", xlab="Borough", ylim=c(0,0.006))
Comparing the mean of the distributions by splitting the variable into the borough groups.
# means by borough
by(sdata$pcomp, sdata$borough, mean)
## sdata$borough: BRONX
## [1] 0.02962233
## --------------------------------------------------------
## sdata$borough: BROOKLYN
## [1] 0.03189598
## --------------------------------------------------------
## sdata$borough: MANHATTAN
## [1] 0.03309111
## --------------------------------------------------------
## sdata$borough: QUEENS
## [1] 0.0421183
## --------------------------------------------------------
## sdata$borough: STATEN ISLAND
## [1] 0.08558736
by(sdata$pcomp, sdata$borough, summary)
## sdata$borough: BRONX
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002635 0.015800 0.018430 0.029620 0.032140 0.143400
## --------------------------------------------------------
## sdata$borough: BROOKLYN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01028 0.02526 0.03020 0.03190 0.03709 0.06315
## --------------------------------------------------------
## sdata$borough: MANHATTAN
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0006639 0.0149300 0.0211700 0.0330900 0.0361900 0.1584000
## --------------------------------------------------------
## sdata$borough: QUEENS
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002696 0.025920 0.036960 0.042120 0.052200 0.151200
## --------------------------------------------------------
## sdata$borough: STATEN ISLAND
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06368 0.07676 0.08626 0.08559 0.09400 0.11880
ggplot(
mdata, aes(x = yyyymm, y = tcomplaints, fill=tcomplaints)) +
geom_bar(stat="identity") +
ggtitle("Complaints by Month")+
theme(axis.text=element_text(angle=90))+
labs(x="Month",y="Complaints")
There were several interesting outliers. The graph of per capita income had many outliers in the upper ranges, expected for New York City.
The box plots for borough response time had more outliers in Manhattan at the upper range of per capita complaints. This could indicate that Manhattan has higher density in general, or a dedicated group of concerned citizens.
The per capita complaints for Staten Island were significantly greater than other boroughs. Staten Island has historically had issues with road conditions. They are geographically separated from the other four boroughs and have many sprawled suburban areas which are difficult to address in a concise and timely manner.
There were high income outliers for zip codes. In once case, a Westchester County zip code was being included in the NYC zip code list, possibly due to a shared border street with the Bronx. (ZIP_10803, mean income 232513, Bronx.)
A high income outlier for Queens was a zip code in Long Island city across the river from Manhattan, where dozens of new skyscraper condominiums have risen in recent years on the site of previously industrial neighborhoods. (ZIP_11109, mean income 168940, Queens.)
A high outlier for per capita complaints was ZIP_10004 in Lower Manhattan. Interesting, this zip code includes the NYC DOT headquarters, which could potentially be skewing the data due to employees using 311 web forms as part of their job or interest.
Some zip codes in the US Census dataset showed as 0 income or NA income. Researching these zip codes identified these as parks, airports, large office buildings, and in one case the former World Trade Center zip code, which has been discontinued.
Satisfying conditions for inference
The conditions for inference do appear to be satisfied. The sample size is greater than 30; the datasets follow a unimodal normal distribution; the samples are random.
Confidence interval
summary(mwhite)
##
## Call:
## lm(formula = pcomp ~ pwhite, data = sdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.051881 -0.016013 -0.006274 0.011839 0.110939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.438e-02 4.115e-03 5.923 1.57e-08 ***
## pwhite 2.983e-04 7.453e-05 4.002 9.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02693 on 180 degrees of freedom
## Multiple R-squared: 0.0817, Adjusted R-squared: 0.0766
## F-statistic: 16.02 on 1 and 180 DF, p-value: 9.171e-05
summary(mblack)
##
## Call:
## lm(formula = pcomp ~ pblack, data = sdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.043073 -0.017971 -0.006732 0.010870 0.115067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.387e-02 2.677e-03 16.391 < 2e-16 ***
## pblack -2.344e-04 7.998e-05 -2.931 0.00382 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02746 on 180 degrees of freedom
## Multiple R-squared: 0.04554, Adjusted R-squared: 0.04024
## F-statistic: 8.588 on 1 and 180 DF, p-value: 0.003823
summary(minc)
##
## Call:
## lm(formula = pcomp ~ income, data = sdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.055829 -0.017399 -0.007053 0.011395 0.111364
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.992e-02 4.252e-03 7.036 3.99e-11 ***
## income 9.160e-08 3.851e-08 2.379 0.0184 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02767 on 180 degrees of freedom
## Multiple R-squared: 0.03047, Adjusted R-squared: 0.02509
## F-statistic: 5.658 on 1 and 180 DF, p-value: 0.01842
Inference
# summary
load(url("http://bit.ly/dasi_gss_ws_cl"))
source("http://bit.ly/dasi_inference")
inference(y = sdata$pwhite, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")
## Single mean
## Summary statistics:
## mean = 48.2846 ; sd = 26.8587 ; n = 182
## Standard error = 1.9909
## 95 % Confidence interval = ( 44.3825 , 52.1867 )
inference(y = sdata$pblack, est = "mean", type = "ci", null = 0, alternative = "twosided", method = "theoretical")
## Single mean
## Summary statistics:
## mean = 21.7357 ; sd = 25.516 ; n = 182
## Standard error = 1.8914
## 95 % Confidence interval = ( 18.0287 , 25.4427 )
ANOVA summary statistics for income
inference(y = sdata$pcomp, x = sdata$income, est = "mean", type = "ht", null = 0, alternative = "greater", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_37028 = 1, mean_37028 = 0.0499, sd_37028 = NA
## n_38229 = 1, mean_38229 = 0.0132, sd_38229 = NA
## n_39218 = 1, mean_39218 = 0.0242, sd_39218 = NA
## n_39913 = 1, mean_39913 = 0.0111, sd_39913 = NA
## n_39915 = 1, mean_39915 = 0.012, sd_39915 = NA
## n_40428 = 1, mean_40428 = 0.0175, sd_40428 = NA
## n_41298 = 1, mean_41298 = 0.0172, sd_41298 = NA
## n_41669 = 1, mean_41669 = 0.0208, sd_41669 = NA
## n_43997 = 1, mean_43997 = 0.0185, sd_43997 = NA
## n_44255 = 1, mean_44255 = 0.025, sd_44255 = NA
## n_44756 = 1, mean_44756 = 0.0141, sd_44756 = NA
## n_46712 = 1, mean_46712 = 0.0155, sd_46712 = NA
## n_47834 = 1, mean_47834 = 0.0244, sd_47834 = NA
## n_49360 = 1, mean_49360 = 0.0166, sd_49360 = NA
## n_50538 = 1, mean_50538 = 0.0103, sd_50538 = NA
## n_52952 = 1, mean_52952 = 0.0107, sd_52952 = NA
## n_53132 = 1, mean_53132 = 0.035, sd_53132 = NA
## n_53563 = 1, mean_53563 = 0.0139, sd_53563 = NA
## n_54186 = 1, mean_54186 = 0.022, sd_54186 = NA
## n_54293 = 1, mean_54293 = 0.0283, sd_54293 = NA
## n_55024 = 1, mean_55024 = 0.0155, sd_55024 = NA
## n_56086 = 1, mean_56086 = 0.0329, sd_56086 = NA
## n_56291 = 1, mean_56291 = 0.0204, sd_56291 = NA
## n_56512 = 1, mean_56512 = 0.0251, sd_56512 = NA
## n_56796 = 1, mean_56796 = 0.0279, sd_56796 = NA
## n_56809 = 1, mean_56809 = 0.0141, sd_56809 = NA
## n_56880 = 1, mean_56880 = 0.0183, sd_56880 = NA
## n_57737 = 1, mean_57737 = 0.0143, sd_57737 = NA
## n_60092 = 1, mean_60092 = 0.017, sd_60092 = NA
## n_60266 = 1, mean_60266 = 0.0267, sd_60266 = NA
## n_60304 = 1, mean_60304 = 0.0176, sd_60304 = NA
## n_61004 = 1, mean_61004 = 0.0319, sd_61004 = NA
## n_61271 = 1, mean_61271 = 0.0632, sd_61271 = NA
## n_61349 = 1, mean_61349 = 0.0231, sd_61349 = NA
## n_61402 = 1, mean_61402 = 0.01, sd_61402 = NA
## n_61546 = 1, mean_61546 = 0.0346, sd_61546 = NA
## n_61763 = 1, mean_61763 = 0.017, sd_61763 = NA
## n_61799 = 1, mean_61799 = 0.008, sd_61799 = NA
## n_61943 = 1, mean_61943 = 0.0194, sd_61943 = NA
## n_62336 = 1, mean_62336 = 0.0121, sd_62336 = NA
## n_62591 = 1, mean_62591 = 0.0254, sd_62591 = NA
## n_62728 = 1, mean_62728 = 0.0233, sd_62728 = NA
## n_63353 = 1, mean_63353 = 0.0332, sd_63353 = NA
## n_63897 = 1, mean_63897 = 0.0205, sd_63897 = NA
## n_65091 = 1, mean_65091 = 0.0124, sd_65091 = NA
## n_65678 = 1, mean_65678 = 0.031, sd_65678 = NA
## n_65882 = 1, mean_65882 = 0.0372, sd_65882 = NA
## n_66180 = 1, mean_66180 = 0.017, sd_66180 = NA
## n_66430 = 1, mean_66430 = 0.0328, sd_66430 = NA
## n_66793 = 1, mean_66793 = 0.0219, sd_66793 = NA
## n_67059 = 1, mean_67059 = 0.0255, sd_67059 = NA
## n_67535 = 1, mean_67535 = 0.0375, sd_67535 = NA
## n_67943 = 1, mean_67943 = 0.0219, sd_67943 = NA
## n_68241 = 1, mean_68241 = 0.0306, sd_68241 = NA
## n_68255 = 1, mean_68255 = 0.029, sd_68255 = NA
## n_68575 = 1, mean_68575 = 0.0179, sd_68575 = NA
## n_69019 = 1, mean_69019 = 0.0251, sd_69019 = NA
## n_69547 = 1, mean_69547 = 0.0453, sd_69547 = NA
## n_69607 = 1, mean_69607 = 0.0258, sd_69607 = NA
## n_69811 = 1, mean_69811 = 0.016, sd_69811 = NA
## n_70002 = 1, mean_70002 = 0.0363, sd_70002 = NA
## n_70144 = 1, mean_70144 = 0.057, sd_70144 = NA
## n_71785 = 1, mean_71785 = 0.0334, sd_71785 = NA
## n_71786 = 1, mean_71786 = 0.0176, sd_71786 = NA
## n_72470 = 1, mean_72470 = 0.07, sd_72470 = NA
## n_72612 = 1, mean_72612 = 0.0339, sd_72612 = NA
## n_72951 = 1, mean_72951 = 0.0369, sd_72951 = NA
## n_73020 = 1, mean_73020 = 0.0423, sd_73020 = NA
## n_73138 = 1, mean_73138 = 0.0273, sd_73138 = NA
## n_73444 = 1, mean_73444 = 0.0373, sd_73444 = NA
## n_73767 = 1, mean_73767 = 0.0277, sd_73767 = NA
## n_73822 = 1, mean_73822 = 0.0245, sd_73822 = NA
## n_73998 = 1, mean_73998 = 0.049, sd_73998 = NA
## n_74239 = 1, mean_74239 = 0.0174, sd_74239 = NA
## n_74273 = 1, mean_74273 = 0.0478, sd_74273 = NA
## n_74457 = 1, mean_74457 = 0.0882, sd_74457 = NA
## n_74755 = 1, mean_74755 = 0.0418, sd_74755 = NA
## n_74934 = 1, mean_74934 = 0.027, sd_74934 = NA
## n_75330 = 1, mean_75330 = 0.0197, sd_75330 = NA
## n_75415 = 1, mean_75415 = 0.036, sd_75415 = NA
## n_75885 = 1, mean_75885 = 0.0244, sd_75885 = NA
## n_76050 = 1, mean_76050 = 0.0262, sd_76050 = NA
## n_76210 = 1, mean_76210 = 0.0459, sd_76210 = NA
## n_76273 = 1, mean_76273 = 0.029, sd_76273 = NA
## n_77253 = 1, mean_77253 = 0.0301, sd_77253 = NA
## n_77364 = 2, mean_77364 = 0.028, sd_77364 = 0.0031
## n_77415 = 1, mean_77415 = 0.0621, sd_77415 = NA
## n_77727 = 1, mean_77727 = 0.0493, sd_77727 = NA
## n_77796 = 1, mean_77796 = 0.0481, sd_77796 = NA
## n_79341 = 1, mean_79341 = 0.045, sd_79341 = NA
## n_79448 = 1, mean_79448 = 0.0704, sd_79448 = NA
## n_79488 = 1, mean_79488 = 0.0641, sd_79488 = NA
## n_79830 = 1, mean_79830 = 0.0313, sd_79830 = NA
## n_79996 = 1, mean_79996 = 0.026, sd_79996 = NA
## n_80156 = 1, mean_80156 = 0.0259, sd_80156 = NA
## n_80774 = 1, mean_80774 = 0.0637, sd_80774 = NA
## n_82566 = 1, mean_82566 = 0.0804, sd_82566 = NA
## n_82651 = 1, mean_82651 = 0.0344, sd_82651 = NA
## n_83316 = 1, mean_83316 = 0.0477, sd_83316 = NA
## n_84058 = 1, mean_84058 = 0.1266, sd_84058 = NA
## n_84284 = 1, mean_84284 = 0.0877, sd_84284 = NA
## n_84338 = 1, mean_84338 = 0.0417, sd_84338 = NA
## n_84388 = 1, mean_84388 = 0.0341, sd_84388 = NA
## n_86014 = 1, mean_86014 = 0.0203, sd_86014 = NA
## n_86324 = 1, mean_86324 = 0.0212, sd_86324 = NA
## n_86328 = 1, mean_86328 = 0.0548, sd_86328 = NA
## n_86741 = 1, mean_86741 = 0.0794, sd_86741 = NA
## n_86896 = 1, mean_86896 = 0.0494, sd_86896 = NA
## n_88043 = 1, mean_88043 = 0.0454, sd_88043 = NA
## n_88110 = 1, mean_88110 = 0.0412, sd_88110 = NA
## n_88151 = 1, mean_88151 = 0.0496, sd_88151 = NA
## n_88272 = 1, mean_88272 = 0.0462, sd_88272 = NA
## n_88723 = 1, mean_88723 = 0.0701, sd_88723 = NA
## n_88783 = 1, mean_88783 = 0.0848, sd_88783 = NA
## n_88819 = 1, mean_88819 = 0.0444, sd_88819 = NA
## n_89901 = 1, mean_89901 = 0.0123, sd_89901 = NA
## n_90629 = 1, mean_90629 = 0.0521, sd_90629 = NA
## n_91077 = 1, mean_91077 = 0.0379, sd_91077 = NA
## n_91492 = 1, mean_91492 = 0.0526, sd_91492 = NA
## n_92584 = 1, mean_92584 = 0.0301, sd_92584 = NA
## n_93502 = 1, mean_93502 = 0.0537, sd_93502 = NA
## n_94318 = 1, mean_94318 = 0.0543, sd_94318 = NA
## n_95273 = 1, mean_95273 = 0.0883, sd_95273 = NA
## n_96252 = 1, mean_96252 = 0.1434, sd_96252 = NA
## n_96420 = 1, mean_96420 = 0.052, sd_96420 = NA
## n_96432 = 1, mean_96432 = 0.0367, sd_96432 = NA
## n_96441 = 1, mean_96441 = 0.0638, sd_96441 = NA
## n_96565 = 1, mean_96565 = 0.0279, sd_96565 = NA
## n_96581 = 1, mean_96581 = 0.0821, sd_96581 = NA
## n_96722 = 1, mean_96722 = 0.0347, sd_96722 = NA
## n_99435 = 1, mean_99435 = 0.0149, sd_99435 = NA
## n_99557 = 1, mean_99557 = 0.0549, sd_99557 = NA
## n_101351 = 1, mean_101351 = 0.1024, sd_101351 = NA
## n_103311 = 1, mean_103311 = 0.0327, sd_103311 = NA
## n_103853 = 1, mean_103853 = 0.0246, sd_103853 = NA
## n_104570 = 1, mean_104570 = 0.0949, sd_104570 = NA
## n_105838 = 1, mean_105838 = 0.0937, sd_105838 = NA
## n_107497 = 1, mean_107497 = 0.0954, sd_107497 = NA
## n_108378 = 1, mean_108378 = 0.011, sd_108378 = NA
## n_111811 = 1, mean_111811 = 0.0688, sd_111811 = NA
## n_116523 = 1, mean_116523 = 0.1188, sd_116523 = NA
## n_116573 = 1, mean_116573 = 7e-04, sd_116573 = NA
## n_116979 = 1, mean_116979 = 0.0038, sd_116979 = NA
## n_122437 = 1, mean_122437 = 0.0027, sd_122437 = NA
## n_124132 = 1, mean_124132 = 0.042, sd_124132 = NA
## n_126846 = 1, mean_126846 = 0.1512, sd_126846 = NA
## n_127438 = 1, mean_127438 = 0.0728, sd_127438 = NA
## n_130084 = 1, mean_130084 = 0.048, sd_130084 = NA
## n_131372 = 1, mean_131372 = 0.0152, sd_131372 = NA
## n_139515 = 1, mean_139515 = 0.0358, sd_139515 = NA
## n_141608 = 1, mean_141608 = 0.0273, sd_141608 = NA
## n_144485 = 1, mean_144485 = 0.0329, sd_144485 = NA
## n_144821 = 1, mean_144821 = 0.0866, sd_144821 = NA
## n_145975 = 1, mean_145975 = 0.0549, sd_145975 = NA
## n_159651 = 1, mean_159651 = 0.0314, sd_159651 = NA
## n_159847 = 1, mean_159847 = 0.0668, sd_159847 = NA
## n_159984 = 1, mean_159984 = 0.0478, sd_159984 = NA
## n_163529 = 1, mean_163529 = 0.0209, sd_163529 = NA
## n_163900 = 1, mean_163900 = 0.0722, sd_163900 = NA
## n_165258 = 1, mean_165258 = 0.0313, sd_165258 = NA
## n_166068 = 1, mean_166068 = 0.0647, sd_166068 = NA
## n_168940 = 1, mean_168940 = 0.0136, sd_168940 = NA
## n_175270 = 1, mean_175270 = 0.0157, sd_175270 = NA
## n_176533 = 1, mean_176533 = 0.0362, sd_176533 = NA
## n_184393 = 1, mean_184393 = 0.0255, sd_184393 = NA
## n_185322 = 1, mean_185322 = 0.0112, sd_185322 = NA
## n_186872 = 1, mean_186872 = 0.1584, sd_186872 = NA
## n_187011 = 1, mean_187011 = 0.0522, sd_187011 = NA
## n_190868 = 1, mean_190868 = 0.0261, sd_190868 = NA
## n_192865 = 1, mean_192865 = 0.0241, sd_192865 = NA
## n_195093 = 1, mean_195093 = 0.0184, sd_195093 = NA
## n_196198 = 1, mean_196198 = 0.0191, sd_196198 = NA
## n_196497 = 1, mean_196497 = 0.0285, sd_196497 = NA
## n_200596 = 1, mean_200596 = 0.0332, sd_200596 = NA
## n_212092 = 1, mean_212092 = 0.0619, sd_212092 = NA
## n_214973 = 1, mean_214973 = 8e-04, sd_214973 = NA
## n_226647 = 1, mean_226647 = 0.0618, sd_226647 = NA
## n_230004 = 1, mean_230004 = 0.008, sd_230004 = NA
## n_232513 = 1, mean_232513 = 0.0026, sd_232513 = NA
## n_340196 = 1, mean_340196 = 0.0053, sd_340196 = NA
## n_415435 = 1, mean_415435 = 0.1046, sd_415435 = NA
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 180 0.14215 0.00078972 80.32 0.08872
## Residuals 1 0.00001 0.00000983
The initial question was to determine if a correlation exists between road condition complaints and income or race. Using plots, linear modeling, and statistical analysis, both race and income did appear to be correlated with per capita complaints.
The validity of the data was indicated by summary statistics for the chosen variables, which showed low p-values less than 0.05, as well as normality and qqplots for residuals.
Further data exploration beyond the original scope of the hypothesis resulted in the discovery of significant correlations in complaints by borough, as well as clear seasonal trends in complaint volume. Staten Island has a greater percentage of complaints per capita. Early spring months have the greatest volume of complaints, which is logical as they are after snow plow contact, salt treatment, and snow melting, and when individuals increase outdoor activities.
There may be a self-selection of 311 complainants. Certain people may not want to call 311, for example if they are not comfortable speaking on the phone, or do not have time in their workday, or do not speak English. 311 does provide service in multiple languages, but callers may not know this.
311 complaints also come from web forms, which is limited to individuals with computers and internet access. While 311 does have a mobile app, it is limited. Some of these complaint categories are not available on the mobile app and require the web interface or a phone call to report.
Complaints with blank zip codes were excluded to make data analysis possible. However, excluding complaints where the complainant provided voluntary information omit disenfranchised neighborhoods or individuals.
Median income might have been a better indicator than mean since high incomes in large cities skew the average higher. In reality, close to 50% of NYC is living below or near the poverty line, a very different story from a mean income of $90,000.
Missing and faded markings might have identical meaning to 311 callers, even though these are significantly different for DOT’s processes. For operational purposes I was looking for accurate data since these are handled through different processes.
Grouping race percentage by zip code may not be a reliable indicator of the behavior of a certain race. For example, blacks in majority-white zip codes may be more likely to complain to 311 than blacks in majority-black zip codes, but some of this level of analysis is lost in grouping.
These two datasets are extremely rich and much more research could be done.
Incorporating more Census demographic data could allow for a statistical summary of more variables. This could help develop predictive models of roadway complaints by population characteristics. This could help influence city policies and allow the city to reach underserved or disenfranchised populations.
Monthly complaints could be broken down into borough to see if some boroughs are showing unresolved or increasing conditions at a different rate than other boroughs.
Crowdsourcing in such a large population is bound to have a relevant result. To determine the impact of these complaint numbers, data from the Department of Transportation would be required. For example, a percentage of roadway area paved per month, or a percentage of marking footage totals installed per month. This could show if the city government is responding to complaints proportionally.
A citizen with even a beginning knowledge of R can make significant findings to inform their participation in local government. City agencies have increased data science and GIS approaches in the hopes of harnessing these large streams of data. A targeted approach with a limited scope allows for achievable goals. Combining and analyzing these types of datasets can result in programs to reduce injury and fatality rates citywide.