Because the data looked odd and counties like Kings and Manhattan came in at around $40K income per capita, we decided to make adjustments for the impoverished - knowing that they could skew the data.
bea_2010_length <- length(read_lines("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-Final-Project/master/bea_income_per_capita/ipc_2010.csv"))
bea_2010 <- read.csv( "https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-Final-Project/master/bea_income_per_capita/ipc_2010.csv", nrows = (bea_2010_length-16), skip = 4, stringsAsFactors = FALSE)
colnames(bea_2010) <- c("fips", "GeoName", "income_per_capita_2010")
bea_2010 <- subset(bea_2010, select = c(1,3))
bea_2010$income_per_capita_2010 <- as.numeric(bea_2010$income_per_capita_2010)
Let’s take a look at counties in NYC (Queens, Bronx, Manhattan, Brooklyn) and compare with counties from South Florida (Broward, Miami-Dade, Monroe, Collier):
kable(bea_2010[bea_2010$fips %in% c(12086,12011,12087,12021),])
fips | income_per_capita_2010 | |
---|---|---|
350 | 12011 | 40713 |
355 | 12021 | 63677 |
387 | 12086 | 38162 |
388 | 12087 | 62358 |
kable(bea_2010[bea_2010$fips %in% c(36047,36081,36061,36005),])
fips | income_per_capita_2010 | |
---|---|---|
1854 | 36005 | 29556 |
1875 | 36047 | 35416 |
1882 | 36061 | 113643 |
1892 | 36081 | 36661 |
We looked into the data to determine how these numbers could show such a surprising picture for these two areas.
We consider census data (source: https://www.census.gov/did/www/saipe/downloads/estmod14/readme.txt) on poverty for each county. Disclaimer: we only consider estimates from 2014 (latest year available is 2015).
cls=c(t.fips='character')
poverty<-read.csv('~/Documents/CUNY/data_class/project-final/treatedMedian2014.csv',stringsAsFactors = F,header=T,colClasses = cls)
Let’s take a look at the size of population deemed ‘poor’ per county for each of these areas:
#gross income of county by year
kable(poverty[poverty$t.fips %in% c(12086,12011,12087,12021),c(1,6)])
all.poverty | t.fips | |
---|---|---|
335 | 268418 | 12011 |
340 | 49211 | 12021 |
372 | 535148 | 12086 |
373 | 10626 | 12087 |
kable(poverty[poverty$t.fips %in% c(36047,36081,36061,36005),c(1,6)])
all.poverty | t.fips | |
---|---|---|
1863 | 440989 | 36005 |
1884 | 607086 | 36047 |
1891 | 280877 | 36061 |
1901 | 352481 | 36081 |
We will apply an adjustment calculation in order to arrive at an adjusted income per capita. The calculation looks like this
\(total.personal.income/(census.population - total.poverty.population)\)
We remove the size of the poor population divide this into the total income generated by that county.
#change all the names to prep for merge_all()
names(personal.inc)<-c("fips","GeoName","total.pers.inc")
names(poverty)[6]<-"fips"
n.l<-names(census.trim)
names(census.trim)<-sapply(n.l,paste0,".census")
names(census.trim)[2]<-c("fips")
#run the merge_all calc
df<-list(personal.inc,poverty,census.trim)
que<-Reduce(merge,df)
g.stack<-que %>%
mutate(adj.per.capita=as.numeric(total.pers.inc)/(X2014.census-all.poverty), adj.per.capita=adj.per.capita*1000)
#names(g.stack)
Our reference point is the other sterilized summary statistics, the median.
#South FLorida
kable(g.stack[g.stack$fips %in% c(12086,12011,12087,12021),c(1,6)])
fips | med.inc | |
---|---|---|
325 | 12011 | 51485 |
330 | 12021 | 58403 |
362 | 12086 | 42754 |
363 | 12087 | 57411 |
#New York City
kable(g.stack[g.stack$fips %in% c(36047,36081,36061,36005),c(1,6)])
fips | med.inc | |
---|---|---|
1828 | 36005 | 33687 |
1849 | 36047 | 47547 |
1856 | 36061 | 75459 |
1866 | 36081 | 56866 |
Percentiles on the size of impact from the adjustment reveal that the adjusted figures tended to be to the upside when compared with median.
summary(g.stack$adj.per.capita-g.stack$med.inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -47410 -4564 1024 1063 6367 137400
With our adjustments and data in hand, we build an algorithm to cluster and classify counties as high income and other (or low). This is all done in R and was time consuming.
Disclaimer: We classify counties using only 2014 data: both poverty and total personal income.
load('~/Documents/CUNY/data_class/project-final/final_algo.RData')
#merge to original data table
full<-merge(g.stack,cluster.df,by='fips')
#compute y/y growth
test<-full %>%
mutate('2011.g'=1-(X2011.census/base.2010.census), '2012.g'=1-(X2012.census/X2011.census), '2013.g'=1-(X2013.census/X2012.census), '2014.g'=1-(X2012.census/X2011.census))
test[test$label=="eligible","label"]<-"low"
#main portion of table that is interesting:
full.t<-test[,c(1,19:23)]
#recode
full.t[full.t$label=="eligible","label"]<-"low"
colnames(full.t)[3:6]<-c('g.2011','g.2012','g.2013','g.2014')
#long format
ma<-full.t %>%
gather(year,growth.rate,g.2011:g.2014)
For the sake of argument, we run this same plot using only median income for 2014.
ggplot(test,aes(x=label,y=med.inc)) +geom_boxplot() +ggtitle("US Census Median Income")
ggplot(test,aes(x=label,y=adj.per.capita)) +geom_boxplot() +ggtitle("Adjusted Income Per Capita Measure")
With our labels in hand we are ready to apply summary statistics and determine whether income per capita indeed has a positive relationships to growth.
Run some visualization… box plots classed on label
ggplot(data=ma,aes(x=year,y=growth.rate, fill=label)) + geom_boxplot() +ggtitle('Observed Growth Rate Grouped by Year and Income Class')
There does not appear to be any positive difference between the two. In fact, “high” label counties perform poorer than “low”.
We run a regression on the factors and investigate fit and coefficient on our high/low parameter.
We handle year by converting it to a numeric to capture general population growth as we move through time. What’s most important is
How legit is this regression?
#tr has same length as the residuals and fitted values
summary(lm.obj)
##
## Call:
## lm(formula = growth.rate ~ yr + label.f, data = tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.240860 -0.005777 0.000862 0.006705 0.188590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8577898 0.2186607 -3.923 8.8e-05 ***
## yr 0.0004227 0.0001087 3.891 1e-04 ***
## label.flow 0.0067264 0.0005005 13.440 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0135 on 12341 degrees of freedom
## Multiple R-squared: 0.01562, Adjusted R-squared: 0.01546
## F-statistic: 97.89 on 2 and 12341 DF, p-value: < 2.2e-16
model.df<-data.frame(actual=tr$growth.rate,fitted.values=lm.obj$fitted.values,residuals=lm.obj$residuals,stringsAsFactors=F)
plot(lm.obj)
plot(lm.obj$residuals)
#can plot against a 0 line
ggplot(data=model.df,aes(x=fitted.values, y=residuals))+geom_point()
ggplot(data=model.df,aes(x=actual, y=residuals))+geom_point()
ggplot(data=model.df,aes(x=actual, y=fitted.values))+geom_point()
#plots of linear models
The study was a good exercise in getting acquainted with BEA, census data as well as county shapefiles. We learned about the pitfalls aboue working with aggregated data like income per capita.
Improvements:
1) Conduct the study with a more narrowed scope such as region or state. Or conversely, create a hierarchy across the nation - giving due respect to mobility across states.
2) Treat the earnings data differently or work on only one end of the spectrum of earnings.
3) Introduce more variables.
4) Conduct a longer study.
5) Run a clustering algorithm versus our proprietary version