Because the data looked odd and counties like Kings and Manhattan came in at around $40K income per capita, we decided to make adjustments for the impoverished - knowing that they could skew the data.

bea_2010_length <- length(read_lines("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-Final-Project/master/bea_income_per_capita/ipc_2010.csv"))

bea_2010 <- read.csv( "https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-Final-Project/master/bea_income_per_capita/ipc_2010.csv", nrows = (bea_2010_length-16), skip = 4, stringsAsFactors = FALSE)

colnames(bea_2010) <- c("fips", "GeoName", "income_per_capita_2010")

bea_2010 <- subset(bea_2010, select = c(1,3))

bea_2010$income_per_capita_2010 <- as.numeric(bea_2010$income_per_capita_2010)

Let’s take a look at counties in NYC (Queens, Bronx, Manhattan, Brooklyn) and compare with counties from South Florida (Broward, Miami-Dade, Monroe, Collier):

kable(bea_2010[bea_2010$fips %in% c(12086,12011,12087,12021),])
fips income_per_capita_2010
350 12011 40713
355 12021 63677
387 12086 38162
388 12087 62358
kable(bea_2010[bea_2010$fips %in% c(36047,36081,36061,36005),])
fips income_per_capita_2010
1854 36005 29556
1875 36047 35416
1882 36061 113643
1892 36081 36661

We looked into the data to determine how these numbers could show such a surprising picture for these two areas.

We consider census data (source: https://www.census.gov/did/www/saipe/downloads/estmod14/readme.txt) on poverty for each county. Disclaimer: we only consider estimates from 2014 (latest year available is 2015).

cls=c(t.fips='character')
poverty<-read.csv('~/Documents/CUNY/data_class/project-final/treatedMedian2014.csv',stringsAsFactors = F,header=T,colClasses = cls)

Let’s take a look at the size of population deemed ‘poor’ per county for each of these areas:

#gross income of county by year
kable(poverty[poverty$t.fips %in% c(12086,12011,12087,12021),c(1,6)])
all.poverty t.fips
335 268418 12011
340 49211 12021
372 535148 12086
373 10626 12087
kable(poverty[poverty$t.fips %in% c(36047,36081,36061,36005),c(1,6)])
all.poverty t.fips
1863 440989 36005
1884 607086 36047
1891 280877 36061
1901 352481 36081

We will apply an adjustment calculation in order to arrive at an adjusted income per capita. The calculation looks like this

\(total.personal.income/(census.population - total.poverty.population)\)

We remove the size of the poor population divide this into the total income generated by that county.

#change all the names to prep for merge_all()
names(personal.inc)<-c("fips","GeoName","total.pers.inc")
names(poverty)[6]<-"fips"
n.l<-names(census.trim)
names(census.trim)<-sapply(n.l,paste0,".census")
names(census.trim)[2]<-c("fips")
#run the merge_all calc
df<-list(personal.inc,poverty,census.trim)
que<-Reduce(merge,df)

g.stack<-que %>%
  mutate(adj.per.capita=as.numeric(total.pers.inc)/(X2014.census-all.poverty), adj.per.capita=adj.per.capita*1000)
#names(g.stack)

Our reference point is the other sterilized summary statistics, the median.

#South FLorida
kable(g.stack[g.stack$fips %in% c(12086,12011,12087,12021),c(1,6)])
fips med.inc
325 12011 51485
330 12021 58403
362 12086 42754
363 12087 57411
#New York City
kable(g.stack[g.stack$fips %in% c(36047,36081,36061,36005),c(1,6)])
fips med.inc
1828 36005 33687
1849 36047 47547
1856 36061 75459
1866 36081 56866

Percentiles on the size of impact from the adjustment reveal that the adjusted figures tended to be to the upside when compared with median.

summary(g.stack$adj.per.capita-g.stack$med.inc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -47410   -4564    1024    1063    6367  137400

With our adjustments and data in hand, we build an algorithm to cluster and classify counties as high income and other (or low). This is all done in R and was time consuming.
Disclaimer: We classify counties using only 2014 data: both poverty and total personal income.

load('~/Documents/CUNY/data_class/project-final/final_algo.RData')
#merge to original data table
full<-merge(g.stack,cluster.df,by='fips')
#compute y/y growth
test<-full %>%
  mutate('2011.g'=1-(X2011.census/base.2010.census), '2012.g'=1-(X2012.census/X2011.census), '2013.g'=1-(X2013.census/X2012.census), '2014.g'=1-(X2012.census/X2011.census))

test[test$label=="eligible","label"]<-"low"
#main portion of table that is interesting:
full.t<-test[,c(1,19:23)]
#recode
full.t[full.t$label=="eligible","label"]<-"low"
colnames(full.t)[3:6]<-c('g.2011','g.2012','g.2013','g.2014')
#long format
ma<-full.t %>%
  gather(year,growth.rate,g.2011:g.2014)

For the sake of argument, we run this same plot using only median income for 2014.

ggplot(test,aes(x=label,y=med.inc)) +geom_boxplot() +ggtitle("US Census Median Income")

ggplot(test,aes(x=label,y=adj.per.capita)) +geom_boxplot() +ggtitle("Adjusted Income Per Capita Measure")

With our labels in hand we are ready to apply summary statistics and determine whether income per capita indeed has a positive relationships to growth.

Run some visualization… box plots classed on label

ggplot(data=ma,aes(x=year,y=growth.rate, fill=label)) + geom_boxplot() +ggtitle('Observed Growth Rate Grouped by Year and Income Class')

There does not appear to be any positive difference between the two. In fact, “high” label counties perform poorer than “low”.

Regression

We run a regression on the factors and investigate fit and coefficient on our high/low parameter.

We handle year by converting it to a numeric to capture general population growth as we move through time. What’s most important is

How legit is this regression?

#tr has same length as the residuals and fitted values
summary(lm.obj)
## 
## Call:
## lm(formula = growth.rate ~ yr + label.f, data = tr)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.240860 -0.005777  0.000862  0.006705  0.188590 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8577898  0.2186607  -3.923  8.8e-05 ***
## yr           0.0004227  0.0001087   3.891    1e-04 ***
## label.flow   0.0067264  0.0005005  13.440  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0135 on 12341 degrees of freedom
## Multiple R-squared:  0.01562,    Adjusted R-squared:  0.01546 
## F-statistic: 97.89 on 2 and 12341 DF,  p-value: < 2.2e-16
model.df<-data.frame(actual=tr$growth.rate,fitted.values=lm.obj$fitted.values,residuals=lm.obj$residuals,stringsAsFactors=F)

plot(lm.obj)

plot(lm.obj$residuals)

#can plot against a 0 line
ggplot(data=model.df,aes(x=fitted.values, y=residuals))+geom_point()

ggplot(data=model.df,aes(x=actual, y=residuals))+geom_point()

ggplot(data=model.df,aes(x=actual, y=fitted.values))+geom_point()

#plots of linear models

Conclusion

The study was a good exercise in getting acquainted with BEA, census data as well as county shapefiles. We learned about the pitfalls aboue working with aggregated data like income per capita.
Improvements:
1) Conduct the study with a more narrowed scope such as region or state. Or conversely, create a hierarchy across the nation - giving due respect to mobility across states.
2) Treat the earnings data differently or work on only one end of the spectrum of earnings.
3) Introduce more variables.
4) Conduct a longer study.
5) Run a clustering algorithm versus our proprietary version