Income Per Capita Study

Because the data looked odd and counties like Kings and Manhattan came in at around $40K income per capita, we decided to make adjustments for the impoverished - knowing that they could skew the data.

bea_2010_length <- length(read_lines("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-Final-Project/master/bea_income_per_capita/ipc_2010.csv"))

bea_2010 <- read.csv( "https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-Final-Project/master/bea_income_per_capita/ipc_2010.csv", nrows = (bea_2010_length-16), skip = 4, stringsAsFactors = FALSE)

colnames(bea_2010) <- c("fips", "GeoName", "income_per_capita_2010")

bea_2010 <- subset(bea_2010, select = c(1,3))

bea_2010$income_per_capita_2010 <- as.numeric(bea_2010$income_per_capita_2010)

Let’s take a look at counties in NYC (Queens, Bronx, Manhattan, Brooklyn) and compare with counties from South Florida (Broward, Miami-Dade, Monroe, Collier):

kable(bea_2010[bea_2010$fips %in% c(12086,12011,12087,12021),])

	fips	income_per_capita_2010
350	12011	40713
355	12021	63677
387	12086	38162
388	12087	62358

kable(bea_2010[bea_2010$fips %in% c(36047,36081,36061,36005),])

	fips	income_per_capita_2010
1854	36005	29556
1875	36047	35416
1882	36061	113643
1892	36081	36661

We looked into the data to determine how these numbers could show such a surprising picture for these two areas.

We consider census data (source: https://www.census.gov/did/www/saipe/downloads/estmod14/readme.txt) on poverty for each county. Disclaimer: we only consider estimates from 2014 (latest year available is 2015).

cls=c(t.fips='character')
poverty<-read.csv('~/Documents/CUNY/data_class/project-final/treatedMedian2014.csv',stringsAsFactors = F,header=T,colClasses = cls)

Let’s take a look at the size of population deemed ‘poor’ per county for each of these areas:

#gross income of county by year
kable(poverty[poverty$t.fips %in% c(12086,12011,12087,12021),c(1,6)])

	all.poverty	t.fips
335	268418	12011
340	49211	12021
372	535148	12086
373	10626	12087

kable(poverty[poverty$t.fips %in% c(36047,36081,36061,36005),c(1,6)])

	all.poverty	t.fips
1863	440989	36005
1884	607086	36047
1891	280877	36061
1901	352481	36081

We will apply an adjustment calculation in order to arrive at an adjusted income per capita. The calculation looks like this

$total.personal.income/(census.population - total.poverty.population)$

We remove the size of the poor population divide this into the total income generated by that county.

#change all the names to prep for merge_all()
names(personal.inc)<-c("fips","GeoName","total.pers.inc")
names(poverty)[6]<-"fips"
n.l<-names(census.trim)
names(census.trim)<-sapply(n.l,paste0,".census")
names(census.trim)[2]<-c("fips")
#run the merge_all calc
df<-list(personal.inc,poverty,census.trim)
que<-Reduce(merge,df)

g.stack<-que %>%
  mutate(adj.per.capita=as.numeric(total.pers.inc)/(X2014.census-all.poverty), adj.per.capita=adj.per.capita*1000)
#names(g.stack)

Our reference point is the other sterilized summary statistics, the median.

#South FLorida
kable(g.stack[g.stack$fips %in% c(12086,12011,12087,12021),c(1,6)])

	fips	med.inc
325	12011	51485
330	12021	58403
362	12086	42754
363	12087	57411

#New York City
kable(g.stack[g.stack$fips %in% c(36047,36081,36061,36005),c(1,6)])

	fips	med.inc
1828	36005	33687
1849	36047	47547
1856	36061	75459
1866	36081	56866

Percentiles on the size of impact from the adjustment reveal that the adjusted figures tended to be to the upside when compared with median.

summary(g.stack$adj.per.capita-g.stack$med.inc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -47410   -4564    1024    1063    6367  137400

With our adjustments and data in hand, we build an algorithm to cluster and classify counties as high income and other (or low). This is all done in R and was time consuming.
Disclaimer: We classify counties using only 2014 data: both poverty and total personal income.

load('~/Documents/CUNY/data_class/project-final/final_algo.RData')
#merge to original data table
full<-merge(g.stack,cluster.df,by='fips')
#compute y/y growth
test<-full %>%
  mutate('2011.g'=1-(X2011.census/base.2010.census), '2012.g'=1-(X2012.census/X2011.census), '2013.g'=1-(X2013.census/X2012.census), '2014.g'=1-(X2012.census/X2011.census))

test[test$label=="eligible","label"]<-"low"
#main portion of table that is interesting:
full.t<-test[,c(1,19:23)]
#recode
full.t[full.t$label=="eligible","label"]<-"low"
colnames(full.t)[3:6]<-c('g.2011','g.2012','g.2013','g.2014')
#long format
ma<-full.t %>%
  gather(year,growth.rate,g.2011:g.2014)

For the sake of argument, we run this same plot using only median income for 2014.

ggplot(test,aes(x=label,y=med.inc)) +geom_boxplot() +ggtitle("US Census Median Income")

ggplot(test,aes(x=label,y=adj.per.capita)) +geom_boxplot() +ggtitle("Adjusted Income Per Capita Measure")

With our labels in hand we are ready to apply summary statistics and determine whether income per capita indeed has a positive relationships to growth.

Run some visualization… box plots classed on label

ggplot(data=ma,aes(x=year,y=growth.rate, fill=label)) + geom_boxplot() +ggtitle('Observed Growth Rate Grouped by Year and Income Class')

There does not appear to be any positive difference between the two. In fact, “high” label counties perform poorer than “low”.

Regression

We run a regression on the factors and investigate fit and coefficient on our high/low parameter.

We handle year by converting it to a numeric to capture general population growth as we move through time. What’s most important is

How legit is this regression?

#tr has same length as the residuals and fitted values
summary(lm.obj)

## 
## Call:
## lm(formula = growth.rate ~ yr + label.f, data = tr)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.240860 -0.005777  0.000862  0.006705  0.188590 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8577898  0.2186607  -3.923  8.8e-05 ***
## yr           0.0004227  0.0001087   3.891    1e-04 ***
## label.flow   0.0067264  0.0005005  13.440  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0135 on 12341 degrees of freedom
## Multiple R-squared:  0.01562,    Adjusted R-squared:  0.01546 
## F-statistic: 97.89 on 2 and 12341 DF,  p-value: < 2.2e-16

model.df<-data.frame(actual=tr$growth.rate,fitted.values=lm.obj$fitted.values,residuals=lm.obj$residuals,stringsAsFactors=F)

plot(lm.obj)

plot(lm.obj$residuals)

#can plot against a 0 line
ggplot(data=model.df,aes(x=fitted.values, y=residuals))+geom_point()

ggplot(data=model.df,aes(x=actual, y=residuals))+geom_point()

ggplot(data=model.df,aes(x=actual, y=fitted.values))+geom_point()

#plots of linear models

Conclusion

The study was a good exercise in getting acquainted with BEA, census data as well as county shapefiles. We learned about the pitfalls aboue working with aggregated data like income per capita.
Improvements:
1) Conduct the study with a more narrowed scope such as region or state. Or conversely, create a hierarchy across the nation - giving due respect to mobility across states.
2) Treat the earnings data differently or work on only one end of the spectrum of earnings.
3) Introduce more variables.
4) Conduct a longer study.
5) Run a clustering algorithm versus our proprietary version

Income Per Capita Study

Raphael Nash & Luis Calleja

December 21, 2016

Regression

Conclusion