Beta - project 2

In this analysis we will calculate changes in growth rates year-over-year by state and region. The dataset is from the US Census Bureau. Each row is an observation: point in time population of a state across five years.

Other analysis, should I chose:

Year over year change in population by state Year over year change in population by region of the US Overall US population growth Using this file to calculate population rates for some condition you are interested in.

library(tidyr);library(dplyr);
#install.packages('gdata')
library(gdata)
library(stringr)

Import the raw data file as uploaded by fellow classmate.

pop<-read.xls("~/Documents/CUNY/data_class/project2/NST-EST2014-01.xlsx",header=FALSE, stringsAsFactors=FALSE)
pop.1<-pop[-(1:3),]
#each vector in the raw dataframe is a factor, so let's convert
#sapply(pop.1,class)

Some light data conversions and column renaming.

names(pop.1)<-pop.1[1,] #rename columns
names(pop.1)[1]<-'geographic.area'
pop.2<-pop.1[-1,]

#remove commas and convert vectors to numeric
g<-data.frame(lapply(pop.2[,-1],function(x) { as.numeric(gsub(',','',x))}),stringsAsFactors=FALSE)

#sapply(g,class)
#rebind vector to dataframe
g$geographic.area<-pop.2$geographic.area

#str_extract all returns strings, so this assignment should work
names(g)[3:7]<-unlist(str_extract_all(names(g),"[[:digit:]]{2,}"))

Having a look at the data after this preliminary cleanup:

head(g)

##      Census Estimates.Base      2010      2011      2012      2013
## 1 308745538      308758105 309347057 311721632 314112078 316497531
## 2  55317240       55318348  55381690  55635670  55832038  56028220
## 3  66927001       66929898  66972390  67149657  67331458  67567871
## 4 114555744      114562951 114871231 116089908 117346322 118522802
## 5  71945553       71946908  72121746  72846397  73602260  74378638
## 6   4779736        4780127   4785822   4801695   4817484   4833996
##        2014 geographic.area
## 1 318857056   United States
## 2  56152333       Northeast
## 3  67745108         Midwest
## 4 119771934           South
## 5  75187681            West
## 6   4849377        .Alabama

names(g)

## [1] "Census"          "Estimates.Base"  "2010"            "2011"           
## [5] "2012"            "2013"            "2014"            "geographic.area"

For the first study, only the state and yearly population measurements are relevant. I break those out here.

#states start with ".", so we filter rows (observations on state) here.
te<-g %>%
  filter(grepl("^\\.",geographic.area))

## Warning: failed to assign NativeSymbolInfo for env since env is already
## defined in the 'lazyeval' namespace

#carve out the data relevant to this study
g.1<-te %>%
  select(-c(Census,Estimates.Base))
head(g.1)

##       2010     2011     2012     2013     2014 geographic.area
## 1  4785822  4801695  4817484  4833996  4849377        .Alabama
## 2   713856   722572   731081   737259   736732         .Alaska
## 3  6411999  6472867  6556236  6634997  6731484        .Arizona
## 4  2922297  2938430  2949300  2958765  2966369       .Arkansas
## 5 37336011 37701901 38062780 38431393 38802500     .California
## 6  5048575  5119661  5191709  5272086  5355866       .Colorado

Put data into long format: one variable per column, which will create row for each state and year combination. This will facilitate aggregate and state-scope studies. Later, because we are measuring growth, we will group by state (‘geographic area’). Lastly, we apply a “lead()” to calculate year-over-year change in population.

#adjust column names because the gather() function doesn't accept numeric arguments
names(g.1)<-sapply(names(g.1),function(x) {ifelse(str_detect(x,"[[:digit:]]{2,}"),paste("x",x,sep=""),x)})

j<-g.1 %>%
  gather(year,pop,x2010:x2014) %>%
  arrange(geographic.area,year) %>%
  group_by(geographic.area) %>%
  mutate(growth=((pop/lag(pop))-1)*100)

Let’s look at average growth throughout this time period.

final<-j %>%
  ungroup() %>%
  group_by(geographic.area) %>%
  summarise(mean.growth=mean(growth,na.rm=TRUE)) %>%
  arrange(mean.growth)

Let’s take a look at the highest growing and lowest growing states:

head(final);tail(final)

## # A tibble: 6 x 2
##   geographic.area mean.growth
##             <chr>       <dbl>
## 1  .West Virginia -0.05188928
## 2        .Vermont  0.03077813
## 3   .Rhode Island  0.04974570
## 4          .Maine  0.05134652
## 5       .Illinois  0.07876954
## 6       .Michigan  0.08439948

## # A tibble: 6 x 2
##         geographic.area mean.growth
##                   <chr>       <dbl>
## 1              .Florida    1.352912
## 2                 .Utah    1.485513
## 3             .Colorado    1.488156
## 4                .Texas    1.653153
## 5 .District of Columbia    2.148099
## 6         .North Dakota    2.333476

Sure, the average growth rates are all positive, however, throughout this time period do all the states show an upward (positive) linear trend?

ext<-function(x) {x$coefficients[2][[1]]}

#convert year to numeric, run a linear regression on the groups, extract the relevant coefficient
j.1<-j %>%
  mutate(num.year=as.numeric(unlist(str_extract_all(year,"[[:digit:]]{2,}")))) %>%
  group_by(geographic.area) %>%
  do(model = lm(growth~num.year, data = .)) %>%
  mutate(growth.coefficient=ext(model)) %>%
  select(-model) %>%
  arrange(growth.coefficient)

Conclusion

Arguably, states with lowest momentum for growth:

head(j.1)

## # A tibble: 6 x 2
##         geographic.area growth.coefficient
##                   <chr>              <dbl>
## 1               .Alaska         -0.4209915
## 2 .District of Columbia         -0.3161597
## 3           .New Mexico         -0.2332545
## 4              .Wyoming         -0.1795914
## 5             .Virginia         -0.1228522
## 6          .Connecticut         -0.1128097

States with highest growth momentum:

tail(j.1)

## # A tibble: 6 x 2
##   geographic.area growth.coefficient
##             <chr>              <dbl>
## 1   .Rhode Island         0.08289272
## 2 .South Carolina         0.14113235
## 3        .Arizona         0.14281292
## 4          .Idaho         0.18472671
## 5   .North Dakota         0.23822845
## 6         .Nevada         0.34084222

Future studies

No doubt much growth is in the form of immigration and births, and not necessarily interstate migration. To somewhat normalize for this, assuming that the death rates are equal among all states, which states are growing more than the national average?

To reduce bias, we can remove the state in question from the mean calculation.

Beta - project 2

Luis Calleja

October 3, 2016

Conclusion

Future studies