In this analysis we will calculate changes in growth rates year-over-year by state and region. The dataset is from the US Census Bureau. Each row is an observation: point in time population of a state across five years.
Other analysis, should I chose:
Year over year change in population by state Year over year change in population by region of the US Overall US population growth Using this file to calculate population rates for some condition you are interested in.
library(tidyr);library(dplyr);
#install.packages('gdata')
library(gdata)
library(stringr)
Import the raw data file as uploaded by fellow classmate.
pop<-read.xls("~/Documents/CUNY/data_class/project2/NST-EST2014-01.xlsx",header=FALSE, stringsAsFactors=FALSE)
pop.1<-pop[-(1:3),]
#each vector in the raw dataframe is a factor, so let's convert
#sapply(pop.1,class)
Some light data conversions and column renaming.
names(pop.1)<-pop.1[1,] #rename columns
names(pop.1)[1]<-'geographic.area'
pop.2<-pop.1[-1,]
#remove commas and convert vectors to numeric
g<-data.frame(lapply(pop.2[,-1],function(x) { as.numeric(gsub(',','',x))}),stringsAsFactors=FALSE)
#sapply(g,class)
#rebind vector to dataframe
g$geographic.area<-pop.2$geographic.area
#str_extract all returns strings, so this assignment should work
names(g)[3:7]<-unlist(str_extract_all(names(g),"[[:digit:]]{2,}"))
Having a look at the data after this preliminary cleanup:
head(g)
## Census Estimates.Base 2010 2011 2012 2013
## 1 308745538 308758105 309347057 311721632 314112078 316497531
## 2 55317240 55318348 55381690 55635670 55832038 56028220
## 3 66927001 66929898 66972390 67149657 67331458 67567871
## 4 114555744 114562951 114871231 116089908 117346322 118522802
## 5 71945553 71946908 72121746 72846397 73602260 74378638
## 6 4779736 4780127 4785822 4801695 4817484 4833996
## 2014 geographic.area
## 1 318857056 United States
## 2 56152333 Northeast
## 3 67745108 Midwest
## 4 119771934 South
## 5 75187681 West
## 6 4849377 .Alabama
names(g)
## [1] "Census" "Estimates.Base" "2010" "2011"
## [5] "2012" "2013" "2014" "geographic.area"
For the first study, only the state and yearly population measurements are relevant. I break those out here.
#states start with ".", so we filter rows (observations on state) here.
te<-g %>%
filter(grepl("^\\.",geographic.area))
## Warning: failed to assign NativeSymbolInfo for env since env is already
## defined in the 'lazyeval' namespace
#carve out the data relevant to this study
g.1<-te %>%
select(-c(Census,Estimates.Base))
head(g.1)
## 2010 2011 2012 2013 2014 geographic.area
## 1 4785822 4801695 4817484 4833996 4849377 .Alabama
## 2 713856 722572 731081 737259 736732 .Alaska
## 3 6411999 6472867 6556236 6634997 6731484 .Arizona
## 4 2922297 2938430 2949300 2958765 2966369 .Arkansas
## 5 37336011 37701901 38062780 38431393 38802500 .California
## 6 5048575 5119661 5191709 5272086 5355866 .Colorado
Put data into long format: one variable per column, which will create row for each state and year combination. This will facilitate aggregate and state-scope studies. Later, because we are measuring growth, we will group by state (‘geographic area’). Lastly, we apply a “lead()” to calculate year-over-year change in population.
#adjust column names because the gather() function doesn't accept numeric arguments
names(g.1)<-sapply(names(g.1),function(x) {ifelse(str_detect(x,"[[:digit:]]{2,}"),paste("x",x,sep=""),x)})
j<-g.1 %>%
gather(year,pop,x2010:x2014) %>%
arrange(geographic.area,year) %>%
group_by(geographic.area) %>%
mutate(growth=((pop/lag(pop))-1)*100)
Let’s look at average growth throughout this time period.
final<-j %>%
ungroup() %>%
group_by(geographic.area) %>%
summarise(mean.growth=mean(growth,na.rm=TRUE)) %>%
arrange(mean.growth)
Let’s take a look at the highest growing and lowest growing states:
head(final);tail(final)
## # A tibble: 6 x 2
## geographic.area mean.growth
## <chr> <dbl>
## 1 .West Virginia -0.05188928
## 2 .Vermont 0.03077813
## 3 .Rhode Island 0.04974570
## 4 .Maine 0.05134652
## 5 .Illinois 0.07876954
## 6 .Michigan 0.08439948
## # A tibble: 6 x 2
## geographic.area mean.growth
## <chr> <dbl>
## 1 .Florida 1.352912
## 2 .Utah 1.485513
## 3 .Colorado 1.488156
## 4 .Texas 1.653153
## 5 .District of Columbia 2.148099
## 6 .North Dakota 2.333476
Sure, the average growth rates are all positive, however, throughout this time period do all the states show an upward (positive) linear trend?
ext<-function(x) {x$coefficients[2][[1]]}
#convert year to numeric, run a linear regression on the groups, extract the relevant coefficient
j.1<-j %>%
mutate(num.year=as.numeric(unlist(str_extract_all(year,"[[:digit:]]{2,}")))) %>%
group_by(geographic.area) %>%
do(model = lm(growth~num.year, data = .)) %>%
mutate(growth.coefficient=ext(model)) %>%
select(-model) %>%
arrange(growth.coefficient)
Arguably, states with lowest momentum for growth:
head(j.1)
## # A tibble: 6 x 2
## geographic.area growth.coefficient
## <chr> <dbl>
## 1 .Alaska -0.4209915
## 2 .District of Columbia -0.3161597
## 3 .New Mexico -0.2332545
## 4 .Wyoming -0.1795914
## 5 .Virginia -0.1228522
## 6 .Connecticut -0.1128097
States with highest growth momentum:
tail(j.1)
## # A tibble: 6 x 2
## geographic.area growth.coefficient
## <chr> <dbl>
## 1 .Rhode Island 0.08289272
## 2 .South Carolina 0.14113235
## 3 .Arizona 0.14281292
## 4 .Idaho 0.18472671
## 5 .North Dakota 0.23822845
## 6 .Nevada 0.34084222
No doubt much growth is in the form of immigration and births, and not necessarily interstate migration. To somewhat normalize for this, assuming that the death rates are equal among all states, which states are growing more than the national average?
To reduce bias, we can remove the state in question from the mean calculation.