The dataset is annual estimates of the resident population from April 1, 2010 to July 1, 2018. The estimates are based on the 2010 Census and reflect changes to the April 1, 2010 population due to the Count Question Resolution program and geographic program revisions. See Geographic Terms and Definitions at
[link]http://www.census.gov/programs-surveys/popest/guidance-geographies/terms-and-definitions.html for a list of the states that are included in each region and division. All geographic boundaries for the 2018 population estimates series except statistical area delineations are as of January 1, 2018. For population estimates methodology
This assignment was accomplished by utilizing these packages for both data analysis and visualizations.
library("tidyr")
library("dplyr")
library("kableExtra")
library("ggplot2")
library("stringr")
library("lubridate")The data is captured in the .csv format and updated into GitHub. You will see below that the data is not in a very clean form to conduct analysis easily, therefore this data set needed to be tidy.
theURL <- "https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/nst-est2018-01.csv"
RawFile <-data.frame(read.csv(file = theURL,header = TRUE,stringsAsFactors = FALSE))
#Table Structure
glimpse(RawFile)## Observations: 56
## Variables: 12
## $ State <chr> "United States", "Northeast", "Midwest", "South...
## $ Census <int> 308745538, 55317240, 66927001, 114555744, 71945...
## $ Estimates.Base <int> 308758105, 55318430, 66929743, 114563045, 71946...
## $ X2010 <int> 309326085, 55380645, 66974749, 114867066, 72103...
## $ X2011 <int> 311580009, 55600532, 67152631, 116039399, 72787...
## $ X2012 <int> 313874218, 55776729, 67336937, 117271075, 73489...
## $ X2013 <int> 316057727, 55907823, 67564135, 118393244, 74192...
## $ X2014 <int> 318386421, 56015864, 67752238, 119657737, 74960...
## $ X2015 <int> 320742673, 56047587, 67869139, 121037542, 75788...
## $ X2016 <int> 323071342, 56058789, 67996917, 122401186, 76614...
## $ X2017 <int> 325147121, 56072676, 68156035, 123598424, 77319...
## $ X2018 <int> 327167434, 56111079, 68308744, 124753948, 77993...
The data needs to be cleansed and manipulated for it to be presentable for analysis.
#
head(RawFile)## State Census Estimates.Base X2010 X2011 X2012
## 1 United States 308745538 308758105 309326085 311580009 313874218
## 2 Northeast 55317240 55318430 55380645 55600532 55776729
## 3 Midwest 66927001 66929743 66974749 67152631 67336937
## 4 South 114555744 114563045 114867066 116039399 117271075
## 5 West 71945553 71946887 72103625 72787447 73489477
## 6 Alabama 4779736 4780138 4785448 4798834 4815564
## X2013 X2014 X2015 X2016 X2017 X2018
## 1 316057727 318386421 320742673 323071342 325147121 327167434
## 2 55907823 56015864 56047587 56058789 56072676 56111079
## 3 67564135 67752238 67869139 67996917 68156035 68308744
## 4 118393244 119657737 121037542 122401186 123598424 124753948
## 5 74192525 74960582 75788405 76614450 77319986 77993663
## 6 4830460 4842481 4853160 4864745 4875120 4887871
#
subset_db_1 <- RawFile %>% select(State,X2010,X2011,X2012,X2013,X2014,X2015,X2016,X2017,X2018)subset_db_2<-gather(subset_db_1,"Year","Count",-State)subset_db_2$Year<-str_replace_all(subset_db_2$Year,"X","")head(subset_db_2)## State Year Count
## 1 United States 2010 309326085
## 2 Northeast 2010 55380645
## 3 Midwest 2010 66974749
## 4 South 2010 114867066
## 5 West 2010 72103625
## 6 Alabama 2010 4785448
subset_db_US<-subset_db_2 %>% select(State,Year,"Count") %>% filter(State=='United States') %>% group_by(State,Year) %>% summarise(Total = sum(as.numeric(Count)))*Bar Plot showing U.S Population over the period of time from 2010
barplot(subset_db_US$Total,names.arg = subset_db_US$Year,xlab = "Years",ylab = "Population #",main= "U.S National Population from 2010" ,border="red", density=c(90, 70, 50, 40, 30, 20, 10))*Segmentation of U.S regions by population count
subset_db_Regions <- subset_db_2 %>% select(State,Year,"Count") %>% filter (State %in% c('Northeast','South','West','Midwest') & Year==2018) %>% group_by(State) %>% summarise(Total = sum(as.numeric(Count)))
subset_db_Regions## # A tibble: 4 x 2
## State Total
## <chr> <dbl>
## 1 Midwest 68308744
## 2 Northeast 56111079
## 3 South 124753948
## 4 West 77993663
Conclusion:
U.S National population is increasing year over year from 2010 to 2018 and Southern region has more population in 2018 compared to other regions.