Project 2- Data Cleansing & Analysis DataSet 2

Overview

The dataset is annual estimates of the resident population from April 1, 2010 to July 1, 2018. The estimates are based on the 2010 Census and reflect changes to the April 1, 2010 population due to the Count Question Resolution program and geographic program revisions. See Geographic Terms and Definitions at

[link]http://www.census.gov/programs-surveys/popest/guidance-geographies/terms-and-definitions.html for a list of the states that are included in each region and division. All geographic boundaries for the 2018 population estimates series except statistical area delineations are as of January 1, 2018. For population estimates methodology

R Packages Used

This assignment was accomplished by utilizing these packages for both data analysis and visualizations.

library("tidyr")
library("dplyr")
library("kableExtra")
library("ggplot2")
library("stringr")
library("lubridate")

The DataSet

The data is captured in the .csv format and updated into GitHub. You will see below that the data is not in a very clean form to conduct analysis easily, therefore this data set needed to be tidy.

theURL <- "https://raw.githubusercontent.com/DataScienceAR/Cuny-Assignments/master/Data-607/Data-Sets/nst-est2018-01.csv"
RawFile <-data.frame(read.csv(file = theURL,header = TRUE,stringsAsFactors = FALSE))

#Table Structure
glimpse(RawFile)

## Observations: 56
## Variables: 12
## $ State          <chr> "United States", "Northeast", "Midwest", "South...
## $ Census         <int> 308745538, 55317240, 66927001, 114555744, 71945...
## $ Estimates.Base <int> 308758105, 55318430, 66929743, 114563045, 71946...
## $ X2010          <int> 309326085, 55380645, 66974749, 114867066, 72103...
## $ X2011          <int> 311580009, 55600532, 67152631, 116039399, 72787...
## $ X2012          <int> 313874218, 55776729, 67336937, 117271075, 73489...
## $ X2013          <int> 316057727, 55907823, 67564135, 118393244, 74192...
## $ X2014          <int> 318386421, 56015864, 67752238, 119657737, 74960...
## $ X2015          <int> 320742673, 56047587, 67869139, 121037542, 75788...
## $ X2016          <int> 323071342, 56058789, 67996917, 122401186, 76614...
## $ X2017          <int> 325147121, 56072676, 68156035, 123598424, 77319...
## $ X2018          <int> 327167434, 56111079, 68308744, 124753948, 77993...

Data Manipulation

The data needs to be cleansed and manipulated for it to be presentable for analysis.

Untidy DataSet

Showing top 6 rows

# 
head(RawFile)

##           State    Census Estimates.Base     X2010     X2011     X2012
## 1 United States 308745538      308758105 309326085 311580009 313874218
## 2     Northeast  55317240       55318430  55380645  55600532  55776729
## 3       Midwest  66927001       66929743  66974749  67152631  67336937
## 4         South 114555744      114563045 114867066 116039399 117271075
## 5          West  71945553       71946887  72103625  72787447  73489477
## 6       Alabama   4779736        4780138   4785448   4798834   4815564
##       X2013     X2014     X2015     X2016     X2017     X2018
## 1 316057727 318386421 320742673 323071342 325147121 327167434
## 2  55907823  56015864  56047587  56058789  56072676  56111079
## 3  67564135  67752238  67869139  67996917  68156035  68308744
## 4 118393244 119657737 121037542 122401186 123598424 124753948
## 5  74192525  74960582  75788405  76614450  77319986  77993663
## 6   4830460   4842481   4853160   4864745   4875120   4887871

Cleansing the data

Subsetting the data with required columns

# 
subset_db_1 <- RawFile %>% select(State,X2010,X2011,X2012,X2013,X2014,X2015,X2016,X2017,X2018)

Collapsing the wide table into longer table

subset_db_2<-gather(subset_db_1,"Year","Count",-State)

Removing the X from year values

subset_db_2$Year<-str_replace_all(subset_db_2$Year,"X","")

Tidy DataSet

head(subset_db_2)

##           State Year     Count
## 1 United States 2010 309326085
## 2     Northeast 2010  55380645
## 3       Midwest 2010  66974749
## 4         South 2010 114867066
## 5          West 2010  72103625
## 6       Alabama 2010   4785448

Analysis

Aggregating at U.S National level

subset_db_US<-subset_db_2 %>% select(State,Year,"Count") %>% filter(State=='United States') %>% group_by(State,Year) %>% summarise(Total = sum(as.numeric(Count)))

*Bar Plot showing U.S Population over the period of time from 2010

barplot(subset_db_US$Total,names.arg = subset_db_US$Year,xlab = "Years",ylab = "Population #",main= "U.S National Population from 2010" ,border="red", density=c(90, 70, 50, 40, 30, 20, 10))

*Segmentation of U.S regions by population count

subset_db_Regions <- subset_db_2 %>% select(State,Year,"Count") %>% filter (State %in% c('Northeast','South','West','Midwest') & Year==2018) %>% group_by(State) %>% summarise(Total = sum(as.numeric(Count)))
subset_db_Regions

## # A tibble: 4 x 2
##   State         Total
##   <chr>         <dbl>
## 1 Midwest    68308744
## 2 Northeast  56111079
## 3 South     124753948
## 4 West       77993663

Conclusion:

U.S National population is increasing year over year from 2010 to 2018 and Southern region has more population in 2018 compared to other regions.