Data Description: Population Data

I chose to work with population data posted by Arun Reddy. His description:

I found a census dataset for Puerto Rico population for 2010 to 2018. The dataset gives the Estimates of the Total Resident Population and Resident Population Age 18 Years and Older for the United States, States, and Puerto Rico

The dataset gives the population metrics like estimated, change in population from the estimate, National Rank in populations for each year from 2010 to 2018.

So there are 57 rows with 61 columns, so the dataset is wider than the length.

This is a good example of untidy data can be cleansed and make it more presentable. The following steps can be performed to cleanse the data.

  1. All the 4 metrics by years are spread out by column-wise can be changed into rows.

  2. Row names/Column name which includes the year as a concatenation can be well formatted to make more readable.

  3. Some of the column names don’t have the right data type like population change, national rank is factorial data type which is unnecessary.


Tidying

Let’s read population data from GitHub, check descriptives with summary command. Output is limited by customized CSS in Markdown.

USpop<-read.csv("https://raw.githubusercontent.com/sigmasigmaiota/USpopulation/master/Population DataSet.csv")

#Look at summary table.
summary(USpop)
##        X          SUMLEV      REGION    DIVISION      STATE      
##  Min.   : 1   Min.   :10.00   0: 1   5      : 9   Min.   : 0.00  
##  1st Qu.:15   1st Qu.:40.00   1:10   8      : 8   1st Qu.:12.00  
##  Median :29   Median :40.00   2:13   4      : 7   Median :27.00  
##  Mean   :29   Mean   :38.07   3:18   1      : 6   Mean   :27.18  
##  3rd Qu.:43   3rd Qu.:40.00   4:14   0      : 5   3rd Qu.:41.00  
##  Max.   :57   Max.   :40.00   X: 1   3      : 5   Max.   :72.00  
##                                      (Other):17                  
##          NAME    ESTIMATESBASE2010   POPESTIMATE2010    
##  Alabama   : 1   Min.   :   563773   Min.   :   564483  
##  Alaska    : 1   1st Qu.:  1853001   1st Qu.:  1854214  
##  Arizona   : 1   Median :  4625381   Median :  4635656  
##  Arkansas  : 1   Mean   : 16315798   Mean   : 16345610  
##  California: 1   3rd Qu.:  9535736   3rd Qu.:  9574293  
##  Colorado  : 1   Max.   :308758105   Max.   :309326085  
##  (Other)   :51                                          
##  POPESTIMATE2011     POPESTIMATE2012     POPESTIMATE2013    
##  Min.   :   567224   Min.   :   576270   Min.   :   582123  
##  1st Qu.:  1856074   1st Qu.:  1856764   1st Qu.:  1865414  
##  Median :  4671422   Median :  4717112   Median :  4764153  
##  Mean   : 16463487   Mean   : 16583459   Mean   : 16697654  
##  3rd Qu.:  9656754   3rd Qu.:  9749123   3rd Qu.:  9843599  
##  Max.   :311580009   Max.   :313874218   Max.   :316057727  
##                                                             
##  POPESTIMATE2014     POPESTIMATE2015     POPESTIMATE2016    
##  Min.   :   582548   Min.   :   585668   Min.   :   584290  
##  1st Qu.:  1879522   1st Qu.:  1891507   1st Qu.:  1905924  
##  Median :  4823793   Median :  4853160   Median :  4864745  
##  Mean   : 16819195   Mean   : 16942126   Mean   : 17063518  
##  3rd Qu.:  9930589   3rd Qu.:  9932573   3rd Qu.:  9951890  
##  Max.   :318386421   Max.   :320742673   Max.   :323071342  
##                                                             
##  POPESTIMATE2017     POPESTIMATE2018      NPOPCHG_2010   
##  Min.   :   578934   Min.   :   577737   Min.   : -6582  
##  1st Qu.:  1917575   1st Qu.:  1929268   1st Qu.:  2570  
##  Median :  4875120   Median :  4887871   Median :  6457  
##  Mean   : 17171340   Mean   : 17275394   Mean   : 29812  
##  3rd Qu.:  9976447   3rd Qu.:  9995915   3rd Qu.: 18362  
##  Max.   :325147121   Max.   :327167434   Max.   :567980  
##                                                          
##   NPOPCHG_2011      NPOPCHG_2012      NPOPCHG_2013      NPOPCHG_2014    
##  Min.   : -42793   Min.   : -44244   Min.   : -41411   Min.   : -58203  
##  1st Qu.:   8898   1st Qu.:  10012   1st Qu.:   8774   1st Qu.:   7386  
##  Median :  21288   Median :  16893   Median :  18436   Median :  17240  
##  Mean   : 117877   Mean   : 119972   Mean   : 114195   Mean   : 121542  
##  3rd Qu.:  65723   3rd Qu.:  71221   3rd Qu.:  53494   3rd Qu.:  59023  
##  Max.   :2253924   Max.   :2294209   Max.   :2183509   Max.   :2328694  
##                                                                         
##   NPOPCHG_2015      NPOPCHG_2016      NPOPCHG_2017      NPOPCHG_2018    
##  Min.   : -61708   Min.   : -66671   Min.   : -81494   Min.   :-129848  
##  1st Qu.:   4089   1st Qu.:   3647   1st Qu.:   1365   1st Qu.:   3341  
##  Median :  14763   Median :  13364   Median :  14027   Median :  17827  
##  Mean   : 122931   Mean   : 121392   Mean   : 107822   Mean   : 104054  
##  3rd Qu.:  50831   3rd Qu.:  60116   3rd Qu.:  60505   3rd Qu.:  61216  
##  Max.   :2356252   Max.   :2328669   Max.   :2075779   Max.   :2020313  
##                                                                         
##   PPOPCHG_2010       PPOPCHG_2011      PPOPCHG_2012      PPOPCHG_2013    
##  Min.   :-0.12431   Min.   :-1.1499   Min.   :-1.2027   Min.   :-1.1394  
##  1st Qu.: 0.09317   1st Qu.: 0.2671   1st Qu.: 0.2745   1st Qu.: 0.2512  
##  Median : 0.17818   Median : 0.6560   Median : 0.6946   Median : 0.6524  
##  Mean   : 0.17896   Mean   : 0.6490   Mean   : 0.6857   Mean   : 0.6626  
##  3rd Qu.: 0.24368   3rd Qu.: 0.8968   3rd Qu.: 1.0440   3rd Qu.: 0.9712  
##  Max.   : 0.55154   Max.   : 2.3992   Max.   : 2.4408   Max.   : 2.9785  
##                                                                          
##   PPOPCHG_2014      PPOPCHG_2015      PPOPCHG_2016      PPOPCHG_2017     
##  Min.   :-1.6199   Min.   :-1.7457   Min.   :-1.9196   Min.   :-2.39231  
##  1st Qu.: 0.1930   1st Qu.: 0.1282   1st Qu.: 0.1233   1st Qu.: 0.05405  
##  Median : 0.5311   Median : 0.4792   Median : 0.4030   Median : 0.37844  
##  Mean   : 0.6018   Mean   : 0.6056   Mean   : 0.5883   Mean   : 0.51042  
##  3rd Qu.: 0.9699   3rd Qu.: 1.1043   3rd Qu.: 1.0899   3rd Qu.: 1.05089  
##  Max.   : 2.1306   Max.   : 2.2566   Max.   : 2.0156   Max.   : 2.13758  
##                                                                          
##   PPOPCHG_2018     NRANK_ESTBASE2010 NRANK_POPEST2010 NRANK_POPEST2011
##  Min.   :-3.9052   1      : 2        1      : 2       1      : 2      
##  1st Qu.: 0.1299   2      : 2        2      : 2       2      : 2      
##  Median : 0.3979   3      : 2        3      : 2       3      : 2      
##  Mean   : 0.4897   4      : 2        4      : 2       4      : 2      
##  3rd Qu.: 0.9723   X      : 2        X      : 2       X      : 2      
##  Max.   : 2.0854   10     : 1        10     : 1       10     : 1      
##                    (Other):46        (Other):46       (Other):46      
##  NRANK_POPEST2012 NRANK_POPEST2013 NRANK_POPEST2014 NRANK_POPEST2015
##  1      : 2       1      : 2       1      : 2       1      : 2      
##  2      : 2       2      : 2       2      : 2       2      : 2      
##  3      : 2       3      : 2       3      : 2       3      : 2      
##  4      : 2       4      : 2       4      : 2       4      : 2      
##  X      : 2       X      : 2       X      : 2       X      : 2      
##  10     : 1       10     : 1       10     : 1       10     : 1      
##  (Other):46       (Other):46       (Other):46       (Other):46      
##  NRANK_POPEST2016 NRANK_POPEST2017 NRANK_POPEST2018 NRANK_NPCHG2010
##  1      : 2       1      : 2       1      : 2       1      : 2     
##  2      : 2       2      : 2       2      : 2       2      : 2     
##  3      : 2       3      : 2       3      : 2       3      : 2     
##  4      : 2       4      : 2       4      : 2       4      : 2     
##  X      : 2       X      : 2       X      : 2       X      : 2     
##  10     : 1       10     : 1       10     : 1       10     : 1     
##  (Other):46       (Other):46       (Other):46       (Other):46     
##  NRANK_NPCHG2011 NRANK_NPCHG2012 NRANK_NPCHG2013 NRANK_NPCHG2014
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_NPCHG2015 NRANK_NPCHG2016 NRANK_NPCHG2017 NRANK_NPCHG2018
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_PPCHG2010 NRANK_PPCHG2011 NRANK_PPCHG2012 NRANK_PPCHG2013
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_PPCHG2014 NRANK_PPCHG2015 NRANK_PPCHG2016 NRANK_PPCHG2017
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_PPCHG2018
##  1      : 2     
##  2      : 2     
##  3      : 2     
##  4      : 2     
##  X      : 2     
##  10     : 1     
##  (Other):46


We’ll need to remove the divisions listed at the top of the dataset by creating a subset; let’s remove the first five rows, wherein data at the region and national level lives. The result is displayed using kableExtra.

#subset the data.
USpop.StateTerr<-USpop[6:57,]

require(kableExtra)
kable(head(USpop.StateTerr))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  scroll_box(width = "100%")
X SUMLEV REGION DIVISION STATE NAME ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 POPESTIMATE2016 POPESTIMATE2017 POPESTIMATE2018 NPOPCHG_2010 NPOPCHG_2011 NPOPCHG_2012 NPOPCHG_2013 NPOPCHG_2014 NPOPCHG_2015 NPOPCHG_2016 NPOPCHG_2017 NPOPCHG_2018 PPOPCHG_2010 PPOPCHG_2011 PPOPCHG_2012 PPOPCHG_2013 PPOPCHG_2014 PPOPCHG_2015 PPOPCHG_2016 PPOPCHG_2017 PPOPCHG_2018 NRANK_ESTBASE2010 NRANK_POPEST2010 NRANK_POPEST2011 NRANK_POPEST2012 NRANK_POPEST2013 NRANK_POPEST2014 NRANK_POPEST2015 NRANK_POPEST2016 NRANK_POPEST2017 NRANK_POPEST2018 NRANK_NPCHG2010 NRANK_NPCHG2011 NRANK_NPCHG2012 NRANK_NPCHG2013 NRANK_NPCHG2014 NRANK_NPCHG2015 NRANK_NPCHG2016 NRANK_NPCHG2017 NRANK_NPCHG2018 NRANK_PPCHG2010 NRANK_PPCHG2011 NRANK_PPCHG2012 NRANK_PPCHG2013 NRANK_PPCHG2014 NRANK_PPCHG2015 NRANK_PPCHG2016 NRANK_PPCHG2017 NRANK_PPCHG2018
6 6 40 3 6 1 Alabama 4780138 4785448 4798834 4815564 4830460 4842481 4853160 4864745 4875120 4887871 5310 13386 16730 14896 12021 10679 11585 10375 12751 0.1110847 0.2797230 0.3486264 0.3093303 0.2488583 0.2205275 0.2387104 0.2132691 0.2615525 23 23 23 23 23 23 24 24 24 24 28 32 26 31 32 30 28 31 27 37 38 34 33 35 36 34 34 34
7 7 40 4 9 2 Alaska 710249 713906 722038 730399 737045 736307 737547 741504 739786 737438 3657 8132 8361 6646 -738 1240 3957 -1718 -2348 0.5148898 1.1390855 1.1579723 0.9099136 -0.1001296 0.1684080 0.5365082 -0.2316913 -0.3173891 47 47 47 47 47 48 48 48 48 48 32 39 40 42 47 42 37 44 45 2 8 11 18 48 37 22 46 49
8 8 40 4 8 4 Arizona 6392288 6407774 6473497 6556629 6634999 6733840 6833596 6945452 7048876 7171646 15486 65723 83132 78370 98841 99756 111856 103424 122770 0.2422607 1.0256760 1.2841900 1.1952789 1.4896913 1.4814133 1.6368542 1.4890896 1.7416961 16 16 16 15 15 15 14 14 14 14 11 10 7 5 4 7 7 7 4 15 10 7 7 7 9 9 6 4
9 9 40 3 7 5 Arkansas 2916028 2921978 2940407 2952109 2959549 2967726 2978407 2990410 3002997 3013825 5950 18429 11702 7440 8177 10681 12003 12587 10828 0.2040447 0.6307029 0.3979721 0.2520232 0.2762921 0.3599052 0.4030007 0.4209122 0.3605731 32 32 32 32 32 32 33 32 32 33 26 25 35 41 37 29 27 26 30 22 28 31 38 31 30 26 24 28
10 10 40 4 9 6 California 37254523 37320903 37641823 37960782 38280824 38625139 38953142 39209127 39399349 39557045 66380 320920 318959 320042 344315 328003 255985 190222 157696 0.1781797 0.8598934 0.8473527 0.8430859 0.8994451 0.8491956 0.6571614 0.4851472 0.4002503 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 26 17 19 20 16 16 20 23 25
11 11 40 4 8 8 Colorado 5029316 5048281 5121771 5193721 5270482 5351218 5452107 5540921 5615902 5695564 18965 73490 71950 76761 80736 100889 88814 74981 79662 0.3770891 1.4557431 1.4047875 1.4779577 1.5318523 1.8853465 1.6289849 1.3532227 1.4185077 22 22 22 22 22 22 22 21 21 21 8 9 9 6 8 6 8 8 8 6 4 5 5 4 4 10 8 7


We now have information only on the state/territory level. Let’s transform and create a row for each year. Additionally, let’s omit the first two columns, as SUMLEV is consistent on the state/terr level. The result is displayed once again using kableExtra.

#remove first two columns
USpop.StateTerr[1:2]<-NULL
library(tidyr)
library(dplyr)
#Create row for each statistic.
USpop.YearList<-gather(USpop.StateTerr,"Statistic","Value",5:length(colnames(USpop.StateTerr)))

kable(head(USpop.YearList))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
REGION DIVISION STATE NAME Statistic Value
3 6 1 Alabama ESTIMATESBASE2010 4780138
4 9 2 Alaska ESTIMATESBASE2010 710249
4 8 4 Arizona ESTIMATESBASE2010 6392288
3 7 5 Arkansas ESTIMATESBASE2010 2916028
4 9 6 California ESTIMATESBASE2010 37254523
4 8 8 Colorado ESTIMATESBASE2010 5029316


Let’s extract the year from our new “Year” variable and create a “Statistic” variable; we’ll use stringr to extract numeric characters. The new variable is highlighted in light blue.

library(stringr)
USpop.YearList$Year<-as.numeric(str_extract(USpop.YearList$Statistic,"([0-9]+)"))
USpop.YearList$Statistic<-gsub("([0-9])|(_)","",USpop.YearList$Statistic)

#Rename value column to specifically address state/terr level data.
colnames(USpop.YearList)[6]<-"StateValue"

kable(head(USpop.YearList))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 7, bold = T, background = "LightCyan")
REGION DIVISION STATE NAME Statistic StateValue Year
3 6 1 Alabama ESTIMATESBASE 4780138 2010
4 9 2 Alaska ESTIMATESBASE 710249 2010
4 8 4 Arizona ESTIMATESBASE 6392288 2010
3 7 5 Arkansas ESTIMATESBASE 2916028 2010
4 9 6 California ESTIMATESBASE 37254523 2010
4 8 8 Colorado ESTIMATESBASE 5029316 2010


Let’s do the same for division information. Using the gather command, division data is added to our table; the change is marked in light blue.

USdiv.YearList<-gather(USpop[2:5,],"Statistic","Value",7:length(colnames(USpop)))

#Remove first two columns.
USdiv.YearList[1:2]<-NULL

#We also have no need for state or division information at Region level.
USdiv.YearList[2:3]<-NULL

#Rename NAME column.
colnames(USdiv.YearList)[2]<-"RegionName"

#Rename Value column in anticipation of merge with state data.
colnames(USdiv.YearList)[4]<-"RegionValue"

#Clean year and statistic.
USdiv.YearList$Year<-as.numeric(str_extract(USdiv.YearList$Statistic,"([0-9]+)"))
USdiv.YearList$Statistic<-gsub("([0-9])|(_)","",USdiv.YearList$Statistic)

#Shorten RegionName.
USdiv.YearList$RegionName<-gsub("( Region)","",USdiv.YearList$RegionName)

kable(head(USdiv.YearList))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 2, bold = T, background = "LightCyan")
REGION RegionName Statistic RegionValue Year
1 Northeast ESTIMATESBASE 55318430 2010
2 Midwest ESTIMATESBASE 66929743 2010
3 South ESTIMATESBASE 114563045 2010
4 West ESTIMATESBASE 71946887 2010
1 Northeast POPESTIMATE 55380645 2010
2 Midwest POPESTIMATE 66974749 2010


Nice. Let’s merge by REGION number, Year, and Statistic. We must use full_join in order to keep Puerto Rico, which is omitted from Region categorization.

USpop2<-full_join(USdiv.YearList,USpop.YearList,by=c("REGION","Statistic","Year"))

kable(head(USpop2))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
REGION RegionName Statistic RegionValue Year DIVISION STATE NAME StateValue
1 Northeast ESTIMATESBASE 55318430 2010 1 9 Connecticut 3574147
1 Northeast ESTIMATESBASE 55318430 2010 1 23 Maine 1328369
1 Northeast ESTIMATESBASE 55318430 2010 1 25 Massachusetts 6547790
1 Northeast ESTIMATESBASE 55318430 2010 1 33 New Hampshire 1316464
1 Northeast ESTIMATESBASE 55318430 2010 2 34 New Jersey 8791962
1 Northeast ESTIMATESBASE 55318430 2010 2 36 New York 19378124


We need to pair this with a list of divisions, which differs from Region. A list exists on wikipedia; I’ve edited from https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States.

DIVISION<-as.factor(c(1,2,3,4,5,6,7,8,9))
DivisionName<-c("New England","Mid-Atlantic","East North Central","West North Central","South Atlantic","East South Central","West South Central","Mountain","Pacific")
Divisions<-data.frame(DIVISION,DivisionName)


Merge Divisions with master dataset and check.

USpop3<-full_join(Divisions,USpop2,by=c("DIVISION"))

kable(head(USpop3))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 2, bold = T, background = "LightCyan")
DIVISION DivisionName REGION RegionName Statistic RegionValue Year STATE NAME StateValue
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 9 Connecticut 3574147
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 23 Maine 1328369
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 25 Massachusetts 6547790
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 33 New Hampshire 1316464
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 44 Rhode Island 1052957
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 50 Vermont 625744


Let’s look at Puerto Rico in particular.

PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]

kable(head(PR))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
DIVISION DivisionName REGION RegionName Statistic RegionValue Year STATE NAME StateValue
2857 X NA X NA ESTIMATESBASE NA 2010 72 Puerto Rico 3726157
2858 X NA X NA POPESTIMATE NA 2010 72 Puerto Rico 3721525
2859 X NA X NA POPESTIMATE NA 2011 72 Puerto Rico 3678732
2860 X NA X NA POPESTIMATE NA 2012 72 Puerto Rico 3634488
2861 X NA X NA POPESTIMATE NA 2013 72 Puerto Rico 3593077
2862 X NA X NA POPESTIMATE NA 2014 72 Puerto Rico 3534874


Puerto Rico is not ranked among the other states and has no division assignment. This is problematic for anyone completing comparative analysis on a national level. Since puerto rico has been assigned no region, there is no regional data available. Let’s recode NA as Puerto Rico in RegionName.

USpop3$RegionName[which(USpop3$NAME=="Puerto Rico")]<-"Puerto Rico"

#look at Puerto Rico again, replace subset.
PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]

kable(head(PR))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 4, bold = T, background = "LightCyan")
DIVISION DivisionName REGION RegionName Statistic RegionValue Year STATE NAME StateValue
2857 X NA X Puerto Rico ESTIMATESBASE NA 2010 72 Puerto Rico 3726157
2858 X NA X Puerto Rico POPESTIMATE NA 2010 72 Puerto Rico 3721525
2859 X NA X Puerto Rico POPESTIMATE NA 2011 72 Puerto Rico 3678732
2860 X NA X Puerto Rico POPESTIMATE NA 2012 72 Puerto Rico 3634488
2861 X NA X Puerto Rico POPESTIMATE NA 2013 72 Puerto Rico 3593077
2862 X NA X Puerto Rico POPESTIMATE NA 2014 72 Puerto Rico 3534874


Let’s add a column for National statistics.

#National data only.
USpop.Nat<-USpop[1,3:ncol(USpop)]

#gather, as before
USpop.Nat<-gather(USpop.Nat,"Statistic","Value",5:length(colnames(USpop.Nat)))

#Clean year and statistic.
USpop.Nat$Year<-as.numeric(str_extract(USpop.Nat$Statistic,"([0-9]+)"))
USpop.Nat$Statistic<-gsub("([0-9])|(_)","",USpop.Nat$Statistic)

#omit columns and unnecessary rows.
USpop.Nat<-USpop.Nat[,5:7]
USpop.Nat<-USpop.Nat[which(USpop.Nat$Value!="X"),]

#rename coluns in preparation for merge.
colnames(USpop.Nat)[2]<-"NationalValue"

#merge into master data.
USpop4<-full_join(USpop.Nat,USpop3,by=c("Statistic","Year"))

kable(head(USpop4))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 2, bold = T, background = "LightCyan")
Statistic NationalValue Year DIVISION DivisionName REGION RegionName RegionValue STATE NAME StateValue
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 9 Connecticut 3574147
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 23 Maine 1328369
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 25 Massachusetts 6547790
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 33 New Hampshire 1316464
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 44 Rhode Island 1052957
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 50 Vermont 625744


Let’s sort the data by type of statistic. I’ll use the subsets in the code below to render maps.

This concludes the tidying portion of the project; the original cleaning objective in Arun Reddy’s post has been accomplished. As there were no outlines provided for analysis, let’s identify
1. States with the greatest population growth in 2018.
2. States with the greatest decrease in population in 2018.
3. States with the greatest population perentage by nation and by region.

USpopESTIMATESBASE<-USpop4[which(USpop4$Statistic=="ESTIMATESBASE"),]
USpopESTIMATE<-USpop4[which(USpop4$Statistic=="POPESTIMATE"),]
USpopPOPCHG<-USpop4[which(USpop4$Statistic=="NPOPCHG"),]
USpopPPOPCHG<-USpop4[which(USpop4$Statistic=="PPOPCHG"),]

Results

Setup

Finally, let’s map our results; I’ve used ggplot2 to render, ggthemes and viridis to customize, various tools in the tidyverse package, as well as fiftystater, which needs to be installed via github. The packages mapdata and mapproj supply map data for puerto rico, which is excluded from the fiftystater package.
# install.packages("devtools") devtools::install_github("wmurphyrd/fiftystater")

Map: population increases in 2018

Let’s map population increases in 2018.

USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue<0)]<-0


require(ggplot2)
require(fiftystater)
require(ggthemes)
require(tidyverse)
require(viridis)

USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)

data("fifty_states")

library(mapdata)
library(mapproj)

#Puerto Rico color must be set manually; there has been a decrease in population, so we've matched the color representing zero growth.
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="lemonchiffon1")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=StateValue), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(-100000,-25000,-5000,0,5000,25000,100000,300000),
labels=c('-100K--25K','-25K--5K','-25K-0','0-5000','5k-25k','25K-100K','100K-300K','300K+'),begin=1,end=.25,option="magma")+
  guides(fill=guide_legend(title="Growth by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("Population Increases by State, 2018")

library(grid)
library(grDevices)

png(file="Project2a.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()
## png 
##   2
knitr::include_graphics("Project2a.png")

The greatest estimated population increase in 2018, by far, occurred in Texas and Florida.


Map: population decreases in 2018

Let’s map population losses.

USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue>0)]<-0

USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)

data("fifty_states")

pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="khaki3")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=StateValue), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(-300000,-100000,-25000,-5000,0),
labels=c('300K+','100K-300K','25K-100K','5K-25K','0-25K'),begin=1,end=0,option="cividis")+
  guides(fill=guide_legend(title="Decrease pop by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("Population Decreases by State, 2018")


png(file="Project2b.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()
## png 
##   2
knitr::include_graphics("Project2b.png")

It’s easy to see that West Virginia and Louisiana is estimated to have experienced a decrease in population in 2018, as well as Illinois and New York. Puerto Rico is estimated to have experienced the greatest decrease in population.


Map: population by state

Let’s calculate percentage of national and division population resides in each state or territory using projected population data for 2018. Let’s map percentage of national population by state.

USpopESTIMATE2018<-USpop4[which(USpopESTIMATE$Year==2018),]
USpopESTIMATE2018$StateValue<-as.numeric(USpopESTIMATE2018$StateValue)
USpopESTIMATE2018$RegionValue<-as.numeric(USpopESTIMATE2018$RegionValue)
USpopESTIMATE2018$NationalValue<-as.numeric(USpopESTIMATE2018$NationalValue)
USpopESTIMATE2018$NatPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$NationalValue
USpopESTIMATE2018$RegPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$RegionValue

USpopESTIMATE2018$statefull<-tolower(USpopESTIMATE2018$NAME)

data("fifty_states")

pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey49")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=NatPer), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
  guides(fill=guide_legend(title="Nat Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("National Population Percentage by State, 2018")


png(file="Project2c.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()
## png 
##   2
knitr::include_graphics("Project2c.png")

It’s apparent that the states with greatest populations are California, Texas, Florida and New York. With New York’s estimated population decreasing, it’s curious that it still accounts for one of the greatest percentages of national population.


Map: population percentage by region

Let’s map state percentage by region.

data("fifty_states")

pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey98")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=RegPer), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
  guides(fill=guide_legend(title="Region Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("Regional Population Percentage by State, 2018")

png(file="Project2d.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()
## png 
##   2
knitr::include_graphics("Project2d.png")

Omitting Puerto Rico due to the fact that there is no region or division assignment, we see that California, as expected, dominates the West Region, while New York dominates the Northeast to a lesser degree. Texas is most populous in the South, and Illinois carries the Midwest by only a slight margin.

Such data could be used to form an argument for representation in Congress. Should a state such as California, with 12.1% of the national population, be represented by only 2 senators?