Data Description: Population Data

I chose to work with population data posted by Arun Reddy. His description:

I found a census dataset for Puerto Rico population for 2010 to 2018. The dataset gives the Estimates of the Total Resident Population and Resident Population Age 18 Years and Older for the United States, States, and Puerto Rico

The dataset gives the population metrics like estimated, change in population from the estimate, National Rank in populations for each year from 2010 to 2018.

So there are 57 rows with 61 columns, so the dataset is wider than the length.

This is a good example of untidy data can be cleansed and make it more presentable. The following steps can be performed to cleanse the data.

All the 4 metrics by years are spread out by column-wise can be changed into rows.
Row names/Column name which includes the year as a concatenation can be well formatted to make more readable.
Some of the column names don’t have the right data type like population change, national rank is factorial data type which is unnecessary.

Tidying

Let’s read population data from GitHub, check descriptives with summary command. Output is limited by customized CSS in Markdown.

USpop<-read.csv("https://raw.githubusercontent.com/sigmasigmaiota/USpopulation/master/Population DataSet.csv")

#Look at summary table.
summary(USpop)

##        X          SUMLEV      REGION    DIVISION      STATE      
##  Min.   : 1   Min.   :10.00   0: 1   5      : 9   Min.   : 0.00  
##  1st Qu.:15   1st Qu.:40.00   1:10   8      : 8   1st Qu.:12.00  
##  Median :29   Median :40.00   2:13   4      : 7   Median :27.00  
##  Mean   :29   Mean   :38.07   3:18   1      : 6   Mean   :27.18  
##  3rd Qu.:43   3rd Qu.:40.00   4:14   0      : 5   3rd Qu.:41.00  
##  Max.   :57   Max.   :40.00   X: 1   3      : 5   Max.   :72.00  
##                                      (Other):17                  
##          NAME    ESTIMATESBASE2010   POPESTIMATE2010    
##  Alabama   : 1   Min.   :   563773   Min.   :   564483  
##  Alaska    : 1   1st Qu.:  1853001   1st Qu.:  1854214  
##  Arizona   : 1   Median :  4625381   Median :  4635656  
##  Arkansas  : 1   Mean   : 16315798   Mean   : 16345610  
##  California: 1   3rd Qu.:  9535736   3rd Qu.:  9574293  
##  Colorado  : 1   Max.   :308758105   Max.   :309326085  
##  (Other)   :51                                          
##  POPESTIMATE2011     POPESTIMATE2012     POPESTIMATE2013    
##  Min.   :   567224   Min.   :   576270   Min.   :   582123  
##  1st Qu.:  1856074   1st Qu.:  1856764   1st Qu.:  1865414  
##  Median :  4671422   Median :  4717112   Median :  4764153  
##  Mean   : 16463487   Mean   : 16583459   Mean   : 16697654  
##  3rd Qu.:  9656754   3rd Qu.:  9749123   3rd Qu.:  9843599  
##  Max.   :311580009   Max.   :313874218   Max.   :316057727  
##                                                             
##  POPESTIMATE2014     POPESTIMATE2015     POPESTIMATE2016    
##  Min.   :   582548   Min.   :   585668   Min.   :   584290  
##  1st Qu.:  1879522   1st Qu.:  1891507   1st Qu.:  1905924  
##  Median :  4823793   Median :  4853160   Median :  4864745  
##  Mean   : 16819195   Mean   : 16942126   Mean   : 17063518  
##  3rd Qu.:  9930589   3rd Qu.:  9932573   3rd Qu.:  9951890  
##  Max.   :318386421   Max.   :320742673   Max.   :323071342  
##                                                             
##  POPESTIMATE2017     POPESTIMATE2018      NPOPCHG_2010   
##  Min.   :   578934   Min.   :   577737   Min.   : -6582  
##  1st Qu.:  1917575   1st Qu.:  1929268   1st Qu.:  2570  
##  Median :  4875120   Median :  4887871   Median :  6457  
##  Mean   : 17171340   Mean   : 17275394   Mean   : 29812  
##  3rd Qu.:  9976447   3rd Qu.:  9995915   3rd Qu.: 18362  
##  Max.   :325147121   Max.   :327167434   Max.   :567980  
##                                                          
##   NPOPCHG_2011      NPOPCHG_2012      NPOPCHG_2013      NPOPCHG_2014    
##  Min.   : -42793   Min.   : -44244   Min.   : -41411   Min.   : -58203  
##  1st Qu.:   8898   1st Qu.:  10012   1st Qu.:   8774   1st Qu.:   7386  
##  Median :  21288   Median :  16893   Median :  18436   Median :  17240  
##  Mean   : 117877   Mean   : 119972   Mean   : 114195   Mean   : 121542  
##  3rd Qu.:  65723   3rd Qu.:  71221   3rd Qu.:  53494   3rd Qu.:  59023  
##  Max.   :2253924   Max.   :2294209   Max.   :2183509   Max.   :2328694  
##                                                                         
##   NPOPCHG_2015      NPOPCHG_2016      NPOPCHG_2017      NPOPCHG_2018    
##  Min.   : -61708   Min.   : -66671   Min.   : -81494   Min.   :-129848  
##  1st Qu.:   4089   1st Qu.:   3647   1st Qu.:   1365   1st Qu.:   3341  
##  Median :  14763   Median :  13364   Median :  14027   Median :  17827  
##  Mean   : 122931   Mean   : 121392   Mean   : 107822   Mean   : 104054  
##  3rd Qu.:  50831   3rd Qu.:  60116   3rd Qu.:  60505   3rd Qu.:  61216  
##  Max.   :2356252   Max.   :2328669   Max.   :2075779   Max.   :2020313  
##                                                                         
##   PPOPCHG_2010       PPOPCHG_2011      PPOPCHG_2012      PPOPCHG_2013    
##  Min.   :-0.12431   Min.   :-1.1499   Min.   :-1.2027   Min.   :-1.1394  
##  1st Qu.: 0.09317   1st Qu.: 0.2671   1st Qu.: 0.2745   1st Qu.: 0.2512  
##  Median : 0.17818   Median : 0.6560   Median : 0.6946   Median : 0.6524  
##  Mean   : 0.17896   Mean   : 0.6490   Mean   : 0.6857   Mean   : 0.6626  
##  3rd Qu.: 0.24368   3rd Qu.: 0.8968   3rd Qu.: 1.0440   3rd Qu.: 0.9712  
##  Max.   : 0.55154   Max.   : 2.3992   Max.   : 2.4408   Max.   : 2.9785  
##                                                                          
##   PPOPCHG_2014      PPOPCHG_2015      PPOPCHG_2016      PPOPCHG_2017     
##  Min.   :-1.6199   Min.   :-1.7457   Min.   :-1.9196   Min.   :-2.39231  
##  1st Qu.: 0.1930   1st Qu.: 0.1282   1st Qu.: 0.1233   1st Qu.: 0.05405  
##  Median : 0.5311   Median : 0.4792   Median : 0.4030   Median : 0.37844  
##  Mean   : 0.6018   Mean   : 0.6056   Mean   : 0.5883   Mean   : 0.51042  
##  3rd Qu.: 0.9699   3rd Qu.: 1.1043   3rd Qu.: 1.0899   3rd Qu.: 1.05089  
##  Max.   : 2.1306   Max.   : 2.2566   Max.   : 2.0156   Max.   : 2.13758  
##                                                                          
##   PPOPCHG_2018     NRANK_ESTBASE2010 NRANK_POPEST2010 NRANK_POPEST2011
##  Min.   :-3.9052   1      : 2        1      : 2       1      : 2      
##  1st Qu.: 0.1299   2      : 2        2      : 2       2      : 2      
##  Median : 0.3979   3      : 2        3      : 2       3      : 2      
##  Mean   : 0.4897   4      : 2        4      : 2       4      : 2      
##  3rd Qu.: 0.9723   X      : 2        X      : 2       X      : 2      
##  Max.   : 2.0854   10     : 1        10     : 1       10     : 1      
##                    (Other):46        (Other):46       (Other):46      
##  NRANK_POPEST2012 NRANK_POPEST2013 NRANK_POPEST2014 NRANK_POPEST2015
##  1      : 2       1      : 2       1      : 2       1      : 2      
##  2      : 2       2      : 2       2      : 2       2      : 2      
##  3      : 2       3      : 2       3      : 2       3      : 2      
##  4      : 2       4      : 2       4      : 2       4      : 2      
##  X      : 2       X      : 2       X      : 2       X      : 2      
##  10     : 1       10     : 1       10     : 1       10     : 1      
##  (Other):46       (Other):46       (Other):46       (Other):46      
##  NRANK_POPEST2016 NRANK_POPEST2017 NRANK_POPEST2018 NRANK_NPCHG2010
##  1      : 2       1      : 2       1      : 2       1      : 2     
##  2      : 2       2      : 2       2      : 2       2      : 2     
##  3      : 2       3      : 2       3      : 2       3      : 2     
##  4      : 2       4      : 2       4      : 2       4      : 2     
##  X      : 2       X      : 2       X      : 2       X      : 2     
##  10     : 1       10     : 1       10     : 1       10     : 1     
##  (Other):46       (Other):46       (Other):46       (Other):46     
##  NRANK_NPCHG2011 NRANK_NPCHG2012 NRANK_NPCHG2013 NRANK_NPCHG2014
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_NPCHG2015 NRANK_NPCHG2016 NRANK_NPCHG2017 NRANK_NPCHG2018
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_PPCHG2010 NRANK_PPCHG2011 NRANK_PPCHG2012 NRANK_PPCHG2013
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_PPCHG2014 NRANK_PPCHG2015 NRANK_PPCHG2016 NRANK_PPCHG2017
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_PPCHG2018
##  1      : 2     
##  2      : 2     
##  3      : 2     
##  4      : 2     
##  X      : 2     
##  10     : 1     
##  (Other):46

We’ll need to remove the divisions listed at the top of the dataset by creating a subset; let’s remove the first five rows, wherein data at the region and national level lives. The result is displayed using kableExtra.

#subset the data.
USpop.StateTerr<-USpop[6:57,]

require(kableExtra)
kable(head(USpop.StateTerr))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  scroll_box(width = "100%")

	X	SUMLEV	REGION	DIVISION	STATE	NAME	ESTIMATESBASE2010	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015	POPESTIMATE2016	POPESTIMATE2017	POPESTIMATE2018	NPOPCHG_2010	NPOPCHG_2011	NPOPCHG_2012	NPOPCHG_2013	NPOPCHG_2014	NPOPCHG_2015	NPOPCHG_2016	NPOPCHG_2017	NPOPCHG_2018	PPOPCHG_2010	PPOPCHG_2011	PPOPCHG_2012	PPOPCHG_2013	PPOPCHG_2014	PPOPCHG_2015	PPOPCHG_2016	PPOPCHG_2017	PPOPCHG_2018	NRANK_ESTBASE2010	NRANK_POPEST2010	NRANK_POPEST2011	NRANK_POPEST2012	NRANK_POPEST2013	NRANK_POPEST2014	NRANK_POPEST2015	NRANK_POPEST2016	NRANK_POPEST2017	NRANK_POPEST2018	NRANK_NPCHG2010	NRANK_NPCHG2011	NRANK_NPCHG2012	NRANK_NPCHG2013	NRANK_NPCHG2014	NRANK_NPCHG2015	NRANK_NPCHG2016	NRANK_NPCHG2017	NRANK_NPCHG2018	NRANK_PPCHG2010	NRANK_PPCHG2011	NRANK_PPCHG2012	NRANK_PPCHG2013	NRANK_PPCHG2014	NRANK_PPCHG2015	NRANK_PPCHG2016	NRANK_PPCHG2017	NRANK_PPCHG2018
6	6	40	3	6	1	Alabama	4780138	4785448	4798834	4815564	4830460	4842481	4853160	4864745	4875120	4887871	5310	13386	16730	14896	12021	10679	11585	10375	12751	0.1110847	0.2797230	0.3486264	0.3093303	0.2488583	0.2205275	0.2387104	0.2132691	0.2615525	23	23	23	23	23	23	24	24	24	24	28	32	26	31	32	30	28	31	27	37	38	34	33	35	36	34	34	34
7	7	40	4	9	2	Alaska	710249	713906	722038	730399	737045	736307	737547	741504	739786	737438	3657	8132	8361	6646	-738	1240	3957	-1718	-2348	0.5148898	1.1390855	1.1579723	0.9099136	-0.1001296	0.1684080	0.5365082	-0.2316913	-0.3173891	47	47	47	47	47	48	48	48	48	48	32	39	40	42	47	42	37	44	45	2	8	11	18	48	37	22	46	49
8	8	40	4	8	4	Arizona	6392288	6407774	6473497	6556629	6634999	6733840	6833596	6945452	7048876	7171646	15486	65723	83132	78370	98841	99756	111856	103424	122770	0.2422607	1.0256760	1.2841900	1.1952789	1.4896913	1.4814133	1.6368542	1.4890896	1.7416961	16	16	16	15	15	15	14	14	14	14	11	10	7	5	4	7	7	7	4	15	10	7	7	7	9	9	6	4
9	9	40	3	7	5	Arkansas	2916028	2921978	2940407	2952109	2959549	2967726	2978407	2990410	3002997	3013825	5950	18429	11702	7440	8177	10681	12003	12587	10828	0.2040447	0.6307029	0.3979721	0.2520232	0.2762921	0.3599052	0.4030007	0.4209122	0.3605731	32	32	32	32	32	32	33	32	32	33	26	25	35	41	37	29	27	26	30	22	28	31	38	31	30	26	24	28
10	10	40	4	9	6	California	37254523	37320903	37641823	37960782	38280824	38625139	38953142	39209127	39399349	39557045	66380	320920	318959	320042	344315	328003	255985	190222	157696	0.1781797	0.8598934	0.8473527	0.8430859	0.8994451	0.8491956	0.6571614	0.4851472	0.4002503	1	1	1	1	1	1	1	1	1	1	2	2	2	2	2	3	3	3	3	26	17	19	20	16	16	20	23	25
11	11	40	4	8	8	Colorado	5029316	5048281	5121771	5193721	5270482	5351218	5452107	5540921	5615902	5695564	18965	73490	71950	76761	80736	100889	88814	74981	79662	0.3770891	1.4557431	1.4047875	1.4779577	1.5318523	1.8853465	1.6289849	1.3532227	1.4185077	22	22	22	22	22	22	22	21	21	21	8	9	9	6	8	6	8	8	8	6	4	5	5	4	4	10	8	7

We now have information only on the state/territory level. Let’s transform and create a row for each year. Additionally, let’s omit the first two columns, as SUMLEV is consistent on the state/terr level. The result is displayed once again using kableExtra.

#remove first two columns
USpop.StateTerr[1:2]<-NULL
library(tidyr)
library(dplyr)
#Create row for each statistic.
USpop.YearList<-gather(USpop.StateTerr,"Statistic","Value",5:length(colnames(USpop.StateTerr)))

kable(head(USpop.YearList))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

REGION	DIVISION	STATE	NAME	Statistic	Value
3	6	1	Alabama	ESTIMATESBASE2010	4780138
4	9	2	Alaska	ESTIMATESBASE2010	710249
4	8	4	Arizona	ESTIMATESBASE2010	6392288
3	7	5	Arkansas	ESTIMATESBASE2010	2916028
4	9	6	California	ESTIMATESBASE2010	37254523
4	8	8	Colorado	ESTIMATESBASE2010	5029316

Let’s extract the year from our new “Year” variable and create a “Statistic” variable; we’ll use stringr to extract numeric characters. The new variable is highlighted in light blue.

library(stringr)
USpop.YearList$Year<-as.numeric(str_extract(USpop.YearList$Statistic,"([0-9]+)"))
USpop.YearList$Statistic<-gsub("([0-9])|(_)","",USpop.YearList$Statistic)

#Rename value column to specifically address state/terr level data.
colnames(USpop.YearList)[6]<-"StateValue"

kable(head(USpop.YearList))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 7, bold = T, background = "LightCyan")

REGION	DIVISION	STATE	NAME	Statistic	StateValue	Year
3	6	1	Alabama	ESTIMATESBASE	4780138	2010
4	9	2	Alaska	ESTIMATESBASE	710249	2010
4	8	4	Arizona	ESTIMATESBASE	6392288	2010
3	7	5	Arkansas	ESTIMATESBASE	2916028	2010
4	9	6	California	ESTIMATESBASE	37254523	2010
4	8	8	Colorado	ESTIMATESBASE	5029316	2010

Let’s do the same for division information. Using the gather command, division data is added to our table; the change is marked in light blue.

USdiv.YearList<-gather(USpop[2:5,],"Statistic","Value",7:length(colnames(USpop)))

#Remove first two columns.
USdiv.YearList[1:2]<-NULL

#We also have no need for state or division information at Region level.
USdiv.YearList[2:3]<-NULL

#Rename NAME column.
colnames(USdiv.YearList)[2]<-"RegionName"

#Rename Value column in anticipation of merge with state data.
colnames(USdiv.YearList)[4]<-"RegionValue"

#Clean year and statistic.
USdiv.YearList$Year<-as.numeric(str_extract(USdiv.YearList$Statistic,"([0-9]+)"))
USdiv.YearList$Statistic<-gsub("([0-9])|(_)","",USdiv.YearList$Statistic)

#Shorten RegionName.
USdiv.YearList$RegionName<-gsub("( Region)","",USdiv.YearList$RegionName)

kable(head(USdiv.YearList))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 2, bold = T, background = "LightCyan")

REGION	RegionName	Statistic	RegionValue	Year
1	Northeast	ESTIMATESBASE	55318430	2010
2	Midwest	ESTIMATESBASE	66929743	2010
3	South	ESTIMATESBASE	114563045	2010
4	West	ESTIMATESBASE	71946887	2010
1	Northeast	POPESTIMATE	55380645	2010
2	Midwest	POPESTIMATE	66974749	2010

Nice. Let’s merge by REGION number, Year, and Statistic. We must use full_join in order to keep Puerto Rico, which is omitted from Region categorization.

USpop2<-full_join(USdiv.YearList,USpop.YearList,by=c("REGION","Statistic","Year"))

kable(head(USpop2))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

REGION	RegionName	Statistic	RegionValue	Year	DIVISION	STATE	NAME	StateValue
1	Northeast	ESTIMATESBASE	55318430	2010	1	9	Connecticut	3574147
1	Northeast	ESTIMATESBASE	55318430	2010	1	23	Maine	1328369
1	Northeast	ESTIMATESBASE	55318430	2010	1	25	Massachusetts	6547790
1	Northeast	ESTIMATESBASE	55318430	2010	1	33	New Hampshire	1316464
1	Northeast	ESTIMATESBASE	55318430	2010	2	34	New Jersey	8791962
1	Northeast	ESTIMATESBASE	55318430	2010	2	36	New York	19378124

We need to pair this with a list of divisions, which differs from Region. A list exists on wikipedia; I’ve edited from https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States.

DIVISION<-as.factor(c(1,2,3,4,5,6,7,8,9))
DivisionName<-c("New England","Mid-Atlantic","East North Central","West North Central","South Atlantic","East South Central","West South Central","Mountain","Pacific")
Divisions<-data.frame(DIVISION,DivisionName)

Merge Divisions with master dataset and check.

USpop3<-full_join(Divisions,USpop2,by=c("DIVISION"))

kable(head(USpop3))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 2, bold = T, background = "LightCyan")

DIVISION	DivisionName	REGION	RegionName	Statistic	RegionValue	Year	STATE	NAME	StateValue
1	New England	1	Northeast	ESTIMATESBASE	55318430	2010	9	Connecticut	3574147
1	New England	1	Northeast	ESTIMATESBASE	55318430	2010	23	Maine	1328369
1	New England	1	Northeast	ESTIMATESBASE	55318430	2010	25	Massachusetts	6547790
1	New England	1	Northeast	ESTIMATESBASE	55318430	2010	33	New Hampshire	1316464
1	New England	1	Northeast	ESTIMATESBASE	55318430	2010	44	Rhode Island	1052957
1	New England	1	Northeast	ESTIMATESBASE	55318430	2010	50	Vermont	625744

Let’s look at Puerto Rico in particular.

PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]

kable(head(PR))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	DIVISION	DivisionName	REGION	RegionName	Statistic	RegionValue	Year	STATE	NAME	StateValue
2857	X	NA	X	NA	ESTIMATESBASE	NA	2010	72	Puerto Rico	3726157
2858	X	NA	X	NA	POPESTIMATE	NA	2010	72	Puerto Rico	3721525
2859	X	NA	X	NA	POPESTIMATE	NA	2011	72	Puerto Rico	3678732
2860	X	NA	X	NA	POPESTIMATE	NA	2012	72	Puerto Rico	3634488
2861	X	NA	X	NA	POPESTIMATE	NA	2013	72	Puerto Rico	3593077
2862	X	NA	X	NA	POPESTIMATE	NA	2014	72	Puerto Rico	3534874

Puerto Rico is not ranked among the other states and has no division assignment. This is problematic for anyone completing comparative analysis on a national level. Since puerto rico has been assigned no region, there is no regional data available. Let’s recode NA as Puerto Rico in RegionName.

USpop3$RegionName[which(USpop3$NAME=="Puerto Rico")]<-"Puerto Rico"

#look at Puerto Rico again, replace subset.
PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]

kable(head(PR))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 4, bold = T, background = "LightCyan")

	DIVISION	DivisionName	REGION	RegionName	Statistic	RegionValue	Year	STATE	NAME	StateValue
2857	X	NA	X	Puerto Rico	ESTIMATESBASE	NA	2010	72	Puerto Rico	3726157
2858	X	NA	X	Puerto Rico	POPESTIMATE	NA	2010	72	Puerto Rico	3721525
2859	X	NA	X	Puerto Rico	POPESTIMATE	NA	2011	72	Puerto Rico	3678732
2860	X	NA	X	Puerto Rico	POPESTIMATE	NA	2012	72	Puerto Rico	3634488
2861	X	NA	X	Puerto Rico	POPESTIMATE	NA	2013	72	Puerto Rico	3593077
2862	X	NA	X	Puerto Rico	POPESTIMATE	NA	2014	72	Puerto Rico	3534874

Let’s add a column for National statistics.

#National data only.
USpop.Nat<-USpop[1,3:ncol(USpop)]

#gather, as before
USpop.Nat<-gather(USpop.Nat,"Statistic","Value",5:length(colnames(USpop.Nat)))

#Clean year and statistic.
USpop.Nat$Year<-as.numeric(str_extract(USpop.Nat$Statistic,"([0-9]+)"))
USpop.Nat$Statistic<-gsub("([0-9])|(_)","",USpop.Nat$Statistic)

#omit columns and unnecessary rows.
USpop.Nat<-USpop.Nat[,5:7]
USpop.Nat<-USpop.Nat[which(USpop.Nat$Value!="X"),]

#rename coluns in preparation for merge.
colnames(USpop.Nat)[2]<-"NationalValue"

#merge into master data.
USpop4<-full_join(USpop.Nat,USpop3,by=c("Statistic","Year"))

kable(head(USpop4))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 2, bold = T, background = "LightCyan")

Statistic	NationalValue	Year	DIVISION	DivisionName	REGION	RegionName	RegionValue	STATE	NAME	StateValue
ESTIMATESBASE	308758105	2010	1	New England	1	Northeast	55318430	9	Connecticut	3574147
ESTIMATESBASE	308758105	2010	1	New England	1	Northeast	55318430	23	Maine	1328369
ESTIMATESBASE	308758105	2010	1	New England	1	Northeast	55318430	25	Massachusetts	6547790
ESTIMATESBASE	308758105	2010	1	New England	1	Northeast	55318430	33	New Hampshire	1316464
ESTIMATESBASE	308758105	2010	1	New England	1	Northeast	55318430	44	Rhode Island	1052957
ESTIMATESBASE	308758105	2010	1	New England	1	Northeast	55318430	50	Vermont	625744

Let’s sort the data by type of statistic. I’ll use the subsets in the code below to render maps.

This concludes the tidying portion of the project; the original cleaning objective in Arun Reddy’s post has been accomplished. As there were no outlines provided for analysis, let’s identify
1. States with the greatest population growth in 2018.
2. States with the greatest decrease in population in 2018.
3. States with the greatest population perentage by nation and by region.

USpopESTIMATESBASE<-USpop4[which(USpop4$Statistic=="ESTIMATESBASE"),]
USpopESTIMATE<-USpop4[which(USpop4$Statistic=="POPESTIMATE"),]
USpopPOPCHG<-USpop4[which(USpop4$Statistic=="NPOPCHG"),]
USpopPPOPCHG<-USpop4[which(USpop4$Statistic=="PPOPCHG"),]

Results

Setup

Finally, let’s map our results; I’ve used ggplot2 to render, ggthemes and viridis to customize, various tools in the tidyverse package, as well as fiftystater, which needs to be installed via github. The packages mapdata and mapproj supply map data for puerto rico, which is excluded from the fiftystater package.
# install.packages("devtools") devtools::install_github("wmurphyrd/fiftystater")

Map: population increases in 2018

Let’s map population increases in 2018.

USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue<0)]<-0


require(ggplot2)
require(fiftystater)
require(ggthemes)
require(tidyverse)
require(viridis)

USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)

data("fifty_states")

library(mapdata)
library(mapproj)

#Puerto Rico color must be set manually; there has been a decrease in population, so we've matched the color representing zero growth.
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="lemonchiffon1")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=StateValue), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(-100000,-25000,-5000,0,5000,25000,100000,300000),
labels=c('-100K--25K','-25K--5K','-25K-0','0-5000','5k-25k','25K-100K','100K-300K','300K+'),begin=1,end=.25,option="magma")+
  guides(fill=guide_legend(title="Growth by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("Population Increases by State, 2018")

library(grid)
library(grDevices)

png(file="Project2a.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()

## png 
##   2

knitr::include_graphics("Project2a.png")

The greatest estimated population increase in 2018, by far, occurred in Texas and Florida.

Map: population decreases in 2018

Let’s map population losses.

USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue>0)]<-0

USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)

data("fifty_states")

pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="khaki3")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=StateValue), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(-300000,-100000,-25000,-5000,0),
labels=c('300K+','100K-300K','25K-100K','5K-25K','0-25K'),begin=1,end=0,option="cividis")+
  guides(fill=guide_legend(title="Decrease pop by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("Population Decreases by State, 2018")


png(file="Project2b.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()

## png 
##   2

knitr::include_graphics("Project2b.png")

It’s easy to see that West Virginia and Louisiana is estimated to have experienced a decrease in population in 2018, as well as Illinois and New York. Puerto Rico is estimated to have experienced the greatest decrease in population.

Map: population by state

Let’s calculate percentage of national and division population resides in each state or territory using projected population data for 2018. Let’s map percentage of national population by state.

USpopESTIMATE2018<-USpop4[which(USpopESTIMATE$Year==2018),]
USpopESTIMATE2018$StateValue<-as.numeric(USpopESTIMATE2018$StateValue)
USpopESTIMATE2018$RegionValue<-as.numeric(USpopESTIMATE2018$RegionValue)
USpopESTIMATE2018$NationalValue<-as.numeric(USpopESTIMATE2018$NationalValue)
USpopESTIMATE2018$NatPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$NationalValue
USpopESTIMATE2018$RegPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$RegionValue

USpopESTIMATE2018$statefull<-tolower(USpopESTIMATE2018$NAME)

data("fifty_states")

pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey49")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=NatPer), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
  guides(fill=guide_legend(title="Nat Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("National Population Percentage by State, 2018")


png(file="Project2c.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()

## png 
##   2

knitr::include_graphics("Project2c.png")

It’s apparent that the states with greatest populations are California, Texas, Florida and New York. With New York’s estimated population decreasing, it’s curious that it still accounts for one of the greatest percentages of national population.

Map: population percentage by region

Let’s map state percentage by region.

data("fifty_states")

pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey98")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=RegPer), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
  guides(fill=guide_legend(title="Region Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("Regional Population Percentage by State, 2018")

png(file="Project2d.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()

## png 
##   2

knitr::include_graphics("Project2d.png")

Omitting Puerto Rico due to the fact that there is no region or division assignment, we see that California, as expected, dominates the West Region, while New York dominates the Northeast to a lesser degree. Texas is most populous in the South, and Illinois carries the Midwest by only a slight margin.

Such data could be used to form an argument for representation in Congress. Should a state such as California, with 12.1% of the national population, be represented by only 2 senators?

Project 2

Stephen Jones

March 9, 2019