I found a census dataset for Puerto Rico population for 2010 to 2018. The dataset gives the Estimates of the Total Resident Population and Resident Population Age 18 Years and Older for the United States, States, and Puerto Rico
The dataset gives the population metrics like estimated, change in population from the estimate, National Rank in populations for each year from 2010 to 2018.
So there are 57 rows with 61 columns, so the dataset is wider than the length.
This is a good example of untidy data can be cleansed and make it more presentable. The following steps can be performed to cleanse the data.
All the 4 metrics by years are spread out by column-wise can be changed into rows.
Row names/Column name which includes the year as a concatenation can be well formatted to make more readable.
Some of the column names don’t have the right data type like population change, national rank is factorial data type which is unnecessary.
Let’s read population data from GitHub, check descriptives with summary command. Output is limited by customized CSS in Markdown.
USpop<-read.csv("https://raw.githubusercontent.com/sigmasigmaiota/USpopulation/master/Population DataSet.csv")
#Look at summary table.
summary(USpop)## X SUMLEV REGION DIVISION STATE
## Min. : 1 Min. :10.00 0: 1 5 : 9 Min. : 0.00
## 1st Qu.:15 1st Qu.:40.00 1:10 8 : 8 1st Qu.:12.00
## Median :29 Median :40.00 2:13 4 : 7 Median :27.00
## Mean :29 Mean :38.07 3:18 1 : 6 Mean :27.18
## 3rd Qu.:43 3rd Qu.:40.00 4:14 0 : 5 3rd Qu.:41.00
## Max. :57 Max. :40.00 X: 1 3 : 5 Max. :72.00
## (Other):17
## NAME ESTIMATESBASE2010 POPESTIMATE2010
## Alabama : 1 Min. : 563773 Min. : 564483
## Alaska : 1 1st Qu.: 1853001 1st Qu.: 1854214
## Arizona : 1 Median : 4625381 Median : 4635656
## Arkansas : 1 Mean : 16315798 Mean : 16345610
## California: 1 3rd Qu.: 9535736 3rd Qu.: 9574293
## Colorado : 1 Max. :308758105 Max. :309326085
## (Other) :51
## POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013
## Min. : 567224 Min. : 576270 Min. : 582123
## 1st Qu.: 1856074 1st Qu.: 1856764 1st Qu.: 1865414
## Median : 4671422 Median : 4717112 Median : 4764153
## Mean : 16463487 Mean : 16583459 Mean : 16697654
## 3rd Qu.: 9656754 3rd Qu.: 9749123 3rd Qu.: 9843599
## Max. :311580009 Max. :313874218 Max. :316057727
##
## POPESTIMATE2014 POPESTIMATE2015 POPESTIMATE2016
## Min. : 582548 Min. : 585668 Min. : 584290
## 1st Qu.: 1879522 1st Qu.: 1891507 1st Qu.: 1905924
## Median : 4823793 Median : 4853160 Median : 4864745
## Mean : 16819195 Mean : 16942126 Mean : 17063518
## 3rd Qu.: 9930589 3rd Qu.: 9932573 3rd Qu.: 9951890
## Max. :318386421 Max. :320742673 Max. :323071342
##
## POPESTIMATE2017 POPESTIMATE2018 NPOPCHG_2010
## Min. : 578934 Min. : 577737 Min. : -6582
## 1st Qu.: 1917575 1st Qu.: 1929268 1st Qu.: 2570
## Median : 4875120 Median : 4887871 Median : 6457
## Mean : 17171340 Mean : 17275394 Mean : 29812
## 3rd Qu.: 9976447 3rd Qu.: 9995915 3rd Qu.: 18362
## Max. :325147121 Max. :327167434 Max. :567980
##
## NPOPCHG_2011 NPOPCHG_2012 NPOPCHG_2013 NPOPCHG_2014
## Min. : -42793 Min. : -44244 Min. : -41411 Min. : -58203
## 1st Qu.: 8898 1st Qu.: 10012 1st Qu.: 8774 1st Qu.: 7386
## Median : 21288 Median : 16893 Median : 18436 Median : 17240
## Mean : 117877 Mean : 119972 Mean : 114195 Mean : 121542
## 3rd Qu.: 65723 3rd Qu.: 71221 3rd Qu.: 53494 3rd Qu.: 59023
## Max. :2253924 Max. :2294209 Max. :2183509 Max. :2328694
##
## NPOPCHG_2015 NPOPCHG_2016 NPOPCHG_2017 NPOPCHG_2018
## Min. : -61708 Min. : -66671 Min. : -81494 Min. :-129848
## 1st Qu.: 4089 1st Qu.: 3647 1st Qu.: 1365 1st Qu.: 3341
## Median : 14763 Median : 13364 Median : 14027 Median : 17827
## Mean : 122931 Mean : 121392 Mean : 107822 Mean : 104054
## 3rd Qu.: 50831 3rd Qu.: 60116 3rd Qu.: 60505 3rd Qu.: 61216
## Max. :2356252 Max. :2328669 Max. :2075779 Max. :2020313
##
## PPOPCHG_2010 PPOPCHG_2011 PPOPCHG_2012 PPOPCHG_2013
## Min. :-0.12431 Min. :-1.1499 Min. :-1.2027 Min. :-1.1394
## 1st Qu.: 0.09317 1st Qu.: 0.2671 1st Qu.: 0.2745 1st Qu.: 0.2512
## Median : 0.17818 Median : 0.6560 Median : 0.6946 Median : 0.6524
## Mean : 0.17896 Mean : 0.6490 Mean : 0.6857 Mean : 0.6626
## 3rd Qu.: 0.24368 3rd Qu.: 0.8968 3rd Qu.: 1.0440 3rd Qu.: 0.9712
## Max. : 0.55154 Max. : 2.3992 Max. : 2.4408 Max. : 2.9785
##
## PPOPCHG_2014 PPOPCHG_2015 PPOPCHG_2016 PPOPCHG_2017
## Min. :-1.6199 Min. :-1.7457 Min. :-1.9196 Min. :-2.39231
## 1st Qu.: 0.1930 1st Qu.: 0.1282 1st Qu.: 0.1233 1st Qu.: 0.05405
## Median : 0.5311 Median : 0.4792 Median : 0.4030 Median : 0.37844
## Mean : 0.6018 Mean : 0.6056 Mean : 0.5883 Mean : 0.51042
## 3rd Qu.: 0.9699 3rd Qu.: 1.1043 3rd Qu.: 1.0899 3rd Qu.: 1.05089
## Max. : 2.1306 Max. : 2.2566 Max. : 2.0156 Max. : 2.13758
##
## PPOPCHG_2018 NRANK_ESTBASE2010 NRANK_POPEST2010 NRANK_POPEST2011
## Min. :-3.9052 1 : 2 1 : 2 1 : 2
## 1st Qu.: 0.1299 2 : 2 2 : 2 2 : 2
## Median : 0.3979 3 : 2 3 : 2 3 : 2
## Mean : 0.4897 4 : 2 4 : 2 4 : 2
## 3rd Qu.: 0.9723 X : 2 X : 2 X : 2
## Max. : 2.0854 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46
## NRANK_POPEST2012 NRANK_POPEST2013 NRANK_POPEST2014 NRANK_POPEST2015
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_POPEST2016 NRANK_POPEST2017 NRANK_POPEST2018 NRANK_NPCHG2010
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_NPCHG2011 NRANK_NPCHG2012 NRANK_NPCHG2013 NRANK_NPCHG2014
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_NPCHG2015 NRANK_NPCHG2016 NRANK_NPCHG2017 NRANK_NPCHG2018
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_PPCHG2010 NRANK_PPCHG2011 NRANK_PPCHG2012 NRANK_PPCHG2013
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_PPCHG2014 NRANK_PPCHG2015 NRANK_PPCHG2016 NRANK_PPCHG2017
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_PPCHG2018
## 1 : 2
## 2 : 2
## 3 : 2
## 4 : 2
## X : 2
## 10 : 1
## (Other):46
We’ll need to remove the divisions listed at the top of the dataset by creating a subset; let’s remove the first five rows, wherein data at the region and national level lives. The result is displayed using kableExtra.
#subset the data.
USpop.StateTerr<-USpop[6:57,]
require(kableExtra)
kable(head(USpop.StateTerr))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
scroll_box(width = "100%")| X | SUMLEV | REGION | DIVISION | STATE | NAME | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | POPESTIMATE2016 | POPESTIMATE2017 | POPESTIMATE2018 | NPOPCHG_2010 | NPOPCHG_2011 | NPOPCHG_2012 | NPOPCHG_2013 | NPOPCHG_2014 | NPOPCHG_2015 | NPOPCHG_2016 | NPOPCHG_2017 | NPOPCHG_2018 | PPOPCHG_2010 | PPOPCHG_2011 | PPOPCHG_2012 | PPOPCHG_2013 | PPOPCHG_2014 | PPOPCHG_2015 | PPOPCHG_2016 | PPOPCHG_2017 | PPOPCHG_2018 | NRANK_ESTBASE2010 | NRANK_POPEST2010 | NRANK_POPEST2011 | NRANK_POPEST2012 | NRANK_POPEST2013 | NRANK_POPEST2014 | NRANK_POPEST2015 | NRANK_POPEST2016 | NRANK_POPEST2017 | NRANK_POPEST2018 | NRANK_NPCHG2010 | NRANK_NPCHG2011 | NRANK_NPCHG2012 | NRANK_NPCHG2013 | NRANK_NPCHG2014 | NRANK_NPCHG2015 | NRANK_NPCHG2016 | NRANK_NPCHG2017 | NRANK_NPCHG2018 | NRANK_PPCHG2010 | NRANK_PPCHG2011 | NRANK_PPCHG2012 | NRANK_PPCHG2013 | NRANK_PPCHG2014 | NRANK_PPCHG2015 | NRANK_PPCHG2016 | NRANK_PPCHG2017 | NRANK_PPCHG2018 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 6 | 40 | 3 | 6 | 1 | Alabama | 4780138 | 4785448 | 4798834 | 4815564 | 4830460 | 4842481 | 4853160 | 4864745 | 4875120 | 4887871 | 5310 | 13386 | 16730 | 14896 | 12021 | 10679 | 11585 | 10375 | 12751 | 0.1110847 | 0.2797230 | 0.3486264 | 0.3093303 | 0.2488583 | 0.2205275 | 0.2387104 | 0.2132691 | 0.2615525 | 23 | 23 | 23 | 23 | 23 | 23 | 24 | 24 | 24 | 24 | 28 | 32 | 26 | 31 | 32 | 30 | 28 | 31 | 27 | 37 | 38 | 34 | 33 | 35 | 36 | 34 | 34 | 34 |
| 7 | 7 | 40 | 4 | 9 | 2 | Alaska | 710249 | 713906 | 722038 | 730399 | 737045 | 736307 | 737547 | 741504 | 739786 | 737438 | 3657 | 8132 | 8361 | 6646 | -738 | 1240 | 3957 | -1718 | -2348 | 0.5148898 | 1.1390855 | 1.1579723 | 0.9099136 | -0.1001296 | 0.1684080 | 0.5365082 | -0.2316913 | -0.3173891 | 47 | 47 | 47 | 47 | 47 | 48 | 48 | 48 | 48 | 48 | 32 | 39 | 40 | 42 | 47 | 42 | 37 | 44 | 45 | 2 | 8 | 11 | 18 | 48 | 37 | 22 | 46 | 49 |
| 8 | 8 | 40 | 4 | 8 | 4 | Arizona | 6392288 | 6407774 | 6473497 | 6556629 | 6634999 | 6733840 | 6833596 | 6945452 | 7048876 | 7171646 | 15486 | 65723 | 83132 | 78370 | 98841 | 99756 | 111856 | 103424 | 122770 | 0.2422607 | 1.0256760 | 1.2841900 | 1.1952789 | 1.4896913 | 1.4814133 | 1.6368542 | 1.4890896 | 1.7416961 | 16 | 16 | 16 | 15 | 15 | 15 | 14 | 14 | 14 | 14 | 11 | 10 | 7 | 5 | 4 | 7 | 7 | 7 | 4 | 15 | 10 | 7 | 7 | 7 | 9 | 9 | 6 | 4 |
| 9 | 9 | 40 | 3 | 7 | 5 | Arkansas | 2916028 | 2921978 | 2940407 | 2952109 | 2959549 | 2967726 | 2978407 | 2990410 | 3002997 | 3013825 | 5950 | 18429 | 11702 | 7440 | 8177 | 10681 | 12003 | 12587 | 10828 | 0.2040447 | 0.6307029 | 0.3979721 | 0.2520232 | 0.2762921 | 0.3599052 | 0.4030007 | 0.4209122 | 0.3605731 | 32 | 32 | 32 | 32 | 32 | 32 | 33 | 32 | 32 | 33 | 26 | 25 | 35 | 41 | 37 | 29 | 27 | 26 | 30 | 22 | 28 | 31 | 38 | 31 | 30 | 26 | 24 | 28 |
| 10 | 10 | 40 | 4 | 9 | 6 | California | 37254523 | 37320903 | 37641823 | 37960782 | 38280824 | 38625139 | 38953142 | 39209127 | 39399349 | 39557045 | 66380 | 320920 | 318959 | 320042 | 344315 | 328003 | 255985 | 190222 | 157696 | 0.1781797 | 0.8598934 | 0.8473527 | 0.8430859 | 0.8994451 | 0.8491956 | 0.6571614 | 0.4851472 | 0.4002503 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 26 | 17 | 19 | 20 | 16 | 16 | 20 | 23 | 25 |
| 11 | 11 | 40 | 4 | 8 | 8 | Colorado | 5029316 | 5048281 | 5121771 | 5193721 | 5270482 | 5351218 | 5452107 | 5540921 | 5615902 | 5695564 | 18965 | 73490 | 71950 | 76761 | 80736 | 100889 | 88814 | 74981 | 79662 | 0.3770891 | 1.4557431 | 1.4047875 | 1.4779577 | 1.5318523 | 1.8853465 | 1.6289849 | 1.3532227 | 1.4185077 | 22 | 22 | 22 | 22 | 22 | 22 | 22 | 21 | 21 | 21 | 8 | 9 | 9 | 6 | 8 | 6 | 8 | 8 | 8 | 6 | 4 | 5 | 5 | 4 | 4 | 10 | 8 | 7 |
We now have information only on the state/territory level. Let’s transform and create a row for each year. Additionally, let’s omit the first two columns, as SUMLEV is consistent on the state/terr level. The result is displayed once again using kableExtra.
#remove first two columns
USpop.StateTerr[1:2]<-NULL
library(tidyr)
library(dplyr)
#Create row for each statistic.
USpop.YearList<-gather(USpop.StateTerr,"Statistic","Value",5:length(colnames(USpop.StateTerr)))
kable(head(USpop.YearList))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| REGION | DIVISION | STATE | NAME | Statistic | Value |
|---|---|---|---|---|---|
| 3 | 6 | 1 | Alabama | ESTIMATESBASE2010 | 4780138 |
| 4 | 9 | 2 | Alaska | ESTIMATESBASE2010 | 710249 |
| 4 | 8 | 4 | Arizona | ESTIMATESBASE2010 | 6392288 |
| 3 | 7 | 5 | Arkansas | ESTIMATESBASE2010 | 2916028 |
| 4 | 9 | 6 | California | ESTIMATESBASE2010 | 37254523 |
| 4 | 8 | 8 | Colorado | ESTIMATESBASE2010 | 5029316 |
Let’s extract the year from our new “Year” variable and create a “Statistic” variable; we’ll use stringr to extract numeric characters. The new variable is highlighted in light blue.
library(stringr)
USpop.YearList$Year<-as.numeric(str_extract(USpop.YearList$Statistic,"([0-9]+)"))
USpop.YearList$Statistic<-gsub("([0-9])|(_)","",USpop.YearList$Statistic)
#Rename value column to specifically address state/terr level data.
colnames(USpop.YearList)[6]<-"StateValue"
kable(head(USpop.YearList))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 7, bold = T, background = "LightCyan")| REGION | DIVISION | STATE | NAME | Statistic | StateValue | Year |
|---|---|---|---|---|---|---|
| 3 | 6 | 1 | Alabama | ESTIMATESBASE | 4780138 | 2010 |
| 4 | 9 | 2 | Alaska | ESTIMATESBASE | 710249 | 2010 |
| 4 | 8 | 4 | Arizona | ESTIMATESBASE | 6392288 | 2010 |
| 3 | 7 | 5 | Arkansas | ESTIMATESBASE | 2916028 | 2010 |
| 4 | 9 | 6 | California | ESTIMATESBASE | 37254523 | 2010 |
| 4 | 8 | 8 | Colorado | ESTIMATESBASE | 5029316 | 2010 |
Let’s do the same for division information. Using the gather command, division data is added to our table; the change is marked in light blue.
USdiv.YearList<-gather(USpop[2:5,],"Statistic","Value",7:length(colnames(USpop)))
#Remove first two columns.
USdiv.YearList[1:2]<-NULL
#We also have no need for state or division information at Region level.
USdiv.YearList[2:3]<-NULL
#Rename NAME column.
colnames(USdiv.YearList)[2]<-"RegionName"
#Rename Value column in anticipation of merge with state data.
colnames(USdiv.YearList)[4]<-"RegionValue"
#Clean year and statistic.
USdiv.YearList$Year<-as.numeric(str_extract(USdiv.YearList$Statistic,"([0-9]+)"))
USdiv.YearList$Statistic<-gsub("([0-9])|(_)","",USdiv.YearList$Statistic)
#Shorten RegionName.
USdiv.YearList$RegionName<-gsub("( Region)","",USdiv.YearList$RegionName)
kable(head(USdiv.YearList))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 2, bold = T, background = "LightCyan")| REGION | RegionName | Statistic | RegionValue | Year |
|---|---|---|---|---|
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 |
| 2 | Midwest | ESTIMATESBASE | 66929743 | 2010 |
| 3 | South | ESTIMATESBASE | 114563045 | 2010 |
| 4 | West | ESTIMATESBASE | 71946887 | 2010 |
| 1 | Northeast | POPESTIMATE | 55380645 | 2010 |
| 2 | Midwest | POPESTIMATE | 66974749 | 2010 |
Nice. Let’s merge by REGION number, Year, and Statistic. We must use full_join in order to keep Puerto Rico, which is omitted from Region categorization.
USpop2<-full_join(USdiv.YearList,USpop.YearList,by=c("REGION","Statistic","Year"))
kable(head(USpop2))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| REGION | RegionName | Statistic | RegionValue | Year | DIVISION | STATE | NAME | StateValue |
|---|---|---|---|---|---|---|---|---|
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 1 | 9 | Connecticut | 3574147 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 1 | 23 | Maine | 1328369 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 1 | 25 | Massachusetts | 6547790 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 1 | 33 | New Hampshire | 1316464 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 2 | 34 | New Jersey | 8791962 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 2 | 36 | New York | 19378124 |
We need to pair this with a list of divisions, which differs from Region. A list exists on wikipedia; I’ve edited from https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States.
DIVISION<-as.factor(c(1,2,3,4,5,6,7,8,9))
DivisionName<-c("New England","Mid-Atlantic","East North Central","West North Central","South Atlantic","East South Central","West South Central","Mountain","Pacific")
Divisions<-data.frame(DIVISION,DivisionName)
Merge Divisions with master dataset and check.
USpop3<-full_join(Divisions,USpop2,by=c("DIVISION"))
kable(head(USpop3))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 2, bold = T, background = "LightCyan")| DIVISION | DivisionName | REGION | RegionName | Statistic | RegionValue | Year | STATE | NAME | StateValue |
|---|---|---|---|---|---|---|---|---|---|
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 9 | Connecticut | 3574147 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 23 | Maine | 1328369 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 25 | Massachusetts | 6547790 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 33 | New Hampshire | 1316464 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 44 | Rhode Island | 1052957 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 50 | Vermont | 625744 |
Let’s look at Puerto Rico in particular.
PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]
kable(head(PR))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| DIVISION | DivisionName | REGION | RegionName | Statistic | RegionValue | Year | STATE | NAME | StateValue | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2857 | X | NA | X | NA | ESTIMATESBASE | NA | 2010 | 72 | Puerto Rico | 3726157 |
| 2858 | X | NA | X | NA | POPESTIMATE | NA | 2010 | 72 | Puerto Rico | 3721525 |
| 2859 | X | NA | X | NA | POPESTIMATE | NA | 2011 | 72 | Puerto Rico | 3678732 |
| 2860 | X | NA | X | NA | POPESTIMATE | NA | 2012 | 72 | Puerto Rico | 3634488 |
| 2861 | X | NA | X | NA | POPESTIMATE | NA | 2013 | 72 | Puerto Rico | 3593077 |
| 2862 | X | NA | X | NA | POPESTIMATE | NA | 2014 | 72 | Puerto Rico | 3534874 |
Puerto Rico is not ranked among the other states and has no division assignment. This is problematic for anyone completing comparative analysis on a national level. Since puerto rico has been assigned no region, there is no regional data available. Let’s recode NA as Puerto Rico in RegionName.
USpop3$RegionName[which(USpop3$NAME=="Puerto Rico")]<-"Puerto Rico"
#look at Puerto Rico again, replace subset.
PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]
kable(head(PR))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 4, bold = T, background = "LightCyan")| DIVISION | DivisionName | REGION | RegionName | Statistic | RegionValue | Year | STATE | NAME | StateValue | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2857 | X | NA | X | Puerto Rico | ESTIMATESBASE | NA | 2010 | 72 | Puerto Rico | 3726157 |
| 2858 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2010 | 72 | Puerto Rico | 3721525 |
| 2859 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2011 | 72 | Puerto Rico | 3678732 |
| 2860 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2012 | 72 | Puerto Rico | 3634488 |
| 2861 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2013 | 72 | Puerto Rico | 3593077 |
| 2862 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2014 | 72 | Puerto Rico | 3534874 |
Let’s add a column for National statistics.
#National data only.
USpop.Nat<-USpop[1,3:ncol(USpop)]
#gather, as before
USpop.Nat<-gather(USpop.Nat,"Statistic","Value",5:length(colnames(USpop.Nat)))
#Clean year and statistic.
USpop.Nat$Year<-as.numeric(str_extract(USpop.Nat$Statistic,"([0-9]+)"))
USpop.Nat$Statistic<-gsub("([0-9])|(_)","",USpop.Nat$Statistic)
#omit columns and unnecessary rows.
USpop.Nat<-USpop.Nat[,5:7]
USpop.Nat<-USpop.Nat[which(USpop.Nat$Value!="X"),]
#rename coluns in preparation for merge.
colnames(USpop.Nat)[2]<-"NationalValue"
#merge into master data.
USpop4<-full_join(USpop.Nat,USpop3,by=c("Statistic","Year"))
kable(head(USpop4))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 2, bold = T, background = "LightCyan")| Statistic | NationalValue | Year | DIVISION | DivisionName | REGION | RegionName | RegionValue | STATE | NAME | StateValue |
|---|---|---|---|---|---|---|---|---|---|---|
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 9 | Connecticut | 3574147 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 23 | Maine | 1328369 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 25 | Massachusetts | 6547790 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 33 | New Hampshire | 1316464 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 44 | Rhode Island | 1052957 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 50 | Vermont | 625744 |
Let’s sort the data by type of statistic. I’ll use the subsets in the code below to render maps.
This concludes the tidying portion of the project; the original cleaning objective in Arun Reddy’s post has been accomplished. As there were no outlines provided for analysis, let’s identify
1. States with the greatest population growth in 2018.
2. States with the greatest decrease in population in 2018.
3. States with the greatest population perentage by nation and by region.
USpopESTIMATESBASE<-USpop4[which(USpop4$Statistic=="ESTIMATESBASE"),]
USpopESTIMATE<-USpop4[which(USpop4$Statistic=="POPESTIMATE"),]
USpopPOPCHG<-USpop4[which(USpop4$Statistic=="NPOPCHG"),]
USpopPPOPCHG<-USpop4[which(USpop4$Statistic=="PPOPCHG"),]ggplot2 to render, ggthemes and viridis to customize, various tools in the tidyverse package, as well as fiftystater, which needs to be installed via github. The packages mapdata and mapproj supply map data for puerto rico, which is excluded from the fiftystater package.# install.packages("devtools") devtools::install_github("wmurphyrd/fiftystater")
Let’s map population increases in 2018.
USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue<0)]<-0
require(ggplot2)
require(fiftystater)
require(ggthemes)
require(tidyverse)
require(viridis)
USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)
data("fifty_states")
library(mapdata)
library(mapproj)
#Puerto Rico color must be set manually; there has been a decrease in population, so we've matched the color representing zero growth.
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0)
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="lemonchiffon1")+
coord_fixed(1.0)+
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())
Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) +
geom_map(aes(fill=StateValue), map=fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())+
scale_fill_viridis(breaks=c(-100000,-25000,-5000,0,5000,25000,100000,300000),
labels=c('-100K--25K','-25K--5K','-25K-0','0-5000','5k-25k','25K-100K','100K-300K','300K+'),begin=1,end=.25,option="magma")+
guides(fill=guide_legend(title="Growth by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
ggtitle("Population Increases by State, 2018")
library(grid)
library(grDevices)
png(file="Project2a.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1)
print(prmap,vp=v4)
dev.off()## png
## 2
knitr::include_graphics("Project2a.png")The greatest estimated population increase in 2018, by far, occurred in Texas and Florida.
Let’s map population losses.
USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue>0)]<-0
USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)
data("fifty_states")
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0)
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="khaki3")+
coord_fixed(1.0)+
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())
Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) +
geom_map(aes(fill=StateValue), map=fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())+
scale_fill_viridis(breaks=c(-300000,-100000,-25000,-5000,0),
labels=c('300K+','100K-300K','25K-100K','5K-25K','0-25K'),begin=1,end=0,option="cividis")+
guides(fill=guide_legend(title="Decrease pop by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
ggtitle("Population Decreases by State, 2018")
png(file="Project2b.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1)
print(prmap,vp=v4)
dev.off()## png
## 2
knitr::include_graphics("Project2b.png") It’s easy to see that West Virginia and Louisiana is estimated to have experienced a decrease in population in 2018, as well as Illinois and New York. Puerto Rico is estimated to have experienced the greatest decrease in population.
Let’s calculate percentage of national and division population resides in each state or territory using projected population data for 2018. Let’s map percentage of national population by state.
USpopESTIMATE2018<-USpop4[which(USpopESTIMATE$Year==2018),]
USpopESTIMATE2018$StateValue<-as.numeric(USpopESTIMATE2018$StateValue)
USpopESTIMATE2018$RegionValue<-as.numeric(USpopESTIMATE2018$RegionValue)
USpopESTIMATE2018$NationalValue<-as.numeric(USpopESTIMATE2018$NationalValue)
USpopESTIMATE2018$NatPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$NationalValue
USpopESTIMATE2018$RegPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$RegionValue
USpopESTIMATE2018$statefull<-tolower(USpopESTIMATE2018$NAME)
data("fifty_states")
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0)
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey49")+
coord_fixed(1.0)+
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())
Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) +
geom_map(aes(fill=NatPer), map=fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())+
scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
guides(fill=guide_legend(title="Nat Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
ggtitle("National Population Percentage by State, 2018")
png(file="Project2c.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1)
print(prmap,vp=v4)
dev.off()## png
## 2
knitr::include_graphics("Project2c.png")It’s apparent that the states with greatest populations are California, Texas, Florida and New York. With New York’s estimated population decreasing, it’s curious that it still accounts for one of the greatest percentages of national population.
Let’s map state percentage by region.
data("fifty_states")
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0)
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey98")+
coord_fixed(1.0)+
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())
Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) +
geom_map(aes(fill=RegPer), map=fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())+
scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
guides(fill=guide_legend(title="Region Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
ggtitle("Regional Population Percentage by State, 2018")
png(file="Project2d.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1)
print(prmap,vp=v4)
dev.off()## png
## 2
knitr::include_graphics("Project2d.png") Omitting Puerto Rico due to the fact that there is no region or division assignment, we see that California, as expected, dominates the West Region, while New York dominates the Northeast to a lesser degree. Texas is most populous in the South, and Illinois carries the Midwest by only a slight margin.
Such data could be used to form an argument for representation in Congress. Should a state such as California, with 12.1% of the national population, be represented by only 2 senators?