For this “tidying” we will utilize Christopher Ayre’s example dataset of Heating and Cooling Absorption.
Discussion board can be found here: https://bbhosted.cuny.edu/webapps/discussionboard/do/message?action=list_messages&course_id=_1705328_1&nav=discussion_board&conf_id=_1845527_1&forum_id=_1908779_1&message_id=_31283025_1
Christopher provided the data in a .csv file, which I’ve uploaded to github:
heating <- read.csv("https://raw.githubusercontent.com/murphystout/data-607/master/heating_cooling.csv")
head(heating)
## color minute.0 minute.10 minute.20 minute.30 minute.40 minute.50
## 1 white 78 81 83 88 93 96
## 2 red 78 82 90 93 98 106
## 3 pink 78 82 84 90 96 99
## 4 black 78 88 92 98 108 116
## 5 green 78 81 85 91 95 102
## 6 white 98 96 93 80 78 78
## minute.60 phase
## 1 98 heating
## 2 109 heating
## 3 102 heating
## 4 121 heating
## 5 105 heating
## 6 78 cooling
Christopher adeptly pointed out several issues with the data set that make it “untidy”. These are:
The variable for “time elapsed” does not have its own column. In this case we see a column for each ten minute interval. These should be collapsed into one column.
Each color should have its own column. We will treat one “observation” as the temperature across all the colors, and so the color is merely a variable for a single observation, hence they should all be placed on a single row.
Multiple observational units are observed in the same table. In particular the “heating” and “cooling” data are in one table.
After we have tidied it up, we will do some exploratory analysis and visualizations on the data.
For this step we will gather multiple timestamp columns into a single column.
As a intermediate step, let’s first rename the columns to that they take on a numerical value, this will help with plotting the value later on.
colnames(heating) <- c("color",0,10,20,30,40,50,60,"phase")
heating <- gather(heating, time, temperature, "0":"60")
head(heating, 20)
## color phase time temperature
## 1 white heating 0 78
## 2 red heating 0 78
## 3 pink heating 0 78
## 4 black heating 0 78
## 5 green heating 0 78
## 6 white cooling 0 98
## 7 red cooling 0 109
## 8 pink cooling 0 102
## 9 black cooling 0 121
## 10 green cooling 0 105
## 11 white heating 10 81
## 12 red heating 10 82
## 13 pink heating 10 82
## 14 black heating 10 88
## 15 green heating 10 81
## 16 white cooling 10 96
## 17 red cooling 10 106
## 18 pink cooling 10 96
## 19 black cooling 10 108
## 20 green cooling 10 94
Finally, let’s make sure the time is stores as numeric values:
heating$time <- as.numeric(heating$time)
Now that’s we’ve gathered up the temperature columns, let’s spread out the color columns to include one temp reading for each color + timestamp combination.
heating <- spread(heating, color, temperature)
heating
## phase time black green pink red white
## 1 cooling 0 121 105 102 109 98
## 2 cooling 10 108 94 96 106 96
## 3 cooling 20 98 90 90 95 93
## 4 cooling 30 90 82 83 87 80
## 5 cooling 40 84 80 80 82 78
## 6 cooling 50 79 78 78 80 78
## 7 cooling 60 78 78 78 78 78
## 8 heating 0 78 78 78 78 78
## 9 heating 10 88 81 82 82 81
## 10 heating 20 92 85 84 90 83
## 11 heating 30 98 91 90 93 88
## 12 heating 40 108 95 96 98 93
## 13 heating 50 116 102 99 106 96
## 14 heating 60 121 105 102 109 98
Thankfully this is as simple as subsetting the data based on the “phase” column. I saved this for the last step to save us from having to perform the tidying operations twice.
cooling <- subset(heating, phase == "cooling")
heating <- subset(heating, phase == "heating")
head(cooling)
## phase time black green pink red white
## 1 cooling 0 121 105 102 109 98
## 2 cooling 10 108 94 96 106 96
## 3 cooling 20 98 90 90 95 93
## 4 cooling 30 90 82 83 87 80
## 5 cooling 40 84 80 80 82 78
## 6 cooling 50 79 78 78 80 78
head(heating)
## phase time black green pink red white
## 8 heating 0 78 78 78 78 78
## 9 heating 10 88 81 82 82 81
## 10 heating 20 92 85 84 90 83
## 11 heating 30 98 91 90 93 88
## 12 heating 40 108 95 96 98 93
## 13 heating 50 116 102 99 106 96
Let’s plot these data in a line graph to get a visual representation of how the colors responded to heating and cooling:
plot(x = heating$time, y = heating$black, type = "l", col = "black", xlab = "Time Elapsed (Minutes)", ylab = "Temp (Farheneit)", main = "Heating/Color Absorption")
lines(x = heating$time, y = heating$red, col = "red")
lines(x = heating$time, y = heating$green, col = "green")
lines(x = heating$time, y = heating$pink, col = "pink")
lines(x = heating$time, y = heating$white, col = "grey")
plot(x = cooling$time, y = cooling$black, type = "l", col = "black", xlab = "Time Elapsed (Minutes)", ylab = "Temp (Farheneit)", main = "Cooling/Color Asborption")
lines(x = cooling$time, y = cooling$red, col = "red")
lines(x = cooling$time, y = cooling$green, col = "green")
lines(x = cooling$time, y = cooling$pink, col = "pink")
lines(x = cooling$time, y = cooling$white, col = "grey")
The graphs look neat, and we can see that black is the fastest heat absorber.
The graphs also look symmetrical, but now that we see it in this form, it might make sense to view the cooling and heating data in one graph.
However, the data requires a bit more finagling to get this correct, such as:
1: Minute “60” of the Heating Data is equivalent of Minute “0” of the Cooling.
2: Minutes elapsed in the Cooling data need to be increased by 60 in order to create one continuous time series.
Let’s do it!
# Remove minute 0 of the cooling dataset:
heat_cool <- cooling[-1,]
# Add 60 to the time column.
heat_cool$time <- as.numeric(heat_cool$time) + 60
# Stack this underneath the heating data
heat_cool <- rbind(heating, heat_cool)
heat_cool
## phase time black green pink red white
## 8 heating 0 78 78 78 78 78
## 9 heating 10 88 81 82 82 81
## 10 heating 20 92 85 84 90 83
## 11 heating 30 98 91 90 93 88
## 12 heating 40 108 95 96 98 93
## 13 heating 50 116 102 99 106 96
## 14 heating 60 121 105 102 109 98
## 2 cooling 70 108 94 96 106 96
## 3 cooling 80 98 90 90 95 93
## 4 cooling 90 90 82 83 87 80
## 5 cooling 100 84 80 80 82 78
## 6 cooling 110 79 78 78 80 78
## 7 cooling 120 78 78 78 78 78
Now we have a nice, neat and tidy dataset showing heating and cooling times. Let’s revisit those graphs we generated previously:
plot(x = heat_cool$time, y = heat_cool$black, type = "l", col = "black", xlab = "Time Elapsed (Minutes)", ylab = "Temp (Farheneit)", main = "Cooling/Color Asborption")
lines(x = heat_cool$time, y = heat_cool$red, col = "red")
lines(x = heat_cool$time, y = heat_cool$green, col = "green")
lines(x = heat_cool$time, y = heat_cool$pink, col = "pink")
lines(x = heat_cool$time, y = heat_cool$white, col = "grey")
Let’s get a bit more quantitative. Let’s calculate the rates of heating and cooling for each of the colors:
heating_rate <- (heating[7,3:7] - heating[1,3:7])/60
heating_rate
## black green pink red white
## 14 0.7166667 0.45 0.4 0.5166667 0.3333333
cooling_rate <- (cooling[7,3:7] - cooling[1,3:7])/60
cooling_rate
## black green pink red white
## 7 -0.7166667 -0.45 -0.4 -0.5166667 -0.3333333
Since the starting and ending temperatures were equivalent, we see the overall heating and cooling rates to be symmetrical to one another.
According to this test, a colors heating rate also dicates its cooling rate (or heat retention), at least on average over 120 minutes.
Looking at these visually:
heating_rate <- gather(heating_rate, color, temp)
barplot(heating_rate$temp, col = heating_rate$color, names.arg = heating_rate$color, main = 'Heating Rates (by Color)', xlab = "Color", ylab = "Rate (Degrees per minute)")
However, this is looking at averages over the hour. But what does temp change look like within each 10 minute interval?
We can find this programmatically:
black_ht <- diff(heating$black)/10
green_ht <- diff(heating$green)/10
pink_ht <- diff(heating$pink)/10
red_ht <- diff(heating$red)/10
white_ht <- diff(heating$white)/10
black_cl <- diff(cooling$black)/10
green_cl <- diff(cooling$green)/10
pink_cl <- diff(cooling$pink)/10
red_cl <- diff(cooling$red)/10
white_cl <- diff(cooling$white)/10
ht_rates <- data.frame(black_ht, black_cl, green_ht, green_cl, pink_ht, pink_cl, red_ht, red_cl, white_ht, white_cl)
ht_rates
## black_ht black_cl green_ht green_cl pink_ht pink_cl red_ht red_cl
## 1 1.0 -1.3 0.3 -1.1 0.4 -0.6 0.4 -0.3
## 2 0.4 -1.0 0.4 -0.4 0.2 -0.6 0.8 -1.1
## 3 0.6 -0.8 0.6 -0.8 0.6 -0.7 0.3 -0.8
## 4 1.0 -0.6 0.4 -0.2 0.6 -0.3 0.5 -0.5
## 5 0.8 -0.5 0.7 -0.2 0.3 -0.2 0.8 -0.2
## 6 0.5 -0.1 0.3 0.0 0.3 0.0 0.3 -0.2
## white_ht white_cl
## 1 0.3 -0.2
## 2 0.2 -0.3
## 3 0.5 -1.3
## 4 0.5 -0.2
## 5 0.3 0.0
## 6 0.2 0.0
Let’s take a look at these visually:
plot(x = seq(10, 60, 10), y = black_ht, type = "l", col = "black", ylim = c(-1.5,1.5), main = "Heating and Cooling Rates", sub = "Postive Values are Heating Rates, Negative are Cooling Rates", xlab = "Time Elapsed (Minutes)", ylab = "Heating and Cooling Rates")
lines(seq(10, 60, 10),y = black_cl, col = "black")
lines(seq(10, 60, 10),y = green_ht, col = "green")
lines(seq(10, 60, 10),y = green_cl, col = "green")
lines(seq(10, 60, 10),y = pink_ht, col = "pink")
lines(seq(10, 60, 10),y = pink_cl, col = "pink")
lines(seq(10, 60, 10),y = red_ht, col = "red")
lines(seq(10, 60, 10),y = red_cl, col = "red")
lines(seq(10, 60, 10),y = white_ht, col = "grey")
lines(seq(10, 60, 10),y = white_cl, col = "grey")
This chart shows both heating and cooling rates. The heating rates are postive (top of chart), while the cooling rates are negative (bottom of chart).
Matching like colors can show you how that color behaved in its heating and cooling phase.
Being that our ultimate averages were very symmetrical (i.e. over the full 120 minute span), we might expect that each smaller interval would be symmetrical too.
However that doesn’t always to be the case in this data. Note the green line is often twice the magnitude of its counterpart.
We can also see that all colors tend to converge to low values at the end of both periods. Perhaps this speaks to a type of heating saturation paired with a similar flatline of cooling.
Some initial conclusions from our exploratory data analysis:
Black has the fastest heating rate, and ~0.72 degrees per minute. White has the slowest heating rate, at ~0.33 degrees per minute. This was probably suspected based on known heuristics, and the data seems to confirm it.
Heating rates and cooling rates were symmetrical over a 120 minute span. However, they don’t seem to be symmetric over smaller 10 minute spans. Lots of variation of rates across that time.
Some questions it raised:
There seems to be a wide varience for temperature changes in the 10 minute intervals. Is this typical? Do temperature changes “slow” or otherwise change during based on when they occur in the time series?
Heat absorbtion may very well not be a linear activity, a perhaps more detailed detail in needed to really understand the dynamics of these rates.
In order to render the maps in this code, the fiftystater package must be installed via devtools. The following code chunk installs fiftystater.
require(devtools)
devtools::install_github("wmurphyrd/fiftystater")
I found a census dataset for Puerto Rico population for 2010 to 2018. The dataset gives the Estimates of the Total Resident Population and Resident Population Age 18 Years and Older for the United States, States, and Puerto Rico
The dataset gives the population metrics like estimated, change in population from the estimate, National Rank in populations for each year from 2010 to 2018.
So there are 57 rows with 61 columns, so the dataset is wider than the length.
This is a good example of untidy data can be cleansed and make it more presentable. The following steps can be performed to cleanse the data.
All the 4 metrics by years are spread out by column-wise can be changed into rows.
Row names/Column name which includes the year as a concatenation can be well formatted to make more readable.
Some of the column names don’t have the right data type like population change, national rank is factorial data type which is unnecessary.
Let’s read population data from GitHub, check descriptives with summary command. Output is limited by customized CSS in Markdown.
USpop<-read.csv("https://raw.githubusercontent.com/sigmasigmaiota/USpopulation/master/Population DataSet.csv")
#Look at summary table.
summary(USpop)
## X SUMLEV REGION DIVISION STATE
## Min. : 1 Min. :10.00 0: 1 5 : 9 Min. : 0.00
## 1st Qu.:15 1st Qu.:40.00 1:10 8 : 8 1st Qu.:12.00
## Median :29 Median :40.00 2:13 4 : 7 Median :27.00
## Mean :29 Mean :38.07 3:18 1 : 6 Mean :27.18
## 3rd Qu.:43 3rd Qu.:40.00 4:14 0 : 5 3rd Qu.:41.00
## Max. :57 Max. :40.00 X: 1 3 : 5 Max. :72.00
## (Other):17
## NAME ESTIMATESBASE2010 POPESTIMATE2010
## Alabama : 1 Min. : 563773 Min. : 564483
## Alaska : 1 1st Qu.: 1853001 1st Qu.: 1854214
## Arizona : 1 Median : 4625381 Median : 4635656
## Arkansas : 1 Mean : 16315798 Mean : 16345610
## California: 1 3rd Qu.: 9535736 3rd Qu.: 9574293
## Colorado : 1 Max. :308758105 Max. :309326085
## (Other) :51
## POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013
## Min. : 567224 Min. : 576270 Min. : 582123
## 1st Qu.: 1856074 1st Qu.: 1856764 1st Qu.: 1865414
## Median : 4671422 Median : 4717112 Median : 4764153
## Mean : 16463487 Mean : 16583459 Mean : 16697654
## 3rd Qu.: 9656754 3rd Qu.: 9749123 3rd Qu.: 9843599
## Max. :311580009 Max. :313874218 Max. :316057727
##
## POPESTIMATE2014 POPESTIMATE2015 POPESTIMATE2016
## Min. : 582548 Min. : 585668 Min. : 584290
## 1st Qu.: 1879522 1st Qu.: 1891507 1st Qu.: 1905924
## Median : 4823793 Median : 4853160 Median : 4864745
## Mean : 16819195 Mean : 16942126 Mean : 17063518
## 3rd Qu.: 9930589 3rd Qu.: 9932573 3rd Qu.: 9951890
## Max. :318386421 Max. :320742673 Max. :323071342
##
## POPESTIMATE2017 POPESTIMATE2018 NPOPCHG_2010
## Min. : 578934 Min. : 577737 Min. : -6582
## 1st Qu.: 1917575 1st Qu.: 1929268 1st Qu.: 2570
## Median : 4875120 Median : 4887871 Median : 6457
## Mean : 17171340 Mean : 17275394 Mean : 29812
## 3rd Qu.: 9976447 3rd Qu.: 9995915 3rd Qu.: 18362
## Max. :325147121 Max. :327167434 Max. :567980
##
## NPOPCHG_2011 NPOPCHG_2012 NPOPCHG_2013 NPOPCHG_2014
## Min. : -42793 Min. : -44244 Min. : -41411 Min. : -58203
## 1st Qu.: 8898 1st Qu.: 10012 1st Qu.: 8774 1st Qu.: 7386
## Median : 21288 Median : 16893 Median : 18436 Median : 17240
## Mean : 117877 Mean : 119972 Mean : 114195 Mean : 121542
## 3rd Qu.: 65723 3rd Qu.: 71221 3rd Qu.: 53494 3rd Qu.: 59023
## Max. :2253924 Max. :2294209 Max. :2183509 Max. :2328694
##
## NPOPCHG_2015 NPOPCHG_2016 NPOPCHG_2017 NPOPCHG_2018
## Min. : -61708 Min. : -66671 Min. : -81494 Min. :-129848
## 1st Qu.: 4089 1st Qu.: 3647 1st Qu.: 1365 1st Qu.: 3341
## Median : 14763 Median : 13364 Median : 14027 Median : 17827
## Mean : 122931 Mean : 121392 Mean : 107822 Mean : 104054
## 3rd Qu.: 50831 3rd Qu.: 60116 3rd Qu.: 60505 3rd Qu.: 61216
## Max. :2356252 Max. :2328669 Max. :2075779 Max. :2020313
##
## PPOPCHG_2010 PPOPCHG_2011 PPOPCHG_2012 PPOPCHG_2013
## Min. :-0.12431 Min. :-1.1499 Min. :-1.2027 Min. :-1.1394
## 1st Qu.: 0.09317 1st Qu.: 0.2671 1st Qu.: 0.2745 1st Qu.: 0.2512
## Median : 0.17818 Median : 0.6560 Median : 0.6946 Median : 0.6524
## Mean : 0.17896 Mean : 0.6490 Mean : 0.6857 Mean : 0.6626
## 3rd Qu.: 0.24368 3rd Qu.: 0.8968 3rd Qu.: 1.0440 3rd Qu.: 0.9712
## Max. : 0.55154 Max. : 2.3992 Max. : 2.4408 Max. : 2.9785
##
## PPOPCHG_2014 PPOPCHG_2015 PPOPCHG_2016 PPOPCHG_2017
## Min. :-1.6199 Min. :-1.7457 Min. :-1.9196 Min. :-2.39231
## 1st Qu.: 0.1930 1st Qu.: 0.1282 1st Qu.: 0.1233 1st Qu.: 0.05405
## Median : 0.5311 Median : 0.4792 Median : 0.4030 Median : 0.37844
## Mean : 0.6018 Mean : 0.6056 Mean : 0.5883 Mean : 0.51042
## 3rd Qu.: 0.9699 3rd Qu.: 1.1043 3rd Qu.: 1.0899 3rd Qu.: 1.05089
## Max. : 2.1306 Max. : 2.2566 Max. : 2.0156 Max. : 2.13758
##
## PPOPCHG_2018 NRANK_ESTBASE2010 NRANK_POPEST2010 NRANK_POPEST2011
## Min. :-3.9052 1 : 2 1 : 2 1 : 2
## 1st Qu.: 0.1299 2 : 2 2 : 2 2 : 2
## Median : 0.3979 3 : 2 3 : 2 3 : 2
## Mean : 0.4897 4 : 2 4 : 2 4 : 2
## 3rd Qu.: 0.9723 X : 2 X : 2 X : 2
## Max. : 2.0854 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46
## NRANK_POPEST2012 NRANK_POPEST2013 NRANK_POPEST2014 NRANK_POPEST2015
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_POPEST2016 NRANK_POPEST2017 NRANK_POPEST2018 NRANK_NPCHG2010
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_NPCHG2011 NRANK_NPCHG2012 NRANK_NPCHG2013 NRANK_NPCHG2014
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_NPCHG2015 NRANK_NPCHG2016 NRANK_NPCHG2017 NRANK_NPCHG2018
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_PPCHG2010 NRANK_PPCHG2011 NRANK_PPCHG2012 NRANK_PPCHG2013
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_PPCHG2014 NRANK_PPCHG2015 NRANK_PPCHG2016 NRANK_PPCHG2017
## 1 : 2 1 : 2 1 : 2 1 : 2
## 2 : 2 2 : 2 2 : 2 2 : 2
## 3 : 2 3 : 2 3 : 2 3 : 2
## 4 : 2 4 : 2 4 : 2 4 : 2
## X : 2 X : 2 X : 2 X : 2
## 10 : 1 10 : 1 10 : 1 10 : 1
## (Other):46 (Other):46 (Other):46 (Other):46
## NRANK_PPCHG2018
## 1 : 2
## 2 : 2
## 3 : 2
## 4 : 2
## X : 2
## 10 : 1
## (Other):46
We’ll need to remove the divisions listed at the top of the dataset by creating a subset; let’s remove the first five rows, wherein data at the region and national level lives. The result is displayed using kableExtra.
#subset the data.
USpop.StateTerr<-USpop[6:57,]
require(kableExtra)
kable(head(USpop.StateTerr))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
scroll_box(width = "100%")
| X | SUMLEV | REGION | DIVISION | STATE | NAME | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | POPESTIMATE2016 | POPESTIMATE2017 | POPESTIMATE2018 | NPOPCHG_2010 | NPOPCHG_2011 | NPOPCHG_2012 | NPOPCHG_2013 | NPOPCHG_2014 | NPOPCHG_2015 | NPOPCHG_2016 | NPOPCHG_2017 | NPOPCHG_2018 | PPOPCHG_2010 | PPOPCHG_2011 | PPOPCHG_2012 | PPOPCHG_2013 | PPOPCHG_2014 | PPOPCHG_2015 | PPOPCHG_2016 | PPOPCHG_2017 | PPOPCHG_2018 | NRANK_ESTBASE2010 | NRANK_POPEST2010 | NRANK_POPEST2011 | NRANK_POPEST2012 | NRANK_POPEST2013 | NRANK_POPEST2014 | NRANK_POPEST2015 | NRANK_POPEST2016 | NRANK_POPEST2017 | NRANK_POPEST2018 | NRANK_NPCHG2010 | NRANK_NPCHG2011 | NRANK_NPCHG2012 | NRANK_NPCHG2013 | NRANK_NPCHG2014 | NRANK_NPCHG2015 | NRANK_NPCHG2016 | NRANK_NPCHG2017 | NRANK_NPCHG2018 | NRANK_PPCHG2010 | NRANK_PPCHG2011 | NRANK_PPCHG2012 | NRANK_PPCHG2013 | NRANK_PPCHG2014 | NRANK_PPCHG2015 | NRANK_PPCHG2016 | NRANK_PPCHG2017 | NRANK_PPCHG2018 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 6 | 40 | 3 | 6 | 1 | Alabama | 4780138 | 4785448 | 4798834 | 4815564 | 4830460 | 4842481 | 4853160 | 4864745 | 4875120 | 4887871 | 5310 | 13386 | 16730 | 14896 | 12021 | 10679 | 11585 | 10375 | 12751 | 0.1110847 | 0.2797230 | 0.3486264 | 0.3093303 | 0.2488583 | 0.2205275 | 0.2387104 | 0.2132691 | 0.2615525 | 23 | 23 | 23 | 23 | 23 | 23 | 24 | 24 | 24 | 24 | 28 | 32 | 26 | 31 | 32 | 30 | 28 | 31 | 27 | 37 | 38 | 34 | 33 | 35 | 36 | 34 | 34 | 34 |
| 7 | 7 | 40 | 4 | 9 | 2 | Alaska | 710249 | 713906 | 722038 | 730399 | 737045 | 736307 | 737547 | 741504 | 739786 | 737438 | 3657 | 8132 | 8361 | 6646 | -738 | 1240 | 3957 | -1718 | -2348 | 0.5148898 | 1.1390855 | 1.1579723 | 0.9099136 | -0.1001296 | 0.1684080 | 0.5365082 | -0.2316913 | -0.3173891 | 47 | 47 | 47 | 47 | 47 | 48 | 48 | 48 | 48 | 48 | 32 | 39 | 40 | 42 | 47 | 42 | 37 | 44 | 45 | 2 | 8 | 11 | 18 | 48 | 37 | 22 | 46 | 49 |
| 8 | 8 | 40 | 4 | 8 | 4 | Arizona | 6392288 | 6407774 | 6473497 | 6556629 | 6634999 | 6733840 | 6833596 | 6945452 | 7048876 | 7171646 | 15486 | 65723 | 83132 | 78370 | 98841 | 99756 | 111856 | 103424 | 122770 | 0.2422607 | 1.0256760 | 1.2841900 | 1.1952789 | 1.4896913 | 1.4814133 | 1.6368542 | 1.4890896 | 1.7416961 | 16 | 16 | 16 | 15 | 15 | 15 | 14 | 14 | 14 | 14 | 11 | 10 | 7 | 5 | 4 | 7 | 7 | 7 | 4 | 15 | 10 | 7 | 7 | 7 | 9 | 9 | 6 | 4 |
| 9 | 9 | 40 | 3 | 7 | 5 | Arkansas | 2916028 | 2921978 | 2940407 | 2952109 | 2959549 | 2967726 | 2978407 | 2990410 | 3002997 | 3013825 | 5950 | 18429 | 11702 | 7440 | 8177 | 10681 | 12003 | 12587 | 10828 | 0.2040447 | 0.6307029 | 0.3979721 | 0.2520232 | 0.2762921 | 0.3599052 | 0.4030007 | 0.4209122 | 0.3605731 | 32 | 32 | 32 | 32 | 32 | 32 | 33 | 32 | 32 | 33 | 26 | 25 | 35 | 41 | 37 | 29 | 27 | 26 | 30 | 22 | 28 | 31 | 38 | 31 | 30 | 26 | 24 | 28 |
| 10 | 10 | 40 | 4 | 9 | 6 | California | 37254523 | 37320903 | 37641823 | 37960782 | 38280824 | 38625139 | 38953142 | 39209127 | 39399349 | 39557045 | 66380 | 320920 | 318959 | 320042 | 344315 | 328003 | 255985 | 190222 | 157696 | 0.1781797 | 0.8598934 | 0.8473527 | 0.8430859 | 0.8994451 | 0.8491956 | 0.6571614 | 0.4851472 | 0.4002503 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 26 | 17 | 19 | 20 | 16 | 16 | 20 | 23 | 25 |
| 11 | 11 | 40 | 4 | 8 | 8 | Colorado | 5029316 | 5048281 | 5121771 | 5193721 | 5270482 | 5351218 | 5452107 | 5540921 | 5615902 | 5695564 | 18965 | 73490 | 71950 | 76761 | 80736 | 100889 | 88814 | 74981 | 79662 | 0.3770891 | 1.4557431 | 1.4047875 | 1.4779577 | 1.5318523 | 1.8853465 | 1.6289849 | 1.3532227 | 1.4185077 | 22 | 22 | 22 | 22 | 22 | 22 | 22 | 21 | 21 | 21 | 8 | 9 | 9 | 6 | 8 | 6 | 8 | 8 | 8 | 6 | 4 | 5 | 5 | 4 | 4 | 10 | 8 | 7 |
We now have information only on the state/territory level. Let’s transform and create a row for each year. Additionally, let’s omit the first two columns, as SUMLEV is consistent on the state/terr level. The result is displayed once again using kableExtra.
#remove first two columns
USpop.StateTerr[1:2]<-NULL
library(tidyr)
library(dplyr)
#Create row for each statistic.
USpop.YearList<-gather(USpop.StateTerr,"Statistic","Value",5:length(colnames(USpop.StateTerr)))
kable(head(USpop.YearList))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| REGION | DIVISION | STATE | NAME | Statistic | Value |
|---|---|---|---|---|---|
| 3 | 6 | 1 | Alabama | ESTIMATESBASE2010 | 4780138 |
| 4 | 9 | 2 | Alaska | ESTIMATESBASE2010 | 710249 |
| 4 | 8 | 4 | Arizona | ESTIMATESBASE2010 | 6392288 |
| 3 | 7 | 5 | Arkansas | ESTIMATESBASE2010 | 2916028 |
| 4 | 9 | 6 | California | ESTIMATESBASE2010 | 37254523 |
| 4 | 8 | 8 | Colorado | ESTIMATESBASE2010 | 5029316 |
Let’s extract the year from our new “Year” variable and create a “Statistic” variable; we’ll use stringr to extract numeric characters. The new variable is highlighted in light blue.
library(stringr)
USpop.YearList$Year<-as.numeric(str_extract(USpop.YearList$Statistic,"([0-9]+)"))
USpop.YearList$Statistic<-gsub("([0-9])|(_)","",USpop.YearList$Statistic)
#Rename value column to specifically address state/terr level data.
colnames(USpop.YearList)[6]<-"StateValue"
kable(head(USpop.YearList))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 7, bold = T, background = "LightCyan")
| REGION | DIVISION | STATE | NAME | Statistic | StateValue | Year |
|---|---|---|---|---|---|---|
| 3 | 6 | 1 | Alabama | ESTIMATESBASE | 4780138 | 2010 |
| 4 | 9 | 2 | Alaska | ESTIMATESBASE | 710249 | 2010 |
| 4 | 8 | 4 | Arizona | ESTIMATESBASE | 6392288 | 2010 |
| 3 | 7 | 5 | Arkansas | ESTIMATESBASE | 2916028 | 2010 |
| 4 | 9 | 6 | California | ESTIMATESBASE | 37254523 | 2010 |
| 4 | 8 | 8 | Colorado | ESTIMATESBASE | 5029316 | 2010 |
Let’s do the same for division information. Using the gather command, division data is added to our table; the change is marked in light blue.
USdiv.YearList<-gather(USpop[2:5,],"Statistic","Value",7:length(colnames(USpop)))
#Remove first two columns.
USdiv.YearList[1:2]<-NULL
#We also have no need for state or division information at Region level.
USdiv.YearList[2:3]<-NULL
#Rename NAME column.
colnames(USdiv.YearList)[2]<-"RegionName"
#Rename Value column in anticipation of merge with state data.
colnames(USdiv.YearList)[4]<-"RegionValue"
#Clean year and statistic.
USdiv.YearList$Year<-as.numeric(str_extract(USdiv.YearList$Statistic,"([0-9]+)"))
USdiv.YearList$Statistic<-gsub("([0-9])|(_)","",USdiv.YearList$Statistic)
#Shorten RegionName.
USdiv.YearList$RegionName<-gsub("( Region)","",USdiv.YearList$RegionName)
kable(head(USdiv.YearList))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 2, bold = T, background = "LightCyan")
| REGION | RegionName | Statistic | RegionValue | Year |
|---|---|---|---|---|
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 |
| 2 | Midwest | ESTIMATESBASE | 66929743 | 2010 |
| 3 | South | ESTIMATESBASE | 114563045 | 2010 |
| 4 | West | ESTIMATESBASE | 71946887 | 2010 |
| 1 | Northeast | POPESTIMATE | 55380645 | 2010 |
| 2 | Midwest | POPESTIMATE | 66974749 | 2010 |
Nice. Let’s merge by REGION number, Year, and Statistic. We must use full_join in order to keep Puerto Rico, which is omitted from Region categorization.
USpop2<-full_join(USdiv.YearList,USpop.YearList,by=c("REGION","Statistic","Year"))
kable(head(USpop2))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| REGION | RegionName | Statistic | RegionValue | Year | DIVISION | STATE | NAME | StateValue |
|---|---|---|---|---|---|---|---|---|
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 1 | 9 | Connecticut | 3574147 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 1 | 23 | Maine | 1328369 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 1 | 25 | Massachusetts | 6547790 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 1 | 33 | New Hampshire | 1316464 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 2 | 34 | New Jersey | 8791962 |
| 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 2 | 36 | New York | 19378124 |
We need to pair this with a list of divisions, which differs from Region. A list exists on wikipedia; I’ve edited from https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States.
DIVISION<-as.factor(c(1,2,3,4,5,6,7,8,9))
DivisionName<-c("New England","Mid-Atlantic","East North Central","West North Central","South Atlantic","East South Central","West South Central","Mountain","Pacific")
Divisions<-data.frame(DIVISION,DivisionName)
Merge Divisions with master dataset and check.
USpop3<-full_join(Divisions,USpop2,by=c("DIVISION"))
kable(head(USpop3))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 2, bold = T, background = "LightCyan")
| DIVISION | DivisionName | REGION | RegionName | Statistic | RegionValue | Year | STATE | NAME | StateValue |
|---|---|---|---|---|---|---|---|---|---|
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 9 | Connecticut | 3574147 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 23 | Maine | 1328369 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 25 | Massachusetts | 6547790 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 33 | New Hampshire | 1316464 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 44 | Rhode Island | 1052957 |
| 1 | New England | 1 | Northeast | ESTIMATESBASE | 55318430 | 2010 | 50 | Vermont | 625744 |
Let’s look at Puerto Rico in particular.
PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]
kable(head(PR))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| DIVISION | DivisionName | REGION | RegionName | Statistic | RegionValue | Year | STATE | NAME | StateValue | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2857 | X | NA | X | NA | ESTIMATESBASE | NA | 2010 | 72 | Puerto Rico | 3726157 |
| 2858 | X | NA | X | NA | POPESTIMATE | NA | 2010 | 72 | Puerto Rico | 3721525 |
| 2859 | X | NA | X | NA | POPESTIMATE | NA | 2011 | 72 | Puerto Rico | 3678732 |
| 2860 | X | NA | X | NA | POPESTIMATE | NA | 2012 | 72 | Puerto Rico | 3634488 |
| 2861 | X | NA | X | NA | POPESTIMATE | NA | 2013 | 72 | Puerto Rico | 3593077 |
| 2862 | X | NA | X | NA | POPESTIMATE | NA | 2014 | 72 | Puerto Rico | 3534874 |
Puerto Rico is not ranked among the other states and has no division assignment. This is problematic for anyone completing comparative analysis on a national level. Since puerto rico has been assigned no region, there is no regional data available. Let’s recode NA as Puerto Rico in RegionName.
USpop3$RegionName[which(USpop3$NAME=="Puerto Rico")]<-"Puerto Rico"
#look at Puerto Rico again, replace subset.
PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]
kable(head(PR))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 5, bold = T, background = "LightCyan")
| DIVISION | DivisionName | REGION | RegionName | Statistic | RegionValue | Year | STATE | NAME | StateValue | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2857 | X | NA | X | Puerto Rico | ESTIMATESBASE | NA | 2010 | 72 | Puerto Rico | 3726157 |
| 2858 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2010 | 72 | Puerto Rico | 3721525 |
| 2859 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2011 | 72 | Puerto Rico | 3678732 |
| 2860 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2012 | 72 | Puerto Rico | 3634488 |
| 2861 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2013 | 72 | Puerto Rico | 3593077 |
| 2862 | X | NA | X | Puerto Rico | POPESTIMATE | NA | 2014 | 72 | Puerto Rico | 3534874 |
Let’s add a column for National statistics.
#National data only.
USpop.Nat<-USpop[1,3:ncol(USpop)]
#gather, as before
USpop.Nat<-gather(USpop.Nat,"Statistic","Value",5:length(colnames(USpop.Nat)))
#Clean year and statistic.
USpop.Nat$Year<-as.numeric(str_extract(USpop.Nat$Statistic,"([0-9]+)"))
USpop.Nat$Statistic<-gsub("([0-9])|(_)","",USpop.Nat$Statistic)
#omit columns and unnecessary rows.
USpop.Nat<-USpop.Nat[,5:7]
USpop.Nat<-USpop.Nat[which(USpop.Nat$Value!="X"),]
#rename coluns in preparation for merge.
colnames(USpop.Nat)[2]<-"NationalValue"
#merge into master data.
USpop4<-full_join(USpop.Nat,USpop3,by=c("Statistic","Year"))
kable(head(USpop4))%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
column_spec(column = 2, bold = T, background = "LightCyan")
| Statistic | NationalValue | Year | DIVISION | DivisionName | REGION | RegionName | RegionValue | STATE | NAME | StateValue |
|---|---|---|---|---|---|---|---|---|---|---|
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 9 | Connecticut | 3574147 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 23 | Maine | 1328369 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 25 | Massachusetts | 6547790 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 33 | New Hampshire | 1316464 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 44 | Rhode Island | 1052957 |
| ESTIMATESBASE | 308758105 | 2010 | 1 | New England | 1 | Northeast | 55318430 | 50 | Vermont | 625744 |
Let’s sort the data by type of statistic. I’ll use the subsets in the code below to render maps.
This concludes the tidying portion of the project; the original cleaning objective in Arun Reddy’s post has been accomplished. As there were no outlines provided for analysis, let’s identify
1. States with the greatest population growth in 2018.
2. States with the greatest decrease in population in 2018.
3. States with the greatest population perentage by nation and by region.
USpopESTIMATESBASE<-USpop4[which(USpop4$Statistic=="ESTIMATESBASE"),]
USpopESTIMATE<-USpop4[which(USpop4$Statistic=="POPESTIMATE"),]
USpopPOPCHG<-USpop4[which(USpop4$Statistic=="NPOPCHG"),]
USpopPPOPCHG<-USpop4[which(USpop4$Statistic=="PPOPCHG"),]
ggplot2 to render, ggthemes and viridis to customize, various tools in the tidyverse package, as well as fiftystater, which needs to be installed via github. The packages mapdata and mapproj supply map data for puerto rico, which is excluded from the fiftystater package.# install.packages("devtools") devtools::install_github("wmurphyrd/fiftystater")
Let’s map population increases in 2018.
USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue<0)]<-0
require(ggplot2)
require(fiftystater)
require(ggthemes)
require(tidyverse)
require(viridis)
USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)
data("fifty_states")
library(mapdata)
library(mapproj)
#Puerto Rico color must be set manually; there has been a decrease in population, so we've matched the color representing zero growth.
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0)
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="lemonchiffon1")+
coord_fixed(1.0)+
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())
Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) +
geom_map(aes(fill=StateValue), map=fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())+
scale_fill_viridis(breaks=c(-100000,-25000,-5000,0,5000,25000,100000,300000),
labels=c('-100K--25K','-25K--5K','-25K-0','0-5000','5k-25k','25K-100K','100K-300K','300K+'),begin=1,end=.25,option="magma")+
guides(fill=guide_legend(title="Growth by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
ggtitle("Population Increases by State, 2018")
library(grid)
library(grDevices)
png(file="Project2a.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1)
print(prmap,vp=v4)
dev.off()
## png
## 2
knitr::include_graphics("Project2a.png")
The greatest estimated population increase in 2018, by far, occurred in Texas and Florida.
Let’s map population losses.
USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue>0)]<-0
USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)
data("fifty_states")
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0)
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="khaki3")+
coord_fixed(1.0)+
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())
Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) +
geom_map(aes(fill=StateValue), map=fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())+
scale_fill_viridis(breaks=c(-300000,-100000,-25000,-5000,0),
labels=c('300K+','100K-300K','25K-100K','5K-25K','0-25K'),begin=1,end=0,option="cividis")+
guides(fill=guide_legend(title="Decrease pop by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
ggtitle("Population Decreases by State, 2018")
png(file="Project2b.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1)
print(prmap,vp=v4)
dev.off()
## png
## 2
knitr::include_graphics("Project2b.png")
It’s easy to see that West Virginia and Louisiana is estimated to have experienced a decrease in population in 2018, as well as Illinois and New York. Puerto Rico is estimated to have experienced the greatest decrease in population.
Let’s calculate percentage of national and division population resides in each state or territory using projected population data for 2018. Let’s map percentage of national population by state.
USpopESTIMATE2018<-USpop4[which(USpopESTIMATE$Year==2018),]
USpopESTIMATE2018$StateValue<-as.numeric(USpopESTIMATE2018$StateValue)
USpopESTIMATE2018$RegionValue<-as.numeric(USpopESTIMATE2018$RegionValue)
USpopESTIMATE2018$NationalValue<-as.numeric(USpopESTIMATE2018$NationalValue)
USpopESTIMATE2018$NatPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$NationalValue
USpopESTIMATE2018$RegPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$RegionValue
USpopESTIMATE2018$statefull<-tolower(USpopESTIMATE2018$NAME)
data("fifty_states")
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0)
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey49")+
coord_fixed(1.0)+
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())
Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) +
geom_map(aes(fill=NatPer), map=fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())+
scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
guides(fill=guide_legend(title="Nat Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
ggtitle("National Population Percentage by State, 2018")
png(file="Project2c.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1)
print(prmap,vp=v4)
dev.off()
## png
## 2
knitr::include_graphics("Project2c.png")
It’s apparent that the states with greatest populations are California, Texas, Florida and New York. With New York’s estimated population decreasing, it’s curious that it still accounts for one of the greatest percentages of national population.
Let’s map state percentage by region.
data("fifty_states")
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0)
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey98")+
coord_fixed(1.0)+
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())
Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) +
geom_map(aes(fill=RegPer), map=fifty_states) +
expand_limits(x = fifty_states$long, y = fifty_states$lat) +
coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
labs(x = "", y = "") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "transparent",colour = NA),
plot.background = element_rect(fill = "transparent",colour = NA),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
rect = element_blank())+
scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
guides(fill=guide_legend(title="Region Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
ggtitle("Regional Population Percentage by State, 2018")
png(file="Project2d.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1)
print(prmap,vp=v4)
dev.off()
## png
## 2
knitr::include_graphics("Project2d.png")
If we omit Puerto Rico due to the fact that there is no region or division assignment, we see that California, as expected, dominates the West Region, while New York dominates the Northeast to a lesser degree. Texas is most populous in the South, and Illinois carries the Midwest by only a slight margin.
Such data could be used to form an argument for representation in Congress. Should a state such as California, with 12.1% of the national population, be represented by only 2 senators? On a local level, these data could be combined with demographic data to understand why some areas are experiencing negative growth; Puerto Rico, for example, was decimated by a recent hurricane.