Restaurant Menu - Austin Chan

This first dataset is a menu from a Chinese restaurant that has been transcribed into an excel spreadsheet. The menu comes from Alexander Ng’s post on the week 5 discussion.

link: https://bbhosted.cuny.edu/webapps/discussionboard/do/message?action=list_messages&course_id=_1705328_1&nav=discussion_board&conf_id=_1845527_1&forum_id=_1908779_1&message_id=_31289549_1

The menu has been organized into three different sections:

  • Wok Specialties
  • Combinations
  • Party Feast

Each section of the menu has its own distinct table structure and pricing model depending on the time of day and which combination of dishes a person orders.

In Alexander’s post, he asked the following analysis questions:

  • What is the average price of a dish at the restaurant?
  • How many dishes contain chicken?
  • How many dishes can be sold from this menu?

In order to answer these questions, the data needs to be tidied first. The following section will go through the basic tidying before answering the analysis questions.

Tidying the Menu

The first step to organize this data is to combine all of the separate tables into one large table. The final structure of the table will contain these variables:

  • The name of the dishes
  • Dish descriptions
  • The time of day the dishes are served
  • The price of the dish
  • Which menu section the dish is a part of

Load the spreadsheet and necessary packages

Since the menu was written as three separate sections, I transcribed each section into their own separate files and loaded them individually. since some of the item descriptions contain commas, I had to load the files as a tab delimited text files instead to prevent weird parsing issues.

install.packages("kableExtra", repos ="http://cran.rstudio.org")
## package 'kableExtra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\mike\AppData\Local\Temp\RtmpwjmTQJ\downloaded_packages
library(kableExtra)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)


WokSpecialties = as_tibble(read.csv("https://raw.githubusercontent.com/austinchan1/Data-607---Data-Acquisition-and-Management/master/Project%202/WokSpecialties.txt", header = T, sep = "\t", stringsAsFactors = F, na.strings = ""))
Combo = as_tibble(read.csv("https://raw.githubusercontent.com/austinchan1/Data-607---Data-Acquisition-and-Management/master/Project%202/Combos.txt", header = T, sep = "\t", stringsAsFactors = F, na.strings = ""))
PartyFeast = as_tibble(read.csv("https://raw.githubusercontent.com/austinchan1/Data-607---Data-Acquisition-and-Management/master/Project%202/PartyFeast.txt", header = T, sep = "\t", stringsAsFactors = F, na.strings = ""))


#read.csv("https://raw.githubusercontent.com/murphystout/data-607/master/heating_cooling.csv")

Look at the data

Taking a look at the data, it is clear that the different sections do not follow a standardized format. Not every dish has a description and the Party Feast section has some menu items with cut off names that need to be removed. Also, there are some typos in the menu prices that need to be fixed.

WokSpecialties
## # A tibble: 12 x 3
##    Dish              Description                                    Price  
##    <chr>             <chr>                                          <chr>  
##  1 Orange Chicken o~ <NA>                                           "$11.9~
##  2 Happy Family      Shrimp, scallop, roast pork, beef, chicken, v~ "$12.9~
##  3 Steak Kew         Cubed prime filet with bamboo shoots, snow pe~ "$14.9~
##  4 Ginger Scallion ~ Steamed fish filet in a light soy sauce        "$12.9~
##  5 Roast Duck        Cantonese Style Half crispy roase duck with h~ "$13.9~
##  6 Crispy Shrimp     Large tail-on shrimp on a bet of field greens~ "$13.9~
##  7 Szechwan Fish     Marinated fresh tender pieces of fish tossed ~ "$13.9~
##  8 Shanghai Noon     Tender boneless chicken cubes, lobster, shrim~ "$14.9~
##  9 Walnut Chicken    <NA>                                           "$11.9~
## 10 Shrimp w. Black ~ <NA>                                           "$13.9~
## 11 Imperial Pork Ch~ Tender pieces of crispy pork chops tossed in ~ "$12.9~
## 12 Neptune's Delight Large shrimp, sea scallops, red snapper, dry ~ "$13.9~
Combo
## # A tibble: 25 x 5
##    ComboID Dish                         Description LunchPrice DinnerPrice
##    <chr>   <chr>                        <chr>       <chr>      <chr>      
##  1 C01     Chicken Chow Mein            <NA>        "$5.50 "   "$825 "    
##  2 C02     Crispy Chicken Wings         <NA>        "$5.50 "   "$8.25 "   
##  3 C03     Pork or Veggie Egg Foo Young <NA>        "$5.75 "   "$8.25 "   
##  4 C04     Moo Goo Gai Pan              <NA>        "$5.95 "   "$8.95 "   
##  5 C05     Chicken or Pork Lo Mein      <NA>        "$5.95 "   "$8.95 "   
##  6 C06     Honey Garlic Chicken         <NA>        "$5.95 "   "$8.95 "   
##  7 C07     Sweet & Sour Chicken or Pork <NA>        "$5.95 "   "$8.95 "   
##  8 C08     Chicken Teriyaki Stir Fry    <NA>        "$6.25 "   "$9.25 "   
##  9 C09     Beef with Mixed Vegetables   <NA>        "$6.25 "   "$9.25 "   
## 10 C10     Hunan Chicken or Beef        <NA>        "$6.25 "   "$9.25 "   
## # ... with 15 more rows
PartyFeast
## # A tibble: 12 x 1
##    Entrees                 
##    <chr>                   
##  1 Honey Garlic Chicken    
##  2 Sweet & Sour Chicken    
##  3 Pepper Steak            
##  4 Orange or Sesame Chicken
##  5 Bourbon Chicken         
##  6 Chicken or Pork Lo Mein 
##  7 General Tsao's Chicken  
##  8 Chicken w               
##  9 Beef with B             
## 10 Chicken w               
## 11 Buddha's D              
## 12 Shrimp Lo

Tidying the Wok Specialties section

The wok specialties table is already pretty organized, but it is missing a few columns. In order to fit the final table structure, the prices need to be numeric, the NA values for the dish descriptions need to be filled out, the time of day needs to be added, and the menu section needs to be specified.

Changing the prices from characters to numeric and removing the dollar sign are necessary to do further analysis with the prices. The code below finds all dollar signs in the price column and replaces it with nothing and then converts the vector to a numeric data type. That column then replaces the current price column.

WokSpecialties$Price = as.numeric(gsub("\\$", "", WokSpecialties$Price))

Adding the time of day and menu section is fairly straightforward. The time of day for all the wok specialties is “All Day” because the menu items have the same price for lunch and dinner. The menu section for all items is “Wok Specialties”.

WokSpecialties = WokSpecialties %>%
  mutate(ServedWhen = "All Day", MenuSection = "Wok Specialties")

WokSpecialties
## # A tibble: 12 x 5
##    Dish         Description                   Price ServedWhen MenuSection 
##    <chr>        <chr>                         <dbl> <chr>      <chr>       
##  1 Orange Chic~ <NA>                           12.0 All Day    Wok Special~
##  2 Happy Family Shrimp, scallop, roast pork,~  13.0 All Day    Wok Special~
##  3 Steak Kew    Cubed prime filet with bambo~  15.0 All Day    Wok Special~
##  4 Ginger Scal~ Steamed fish filet in a ligh~  13.0 All Day    Wok Special~
##  5 Roast Duck   Cantonese Style Half crispy ~  14.0 All Day    Wok Special~
##  6 Crispy Shri~ Large tail-on shrimp on a be~  14.0 All Day    Wok Special~
##  7 Szechwan Fi~ Marinated fresh tender piece~  14.0 All Day    Wok Special~
##  8 Shanghai No~ Tender boneless chicken cube~  15.0 All Day    Wok Special~
##  9 Walnut Chic~ <NA>                           12.0 All Day    Wok Special~
## 10 Shrimp w. B~ <NA>                           14.0 All Day    Wok Special~
## 11 Imperial Po~ Tender pieces of crispy pork~  13.0 All Day    Wok Special~
## 12 Neptune's D~ Large shrimp, sea scallops, ~  14.0 All Day    Wok Special~

The next step is slightly more tricky. The NA values in the descriptions need to be filled out. This step involves replacing the NA value with the dish name. The following code filters the specialties by dishes with missing descriptions and then changes the description column to math the dish name. The table is then appended onto the Wok Specialties table, replacing the rows with missing descriptions.

WokSpecialtiesNAFilled = WokSpecialties %>%
  filter(is.na(Description) == TRUE) %>%
  mutate(Description = Dish)

WokSpecialtiesFinal = bind_rows(filter(WokSpecialties,is.na(Description) == FALSE),WokSpecialtiesNAFilled)

WokSpecialtiesFinal
## # A tibble: 12 x 5
##    Dish         Description                   Price ServedWhen MenuSection 
##    <chr>        <chr>                         <dbl> <chr>      <chr>       
##  1 Happy Family Shrimp, scallop, roast pork,~  13.0 All Day    Wok Special~
##  2 Steak Kew    Cubed prime filet with bambo~  15.0 All Day    Wok Special~
##  3 Ginger Scal~ Steamed fish filet in a ligh~  13.0 All Day    Wok Special~
##  4 Roast Duck   Cantonese Style Half crispy ~  14.0 All Day    Wok Special~
##  5 Crispy Shri~ Large tail-on shrimp on a be~  14.0 All Day    Wok Special~
##  6 Szechwan Fi~ Marinated fresh tender piece~  14.0 All Day    Wok Special~
##  7 Shanghai No~ Tender boneless chicken cube~  15.0 All Day    Wok Special~
##  8 Imperial Po~ Tender pieces of crispy pork~  13.0 All Day    Wok Special~
##  9 Neptune's D~ Large shrimp, sea scallops, ~  14.0 All Day    Wok Special~
## 10 Orange Chic~ Orange Chicken or Beef         12.0 All Day    Wok Special~
## 11 Walnut Chic~ Walnut Chicken                 12.0 All Day    Wok Special~
## 12 Shrimp w. B~ Shrimp w. Black Bean Sauce     14.0 All Day    Wok Special~

Tidying the Combinations section

The combinations section has the same issues as the wok specialties section, but also has the added complication of different prices depending on the time of day. Perhaps the easiest way to fix this combinations table is to separate it into a lunch combination menu and a dinner combination menu after filling out the description column.

The code below reformats the prices in the same way as the wok specialties menu.

Combo$LunchPrice = as.numeric(gsub("\\$", "", Combo$LunchPrice))
Combo$DinnerPrice = as.numeric(gsub("\\$", "", Combo$DinnerPrice))

The code below fills out the description column the same way as the wok specialties menu.

ComboNAFilled = Combo %>%
  filter(is.na(Description) == TRUE) %>%
  mutate(Description = Dish)

ComboWithDescription = bind_rows(filter(Combo,is.na(Description) == FALSE),ComboNAFilled)

ComboWithDescription
## # A tibble: 25 x 5
##    ComboID Dish                Description           LunchPrice DinnerPrice
##    <chr>   <chr>               <chr>                      <dbl>       <dbl>
##  1 C14     Crazy Combo         Chicken, Roast pork,~       6.5         9.5 
##  2 C01     Chicken Chow Mein   Chicken Chow Mein           5.5       825   
##  3 C02     Crispy Chicken Win~ Crispy Chicken Wings        5.5         8.25
##  4 C03     Pork or Veggie Egg~ Pork or Veggie Egg F~       5.75        8.25
##  5 C04     Moo Goo Gai Pan     Moo Goo Gai Pan             5.95        8.95
##  6 C05     Chicken or Pork Lo~ Chicken or Pork Lo M~       5.95        8.95
##  7 C06     Honey Garlic Chick~ Honey Garlic Chicken        5.95        8.95
##  8 C07     Sweet & Sour Chick~ Sweet & Sour Chicken~       5.95        8.95
##  9 C08     Chicken Teriyaki S~ Chicken Teriyaki Sti~       6.25        9.25
## 10 C09     Beef with Mixed Ve~ Beef with Mixed Vege~       6.25        9.25
## # ... with 15 more rows

The next step is to separate the menu into a lunch menu and dinner menu and then add the ServedWhen and MenuSection columns. From there, the lunch and dinner combos can be combined into one table.

The code below selects the relevant columns from the ComboWithDescription table into separate lunch and dinner tables. The next chunk of code appends the ServedWhen and MenuSection columns to the tables and specifies whether the ServedWhen column corresponds to lunch or dinner. The last bit of code combines the two tables into one large table and displays it.

LunchCombos = select(ComboWithDescription,Dish,Description,Price = LunchPrice)
DinnerCombos = select(ComboWithDescription,Dish,Description,Price = DinnerPrice)

LunchCombos = LunchCombos %>%
  mutate(ServedWhen = "Lunch", MenuSection = "Combination")

DinnerCombos = DinnerCombos %>%
  mutate(ServedWhen = "Dinner", MenuSection = "Combination")


AllCombos = arrange(bind_rows(LunchCombos,DinnerCombos),desc(Price))

AllCombos
## # A tibble: 50 x 5
##    Dish                Description             Price ServedWhen MenuSection
##    <chr>               <chr>                   <dbl> <chr>      <chr>      
##  1 Chicken Chow Mein   Chicken Chow Mein      825    Dinner     Combination
##  2 Happy Family        Happy Family           695    Lunch      Combination
##  3 Barbecue Spare Ribs Barbecue Spare Ribs      9.95 Dinner     Combination
##  4 Szechwan Shrimp or~ Szechwan Shrimp or Sc~   9.95 Dinner     Combination
##  5 Shrimp w. Lobster ~ Shrimp w. Lobster Sau~   9.95 Dinner     Combination
##  6 Happy Family        Happy Family             9.95 Dinner     Combination
##  7 Crazy Combo         Chicken, Roast pork, ~   9.5  Dinner     Combination
##  8 Chicken Teriyaki S~ Chicken Teriyaki Stir~   9.25 Dinner     Combination
##  9 Beef with Mixed Ve~ Beef with Mixed Veget~   9.25 Dinner     Combination
## 10 Hunan Chicken or B~ Hunan Chicken or Beef    9.25 Dinner     Combination
## # ... with 40 more rows

The last step is to fix the price typos. For a couple of dishes, there is a typo where the prices were missing their decimal. It is unlikely that a plate of chicken chow mein would be $825 dollars, so any extreme values need to be changed to include their decimals.

The code below selects dishes with prices over $100 and divides the price by 100 to get the decimal in the right place. The next chunk replaces the incorrect rows with the correct rows and displays the final combinations table.

AllCombosPriceFix = AllCombos %>%
  filter(Price > 100) %>%
  mutate(Price = Price/100)

AllCombosFinal = arrange(bind_rows(filter(AllCombos,Price <= 100),AllCombosPriceFix),desc(Price))

AllCombosFinal
## # A tibble: 50 x 5
##    Dish                Description             Price ServedWhen MenuSection
##    <chr>               <chr>                   <dbl> <chr>      <chr>      
##  1 Barbecue Spare Ribs Barbecue Spare Ribs      9.95 Dinner     Combination
##  2 Szechwan Shrimp or~ Szechwan Shrimp or Sca~  9.95 Dinner     Combination
##  3 Shrimp w. Lobster ~ Shrimp w. Lobster Sauce  9.95 Dinner     Combination
##  4 Happy Family        Happy Family             9.95 Dinner     Combination
##  5 Crazy Combo         Chicken, Roast pork, a~  9.5  Dinner     Combination
##  6 Chicken Teriyaki S~ Chicken Teriyaki Stir ~  9.25 Dinner     Combination
##  7 Beef with Mixed Ve~ Beef with Mixed Vegeta~  9.25 Dinner     Combination
##  8 Hunan Chicken or B~ Hunan Chicken or Beef    9.25 Dinner     Combination
##  9 Garlic Beef or Chi~ Garlic Beef or Chicken   9.25 Dinner     Combination
## 10 Bourbon Chicken     Bourbon Chicken          9.25 Dinner     Combination
## # ... with 40 more rows

Tidying the Party Feast section

The party feast section is pretty barren. It is only a list of dishes and nothing else. Some of the dishes have their names cut off. Those rows need to be removed before adding the additional columns onto the table.

Since only the first seven dishes on the menu have their full names, I subsetted the data to only include the first seven dishes. Afterwards, I added the additional columns to match the structure of the previous tables.

Since none of the dishes had descriptions, I matched the description to the dish name.

The pricing on the Party Feast is a little strange. The feast consists of any three dishes in the party feast menu for a total price of $25.95. Therefore, each dish individually would cost $8.65.

The party feast is available all day, so every dish is marked as served “All Day”.

The party feast menu items are part of the “Party Feast” menu section, so I marked them as such.

PartyFeastFixed = PartyFeast %>%
  slice(1:7) %>%
  rename(Dish = Entrees) %>%
  mutate(Description = Dish, Price = 25.95/3, ServedWhen = "All Day", MenuSection = "Party Feast")

PartyFeastFixed
## # A tibble: 7 x 5
##   Dish                   Description           Price ServedWhen MenuSection
##   <chr>                  <chr>                 <dbl> <chr>      <chr>      
## 1 Honey Garlic Chicken   Honey Garlic Chicken   8.65 All Day    Party Feast
## 2 Sweet & Sour Chicken   Sweet & Sour Chicken   8.65 All Day    Party Feast
## 3 Pepper Steak           Pepper Steak           8.65 All Day    Party Feast
## 4 Orange or Sesame Chic~ Orange or Sesame Chi~  8.65 All Day    Party Feast
## 5 Bourbon Chicken        Bourbon Chicken        8.65 All Day    Party Feast
## 6 Chicken or Pork Lo Me~ Chicken or Pork Lo M~  8.65 All Day    Party Feast
## 7 General Tsao's Chicken General Tsao's Chick~  8.65 All Day    Party Feast

Putting all the menus together

After tidying each individual menu section, I can finally put together all the menus and do some analysis.

FinalMenu = bind_rows(WokSpecialtiesFinal, AllCombosFinal, PartyFeastFixed)

FinalMenu
## # A tibble: 69 x 5
##    Dish        Description                    Price ServedWhen MenuSection 
##    <chr>       <chr>                          <dbl> <chr>      <chr>       
##  1 Happy Fami~ Shrimp, scallop, roast pork, ~  13.0 All Day    Wok Special~
##  2 Steak Kew   Cubed prime filet with bamboo~  15.0 All Day    Wok Special~
##  3 Ginger Sca~ Steamed fish filet in a light~  13.0 All Day    Wok Special~
##  4 Roast Duck  Cantonese Style Half crispy r~  14.0 All Day    Wok Special~
##  5 Crispy Shr~ Large tail-on shrimp on a bet~  14.0 All Day    Wok Special~
##  6 Szechwan F~ Marinated fresh tender pieces~  14.0 All Day    Wok Special~
##  7 Shanghai N~ Tender boneless chicken cubes~  15.0 All Day    Wok Special~
##  8 Imperial P~ Tender pieces of crispy pork ~  13.0 All Day    Wok Special~
##  9 Neptune's ~ Large shrimp, sea scallops, r~  14.0 All Day    Wok Special~
## 10 Orange Chi~ Orange Chicken or Beef          12.0 All Day    Wok Special~
## # ... with 59 more rows

Analysis Questions

What is the average price of a dish at the restaurant?

This question is a little vague, but I will try to answer it from different interpretations.

The first interpretation is the average price for all dishes and all menu sections. This average does not consider the time of day or whether the dish is included in a combination or in the party feast menu.

The code below calculates the average price for all dishes.

The simple average of all dishes is approximately $8.78.

summarise(FinalMenu,AveragePrice = mean(Price))
## # A tibble: 1 x 1
##   AveragePrice
##          <dbl>
## 1         8.78

The second interpretation is the average of a la carte items only for lunch and for dinner separately. This average considers the time of day and does not include the party feast because it is not a la carte.

The code below filters the menu by items served at different times of the day for only the wok specialties menu and the combination menu and then calculates the average price.

The average price for all a la carte dishes served during lunch is about $8.56.

The average price for all a la carte dishes served during dinner is about $10.57.

FinalMenu %>%
  filter(ServedWhen == "All Day" | ServedWhen == "Lunch", MenuSection == "Wok Specialties" | MenuSection == "Combination") %>%
  summarise(AverageLunchPrice = mean(Price))
## # A tibble: 1 x 1
##   AverageLunchPrice
##               <dbl>
## 1              8.56
FinalMenu %>%
  filter(ServedWhen == "All Day" | ServedWhen == "Dinner", MenuSection == "Wok Specialties" | MenuSection == "Combination") %>%
  summarise(AverageDinnerPrice = mean(Price))
## # A tibble: 1 x 1
##   AverageDinnerPrice
##                <dbl>
## 1               10.6

It appears that lunch is about two dollars cheaper on average compared to dinner, which is to be expected. However, I did not realize that the average dinner a la carte item would be over ten dollars. If ordering more than three dishes during dinner, it would be cheaper to order the Party Feast. While there is a limited selection for the party feast, it brings the price per dish down closer to the average lunch dish price.

How many dishes contain chicken?

This question is fairly straightforward. I am going to make the assumption that if the words “Chicken” or “Gai” (chicken in Chinese) are not included in the name or description of the dish, the dish does not contain chicken.

The code below filters the rows in the menu by whether they contain the words “chicken” or “gai” in the dish description or the dish name. The resulting table is filtered further into the distinct dish names, which are counted and shown below.

FinalMenu %>%
  filter(str_detect(tolower(Description),"chicken") | str_detect(tolower(Description),"gai") | str_detect(tolower(Dish),"chicken") | str_detect(tolower(Dish),"gai")) %>%
  distinct(Dish) %>%
  summarise(ChickenDishCount = length(Dish))
## # A tibble: 1 x 1
##   ChickenDishCount
##              <int>
## 1               23

There are 23 unique dishes that contain chicken.

How many items can be sold from this menu?

This question is a little vague. I am going to interpret the question as asking for the number of unique a la carte dishes that can be sold individually. In that case, the party feast will not be included because its dishes are not ordered a la carte.

The code below filters the menu by the wok specialties menu and the combination menu. The distinct dish names are selected and then counted and presented below.

FinalMenu %>%
  filter(MenuSection == "Wok Specialties" | MenuSection == "Combination") %>%
  distinct(Dish) %>%
  summarise(ItemsSoldCount = length(Dish))
## # A tibble: 1 x 1
##   ItemsSoldCount
##            <int>
## 1             36

Another interpretation is the number of total available dishes at the restaurant. This can be found by counting the unique dish names.

The code below selects all distinct dish names and counts them.

FinalMenu %>%
  distinct(Dish) %>%
  summarise(AllItemsAvailableCount = length(Dish))
## # A tibble: 1 x 1
##   AllItemsAvailableCount
##                    <int>
## 1                     38

Extra Analysis

I was curious how much each dish costs on average depending on the time of day they are available. Are lunch-only items significantly cheaper than items available during dinner or all day?

The code below groups the dishes by when they are served and then averages the price among the dishes in each group.

AveragePriceServedWhen = FinalMenu %>%
  group_by(ServedWhen) %>%
  summarise(AveragePrice = round(mean(Price), digits = 2))

ggplot(data = AveragePriceServedWhen, aes(x = ServedWhen, y = AveragePrice, fill = ServedWhen)) + geom_bar(stat = "identity") + geom_text(aes(label = AveragePrice), vjust = 1.6, color = "white", size = 3.5) + theme_minimal()

It appears that lunch-only menu items are significantly cheaper compared to dishes served All Day and at dinner. Including the All Day items into the lunch and dinner averages skews the prices upwards by about two dollars. However, when All Day dishes are not included in the price averages, the average cost per dish goes down significantly. Lunch-only menu items are a great value compared to the cost of the dishes available at other times of the day.

Conclusion

This Chinese restaurant menu had a lot of interesting challenges when tidying the data. The data was organized in three separate spreadsheets in an excel document, each with a different structure. Most of the difficulty with tidying this data was coming up with a format that all three spreadsheets could conform to. Given that certain dishes were priced differently depending on the time of day, I added the ServedWhen column to indicate whether the dish price was for Lunch, Dinner, or All Day. Many dishes did not have descriptions, which were necessary to determine the ingredients of the dish. To fix this issue, I copied the dish name to the description column when a description was not provided. After organizing each individual menu section, I put all of the sections together into one large menu and performed the requested analysis.

The analysis was much easier to do after the data had been tidied. I found that the Lunch menu selections were about two dollars cheaper on average compared to the Dinner menu selections. For parties ordering three dishes, the Party Feast had comparable prices to the Lunch menu selections and was also available during dinner time. This means that ordering the party feast was significantly cheaper than ordering dinner items individually, however, the party feast had a limited selection compared to the full dinner menu. In terms of dish ingredients, chicken is present in 23 of the 38 dishes served at the restaurant (about 60%).

Overall, I think this restaurant has pretty good value and a decent variety of dishes.


Color and Heat Absorption Data - Michael Hayes

Tidying the Data

For this “tidying” we will utilize Christopher Ayre’s example dataset of Heating and Cooling Absorption.

Discussion board can be found here: https://bbhosted.cuny.edu/webapps/discussionboard/do/message?action=list_messages&course_id=_1705328_1&nav=discussion_board&conf_id=_1845527_1&forum_id=_1908779_1&message_id=_31283025_1

Step 1 - Load Dataset

Christopher provided the data in a .csv file, which I’ve uploaded to github:

heating <- read.csv("https://raw.githubusercontent.com/murphystout/data-607/master/heating_cooling.csv")

head(heating)
##   color minute.0 minute.10 minute.20 minute.30 minute.40 minute.50
## 1 white       78        81        83        88        93        96
## 2   red       78        82        90        93        98       106
## 3  pink       78        82        84        90        96        99
## 4 black       78        88        92        98       108       116
## 5 green       78        81        85        91        95       102
## 6 white       98        96        93        80        78        78
##   minute.60   phase
## 1        98 heating
## 2       109 heating
## 3       102 heating
## 4       121 heating
## 5       105 heating
## 6        78 cooling

Christopher adeptly pointed out several issues with the data set that make it “untidy”. These are:

  1. The variable for “time elapsed” does not have its own column. In this case we see a column for each ten minute interval. These should be collapsed into one column.

  2. Each color should have its own column. We will treat one “observation” as the temperature across all the colors, and so the color is merely a variable for a single observation, hence they should all be placed on a single row.

  3. Multiple observational units are observed in the same table. In particular the “heating” and “cooling” data are in one table.

After we have tidied it up, we will do some exploratory analysis and visualizations on the data.

Step 2 - Gather Times

For this step we will gather multiple timestamp columns into a single column.

As a intermediate step, let’s first rename the columns to that they take on a numerical value, this will help with plotting the value later on.

colnames(heating) <- c("color",0,10,20,30,40,50,60,"phase")

heating <- gather(heating, time, temperature, "0":"60")

head(heating, 20)
##    color   phase time temperature
## 1  white heating    0          78
## 2    red heating    0          78
## 3   pink heating    0          78
## 4  black heating    0          78
## 5  green heating    0          78
## 6  white cooling    0          98
## 7    red cooling    0         109
## 8   pink cooling    0         102
## 9  black cooling    0         121
## 10 green cooling    0         105
## 11 white heating   10          81
## 12   red heating   10          82
## 13  pink heating   10          82
## 14 black heating   10          88
## 15 green heating   10          81
## 16 white cooling   10          96
## 17   red cooling   10         106
## 18  pink cooling   10          96
## 19 black cooling   10         108
## 20 green cooling   10          94

Finally, let’s make sure the time is stores as numeric values:

heating$time <- as.numeric(heating$time)

Step 3 - Spreading Colors to Columns

Now that’s we’ve gathered up the temperature columns, let’s spread out the color columns to include one temp reading for each color + timestamp combination.

heating <- spread(heating, color, temperature)

heating
##      phase time black green pink red white
## 1  cooling    0   121   105  102 109    98
## 2  cooling   10   108    94   96 106    96
## 3  cooling   20    98    90   90  95    93
## 4  cooling   30    90    82   83  87    80
## 5  cooling   40    84    80   80  82    78
## 6  cooling   50    79    78   78  80    78
## 7  cooling   60    78    78   78  78    78
## 8  heating    0    78    78   78  78    78
## 9  heating   10    88    81   82  82    81
## 10 heating   20    92    85   84  90    83
## 11 heating   30    98    91   90  93    88
## 12 heating   40   108    95   96  98    93
## 13 heating   50   116   102   99 106    96
## 14 heating   60   121   105  102 109    98

Step 4 - Split observational units to separate tables.

Thankfully this is as simple as subsetting the data based on the “phase” column. I saved this for the last step to save us from having to perform the tidying operations twice.

cooling <- subset(heating, phase == "cooling")
heating <- subset(heating, phase == "heating")

head(cooling)
##     phase time black green pink red white
## 1 cooling    0   121   105  102 109    98
## 2 cooling   10   108    94   96 106    96
## 3 cooling   20    98    90   90  95    93
## 4 cooling   30    90    82   83  87    80
## 5 cooling   40    84    80   80  82    78
## 6 cooling   50    79    78   78  80    78
head(heating)
##      phase time black green pink red white
## 8  heating    0    78    78   78  78    78
## 9  heating   10    88    81   82  82    81
## 10 heating   20    92    85   84  90    83
## 11 heating   30    98    91   90  93    88
## 12 heating   40   108    95   96  98    93
## 13 heating   50   116   102   99 106    96

Step 5 - Exploratory Data Analysis

Let’s plot these data in a line graph to get a visual representation of how the colors responded to heating and cooling:

plot(x = heating$time, y = heating$black, type = "l", col = "black", xlab = "Time Elapsed (Minutes)", ylab = "Temp (Farheneit)", main = "Heating/Color Absorption")
lines(x = heating$time, y = heating$red, col = "red")
lines(x = heating$time, y = heating$green, col = "green")
lines(x = heating$time, y = heating$pink, col = "pink")
lines(x = heating$time, y = heating$white, col = "grey")

plot(x = cooling$time, y = cooling$black, type = "l", col = "black", xlab = "Time Elapsed (Minutes)", ylab = "Temp (Farheneit)", main = "Cooling/Color Asborption")
lines(x = cooling$time, y = cooling$red, col = "red")
lines(x = cooling$time, y = cooling$green, col = "green")
lines(x = cooling$time, y = cooling$pink, col = "pink")
lines(x = cooling$time, y = cooling$white, col = "grey")

Initial findings and Next steps

The graphs look neat, and we can see that black is the fastest heat absorber.

The graphs also look symmetrical, but now that we see it in this form, it might make sense to view the cooling and heating data in one graph.

However, the data requires a bit more finagling to get this correct, such as:

1: Minute “60” of the Heating Data is equivalent of Minute “0” of the Cooling.

2: Minutes elapsed in the Cooling data need to be increased by 60 in order to create one continuous time series.

Let’s do it!

Combining Heating and Cooling Data

# Remove minute 0 of the cooling dataset:

heat_cool <- cooling[-1,]

# Add 60 to the time column.

heat_cool$time <- as.numeric(heat_cool$time) + 60

# Stack this underneath the heating data

heat_cool <- rbind(heating, heat_cool)

heat_cool
##      phase time black green pink red white
## 8  heating    0    78    78   78  78    78
## 9  heating   10    88    81   82  82    81
## 10 heating   20    92    85   84  90    83
## 11 heating   30    98    91   90  93    88
## 12 heating   40   108    95   96  98    93
## 13 heating   50   116   102   99 106    96
## 14 heating   60   121   105  102 109    98
## 2  cooling   70   108    94   96 106    96
## 3  cooling   80    98    90   90  95    93
## 4  cooling   90    90    82   83  87    80
## 5  cooling  100    84    80   80  82    78
## 6  cooling  110    79    78   78  80    78
## 7  cooling  120    78    78   78  78    78

Now we have a nice, neat and tidy dataset showing heating and cooling times. Let’s revisit those graphs we generated previously:

plot(x = heat_cool$time, y = heat_cool$black, type = "l", col = "black", xlab = "Time Elapsed (Minutes)", ylab = "Temp (Farheneit)", main = "Cooling/Color Asborption")
lines(x = heat_cool$time, y = heat_cool$red, col = "red")
lines(x = heat_cool$time, y = heat_cool$green, col = "green")
lines(x = heat_cool$time, y = heat_cool$pink, col = "pink")
lines(x = heat_cool$time, y = heat_cool$white, col = "grey")

Calculating Rates (Hourly)

Let’s get a bit more quantitative. Let’s calculate the rates of heating and cooling for each of the colors:

heating_rate <- (heating[7,3:7] - heating[1,3:7])/60

heating_rate
##        black green pink       red     white
## 14 0.7166667  0.45  0.4 0.5166667 0.3333333
cooling_rate <- (cooling[7,3:7] - cooling[1,3:7])/60

cooling_rate
##        black green pink        red      white
## 7 -0.7166667 -0.45 -0.4 -0.5166667 -0.3333333

Since the starting and ending temperatures were equivalent, we see the overall heating and cooling rates to be symmetrical to one another.

According to this test, a colors heating rate also dicates its cooling rate (or heat retention), at least on average over 120 minutes.

Looking at these visually:

heating_rate <- gather(heating_rate, color, temp)

barplot(heating_rate$temp, col = heating_rate$color, names.arg = heating_rate$color, main = 'Heating Rates (by Color)', xlab = "Color", ylab = "Rate (Degrees per minute)")

However, this is looking at averages over the hour. But what does temp change look like within each 10 minute interval?

We can find this programmatically:

Calculating Rates (Per 10 Minutes)

black_ht <- diff(heating$black)/10
green_ht <- diff(heating$green)/10
pink_ht <- diff(heating$pink)/10
red_ht <- diff(heating$red)/10
white_ht <- diff(heating$white)/10

black_cl <- diff(cooling$black)/10
green_cl <- diff(cooling$green)/10
pink_cl <- diff(cooling$pink)/10
red_cl <- diff(cooling$red)/10
white_cl <- diff(cooling$white)/10

ht_rates <- data.frame(black_ht, black_cl, green_ht, green_cl, pink_ht, pink_cl, red_ht, red_cl, white_ht, white_cl)

ht_rates
##   black_ht black_cl green_ht green_cl pink_ht pink_cl red_ht red_cl
## 1      1.0     -1.3      0.3     -1.1     0.4    -0.6    0.4   -0.3
## 2      0.4     -1.0      0.4     -0.4     0.2    -0.6    0.8   -1.1
## 3      0.6     -0.8      0.6     -0.8     0.6    -0.7    0.3   -0.8
## 4      1.0     -0.6      0.4     -0.2     0.6    -0.3    0.5   -0.5
## 5      0.8     -0.5      0.7     -0.2     0.3    -0.2    0.8   -0.2
## 6      0.5     -0.1      0.3      0.0     0.3     0.0    0.3   -0.2
##   white_ht white_cl
## 1      0.3     -0.2
## 2      0.2     -0.3
## 3      0.5     -1.3
## 4      0.5     -0.2
## 5      0.3      0.0
## 6      0.2      0.0

Let’s take a look at these visually:

plot(x = seq(10, 60, 10), y = black_ht, type = "l", col = "black", ylim = c(-1.5,1.5), main = "Heating and Cooling Rates", sub = "Postive Values are Heating Rates, Negative are Cooling Rates", xlab = "Time Elapsed (Minutes)", ylab = "Heating and Cooling Rates")
lines(seq(10, 60, 10),y = black_cl, col = "black")
lines(seq(10, 60, 10),y = green_ht, col = "green")
lines(seq(10, 60, 10),y = green_cl, col = "green")
lines(seq(10, 60, 10),y = pink_ht, col = "pink")
lines(seq(10, 60, 10),y = pink_cl, col = "pink")
lines(seq(10, 60, 10),y = red_ht, col = "red")
lines(seq(10, 60, 10),y = red_cl, col = "red")
lines(seq(10, 60, 10),y = white_ht, col = "grey")
lines(seq(10, 60, 10),y = white_cl, col = "grey")

This chart shows both heating and cooling rates. The heating rates are postive (top of chart), while the cooling rates are negative (bottom of chart).

Matching like colors can show you how that color behaved in its heating and cooling phase.

Being that our ultimate averages were very symmetrical (i.e. over the full 120 minute span), we might expect that each smaller interval would be symmetrical too.

However that doesn’t always to be the case in this data. Note the green line is often twice the magnitude of its counterpart.

We can also see that all colors tend to converge to low values at the end of both periods. Perhaps this speaks to a type of heating saturation paired with a similar flatline of cooling.

Some Conclusions and Questions for Future Analysis

Some initial conclusions from our exploratory data analysis:

  1. Black has the fastest heating rate, and ~0.72 degrees per minute. White has the slowest heating rate, at ~0.33 degrees per minute. This was probably suspected based on known heuristics, and the data seems to confirm it.

  2. Heating rates and cooling rates were symmetrical over a 120 minute span. However, they don’t seem to be symmetric over smaller 10 minute spans. Lots of variation of rates across that time.

Some questions it raised:

  1. There seems to be a wide varience for temperature changes in the 10 minute intervals. Is this typical? Do temperature changes “slow” or otherwise change during based on when they occur in the time series?

  2. Heat absorbtion may very well not be a linear activity, a perhaps more detailed detail in needed to really understand the dynamics of these rates.



In order to render the maps in this code, the fiftystater package must be installed via devtools. The following code chunk installs fiftystater.

require(devtools)
devtools::install_github("wmurphyrd/fiftystater")

Population Data - Stephen Jones

Data Description: Population Data

I chose to work with population data posted by Arun Reddy. His description:

I found a census dataset for Puerto Rico population for 2010 to 2018. The dataset gives the Estimates of the Total Resident Population and Resident Population Age 18 Years and Older for the United States, States, and Puerto Rico

The dataset gives the population metrics like estimated, change in population from the estimate, National Rank in populations for each year from 2010 to 2018.

So there are 57 rows with 61 columns, so the dataset is wider than the length.

This is a good example of untidy data can be cleansed and make it more presentable. The following steps can be performed to cleanse the data.

  1. All the 4 metrics by years are spread out by column-wise can be changed into rows.

  2. Row names/Column name which includes the year as a concatenation can be well formatted to make more readable.

  3. Some of the column names don’t have the right data type like population change, national rank is factorial data type which is unnecessary.


Tidying

Load data and examine

Let’s read population data from GitHub, check descriptives with summary command. Output is limited by customized CSS in Markdown.

USpop<-read.csv("https://raw.githubusercontent.com/sigmasigmaiota/USpopulation/master/Population DataSet.csv")

#Look at summary table.
summary(USpop)
##        X          SUMLEV      REGION    DIVISION      STATE      
##  Min.   : 1   Min.   :10.00   0: 1   5      : 9   Min.   : 0.00  
##  1st Qu.:15   1st Qu.:40.00   1:10   8      : 8   1st Qu.:12.00  
##  Median :29   Median :40.00   2:13   4      : 7   Median :27.00  
##  Mean   :29   Mean   :38.07   3:18   1      : 6   Mean   :27.18  
##  3rd Qu.:43   3rd Qu.:40.00   4:14   0      : 5   3rd Qu.:41.00  
##  Max.   :57   Max.   :40.00   X: 1   3      : 5   Max.   :72.00  
##                                      (Other):17                  
##          NAME    ESTIMATESBASE2010   POPESTIMATE2010    
##  Alabama   : 1   Min.   :   563773   Min.   :   564483  
##  Alaska    : 1   1st Qu.:  1853001   1st Qu.:  1854214  
##  Arizona   : 1   Median :  4625381   Median :  4635656  
##  Arkansas  : 1   Mean   : 16315798   Mean   : 16345610  
##  California: 1   3rd Qu.:  9535736   3rd Qu.:  9574293  
##  Colorado  : 1   Max.   :308758105   Max.   :309326085  
##  (Other)   :51                                          
##  POPESTIMATE2011     POPESTIMATE2012     POPESTIMATE2013    
##  Min.   :   567224   Min.   :   576270   Min.   :   582123  
##  1st Qu.:  1856074   1st Qu.:  1856764   1st Qu.:  1865414  
##  Median :  4671422   Median :  4717112   Median :  4764153  
##  Mean   : 16463487   Mean   : 16583459   Mean   : 16697654  
##  3rd Qu.:  9656754   3rd Qu.:  9749123   3rd Qu.:  9843599  
##  Max.   :311580009   Max.   :313874218   Max.   :316057727  
##                                                             
##  POPESTIMATE2014     POPESTIMATE2015     POPESTIMATE2016    
##  Min.   :   582548   Min.   :   585668   Min.   :   584290  
##  1st Qu.:  1879522   1st Qu.:  1891507   1st Qu.:  1905924  
##  Median :  4823793   Median :  4853160   Median :  4864745  
##  Mean   : 16819195   Mean   : 16942126   Mean   : 17063518  
##  3rd Qu.:  9930589   3rd Qu.:  9932573   3rd Qu.:  9951890  
##  Max.   :318386421   Max.   :320742673   Max.   :323071342  
##                                                             
##  POPESTIMATE2017     POPESTIMATE2018      NPOPCHG_2010   
##  Min.   :   578934   Min.   :   577737   Min.   : -6582  
##  1st Qu.:  1917575   1st Qu.:  1929268   1st Qu.:  2570  
##  Median :  4875120   Median :  4887871   Median :  6457  
##  Mean   : 17171340   Mean   : 17275394   Mean   : 29812  
##  3rd Qu.:  9976447   3rd Qu.:  9995915   3rd Qu.: 18362  
##  Max.   :325147121   Max.   :327167434   Max.   :567980  
##                                                          
##   NPOPCHG_2011      NPOPCHG_2012      NPOPCHG_2013      NPOPCHG_2014    
##  Min.   : -42793   Min.   : -44244   Min.   : -41411   Min.   : -58203  
##  1st Qu.:   8898   1st Qu.:  10012   1st Qu.:   8774   1st Qu.:   7386  
##  Median :  21288   Median :  16893   Median :  18436   Median :  17240  
##  Mean   : 117877   Mean   : 119972   Mean   : 114195   Mean   : 121542  
##  3rd Qu.:  65723   3rd Qu.:  71221   3rd Qu.:  53494   3rd Qu.:  59023  
##  Max.   :2253924   Max.   :2294209   Max.   :2183509   Max.   :2328694  
##                                                                         
##   NPOPCHG_2015      NPOPCHG_2016      NPOPCHG_2017      NPOPCHG_2018    
##  Min.   : -61708   Min.   : -66671   Min.   : -81494   Min.   :-129848  
##  1st Qu.:   4089   1st Qu.:   3647   1st Qu.:   1365   1st Qu.:   3341  
##  Median :  14763   Median :  13364   Median :  14027   Median :  17827  
##  Mean   : 122931   Mean   : 121392   Mean   : 107822   Mean   : 104054  
##  3rd Qu.:  50831   3rd Qu.:  60116   3rd Qu.:  60505   3rd Qu.:  61216  
##  Max.   :2356252   Max.   :2328669   Max.   :2075779   Max.   :2020313  
##                                                                         
##   PPOPCHG_2010       PPOPCHG_2011      PPOPCHG_2012      PPOPCHG_2013    
##  Min.   :-0.12431   Min.   :-1.1499   Min.   :-1.2027   Min.   :-1.1394  
##  1st Qu.: 0.09317   1st Qu.: 0.2671   1st Qu.: 0.2745   1st Qu.: 0.2512  
##  Median : 0.17818   Median : 0.6560   Median : 0.6946   Median : 0.6524  
##  Mean   : 0.17896   Mean   : 0.6490   Mean   : 0.6857   Mean   : 0.6626  
##  3rd Qu.: 0.24368   3rd Qu.: 0.8968   3rd Qu.: 1.0440   3rd Qu.: 0.9712  
##  Max.   : 0.55154   Max.   : 2.3992   Max.   : 2.4408   Max.   : 2.9785  
##                                                                          
##   PPOPCHG_2014      PPOPCHG_2015      PPOPCHG_2016      PPOPCHG_2017     
##  Min.   :-1.6199   Min.   :-1.7457   Min.   :-1.9196   Min.   :-2.39231  
##  1st Qu.: 0.1930   1st Qu.: 0.1282   1st Qu.: 0.1233   1st Qu.: 0.05405  
##  Median : 0.5311   Median : 0.4792   Median : 0.4030   Median : 0.37844  
##  Mean   : 0.6018   Mean   : 0.6056   Mean   : 0.5883   Mean   : 0.51042  
##  3rd Qu.: 0.9699   3rd Qu.: 1.1043   3rd Qu.: 1.0899   3rd Qu.: 1.05089  
##  Max.   : 2.1306   Max.   : 2.2566   Max.   : 2.0156   Max.   : 2.13758  
##                                                                          
##   PPOPCHG_2018     NRANK_ESTBASE2010 NRANK_POPEST2010 NRANK_POPEST2011
##  Min.   :-3.9052   1      : 2        1      : 2       1      : 2      
##  1st Qu.: 0.1299   2      : 2        2      : 2       2      : 2      
##  Median : 0.3979   3      : 2        3      : 2       3      : 2      
##  Mean   : 0.4897   4      : 2        4      : 2       4      : 2      
##  3rd Qu.: 0.9723   X      : 2        X      : 2       X      : 2      
##  Max.   : 2.0854   10     : 1        10     : 1       10     : 1      
##                    (Other):46        (Other):46       (Other):46      
##  NRANK_POPEST2012 NRANK_POPEST2013 NRANK_POPEST2014 NRANK_POPEST2015
##  1      : 2       1      : 2       1      : 2       1      : 2      
##  2      : 2       2      : 2       2      : 2       2      : 2      
##  3      : 2       3      : 2       3      : 2       3      : 2      
##  4      : 2       4      : 2       4      : 2       4      : 2      
##  X      : 2       X      : 2       X      : 2       X      : 2      
##  10     : 1       10     : 1       10     : 1       10     : 1      
##  (Other):46       (Other):46       (Other):46       (Other):46      
##  NRANK_POPEST2016 NRANK_POPEST2017 NRANK_POPEST2018 NRANK_NPCHG2010
##  1      : 2       1      : 2       1      : 2       1      : 2     
##  2      : 2       2      : 2       2      : 2       2      : 2     
##  3      : 2       3      : 2       3      : 2       3      : 2     
##  4      : 2       4      : 2       4      : 2       4      : 2     
##  X      : 2       X      : 2       X      : 2       X      : 2     
##  10     : 1       10     : 1       10     : 1       10     : 1     
##  (Other):46       (Other):46       (Other):46       (Other):46     
##  NRANK_NPCHG2011 NRANK_NPCHG2012 NRANK_NPCHG2013 NRANK_NPCHG2014
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_NPCHG2015 NRANK_NPCHG2016 NRANK_NPCHG2017 NRANK_NPCHG2018
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_PPCHG2010 NRANK_PPCHG2011 NRANK_PPCHG2012 NRANK_PPCHG2013
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_PPCHG2014 NRANK_PPCHG2015 NRANK_PPCHG2016 NRANK_PPCHG2017
##  1      : 2      1      : 2      1      : 2      1      : 2     
##  2      : 2      2      : 2      2      : 2      2      : 2     
##  3      : 2      3      : 2      3      : 2      3      : 2     
##  4      : 2      4      : 2      4      : 2      4      : 2     
##  X      : 2      X      : 2      X      : 2      X      : 2     
##  10     : 1      10     : 1      10     : 1      10     : 1     
##  (Other):46      (Other):46      (Other):46      (Other):46     
##  NRANK_PPCHG2018
##  1      : 2     
##  2      : 2     
##  3      : 2     
##  4      : 2     
##  X      : 2     
##  10     : 1     
##  (Other):46


Create state-level dataset

We’ll need to remove the divisions listed at the top of the dataset by creating a subset; let’s remove the first five rows, wherein data at the region and national level lives. The result is displayed using kableExtra.

#subset the data.
USpop.StateTerr<-USpop[6:57,]

require(kableExtra)
kable(head(USpop.StateTerr))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  scroll_box(width = "100%")
X SUMLEV REGION DIVISION STATE NAME ESTIMATESBASE2010 POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 POPESTIMATE2016 POPESTIMATE2017 POPESTIMATE2018 NPOPCHG_2010 NPOPCHG_2011 NPOPCHG_2012 NPOPCHG_2013 NPOPCHG_2014 NPOPCHG_2015 NPOPCHG_2016 NPOPCHG_2017 NPOPCHG_2018 PPOPCHG_2010 PPOPCHG_2011 PPOPCHG_2012 PPOPCHG_2013 PPOPCHG_2014 PPOPCHG_2015 PPOPCHG_2016 PPOPCHG_2017 PPOPCHG_2018 NRANK_ESTBASE2010 NRANK_POPEST2010 NRANK_POPEST2011 NRANK_POPEST2012 NRANK_POPEST2013 NRANK_POPEST2014 NRANK_POPEST2015 NRANK_POPEST2016 NRANK_POPEST2017 NRANK_POPEST2018 NRANK_NPCHG2010 NRANK_NPCHG2011 NRANK_NPCHG2012 NRANK_NPCHG2013 NRANK_NPCHG2014 NRANK_NPCHG2015 NRANK_NPCHG2016 NRANK_NPCHG2017 NRANK_NPCHG2018 NRANK_PPCHG2010 NRANK_PPCHG2011 NRANK_PPCHG2012 NRANK_PPCHG2013 NRANK_PPCHG2014 NRANK_PPCHG2015 NRANK_PPCHG2016 NRANK_PPCHG2017 NRANK_PPCHG2018
6 6 40 3 6 1 Alabama 4780138 4785448 4798834 4815564 4830460 4842481 4853160 4864745 4875120 4887871 5310 13386 16730 14896 12021 10679 11585 10375 12751 0.1110847 0.2797230 0.3486264 0.3093303 0.2488583 0.2205275 0.2387104 0.2132691 0.2615525 23 23 23 23 23 23 24 24 24 24 28 32 26 31 32 30 28 31 27 37 38 34 33 35 36 34 34 34
7 7 40 4 9 2 Alaska 710249 713906 722038 730399 737045 736307 737547 741504 739786 737438 3657 8132 8361 6646 -738 1240 3957 -1718 -2348 0.5148898 1.1390855 1.1579723 0.9099136 -0.1001296 0.1684080 0.5365082 -0.2316913 -0.3173891 47 47 47 47 47 48 48 48 48 48 32 39 40 42 47 42 37 44 45 2 8 11 18 48 37 22 46 49
8 8 40 4 8 4 Arizona 6392288 6407774 6473497 6556629 6634999 6733840 6833596 6945452 7048876 7171646 15486 65723 83132 78370 98841 99756 111856 103424 122770 0.2422607 1.0256760 1.2841900 1.1952789 1.4896913 1.4814133 1.6368542 1.4890896 1.7416961 16 16 16 15 15 15 14 14 14 14 11 10 7 5 4 7 7 7 4 15 10 7 7 7 9 9 6 4
9 9 40 3 7 5 Arkansas 2916028 2921978 2940407 2952109 2959549 2967726 2978407 2990410 3002997 3013825 5950 18429 11702 7440 8177 10681 12003 12587 10828 0.2040447 0.6307029 0.3979721 0.2520232 0.2762921 0.3599052 0.4030007 0.4209122 0.3605731 32 32 32 32 32 32 33 32 32 33 26 25 35 41 37 29 27 26 30 22 28 31 38 31 30 26 24 28
10 10 40 4 9 6 California 37254523 37320903 37641823 37960782 38280824 38625139 38953142 39209127 39399349 39557045 66380 320920 318959 320042 344315 328003 255985 190222 157696 0.1781797 0.8598934 0.8473527 0.8430859 0.8994451 0.8491956 0.6571614 0.4851472 0.4002503 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 26 17 19 20 16 16 20 23 25
11 11 40 4 8 8 Colorado 5029316 5048281 5121771 5193721 5270482 5351218 5452107 5540921 5615902 5695564 18965 73490 71950 76761 80736 100889 88814 74981 79662 0.3770891 1.4557431 1.4047875 1.4779577 1.5318523 1.8853465 1.6289849 1.3532227 1.4185077 22 22 22 22 22 22 22 21 21 21 8 9 9 6 8 6 8 8 8 6 4 5 5 4 4 10 8 7


We now have information only on the state/territory level. Let’s transform and create a row for each year. Additionally, let’s omit the first two columns, as SUMLEV is consistent on the state/terr level. The result is displayed once again using kableExtra.

#remove first two columns
USpop.StateTerr[1:2]<-NULL
library(tidyr)
library(dplyr)
#Create row for each statistic.
USpop.YearList<-gather(USpop.StateTerr,"Statistic","Value",5:length(colnames(USpop.StateTerr)))

kable(head(USpop.YearList))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
REGION DIVISION STATE NAME Statistic Value
3 6 1 Alabama ESTIMATESBASE2010 4780138
4 9 2 Alaska ESTIMATESBASE2010 710249
4 8 4 Arizona ESTIMATESBASE2010 6392288
3 7 5 Arkansas ESTIMATESBASE2010 2916028
4 9 6 California ESTIMATESBASE2010 37254523
4 8 8 Colorado ESTIMATESBASE2010 5029316


Extract year from former column heading

Let’s extract the year from our new “Year” variable and create a “Statistic” variable; we’ll use stringr to extract numeric characters. The new variable is highlighted in light blue.

library(stringr)
USpop.YearList$Year<-as.numeric(str_extract(USpop.YearList$Statistic,"([0-9]+)"))
USpop.YearList$Statistic<-gsub("([0-9])|(_)","",USpop.YearList$Statistic)

#Rename value column to specifically address state/terr level data.
colnames(USpop.YearList)[6]<-"StateValue"

kable(head(USpop.YearList))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 7, bold = T, background = "LightCyan")
REGION DIVISION STATE NAME Statistic StateValue Year
3 6 1 Alabama ESTIMATESBASE 4780138 2010
4 9 2 Alaska ESTIMATESBASE 710249 2010
4 8 4 Arizona ESTIMATESBASE 6392288 2010
3 7 5 Arkansas ESTIMATESBASE 2916028 2010
4 9 6 California ESTIMATESBASE 37254523 2010
4 8 8 Colorado ESTIMATESBASE 5029316 2010


Add division-level data

Let’s do the same for division information. Using the gather command, division data is added to our table; the change is marked in light blue.

USdiv.YearList<-gather(USpop[2:5,],"Statistic","Value",7:length(colnames(USpop)))

#Remove first two columns.
USdiv.YearList[1:2]<-NULL

#We also have no need for state or division information at Region level.
USdiv.YearList[2:3]<-NULL

#Rename NAME column.
colnames(USdiv.YearList)[2]<-"RegionName"

#Rename Value column in anticipation of merge with state data.
colnames(USdiv.YearList)[4]<-"RegionValue"

#Clean year and statistic.
USdiv.YearList$Year<-as.numeric(str_extract(USdiv.YearList$Statistic,"([0-9]+)"))
USdiv.YearList$Statistic<-gsub("([0-9])|(_)","",USdiv.YearList$Statistic)

#Shorten RegionName.
USdiv.YearList$RegionName<-gsub("( Region)","",USdiv.YearList$RegionName)

kable(head(USdiv.YearList))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 2, bold = T, background = "LightCyan")
REGION RegionName Statistic RegionValue Year
1 Northeast ESTIMATESBASE 55318430 2010
2 Midwest ESTIMATESBASE 66929743 2010
3 South ESTIMATESBASE 114563045 2010
4 West ESTIMATESBASE 71946887 2010
1 Northeast POPESTIMATE 55380645 2010
2 Midwest POPESTIMATE 66974749 2010


Add region-level data

Nice. Let’s merge by REGION number, Year, and Statistic. We must use full_join in order to keep Puerto Rico, which is omitted from Region categorization.

USpop2<-full_join(USdiv.YearList,USpop.YearList,by=c("REGION","Statistic","Year"))

kable(head(USpop2))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
REGION RegionName Statistic RegionValue Year DIVISION STATE NAME StateValue
1 Northeast ESTIMATESBASE 55318430 2010 1 9 Connecticut 3574147
1 Northeast ESTIMATESBASE 55318430 2010 1 23 Maine 1328369
1 Northeast ESTIMATESBASE 55318430 2010 1 25 Massachusetts 6547790
1 Northeast ESTIMATESBASE 55318430 2010 1 33 New Hampshire 1316464
1 Northeast ESTIMATESBASE 55318430 2010 2 34 New Jersey 8791962
1 Northeast ESTIMATESBASE 55318430 2010 2 36 New York 19378124


Recode division into descriptive string

We need to pair this with a list of divisions, which differs from Region. A list exists on wikipedia; I’ve edited from https://en.wikipedia.org/wiki/List_of_regions_of_the_United_States.

DIVISION<-as.factor(c(1,2,3,4,5,6,7,8,9))
DivisionName<-c("New England","Mid-Atlantic","East North Central","West North Central","South Atlantic","East South Central","West South Central","Mountain","Pacific")
Divisions<-data.frame(DIVISION,DivisionName)


Merge

Merge Divisions with master dataset and check.

USpop3<-full_join(Divisions,USpop2,by=c("DIVISION"))

kable(head(USpop3))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 2, bold = T, background = "LightCyan")
DIVISION DivisionName REGION RegionName Statistic RegionValue Year STATE NAME StateValue
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 9 Connecticut 3574147
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 23 Maine 1328369
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 25 Massachusetts 6547790
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 33 New Hampshire 1316464
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 44 Rhode Island 1052957
1 New England 1 Northeast ESTIMATESBASE 55318430 2010 50 Vermont 625744


Puerto Rico: no division, no region

Let’s look at Puerto Rico in particular.

PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]

kable(head(PR))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
DIVISION DivisionName REGION RegionName Statistic RegionValue Year STATE NAME StateValue
2857 X NA X NA ESTIMATESBASE NA 2010 72 Puerto Rico 3726157
2858 X NA X NA POPESTIMATE NA 2010 72 Puerto Rico 3721525
2859 X NA X NA POPESTIMATE NA 2011 72 Puerto Rico 3678732
2860 X NA X NA POPESTIMATE NA 2012 72 Puerto Rico 3634488
2861 X NA X NA POPESTIMATE NA 2013 72 Puerto Rico 3593077
2862 X NA X NA POPESTIMATE NA 2014 72 Puerto Rico 3534874


Puerto Rico is not ranked among the other states and has no division assignment. This is problematic for anyone completing comparative analysis on a national level. Since puerto rico has been assigned no region, there is no regional data available. Let’s recode NA as Puerto Rico in RegionName.

USpop3$RegionName[which(USpop3$NAME=="Puerto Rico")]<-"Puerto Rico"

#look at Puerto Rico again, replace subset.
PR<-USpop3[which(USpop3$NAME == "Puerto Rico"),]

kable(head(PR))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 5, bold = T, background = "LightCyan")
DIVISION DivisionName REGION RegionName Statistic RegionValue Year STATE NAME StateValue
2857 X NA X Puerto Rico ESTIMATESBASE NA 2010 72 Puerto Rico 3726157
2858 X NA X Puerto Rico POPESTIMATE NA 2010 72 Puerto Rico 3721525
2859 X NA X Puerto Rico POPESTIMATE NA 2011 72 Puerto Rico 3678732
2860 X NA X Puerto Rico POPESTIMATE NA 2012 72 Puerto Rico 3634488
2861 X NA X Puerto Rico POPESTIMATE NA 2013 72 Puerto Rico 3593077
2862 X NA X Puerto Rico POPESTIMATE NA 2014 72 Puerto Rico 3534874


Add nation-level statistics

Let’s add a column for National statistics.

#National data only.
USpop.Nat<-USpop[1,3:ncol(USpop)]

#gather, as before
USpop.Nat<-gather(USpop.Nat,"Statistic","Value",5:length(colnames(USpop.Nat)))

#Clean year and statistic.
USpop.Nat$Year<-as.numeric(str_extract(USpop.Nat$Statistic,"([0-9]+)"))
USpop.Nat$Statistic<-gsub("([0-9])|(_)","",USpop.Nat$Statistic)

#omit columns and unnecessary rows.
USpop.Nat<-USpop.Nat[,5:7]
USpop.Nat<-USpop.Nat[which(USpop.Nat$Value!="X"),]

#rename coluns in preparation for merge.
colnames(USpop.Nat)[2]<-"NationalValue"

#merge into master data.
USpop4<-full_join(USpop.Nat,USpop3,by=c("Statistic","Year"))

kable(head(USpop4))%>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))%>%
  column_spec(column = 2, bold = T, background = "LightCyan")
Statistic NationalValue Year DIVISION DivisionName REGION RegionName RegionValue STATE NAME StateValue
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 9 Connecticut 3574147
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 23 Maine 1328369
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 25 Massachusetts 6547790
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 33 New Hampshire 1316464
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 44 Rhode Island 1052957
ESTIMATESBASE 308758105 2010 1 New England 1 Northeast 55318430 50 Vermont 625744


Create subsets by statistic

Let’s sort the data by type of statistic. I’ll use the subsets in the code below to render maps.

This concludes the tidying portion of the project; the original cleaning objective in Arun Reddy’s post has been accomplished. As there were no outlines provided for analysis, let’s identify
1. States with the greatest population growth in 2018.
2. States with the greatest decrease in population in 2018.
3. States with the greatest population perentage by nation and by region.

USpopESTIMATESBASE<-USpop4[which(USpop4$Statistic=="ESTIMATESBASE"),]
USpopESTIMATE<-USpop4[which(USpop4$Statistic=="POPESTIMATE"),]
USpopPOPCHG<-USpop4[which(USpop4$Statistic=="NPOPCHG"),]
USpopPPOPCHG<-USpop4[which(USpop4$Statistic=="PPOPCHG"),]

Results

Setup

Finally, let’s map our results; I’ve used ggplot2 to render, ggthemes and viridis to customize, various tools in the tidyverse package, as well as fiftystater, which needs to be installed via github. The packages mapdata and mapproj supply map data for puerto rico, which is excluded from the fiftystater package.
# install.packages("devtools") devtools::install_github("wmurphyrd/fiftystater")

Map: population increases in 2018

Let’s map population increases in 2018.

USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue<0)]<-0


require(ggplot2)
require(fiftystater)
require(ggthemes)
require(tidyverse)
require(viridis)

USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)

data("fifty_states")

library(mapdata)
library(mapproj)

#Puerto Rico color must be set manually; there has been a decrease in population, so we've matched the color representing zero growth.
pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="lemonchiffon1")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=StateValue), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(-100000,-25000,-5000,0,5000,25000,100000,300000),
labels=c('-100K--25K','-25K--5K','-25K-0','0-5000','5k-25k','25K-100K','100K-300K','300K+'),begin=1,end=.25,option="magma")+
  guides(fill=guide_legend(title="Growth by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("Population Increases by State, 2018")

library(grid)
library(grDevices)

png(file="Project2a.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()
## png 
##   2
knitr::include_graphics("Project2a.png")

The greatest estimated population increase in 2018, by far, occurred in Texas and Florida.


Map: population decreases in 2018

Let’s map population losses.

USpopPOPCHG2018<-USpop4[which(USpop4$Statistic=="NPOPCHG"&USpop4$Year==2018),]
USpopPOPCHG2018$StateValue<-as.numeric(USpopPOPCHG2018$StateValue)
USpopPOPCHG2018$StateValue[which(USpopPOPCHG2018$StateValue>0)]<-0

USpopPOPCHG2018$statefull<-tolower(USpopPOPCHG2018$NAME)

data("fifty_states")

pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopPOPCHG2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="khaki3")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopPOPCHG2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=StateValue), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(-300000,-100000,-25000,-5000,0),
labels=c('300K+','100K-300K','25K-100K','5K-25K','0-25K'),begin=1,end=0,option="cividis")+
  guides(fill=guide_legend(title="Decrease pop by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("Population Decreases by State, 2018")


png(file="Project2b.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()
## png 
##   2
knitr::include_graphics("Project2b.png")

It’s easy to see that West Virginia and Louisiana is estimated to have experienced a decrease in population in 2018, as well as Illinois and New York. Puerto Rico is estimated to have experienced the greatest decrease in population.


Map: population by state

Let’s calculate percentage of national and division population resides in each state or territory using projected population data for 2018. Let’s map percentage of national population by state.

USpopESTIMATE2018<-USpop4[which(USpopESTIMATE$Year==2018),]
USpopESTIMATE2018$StateValue<-as.numeric(USpopESTIMATE2018$StateValue)
USpopESTIMATE2018$RegionValue<-as.numeric(USpopESTIMATE2018$RegionValue)
USpopESTIMATE2018$NationalValue<-as.numeric(USpopESTIMATE2018$NationalValue)
USpopESTIMATE2018$NatPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$NationalValue
USpopESTIMATE2018$RegPer <- USpopESTIMATE2018$StateValue/USpopESTIMATE2018$RegionValue

USpopESTIMATE2018$statefull<-tolower(USpopESTIMATE2018$NAME)

data("fifty_states")

pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey49")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=NatPer), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
  guides(fill=guide_legend(title="Nat Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("National Population Percentage by State, 2018")


png(file="Project2c.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()
## png 
##   2
knitr::include_graphics("Project2c.png")

It’s apparent that the states with greatest populations are California, Texas, Florida and New York. With New York’s estimated population decreasing, it’s curious that it still accounts for one of the greatest percentages of national population.


Map: population percentage by region

Let’s map state percentage by region.

data("fifty_states")

pr<-map_data('worldHires','Puerto Rico')
pr<-subset(pr,long<0) 
prmap<-ggplot(USpopESTIMATE2018)+geom_polygon(data=pr,aes(long,lat,group=group),fill="grey98")+
  coord_fixed(1.0)+
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-68, -65), ylim = c(18.6,17.8))+
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())

Total_plot<-ggplot(USpopESTIMATE2018, aes(map_id=statefull)) + 
  geom_map(aes(fill=RegPer), map=fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map(projection = "mercator", xlim = c(-125, -65), ylim = c(50,23)) +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        rect = element_blank())+
  scale_fill_viridis(breaks=c(.05,.1,.15,.20),
labels=c('0-5%','6-10%','11-15%','15% +'),begin=.3,end=1,option="cividis")+
  guides(fill=guide_legend(title="Region Pop % by state/terr",size="legend",title.theme=element_text(size=9,angle=0)))+
  ggtitle("Regional Population Percentage by State, 2018")

png(file="Project2d.png",w=4000,h=4000,res=500,bg="transparent")
grid.newpage()
v1<-viewport(width = 1, height = 1, x = 0.5, y = 0.5) #plot area for the main map
v4<-viewport(width = 0.12, height = 0.12, x = 0.48, y = 0.30) #plot area for the inset map)
print(Total_plot,vp=v1) 
print(prmap,vp=v4)
dev.off()
## png 
##   2
knitr::include_graphics("Project2d.png")

Conclusions

If we omit Puerto Rico due to the fact that there is no region or division assignment, we see that California, as expected, dominates the West Region, while New York dominates the Northeast to a lesser degree. Texas is most populous in the South, and Illinois carries the Midwest by only a slight margin.

Such data could be used to form an argument for representation in Congress. Should a state such as California, with 12.1% of the national population, be represented by only 2 senators? On a local level, these data could be combined with demographic data to understand why some areas are experiencing negative growth; Puerto Rico, for example, was decimated by a recent hurricane.