DATA 698 : Capstone Research Project

1 Overview

We are in totally different era of twenty first century, and it gives us very rare situation where any positive news would help the humankind. We want to use the historical CPI data and find the relation of it with the employment, in hope that we would have some positive news on employment by following the trend of the data in past.

We feel that lower CPI would result in more job opportunity, as the it gives space for more competition in small business across sectors.

2 Capstone Project on CPI and Employment

  1. The estimates of employment for 1998-2006 are based on the 2002 North American Industry Classification System (NAICS). The estimates for 2007-2010 are based on the 2007 NAICS. The estimates for 2011-2016 are based on the 2012 NAICS. The estimates for 2017 forward are based on the 2017 NAICS.
  2. Excludes limited partners.
  3. Under the 2007 NAICS, internet publishing and broadcasting was reclassified to other information services.
  • (NA) Not available.
  • (NM) Not meaningful.
    1. Not shown to avoid disclosure of confidential information; estimates are included in higher-level totals.
    1. Estimate for employment suppressed to cover corresponding estimate for earnings. Estimates for this item are included in the total.

Last updated: September 24, 2019- new statistics for 2018; revised statistics for 2014-2017.

3 Data Preparation

Load the required libraries

#packages <- c("R.rsp","pandocfilters","knitr","plyr","tidyr","dplyr","ggplot2","plotly","sqldf","MASS","reshape2","Amelia","mice","googleVis","stringi","ROCR","scatterplot3d","visdat","gridExtra","sjmisc")
#install.packages (packages, repos="http://cran.us.r-project.org", dependencies=TRUE, lib="C:/Program Files/R/R-3.6.3/library")
#invisible (lapply (packages, library, character.only = T))
#install.packages("devtools", lib="C:/Program Files/R/R-3.6.3/library")
#devtools::install_github("hadley/ggplot2")

library(tidyverse)
library(kableExtra)
library(lubridate)
library(forecast)
library(stringi)
library(dplyr)

3.1 Load Employment Datasets

file_path <- "../data/cpi/SAINC/"

EMP_DATA_FILE <- paste0(file_path,"EMP9818.csv")
EMP_DATA <- read_delim(EMP_DATA_FILE, delim = ",", col_names = TRUE, trim_ws= TRUE) 
#,skip = 6,col_types = list(col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character(),col_character()))
# names(SAP_INV_DAT) = c("DROP","I_BILLDATE","I_ORG","I_DC","I_DOCCA","I_BILLTYPE","I_ORDER_REA","I_DIV","I_MATYPE","I_BILL_QTY","BLANK")
# SAP_INV_DAT<- SAP_INV_DAT[-which(is.na(SAP_INV_DAT$I_ORG)),] %>% .[-str_which(trimws(.$I_ORG),'SOrg.|-----'),c(-1,-11)]
# 3
# SAP_INV_DAT$I_BILLDATE <- date(parse_datetime(SAP_INV_DAT$I_BILLDATE, "%m/%d/%Y"))
# 
# head(SAP_INV_DAT)
summary(EMP_DATA)
##    GeoFIPS            GeoName              Region       TableName        
##  Length:7084        Length:7084        Min.   :1.000   Length:7084       
##  Class :character   Class :character   1st Qu.:3.000   Class :character  
##  Mode  :character   Mode  :character   Median :5.000   Mode  :character  
##                                        Mean   :4.475                     
##                                        3rd Qu.:6.000                     
##                                        Max.   :8.000                     
##                                        NA's   :122                       
##     LineCode      IndustryClassification Description            Unit          
##  Min.   :  10.0   Length:7084            Length:7084        Length:7084       
##  1st Qu.: 517.0   Class :character       Class :character   Class :character  
##  Median : 712.5   Mode  :character       Mode  :character   Mode  :character  
##  Mean   : 872.1                                                               
##  3rd Qu.:1103.0                                                               
##  Max.   :2012.0                                                               
##  NA's   :4                                                                    
##      1998               1999               2000               2001          
##  Length:7084        Length:7084        Length:7084        Length:7084       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      2002               2003               2004               2005          
##  Length:7084        Length:7084        Length:7084        Length:7084       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      2006               2007               2008               2009          
##  Length:7084        Length:7084        Length:7084        Length:7084       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      2010               2011               2012               2013          
##  Length:7084        Length:7084        Length:7084        Length:7084       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      2014               2015               2016               2017          
##  Length:7084        Length:7084        Length:7084        Length:7084       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      2018          
##  Length:7084       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
DT::datatable(EMP_DATA)
# Reading the Category and detail Desc for ref.
EMP_DEF_FILE <- paste0(file_path,"SAEMP25N__definition.xml")
IND_DESC <- XML::xmlToDataFrame(EMP_DEF_FILE)
names(IND_DESC) <- c("LineCode","Description")
glimpse(IND_DESC)
## Rows: 118
## Columns: 3
## $ LineCode    <fct> 10, 20, 40, 50, 60, 70, 80, 90, 100, 101, 102, 103, 200...
## $ Description <fct> "Total employment (number of jobs)", "Wage and salary e...
## $ NA          <fct> "A count of jobs, both full-time and part-time. It incl...
head(IND_DESC,5)

3.2 Show Relevant Employment Data

# MN-MAIN  | NN - RANGE  | NL - LAST |
EMAP_CAT_LAST <- EMP_DATA[,c(2,5,6,7)] %>% 
                  filter(GeoName=="United States") %>% 
                  mutate(ISM = ifelse(IndustryClassification=="...","M","N")) %>% 
                  separate(IndustryClassification, c("From", "To"))  %>% 
                  mutate(ISML = ifelse(is.na(To)==T,"L","N")) %>% 
                  filter(ISML %in% c("L","N")) %>%  
                  select(.,LineCode,Description)
#DT::datatable(EMAP_CAT_LAST)

EMP_DAT_CAT <- EMP_DATA[,c(2,5,9:29)] %>% 
                left_join(EMAP_CAT_LAST,.,by="LineCode")
glimpse(EMP_DAT_CAT)
## Rows: 7,080
## Columns: 24
## $ LineCode    <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,...
## $ Description <chr> "Total employment (number of jobs)", "Total employment ...
## $ GeoName     <chr> "United States", "Alabama", "Alaska", "Arizona", "Arkan...
## $ `1998`      <chr> "158481200", "2361892", "382166", "2616288", "1445536",...
## $ `1999`      <chr> "161531300", "2378217", "381307", "2695892", "1460374",...
## $ `2000`      <chr> "165370800", "2392225", "389734", "2801510", "1482449",...
## $ `2001`      <chr> "165522200", "2376053", "394565", "2829002", "1482587",...
## $ `2002`      <chr> "165095100", "2364828", "402187", "2847095", "1478929",...
## $ `2003`      <chr> "165921500", "2371430", "405621", "2917121", "1482035",...
## $ `2004`      <chr> "168839700", "2425649", "413864", "3041476", "1505095",...
## $ `2005`      <chr> "172338400", "2486833", "421419", "3219820", "1537680",...
## $ `2006`      <chr> "175868600", "2545556", "431320", "3375218", "1567682",...
## $ `2007`      <chr> "179543700", "2604078", "439825", "3465075", "1582858",...
## $ `2008`      <chr> "179213900", "2582591", "443538", "3402808", "1579283",...
## $ `2009`      <chr> "173636700", "2479507", "442447", "3228493", "1542944",...
## $ `2010`      <chr> "172901700", "2460298", "443904", "3181571", "1541272",...
## $ `2011`      <chr> "176091700", "2497933", "450364", "3239045", "1561948",...
## $ `2012`      <chr> "178979700", "2503678", "459222", "3295537", "1565142",...
## $ `2013`      <chr> "182325100", "2523338", "461110", "3371219", "1569249",...
## $ `2014`      <chr> "186233800", "2551872", "461327", "3448173", "1587414",...
## $ `2015`      <chr> "190315800", "2586885", "461767", "3548174", "1610779",...
## $ `2016`      <chr> "193371900", "2619154", "457371", "3646604", "1629237",...
## $ `2017`      <chr> "196825300", "2653968", "456799", "3751283", "1644432",...
## $ `2018`      <chr> "200746000", "2691517", "459178", "3859137", "1663188",...
head(EMP_DAT_CAT,130)
#DT::datatable(EMP_DAT_CAT)

temp_name <- names(EMP_DAT_CAT [,c(4:24)])
EMP_DAT_CAT_SAN <-  data.frame(lapply(EMP_DAT_CAT [,c(4:24)],function(x) {
                      gsub("(D)|(T)",NA,x)
                    })) 
names(EMP_DAT_CAT_SAN) = temp_name
#DT::datatable(EMP_DAT_CAT_SAN)

EMP_DAT_CAT_SAN <- cbind(EMP_DAT_CAT[,c(1:3)],EMP_DAT_CAT_SAN)
#DT::datatable(EMP_DAT_CAT_SAN)

3.2.1 Employment in New York from Select Industry

EMP_DAT_CAT_SAN_NY <- EMP_DAT_CAT_SAN %>% 
                        filter(LineCode %in% c(705,1100,800,1600,1500)) %>% 
                        filter(GeoName=="New York") %>% 
                        reshape2::melt(
                          id.vars = c("LineCode","Description","GeoName"), 
                          variable.name = "Year", 
                          value.name = "Number of jobs")

EMP_DAT_CAT_SAN_NY %>%
  #sample_frac(0.33) %>% 
  DT::datatable()

3.3 Load CPI Data

file_path <- "../data/cpi/archive/"

## Read files named *.csv
filenames <- list.files(path=file_path,
                        pattern="*.csv",
                        full.names = TRUE)
#print(filenames)

##Create list of data frame names without the ".csv" part 
names <- gsub(pattern='\\.', 
              replacement='_', # remove the last 4 characters from filename (_clean)
              x=(sub(pattern = "(.*)\\..*$", replacement = "\\1", # remove extenson from filename (.csv)
                    basename(filenames) # get filename without the directory file path
                   )
                 )
              )
#print(names)

###Load all files
for(i in filenames){
  ## Create list of data frame names without the ".csv" part 
  name <- gsub(pattern='\\.', 
               replacement='_', # remove the last 4 characters from filename (_clean)
               x=(sub(pattern = "(.*)\\..*$", replacement = "\\1", # remove extenson from filename (.csv)
                      basename(i) # get filename without the directory file path
                     )
                 )
              )
   print(name)
   assign(x=name, value=data.frame(read.csv(i)))
   #DT::datatable(name)
}
## [1] "cu_area"
## [1] "cu_base"
## [1] "cu_data_0_Current"
## [1] "cu_data_1_AllItems"
## [1] "cu_data_10_OtherWest"
## [1] "cu_data_11_USFoodBeverage"
## [1] "cu_data_12_USHousing"
## [1] "cu_data_13_USApparel"
## [1] "cu_data_14_USTransportation"
## [1] "cu_data_15_USMedical"
## [1] "cu_data_16_USRecreation"
## [1] "cu_data_17_USEducationAndCommunication"
## [1] "cu_data_18_USOtherGoodsAndServices"
## [1] "cu_data_19_PopulationSize"
## [1] "cu_data_2_Summaries"
## [1] "cu_data_20_USCommoditiesServicesSpecial"
## [1] "cu_data_3_AsizeNorthEast"
## [1] "cu_data_4_AsizeNorthCentral"
## [1] "cu_data_5_AsizeSouth"
## [1] "cu_data_6_AsizeWest"
## [1] "cu_data_7_OtherNorthEast"
## [1] "cu_data_8_OtherNorthCentral"
## [1] "cu_data_9_OtherSouth"
## [1] "cu_footnote"
## [1] "cu_item"
## [1] "cu_period"
## [1] "cu_periodicity"
## [1] "cu_series"

3.4 Show CPI Reference Data

3.4.1 Area

3.4.2 Base

3.4.3 Item

3.4.4 Periodicity

3.4.5 Series

3.4.6 Period

3.5 Combined Reference Data

3.6 Show CPI Data from Select Industry

3.6.1 CPI - US Food Beverages

FoodItemCode <- cu_reference_data %>%
                    filter(grepl('SAF|SEF|SS0|SS1|SS20', item_code))
                    #dplyr::filter(item_code %like% 'SAH' | item_code %like% 'SEH')
#DT::datatable((FoodItemCode))

cu_data_USFoodBeverage = cu_data_0_Current %>% 
                              inner_join(FoodItemCode, by="series_id") %>% 
                              inner_join(cu_period, by="period") %>% 
                               filter(year %in% (1998:2018)) %>% 
                               select(c("series_id",
                                        "series_title",
                                        "year",
                                        "period",
                                        "period_abbr",
                                        "period_name",
                                        "value",
                                        "area_code",
                                        "area_name",
                                        "area_display_level",
                                        "area_selectable",
                                        "area_sort_sequence",
                                        "item_code",
                                        "item_name",
                                        "item_display_level",
                                        "item_selectable",
                                        "item_sort_sequence",
                                        "seasonal",
                                        "periodicity_code",
                                        "periodicity_name",
                                        "base_code",
                                        "base_name",
                                        "base_period"))
DT::datatable(head(cu_data_USFoodBeverage))

3.6.2 CPI - US Housing

HousingItemCode <- cu_reference_data %>%
                    #filter(grepl('SAH|SEH', item_code, fixed = TRUE))
                    filter(grepl('SAH|SEH', item_code))
                    #dplyr::filter(item_code %like% 'SAH' | item_code %like% 'SEH')
#DT::datatable((HousingItemCode))

cu_data_USHousing = cu_data_0_Current %>% 
                              inner_join(HousingItemCode, by="series_id") %>% 
                              inner_join(cu_period, by="period") %>% 
                               #filter(item_code %like% c('SAH','SEH')) %>%
                               filter(year %in% (1998:2018)) %>% 
                               select(c("series_id",
                                        "series_title",
                                        "year",
                                        "period",
                                        "period_abbr",
                                        "period_name",
                                        "value",
                                        "area_code",
                                        "area_name",
                                        "area_display_level",
                                        "area_selectable",
                                        "area_sort_sequence",
                                        "item_code",
                                        "item_name",
                                        "item_display_level",
                                        "item_selectable",
                                        "item_sort_sequence",
                                        "seasonal",
                                        "periodicity_code",
                                        "periodicity_name",
                                        "base_code",
                                        "base_name",
                                        "base_period"))
DT::datatable(head(cu_data_USHousing))

3.6.3 CPI - US Transportation

TransportItemCode <- cu_reference_data %>%
                    filter(grepl('SAT|SET|SS4|SS52|SS53|SAS24|SAS4', item_code))
                    #dplyr::filter(item_code %like% 'SAH' | item_code %like% 'SEH')
#DT::datatable((TransportItemCode))

cu_data_USTransportation = cu_data_0_Current %>% 
                              inner_join(TransportItemCode, by="series_id") %>% 
                              inner_join(cu_period, by="period") %>% 
                               filter(year %in% (1998:2018)) %>% 
                               select(c("series_id",
                                        "series_title",
                                        "year",
                                        "period",
                                        "period_abbr",
                                        "period_name",
                                        "value",
                                        "area_code",
                                        "area_name",
                                        "area_display_level",
                                        "area_selectable",
                                        "area_sort_sequence",
                                        "item_code",
                                        "item_name",
                                        "item_display_level",
                                        "item_selectable",
                                        "item_sort_sequence",
                                        "seasonal",
                                        "periodicity_code",
                                        "periodicity_name",
                                        "base_code",
                                        "base_name",
                                        "base_period"))
DT::datatable(head(cu_data_USTransportation))

3.6.4 CPI - US Medical

MedicalItemCode <- cu_reference_data %>%
                    filter(grepl('SAM|SEM|SS57', item_code))
                    #dplyr::filter(item_code %like% 'SAH' | item_code %like% 'SEH')
#DT::datatable((MedicalItemCode))

cu_data_USMedical = cu_data_0_Current %>% 
                              inner_join(MedicalItemCode, by="series_id") %>% 
                              inner_join(cu_period, by="period") %>% 
                               filter(year %in% (1998:2018)) %>% 
                               select(c("series_id",
                                        "series_title",
                                        "year",
                                        "period",
                                        "period_abbr",
                                        "period_name",
                                        "value",
                                        "area_code",
                                        "area_name",
                                        "area_display_level",
                                        "area_selectable",
                                        "area_sort_sequence",
                                        "item_code",
                                        "item_name",
                                        "item_display_level",
                                        "item_selectable",
                                        "item_sort_sequence",
                                        "seasonal",
                                        "periodicity_code",
                                        "periodicity_name",
                                        "base_code",
                                        "base_name",
                                        "base_period"))
DT::datatable(head(cu_data_USMedical))

3.6.5 CPI - US Education And Communication

EducationItemCode <- cu_reference_data %>%
                    filter(grepl('SAE|SEE|SSE|SS27', item_code))
                    #dplyr::filter(item_code %like% 'SAH' | item_code %like% 'SEH')
#DT::datatable((EducationItemCode))

cu_data_USEducationAndCommunication = cu_data_0_Current %>% 
                                        inner_join(EducationItemCode, by="series_id") %>% 
                                        inner_join(cu_period, by="period") %>% 
                                         filter(year %in% (1998:2018)) %>% 
                                         select(c("series_id",
                                                  "series_title",
                                                  "year",
                                                  "period",
                                                  "period_abbr",
                                                  "period_name",
                                                  "value",
                                                  "area_code",
                                                  "area_name",
                                                  "area_display_level",
                                                  "area_selectable",
                                                  "area_sort_sequence",
                                                  "item_code",
                                                  "item_name",
                                                  "item_display_level",
                                                  "item_selectable",
                                                  "item_sort_sequence",
                                                  "seasonal",
                                                  "periodicity_code",
                                                  "periodicity_name",
                                                  "base_code",
                                                  "base_name",
                                                  "base_period"))
DT::datatable(head(cu_data_USEducationAndCommunication))

4 Employment vs CPI in NY - by Industry

4.1 Food & Beverages

cu_emp_data_USFoodBeverage_NY = cu_data_USFoodBeverage %>% 
                                  filter(area_name=="New York-Northern New Jersey-Long Island", 
                                         period %in% c("M01","M02","M03","M04","M05","M06","M07","M08","M09","M10","M11","M12")) %>% 
                                  mutate(parent_item_code = 'SAF', 
                                         emp_line_code = 705) %>% 
                                   select(c("parent_item_code",
                                            "emp_line_code",
                                            "series_id",
                                            "series_title",
                                            "year",
                                            "period",
                                            "period_abbr",
                                            "period_name",
                                            "value",
                                            "area_code",
                                            "area_name",
                                            "area_display_level",
                                            "area_selectable",
                                            "area_sort_sequence",
                                            "item_code",
                                            "item_name",
                                            "item_display_level",
                                            "item_selectable",
                                            "item_sort_sequence",
                                            "seasonal",
                                            "periodicity_code",
                                            "periodicity_name",
                                            "base_code",
                                            "base_name",
                                            "base_period")) %>% 
                                  group_by(parent_item_code, emp_line_code, year, area_code, area_name) %>% 
                                  summarise(cpi_value = sum(value))
#cu_emp_data_USFoodBeverage_NY %>%
#  sample_frac(0.33) %>% 
#  DT::datatable()

#Food and beverage stores
#EMP_DAT_CAT_SAN_USFoodBeverage_NY = EMP_DAT_CAT_SAN_NY %>%
#                                      filter(LineCode==705)%>%
#                                      mutate(parent_line_code = 705)

cpi_emp_data_USFoodBeverage_NY = merge(EMP_DAT_CAT_SAN_NY, cu_emp_data_USFoodBeverage_NY, 
                                       by.x=c("LineCode","Year"), by.y=c("emp_line_code","year"))

cpi_emp_data_USFoodBeverage_NY %>%
#  sample_frac(0.33) %>% 
  DT::datatable()

4.2 Housing

cu_emp_data_USHousing_NY = cu_data_USHousing  %>% 
                            filter(area_name=="New York-Northern New Jersey-Long Island", 
                                   period %in% c("M01","M02","M03","M04","M05","M06","M07","M08","M09","M10","M11","M12")) %>% 
                            mutate(parent_item_code = 'SAH', 
                                   emp_line_code = 1100) %>% 
                            select(c("parent_item_code",
                                     "emp_line_code",
                                     "series_id",
                                     "series_title",
                                     "year",
                                     "period",
                                     "period_abbr",
                                     "period_name",
                                     "value",
                                     "area_code",
                                     "area_name",
                                     "area_display_level",
                                     "area_selectable",
                                     "area_sort_sequence",
                                     "item_code",
                                     "item_name",
                                     "item_display_level",
                                     "item_selectable",
                                     "item_sort_sequence",
                                     "seasonal",
                                     "periodicity_code",
                                     "periodicity_name",
                                     "base_code",
                                     "base_name",
                                     "base_period")) %>% 
                            group_by(parent_item_code, emp_line_code, year, area_code, area_name) %>% 
                            summarise(cpi_value = sum(value))
#cu_emp_data_USHousing_NY %>%
#  sample_frac(0.33) %>% 
#  DT::datatable()

#Real estate and rental and leasing
#EMP_DAT_CAT_SAN_USHousing_NY = EMP_DAT_CAT_SAN_NY %>%
#                                  filter(LineCode %in% c(1100,1101,1102,1103)) %>%
#                                  mutate(parent_line_code = 1100)

cpi_emp_data_USHousing_NY = merge(EMP_DAT_CAT_SAN_NY, cu_emp_data_USHousing_NY, 
                                  by.x=c("LineCode","Year"), by.y=c("emp_line_code","year"))

cpi_emp_data_USHousing_NY %>%
#  sample_frac(0.33) %>% 
  DT::datatable()

4.3 Transportation

cu_emp_data_USTransportation_NY = cu_data_USTransportation %>% 
                                    filter(area_name=="New York-Northern New Jersey-Long Island", 
                                           period %in% c("M01","M02","M03","M04","M05","M06","M07","M08","M09","M10","M11","M12")) %>% 
                                    mutate(parent_item_code = 'SAT', 
                                           emp_line_code = 800) %>% 
                                    select(c("parent_item_code",
                                             "emp_line_code",
                                             "series_id",
                                             "series_title",
                                             "year",
                                             "period",
                                             "period_abbr",
                                             "period_name",
                                             "value",
                                             "area_code",
                                             "area_name",
                                             "area_display_level",
                                             "area_selectable",
                                             "area_sort_sequence",
                                             "item_code",
                                             "item_name",
                                             "item_display_level",
                                             "item_selectable",
                                             "item_sort_sequence",
                                             "seasonal",
                                             "periodicity_code",
                                             "periodicity_name",
                                             "base_code",
                                             "base_name",
                                             "base_period")) %>% 
                                    group_by(parent_item_code, emp_line_code, year, area_code, area_name) %>% 
                                    summarise(cpi_value = sum(value))
#cu_emp_data_USTransportation_NY %>%
#  sample_frac(0.33) %>% 
#  DT::datatable()

#Transportation and warehousing
#EMP_DAT_CAT_SAN_USTransportation_NY = EMP_DAT_CAT_SAN_NY %>%
#                                        filter(LineCode %in% c(800,801,802,803,804,805,806,807,808,809,811)) %>%
#                                        mutate(parent_line_code = 800)

cpi_emp_data_USTransportation_NY = merge(EMP_DAT_CAT_SAN_NY, cu_emp_data_USTransportation_NY, 
                                         by.x=c("LineCode","Year"), by.y=c("emp_line_code","year"))

cpi_emp_data_USTransportation_NY %>%
#  sample_frac(0.33) %>% 
  DT::datatable()

4.4 Medical

cu_emp_data_USMedical_NY = cu_data_USMedical %>% 
                            filter(area_name=="New York-Northern New Jersey-Long Island", 
                                   period %in% c("M01","M02","M03","M04","M05","M06","M07","M08","M09","M10","M11","M12")) %>% 
                            mutate(parent_item_code = 'SAM', 
                                   emp_line_code = 1600) %>% 
                            select(c("parent_item_code",
                                     "emp_line_code",
                                     "series_id",
                                     "series_title",
                                     "year",
                                     "period",
                                     "period_abbr",
                                     "period_name",
                                     "value",
                                     "area_code",
                                     "area_name",
                                     "area_display_level",
                                     "area_selectable",
                                     "area_sort_sequence",
                                     "item_code",
                                     "item_name",
                                     "item_display_level",
                                     "item_selectable",
                                     "item_sort_sequence",
                                     "seasonal",
                                     "periodicity_code",
                                     "periodicity_name",
                                     "base_code",
                                     "base_name",
                                     "base_period")) %>% 
                            group_by(parent_item_code, emp_line_code, year, area_code, area_name) %>% 
                            summarise(cpi_value = sum(value))
#cu_emp_data_USMedical_NY %>%
#  sample_frac(0.33) %>% 
#  DT::datatable()

#Health care and social assistance
#EMP_DAT_CAT_SAN_USTransportation_NY = EMP_DAT_CAT_SAN_NY %>%
#                                        filter(LineCode %in% c(1600,1601,1602,1603,1604)) %>%
#                                        mutate(parent_line_code = 1600)

cpi_emp_data_USMedical_NY = merge(EMP_DAT_CAT_SAN_NY, cu_emp_data_USMedical_NY, 
                                  by.x=c("LineCode","Year"), by.y=c("emp_line_code","year"))

cpi_emp_data_USMedical_NY %>%
#  sample_frac(0.33) %>% 
  DT::datatable()

4.5 Education & Communication

cu_data_USEducationAndCommunication_NY = cu_data_USEducationAndCommunication %>% 
                                          filter(area_name=="New York-Northern New Jersey-Long Island", 
                                                 period %in% c("M01","M02","M03","M04","M05","M06","M07","M08","M09","M10","M11","M12")) %>% 
                                          mutate(parent_item_code = 'SAE', emp_line_code = 1500) %>% 
                                          select(c("parent_item_code",
                                                   "emp_line_code",
                                                   "series_id",
                                                   "series_title",
                                                   "year",
                                                   "period",
                                                   "period_abbr",
                                                   "period_name",
                                                   "value",
                                                   "area_code",
                                                   "area_name",
                                                   "area_display_level",
                                                   "area_selectable",
                                                   "area_sort_sequence",
                                                   "item_code",
                                                   "item_name",
                                                   "item_display_level",
                                                   "item_selectable",
                                                   "item_sort_sequence",
                                                   "seasonal",
                                                   "periodicity_code",
                                                   "periodicity_name",
                                                   "base_code",
                                                   "base_name",
                                                   "base_period")) %>% 
                                          group_by(parent_item_code, emp_line_code, year, area_code, area_name) %>% 
                                          summarise(cpi_value = sum(value))
#cu_data_USEducationAndCommunication_NY %>%
#  sample_frac(0.33) %>% 
#  DT::datatable()

#Educational services
#EMP_DAT_CAT_SAN_USEducationAndCommunication_NY = EMP_DAT_CAT_SAN_NY %>%
#                                                   filter(LineCode %in% c(1500)) %>%
#                                                   mutate(parent_line_code = 1500)

cpi_emp_data_USEducationAndCommunication_NY = merge(EMP_DAT_CAT_SAN_NY, cu_data_USEducationAndCommunication_NY, 
                                                    by.x=c("LineCode","Year"), by.y=c("emp_line_code","year"))

cpi_emp_data_USEducationAndCommunication_NY %>%
#  sample_frac(0.33) %>% 
  DT::datatable()
cpi_emp_data_USFoodBeverage_NY  %>% select(c("Year","Sector"= "Description","No_Jobs"="Number of jobs","cpi_value")) %>%   gather(key="Type", value ="Value", "cpi_value","No_Jobs") %>% ggplot(mapping = aes(x= Year, y=(as.numeric(Value)),fill = Type)) + 
geom_col(position = position_dodge()) +
   theme(axis.text.x = element_text(angle = 60, colour="gray",hjust = 1,size=rel(0.86)))+
  labs(title = "USSFoodBeverage Data for No. Of Jobs and CPI Value in NY",
       y = "Sector",x= "Years") 

cpi_emp_data_USFoodBeverage_NY  %>% select(c("Year","Sector"= "Description","No_Jobs"="Number of jobs")) %>% ggplot(mapping = aes(x= Year, y=(as.numeric(No_Jobs)))) + geom_col()+    theme(axis.text.x = element_text(angle = 60, colour="gray",hjust = 1,size=rel(0.86)))+
  labs(title = "USSFoodBeverage Data for No. Of Jobs  in NY",
       y = "Sector",x= "Years") 

cpi_emp_data_USFoodBeverage_NY  %>% select(c("Year","Sector"= "Description","cpi_value")) %>% ggplot(mapping = aes(x= Year, y=(as.numeric(cpi_value)))) + geom_col()+    theme(axis.text.x = element_text(angle = 60, colour="gray",hjust = 1,size=rel(0.86)))+
  labs(title = "USSFoodBeverage Data for CPI Value in NY",
       y = "Sector",x= "Years") 

# summary(as.numeric(cpi_emp_data_USFoodBeverage_NY$`Number of jobs`))
# summary(as.numeric(cpi_emp_data_USFoodBeverage_NY$cpi_value))
# 
# quantile(as.numeric(cpi_emp_data_USFoodBeverage_NY$`Number of jobs`))
# levels(quantile(as.numeric(cpi_emp_data_USFoodBeverage_NY$cpi_value)))

Debabrata Kabiraj, Joseph Simone and Rajwant Mishra

Oct 1, 2020