This document is a document that illustrates the last few months of labor demands across the country, in CA, and for a selected handful of different jobs. The 10 most populated cities in each state were selected and on a weekly basis job advertisement header information was pulled from Indeed.com from these 10 cities from July 20, 2020 to August 17, 2020. Jobs are posted as ‘just posted’ or ‘today’ under the Indeed.com tag of ‘days posted’ and up to 30 days, but anything over than 30 days posted gets categorized as 30 days, so those dates were removed. This counts those postings that are fresh on each day starting 29 days before July 20th as June 21, 2020 and on August 17, 2020 when the data was pulled. R was used to calculate, web scrape, and document the steps. The R markdown file is in my github repo on state licensing for LMTs at github.com/janjanjan2018.

These images are images of the interactive charts available and encouraged for you to explore at my Tableau Public Server repository at tableau.com/janisharris. Just click an image and it will direct you to the chart in a separate window you can keep open to read through the image information and notes and look through the charts to discover something new or confirm what you see or make interesting findings on the labor demands of the country and our state.

The selected jobs that are compared to LMT job demand are nannys, nurses, personal assistants, security, and warehouse workers. The nurses missed a week on August 3rd because it was renamed when moving files and never got pulled, but the other data somewhat makes up for it, because each job says how many days ago it was posted so that should resolve that problem. You will see maps of the country, the state, the 10 most populated cities according to worldpopulation.org, bar charts, stacked bar charts, line charts, area line charts, and more. Note on the area line chart, that it doesn’t matter which category is on top, only how much area of their color dimension is shown. And that area is given as the number of jobs for a date on the area line chart time series of job postings for each category of job.

combining the 5 weeks of data

machine learning with rpart, caret’s lm, and gbm

images

These following links are outside references for more information on the models used or attempted to use but took too much computational power to be done readily.

randomForest

rpart

gbm

e1071 caret

library(dplyr)#grouping,filtering, aggregate statistics
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate) # date objects
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(rpart) #recursive partitioning and regression trees
library(caret)#CART based,classification and regression training
## Loading required package: lattice
## Loading required package: ggplot2
library(MASS) # Modern Applied Statistics with S
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(gbm) #generalized boosted regression models similar to adaboost
## Loaded gbm 2.1.5
library(e1071) #SVM and statistics naive poisson etc models
library(party)
## Warning: package 'party' was built under R version 3.6.3
## Loading required package: grid
## Loading required package: mvtnorm
## Warning: package 'mvtnorm' was built under R version 3.6.3
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 3.6.3
## Loading required package: stats4
## Loading required package: strucchange
## Warning: package 'strucchange' was built under R version 3.6.3
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 3.6.3
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine

Preprocessing

lmt1 <- read.csv('LMT-7-20-2020.csv',sep=',',header=F,
                 na.strings=c('',' ','NA'),stringsAsFactors = F)

lmt2 <- read.csv('LMT-7-27-2020.csv',sep=',',header=F,
                 na.strings=c('',' ','NA'),stringsAsFactors = F)
lmt3 <- read.csv('LMT-8-3-2020.csv',sep=',',header=F,
                 na.strings=c('',' ','NA'),stringsAsFactors = F)
lmt4 <- read.csv('LMT-8-10-2020.csv',sep=',',header=F,
                 na.strings=c('',' ','NA'),stringsAsFactors = F)
lmt5 <- read.csv('LMT-8-17-2020.csv',sep=',',header=F,
                 na.strings=c('',' ','NA'),stringsAsFactors = F)

header <- read.csv('tableJobsHeader.csv')
colnames(lmt1) <- header$x
colnames(lmt2) <- header$x
colnames(lmt3) <- header$x
colnames(lmt4) <- header$x
colnames(lmt5) <- header$x

lmt1$state <- toupper(lmt1$state)
lmt2$state <- toupper(lmt2$state)
lmt3$state <- toupper(lmt3$state)
lmt4$state <- toupper(lmt4$state)
lmt5$state <- toupper(lmt5$state)
unique(lmt1$date_daysAgo)
##  [1] "30"          "14"          "10"          "12"          "3"          
##  [6] "27"          "11"          "6"           "4"           "Just posted"
## [11] "19"          "5"           "18"          "23"          "2"          
## [16] "17"          "25"          "26"          "13"          "24"         
## [21] "20"          "28"          "22"          "15"          "7"          
## [26] "21"          "Today"       "8"           "9"           "1"          
## [31] "16"          "29"
lmt1$date_daysAgo <- gsub('Just posted',0,lmt1$date_daysAgo)
lmt1$date_daysAgo <- gsub('Today',0,lmt1$date_daysAgo)

lmt2$date_daysAgo <- gsub('Just posted',0,lmt2$date_daysAgo)
lmt2$date_daysAgo <- gsub('Today',0,lmt2$date_daysAgo)

lmt3$date_daysAgo <- gsub('Just posted',0,lmt3$date_daysAgo)
lmt3$date_daysAgo <- gsub('Today',0,lmt3$date_daysAgo)

lmt4$date_daysAgo <- gsub('Just posted',0,lmt4$date_daysAgo)
lmt4$date_daysAgo <- gsub('Today',0,lmt4$date_daysAgo)

lmt5$date_daysAgo <- gsub('Just posted',0,lmt5$date_daysAgo)
lmt5$date_daysAgo <- gsub('Today',0,lmt5$date_daysAgo)
lmt1$date_daysAgo <- as.numeric(paste(lmt1$date_daysAgo))
lmt2$date_daysAgo <- as.numeric(paste(lmt2$date_daysAgo))
lmt3$date_daysAgo <- as.numeric(paste(lmt3$date_daysAgo))
lmt4$date_daysAgo <- as.numeric(paste(lmt4$date_daysAgo))
lmt5$date_daysAgo <- as.numeric(paste(lmt5$date_daysAgo))
lmt1s <- strsplit(lmt1$todaysDate,split=' ')
lmt1$wday <- as.character(lapply(lmt1s,'[',1))
lmt1$month   <- as.character(lapply(lmt1s,'[',2))
lmt1$dayMonth <- as.character(lapply(lmt1s, '[',3))
lmt1$timeHMS  <- as.character(lapply(lmt1s,'[',4))
lmt1$year <- as.character(lapply(lmt1s,'[',5))

lmt2s <- strsplit(lmt2$todaysDate,split=' ')
lmt2$wday <- as.character(lapply(lmt2s,'[',1))
lmt2$month   <- as.character(lapply(lmt2s,'[',2))
lmt2$dayMonth <- as.character(lapply(lmt2s, '[',3))
lmt2$timeHMS  <- as.character(lapply(lmt2s,'[',4))
lmt2$year <- as.character(lapply(lmt2s,'[',5))

lmt3s <- strsplit(lmt3$todaysDate,split=' ')
lmt3$wday <- as.character(lapply(lmt3s,'[',1))
lmt3$month   <- as.character(lapply(lmt3s,'[',2))
lmt3$dayMonth <- as.character(lapply(lmt3s, '[',3))
lmt3$timeHMS  <- as.character(lapply(lmt3s,'[',4))
lmt3$year <- as.character(lapply(lmt3s,'[',5))

lmt4s <- strsplit(lmt4$todaysDate,split=' ')
lmt4$wday <- as.character(lapply(lmt4s,'[',1))
lmt4$month   <- as.character(lapply(lmt4s,'[',2))
lmt4$dayMonth <- as.character(lapply(lmt4s, '[',3))
lmt4$timeHMS  <- as.character(lapply(lmt4s,'[',4))
lmt4$year <- as.character(lapply(lmt4s,'[',5))

lmt5s <- strsplit(lmt5$todaysDate,split=' ')
lmt5$wday <- as.character(lapply(lmt5s,'[',1))
lmt5$month   <- as.character(lapply(lmt5s,'[',2))
lmt5$dayMonth <- as.character(lapply(lmt5s, '[',3))
lmt5$timeHMS  <- as.character(lapply(lmt5s,'[',4))
lmt5$year <- as.character(lapply(lmt5s,'[',5))
JulAug <- rbind(lmt1,lmt2,lmt3,lmt4,lmt5)
JulAug$nMonth <- ifelse(JulAug$month=='Jan',1,
                  ifelse(JulAug$month=='Feb',2,
                  ifelse(JulAug$month=='Mar',3,
                  ifelse(JulAug$month=='Apr',4,
                  ifelse(JulAug$month=='May',5,
                  ifelse(JulAug$month=='Jun',6,
                  ifelse(JulAug$month=='Jul',7,
                  ifelse(JulAug$month=='Aug',8,
                  ifelse(JulAug$month=='Sep',9,
                  ifelse(JulAug$month=='Oct',10,
                  ifelse(JulAug$month=='Nov',11,
                  ifelse(JulAug$month=='Dec',12,NA))))))))))))
JulAug$mdy <- paste(JulAug$nMonth,JulAug$dayMonth,JulAug$year,sep='/')
JulAug$mdy <- mdy(JulAug$mdy)
JulAug$jobPosted_nDays_ago <- JulAug$mdy-JulAug$date_daysAgo
JulAugNot30 <- subset(JulAug, JulAug$date_daysAgo != 30)
JulAugNot30 <- JulAugNot30[!duplicated(JulAugNot30),]
jobsPerStateCityDayPosted <- JulAugNot30 %>% 
  group_by(state,city,jobPosted_nDays_ago) %>% count(jobPosted_nDays_ago)
colnames(jobsPerStateCityDayPosted)[3:4] <- c('datePosted','LMT_jobsPosted')
write.csv(JulAugNot30,'JulAugNot30.csv',row.names=F)
write.csv(jobsPerStateCityDayPosted,'LMT_jobsPostedDateCityState.csv',
          row.names=F)

Lets read in the available information on nannys, nurses (Aug 3, 2020 missed getting pulled from web that week), personal assistants, security, and warehouse jobs available in the US.

nannyListings <- read.csv(
  './alternate jobs/nanny_jobsPostedDateCityState.csv',
                          sep=',',header=T, na.strings=c('',' ','NA'),
                          stringsAsFactors = T)

#reread in the file if starting at ***
lmtListings <- read.csv(
  'LMT_jobsPostedDateCityState.csv',
                          sep=',',header=T, na.strings=c('',' ','NA'),
                          stringsAsFactors = T)

nurseListings <- read.csv(
  './alternate jobs/nurse_jobsPostedDateCityState.csv',
                          sep=',',header=T, na.strings=c('',' ','NA'),
                          stringsAsFactors = T)

personalAssistantListings <- read.csv(
  './alternate jobs/personalAssistant_jobsPostedDateCityState.csv',
                          sep=',',header=T, na.strings=c('',' ','NA'),
                          stringsAsFactors = T)

securityListings <- read.csv(
  './alternate jobs/security_jobsPostedDateCityState.csv',
                          sep=',',header=T, na.strings=c('',' ','NA'),
                          stringsAsFactors = T)

warehouseListings <- read.csv(
  './alternate jobs/warehouse_jobsPostedDateCityState.csv',
                          sep=',',header=T, na.strings=c('',' ','NA'),
                          stringsAsFactors = T)
colnames(lmtListings)[4] <- 'jobsPosted'
lmtListings$job <- 'LMT'

colnames(nannyListings)[4] <- 'jobsPosted'
nannyListings$job <- 'nanny'

colnames(nurseListings)[4] <- 'jobsPosted'
nurseListings$job <- 'nurse'

colnames(personalAssistantListings)[4] <- 'jobsPosted'
personalAssistantListings$job <- 'personal assistant'

colnames(securityListings)[4] <- 'jobsPosted'
securityListings$job <- 'security'

colnames(warehouseListings)[4] <- 'jobsPosted'
warehouseListings$job <- 'warehouse'
allJobPostings <- rbind(lmtListings,nannyListings,
                        nurseListings, personalAssistantListings,
                        securityListings,warehouseListings)
allJobPostings$job <- as.factor(paste(allJobPostings$job))
write.csv(allJobPostings,'allJobsPosted.csv',row.names=F)

Lets get data that also gets the average hourly and annual salary for each city by reading in that data.

wages <- read.csv('JulAugNot30.csv',sep=',',header=T,
                  na.strings=c('',' ','NA'),stringsAsFactors = T)
colnames(wages)
##  [1] "city"                "state"               "jobSearched"        
##  [4] "jobTitle"            "HiringAgency"        "date_daysAgo"       
##  [7] "todaysDate"          "MinHourlySalary"     "MaxHourlySalary"    
## [10] "MinAnnualSalary"     "MaxAnnualSalary"     "wday"               
## [13] "month"               "dayMonth"            "timeHMS"            
## [16] "year"                "nMonth"              "mdy"                
## [19] "jobPosted_nDays_ago"
cityWages <- wages %>% group_by(state,city,jobSearched,
                                jobPosted_nDays_ago) %>% summarise_at(vars(MinHourlySalary,MaxHourlySalary,MinAnnualSalary,
                                                                                      MaxAnnualSalary),mean)
write.csv(cityWages,'LMT_citywages.csv',row.names=F)

Lets import all of the city wage data for our five selected jobs.

lmtCityWages <- read.csv('LMT_citywages.csv',sep=',',header=T,
                         na.strings=c('',' ','NA'),
                         stringsAsFactors = F)
nannyCityWages <- read.csv('./alternate jobs/nanny_citywages.csv',sep=',',header=T,
                         na.strings=c('',' ','NA'),
                         stringsAsFactors = F)

nurseCityWages <- read.csv('./alternate jobs/nurse_citywages.csv',sep=',',header=T,
                         na.strings=c('',' ','NA'),
                         stringsAsFactors = F)

personalAssistantCityWages <- read.csv('./alternate jobs/personalAssistant_citywages.csv',sep=',',header=T,
                         na.strings=c('',' ','NA'),
                         stringsAsFactors = F)

securityCityWages <- read.csv('./alternate jobs/security_citywages.csv',sep=',',header=T,
                         na.strings=c('',' ','NA'),
                         stringsAsFactors = F)

warehouseCityWages <- read.csv('./alternate jobs/warehouse_citywages.csv',sep=',',header=T,
                         na.strings=c('',' ','NA'),
                         stringsAsFactors = F)

Lets combine all this data into one table.

BigDataWages <- rbind(lmtCityWages,
                      nannyCityWages,
                      nurseCityWages,
                      personalAssistantCityWages,
                      securityCityWages,
                      warehouseCityWages)
colnames(BigDataWages)[4] <- 'dateJobPosted'
write.csv(BigDataWages,'BigDataWages.csv', row.names=F)

Machine Learning

Machine Learning to predict the maximum hourly wages by the state, job searched, and minimum hourly wages.

set.seed(453423)

bigdatawages <- BigDataWages[complete.cases(BigDataWages),]
inTrain <- createDataPartition(y=bigdatawages$MinHourlySalary, p=0.7, list=FALSE)

trainingSet <- bigdatawages[inTrain,]
testingSet <- bigdatawages[-inTrain,]
dim(trainingSet)
## [1] 43867     8
dim(testingSet)
## [1] 18797     8
set.seed(233445)
rpartFit <- train(MaxHourlySalary ~ state+MinHourlySalary+jobSearched,
                  data = trainingSet,
                  method = "rpart",
                  tuneLength = 9)
rpartPredictedHourly <- predict(rpartFit,testingSet)
results <- data.frame(cbind(testingSet,rpartPredictedHourly))
head(results,10)
##     state          city                jobSearched dateJobPosted
## 5      AK     anchorage licensed massage therapist    2020-07-28
## 12     AK knik-fairview licensed massage therapist    2020-07-23
## 17     AK knik-fairview licensed massage therapist    2020-08-02
## 18     AK knik-fairview licensed massage therapist    2020-08-07
## 58     AL    birmingham licensed massage therapist    2020-07-30
## 59     AL    birmingham licensed massage therapist    2020-08-01
## 72     AL        hoover licensed massage therapist    2020-07-26
## 80     AL    huntsville licensed massage therapist    2020-07-31
## 102    AR        conway licensed massage therapist    2020-07-28
## 105    AR        conway licensed massage therapist    2020-08-04
##     MinHourlySalary MaxHourlySalary MinAnnualSalary MaxAnnualSalary
## 5                20        20.00000        75000.00       120000.00
## 12               30        72.50000        75000.00       120000.00
## 17               20        72.50000        75000.00       120000.00
## 18               20        72.50000        75000.00       120000.00
## 58                8        40.00000        30000.00        60000.00
## 59                8        52.50000        30000.00        60000.00
## 72                8        48.33333        28333.33        56666.67
## 80                9        11.00000        15000.00        15000.00
## 102              13        15.00000        35000.00        50000.00
## 105              13        15.00000        35000.00        57000.00
##     rpartPredictedHourly
## 5               63.52607
## 12              63.52607
## 17              63.52607
## 18              63.52607
## 58              25.97129
## 59              25.97129
## 72              25.97129
## 80              25.97129
## 102             67.97146
## 105             67.97146
set.seed(233445)
lmFit <- train(MaxHourlySalary ~ state+MinHourlySalary+jobSearched,
                  data = trainingSet,
                  method = "lm")
lmPredictedHourly <- predict(lmFit,testingSet)
results2 <- data.frame(cbind(results,lmPredictedHourly))
write.csv(results2,'results2.csv',row.names=F)
head(results2,10)
##     state          city                jobSearched dateJobPosted
## 5      AK     anchorage licensed massage therapist    2020-07-28
## 12     AK knik-fairview licensed massage therapist    2020-07-23
## 17     AK knik-fairview licensed massage therapist    2020-08-02
## 18     AK knik-fairview licensed massage therapist    2020-08-07
## 58     AL    birmingham licensed massage therapist    2020-07-30
## 59     AL    birmingham licensed massage therapist    2020-08-01
## 72     AL        hoover licensed massage therapist    2020-07-26
## 80     AL    huntsville licensed massage therapist    2020-07-31
## 102    AR        conway licensed massage therapist    2020-07-28
## 105    AR        conway licensed massage therapist    2020-08-04
##     MinHourlySalary MaxHourlySalary MinAnnualSalary MaxAnnualSalary
## 5                20        20.00000        75000.00       120000.00
## 12               30        72.50000        75000.00       120000.00
## 17               20        72.50000        75000.00       120000.00
## 18               20        72.50000        75000.00       120000.00
## 58                8        40.00000        30000.00        60000.00
## 59                8        52.50000        30000.00        60000.00
## 72                8        48.33333        28333.33        56666.67
## 80                9        11.00000        15000.00        15000.00
## 102              13        15.00000        35000.00        50000.00
## 105              13        15.00000        35000.00        57000.00
##     rpartPredictedHourly lmPredictedHourly
## 5               63.52607          75.16865
## 12              63.52607          75.60783
## 17              63.52607          75.16865
## 18              63.52607          75.16865
## 58              25.97129          56.77021
## 59              25.97129          56.77021
## 72              25.97129          56.77021
## 80              25.97129          56.81412
## 102             67.97146          51.77988
## 105             67.97146          51.77988

gbm

set.seed(233445)
gbmFit <- gbm(MaxHourlySalary ~ as.factor(paste(state))+MinHourlySalary+as.factor(paste(jobSearched)),
                  data = trainingSet,
                  n.trees=500, interaction.depth=4,
              shrinkage=0.01,
              cv.folds=3)
## Distribution not specified, assuming gaussian ...
gbmPredictedHourly <- predict(gbmFit,testingSet)
## Using 500 trees...
results3 <- data.frame(cbind(results2,gbmPredictedHourly))
write.csv(results3,'results3.csv',row.names=F)
head(results3,10)
##     state          city                jobSearched dateJobPosted
## 5      AK     anchorage licensed massage therapist    2020-07-28
## 12     AK knik-fairview licensed massage therapist    2020-07-23
## 17     AK knik-fairview licensed massage therapist    2020-08-02
## 18     AK knik-fairview licensed massage therapist    2020-08-07
## 58     AL    birmingham licensed massage therapist    2020-07-30
## 59     AL    birmingham licensed massage therapist    2020-08-01
## 72     AL        hoover licensed massage therapist    2020-07-26
## 80     AL    huntsville licensed massage therapist    2020-07-31
## 102    AR        conway licensed massage therapist    2020-07-28
## 105    AR        conway licensed massage therapist    2020-08-04
##     MinHourlySalary MaxHourlySalary MinAnnualSalary MaxAnnualSalary
## 5                20        20.00000        75000.00       120000.00
## 12               30        72.50000        75000.00       120000.00
## 17               20        72.50000        75000.00       120000.00
## 18               20        72.50000        75000.00       120000.00
## 58                8        40.00000        30000.00        60000.00
## 59                8        52.50000        30000.00        60000.00
## 72                8        48.33333        28333.33        56666.67
## 80                9        11.00000        15000.00        15000.00
## 102              13        15.00000        35000.00        50000.00
## 105              13        15.00000        35000.00        57000.00
##     rpartPredictedHourly lmPredictedHourly gbmPredictedHourly
## 5               63.52607          75.16865           76.89773
## 12              63.52607          75.60783           76.32282
## 17              63.52607          75.16865           76.89773
## 18              63.52607          75.16865           76.89773
## 58              25.97129          56.77021           45.31054
## 59              25.97129          56.77021           45.31054
## 72              25.97129          56.77021           45.31054
## 80              25.97129          56.81412           45.31054
## 102             67.97146          51.77988           42.17769
## 105             67.97146          51.77988           42.17769

rpart

set.seed(233445)
rpartFunctionFit <- rpart(MaxHourlySalary ~
                           state+MinHourlySalary+jobSearched,
                  data = trainingSet, method='anova')
rpartFunctionPredictedHourly <- predict(rpartFunctionFit,testingSet)
results4 <- data.frame(cbind(results3,rpartFunctionPredictedHourly))
write.csv(results4,'results4.csv',row.names=F)
head(results4,10)
##     state          city                jobSearched dateJobPosted
## 5      AK     anchorage licensed massage therapist    2020-07-28
## 12     AK knik-fairview licensed massage therapist    2020-07-23
## 17     AK knik-fairview licensed massage therapist    2020-08-02
## 18     AK knik-fairview licensed massage therapist    2020-08-07
## 58     AL    birmingham licensed massage therapist    2020-07-30
## 59     AL    birmingham licensed massage therapist    2020-08-01
## 72     AL        hoover licensed massage therapist    2020-07-26
## 80     AL    huntsville licensed massage therapist    2020-07-31
## 102    AR        conway licensed massage therapist    2020-07-28
## 105    AR        conway licensed massage therapist    2020-08-04
##     MinHourlySalary MaxHourlySalary MinAnnualSalary MaxAnnualSalary
## 5                20        20.00000        75000.00       120000.00
## 12               30        72.50000        75000.00       120000.00
## 17               20        72.50000        75000.00       120000.00
## 18               20        72.50000        75000.00       120000.00
## 58                8        40.00000        30000.00        60000.00
## 59                8        52.50000        30000.00        60000.00
## 72                8        48.33333        28333.33        56666.67
## 80                9        11.00000        15000.00        15000.00
## 102              13        15.00000        35000.00        50000.00
## 105              13        15.00000        35000.00        57000.00
##     rpartPredictedHourly lmPredictedHourly gbmPredictedHourly
## 5               63.52607          75.16865           76.89773
## 12              63.52607          75.60783           76.32282
## 17              63.52607          75.16865           76.89773
## 18              63.52607          75.16865           76.89773
## 58              25.97129          56.77021           45.31054
## 59              25.97129          56.77021           45.31054
## 72              25.97129          56.77021           45.31054
## 80              25.97129          56.81412           45.31054
## 102             67.97146          51.77988           42.17769
## 105             67.97146          51.77988           42.17769
##     rpartFunctionPredictedHourly
## 5                       72.29220
## 12                      72.29220
## 17                      72.29220
## 18                      72.29220
## 58                      54.46697
## 59                      54.46697
## 72                      54.46697
## 80                      54.46697
## 102                     54.46697
## 105                     54.46697

This is too slow to complete.

set.seed(233445)
rfFit <- train(MaxHourlySalary ~
                           state+MinHourlySalary+jobSearched,
                  data = trainingSet, method='rf')
rfPredictedHourly <- predict(rfFit,testingSet)
results5 <- data.frame(cbind(results4,rfPredictedHourly))
head(results5,10)

Images

Here are the images from the jobs listed and salary of the hourly wages per city and state of the 10 most populated cities in each state for the selected jobs of massage therapists, nurses, nannies, personal assistants, security, and warehouse. The images are of the charts they represent in Tableau Public Server at the link below the image.

image 1

image 1

Figure 1 outside window

Figure 1:The number of LMT job postings 29 days before July 20, 2020 and up till August 17, 2020 in the 10 most populated cities in the US. This is a bar chart showing the 10 cities spanning the date July 27th as the start and August 17th as the end on a weekly pull from Indeed.com that goes back 29 days and not 30 to avoid those actually many days past 30 days posted but grouped into 30 days ‘posted ago’ in the timeline of each advertisement for a job. There are some days that no jobs were posted in a city and for many days in some parts of the US. Others had days that had many jobs posted on the same day.

image 2

image 2

Figure 2 outside window

Figure 2:This is a sum of the number of LMT jobs posted June24, 2020 through August 14, 2020 as the last week of June and the first two weeks of August. We can see some of the 10 most populated cities in each state have had darker scatter spots indicating a lot of job postings in relation to the rest of the US, such as New Jersey and New York, Florida, Washington, Arizona, Texas, and southern CA.

image 3

image 3

Figure 3 outside window

Figure 3:This image is a map of CA’s 10 most populated cities’ sum of LMT job postings from June 24-July 12, 2020 before businesses in CA knew about the 2nd quarantine that would shut their gyms, massage spas or clinics, and nail or hair shops if they weren’t made to deliver those services out doors. Restaraunts and theaters were already closed for a week before this date in early July. We can see that the darker spots indicate there are many jobs posted advertising a need for LMTs in northern and sourthern CA in June and July.

image 4

image 4

Figure 4 outside window

Figure 4:This image above is a map of CA that was shown in Figure 3, except the date is different. This time the date spans July 13-August 17,2020, after CA businesses knew about the 2nd quarantine that would shut down their business or cut business services drastically if they can only accomodate a few customers outside. We see that there are a lot more southern CA businesses advertising for LMTs in July than in August.

image 5

image 5

Figure 5 outside window

Figure 5:The above image is a line chart by month and by city in one of the 10 most populated cities in CA from June 21-August 17, 2020 showing the rise and fall of advertised job postings for LMTs in Calirfornia. Some days have none and some seem to make up for missed job postings. The chart shows in Anaheim on July 24, 2020, there were 16 jobs freshly posted on Indeed.com for LMTs.

image 6

image 6

Figure 6 outside window

Figure 6:The above image is a stacked bar chart that indicates by color those alternate and/or essential jobs as a comparison to LMT job demand from June 21, 2020 through August 17, 2020. I didn’t just pull web data every monday on LMTs but also on 22 different jobs. I selected these jobs for this document to illustrate the effects of the quarantine and needs by businesses for labor. Each of the 10 most populated cities in each state is shown as a stacked bar, where you can see that yellow is for warehouse, green is for security, light blue is for personal assistants, red is for nurses, orange is for nannys, and blue is for LMTs. When hovering the sum of total job postings between June 21st and August 17th will be displayed. We can see from the image above that Arizona and CA seem to have had similar labor demands of these professions or jobs as each other, but a few cities like Tucson, AZ and Bakersfield and Fresno, CA have little to no demand for personal assistants.

image 7

image 7

Figure 7 outside window

Figure 7:The image above is a line chart from the span of this analysis from June 21-Aug 17, 2020 of the number of jobs posted each day for either an LMT, nanny, nurse, personal assistant, security, or warehouse worker. We can see there was a spike in nanny demands for employment on August 4th, with 94 jobs requesting a nanny posted that day in Arkansas’ 10 most populated cities. We know this by hovering over the spike and seeing the information box pop-up.

image 8

image 8

Figure 8 outside window

Figure 8:The image above shows the demands by job in CA from June 21-July 14, 2020, before the 2nd quarantine started. If you scroll through the chart this image links to, you will see the scatter of all job categories shown for LMTs, nannys, nurses, personal assistants, security, and warehouse workers. The size is a dimension indicating the most job demands for that category as well as color for the category color, such as orange for nanny and blue for LMT. The scatter on the map is showing that San Diego, CA had 29 jobs advertised for LMT between June 21st and July 14th.

image 9

image 9

Figure 9 outside window

Figure 9:The image above is the same chart as Figure 8, except that the date is July 15-August 17th, the days in our time series after the 2nd CA quarantine. We can see the same color and size dimension for job category and the sum of the number of job postings per category respectively. Above the image shows that in San Diego there were 138 job advertisements for a nanny between July 15th and August 17th. And below that we see a uniform demand for nurses all over CA. The three city blob in the lower portion of CA colored red is the cities of Los Angeles, Long Beach, and Anaheim because those are in the top 10 most populated cities in CA.

image 10

image 10

Figure 10 outside window

Figure 10:In the above image, it looks very similar to Figure 7 except that this chart only shows the 10 most populated cities in CA for the span of our time series June 21-August 17, 2020. The job categories are color coded, and we see the rise and falls of job demand for each job over time. We see that the point being hovered on represents a spike in warehouse demand in Bakersfield, cA July 29th with 13 jobs posted. Anaheim is skewed relatively flat in job demand for all categories until early August with that huge spike in demand for nannys that shifts the scale to five times the average approximately.

image 11

image 11

Figure 11 outside window

Figure 11:In the above image we see an area line chart color coded by job category. In an area chart, think of it more as a map you lay down flat on the table to see the amount each category or region for the map analogy has showing. For instance, the top area is blue, but the blue doesn’t have more job demand than the nurses just that tiny strip showing. The blue is indicative of LMT job demand and red is indicative of nurses, light blue for personal assistants, green for security, and yellow for warehouse labor demands. This map is a city by city comparison of those 10 top populated cities in CA and the demand for these selected jobs from June 21-August 17, 2020. Again we see the day with the most job demands of one category is on August 4th with 43 job advertisements for a nanny in Long Beach, CA.

image 12

image 12

Figure 12 outside window

Figure 12:The above image is of the chart that compares the selected jobs’ average minimum hourly pay advertised on Indeed.com from Jun 21-Aug 17, 2020. The nanny role is hovered on to show the details, with a minimum hourly pay of $10 and a maximum hourly average pay of $50 that corresponds to the huge spike in demands for Nannys on Indeed on August 4, 2020. All of the 500 cities that include only the top 10 most populated cities in our 50 states if they had data available are shown.

image 13

image 13

Figure 13 outside window

Figure 13:The above image is of a line chart of those 10 most populated cities in CA for their minimum hourly advertised pay for our selected jobs using Indeed data. The selected point on one of the lines shown is for San Jose, with a minimum hourly advertised pay of $28 and a maximum hourly pay of $91 for the nurses that overlaps with the LMT advertised hourly pay on August 10, 2020.

image 14

image 14

Figure 14 outside window Figure 14:The above image is of an area chart detailing the jobs selected’s average hourly pay as an area map for the 10 most populated cities in CA by month as an average of all days in the month for each job in each city. Avoid the numeric scale on the x-axis, as it adds confusion and wasn’t able to be removed. Unless you can look and guestimate by the scale subtracting as you would on a map by scale from a start point and end point. For instance, Nanny is either a low or very high paying job ranging from $10-$52 an hour in Los Angeles. Who’s family pays $52 an hour is a question I would like to know, if you go up to Fresno the pay range for a nanny is $10-$16 which seems more than likely for households that aren’t the wealthy elite like celebrities earning millions a year and then some for royalties. But also notice that the area of the nanny for Los Angeles starts on approximately 45 of the scale of the x-axis and goes to about 65, which is the minimum listed value of $10 if you subtract 55 from 65. Some cities don’t have listings, such as Bakersfield and Fresno for nanny or LMT jobs advertised in August or July. But they are two of the ten most populated cities in CA (note to the English majors and so called grammatic punctualists, sorry for the trivial errors in not first word capitalizing the cities, if you can’t get past that mistake or trivial ones like it you will never get to the true point of your analytic work–you know who you are Alex, Michelle, old dudes at community colleges picking on older female professors,…).


Some machine learning was also used to test prediction of the advertised max hourly wages for each of these jobs using the state, minimum advertised hourly rate, and the job as predictors or determinants and compared to the actual advertised max hourly wage. Only complete cases or those instances that didn’t have missing values were used. The machine learning algorithms used were the linear regression of the caret package, the recursive partitioned and regression trees or rpart of the caret package and the rpart function in the rpart package, and the generalized boosted regression models or gbm algorithm set to 500 trees, shrinkage of 0.01, and 4 for the tree depth or n.depth parameter of the gbm function in the gbm package. The gbm algorithm of the caret package wasn’t used, and the random forest algorithm of caret and the separate randomForest funciton of the randomForest package were attempted but took much too long to process and shut down or R stopped working.

When looking at the interactive charts, compare the actual maximum advertised hourly rate to that used by the prediction algorithms above. There are four prediction models and the actual value side by side in our line charts of jobs selected to analyze over the course of our time series from June 21-Aug 17, 2020.

image 15

image 15

Figure 15 outside window

Figure 15:Since we omitted the missing values some of our chart also has missing values for those instances where there were any NA or missing values. We could go back and use stipulated or imputed entries with an ifelse function later to impute the average when grouping by cities and jobs to fill in these blank areas with some information if we want to. What we are being shown on the hover tool, is that from July to August in Arkansas, the generalized boosted regression model predicted the max advertised hourly rate to be $42.20 for an LMT, but the actual max advertised hourly rate if you were to select the red line below it is $15.

image 16

image 16

Figure 16 outside window

Figure 16:In the above image the security job hourly advertised actual rate jumps a lot on a daily basis, where it was $9 an hour in Alaska on average around July 20th but jumped to $27 on average on the 22nd through the 24th. Also, the predicted advertised hourly rates for security in Alaska were at times less and others more than the actual advertised max hourly rate for this state. This map is different from the previous map in Figure 15 because it shows the advertised max hourly rate’s daily fluctuations instead of monthly.

image 17

image 17

Figure 17 outside window

Figure 17:In the above image we see the chart with only CA selected and the nurse’s CA average actual maximum advertised hourly rate shown for July 28th. This is one of the local minimums for this time series at $72.50 an hour, and all advertised hourly rates after this value stay near this level for nurses. Previously in June and July their actual advertised hourly rates were much higher to around $150 an hour. The businesses might have been including over time hourly pay for double time when posting their ads hiring nurses in CA. We also see that this local minimum is lower than the machine learning algorithms’ predicted advertised hourly rate for nurses. The global minimum is on July 10th at around $50.

image 18

image 18

Figure 18 outside window

Figure 18:In the above image the 10 most populated cities in CA are included in the line chart that in some days or areas of the chart look like scatter points due to the missing values for those dates that wasn’t imputed with the average as was done in the aggregate averaging by state, city, and job to produce the data for the salary charts previously shown. What we are seeing in the hover box is the city of Anaheim average actual maximum hourly advertised rate for LMTs at $100 an hour which is more than the predicted max hourly rate by our machine learning models. But earlier in the time series, in June and July we saw the machine learning algorithms predicted the max advertised hourly rate of LMTs to be much higher than what the actual hourly rates advertised were. Such as any date before July 22nd, where only the caret rpart model predicted a lower max hourly rate than the actual advertised max hourly rate for LMTs. We should look at the data to see which companies are advertising a high hourly rate for massage therapists in Anaheim, CA during a quarantine. I am assuming it is a mobile massage company, let it be known my hourly rate is much more affordable for clients.

The program used to webscrape the salary content took the minimum and maximum hourly from the salary tag of all businesses advertising in each city in any amount of days ‘posted ago’ and discarded each business’s advertised rate range if it was posted. No other city information was added to those values, as each city produced its own csv file after the program looped through the 10 most populated cities of each state in a table of 500 cities in total. Each job produced 500 csv files with the imputed average of the minimum hourly rate advertised and average of the hourly rate advertised in all ads for that job in that city. But when looking at the data for that csv file in Anaheim, CA in the folder of August 17, 2020 web scraped csv files, Zeal had $80/hour in their job advertisement title. None of the other companies had title information with advertised pay. I will change the web scrape script to account for this and include the salry information per post if it is available to compare to the average min and max calculated features.

AnaheimAug17 <- read.csv('licensed massage therapist_anaheim_ca_.csv',
                         sep=',',
         header=T, na.strings=c('',' ','NA'),
         stringsAsFactors = T)
AnaheimAug17b <- AnaheimAug17[order(AnaheimAug17$date_daysAgo),]
unique(AnaheimAug17b$HiringAgency)
##  [1] Tabrizi Family ChiropracticCosta Mesa, CA 92626                      
##  [2] Sycamore Spa by HudsonLaguna Beach, CA 92651                         
##  [3] My 360 MassageLake Forest, CA 92630                                  
##  [4] Zeel3.5Orange County, CA                                             
##  [5] Massage Envy - La HabraLa Habra, CA 90631                            
##  [6] Basic Chiropractic & Leach Rehabilitation, Inc.Anaheim, CA 92805     
##  [7] Thrive Chiropractic4.4Anaheim, CA 92807 (Anaheim Hills area)         
##  [8] Massage Envy - Brea Downtown3.2Brea, CA+4 locations                  
##  [9] SAN PEDRO PAIN AND WELLNESSSan Pedro, CA 90731                       
## [10] Advantage Care Chiropractic4.0Brea, CA 92821                         
## [11] Massage Envy - PlacentiaPlacentia, CA 92870                          
## [12] APEX Chiropractic4.3Long Beach, CA 90808 (The Plaza area)            
## [13] Massage Heights - The BluffsNewport Beach, CA 92660                  
## [14] Massage Heights Irvine3.1Irvine, CA 92606 (West Park area)           
## [15] Massage Heights Chino Hills3.1Chino Hills, CA 91709                  
## [16] Career College3.8Carson, CA 90746                                    
## [17] HealthSource of AnaheimAnaheim, CA 92806 (Northeast Anaheim area)    
## [18] Rausch Physical Therapy and Sports PreformanceLaguna Niguel, CA 92677
## [19] Rausch Physical Therapy, INCLaguna Hills, CA 92653                   
## [20] Stretchlab Tustin3.7Tustin, CA 92782                                 
## [21] Health Atlast Long BeachLong Beach, CA 90815 (Los Altos area)        
## [22] Life Rx Wellness IncLos Angeles, CA                                  
## [23] Agape Wellness CenterCosta Mesa, CA 92626                            
## [24] Soothe3.7Santa Ana, CA                                               
## [25] Hand & Stone Massage and Facial Spa3.1Brea, CA 92821                 
## [26] International Bay Clubs LlcNewport Beach, CA 92663                   
## [27] Elements3.6Costa Mesa, CA 92626                                      
## [28] Elements3.6Brea, CA 92821                                            
## [29] Massage Heights3.1Newport Beach, CA 92660+2 locations                
## [30] BODY CENTRE SPAUpland, CA 91786                                      
## [31] Nelson ChiropracticSan Pedro, CA 90732                               
## [32] Hand and Stone3.1Brea, CA 92821                                      
## [33] Massage Heights3.1Newport Beach, CA 92660                            
## [34] JCPenney3.7Whittier, CA 90603+1 location                             
## [35] Massage Heights3.1Mission Viejo, CA 92692                            
## [36] StretchLab3.7Mission Viejo, CA 92692                                 
## [37] Massage Envy3.2Corona, CA 92883 (Wildrose area)                      
## 37 Levels: Advantage Care Chiropractic4.0Brea, CA 92821 ...

The above list of 37 businesses are the unique businesses posting job advertisements for LMTs on Indeed.com in last 30 days or more from August 17, 2020.


Thank you for reading this document and I am certain you found some usefulness out of it and some questions answered that you didn’t know you wanted to know. Any ideas for other projects? Please leave a comment and I will respond.

Have a great rest of your day.