Predicting the vacancy rate in BC's workforce

This project predicts the percentage of firms reporting vacancy rates lasting more than four months. The data is first explored using boxplots and then classified using a regression tree. Below each step in the procedure is described.

Load the data. The data is for 2009, just called wages for short. It is held in a public dropbox folder.

wages <- read.csv("http://dl.dropbox.com/u/23281950/wages.csv")

Then attach the data

attach(wages)

Draw a boxplot of yearly average earnings by region

boxplot(YearAve ~ Region, main = "Yearly Earnings by Region", ylab = "Total Yearly Earnings", 
    xlab = "BC Regional Codes")

plot of chunk unnamed-chunk-3

It looks as though the medians are pretty much the same, but region 5920 is more widely distributed with one large outlier.Someone in that region has a fat pay-cheque!

Draw a boxplot of regional vacancies

boxplot(Vac ~ Region, main = "Regional Vacancies", ylab = "Percentage Vacancies")

plot of chunk unnamed-chunk-4

A much wider distribution….

Now draw a classification tree. Note make sure to install the package 'tree' first by typing install.packages(“tree”). You only need to do this once.

Now make the library available to R

library(tree)

This builds a classification tree called vactree using three of the variables available:

vactree <- tree(Vac ~ Region + TrainT + AvWage + Hours)

We can take a look at the tree

plot(vactree)

plot of chunk unnamed-chunk-7

But this is a messy untidy tree…prune it!

prunevac <- cv.tree(vactree, best = 5)

#take a look at the newly pruned tree and add some text

plot(prunevac)
text(prunevac)

plot of chunk unnamed-chunk-9

The tree function has found the 'cuts' in the data which provide the greatest explanatory power. We could also use this for prediction—-jobs where the average wage is less than $16.45, the hours available are less that 38.2, and the average wage is less that $12.435 will have a vacancy rate of 32.46%.