Load the data. The data is for 2009, just called wages for short. It is held in a public dropbox folder.
wages <- read.csv("http://dl.dropbox.com/u/23281950/wages.csv")
Then attach the data
attach(wages)
Draw a boxplot of yearly average earnings by region
boxplot(YearAve ~ Region, main = "Yearly Earnings by Region", ylab = "Total Yearly Earnings",
xlab = "BC Regional Codes")
It looks as though the medians are pretty much the same, but region 5920 is more widely distributed with one large outlier.Someone in that region has a fat pay-cheque!
Draw a boxplot of regional vacancies
boxplot(Vac ~ Region, main = "Regional Vacancies", ylab = "Percentage Vacancies")
A much wider distribution….
Now draw a classification tree. Note make sure to install the package 'tree' first by typing install.packages(“tree”). You only need to do this once.
Now make the library available to R
library(tree)
This builds a classification tree called vactree using three of the variables available:
vactree <- tree(Vac ~ Region + TrainT + AvWage + Hours)
We can take a look at the tree
plot(vactree)
But this is a messy untidy tree…prune it!
prunevac <- cv.tree(vactree, best = 5)
#take a look at the newly pruned tree and add some text
plot(prunevac)
text(prunevac)
The tree function has found the 'cuts' in the data which provide the greatest explanatory power. We could also use this for prediction—-jobs where the average wage is less than $16.45, the hours available are less that 38.2, and the average wage is less that $12.435 will have a vacancy rate of 32.46%.