Load the data
wages <- read.csv("http://dl.dropbox.com/u/23281950/WagesPop.csv")
Use a regular linear regression model to find the relationship between vacancy rate, average wages and population. Both independent variables are statistically significant, but the R-squared is very low.
x <- lm(Vac ~ AvWage + Population, data = wages)
summary(x)
##
## Call:
## lm(formula = Vac ~ AvWage + Population, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.40 -13.77 -4.42 12.12 78.88
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.5929 3.3294 12.79 < 2e-16 ***
## AvWage -0.7399 0.1461 -5.06 6.7e-07 ***
## Population -0.0488 0.0482 -1.01 0.31
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20 on 343 degrees of freedom
## Multiple R-squared: 0.0781, Adjusted R-squared: 0.0727
## F-statistic: 14.5 on 2 and 343 DF, p-value: 8.8e-07
It is possible that the relationship between the variables is nonlinear, which we can explore with mgcv
install.packages("mgcv")
## Installing package(s) into 'C:/Users/Stephen/Documents/R/win-library/2.15'
## (as 'lib' is unspecified)
## Error: trying to use CRAN without setting a mirror
library(mgcv)
## This is mgcv 1.7-22. For overview type 'help("mgcv-package")'.
create an object with 'gam' which means 'generalized additive model'
t <- gam(Vac ~ s(AvWage) + s(Population, k = 5), data = wages)
check the model fit
gam.check(t)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 5 iterations.
## The RMS GCV score gradiant at convergence was 0.0005124 .
## The Hessian was positive definite.
## The estimated model rank was 14 (maximum possible: 14)
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(AvWage) 9.000 5.656 0.985 0.35
## s(Population) 4.000 1.250 0.839 0.00
and visualise the plot
vis.gam(t, view = c("AvWage", "Population"))
## Warning: data length [31] is not a sub-multiple or multiple of the number
## of rows [30]
The 'linear predictor' is the vacancy rate. There is an interesting shape: it is clear that higher wages lower the vacancy rate, as one would expect. But notice the trough along the Population axis. Mid-way it drops. The surface is roughly parallel to the Population axis, implying that Population on its own doesn't have a big impact on vacancy rates. However, the average wage has much more impact, but the effect is non-linear. At about one-third along the AvWage axis, there is a 'bump' in the vacancy rate. It is possible that that this the partition between 'Macjobs' and more highly-paid employment.