Linear Regression Assignment

This is my attempt at the linear regression assignment from the General Assembly Data Science course, which is based on the Job Salary Prediction competition on Kaggle.

We will first read in the training data, and convert some columns into factors:

train <- read.csv("train.csv", as.is = TRUE)
train$ContractType <- as.factor(train$ContractType)
train$ContractTime <- as.factor(train$ContractTime)
train$Category <- as.factor(train$Category)
train$SourceName <- as.factor(train$SourceName)
train$Company <- as.factor(train$Company)
train$LocationNormalized <- as.factor(train$LocationNormalized)

Data Exploration

Let's do two quick visualizations, to check that ContractType and ContractTime do have an effect on SalaryNormalized:

library(ggplot2)
ggplot(train, aes(SalaryNormalized)) + geom_bar() + facet_grid(ContractType ~ 
    .)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-2

ggplot(train, aes(SalaryNormalized)) + geom_bar() + facet_grid(ContractTime ~ 
    .)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-2

Those both look useful. Next, let's run some simple stats on Category and SourceName:

tapply(train$SalaryNormalized, train$Category, mean)
##        Accounting & Finance Jobs                       Admin Jobs 
##                            37375                            19307 
##         Charity & Voluntary Jobs                 Consultancy Jobs 
##                            26604                            33914 
##           Creative & Design Jobs           Customer Services Jobs 
##                            25678                            18548 
##    Domestic help & Cleaning Jobs           Energy, Oil & Gas Jobs 
##                            14928                            45461 
##                 Engineering Jobs                    Graduate Jobs 
##                            33557                            26439 
##        Healthcare & Nursing Jobs      Hospitality & Catering Jobs 
##                            29675                            21567 
##            HR & Recruitment Jobs                          IT Jobs 
##                            30157                            43306 
##                       Legal Jobs       Logistics & Warehouse Jobs 
##                            28509                            21716 
##                 Maintenance Jobs               Manufacturing Jobs 
##                            29309                            27191 
##               Other/General Jobs PR, Advertising & Marketing Jobs 
##                            30815                            33402 
##                    Property Jobs                      Retail Jobs 
##                            29688                            26954 
##                       Sales Jobs             Scientific & QA Jobs 
##                            28374                            32911 
##                 Social work Jobs                    Teaching Jobs 
##                            32420                            27633 
##        Trade & Construction Jobs                      Travel Jobs 
##                            32006                            22214
table(train$Category)
## 
##        Accounting & Finance Jobs                       Admin Jobs 
##                              606                              151 
##         Charity & Voluntary Jobs                 Consultancy Jobs 
##                               23                               80 
##           Creative & Design Jobs           Customer Services Jobs 
##                               22                              257 
##    Domestic help & Cleaning Jobs           Energy, Oil & Gas Jobs 
##                               10                               31 
##                 Engineering Jobs                    Graduate Jobs 
##                             1152                               19 
##        Healthcare & Nursing Jobs      Hospitality & Catering Jobs 
##                             3149                              525 
##            HR & Recruitment Jobs                          IT Jobs 
##                              578                             1414 
##                       Legal Jobs       Logistics & Warehouse Jobs 
##                               88                              110 
##                 Maintenance Jobs               Manufacturing Jobs 
##                               20                              106 
##               Other/General Jobs PR, Advertising & Marketing Jobs 
##                              236                               88 
##                    Property Jobs                      Retail Jobs 
##                               44                               93 
##                       Sales Jobs             Scientific & QA Jobs 
##                              426                              129 
##                 Social work Jobs                    Teaching Jobs 
##                               53                              342 
##        Trade & Construction Jobs                      Travel Jobs 
##                              148                              100
tapply(train$SalaryNormalized, train$SourceName, mean)
##         accountancyagejobs.com           allhousingjobs.co.uk 
##                          48875                          31573 
##            Brand Republic Jobs              careerbuilder.com 
##                          53250                          31667 
##            careerstructure.com                 careworx.co.uk 
##                          85000                          29292 
##                    caterer.com               charityjob.co.uk 
##                          21711                          24250 
##               cv-library.co.uk                   cwjobs.co.uk 
##                          29851                          44798 
##           discoverretail.co.uk              eFinancialCareers 
##                          25667                          62723 
##               emptylemon.co.uk                    fish4.co.uk 
##                          40022                          31605 
##                     hays.co.uk                 hotrecruit.com 
##                          41992                          36102 
##                 Jobcentre Plus                      jobg8.com 
##                          19645                          39168 
##             jobs.cabincrew.com jobs.catererandhotelkeeper.com 
##                          27000                          22436 
##            jobs.guardian.co.uk    jobs.planningresource.co.uk 
##                          31187                          30380 
##           jobs.telegraph.co.uk                         Jobs24 
##                          41262                          33642 
##             jobs4medical.co.uk                   jobserve.com 
##                          24000                          46193 
##              jobsfinancial.com               jobsincredit.com 
##                          49009                          97500 
##          jobsineducation.co.uk         jobsinsocialwork.co.uk 
##                          30004                          23530 
##                  jobsite.co.uk               juniorbroker.com 
##                          55900                          16312 
##                leisurejobs.com               londonjobs.co.uk 
##                          21076                          25000 
##              michaelpage.co.uk          Multilingualvacancies 
##                          45000                          24768 
##          myjobs.cimaglobal.com                       MyUkJobs 
##                          47656                          24959 
##              nijobfinder.co.uk                     nijobs.com 
##                          26050                          35000 
##         nurseryworldjobs.co.uk         oilandgasjobsearch.com 
##                          30000                          33596 
##                 OilCareers.com          onlymarketingjobs.com 
##                          71350                          24833 
##         peoplemanagement.co.uk            Personneltoday Jobs 
##                          30000                          31750 
##              planetrecruit.com                   PR Week Jobs 
##                          43278                          35408 
##   professionalpensionsjobs.com             propertyjobs.co.uk 
##                          29250                          36667 
##                 randstadfp.com                  recruitni.com 
##                          43400                          33538 
##           rengineeringjobs.com               retailchoice.com 
##                          40260                          26001 
##                    rtmjobs.com              salestarget.co.uk 
##                          40000                          42200 
##        securityclearedjobs.com             simplyhrjobs.co.uk 
##                          37838                          22500 
##      simplymarketingjobs.co.uk          simplysalesjobs.co.uk 
##                          46250                          35833 
##                 staffnurse.com              strike-jobs.co.uk 
##                          32107                          36582 
##               targetjobs.co.uk               technojobs.co.uk 
##                          26682                          53400 
##          thecareerengineer.com              thegraduate.co.uk 
##                          34154                          22000 
##            theitjobboard.co.uk               theladders.co.uk 
##                          46431                          62499 
##              Third Sector Jobs                  totaljobs.com 
##                          30857                          32962 
##                 uksport.gov.uk            wileyjobnetwork.com 
##                          45000                          34000 
##                  workthing.com                     zartis.com 
##                          21375                          34680
table(train$SourceName)
## 
##         accountancyagejobs.com           allhousingjobs.co.uk 
##                              4                              1 
##            Brand Republic Jobs              careerbuilder.com 
##                              4                             24 
##            careerstructure.com                 careworx.co.uk 
##                              1                           2946 
##                    caterer.com               charityjob.co.uk 
##                            264                              2 
##               cv-library.co.uk                   cwjobs.co.uk 
##                            874                             52 
##           discoverretail.co.uk              eFinancialCareers 
##                              3                             46 
##               emptylemon.co.uk                    fish4.co.uk 
##                              4                            441 
##                     hays.co.uk                 hotrecruit.com 
##                            293                             94 
##                 Jobcentre Plus                      jobg8.com 
##                            122                             17 
##             jobs.cabincrew.com jobs.catererandhotelkeeper.com 
##                              1                            152 
##            jobs.guardian.co.uk    jobs.planningresource.co.uk 
##                             13                              3 
##           jobs.telegraph.co.uk                         Jobs24 
##                              2                             48 
##             jobs4medical.co.uk                   jobserve.com 
##                              1                             17 
##              jobsfinancial.com               jobsincredit.com 
##                             58                              1 
##          jobsineducation.co.uk         jobsinsocialwork.co.uk 
##                             99                             13 
##                  jobsite.co.uk               juniorbroker.com 
##                              2                              8 
##                leisurejobs.com               londonjobs.co.uk 
##                             84                              1 
##              michaelpage.co.uk          Multilingualvacancies 
##                              1                            139 
##          myjobs.cimaglobal.com                       MyUkJobs 
##                             32                           1559 
##              nijobfinder.co.uk                     nijobs.com 
##                            131                              4 
##         nurseryworldjobs.co.uk         oilandgasjobsearch.com 
##                              1                              3 
##                 OilCareers.com          onlymarketingjobs.com 
##                              5                              3 
##         peoplemanagement.co.uk            Personneltoday Jobs 
##                              1                              2 
##              planetrecruit.com                   PR Week Jobs 
##                            436                              3 
##   professionalpensionsjobs.com             propertyjobs.co.uk 
##                              2                              3 
##                 randstadfp.com                  recruitni.com 
##                              5                            218 
##           rengineeringjobs.com               retailchoice.com 
##                             60                             19 
##                    rtmjobs.com              salestarget.co.uk 
##                              1                              5 
##        securityclearedjobs.com             simplyhrjobs.co.uk 
##                             13                              1 
##      simplymarketingjobs.co.uk          simplysalesjobs.co.uk 
##                              8                              6 
##                 staffnurse.com              strike-jobs.co.uk 
##                             54                            188 
##               targetjobs.co.uk               technojobs.co.uk 
##                             17                              5 
##          thecareerengineer.com              thegraduate.co.uk 
##                            312                              1 
##            theitjobboard.co.uk               theladders.co.uk 
##                            643                              2 
##              Third Sector Jobs                  totaljobs.com 
##                              5                            403 
##                 uksport.gov.uk            wileyjobnetwork.com 
##                              1                              1 
##                  workthing.com                     zartis.com 
##                              4                              8

Both could be useful, though I'll probably stay away from SourceName since there are many sources that only have a few observations. Next, let's look at Company and LocationNormalized:

# output suppressed
table(train$Company)
table(train$LocationNormalized)

Way too many. I'm going to stay away from these.

Creating New Features

I want to use location_tree.txt to create a much more broad location feature:

con <- file("location_tree.txt", "r")
tree <- readLines(con)
close(con)

for (i in 1:nrow(train)) {
    # get city name
    loc <- train$LocationNormalized[i]
    # find the first line in the tree in which that city name appears
    line.id <- which(grepl(loc, tree))[1]
    # use regular expressions to pull out the broad location
    r <- regexpr("~.+?~", tree[line.id])
    match <- regmatches(tree[line.id], r)
    # store the broad location
    train$Location[i] <- gsub("~", "", match)
}
## Error: replacement has length zero

train$Location <- as.factor(train$Location)
table(train$Location)
## 
##            East Midlands          Eastern England                   London 
##                      349                      598                     1928 
##       North East England       North West England         Northern Ireland 
##                      214                      561                      125 
##                 Scotland       South East England       South West England 
##                      443                     4500                      514 
##                    Wales            West Midlands Yorkshire And The Humber 
##                      173                      366                      229

I'm also going to make a simpler London vs non-London feature, to see if that performs better:

train$London <- as.factor(ifelse(train$Location == "London", "Yes", "No"))

Let's add some text features:

train$TitleSenior <- as.factor(ifelse(grepl("[Ss]enior", train$Title), "Yes", 
    "No"))
train$TitleManage <- as.factor(ifelse(grepl("[Mm]anage", train$Title), "Yes", 
    "No"))
train$DescripSenior <- as.factor(ifelse(grepl("[Ss]enior", train$FullDescription), 
    "Yes", "No"))
train$DescripManage <- as.factor(ifelse(grepl("[Mm]anage", train$FullDescription), 
    "Yes", "No"))

Enough feature engineering… let's do some modeling!

Modeling: Intercept-only

Since the number of observations is large, I'll split the training data into a training set and validation set:

set.seed(100)
train.index <- sample(10000, 7000, replace = FALSE)
tr <- train[train.index, ]
val <- train[-train.index, ]

Let's check the RMSE for a model fit with just an intercept:

sqrt(mean((mean(tr$SalaryNormalized) - val$SalaryNormalized)^2))
## [1] 16201

Modeling: Linear Models

Let's create some simple linear models, and add more features each time to see how it affects RMSE:

lm.fit1 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category, data = tr)
summary(lm.fit1)
## 
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime + 
##     Category, data = tr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38567  -8189  -2896   5623 129929 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                                 33567        765   43.90
## ContractTypefull_time                        -312        420   -0.74
## ContractTypepart_time                       -7431        804   -9.24
## ContractTimecontract                        12182        837   14.55
## ContractTimepermanent                        6243        547   11.41
## CategoryAdmin Jobs                         -16523       1560  -10.59
## CategoryCharity & Voluntary Jobs           -14438       3698   -3.90
## CategoryConsultancy Jobs                     -665       2169   -0.31
## CategoryCreative & Design Jobs             -16644       4429   -3.76
## CategoryCustomer Services Jobs             -18559       1283  -14.46
## CategoryDomestic help & Cleaning Jobs      -19525       5175   -3.77
## CategoryEnergy, Oil & Gas Jobs               9324       2989    3.12
## CategoryEngineering Jobs                    -5413        892   -6.07
## CategoryGraduate Jobs                       -7506       3419   -2.20
## CategoryHealthcare & Nursing Jobs           -2878        829   -3.47
## CategoryHospitality & Catering Jobs        -12671       1058  -11.98
## CategoryHR & Recruitment Jobs               -8041       1012   -7.95
## CategoryIT Jobs                              3378        867    3.90
## CategoryLegal Jobs                          -6589       1951   -3.38
## CategoryLogistics & Warehouse Jobs         -14141       1772   -7.98
## CategoryMaintenance Jobs                    -8282       3596   -2.30
## CategoryManufacturing Jobs                  -9867       1778   -5.55
## CategoryOther/General Jobs                  -9080       1389   -6.54
## CategoryPR, Advertising & Marketing Jobs    -4472       2045   -2.19
## CategoryProperty Jobs                       -6548       2933   -2.23
## CategoryRetail Jobs                        -11374       1978   -5.75
## CategorySales Jobs                          -9562       1115   -8.57
## CategoryScientific & QA Jobs                -5563       1693   -3.29
## CategorySocial work Jobs                    -6011       2703   -2.22
## CategoryTeaching Jobs                      -12028       1165  -10.32
## CategoryTrade & Construction Jobs           -4568       1596   -2.86
## CategoryTravel Jobs                        -18150       1840   -9.86
##                                          Pr(>|t|)    
## (Intercept)                               < 2e-16 ***
## ContractTypefull_time                     0.45781    
## ContractTypepart_time                     < 2e-16 ***
## ContractTimecontract                      < 2e-16 ***
## ContractTimepermanent                     < 2e-16 ***
## CategoryAdmin Jobs                        < 2e-16 ***
## CategoryCharity & Voluntary Jobs          9.6e-05 ***
## CategoryConsultancy Jobs                  0.75923    
## CategoryCreative & Design Jobs            0.00017 ***
## CategoryCustomer Services Jobs            < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs     0.00016 ***
## CategoryEnergy, Oil & Gas Jobs            0.00182 ** 
## CategoryEngineering Jobs                  1.4e-09 ***
## CategoryGraduate Jobs                     0.02815 *  
## CategoryHealthcare & Nursing Jobs         0.00052 ***
## CategoryHospitality & Catering Jobs       < 2e-16 ***
## CategoryHR & Recruitment Jobs             2.2e-15 ***
## CategoryIT Jobs                           9.9e-05 ***
## CategoryLegal Jobs                        0.00074 ***
## CategoryLogistics & Warehouse Jobs        1.7e-15 ***
## CategoryMaintenance Jobs                  0.02132 *  
## CategoryManufacturing Jobs                3.0e-08 ***
## CategoryOther/General Jobs                6.7e-11 ***
## CategoryPR, Advertising & Marketing Jobs  0.02882 *  
## CategoryProperty Jobs                     0.02560 *  
## CategoryRetail Jobs                       9.2e-09 ***
## CategorySales Jobs                        < 2e-16 ***
## CategoryScientific & QA Jobs              0.00102 ** 
## CategorySocial work Jobs                  0.02620 *  
## CategoryTeaching Jobs                     < 2e-16 ***
## CategoryTrade & Construction Jobs         0.00422 ** 
## CategoryTravel Jobs                       < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14500 on 6968 degrees of freedom
## Multiple R-squared:  0.196,  Adjusted R-squared:  0.193 
## F-statistic: 54.9 on 31 and 6968 DF,  p-value: <2e-16
lm.pred1 <- predict(lm.fit1, newdata = val)
sqrt(mean((lm.pred1 - val$SalaryNormalized)^2))
## [1] 14701

lm.fit2 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category + Location, 
    data = tr)
summary(lm.fit2)
## 
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime + 
##     Category + Location, data = tr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38574  -8376  -2816   5425 129864 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                                 32886       1206   27.27
## ContractTypefull_time                        -768        423   -1.82
## ContractTypepart_time                       -7995        806   -9.91
## ContractTimecontract                        12056        856   14.09
## ContractTimepermanent                        5955        568   10.49
## CategoryAdmin Jobs                         -16713       1562  -10.70
## CategoryCharity & Voluntary Jobs           -14842       3679   -4.03
## CategoryConsultancy Jobs                     -773       2158   -0.36
## CategoryCreative & Design Jobs             -16679       4406   -3.79
## CategoryCustomer Services Jobs             -18071       1280  -14.12
## CategoryDomestic help & Cleaning Jobs      -19579       5153   -3.80
## CategoryEnergy, Oil & Gas Jobs               9411       2974    3.16
## CategoryEngineering Jobs                    -5266        894   -5.89
## CategoryGraduate Jobs                       -7287       3410   -2.14
## CategoryHealthcare & Nursing Jobs           -2827        832   -3.40
## CategoryHospitality & Catering Jobs        -13289       1057  -12.57
## CategoryHR & Recruitment Jobs               -8124       1019   -7.97
## CategoryIT Jobs                              3371        866    3.89
## CategoryLegal Jobs                          -6639       1942   -3.42
## CategoryLogistics & Warehouse Jobs         -14097       1770   -7.97
## CategoryMaintenance Jobs                    -9208       3582   -2.57
## CategoryManufacturing Jobs                  -9598       1776   -5.41
## CategoryOther/General Jobs                  -9404       1383   -6.80
## CategoryPR, Advertising & Marketing Jobs    -4736       2037   -2.33
## CategoryProperty Jobs                       -7065       2918   -2.42
## CategoryRetail Jobs                        -11257       1969   -5.72
## CategorySales Jobs                          -9714       1112   -8.73
## CategoryScientific & QA Jobs                -5522       1686   -3.27
## CategorySocial work Jobs                    -5906       2694   -2.19
## CategoryTeaching Jobs                      -12273       1168  -10.51
## CategoryTrade & Construction Jobs           -4323       1596   -2.71
## CategoryTravel Jobs                        -18255       1831   -9.97
## LocationEastern England                      2557       1180    2.17
## LocationLondon                               3574       1023    3.49
## LocationNorth East England                   -433       1513   -0.29
## LocationNorth West England                  -1921       1184   -1.62
## LocationNorthern Ireland                      118       1905    0.06
## LocationScotland                             1542       1260    1.22
## LocationSouth East England                    822        994    0.83
## LocationSouth West England                  -1994       1212   -1.65
## LocationWales                                -494       1637   -0.30
## LocationWest Midlands                        -328       1310   -0.25
## LocationYorkshire And The Humber            -2307       1490   -1.55
##                                          Pr(>|t|)    
## (Intercept)                               < 2e-16 ***
## ContractTypefull_time                     0.06949 .  
## ContractTypepart_time                     < 2e-16 ***
## ContractTimecontract                      < 2e-16 ***
## ContractTimepermanent                     < 2e-16 ***
## CategoryAdmin Jobs                        < 2e-16 ***
## CategoryCharity & Voluntary Jobs          5.5e-05 ***
## CategoryConsultancy Jobs                  0.72012    
## CategoryCreative & Design Jobs            0.00015 ***
## CategoryCustomer Services Jobs            < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs     0.00015 ***
## CategoryEnergy, Oil & Gas Jobs            0.00156 ** 
## CategoryEngineering Jobs                  4.0e-09 ***
## CategoryGraduate Jobs                     0.03261 *  
## CategoryHealthcare & Nursing Jobs         0.00069 ***
## CategoryHospitality & Catering Jobs       < 2e-16 ***
## CategoryHR & Recruitment Jobs             1.8e-15 ***
## CategoryIT Jobs                           0.00010 ***
## CategoryLegal Jobs                        0.00063 ***
## CategoryLogistics & Warehouse Jobs        1.9e-15 ***
## CategoryMaintenance Jobs                  0.01017 *  
## CategoryManufacturing Jobs                6.7e-08 ***
## CategoryOther/General Jobs                1.1e-11 ***
## CategoryPR, Advertising & Marketing Jobs  0.02007 *  
## CategoryProperty Jobs                     0.01550 *  
## CategoryRetail Jobs                       1.1e-08 ***
## CategorySales Jobs                        < 2e-16 ***
## CategoryScientific & QA Jobs              0.00106 ** 
## CategorySocial work Jobs                  0.02837 *  
## CategoryTeaching Jobs                     < 2e-16 ***
## CategoryTrade & Construction Jobs         0.00675 ** 
## CategoryTravel Jobs                       < 2e-16 ***
## LocationEastern England                   0.03024 *  
## LocationLondon                            0.00048 ***
## LocationNorth East England                0.77493    
## LocationNorth West England                0.10490    
## LocationNorthern Ireland                  0.95042    
## LocationScotland                          0.22114    
## LocationSouth East England                0.40820    
## LocationSouth West England                0.09985 .  
## LocationWales                             0.76307    
## LocationWest Midlands                     0.80249    
## LocationYorkshire And The Humber          0.12167    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14400 on 6957 degrees of freedom
## Multiple R-squared:  0.207,  Adjusted R-squared:  0.202 
## F-statistic: 43.2 on 42 and 6957 DF,  p-value: <2e-16
lm.pred2 <- predict(lm.fit2, newdata = val)
sqrt(mean((lm.pred2 - val$SalaryNormalized)^2))
## [1] 14619

lm.fit3 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category + London, 
    data = tr)
summary(lm.fit3)
## 
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime + 
##     Category + London, data = tr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38479  -8319  -2935   5404 130243 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                                 32956        766   43.00
## ContractTypefull_time                        -664        421   -1.58
## ContractTypepart_time                       -7635        802   -9.53
## ContractTimecontract                        12459        835   14.92
## ContractTimepermanent                        6274        545   11.51
## CategoryAdmin Jobs                         -16253       1555  -10.46
## CategoryCharity & Voluntary Jobs           -14695       3685   -3.99
## CategoryConsultancy Jobs                     -642       2161   -0.30
## CategoryCreative & Design Jobs             -16529       4412   -3.75
## CategoryCustomer Services Jobs             -18328       1279  -14.33
## CategoryDomestic help & Cleaning Jobs      -19644       5156   -3.81
## CategoryEnergy, Oil & Gas Jobs               9674       2978    3.25
## CategoryEngineering Jobs                    -5134        890   -5.77
## CategoryGraduate Jobs                       -6754       3408   -1.98
## CategoryHealthcare & Nursing Jobs           -2857        826   -3.46
## CategoryHospitality & Catering Jobs        -12951       1054  -12.28
## CategoryHR & Recruitment Jobs               -7983       1008   -7.92
## CategoryIT Jobs                              3624        865    4.19
## CategoryLegal Jobs                          -6369       1944   -3.28
## CategoryLogistics & Warehouse Jobs         -13756       1767   -7.79
## CategoryMaintenance Jobs                    -9475       3587   -2.64
## CategoryManufacturing Jobs                  -9672       1772   -5.46
## CategoryOther/General Jobs                  -9149       1384   -6.61
## CategoryPR, Advertising & Marketing Jobs    -4469       2038   -2.19
## CategoryProperty Jobs                       -6815       2922   -2.33
## CategoryRetail Jobs                        -11087       1971   -5.63
## CategorySales Jobs                          -9517       1111   -8.56
## CategoryScientific & QA Jobs                -5855       1687   -3.47
## CategorySocial work Jobs                    -5555       2694   -2.06
## CategoryTeaching Jobs                      -12501       1162  -10.75
## CategoryTrade & Construction Jobs           -4313       1590   -2.71
## CategoryTravel Jobs                        -18265       1834   -9.96
## LondonYes                                    3243        446    7.27
##                                          Pr(>|t|)    
## (Intercept)                               < 2e-16 ***
## ContractTypefull_time                     0.11491    
## ContractTypepart_time                     < 2e-16 ***
## ContractTimecontract                      < 2e-16 ***
## ContractTimepermanent                     < 2e-16 ***
## CategoryAdmin Jobs                        < 2e-16 ***
## CategoryCharity & Voluntary Jobs          6.7e-05 ***
## CategoryConsultancy Jobs                  0.76648    
## CategoryCreative & Design Jobs            0.00018 ***
## CategoryCustomer Services Jobs            < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs     0.00014 ***
## CategoryEnergy, Oil & Gas Jobs            0.00117 ** 
## CategoryEngineering Jobs                  8.3e-09 ***
## CategoryGraduate Jobs                     0.04752 *  
## CategoryHealthcare & Nursing Jobs         0.00054 ***
## CategoryHospitality & Catering Jobs       < 2e-16 ***
## CategoryHR & Recruitment Jobs             2.8e-15 ***
## CategoryIT Jobs                           2.8e-05 ***
## CategoryLegal Jobs                        0.00106 ** 
## CategoryLogistics & Warehouse Jobs        7.9e-15 ***
## CategoryMaintenance Jobs                  0.00827 ** 
## CategoryManufacturing Jobs                5.0e-08 ***
## CategoryOther/General Jobs                4.1e-11 ***
## CategoryPR, Advertising & Marketing Jobs  0.02834 *  
## CategoryProperty Jobs                     0.01972 *  
## CategoryRetail Jobs                       1.9e-08 ***
## CategorySales Jobs                        < 2e-16 ***
## CategoryScientific & QA Jobs              0.00052 ***
## CategorySocial work Jobs                  0.03925 *  
## CategoryTeaching Jobs                     < 2e-16 ***
## CategoryTrade & Construction Jobs         0.00671 ** 
## CategoryTravel Jobs                       < 2e-16 ***
## LondonYes                                 3.9e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14400 on 6967 degrees of freedom
## Multiple R-squared:  0.202,  Adjusted R-squared:  0.199 
## F-statistic: 55.3 on 32 and 6967 DF,  p-value: <2e-16
lm.pred3 <- predict(lm.fit3, newdata = val)
sqrt(mean((lm.pred3 - val$SalaryNormalized)^2))
## [1] 14591

lm.fit4 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category + London + 
    TitleSenior + TitleManage + DescripSenior + DescripManage, data = tr)
summary(lm.fit4)
## 
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime + 
##     Category + London + TitleSenior + TitleManage + DescripSenior + 
##     DescripManage, data = tr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38496  -8005  -2400   4293 133040 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                                 29345        776   37.81
## ContractTypefull_time                        -933        408   -2.29
## ContractTypepart_time                       -5017        785   -6.39
## ContractTimecontract                        13020        808   16.10
## ContractTimepermanent                        5704        528   10.81
## CategoryAdmin Jobs                         -14392       1507   -9.55
## CategoryCharity & Voluntary Jobs           -13264       3566   -3.72
## CategoryConsultancy Jobs                    -3851       2102   -1.83
## CategoryCreative & Design Jobs             -15475       4268   -3.63
## CategoryCustomer Services Jobs             -15938       1244  -12.82
## CategoryDomestic help & Cleaning Jobs      -17944       4987   -3.60
## CategoryEnergy, Oil & Gas Jobs              11461       2882    3.98
## CategoryEngineering Jobs                    -3579        866   -4.14
## CategoryGraduate Jobs                       -5769       3301   -1.75
## CategoryHealthcare & Nursing Jobs           -2605        806   -3.23
## CategoryHospitality & Catering Jobs        -12171       1025  -11.88
## CategoryHR & Recruitment Jobs               -7961        977   -8.15
## CategoryIT Jobs                              4524        838    5.40
## CategoryLegal Jobs                          -3820       1885   -2.03
## CategoryLogistics & Warehouse Jobs         -11697       1716   -6.82
## CategoryMaintenance Jobs                    -8682       3470   -2.50
## CategoryManufacturing Jobs                  -8819       1715   -5.14
## CategoryOther/General Jobs                  -8201       1340   -6.12
## CategoryPR, Advertising & Marketing Jobs    -5282       1976   -2.67
## CategoryProperty Jobs                       -8610       2829   -3.04
## CategoryRetail Jobs                        -12731       1912   -6.66
## CategorySales Jobs                          -8857       1077   -8.22
## CategoryScientific & QA Jobs                -4525       1635   -2.77
## CategorySocial work Jobs                    -6902       2611   -2.64
## CategoryTeaching Jobs                       -9912       1132   -8.76
## CategoryTrade & Construction Jobs           -3583       1543   -2.32
## CategoryTravel Jobs                        -18292       1776  -10.30
## LondonYes                                    3226        431    7.48
## TitleSeniorYes                               3693        788    4.69
## TitleManageYes                               7157        503   14.22
## DescripSeniorYes                             2324        532    4.37
## DescripManageYes                             2167        394    5.50
##                                          Pr(>|t|)    
## (Intercept)                               < 2e-16 ***
## ContractTypefull_time                     0.02219 *  
## ContractTypepart_time                     1.7e-10 ***
## ContractTimecontract                      < 2e-16 ***
## ContractTimepermanent                     < 2e-16 ***
## CategoryAdmin Jobs                        < 2e-16 ***
## CategoryCharity & Voluntary Jobs          0.00020 ***
## CategoryConsultancy Jobs                  0.06702 .  
## CategoryCreative & Design Jobs            0.00029 ***
## CategoryCustomer Services Jobs            < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs     0.00032 ***
## CategoryEnergy, Oil & Gas Jobs            7.0e-05 ***
## CategoryEngineering Jobs                  3.6e-05 ***
## CategoryGraduate Jobs                     0.08054 .  
## CategoryHealthcare & Nursing Jobs         0.00124 ** 
## CategoryHospitality & Catering Jobs       < 2e-16 ***
## CategoryHR & Recruitment Jobs             4.4e-16 ***
## CategoryIT Jobs                           6.9e-08 ***
## CategoryLegal Jobs                        0.04269 *  
## CategoryLogistics & Warehouse Jobs        1.0e-11 ***
## CategoryMaintenance Jobs                  0.01237 *  
## CategoryManufacturing Jobs                2.8e-07 ***
## CategoryOther/General Jobs                9.9e-10 ***
## CategoryPR, Advertising & Marketing Jobs  0.00752 ** 
## CategoryProperty Jobs                     0.00235 ** 
## CategoryRetail Jobs                       3.0e-11 ***
## CategorySales Jobs                        2.3e-16 ***
## CategoryScientific & QA Jobs              0.00565 ** 
## CategorySocial work Jobs                  0.00823 ** 
## CategoryTeaching Jobs                     < 2e-16 ***
## CategoryTrade & Construction Jobs         0.02027 *  
## CategoryTravel Jobs                       < 2e-16 ***
## LondonYes                                 8.3e-14 ***
## TitleSeniorYes                            2.8e-06 ***
## TitleManageYes                            < 2e-16 ***
## DescripSeniorYes                          1.3e-05 ***
## DescripManageYes                          3.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14000 on 6963 degrees of freedom
## Multiple R-squared:  0.255,  Adjusted R-squared:  0.251 
## F-statistic: 66.1 on 36 and 6963 DF,  p-value: <2e-16
lm.pred4 <- predict(lm.fit4, newdata = val)
sqrt(mean((lm.pred4 - val$SalaryNormalized)^2))
## [1] 14088

Interesting! Adding “Location” helped, but using “London” instead of “Location” was even better. And, those four text features helped a lot, too.

Modeling: Ridge Regression

Let's try ridge regression using the features from the best linear model:

# build model matrices for train and validation
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 1.9-5
x.tr <- model.matrix(SalaryNormalized ~ ContractType + ContractTime + Category + 
    London + TitleSenior + TitleManage + DescripSenior + DescripManage, data = tr)[, 
    -1]
y.tr <- tr$SalaryNormalized
x.val <- model.matrix(SalaryNormalized ~ ContractType + ContractTime + Category + 
    London + TitleSenior + TitleManage + DescripSenior + DescripManage, data = val)[, 
    -1]
y.val <- val$SalaryNormalized

# CV to obtain best lambda
set.seed(10)
rr.cv <- cv.glmnet(x.tr, y.tr, alpha = 0)
plot(rr.cv)

plot of chunk unnamed-chunk-11

rr.bestlam <- rr.cv$lambda.min
rr.goodlam <- rr.cv$lambda.1se

# predict validation set using best lambda and calculate RMSE
rr.fit <- glmnet(x.tr, y.tr, alpha = 0)
plot(rr.fit, xvar = "lambda", label = TRUE)

plot of chunk unnamed-chunk-11

rr.pred <- predict(rr.fit, s = rr.bestlam, newx = x.val)
sqrt(mean((rr.pred - y.val)^2))
## [1] 14093

Ridge regression did about the same as the best linear model.

Modeling: Lasso

Next, let's try the lasso:

# CV to obtain best lambda
set.seed(10)
las.cv <- cv.glmnet(x.tr, y.tr, alpha = 1)
plot(las.cv)

plot of chunk unnamed-chunk-12

las.bestlam <- las.cv$lambda.min
las.goodlam <- las.cv$lambda.1se

# predict validation set using best lambda and calculate RMSE
las.fit <- glmnet(x.tr, y.tr, alpha = 1)
plot(las.fit, xvar = "lambda", label = TRUE)

plot of chunk unnamed-chunk-12

las.pred <- predict(las.fit, s = las.bestlam, newx = x.val)
sqrt(mean((las.pred - y.val)^2))
## [1] 14089

Again, lasso did about the same as the best linear model.

Modeling: Forward Stepwise

Let's try a forward stepwise approach (since there are too many variables for a best subset approach):

library(leaps)
fwd.fit <- regsubsets(SalaryNormalized ~ ContractType + ContractTime + Category + 
    London + TitleSenior + TitleManage + DescripSenior + DescripManage, data = tr, 
    nvmax = 36, method = "forward")
plot(fwd.fit, scale = "adjr2")

plot of chunk unnamed-chunk-13


# loop through models of each size and compute test RMSE for each model
# (clever approach from ISLR page 248)
fwd.errors <- rep(NA, 36)
val.mat <- model.matrix(SalaryNormalized ~ ContractType + ContractTime + Category + 
    London + TitleSenior + TitleManage + DescripSenior + DescripManage, data = val)
for (i in 1:36) {
    # extract the coefficients from the best model of that size
    coefi <- coef(fwd.fit, id = i)
    # multiply them into the appropriate columns of the test model matrix to
    # predict
    pred <- val.mat[, names(coefi)] %*% coefi
    # compute the test MSE
    fwd.errors[i] <- sqrt(mean((y.val - pred)^2))
}

# find the best model
fwd.errors
##  [1] 15490 15056 14940 14786 14725 14638 14561 14515 14469 14374 14385
## [12] 14356 14297 14254 14245 14252 14229 14204 14163 14163 14144 14140
## [23] 14143 14139 14141 14135 14130 14128 14129 14126 14120 14120 14103
## [34] 14100 14090 14088
min(fwd.errors)
## [1] 14088
which.min(fwd.errors)
## [1] 36

Again, forward stepwise did about the same as the best linear model, and chose the model with all 36 variables.

Let's just go with the best linear model, and test it against the solution!

Test Model Against Solution

First, we need to read in the solution set and redo our feature engineering:

solution <- read.csv("solution.csv", as.is = TRUE)
solution$ContractType <- as.factor(solution$ContractType)
solution$ContractTime <- as.factor(solution$ContractTime)
solution$Category <- as.factor(solution$Category)
solution$SourceName <- as.factor(solution$SourceName)
solution$Company <- as.factor(solution$Company)
solution$LocationNormalized <- as.factor(solution$LocationNormalized)
for (i in 1:nrow(solution)) {
    loc <- solution$LocationNormalized[i]
    line.id <- which(grepl(loc, tree))[1]
    r <- regexpr("~.+?~", tree[line.id])
    match <- regmatches(tree[line.id], r)
    solution$Location[i] <- gsub("~", "", match)
}
## Error: replacement has length zero
solution$Location <- as.factor(solution$Location)
solution$London <- as.factor(ifelse(solution$Location == "London", "Yes", "No"))
solution$TitleSenior <- as.factor(ifelse(grepl("[Ss]enior", solution$Title), 
    "Yes", "No"))
solution$TitleManage <- as.factor(ifelse(grepl("[Mm]anage", solution$Title), 
    "Yes", "No"))
solution$DescripSenior <- as.factor(ifelse(grepl("[Ss]enior", solution$FullDescription), 
    "Yes", "No"))
solution$DescripManage <- as.factor(ifelse(grepl("[Mm]anage", solution$FullDescription), 
    "Yes", "No"))

The solution set has an extra Category (not present in the training set), so I'll take the easiest approach and just convert it to a different Category:

pt.index <- which(solution$Category == "Part time Jobs")
solution$Category[pt.index] <- "Admin Jobs"
solution$Category <- factor(solution$Category)

To get some context, let's check the RMSE for a model fit with just the intercept from the full training set:

sqrt(mean((mean(train$SalaryNormalized) - solution$SalaryNormalized)^2))
## [1] 18761

Okay, let's train our linear model on the full training set, and predict on the solution set:

lm.fit5 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category + London + 
    TitleSenior + TitleManage + DescripSenior + DescripManage, data = train)
summary(lm.fit5)
## 
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime + 
##     Category + London + TitleSenior + TitleManage + DescripSenior + 
##     DescripManage, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39027  -7986  -2462   4268 132936 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                                 29519        652   45.31
## ContractTypefull_time                        -732        343   -2.14
## ContractTypepart_time                       -5287        656   -8.05
## ContractTimecontract                        13135        677   19.40
## ContractTimepermanent                        5211        441   11.82
## CategoryAdmin Jobs                         -14566       1283  -11.35
## CategoryCharity & Voluntary Jobs           -12866       2980   -4.32
## CategoryConsultancy Jobs                    -4710       1671   -2.82
## CategoryCreative & Design Jobs             -12191       3041   -4.01
## CategoryCustomer Services Jobs             -15904       1050  -15.15
## CategoryDomestic help & Cleaning Jobs      -17972       4464   -4.03
## CategoryEnergy, Oil & Gas Jobs               8583       2582    3.32
## CategoryEngineering Jobs                    -3997        721   -5.54
## CategoryGraduate Jobs                       -6272       3279   -1.91
## CategoryHealthcare & Nursing Jobs           -2790        677   -4.12
## CategoryHospitality & Catering Jobs        -12242        864  -14.16
## CategoryHR & Recruitment Jobs               -7477        820   -9.11
## CategoryIT Jobs                              4552        700    6.51
## CategoryLegal Jobs                          -3993       1607   -2.48
## CategoryLogistics & Warehouse Jobs         -11884       1462   -8.13
## CategoryMaintenance Jobs                    -9261       3191   -2.90
## CategoryManufacturing Jobs                  -9762       1476   -6.61
## CategoryOther/General Jobs                  -7664       1082   -7.09
## CategoryPR, Advertising & Marketing Jobs    -5584       1599   -3.49
## CategoryProperty Jobs                       -8731       2188   -3.99
## CategoryRetail Jobs                        -12782       1567   -8.16
## CategorySales Jobs                          -8444        889   -9.49
## CategoryScientific & QA Jobs                -5323       1373   -3.88
## CategorySocial work Jobs                    -6119       2016   -3.04
## CategoryTeaching Jobs                      -10094        974  -10.37
## CategoryTrade & Construction Jobs           -5542       1292   -4.29
## CategoryTravel Jobs                        -18335       1530  -11.98
## LondonYes                                    3660        363   10.09
## TitleSeniorYes                               4106        667    6.16
## TitleManageYes                               7111        419   16.96
## DescripSeniorYes                             1953        451    4.33
## DescripManageYes                             2380        330    7.21
##                                          Pr(>|t|)    
## (Intercept)                               < 2e-16 ***
## ContractTypefull_time                     0.03251 *  
## ContractTypepart_time                     8.9e-16 ***
## ContractTimecontract                      < 2e-16 ***
## ContractTimepermanent                     < 2e-16 ***
## CategoryAdmin Jobs                        < 2e-16 ***
## CategoryCharity & Voluntary Jobs          1.6e-05 ***
## CategoryConsultancy Jobs                  0.00484 ** 
## CategoryCreative & Design Jobs            6.1e-05 ***
## CategoryCustomer Services Jobs            < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs     5.7e-05 ***
## CategoryEnergy, Oil & Gas Jobs            0.00089 ***
## CategoryEngineering Jobs                  3.1e-08 ***
## CategoryGraduate Jobs                     0.05580 .  
## CategoryHealthcare & Nursing Jobs         3.8e-05 ***
## CategoryHospitality & Catering Jobs       < 2e-16 ***
## CategoryHR & Recruitment Jobs             < 2e-16 ***
## CategoryIT Jobs                           8.1e-11 ***
## CategoryLegal Jobs                        0.01297 *  
## CategoryLogistics & Warehouse Jobs        4.9e-16 ***
## CategoryMaintenance Jobs                  0.00371 ** 
## CategoryManufacturing Jobs                3.9e-11 ***
## CategoryOther/General Jobs                1.5e-12 ***
## CategoryPR, Advertising & Marketing Jobs  0.00048 ***
## CategoryProperty Jobs                     6.7e-05 ***
## CategoryRetail Jobs                       3.9e-16 ***
## CategorySales Jobs                        < 2e-16 ***
## CategoryScientific & QA Jobs              0.00011 ***
## CategorySocial work Jobs                  0.00240 ** 
## CategoryTeaching Jobs                     < 2e-16 ***
## CategoryTrade & Construction Jobs         1.8e-05 ***
## CategoryTravel Jobs                       < 2e-16 ***
## LondonYes                                 < 2e-16 ***
## TitleSeniorYes                            7.7e-10 ***
## TitleManageYes                            < 2e-16 ***
## DescripSeniorYes                          1.5e-05 ***
## DescripManageYes                          6.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14000 on 9963 degrees of freedom
## Multiple R-squared:  0.252,  Adjusted R-squared:  0.25 
## F-statistic: 93.5 on 36 and 9963 DF,  p-value: <2e-16
lm.pred5 <- predict(lm.fit5, newdata = solution)
sqrt(mean((lm.pred5 - solution$SalaryNormalized)^2))
## [1] 16470

The decrease in RMSE (over the intercept-only model) is similar to what we saw for the validation set approach, meaning that the validation set approach did a good job predicting how we would eventually do on the solution set.

As for how to improve the model, I would bet that you could achieve significant reductions in RMSE by generating a ton of features from the Title and FullDescription. I was surprised by how much performance increased in lm.fit4 (over lm.fit3) just by adding four very simple text features.