This is my attempt at the linear regression assignment from the General Assembly Data Science course, which is based on the Job Salary Prediction competition on Kaggle.
We will first read in the training data, and convert some columns into factors:
train <- read.csv("train.csv", as.is = TRUE)
train$ContractType <- as.factor(train$ContractType)
train$ContractTime <- as.factor(train$ContractTime)
train$Category <- as.factor(train$Category)
train$SourceName <- as.factor(train$SourceName)
train$Company <- as.factor(train$Company)
train$LocationNormalized <- as.factor(train$LocationNormalized)
Let's do two quick visualizations, to check that ContractType and ContractTime do have an effect on SalaryNormalized:
library(ggplot2)
ggplot(train, aes(SalaryNormalized)) + geom_bar() + facet_grid(ContractType ~
.)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(train, aes(SalaryNormalized)) + geom_bar() + facet_grid(ContractTime ~
.)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Those both look useful. Next, let's run some simple stats on Category and SourceName:
tapply(train$SalaryNormalized, train$Category, mean)
## Accounting & Finance Jobs Admin Jobs
## 37375 19307
## Charity & Voluntary Jobs Consultancy Jobs
## 26604 33914
## Creative & Design Jobs Customer Services Jobs
## 25678 18548
## Domestic help & Cleaning Jobs Energy, Oil & Gas Jobs
## 14928 45461
## Engineering Jobs Graduate Jobs
## 33557 26439
## Healthcare & Nursing Jobs Hospitality & Catering Jobs
## 29675 21567
## HR & Recruitment Jobs IT Jobs
## 30157 43306
## Legal Jobs Logistics & Warehouse Jobs
## 28509 21716
## Maintenance Jobs Manufacturing Jobs
## 29309 27191
## Other/General Jobs PR, Advertising & Marketing Jobs
## 30815 33402
## Property Jobs Retail Jobs
## 29688 26954
## Sales Jobs Scientific & QA Jobs
## 28374 32911
## Social work Jobs Teaching Jobs
## 32420 27633
## Trade & Construction Jobs Travel Jobs
## 32006 22214
table(train$Category)
##
## Accounting & Finance Jobs Admin Jobs
## 606 151
## Charity & Voluntary Jobs Consultancy Jobs
## 23 80
## Creative & Design Jobs Customer Services Jobs
## 22 257
## Domestic help & Cleaning Jobs Energy, Oil & Gas Jobs
## 10 31
## Engineering Jobs Graduate Jobs
## 1152 19
## Healthcare & Nursing Jobs Hospitality & Catering Jobs
## 3149 525
## HR & Recruitment Jobs IT Jobs
## 578 1414
## Legal Jobs Logistics & Warehouse Jobs
## 88 110
## Maintenance Jobs Manufacturing Jobs
## 20 106
## Other/General Jobs PR, Advertising & Marketing Jobs
## 236 88
## Property Jobs Retail Jobs
## 44 93
## Sales Jobs Scientific & QA Jobs
## 426 129
## Social work Jobs Teaching Jobs
## 53 342
## Trade & Construction Jobs Travel Jobs
## 148 100
tapply(train$SalaryNormalized, train$SourceName, mean)
## accountancyagejobs.com allhousingjobs.co.uk
## 48875 31573
## Brand Republic Jobs careerbuilder.com
## 53250 31667
## careerstructure.com careworx.co.uk
## 85000 29292
## caterer.com charityjob.co.uk
## 21711 24250
## cv-library.co.uk cwjobs.co.uk
## 29851 44798
## discoverretail.co.uk eFinancialCareers
## 25667 62723
## emptylemon.co.uk fish4.co.uk
## 40022 31605
## hays.co.uk hotrecruit.com
## 41992 36102
## Jobcentre Plus jobg8.com
## 19645 39168
## jobs.cabincrew.com jobs.catererandhotelkeeper.com
## 27000 22436
## jobs.guardian.co.uk jobs.planningresource.co.uk
## 31187 30380
## jobs.telegraph.co.uk Jobs24
## 41262 33642
## jobs4medical.co.uk jobserve.com
## 24000 46193
## jobsfinancial.com jobsincredit.com
## 49009 97500
## jobsineducation.co.uk jobsinsocialwork.co.uk
## 30004 23530
## jobsite.co.uk juniorbroker.com
## 55900 16312
## leisurejobs.com londonjobs.co.uk
## 21076 25000
## michaelpage.co.uk Multilingualvacancies
## 45000 24768
## myjobs.cimaglobal.com MyUkJobs
## 47656 24959
## nijobfinder.co.uk nijobs.com
## 26050 35000
## nurseryworldjobs.co.uk oilandgasjobsearch.com
## 30000 33596
## OilCareers.com onlymarketingjobs.com
## 71350 24833
## peoplemanagement.co.uk Personneltoday Jobs
## 30000 31750
## planetrecruit.com PR Week Jobs
## 43278 35408
## professionalpensionsjobs.com propertyjobs.co.uk
## 29250 36667
## randstadfp.com recruitni.com
## 43400 33538
## rengineeringjobs.com retailchoice.com
## 40260 26001
## rtmjobs.com salestarget.co.uk
## 40000 42200
## securityclearedjobs.com simplyhrjobs.co.uk
## 37838 22500
## simplymarketingjobs.co.uk simplysalesjobs.co.uk
## 46250 35833
## staffnurse.com strike-jobs.co.uk
## 32107 36582
## targetjobs.co.uk technojobs.co.uk
## 26682 53400
## thecareerengineer.com thegraduate.co.uk
## 34154 22000
## theitjobboard.co.uk theladders.co.uk
## 46431 62499
## Third Sector Jobs totaljobs.com
## 30857 32962
## uksport.gov.uk wileyjobnetwork.com
## 45000 34000
## workthing.com zartis.com
## 21375 34680
table(train$SourceName)
##
## accountancyagejobs.com allhousingjobs.co.uk
## 4 1
## Brand Republic Jobs careerbuilder.com
## 4 24
## careerstructure.com careworx.co.uk
## 1 2946
## caterer.com charityjob.co.uk
## 264 2
## cv-library.co.uk cwjobs.co.uk
## 874 52
## discoverretail.co.uk eFinancialCareers
## 3 46
## emptylemon.co.uk fish4.co.uk
## 4 441
## hays.co.uk hotrecruit.com
## 293 94
## Jobcentre Plus jobg8.com
## 122 17
## jobs.cabincrew.com jobs.catererandhotelkeeper.com
## 1 152
## jobs.guardian.co.uk jobs.planningresource.co.uk
## 13 3
## jobs.telegraph.co.uk Jobs24
## 2 48
## jobs4medical.co.uk jobserve.com
## 1 17
## jobsfinancial.com jobsincredit.com
## 58 1
## jobsineducation.co.uk jobsinsocialwork.co.uk
## 99 13
## jobsite.co.uk juniorbroker.com
## 2 8
## leisurejobs.com londonjobs.co.uk
## 84 1
## michaelpage.co.uk Multilingualvacancies
## 1 139
## myjobs.cimaglobal.com MyUkJobs
## 32 1559
## nijobfinder.co.uk nijobs.com
## 131 4
## nurseryworldjobs.co.uk oilandgasjobsearch.com
## 1 3
## OilCareers.com onlymarketingjobs.com
## 5 3
## peoplemanagement.co.uk Personneltoday Jobs
## 1 2
## planetrecruit.com PR Week Jobs
## 436 3
## professionalpensionsjobs.com propertyjobs.co.uk
## 2 3
## randstadfp.com recruitni.com
## 5 218
## rengineeringjobs.com retailchoice.com
## 60 19
## rtmjobs.com salestarget.co.uk
## 1 5
## securityclearedjobs.com simplyhrjobs.co.uk
## 13 1
## simplymarketingjobs.co.uk simplysalesjobs.co.uk
## 8 6
## staffnurse.com strike-jobs.co.uk
## 54 188
## targetjobs.co.uk technojobs.co.uk
## 17 5
## thecareerengineer.com thegraduate.co.uk
## 312 1
## theitjobboard.co.uk theladders.co.uk
## 643 2
## Third Sector Jobs totaljobs.com
## 5 403
## uksport.gov.uk wileyjobnetwork.com
## 1 1
## workthing.com zartis.com
## 4 8
Both could be useful, though I'll probably stay away from SourceName since there are many sources that only have a few observations. Next, let's look at Company and LocationNormalized:
# output suppressed
table(train$Company)
table(train$LocationNormalized)
Way too many. I'm going to stay away from these.
I want to use location_tree.txt to create a much more broad location feature:
con <- file("location_tree.txt", "r")
tree <- readLines(con)
close(con)
for (i in 1:nrow(train)) {
# get city name
loc <- train$LocationNormalized[i]
# find the first line in the tree in which that city name appears
line.id <- which(grepl(loc, tree))[1]
# use regular expressions to pull out the broad location
r <- regexpr("~.+?~", tree[line.id])
match <- regmatches(tree[line.id], r)
# store the broad location
train$Location[i] <- gsub("~", "", match)
}
## Error: replacement has length zero
train$Location <- as.factor(train$Location)
table(train$Location)
##
## East Midlands Eastern England London
## 349 598 1928
## North East England North West England Northern Ireland
## 214 561 125
## Scotland South East England South West England
## 443 4500 514
## Wales West Midlands Yorkshire And The Humber
## 173 366 229
I'm also going to make a simpler London vs non-London feature, to see if that performs better:
train$London <- as.factor(ifelse(train$Location == "London", "Yes", "No"))
Let's add some text features:
train$TitleSenior <- as.factor(ifelse(grepl("[Ss]enior", train$Title), "Yes",
"No"))
train$TitleManage <- as.factor(ifelse(grepl("[Mm]anage", train$Title), "Yes",
"No"))
train$DescripSenior <- as.factor(ifelse(grepl("[Ss]enior", train$FullDescription),
"Yes", "No"))
train$DescripManage <- as.factor(ifelse(grepl("[Mm]anage", train$FullDescription),
"Yes", "No"))
Enough feature engineering… let's do some modeling!
Since the number of observations is large, I'll split the training data into a training set and validation set:
set.seed(100)
train.index <- sample(10000, 7000, replace = FALSE)
tr <- train[train.index, ]
val <- train[-train.index, ]
Let's check the RMSE for a model fit with just an intercept:
sqrt(mean((mean(tr$SalaryNormalized) - val$SalaryNormalized)^2))
## [1] 16201
Let's create some simple linear models, and add more features each time to see how it affects RMSE:
lm.fit1 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category, data = tr)
summary(lm.fit1)
##
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime +
## Category, data = tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38567 -8189 -2896 5623 129929
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 33567 765 43.90
## ContractTypefull_time -312 420 -0.74
## ContractTypepart_time -7431 804 -9.24
## ContractTimecontract 12182 837 14.55
## ContractTimepermanent 6243 547 11.41
## CategoryAdmin Jobs -16523 1560 -10.59
## CategoryCharity & Voluntary Jobs -14438 3698 -3.90
## CategoryConsultancy Jobs -665 2169 -0.31
## CategoryCreative & Design Jobs -16644 4429 -3.76
## CategoryCustomer Services Jobs -18559 1283 -14.46
## CategoryDomestic help & Cleaning Jobs -19525 5175 -3.77
## CategoryEnergy, Oil & Gas Jobs 9324 2989 3.12
## CategoryEngineering Jobs -5413 892 -6.07
## CategoryGraduate Jobs -7506 3419 -2.20
## CategoryHealthcare & Nursing Jobs -2878 829 -3.47
## CategoryHospitality & Catering Jobs -12671 1058 -11.98
## CategoryHR & Recruitment Jobs -8041 1012 -7.95
## CategoryIT Jobs 3378 867 3.90
## CategoryLegal Jobs -6589 1951 -3.38
## CategoryLogistics & Warehouse Jobs -14141 1772 -7.98
## CategoryMaintenance Jobs -8282 3596 -2.30
## CategoryManufacturing Jobs -9867 1778 -5.55
## CategoryOther/General Jobs -9080 1389 -6.54
## CategoryPR, Advertising & Marketing Jobs -4472 2045 -2.19
## CategoryProperty Jobs -6548 2933 -2.23
## CategoryRetail Jobs -11374 1978 -5.75
## CategorySales Jobs -9562 1115 -8.57
## CategoryScientific & QA Jobs -5563 1693 -3.29
## CategorySocial work Jobs -6011 2703 -2.22
## CategoryTeaching Jobs -12028 1165 -10.32
## CategoryTrade & Construction Jobs -4568 1596 -2.86
## CategoryTravel Jobs -18150 1840 -9.86
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## ContractTypefull_time 0.45781
## ContractTypepart_time < 2e-16 ***
## ContractTimecontract < 2e-16 ***
## ContractTimepermanent < 2e-16 ***
## CategoryAdmin Jobs < 2e-16 ***
## CategoryCharity & Voluntary Jobs 9.6e-05 ***
## CategoryConsultancy Jobs 0.75923
## CategoryCreative & Design Jobs 0.00017 ***
## CategoryCustomer Services Jobs < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs 0.00016 ***
## CategoryEnergy, Oil & Gas Jobs 0.00182 **
## CategoryEngineering Jobs 1.4e-09 ***
## CategoryGraduate Jobs 0.02815 *
## CategoryHealthcare & Nursing Jobs 0.00052 ***
## CategoryHospitality & Catering Jobs < 2e-16 ***
## CategoryHR & Recruitment Jobs 2.2e-15 ***
## CategoryIT Jobs 9.9e-05 ***
## CategoryLegal Jobs 0.00074 ***
## CategoryLogistics & Warehouse Jobs 1.7e-15 ***
## CategoryMaintenance Jobs 0.02132 *
## CategoryManufacturing Jobs 3.0e-08 ***
## CategoryOther/General Jobs 6.7e-11 ***
## CategoryPR, Advertising & Marketing Jobs 0.02882 *
## CategoryProperty Jobs 0.02560 *
## CategoryRetail Jobs 9.2e-09 ***
## CategorySales Jobs < 2e-16 ***
## CategoryScientific & QA Jobs 0.00102 **
## CategorySocial work Jobs 0.02620 *
## CategoryTeaching Jobs < 2e-16 ***
## CategoryTrade & Construction Jobs 0.00422 **
## CategoryTravel Jobs < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14500 on 6968 degrees of freedom
## Multiple R-squared: 0.196, Adjusted R-squared: 0.193
## F-statistic: 54.9 on 31 and 6968 DF, p-value: <2e-16
lm.pred1 <- predict(lm.fit1, newdata = val)
sqrt(mean((lm.pred1 - val$SalaryNormalized)^2))
## [1] 14701
lm.fit2 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category + Location,
data = tr)
summary(lm.fit2)
##
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime +
## Category + Location, data = tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38574 -8376 -2816 5425 129864
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 32886 1206 27.27
## ContractTypefull_time -768 423 -1.82
## ContractTypepart_time -7995 806 -9.91
## ContractTimecontract 12056 856 14.09
## ContractTimepermanent 5955 568 10.49
## CategoryAdmin Jobs -16713 1562 -10.70
## CategoryCharity & Voluntary Jobs -14842 3679 -4.03
## CategoryConsultancy Jobs -773 2158 -0.36
## CategoryCreative & Design Jobs -16679 4406 -3.79
## CategoryCustomer Services Jobs -18071 1280 -14.12
## CategoryDomestic help & Cleaning Jobs -19579 5153 -3.80
## CategoryEnergy, Oil & Gas Jobs 9411 2974 3.16
## CategoryEngineering Jobs -5266 894 -5.89
## CategoryGraduate Jobs -7287 3410 -2.14
## CategoryHealthcare & Nursing Jobs -2827 832 -3.40
## CategoryHospitality & Catering Jobs -13289 1057 -12.57
## CategoryHR & Recruitment Jobs -8124 1019 -7.97
## CategoryIT Jobs 3371 866 3.89
## CategoryLegal Jobs -6639 1942 -3.42
## CategoryLogistics & Warehouse Jobs -14097 1770 -7.97
## CategoryMaintenance Jobs -9208 3582 -2.57
## CategoryManufacturing Jobs -9598 1776 -5.41
## CategoryOther/General Jobs -9404 1383 -6.80
## CategoryPR, Advertising & Marketing Jobs -4736 2037 -2.33
## CategoryProperty Jobs -7065 2918 -2.42
## CategoryRetail Jobs -11257 1969 -5.72
## CategorySales Jobs -9714 1112 -8.73
## CategoryScientific & QA Jobs -5522 1686 -3.27
## CategorySocial work Jobs -5906 2694 -2.19
## CategoryTeaching Jobs -12273 1168 -10.51
## CategoryTrade & Construction Jobs -4323 1596 -2.71
## CategoryTravel Jobs -18255 1831 -9.97
## LocationEastern England 2557 1180 2.17
## LocationLondon 3574 1023 3.49
## LocationNorth East England -433 1513 -0.29
## LocationNorth West England -1921 1184 -1.62
## LocationNorthern Ireland 118 1905 0.06
## LocationScotland 1542 1260 1.22
## LocationSouth East England 822 994 0.83
## LocationSouth West England -1994 1212 -1.65
## LocationWales -494 1637 -0.30
## LocationWest Midlands -328 1310 -0.25
## LocationYorkshire And The Humber -2307 1490 -1.55
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## ContractTypefull_time 0.06949 .
## ContractTypepart_time < 2e-16 ***
## ContractTimecontract < 2e-16 ***
## ContractTimepermanent < 2e-16 ***
## CategoryAdmin Jobs < 2e-16 ***
## CategoryCharity & Voluntary Jobs 5.5e-05 ***
## CategoryConsultancy Jobs 0.72012
## CategoryCreative & Design Jobs 0.00015 ***
## CategoryCustomer Services Jobs < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs 0.00015 ***
## CategoryEnergy, Oil & Gas Jobs 0.00156 **
## CategoryEngineering Jobs 4.0e-09 ***
## CategoryGraduate Jobs 0.03261 *
## CategoryHealthcare & Nursing Jobs 0.00069 ***
## CategoryHospitality & Catering Jobs < 2e-16 ***
## CategoryHR & Recruitment Jobs 1.8e-15 ***
## CategoryIT Jobs 0.00010 ***
## CategoryLegal Jobs 0.00063 ***
## CategoryLogistics & Warehouse Jobs 1.9e-15 ***
## CategoryMaintenance Jobs 0.01017 *
## CategoryManufacturing Jobs 6.7e-08 ***
## CategoryOther/General Jobs 1.1e-11 ***
## CategoryPR, Advertising & Marketing Jobs 0.02007 *
## CategoryProperty Jobs 0.01550 *
## CategoryRetail Jobs 1.1e-08 ***
## CategorySales Jobs < 2e-16 ***
## CategoryScientific & QA Jobs 0.00106 **
## CategorySocial work Jobs 0.02837 *
## CategoryTeaching Jobs < 2e-16 ***
## CategoryTrade & Construction Jobs 0.00675 **
## CategoryTravel Jobs < 2e-16 ***
## LocationEastern England 0.03024 *
## LocationLondon 0.00048 ***
## LocationNorth East England 0.77493
## LocationNorth West England 0.10490
## LocationNorthern Ireland 0.95042
## LocationScotland 0.22114
## LocationSouth East England 0.40820
## LocationSouth West England 0.09985 .
## LocationWales 0.76307
## LocationWest Midlands 0.80249
## LocationYorkshire And The Humber 0.12167
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14400 on 6957 degrees of freedom
## Multiple R-squared: 0.207, Adjusted R-squared: 0.202
## F-statistic: 43.2 on 42 and 6957 DF, p-value: <2e-16
lm.pred2 <- predict(lm.fit2, newdata = val)
sqrt(mean((lm.pred2 - val$SalaryNormalized)^2))
## [1] 14619
lm.fit3 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category + London,
data = tr)
summary(lm.fit3)
##
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime +
## Category + London, data = tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38479 -8319 -2935 5404 130243
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 32956 766 43.00
## ContractTypefull_time -664 421 -1.58
## ContractTypepart_time -7635 802 -9.53
## ContractTimecontract 12459 835 14.92
## ContractTimepermanent 6274 545 11.51
## CategoryAdmin Jobs -16253 1555 -10.46
## CategoryCharity & Voluntary Jobs -14695 3685 -3.99
## CategoryConsultancy Jobs -642 2161 -0.30
## CategoryCreative & Design Jobs -16529 4412 -3.75
## CategoryCustomer Services Jobs -18328 1279 -14.33
## CategoryDomestic help & Cleaning Jobs -19644 5156 -3.81
## CategoryEnergy, Oil & Gas Jobs 9674 2978 3.25
## CategoryEngineering Jobs -5134 890 -5.77
## CategoryGraduate Jobs -6754 3408 -1.98
## CategoryHealthcare & Nursing Jobs -2857 826 -3.46
## CategoryHospitality & Catering Jobs -12951 1054 -12.28
## CategoryHR & Recruitment Jobs -7983 1008 -7.92
## CategoryIT Jobs 3624 865 4.19
## CategoryLegal Jobs -6369 1944 -3.28
## CategoryLogistics & Warehouse Jobs -13756 1767 -7.79
## CategoryMaintenance Jobs -9475 3587 -2.64
## CategoryManufacturing Jobs -9672 1772 -5.46
## CategoryOther/General Jobs -9149 1384 -6.61
## CategoryPR, Advertising & Marketing Jobs -4469 2038 -2.19
## CategoryProperty Jobs -6815 2922 -2.33
## CategoryRetail Jobs -11087 1971 -5.63
## CategorySales Jobs -9517 1111 -8.56
## CategoryScientific & QA Jobs -5855 1687 -3.47
## CategorySocial work Jobs -5555 2694 -2.06
## CategoryTeaching Jobs -12501 1162 -10.75
## CategoryTrade & Construction Jobs -4313 1590 -2.71
## CategoryTravel Jobs -18265 1834 -9.96
## LondonYes 3243 446 7.27
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## ContractTypefull_time 0.11491
## ContractTypepart_time < 2e-16 ***
## ContractTimecontract < 2e-16 ***
## ContractTimepermanent < 2e-16 ***
## CategoryAdmin Jobs < 2e-16 ***
## CategoryCharity & Voluntary Jobs 6.7e-05 ***
## CategoryConsultancy Jobs 0.76648
## CategoryCreative & Design Jobs 0.00018 ***
## CategoryCustomer Services Jobs < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs 0.00014 ***
## CategoryEnergy, Oil & Gas Jobs 0.00117 **
## CategoryEngineering Jobs 8.3e-09 ***
## CategoryGraduate Jobs 0.04752 *
## CategoryHealthcare & Nursing Jobs 0.00054 ***
## CategoryHospitality & Catering Jobs < 2e-16 ***
## CategoryHR & Recruitment Jobs 2.8e-15 ***
## CategoryIT Jobs 2.8e-05 ***
## CategoryLegal Jobs 0.00106 **
## CategoryLogistics & Warehouse Jobs 7.9e-15 ***
## CategoryMaintenance Jobs 0.00827 **
## CategoryManufacturing Jobs 5.0e-08 ***
## CategoryOther/General Jobs 4.1e-11 ***
## CategoryPR, Advertising & Marketing Jobs 0.02834 *
## CategoryProperty Jobs 0.01972 *
## CategoryRetail Jobs 1.9e-08 ***
## CategorySales Jobs < 2e-16 ***
## CategoryScientific & QA Jobs 0.00052 ***
## CategorySocial work Jobs 0.03925 *
## CategoryTeaching Jobs < 2e-16 ***
## CategoryTrade & Construction Jobs 0.00671 **
## CategoryTravel Jobs < 2e-16 ***
## LondonYes 3.9e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14400 on 6967 degrees of freedom
## Multiple R-squared: 0.202, Adjusted R-squared: 0.199
## F-statistic: 55.3 on 32 and 6967 DF, p-value: <2e-16
lm.pred3 <- predict(lm.fit3, newdata = val)
sqrt(mean((lm.pred3 - val$SalaryNormalized)^2))
## [1] 14591
lm.fit4 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category + London +
TitleSenior + TitleManage + DescripSenior + DescripManage, data = tr)
summary(lm.fit4)
##
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime +
## Category + London + TitleSenior + TitleManage + DescripSenior +
## DescripManage, data = tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38496 -8005 -2400 4293 133040
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 29345 776 37.81
## ContractTypefull_time -933 408 -2.29
## ContractTypepart_time -5017 785 -6.39
## ContractTimecontract 13020 808 16.10
## ContractTimepermanent 5704 528 10.81
## CategoryAdmin Jobs -14392 1507 -9.55
## CategoryCharity & Voluntary Jobs -13264 3566 -3.72
## CategoryConsultancy Jobs -3851 2102 -1.83
## CategoryCreative & Design Jobs -15475 4268 -3.63
## CategoryCustomer Services Jobs -15938 1244 -12.82
## CategoryDomestic help & Cleaning Jobs -17944 4987 -3.60
## CategoryEnergy, Oil & Gas Jobs 11461 2882 3.98
## CategoryEngineering Jobs -3579 866 -4.14
## CategoryGraduate Jobs -5769 3301 -1.75
## CategoryHealthcare & Nursing Jobs -2605 806 -3.23
## CategoryHospitality & Catering Jobs -12171 1025 -11.88
## CategoryHR & Recruitment Jobs -7961 977 -8.15
## CategoryIT Jobs 4524 838 5.40
## CategoryLegal Jobs -3820 1885 -2.03
## CategoryLogistics & Warehouse Jobs -11697 1716 -6.82
## CategoryMaintenance Jobs -8682 3470 -2.50
## CategoryManufacturing Jobs -8819 1715 -5.14
## CategoryOther/General Jobs -8201 1340 -6.12
## CategoryPR, Advertising & Marketing Jobs -5282 1976 -2.67
## CategoryProperty Jobs -8610 2829 -3.04
## CategoryRetail Jobs -12731 1912 -6.66
## CategorySales Jobs -8857 1077 -8.22
## CategoryScientific & QA Jobs -4525 1635 -2.77
## CategorySocial work Jobs -6902 2611 -2.64
## CategoryTeaching Jobs -9912 1132 -8.76
## CategoryTrade & Construction Jobs -3583 1543 -2.32
## CategoryTravel Jobs -18292 1776 -10.30
## LondonYes 3226 431 7.48
## TitleSeniorYes 3693 788 4.69
## TitleManageYes 7157 503 14.22
## DescripSeniorYes 2324 532 4.37
## DescripManageYes 2167 394 5.50
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## ContractTypefull_time 0.02219 *
## ContractTypepart_time 1.7e-10 ***
## ContractTimecontract < 2e-16 ***
## ContractTimepermanent < 2e-16 ***
## CategoryAdmin Jobs < 2e-16 ***
## CategoryCharity & Voluntary Jobs 0.00020 ***
## CategoryConsultancy Jobs 0.06702 .
## CategoryCreative & Design Jobs 0.00029 ***
## CategoryCustomer Services Jobs < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs 0.00032 ***
## CategoryEnergy, Oil & Gas Jobs 7.0e-05 ***
## CategoryEngineering Jobs 3.6e-05 ***
## CategoryGraduate Jobs 0.08054 .
## CategoryHealthcare & Nursing Jobs 0.00124 **
## CategoryHospitality & Catering Jobs < 2e-16 ***
## CategoryHR & Recruitment Jobs 4.4e-16 ***
## CategoryIT Jobs 6.9e-08 ***
## CategoryLegal Jobs 0.04269 *
## CategoryLogistics & Warehouse Jobs 1.0e-11 ***
## CategoryMaintenance Jobs 0.01237 *
## CategoryManufacturing Jobs 2.8e-07 ***
## CategoryOther/General Jobs 9.9e-10 ***
## CategoryPR, Advertising & Marketing Jobs 0.00752 **
## CategoryProperty Jobs 0.00235 **
## CategoryRetail Jobs 3.0e-11 ***
## CategorySales Jobs 2.3e-16 ***
## CategoryScientific & QA Jobs 0.00565 **
## CategorySocial work Jobs 0.00823 **
## CategoryTeaching Jobs < 2e-16 ***
## CategoryTrade & Construction Jobs 0.02027 *
## CategoryTravel Jobs < 2e-16 ***
## LondonYes 8.3e-14 ***
## TitleSeniorYes 2.8e-06 ***
## TitleManageYes < 2e-16 ***
## DescripSeniorYes 1.3e-05 ***
## DescripManageYes 3.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14000 on 6963 degrees of freedom
## Multiple R-squared: 0.255, Adjusted R-squared: 0.251
## F-statistic: 66.1 on 36 and 6963 DF, p-value: <2e-16
lm.pred4 <- predict(lm.fit4, newdata = val)
sqrt(mean((lm.pred4 - val$SalaryNormalized)^2))
## [1] 14088
Interesting! Adding “Location” helped, but using “London” instead of “Location” was even better. And, those four text features helped a lot, too.
Let's try ridge regression using the features from the best linear model:
# build model matrices for train and validation
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 1.9-5
x.tr <- model.matrix(SalaryNormalized ~ ContractType + ContractTime + Category +
London + TitleSenior + TitleManage + DescripSenior + DescripManage, data = tr)[,
-1]
y.tr <- tr$SalaryNormalized
x.val <- model.matrix(SalaryNormalized ~ ContractType + ContractTime + Category +
London + TitleSenior + TitleManage + DescripSenior + DescripManage, data = val)[,
-1]
y.val <- val$SalaryNormalized
# CV to obtain best lambda
set.seed(10)
rr.cv <- cv.glmnet(x.tr, y.tr, alpha = 0)
plot(rr.cv)
rr.bestlam <- rr.cv$lambda.min
rr.goodlam <- rr.cv$lambda.1se
# predict validation set using best lambda and calculate RMSE
rr.fit <- glmnet(x.tr, y.tr, alpha = 0)
plot(rr.fit, xvar = "lambda", label = TRUE)
rr.pred <- predict(rr.fit, s = rr.bestlam, newx = x.val)
sqrt(mean((rr.pred - y.val)^2))
## [1] 14093
Ridge regression did about the same as the best linear model.
Next, let's try the lasso:
# CV to obtain best lambda
set.seed(10)
las.cv <- cv.glmnet(x.tr, y.tr, alpha = 1)
plot(las.cv)
las.bestlam <- las.cv$lambda.min
las.goodlam <- las.cv$lambda.1se
# predict validation set using best lambda and calculate RMSE
las.fit <- glmnet(x.tr, y.tr, alpha = 1)
plot(las.fit, xvar = "lambda", label = TRUE)
las.pred <- predict(las.fit, s = las.bestlam, newx = x.val)
sqrt(mean((las.pred - y.val)^2))
## [1] 14089
Again, lasso did about the same as the best linear model.
Let's try a forward stepwise approach (since there are too many variables for a best subset approach):
library(leaps)
fwd.fit <- regsubsets(SalaryNormalized ~ ContractType + ContractTime + Category +
London + TitleSenior + TitleManage + DescripSenior + DescripManage, data = tr,
nvmax = 36, method = "forward")
plot(fwd.fit, scale = "adjr2")
# loop through models of each size and compute test RMSE for each model
# (clever approach from ISLR page 248)
fwd.errors <- rep(NA, 36)
val.mat <- model.matrix(SalaryNormalized ~ ContractType + ContractTime + Category +
London + TitleSenior + TitleManage + DescripSenior + DescripManage, data = val)
for (i in 1:36) {
# extract the coefficients from the best model of that size
coefi <- coef(fwd.fit, id = i)
# multiply them into the appropriate columns of the test model matrix to
# predict
pred <- val.mat[, names(coefi)] %*% coefi
# compute the test MSE
fwd.errors[i] <- sqrt(mean((y.val - pred)^2))
}
# find the best model
fwd.errors
## [1] 15490 15056 14940 14786 14725 14638 14561 14515 14469 14374 14385
## [12] 14356 14297 14254 14245 14252 14229 14204 14163 14163 14144 14140
## [23] 14143 14139 14141 14135 14130 14128 14129 14126 14120 14120 14103
## [34] 14100 14090 14088
min(fwd.errors)
## [1] 14088
which.min(fwd.errors)
## [1] 36
Again, forward stepwise did about the same as the best linear model, and chose the model with all 36 variables.
Let's just go with the best linear model, and test it against the solution!
First, we need to read in the solution set and redo our feature engineering:
solution <- read.csv("solution.csv", as.is = TRUE)
solution$ContractType <- as.factor(solution$ContractType)
solution$ContractTime <- as.factor(solution$ContractTime)
solution$Category <- as.factor(solution$Category)
solution$SourceName <- as.factor(solution$SourceName)
solution$Company <- as.factor(solution$Company)
solution$LocationNormalized <- as.factor(solution$LocationNormalized)
for (i in 1:nrow(solution)) {
loc <- solution$LocationNormalized[i]
line.id <- which(grepl(loc, tree))[1]
r <- regexpr("~.+?~", tree[line.id])
match <- regmatches(tree[line.id], r)
solution$Location[i] <- gsub("~", "", match)
}
## Error: replacement has length zero
solution$Location <- as.factor(solution$Location)
solution$London <- as.factor(ifelse(solution$Location == "London", "Yes", "No"))
solution$TitleSenior <- as.factor(ifelse(grepl("[Ss]enior", solution$Title),
"Yes", "No"))
solution$TitleManage <- as.factor(ifelse(grepl("[Mm]anage", solution$Title),
"Yes", "No"))
solution$DescripSenior <- as.factor(ifelse(grepl("[Ss]enior", solution$FullDescription),
"Yes", "No"))
solution$DescripManage <- as.factor(ifelse(grepl("[Mm]anage", solution$FullDescription),
"Yes", "No"))
The solution set has an extra Category (not present in the training set), so I'll take the easiest approach and just convert it to a different Category:
pt.index <- which(solution$Category == "Part time Jobs")
solution$Category[pt.index] <- "Admin Jobs"
solution$Category <- factor(solution$Category)
To get some context, let's check the RMSE for a model fit with just the intercept from the full training set:
sqrt(mean((mean(train$SalaryNormalized) - solution$SalaryNormalized)^2))
## [1] 18761
Okay, let's train our linear model on the full training set, and predict on the solution set:
lm.fit5 <- lm(SalaryNormalized ~ ContractType + ContractTime + Category + London +
TitleSenior + TitleManage + DescripSenior + DescripManage, data = train)
summary(lm.fit5)
##
## Call:
## lm(formula = SalaryNormalized ~ ContractType + ContractTime +
## Category + London + TitleSenior + TitleManage + DescripSenior +
## DescripManage, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39027 -7986 -2462 4268 132936
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 29519 652 45.31
## ContractTypefull_time -732 343 -2.14
## ContractTypepart_time -5287 656 -8.05
## ContractTimecontract 13135 677 19.40
## ContractTimepermanent 5211 441 11.82
## CategoryAdmin Jobs -14566 1283 -11.35
## CategoryCharity & Voluntary Jobs -12866 2980 -4.32
## CategoryConsultancy Jobs -4710 1671 -2.82
## CategoryCreative & Design Jobs -12191 3041 -4.01
## CategoryCustomer Services Jobs -15904 1050 -15.15
## CategoryDomestic help & Cleaning Jobs -17972 4464 -4.03
## CategoryEnergy, Oil & Gas Jobs 8583 2582 3.32
## CategoryEngineering Jobs -3997 721 -5.54
## CategoryGraduate Jobs -6272 3279 -1.91
## CategoryHealthcare & Nursing Jobs -2790 677 -4.12
## CategoryHospitality & Catering Jobs -12242 864 -14.16
## CategoryHR & Recruitment Jobs -7477 820 -9.11
## CategoryIT Jobs 4552 700 6.51
## CategoryLegal Jobs -3993 1607 -2.48
## CategoryLogistics & Warehouse Jobs -11884 1462 -8.13
## CategoryMaintenance Jobs -9261 3191 -2.90
## CategoryManufacturing Jobs -9762 1476 -6.61
## CategoryOther/General Jobs -7664 1082 -7.09
## CategoryPR, Advertising & Marketing Jobs -5584 1599 -3.49
## CategoryProperty Jobs -8731 2188 -3.99
## CategoryRetail Jobs -12782 1567 -8.16
## CategorySales Jobs -8444 889 -9.49
## CategoryScientific & QA Jobs -5323 1373 -3.88
## CategorySocial work Jobs -6119 2016 -3.04
## CategoryTeaching Jobs -10094 974 -10.37
## CategoryTrade & Construction Jobs -5542 1292 -4.29
## CategoryTravel Jobs -18335 1530 -11.98
## LondonYes 3660 363 10.09
## TitleSeniorYes 4106 667 6.16
## TitleManageYes 7111 419 16.96
## DescripSeniorYes 1953 451 4.33
## DescripManageYes 2380 330 7.21
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## ContractTypefull_time 0.03251 *
## ContractTypepart_time 8.9e-16 ***
## ContractTimecontract < 2e-16 ***
## ContractTimepermanent < 2e-16 ***
## CategoryAdmin Jobs < 2e-16 ***
## CategoryCharity & Voluntary Jobs 1.6e-05 ***
## CategoryConsultancy Jobs 0.00484 **
## CategoryCreative & Design Jobs 6.1e-05 ***
## CategoryCustomer Services Jobs < 2e-16 ***
## CategoryDomestic help & Cleaning Jobs 5.7e-05 ***
## CategoryEnergy, Oil & Gas Jobs 0.00089 ***
## CategoryEngineering Jobs 3.1e-08 ***
## CategoryGraduate Jobs 0.05580 .
## CategoryHealthcare & Nursing Jobs 3.8e-05 ***
## CategoryHospitality & Catering Jobs < 2e-16 ***
## CategoryHR & Recruitment Jobs < 2e-16 ***
## CategoryIT Jobs 8.1e-11 ***
## CategoryLegal Jobs 0.01297 *
## CategoryLogistics & Warehouse Jobs 4.9e-16 ***
## CategoryMaintenance Jobs 0.00371 **
## CategoryManufacturing Jobs 3.9e-11 ***
## CategoryOther/General Jobs 1.5e-12 ***
## CategoryPR, Advertising & Marketing Jobs 0.00048 ***
## CategoryProperty Jobs 6.7e-05 ***
## CategoryRetail Jobs 3.9e-16 ***
## CategorySales Jobs < 2e-16 ***
## CategoryScientific & QA Jobs 0.00011 ***
## CategorySocial work Jobs 0.00240 **
## CategoryTeaching Jobs < 2e-16 ***
## CategoryTrade & Construction Jobs 1.8e-05 ***
## CategoryTravel Jobs < 2e-16 ***
## LondonYes < 2e-16 ***
## TitleSeniorYes 7.7e-10 ***
## TitleManageYes < 2e-16 ***
## DescripSeniorYes 1.5e-05 ***
## DescripManageYes 6.1e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14000 on 9963 degrees of freedom
## Multiple R-squared: 0.252, Adjusted R-squared: 0.25
## F-statistic: 93.5 on 36 and 9963 DF, p-value: <2e-16
lm.pred5 <- predict(lm.fit5, newdata = solution)
sqrt(mean((lm.pred5 - solution$SalaryNormalized)^2))
## [1] 16470
The decrease in RMSE (over the intercept-only model) is similar to what we saw for the validation set approach, meaning that the validation set approach did a good job predicting how we would eventually do on the solution set.
As for how to improve the model, I would bet that you could achieve significant reductions in RMSE by generating a ton of features from the Title and FullDescription. I was surprised by how much performance increased in lm.fit4 (over lm.fit3) just by adding four very simple text features.