In what follows, I’m going to assume that you ran all of the code in the first OkCupid presentation.
Here, I’m going to show you how to make predictions based on different models for different subsets of the data.
For instance, let’s imagine that you want to make predictions based on offspring for users who answered the offspring question. Next you want to make predictions based on job for users who are either students or retired and predictions based on income for all other users.
First, let’s make our offspring model
m_offspring <- lm(age~offspring, data=train)
summary(m_offspring)
##
## Call:
## lm(formula = age ~ offspring, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.248 -6.241 -1.241 4.845 76.120
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 30.24116 0.14340
## offspringdoesn't have kids, and doesn't want any 2.28488 0.39611
## offspringdoesn't have kids, but might want them 0.56973 0.24682
## offspringdoesn't have kids, but wants them -0.08616 0.25687
## offspringdoesn't want kids 5.05769 0.27141
## offspringhas a kid 12.70280 0.32433
## offspringhas a kid, and might want more 6.01355 0.87260
## offspringhas a kid, and wants more 0.73181 1.46391
## offspringhas a kid, but doesn't want more 19.00711 0.74977
## offspringhas kids 18.04957 0.32279
## offspringhas kids, and might want more 7.46059 1.18250
## offspringhas kids, and wants more 2.29730 2.46200
## offspringhas kids, but doesn't want more 19.45148 0.60044
## offspringmight want kids 2.63927 0.93497
## offspringwants kids 0.65484 0.80549
## t value Pr(>|t|)
## (Intercept) 210.888 < 2e-16 ***
## offspringdoesn't have kids, and doesn't want any 5.768 8.20e-09 ***
## offspringdoesn't have kids, but might want them 2.308 0.02100 *
## offspringdoesn't have kids, but wants them -0.335 0.73732
## offspringdoesn't want kids 18.635 < 2e-16 ***
## offspringhas a kid 39.167 < 2e-16 ***
## offspringhas a kid, and might want more 6.892 5.79e-12 ***
## offspringhas a kid, and wants more 0.500 0.61715
## offspringhas a kid, but doesn't want more 25.351 < 2e-16 ***
## offspringhas kids 55.917 < 2e-16 ***
## offspringhas kids, and might want more 6.309 2.90e-10 ***
## offspringhas kids, and wants more 0.933 0.35078
## offspringhas kids, but doesn't want more 32.395 < 2e-16 ***
## offspringmight want kids 2.823 0.00477 **
## offspringwants kids 0.813 0.41625
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.862 on 12207 degrees of freedom
## (17751 observations deleted due to missingness)
## Multiple R-squared: 0.3255, Adjusted R-squared: 0.3247
## F-statistic: 420.7 on 14 and 12207 DF, p-value: < 2.2e-16
Next, let’s make predictions, this time on the test set using this model. We could use this model to make predictions for everyone:
predict(m_offspring, test)
… but many of these predictions will be NAs. Instead let’s filter out users with NA for offspring and make predictions for everyone else.
predict(m_offspring, test %>% filter(!is.na(offspring)))
Now, we want to assign these values to age in the test set but only to users who answered the offspring question:
test[!is.na(test$offspring), "age"] <- predict(m_offspring, test %>% filter(!is.na(offspring)))
You can now view the test set to confirm that we have made predictions only for users who completed the offspring question. All other users should still have 0 for age.
Next, let’s make a model based on jobs.
m_job <- lm(age~job, data=train)
summary(m_job)
##
## Call:
## lm(formula = age ~ job, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.307 -6.247 -1.985 4.050 84.050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9850 0.1392 222.579 < 2e-16
## jobartistic / musical / writer 2.6523 0.2373 11.175 < 2e-16
## jobbanking / financial / real estate 2.2625 0.3063 7.387 1.55e-13
## jobclerical / administrative 2.3825 0.4724 5.043 4.60e-07
## jobcomputer / hardware / software 1.5165 0.2337 6.489 8.78e-11
## jobconstruction / craftsmanship 4.7472 0.4200 11.302 < 2e-16
## jobeducation / academia 3.4224 0.2554 13.398 < 2e-16
## jobentertainment / media 1.3905 0.3052 4.556 5.23e-06
## jobexecutive / management 6.0516 0.2928 20.665 < 2e-16
## jobhospitality / travel 0.1515 0.3646 0.415 0.67779
## joblaw / legal services 3.1099 0.3697 8.413 < 2e-16
## jobmedicine / health 4.4713 0.2544 17.573 < 2e-16
## jobmilitary -4.4081 0.8962 -4.919 8.75e-07
## jobother 1.7941 0.2025 8.862 < 2e-16
## jobpolitical / government 3.8950 0.5023 7.755 9.12e-15
## jobrather not say 3.9594 0.6299 6.286 3.30e-10
## jobretired 19.3215 0.7838 24.651 < 2e-16
## jobsales / marketing / biz dev 0.8195 0.2385 3.436 0.00059
## jobscience / tech / engineering 0.7459 0.2309 3.230 0.00124
## jobstudent -6.0350 0.2292 -26.336 < 2e-16
## jobtransportation 4.6378 0.6965 6.659 2.81e-11
## jobunemployed -1.3610 0.8194 -1.661 0.09673
##
## (Intercept) ***
## jobartistic / musical / writer ***
## jobbanking / financial / real estate ***
## jobclerical / administrative ***
## jobcomputer / hardware / software ***
## jobconstruction / craftsmanship ***
## jobeducation / academia ***
## jobentertainment / media ***
## jobexecutive / management ***
## jobhospitality / travel
## joblaw / legal services ***
## jobmedicine / health ***
## jobmilitary ***
## jobother ***
## jobpolitical / government ***
## jobrather not say ***
## jobretired ***
## jobsales / marketing / biz dev ***
## jobscience / tech / engineering **
## jobstudent ***
## jobtransportation ***
## jobunemployed .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.028 on 29951 degrees of freedom
## Multiple R-squared: 0.09789, Adjusted R-squared: 0.09725
## F-statistic: 154.8 on 21 and 29951 DF, p-value: < 2.2e-16
If I think that job is only revealing for students and folks who are retired, I can make predictions only for OkCupid users who are in these jobs and who didn’t complete the offspring questions
predict(m_job, test %>% filter(
(job=="retired" | job == "student") & is.na(offspring)
)
)
and then assign these predictions only for users falling into that same subset:
test[(test$job=="retired" | test$job == "student") & is.na(test$offspring), "age"] <- predict(m_job, test %>% filter(
(job=="retired" | job == "student") & is.na(offspring)
)
)
Now, take another look at the test set.
Finally, let’s make predictions for all other users based on income.
m_income <- lm(age~logincome+listsincome, data=train)
summary(m_income)
##
## Call:
## lm(formula = age ~ logincome + listsincome, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.343 -6.291 -2.291 4.049 77.709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.29101 0.06073 531.75 <2e-16 ***
## logincome 5.33014 0.31112 17.13 <2e-16 ***
## listsincome -24.92898 1.48206 -16.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.455 on 29970 degrees of freedom
## Multiple R-squared: 0.009906, Adjusted R-squared: 0.00984
## F-statistic: 149.9 on 2 and 29970 DF, p-value: < 2.2e-16
… there’s one catch! This model uses variables (logincome and listsincome) that we created in the training set and that don’t yet exist in the test set. To make predictions on the test set, we’ll need to create those variables in the test set. We can simply copy and paste the code we used to create them and replace “train” with “test”:
test$listsincome[!is.na(test$income)] <- 1
test$listsincome[is.na(test$income)] <- 0
test$logincome <- ifelse(test$listsincome, log(test$income, base=10),0)
Now, we can make our predictions and confirm that we’ve made predictions for every user:
test[test$age==0, "age"] <- predict(m_income, test %>% filter(age==0))
summary(test$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.95 30.29 32.29 32.72 32.29 50.31
If I’m now happy with my model, I can write my predictions to a file:
write.csv(test %>% select(ID, age), 'cross_predictions.csv', row.names=FALSE)
and email them to my teacher.