In what follows, I’m going to assume that you ran all of the code in the first OkCupid presentation.

Here, I’m going to show you how to make predictions based on different models for different subsets of the data.

For instance, let’s imagine that you want to make predictions based on offspring for users who answered the offspring question. Next you want to make predictions based on job for users who are either students or retired and predictions based on income for all other users.

First, let’s make our offspring model

m_offspring <- lm(age~offspring, data=train)
summary(m_offspring)
## 
## Call:
## lm(formula = age ~ offspring, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.248  -6.241  -1.241   4.845  76.120 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                      30.24116    0.14340
## offspringdoesn't have kids, and doesn't want any  2.28488    0.39611
## offspringdoesn't have kids, but might want them   0.56973    0.24682
## offspringdoesn't have kids, but wants them       -0.08616    0.25687
## offspringdoesn't want kids                        5.05769    0.27141
## offspringhas a kid                               12.70280    0.32433
## offspringhas a kid, and might want more           6.01355    0.87260
## offspringhas a kid, and wants more                0.73181    1.46391
## offspringhas a kid, but doesn't want more        19.00711    0.74977
## offspringhas kids                                18.04957    0.32279
## offspringhas kids, and might want more            7.46059    1.18250
## offspringhas kids, and wants more                 2.29730    2.46200
## offspringhas kids, but doesn't want more         19.45148    0.60044
## offspringmight want kids                          2.63927    0.93497
## offspringwants kids                               0.65484    0.80549
##                                                  t value Pr(>|t|)    
## (Intercept)                                      210.888  < 2e-16 ***
## offspringdoesn't have kids, and doesn't want any   5.768 8.20e-09 ***
## offspringdoesn't have kids, but might want them    2.308  0.02100 *  
## offspringdoesn't have kids, but wants them        -0.335  0.73732    
## offspringdoesn't want kids                        18.635  < 2e-16 ***
## offspringhas a kid                                39.167  < 2e-16 ***
## offspringhas a kid, and might want more            6.892 5.79e-12 ***
## offspringhas a kid, and wants more                 0.500  0.61715    
## offspringhas a kid, but doesn't want more         25.351  < 2e-16 ***
## offspringhas kids                                 55.917  < 2e-16 ***
## offspringhas kids, and might want more             6.309 2.90e-10 ***
## offspringhas kids, and wants more                  0.933  0.35078    
## offspringhas kids, but doesn't want more          32.395  < 2e-16 ***
## offspringmight want kids                           2.823  0.00477 ** 
## offspringwants kids                                0.813  0.41625    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.862 on 12207 degrees of freedom
##   (17751 observations deleted due to missingness)
## Multiple R-squared:  0.3255, Adjusted R-squared:  0.3247 
## F-statistic: 420.7 on 14 and 12207 DF,  p-value: < 2.2e-16

Next, let’s make predictions, this time on the test set using this model. We could use this model to make predictions for everyone:

predict(m_offspring, test)

… but many of these predictions will be NAs. Instead let’s filter out users with NA for offspring and make predictions for everyone else.

predict(m_offspring, test %>% filter(!is.na(offspring)))

Now, we want to assign these values to age in the test set but only to users who answered the offspring question:

test[!is.na(test$offspring), "age"] <- predict(m_offspring, test %>% filter(!is.na(offspring)))

You can now view the test set to confirm that we have made predictions only for users who completed the offspring question. All other users should still have 0 for age.

Next, let’s make a model based on jobs.

m_job <- lm(age~job, data=train)
summary(m_job)
## 
## Call:
## lm(formula = age ~ job, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.307  -6.247  -1.985   4.050  84.050 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           30.9850     0.1392 222.579  < 2e-16
## jobartistic / musical / writer         2.6523     0.2373  11.175  < 2e-16
## jobbanking / financial / real estate   2.2625     0.3063   7.387 1.55e-13
## jobclerical / administrative           2.3825     0.4724   5.043 4.60e-07
## jobcomputer / hardware / software      1.5165     0.2337   6.489 8.78e-11
## jobconstruction / craftsmanship        4.7472     0.4200  11.302  < 2e-16
## jobeducation / academia                3.4224     0.2554  13.398  < 2e-16
## jobentertainment / media               1.3905     0.3052   4.556 5.23e-06
## jobexecutive / management              6.0516     0.2928  20.665  < 2e-16
## jobhospitality / travel                0.1515     0.3646   0.415  0.67779
## joblaw / legal services                3.1099     0.3697   8.413  < 2e-16
## jobmedicine / health                   4.4713     0.2544  17.573  < 2e-16
## jobmilitary                           -4.4081     0.8962  -4.919 8.75e-07
## jobother                               1.7941     0.2025   8.862  < 2e-16
## jobpolitical / government              3.8950     0.5023   7.755 9.12e-15
## jobrather not say                      3.9594     0.6299   6.286 3.30e-10
## jobretired                            19.3215     0.7838  24.651  < 2e-16
## jobsales / marketing / biz dev         0.8195     0.2385   3.436  0.00059
## jobscience / tech / engineering        0.7459     0.2309   3.230  0.00124
## jobstudent                            -6.0350     0.2292 -26.336  < 2e-16
## jobtransportation                      4.6378     0.6965   6.659 2.81e-11
## jobunemployed                         -1.3610     0.8194  -1.661  0.09673
##                                         
## (Intercept)                          ***
## jobartistic / musical / writer       ***
## jobbanking / financial / real estate ***
## jobclerical / administrative         ***
## jobcomputer / hardware / software    ***
## jobconstruction / craftsmanship      ***
## jobeducation / academia              ***
## jobentertainment / media             ***
## jobexecutive / management            ***
## jobhospitality / travel                 
## joblaw / legal services              ***
## jobmedicine / health                 ***
## jobmilitary                          ***
## jobother                             ***
## jobpolitical / government            ***
## jobrather not say                    ***
## jobretired                           ***
## jobsales / marketing / biz dev       ***
## jobscience / tech / engineering      ** 
## jobstudent                           ***
## jobtransportation                    ***
## jobunemployed                        .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.028 on 29951 degrees of freedom
## Multiple R-squared:  0.09789,    Adjusted R-squared:  0.09725 
## F-statistic: 154.8 on 21 and 29951 DF,  p-value: < 2.2e-16

If I think that job is only revealing for students and folks who are retired, I can make predictions only for OkCupid users who are in these jobs and who didn’t complete the offspring questions

predict(m_job, test %>% filter(
  (job=="retired" | job == "student") & is.na(offspring)
    )
  )

and then assign these predictions only for users falling into that same subset:

test[(test$job=="retired" | test$job == "student") & is.na(test$offspring), "age"] <- predict(m_job, test %>% filter(
  (job=="retired" | job == "student") & is.na(offspring)
    )
  )

Now, take another look at the test set.

Finally, let’s make predictions for all other users based on income.

m_income <- lm(age~logincome+listsincome, data=train)
summary(m_income)
## 
## Call:
## lm(formula = age ~ logincome + listsincome, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.343  -6.291  -2.291   4.049  77.709 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.29101    0.06073  531.75   <2e-16 ***
## logincome     5.33014    0.31112   17.13   <2e-16 ***
## listsincome -24.92898    1.48206  -16.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.455 on 29970 degrees of freedom
## Multiple R-squared:  0.009906,   Adjusted R-squared:  0.00984 
## F-statistic: 149.9 on 2 and 29970 DF,  p-value: < 2.2e-16

… there’s one catch! This model uses variables (logincome and listsincome) that we created in the training set and that don’t yet exist in the test set. To make predictions on the test set, we’ll need to create those variables in the test set. We can simply copy and paste the code we used to create them and replace “train” with “test”:

test$listsincome[!is.na(test$income)] <- 1
test$listsincome[is.na(test$income)] <- 0

test$logincome <- ifelse(test$listsincome, log(test$income, base=10),0)

Now, we can make our predictions and confirm that we’ve made predictions for every user:

test[test$age==0, "age"] <- predict(m_income, test %>% filter(age==0))
summary(test$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.95   30.29   32.29   32.72   32.29   50.31

If I’m now happy with my model, I can write my predictions to a file:

write.csv(test %>% select(ID, age), 'cross_predictions.csv', row.names=FALSE)

and email them to my teacher.