Summary

In the previous chapters, you fit various models to explain or predict an outcome variable of interest. However, how do we know which models to choose? Model assessment measures allow you to assess how well an explanatory model “fits” a set of data or how accurate a predictive model is. Based on these measures, you’ll learn about criteria for determining which models are “best”.

# Load the packages needed for this Chapter
library(dplyr)
library(moderndive)
library(ggplot2)
library(gridExtra)

Model Selection and Assessment

Let’s recap what we have learned so far. After learning some background modeling theory and terminology in Chapter 1, in Chapter 2 you modeled basic regressions using one explanatory or predictor $x$ variable. In Chapter 3, you extended this by using two $x$ variables. You created many models for both teaching score (from the evals dataset) and house_price (from the house_prices dataset).

However, you may be asking: how do I know which model to choose? In other words, which models are best? What do we mean by “best” and how do we assess this? In this chapter you will answer these questions via elementary model assessment and selection. In particular, you will assess the quality of the multiple regression models for Seattle house prices from Chapter 3.

But first, a brief refresher.

Refresher: Multiple Regression

In Chapter 3 you studied two different multiple regression models for the outcome variable log10_price.

# Add log10 transformations for both variables
house_prices <- house_prices %>% 
  mutate(log10_price = log10(price),
         log10_size = log10(sqft_living))

The first, using two numerical explanatory or predictor $x$ variables: log10_size and yr_built.

# Model 1 - Two numerical:
model_price_1 <- lm(log10_price ~ log10_size + yr_built,
                    data = house_prices)

The other, using one numerical and one categorical $x$ variable: log10_size and condition.

# Model 3 - One numerical and one categorical:
model_price_3 <- lm(log10_price ~ log10_size + condition,
                    data = house_prices)

If you wanted to explain or predict house prices and you had to choose form these two models, which would you select? Presumably the better one? As suggested earlier this necessitates an explicit criteria for “better”. Have you seen one so far? Yes, the sum of squared residuals.

Refresher: Sum of Squared Residuals

Recall, a residual is an observed $y$ value minus its corresponding fitted or predicted value $\hat{y}$. In our case, log10_price - log10_price_hat. Visually they are the vertical distances between the blue points and their corresponding value on the regression plane. We have marked a small selection on the snapshot of the 3D visualisation below.

Furthermore, you learned that of all possible planes, the regression plane minimizes the sum of squared residuals. The sum of squared residuals is computed by squaring all 21,000 residuals and summing them. You saw that this quantity can be thought of as a measure of lack of fit, where larger values indicate a worse fit.

You computed this value explicitly in a previous section for model_price_1, which uses log10_size and yr_built as $x$ variables.

# Model 1
get_regression_points(model_price_1) %>% 
  mutate(sq_residuals = residual ^ 2) %>% 
  summarise(sum_sq_residuals = sum(sq_residuals)) %>% 
  kable(digits = 0,
        align = "l",
        caption = "Sum of squared residuals for Model 1")

Sum of squared residuals for Model 1
sum_sq_residuals
585

You saw that this model’s sum of squared residuals was 585, a number that is a bit hard to make sense of on its own.

However, let’s compute the sum of squared residuals for model_price_3 as well, which uses the categorical variable condition instead of the numerical variable year.

# Model 3
get_regression_points(model_price_3) %>% 
  mutate(sq_residuals = residual ^ 2) %>% 
  summarise(sum_sq_residuals = sum(sq_residuals)) %>% 
  kable(digits = 0,
        align = "l",
        caption = "Sum of squared residuals for Model 3")

Sum of squared residuals for Model 3
sum_sq_residuals
608

The sum of squared residuals is now 608. So it seems that Model 3, using the variable condition, has a bigger lack of fit, so is worse, suggesting that Model 1, using yr_built, is better.

Refresher: Sum of Squared Residuals

Let’s remind you how to compute the sum of squared residuals. You’ll do this for two models.

Use the appropriate function to get a dataframe with the residuals for model_price_2.
Add a new column of squared residuals called sq_residuals.
Then summarize sq_residuals with their sum. Call this sum sum_sq_residuals.

# Model 2
model_price_2 <- lm(log10_price ~ log10_size + bedrooms, 
                    data = house_prices)

# Calculate squared residuals
get_regression_points(model_price_2) %>% 
  mutate(sq_residuals = residual ^ 2) %>% 
  summarise(sum_sq_residuals = sum(sq_residuals)) %>% 
  kable(digits = 0,
        align = "l",
        caption = "Sum of squared residuals for Model 2")

Sum of squared residuals for Model 2
sum_sq_residuals
605

Now compute the sum of squared residuals for model_price_4 which uses the categorical variable waterfront instead of the numerical variable bedrooms.

# Model 4
model_price_4 <- lm(log10_price ~ log10_size + waterfront, 
                    data = house_prices)

# Calculate squared residuals
get_regression_points(model_price_4) %>% 
  mutate(sq_residuals = residual ^ 2) %>% 
  summarise(sum_sq_residuals = sum(sq_residuals)) %>% 
  kable(digits = 0,
        align = "l",
        caption = "Sum of squared residuals for Model 4")

Sum of squared residuals for Model 4
sum_sq_residuals
599

Let’s use these two measures of model assessment to choose between these two models, or in other words, perform model selection!

Which Model to Select?

Based on these two values of the sum of squared residuals, which of these two models do you think is “better”, and hence which would you select?

model_price_2 that uses log10_size and bedrooms?
model_price_4 that uses log10_size and waterfront?

Since model_price_4’s value was 599, select this one. Given the choice of just these two models, the evidence suggests using size and waterfront yield a better fit, so you should choose this one!

Assessing Model Fit with R-Squared

Now that you’ve reviewed the sum of squared residuals with an eye towards model assessment and selection, let’s learn about another measure of a model’s fit: the widely known R-squared.

R-Squared

\[ R^2 = 1 - \frac{\text{Var(residuals)}}{\text{Var}(y)} \] R-squared is another numerical summary of how well a model fits points. It is 1 minus the variance of the residuals over the variance of the outcome variable. If you’ve never heard of a variance, it’s another measure of variability/spread and its the standard deviation squared. Instead of focusing on the formula however, let’s first focus on the intuition: While the sum of squared residuals is unbounded, meaning there is no theoretical upper limit to its value, R-squared is standardized to be between 0 and 1. Unlike the sum of squared residuals where smaller values indicate better fit, larger values of R-squared indicate better fit. So 1 indicates perfect fit, and 0 indicates perfect lack of fit, in other words no relationship between the outcome and explanatory/predictor variables. Let’s explore these ideas visually.

$R^2$ is between 0 and 1.
Smaller $R^2$ ~ poorer “fit”.
$R^2 = 1$ ~ “perfect fit” and $R^2 = 0$ ~ “no fit”.

High R-Squared Value Example

Let’s revisit basic regression with one numerical variable and consider a set of points with a perfectly linear relationship. In other words, the points fall perfectly on a line. Recall residuals are the vertical distances between the observed values, here the black points, and the corresponding fitted/predicted values on the blue regression line. Here, the residuals are all invariably 0. Thus the variance, or variation, of the residuals is 0, and thus $R^2$ is equal to $1-0$, which is 1.

Low R-Squared Value Example

Now the points don’t fit tightly on a line, but rather exhibit a large amount of scatter. Unlike the previous example, there are now varying residuals, thus the numerator is greater than zero, so R-squared will be smaller. Note that it is a mathematical fact that the variance of y is greater than or equal to the variance of the residuals, guaranteeing that R-squared is between 0 and 1.

Using this fact, the numerical interpretation of R-squared is as follows: it is the proportion of the total variation in the outcome variable y that the model explains. Our models attempt to explain the variation in house prices. For example, what makes certain houses expensive and others not? The question is, how much of this variation can our models explain? If it’s 100%, then our model explains everything! If its 0%, then our model has no explanatory power.

Since $\text{Var}(y)\ge\text{Var(residuals)}$ and;
$R^2=1-\frac{\text{Var(residuals)}}{\text{Var}(y)}=\frac{\text{Var}(y)-\text{Var(residuals)}}{\text{Var}(y)}$ then;
$R^2$’s interprestation is: the proportion of the total variation in the outcome variable $y$ that the model explains.

Computing R-squared

Let’s compute the R-squared statistic for both models we saw in the last Chapter.

# Model 1: price as a function of size and year built
get_regression_points(model_price_1) %>% 
  summarise(r_squared = 1 - var(residual)/var(log10_price)) %>% 
  kable(digits = 2,
        align = "l",
        caption = "R-Squared Value for Model 1")

R-Squared Value for Model 1
r_squared
0.48

In both cases, the outcome variable $y$ is the observed log10_price. For Model 1, which used log10_size and yr_built, the $R^2$ is 0.483 or 48.3%. So you can explain about half the total variation in house prices using Model 1.

# Model 3: price as a function of size and condition
get_regression_points(model_price_3) %>% 
  summarise(r_squared = 1 - var(residual)/var(log10_price)) %>% 
  kable(digits = 2,
        align = "l",
        caption = "R-Squared Value for Model 3")

R-Squared Value for Model 3
r_squared
0.46

For Model 3, which used condition instead of yr_built, the $R^2$ is 0.462 or 46.2%. Now a lower proportion of the total variation in house prices is explained by Model 3. Since R-squared values closer to 1 mean better fit, the results suggest you choose Model 1, and thus using size and year is preferred to using size and condition. This is the same conclusion reached as when you used sum of squared residuals as the assessment criteria. Note however, sometimes there are no models that yield R-squared values close to 1. Sometimes the phenomenon you are modeling is so complex, no choice of variables will capture its behavior, and thus you only get low R-squared values.

Computing the R-Squared of a Model

Let’s compute the $R^2$ summary value for the two numerical explanatory/predictor variable model you fit in the Chapter 3, price as a function of size and the number of bedrooms.

Compute $R^2$ by summarizing the residual and log10_price columns.

# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_2) %>% 
  summarise(r_squared = 1 - var(residual)/var(log10_price)) %>% 
  kable(digits = 2,
        align = "l",
        caption = "R-Squared Value for Model 2")

R-Squared Value for Model 2
r_squared
0.46

You observed an R-squared value of 0.465, which means that 46.5% of the total variability of the outcome variable log base 10 price can be explained by this model.

Comparing the R-Squared of Two Models

Let’s now compute $R^2$ for the one numerical and one categorical explanatory/predictor variable model you fit in the Chapter 3, price as a function of size and whether the house had a view of the waterfront, and compare its $R^2$ with the one you just computed.

# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_4) %>% 
  summarise(r_squared = 1 - var(residual) / var(log10_price)) %>% 
  kable(digits = 2,
        align = "l",
        caption = "R-Squared Value for Model 4")

R-Squared Value for Model 4
r_squared
0.47

Since model_price_4 had a higher $R^2$ of 0.470, it “fit” the data better. Since using waterfront explained a higher proportion of the total variance of the outcome variable than using the number of bedrooms, using waterfront in our model is preferred.

Assessing Predictions with RMSE

You just learned about R-squared, the proportion of the total variation in house prices that is explained by a model. This numerical summary can be used to assess model fit, where models with $R^2$ values closer to 1 have better fit and values closer to 0 have poorer fit.

Let’s now consider another assessment measure, but one more associated with modeling for prediction. In particular, how can you assess the quality of a model’s predictions? You’ll use a quantity called the Root Mean Squared Error which is a slight variation of the sum of squared residuals.

Refresher: Residuals

Once again, recall in your visualization of modeling with two numerical predictor variables, you marked a selection of residuals with red lines: the difference between the observed values and their corresponding fitted/predicted values on the regression plane. The sum of squared residuals takes all such residuals, squares them, and sums them. But what if you took the average instead of the sum? For example, a model might have a large sum of squared residuals, merely because it involves a large number of points! By using the average, we’ll correct for this and get a notion of “average prediction error”.

Mean Squared Error

You’ve seen the computation of the sum of squared residuals for Model 1 a few times now. Instead of using sum() in the summarize() call however, let’s use the mean() function and assign this to mse, meaning mean squared error. This is the average squared error a predictive model makes. The closer your predictions $\hat{y}$ are to the observed values $y$, the smaller the residuals will be, and hence the closer the MSE will be to 0. The further your predictions are, the larger the MSE will be.

# Mean squared error: use mean() instead of sum()
get_regression_points(model_price_1) %>% 
  mutate(sq_residuals = residual^2) %>% 
  summarise(mse = mean(sq_residuals)) %>% 
  kable(digits = 2,
        align = "l",
        caption = "MSE for Model 1")

MSE for Model 1
mse
0.03

You observe an MSE of 0.0271, which is 585 divided by 21613, the total number of houses. Why is this called the MSE, and not the mean of squared residuals? No reason other than convention, they mean the same thing.

Root Mean Squared Error

Since the MSE involves squared errors, the units of MSE are the units of the outcome variable $y$ squared. Let’s instead obtain a measure of error whose units match the units of y.

You do this via the root mean squared error, or RMSE, which is the square-root of the MSE.

# Root mean squared error
get_regression_points(model_price_1) %>% 
  mutate(sq_residuals = residual^2) %>% 
  summarise(mse = mean(sq_residuals)) %>% 
  mutate(rmse = sqrt(mse)) %>% 
  kable(digits = 2,
        align = "l",
        caption = "RMSE for Model 1")

RMSE for Model 1
mse	rmse
0.03	0.16

Note the added mutate() line of code to compute the sqrt(). This can be thought of as the “typical prediction error” our model will make and its units match the units of the outcome variable $y$. While the interpretation in our case of the units of log10 dollars might not be immediately apparent to everyone, you can imagine in many other cases it being very useful for these units match.

RMSE of Predictions on New Houses

Let’s now assess the quality of the predictions of log10_price for the two new houses you saw in the previous chapter.

new_houses <- data_frame(
  log10_size = c(2.9, 3.6),
  condition = factor(c(3, 4))
)

new_houses %>% 
  kable(digits = 2,
        align = "l",
        caption = "New Houses Predictions")

New Houses Predictions
log10_size	condition
2.9	3
3.6	4

Recall that you apply the get_regression_points() function to model_price_3, but also with the newdata argument set to new_houses.

# Get predictions
get_regression_points(model_price_3,
                      newdata = new_houses) %>% 
  kable(digits = 2,
        align = "l",
        caption = "Get Regression Points for New Houses using Model 3")

Get Regression Points for New Houses using Model 3
ID	log10_size	condition	log10_price_hat
1	2.9	3	5.34
2	3.6	4	5.94

You thus obtain predicted values log10_price_hat of 5.34 and 5.94. Now let’s take this output and compute the RMSE by taking the residual, squaring them, taking the mean not the sum, and then square rooting.

# Compute RMSE
get_regression_points(model_price_3,
                      newdata = new_houses) %>% 
  mutate(sq_residuals = residual^2) %>% 
  summarise(mse = mean(sq_residuals)) %>% 
  mutate(rmse = sqrt(mse))

You get the following error message:

Error in mutate_impl(.data, dots) :
  Evaluation error: object 'residual' not found.

It says the residual column is not found. Why is it not found? Because to compute residuals, you need both the predicted/fitted values $\hat{y}$, in this case log10_price_hat and the observed values $y$, in this case log10_price. But if you don’t have the latter, you can’t compute the residuals, and hence you cannot compute the RMSE. This illustrates a key restriction in predictive modeling assessment: you can only assess the quality of predictions when you have access to the observed value $y$. You’ll learn about a workaround to this issue shortly.

Computing the MSE and RMSE of a Model

Just as you did earlier with $R^2$, which is a measure of model fit, let’s now compute the root mean square error (RMSE) of our models, which is a commonly used measure of predictive error. Let’s use the model of price as a function of size and number of bedrooms, model_price_2.

Let’s start by computing the mean squared error (mse), which is the mean of the squared residual.

# Get all residuals, square them and take mean
get_regression_points(model_price_2) %>% 
  mutate(sq_residuals = residual ^ 2) %>% 
  summarise(mse = mean(sq_residuals)) %>% 
  kable(digits = 3,
        align = "l",
        caption = "MSE for Model 2")

MSE for Model 2
mse
0.028

Now that you’ve computed the mean squared error, let’s compute the root mean squared error.

# Get all residuals, square them and take mean and square root
get_regression_points(model_price_2) %>% 
  mutate(sq_residuals = residual ^ 2) %>% 
  summarise(mse = mean(sq_residuals),
            rmse = sqrt(mse)) %>%
  kable(digits = 3,
        align = "l",
        caption = "MSE and RMSE for Model 2")

MSE and RMSE for Model 2
mse	rmse
0.028	0.167

The RMSE is 0.167. You can think of this as the “typical” prediction error this model makes.

Comparing the RMSE of Two Models

As you did using the sum of squared residuals and $R^2$, let’s once again assess and compare the quality of your two models using the root mean squared error (RMSE). Note that RMSE is more typically used in prediction settings than explanatory settings.

Based on the code provided that computes MSE and RMSE for model_price_2, compute the MSE and RMSE for model_price_4.

# MSE and RMSE for model_price_2
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals),
            rmse = sqrt(mean(sq_residuals))) %>%
  kable(digits = 3,
        align = "l",
        caption = "MSE and RMSE for Model 2")

MSE and RMSE for Model 2
mse	rmse
0.028	0.167

# MSE and RMSE for model_price_4
get_regression_points(model_price_4) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals),
            rmse = sqrt(mean(sq_residuals))) %>%
  kable(digits = 3,
        align = "l",
        caption = "MSE and RMSE for Model 4")

MSE and RMSE for Model 4
mse	rmse
0.028	0.166

Since model_price_4 had a lower rmse of 0.166, this is suggestive that this model has better predictive power. RMSE can be thought of as the ‘typical’ error a predicive model will make.

Validation Set Prediction Framework

In this section, we will introduce the validation set prediction framework. This framework allows us to get a sense of how well a predictive model will perform, in our case on new, previously unseen houses. This forms the backbone of a well-known machine learning method for model assessment called cross-validation.

Validation Set Approach

Use two independent datasets to :

Train/fit your model
Evaluate your model’s predictive power (i.e. validate your model)

The underlying idea of the validation set approach is to first fit or train a model on one set of data, but evaluate, or validate, its performance on a different set of data. If you use the same data to both train your model and evaluate its performance, you could imagine your model easily being over-fitted to this data. In other words, you construct a model that is so overly specific to one dataset that it wouldn’t generalise well to other datasets.

Training/Test Set Split

Randomly split all $n$ observations (white) into:

A training set (blue) to fit models
A test set (orange) to make predictions on

Say your dataset has $n$ observations. You randomly split the data into two sets: a training set in blue and a test set in orange. I will use the blue observations to train, or fit, our model. Then apply the model to get predictions $\hat{y}$ for the orange observations. Then for these orange observations again, you’ll assess these predictions $\hat{y}$ by comparing them to the observed outcome variable $y$.

By using independent training and test data as above, you can get a sense of a model’s predictive performance on new data.

Training/Test Split in R

Let’s do this with some dplyr functions. You first use slice_sample() with prop set to 1 and replace set to FALSE to randomly sample 100% of the rows of house_prices without replacement. This has the effect of randomly shuffling the order of the rows.

# Set random number generator seed value for reproducibility
set.seed(76)

# Randomly shuffle order of rows
house_prices_shuffled <- house_prices %>% 
  slice_sample(prop = 1, replace = FALSE)

You then set the training data to be the first 10,000 rows of house_prices_shuffled using slice(). You similarly set the test data to be the remaining 11,613 rows.

# Split into train and test
train <- house_prices_shuffled %>% 
  slice(1:10000)
test <- house_prices_shuffled %>% 
  slice(10001:21613)

Note that these two datasets have none of the original rows in common, and by randomly shuffling the rows before slicing, we have effectively randomly assigned the rows to train and test. Also, you are not limited to a rough 50/50 split between train and test as we just did here. We only did this for simplicity.

Training Models on Training Data

Let’s fit the same regression model as earlier using log10_size and yr_built as predictor variables. However, we set the data to be train and not house_prices. Let’s then output the regression table.

# Train Model 1
train_model_price_1 <- lm(log10_price ~ log10_size + yr_built,
                          data = train)

get_regression_table(train_model_price_1) %>% 
  kable(digits = 2,
        align = "l",
        caption = "Training summary for Model 1")

Training summary for Model 1
term	estimate	std_error	statistic	lower_ci	upper_ci
intercept	5.25	0.11	47.34	5.03	5.47
log10_size	0.92	0.01	96.67	0.90	0.94
yr_built	0.00	0.00	-22.02	0.00	0.00

You again obtain values for the intercept and the slopes for log10_size and yr_built in the estimate column, but these values are slightly different than before when using all of the house_prices data, as they are based on a randomly chosen subset of points in the training data.

Making Predictions on Test Data

Let’s then apply this model to the test data to make predictions. In other words, take all 11,613 houses in test and compute the predicted values log10_price_hat. Recall from earlier, that you can do this quickly by using the get_regression_points() function with the newdata argument set to test.

# Train model on train
train_model_price_1 <- lm(log10_price ~ log10_size + yr_built,
                          data = train)

# Get predictions on test
get_regression_points(train_model_price_1,
                      newdata = test) %>% 
  head(10) %>% 
  kable(digits = 2,
        align = "l",
        caption = "First 10 Predictions for Test Data")

First 10 Predictions for Test Data
ID	log10_price	log10_size	yr_built	log10_price_hat	residual
1	5.81	3.24	2008	5.58	0.23
2	5.45	3.25	1964	5.65	-0.20
3	5.69	3.38	1995	5.73	-0.04
4	5.50	2.98	1971	5.39	0.11
5	5.93	3.55	1986	5.90	0.04
6	5.39	3.25	1960	5.66	-0.27
7	5.61	3.06	1937	5.52	0.09
8	5.38	3.12	1993	5.50	-0.12
9	6.28	3.49	1950	5.89	0.40
10	5.85	3.53	2000	5.86	0.00

You observe a log10_price_hat column of predicted values and the corresponding residuals. Note that since you have both the predicted values $\hat{y}$, in this case log10_price_hat, and the observed values $y$, in this case log10_price, you can compute the residuals in the final column.

Assessing Predictions with RMSE

Let’s now compute the RMSE to assess our predictions as before. You first mutate() a new column for the squared residuals, then you summarise() these values with the square root of their mean, in this case in a single mutate() step.

# Train model on train
train_model_price_1 <- lm(log10_price ~ log10_size + yr_built,
                          data = train)

# Get predictions and compute RMSE on test
get_regression_points(train_model_price_1,
                      newdata = test) %>% 
  mutate(sq_residuals = residual ^ 2) %>% 
  summarise(rmse = sqrt(mean(sq_residuals))) %>% 
  kable(digits = 3,
        align = "l",
        caption = "RMSE for Model 1 Test Data")

RMSE for Model 1 Test Data
rmse
0.164

The RMSE is 0.164.

Comparing RMSEs

Let’s now repeat this for the model that used condition instead of year and compare the RMSEs. You again fit the model to the training data and then use the get_regression_points() function with the newdata argument set to test to make predictions and you compute the RMSE.

# Train model on train
train_model_price_3 <- lm(log10_price ~ log10_size + condition,
                          data = train)

# Get predictions and compute RMSE on test
get_regression_points(train_model_price_3,
                      newdata = test) %>% 
  mutate(sq_residuals = residual ^ 2) %>% 
  summarise(rmse = sqrt(mean(sq_residuals))) %>% 
  kable(digits = 3,
        align = "l",
        caption = "RMSE for Model 3 Test Data")

RMSE for Model 3 Test Data
rmse
0.168

This RMSE of 0.168 is larger than the previous one of 0.164 suggesting that using condition instead of yr_built yields worse predictions.

Fitting Model to Training Data

It’s time to split your data into a training set to fit a model and a separate test set to evaluate the predictive power of the model. Before making this split however, we first sample 100% of the rows of house_prices without replacement and assign this to house_prices_shuffled. This has the effect of “shuffling” the rows, thereby ensuring that the training and test sets are randomly sampled.

Use slice() to set train to the first 10,000 rows of house_prices_shuffled and test to the remainder of the 21,613 rows.

# Set random number generator seed value for reproducibility
set.seed(76)

# Randomly reorder the rows
house_prices_shuffled <- house_prices %>% 
  slice_sample(prop = 1, replace = FALSE)

# Train/test split
train <- house_prices_shuffled %>%
  slice(1:10000)
test <- house_prices_shuffled %>%
  slice(10001:21613)

Now fit a linear regression to predict log10_price using log10_size and bedrooms using just the training data.

# Fit model to training set
train_model_price_2 <- lm(log10_price ~ log10_size + bedrooms,
                          data = train)

Since you’ve fit/trained the predictive model on the training set, let’s now apply it to the test set to make predictions!

Predicting on Test Data

Now that you’ve trained the model on the train set, let’s apply the model to the test data, make predictions, and evaluate the predictions. Recall that having a separate test set here simulates the gathering of a “new” independent dataset to test our model’s predictive performance on.

Use the get_regression_points() function to apply train_model_2 on your new dataset: test.

# Make predictions on test set
get_regression_points(train_model_price_2,
                      newdata = test) %>% 
  head(10) %>% 
  kable(digits = 2,
        align = "l",
        caption = "First 10 Predictions for Testing Data for Model 2")

First 10 Predictions for Testing Data for Model 2
ID	log10_price	log10_size	bedrooms	log10_price_hat	residual
1	5.81	3.24	3	5.64	0.17
2	5.45	3.25	3	5.65	-0.20
3	5.69	3.38	3	5.78	-0.09
4	5.50	2.98	3	5.40	0.11
5	5.93	3.55	4	5.90	0.03
6	5.39	3.25	3	5.65	-0.27
7	5.61	3.06	2	5.51	0.10
8	5.38	3.12	2	5.56	-0.18
9	6.28	3.49	4	5.84	0.44
10	5.85	3.53	3	5.91	-0.06

Compute the root mean square error using this output.

# Make predictions on test set
get_regression_points(train_model_price_2,
                      newdata = test) %>% 
  mutate(sq_residuals = residual ^ 2) %>% 
  summarise(rmse = sqrt(mean(sq_residuals))) %>% 
  kable(digits = 3,
        align = "l",
        caption = "RMSE for Model 2")

RMSE for Model 2
rmse
0.167

Your RMSE using size and condition as predictor variables is 0.167, which is higher than the 0.165 when you used size and yr_built. It seems the latter is marginally better!

Conclusion - Where to Go From Here?

Congratulations on completing this course. In this course we leveraged the data science toolbox you developed in previous courses to perform exploratory data analysis, fit both explanatory and predictive models and study methods for model assessment and selection. But where can we go from here?

R Source Code for all Videos

First, a great way to learn how to code in a new language is to find code that you know works, then copy it, paste it and tweak it to serve your goals. To help facilitate your new R modeling skills using the Tidyverse, we have included all the R source code used for the videos in this course on Github.

Second, here are links to other Datacamp courses that use the Tidyverse:

Being an effective data scientist requires you to develop a wide array of tools for your data science toolbox. And a lot of practice, practice, practice! These courses will help in this journey.

Refresher: General Modeling Framework

\[ y=f(\vec{x})+\epsilon \] Perhaps the theory behind modeling interests you more. For example, recall our general modeling framework which, at its heart, has a function $f()$ making explicit the relationship between $y$ and $x$. We kept things simple and only studied models where $f()$ was linear:

\[ y=\beta_0+\beta_1\cdot{x_1}+\epsilon \]

But by no means is one restricted to such models. What do we mean?

Parallel Slopes Model

Recall our parallel slopes model for house price as a function of size and condition.

But why restrict ourselves to parallel lines?

Polynomial Model

Here we have something known as a ploynomial model, where we allow for curvature by incorporating log10_size squared as an explanatory or predictor variable. This gives our model more flexibility.

Tree Models

Furthermore, we are not restricted to models based on lines either. Yet another form of model are trees. Tree models are a form of triage. You start at the top of the tree and based on answers to true/false questions you progress down branches of the tree, where if the answer is true you go left and if the answer is false you go right.

For example, say a house has log-10 size equal to 3.2. Since $3.2\lt3.387=\text{True}$, you first go down the left branch. Next since $3.2\lt3.184=\text{False}$, you then go down the right branch. This model’s fitted or predicted value of the house’s log-10 price is 5.642 or about $438,000. Repeating this triage for all 21,000 houses, there are 8,875 houses that fall into this final branch.

Datacamp Courses Using Other Models

We have only scratched the surface of possible other models to consider. Here are some other Datacamp courses you can take that consider more complex, but also potentially more powerful, models:

Refresher: Regression Table

# Fit model
model_score_1 <- lm(score ~ age,
                    data = evals)

# Output regression table
get_regression_table(model_score_1) %>% 
  kable(digits = 3,
        align = "l",
        caption = "Example of Regression Table")

Example of Regression Table
term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	4.462	0.127	35.195	0.000	4.213	4.711
age	-0.006	0.003	-2.311	0.021	-0.011	-0.001

Recall the regression table of the model of teaching score as a function of age. We only looked at values in the estimate column like the negative slope for age of -0.006, suggesting that as professors age, they tend to have lower teaching scores.

But what do the other columns tell us? They speak to the statistical significance of our results. For example, can we conclusively say that age and score are negatively related for all instructors? Or was this relationship just a fluke occurrence for these 463 instructors? How would these results differ if we selected 463 different instructors? To be able to answer questions like these, we need to understand statistical inference.

Moderndive: Online Textbook

If you are interested in statistical inference, we suggest you check out ModernDive, an electronic textbook that Chester Ismay and I co-authored. ModernDive uses the same Tidyverse tools as in this course, expands on the regression models from this course and others, uses the evals and house_prices datasets (and more) all towards the goal of learning statistical inference via data science.

Datacamp: Modeling with Data in the Tidyverse

Chapter 4: Model Assessment and Selection

Clare Gibson

2022-01-11

Summary

Model Selection and Assessment

Refresher: Multiple Regression

Refresher: Sum of Squared Residuals

Refresher: Sum of Squared Residuals

Which Model to Select?

Assessing Model Fit with R-Squared

R-Squared

High R-Squared Value Example

Low R-Squared Value Example

Computing R-squared

Computing the R-Squared of a Model

Comparing the R-Squared of Two Models

Assessing Predictions with RMSE

Refresher: Residuals

Mean Squared Error

Root Mean Squared Error

RMSE of Predictions on New Houses

Computing the MSE and RMSE of a Model

Comparing the RMSE of Two Models

Validation Set Prediction Framework

Validation Set Approach

Training/Test Set Split

Training/Test Split in R

Training Models on Training Data

Making Predictions on Test Data

Assessing Predictions with RMSE

Comparing RMSEs

Fitting Model to Training Data

Predicting on Test Data

Conclusion - Where to Go From Here?

R Source Code for all Videos

Refresher: General Modeling Framework

Parallel Slopes Model

Polynomial Model

Tree Models

Datacamp Courses Using Other Models

Refresher: Regression Table

Moderndive: Online Textbook