Modeling with Data in the Tidyverse

Background on modeling for explanation

Course overview

introduction to modeling: theory and terminology
regression:

simple linear regression
multiple regression

model assessment

General modeling framework formula

$y = f(\vec{x}) + \epsilon$

Where:

$y$: outcome variable of interest
$\vec{x}$: explanatory/predictor variables
$f()$: function of the relationship between $y$ and $\vec{x}$ AKA the signal
$\epsilon$: unsystematic error component AKA the noise

Two modeling scenarios

Modeling for either:

Explanation: $\vec{x}$ are explanatory variables
Prediction: $\vec{x}$ are predictor variables

Exploratory data analysis

Three basic steps to exploratory data analysis (EDA)

Looking at your data
Creating visualizations
Computing summary statistics

Exploratory visualization of age

Let’s perform an exploratory data analysis (EDA) of the numerical explanatory variable age. You should always perform an exploratory analysis of your variables before any formal modeling. This will give you a sense of your variable’s distributions, any outliers, and any patterns that might be useful when constructing your eventual model.

# Load packages
library(moderndive)
library(ggplot2)

# Plot the histogram
ggplot(evals, aes(x = age)) +
  geom_histogram(binwidth = 5) +
  labs(x = "age", y = "count")

Nice! The 463 instructors’ ages are centered roughly at age 50. You’ll now compute notions of center numerically!

Numerical summaries of age

Let’s continue our exploratory data analysis of the numerical explanatory variable age by computing summary statistics. Summary statistics take many values and summarize them with a single value. Let’s compute three such values using dplyr data wrangling: mean (AKA the average), the median (the middle value), and the standard deviation (a measure of spread/variation).

# Load packages
library(moderndive)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Compute summary stats
evals %>%
  summarize(mean_age = mean(age),
            median_age = median(age),
            sd_age = sd(age))

## # A tibble: 1 x 3
##   mean_age median_age sd_age
##      <dbl>      <int>  <dbl>
## 1     48.4         48   9.80

Great! As suggested in the previous histogram for age, the center of the distribution as quantified by the mean and median is around 48 years old!

Background on modeling for prediction

Comparing before and after log10-transformation

log10 transformations are multiplicative (instead of additive) and monotonic –> they preserve orderings

Exploratory visualization of house size

Let’s create an exploratory visualization of the predictor variable reflecting the size of houses: sqft_living the square footage of living space where 1 sq.foot ≈ 0.1 sq.meter.

After plotting the histogram, what can you say about the distribution of the variable sqft_living?

# Load packages
library(moderndive)
library(ggplot2)

# Plot the histogram
ggplot(house_prices, aes(x = sqft_living)) +
  geom_histogram() +
  labs(x = "Size (sq.feet)", y = "count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A variable is right-skewed if it has a long tail to the right.

Log10 transformation of house size

You just saw that the predictor variable sqft_living is right-skewed and hence a log base 10 transformation is warranted to unskew it. Just as we transformed the outcome variable price to create log10_price in the video, let’s do the same for sqft_living.

# Load packages
library(moderndive)
library(dplyr)
library(ggplot2)

# Add log10_size
house_prices_2 <- house_prices %>%
  mutate(log10_size = log10(sqft_living))

# Plot the histogram  
ggplot(house_prices_2, aes(x = log10_size)) +
  geom_histogram() +
  labs(x = "log10 size", y = "count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Huzzah! Notice how the distribution is much less skewed. Going forward, you’ll use this new transformed variable to represent the size of houses.

The modeling problem for explanation

EDA of relationship of teaching & “beauty” scores

The researchers in the UT Austin created a “beauty score” by asking a panel of 6 students to rate the “beauty” of all 463 instructors. They were interested in studying any possible impact of “beauty” of teaching evaluation scores. Let’s do an EDA of this variable and its relationship with teaching score.

From now on, assume that ggplot2, dplyr, and moderndive are all available in your workspace unless you’re told otherwise.

# Plot the histogram
ggplot(evals, aes(x = bty_avg)) +
  geom_histogram(binwidth = 0.5) +
  labs(x = "Beauty score", y = "count")

# Scatterplot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point() +
  labs(x = "beauty score", y = "teaching score")

# Jitter plot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_jitter() +
  labs(x = "beauty score", y = "teaching score")

“Beauty”-ful! It seems the original scatterplot did suffer from overplotting since the jittered scatterplot reveals many originally hidden points. Most bty_avg scores range from 2-8, with 5 being about the center.

Correlation between teaching and “beauty” scores

Let’s numerically summarize the relationship between teaching score and beauty score bty_avg using the correlation coefficient. Based on this, what can you say about the relationship between these two variables?

# Compute correlation
evals %>%
  summarize(correlation = cor(score, bty_avg))

## # A tibble: 1 x 1
##   correlation
##         <dbl>
## 1       0.187

While there seems to be a positive relationship, +0.187 is still a long ways from +1, so the correlation is only weakly positive.

The modeling problem for prediction

Difference between explanation and prediction

EDA of relationship of house price and waterfront

Let’s now perform an exploratory data analysis of the relationship between log10_price, the log base 10 house price, and the binary variable waterfront. Let’s look at the raw values of waterfront and then visualize their relationship.

The column log10_price has been added for you in the house_prices dataset.

house_prices_2 <- house_prices %>%
  mutate(log10_price = log10(price), log10_size = log10(sqft_living))

house_prices <- house_prices_2

# View the structure of log10_price and waterfront
house_prices %>%
  select(log10_price, waterfront) %>%
  glimpse()

## Rows: 21,613
## Columns: 2
## $ log10_price <dbl> 5.346157, 5.730782, 5.255273, 5.781037, 5.707570, 6.088...
## $ waterfront  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...

# Plot 
ggplot(house_prices, aes(x = waterfront, y = log10_price)) +
  geom_boxplot() +
  labs(x = "waterfront", y = "log10 price")

A+! Look at that boxplot! Houses that have a view of the waterfront tend to be MUCH more expensive as evidenced by the much higher log10 prices!

Predicting house price with waterfront

You just saw that houses with a view of the waterfront tend to be much more expensive. But by how much? Let’s compute group means of log10_price, convert them back to dollar units, and compare!

The variable log10_price has already been added to house_prices for you.

# Calculate stats
house_prices %>%
  group_by(waterfront) %>%
  summarize(mean_log10_price = mean(log10_price), n = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 3
##   waterfront mean_log10_price     n
##   <lgl>                 <dbl> <int>
## 1 FALSE                  5.66 21450
## 2 TRUE                   6.12   163

# Prediction of price for houses with view
10^(6.12)

## [1] 1318257

# Prediction of price for houses without view
10^(5.66)

## [1] 457088.2

100%! Most houses don’t have a view of the waterfront (n = 21,450), but those that do (n = 163) have a MUCH higher predicted price. Look at that difference! $457,088 versus $1,318,257! In the upcoming Chapter 2 on basic regression, we’ll build on such intuition and construct our first formal explanatory and predictive models using basic regression!

Explaining teaching score with age

Plotting a “best-fitting” regression line

Previously you visualized the relationship of teaching score and “beauty score” via a scatterplot. Now let’s add the “best-fitting” regression line to provide a sense of any overall trends. Even though you know this plot suffers from overplotting, you’ll stick to the non-jitter version.

# Load packages
library(ggplot2)
library(dplyr)
library(moderndive)

# Plot
ggplot(evals, aes(x = bty_avg, y = score)) +
  geom_point() +
  labs(x = "beauty score", y = "score") +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Fantastic! The overall trend seems to be positive! As instructors have higher “beauty” scores, so also do they tend to have higher teaching scores.

Fitting a regression with a numerical x

Let’s now explicity quantify the linear relationship between score and bty_avg using linear regression. You will do this by first “fitting” the model. Then you will get the regression table, a standard output in many statistical software packages. Finally, based on the output of get_regression_table(), which interpretation of the slope coefficient is correct?

# Load package
library(moderndive)

# Fit model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Output content
model_score_2

## 
## Call:
## lm(formula = score ~ bty_avg, data = evals)
## 
## Coefficients:
## (Intercept)      bty_avg  
##     3.88034      0.06664

# Output regression table
get_regression_table(model_score_2, print = TRUE)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	3.880	0.076	50.961	0	3.731	4.030
bty_avg	0.067	0.016	4.090	0	0.035	0.099

As suggested by your exploratory visualization, there is a positive relationship between “beauty” and teaching score.

Predicting teaching score using age

Making predictions using “beauty score”

Say there is an instructor at UT Austin and you know nothing about them except that their beauty score is 5. What is your prediction $\hat{y}$ of their teaching score $y$?

get_regression_table(model_score_2)
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept    3.88      0.076     51.0        0    3.73     4.03 
2 bty_avg      0.067     0.016      4.09       0    0.035    0.099

# Use fitted intercept and slope to get a prediction
y_hat <- 3.88 + 5 * 0.0670
y_hat

## [1] 4.215

# Compute residual y - y_hat
4.7 - 4.215

## [1] 0.485

Awesome! Was your visual guess close to the predicted teaching score of 4.215? Also, note that this prediction is off by about 0.485 units in teaching score.

Computing fitted/predicted values & residuals

Now say you want to repeat this for all 463 instructors in evals. Doing this manually as you just did would be long and tedious, so as seen in the video, let’s automate this using the get_regression_points() function.

Furthemore, let’s unpack its output.

# Fit regression model
model_score_2 <- lm(score ~ bty_avg, data = evals)

# Get regression table
get_regression_table(model_score_2)

## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    3.88      0.076     51.0        0    3.73     4.03 
## 2 bty_avg      0.067     0.016      4.09       0    0.035    0.099

# Get all fitted/predicted values and residuals
get_regression_points(model_score_2)

## # A tibble: 463 x 5
##       ID score bty_avg score_hat residual
##    <int> <dbl>   <dbl>     <dbl>    <dbl>
##  1     1   4.7    5         4.21    0.486
##  2     2   4.1    5         4.21   -0.114
##  3     3   3.9    5         4.21   -0.314
##  4     4   4.8    5         4.21    0.586
##  5     5   4.6    3         4.08    0.52 
##  6     6   4.3    3         4.08    0.22 
##  7     7   2.8    3         4.08   -1.28 
##  8     8   4.1    3.33      4.10   -0.002
##  9     9   3.4    3.33      4.10   -0.702
## 10    10   4.5    3.17      4.09    0.409
## # ... with 453 more rows

# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>% 
  mutate(score_hat_2 = 3.88 + 0.067 * bty_avg)

## # A tibble: 463 x 6
##       ID score bty_avg score_hat residual score_hat_2
##    <int> <dbl>   <dbl>     <dbl>    <dbl>       <dbl>
##  1     1   4.7    5         4.21    0.486        4.22
##  2     2   4.1    5         4.21   -0.114        4.22
##  3     3   3.9    5         4.21   -0.314        4.22
##  4     4   4.8    5         4.21    0.586        4.22
##  5     5   4.6    3         4.08    0.52         4.08
##  6     6   4.3    3         4.08    0.22         4.08
##  7     7   2.8    3         4.08   -1.28         4.08
##  8     8   4.1    3.33      4.10   -0.002        4.10
##  9     9   3.4    3.33      4.10   -0.702        4.10
## 10    10   4.5    3.17      4.09    0.409        4.09
## # ... with 453 more rows

# Get all fitted/predicted values and residuals
get_regression_points(model_score_2) %>% 
  mutate(residual_2 = score - score_hat)

## # A tibble: 463 x 6
##       ID score bty_avg score_hat residual residual_2
##    <int> <dbl>   <dbl>     <dbl>    <dbl>      <dbl>
##  1     1   4.7    5         4.21    0.486      0.486
##  2     2   4.1    5         4.21   -0.114     -0.114
##  3     3   3.9    5         4.21   -0.314     -0.314
##  4     4   4.8    5         4.21    0.586      0.586
##  5     5   4.6    3         4.08    0.52       0.520
##  6     6   4.3    3         4.08    0.22       0.220
##  7     7   2.8    3         4.08   -1.28      -1.28 
##  8     8   4.1    3.33      4.10   -0.002     -0.002
##  9     9   3.4    3.33      4.10   -0.702     -0.702
## 10    10   4.5    3.17      4.09    0.409      0.409
## # ... with 453 more rows

Bingo! You’ll see later that the residuals can provide useful information about the quality of your regression models. Stay tuned!

Explaining teaching score with gender

EDA of relationship of score and rank

Let’s perform an EDA of the relationship between an instructor’s score and their rank in the evals dataset. You’ll both visualize this relationship and compute summary statistics for each level of rank: teaching, tenure track, and tenured.

ggplot(evals, aes(x = rank, y = score)) +
  geom_boxplot() +
  labs(x = "rank", y = "score")

evals %>%
  group_by(rank) %>%
  summarize(n = n(), mean_score = mean(score), sd_score = sd(score))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 4
##   rank             n mean_score sd_score
##   <fct>        <int>      <dbl>    <dbl>
## 1 teaching       102       4.28    0.498
## 2 tenure track   108       4.15    0.561
## 3 tenured        253       4.14    0.550

Cool! The boxplot and summary statistics suggest that teaching get the highest scores while tenured professors get the lowest. However, there is clearly variation around the respective means.

Fitting a regression with a categorical x

You’ll now fit a regression model with the categorical variable rank as the explanatory variable and interpret the values in the resulting regression table. Note here the rank “teaching” is treated as the baseline for comparison group for the “tenure track” and “tenured” groups.

# Fit regression model
model_score_4 <- lm(score ~ rank, data = evals)

# Get regression table
get_regression_table(model_score_4)

## # A tibble: 3 x 7
##   term             estimate std_error statistic p_value lower_ci upper_ci
##   <chr>               <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept           4.28      0.054     79.9    0        4.18     4.39 
## 2 ranktenure track   -0.13      0.075     -1.73   0.084   -0.277    0.017
## 3 ranktenured        -0.145     0.064     -2.28   0.023   -0.27    -0.02

# teaching mean
teaching_mean <- 4.28

# tenure track mean
tenure_track_mean <- 4.28 - 0.13

# tenured mean
tenured_mean <- 4.28 - 0.145

Kudos! Remember that regressions with a categorical variable return group means expressed relative to a baseline for comparison!

Predicting teaching score using gender

Regression tables for categorical explanatory variables show differences in means relative to a baseline.

Visualizing the distribution of residuals

Let’s now compute both the predicted score $\hat{y}$ and the residual $y-\hat{y}$ for all $n = 463$ instructors in the evals dataset. Furthermore, you’ll plot a histogram of the residuals and see if there are any patterns to the residuals, i.e. your predictive errors.

model_score_4 from the previous exercise is available in your workspace.

# Calculate predictions and residuals
model_score_4_points <- get_regression_points(model_score_4)
model_score_4_points

## # A tibble: 463 x 5
##       ID score rank         score_hat residual
##    <int> <dbl> <fct>            <dbl>    <dbl>
##  1     1   4.7 tenure track      4.16    0.545
##  2     2   4.1 tenure track      4.16   -0.055
##  3     3   3.9 tenure track      4.16   -0.255
##  4     4   4.8 tenure track      4.16    0.645
##  5     5   4.6 tenured           4.14    0.461
##  6     6   4.3 tenured           4.14    0.161
##  7     7   2.8 tenured           4.14   -1.34 
##  8     8   4.1 tenured           4.14   -0.039
##  9     9   3.4 tenured           4.14   -0.739
## 10    10   4.5 tenured           4.14    0.361
## # ... with 453 more rows

# Plot residuals
ggplot(model_score_4_points, aes(x = residual)) +
  geom_histogram() +
  labs(x = "residuals", title = "Residuals from score ~ rank model")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Congrats! Look at the distribution of the residuals. While it seems there are fewer negative residuals corresponding to overpredictions of score, the magnitude of the error seems to be larger (ranging all the way to -2).

Explaining house price with year & size

Plotly was used to create this visualization

Taking into account all other variables (age of home in this case), larger homes tend to cost more.
Taking into account / controlling for log10_size, for every additional year increase in home construction, there is an associated marginal decrease in log_price.

EDA of relationship

Unfortunately, making 3D scatterplots to perform an EDA is beyond the scope of this course. So instead let’s focus on making standard 2D scatterplots of the relationship between price and the number of bedrooms, keeping an eye out for outliers.

The log10 transformations have been made for you and are saved in house_prices.

# Create scatterplot with regression line
ggplot(house_prices, aes(x = bedrooms, y = log10_price)) +
  geom_point() +
  labs(x = "Number of bedrooms", y = "log10 price") +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Remove outlier
house_prices_transform <- house_prices %>%
  filter(bedrooms < 20)

# Create scatterplot with regression line
ggplot(house_prices_transform, aes(x = bedrooms, y = log10_price)) +
  geom_point() +
  labs(x = "Number of bedrooms", y = "log10 price") +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Excellent! Another important reason to perform EDA is to discard any potential outliers that are likely data entry errors. In our case, after removing an outlier, you can see a clear positive relationship between the number of bedrooms and price, as one would expect.

Fitting a regression

house_prices, which is available in your environment, has the log base 10 transformed variables included and the outlier house with 33 bedrooms removed. Let’s fit a multiple regression model of price as a function of size and the number of bedrooms and generate the regression table. In this exercise, you will first fit the model, and based on the regression table, in the second part, you will answer the following question:

Which of these interpretations of the slope coefficent for bedrooms is correct?

# Fit model
model_price_2 <- lm(log10_price ~ log10_size + bedrooms, 
                    data = house_prices)

# Get regression table
get_regression_table(model_price_2)

## # A tibble: 3 x 7
##   term       estimate std_error statistic p_value lower_ci upper_ci
##   <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept     2.72      0.023     118.        0    2.67     2.76 
## 2 log10_size    0.931     0.008     118.        0    0.916    0.947
## 3 bedrooms     -0.03      0.002     -19.3       0   -0.033   -0.027

Accounting for log10_size, every extra bedroom is associated with a decrease of on average 0.033 in log10_price.

Splendid! In this multiple regression setting, the associated effect of any variable must be viewed in light of the other variables in the model. In our case, accounting for the size of the house reverses the relationship of the number of bedrooms and price from positive to negative!

Predicting house price using year & size

Making predictions using size and bedrooms

Say you want to predict the price of a house using this model and you know it has:

1000 square feet of living space, and
3 bedrooms

What is your prediction both in log10 dollars and then dollars?

The regression model from the previous exercise is available in your workspace as model_price_2.

get_regression_table(model_price_2)
# A tibble: 3 x 7
  term       estimate std_error statistic p_value lower_ci upper_ci
  <chr>         <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept     2.69      0.023     116.        0    2.65     2.74 
2 log10_size    0.941     0.008     118.        0    0.925    0.957
3 bedrooms     -0.033     0.002     -20.5       0   -0.036   -0.03

# Make prediction in log10 dollars
2.69 + 0.941 * log10(1000) - .033 * 3

## [1] 5.414

# Make prediction dollars
10^5.414

## [1] 259417.9

Spot on! Using the values in the regression table you can make predictions of house prices! In this case, your prediction is about $260,000. Let’s now apply this procedure to all 21k houses!

Interpreting residuals

Let’s automate this process for all 21K rows in house_prices to obtain residuals, which you’ll use to compute the sum of squared residuals: a measure of the lack of fit of a model. After computing the sum of squared residuals, you will answer the following question:

Which of these statements about residuals is correct?

# Automate prediction and residual computation
get_regression_points(model_price_2)

## # A tibble: 21,613 x 6
##       ID log10_price log10_size bedrooms log10_price_hat residual
##    <int>       <dbl>      <dbl>    <int>           <dbl>    <dbl>
##  1     1        5.35       3.07        3            5.48   -0.139
##  2     2        5.73       3.41        3            5.8    -0.069
##  3     3        5.26       2.89        2            5.34   -0.087
##  4     4        5.78       3.29        4            5.66    0.121
##  5     5        5.71       3.22        3            5.63    0.08 
##  6     6        6.09       3.73        4            6.07    0.017
##  7     7        5.41       3.23        3            5.64   -0.225
##  8     8        5.46       3.02        3            5.44    0.024
##  9     9        5.36       3.25        3            5.65   -0.290
## 10    10        5.51       3.28        3            5.68   -0.166
## # ... with 21,603 more rows

# Automate prediction and residual computation
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(sum_sq_residuals = sum(sq_residuals))

## # A tibble: 1 x 1
##   sum_sq_residuals
##              <dbl>
## 1             605.

The residual is the observed outcome variable minus the predicted variable.
They can be thought of as prediction errors.
They can be thought of as the lack-of-fit of the predictions to truth.

Residual does suggest ‘leftovers’, but not in the sense that they are leftover points.

Explaining house price with size & condition

Parallel slopes model

Let’s now fit a “parallel slopes” model with the numerical explanatory/predictor variable log10_size and the categorical, in this case binary, variable waterfront. The visualization corresponding to this model is below:

# Fit model
model_price_4 <- lm(log10_price ~ log10_size + waterfront,
                    data = house_prices)

# Get regression table
get_regression_table(model_price_4)

## # A tibble: 3 x 7
##   term           estimate std_error statistic p_value lower_ci upper_ci
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept         2.96      0.02      146.        0    2.92     3.00 
## 2 log10_size        0.825     0.006     134.        0    0.813    0.837
## 3 waterfrontTRUE    0.322     0.013      24.5       0    0.296    0.348

Success! Notice how the regression table has three rows: intercept, the slope for log10_size, and an offset for houses that do have a waterfront.

The intercept for houses with a view of the waterfront is 3.282.
All houses are assumed to have the same slope between log10_price & log10_size.
The intercept for houses without a view of the waterfront is 2.96.

0.322 is the offset in the intercept for houses with a view of the waterfront relative to those which don’t.

Predicting house price using size & condition

Making predictions using size and waterfront

Using your model for log10_price as a function of log10_size and the binary variable waterfront, let’s make some predictions! Say you have the two following “new” houses, what would you predict their prices to be in dollars?

House A: log10_size = 2.9 that has a view of the waterfront
House B: log10_size = 3.1 that does not have a view of the waterfront

We make the corresponding visual predictions below:

# Get regression table
get_regression_table(model_price_4)

## # A tibble: 3 x 7
##   term           estimate std_error statistic p_value lower_ci upper_ci
##   <chr>             <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept         2.96      0.02      146.        0    2.92     3.00 
## 2 log10_size        0.825     0.006     134.        0    0.813    0.837
## 3 waterfrontTRUE    0.322     0.013      24.5       0    0.296    0.348

# Prediction for House A
10^(2.96 + 0.825 * 2.9 + 0.322)

## [1] 472606.8

# Prediction for House B
10^(2.96 + 0.825 * 3.1)

## [1] 329230.5

Yay! Your modeling toolbox is getting quite extensive! Let’s now automate this!

Automating predictions on “new” houses

Let’s now repeat what you did in the last exercise, but in an automated fashion assuming the information on these “new” houses is saved in a dataframe.

Your model for log10_price as a function of log10_size and the binary variable waterfront (model_price_4) is available in your workspace, and so is new_houses_2, a dataframe with data on 2 new houses. While not so beneficial with only 2 “new” houses, this will save a lot of work if you had 2000 “new” houses.

new_houses_2 <- tibble(log10_size = c(2.9, 3.1), waterfront = c(TRUE, FALSE))

# View the "new" houses
new_houses_2

## # A tibble: 2 x 2
##   log10_size waterfront
##        <dbl> <lgl>     
## 1        2.9 TRUE      
## 2        3.1 FALSE

# Get predictions on "new" houses
get_regression_points(model_price_4, newdata = new_houses_2)

## # A tibble: 2 x 4
##      ID log10_size waterfront log10_price_hat
##   <int>      <dbl> <lgl>                <dbl>
## 1     1        2.9 TRUE                  5.67
## 2     2        3.1 FALSE                 5.52

# Get predictions price_hat in dollars on "new" houses
get_regression_points(model_price_4, newdata = new_houses_2) %>% 
  mutate(price_hat = 10^log10_price_hat)

## # A tibble: 2 x 5
##      ID log10_size waterfront log10_price_hat price_hat
##   <int>      <dbl> <lgl>                <dbl>     <dbl>
## 1     1        2.9 TRUE                  5.67   472063.
## 2     2        3.1 FALSE                 5.52   328095.

Predictions of $472,000 and $328,000! Exceptional! You’re done with the multiple regression chapter, and now you’re onto model assessment and selection!

Model selection and assessment

Refresher: sum of squared residuals Let’s remind you how to compute the sum of squared residuals. You’ll do this for two models.

# Model 2
model_price_2 <- lm(log10_price ~ log10_size + bedrooms, 
                    data = house_prices)

# Calculate squared residuals
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(sum_sq_residuals = sum(sq_residuals))

## # A tibble: 1 x 1
##   sum_sq_residuals
##              <dbl>
## 1             605.

# Model 4
model_price_4 <- lm(log10_price ~ log10_size + waterfront, 
                    data = house_prices)

# Calculate squared residuals
get_regression_points(model_price_4) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(sum_sq_residuals = sum(sq_residuals))

## # A tibble: 1 x 1
##   sum_sq_residuals
##              <dbl>
## 1             599.

Wonderful! Let’s use these two measures of model assessment to choose between these two models, or in other words, perform model selection!

Which model to select?

Based on these two values of the sum of squared residuals, which of these two models do you think is “better”, and hence which would you select?

model_price_2 that uses log10_size and bedrooms?
model_price_4 that uses log10_size and waterfront?

Since model_price_4’s value was lower at 599, select this one.

Given the choice of just these two models, the evidence suggests using size and waterfront yield a better fit, so you should choose this one!

Assessing model fit with R-squared

it is a mathematical fact that $Var_{(y)} \geq Var_{(residuals)}$

Computing the R-squared of a model

Let’s compute the $R^2$ summary value for the two numerical explanatory/predictor variable model you fit in the Chapter 3, price as a function of size and the number of bedrooms.

Recall that $R^2$ can be calculated as:

$1 - \frac{Var_{(residuals)}}{Var_{(y)}}$

# Fit model
model_price_2 <- lm(log10_price ~ log10_size + bedrooms,
                    data = house_prices)
                    
# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_2) %>%
  summarize(r_squared = 1 - var(residual) / var(log10_price))

## # A tibble: 1 x 1
##   r_squared
##       <dbl>
## 1     0.465

Nice job! You observed an R-squared value of 0.465, which means that 46.5% of the total variability of the outcome variable log base 10 price can be explained by this model.

Comparing the R-squared of two models

Let’s now compute $R^2$ for the one numerical and one categorical explanatory/predictor variable model you fit in the Chapter 3, price as a function of size and whether the house had a view of the waterfront, and compare its $R^2$ with the one you just computed.

# Fit model
model_price_4 <- lm(log10_price ~ log10_size + waterfront,
                    data = house_prices)

# Get fitted/values & residuals, compute R^2 using residuals
get_regression_points(model_price_4) %>% 
  summarize(r_squared = 1 - var(residual) / var(log10_price))

## # A tibble: 1 x 1
##   r_squared
##       <dbl>
## 1     0.470

Since model_price_4 had a higher $R^2$ of 0.470, it “fit” the data better.

Since using waterfront explained a higher proportion of the total variance of the outcome variable than using the number of bedrooms, using waterfront in our model is preferred.

Assessing predictions with RMSE

$RMSE$ = typical prediction error the model will make that matches units of outcome variable $y$

you can only assess accuracy of predictions (i.e. $RMSE$) if you have access to the observed value $y$

Computing the MSE & RMSE of a model

Just as you did earlier with $R^2$, which is a measure of model fit, let’s now compute the root mean square error (RMSE) of our models, which is a commonly used measure of preditive error. Let’s use the model of price as a function of size and number of bedrooms.

The model is available in your workspace as model_price_2.

# Get all residuals, square them, and take mean                    
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals))

## # A tibble: 1 x 1
##      mse
##    <dbl>
## 1 0.0280

# Get all residuals, square them, take the mean and square root
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals)) %>% 
  mutate(rmse = sqrt(mse))

## # A tibble: 1 x 2
##      mse  rmse
##    <dbl> <dbl>
## 1 0.0280 0.167

Woo hoo! The RMSE is 0.167. You can think of this as the “typical” prediction error this model makes.

Comparing the RMSE of two models

As you did using the sum of squared residuals and $R^2$, let’s once again assess and compare the quality of your two models using the root mean squared error (RMSE). Note that RMSE is more typically used in prediction settings than explanatory settings.

model_price_2 and model_price_4 are available in your workspace.

# MSE and RMSE for model_price_2
get_regression_points(model_price_2) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))

## # A tibble: 1 x 2
##      mse  rmse
##    <dbl> <dbl>
## 1 0.0280 0.167

# MSE and RMSE for model_price_4
get_regression_points(model_price_4) %>%
  mutate(sq_residuals = residual^2) %>%
  summarize(mse = mean(sq_residuals), rmse = sqrt(mean(sq_residuals)))

## # A tibble: 1 x 2
##      mse  rmse
##    <dbl> <dbl>
## 1 0.0277 0.166

Since model_price_4 had a lower rmse of 0.166, this is suggestive that this model has better predictive power.

RMSE can be thought of as the ‘typical’ error a predicive model will make.

Validation set prediction framework

Validation set approach

Use two independent datasets to:

train/fit your data
evaluate your model’s predictive power i.e. validate your model

Fitting model to training data

It’s time to split your data into a training set to fit a model and a separate test set to evaluate the predictive power of the model. Before making this split however, we first sample 100% of the rows of house_prices without replacement and assign this to house_prices_shuffled. This has the effect of “shuffling” the rows, thereby ensuring that the training and test sets are randomly sampled.

# Set random number generator seed value for reproducibility
set.seed(76)

# Randomly reorder the rows
house_prices_shuffled <- house_prices %>% 
  sample_frac(size = 1, replace = FALSE)

# Train/test split
train <- house_prices_shuffled %>%
  slice(1:10000)
test <- house_prices_shuffled %>%
  slice(10001:21613)

# Fit model to training set
train_model_2 <- lm(log10_price ~ log10_size + bedrooms, data = train)

Fabulous! Since you’ve fit/trained the predictive model on the training set, let’s now apply it to the test set to make predictions!

Predicting on test data

Now that you’ve trained the model on the train set, let’s apply the model to the test data, make predictions, and evaluate the predictions. Recall that having a separate test set here simulates the gathering of a “new” independent dataset to test our model’s predictive performance on.

The datasets train and test, and the trained model, train_model_2 are available in your workspace.

# Make predictions on test set
get_regression_points(train_model_2, newdata = test)

## # A tibble: 11,613 x 6
##       ID log10_price log10_size bedrooms log10_price_hat residual
##    <int>       <dbl>      <dbl>    <int>           <dbl>    <dbl>
##  1     1        5.81       3.24        3            5.64    0.171
##  2     2        5.45       3.25        3            5.65   -0.204
##  3     3        5.69       3.38        3            5.78   -0.089
##  4     4        5.50       2.98        3            5.40    0.106
##  5     5        5.93       3.55        4            5.90    0.031
##  6     6        5.39       3.25        3            5.65   -0.267
##  7     7        5.61       3.06        2            5.51    0.101
##  8     8        5.38       3.12        2            5.56   -0.182
##  9     9        6.28       3.49        4            5.84    0.44 
## 10    10        5.85       3.53        3            5.91   -0.058
## # ... with 11,603 more rows

# Compute RMSE
get_regression_points(train_model_2, newdata = test) %>% 
  mutate(sq_residuals = residual^2) %>%
  summarize(rmse = sqrt(mean(sq_residuals)))

## # A tibble: 1 x 1
##    rmse
##   <dbl>
## 1 0.167

Magnificent! Your RMSE using size and condition as predictor variables is 0.167, which is higher than 0.165 when you used size and year built! It seems the latter is marginally better!

Conclusion - Where to go from here?

link to Tidyverse courses is here