The Ames housing data

The Ames housing data set is an excellent resource for learning about models that we will use throughout this book. It contains data on 2930 properties in Ames, Iowa, including columns rated to

Our goal for these date is to predict sale price of a house based on its other characteristics

The raw data are provided by the authors, but in our analyses in this book, we use a transformed version available in the modeldata package. This version has several changes and improvement to the data. For example, the longitude and latitude values have been determined for each property.

To load the data:

library(tidymodels)
## -- Attaching packages --------
## v broom     0.7.0      v recipes   0.1.13
## v dials     0.0.8      v rsample   0.0.7 
## v dplyr     1.0.0      v tibble    3.0.3 
## v ggplot2   3.3.2      v tidyr     1.1.0 
## v infer     0.5.3      v tune      0.1.1 
## v modeldata 0.0.2      v workflows 0.1.2 
## v parsnip   0.1.2      v yardstick 0.0.7 
## v purrr     0.3.4
## -- Conflicts -----------------
## x purrr::discard() masks scales::discard()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()
## x recipes::step()  masks stats::step()
library(tidyverse)
## -- Attaching packages --------
## v readr   1.3.1     v forcats 0.5.0
## v stringr 1.4.0
## -- Conflicts -----------------
## x readr::col_factor() masks scales::col_factor()
## x purrr::discard()    masks scales::discard()
## x dplyr::filter()     masks stats::filter()
## x stringr::fixed()    masks recipes::fixed()
## x dplyr::lag()        masks stats::lag()
## x readr::spec()       masks yardstick::spec()
data(ames)

dim(ames)
## [1] 2930   74
theme_set(theme_light())

Exploring Important Features

It makes sense to start with the outcome we want to predict the last sale price of the house (in USD):

ggplot(ames, aes(Sale_Price)) + geom_histogram(bins =50)

While not perfect, this will probably result in better models than using the untransformed data

The downside to transforming the outcome is mostly related to interpretation

The units of the model coefficients might be more difficult to interpret, as will measures of performance. For example, the root mean squared error (RMSE) is a common performance metric that is used in regression models. It uses the difference between the observed and predicted values in its calculations. If the sale price is on the log scale, these differences (i.e the residuals) are also in log units. For this reason, it can be difficult to understand the quality of a model whose RMSE is 0.15 log units.

ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))

Another important aspect of these data for our modeling are their geographic locations. This spatial information is contained in the data in two ways: a qualitative Neighborhood label as well as quantitative longitude and latitude data. To visualize the spatial information, let’s use both together to plot the data on a map and color by neighborhood:

ames %>% ggplot(aes(x=Longitude, y =Latitude, color = Neighborhood)) + 
               geom_point() + 
               theme() + theme(legend.position = 'bottom')

## Chapter Summary

This chapter introduced a data set used in later chapters to demonstrate tidymodels syntax and investigated some of its characteristics.

Spending our data

As data are reused for multiple tasks, certain risk increase, such as the risks of adding bias or large effects from methodological errors

If the initial pool of data available is not huge, there will be some overlap of how and when our data is ‘spent’ or allocated, and a solid methodology for data spending is important. This chapter demonstrates the basic of splitting our initial pool of samples for different purposes.

Common Methods for splitting data

  • The primary approach for empirical model validation is to split the existing pool of data into two distinct sets. Some observations are used to develop and optimize the model. This training set is usually the majority of the data. These data are a sandbox for model building where different models can be fit, feature engineering strategies are investigated, and so on. We as modeling practioners spend the vast majority of the modeling process using the training set as the substrate to develop the model.

  • The other portion of the observations are placed into the test set. This is held in reserve until one or two models are chosen as the methods that are mostly likely to succedd. The test set is then used as the final arbiter to determine the efficacy of the model. It is critical to only look at the test set once; otherwise, it becomes part of the modeling process.

How should we conduct this split of the data? This depends on the content

Suppose we allocate 80% of the data to the training set and the remaining 20% for testing. The most common method is to use simple random sampling.

The rsample package has tools for making data splits such as this; the function initial_split() was created for this purpose. It takes the data frame as an argument as well as the proportion to be placed into training. Using the previous produced by the code snippet from the summary in Section 4.2:

# Set the random number stream using `set.seed()` so
# reproduced later

set.seed(123)

# Save the split information for an 80/20 split of the data

ames_split <- initial_split(ames, prob= 0.80)
ames_split
## <Analysis/Assess/Total>
## <2198/732/2930>

The printed information denotes the amount of data in the training set (n=2198), the amount in the test set (n=732), and the size of the original pool of samples (n=2930)

The object ames_split is an rsplit object and only contains the partitioning information; to get the resulting data sets, we apply two more functions:

ames_train <- training(ames_split)
ames_test <- testing(ames_split)

dim(ames_train)
## [1] 2198   74

Simple random resampling is appropriate in many cases but there are exceptions. When there is a dramatic class imbalance classifiction problems, one class occurs much less frequently than another. Using a simple random sample may haphazardly allocate these infrequent samples disproportionately into the training or test set. To avoid this, stratified sampling can be used. The training/test split is conducted separately with each class and then these ubsamples are combined into the overall training and test set. For regression problems, the outcome data can be artifically binned into quartiles and then stratified sampling conducted four separate times. This is an effective method for keeping the distributions of the outcome similar between the training and test set

hist

hist

The distribution of the sale price outcome for the Ames housing data is shown. As previously discuessed, the sale price distribution is right skewed, with proportionally more expensive houses than inexpensive houses on either side of the center of the distribution. The worry here is that the more expensive houses would not be represented in the training set well ith simple splitting: this would increase the risk that our model would be ineffective at predicting the price for such properties. The dotted vertical lines indicate the four quartiles for these data. A stratified random sample would conduct 80/20 split within each of these data subsets and then pool the results together. In rsample, this is achieved using the strata argument.

set.seed(123)

ames_split <- initial_split(ames, prob = 0.8, strata = Sale_Price)

ames_train <- training(ames_split)
ames_test <- testing(ames_split)

Only a single column can be used for stratification.

There is very little downside to using stratified sampling

Are there situations when random sampling is not the best choice? One case is when the data have a significant time component, such as time series data. Here it is more common to use the most recent data as a test set. The rsample package contains a function called initial_time_split() that is very similar to initial_split(). Instead of using random sampling, the prop argument denotes what proportion of the first part of the data should be used as the training set; the function assumes that the data have been pre-sorted in an appropriate order.

What proportion should be used?

Depend on each context to define the proportion for splitting data

What about a validation set?

Previously, when describing the goals of data splitting, we singled out the test set as the data should be used to conduct a proper evaluation of model performance on the final model(s). This begs the question of’ How can we tell what is the best if we don’t measure performance until the test set?

Chapter Summary

Data splitting is the fundamental tool for empirical validation for models. Even in the era of unrestrained data collection, a typical modeling project has a limited amoutn of appriproate data and wise ‘spending’ of a project’s data is necessary. In this cahpter, we discussed several strategies for partitioning the data into distinct groups for modeling and evaluation.

library(tidymodels)
data(ames)

ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split <- initial_split(ames, prob = 0.8, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

ames_split
## <Analysis/Assess/Total>
## <2199/731/2930>