Instructions

This assignment reviews the Predicting Continuous Outcomes content. You will use the predict_continuous_outcomes.Rmd file I reviewed as part of the lecture for this class session to complete this assignment. You will copy and paste relevant code from that file and update it to answer the questions in this assignment. You will respond to questions in each section after executing relevant code to answer a question. You will submit this assignment to its Submissions folder on D2L. You will submit two files:

this completed R Markdown script, and
a HTML rendered version of it to D2L.

To start:

First, create a folder on your computer to save all relevant files for this course. If you did not do so already, you will want to create a folder named gsb_804 that contains all of the materials for this course.

Second, inside of gsb_804, you will create a folder to host assignments. You can name that folder assignments.

Third, inside of assignments, you will create folders for each assignment. You can name the folder for this first assignment: 02_predict_continuous_outcomes.

Fourth, create two additional folders in 02_predict_continuous_outcomes named scripts and data. Store this script in the scripts folder and the data for this assignment in the data folder. Create a plots folder as well.

Fifth, go to the File menu in RStudio, select New Project…, choose Existing Directory, go to your /gsb_804/assignments/02_predict_continuous_outcomes folder to select it as the top-level directory for this R Project.

Global Settings

The first code chunk sets the global settings for the remaining code chunks in the document. Do not change anything in this code chunk.

Activate Packages

In this code chunk, we load the following packages:

Make sure you installed these packages when you reviewed the analytical lecture.

We will use functions from these packages to examine the data. Do not change anything in this code chunk.

## here for project work flow
library(here)

## tidyverse for data manipulation and plotting;
## activates multiple packages simultaneously
library(tidyverse)

## skimr to summarize data
library(skimr)

## scales for variables scales
library(scales)

## snakecase for naming conventions
library(snakecase)

## corrr for correlation matrices
library(corrr)

## ggthemes for plot themes
library(ggthemes)

## tidymodels for modeling;
## activates multiple packages simultaneously
library(tidymodels)

## glmnet for elastic net models
library(glmnet)

## DBI for database interaction
library(DBI)

## RSQLite for SQLite databases
library(RSQLite)

## ranger for random forests
library(ranger)

## vip for variable importance
library(vip)

## tictoc for timing
library(tictoc)

## parallel for parallel processing
library(parallelly)

## doFuture for parallel processing
library(doFuture)

Task 1: Database Connection

In this task, you will establish a database connection to miami_houses.sqlite. After you connect to the database, you will work with the tables inside the database.

Task 1.1

Create a database connection object named houses_con by using dbConnect(). Inside of dbConnect(), set SQLite() as the database driver and use here() to navigate to the data folder of the project directory to locate the database file miami_houses.sqlite. Apply dbListTables() to houses_con.

Question 1.1: How many tables are in the database?

Response 1.1: Two tables

### import and save data as object
## use dbConnect() and SQLite() to import the data file
houses_con <- dbConnect(
  ## driver
  SQLite(),
  ## use here() to locate file in our project directory
  here(
    # folder
    "data", 
    # file
    "miami_houses.sqlite"
  )
)

### list tables
## call function
dbListTables(houses_con)

## [1] "distance" "overview"

Task 1.2

Perform these operations.

Pipe houses_con to tbl() and set “overview” as the input.

Pipe houses_con to tbl() and set “distance” as the input.

Question 1.2: Answer these questions: (1) What is the first value of SALE_PRC in the overview table? (2) What is the first value of RAIL_DIST in the distance table?

Response 1.2: (1) 440000 (2) 2816

### view table
## call connection
houses_con %>%
  ## preview table
  tbl(
    # table
    "overview"
  )

### view table
## call connection
houses_con %>%
  ## preview table
  tbl(
    # table
    "distance"
  )

Task 2: Clean Data

For this task, you will clean the data.

Task 2.1

In one chained command, create a data table named overview by performing the following operations:

pipe houses_con to tbl() and set “overview” as the input;
pipe the result to collect() to extract the table from the database;
change variable names to snake case using rename_with() and to_snake_case();
update the variable names of parcelno to parcel_id and avno_60_plus to airplane_noise using rename();
call mutate() to perform several additional operations;
update month_sold with month() and set label to TRUE;
convert airplane_noise to a nominal factor variable and recode its levels to “No” and “Yes” from “0” and “1”, respectively;
convert structure_quality to an ordered factor variable and recode its levels to “Very Bad”, “Bad”, “Okay”, “Good”, and “Very Good” from “1”, “2”, “3”, “4”, and “5”, respectively.

Examine overview with glimpse().

Question 2.1: Answer these questions: (1) How many observations are there in the overview table? (2) How many columns are there in the overview table? (3) How many ordered factor columns are there in the overview table?

Response 2.1: (1) 13776 Observations (2) 11 Columns (3) 2 (Months sold and Structure quality)

### export and clean table
## save
overview <- houses_con %>%
  ## preview table
  tbl(
    # table
    "overview"
  ) %>%
  ## export
  collect() %>%
  ## change variable names
  rename_with(
    # function
    .fn = to_snake_case
  ) %>% 
  ## update variable names
  rename(
    # new = old
    parcel_id = parcelno,
    # new = old
    airplane_noise = avno_60_plus
  ) %>%
  ## update
  mutate(
    # month
    month_sold = month(
      # column
      month_sold,
      # labels
      label = TRUE
    ),
    # convert to factor
    airplane_noise = as_factor(airplane_noise),
    # recode levels
    airplane_noise = fct_recode(
      # factor
      airplane_noise,
      # levels
      "No" = "0", "Yes" = "1"
    ),
    # convert to ordinal factor
    structure_quality = factor(
      # factor
      structure_quality,
      # ordered
      ordered = TRUE
    ),
    # recode levels
    structure_quality = fct_recode(
      # factor
      structure_quality,
      # levels
      "Very Bad" = "1", "Bad" = "2",
      "Okay" = "3", "Good" = "4",
      "Very Good" = "5"
    )
  )

## confirm changes
glimpse(overview)

## Rows: 13,776
## Columns: 11
## $ parcel_id         <dbl> 622280070620, 622280100460, 622280100470, 6222801005…
## $ latitude          <dbl> 25.89103, 25.89132, 25.89133, 25.89176, 25.89182, 25…
## $ longitude         <dbl> -80.16056, -80.15397, -80.15374, -80.15266, -80.1546…
## $ sale_prc          <dbl> 440000, 349000, 800000, 988000, 755000, 630000, 1020…
## $ lnd_sqfoot        <dbl> 9375, 9375, 9375, 12450, 12800, 9900, 10387, 10272, …
## $ tot_lvg_area      <dbl> 1753, 1715, 2276, 2058, 1684, 1531, 1753, 1663, 1493…
## $ spec_feat_val     <dbl> 0, 0, 49206, 10033, 16681, 2978, 23116, 34933, 11668…
## $ age               <dbl> 67, 63, 61, 63, 42, 41, 63, 21, 56, 63, 64, 51, 56, …
## $ airplane_noise    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
## $ month_sold        <ord> Aug, Sep, Feb, Sep, Jul, Feb, Feb, Sep, Mar, Nov, Fe…
## $ structure_quality <ord> Good, Good, Good, Good, Good, Good, Very Good, Good,…

Task 2.2

In one chained command, create a data table named distance by performing the following operations:

pipe houses_con to tbl() and set “distance” as the input;
pipe the result to collect() to extract the table from the database;
change variable names to snake case using rename_with() and to_snake_case();
update the variable names of parcelno to parcel_id, subcntr_di to subcenter_dist, and cntr_dist to center_dist using rename().

Examine distance with glimpse().

Question 2.2: Answer these questions: (1) How many observations are there in the distance table? (2) How many columns are there in the distance table?

Response 2.2: (1) 13,776 Observations (2) 7 columns

### export and clean table
## save
distance <- houses_con %>%
  ## preview table
  tbl(
    # table
    "distance"
  ) %>%
  ## export
  collect() %>%
  ## change variable names
  rename_with(
    # function
    .fn = to_snake_case
  ) %>% 
  ## update variable names
  rename(
    # new = old
    parcel_id = parcelno
  ) 

## confirm changes
glimpse(distance)

## Rows: 13,776
## Columns: 7
## $ parcel_id  <dbl> 622280070620, 622280100460, 622280100470, 622280100530, 622…
## $ rail_dist  <dbl> 2815.9, 4359.1, 4412.9, 4585.0, 4063.4, 2391.4, 3277.4, 311…
## $ ocean_dist <dbl> 12811.4, 10648.4, 10574.1, 10156.5, 10836.8, 13017.0, 11667…
## $ water_dist <dbl> 347.6, 337.8, 297.1, 0.0, 326.6, 188.9, 0.0, 10.5, 51.5, 9.…
## $ cntr_dist  <dbl> 42815.3, 43504.9, 43530.4, 43797.5, 43599.7, 43135.1, 43598…
## $ subcntr_di <dbl> 37742.2, 37340.5, 37328.7, 37423.2, 37550.8, 38176.2, 37973…
## $ hwy_dist   <dbl> 15954.9, 18125.0, 18200.5, 18514.4, 17903.4, 15687.2, 17068…

Task 3: Join Data

For this task, you will join data tables.

Task 3.1

Create a new data table named houses_joined. Pipe overview into left_join() and set distance as the additional data input and by to “parcel_id”. Preview houses_joined with glimpse().

Question 3.1: Answer these questions: (1) How many observations are there in the houses_joined table? (2) How many columns are there in the houses_joined table?

Response 3.1: (1) 13776 Observations (2) 17 columns

### join tables
## save
houses_joined <- overview %>%
  ## left join
  left_join(
    # data
    distance,
    # key
    by = "parcel_id"
  )

## preview table
glimpse(houses_joined)

## Rows: 13,776
## Columns: 17
## $ parcel_id         <dbl> 622280070620, 622280100460, 622280100470, 6222801005…
## $ latitude          <dbl> 25.89103, 25.89132, 25.89133, 25.89176, 25.89182, 25…
## $ longitude         <dbl> -80.16056, -80.15397, -80.15374, -80.15266, -80.1546…
## $ sale_prc          <dbl> 440000, 349000, 800000, 988000, 755000, 630000, 1020…
## $ lnd_sqfoot        <dbl> 9375, 9375, 9375, 12450, 12800, 9900, 10387, 10272, …
## $ tot_lvg_area      <dbl> 1753, 1715, 2276, 2058, 1684, 1531, 1753, 1663, 1493…
## $ spec_feat_val     <dbl> 0, 0, 49206, 10033, 16681, 2978, 23116, 34933, 11668…
## $ age               <dbl> 67, 63, 61, 63, 42, 41, 63, 21, 56, 63, 64, 51, 56, …
## $ airplane_noise    <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
## $ month_sold        <ord> Aug, Sep, Feb, Sep, Jul, Feb, Feb, Sep, Mar, Nov, Fe…
## $ structure_quality <ord> Good, Good, Good, Good, Good, Good, Very Good, Good,…
## $ rail_dist         <dbl> 2815.9, 4359.1, 4412.9, 4585.0, 4063.4, 2391.4, 3277…
## $ ocean_dist        <dbl> 12811.4, 10648.4, 10574.1, 10156.5, 10836.8, 13017.0…
## $ water_dist        <dbl> 347.6, 337.8, 297.1, 0.0, 326.6, 188.9, 0.0, 10.5, 5…
## $ cntr_dist         <dbl> 42815.3, 43504.9, 43530.4, 43797.5, 43599.7, 43135.1…
## $ subcntr_di        <dbl> 37742.2, 37340.5, 37328.7, 37423.2, 37550.8, 38176.2…
## $ hwy_dist          <dbl> 15954.9, 18125.0, 18200.5, 18514.4, 17903.4, 15687.2…

Task 4: Examine Data

For this task, you will examine the data.

Task 4.1

Summarize houses_joined by piping it into select() and excluding -parcel_id. Pipe the result to skim_without_charts() to view a summary of the variables. View the output to respond to questions.

Question 4.1: Answer these questions: (1) During what month were the most houses sold? (2) What was the median sale price of these Miami homes? (3) What is the average distance to the ocean (measured in feet) from these Miami homes?

Response 4.1: (1) June (1373) (2) 311100 (3) 31711

### overall summary
## call data
houses_joined %>%
  ## select columns
  select(
    # exclude
    -parcel_id
  ) %>%
  ## compute summary statistics
  skim_without_charts()

Data summary
Name	Piped data
Number of rows	13776
Number of columns	16
_______________________
Column type frequency:
factor	3
numeric	13
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
airplane_noise	1	FALSE	2	No: 13570, Yes: 206
month_sold	1	TRUE	12	Jun: 1373, Aug: 1256, May: 1233, Apr: 1228
structure_quality	1	TRUE	5	Goo: 7546, Bad: 4053, Ver: 1989, Ver: 172

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
latitude	1	25.73	0.14	25.43	25.62	25.73	25.85	25.97
longitude	1	-80.33	0.09	-80.54	-80.40	-80.34	-80.26	-80.12
sale_prc	1	400746.45	317868.54	72000.00	235000.00	311100.00	430000.00	2650000.00
lnd_sqfoot	1	8623.96	6063.47	1248.00	5400.00	7500.00	9144.25	57064.00
tot_lvg_area	1	2061.50	814.36	854.00	1472.00	1881.00	2474.25	6287.00
spec_feat_val	1	9604.75	13923.55	0.00	823.50	2784.00	12655.75	175020.00
age	1	30.59	21.12	0.00	14.00	26.00	46.00	96.00
rail_dist	1	8354.85	6175.57	10.50	3299.98	7124.95	12108.85	29621.50
ocean_dist	1	31710.96	17609.83	236.10	18080.07	28595.55	44358.30	75744.90
water_dist	1	11994.25	11941.12	0.00	2681.77	6979.95	19265.92	50399.80
cntr_dist	1	68625.84	31990.84	3825.60	42955.22	66039.80	89542.55	159976.50
subcntr_di	1	41149.98	22167.83	1462.80	24043.05	41156.40	53973.90	110553.80
hwy_dist	1	7731.19	6074.76	90.20	3000.48	6158.85	10873.57	48167.30

Task 4.2

Perform these two operations.

First, create a correlation table named corr_res by piping houses_joined to select() and selecting for all numeric variables excluding parcel_id. Pipe the result to correlate(), rearrange(), shave(). Pipe the result to stretch() and specify to remove missing values and duplicates. Pipe the result to mutate() and use across() to create factors of x and y and reverse their levels. Print corr_res to confirm it looks correct.

Second, create a plot named corr_plot that creates a heat map of the correlation table. Call ggplot() to specify the data and map the x-axis to x, the y-axis to y, and the fill to r. Add a geom_tile() layer and set color to “white”. Add a geom_text() layer to add the correlations rounded to two digits as text in the tiles. Add a scale_fill_gradient()2 layer and specify low to “red3”, middle to “white”, high to “deepskyblue3”, and other appropriate parameters. Add a labs() layer to remove any axes titles. Add a ggtitle() layer to add a title. Add a theme() layer to move the legend to the bottom, angle the x-axis labels to 45 degrees, se the panel background “gray20”, set the plot background to “gray100”, remove the panel grid, and center the title. Print the plot to view it.

Question 4.2: Answer these questions: (1) What is the correlation between sales price (sale_prc) and total living area (tot_lvg_area)? (2) What is the correlation between age of homes (age) and distance to center (center_dist)?

Response 4.2: (1) 0.67 (2) -0.55

### overall summary
## call data
houses_joined %>%
  ## select columns
  select(
    # exclude
    -parcel_id
  ) %>%
  ## compute summary statistics
  skim_without_charts()

Data summary
Name	Piped data
Number of rows	13776
Number of columns	16
_______________________
Column type frequency:
factor	3
numeric	13
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
airplane_noise	1	FALSE	2	No: 13570, Yes: 206
month_sold	1	TRUE	12	Jun: 1373, Aug: 1256, May: 1233, Apr: 1228
structure_quality	1	TRUE	5	Goo: 7546, Bad: 4053, Ver: 1989, Ver: 172

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
latitude	1	25.73	0.14	25.43	25.62	25.73	25.85	25.97
longitude	1	-80.33	0.09	-80.54	-80.40	-80.34	-80.26	-80.12
sale_prc	1	400746.45	317868.54	72000.00	235000.00	311100.00	430000.00	2650000.00
lnd_sqfoot	1	8623.96	6063.47	1248.00	5400.00	7500.00	9144.25	57064.00
tot_lvg_area	1	2061.50	814.36	854.00	1472.00	1881.00	2474.25	6287.00
spec_feat_val	1	9604.75	13923.55	0.00	823.50	2784.00	12655.75	175020.00
age	1	30.59	21.12	0.00	14.00	26.00	46.00	96.00
rail_dist	1	8354.85	6175.57	10.50	3299.98	7124.95	12108.85	29621.50
ocean_dist	1	31710.96	17609.83	236.10	18080.07	28595.55	44358.30	75744.90
water_dist	1	11994.25	11941.12	0.00	2681.77	6979.95	19265.92	50399.80
cntr_dist	1	68625.84	31990.84	3825.60	42955.22	66039.80	89542.55	159976.50
subcntr_di	1	41149.98	22167.83	1462.80	24043.05	41156.40	53973.90	110553.80
hwy_dist	1	7731.19	6074.76	90.20	3000.48	6158.85	10873.57	48167.30

### examine correlations 
## call data
corr_res <- houses_joined %>%
  ## select variables
  select(
    # numeric
    where(is.numeric),
    # exclude
    -parcel_id
  ) %>%
  ## correlation matrix
  correlate() %>%
  ## arrange values
  rearrange() %>%
  ## keep lower triangle
  shave() %>%
  ## pivot long
  stretch(
    # remove missing values
    na.rm = TRUE,
    # remove duplicates
    remove.dups = TRUE
  ) %>%
  ## update
  mutate(
    # apply function to columns
    across(
      # columns
      .cols = x:y,
      # function
      .fns = ~ 
        # convert to factor
        as_factor(.) %>%
          # reverse levels
          fct_rev()
    )
  )

## print
corr_res

### correlation heat map
## save
corr_plot <- ggplot(
  # data
  corr_res,
  # mapping
  aes(
    # x-axis
    x = x,
    # y-axis
    y = y,
    # fill
    fill = r
  )
) +
  ## tiles
  geom_tile(
    # border color
    color = "white"
  ) +
  ## text
  geom_text(
    # mapping
    aes(
      # labels
      label = number(
        # variable
        r,
        # decimals
        accuracy = 0.01
      )
    ),
    ## size
    size = 4, 
    ## bold
    fontface = "bold",
    ## color
    color = "black"
  ) +
  ## alter fill of tiles
  scale_fill_gradient2(
    # small distance
    low = "red1", 
    # medium distance
    mid = "white",
    # large distance
    high = "deepskyblue1", 
    # medium indicator                   
    midpoint = 0, 
    # limits
    limit = c(-1, 1), 
    # title
    name = "Correlation",
    # missing values
    na.value = "transparent"
  ) +
  ## labels
  labs(
    # x-axis
    x = NULL,
    # y-axis
    y = NULL
  ) +
  ## title of plot
  ggtitle(
    # title
    "Correlations"
  ) + 
  ## update theme
  theme(
    # legend at bottom
    legend.position = "bottom",
    # alter panel background
    panel.background = element_rect(
      # fill
      fill = "black"
    ),
    # remove gridlines
    panel.grid = element_blank(),
    # alter plot background
    plot.background = element_rect(
      # fill
      fill = "mintcream"      
    ),
    # adjust title position
    plot.title = element_text(
      # center
      hjust = 0.5
    ),
    # x-axis labels
    axis.text.x = element_text(
      # angle
      angle = 45,
      # justification
      vjust = 0.6
    )
  )

## show plot
corr_plot

Task 4.3

Create a plot named center_sale_plot to examine the relationship between the distance to the city center and sale price of homes. Call ggplot() and set houses_joined as the data input and map center_dist to the x-axis and sale_prc to the y-axis. Add a geom_point() layer with alpha set to 0.1. Add a geom_smooth() layer with method set to “loess”, formula set to “y ~ x”, and se set to FALSE. Use scale_x_continuous() to update the x-axis labels with label_number() and setting big.mark to “,”. Use scale_y_continuous() to update the y-axis labels with label_dollar() and setting scaleto 0.000001 and suffix to “M” and setting breaks to seq(0, 3000000, 500000). Adjust titles for the x and y axes to “Distance to Center (Feet)” and “Sale Price (Millions)” using labs(). Set the theme to theme_hc(). Display the plot.

Question 4.3: Is the relationship between distance to center and sale price negative or positive?

Response 4.3: It is a curve linear relationship: at short distances from center: as distance increases, price increases (positive relationship), at medium distances: Prices reach a peak and then at longer distances, as distance increases, price decreases (negative relationship).

### examine categorical features 
## call data and mapping
center_sale_plot <- ggplot(
  # data
  houses_joined,
  # mapping
  aes(
    # factor
    x = lnd_sqfoot, 
    # outcome
    y = sale_prc
  )
) +
  ## points
  geom_point(
    # transparency
    alpha = 0.3
  ) +
  ## smooth
  geom_smooth(
    # loess
    method = "loess",
    # formula
    formula = "y ~ x",
    # no ribbon
    se = FALSE
  ) +
  ## scale x-axis
  scale_x_continuous(
    # labels
    labels = label_number(
      # big marks
      big.mark = ","
    )
  ) +
  ## scale y-axis
  scale_y_continuous(
    # breaks
    breaks = seq(0, 3000000, 500000),
    # labels
    labels = label_dollar(
      # scale
      scale = 0.000001,
      # suffix
      suffix = "M"
    )
  ) +
  ## labels
  labs(
    # x-axis
    x = "Land Area (Feet)", 
    # y-axis
    y = "Sale Price (Millions)"
  ) +
  ## set theme
  theme_clean() 

## show plot
center_sale_plot

Task 5: Split Data

For this task, you will split the data for training and testing.

Task 5.1

Perform these operations.

First, set the random seed of your computer to 1795 using set.seed().

Second, create houses_split using initial_split(). Set the data input to houses_joined and the prop input to 0.65 Print houses_split.

Third, create houses_train by applying training() to houses_split. Print houses_train. Create houses_test by applying testing() to houses_split. Pring houses_test.

Question 5.1: Answer these questions: (1) How many observations are there in the training data? (2) How many observations are there in the testing data?

Response 5.1: (1) 8,954 (2) 4,822

### set random seed
## call function
set.seed(1795)

### create split
## split data
houses_split <- initial_split(
  # data
  houses_joined,
  # split proportion
  prop = 0.65
)

## examine initial split
houses_split

## <Training/Testing/Total>
## <8954/4822/13776>

### extract training data
## save
houses_train <- training(houses_split)

## preview
houses_train

### extract testing data
## save
houses_test <- testing(houses_split)

## preview
houses_test

Task 5.2

Set the random seed of your computer to 1805 using set.seed(). Then, create houses_train_folds with vfold_cv(). Set houses_train as the data input, set v to 3 and set repeats to 3. Print houses_train_folds.

Question 5.2: Answer these questions: (1) How many observations are there in the first analysis set? (2) How many observations are there in the first assessment set?

Response 5.2: (1) 5969 (2) 2985

### set random seed
## call function
set.seed(1805)

### repeated cross-folds on training set
## split training
houses_train_folds <- vfold_cv(
  # training data
  houses_train,
  # number of folds
  v = 3,
  # repeats
  repeats = 3
)

## examine folds
houses_train_folds

Task 6: Model Recipe and Metrics

For this task, you will create a model recipe.

Task 6.1

Perform these operations.

First, create a model recipe named houses_recipe. Call recipe() and set the formula input to sale_prc ~ . and the data input to houses_train. Pipe the result to step_rm() to remove parcel_id. Pipe the result to step_normalize() to standardize any numeric predictors. Pipe the result to step_dummy() to create dummy coded variables for any nominal predictors.

Second, pipe houses_recipe to prep(). Pipe the result to bake() with new_data set to NULL. Pipe the result to print() with width set to Inf.

Third, create a metrics function named metrics. Call metric_set() and list the following metrics: mae, rmse, rsq, rsq_trad, and ccc.

Question 6.1: Answer these questions: (1) How many columns are there in the baked data? (2) Is the first home below or above the average land area (lnd_sqfoot)?

Response 6.1: (1) 29 (2) Below (-0.117)

### create modeling recipe
## save as object
houses_recipe <- recipe(
  # identify outcomes
  sale_prc ~
    # all other variables
    .,
  # data
  data = houses_train
) %>%
  ## remove variables
  step_rm(
    # list variables
    parcel_id
  ) %>%
  ## normalize predictors
  step_normalize(
    # numeric predictors
    all_numeric_predictors()
  ) %>%
  ## dummy variables
  step_dummy(
    # factor predictors
    all_nominal_predictors()
  )

### prep and bake recipe
## call recipe
houses_recipe %>%
  ## estimate parameters
  prep() %>%
  ## apply computations to data
  bake(
    # training data
    new_data = NULL
  ) %>%
  ## print table
  print(
    # all columns
    width = Inf
  )

## # A tibble: 8,954 × 29
##    latitude longitude lnd_sqfoot tot_lvg_area spec_feat_val     age rail_dist
##       <dbl>     <dbl>      <dbl>        <dbl>         <dbl>   <dbl>     <dbl>
##  1   -0.529   -0.0616   -0.117         -0.528        -0.621  1.46      -1.19 
##  2   -0.755   -0.328     0.00385       -0.566         1.14   1.08      -0.420
##  3    1.28     1.10     -0.589         -1.15         -0.579  1.70      -1.25 
##  4    1.09     1.75      0.265         -0.878        -0.285  0.0779    -1.25 
##  5    0.131   -1.31     -0.548          0.935         1.10  -0.876      0.862
##  6    0.820    1.73      2.10           0.171        -0.558  1.27      -0.615
##  7   -0.514   -1.25     -0.630          0.500        -0.453 -1.40      -0.361
##  8    1.24     1.49      0.117         -0.464        -0.535 -0.447      0.161
##  9   -0.599   -0.151     1.16           1.77          1.26   0.0302    -1.03 
## 10   -0.480   -0.553    -0.178         -0.824         0.979 -0.495     -1.33 
##    ocean_dist water_dist cntr_dist subcntr_di hwy_dist sale_prc
##         <dbl>      <dbl>     <dbl>      <dbl>    <dbl>    <dbl>
##  1     -0.887      0.155  -0.133      -1.21      0.424   276000
##  2     -0.855     -0.153   0.293      -0.588    -0.104   260000
##  3      0.233     -0.991  -0.596       0.369    -0.193   101000
##  4     -0.835     -0.791  -0.932      -0.245     0.689   260000
##  5      1.91       2.04    0.458       0.294     1.01    465000
##  6     -0.772     -0.988  -1.36       -0.775     0.432   840000
##  7      0.829      1.99    0.734       0.0798    1.62    432000
##  8     -0.428     -1.00   -0.706       0.210    -0.516   340000
##  9     -0.935      0.220   0.00325    -1.01      0.787   630000
## 10     -0.188      1.13    0.163      -0.810    -1.21    325000
##    airplane_noise_Yes month_sold_01 month_sold_02 month_sold_03 month_sold_04
##                 <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
##  1                  0       -0.209         -0.155        0.348         -0.145
##  2                  0       -0.209         -0.155        0.348         -0.145
##  3                  0        0.209         -0.155       -0.348         -0.145
##  4                  0        0.209         -0.155       -0.348         -0.145
##  5                  0        0.125         -0.265       -0.265          0.134
##  6                  0       -0.125         -0.265        0.265          0.134
##  7                  0       -0.209         -0.155        0.348         -0.145
##  8                  0       -0.0418        -0.319        0.0976         0.313
##  9                  0       -0.376          0.228        0.0418        -0.302
## 10                  0       -0.125         -0.265        0.265          0.134
##    month_sold_05 month_sold_06 month_sold_07 month_sold_08 month_sold_09
##            <dbl>         <dbl>         <dbl>         <dbl>         <dbl>
##  1        -0.230        0.373         -0.137        -0.255         0.474
##  2        -0.230        0.373         -0.137        -0.255         0.474
##  3         0.230        0.373          0.137        -0.255        -0.474
##  4         0.230        0.373          0.137        -0.255        -0.474
##  5         0.349        0.0597        -0.336        -0.290         0.160
##  6        -0.349        0.0597         0.336        -0.290        -0.160
##  7        -0.230        0.373         -0.137        -0.255         0.474
##  8        -0.159       -0.299          0.230         0.274        -0.329
##  9         0.452       -0.463          0.370        -0.239         0.124
## 10        -0.349        0.0597         0.336        -0.290        -0.160
##    month_sold_10 month_sold_11 structure_quality_1 structure_quality_2
##            <dbl>         <dbl>               <dbl>               <dbl>
##  1       -0.409         0.196               -0.316              -0.267
##  2       -0.409         0.196               -0.316              -0.267
##  3       -0.409        -0.196               -0.316              -0.267
##  4       -0.409        -0.196               -0.316              -0.267
##  5        0.491         0.393                0.316              -0.267
##  6        0.491        -0.393                0.316              -0.267
##  7       -0.409         0.196                0.316              -0.267
##  8       -0.229         0.550                0.316              -0.267
##  9       -0.0491        0.0131              -0.316              -0.267
## 10        0.491        -0.393                0.316              -0.267
##    structure_quality_3 structure_quality_4
##                  <dbl>               <dbl>
##  1               0.632              -0.478
##  2               0.632              -0.478
##  3               0.632              -0.478
##  4               0.632              -0.478
##  5              -0.632              -0.478
##  6              -0.632              -0.478
##  7              -0.632              -0.478
##  8              -0.632              -0.478
##  9               0.632              -0.478
## 10              -0.632              -0.478
## # ℹ 8,944 more rows

### specify model metric to optimize
## save as object
metrics <- metric_set(
  # accuracy
  mae, rmse, 
  # consistency
  rsq, rsq_trad,
  # accuracy and consistency
  ccc
)

Task 7: Train Models

For this task, you will train the models of interest.

Task 7.1

Perform these operations.

First, create a workflow object named lm_wflow. Pipe workflow() to add_model(). In add_model(), call linear_reg() and pipe it to set_engine(“lm”). Pipe the result of add_model() to add_recipe(). In add_recipe(), specify houses_recipe as the input.

Second, create an object named lm_fit_folds. Pipe lm_wflow to fit_resamples(). In fit_resamples(), set resamples to houses_train_folds and the metrics to metrics.

Third, call collect_metrics() on lm_fit_folds.

Question 7.1: Answer three questions: (1) What is the average ccc across the folds for these linear regression models? (2) What is the average rsq across the folds for these linear regression models?

Response 7.1: (1) 0.844 (2) 0.730

### linear model
## save
lm_wflow <- workflow() %>%
  ## add model
  add_model(
    # regression specification
    linear_reg() %>%
      # engine
      set_engine("lm")
  ) %>%
  ## add recipe
  add_recipe(
    # recipe
    houses_recipe 
  )

### estimate model on folds
## save as object
lm_fit_folds <-  
  ## workflow
  lm_wflow %>%
  ## fit
  fit_resamples(
    # folds
    resamples = houses_train_folds,
    # metrics
    metrics = metrics
  )

### show metrics
## call function
collect_metrics(
  # results
  lm_fit_folds
)

Task 7.2

Perform these operations.

First, create an object named lm_fit. Pipe lm_wflow to fit() and set houses_train as the input.

Second, pipe lm_fit to extract_fit_parsnip(). Pipe the result to tidy(). Pipe the result to arrange() and arrange the rows in descending order by estimate. Pipe the result to print() and set n to Inf.

Question 7.2: Answer these questions: (1) What is the correct interpretation of the ocean distance (ocean_dist) regression coefficient? (2) Is the highway distance (hwy_dist) or center distance (center_dist) a stronger predictor of sales prices?

Response 7.2: (1) For every 1-unit increase in distance from the ocean, the sale price increases by approximately $49,948, holding all other variables constant. (2) Highway distance is ranked higher than center distance, has a lower standard error and has a more precise relationship with sale prices and is therefore a stronger predictor of sales prices.

### fit to complete training data
## save as object
lm_fit <- 
  ## workflow
  lm_wflow %>%
  ## fit
  fit(
    # data
    houses_train
  )

### view coefficients
## call model object
lm_fit %>%
  ## pull fit
  extract_fit_parsnip() %>%
  ## coefficients
  tidy() %>%
  ## arrange rows
  arrange(
    # descending
    desc(estimate)
  ) %>%
  ## print table
  print(
    # all rows
    n = Inf
  )

## # A tibble: 29 × 5
##    term                 estimate std.error statistic   p.value
##    <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
##  1 structure_quality_4  722764.     39420.  18.3     9.73e- 74
##  2 (Intercept)          575946.     11489.  50.1     0        
##  3 structure_quality_1  280793.     11035.  25.4     6.11e-138
##  4 longitude            178368.     12099.  14.7     1.27e- 48
##  5 tot_lvg_area         145448.      2411.  60.3     0        
##  6 structure_quality_3   80547.      6026.  13.4     2.33e- 40
##  7 ocean_dist            49948.      8089.   6.17    6.92e- 10
##  8 spec_feat_val         38774.      2109.  18.4     3.98e- 74
##  9 hwy_dist              29788.      2028.  14.7     2.81e- 48
## 10 rail_dist             27396.      2051.  13.4     2.52e- 40
## 11 lnd_sqfoot            21614.      2191.   9.86    7.92e- 23
## 12 water_dist            13706.      3317.   4.13    3.63e-  5
## 13 month_sold_04          8502.      6175.   1.38    1.69e-  1
## 14 month_sold_03          6983.      6155.   1.13    2.57e-  1
## 15 month_sold_09          4431.      5821.   0.761   4.47e-  1
## 16 month_sold_01          3827.      6365.   0.601   5.48e-  1
## 17 month_sold_10          3171.      5741.   0.552   5.81e-  1
## 18 month_sold_07           723.      6051.   0.120   9.05e-  1
## 19 month_sold_05           367.      6161.   0.0596  9.52e-  1
## 20 month_sold_11           -48.0     5692.  -0.00844 9.93e-  1
## 21 month_sold_08         -4278.      5944.  -0.720   4.72e-  1
## 22 month_sold_06         -5505.      6100.  -0.903   3.67e-  1
## 23 month_sold_02        -11414.      6235.  -1.83    6.72e-  2
## 24 age                  -30758.      2312. -13.3     5.24e- 40
## 25 cntr_dist            -41173.     11453.  -3.59    3.26e-  4
## 26 subcntr_di           -58946.      8133.  -7.25    4.60e- 13
## 27 airplane_noise_Yes   -86150.     14718.  -5.85    4.98e-  9
## 28 latitude            -146625.      8195. -17.9     2.31e- 70
## 29 structure_quality_2 -448226.     30555. -14.7     3.67e- 48

Task 7.3

Perform these operations.

First, create a model specification object named elastic_net_spec. Call linear_reg() and set penalty and mixture to tune(). Pipe the result to set_engine(“glmnet”).

Second, create a grid object named elastic_net_grid. Pipe elastic_net_spec to extract_parameter_set_dials(). Pipe the result to grid_regular() with levels set to 6.

Third, create a workflow object named elastic_net_wflow. Pipe workflow() to add_model() and specify elastic_net_spec as the input. Pipe the result of add_model() to add_recipe(). In add_recipe(), specify houses_recipe as the input.

Fourth, create an object named elastic_net_tune. Pipe elastic_net_wflow to tune_grid(). In tune_grid(), set resamples to houses_train_folds, the grid to elastic_net_grid, and the metrics to metrics.

Fifth, call autoplot() and set elastic_net_tune as the first input and c(“mae”, “rsq”, “ccc”) as the metric input. Update the theme with theme() and move the legend to the top of the plot.

Sixth, call collect_metrics() on elastic_net_tune. Pipe the result to print() with n set to 40 and width to Inf.

Seventh, call show_best(). Set elastic_net_tune as the first input and “mae” as the metric input.

Question 7.3: Answer these questions: (1) Examining the plot, what is approximately the highest R-squared (rsq) value? (2) What penalty and mixture values are the best based on average mae?

Response 7.3: (1) 0.73 (2) Penalty = 0.0000000001 ; Mixture 0.05

#### elastic net
### model specification
## save as object
elastic_net_spec <- 
  ## regression specification
  linear_reg(
    # tune penalty
    penalty = tune(),
    # tune mixture
    mixture = tune()
  ) %>%
  ## specify engine
  set_engine("glmnet") 
  
### view a tuning grid
## call model specification
elastic_net_grid <- elastic_net_spec %>%
  ## parameters
  extract_parameter_set_dials() %>%
  ## grid
  grid_regular(
    # levels
    levels = 6
  )

### create initial workflow
## save as object
elastic_net_wflow <- workflow() %>%
  ## add model
  add_model(
    # specification
    elastic_net_spec
  ) %>%
  ## add recipe
  add_recipe(
    # recipe
    houses_recipe 
  )

### estimate models
## save as object
elastic_net_tune <- 
  ## workflow
  elastic_net_wflow %>%
  ## tune
  tune_grid(
    # folds
    resamples = houses_train_folds,
    # grid
    grid = elastic_net_grid,
    # metrics
    metrics = metrics
  )

### plot metrics
## produce plot
## produce plot
autoplot(
  # results
  elastic_net_tune,
  # metrics
  metric = c("mae", "rsq", "ccc")
) +
  ## update theme
  theme(
    # move legend
    legend.position = "top"
  )

### show metrics
## call function
collect_metrics(
  # results
  elastic_net_tune
) %>%
  ## print table
  print(
    # rows
    n = 40, 
    # all columns
    width = Inf
  )

## # A tibble: 180 × 8
##         penalty mixture .metric  .estimator       mean     n    std_err
##           <dbl>   <dbl> <chr>    <chr>           <dbl> <int>      <dbl>
##  1 0.0000000001    0.05 ccc      standard        0.844     9    0.00247
##  2 0.0000000001    0.05 mae      standard   106186.        9  487.     
##  3 0.0000000001    0.05 rmse     standard   163407.        9 1773.     
##  4 0.0000000001    0.05 rsq      standard        0.730     9    0.00349
##  5 0.0000000001    0.05 rsq_trad standard        0.730     9    0.00347
##  6 0.0000000001    0.24 ccc      standard        0.844     9    0.00248
##  7 0.0000000001    0.24 mae      standard   106294.        9  488.     
##  8 0.0000000001    0.24 rmse     standard   163418.        9 1789.     
##  9 0.0000000001    0.24 rsq      standard        0.730     9    0.00349
## 10 0.0000000001    0.24 rsq_trad standard        0.730     9    0.00348
## 11 0.0000000001    0.43 ccc      standard        0.844     9    0.00248
## 12 0.0000000001    0.43 mae      standard   106283.        9  486.     
## 13 0.0000000001    0.43 rmse     standard   163417.        9 1789.     
## 14 0.0000000001    0.43 rsq      standard        0.730     9    0.00349
## 15 0.0000000001    0.43 rsq_trad standard        0.730     9    0.00347
## 16 0.0000000001    0.62 ccc      standard        0.844     9    0.00248
## 17 0.0000000001    0.62 mae      standard   106265.        9  482.     
## 18 0.0000000001    0.62 rmse     standard   163410.        9 1786.     
## 19 0.0000000001    0.62 rsq      standard        0.730     9    0.00349
## 20 0.0000000001    0.62 rsq_trad standard        0.730     9    0.00347
## 21 0.0000000001    0.81 ccc      standard        0.844     9    0.00248
## 22 0.0000000001    0.81 mae      standard   106240.        9  488.     
## 23 0.0000000001    0.81 rmse     standard   163408.        9 1785.     
## 24 0.0000000001    0.81 rsq      standard        0.730     9    0.00349
## 25 0.0000000001    0.81 rsq_trad standard        0.730     9    0.00347
## 26 0.0000000001    1    ccc      standard        0.844     9    0.00249
## 27 0.0000000001    1    mae      standard   106213.        9  487.     
## 28 0.0000000001    1    rmse     standard   163413.        9 1787.     
## 29 0.0000000001    1    rsq      standard        0.730     9    0.00349
## 30 0.0000000001    1    rsq_trad standard        0.730     9    0.00348
## 31 0.00000001      0.05 ccc      standard        0.844     9    0.00247
## 32 0.00000001      0.05 mae      standard   106186.        9  487.     
## 33 0.00000001      0.05 rmse     standard   163407.        9 1773.     
## 34 0.00000001      0.05 rsq      standard        0.730     9    0.00349
## 35 0.00000001      0.05 rsq_trad standard        0.730     9    0.00347
## 36 0.00000001      0.24 ccc      standard        0.844     9    0.00248
## 37 0.00000001      0.24 mae      standard   106294.        9  488.     
## 38 0.00000001      0.24 rmse     standard   163418.        9 1789.     
## 39 0.00000001      0.24 rsq      standard        0.730     9    0.00349
## 40 0.00000001      0.24 rsq_trad standard        0.730     9    0.00348
##    .config         
##    <chr>           
##  1 pre0_mod01_post0
##  2 pre0_mod01_post0
##  3 pre0_mod01_post0
##  4 pre0_mod01_post0
##  5 pre0_mod01_post0
##  6 pre0_mod02_post0
##  7 pre0_mod02_post0
##  8 pre0_mod02_post0
##  9 pre0_mod02_post0
## 10 pre0_mod02_post0
## 11 pre0_mod03_post0
## 12 pre0_mod03_post0
## 13 pre0_mod03_post0
## 14 pre0_mod03_post0
## 15 pre0_mod03_post0
## 16 pre0_mod04_post0
## 17 pre0_mod04_post0
## 18 pre0_mod04_post0
## 19 pre0_mod04_post0
## 20 pre0_mod04_post0
## 21 pre0_mod05_post0
## 22 pre0_mod05_post0
## 23 pre0_mod05_post0
## 24 pre0_mod05_post0
## 25 pre0_mod05_post0
## 26 pre0_mod06_post0
## 27 pre0_mod06_post0
## 28 pre0_mod06_post0
## 29 pre0_mod06_post0
## 30 pre0_mod06_post0
## 31 pre0_mod07_post0
## 32 pre0_mod07_post0
## 33 pre0_mod07_post0
## 34 pre0_mod07_post0
## 35 pre0_mod07_post0
## 36 pre0_mod08_post0
## 37 pre0_mod08_post0
## 38 pre0_mod08_post0
## 39 pre0_mod08_post0
## 40 pre0_mod08_post0
## # ℹ 140 more rows

### show best
## call function
show_best(
  # results
  elastic_net_tune,
  # metric
  metric = "ccc"
)

### create final workflow
## save as object
elastic_net_wflow_final <- 
  ## initial workflow
  elastic_net_wflow %>%
  ## finalize workflow
  finalize_workflow(
    # hyperparameters
    select_best(
      # results
      elastic_net_tune,
      # metric
      metric = "ccc"
    )
  )

Task 7.4

Perform these operations.

First, create a workflow object named elastic_net_wflow_final. Pipe elastic_net_wflow to finalize_workflow(). Inside of finalize_workflow(), call select_best() and set elastic_net_tune as the first input and “mae” as the metric input.

Second, create an object named elastic_net_fit. Pipe elastic_net_wflow_final to fit() and set houses_train as the input.

Second, pipe elastic_net_fit to extract_fit_parsnip(). Pipe the result to tidy(). Pipe the result to arrange() and arrange the rows in descending order by estimate. Pipe the result to print() and set n to Inf.

Question 7.4: What is the correct interpretation of the airplane noise (airplane_noise_Yes) regression coefficient?

Response 7.4: Houses with airplane noise have sale prices that are approximately $86,118 lower than houses without airplane noise, holding all other variables constant. Airplane noise is considered undesirable, so it reduces property values.

### create final workflow
## save as object
elastic_net_wflow_final <- 
  ## initial workflow
  elastic_net_wflow %>%
  ## finalize workflow
  finalize_workflow(
    # hyperparameters
    select_best(
      # results
      elastic_net_tune,
      # metric
      metric = "ccc"
    )
  )

### fit to complete training data
## save as object
elastic_net_fit <- 
  ## workflow
  elastic_net_wflow_final %>%
  ## fit
  fit(
    # data
    houses_train
  )

### view coefficients
## call model object
elastic_net_fit %>%
  ## pull fit
  extract_fit_parsnip() %>%
  ## coefficients
  tidy() %>%
  ## arrange rows
  arrange(
    # descending
    desc(estimate)
  ) %>%
  ## print table
  print(
    # all rows
    n = Inf
  )

## # A tibble: 29 × 3
##    term                estimate      penalty
##    <chr>                  <dbl>        <dbl>
##  1 structure_quality_4  695889. 0.0000000001
##  2 (Intercept)          569285. 0.0000000001
##  3 structure_quality_1  276884. 0.0000000001
##  4 longitude            163927. 0.0000000001
##  5 tot_lvg_area         145337. 0.0000000001
##  6 structure_quality_3   79135. 0.0000000001
##  7 ocean_dist            40514. 0.0000000001
##  8 spec_feat_val         38837. 0.0000000001
##  9 hwy_dist              29737. 0.0000000001
## 10 rail_dist             27389. 0.0000000001
## 11 lnd_sqfoot            21876. 0.0000000001
## 12 water_dist            13977. 0.0000000001
## 13 month_sold_04          8322. 0.0000000001
## 14 month_sold_03          6658. 0.0000000001
## 15 month_sold_09          4259. 0.0000000001
## 16 month_sold_01          3698. 0.0000000001
## 17 month_sold_10          3096. 0.0000000001
## 18 month_sold_07           478. 0.0000000001
## 19 month_sold_05           173. 0.0000000001
## 20 month_sold_11             0  0.0000000001
## 21 month_sold_08         -4193. 0.0000000001
## 22 month_sold_06         -5375. 0.0000000001
## 23 month_sold_02        -11356. 0.0000000001
## 24 age                  -30842. 0.0000000001
## 25 cntr_dist            -49946. 0.0000000001
## 26 subcntr_di           -52192. 0.0000000001
## 27 airplane_noise_Yes   -86118. 0.0000000001
## 28 latitude            -138348. 0.0000000001
## 29 structure_quality_2 -425947. 0.0000000001

Task 7.5

Perform these operations.

First, create a model specification object named rf_spec. Call rand_forest() and set mode to “regression”, mtry to tune(), trees to 500, and min_n to tune(). Pipe the result to set_engine(“ranger”) and set importance to “impurity”.

Second, create a grid object named rf_grid. Call grid_regular(). Inside of grid_regular(), call mtry(c(2, 20)), min_n(c(5, 50)), and set levels to 3.

Third, create a workflow object named rf_wflow. Pipe workflow() to add_model() and specify rf_spec as the input. Pipe the result of add_model() to add_recipe(). In add_recipe(), specify houses_recipe as the input.

Fourth, set-up the timer for parallel processing by calling tic(). Then, create n_cores with availableCores(). Then, call registerDoFuture(). Then, create clust_work by calling makeClusterPSOCK() and setting n_cores as the first input and setting autoStop to TRUE. Then, call plan(cluster, workers = clust_work). Then, create an object named rf_tune. Pipe rf_wflow to tune_grid(). In tune_grid(), set resamples to houses_train_folds, the grid to rf_grid, and the metrics to metrics. Then, call plan(sequential). Then, close the timer with toc().

Fifth, call autoplot() and set rf_tune as the first input and c(“mae”, “rsq”, “ccc”) as the metric input. Update the theme with theme() and move the legend to the top of the plot.

Sixth, call collect_metrics() on rf_tune. Pipe the result to print() with n set to Inf and width to Inf.

Seventh, call show_best(). Set rf_tune as the first input and “mae” as the metric input.

Question 7.5: Answer these questions: (1) Examining the plot, what is approximately the highest R-squared (rsq) value? (2) What number of features (mtry) and minimal number of observations (min_n) are the best based on average mae?

Response 7.5: (1) 0.90 (2) Mtry= 11 and min_n = 5

#### random forest
### model specification
## save as object
rf_spec <- 
  ## rf specification
  rand_forest(
    # regression
    mode = "regression",
    # selected features
    mtry = tune(),
    # number of trees
    trees = 500,
    # minimal node size
    min_n = tune()
  ) %>%
  ## specify engine
  set_engine(
    # engine
    "ranger",
    # variable importance
    importance = "impurity"
  )

### view a tuning grid
## save
rf_grid <- grid_regular(
  # number of features
  mtry(
    # range
    c(2, 20)
  ),
  # number of units
  min_n(
    # range
    c(5, 50)
  ),
  # combinations  
  levels = 3
)

### create initial workflow
## save as object
rf_wflow <- workflow() %>%
  ## add model
  add_model(
    # specification
    rf_spec
  ) %>%
  ## add recipe
  add_recipe(
    # recipe
    houses_recipe
  )

### begin timer
## call function
tic()

### number of logical cores
## save
n_cores <- availableCores()

### register parallel backend
## call function
registerDoFuture()

### create compute clusters
## save
clust_work <- makeClusterPSOCK(
  # cores
  n_cores,
  # stop cluster
  autoStop = TRUE
)

### create parallel plan
## call function
plan(
  # cluster
  cluster, 
  # workers
  workers = clust_work
)

### estimate models
## save as object
rf_tune <- 
  ## workflow
  rf_wflow %>%
  ## fit
  tune_grid(
    # folds
    resamples = houses_train_folds,
    # metrics
    metrics = metrics,
    # grid
    grid = rf_grid
  )

### convert to sequential
## call function
plan(sequential)

### end timer
## call function
toc()

## 204.93 sec elapsed

### plot metrics
## produce plot
## produce plot
autoplot(
  # results
  rf_tune,
  # metrics
  metric = c("mae", "rsq", "ccc")
) +
  ## update theme
  theme(
    # move legend
    legend.position = "top"
  )

### show metrics
## call function
collect_metrics(
  # results
  rf_tune
) %>%
  ## print table
  print(
    # all rows
    n = Inf, 
    # all columns
    width = Inf
  )

## # A tibble: 45 × 8
##     mtry min_n .metric  .estimator       mean     n    std_err .config        
##    <int> <int> <chr>    <chr>           <dbl> <int>      <dbl> <chr>          
##  1     2     5 ccc      standard        0.896     9    0.00128 pre0_mod1_post0
##  2     2     5 mae      standard    65187.        9  358.      pre0_mod1_post0
##  3     2     5 rmse     standard   126922.        9 1022.      pre0_mod1_post0
##  4     2     5 rsq      standard        0.870     9    0.00228 pre0_mod1_post0
##  5     2     5 rsq_trad standard        0.837     9    0.00189 pre0_mod1_post0
##  6     2    27 ccc      standard        0.876     9    0.00165 pre0_mod2_post0
##  7     2    27 mae      standard    70177.        9  378.      pre0_mod2_post0
##  8     2    27 rmse     standard   136016.        9 1294.      pre0_mod2_post0
##  9     2    27 rsq      standard        0.856     9    0.00259 pre0_mod2_post0
## 10     2    27 rsq_trad standard        0.813     9    0.00223 pre0_mod2_post0
## 11     2    50 ccc      standard        0.858     9    0.00185 pre0_mod3_post0
## 12     2    50 mae      standard    74513.        9  406.      pre0_mod3_post0
## 13     2    50 rmse     standard   143712.        9 1292.      pre0_mod3_post0
## 14     2    50 rsq      standard        0.844     9    0.00302 pre0_mod3_post0
## 15     2    50 rsq_trad standard        0.791     9    0.00250 pre0_mod3_post0
## 16    11     5 ccc      standard        0.942     9    0.00147 pre0_mod4_post0
## 17    11     5 mae      standard    48696.        9  322.      pre0_mod4_post0
## 18    11     5 rmse     standard   101118.        9 1219.      pre0_mod4_post0
## 19    11     5 rsq      standard        0.900     9    0.00228 pre0_mod4_post0
## 20    11     5 rsq_trad standard        0.896     9    0.00236 pre0_mod4_post0
## 21    11    27 ccc      standard        0.936     9    0.00164 pre0_mod5_post0
## 22    11    27 mae      standard    51160.        9  357.      pre0_mod5_post0
## 23    11    27 rmse     standard   105430.        9 1309.      pre0_mod5_post0
## 24    11    27 rsq      standard        0.892     9    0.00258 pre0_mod5_post0
## 25    11    27 rsq_trad standard        0.887     9    0.00264 pre0_mod5_post0
## 26    11    50 ccc      standard        0.928     9    0.00166 pre0_mod6_post0
## 27    11    50 mae      standard    53976.        9  348.      pre0_mod6_post0
## 28    11    50 rmse     standard   110795.        9 1341.      pre0_mod6_post0
## 29    11    50 rsq      standard        0.883     9    0.00264 pre0_mod6_post0
## 30    11    50 rsq_trad standard        0.876     9    0.00264 pre0_mod6_post0
## 31    20     5 ccc      standard        0.942     9    0.00146 pre0_mod7_post0
## 32    20     5 mae      standard    49187.        9  314.      pre0_mod7_post0
## 33    20     5 rmse     standard   101615.        9 1192.      pre0_mod7_post0
## 34    20     5 rsq      standard        0.897     9    0.00229 pre0_mod7_post0
## 35    20     5 rsq_trad standard        0.895     9    0.00235 pre0_mod7_post0
## 36    20    27 ccc      standard        0.937     9    0.00149 pre0_mod8_post0
## 37    20    27 mae      standard    51434.        9  316.      pre0_mod8_post0
## 38    20    27 rmse     standard   105731.        9 1202.      pre0_mod8_post0
## 39    20    27 rsq      standard        0.889     9    0.00231 pre0_mod8_post0
## 40    20    27 rsq_trad standard        0.887     9    0.00236 pre0_mod8_post0
## 41    20    50 ccc      standard        0.930     9    0.00155 pre0_mod9_post0
## 42    20    50 mae      standard    54140.        9  312.      pre0_mod9_post0
## 43    20    50 rmse     standard   110506.        9 1228.      pre0_mod9_post0
## 44    20    50 rsq      standard        0.880     9    0.00242 pre0_mod9_post0
## 45    20    50 rsq_trad standard        0.876     9    0.00245 pre0_mod9_post0

### show best
## call function
show_best(
  # results
  rf_tune,
  # metric
  metric = "mae"
)

Task 7.6

Perform these operations.

First, create a workflow object named rf_wflow_final. Pipe rf_wflow to finalize_workflow(). Inside of finalize_workflow(), call select_best() and set rf_tune as the first input and “mae” as the metric input.

Second, create an object named rf_fit. Pipe rf_wflow_final to fit() and set houses_train as the input.

Second, pipe rf_fit to extract_fit_parsnip(). Pipe the result to vip().

Question 7.6: Answer these questions: (1) Which predictor is the most important? (2) Which predictor is second most important?

Response 7.6: (1) Total Living Area (2) Ocean Distance

### create final workflow
## save as object
rf_wflow_final <- 
  ## initial workflow
  rf_wflow %>%
  ## finalize workflow
  finalize_workflow(
    # hyperparameters
    select_best(
      # results
      rf_tune,
      # metric
      metric = "mae"
    )
  )

### fit to complete training data
## save as object
rf_fit <- 
  ## workflow
  rf_wflow_final %>%
  ## fit
  fit(
    # data
    houses_train
  )

### view coefficients
## call model object
rf_fit %>%
  ## pull fit
  extract_fit_parsnip() %>%
  ## prediction importance
  vip()

Task 8: Test Models

For this task, you will evaluate estimated models on the testing data.

Task 8.1

Perform these operations.

First, create a data table object named lm_pred. Call predict() and set lm_fit as the model input and houses_test as the new_data input. Pipe the result to rename() and rename .pred to lm_pred.

Second, create a data table object named elastic_net_pred. Call predict() and set elast_net_fit as the model input and houses_test as the new_data input. Pipe the result to rename() and rename .pred to elastic_net_pred.

Third, create a data table object named rf_pred. Call predict() and set rf_fit as the model input and houses_test as the new_data input. Pipe the result to rename() and rename .pred to rf_pred.

Fourth, create a data table object named houses_test_pred. Pipe houses_test to select() and select sale_prc. Pipe the result to bind_cols() and set lm_pred, elastic_net_pred, and rf_pred as inputs.

Fifth, call map(). As the first input, call set_names(). Inside of set_names(), call houses_test_pred, then pipe it to select() to remove sale_prc followed by a pipe to names(). As the second input to map(), call metrics() with data set houses_test_pred, truth set to sale_prc, and the estimate set to all_of(.x). Pipe the result to list_rbind() with names_to set to “model”. Pipe the result to pivot_wider() and set id_cols to model, names_from to .metric, and values_from to .estimate.

Question 8.1: Answer these questions: (1) What is the mean absolute error (mae) of the elastic net model? (2) Irrespective of metric, which model performs the best?

Response 8.1: (1) 106013 (2) Random Forrest performs the best across all metrics

### linear model
## save
lm_pred <- predict(
  # fitted model
  lm_fit,
  # test data
  new_data = houses_test
) %>% 
  ## rename
  rename(
    # new = old
    lm_pred = .pred
  )

### elastic net
## save
elastic_net_pred <- predict(
  # fitted model
  elastic_net_fit,
  # test data
  new_data = houses_test
) %>% 
  ## rename
  rename(
    # new = old
    elastic_net_pred = .pred
  )

### random forest
## save
rf_pred <- predict(
  # fitted model
  rf_fit,
  # test data
  new_data = houses_test
) %>% 
  ## rename
  rename(
    # new = old
    rf_pred = .pred
  )

### observed and predicted values
## save as object
houses_test_pred <- houses_test %>%
  ## select observed values
  select(sale_prc) %>%
  ## bind columns
  bind_cols(
    # linear model
    lm_pred,
    # elastic net
    elastic_net_pred,
    # random forest
    rf_pred
  ) 

### compute metrics on test data
## map
map(
  # columns
  set_names(
    # data
    houses_test_pred %>%
      # remove observed values
      select(-sale_prc) %>%
      # extract column names
      names(),  
  ),
  # function
  ~ metrics(
      # data
      data = houses_test_pred,
      # observed
      truth = sale_prc,
      # predicted
      estimate = all_of(.x)
  )
) %>%
  ## bind as rows
  list_rbind(
    # names
    names_to = "model"
  ) %>%
  ## pivot wide
  pivot_wider(
    # model
    id_cols = model,
    # names
    names_from = .metric,
    # values
    values_from = .estimate
  )

Task 8.2

Perform these operations.

First, create a data table named houses_test_pred_long. Pipe houses_test_pred to pivot_longer(). Inside of pivot_longer() set cols to lm_pred:rf_pred, names_to to “model”, and values_to to “pred”. Pipe the result to mutate() and convert model to a factor variable with as_factor().

Second, create a plot named pred_plot. Call ggplot() and set houses_test_pred_long as the data input and map pred to the x-axis, and sale_prc to the y-axis. Add a geom_point() layer and set to alpha to 0.1. Add a first geom_abline() and set lty to 1, color to “red”, and linewidth to 1. Add facets with facet_wrap() and set vars(model) as the first input, nrow to 2, and labeller to as_labeller(). Inside of as_labeller(), specify set_names(c(“Linear Model”, “Elastic Net”, “Random Forest”)) as the first input, and, for the second input, pipe house_test_pred_long to pull() to extract the model values and pipe the result to levels(). Use scale_x_continuous() to update the y-axis labels with label_dollar() and setting scaleto 0.000001 and suffix to “M” and setting breaks to seq(0, 3000000, 500000). Use scale_y_continuous() to update the y-axis labels with label_dollar() and setting scaleto 0.000001 and suffix to “M” and setting breaks to seq(0, 3000000, 500000). Adjust titles for the x and y axes to “Predicted Sale Price (Millions)” and “Observed Sale Price (Millions)” using labs().

View pred_plot.

Question 8.2: Which model’s predicted sale prices most closely match the observed sale prices?

Response 8.2: Random forest - the closer the points are to the red diagonal line which represents perfect predictions, the better the model performs.

### make long table for plots
## save as object
houses_test_pred_long <- houses_test_pred %>%
  ## pivot long for plot
  pivot_longer(
    # columns to pivot
    cols = lm_pred:rf_pred,
    # names
    names_to = "model",
    # values
    values_to = "pred"
  ) %>%
  ## update variable
  mutate(
    # convert to factor
    model = as_factor(model)
  )

#### create plot
### observed versus predicted values
## save as object
model_pred_plot <- ggplot(
  # data
  houses_test_pred_long,
  # mapping
  aes(
    # predicted values
    x = pred,
    # observed values
    y = sale_prc
  )
) +
  ## add points
  geom_point(
    # transparency
    alpha = 0.25
  ) +
  ## add one-to-one diagonal line
  geom_abline(
    # solid
    lty = 1, 
    # color
    color = "red", 
    # thick
    linewidth = 1
  ) +
  ## add facet
  facet_wrap(
    # variable
    vars(model), 
    # number of rows
    nrow = 2,
    # labels
    labeller = as_labeller(
      # look-up table
      set_names(
        # vector elements
        c(
          "Linear Model", "Elastic Net", "Random Forest"
        ), 
        # names of elements
        houses_test_pred_long %>%
          # extract
          pull(model) %>%
          # levels
          levels()
      )
    )
  ) +
  ## scale x-axis
  scale_x_continuous(
    # breaks
    breaks = seq(0, 3000000, 500000),
    # format
    labels = label_dollar(
      # scale
      scale = 0.000001,
      # suffix
      suffix = "M"
    )
  ) +
  ## scale y-axis
  scale_y_continuous(
    # breaks
    breaks = seq(0, 3000000, 500000),
    # format
    labels = label_dollar(
      # scale
      scale = 0.000001,
      # suffix
      suffix = "M"
    )
  ) +
  ## labels
  labs(
    # x-axis
    x = "Predicted Sale Price (Millions)", 
    # y-axis
    y = "Observed Sale Price (Millions)"
  )

## display plot
model_pred_plot

Task 9: Create Database

Create a new database connection named houses_con_update. Call dbConnect() and use SQLite() as the driver input and use here() to navigate to the data folder of the project directory and specify a new database file named “miami_houses_update.sqlite”.

Copy the existing database by calling sqliteCopyDatabase(). Inside of sqliteCopyDatabase(), set houses_con and houses_con_update as the inputs.

Copy the houses_joined to the new database by calling copy_to(). Inside of copy_to(), set houses_con_update as the database connection input, set houses_joined as the data table input, name the database table “complete”, set temporary to FALSE, and set analyze to FALSE.

Check that the “complete” table exists in the database. Pipe houses_con_update to tbl() and set “complete” as the input.

Disconnect from both databases. Call dbDisconnect() twice to disconnect from houses_con and houses_con_update.

### create empty database
## save
houses_con_update <- dbConnect(
  # driver
  SQLite(),
  # path
  here(
    # folder
    "data",
    # file
    "miami_houses_update.sqlite"  
  )
)

### list tables
## call function
dbListTables(houses_con_update)

## [1] "complete" "distance" "overview"

### copy database
## call function
sqliteCopyDatabase(
  # current
  houses_con,
  # new
  houses_con_update
)

### list tables
## call function
dbListTables(houses_con_update)

## [1] "distance" "overview"

### copy tables to database
## copy
copy_to(
  # database
  houses_con_update,
  # table
  houses_joined,
  # name
  "complete",
  # persistent
  temporary = FALSE,
  # query optimization
  analyze = FALSE
)

### list tables
## call function
dbListTables(houses_con_update)

## [1] "complete" "distance" "overview"

### view table
## call connection
houses_con_update %>%
  ## preview table
  tbl(
    # table
    "complete"
  )

### disconnect from database
## call function
dbDisconnect(houses_con_update)

### disconnect from database
## call function
dbDisconnect(houses_con)

Task 10: Save Plots

Save the plot objects as png files in the plots folder of the project directory using the combination of walk(), ls(), ggsave(), here(), str_glue(), str_remove(), and get(). Use a width of 8.1 inches and a height of 5 inches.

### save plots
## call function
walk(
  ## vector of plots
  ls(
    # name pattern
    pattern = "plot"
  ),
  ## function
  function(plot_inp) {
    ### save plot
    ## call function
    ggsave(
      ## file path
      here(
        # folder
        "plots",
        # file
        str_glue(
          # remove string
          str_remove(
            # string
            plot_inp,
            # pattern
            "_plot"
          ),
          # extension
          ".png"
        )
      ),
      ## plot object
      plot = get(plot_inp),
      ## units
      units = "in",
      ## width
      width = 8.1,
      ## height
      height = 5
    )
  }
)

Task 11: Conceptual Questions

For your last task, you will respond to conceptual questions based on the conceptual lectures for this week.

Question 11.1: Why is it better to evaluate predictive performance of a model on testing as opposed to training data? Response 11.1: Because testing data tells us how well our model generalizes to new observations and therefore what the models true predictive power in practice is.

Question 11.2: What is a tuning parameter in a supervised learning model?

Response 11.2: A tuning parameter is a configuration setting that is set or “tuned” before training a model to find the optimal values that give the best model performance. It controls how a supervised learning model learns.

Question 11.3: What are two general advantages of using random forests for predictions?

Response 11.3: 1. Unlike in linear regression where manual adding of interaction terms is needed, random forests automatically capture non-linear relationships and interactions between variables without us needing to specify them making them very flexible and powerful for complex real-world data. 2. By building many different trees, each tree is trained on a random subset of data and uses random subsets of variables averaging their predictions. This makes random forests less likely to overfit compared to a single decision tree which therefore gives us better performance on test data without requiring extensive tuning.

Assignment: Predict Continuous Outcomes

Michael Chiuri

2025-10-04 23:18:28.700967

Instructions

Global Settings

Activate Packages

Task 1: Database Connection

Task 1.1

Task 1.2

Task 2: Clean Data

Task 2.1

Task 2.2

Task 3: Join Data

Task 3.1

Task 4: Examine Data

Task 4.1

Task 4.2

Task 4.3

Task 5: Split Data

Task 5.1

Task 5.2

Task 6: Model Recipe and Metrics

Task 6.1

Task 7: Train Models

Task 7.1

Task 7.2

Task 7.3

Task 7.4

Task 7.5

Task 7.6

Task 8: Test Models

Task 8.1

Task 8.2

Task 9: Create Database

Task 10: Save Plots

Task 11: Conceptual Questions