This assignment reviews the Predicting Continuous Outcomes content. You will use the predict_continuous_outcomes.Rmd file I reviewed as part of the lecture for this class session to complete this assignment. You will copy and paste relevant code from that file and update it to answer the questions in this assignment. You will respond to questions in each section after executing relevant code to answer a question. You will submit this assignment to its Submissions folder on D2L. You will submit two files:
To start:
First, create a folder on your computer to save all relevant files for this course. If you did not do so already, you will want to create a folder named gsb_804 that contains all of the materials for this course.
Second, inside of gsb_804, you will create a folder to host assignments. You can name that folder assignments.
Third, inside of assignments, you will create folders for each assignment. You can name the folder for this first assignment: 02_predict_continuous_outcomes.
Fourth, create two additional folders in 02_predict_continuous_outcomes named scripts and data. Store this script in the scripts folder and the data for this assignment in the data folder. Create a plots folder as well.
Fifth, go to the File menu in RStudio, select New Project…, choose Existing Directory, go to your /gsb_804/assignments/02_predict_continuous_outcomes folder to select it as the top-level directory for this R Project.
The first code chunk sets the global settings for the remaining code chunks in the document. Do not change anything in this code chunk.
In this code chunk, we load the following packages:
Make sure you installed these packages when you reviewed the analytical lecture.
We will use functions from these packages to examine the data. Do not change anything in this code chunk.
## here for project work flow
library(here)
## tidyverse for data manipulation and plotting;
## activates multiple packages simultaneously
library(tidyverse)
## skimr to summarize data
library(skimr)
## scales for variables scales
library(scales)
## snakecase for naming conventions
library(snakecase)
## corrr for correlation matrices
library(corrr)
## ggthemes for plot themes
library(ggthemes)
## tidymodels for modeling;
## activates multiple packages simultaneously
library(tidymodels)
## glmnet for elastic net models
library(glmnet)
## DBI for database interaction
library(DBI)
## RSQLite for SQLite databases
library(RSQLite)
## ranger for random forests
library(ranger)
## vip for variable importance
library(vip)
## tictoc for timing
library(tictoc)
## parallel for parallel processing
library(parallelly)
## doFuture for parallel processing
library(doFuture)In this task, you will establish a database connection to miami_houses.sqlite. After you connect to the database, you will work with the tables inside the database.
Create a database connection object named houses_con by using dbConnect(). Inside of dbConnect(), set SQLite() as the database driver and use here() to navigate to the data folder of the project directory to locate the database file miami_houses.sqlite. Apply dbListTables() to houses_con.
Question 1.1: How many tables are in the database?
Response 1.1: Two tables
### import and save data as object
## use dbConnect() and SQLite() to import the data file
houses_con <- dbConnect(
## driver
SQLite(),
## use here() to locate file in our project directory
here(
# folder
"data",
# file
"miami_houses.sqlite"
)
)
### list tables
## call function
dbListTables(houses_con)## [1] "distance" "overview"
Perform these operations.
Pipe houses_con to tbl() and set “overview” as the input.
Pipe houses_con to tbl() and set “distance” as the input.
Question 1.2: Answer these questions: (1) What is the first value of SALE_PRC in the overview table? (2) What is the first value of RAIL_DIST in the distance table?
Response 1.2: (1) 440000 (2) 2816
For this task, you will clean the data.
In one chained command, create a data table named overview by performing the following operations:
Examine overview with glimpse().
Question 2.1: Answer these questions: (1) How many observations are there in the overview table? (2) How many columns are there in the overview table? (3) How many ordered factor columns are there in the overview table?
Response 2.1: (1) 13776 Observations (2) 11 Columns (3) 2 (Months sold and Structure quality)
### export and clean table
## save
overview <- houses_con %>%
## preview table
tbl(
# table
"overview"
) %>%
## export
collect() %>%
## change variable names
rename_with(
# function
.fn = to_snake_case
) %>%
## update variable names
rename(
# new = old
parcel_id = parcelno,
# new = old
airplane_noise = avno_60_plus
) %>%
## update
mutate(
# month
month_sold = month(
# column
month_sold,
# labels
label = TRUE
),
# convert to factor
airplane_noise = as_factor(airplane_noise),
# recode levels
airplane_noise = fct_recode(
# factor
airplane_noise,
# levels
"No" = "0", "Yes" = "1"
),
# convert to ordinal factor
structure_quality = factor(
# factor
structure_quality,
# ordered
ordered = TRUE
),
# recode levels
structure_quality = fct_recode(
# factor
structure_quality,
# levels
"Very Bad" = "1", "Bad" = "2",
"Okay" = "3", "Good" = "4",
"Very Good" = "5"
)
)
## confirm changes
glimpse(overview)## Rows: 13,776
## Columns: 11
## $ parcel_id <dbl> 622280070620, 622280100460, 622280100470, 6222801005…
## $ latitude <dbl> 25.89103, 25.89132, 25.89133, 25.89176, 25.89182, 25…
## $ longitude <dbl> -80.16056, -80.15397, -80.15374, -80.15266, -80.1546…
## $ sale_prc <dbl> 440000, 349000, 800000, 988000, 755000, 630000, 1020…
## $ lnd_sqfoot <dbl> 9375, 9375, 9375, 12450, 12800, 9900, 10387, 10272, …
## $ tot_lvg_area <dbl> 1753, 1715, 2276, 2058, 1684, 1531, 1753, 1663, 1493…
## $ spec_feat_val <dbl> 0, 0, 49206, 10033, 16681, 2978, 23116, 34933, 11668…
## $ age <dbl> 67, 63, 61, 63, 42, 41, 63, 21, 56, 63, 64, 51, 56, …
## $ airplane_noise <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
## $ month_sold <ord> Aug, Sep, Feb, Sep, Jul, Feb, Feb, Sep, Mar, Nov, Fe…
## $ structure_quality <ord> Good, Good, Good, Good, Good, Good, Very Good, Good,…
In one chained command, create a data table named distance by performing the following operations:
Examine distance with glimpse().
Question 2.2: Answer these questions: (1) How many observations are there in the distance table? (2) How many columns are there in the distance table?
Response 2.2: (1) 13,776 Observations (2) 7 columns
### export and clean table
## save
distance <- houses_con %>%
## preview table
tbl(
# table
"distance"
) %>%
## export
collect() %>%
## change variable names
rename_with(
# function
.fn = to_snake_case
) %>%
## update variable names
rename(
# new = old
parcel_id = parcelno
)
## confirm changes
glimpse(distance)## Rows: 13,776
## Columns: 7
## $ parcel_id <dbl> 622280070620, 622280100460, 622280100470, 622280100530, 622…
## $ rail_dist <dbl> 2815.9, 4359.1, 4412.9, 4585.0, 4063.4, 2391.4, 3277.4, 311…
## $ ocean_dist <dbl> 12811.4, 10648.4, 10574.1, 10156.5, 10836.8, 13017.0, 11667…
## $ water_dist <dbl> 347.6, 337.8, 297.1, 0.0, 326.6, 188.9, 0.0, 10.5, 51.5, 9.…
## $ cntr_dist <dbl> 42815.3, 43504.9, 43530.4, 43797.5, 43599.7, 43135.1, 43598…
## $ subcntr_di <dbl> 37742.2, 37340.5, 37328.7, 37423.2, 37550.8, 38176.2, 37973…
## $ hwy_dist <dbl> 15954.9, 18125.0, 18200.5, 18514.4, 17903.4, 15687.2, 17068…
For this task, you will join data tables.
Create a new data table named houses_joined. Pipe overview into left_join() and set distance as the additional data input and by to “parcel_id”. Preview houses_joined with glimpse().
Question 3.1: Answer these questions: (1) How many observations are there in the houses_joined table? (2) How many columns are there in the houses_joined table?
Response 3.1: (1) 13776 Observations (2) 17 columns
### join tables
## save
houses_joined <- overview %>%
## left join
left_join(
# data
distance,
# key
by = "parcel_id"
)
## preview table
glimpse(houses_joined)## Rows: 13,776
## Columns: 17
## $ parcel_id <dbl> 622280070620, 622280100460, 622280100470, 6222801005…
## $ latitude <dbl> 25.89103, 25.89132, 25.89133, 25.89176, 25.89182, 25…
## $ longitude <dbl> -80.16056, -80.15397, -80.15374, -80.15266, -80.1546…
## $ sale_prc <dbl> 440000, 349000, 800000, 988000, 755000, 630000, 1020…
## $ lnd_sqfoot <dbl> 9375, 9375, 9375, 12450, 12800, 9900, 10387, 10272, …
## $ tot_lvg_area <dbl> 1753, 1715, 2276, 2058, 1684, 1531, 1753, 1663, 1493…
## $ spec_feat_val <dbl> 0, 0, 49206, 10033, 16681, 2978, 23116, 34933, 11668…
## $ age <dbl> 67, 63, 61, 63, 42, 41, 63, 21, 56, 63, 64, 51, 56, …
## $ airplane_noise <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
## $ month_sold <ord> Aug, Sep, Feb, Sep, Jul, Feb, Feb, Sep, Mar, Nov, Fe…
## $ structure_quality <ord> Good, Good, Good, Good, Good, Good, Very Good, Good,…
## $ rail_dist <dbl> 2815.9, 4359.1, 4412.9, 4585.0, 4063.4, 2391.4, 3277…
## $ ocean_dist <dbl> 12811.4, 10648.4, 10574.1, 10156.5, 10836.8, 13017.0…
## $ water_dist <dbl> 347.6, 337.8, 297.1, 0.0, 326.6, 188.9, 0.0, 10.5, 5…
## $ cntr_dist <dbl> 42815.3, 43504.9, 43530.4, 43797.5, 43599.7, 43135.1…
## $ subcntr_di <dbl> 37742.2, 37340.5, 37328.7, 37423.2, 37550.8, 38176.2…
## $ hwy_dist <dbl> 15954.9, 18125.0, 18200.5, 18514.4, 17903.4, 15687.2…
For this task, you will examine the data.
Summarize houses_joined by piping it into select() and excluding -parcel_id. Pipe the result to skim_without_charts() to view a summary of the variables. View the output to respond to questions.
Question 4.1: Answer these questions: (1) During what month were the most houses sold? (2) What was the median sale price of these Miami homes? (3) What is the average distance to the ocean (measured in feet) from these Miami homes?
Response 4.1: (1) June (1373) (2) 311100 (3) 31711
### overall summary
## call data
houses_joined %>%
## select columns
select(
# exclude
-parcel_id
) %>%
## compute summary statistics
skim_without_charts()| Name | Piped data |
| Number of rows | 13776 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| airplane_noise | 0 | 1 | FALSE | 2 | No: 13570, Yes: 206 |
| month_sold | 0 | 1 | TRUE | 12 | Jun: 1373, Aug: 1256, May: 1233, Apr: 1228 |
| structure_quality | 0 | 1 | TRUE | 5 | Goo: 7546, Bad: 4053, Ver: 1989, Ver: 172 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| latitude | 0 | 1 | 25.73 | 0.14 | 25.43 | 25.62 | 25.73 | 25.85 | 25.97 |
| longitude | 0 | 1 | -80.33 | 0.09 | -80.54 | -80.40 | -80.34 | -80.26 | -80.12 |
| sale_prc | 0 | 1 | 400746.45 | 317868.54 | 72000.00 | 235000.00 | 311100.00 | 430000.00 | 2650000.00 |
| lnd_sqfoot | 0 | 1 | 8623.96 | 6063.47 | 1248.00 | 5400.00 | 7500.00 | 9144.25 | 57064.00 |
| tot_lvg_area | 0 | 1 | 2061.50 | 814.36 | 854.00 | 1472.00 | 1881.00 | 2474.25 | 6287.00 |
| spec_feat_val | 0 | 1 | 9604.75 | 13923.55 | 0.00 | 823.50 | 2784.00 | 12655.75 | 175020.00 |
| age | 0 | 1 | 30.59 | 21.12 | 0.00 | 14.00 | 26.00 | 46.00 | 96.00 |
| rail_dist | 0 | 1 | 8354.85 | 6175.57 | 10.50 | 3299.98 | 7124.95 | 12108.85 | 29621.50 |
| ocean_dist | 0 | 1 | 31710.96 | 17609.83 | 236.10 | 18080.07 | 28595.55 | 44358.30 | 75744.90 |
| water_dist | 0 | 1 | 11994.25 | 11941.12 | 0.00 | 2681.77 | 6979.95 | 19265.92 | 50399.80 |
| cntr_dist | 0 | 1 | 68625.84 | 31990.84 | 3825.60 | 42955.22 | 66039.80 | 89542.55 | 159976.50 |
| subcntr_di | 0 | 1 | 41149.98 | 22167.83 | 1462.80 | 24043.05 | 41156.40 | 53973.90 | 110553.80 |
| hwy_dist | 0 | 1 | 7731.19 | 6074.76 | 90.20 | 3000.48 | 6158.85 | 10873.57 | 48167.30 |
Perform these two operations.
First, create a correlation table named corr_res by piping houses_joined to select() and selecting for all numeric variables excluding parcel_id. Pipe the result to correlate(), rearrange(), shave(). Pipe the result to stretch() and specify to remove missing values and duplicates. Pipe the result to mutate() and use across() to create factors of x and y and reverse their levels. Print corr_res to confirm it looks correct.
Second, create a plot named corr_plot that creates a heat map of the correlation table. Call ggplot() to specify the data and map the x-axis to x, the y-axis to y, and the fill to r. Add a geom_tile() layer and set color to “white”. Add a geom_text() layer to add the correlations rounded to two digits as text in the tiles. Add a scale_fill_gradient()2 layer and specify low to “red3”, middle to “white”, high to “deepskyblue3”, and other appropriate parameters. Add a labs() layer to remove any axes titles. Add a ggtitle() layer to add a title. Add a theme() layer to move the legend to the bottom, angle the x-axis labels to 45 degrees, se the panel background “gray20”, set the plot background to “gray100”, remove the panel grid, and center the title. Print the plot to view it.
Question 4.2: Answer these questions: (1) What is the correlation between sales price (sale_prc) and total living area (tot_lvg_area)? (2) What is the correlation between age of homes (age) and distance to center (center_dist)?
Response 4.2: (1) 0.67 (2) -0.55
### overall summary
## call data
houses_joined %>%
## select columns
select(
# exclude
-parcel_id
) %>%
## compute summary statistics
skim_without_charts()| Name | Piped data |
| Number of rows | 13776 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| airplane_noise | 0 | 1 | FALSE | 2 | No: 13570, Yes: 206 |
| month_sold | 0 | 1 | TRUE | 12 | Jun: 1373, Aug: 1256, May: 1233, Apr: 1228 |
| structure_quality | 0 | 1 | TRUE | 5 | Goo: 7546, Bad: 4053, Ver: 1989, Ver: 172 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| latitude | 0 | 1 | 25.73 | 0.14 | 25.43 | 25.62 | 25.73 | 25.85 | 25.97 |
| longitude | 0 | 1 | -80.33 | 0.09 | -80.54 | -80.40 | -80.34 | -80.26 | -80.12 |
| sale_prc | 0 | 1 | 400746.45 | 317868.54 | 72000.00 | 235000.00 | 311100.00 | 430000.00 | 2650000.00 |
| lnd_sqfoot | 0 | 1 | 8623.96 | 6063.47 | 1248.00 | 5400.00 | 7500.00 | 9144.25 | 57064.00 |
| tot_lvg_area | 0 | 1 | 2061.50 | 814.36 | 854.00 | 1472.00 | 1881.00 | 2474.25 | 6287.00 |
| spec_feat_val | 0 | 1 | 9604.75 | 13923.55 | 0.00 | 823.50 | 2784.00 | 12655.75 | 175020.00 |
| age | 0 | 1 | 30.59 | 21.12 | 0.00 | 14.00 | 26.00 | 46.00 | 96.00 |
| rail_dist | 0 | 1 | 8354.85 | 6175.57 | 10.50 | 3299.98 | 7124.95 | 12108.85 | 29621.50 |
| ocean_dist | 0 | 1 | 31710.96 | 17609.83 | 236.10 | 18080.07 | 28595.55 | 44358.30 | 75744.90 |
| water_dist | 0 | 1 | 11994.25 | 11941.12 | 0.00 | 2681.77 | 6979.95 | 19265.92 | 50399.80 |
| cntr_dist | 0 | 1 | 68625.84 | 31990.84 | 3825.60 | 42955.22 | 66039.80 | 89542.55 | 159976.50 |
| subcntr_di | 0 | 1 | 41149.98 | 22167.83 | 1462.80 | 24043.05 | 41156.40 | 53973.90 | 110553.80 |
| hwy_dist | 0 | 1 | 7731.19 | 6074.76 | 90.20 | 3000.48 | 6158.85 | 10873.57 | 48167.30 |
### examine correlations
## call data
corr_res <- houses_joined %>%
## select variables
select(
# numeric
where(is.numeric),
# exclude
-parcel_id
) %>%
## correlation matrix
correlate() %>%
## arrange values
rearrange() %>%
## keep lower triangle
shave() %>%
## pivot long
stretch(
# remove missing values
na.rm = TRUE,
# remove duplicates
remove.dups = TRUE
) %>%
## update
mutate(
# apply function to columns
across(
# columns
.cols = x:y,
# function
.fns = ~
# convert to factor
as_factor(.) %>%
# reverse levels
fct_rev()
)
)
## print
corr_res### correlation heat map
## save
corr_plot <- ggplot(
# data
corr_res,
# mapping
aes(
# x-axis
x = x,
# y-axis
y = y,
# fill
fill = r
)
) +
## tiles
geom_tile(
# border color
color = "white"
) +
## text
geom_text(
# mapping
aes(
# labels
label = number(
# variable
r,
# decimals
accuracy = 0.01
)
),
## size
size = 4,
## bold
fontface = "bold",
## color
color = "black"
) +
## alter fill of tiles
scale_fill_gradient2(
# small distance
low = "red1",
# medium distance
mid = "white",
# large distance
high = "deepskyblue1",
# medium indicator
midpoint = 0,
# limits
limit = c(-1, 1),
# title
name = "Correlation",
# missing values
na.value = "transparent"
) +
## labels
labs(
# x-axis
x = NULL,
# y-axis
y = NULL
) +
## title of plot
ggtitle(
# title
"Correlations"
) +
## update theme
theme(
# legend at bottom
legend.position = "bottom",
# alter panel background
panel.background = element_rect(
# fill
fill = "black"
),
# remove gridlines
panel.grid = element_blank(),
# alter plot background
plot.background = element_rect(
# fill
fill = "mintcream"
),
# adjust title position
plot.title = element_text(
# center
hjust = 0.5
),
# x-axis labels
axis.text.x = element_text(
# angle
angle = 45,
# justification
vjust = 0.6
)
)
## show plot
corr_plotCreate a plot named center_sale_plot to examine the relationship between the distance to the city center and sale price of homes. Call ggplot() and set houses_joined as the data input and map center_dist to the x-axis and sale_prc to the y-axis. Add a geom_point() layer with alpha set to 0.1. Add a geom_smooth() layer with method set to “loess”, formula set to “y ~ x”, and se set to FALSE. Use scale_x_continuous() to update the x-axis labels with label_number() and setting big.mark to “,”. Use scale_y_continuous() to update the y-axis labels with label_dollar() and setting scaleto 0.000001 and suffix to “M” and setting breaks to seq(0, 3000000, 500000). Adjust titles for the x and y axes to “Distance to Center (Feet)” and “Sale Price (Millions)” using labs(). Set the theme to theme_hc(). Display the plot.
Question 4.3: Is the relationship between distance to center and sale price negative or positive?
Response 4.3: It is a curve linear relationship: at short distances from center: as distance increases, price increases (positive relationship), at medium distances: Prices reach a peak and then at longer distances, as distance increases, price decreases (negative relationship).
### examine categorical features
## call data and mapping
center_sale_plot <- ggplot(
# data
houses_joined,
# mapping
aes(
# factor
x = lnd_sqfoot,
# outcome
y = sale_prc
)
) +
## points
geom_point(
# transparency
alpha = 0.3
) +
## smooth
geom_smooth(
# loess
method = "loess",
# formula
formula = "y ~ x",
# no ribbon
se = FALSE
) +
## scale x-axis
scale_x_continuous(
# labels
labels = label_number(
# big marks
big.mark = ","
)
) +
## scale y-axis
scale_y_continuous(
# breaks
breaks = seq(0, 3000000, 500000),
# labels
labels = label_dollar(
# scale
scale = 0.000001,
# suffix
suffix = "M"
)
) +
## labels
labs(
# x-axis
x = "Land Area (Feet)",
# y-axis
y = "Sale Price (Millions)"
) +
## set theme
theme_clean()
## show plot
center_sale_plotFor this task, you will split the data for training and testing.
Perform these operations.
First, set the random seed of your computer to 1795 using set.seed().
Second, create houses_split using initial_split(). Set the data input to houses_joined and the prop input to 0.65 Print houses_split.
Third, create houses_train by applying training() to houses_split. Print houses_train. Create houses_test by applying testing() to houses_split. Pring houses_test.
Question 5.1: Answer these questions: (1) How many observations are there in the training data? (2) How many observations are there in the testing data?
Response 5.1: (1) 8,954 (2) 4,822
### set random seed
## call function
set.seed(1795)
### create split
## split data
houses_split <- initial_split(
# data
houses_joined,
# split proportion
prop = 0.65
)
## examine initial split
houses_split## <Training/Testing/Total>
## <8954/4822/13776>
Set the random seed of your computer to 1805 using set.seed(). Then, create houses_train_folds with vfold_cv(). Set houses_train as the data input, set v to 3 and set repeats to 3. Print houses_train_folds.
Question 5.2: Answer these questions: (1) How many observations are there in the first analysis set? (2) How many observations are there in the first assessment set?
Response 5.2: (1) 5969 (2) 2985
### set random seed
## call function
set.seed(1805)
### repeated cross-folds on training set
## split training
houses_train_folds <- vfold_cv(
# training data
houses_train,
# number of folds
v = 3,
# repeats
repeats = 3
)
## examine folds
houses_train_foldsFor this task, you will create a model recipe.
Perform these operations.
First, create a model recipe named houses_recipe. Call recipe() and set the formula input to sale_prc ~ . and the data input to houses_train. Pipe the result to step_rm() to remove parcel_id. Pipe the result to step_normalize() to standardize any numeric predictors. Pipe the result to step_dummy() to create dummy coded variables for any nominal predictors.
Second, pipe houses_recipe to prep(). Pipe the result to bake() with new_data set to NULL. Pipe the result to print() with width set to Inf.
Third, create a metrics function named metrics. Call metric_set() and list the following metrics: mae, rmse, rsq, rsq_trad, and ccc.
Question 6.1: Answer these questions: (1) How many columns are there in the baked data? (2) Is the first home below or above the average land area (lnd_sqfoot)?
Response 6.1: (1) 29 (2) Below (-0.117)
### create modeling recipe
## save as object
houses_recipe <- recipe(
# identify outcomes
sale_prc ~
# all other variables
.,
# data
data = houses_train
) %>%
## remove variables
step_rm(
# list variables
parcel_id
) %>%
## normalize predictors
step_normalize(
# numeric predictors
all_numeric_predictors()
) %>%
## dummy variables
step_dummy(
# factor predictors
all_nominal_predictors()
)
### prep and bake recipe
## call recipe
houses_recipe %>%
## estimate parameters
prep() %>%
## apply computations to data
bake(
# training data
new_data = NULL
) %>%
## print table
print(
# all columns
width = Inf
)## # A tibble: 8,954 × 29
## latitude longitude lnd_sqfoot tot_lvg_area spec_feat_val age rail_dist
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.529 -0.0616 -0.117 -0.528 -0.621 1.46 -1.19
## 2 -0.755 -0.328 0.00385 -0.566 1.14 1.08 -0.420
## 3 1.28 1.10 -0.589 -1.15 -0.579 1.70 -1.25
## 4 1.09 1.75 0.265 -0.878 -0.285 0.0779 -1.25
## 5 0.131 -1.31 -0.548 0.935 1.10 -0.876 0.862
## 6 0.820 1.73 2.10 0.171 -0.558 1.27 -0.615
## 7 -0.514 -1.25 -0.630 0.500 -0.453 -1.40 -0.361
## 8 1.24 1.49 0.117 -0.464 -0.535 -0.447 0.161
## 9 -0.599 -0.151 1.16 1.77 1.26 0.0302 -1.03
## 10 -0.480 -0.553 -0.178 -0.824 0.979 -0.495 -1.33
## ocean_dist water_dist cntr_dist subcntr_di hwy_dist sale_prc
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.887 0.155 -0.133 -1.21 0.424 276000
## 2 -0.855 -0.153 0.293 -0.588 -0.104 260000
## 3 0.233 -0.991 -0.596 0.369 -0.193 101000
## 4 -0.835 -0.791 -0.932 -0.245 0.689 260000
## 5 1.91 2.04 0.458 0.294 1.01 465000
## 6 -0.772 -0.988 -1.36 -0.775 0.432 840000
## 7 0.829 1.99 0.734 0.0798 1.62 432000
## 8 -0.428 -1.00 -0.706 0.210 -0.516 340000
## 9 -0.935 0.220 0.00325 -1.01 0.787 630000
## 10 -0.188 1.13 0.163 -0.810 -1.21 325000
## airplane_noise_Yes month_sold_01 month_sold_02 month_sold_03 month_sold_04
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 -0.209 -0.155 0.348 -0.145
## 2 0 -0.209 -0.155 0.348 -0.145
## 3 0 0.209 -0.155 -0.348 -0.145
## 4 0 0.209 -0.155 -0.348 -0.145
## 5 0 0.125 -0.265 -0.265 0.134
## 6 0 -0.125 -0.265 0.265 0.134
## 7 0 -0.209 -0.155 0.348 -0.145
## 8 0 -0.0418 -0.319 0.0976 0.313
## 9 0 -0.376 0.228 0.0418 -0.302
## 10 0 -0.125 -0.265 0.265 0.134
## month_sold_05 month_sold_06 month_sold_07 month_sold_08 month_sold_09
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.230 0.373 -0.137 -0.255 0.474
## 2 -0.230 0.373 -0.137 -0.255 0.474
## 3 0.230 0.373 0.137 -0.255 -0.474
## 4 0.230 0.373 0.137 -0.255 -0.474
## 5 0.349 0.0597 -0.336 -0.290 0.160
## 6 -0.349 0.0597 0.336 -0.290 -0.160
## 7 -0.230 0.373 -0.137 -0.255 0.474
## 8 -0.159 -0.299 0.230 0.274 -0.329
## 9 0.452 -0.463 0.370 -0.239 0.124
## 10 -0.349 0.0597 0.336 -0.290 -0.160
## month_sold_10 month_sold_11 structure_quality_1 structure_quality_2
## <dbl> <dbl> <dbl> <dbl>
## 1 -0.409 0.196 -0.316 -0.267
## 2 -0.409 0.196 -0.316 -0.267
## 3 -0.409 -0.196 -0.316 -0.267
## 4 -0.409 -0.196 -0.316 -0.267
## 5 0.491 0.393 0.316 -0.267
## 6 0.491 -0.393 0.316 -0.267
## 7 -0.409 0.196 0.316 -0.267
## 8 -0.229 0.550 0.316 -0.267
## 9 -0.0491 0.0131 -0.316 -0.267
## 10 0.491 -0.393 0.316 -0.267
## structure_quality_3 structure_quality_4
## <dbl> <dbl>
## 1 0.632 -0.478
## 2 0.632 -0.478
## 3 0.632 -0.478
## 4 0.632 -0.478
## 5 -0.632 -0.478
## 6 -0.632 -0.478
## 7 -0.632 -0.478
## 8 -0.632 -0.478
## 9 0.632 -0.478
## 10 -0.632 -0.478
## # ℹ 8,944 more rows
For this task, you will train the models of interest.
Perform these operations.
First, create a workflow object named lm_wflow. Pipe workflow() to add_model(). In add_model(), call linear_reg() and pipe it to set_engine(“lm”). Pipe the result of add_model() to add_recipe(). In add_recipe(), specify houses_recipe as the input.
Second, create an object named lm_fit_folds. Pipe lm_wflow to fit_resamples(). In fit_resamples(), set resamples to houses_train_folds and the metrics to metrics.
Third, call collect_metrics() on lm_fit_folds.
Question 7.1: Answer three questions: (1) What is the average ccc across the folds for these linear regression models? (2) What is the average rsq across the folds for these linear regression models?
Response 7.1: (1) 0.844 (2) 0.730
### linear model
## save
lm_wflow <- workflow() %>%
## add model
add_model(
# regression specification
linear_reg() %>%
# engine
set_engine("lm")
) %>%
## add recipe
add_recipe(
# recipe
houses_recipe
)
### estimate model on folds
## save as object
lm_fit_folds <-
## workflow
lm_wflow %>%
## fit
fit_resamples(
# folds
resamples = houses_train_folds,
# metrics
metrics = metrics
)
### show metrics
## call function
collect_metrics(
# results
lm_fit_folds
)Perform these operations.
First, create an object named lm_fit. Pipe lm_wflow to fit() and set houses_train as the input.
Second, pipe lm_fit to extract_fit_parsnip(). Pipe the result to tidy(). Pipe the result to arrange() and arrange the rows in descending order by estimate. Pipe the result to print() and set n to Inf.
Question 7.2: Answer these questions: (1) What is the correct interpretation of the ocean distance (ocean_dist) regression coefficient? (2) Is the highway distance (hwy_dist) or center distance (center_dist) a stronger predictor of sales prices?
Response 7.2: (1) For every 1-unit increase in distance from the ocean, the sale price increases by approximately $49,948, holding all other variables constant. (2) Highway distance is ranked higher than center distance, has a lower standard error and has a more precise relationship with sale prices and is therefore a stronger predictor of sales prices.
### fit to complete training data
## save as object
lm_fit <-
## workflow
lm_wflow %>%
## fit
fit(
# data
houses_train
)
### view coefficients
## call model object
lm_fit %>%
## pull fit
extract_fit_parsnip() %>%
## coefficients
tidy() %>%
## arrange rows
arrange(
# descending
desc(estimate)
) %>%
## print table
print(
# all rows
n = Inf
)## # A tibble: 29 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 structure_quality_4 722764. 39420. 18.3 9.73e- 74
## 2 (Intercept) 575946. 11489. 50.1 0
## 3 structure_quality_1 280793. 11035. 25.4 6.11e-138
## 4 longitude 178368. 12099. 14.7 1.27e- 48
## 5 tot_lvg_area 145448. 2411. 60.3 0
## 6 structure_quality_3 80547. 6026. 13.4 2.33e- 40
## 7 ocean_dist 49948. 8089. 6.17 6.92e- 10
## 8 spec_feat_val 38774. 2109. 18.4 3.98e- 74
## 9 hwy_dist 29788. 2028. 14.7 2.81e- 48
## 10 rail_dist 27396. 2051. 13.4 2.52e- 40
## 11 lnd_sqfoot 21614. 2191. 9.86 7.92e- 23
## 12 water_dist 13706. 3317. 4.13 3.63e- 5
## 13 month_sold_04 8502. 6175. 1.38 1.69e- 1
## 14 month_sold_03 6983. 6155. 1.13 2.57e- 1
## 15 month_sold_09 4431. 5821. 0.761 4.47e- 1
## 16 month_sold_01 3827. 6365. 0.601 5.48e- 1
## 17 month_sold_10 3171. 5741. 0.552 5.81e- 1
## 18 month_sold_07 723. 6051. 0.120 9.05e- 1
## 19 month_sold_05 367. 6161. 0.0596 9.52e- 1
## 20 month_sold_11 -48.0 5692. -0.00844 9.93e- 1
## 21 month_sold_08 -4278. 5944. -0.720 4.72e- 1
## 22 month_sold_06 -5505. 6100. -0.903 3.67e- 1
## 23 month_sold_02 -11414. 6235. -1.83 6.72e- 2
## 24 age -30758. 2312. -13.3 5.24e- 40
## 25 cntr_dist -41173. 11453. -3.59 3.26e- 4
## 26 subcntr_di -58946. 8133. -7.25 4.60e- 13
## 27 airplane_noise_Yes -86150. 14718. -5.85 4.98e- 9
## 28 latitude -146625. 8195. -17.9 2.31e- 70
## 29 structure_quality_2 -448226. 30555. -14.7 3.67e- 48
Perform these operations.
First, create a model specification object named elastic_net_spec. Call linear_reg() and set penalty and mixture to tune(). Pipe the result to set_engine(“glmnet”).
Second, create a grid object named elastic_net_grid. Pipe elastic_net_spec to extract_parameter_set_dials(). Pipe the result to grid_regular() with levels set to 6.
Third, create a workflow object named elastic_net_wflow. Pipe workflow() to add_model() and specify elastic_net_spec as the input. Pipe the result of add_model() to add_recipe(). In add_recipe(), specify houses_recipe as the input.
Fourth, create an object named elastic_net_tune. Pipe elastic_net_wflow to tune_grid(). In tune_grid(), set resamples to houses_train_folds, the grid to elastic_net_grid, and the metrics to metrics.
Fifth, call autoplot() and set elastic_net_tune as the first input and c(“mae”, “rsq”, “ccc”) as the metric input. Update the theme with theme() and move the legend to the top of the plot.
Sixth, call collect_metrics() on elastic_net_tune. Pipe the result to print() with n set to 40 and width to Inf.
Seventh, call show_best(). Set elastic_net_tune as the first input and “mae” as the metric input.
Question 7.3: Answer these questions: (1) Examining the plot, what is approximately the highest R-squared (rsq) value? (2) What penalty and mixture values are the best based on average mae?
Response 7.3: (1) 0.73 (2) Penalty = 0.0000000001 ; Mixture 0.05
#### elastic net
### model specification
## save as object
elastic_net_spec <-
## regression specification
linear_reg(
# tune penalty
penalty = tune(),
# tune mixture
mixture = tune()
) %>%
## specify engine
set_engine("glmnet")
### view a tuning grid
## call model specification
elastic_net_grid <- elastic_net_spec %>%
## parameters
extract_parameter_set_dials() %>%
## grid
grid_regular(
# levels
levels = 6
)
### create initial workflow
## save as object
elastic_net_wflow <- workflow() %>%
## add model
add_model(
# specification
elastic_net_spec
) %>%
## add recipe
add_recipe(
# recipe
houses_recipe
)
### estimate models
## save as object
elastic_net_tune <-
## workflow
elastic_net_wflow %>%
## tune
tune_grid(
# folds
resamples = houses_train_folds,
# grid
grid = elastic_net_grid,
# metrics
metrics = metrics
)
### plot metrics
## produce plot
## produce plot
autoplot(
# results
elastic_net_tune,
# metrics
metric = c("mae", "rsq", "ccc")
) +
## update theme
theme(
# move legend
legend.position = "top"
)### show metrics
## call function
collect_metrics(
# results
elastic_net_tune
) %>%
## print table
print(
# rows
n = 40,
# all columns
width = Inf
)## # A tibble: 180 × 8
## penalty mixture .metric .estimator mean n std_err
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 0.05 ccc standard 0.844 9 0.00247
## 2 0.0000000001 0.05 mae standard 106186. 9 487.
## 3 0.0000000001 0.05 rmse standard 163407. 9 1773.
## 4 0.0000000001 0.05 rsq standard 0.730 9 0.00349
## 5 0.0000000001 0.05 rsq_trad standard 0.730 9 0.00347
## 6 0.0000000001 0.24 ccc standard 0.844 9 0.00248
## 7 0.0000000001 0.24 mae standard 106294. 9 488.
## 8 0.0000000001 0.24 rmse standard 163418. 9 1789.
## 9 0.0000000001 0.24 rsq standard 0.730 9 0.00349
## 10 0.0000000001 0.24 rsq_trad standard 0.730 9 0.00348
## 11 0.0000000001 0.43 ccc standard 0.844 9 0.00248
## 12 0.0000000001 0.43 mae standard 106283. 9 486.
## 13 0.0000000001 0.43 rmse standard 163417. 9 1789.
## 14 0.0000000001 0.43 rsq standard 0.730 9 0.00349
## 15 0.0000000001 0.43 rsq_trad standard 0.730 9 0.00347
## 16 0.0000000001 0.62 ccc standard 0.844 9 0.00248
## 17 0.0000000001 0.62 mae standard 106265. 9 482.
## 18 0.0000000001 0.62 rmse standard 163410. 9 1786.
## 19 0.0000000001 0.62 rsq standard 0.730 9 0.00349
## 20 0.0000000001 0.62 rsq_trad standard 0.730 9 0.00347
## 21 0.0000000001 0.81 ccc standard 0.844 9 0.00248
## 22 0.0000000001 0.81 mae standard 106240. 9 488.
## 23 0.0000000001 0.81 rmse standard 163408. 9 1785.
## 24 0.0000000001 0.81 rsq standard 0.730 9 0.00349
## 25 0.0000000001 0.81 rsq_trad standard 0.730 9 0.00347
## 26 0.0000000001 1 ccc standard 0.844 9 0.00249
## 27 0.0000000001 1 mae standard 106213. 9 487.
## 28 0.0000000001 1 rmse standard 163413. 9 1787.
## 29 0.0000000001 1 rsq standard 0.730 9 0.00349
## 30 0.0000000001 1 rsq_trad standard 0.730 9 0.00348
## 31 0.00000001 0.05 ccc standard 0.844 9 0.00247
## 32 0.00000001 0.05 mae standard 106186. 9 487.
## 33 0.00000001 0.05 rmse standard 163407. 9 1773.
## 34 0.00000001 0.05 rsq standard 0.730 9 0.00349
## 35 0.00000001 0.05 rsq_trad standard 0.730 9 0.00347
## 36 0.00000001 0.24 ccc standard 0.844 9 0.00248
## 37 0.00000001 0.24 mae standard 106294. 9 488.
## 38 0.00000001 0.24 rmse standard 163418. 9 1789.
## 39 0.00000001 0.24 rsq standard 0.730 9 0.00349
## 40 0.00000001 0.24 rsq_trad standard 0.730 9 0.00348
## .config
## <chr>
## 1 pre0_mod01_post0
## 2 pre0_mod01_post0
## 3 pre0_mod01_post0
## 4 pre0_mod01_post0
## 5 pre0_mod01_post0
## 6 pre0_mod02_post0
## 7 pre0_mod02_post0
## 8 pre0_mod02_post0
## 9 pre0_mod02_post0
## 10 pre0_mod02_post0
## 11 pre0_mod03_post0
## 12 pre0_mod03_post0
## 13 pre0_mod03_post0
## 14 pre0_mod03_post0
## 15 pre0_mod03_post0
## 16 pre0_mod04_post0
## 17 pre0_mod04_post0
## 18 pre0_mod04_post0
## 19 pre0_mod04_post0
## 20 pre0_mod04_post0
## 21 pre0_mod05_post0
## 22 pre0_mod05_post0
## 23 pre0_mod05_post0
## 24 pre0_mod05_post0
## 25 pre0_mod05_post0
## 26 pre0_mod06_post0
## 27 pre0_mod06_post0
## 28 pre0_mod06_post0
## 29 pre0_mod06_post0
## 30 pre0_mod06_post0
## 31 pre0_mod07_post0
## 32 pre0_mod07_post0
## 33 pre0_mod07_post0
## 34 pre0_mod07_post0
## 35 pre0_mod07_post0
## 36 pre0_mod08_post0
## 37 pre0_mod08_post0
## 38 pre0_mod08_post0
## 39 pre0_mod08_post0
## 40 pre0_mod08_post0
## # ℹ 140 more rows
Perform these operations.
First, create a workflow object named elastic_net_wflow_final. Pipe elastic_net_wflow to finalize_workflow(). Inside of finalize_workflow(), call select_best() and set elastic_net_tune as the first input and “mae” as the metric input.
Second, create an object named elastic_net_fit. Pipe elastic_net_wflow_final to fit() and set houses_train as the input.
Second, pipe elastic_net_fit to extract_fit_parsnip(). Pipe the result to tidy(). Pipe the result to arrange() and arrange the rows in descending order by estimate. Pipe the result to print() and set n to Inf.
Question 7.4: What is the correct interpretation of the airplane noise (airplane_noise_Yes) regression coefficient?
Response 7.4: Houses with airplane noise have sale prices that are approximately $86,118 lower than houses without airplane noise, holding all other variables constant. Airplane noise is considered undesirable, so it reduces property values.
### create final workflow
## save as object
elastic_net_wflow_final <-
## initial workflow
elastic_net_wflow %>%
## finalize workflow
finalize_workflow(
# hyperparameters
select_best(
# results
elastic_net_tune,
# metric
metric = "ccc"
)
)
### fit to complete training data
## save as object
elastic_net_fit <-
## workflow
elastic_net_wflow_final %>%
## fit
fit(
# data
houses_train
)
### view coefficients
## call model object
elastic_net_fit %>%
## pull fit
extract_fit_parsnip() %>%
## coefficients
tidy() %>%
## arrange rows
arrange(
# descending
desc(estimate)
) %>%
## print table
print(
# all rows
n = Inf
)## # A tibble: 29 × 3
## term estimate penalty
## <chr> <dbl> <dbl>
## 1 structure_quality_4 695889. 0.0000000001
## 2 (Intercept) 569285. 0.0000000001
## 3 structure_quality_1 276884. 0.0000000001
## 4 longitude 163927. 0.0000000001
## 5 tot_lvg_area 145337. 0.0000000001
## 6 structure_quality_3 79135. 0.0000000001
## 7 ocean_dist 40514. 0.0000000001
## 8 spec_feat_val 38837. 0.0000000001
## 9 hwy_dist 29737. 0.0000000001
## 10 rail_dist 27389. 0.0000000001
## 11 lnd_sqfoot 21876. 0.0000000001
## 12 water_dist 13977. 0.0000000001
## 13 month_sold_04 8322. 0.0000000001
## 14 month_sold_03 6658. 0.0000000001
## 15 month_sold_09 4259. 0.0000000001
## 16 month_sold_01 3698. 0.0000000001
## 17 month_sold_10 3096. 0.0000000001
## 18 month_sold_07 478. 0.0000000001
## 19 month_sold_05 173. 0.0000000001
## 20 month_sold_11 0 0.0000000001
## 21 month_sold_08 -4193. 0.0000000001
## 22 month_sold_06 -5375. 0.0000000001
## 23 month_sold_02 -11356. 0.0000000001
## 24 age -30842. 0.0000000001
## 25 cntr_dist -49946. 0.0000000001
## 26 subcntr_di -52192. 0.0000000001
## 27 airplane_noise_Yes -86118. 0.0000000001
## 28 latitude -138348. 0.0000000001
## 29 structure_quality_2 -425947. 0.0000000001
Perform these operations.
First, create a model specification object named rf_spec. Call rand_forest() and set mode to “regression”, mtry to tune(), trees to 500, and min_n to tune(). Pipe the result to set_engine(“ranger”) and set importance to “impurity”.
Second, create a grid object named rf_grid. Call grid_regular(). Inside of grid_regular(), call mtry(c(2, 20)), min_n(c(5, 50)), and set levels to 3.
Third, create a workflow object named rf_wflow. Pipe workflow() to add_model() and specify rf_spec as the input. Pipe the result of add_model() to add_recipe(). In add_recipe(), specify houses_recipe as the input.
Fourth, set-up the timer for parallel processing by calling tic(). Then, create n_cores with availableCores(). Then, call registerDoFuture(). Then, create clust_work by calling makeClusterPSOCK() and setting n_cores as the first input and setting autoStop to TRUE. Then, call plan(cluster, workers = clust_work). Then, create an object named rf_tune. Pipe rf_wflow to tune_grid(). In tune_grid(), set resamples to houses_train_folds, the grid to rf_grid, and the metrics to metrics. Then, call plan(sequential). Then, close the timer with toc().
Fifth, call autoplot() and set rf_tune as the first input and c(“mae”, “rsq”, “ccc”) as the metric input. Update the theme with theme() and move the legend to the top of the plot.
Sixth, call collect_metrics() on rf_tune. Pipe the result to print() with n set to Inf and width to Inf.
Seventh, call show_best(). Set rf_tune as the first input and “mae” as the metric input.
Question 7.5: Answer these questions: (1) Examining the plot, what is approximately the highest R-squared (rsq) value? (2) What number of features (mtry) and minimal number of observations (min_n) are the best based on average mae?
Response 7.5: (1) 0.90 (2) Mtry= 11 and min_n = 5
#### random forest
### model specification
## save as object
rf_spec <-
## rf specification
rand_forest(
# regression
mode = "regression",
# selected features
mtry = tune(),
# number of trees
trees = 500,
# minimal node size
min_n = tune()
) %>%
## specify engine
set_engine(
# engine
"ranger",
# variable importance
importance = "impurity"
)
### view a tuning grid
## save
rf_grid <- grid_regular(
# number of features
mtry(
# range
c(2, 20)
),
# number of units
min_n(
# range
c(5, 50)
),
# combinations
levels = 3
)
### create initial workflow
## save as object
rf_wflow <- workflow() %>%
## add model
add_model(
# specification
rf_spec
) %>%
## add recipe
add_recipe(
# recipe
houses_recipe
)
### begin timer
## call function
tic()
### number of logical cores
## save
n_cores <- availableCores()
### register parallel backend
## call function
registerDoFuture()
### create compute clusters
## save
clust_work <- makeClusterPSOCK(
# cores
n_cores,
# stop cluster
autoStop = TRUE
)
### create parallel plan
## call function
plan(
# cluster
cluster,
# workers
workers = clust_work
)
### estimate models
## save as object
rf_tune <-
## workflow
rf_wflow %>%
## fit
tune_grid(
# folds
resamples = houses_train_folds,
# metrics
metrics = metrics,
# grid
grid = rf_grid
)
### convert to sequential
## call function
plan(sequential)
### end timer
## call function
toc()## 204.93 sec elapsed
### plot metrics
## produce plot
## produce plot
autoplot(
# results
rf_tune,
# metrics
metric = c("mae", "rsq", "ccc")
) +
## update theme
theme(
# move legend
legend.position = "top"
)### show metrics
## call function
collect_metrics(
# results
rf_tune
) %>%
## print table
print(
# all rows
n = Inf,
# all columns
width = Inf
)## # A tibble: 45 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 5 ccc standard 0.896 9 0.00128 pre0_mod1_post0
## 2 2 5 mae standard 65187. 9 358. pre0_mod1_post0
## 3 2 5 rmse standard 126922. 9 1022. pre0_mod1_post0
## 4 2 5 rsq standard 0.870 9 0.00228 pre0_mod1_post0
## 5 2 5 rsq_trad standard 0.837 9 0.00189 pre0_mod1_post0
## 6 2 27 ccc standard 0.876 9 0.00165 pre0_mod2_post0
## 7 2 27 mae standard 70177. 9 378. pre0_mod2_post0
## 8 2 27 rmse standard 136016. 9 1294. pre0_mod2_post0
## 9 2 27 rsq standard 0.856 9 0.00259 pre0_mod2_post0
## 10 2 27 rsq_trad standard 0.813 9 0.00223 pre0_mod2_post0
## 11 2 50 ccc standard 0.858 9 0.00185 pre0_mod3_post0
## 12 2 50 mae standard 74513. 9 406. pre0_mod3_post0
## 13 2 50 rmse standard 143712. 9 1292. pre0_mod3_post0
## 14 2 50 rsq standard 0.844 9 0.00302 pre0_mod3_post0
## 15 2 50 rsq_trad standard 0.791 9 0.00250 pre0_mod3_post0
## 16 11 5 ccc standard 0.942 9 0.00147 pre0_mod4_post0
## 17 11 5 mae standard 48696. 9 322. pre0_mod4_post0
## 18 11 5 rmse standard 101118. 9 1219. pre0_mod4_post0
## 19 11 5 rsq standard 0.900 9 0.00228 pre0_mod4_post0
## 20 11 5 rsq_trad standard 0.896 9 0.00236 pre0_mod4_post0
## 21 11 27 ccc standard 0.936 9 0.00164 pre0_mod5_post0
## 22 11 27 mae standard 51160. 9 357. pre0_mod5_post0
## 23 11 27 rmse standard 105430. 9 1309. pre0_mod5_post0
## 24 11 27 rsq standard 0.892 9 0.00258 pre0_mod5_post0
## 25 11 27 rsq_trad standard 0.887 9 0.00264 pre0_mod5_post0
## 26 11 50 ccc standard 0.928 9 0.00166 pre0_mod6_post0
## 27 11 50 mae standard 53976. 9 348. pre0_mod6_post0
## 28 11 50 rmse standard 110795. 9 1341. pre0_mod6_post0
## 29 11 50 rsq standard 0.883 9 0.00264 pre0_mod6_post0
## 30 11 50 rsq_trad standard 0.876 9 0.00264 pre0_mod6_post0
## 31 20 5 ccc standard 0.942 9 0.00146 pre0_mod7_post0
## 32 20 5 mae standard 49187. 9 314. pre0_mod7_post0
## 33 20 5 rmse standard 101615. 9 1192. pre0_mod7_post0
## 34 20 5 rsq standard 0.897 9 0.00229 pre0_mod7_post0
## 35 20 5 rsq_trad standard 0.895 9 0.00235 pre0_mod7_post0
## 36 20 27 ccc standard 0.937 9 0.00149 pre0_mod8_post0
## 37 20 27 mae standard 51434. 9 316. pre0_mod8_post0
## 38 20 27 rmse standard 105731. 9 1202. pre0_mod8_post0
## 39 20 27 rsq standard 0.889 9 0.00231 pre0_mod8_post0
## 40 20 27 rsq_trad standard 0.887 9 0.00236 pre0_mod8_post0
## 41 20 50 ccc standard 0.930 9 0.00155 pre0_mod9_post0
## 42 20 50 mae standard 54140. 9 312. pre0_mod9_post0
## 43 20 50 rmse standard 110506. 9 1228. pre0_mod9_post0
## 44 20 50 rsq standard 0.880 9 0.00242 pre0_mod9_post0
## 45 20 50 rsq_trad standard 0.876 9 0.00245 pre0_mod9_post0
Perform these operations.
First, create a workflow object named rf_wflow_final. Pipe rf_wflow to finalize_workflow(). Inside of finalize_workflow(), call select_best() and set rf_tune as the first input and “mae” as the metric input.
Second, create an object named rf_fit. Pipe rf_wflow_final to fit() and set houses_train as the input.
Second, pipe rf_fit to extract_fit_parsnip(). Pipe the result to vip().
Question 7.6: Answer these questions: (1) Which predictor is the most important? (2) Which predictor is second most important?
Response 7.6: (1) Total Living Area (2) Ocean Distance
### create final workflow
## save as object
rf_wflow_final <-
## initial workflow
rf_wflow %>%
## finalize workflow
finalize_workflow(
# hyperparameters
select_best(
# results
rf_tune,
# metric
metric = "mae"
)
)
### fit to complete training data
## save as object
rf_fit <-
## workflow
rf_wflow_final %>%
## fit
fit(
# data
houses_train
)
### view coefficients
## call model object
rf_fit %>%
## pull fit
extract_fit_parsnip() %>%
## prediction importance
vip()For this task, you will evaluate estimated models on the testing data.
Perform these operations.
First, create a data table object named lm_pred. Call predict() and set lm_fit as the model input and houses_test as the new_data input. Pipe the result to rename() and rename .pred to lm_pred.
Second, create a data table object named elastic_net_pred. Call predict() and set elast_net_fit as the model input and houses_test as the new_data input. Pipe the result to rename() and rename .pred to elastic_net_pred.
Third, create a data table object named rf_pred. Call predict() and set rf_fit as the model input and houses_test as the new_data input. Pipe the result to rename() and rename .pred to rf_pred.
Fourth, create a data table object named houses_test_pred. Pipe houses_test to select() and select sale_prc. Pipe the result to bind_cols() and set lm_pred, elastic_net_pred, and rf_pred as inputs.
Fifth, call map(). As the first input, call set_names(). Inside of set_names(), call houses_test_pred, then pipe it to select() to remove sale_prc followed by a pipe to names(). As the second input to map(), call metrics() with data set houses_test_pred, truth set to sale_prc, and the estimate set to all_of(.x). Pipe the result to list_rbind() with names_to set to “model”. Pipe the result to pivot_wider() and set id_cols to model, names_from to .metric, and values_from to .estimate.
Question 8.1: Answer these questions: (1) What is the mean absolute error (mae) of the elastic net model? (2) Irrespective of metric, which model performs the best?
Response 8.1: (1) 106013 (2) Random Forrest performs the best across all metrics
### linear model
## save
lm_pred <- predict(
# fitted model
lm_fit,
# test data
new_data = houses_test
) %>%
## rename
rename(
# new = old
lm_pred = .pred
)
### elastic net
## save
elastic_net_pred <- predict(
# fitted model
elastic_net_fit,
# test data
new_data = houses_test
) %>%
## rename
rename(
# new = old
elastic_net_pred = .pred
)
### random forest
## save
rf_pred <- predict(
# fitted model
rf_fit,
# test data
new_data = houses_test
) %>%
## rename
rename(
# new = old
rf_pred = .pred
)
### observed and predicted values
## save as object
houses_test_pred <- houses_test %>%
## select observed values
select(sale_prc) %>%
## bind columns
bind_cols(
# linear model
lm_pred,
# elastic net
elastic_net_pred,
# random forest
rf_pred
)
### compute metrics on test data
## map
map(
# columns
set_names(
# data
houses_test_pred %>%
# remove observed values
select(-sale_prc) %>%
# extract column names
names(),
),
# function
~ metrics(
# data
data = houses_test_pred,
# observed
truth = sale_prc,
# predicted
estimate = all_of(.x)
)
) %>%
## bind as rows
list_rbind(
# names
names_to = "model"
) %>%
## pivot wide
pivot_wider(
# model
id_cols = model,
# names
names_from = .metric,
# values
values_from = .estimate
)Perform these operations.
First, create a data table named houses_test_pred_long. Pipe houses_test_pred to pivot_longer(). Inside of pivot_longer() set cols to lm_pred:rf_pred, names_to to “model”, and values_to to “pred”. Pipe the result to mutate() and convert model to a factor variable with as_factor().
Second, create a plot named pred_plot. Call ggplot() and set houses_test_pred_long as the data input and map pred to the x-axis, and sale_prc to the y-axis. Add a geom_point() layer and set to alpha to 0.1. Add a first geom_abline() and set lty to 1, color to “red”, and linewidth to 1. Add facets with facet_wrap() and set vars(model) as the first input, nrow to 2, and labeller to as_labeller(). Inside of as_labeller(), specify set_names(c(“Linear Model”, “Elastic Net”, “Random Forest”)) as the first input, and, for the second input, pipe house_test_pred_long to pull() to extract the model values and pipe the result to levels(). Use scale_x_continuous() to update the y-axis labels with label_dollar() and setting scaleto 0.000001 and suffix to “M” and setting breaks to seq(0, 3000000, 500000). Use scale_y_continuous() to update the y-axis labels with label_dollar() and setting scaleto 0.000001 and suffix to “M” and setting breaks to seq(0, 3000000, 500000). Adjust titles for the x and y axes to “Predicted Sale Price (Millions)” and “Observed Sale Price (Millions)” using labs().
View pred_plot.
Question 8.2: Which model’s predicted sale prices most closely match the observed sale prices?
Response 8.2: Random forest - the closer the points are to the red diagonal line which represents perfect predictions, the better the model performs.
### make long table for plots
## save as object
houses_test_pred_long <- houses_test_pred %>%
## pivot long for plot
pivot_longer(
# columns to pivot
cols = lm_pred:rf_pred,
# names
names_to = "model",
# values
values_to = "pred"
) %>%
## update variable
mutate(
# convert to factor
model = as_factor(model)
)
#### create plot
### observed versus predicted values
## save as object
model_pred_plot <- ggplot(
# data
houses_test_pred_long,
# mapping
aes(
# predicted values
x = pred,
# observed values
y = sale_prc
)
) +
## add points
geom_point(
# transparency
alpha = 0.25
) +
## add one-to-one diagonal line
geom_abline(
# solid
lty = 1,
# color
color = "red",
# thick
linewidth = 1
) +
## add facet
facet_wrap(
# variable
vars(model),
# number of rows
nrow = 2,
# labels
labeller = as_labeller(
# look-up table
set_names(
# vector elements
c(
"Linear Model", "Elastic Net", "Random Forest"
),
# names of elements
houses_test_pred_long %>%
# extract
pull(model) %>%
# levels
levels()
)
)
) +
## scale x-axis
scale_x_continuous(
# breaks
breaks = seq(0, 3000000, 500000),
# format
labels = label_dollar(
# scale
scale = 0.000001,
# suffix
suffix = "M"
)
) +
## scale y-axis
scale_y_continuous(
# breaks
breaks = seq(0, 3000000, 500000),
# format
labels = label_dollar(
# scale
scale = 0.000001,
# suffix
suffix = "M"
)
) +
## labels
labs(
# x-axis
x = "Predicted Sale Price (Millions)",
# y-axis
y = "Observed Sale Price (Millions)"
)
## display plot
model_pred_plotCreate a new database connection named houses_con_update. Call dbConnect() and use SQLite() as the driver input and use here() to navigate to the data folder of the project directory and specify a new database file named “miami_houses_update.sqlite”.
Copy the existing database by calling sqliteCopyDatabase(). Inside of sqliteCopyDatabase(), set houses_con and houses_con_update as the inputs.
Copy the houses_joined to the new database by calling copy_to(). Inside of copy_to(), set houses_con_update as the database connection input, set houses_joined as the data table input, name the database table “complete”, set temporary to FALSE, and set analyze to FALSE.
Check that the “complete” table exists in the database. Pipe houses_con_update to tbl() and set “complete” as the input.
Disconnect from both databases. Call dbDisconnect() twice to disconnect from houses_con and houses_con_update.
### create empty database
## save
houses_con_update <- dbConnect(
# driver
SQLite(),
# path
here(
# folder
"data",
# file
"miami_houses_update.sqlite"
)
)
### list tables
## call function
dbListTables(houses_con_update)## [1] "complete" "distance" "overview"
### copy database
## call function
sqliteCopyDatabase(
# current
houses_con,
# new
houses_con_update
)
### list tables
## call function
dbListTables(houses_con_update)## [1] "distance" "overview"
### copy tables to database
## copy
copy_to(
# database
houses_con_update,
# table
houses_joined,
# name
"complete",
# persistent
temporary = FALSE,
# query optimization
analyze = FALSE
)
### list tables
## call function
dbListTables(houses_con_update)## [1] "complete" "distance" "overview"
Save the plot objects as png files in the plots folder of the project directory using the combination of walk(), ls(), ggsave(), here(), str_glue(), str_remove(), and get(). Use a width of 8.1 inches and a height of 5 inches.
### save plots
## call function
walk(
## vector of plots
ls(
# name pattern
pattern = "plot"
),
## function
function(plot_inp) {
### save plot
## call function
ggsave(
## file path
here(
# folder
"plots",
# file
str_glue(
# remove string
str_remove(
# string
plot_inp,
# pattern
"_plot"
),
# extension
".png"
)
),
## plot object
plot = get(plot_inp),
## units
units = "in",
## width
width = 8.1,
## height
height = 5
)
}
)For your last task, you will respond to conceptual questions based on the conceptual lectures for this week.
Question 11.1: Why is it better to evaluate predictive performance of a model on testing as opposed to training data? Response 11.1: Because testing data tells us how well our model generalizes to new observations and therefore what the models true predictive power in practice is.
Question 11.2: What is a tuning parameter in a supervised learning model?
Response 11.2: A tuning parameter is a configuration setting that is set or “tuned” before training a model to find the optimal values that give the best model performance. It controls how a supervised learning model learns.
Question 11.3: What are two general advantages of using random forests for predictions?
Response 11.3: 1. Unlike in linear regression where manual adding of interaction terms is needed, random forests automatically capture non-linear relationships and interactions between variables without us needing to specify them making them very flexible and powerful for complex real-world data. 2. By building many different trees, each tree is trained on a random subset of data and uses random subsets of variables averaging their predictions. This makes random forests less likely to overfit compared to a single decision tree which therefore gives us better performance on test data without requiring extensive tuning.