Assignment 09

Author

Connor Perfetto

Go to the shared posit.cloud workspace for this class and open the assign09 project. Open the assign09.qmd file and complete the exercises.

We will be working the the diamonds dataset and tidymodels to predict the carat of a diamond based on other variables.

The Grading Rubric is available at the end of this document.

Exercises

We will start by loading our required packages.

library(tidymodels)
library(glmnet)

Exercise 1

Create a histogram using geom_histogram(binwidth = 0.1), showing the distribution of carat in the diamonds dataset. Set the fill to “blue” and the color to “black”. In the narrative below describe what the distribution looks like.

data("diamonds")
ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(title = "Carat distribution of Diamonds ",
       x = "Carat",
       y = "Frequency") +
  theme_minimal()

The X coefficient is skewed to the left with most diamonds falling below 1 carat and the vast majority falling below 2 carats. This shows that the vast majority of diamonds are low carat with a tiny minority being in the higher grades.

Exercise 2

Repeat the histogram, but this time plot sqrt(carat) instead of carat. Describe if and how the distribution changed.

data("diamonds")
ggplot(diamonds, aes(x = sqrt(carat))) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(title = "Distribution of sqrt(Carat) in the Diamonds Dataset",
       x = "sqrt(Carat)",
       y = "Frequency") +
  theme_minimal()

SQRT, While maintaining the same left-skewed character of the original seems to fall within a tighter range than its predecessor with the vast majority of diamonds falling below 1.5 with vanishingly few reaching as high as 2.

Exercise 3

Below set.seed(), split the data into two datasets: train_data will contain 80% of the data using stratified sampling on carat, test_data will contain the remaining 20% of the data.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.3     ✔ stringr   1.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ Matrix::expand()    masks tidyr::expand()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ Matrix::pack()      masks tidyr::pack()
✖ readr::spec()       masks yardstick::spec()
✖ Matrix::unpack()    masks tidyr::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following objects are masked from 'package:yardstick':

    precision, recall, sensitivity, specificity

The following object is masked from 'package:purrr':

    lift

# set a seed for reproducibility
set.seed(1234)
split_index <- createDataPartition(diamonds$carat, p = 0.8, list = FALSE)
train_data <- diamonds[split_index, ]
test_data <- diamonds[-split_index, ]
nrow(train_data)

[1] 43154

nrow(test_data)

[1] 10786

Exercise 4

Exercise 4 is already completed for you. It creates a recipe called lm_all_recipe that uses carat as the target variable and all other variables as predictors. It creates dummy variables for all nominal predictors so we can use the recipe for reguralized regression.

library(recipes)
#| eval: FALSE
# recipe using all predictors
lm_all_recipe <- recipe(carat ~ ., data = train_data) |> 
  step_dummy(all_nominal_predictors())

Exercise 5

Below is a model specified for reguralized regression model called lasso_spec. Add a second specification called lm_spec for just plain old linear regression using the “lm” engine.

library(tidymodels)
#| eval: FALSE
# Define the lasso model specification
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")

# Define the linear regression model specification.
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")
lm_spec <- linear_reg() |> 
  set_engine("lm")

Exercise 6

Create two workflows. lm_all_workflow should use the lm_spec model specification and lm_all_recipe. lasso_all_workflow should use the lasso_spec model and lm_all_recipe.

library(tidymodels)
lm_all_recipe <- recipe(carat ~ ., data = train_data) |> 
  step_dummy(all_nominal_predictors()) |> 
  step_normalize(all_numeric_predictors())
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")
lm_spec <- linear_reg() |> 
  set_engine("lm")
lm_all_workflow <- workflow() |> 
  add_model(lm_spec) |> 
  add_recipe(lm_all_recipe)
lasso_all_workflow <- workflow() |> 
  add_model(lasso_spec) |> 
  add_recipe(lm_all_recipe)
lm_all_workflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

lasso_all_workflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.01
  mixture = 1

Computational engine: glmnet

Exercise 7

Fit two models. lm_all_fit should use the lm_all_workflow, and lasso_all_fit should use the lasso_all_workflow

lm_all_fit <- lm_all_workflow |> 
  fit(data = train_data)
lasso_all_fit <- lasso_all_workflow |> 
  fit(data = train_data)
lm_all_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)        depth        table        price            x            y  
  7.979e-01    1.688e-02    4.683e-03    1.781e-01    2.741e-01    4.398e-03  
          z        cut_1        cut_2        cut_3        cut_4      color_1  
  2.433e-03   -8.050e-03    4.991e-03   -3.249e-03    7.505e-04    3.450e-02  
    color_2      color_3      color_4      color_5      color_6    clarity_1  
  1.387e-02    1.843e-03   -1.968e-03    2.501e-03    8.284e-04   -4.680e-02  
  clarity_2    clarity_3    clarity_4    clarity_5    clarity_6    clarity_7  
  2.703e-02   -1.851e-02    6.837e-03   -5.213e-03   -3.911e-04   -1.503e-05

lasso_all_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian",      alpha = ~1) 

   Df  %Dev  Lambda
1   0  0.00 0.46230
2   1 16.13 0.42120
3   1 29.52 0.38380
4   1 40.64 0.34970
5   1 49.87 0.31860
6   1 57.54 0.29030
7   1 63.90 0.26450
8   2 69.24 0.24100
9   2 73.89 0.21960
10  2 77.76 0.20010
11  2 80.97 0.18230
12  2 83.63 0.16610
13  2 85.84 0.15140
14  2 87.68 0.13790
15  2 89.20 0.12570
16  2 90.46 0.11450
17  2 91.51 0.10430
18  2 92.39 0.09507
19  2 93.11 0.08663
20  3 93.72 0.07893
21  3 94.22 0.07192
22  3 94.64 0.06553
23  3 94.99 0.05971
24  3 95.28 0.05440
25  3 95.52 0.04957
26  3 95.72 0.04517
27  3 95.89 0.04115
28  3 96.03 0.03750
29  3 96.14 0.03417
30  3 96.24 0.03113
31  3 96.32 0.02837
32  5 96.42 0.02585
33  5 96.56 0.02355
34  5 96.68 0.02146
35  5 96.78 0.01955
36  5 96.86 0.01781
37  7 96.96 0.01623
38  7 97.06 0.01479
39  7 97.14 0.01348
40  7 97.21 0.01228
41  8 97.26 0.01119
42  8 97.31 0.01019
43  9 97.36 0.00929
44  9 97.41 0.00846
45  9 97.44 0.00771
46  9 97.48 0.00703

...
and 28 more lines.

Exercise 8

Make predictions into two new tibbles: lm_all_predictions and lasso_all_predictions

lm_all_predictions <- lm_all_fit |> 
  predict(new_data = test_data) |> 
  bind_cols(test_data)
lasso_all_predictions <- lasso_all_fit |> 
  predict(new_data = test_data) |> 
  bind_cols(test_data)
head(lm_all_predictions)

# A tibble: 6 × 11
  .pred carat cut       color clarity depth table price     x     y     z
  <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.130  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
2 0.300  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
3 0.218  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
4 0.249  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
5 0.178  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
6 0.398  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73

head(lasso_all_predictions)

# A tibble: 6 × 11
  .pred carat cut       color clarity depth table price     x     y     z
  <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.138  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
2 0.270  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
3 0.178  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
4 0.239  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
5 0.173  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
6 0.325  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73

Exercise 9

Compute and display the rmse for each model. Discuss which one performed better and why in the narrative below.

library(yardstick)
lm_rmse <- lm_all_predictions |> 
  metrics(truth = carat, estimate = .pred) |> 
  filter(.metric == "rmse")
lasso_rmse <- lasso_all_predictions |> 
  metrics(truth = carat, estimate = .pred) |> 
  filter(.metric == "rmse")
lm_rmse

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0682

lasso_rmse

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0739

The Linear progression model (the first one) performed better than its lasso counterpart due to having a lower rmse. This is likely due to the linear model’s usage of all provided predictors with no penalization which the lasso method employed in order to reduce overfitting in the dataset - a goal which while achieved resulted in the worse result. In this sense it could be argued that the lasso model did not fail as its primary goal was accomplished although if our top priority was having the lowest RMSE then the linear model would win out.

Submission

To submit your assignment:

Change the author name to your name in the YAML portion at the top of this document
Render your document to html and publish it to RPubs.
Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

Grading Rubric

Item (percent overall)	67% - minor issues	33% - moderate issues	0% - major issues or not attempted
Document formatting: correctly implemented instructions (9%)
Exercises - 9% each (81% )
Submitted properly to Brightspace (10%)	NA	NA	You must submit according to instructions to receive any credit for this portion.