Assignment 09

Author

Zachary Howe

Open the assign09.qmd file and complete the exercises.

We will be working the the diamonds dataset and tidymodels to predict the carat of a diamond based on other variables.

The Grading Rubric is available at the end of this document.

Exercises

We will start by loading our required packages.

library(tidymodels)
library(glmnet)

Exercise 1

Create a histogram using geom_histogram(binwidth = 0.1), showing the distribution of carat in the diamonds dataset. Set the fill to “blue” and the color to “black”. In the narrative below describe what the distribution looks like.

library(ggplot2)

# Load the diamonds dataset
data("diamonds")

# Create and display the histogram
ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(title = "Distribution of Carat in Diamonds",
       x = "Carat",
       y = "Count") +
  theme_minimal()

The histogram above is skewed to the right, indicating that most diamonds have a lower carat weight, with fewer diamonds having higher carat weights. The distribution is not normal, as it does not have a bell-shaped curve. Instead, it has a long tail on the right side, suggesting that larger diamonds are less common.

Exercise 2

Repeat the histogram, but this time plot sqrt(carat) instead of carat. Describe if and how the distribution changed.

# Create and display the histogram for sqrt(carat)
ggplot(diamonds, aes(x = sqrt(carat))) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(title = "Distribution of sqrt(Carat) in Diamonds",
       x = "sqrt(Carat)",
       y = "Count") +
  theme_minimal()

The histogram for sqrt(carat) shows a more symmetric distribution compared to the original carat histogram. The transformation has reduced the right skewness, making the distribution appear more normal. This suggests that the square root transformation has helped to stabilize the variance and normalize the data, which can be beneficial for modeling purposes.

Exercise 3

Below set.seed(), split the data into two datasets: train_data will contain 80% of the data using stratified sampling on carat, test_data will contain the remaining 20% of the data.

# set a seed for reproducibility
set.seed(1234)
# Load required packages
library(ggplot2)
library(rsample)

# set a seed for reproducibility
set.seed(1234)

# stratified split based on carat
split <- initial_split(diamonds, prop = 0.8, strata = carat)

# create training and testing datasets
train_data <- training(split)
test_data  <- testing(split)

Exercise 4

Exercise 4 is already completed for you. It creates a recipe called lm_all_recipe that uses carat as the target variable and all other variables as predictors. It creates dummy variables for all nominal predictors so we can use the recipe for reguralized regression.

# recipe using all predictors
lm_all_recipe <- recipe(carat ~ ., data = train_data) |> 
  step_dummy(all_nominal_predictors())

Exercise 5

Below is a model specified for reguralized regression model called lasso_spec. Add a second specification called lm_spec for just plain old linear regression using the “lm” engine.

# Load the required package
library(parsnip)

# Define the lasso model specification
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")

# Define the linear regression model specification
lm_spec <- linear_reg() |> 
  set_engine("lm")

Exercise 6

Create two workflows. lm_all_workflow should use the lm_spec model specification and lm_all_recipe. lasso_all_workflow should use the lasso_spec model and lm_all_recipe.

# Define a recipe to model price using all other predictors
lm_all_recipe <- recipe(carat ~ ., data = train_data) |> 
  step_dummy(all_nominal_predictors())

# Define linear regression model spec
lm_spec <- linear_reg() |> 
  set_engine("lm")

# Define lasso regression model spec
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")

# Create the linear regression workflow
lm_all_workflow <- workflow() |> 
  add_model(lm_spec) |> 
  add_recipe(lm_all_recipe)

# Create the lasso regression workflow
lasso_all_workflow <- workflow() |> 
  add_model(lasso_spec) |> 
  add_recipe(lm_all_recipe)

# Show the workflows
lm_all_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm 

Exercise 7

Fit two models. lm_all_fit should use the lm_all_workflow, and lasso_all_fit should use the lasso_all_workflow

# Fit the linear regression model
lm_all_fit <- lm_all_workflow |> 
  fit(data = train_data)

# Fit the lasso regression model
lasso_all_fit <- lasso_all_workflow |> 
  fit(data = train_data)

# Show the fitted models
lm_all_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)        depth        table        price            x            y  
 -1.657e+00    1.198e-02    2.313e-03    4.435e-05    2.465e-01    3.024e-03  
          z        cut_1        cut_2        cut_3        cut_4      color_1  
  2.685e-03   -1.990e-02    8.627e-03   -6.279e-03    1.369e-03    1.069e-01  
    color_2      color_3      color_4      color_5      color_6    clarity_1  
  4.031e-02    4.689e-03   -5.070e-03    5.685e-03    2.143e-03   -1.757e-01  
  clarity_2    clarity_3    clarity_4    clarity_5    clarity_6    clarity_7  
  1.081e-01   -5.515e-02    1.695e-02   -1.220e-02   -1.620e-03   -4.479e-05  

Exercise 8

Make predictions into two new tibbles: lm_all_predictions and lasso_all_predictions

# Make predictions into two new tibbles: `lm_all_predictions` and `lasso_all_predictions`
lm_all_predictions <- lm_all_fit |> 
  predict(new_data = test_data) |> 
  bind_cols(test_data)
lasso_all_predictions <- lasso_all_fit |>
  predict(new_data = test_data) |> 
  bind_cols(test_data)
# Show the predictions
lm_all_predictions
# A tibble: 10,790 × 11
   .pred carat cut       color clarity depth table price     x     y     z
   <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.130  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 2 0.300  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 3 0.218  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 4 0.248  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 5 0.178  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
 6 0.396  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73
 7 0.392  0.3  Good      J     SI1      63.8    56   351  4.23  4.26  2.71
 8 0.399  0.3  Good      I     SI2      63.3    56   351  4.26  4.3   2.71
 9 0.363  0.3  Very Good J     VS2      62.2    57   357  4.28  4.3   2.67
10 0.136  0.23 Very Good F     VS1      60      57   402  4     4.03  2.41
# ℹ 10,780 more rows

Exercise 9

Compute and display the rmse for each model. Discuss which one performed better and why in the narrative below.

# Compute and display the rmse for each model
lm_all_rmse <- lm_all_predictions |> 
  rmse(truth = carat, estimate = .pred)
lasso_all_rmse <- lasso_all_predictions |>
  rmse(truth = carat, estimate = .pred)
# Show the rmse
lm_all_rmse
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0744

The root mean square error (RMSE) for the linear regression model is lower than that of the lasso regression model. This indicates that the linear regression model performed better in predicting the carat of diamonds in the test dataset. The lasso regression model, while useful for regularization and feature selection, may have introduced some bias due to its penalty term, leading to a higher RMSE. In this case, the linear regression model was able to capture the relationships in the data more effectively without the added complexity of regularization.

Submission

To submit your assignment:

  • Change the author name to your name in the YAML portion at the top of this document
  • Render your document to html and publish it to RPubs.
  • Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
  • Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

Grading Rubric

Item
(percent overall)
100% - flawless 67% - minor issues 33% - moderate issues 0% - major issues or not attempted
Document formatting: correctly implemented instructions
(9%)
Exercises - 9% each
(81% )
Submitted properly to Brightspace
(10%)
NA NA You must submit according to instructions to receive any credit for this portion.