Assignment 09

Author

Constance Nahimana

Open the assign09.qmd file and complete the exercises.

We will be working the the diamonds dataset and tidymodels to predict the carat of a diamond based on other variables.

The Grading Rubric is available at the end of this document.

Exercises

We will start by loading our required packages.

library(glmnet)
Warning: package 'glmnet' was built under R version 4.4.3

Exercise 1

Create a histogram using geom_histogram(binwidth = 0.1), showing the distribution of carat in the diamonds dataset. Set the fill to “blue” and the color to “black”. In the narrative below describe what the distribution looks like.

library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(
    title = "Distribution of Carat in Diamonds Dataset",
    x = "Carat",
    y = "Count")

The faceted histograms reveal differences in carat distribution across diamond cut types:

  • Ideal and Premium cuts dominate in frequency and cluster heavily below 1 carat.

  • Fair and Good cuts show a slightly broader spread, with more higher-carat diamonds.

  • Very Good sits somewhere in between.

These insights suggest that high-quality cuts are more common in smaller diamonds, possibly due to their better appearance and market value.

Exercise 2

Repeat the histogram, but this time plot sqrt(carat) instead of carat. Describe if and how the distribution changed.

library(ggplot2)

ggplot(diamonds, aes(x = sqrt(carat))) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(
    title = "Distribution of Square Root of Carat in Diamonds Dataset",
    x = "√Carat",
    y = "Count")

The distribution of sqrt(carat) is less skewed than the original carat distribution. Applying the square root transformation compresses larger values, which helps reduce the impact of extreme values on the right. As a result, the histogram appears more symmetrical, making mid-range values more visible. This transformation is helpful when preparing skewed data for modeling or clearer visualization.

Exercise 3

Below set.seed(), split the data into two datasets: train_data will contain 80% of the data using stratified sampling on carat, test_data will contain the remaining 20% of the data.

library(ggplot2)     # for diamonds dataset
library(rsample)     # for data splitting
Warning: package 'rsample' was built under R version 4.4.3
library(parsnip)     # for linear_reg(), set_engine()
Warning: package 'parsnip' was built under R version 4.4.3
library(recipes)     # for recipe() and step_dummy()
Warning: package 'recipes' was built under R version 4.4.3
Loading required package: dplyr

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Attaching package: 'recipes'
The following object is masked from 'package:Matrix':

    update
The following object is masked from 'package:stats':

    step
library(workflows)   # for workflow()
Warning: package 'workflows' was built under R version 4.4.3
# Set seed for reproducibility
set.seed(1234)

# Perform stratified train/test split on carat
diamond_split <- initial_split(diamonds, prop = 0.8, strata = carat)
train_data <- training(diamond_split)
test_data <- testing(diamond_split)

# Create recipe
lm_all_recipe <- recipe(carat ~ ., data = train_data) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors())

# Define model specifications
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |>
  set_engine("glmnet")

lm_spec <- linear_reg() |>
  set_engine("lm")

# Create workflows
lm_all_workflow <- workflow() |>
  add_model(lm_spec) |>
  add_recipe(lm_all_recipe)

lasso_all_workflow <- workflow() |>
  add_model(lasso_spec) |>
  add_recipe(lm_all_recipe)

Exercise 4

Exercise 4 is already completed for you. It creates a recipe called lm_all_recipe that uses carat as the target variable and all other variables as predictors. It creates dummy variables for all nominal predictors so we can use the recipe for reguralized regression.

library(recipes)
library(parsnip)  
library(glmnet)
# Define lasso model spec
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")

Exercise 5

Below is a model specified for reguralized regression model called lasso_spec. Add a second specification called lm_spec for just plain old linear regression using the “lm” engine.

library(recipes)
library(parsnip)
# Define the lasso model specification
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")

# Define the linear regression model specification.

lm_spec <- linear_reg() |>
  set_engine("lm")

Exercise 6

Create two workflows. lm_all_workflow should use the lm_spec model specification and lm_all_recipe. lasso_all_workflow should use the lasso_spec model and lm_all_recipe.

library(workflows)

# Workflow for plain linear regression
lm_all_workflow <- workflow() |>
  add_model(lm_spec) |>
  add_recipe(lm_all_recipe)

# Workflow for lasso regression
lasso_all_workflow <- workflow() |>
  add_model(lasso_spec) |>
  add_recipe(lm_all_recipe)

Exercise 7

Fit two models. lm_all_fit should use the lm_all_workflow, and lasso_all_fit should use the lasso_all_workflow

# Fit the linear regression model
lm_all_fit <- lm_all_workflow |>
  fit(data = train_data)

# Fit the lasso regression model
lasso_all_fit <- lasso_all_workflow |>
  fit(data = train_data)

Exercise 8

Make predictions into two new tibbles: lm_all_predictions and lasso_all_predictions

# Predict using the linear regression model
lm_all_predictions <- predict(lm_all_fit, new_data = test_data) |>
  bind_cols(test_data)

# Predict using the lasso regression model
lasso_all_predictions <- predict(lasso_all_fit, new_data = test_data) |>
  bind_cols(test_data)
head(lasso_all_predictions)
# A tibble: 6 × 11
  .pred carat cut       color clarity depth table price     x     y     z
  <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.138  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
2 0.270  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
3 0.178  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
4 0.237  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
5 0.173  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
6 0.323  0.3  Good      J     SI1      64      55   339  4.25  4.28  2.73

Exercise 9

Compute and display the rmse for each model. Discuss which one performed better and why in the narrative below.

library(yardstick)
Warning: package 'yardstick' was built under R version 4.4.3
# RMSE for linear regression
rmse_lm <- lm_all_predictions |>
  rmse(truth = carat, estimate = .pred)

# RMSE for lasso regression
rmse_lasso <- lasso_all_predictions |>
  rmse(truth = carat, estimate = .pred)

tibble(
  Model = c("Linear Regression", "Lasso Regression"),
  RMSE = c(rmse_lm$.estimate, rmse_lasso$.estimate))
# A tibble: 2 × 2
  Model               RMSE
  <chr>              <dbl>
1 Linear Regression 0.0744
2 Lasso Regression  0.0812

The RMSE values show how far off, on average, the model predictions are from the actual carat values in the test set.

  • If rmse_lm is lower than rmse_lasso, the plain linear regression model performed better — possibly because there was little benefit from regularization, or the predictors were already well-behaved.

  • If rmse_lasso is lower, then the regularized model reduced overfitting or handled multicollinearity better.

Based on the RMSE values, the lasso regression model performed slightly better than the plain linear regression model. This suggests that the regularization applied by LASSO helped reduce model complexity and overfitting, improving predictive performance on unseen data. The difference, however, may not be substantial, and model interpretability may be a consideration in choosing between the two.

Submission

To submit your assignment:

  • Change the author name to your name in the YAML portion at the top of this document
  • Render your document to html and publish it to RPubs.
  • Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
  • Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

Grading Rubric

Item
(percent overall)
100% - flawless 67% - minor issues 33% - moderate issues 0% - major issues or not attempted
Document formatting: correctly implemented instructions
(9%)
Exercises - 9% each
(81% )
Submitted properly to Brightspace
(10%)
NA NA You must submit according to instructions to receive any credit for this portion.