Assignment 09

Author

Brady Heath

Go to the shared posit.cloud workspace for this class and open the assign09 project. Open the assign09.qmd file and complete the exercises.

We will be working the the diamonds dataset and tidymodels to predict the carat of a diamond based on other variables.

The Grading Rubric is available at the end of this document.

Exercises

We will start by loading our required packages.

library(tidymodels)
library(glmnet)

Exercise 1

Create a histogram using geom_histogram(binwidth = 0.1), showing the distribution of carat in the diamonds dataset. Set the fill to “blue” and the color to “black”. In the narrative below describe what the distribution looks like.

# Load the tidyverse package
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.3     ✔ stringr   1.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ Matrix::expand()    masks tidyr::expand()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ Matrix::pack()      masks tidyr::pack()
✖ readr::spec()       masks yardstick::spec()
✖ Matrix::unpack()    masks tidyr::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Use ggplot to create the histogram
ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(
    title = "Distribution of Carat in Diamonds Dataset",
    x = "Carat",
    y = "Count"
  ) +
  theme_minimal()

Exercise 2

Repeat the histogram, but this time plot sqrt(carat) instead of carat. Describe if and how the distribution changed.

# Load the tidyverse package
library(tidyverse)

# Use ggplot to create the histogram
ggplot(diamonds, aes(x = sqrt(carat))) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(
    title = "Distribution of Carat in Diamonds Dataset",
    x = "Carat",
    y = "Count"
  ) +
  theme_minimal()

After changing carat to sqrt(carat) I can see that the count increased and the carat quantity decreased.

Exercise 3

Below set.seed(), split the data into two datasets: train_data will contain 80% of the data using stratified sampling on carat, test_data will contain the remaining 20% of the data.

# set a seed for reproducibility
set.seed(1234)

# Load the necessary libraries
library(tidyverse)
library(caret)

Loading required package: lattice


Attaching package: 'caret'

The following objects are masked from 'package:yardstick':

    precision, recall, sensitivity, specificity

The following object is masked from 'package:purrr':

    lift

# Set seed for reproducibility
set.seed(123)

# Perform stratified sampling
train_indices <- createDataPartition(diamonds$carat, p = 0.8, list = FALSE)

# Split the data into training and test datasets
train_data <- diamonds[train_indices, ]
test_data <- diamonds[-train_indices, ]

# View the dimensions of the resulting datasets
dim(train_data)

[1] 43154    10

dim(test_data)

[1] 10786    10

Exercise 4

Exercise 4 is already completed for you. It creates a recipe called lm_all_recipe that uses carat as the target variable and all other variables as predictors. It creates dummy variables for all nominal predictors so we can use the recipe for reguralized regression.

# recipe using all predictors
lm_all_recipe <- recipe(carat ~ ., data = train_data) |> 
  step_dummy(all_nominal_predictors())

Exercise 5

Below is a model specified for reguralized regression model called lasso_spec. Add a second specification called lm_spec for just plain old linear regression using the “lm” engine.

# Define the lasso model specification
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")

# Define the linear regression model specification.
lm_spec <- linear_reg() |> 
  set_engine("lm")

Exercise 6

Create two workflows. lm_all_workflow should use the lm_spec model specification and lm_all_recipe. lasso_all_workflow should use the lasso_spec model and lm_all_recipe.

library(workflows)

# Define the linear regression workflow
lm_all_workflow <- workflow() |> 
  add_model(lm_spec) |> 
  add_recipe(lm_all_recipe)

# Define the lasso regression workflow
lasso_all_workflow <- workflow() |> 
  add_model(lasso_spec) |> 
  add_recipe(lm_all_recipe)

Exercise 7

Fit two models. lm_all_fit should use the lm_all_workflow, and lasso_all_fit should use the lasso_all_workflow

# Fit the linear regression model
lm_all_fit <- lm_all_workflow |> 
  fit(data = train_data)

# Fit the lasso regression model
lasso_all_fit <- lasso_all_workflow |> 
  fit(data = train_data)

Exercise 8

Make predictions into two new tibbles: lm_all_predictions and lasso_all_predictions

library(tidymodels)

# Make predictions with the linear regression model
lm_all_predictions <- predict(lm_all_fit, new_data = test_data) |> 
  bind_cols(test_data)

# Make predictions with the lasso regression model
lasso_all_predictions <- predict(lasso_all_fit, new_data = test_data) |> 
  bind_cols(test_data)

Exercise 9

Compute and display the rmse for each model. Discuss which one performed better and why in the narrative below.

library(yardstick)

# Compute RMSE for linear regression model
lm_all_rmse <- rmse(lm_all_predictions, truth = carat, estimate = .pred)

# Compute RMSE for lasso regression model
lasso_all_rmse <- rmse(lasso_all_predictions, truth = carat, estimate = .pred)

# Display the RMSE values
lm_all_rmse
lasso_all_rmse

the rmse value of 0.07240289 is the estimate that performed better as a lower RMSE value is better since it is more accurate with smaller errors.

Submission

To submit your assignment:

Change the author name to your name in the YAML portion at the top of this document
Render your document to html and publish it to RPubs.
Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

Grading Rubric

Item (percent overall)	67% - minor issues	33% - moderate issues	0% - major issues or not attempted
Document formatting: correctly implemented instructions (9%)
Exercises - 9% each (81% )
Submitted properly to Brightspace (10%)	NA	NA	You must submit according to instructions to receive any credit for this portion.