Assignment 09

Author

Theresa Anderson

Open the assign09.qmd file and complete the exercises.

We will be working the the diamonds dataset and tidymodels to predict the carat of a diamond based on other variables.

The Grading Rubric is available at the end of this document.

Exercises

We will start by loading our required packages.

library(tidymodels)
library(glmnet)

Exercise 1

Create a histogram using geom_histogram(binwidth = 0.1), showing the distribution of carat in the diamonds dataset. Set the fill to “blue” and the color to “black”. In the narrative below describe what the distribution looks like.

library(ggplot2)
  
ggplot(diamonds, aes(x = carat)) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(title = "Frequency Distribution of Diamond Carat Weight", 
       x = "Carat",
       y = "Quantity") +
  theme_minimal()

The frequency distribution of carat weight is right skewed, there are substantially more diamonds at or below one carat than above.

Exercise 2

Repeat the histogram, but this time plot sqrt(carat) instead of carat. Describe if and how the distribution changed.

ggplot(diamonds, aes(x = sqrt(carat))) +
  geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
  labs(title = "Frequency Distribution of Diamond Carat Weight", 
       x = "Carat (square root)",
       y = "Quantity") +
  theme_minimal()

Exercise 3

Below set.seed(), split the data into two datasets: train_data will contain 80% of the data using stratified sampling on carat, test_data will contain the remaining 20% of the data.

# set a seed for reproducibility
set.seed(1234)

diamonds_split <- initial_split(diamonds, prop = 0.8, strata = carat)
train_data <- training(diamonds_split)
test_data <- testing(diamonds_split)

Exercise 4

Exercise 4 is already completed for you. It creates a recipe called lm_all_recipe that uses carat as the target variable and all other variables as predictors. It creates dummy variables for all nominal predictors so we can use the recipe for regularized regression.

# recipe using all predictors
lm_all_recipe <- recipe(carat ~ ., data = train_data) |> 
  step_dummy(all_nominal_predictors())

Exercise 5

Below is a model specified for regularized regression model called lasso_spec. Add a second specification called lm_spec for just plain old linear regression using the “lm” engine.

# Define the lasso model specification
lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")

# Define the linear regression model specification.
lm_spec <- linear_reg() |>
  set_engine("lm")

Exercise 6

Create two workflows. lm_all_workflow should use the lm_spec model specification and lm_all_recipe. lasso_all_workflow should use the lasso_spec model and lm_all_recipe.

lm_all_recipe <- recipe(carat ~ ., data = train_data) |> 
  step_dummy(all_nominal_predictors())

lm_spec <- linear_reg() |>
  set_engine("lm")

lasso_spec <- linear_reg(penalty = 0.01, mixture = 1) |> 
  set_engine("glmnet")

lm_all_workflow <- workflow() |>
  add_model(lm_spec) |>
  add_recipe(lm_all_recipe)

lasso_all_workflow <- workflow() |>
  add_model(lasso_spec) |>
  add_recipe(lm_all_recipe)

I attempted to run the workflow with the recipe and model specifications in their individual chunks and I was receiving an error when I would save that the recipe did not exist. I added the recipe to this chunk and then I received the same error for the specifications upon saving. I moved the applicable code into this chunk and now it is functioning properly. ¯\_(ツ)_/¯

Exercise 7

Fit two models. lm_all_fit should use the lm_all_workflow, and lasso_all_fit should use the lasso_all_workflow

lm_all_fit <- lm_all_workflow |>
  fit(data = train_data)

lasso_all_fit <- lasso_all_workflow |>
  fit(data = train_data)

Exercise 8

Make predictions into two new tibbles: lm_all_predictions and lasso_all_predictions

lm_all_predictions <- lm_all_fit |>
  predict(new_data = test_data) |>
  bind_cols(test_data |> select(carat))

head(lm_all_predictions)
# A tibble: 6 × 2
  .pred carat
  <dbl> <dbl>
1 0.130  0.23
2 0.300  0.29
3 0.218  0.24
4 0.248  0.26
5 0.178  0.23
6 0.396  0.3 
lasso_all_predictions <- lasso_all_fit |>
  predict(new_data = test_data) |>
  bind_cols(test_data |> select(carat))

head(lasso_all_predictions)
# A tibble: 6 × 2
  .pred carat
  <dbl> <dbl>
1 0.138  0.23
2 0.270  0.29
3 0.178  0.24
4 0.237  0.26
5 0.173  0.23
6 0.323  0.3 

Exercise 9

Compute and display the rmse for each model. Discuss which one performed better and why in the narrative below.

lm_all_rmse <- lm_all_predictions |>
  rmse(truth = carat, estimate = .pred)

lasso_all_rmse <- lasso_all_predictions |>
  rmse(truth = carat, estimate = .pred)

lm_all_rmse
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0744
lasso_all_rmse
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0812

The linear model performed better than the lasso model with a RMSE of 0.0744 compared to 0.0812 for the lasso model. I actually find this result a bit surprising, when I look at the data in the tibbles from the predictions, the lasso predictions were actually closer to the actual values than the linear model. I wanted to double check this and because I’m more comfortable with excel at the moment, I downloaded the results from the results from the lasso and linear predictions into excel and ran a quick count on which model was closer to the actual more often. I don’t feel very familiar with lasso models but from what I have looked it seems linear models do better when each variable is equally important, which would be the case with the diamonds data.

In either case, both models performed quite well.

Submission

To submit your assignment:

  • Change the author name to your name in the YAML portion at the top of this document
  • Render your document to html and publish it to RPubs.
  • Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
  • Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

Grading Rubric

Item
(percent overall)
100% - flawless 67% - minor issues 33% - moderate issues 0% - major issues or not attempted
Document formatting: correctly implemented instructions
(9%)
Exercises - 9% each
(81% )
Submitted properly to Brightspace
(10%)
NA NA You must submit according to instructions to receive any credit for this portion.