tidy modeling - basic ml

CREATED

October 7, 2024

UPDATED

October 7, 2024

Packages

# 1. Set up the environment
pacman::p_load(
  tidyverse,
  tidymodels,
  car,
  gtsummary
)

Dataset

# 2. Load and prepare the data
# For this example, we'll use the built-in mtcars dataset
data <- mtcars %>%
  as_tibble() %>%
  select(mpg, hp, wt)

Data split

# 3. Split the data
set.seed(123)
data_split <- initial_split(data, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)

The model

# 4. Create and train the model
lm_model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(mpg ~ hp + wt, data = train_data)

Model evaluation

# 5. Evaluate the model
predictions <- lm_model %>%
  predict(test_data) %>%
  bind_cols(test_data)
metrics <- predictions %>%
  metrics(truth = mpg, estimate = .pred)
print(metrics)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       1.98 
2 rsq     standard       0.871
3 mae     standard       1.72 
# 1. Plotting residuals
augment(lm_model$fit) %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residual Plot", x = "Fitted values", y = "Residuals")

Multicolinearity

# 2. Checking for multicollinearity
vif(lm_model$fit)
     hp      wt 
1.67739 1.67739 
# 3. Testing on new data (we'll use the test set we created earlier)
test_metrics <- lm_model %>%
  predict(test_data) %>%
  bind_cols(test_data) %>%
  mutate(residuals = mpg - .pred)  |> 
  metrics(truth = mpg, estimate = .pred)

Compare to baseline model

# 4. Comparing to a baseline model (mean prediction)
baseline_model <- mean(train_data$mpg)
baseline_predictions <- tibble(
  .pred = rep(baseline_model, nrow(test_data)),
  mpg = test_data$mpg
)

baseline_metrics <- baseline_predictions %>%
  metrics(truth = mpg, estimate = .pred)

print(baseline_metrics)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        5.28
2 rsq     standard       NA   
3 mae     standard        4.03
# Compare your model to the baseline
model_comparison <- bind_rows(
  test_metrics %>% mutate(model = "Linear Regression"),
  baseline_metrics %>% mutate(model = "Baseline (Mean)")
)

print(model_comparison)
# A tibble: 6 × 4
  .metric .estimator .estimate model            
  <chr>   <chr>          <dbl> <chr>            
1 rmse    standard       1.98  Linear Regression
2 rsq     standard       0.871 Linear Regression
3 mae     standard       1.72  Linear Regression
4 rmse    standard       5.28  Baseline (Mean)  
5 rsq     standard      NA     Baseline (Mean)  
6 mae     standard       4.03  Baseline (Mean)  

This output compares the performance of two models: a Linear Regression model and a Baseline (Mean) model. Let’s interpret it:

  1. Linear Regression Model:

    • RMSE (Root Mean Square Error): 1.9768866

    • R-squared: 0.8706397

    • MAE (Mean Absolute Error): 1.7239015

  2. Baseline (Mean) Model:

    • RMSE: 5.2773510

    • R-squared: NA (Not Applicable)

    • MAE: 4.0281250 (Note: This appears to be cut off in the output)

Interpretation:

  1. The Linear Regression model performs significantly better than the Baseline model:

    • Its RMSE (1.98) is much lower than the Baseline’s RMSE (5.28), indicating more accurate predictions.

    • Its MAE (1.72) is also much lower than the Baseline’s MAE (4.03), further confirming better accuracy.

  2. The Linear Regression model explains a large portion of the variance in the data, with an R-squared of 0.87 (87% of variance explained).

  3. The Baseline model, which simply predicts the mean value for all instances, has no R-squared value (NA) because it doesn’t actually model the relationship between variables.

  4. On average, the Linear Regression model’s predictions are off by about 1.72 units (MAE), while the Baseline model’s predictions are off by about 4.03 units.

In summary, your Linear Regression model is performing well, significantly outperforming a naive baseline that always predicts the mean. This suggests that your model has captured meaningful relationships in the data and is providing valuable predictions.