# 1. Set up the environment
::p_load(
pacman
tidyverse,
tidymodels,
car,
gtsummary )
tidy modeling - basic ml
Packages
Dataset
# 2. Load and prepare the data
# For this example, we'll use the built-in mtcars dataset
<- mtcars %>%
data as_tibble() %>%
select(mpg, hp, wt)
Data split
# 3. Split the data
set.seed(123)
<- initial_split(data, prop = 0.75)
data_split <- training(data_split)
train_data <- testing(data_split) test_data
The model
# 4. Create and train the model
<- linear_reg() %>%
lm_model set_engine("lm") %>%
fit(mpg ~ hp + wt, data = train_data)
Model evaluation
# 5. Evaluate the model
<- lm_model %>%
predictions predict(test_data) %>%
bind_cols(test_data)
<- predictions %>%
metrics metrics(truth = mpg, estimate = .pred)
print(metrics)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.98
2 rsq standard 0.871
3 mae standard 1.72
# 1. Plotting residuals
augment(lm_model$fit) %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residual Plot", x = "Fitted values", y = "Residuals")
Multicolinearity
# 2. Checking for multicollinearity
vif(lm_model$fit)
hp wt
1.67739 1.67739
# 3. Testing on new data (we'll use the test set we created earlier)
<- lm_model %>%
test_metrics predict(test_data) %>%
bind_cols(test_data) %>%
mutate(residuals = mpg - .pred) |>
metrics(truth = mpg, estimate = .pred)
Compare to baseline model
# 4. Comparing to a baseline model (mean prediction)
<- mean(train_data$mpg)
baseline_model <- tibble(
baseline_predictions .pred = rep(baseline_model, nrow(test_data)),
mpg = test_data$mpg
)
<- baseline_predictions %>%
baseline_metrics metrics(truth = mpg, estimate = .pred)
print(baseline_metrics)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 5.28
2 rsq standard NA
3 mae standard 4.03
# Compare your model to the baseline
<- bind_rows(
model_comparison %>% mutate(model = "Linear Regression"),
test_metrics %>% mutate(model = "Baseline (Mean)")
baseline_metrics
)
print(model_comparison)
# A tibble: 6 × 4
.metric .estimator .estimate model
<chr> <chr> <dbl> <chr>
1 rmse standard 1.98 Linear Regression
2 rsq standard 0.871 Linear Regression
3 mae standard 1.72 Linear Regression
4 rmse standard 5.28 Baseline (Mean)
5 rsq standard NA Baseline (Mean)
6 mae standard 4.03 Baseline (Mean)
This output compares the performance of two models: a Linear Regression model and a Baseline (Mean) model. Let’s interpret it:
Linear Regression Model:
RMSE (Root Mean Square Error): 1.9768866
R-squared: 0.8706397
MAE (Mean Absolute Error): 1.7239015
Baseline (Mean) Model:
RMSE: 5.2773510
R-squared: NA (Not Applicable)
MAE: 4.0281250 (Note: This appears to be cut off in the output)
Interpretation:
The Linear Regression model performs significantly better than the Baseline model:
Its RMSE (1.98) is much lower than the Baseline’s RMSE (5.28), indicating more accurate predictions.
Its MAE (1.72) is also much lower than the Baseline’s MAE (4.03), further confirming better accuracy.
The Linear Regression model explains a large portion of the variance in the data, with an R-squared of 0.87 (87% of variance explained).
The Baseline model, which simply predicts the mean value for all instances, has no R-squared value (NA) because it doesn’t actually model the relationship between variables.
On average, the Linear Regression model’s predictions are off by about 1.72 units (MAE), while the Baseline model’s predictions are off by about 4.03 units.
In summary, your Linear Regression model is performing well, significantly outperforming a naive baseline that always predicts the mean. This suggests that your model has captured meaningful relationships in the data and is providing valuable predictions.