# 1. Set up the environment
pacman::p_load(
tidyverse,
tidymodels,
car,
gtsummary
)tidy modeling - basic ml
Packages
Dataset
# 2. Load and prepare the data
# For this example, we'll use the built-in mtcars dataset
data <- mtcars %>%
as_tibble() %>%
select(mpg, hp, wt)Data split
# 3. Split the data
set.seed(123)
data_split <- initial_split(data, prop = 0.75)
train_data <- training(data_split)
test_data <- testing(data_split)The model
# 4. Create and train the model
lm_model <- linear_reg() %>%
set_engine("lm") %>%
fit(mpg ~ hp + wt, data = train_data)Model evaluation
# 5. Evaluate the model
predictions <- lm_model %>%
predict(test_data) %>%
bind_cols(test_data)metrics <- predictions %>%
metrics(truth = mpg, estimate = .pred)print(metrics)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.98
2 rsq standard 0.871
3 mae standard 1.72
# 1. Plotting residuals
augment(lm_model$fit) %>%
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residual Plot", x = "Fitted values", y = "Residuals")Multicolinearity
# 2. Checking for multicollinearity
vif(lm_model$fit) hp wt
1.67739 1.67739
# 3. Testing on new data (we'll use the test set we created earlier)
test_metrics <- lm_model %>%
predict(test_data) %>%
bind_cols(test_data) %>%
mutate(residuals = mpg - .pred) |>
metrics(truth = mpg, estimate = .pred)Compare to baseline model
# 4. Comparing to a baseline model (mean prediction)
baseline_model <- mean(train_data$mpg)
baseline_predictions <- tibble(
.pred = rep(baseline_model, nrow(test_data)),
mpg = test_data$mpg
)
baseline_metrics <- baseline_predictions %>%
metrics(truth = mpg, estimate = .pred)
print(baseline_metrics)# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 5.28
2 rsq standard NA
3 mae standard 4.03
# Compare your model to the baseline
model_comparison <- bind_rows(
test_metrics %>% mutate(model = "Linear Regression"),
baseline_metrics %>% mutate(model = "Baseline (Mean)")
)
print(model_comparison)# A tibble: 6 × 4
.metric .estimator .estimate model
<chr> <chr> <dbl> <chr>
1 rmse standard 1.98 Linear Regression
2 rsq standard 0.871 Linear Regression
3 mae standard 1.72 Linear Regression
4 rmse standard 5.28 Baseline (Mean)
5 rsq standard NA Baseline (Mean)
6 mae standard 4.03 Baseline (Mean)
This output compares the performance of two models: a Linear Regression model and a Baseline (Mean) model. Let’s interpret it:
Linear Regression Model:
RMSE (Root Mean Square Error): 1.9768866
R-squared: 0.8706397
MAE (Mean Absolute Error): 1.7239015
Baseline (Mean) Model:
RMSE: 5.2773510
R-squared: NA (Not Applicable)
MAE: 4.0281250 (Note: This appears to be cut off in the output)
Interpretation:
The Linear Regression model performs significantly better than the Baseline model:
Its RMSE (1.98) is much lower than the Baseline’s RMSE (5.28), indicating more accurate predictions.
Its MAE (1.72) is also much lower than the Baseline’s MAE (4.03), further confirming better accuracy.
The Linear Regression model explains a large portion of the variance in the data, with an R-squared of 0.87 (87% of variance explained).
The Baseline model, which simply predicts the mean value for all instances, has no R-squared value (NA) because it doesn’t actually model the relationship between variables.
On average, the Linear Regression model’s predictions are off by about 1.72 units (MAE), while the Baseline model’s predictions are off by about 4.03 units.
In summary, your Linear Regression model is performing well, significantly outperforming a naive baseline that always predicts the mean. This suggests that your model has captured meaningful relationships in the data and is providing valuable predictions.