library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.4.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.7 ✔ recipes 1.1.1
## ✔ dials 1.4.0 ✔ rsample 1.2.1
## ✔ dplyr 1.1.4 ✔ tibble 3.2.1
## ✔ ggplot2 3.5.1 ✔ tidyr 1.3.1
## ✔ infer 1.0.7 ✔ tune 1.3.0
## ✔ modeldata 1.4.0 ✔ workflows 1.2.0
## ✔ parsnip 1.3.0 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.4 ✔ yardstick 1.3.2
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'modeldata' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'recipes' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.4 ✔ stringr 1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
boston <- read_csv("boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(boston)
set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)
cor(train) %>%
as_tibble(rownames = 'predictor_vars') %>%
select(predictor_vars, cmedv) %>%
arrange(desc(cmedv))
## # A tibble: 16 × 2
## predictor_vars cmedv
## <chr> <dbl>
## 1 cmedv 1
## 2 rm 0.708
## 3 b 0.358
## 4 zn 0.344
## 5 dis 0.271
## 6 chas 0.165
## 7 lat -0.00202
## 8 lon -0.315
## 9 crim -0.384
## 10 rad -0.397
## 11 age -0.399
## 12 nox -0.439
## 13 tax -0.479
## 14 indus -0.490
## 15 ptratio -0.501
## 16 lstat -0.743
The Variable with the strongest positive correlation is rm, and the one with the strongest negative correlation is lstat
ggplot(train, aes(x = rm, y = cmedv)) +
geom_point(alpha = 0.2) +
geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Yes, this line is a pretty good representation of the relationship.
boston_lm <- linear_reg() %>%
fit(cmedv ~ rm, data = train)
tidy(boston_lm)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -35.4 3.11 -11.4 8.70e-26
## 2 rm 9.22 0.491 18.8 7.46e-55
Our model predicts that a house with 0 rm would sell for -$35,429. As the number of rooms in a hosue increases by 1, our model predicts the selling price would increase by $9,217. Both are statistically different than 0 based on their low p-values and their confidence intervals not including 0.
confint(boston_lm$fit)
## 2.5 % 97.5 %
## (Intercept) -41.546106 -29.31285
## rm 8.250909 10.18311
boston_lm %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 6.83
boston_all <- linear_reg() %>%
fit(cmedv ~., data = train)
tidy(boston_all)
## # A tibble: 16 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -608. 342. -1.78 7.64e- 2
## 2 lon -5.65 3.86 -1.46 1.45e- 1
## 3 lat 5.54 4.24 1.31 1.93e- 1
## 4 crim -0.0830 0.0396 -2.10 3.65e- 2
## 5 zn 0.0332 0.0165 2.01 4.56e- 2
## 6 indus -0.0143 0.0738 -0.193 8.47e- 1
## 7 chas 2.28 1.05 2.17 3.06e- 2
## 8 nox -11.7 4.74 -2.46 1.44e- 2
## 9 rm 4.37 0.516 8.46 8.13e-16
## 10 age -0.00921 0.0156 -0.591 5.55e- 1
## 11 dis -1.26 0.244 -5.17 4.06e- 7
## 12 rad 0.272 0.0790 3.44 6.47e- 4
## 13 tax -0.0121 0.00436 -2.78 5.78e- 3
## 14 ptratio -0.874 0.163 -5.37 1.48e- 7
## 15 b 0.0123 0.00310 3.97 8.72e- 5
## 16 lstat -0.479 0.0637 -7.51 5.24e-13
The predictor variables that are statistically different than zero at the p value < 0.05 level are rm, dis, rad, tax, ptratio, b, and lstat. As the percentage of lower-status individuals increases by 1%, the selling price decreases by $478.60.
boston_all %>%
predict(test) %>%
bind_cols(test %>% select(cmedv)) %>%
rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.83
On average, our linear regression is mispredicitng by ~$4,829. This is better than the model used for the rm predictor variable because the value is smaller.
top_5 <- tidy(boston_all) %>%
filter(term != "(Intercept)") %>%
arrange(desc(statistic))
top_5
## # A tibble: 15 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 rm 4.37 0.516 8.46 8.13e-16
## 2 b 0.0123 0.00310 3.97 8.72e- 5
## 3 rad 0.272 0.0790 3.44 6.47e- 4
## 4 chas 2.28 1.05 2.17 3.06e- 2
## 5 zn 0.0332 0.0165 2.01 4.56e- 2
## 6 lat 5.54 4.24 1.31 1.93e- 1
## 7 indus -0.0143 0.0738 -0.193 8.47e- 1
## 8 age -0.00921 0.0156 -0.591 5.55e- 1
## 9 lon -5.65 3.86 -1.46 1.45e- 1
## 10 crim -0.0830 0.0396 -2.10 3.65e- 2
## 11 nox -11.7 4.74 -2.46 1.44e- 2
## 12 tax -0.0121 0.00436 -2.78 5.78e- 3
## 13 dis -1.26 0.244 -5.17 4.06e- 7
## 14 ptratio -0.874 0.163 -5.37 1.48e- 7
## 15 lstat -0.479 0.0637 -7.51 5.24e-13
The top 5 most influential predictor variables are rm, b, rad, chas, and zn.