Loading Packages

library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.4.3
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.7     ✔ recipes      1.1.1
## ✔ dials        1.4.0     ✔ rsample      1.2.1
## ✔ dplyr        1.1.4     ✔ tibble       3.2.1
## ✔ ggplot2      3.5.1     ✔ tidyr        1.3.1
## ✔ infer        1.0.7     ✔ tune         1.3.0
## ✔ modeldata    1.4.0     ✔ workflows    1.2.0
## ✔ parsnip      1.3.0     ✔ workflowsets 1.1.0
## ✔ purrr        1.0.4     ✔ yardstick    1.3.2
## Warning: package 'dials' was built under R version 4.4.3
## Warning: package 'infer' was built under R version 4.4.3
## Warning: package 'modeldata' was built under R version 4.4.3
## Warning: package 'parsnip' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'recipes' was built under R version 4.4.3
## Warning: package 'rsample' was built under R version 4.4.3
## Warning: package 'tune' was built under R version 4.4.3
## Warning: package 'workflows' was built under R version 4.4.3
## Warning: package 'workflowsets' was built under R version 4.4.3
## Warning: package 'yardstick' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ stringr   1.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ stringr::fixed()    masks recipes::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ readr::spec()       masks yardstick::spec()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
boston <- read_csv("boston.csv")
## Rows: 506 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): lon, lat, cmedv, crim, zn, indus, chas, nox, rm, age, dis, rad, ta...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(boston)

Question 2

set.seed(123)
split <- initial_split(boston, prop = 0.7, strata = cmedv)
train <- training(split)
test <- testing(split)

Quesion 3

cor(train) %>%
  as_tibble(rownames = 'predictor_vars') %>%
  select(predictor_vars, cmedv) %>%
  arrange(desc(cmedv))
## # A tibble: 16 × 2
##    predictor_vars    cmedv
##    <chr>             <dbl>
##  1 cmedv           1      
##  2 rm              0.708  
##  3 b               0.358  
##  4 zn              0.344  
##  5 dis             0.271  
##  6 chas            0.165  
##  7 lat            -0.00202
##  8 lon            -0.315  
##  9 crim           -0.384  
## 10 rad            -0.397  
## 11 age            -0.399  
## 12 nox            -0.439  
## 13 tax            -0.479  
## 14 indus          -0.490  
## 15 ptratio        -0.501  
## 16 lstat          -0.743

The Variable with the strongest positive correlation is rm, and the one with the strongest negative correlation is lstat

Question 4

ggplot(train, aes(x = rm, y = cmedv)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Yes, this line is a pretty good representation of the relationship.

Question 5

boston_lm <- linear_reg() %>%
              fit(cmedv ~ rm, data = train)
                  
tidy(boston_lm)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -35.4      3.11      -11.4 8.70e-26
## 2 rm              9.22     0.491      18.8 7.46e-55

Our model predicts that a house with 0 rm would sell for -$35,429. As the number of rooms in a hosue increases by 1, our model predicts the selling price would increase by $9,217. Both are statistically different than 0 based on their low p-values and their confidence intervals not including 0.

confint(boston_lm$fit)
##                  2.5 %    97.5 %
## (Intercept) -41.546106 -29.31285
## rm            8.250909  10.18311

Question 6

boston_lm %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        6.83

Question 7

boston_all <- linear_reg() %>%
  fit(cmedv ~., data = train)

tidy(boston_all)
## # A tibble: 16 × 5
##    term          estimate std.error statistic  p.value
##    <chr>            <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept) -608.      342.         -1.78  7.64e- 2
##  2 lon           -5.65      3.86       -1.46  1.45e- 1
##  3 lat            5.54      4.24        1.31  1.93e- 1
##  4 crim          -0.0830    0.0396     -2.10  3.65e- 2
##  5 zn             0.0332    0.0165      2.01  4.56e- 2
##  6 indus         -0.0143    0.0738     -0.193 8.47e- 1
##  7 chas           2.28      1.05        2.17  3.06e- 2
##  8 nox          -11.7       4.74       -2.46  1.44e- 2
##  9 rm             4.37      0.516       8.46  8.13e-16
## 10 age           -0.00921   0.0156     -0.591 5.55e- 1
## 11 dis           -1.26      0.244      -5.17  4.06e- 7
## 12 rad            0.272     0.0790      3.44  6.47e- 4
## 13 tax           -0.0121    0.00436    -2.78  5.78e- 3
## 14 ptratio       -0.874     0.163      -5.37  1.48e- 7
## 15 b              0.0123    0.00310     3.97  8.72e- 5
## 16 lstat         -0.479     0.0637     -7.51  5.24e-13

The predictor variables that are statistically different than zero at the p value < 0.05 level are rm, dis, rad, tax, ptratio, b, and lstat. As the percentage of lower-status individuals increases by 1%, the selling price decreases by $478.60.

Question 8

boston_all %>%
  predict(test) %>%
  bind_cols(test %>% select(cmedv)) %>%
  rmse(truth = cmedv, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        4.83

On average, our linear regression is mispredicitng by ~$4,829. This is better than the model used for the rm predictor variable because the value is smaller.

Question 9

top_5 <- tidy(boston_all) %>%
  filter(term != "(Intercept)") %>%
  arrange(desc(statistic)) 
  
top_5
## # A tibble: 15 × 5
##    term     estimate std.error statistic  p.value
##    <chr>       <dbl>     <dbl>     <dbl>    <dbl>
##  1 rm        4.37      0.516       8.46  8.13e-16
##  2 b         0.0123    0.00310     3.97  8.72e- 5
##  3 rad       0.272     0.0790      3.44  6.47e- 4
##  4 chas      2.28      1.05        2.17  3.06e- 2
##  5 zn        0.0332    0.0165      2.01  4.56e- 2
##  6 lat       5.54      4.24        1.31  1.93e- 1
##  7 indus    -0.0143    0.0738     -0.193 8.47e- 1
##  8 age      -0.00921   0.0156     -0.591 5.55e- 1
##  9 lon      -5.65      3.86       -1.46  1.45e- 1
## 10 crim     -0.0830    0.0396     -2.10  3.65e- 2
## 11 nox     -11.7       4.74       -2.46  1.44e- 2
## 12 tax      -0.0121    0.00436    -2.78  5.78e- 3
## 13 dis      -1.26      0.244      -5.17  4.06e- 7
## 14 ptratio  -0.874     0.163      -5.37  1.48e- 7
## 15 lstat    -0.479     0.0637     -7.51  5.24e-13

The top 5 most influential predictor variables are rm, b, rad, chas, and zn.