The used cars.csv file has information about 1000 randomly sampled used sedans (4 door cars) in 2021. The variables are:
Create a set of scatterplots with price on the y-axis and the 4 numeric predictors (year, age, cylinders, mileage) on the respective x-axes.
cars |>
# Placing the 4 numeric predictors into the same column named value and predictors
pivot_longer(
cols = c(year, age, cylinders, mileage),
names_to = "predictor",
values_to = "value"
) |>
# Creating the set of scatterplots
ggplot(
mapping = aes(
x = value,
y = price
)
) +
geom_point(alpha = 0.5) +
geom_smooth(
method = "loess",
se = F,
formula = y ~ x
) +
# Separating the plots into 4 with different x-axes
facet_wrap(
facets = vars(predictor),
scales = "free_x"
) +
labs(
x = NULL,
y = NULL
) +
# Adding $ to the y-axis
scale_y_continuous(labels = scales::label_dollar())
How do each of the numeric variables appear to predict the price of the used cars (positive/negative/none, linear/curved/none, etc…)?
year: A curved, positive trend
age: A curved, negative trend
cylinder: No noticeable trend
mileage: A somewhat linear, negative trend
Create a correlation plot for the same 5 variables in question 1a in the code chunk below.
cars |>
dplyr::select(price, age, year, mileage, cylinders) |>
ggcorr(
low = "red",
high = "blue",
label = T,
label_round = 2
)
Does there appear to be a potential problem with multicollinearity? Explain your answer!
In the code chunk below, fit the following four linear models with the corresponding names and explanatory variables listed:
price_lm5
: age + mileage + cylinders + transmission
+ fuel
price_lm3
: age + mileage + cylinders
price_lm2
: age + mileage
price_lm1
: age
# price_lm5
price_lm5 <-
lm(formula = price ~ age + mileage + cylinders + transmission + fuel,
data = cars)
# price_lm3
price_lm3 <-
lm(formula = price ~ age + mileage + cylinders,
data = cars)
# price_lm2
price_lm2 <-
lm(formula = price ~ age + mileage,
data = cars)
# price_lm1
price_lm1 <-
lm(formula = price ~ age,
data = cars)
If done properly, the code chunk below should run
model | n_predictors | r.squared | sigma |
---|---|---|---|
price_lm1 | 1 | 0.374 | 3646 |
price_lm2 | 2 | 0.496 | 3274 |
price_lm3 | 3 | 0.582 | 2983 |
price_lm5 | 5 | 0.583 | 2983 |
Using the output created in 2a i), which model should you use? Again, justify your answer!
We should use price_lm3
because it has a much higher
\(R^2\) and lower \(sigma\) than price_lm1
and
price_lm2
.
We should use it over price_lm5
because it fits almost
as well as the more complex model (the \(R^2\) values are almost identical), so it’s
not worth adding the additional complexity of the 2 additional
predictors.
The code chunk below reads in the “test cars.csv” data set that you’ll use with the models fit in question 2
Using the models you created in 2a), predict the price for
the cars in the test_cars data set. You can predict the results
for a new data set using the predict()
function, which
requires 2 arguments:
object =
the model used to make predictions (the
different lm
objects)
newdata =
The data set you want to make predictions
for.
Combine these predictions into a data set named price_pred that has 5 columns:
price: The actual price for the cars in the test_cars data set
price5: The predicted price using the
price_lm5
model
price3: The predicted price using the
price_lm3
model
price2: The predicted price using the
price_lm2
model
price1: The predicted price using the
price_lm1
model
price_pred <-
data.frame(
price = test_cars$price,
price5 = predict(object = price_lm5, newdata = test_cars),
price3 = predict(object = price_lm3, newdata = test_cars),
price2 = predict(object = price_lm2, newdata = test_cars),
price1 = predict(object = price_lm1, newdata = test_cars)
)
tibble(price_pred)
## # A tibble: 200 Ă— 5
## price price5 price3 price2 price1
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 10495 9174. 9218. 10573. 9902.
## 2 7299 6875. 6906. 7045. 5440.
## 3 9999 15060. 15048. 11700. 9902.
## 4 5400 5891. 5939. 6769. 10646.
## 5 17985 13574. 13585. 12187. 11390.
## 6 8499 9018. 9062. 10145. 10646.
## 7 7000 5729. 5760. 5287. 6927.
## 8 14000 12328. 12367. 13717. 10646.
## 9 16000 14987. 15017. 15543. 13621.
## 10 19800 14382. 14367. 10185. 12133.
## # ℹ 190 more rows
Calculate the \(R^2\), sigma, and
mean absolute error (MAE) of the test predictions for each of the 4
models. You can either calculate them individual and put them together
in a data set, or you can use pivot_longer()
to “shorten”
the code required!
sigma is: \[\textrm{sigma} = \sqrt{\frac{\sum(y - \hat{y})^2}{n}}\]
To calculate the MAE is: \[\textrm{MAE} = \frac{\sum|y - \hat{y}|}{n}\]
and the absolute function in R is abs()
price_pred |>
pivot_longer(
cols = price5:price1,
names_to = "model",
values_to = "price_hat"
) |>
summarize(
.by = model,
r.squared = cor(price, price_hat)^2 |> round(3),
sigma = sqrt(mean((price - price_hat)^2)) |> round(0),
MAE = mean(abs(price - price_hat)) |> round(0)
)
## # A tibble: 4 Ă— 4
## model r.squared sigma MAE
## <chr> <dbl> <dbl> <dbl>
## 1 price5 0.525 3142 2469
## 2 price3 0.526 3141 2466
## 3 price2 0.416 3484 2671
## 4 price1 0.282 3867 2970
Using your results from the above code code chunk, which model should you use?
Using the new cars as a way of testing the accuracy of each of the
four models, it agrees with the answer from 2b). price_lm3
has the highest \(R^2\) and the lowest
sigma and MAE, indicating it is the most accurate of the four
candidates.
The model estimates for price_lm5
are displayed
in the code chunk below and you’ll be using them to answer parts a) and
b)
## # A tibble: 6 Ă— 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 15222
## 2 age -721
## 3 mileage -41
## 4 cylinders 1211
## 5 transmissionmanual 788
## 6 fuelhybrid 206
Interpret the mileage estimate for the model:
For every additional 1,000 miles a car has been driven, the price is expected to decreases by $41, when all other the other variables are the same (held constant)
Interpret the fuel estimate for the model:
The price is $206 more, on average, for a hybrid car compared to a gas powered car, when all over variables are the same
You’ll be using the cars data set and
price_lm3
model for all parts of question 5.
Create just the residual plot for the price_lm3
model.
augment_columns(
x = price_lm3,
data = cars
) |>
ggplot(
mapping = aes(
x = .fitted,
y = .resid
)
) +
geom_point() +
labs(
x = "Predicted Price",
y = "Residuals"
)+
geom_hline(
mapping = aes(yintercept = mean(.resid)),
color = "red",
linewidth = 1
) +
scale_x_continuous(labels = scales::label_dollar()) +
scale_y_continuous(labels = scales::label_dollar())
Using the residual plot you created, which assumptions about our linear model below appear to be violated? If they’ve been violated, justify your answer
Linear Assumption:
No, there is a clear downward trend in the residual plot, indicating that a line is not the best choice.
No outliers:
There is a outlier at about \(\hat{y} = 8000\) and \(e = 15000\)
Equal Spread (homoscedasticity):
No, as the predicted price increases, the residuals appear to be getting larger overall
The residual plot for the three predictors is shown below. Is there evidence of any non-linear trends? Justify your answer!
augment_columns(
x = price_lm3,
data = cars
) |>
dplyr::select(age, cylinders, mileage, .resid) |>
pivot_longer(
cols = -.resid,
names_to = "predictor",
values_to = "value"
) |>
mutate(predictor = as_factor(predictor)) |>
ggplot(
mapping = aes(
x = value,
y = .resid
)
) +
geom_point(alpha = 0.25) +
geom_hline(
mapping = aes(yintercept = mean(.resid)),
color = "red",
linewidth = 1
) +
geom_smooth(
method = "loess",
se = F,
formula = y ~ x,
color = "steelblue",
linewidth = 1
) +
facet_wrap(
facets = vars(predictor),
scales = "free_x",
nrow = 5
) +
labs(
x = NULL,
y = "Residuals"
) +
scale_y_continuous(labels = scales::label_dollar())
No, the individual residual plots all look like what you’d expect to see when the linearity condition is met. (Age does have a little bit of a bow, but not much of one to be too concerning)
Not required:
The issue with using a linear model to predict price is that price has a lower bound at $0 (between 0 and 100), but the predicted price can be anything, even negative. Because of the boundedness of the response variable, our linear model is not appropriate, even tho none of the individuals variables appear to have a non-linear relationship.