The used cars.csv file has information about 1000 randomly sampled used sedans (4 door cars) in 2021. The variables are:
Create the appropriate individual graphs for price, age, and mileage.
cars |>
dplyr::select(price, age, mileage) |>
# Pivoting the values into 1 column to place all three plots in one graph
pivot_longer(
cols = price:mileage,
names_to = 'feature',
values_to = 'value'
) |>
# Creating the density plots
ggplot(
mapping = aes(
x = value)
) +
geom_density(
fill = 'steelblue'
) +
# Separate density plot for each variable
facet_wrap(
facets = vars(feature),
scales = 'free',
ncol = 2
) +
scale_y_continuous(expand = c(0, 0, 0.05, 0))
Describe the shape of each variable from your graphs in 1a)
age: Unimodal and symmetric
mileage: Unimodal and symmetric.
year: Unimodal and right skewed
Create a scatter plot matrix for price, age, and mileage
cars |>
dplyr::select(price, age, mileage) |>
ggpairs() +
theme_bw()
Which of the two predictors have the strongest association with price?
Age has the slightly stronger correlation, but both are very close to -0.6
Regardless of you answer in 1c), we’ll be using mileage for this question
Create the linear model for price by mileage. Call it
car_lm2
. Display the model estimates using
get_regression_table()
car_lm2 <- lm(price ~ mileage, data = cars)
get_regression_table(car_lm2)
## # A tibble: 2 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 17193. 310. 55.4 0 16584. 17802.
## 2 mileage -66.4 2.83 -23.4 0 -72.0 -60.9
Create a scatter plot price by mileage and add the best fitting line. Does it appear that a linear model is appropriate?
ggplot(
data = cars,
mapping = aes(
x = mileage,
y = price
)
) +
# Scatterplot
geom_point() +
# Best fitting line
geom_smooth(
formula = y ~ x,
se = F,
method = 'lm'
) +
# X-label and title
labs(x = 'Mileage (in 1000s)',
title = "Price by Mileage for Used Cars") +
# Different theme
theme_bw()
Interpret the slope in context of price and mileage. You can round the slope to the nearest whole number.
For each additional 1000 miles a car is driven, we expect/predict the price to decrease by about $66
If a car with 50,000 miles on it sold for 13,000, find the predicted price and the residual:
predicted price:
predicted price = 17193 - 66 * 50 = $13,893
residual:
residual = price - predicted price = 13000 - 13893 = -$893
Calculate the fit statistics (\(R^2\), rmse) for the model.
get_regression_summaries(car_lm2)
## # A tibble: 1 × 9
## r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.355 0.354 13666649. 3697. 3701. 549. 0 1 1000
How well does the model predict the price of a used car using mileage alone? Justify your answer using both \(R^2\) and sigma.
With an \(R^2\) of 0.355, the model fits poorly. The rmse tells us that the typical prediction error is $3700 off from the correct price, which is a large amount of money.
Create a linear model using mileage AND age to predict price.
Call it car_lm3
. Use get_regression_table()
to
display the model estimates.
car_lm3 <- lm(formula = price ~ mileage + age, data = cars)
# Displaying the results in the knitted document
get_regression_table(car_lm3)
## # A tibble: 3 × 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 19716. 314. 62.9 0 19101. 20332.
## 2 mileage -44.1 2.84 -15.5 0 -49.7 -38.5
## 3 age -517. 31.0 -16.7 0 -578. -456.
Interpret the slope for age in context of price, age, and mileage:
If the mileage of a car stays the same, we predict/expect the price of a car to decrease by about $517 for each additional year old the car is.
Interpret the intercept of the model in context of price, age, and mileage:
For a new car (age = 0) that hasn’t been driven (mileage = 0), we expect the price to be $19716.
Read in the test cars.csv data set and save it as
test_cars. Use car_lm3 and get_regression_points()
to
predict the prices of these new cars. Save the predictions as
cars_pred.
Reminder: mileage in the linear model is measured in 1000s of miles!
# Reading in the data and converting mileage
test_cars <- read.csv('test cars.csv') |> mutate(mileage = mileage/1000)
# Predicting the price for test cars
cars_pred <- get_regression_points(model = car_lm3, newdata = test_cars)
If done correctly, the code below should run
RNGversion('4.1.0')
set.seed(1870)
# Picking 10 random rows to display
cars_pred |>
slice_sample(n = 10)
## # A tibble: 10 × 6
## ID price mileage age price_hat residual
## <int> <int> <dbl> <int> <dbl> <dbl>
## 1 14 17800 45 3 16180. 1620.
## 2 78 7990 133. 11 8170. -180.
## 3 80 8995 102. 9 10583. -1588.
## 4 116 14995 127. 4 12057. 2938.
## 5 133 5700 163. 14 5304. 396.
## 6 76 8500 89 6 12688. -4188.
## 7 23 5900 112 11 9088. -3188.
## 8 81 7895 94.3 8 11419. -3524.
## 9 20 9990 108 6 11849. -1859.
## 10 94 7000 81 9 11489. -4489.
Calculate the same fit statistics as in 2e), but for car_lm3.
get_regression_summaries(car_lm3)
## # A tibble: 1 × 9
## r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.496 0.495 10687012. 3269. 3274. 490. 0 2 1000
Did adding age improve how well the model predicts price? Justify your answer
Yes, there is a noticeable improvement. \(R^2\) increased from about 3.5 to 0.5 and sigma decreased from $3700 to about $3300. So the typical prediction improved by about $400
Does the model predict price well? Justify your answer
No, an \(R^2\) of 0.5 is still pretty low and being off by $3300 is still a pretty large error.