The used cars.csv file has information about 1000 randomly sampled used sedans (4 door cars) in 2021. The variables are:
Create the appropriate individual graphs for price, age, and mileage.
cars |> 
  dplyr::select(price, age, mileage) |> 
  # Pivoting the values into 1 column to place all three plots in one graph
  pivot_longer(
    cols = price:mileage,
    names_to = 'feature',
    values_to = 'value'
  ) |> 
  # Creating the density plots
  ggplot(
    mapping = aes(
      x = value)
  ) + 
  geom_density(
    fill = 'steelblue'
  ) + 
  # Separate density plot for each variable
  facet_wrap(
    facets = vars(feature),
    scales = 'free',
    ncol = 2
  ) + 
  scale_y_continuous(expand = c(0, 0, 0.05, 0))
Describe the shape of each variable from your graphs in 1a)
age: Unimodal and symmetric
mileage: Unimodal and symmetric.
year: Unimodal and right skewed
Create a scatter plot matrix for price, age, and mileage
cars |> 
  dplyr::select(price, age, mileage) |> 
  ggpairs() + 
  theme_bw()
Which of the two predictors have the strongest association with price?
Age has the slightly stronger correlation, but both are very close to -0.6
Regardless of you answer in 1c), we’ll be using mileage for this question
Create the linear model for price by mileage. Call it
car_lm2. Display the model estimates using
get_regression_table()
car_lm2 <- lm(price ~ mileage, data = cars)
get_regression_table(car_lm2)
## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept  17193.     310.        55.4       0  16584.   17802. 
## 2 mileage      -66.4      2.83     -23.4       0    -72.0    -60.9
Create a scatter plot price by mileage and add the best fitting line. Does it appear that a linear model is appropriate?
ggplot(
  data = cars,
  mapping = aes(
    x = mileage,
    y = price
  )
) + 
  # Scatterplot
  geom_point() + 
  # Best fitting line
  geom_smooth(
    formula = y ~ x,
    se = F,
    method = 'lm'
  ) + 
  # X-label and title
  labs(x = 'Mileage (in 1000s)',
       title = "Price by Mileage for Used Cars") + 
  # Different theme
  theme_bw()
Interpret the slope in context of price and mileage. You can round the slope to the nearest whole number.
For each additional 1000 miles a car is driven, we expect/predict the price to decrease by about $66
If a car with 50,000 miles on it sold for 13,000, find the predicted price and the residual:
predicted price:
predicted price = 17193 - 66 * 50 = $13,893
residual:
residual = price - predicted price = 13000 - 13893 = -$893
Calculate the fit statistics (\(R^2\), rmse) for the model.
get_regression_summaries(car_lm2)
## # A tibble: 1 × 9
##   r_squared adj_r_squared       mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl>     <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.355         0.354 13666649. 3697. 3701.      549.       0     1  1000
How well does the model predict the price of a used car using mileage alone? Justify your answer using both \(R^2\) and sigma.
With an \(R^2\) of 0.355, the model fits poorly. The rmse tells us that the typical prediction error is $3700 off from the correct price, which is a large amount of money.
Create a linear model using mileage AND age to predict price.
Call it car_lm3. Use get_regression_table() to
display the model estimates.
car_lm3 <- lm(formula = price ~ mileage + age, data = cars)
# Displaying the results in the knitted document
get_regression_table(car_lm3)
## # A tibble: 3 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept  19716.     314.        62.9       0  19101.   20332. 
## 2 mileage      -44.1      2.84     -15.5       0    -49.7    -38.5
## 3 age         -517.      31.0      -16.7       0   -578.    -456.
Interpret the slope for age in context of price, age, and mileage:
If the mileage of a car stays the same, we predict/expect the price of a car to decrease by about $517 for each additional year old the car is.
Interpret the intercept of the model in context of price, age, and mileage:
For a new car (age = 0) that hasn’t been driven (mileage = 0), we expect the price to be $19716.
Read in the test cars.csv data set and save it as
test_cars. Use car_lm3 and get_regression_points() to
predict the prices of these new cars. Save the predictions as
cars_pred.
Reminder: mileage in the linear model is measured in 1000s of miles!
# Reading in the data and converting mileage
test_cars <- read.csv('test cars.csv') |> mutate(mileage = mileage/1000)
# Predicting the price for test cars
cars_pred <- get_regression_points(model = car_lm3, newdata = test_cars)
If done correctly, the code below should run
RNGversion('4.1.0')
set.seed(1870)
# Picking 10 random rows to display
cars_pred |> 
  slice_sample(n = 10)
## # A tibble: 10 × 6
##       ID price mileage   age price_hat residual
##    <int> <int>   <dbl> <int>     <dbl>    <dbl>
##  1    14 17800    45       3    16180.    1620.
##  2    78  7990   133.     11     8170.    -180.
##  3    80  8995   102.      9    10583.   -1588.
##  4   116 14995   127.      4    12057.    2938.
##  5   133  5700   163.     14     5304.     396.
##  6    76  8500    89       6    12688.   -4188.
##  7    23  5900   112      11     9088.   -3188.
##  8    81  7895    94.3     8    11419.   -3524.
##  9    20  9990   108       6    11849.   -1859.
## 10    94  7000    81       9    11489.   -4489.
Calculate the same fit statistics as in 2e), but for car_lm3.
get_regression_summaries(car_lm3)
## # A tibble: 1 × 9
##   r_squared adj_r_squared       mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl>     <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.496         0.495 10687012. 3269. 3274.      490.       0     2  1000
Did adding age improve how well the model predicts price? Justify your answer
Yes, there is a noticeable improvement. \(R^2\) increased from about 3.5 to 0.5 and sigma decreased from $3700 to about $3300. So the typical prediction improved by about $400
Does the model predict price well? Justify your answer
No, an \(R^2\) of 0.5 is still pretty low and being off by $3300 is still a pretty large error.