Data Description:

The used cars.csv file has information about 1000 randomly sampled used sedans (4 door cars) in 2021. The variables are:

  1. manufactor: The company that makes the car
  2. model: The model of the car
  3. price: The sale price of the used car (our response variable)
  4. year: The year are the car was manufactured
  5. age: The age of the car when it was posted
  6. condition: The condition of the car (like new/excellent/good)
  7. cylinders: The number of cylinders in the engine (4/6/8)
  8. fuel: Type of fuel the car takes (gas/hybrid)
  9. mileage: The miles driven according to the odometer (in thousands of miles)
  10. transmission: The type of transmission (automatic/manual)
  11. paint_color: The color of the car

Question 1) Exploratory data analysis

Question 1a) Scatter plots of price by year, age, cylinders and mileage

Create a set of scatterplots with price on the y-axis and the 4 numeric predictors (year, age, cylinders, mileage) on the respective x-axes.

cars |> 
  # Placing the 4 numeric predictors into the same column named value and predictors
  pivot_longer(
    cols = c(year, age, cylinders, mileage),
    names_to = "predictor",
    values_to = "value"
  ) |> 
  # Creating the set of scatterplots
  ggplot(
    mapping = aes(
      x = value,
      y = price
    )
  ) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(
    method = "loess",
    se = F,
    formula = y ~ x
  ) +
  # Separating the plots into 4 with different x-axes
  facet_wrap(
    facets = vars(predictor),
    scales = "free_x"
  ) + 
  labs(
    x = NULL,
    y = NULL
  ) + 
  # Adding $ to the y-axis
  scale_y_continuous(labels = scales::label_dollar())

How do each of the numeric variables appear to predict the price of the used cars (positive/negative/none, linear/curved/none, etc…)?

year: A curved, positive trend

age: A curved, negative trend

cylinder: No noticeable trend

mileage: A somewhat linear, negative trend

Question 1b) Correlation Plot

Create a correlation plot for the same 5 variables in question 1a in the code chunk below.

cars |> 
  dplyr::select(price, age, year, mileage, cylinders) |> 
  ggcorr(
    low = "red",
    high = "blue",
    label = T,
    label_round = 2
  )

Does there appear to be a potential problem with multicollinearity? Explain your answer!

Question 2) Finding a good model

Part 2a) Fit four candidate models

In the code chunk below, fit the following four linear models with the corresponding names and explanatory variables listed:

  1. price_lm5: age + mileage + cylinders + transmission + fuel

  2. price_lm3: age + mileage + cylinders

  3. price_lm2: age + mileage

  4. price_lm1: age

# price_lm5
price_lm5 <- 
  lm(formula = price ~ age + mileage + cylinders + transmission + fuel, 
     data = cars)

# price_lm3
price_lm3 <- 
  lm(formula = price ~ age + mileage + cylinders, 
     data = cars)

# price_lm2
price_lm2 <- 
  lm(formula = price ~ age + mileage, 
     data = cars)


# price_lm1
price_lm1 <- 
  lm(formula = price ~ age, 
     data = cars)

If done properly, the code chunk below should run

model n_predictors r.squared sigma
price_lm1 1 0.374 3646
price_lm2 2 0.496 3274
price_lm3 3 0.582 2983
price_lm5 5 0.583 2983

Part 2b) Best model of the four options

Using the output created in 2a i), which model should you use? Again, justify your answer!

We should use price_lm3 because it has a much higher \(R^2\) and lower \(sigma\) than price_lm1 and price_lm2.

We should use it over price_lm5 because it fits almost as well as the more complex model (the \(R^2\) values are almost identical), so it’s not worth adding the additional complexity of the 2 additional predictors.

Question 3) Test Cars Data Set

The code chunk below reads in the “test cars.csv” data set that you’ll use with the models fit in question 2

Part 3a) Making predictions with the models for the test data

Using the models you created in 2a), predict the price for the cars in the test_cars data set. You can predict the results for a new data set using the predict() function, which requires 2 arguments:

  • object = the model used to make predictions (the different lm objects)

  • newdata = The data set you want to make predictions for.

Combine these predictions into a data set named price_pred that has 5 columns:

  1. price: The actual price for the cars in the test_cars data set

  2. price5: The predicted price using the price_lm5 model

  3. price3: The predicted price using the price_lm3 model

  4. price2: The predicted price using the price_lm2 model

  5. price1: The predicted price using the price_lm1 model

price_pred <- 
  data.frame(
    price  = test_cars$price,
    price5 = predict(object = price_lm5, newdata = test_cars),
    price3 = predict(object = price_lm3, newdata = test_cars),
    price2 = predict(object = price_lm2, newdata = test_cars),
    price1 = predict(object = price_lm1, newdata = test_cars)
  )




tibble(price_pred)
## # A tibble: 200 Ă— 5
##    price price5 price3 price2 price1
##    <int>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 10495  9174.  9218. 10573.  9902.
##  2  7299  6875.  6906.  7045.  5440.
##  3  9999 15060. 15048. 11700.  9902.
##  4  5400  5891.  5939.  6769. 10646.
##  5 17985 13574. 13585. 12187. 11390.
##  6  8499  9018.  9062. 10145. 10646.
##  7  7000  5729.  5760.  5287.  6927.
##  8 14000 12328. 12367. 13717. 10646.
##  9 16000 14987. 15017. 15543. 13621.
## 10 19800 14382. 14367. 10185. 12133.
## # ℹ 190 more rows

Part 3b) Calculating the \(R^2\) and MAE for the test data

Calculate the \(R^2\), sigma, and mean absolute error (MAE) of the test predictions for each of the 4 models. You can either calculate them individual and put them together in a data set, or you can use pivot_longer() to “shorten” the code required!

sigma is: \[\textrm{sigma} = \sqrt{\frac{\sum(y - \hat{y})^2}{n}}\]

To calculate the MAE is: \[\textrm{MAE} = \frac{\sum|y - \hat{y}|}{n}\]

and the absolute function in R is abs()

price_pred |> 
  pivot_longer(
    cols = price5:price1,
    names_to = "model",
    values_to = "price_hat"
  ) |> 
  summarize(
    .by = model,
    r.squared = cor(price, price_hat)^2 |> round(3),
    sigma = sqrt(mean((price - price_hat)^2)) |> round(0),
    MAE = mean(abs(price - price_hat)) |> round(0)
  )
## # A tibble: 4 Ă— 4
##   model  r.squared sigma   MAE
##   <chr>      <dbl> <dbl> <dbl>
## 1 price5     0.525  3142  2469
## 2 price3     0.526  3141  2466
## 3 price2     0.416  3484  2671
## 4 price1     0.282  3867  2970

Using your results from the above code code chunk, which model should you use?

Using the new cars as a way of testing the accuracy of each of the four models, it agrees with the answer from 2b). price_lm3 has the highest \(R^2\) and the lowest sigma and MAE, indicating it is the most accurate of the four candidates.

Question 4) Interpreting the model

The model estimates for price_lm5 are displayed in the code chunk below and you’ll be using them to answer parts a) and b)

## # A tibble: 6 Ă— 2
##   term               estimate
##   <chr>                 <dbl>
## 1 (Intercept)           15222
## 2 age                    -721
## 3 mileage                 -41
## 4 cylinders              1211
## 5 transmissionmanual      788
## 6 fuelhybrid              206

Part 4a) Model interpretations: Mileage

Interpret the mileage estimate for the model:

For every additional 1,000 miles a car has been driven, the price is expected to decreases by $41, when all other the other variables are the same (held constant)

Part 4b) Model interpretations: fuel

Interpret the fuel estimate for the model:

The price is $206 more, on average, for a hybrid car compared to a gas powered car, when all over variables are the same

Question 5) Model diagnostics

You’ll be using the cars data set and price_lm3 model for all parts of question 5.

Part 5a) Overall Residual Plot

Create just the residual plot for the price_lm3 model.

augment_columns(
  x = price_lm3,
  data = cars
) |> 
  ggplot(
    mapping = aes(
      x = .fitted,
      y = .resid
    )
  ) +
  
  geom_point() + 
  
  labs(
    x = "Predicted Price",
    y = "Residuals"
  )+
  
  geom_hline(
    mapping = aes(yintercept = mean(.resid)),
    color = "red",
    linewidth = 1
  ) + 
  scale_x_continuous(labels = scales::label_dollar()) +
  scale_y_continuous(labels = scales::label_dollar())

Using the residual plot you created, which assumptions about our linear model below appear to be violated? If they’ve been violated, justify your answer

Linear Assumption:

No, there is a clear downward trend in the residual plot, indicating that a line is not the best choice.

No outliers:

There is a outlier at about \(\hat{y} = 8000\) and \(e = 15000\)

Equal Spread (homoscedasticity):

No, as the predicted price increases, the residuals appear to be getting larger overall

Part 5b) Individual Residual Plots

The residual plot for the three predictors is shown below. Is there evidence of any non-linear trends? Justify your answer!

augment_columns(
  x = price_lm3,
  data = cars
) |> 
  dplyr::select(age, cylinders, mileage, .resid) |> 
  pivot_longer(
    cols = -.resid,
    names_to = "predictor",
    values_to = "value"
  ) |> 
  
  mutate(predictor = as_factor(predictor)) |> 
  
  ggplot(
    mapping = aes(
      x = value,
      y = .resid
    )
  ) +
  
  geom_point(alpha = 0.25) + 
  
  geom_hline(
    mapping = aes(yintercept = mean(.resid)),
    color = "red",
    linewidth = 1
  ) +
  
  geom_smooth(
    method = "loess",
    se = F,
    formula = y ~ x,
    color = "steelblue",
    linewidth = 1
  ) +
  
  facet_wrap(
    facets = vars(predictor),
    scales = "free_x",
    nrow = 5
  ) + 
  
  labs(
    x = NULL,
    y = "Residuals"
  ) + 
  scale_y_continuous(labels = scales::label_dollar())

No, the individual residual plots all look like what you’d expect to see when the linearity condition is met. (Age does have a little bit of a bow, but not much of one to be too concerning)

Not required:

The issue with using a linear model to predict price is that price has a lower bound at $0 (between 0 and 100), but the predicted price can be anything, even negative. Because of the boundedness of the response variable, our linear model is not appropriate, even tho none of the individuals variables appear to have a non-linear relationship.