DS 1870: Module 5 Homework - Linear Regression

Data Description:

The used cars.csv file has information about 1000 randomly sampled used sedans (4 door cars) in 2021. The variables are:

manufactor: The company that makes the car
model: The model of the car
price: The sale price of the used car (our response variable)
year: The year are the car was manufactured
age: The age of the car when it was posted
condition: The condition of the car (like new/excellent/good)
cylinders: The number of cylinders in the engine (4/6/8)
fuel: Type of fuel the car takes (gas/hybrid)
mileage: The miles driven according to the odometer (in thousands of miles)
transmission: The type of transmission (automatic/manual)
paint_color: The color of the car

Question 1) Exploratory data analysis for price, age, and mileage

Part 1a) Univariate EDA

Create the appropriate individual graphs for price, age, and mileage.

cars |> 
  dplyr::select(price, age, mileage) |> 
  # Pivoting the values into 1 column to place all three plots in one graph
  pivot_longer(
    cols = price:mileage,
    names_to = 'feature',
    values_to = 'value'
  ) |> 
  # Creating the density plots
  ggplot(
    mapping = aes(
      x = value)
  ) + 
  geom_density(
    fill = 'steelblue'
  ) + 
  # Separate density plot for each variable
  facet_wrap(
    facets = vars(feature),
    scales = 'free',
    ncol = 2
  ) + 
  scale_y_continuous(expand = c(0, 0, 0.05, 0))

Part 1b) Important characteristics of each variable

Describe the shape of each variable from your graphs in 1a)

age: Unimodal and symmetric

mileage: Unimodal and symmetric.

year: Unimodal and right skewed

Part 1c) Bivariate EDA

Create a scatter plot matrix for price, age, and mileage

cars |> 
  dplyr::select(price, age, mileage) |> 
  ggpairs() + 
  theme_bw()

Which of the two predictors have the strongest association with price?

Age has the slightly stronger correlation, but both are very close to -0.6

Question 2) Simple Linear Regression: Price by mileage

Regardless of you answer in 1c), we’ll be using mileage for this question

Part 2a) Fitting the model

Create the linear model for price by mileage. Call it car_lm2. Display the model estimates using get_regression_table()

car_lm2 <- lm(price ~ mileage, data = cars)

get_regression_table(car_lm2)

## # A tibble: 2 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept  17193.     310.        55.4       0  16584.   17802. 
## 2 mileage      -66.4      2.83     -23.4       0    -72.0    -60.9

Part 2b) Visualize the linear model

Create a scatter plot price by mileage and add the best fitting line. Does it appear that a linear model is appropriate?

ggplot(
  data = cars,
  mapping = aes(
    x = mileage,
    y = price
  )
) + 
  # Scatterplot
  geom_point() + 
  # Best fitting line
  geom_smooth(
    formula = y ~ x,
    se = F,
    method = 'lm'
  ) + 
  # X-label and title
  labs(x = 'Mileage (in 1000s)',
       title = "Price by Mileage for Used Cars") + 
  # Different theme
  theme_bw()

Part 2c) Interpreting the slope

Interpret the slope in context of price and mileage. You can round the slope to the nearest whole number.

For each additional 1000 miles a car is driven, we expect/predict the price to decrease by about $66

Part 2d) Predicting the price of a car

If a car with 50,000 miles on it sold for 13,000, find the predicted price and the residual:

predicted price:

predicted price = 17193 - 66 * 50 = $13,893

residual:

residual = price - predicted price = 13000 - 13893 = -$893

Part 2e) Fit statistics

Calculate the fit statistics ($R^2$, rmse) for the model.

get_regression_summaries(car_lm2)

## # A tibble: 1 × 9
##   r_squared adj_r_squared       mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl>     <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.355         0.354 13666649. 3697. 3701.      549.       0     1  1000

How well does the model predict the price of a used car using mileage alone? Justify your answer using both $R^2$ and sigma.

With an $R^2$ of 0.355, the model fits poorly. The rmse tells us that the typical prediction error is $3700 off from the correct price, which is a large amount of money.

Question 3) Multiple linear regression with mileage and age

Question 3a) Fitting the MLR model

Create a linear model using mileage AND age to predict price. Call it car_lm3. Use get_regression_table() to display the model estimates.

car_lm3 <- lm(formula = price ~ mileage + age, data = cars)


# Displaying the results in the knitted document
get_regression_table(car_lm3)

## # A tibble: 3 × 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept  19716.     314.        62.9       0  19101.   20332. 
## 2 mileage      -44.1      2.84     -15.5       0    -49.7    -38.5
## 3 age         -517.      31.0      -16.7       0   -578.    -456.

Part 3b) Model interpretations: age

Interpret the slope for age in context of price, age, and mileage:

If the mileage of a car stays the same, we predict/expect the price of a car to decrease by about $517 for each additional year old the car is.

Part 3c) Model interpretations: intercept

Interpret the intercept of the model in context of price, age, and mileage:

For a new car (age = 0) that hasn’t been driven (mileage = 0), we expect the price to be $19716.

Part 3d) Predicting price of cars

Read in the test cars.csv data set and save it as test_cars. Use car_lm3 and get_regression_points() to predict the prices of these new cars. Save the predictions as cars_pred.

Reminder: mileage in the linear model is measured in 1000s of miles!

# Reading in the data and converting mileage
test_cars <- read.csv('test cars.csv') |> mutate(mileage = mileage/1000)

# Predicting the price for test cars
cars_pred <- get_regression_points(model = car_lm3, newdata = test_cars)

If done correctly, the code below should run

RNGversion('4.1.0')
set.seed(1870)

# Picking 10 random rows to display
cars_pred |> 
  slice_sample(n = 10)

## # A tibble: 10 × 6
##       ID price mileage   age price_hat residual
##    <int> <int>   <dbl> <int>     <dbl>    <dbl>
##  1    14 17800    45       3    16180.    1620.
##  2    78  7990   133.     11     8170.    -180.
##  3    80  8995   102.      9    10583.   -1588.
##  4   116 14995   127.      4    12057.    2938.
##  5   133  5700   163.     14     5304.     396.
##  6    76  8500    89       6    12688.   -4188.
##  7    23  5900   112      11     9088.   -3188.
##  8    81  7895    94.3     8    11419.   -3524.
##  9    20  9990   108       6    11849.   -1859.
## 10    94  7000    81       9    11489.   -4489.

Part 3e) Fit statistics

Calculate the same fit statistics as in 2e), but for car_lm3.

get_regression_summaries(car_lm3)

## # A tibble: 1 × 9
##   r_squared adj_r_squared       mse  rmse sigma statistic p_value    df  nobs
##       <dbl>         <dbl>     <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
## 1     0.496         0.495 10687012. 3269. 3274.      490.       0     2  1000

Did adding age improve how well the model predicts price? Justify your answer

Yes, there is a noticeable improvement. $R^2$ increased from about 3.5 to 0.5 and sigma decreased from $3700 to about $3300. So the typical prediction improved by about $400

Does the model predict price well? Justify your answer

No, an $R^2$ of 0.5 is still pretty low and being off by $3300 is still a pretty large error.