Instructions

This is the R portion of your mid-term exam. You will analyze the Auto dataset, which contains information about various car models (similar to mtcar). Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Total points: 10 Time allowed: 45 minutes

Good luck!

1. Data Import and Exploration (2 points)

  1. Import the Auto dataset provided on Canvas, named it as Auto, and display the first few rows. (1 points)
library(ggplot2)
Auto <- read.csv("Auto.csv")
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
  1. Use appropriate R functions to display the structure of the dataset and report how many observations and variables are in the dataset? (1 points)
# Show the structure of the Auto dataset (variables and data types)
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

The dataset includes 9 variables with 392 observations.

2. Data Preprocessing and Visualization (2 points)

  1. Create a correlation matrix for all numeric variables in the dataset. (1 points) (Optional) visualize the correlation matrix.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.1
## ✔ purrr     1.2.1     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
corr_matrix <- Auto %>%
  select(where(is.numeric)) %>%
  cor()
print(round(corr_matrix, 2))
##                mpg cylinders displacement horsepower weight acceleration  year
## mpg           1.00     -0.78        -0.81      -0.78  -0.83         0.42  0.58
## cylinders    -0.78      1.00         0.95       0.84   0.90        -0.50 -0.35
## displacement -0.81      0.95         1.00       0.90   0.93        -0.54 -0.37
## horsepower   -0.78      0.84         0.90       1.00   0.86        -0.69 -0.42
## weight       -0.83      0.90         0.93       0.86   1.00        -0.42 -0.31
## acceleration  0.42     -0.50        -0.54      -0.69  -0.42         1.00  0.29
## year          0.58     -0.35        -0.37      -0.42  -0.31         0.29  1.00
## origin        0.57     -0.57        -0.61      -0.46  -0.59         0.21  0.18
##              origin
## mpg            0.57
## cylinders     -0.57
## displacement  -0.61
## horsepower    -0.46
## weight        -0.59
## acceleration   0.21
## year           0.18
## origin         1.00
  1. Create a scatter plot of ‘mpg’ vs ‘weight’ (you can use plot() or ggplot()). Add a title and proper axis labels. You don’t need to interpret the result here but you should know how. (1 points)
library(ggplot2)
ggplot(Auto, aes(x = weight, y = mpg)) +
  geom_point() +
  labs(title = "MPG vs Weight",
       x = "Weight",
       y = "Miles per Gallon")

  1. Optional, no credit, you can skip it. Standardize variable ‘age’ with either Z-standardization or Range-standardization.

3. Linear Regression Analysis (5 points)

  1. Fit a multiple linear regression model using ‘mpg’ as the response variable (Y) and ‘weight’, ‘horsepower’, and ‘year’ as predictors (X). Display the summary of the model. (1 points)
#Simple linear model
Auto_lm_simple <- lm(mpg ~ weight + horsepower + year, data = Auto)
summary(Auto_lm_simple)
## 
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7911 -2.3220 -0.1753  2.0595 14.3527 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.372e+01  4.182e+00  -3.281  0.00113 ** 
## weight      -6.448e-03  4.089e-04 -15.768  < 2e-16 ***
## horsepower  -5.000e-03  9.439e-03  -0.530  0.59663    
## year         7.487e-01  5.212e-02  14.365  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.8068 
## F-statistic: 545.4 on 3 and 388 DF,  p-value: < 2.2e-16
  1. Interpret the Intercept and coefficient of weight. What do they tell us about the relationship between the predictors and ‘mpg’? (1 points)

The intercept represents mpg when weight, horsepower, and year are zero. The weight coefficient of -.0064 indicates that as weight increases, mpg decreases. Cars with higher weight have lower gas mileage.

  1. Create diagnostic plots for the model. Identify ONE potential issues you observe briefly. (1 points) Comment on all potential issue(optional, no extra points).
plot(Auto_lm_simple)

Residuals vs Fitted Values The points are generally scattered around zero, so this shows the assumptions are mostly met; however, the plot shows the slight curve in the red line, implying some slight nonlinearity.

  1. Obtain the R-squared and adjusted R-squared for the model. Interpret these values briefly. (1 points)
#R-squared
r2 <- summary(Auto_lm_simple)$r.squared
r2
## [1] 0.8083189
#Adjusted R-squared
adj_r2 <- summary(Auto_lm_simple)$adj.r.squared
adj_r2
## [1] 0.8068368

The r2 is .8083, indicating the model can explain about 80.83% of the variance. Adjusted r2 is .8068, which shows the model still explains around 80.68% of the variation in mpg after adjusting.

  1. Refit the model in 3a with one interaction term between weight and horsepower added to the ‘weight’, ‘horsepower’, and ‘year’ as predictors (X) and report the adjusted R-squared. (1 point)
Auto_lm_int <- lm(mpg ~ weight * horsepower + year, data = Auto)
summary(Auto_lm_int)
## 
## Call:
## lm(formula = mpg ~ weight * horsepower + year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9146 -1.8987 -0.0386  1.5536 12.6333 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.577e+00  3.911e+00   0.915    0.361    
## weight            -1.185e-02  5.868e-04 -20.198   <2e-16 ***
## horsepower        -2.236e-01  2.063e-02 -10.837   <2e-16 ***
## year               7.749e-01  4.508e-02  17.190   <2e-16 ***
## weight:horsepower  5.790e-05  5.020e-06  11.534   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.963 on 387 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8559 
## F-statistic: 581.5 on 4 and 387 DF,  p-value: < 2.2e-16

The adjusted R-squared for the interaction model is .8559.

  1. Compare the R-squared and the adjusted R-squared from the interaction model and original model, does adding interaction improve the model? Provide your answer and reason below. (1 point)
#OG model
#R-squared
r2 <- summary(Auto_lm_simple)$r.squared
r2
## [1] 0.8083189
#Adjusted R-squared
adj_r2 <- summary(Auto_lm_simple)$adj.r.squared
adj_r2
## [1] 0.8068368
r2 <- summary(Auto_lm_int)$r.squared
r2
## [1] 0.8573517
#Adjusted R-squared
adj_r2 <- summary(Auto_lm_int)$adj.r.squared
adj_r2
## [1] 0.8558773

Yes! The adding the interaction term improved the model. The r2 and adj r2 increased to about .8574 and .8559, respectively. The new model explains about 85% of the variance, rather than only about 80% with the old model.

End of Exam. Please submit this RMD file along with a knitted HTML report. Failed to submit HTML will lead to 1pt deduction.