Instructions

This is the R portion of your mid-term exam. You will analyze the Auto dataset, which contains information about various car models (similar to mtcar). Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Total points: 10 Time allowed: 45 minutes

Good luck!

1. Data Import and Exploration (2 points)

Import the Auto dataset provided on Canvas, named it as Auto, and display the first few rows. (1 points)

# Your code here
library(readr)
auto <- read_csv("auto.csv")

## Rows: 392 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (8): mpg, cylinders, displacement, horsepower, weight, acceleration, yea...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ stringr   1.5.1
## ✔ forcats   1.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

Use appropriate R functions to display the structure of the dataset and report how many observations and variables are in the dataset? (1 points)

# Your code here
summary(auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577                     
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000

[Your answer here] There is 392 observations and 9 variables.

2. Data Preprocessing and Visualization (3 points)

Create a correlation matrix for all numeric variables in the dataset. (1 points) (Optional) visualize the correlation matrix.

# Your code here
auto_num<- select(auto, mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin)
corr_matrix <- cor(auto_num)
print(corr_matrix)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

library(corrplot)

## corrplot 0.94 loaded

corrplot(corr_matrix, method="circle", type="upper", order="hclust",
         tl.col="black", tl.srt=45)

Create a scatter plot of ‘mpg’ vs ‘weight’ (you can use plot() or ggplot()). Add a title and proper axis labels. You don’t need to interpret the result here but you should know how. (1 points)

# Your code here
ggplot(auto_num, aes(x = mpg, y = weight)) +
  geom_point() +
  labs(title = "Mpg vs Weight", x = "Miles per gallon", y = "Weight")

Create boxplots of ‘mpg’ for each ‘origin’ category(you can use boxplot() or ggplot()). You don’t need to interpret the result here but you should know how. (1 points)

# Your code here
ggplot(auto_num, aes(x=factor(origin), y=mpg))+
  geom_boxplot(aes(fill=factor(origin)))

3. Linear Regression Analysis (5 points)

Fit a multiple linear regression model using ‘mpg’ as the response variable and ‘weight’, ‘horsepower’, and ‘year’ as predictors. Display the summary of the model. (1 points)

# Your code here
auto_num_lm <- lm(mpg ~ weight + horsepower + year , data = auto_num)
summary(auto_num_lm)

## 
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = auto_num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7911 -2.3220 -0.1753  2.0595 14.3527 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.372e+01  4.182e+00  -3.281  0.00113 ** 
## weight      -6.448e-03  4.089e-04 -15.768  < 2e-16 ***
## horsepower  -5.000e-03  9.439e-03  -0.530  0.59663    
## year         7.487e-01  5.212e-02  14.365  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.8068 
## F-statistic: 545.4 on 3 and 388 DF,  p-value: < 2.2e-16

Interpret the Intercept and coefficient of weight. What do they tell us about the relationship between the predictors and ‘mpg’? (1 points)

[Your interpretation here] The intercept and coefficient weight are both significant. As for the coefficients weight and horsepower negatively affect mpg meaning that on average when weight goes up or horsepower goes up mpg will go down based on the value of the coefficients.

Create diagnostic plots for the model. Comment on ONE potential issues you observe briefly. (1 points) Comment on all potential issues, if you have time, more than one issue is optional.

# Your code here
par(mfrow = c(2, 2))
plot(auto_num_lm)

[Your comments here] For residuals vs fitted there is some pattern but I think the linearity is good. Q-Q residuals are all mostly in a straight line with some deviation towards the end which is not too concerning I’d say normality is present. Scale location has constant variance confirming homoscedasticity. As for residuals vs leverage there is some outliers and most plots are not towards the center which might indicate some problems in regards to outliers.

Obtain the R-squared and adjusted R-squared for the model. Interpret these values briefly. (1 points)

# Your code here
auto_sum_lm = summary(auto_num_lm)
auto_sum_lm$r.squared

## [1] 0.8083189

auto_sum_lm$adj.r.squared

## [1] 0.8068368

[Your interpretation here] These both account for 80% of the variance within the model making this model a decent model.

Add one interaction term between weight and horsepower and report whether your model improved based on adjusted R-squared. (1 point)

# Your code here
auto_num_lm2 <- lm(mpg ~ weight + horsepower + year + weight*horsepower , data = auto_num)
summary(auto_num_lm2)

## 
## Call:
## lm(formula = mpg ~ weight + horsepower + year + weight * horsepower, 
##     data = auto_num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9146 -1.8987 -0.0386  1.5536 12.6333 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.577e+00  3.911e+00   0.915    0.361    
## weight            -1.185e-02  5.868e-04 -20.198   <2e-16 ***
## horsepower        -2.236e-01  2.063e-02 -10.837   <2e-16 ***
## year               7.749e-01  4.508e-02  17.190   <2e-16 ***
## weight:horsepower  5.790e-05  5.020e-06  11.534   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.963 on 387 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8559 
## F-statistic: 581.5 on 4 and 387 DF,  p-value: < 2.2e-16

[Your observation here] The model did improve from .81 to .86 making this a better more accurate model. —

End of Exam. Please submit this RMD file along with a knitted HTML report.

Mid-term-exam

Tianhai Zu

2024-10-03

Instructions

1. Data Import and Exploration (2 points)

2. Data Preprocessing and Visualization (3 points)

3. Linear Regression Analysis (5 points)