Instructions

This is the R portion of your mid-term exam. You will analyze the Auto dataset, which contains information about various car models (similar to mtcar). Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Total points: 10 Time allowed: 45 minutes

Good luck!

1. Data Import and Exploration (2 points)

  1. Import the Auto dataset provided on Canvas, named it as Auto, and display the first few rows. (1 points)
# Your code here
# import Auto dataset name it "Auto"
library(readr)
Auto <- read_csv("Auto.csv")
## Rows: 392 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## dbl (8): mpg, cylinders, displacement, horsepower, weight, acceleration, yea...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(Auto)
#display first few rows 
head(Auto)
## # A tibble: 6 × 9
##     mpg cylinders displacement horsepower weight acceleration  year origin name 
##   <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl> <dbl>  <dbl> <chr>
## 1    18         8          307        130   3504         12      70      1 chev…
## 2    15         8          350        165   3693         11.5    70      1 buic…
## 3    18         8          318        150   3436         11      70      1 plym…
## 4    16         8          304        150   3433         12      70      1 amc …
## 5    17         8          302        140   3449         10.5    70      1 ford…
## 6    15         8          429        198   4341         10      70      1 ford…
  1. Use appropriate R functions to display the structure of the dataset and report how many observations and variables are in the dataset? (1 points)
# Your code here
str(Auto)
## spc_tbl_ [392 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ mpg         : num [1:392] 18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num [1:392] 8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num [1:392] 307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num [1:392] 130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num [1:392] 3504 3693 3436 3433 3449 ...
##  $ acceleration: num [1:392] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num [1:392] 70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num [1:392] 1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr [1:392] "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   mpg = col_double(),
##   ..   cylinders = col_double(),
##   ..   displacement = col_double(),
##   ..   horsepower = col_double(),
##   ..   weight = col_double(),
##   ..   acceleration = col_double(),
##   ..   year = col_double(),
##   ..   origin = col_double(),
##   ..   name = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
Num_observations <- nrow(Auto)
Num_variables <- ncol(Auto)

[Your answer here] #There’s 392 observations, with 9 variables making up the dataset.

2. Data Preprocessing and Visualization (3 points)

  1. Create a correlation matrix for all numeric variables in the dataset. (1 points) (Optional) visualize the correlation matrix.
# Your code here
cor_matrix <- cor(Auto[, sapply (Auto, is.numeric)])
library(ggcorrplot)
## Loading required package: ggplot2
ggcorrplot(cor_matrix, lab = TRUE) 

  1. Create a scatter plot of ‘mpg’ vs ‘weight’ (you can use plot() or ggplot()). Add a title and proper axis labels. You don’t need to interpret the result here but you should know how. (1 points)
# Your code here
# Create a scatter plot
library(ggplot2)

ggplot(Auto, aes(x = weight, y = mpg)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs Weight",
       x = "Weight",
       y = "Miles Per Gallon (MPG)")

  1. Create boxplots of ‘mpg’ for each ‘origin’ category(you can use boxplot() or ggplot()). You don’t need to interpret the result here but you should know how. (1 points)
# Your code here
# Create boxplots
ggplot(Auto, aes(x = factor(origin), y = mpg)) +
  geom_boxplot() +
  labs(title = "Boxplot of MPG by Origin",
       x = "Origin",
       y = "Miles Per Gallon (MPG)")

3. Linear Regression Analysis (5 points)

  1. Fit a multiple linear regression model using ‘mpg’ as the response variable and ‘weight’, ‘horsepower’, and ‘year’ as predictors. Display the summary of the model. (1 points)
# Your code here
# Fitting the model
model <- lm(mpg ~ weight + horsepower + year, data = Auto)

# Displaying the summary
summary(model)
## 
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7911 -2.3220 -0.1753  2.0595 14.3527 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.372e+01  4.182e+00  -3.281  0.00113 ** 
## weight      -6.448e-03  4.089e-04 -15.768  < 2e-16 ***
## horsepower  -5.000e-03  9.439e-03  -0.530  0.59663    
## year         7.487e-01  5.212e-02  14.365  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.8068 
## F-statistic: 545.4 on 3 and 388 DF,  p-value: < 2.2e-16
  1. Interpret the Intercept and coefficient of weight. What do they tell us about the relationship between the predictors and ‘mpg’? (1 points)

‘Weight’ is statistically significant, the intercept and coefficients of ‘weight’ tell us that when weight changes by one, mpg is positively correlated and will positively change with it according to that estimate.

  1. Create diagnostic plots for the model. Comment on ONE potential issues you observe briefly. (1 points) Comment on all potential issues, if you have time, more than one issue is optional.
# Your code here
# Create diagnostic plots
# Arrange plots in 2x2
par(mfrow = c(2, 2))  


plot(model)

#The residuals VS Leverage graph is abnormal, suggesting it may not fit the data well.

  1. Obtain the R-squared and adjusted R-squared for the model. Interpret these values briefly. (1 points)
# Your code here
# Obtain R-squared and adjusted R-squared
r_squared <- summary(model)$r.squared
adj_r_squared <- summary(model)$adj.r.squared

cat("R-squared:", r_squared, "\n")
## R-squared: 0.8083189
cat("Adjusted R-squared:", adj_r_squared, "\n")
## Adjusted R-squared: 0.8068368

The way I interpret it is The R-squared decreased after adjustment which means the adjustment lowered the accuracy of the model, and the added data did not add value to the model.

  1. Add one interaction term between weight and horsepower and report whether your model improved based on adjusted R-squared. (1 point)
# Your code here
# Fit the model with interaction term
model_interaction <- lm(mpg ~ weight * horsepower + year, data = Auto)

# Compare adjusted R-squared
adj_r_squared_interaction <- summary(model_interaction)$adj.r.squared

cat("Adjusted R-squared with interaction term:", adj_r_squared_interaction, "\n")
## Adjusted R-squared with interaction term: 0.8558773

Adding the interaction term increased the R-Squared which means Adding the interaction term increased the accuracy of the model therefore adding value to the analysis.


End of Exam. Please submit this RMD file along with a knitted HTML report.