Instructions

This is the R portion of your mid-term exam. You will analyze the Auto dataset, which contains information about various car models (similar to mtcar). Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Total points: 10 Time allowed: 45 minutes

Good luck!

1. Data Import and Exploration (2 points)

  1. Import the Auto dataset provided on Canvas, named it as Auto, and display the first few rows. (1 points)
# Your code here
library(ggplot2)
library(MASS)
Auto <- read.csv("Auto.csv")
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
  1. Use appropriate R functions to display the structure of the dataset and report how many observations and variables are in the dataset? (1 points)
# Your code here
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577                     
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000
nrow(Auto)
## [1] 392

There are 9 variables starting with mpg ending with name, and there are 392 observations.

2. Data Preprocessing and Visualization (3 points)

  1. Create a correlation matrix for all numeric variables in the dataset. (1 points) (Optional) visualize the correlation matrix.
# Your code here

Auto2 <- Auto[,-9]
 cor(Auto2)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
  1. Create a scatter plot of ‘mpg’ vs ‘weight’ (you can use plot() or ggplot()). Add a title and proper axis labels. You don’t need to interpret the result here but you should know how. (1 points)
# Your code here
ggplot(Auto, aes(x = mpg, y = weight)) +
  geom_point() +
  labs(title = "Miles per gallon by Weight", x = "Miles per gallon", y = "")

  1. Create boxplots of ‘mpg’ for each ‘origin’ category(you can use boxplot() or ggplot()). You don’t need to interpret the result here but you should know how. (1 points)
# Your code here
ggplot(Auto, aes(x=factor(origin), y=mpg)) +
  geom_boxplot(aes(fill=factor(origin))) +
  ggtitle("Miles per gallon based on origin") +
  xlab("Miles per Gallon") +
  ylab("Origin")

3. Linear Regression Analysis (5 points)

  1. Fit a multiple linear regression model using ‘mpg’ as the response variable and ‘weight’, ‘horsepower’, and ‘year’ as predictors. Display the summary of the model. (1 points)
# Your code here
Auto_lm <- lm(mpg ~ weight + horsepower + year , data = Auto)
summary(Auto_lm)
## 
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7911 -2.3220 -0.1753  2.0595 14.3527 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.372e+01  4.182e+00  -3.281  0.00113 ** 
## weight      -6.448e-03  4.089e-04 -15.768  < 2e-16 ***
## horsepower  -5.000e-03  9.439e-03  -0.530  0.59663    
## year         7.487e-01  5.212e-02  14.365  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.8068 
## F-statistic: 545.4 on 3 and 388 DF,  p-value: < 2.2e-16
  1. Interpret the Intercept and coefficient of weight. What do they tell us about the relationship between the predictors and ‘mpg’? (1 points)

When all X’s/means are 0 then the intercept(mpg) is the average value for mpg at -1.372e+01, but with a one unit increase in weight we see a -6.448e-03 increase to the intercept, when holding all other variables constant. This coefficient also matters since the p-value holds significance.

  1. Create diagnostic plots for the model. Comment on ONE potential issues you observe briefly. (1 points) Comment on all potential issues, if you have time, more than one issue is optional.
# Your code here
par(mfrow = c(2, 2))
plot(Auto_lm)

Using the Auto linear regression data, we can see based on the Q-Q Residuals plot, the Normality assumption, with a majority of values following along the dotted line, though they do sway off at the beginning and end, but stay stable in the middle, meaning Normality can be assumed.

  1. Obtain the R-squared and adjusted R-squared for the model. Interpret these values briefly. (1 points)
# Your code here
summary(Auto_lm)
## 
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7911 -2.3220 -0.1753  2.0595 14.3527 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.372e+01  4.182e+00  -3.281  0.00113 ** 
## weight      -6.448e-03  4.089e-04 -15.768  < 2e-16 ***
## horsepower  -5.000e-03  9.439e-03  -0.530  0.59663    
## year         7.487e-01  5.212e-02  14.365  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.8068 
## F-statistic: 545.4 on 3 and 388 DF,  p-value: < 2.2e-16

The R^2 and adjusted R^2 are shown to be 80.8% and 80.7%, the R^2 shows how much variance can be explained by the model, and only using 3 variables, we can explain around eighty percent of variance in mpg, meaning this model can predict better in the future.

  1. Add one interaction term between weight and horsepower and report whether your model improved based on adjusted R-squared. (1 point)
# Your code here
Auto_lm_int <- lm(mpg ~ weight + horsepower + year + weight*horsepower , data = Auto)
summary(Auto_lm_int)
## 
## Call:
## lm(formula = mpg ~ weight + horsepower + year + weight * horsepower, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9146 -1.8987 -0.0386  1.5536 12.6333 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.577e+00  3.911e+00   0.915    0.361    
## weight            -1.185e-02  5.868e-04 -20.198   <2e-16 ***
## horsepower        -2.236e-01  2.063e-02 -10.837   <2e-16 ***
## year               7.749e-01  4.508e-02  17.190   <2e-16 ***
## weight:horsepower  5.790e-05  5.020e-06  11.534   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.963 on 387 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8559 
## F-statistic: 581.5 on 4 and 387 DF,  p-value: < 2.2e-16

The overall model has improved, the R^2 has increased from 80 to 85 percent, explaining a higher variance, and the interaction term is significant, and with that it has made horsepower a significant predictor as well. Though the model has become a bit more complicated, it still explains quite better.


End of Exam. Please submit this RMD file along with a knitted HTML report.