Instructions

This is the R portion of your mid-term exam. You will analyze the Auto dataset, which contains information about various car models (similar to mtcar). Follow the instructions carefully and write your R code in the provided chunks. You will be graded on the correctness of your code, the quality of your analysis, and your interpretation of the results.

Total points: 10 Time allowed: 45 minutes

Good luck!

1. Data Import and Exploration (2 points)

Import the Auto dataset provided on Canvas, named it as Auto, and display the first few rows. (1 points)

Auto <- read.csv("Auto.csv")
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Use appropriate R functions to display the structure of the dataset and report how many observations and variables are in the dataset? (1 points)

str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

dim(Auto)

## [1] 392   9

#There are 392 observations and 9 variables in the dataset.

2. Data Preprocessing and Visualization (2 points)

Create a correlation matrix for all numeric variables in the dataset. (1 points) (Optional) visualize the correlation matrix.

?sapply

## starting httpd help server ... done

Auto_Num_Only <- Auto[sapply(Auto, is.numeric)]
Auto_matrix <- cor(Auto_Num_Only)
print(Auto_matrix)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Create a scatter plot of ‘mpg’ vs ‘weight’ (you can use plot() or ggplot()). Add a title and proper axis labels. You don’t need to interpret the result here but you should know how. (1 points)

plot(Auto$mpg , Auto$weight, main = "Scatterplot of mpg vs weight")

Optional, no credit, you can skip it. Standardize variable ‘age’ with either Z-standardization or Range-standardization.

# Your code here, again, this is optional, no credit.Maybe come back when you finished all other questions.

3. Linear Regression Analysis (5 points)

Fit a multiple linear regression model using ‘mpg’ as the response variable (Y) and ‘weight’, ‘horsepower’, and ‘year’ as predictors (X). Display the summary of the model. (1 points)

m1 <- lm(mpg ~ weight + horsepower + year, data = Auto)
summary(m1)

## 
## Call:
## lm(formula = mpg ~ weight + horsepower + year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7911 -2.3220 -0.1753  2.0595 14.3527 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.372e+01  4.182e+00  -3.281  0.00113 ** 
## weight      -6.448e-03  4.089e-04 -15.768  < 2e-16 ***
## horsepower  -5.000e-03  9.439e-03  -0.530  0.59663    
## year         7.487e-01  5.212e-02  14.365  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 388 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.8068 
## F-statistic: 545.4 on 3 and 388 DF,  p-value: < 2.2e-16

?lm

Interpret the Intercept and coefficient of weight. What do they tell us about the relationship between the predictors and ‘mpg’? (1 points)

#when all x variables are 0, mpg is -1.372e+01 (intercept) which does not work in the real world but that is the mathmatical intercept. For the coefficients, when the other variables are held constant, with 1 decrease in weight, mpg decreases by -6.44, same goes with horsepower, if contants are held, with 1 decrease in hp, results in a -5 to mpg. But for Year, when it goes down by 1, mpg actually increases by 7.487, which is surprising because id assumed older cars are less efficient on gas.

Create diagnostic plots for the model. Identify ONE potential issues you observe briefly. (1 points) Comment on all potential issue(optional, no extra points).

par(mfrow = c (2,2)) 
plot(m1)

There is slight non linearity as the plot points do not randomly scatter around value 0 in the residuals vs fitted plot. There is a slight curve to the line. Also there is potential issues on the tail ends of the model, specifically the top tail.

Obtain the R-squared and adjusted R-squared for the model. Interpret these values briefly. (1 points)

m1 <- lm(mpg ~ weight + horsepower + year, data = Auto)
summary(m1)$r.squared

## [1] 0.8083189

summary(m1)$r.squared

## [1] 0.8083189

str(m1)

## List of 12
##  $ coefficients : Named num [1:4] -13.71936 -0.00645 -0.005 0.74871
##   ..- attr(*, "names")= chr [1:4] "(Intercept)" "weight" "horsepower" "year"
##  $ residuals    : Named num [1:392] 2.553 0.946 2.214 0.195 1.248 ...
##   ..- attr(*, "names")= chr [1:392] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:392] -464.21 128.44 18.09 49.28 1.03 ...
##   ..- attr(*, "names")= chr [1:392] "(Intercept)" "weight" "horsepower" "year" ...
##  $ rank         : int 4
##  $ fitted.values: Named num [1:392] 15.4 14.1 15.8 15.8 15.8 ...
##   ..- attr(*, "names")= chr [1:392] "1" "2" "3" "4" ...
##  $ assign       : int [1:4] 0 1 2 3
##  $ qr           :List of 5
##   ..$ qr   : num [1:392, 1:4] -19.799 0.0505 0.0505 0.0505 0.0505 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:392] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:4] "(Intercept)" "weight" "horsepower" "year"
##   .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
##   ..$ qraux: num [1:4] 1.05 1.04 1.07 1.05
##   ..$ pivot: int [1:4] 1 2 3 4
##   ..$ tol  : num 1e-07
##   ..$ rank : int 4
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 388
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = mpg ~ weight + horsepower + year, data = Auto)
##  $ terms        :Classes 'terms', 'formula'  language mpg ~ weight + horsepower + year
##   .. ..- attr(*, "variables")= language list(mpg, weight, horsepower, year)
##   .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:4] "mpg" "weight" "horsepower" "year"
##   .. .. .. ..$ : chr [1:3] "weight" "horsepower" "year"
##   .. ..- attr(*, "term.labels")= chr [1:3] "weight" "horsepower" "year"
##   .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(mpg, weight, horsepower, year)
##   .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:4] "mpg" "weight" "horsepower" "year"
##  $ model        :'data.frame':   392 obs. of  4 variables:
##   ..$ mpg       : num [1:392] 18 15 18 16 17 15 14 14 14 15 ...
##   ..$ weight    : int [1:392] 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##   ..$ horsepower: int [1:392] 130 165 150 150 140 198 220 215 225 190 ...
##   ..$ year      : int [1:392] 70 70 70 70 70 70 70 70 70 70 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ weight + horsepower + year
##   .. .. ..- attr(*, "variables")= language list(mpg, weight, horsepower, year)
##   .. .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:4] "mpg" "weight" "horsepower" "year"
##   .. .. .. .. ..$ : chr [1:3] "weight" "horsepower" "year"
##   .. .. ..- attr(*, "term.labels")= chr [1:3] "weight" "horsepower" "year"
##   .. .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(mpg, weight, horsepower, year)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:4] "mpg" "weight" "horsepower" "year"
##  - attr(*, "class")= chr "lm"

The r squared was .8083 or ~ 80.83% and the adjusted R was .8068 or ~ 80.68%. which means the 3 predictors account for roughly 80% of the variance in the model.The adjusted R penalizes having more variables but overall the R’s were quite similar.

Refit the model in 3a with one interaction term between weight and horsepower added to the ‘weight’, ‘horsepower’, and ‘year’ as predictors (X) and report the adjusted R-squared. (1 point)

m2 <- lm(mpg ~ weight + horsepower + year + weight * horsepower, data = Auto)
summary(m2)

## 
## Call:
## lm(formula = mpg ~ weight + horsepower + year + weight * horsepower, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9146 -1.8987 -0.0386  1.5536 12.6333 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.577e+00  3.911e+00   0.915    0.361    
## weight            -1.185e-02  5.868e-04 -20.198   <2e-16 ***
## horsepower        -2.236e-01  2.063e-02 -10.837   <2e-16 ***
## year               7.749e-01  4.508e-02  17.190   <2e-16 ***
## weight:horsepower  5.790e-05  5.020e-06  11.534   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.963 on 387 degrees of freedom
## Multiple R-squared:  0.8574, Adjusted R-squared:  0.8559 
## F-statistic: 581.5 on 4 and 387 DF,  p-value: < 2.2e-16

.8559 or 85.59%

Compare the R-squared and the adjusted R-squared from the interaction model and original model, does adding interaction improve the model? Provide your answer and reason below. (1 point)

Yes, including the interaction did increase the r-squared but that does not necessarily mean it improved the model, thats why its important to look at the adjusted r, because it takes into account, adding more variables. but overall adding the interaction did improve the model, suggesting that 2 those predictor variables have a greater effect on the target variable.

End of Exam. Please submit this RMD file along with a knitted HTML report. Failed to submit HTML will lead to 1pt deduction.

Mid-term-exam

Tianhai Zu

2025-10-15

Instructions

1. Data Import and Exploration (2 points)

2. Data Preprocessing and Visualization (2 points)

3. Linear Regression Analysis (5 points)