Introduction

The purpose of this assignment is to:

  1. Fit a Multiple Linear Regression model using a dataset.
  2. Interpret the results obtained from the model.
  3. Study and explain variable selection methods used in regression analysis.

Dataset Description

For this assignment,i used the built-in mtcars dataset from R.

The dataset contains information about 32 automobiles and 11 variables including:

  • mpg: Miles per gallon (fuel efficiency)
  • cyl: Number of cylinders
  • disp: Engine displacement
  • hp: Horsepower
  • drat: Rear axle ratio
  • wt: Weight of the car
  • qsec: Quarter mile time
  • vs: Engine type
  • am: Transmission type
  • gear: Number of forward gears
  • carb: Number of carburetors

Our goal is to predict fuel efficiency (mpg) using multiple predictor variables.

Load Dataset

data(mtcars)

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Exploratory Data Analysis

Dataset Structure

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Summary Statistics

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Pairwise Relationships

pairs(mtcars[, c("mpg","wt","hp","disp","cyl")],
      main = "Scatterplot Matrix")

Multiple Linear Regression Model

We model fuel efficiency (mpg) using:

  • Weight (wt)
  • Horsepower (hp)
  • Number of cylinders (cyl)

Fit the Model

model1 <- lm(mpg ~ wt + hp + cyl, data = mtcars)

summary(model1)
## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## hp          -0.01804    0.01188  -1.519 0.140015    
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

Interpretation of Results

The regression equation is:

\(begin:math:display\) MPG = \beta_0 + \beta_1\(WT\) + \beta_2\(HP\) + \beta_3\(CYL\) \(end:math:display\)

The coefficients obtained from the model indicate:

  • Weight (wt):
    • Usually has a negative coefficient.
    • As vehicle weight increases, fuel efficiency decreases.
  • Horsepower (hp):
    • Typically negative.
    • Higher horsepower generally reduces fuel economy.
  • Cylinders (cyl):
    • Vehicles with more cylinders tend to consume more fuel.

Model Performance

summary(model1)$r.squared
## [1] 0.84315
summary(model1)$adj.r.squared
## [1] 0.8263446

Interpretation

  • R-squared measures the proportion of variation in MPG explained by the predictors.
  • Adjusted R-squared adjusts for the number of predictors used.
  • Values closer to 1 indicate a better model fit.

Diagnostic Plots

Regression assumptions can be checked using diagnostic plots.

par(mfrow=c(2,2))
plot(model1)

Assumptions Checked

  1. Linearity
  2. Independence
  3. Homoscedasticity (constant variance)
  4. Normality of residuals

Variable Selection Methods

Variable selection helps identify the most important predictors while avoiding overfitting.

1. Forward Selection

Description

  • Starts with no predictors.
  • Adds variables one at a time.
  • At each step, the variable that improves the model the most is added.
  • Stops when no significant improvement occurs.

Advantages

  • Simple and computationally efficient.
  • Useful when many predictors are available.

Disadvantages

  • May miss the best combination of variables.

2. Backward Elimination

Description

  • Starts with all predictors.
  • Removes the least significant variable.
  • Continues until all remaining variables are significant.

Advantages

  • Considers all variables initially.

Disadvantages

  • Requires a sufficiently large sample size.

3. Stepwise Selection

Description

  • Combination of Forward and Backward methods.
  • Variables can be added or removed at each step.

Advantages

  • More flexible.
  • Often produces better models.

Disadvantages

  • Can still miss the globally optimal model.

Applying Stepwise Selection

We use the AIC criterion to select variables.

full_model <- lm(mpg ~ ., data = mtcars)

step_model <- step(full_model,
                   direction = "both",
                   trace = FALSE)

summary(step_model)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Selected Model

formula(step_model)
## mpg ~ wt + qsec + am

Interpretation

The stepwise procedure automatically selects variables that contribute most to explaining MPG while removing less useful predictors.

The final model is generally more parsimonious and easier to interpret than the full model.

Comparison of Models

AIC(model1)
## [1] 155.4766
AIC(step_model)
## [1] 154.1194

Interpretation

  • Lower AIC indicates a better balance between model fit and model complexity.
  • The model with the smaller AIC is preferred.

Conclusion

This assignment demonstrated the use of Multiple Linear Regression using the mtcars dataset.

Key findings include:

  1. Vehicle weight, horsepower, and cylinders influence fuel efficiency.
  2. Multiple Linear Regression can effectively model MPG.
  3. Diagnostic plots help verify regression assumptions.
  4. Variable selection methods help identify important predictors.
  5. Stepwise selection provides a practical way to build a simpler and more efficient regression model.