QUESTION ONE

1. Introduction

For this assignment, I have used the built-in “mtcars” (Motor Trend Car Road Tests) dataset to construct a multiple linear regression model. My objective is to predict a vehicle’s fuel efficiency, measured in Miles Per Gallon, based on the three independent variables namely weight in thousands of pounds, gross horse power, and the number of cylinders.

2. Data Ingestion and Model Fitting

The following R code loads the dataset, subsets the relevant variables, and fits the multiple linear regression model using the standard “lm()” function.

# Using the built-in dataset
data(mtcars)
# I would like first to view the structure of the data
head(mtcars[, c("mpg", "wt", "hp", "cyl")])
##                    mpg    wt  hp cyl
## Mazda RX4         21.0 2.620 110   6
## Mazda RX4 Wag     21.0 2.875 110   6
## Datsun 710        22.8 2.320  93   4
## Hornet 4 Drive    21.4 3.215 110   6
## Hornet Sportabout 18.7 3.440 175   8
## Valiant           18.1 3.460 105   6
# Multiple LInear Regression Model
# Formula: mpg = Intercept + (wt) + (hp) + (cyl)
mlr_model <- lm(mpg ~ wt + hp + cyl, data = mtcars)

# Lastly the output results
summary(mlr_model)
## 
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## wt          -3.16697    0.74058  -4.276 0.000199 ***
## hp          -0.01804    0.01188  -1.519 0.140015    
## cyl         -0.94162    0.55092  -1.709 0.098480 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

3. The Statistical Interpretation

The Regression Equation:

mpg = 38.75 - 3.16(wt) - 0.018(hp) - 0.94(cyl)

1. Overall Model Significance(Goodness of Fit):

  • Adjusted R-squared: The model will show an Adjusted R-squared of approximately 0.82. This is 82% of the variance in a car’s fuel efficiency(mpg) is explained by the combination of its weight, horsepower, and number of cylinders. This is considered a very strong model.
  • F-Statistic and P-value: The overall p-value of the model is significantly less than 0.05 (usually around 1.06x10^-10). This proves that our indeptendent variables, as a group, realiably predict miles per gallon and the results are not due to random chance.

2. Interpreting the Coefficients:

Intercpet(38.75):

If a theoretical car had zero weight, zero horsepower, and zero cylinders, it would get 38.75 miles per gallon. (In practice, this is just the baseline starting point for the math)

Weight(wt):

The coefficient for weight is roughly -3.16. this means that for every additional 1,000 lbs of weight a car carries, its fuel effciency decreases by 3.16 mpg, assuming horsepower and cylinders remain constant. With a p-value less than 0.05, this variable is highly statistically significant.

Number of Cylinders(cyl):

The coefficient is -0.94. For every additional cylinder added to the engine, the car loses 0.94 mpg, holding weight and horsepower constant.

Horsepower(hp):

The coeffient is -0.018. For every 1 unit increase in horsepower, fuel efficiency drops by a tiny fraction (0.018 mpg). However, you will notice its p-value is slightly above 0.05, meaning horsepower is the weakest predictor of the three when weight an cylinders are already accounted for.

QUESTION TWO

The Goal of Variable Selection

The Primary objective is to select the best subset of independent variables to include in model. Doing this correclty improves model accuracy, reduce ther risk of overfittin, speeds up computation, and makes the final mathematical interpretation much clearer. Variable selection techniques are broadly categorized into three main families:

1. Filter Methods

Filter methods evaluate the relevance of features using statistical tests, entirely independent of the predictive model you plan to use later.

2. Wrapper Methods

Wrapper methods treat the variable selection process as a search problem. They evaluate different combinations of features and measure their performance using a specific machine learning model.

3. Embedded Methods

Embedded methods combine the qualities of both filter and wrapper methods. Algorithms that have built-in embedded methods perform variable selection automatically as the model is being mathematically trained.