For this assignment, I have used the built-in “mtcars” (Motor Trend Car Road Tests) dataset to construct a multiple linear regression model. My objective is to predict a vehicle’s fuel efficiency, measured in Miles Per Gallon, based on the three independent variables namely weight in thousands of pounds, gross horse power, and the number of cylinders.
The following R code loads the dataset, subsets the relevant variables, and fits the multiple linear regression model using the standard “lm()” function.
# Using the built-in dataset
data(mtcars)
# I would like first to view the structure of the data
head(mtcars[, c("mpg", "wt", "hp", "cyl")])
## mpg wt hp cyl
## Mazda RX4 21.0 2.620 110 6
## Mazda RX4 Wag 21.0 2.875 110 6
## Datsun 710 22.8 2.320 93 4
## Hornet 4 Drive 21.4 3.215 110 6
## Hornet Sportabout 18.7 3.440 175 8
## Valiant 18.1 3.460 105 6
# Multiple LInear Regression Model
# Formula: mpg = Intercept + (wt) + (hp) + (cyl)
mlr_model <- lm(mpg ~ wt + hp + cyl, data = mtcars)
# Lastly the output results
summary(mlr_model)
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## hp -0.01804 0.01188 -1.519 0.140015
## cyl -0.94162 0.55092 -1.709 0.098480 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
mpg = 38.75 - 3.16(wt) - 0.018(hp) - 0.94(cyl)
If a theoretical car had zero weight, zero horsepower, and zero cylinders, it would get 38.75 miles per gallon. (In practice, this is just the baseline starting point for the math)
The coefficient for weight is roughly -3.16. this means that for every additional 1,000 lbs of weight a car carries, its fuel effciency decreases by 3.16 mpg, assuming horsepower and cylinders remain constant. With a p-value less than 0.05, this variable is highly statistically significant.
The coefficient is -0.94. For every additional cylinder added to the engine, the car loses 0.94 mpg, holding weight and horsepower constant.
The coeffient is -0.018. For every 1 unit increase in horsepower, fuel efficiency drops by a tiny fraction (0.018 mpg). However, you will notice its p-value is slightly above 0.05, meaning horsepower is the weakest predictor of the three when weight an cylinders are already accounted for.
The Primary objective is to select the best subset of independent variables to include in model. Doing this correclty improves model accuracy, reduce ther risk of overfittin, speeds up computation, and makes the final mathematical interpretation much clearer. Variable selection techniques are broadly categorized into three main families:
Filter methods evaluate the relevance of features using statistical tests, entirely independent of the predictive model you plan to use later.
Wrapper methods treat the variable selection process as a search problem. They evaluate different combinations of features and measure their performance using a specific machine learning model.
Embedded methods combine the qualities of both filter and wrapper methods. Algorithms that have built-in embedded methods perform variable selection automatically as the model is being mathematically trained.