Linear Regression Models
Analyze Motor Car Data Trends
You are a data scientist in a top dealership group in USA. Your boss, Mr. Buffet, asked you to analyze the motor trend car data. You are given a dataset containing fuel consumption and 10 aspects of automobile design and performance for 32 automobiles the file mtcars.xlsx.
Data Source: from MASS library in R.
Data Dictionary
This dataset contains the following columns:
Variable | Data Type | Description | Constraints/Rules |
---|---|---|---|
mpg |
Numeric | Miles per gallon | Positive values only (mpg > 0) |
cyl |
Integer | Number of cylinders | Categorical: {4, 6, 8} |
disp |
Numeric | Displacement (cubic inches) | Positive values only (disp > 0) |
hp |
Integer | Gross horsepower | Positive values only (hp > 0) |
drat |
Numeric | Rear axle ratio | Positive values only (drat > 0) |
wt |
Numeric | Weight (1000 lbs) | Positive values only (wt > 0) |
qsec |
Numeric | 1/4 mile time (seconds) | Positive values only (qsec > 0) |
vs |
Integer | Engine type: 0 = V-shaped, 1 = straight | Binary: {0, 1} |
am |
Integer | Transmission: 0 = automatic, 1 = manual | Binary: {0, 1} |
gear |
Integer | Number of forward gears | Categorical: {3, 4, 5} |
carb |
Integer | Number of carburetors | Positive integer values only |
Dataset Loading
Check the dimension of the data frame
## [1] 41 12
Display data frame column names
## [1] "name" "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
## [11] "gear" "carb"
Display data frame structures
## tibble [41 × 12] (S3: tbl_df/tbl/data.frame)
## $ name: chr [1:41] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num [1:41] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num [1:41] 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num [1:41] 160 160 108 258 360 ...
## $ hp : num [1:41] 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num [1:41] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num [1:41] 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num [1:41] 16.5 17 18.6 19.4 17 ...
## $ vs : num [1:41] 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num [1:41] 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num [1:41] 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num [1:41] 4 4 1 1 2 1 4 2 2 4 ...
Display data frame statistical summary
## name mpg cyl disp
## Length:41 Min. :10.40 Min. :4.000 Min. : 71.1
## Class :character 1st Qu.:15.80 1st Qu.:4.000 1st Qu.:121.0
## Mode :character Median :19.70 Median :6.000 Median :167.6
## Mean :20.15 Mean :6.098 Mean :226.4
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:318.0
## Max. :33.90 Max. :8.000 Max. :472.0
## hp drat wt qsec
## Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
## 1st Qu.: 97.0 1st Qu.:3.080 1st Qu.:2.620 1st Qu.:16.90
## Median :110.0 Median :3.690 Median :3.215 Median :17.82
## Mean :141.8 Mean :3.579 Mean :3.181 Mean :17.91
## 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.570 3rd Qu.:18.90
## Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
## vs am gear carb
## Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4634 Mean :0.4146 Mean :3.659 Mean :2.707
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
Question 1
Use the lm() function to perform a simple linear regression with the response mpg and the predictor hp.
Visualize the data pairwise
Build the simple linear model
Print out the model results
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6657 -1.2794 -0.7977 0.8162 8.7977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.297928 1.314115 22.295 < 2e-16 ***
## hp -0.064548 0.008429 -7.658 2.72e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.5 on 39 degrees of freedom
## Multiple R-squared: 0.6006, Adjusted R-squared: 0.5904
## F-statistic: 58.65 on 1 and 39 DF, p-value: 2.723e-09
Question 2
Is there a relationship between the target mpg and predator hp?
Yes, the t-test for hp results in a p-value of 2.72e-09, which is significantly low, indicating a strong relationship. Since this is a simple linear regression, the F-test yields the same conclusion:
- t-value = -7.658
- Squaring the t-value: (-7.658)^2 = 58.65, which equals the F-value
Thus, the p-values from both the t-test and F-test are identical in this case.
Question 3
How strong is the relationship between the response and predictor?
The p-value from the t-test is far below the 0.05 threshold, indicating a strong relationship between the predictor and the response variable. Additionally, the R^2 value suggests that this variable alone explains 60% of the variation in the model. The F-test can also serve as a useful indicator.
Question 4
Is the relationship between mpg and hp positive or negative?
Negative, as hp goes up, mpg goes down.
Question 5
What is the predicted mpg associated with a horsepower (hp) of 100? What’s the 95% confidence interval for the predicted mpg?
Predict mpg for hp = 100 with 95% confidence interval
Print the predicted mpg
## fit lwr upr
## 1 22.84317 21.5279 24.15844
Question 6
Plot the response and the predictor and add the regression line using abline().
Plot the response and predictor with the regression line
plot(mtcars$hp, mtcars$mpg, main = "MPG vs Horsepower",
xlab = "Horsepower (hp)", ylab = "Miles per Gallon (mpg)")
abline(simple_model, lwd = 2, col = "blue")
Question 7
Perform a multiple linear regression with mpg as the response and the predictors cyl, disp, hp, wt, vs, and gear. Print out the results using summary() function.
Perform multiple linear regression
Print the summary of the model
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + wt + vs + gear, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5877 -1.8021 -0.3745 0.8538 6.3448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.29705 6.58333 5.058 1.45e-05 ***
## cyl -0.74909 0.74433 -1.006 0.32133
## disp 0.01733 0.01086 1.596 0.11981
## hp -0.03352 0.01381 -2.428 0.02061 *
## wt -3.85230 0.88681 -4.344 0.00012 ***
## vs 0.70816 1.39192 0.509 0.61420
## gear 1.14054 0.95373 1.196 0.24002
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.348 on 34 degrees of freedom
## Multiple R-squared: 0.8433, Adjusted R-squared: 0.8157
## F-statistic: 30.5 on 6 and 34 DF, p-value: 2.568e-12
Question 8
Is there a relationship between the predictors and the response?
Yes, some predictors show a significant relationship with the response variable, as evidenced by the F-statistic of 30.5 and a p-value of 2.568e-12.
Question 9
Which predictors appears to have a statistically significant relationship to the response?
The variables wt (weight) and hp (gross horsepower) have p-values of 0.00012 and 0.02, respectively, based on their t-tests.
Question 10
Use * symbols to fit linear regression models with interaction effects between hp and wt. Does this interaction appear to be statistically significant?
Fit a model with interaction effects between hp and wt
Print the summary of the interaction model
##
## Call:
## lm(formula = mpg ~ hp * wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9170 -1.6420 -0.7411 1.4507 4.7557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.66145 3.29797 14.755 < 2e-16 ***
## hp -0.11778 0.02277 -5.174 8.24e-06 ***
## wt -7.76370 1.12887 -6.877 4.13e-08 ***
## hp:wt 0.02642 0.00673 3.926 0.000362 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.048 on 37 degrees of freedom
## Multiple R-squared: 0.8703, Adjusted R-squared: 0.8598
## F-statistic: 82.77 on 3 and 37 DF, p-value: < 2.2e-16
Yes, the interaction has a p-value of 0.000362 < alpha = 0.05