Load the data

First we load the data

# Load necessary library
library(readr)

# Load the dataset
data_path <- 'https://raw.githubusercontent.com/hawa1983/DATA605Discussion/main/autos_mpg.csv'
autos_data <- read_csv(data_path)

## Rows: 398 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): horsepower, car_name
## dbl (7): mpg, cylinders, displacement, weight, acceleration, model_year, origin
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows and the data types
head(autos_data)

## # A tibble: 6 × 9
##     mpg cylinders displacement horsepower weight acceleration model_year origin
##   <dbl>     <dbl>        <dbl> <chr>       <dbl>        <dbl>      <dbl>  <dbl>
## 1    18         8          307 130          3504         12           70      1
## 2    15         8          350 165          3693         11.5         70      1
## 3    18         8          318 150          3436         11           70      1
## 4    16         8          304 150          3433         12           70      1
## 5    17         8          302 140          3449         10.5         70      1
## 6    15         8          429 198          4341         10           70      1
## # ℹ 1 more variable: car_name <chr>

str(autos_data)

## spc_tbl_ [398 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ mpg         : num [1:398] 18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num [1:398] 8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num [1:398] 307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr [1:398] "130" "165" "150" "150" ...
##  $ weight      : num [1:398] 3504 3693 3436 3433 3449 ...
##  $ acceleration: num [1:398] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model_year  : num [1:398] 70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num [1:398] 1 1 1 1 1 1 1 1 1 1 ...
##  $ car_name    : chr [1:398] "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   mpg = col_double(),
##   ..   cylinders = col_double(),
##   ..   displacement = col_double(),
##   ..   horsepower = col_character(),
##   ..   weight = col_double(),
##   ..   acceleration = col_double(),
##   ..   model_year = col_double(),
##   ..   origin = col_double(),
##   ..   car_name = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Data imputation and conversion

Since horsepower has few missing values we will impute these values by replacing them with the median values.
Then, we will convert origin into a dichotomous variable is_american.

# Load the necessary libraries
library(readr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)  

# Convert 'horsepower' to numeric, handling non-numeric values by converting them to NA
autos_data$horsepower <- as.numeric(as.character(autos_data$horsepower))

## Warning: NAs introduced by coercion

# Impute missing 'horsepower' with the median value of the column
median_hp <- median(autos_data$horsepower, na.rm = TRUE)  # Calculate median ignoring NA
autos_data <- autos_data %>%
  mutate(horsepower = ifelse(is.na(horsepower), median_hp, horsepower))

# Check for null values in 'horsepower'
missing_hp <- sum(is.na(autos_data$horsepower))

# Display unique values for 'origin'
unique_origins <- unique(autos_data$origin)

# Convert 'origin' into a dichotomous variable: 1 (American) as 1, 2 or 3 (Non-American) as 0
autos_data$is_american <- as.integer(autos_data$origin == 1)

# Display unique values for 'origin'
unique_dichotomous_origins <- unique(autos_data$is_american)

# Print the results
sprintf("Missing values after imputation: %d", missing_hp)

## [1] "Missing values after imputation: 0"

sprintf("Number of missing values after imputation: %d", missing_hp)

## [1] "Number of missing values after imputation: 0"

print("Unique values before converting to dichotomous variable: \n")

## [1] "Unique values before converting to dichotomous variable: \n"

unique_origins

## [1] 1 3 2

print("Unique values after converting to dichotomous variable: \n")

## [1] "Unique values after converting to dichotomous variable: \n"

unique_dichotomous_origins

## [1] 1 0

Create the multiple regression model

Next, create the multiple regression model. The model will include:

A quadratic term for displacement (i.e., displacement^2).
The dichotomous term is_american.
An interaction term between is_american and weight.

# Load necessary libraries
library(ggplot2)

# Build the regression model
model <- lm(mpg ~ displacement + I(displacement^2) + is_american + weight + is_american:weight, data = autos_data)

model

## 
## Call:
## lm(formula = mpg ~ displacement + I(displacement^2) + is_american + 
##     weight + is_american:weight, data = autos_data)
## 
## Coefficients:
##        (Intercept)        displacement   I(displacement^2)         is_american  
##         50.5159047          -0.0816287           0.0001111          -5.5420466  
##             weight  is_american:weight  
##         -0.0060177           0.0022661

Interpretation of Coefficients

Here’s an interpretation of each coefficient:

Intercept (50.5159047): This is the estimated value of mpg when all other predictor variables are zero. However, because zero displacement and weight are not practical values, this interpretation is more theoretical than practical.
Displacement (-0.0816287): For each additional unit of engine displacement, the mpg is expected to decrease by approximately 0.0816 units, assuming all other variables remain constant. This reflects the typical expectation that larger engines are less fuel-efficient.
Displacement Squared (I(displacement^2) = 0.0001111): This represents the quadratic term for displacement. The positive coefficient suggests that there’s a non-linear relationship between displacement and mpg, with an increasing rate of change. This could indicate that the effect of engine size on mpg becomes more pronounced as the engine size increases.
is_american (-5.5420466): Being an American car (is_american = 1) is associated with a decrease in mpg by approximately 5.5420 units compared to non-American cars, holding other variables constant. This may reflect differences in design preferences, with American cars being less fuel-efficient on average.
Weight (-0.0060177): For each additional unit of weight, mpg is expected to decrease by 0.0060 units. Heavier cars typically consume more fuel.
Interaction Term (is_american:weight = 0.0022661): This coefficient indicates that the relationship between car weight and mpg is different for American cars compared to non-American cars. Specifically, for each additional unit of weight, the mpg of an American car is expected to decrease less (by about 0.0023 units) compared to a non-American car. This interaction term suggests that the negative impact of weight on fuel efficiency is less for American cars.

To sum up, heavier cars and those with larger engine displacement are generally less fuel-efficient, but the decrease in fuel efficiency associated with weight is less pronounced in American cars. The relationship between displacement and mpg is complex and non-linear, with the impact on mpg increasing as the engine size grows.

# Summary of the model
summary(model)

## 
## Call:
## lm(formula = mpg ~ displacement + I(displacement^2) + is_american + 
##     weight + is_american:weight, data = autos_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5649  -2.5379  -0.4122   1.7111  18.0943 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.052e+01  1.941e+00  26.025  < 2e-16 ***
## displacement       -8.163e-02  1.616e-02  -5.050 6.78e-07 ***
## I(displacement^2)   1.112e-04  2.723e-05   4.082 5.40e-05 ***
## is_american        -5.542e+00  2.545e+00  -2.178   0.0300 *  
## weight             -6.018e-03  9.516e-04  -6.324 6.97e-10 ***
## is_american:weight  2.266e-03  1.009e-03   2.247   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.148 on 392 degrees of freedom
## Multiple R-squared:  0.7218, Adjusted R-squared:  0.7183 
## F-statistic: 203.4 on 5 and 392 DF,  p-value: < 2.2e-16

Residuals vs. Fitted plot

The Residuals vs. Fitted plot suggests that the model does a decent job predicting mpg since the residuals (the differences between observed and predicted values) are mostly scattered randomly around the zero line. There’s a hint of increasing spread in residuals as the predicted mpg rises, indicating possible mild heteroscedasticity. A few outliers are flagged, which could influence the model’s accuracy and might need further investigation. Overall, the model seems reasonable but could potentially be improved with further analysis and adjustments.

plot(model, which = 1)

Distribution of residuals

The distribution of residuals is somewhat left-skewed, meaning there are more residuals on the left side of the plot (indicating that the model is underpredicting for a number of observations).
The peak of the histogram is left of the center, suggesting that the model has a tendency to predict values that are higher than the actual values.
There are a few large residuals (potential outliers) towards the right, indicating instances where the predicted value is much lower than the actual value.
The histogram suggests that the normality assumption for the residuals may not be fully met.

hist(model$residuals)

Q-Q (quantile-quantile) plot

In this plot, the points largely follow the line, but with some deviations. The lower end shows some points below the line (particularly point 112), and the upper end has points above the line (notably points 388 and 3230), suggesting heavier tails than a normal distribution.
These deviations from the line at the ends suggest that the residuals have outliers and do not follow a perfect normal distribution. It indicates the presence of skewness and kurtosis in the residuals.

plot(model, which = 2)

In this plot, while there is no dramatic fan shape or clear pattern, the line slightly fluctuates, suggesting some change in variance across the range of fitted values.
The plot also identifies potential outliers (388, 3230, 1120) with large standardized residuals, which could influence the variance.
Overall, this plot hints at possible mild heteroscedasticity, as the residuals’ spread seems to increase slightly with the fitted values.

plot(model, which = 3)

Cook’s Distance (Residuals vs Leverage)

Each vertical line represents an observation’s Cook’s distance in the dataset.
Most data points have a Cook’s distance close to zero, which means they do not have a significant influence on the model’s parameters.
However, there are several peaks indicating observations with higher influence. Notably, observation 388 has a Cook’s distance that stands out significantly from the rest, suggesting it has a substantial influence on the model. Observations 14 and 298 also have higher Cook’s distances but are not as extreme.

plot(model, which = 4)

Evaluation

Let us examine if this model is overall a good model.

Residual Standard Error: The model’s predictions are off by an average of 4.148 units of mpg.
R-squared: 72.18% of mpg variation is explained by the model, indicating a strong relationship between predictors and mpg.
Adjusted R-squared: After adjusting for the number of predictors, 71.83% of mpg variation is explained, confirming a good model fit.
F-statistic and p-value: The model is statistically significant, meaning the predictors jointly have a strong relationship with mpg.
Linearity: Partially satisfied, but outliers may affect relationships.
Normality: Approximately met, but with concerns about the tails.
Homoscedasticity: Some concerns due to slight spread in residuals at higher fitted values.
Influence: There are influential points that need further investigation.

Overall, the model seems to be an adequate initial fit but has some concerns, notably the outliers and influential observations, which could skew results. If these issues are resolved, the model may still be useful for drawing insights, depending on the specific goals of the analysis.

FKassoh_Discussion12 - Multiple Linear Regression Modeling

Fomba Kassoh

2024-04-11