First we load the data
# Load necessary library
library(readr)
# Load the dataset
data_path <- 'https://raw.githubusercontent.com/hawa1983/DATA605Discussion/main/autos_mpg.csv'
autos_data <- read_csv(data_path)
## Rows: 398 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): horsepower, car_name
## dbl (7): mpg, cylinders, displacement, weight, acceleration, model_year, origin
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows and the data types
head(autos_data)
## # A tibble: 6 × 9
## mpg cylinders displacement horsepower weight acceleration model_year origin
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 18 8 307 130 3504 12 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11 70 1
## 4 16 8 304 150 3433 12 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10 70 1
## # ℹ 1 more variable: car_name <chr>
str(autos_data)
## spc_tbl_ [398 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ mpg : num [1:398] 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num [1:398] 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num [1:398] 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr [1:398] "130" "165" "150" "150" ...
## $ weight : num [1:398] 3504 3693 3436 3433 3449 ...
## $ acceleration: num [1:398] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model_year : num [1:398] 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num [1:398] 1 1 1 1 1 1 1 1 1 1 ...
## $ car_name : chr [1:398] "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "spec")=
## .. cols(
## .. mpg = col_double(),
## .. cylinders = col_double(),
## .. displacement = col_double(),
## .. horsepower = col_character(),
## .. weight = col_double(),
## .. acceleration = col_double(),
## .. model_year = col_double(),
## .. origin = col_double(),
## .. car_name = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
horsepower
has few missing values we will impute
these values by replacing them with the median values.origin
into a dichotomous
variable is_american.# Load the necessary libraries
library(readr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
# Convert 'horsepower' to numeric, handling non-numeric values by converting them to NA
autos_data$horsepower <- as.numeric(as.character(autos_data$horsepower))
## Warning: NAs introduced by coercion
# Impute missing 'horsepower' with the median value of the column
median_hp <- median(autos_data$horsepower, na.rm = TRUE) # Calculate median ignoring NA
autos_data <- autos_data %>%
mutate(horsepower = ifelse(is.na(horsepower), median_hp, horsepower))
# Check for null values in 'horsepower'
missing_hp <- sum(is.na(autos_data$horsepower))
# Display unique values for 'origin'
unique_origins <- unique(autos_data$origin)
# Convert 'origin' into a dichotomous variable: 1 (American) as 1, 2 or 3 (Non-American) as 0
autos_data$is_american <- as.integer(autos_data$origin == 1)
# Display unique values for 'origin'
unique_dichotomous_origins <- unique(autos_data$is_american)
# Print the results
sprintf("Missing values after imputation: %d", missing_hp)
## [1] "Missing values after imputation: 0"
sprintf("Number of missing values after imputation: %d", missing_hp)
## [1] "Number of missing values after imputation: 0"
print("Unique values before converting to dichotomous variable: \n")
## [1] "Unique values before converting to dichotomous variable: \n"
unique_origins
## [1] 1 3 2
print("Unique values after converting to dichotomous variable: \n")
## [1] "Unique values after converting to dichotomous variable: \n"
unique_dichotomous_origins
## [1] 1 0
Next, create the multiple regression model. The model will include:
# Load necessary libraries
library(ggplot2)
# Build the regression model
model <- lm(mpg ~ displacement + I(displacement^2) + is_american + weight + is_american:weight, data = autos_data)
model
##
## Call:
## lm(formula = mpg ~ displacement + I(displacement^2) + is_american +
## weight + is_american:weight, data = autos_data)
##
## Coefficients:
## (Intercept) displacement I(displacement^2) is_american
## 50.5159047 -0.0816287 0.0001111 -5.5420466
## weight is_american:weight
## -0.0060177 0.0022661
Here’s an interpretation of each coefficient:
To sum up, heavier cars and those with larger engine displacement are generally less fuel-efficient, but the decrease in fuel efficiency associated with weight is less pronounced in American cars. The relationship between displacement and mpg is complex and non-linear, with the impact on mpg increasing as the engine size grows.
# Summary of the model
summary(model)
##
## Call:
## lm(formula = mpg ~ displacement + I(displacement^2) + is_american +
## weight + is_american:weight, data = autos_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5649 -2.5379 -0.4122 1.7111 18.0943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.052e+01 1.941e+00 26.025 < 2e-16 ***
## displacement -8.163e-02 1.616e-02 -5.050 6.78e-07 ***
## I(displacement^2) 1.112e-04 2.723e-05 4.082 5.40e-05 ***
## is_american -5.542e+00 2.545e+00 -2.178 0.0300 *
## weight -6.018e-03 9.516e-04 -6.324 6.97e-10 ***
## is_american:weight 2.266e-03 1.009e-03 2.247 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.148 on 392 degrees of freedom
## Multiple R-squared: 0.7218, Adjusted R-squared: 0.7183
## F-statistic: 203.4 on 5 and 392 DF, p-value: < 2.2e-16
The Residuals vs. Fitted plot suggests that the model does a decent job predicting mpg since the residuals (the differences between observed and predicted values) are mostly scattered randomly around the zero line. There’s a hint of increasing spread in residuals as the predicted mpg rises, indicating possible mild heteroscedasticity. A few outliers are flagged, which could influence the model’s accuracy and might need further investigation. Overall, the model seems reasonable but could potentially be improved with further analysis and adjustments.
plot(model, which = 1)
hist(model$residuals)
plot(model, which = 2)
plot(model, which = 3)
plot(model, which = 4)
Let us examine if this model is overall a good model.
Overall, the model seems to be an adequate initial fit but has some concerns, notably the outliers and influential observations, which could skew results. If these issues are resolved, the model may still be useful for drawing insights, depending on the specific goals of the analysis.