Load the data and the packages:

library(ggplot2)

Exploratory Data Analysis (EDA)

Read the data document on the data

?mtcars

Examine data and variable types

#Sampling the first few rows of data for familiarization
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
#Review the structure of the data
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Summary statistics

summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14

Visualization Of the Data

ggplot(mtcars, aes(x = mpg, y = wt)) +
  geom_point() +
  labs(title = "MPG vs Vehicle Weight", x = "MPG", y = "WT (1000 lbs)")

Anaylsis

This visualization shows that the heavier the vehicle the less miles it will be able to travel per gallon.

ggplot(mtcars, aes(x = mpg, y = hp)) +
  geom_point() +
  labs(title = "MPG vs Horsepower", x = "MPG", y = "Horsepower")

Anaylsis

The visual indicates there is a increase in MPG as the weight of the vehicle drops.

ggplot(mtcars, aes(x = wt, y = hp)) +
  geom_point() +
  labs(title = "Horsepower required by Weight", x = "WT (1000 lbs)", y = "Horsepower")

Anaylsis

After comparing how horsepower and weight effect the MPG, a third relationship can be seen between the horsepower required depending on the weight on the vehicle. With the three visual present, one can start to determine the ideal range of motor requirements to achieve the best mileage.

corr_matrix <- cor(mtcars)
print(corr_matrix)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
library(corrplot)
## corrplot 0.94 loaded
corrplot(corr_matrix, method="circle", type="upper", order="hclust",
         tl.col="black", tl.srt=45)

Anaylsis

From the correlation visual above, we can confirm that for mpg, weight, horsepower, cylinders, and displacement hold the strongest relationships.

Data Preprocessing

The Motor Trend Car Road Tests data set is a well-studied and cleaned data set that does not require much pre-processing but a check can always be performed.

Check whether there is missing Value for each column

colSums(is.na(mtcars))
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

Outliers

boxplot(mtcars, las=2, cex.axis=0.6)

Outliers: disp and hp both show outliers, indicated by points outside the whiskers. These outliers represent values that are notably higher or lower than the typical range.

Skewness: disp and hp distributions are somewhat skewed, with larger spreads and asymmetry in their box plots, suggesting potential non-normality in these variables.

# Truncate data based on a specific threshold (80 in this case)
mtcars_truncated <- mtcars[mtcars$mpg <= 80, ]
dim(mtcars_truncated)
## [1] 32 11
# Calculate the 1st and 99th percentiles for 'cyl'
lower_bound_cyl <- quantile(mtcars_truncated$cyl, 0.01, na.rm = TRUE)
upper_bound_cyl <- quantile(mtcars_truncated$cyl, 0.99, na.rm = TRUE)

mtcars_truncated$cyl[mtcars_truncated$cyl < lower_bound_cyl] <- lower_bound_cyl
mtcars_truncated$cyl[mtcars_truncated$cyl > upper_bound_cyl] <- upper_bound_cyl
summary(mtcars_truncated[,c("mpg","cyl")])
##       mpg             cyl       
##  Min.   :10.40   Min.   :4.000  
##  1st Qu.:15.43   1st Qu.:4.000  
##  Median :19.20   Median :6.000  
##  Mean   :20.09   Mean   :6.188  
##  3rd Qu.:22.80   3rd Qu.:8.000  
##  Max.   :33.90   Max.   :8.000
summary(mtcars_truncated$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

Simple regression

mtcars_lm <- lm(mpg ~disp, data = mtcars)
summary(mtcars_lm)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## disp        -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

The coefficient for disp is -0.041215, which indicates that for every unit increase in disp (displacement), the mpg (miles per gallon) is expected to decrease by 0.041215, holding all else constant. This negative relationship is highly significant, as indicated by the low p-value (9.38e-10).

Linear regression modal

When using linear regression, the following assumptions are made:

1.  Relationship between the independent and dependent variables is linear.
2.  The residuals (errors) are independent of each other, with no autocorrelation.
3.  The residuals are normally distributed.
4.  In cases of multiple regression, the independent variables should not be highly correlated with each other.

Evaluate the model via Mean Squared Error (MSE) for a fitted model:

lm_mse <- mean((mtcars_lm$fitted.values - mtcars$mpg)^2)
print(paste("Mean Squared Error for Linear Model:", round(lm_mse, 2)))
## [1] "Mean Squared Error for Linear Model: 9.91"

Linear regression model with interaction terms

model <- lm(mpg ~ disp * hp + wt * qsec + cyl * am, data = mtcars)
summary(model)
## 
## Call:
## lm(formula = mpg ~ disp * hp + wt * qsec + cyl * am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5462 -1.4358 -0.6963  1.5130  3.2800 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.4289697 41.2342873   0.520    0.608
## disp        -0.0295298  0.0319580  -0.924    0.366
## hp          -0.0665118  0.0412461  -1.613    0.121
## wt          -2.2081431 13.4093655  -0.165    0.871
## qsec         0.9229689  2.1373401   0.432    0.670
## cyl          0.3772102  0.9917047   0.380    0.707
## am           4.9354769  4.3727586   1.129    0.271
## disp:hp      0.0001912  0.0001473   1.299    0.208
## wt:qsec     -0.0870827  0.7262635  -0.120    0.906
## cyl:am      -0.5917330  0.6892465  -0.859    0.400
## 
## Residual standard error: 2.345 on 22 degrees of freedom
## Multiple R-squared:  0.8925, Adjusted R-squared:  0.8486 
## F-statistic:  20.3 on 9 and 22 DF,  p-value: 1.108e-08

Anaylsis

The inclusion of interaction terms has led to a multiple R-squared of 0.8925, indicating that approximately 89.25% of the variance in mpg is explained by the model with these interaction terms. This is a relatively high R-squared value, suggesting a good model fit. However, the p-values indicate that most terms, including interactions, are not statistically significant, so while R-squared is high, the individual predictor contributions are not strong.