Data Exploration

load the mtcars dataset assigning itself to it’s name, and load the appropriate packages

library(ggplot2)
library(MASS)
data("mtcars")
attach(mtcars)
## The following object is masked from package:ggplot2:
## 
##     mpg
?mtcars
## starting httpd help server ...
##  done
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
table(mtcars$cyl) #Number of cylinders
## 
##  4  6  8 
## 11  7 14
table(mtcars$vs) #Type of engine, V-Shaped or Straight
## 
##  0  1 
## 18 14
table(mtcars$am) #Type of Transmission, Automatic or Manual
## 
##  0  1 
## 19 13
table(mtcars$gear) #Number of Forward gears
## 
##  3  4  5 
## 15 12  5
table(mtcars$carb) #Number of carburetors 
## 
##  1  2  3  4  6  8 
##  7 10  3 10  1  1
There are 11 Variables and 32 Observations in the mtcars dataset
#Telling R to visually change the Continuous column of cyl into a Categorical variable, that it actually is
ggplot(mtcars, aes(factor(cyl), mpg)) + 
  geom_point()

#visually Mapping wt
ggplot(mtcars, aes(wt, mpg, size = disp)) +
  geom_point()

One trend that arises between weight (wt) and miles per gallon (mpg) is, as weight increases on the vehicle, the miles one can travel per gallon, decreases

correlation matrix that correlate with mpg

corr_matrix <- cor(mtcars)
print(corr_matrix)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

It would seem as though wt has the highest correlation with mpg, even though it has a negative value

Followed by disp then cyl

library(corrplot)
## corrplot 0.92 loaded
corrplot(corr_matrix, method="circle", type="upper", order="hclust",
         tl.col="black", tl.srt=45)

A visual representation of correlations, with a focus on mpg, though with values close together it is easier to discern from a correlation matrix

Data Processing

Checking to see whether there are any missing values

colSums(is.na(mtcars))
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

A value of 0 means there is no direct NA or missing values in the observations

boxplot(mtcars, las=2, cex.axis=0.6)

It shows that only hp, wt, qsec, and carb have noticable outliers, with disp and hp having a larger range between their lower quartile, median, and upper qurtile

hist(mtcars$disp)

hist(mtcars$hp)

hist(mtcars$wt)

Linear Regression using lm

linear regression model predicting mpg based on all variables

mtcars_lm <- lm(mpg ~ . , data = mtcars)
summary(mtcars_lm)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Interpretations

Most coefficients are of positive value, meaning as the predictor value increases, the response variable will increase, an example would be disp; as disp would increase by 1 unit, the overall model/mpg will increase by 0.01334

But the there a few negative signed coefficients, such as cyl, hp, wt, and carb. Meaning as the predictor value increases, the response variable will decrease, an example would be wt; as wt would increase by 1 unit, the overall model/mpg (constant) will decrease by -3.71530

We can also see that no p-value is significant except for wt`, when all other variables are in the model, and that has a negative affect on the constant

Linear regression assumptions

The assumptions we make when we use linear regression:

  1. Independence
  2. Normality of residuals, normally distributed with a mean of 0
  3. Linearity, a straight line relationship
  4. Homoscedasticity, constant variance
  5. No multicollinearity, independent variables are not highly correlated

Model assumption check

We can check whether the assumptions of linear regression are met using diagnostics plots:

# Diagnostic Plots
par(mfrow = c(2, 2))
plot(mtcars_lm)

The Residuals vs Fitted graph shows that linearity, showing that there is no distinctive pattern, meaning the linearity assumption is met. The Q-Q Residuals plot shows the normality assumption, showing that the values follow along the dotted line, though values sway off and on, normality looks relatively assumed. The Scale-Location plot shows homogeneity of variance, they equal spread of values is hard to be seen since some observations are very close together while some are not, but I believe this assumption is violated in a sense. The Residuals vs Leverage plot shows influential points with values spread apart.

Evaluation of the model using MSE (Mean Square Error)

lm_mse <- mean((mtcars_lm$fitted.values - mtcars$mpg)^2)
print(paste("Mean Squared Error for Linear Model:", round(lm_mse, 2)))
## [1] "Mean Squared Error for Linear Model: 4.61"

With a MSE of 4.61, a lower value, meaning that this model predicts closer to actual values, predicted values being closer to observed values, better to predict observations in the future.

Linear regression model using interaction models between cyl and disp

mtcars_std = mtcars
mtcars_std$mpg = (mtcars$wt - min(mtcars$wt)) / (max(mtcars$wt) - min(mtcars$wt))
mtcars_lm_std <- lm(mpg ~ ., data = mtcars_std)
summary(mtcars_lm_std)
## Warning in summary.lm(mtcars_lm_std): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars_std)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.112e-16 -2.390e-17 -8.700e-18  1.618e-17  4.119e-16 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -3.869e-01  7.807e-16 -4.955e+14   <2e-16 ***
## cyl         -5.516e-18  4.359e-17 -1.270e-01    0.900    
## disp         7.965e-19  7.448e-19  1.069e+00    0.297    
## hp          -8.821e-19  9.080e-19 -9.710e-01    0.342    
## drat        -2.947e-17  6.821e-17 -4.320e-01    0.670    
## wt           2.557e-01  7.902e-17  3.236e+15   <2e-16 ***
## qsec        -2.741e-18  3.048e-17 -9.000e-02    0.929    
## vs          -4.101e-17  8.778e-17 -4.670e-01    0.645    
## am           4.591e-17  8.578e-17  5.350e-01    0.598    
## gear        -4.851e-17  6.228e-17 -7.790e-01    0.445    
## carb         3.340e-17  3.457e-17  9.660e-01    0.345    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.105e-16 on 21 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.588e+31 on 10 and 21 DF,  p-value: < 2.2e-16
mtcars_lm_it1 <- lm(mpg ~ cyl + disp + cyl*disp, data = mtcars_std)
summary(mtcars_lm_it1)
## 
## Call:
## lm(formula = mpg ~ cyl + disp + cyl * disp, data = mtcars_std)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.210244 -0.059353 -0.000182  0.079461  0.175009 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.3387703  0.2100789  -1.613  0.11805   
## cyl          0.0373506  0.0352685   1.059  0.29863   
## disp         0.0054955  0.0016792   3.273  0.00283 **
## cyl:disp    -0.0004470  0.0002077  -2.152  0.04017 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1117 on 28 degrees of freedom
## Multiple R-squared:  0.8201, Adjusted R-squared:  0.8008 
## F-statistic: 42.55 on 3 and 28 DF,  p-value: 1.471e-10

With one categorical variable (cyl) and one continuous variable (disp), cyl still isn’t a significant predictor but disp and cyl by disp is a significant predictor of mpg. With this model explaining around 82% of variance.

While this model only explains 82% of the variance in the model, while the original regression model, with all variables but no interaction, explained 87%, but the model had become complicated. This newer model explains more, while having more significant values. So this model is an easier one to understand, while allowing room for modification, while explaining a large portion of variance with little variables used.

Outliers

boxplot(mtcars, las=2, cex.axis=0.6)

It can be seen via the box plot that hp, wt, qsec, and carb.

winsorize hp at 1 percentile and 99 percentile. In another word, we are replaceing extreme values with less extreme values. Usually keeps all the rows. WIth hp having the largest outlier compared to the others, we will work on this one

#creating a separate data set, to preserve the original
mtcars2 <- mtcars
# Calculate the 1st and 99th percentiles for 'hp'
lower_bound_hp <- quantile(mtcars2$hp, 0.01, na.rm = TRUE)
upper_bound_hp <- quantile(mtcars2$hp, 0.99, na.rm = TRUE)

# Winsorize the data, replacing value that are greater than the threshold with the threshold.
mtcars2$hp[mtcars2$hp < lower_bound_hp] <- lower_bound_hp
mtcars2$hp[mtcars2$hp > upper_bound_hp] <- upper_bound_hp

Double check

summary(mtcars[, "hp"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    52.0    96.5   123.0   146.7   180.0   335.0
summary(mtcars2[, "hp"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    55.1    96.5   123.0   146.1   180.0   313.0

The data variable hp has been transformed via winsorize, we can see from the original mtcars data set the minimum value is 52 and the max was 335, but due to the winsorize we replaced the extreme values and created the mtcars2 where the mean is slightly different, but allows the overall variable observations to be predicted better, due to replacement of outliers

Now a look at the original regression model with all variables, to be compared to the new mtcars2 data set
mtcars_lm <- lm(mpg ~ . , data = mtcars)
summary(mtcars_lm)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
mtcars2_lm <- lm(mpg ~ . , data = mtcars2)
summary(mtcars2_lm)
## 
## Call:
## lm(formula = mpg ~ ., data = mtcars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4201 -1.5602 -0.1017  1.1611  4.6188 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.72098   18.68259   0.681   0.5034  
## cyl         -0.09268    1.04285  -0.089   0.9300  
## disp         0.01389    0.01778   0.782   0.4432  
## hp          -0.02414    0.02288  -1.055   0.3034  
## drat         0.86325    1.62519   0.531   0.6009  
## wt          -3.68497    1.87561  -1.965   0.0628 .
## qsec         0.78554    0.73222   1.073   0.2955  
## vs           0.32962    2.09148   0.158   0.8763  
## am           2.46214    2.04818   1.202   0.2427  
## gear         0.65888    1.48812   0.443   0.6625  
## carb        -0.20609    0.80918  -0.255   0.8014  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.642 on 21 degrees of freedom
## Multiple R-squared:  0.8698, Adjusted R-squared:  0.8079 
## F-statistic: 14.03 on 10 and 21 DF,  p-value: 3.562e-07

With a focus on the R^2, we can see in the original value is 86.9% and the original adjusted R^2 is 80.7%, while the new values of mtcars2 have 87% and an adjusted score of 80.8%. Though the increase is minuscule, there was an increase, the coefficients of the new model have also changed, some have increased while others have decreased, meaning predictive power has changed.

A reason why the model only predicts this well, is because we have variables that predict poorly in the model, if we had a more precise question/problem we needed to assess, we could use the new mtcars2 set, with transformed data that is has more predictive power. This increase in R^2 only provides so much of a value, our model has become too complicated with so many variables that were used, so the slight increase in R^2 didn’t really improve model predictability