Problem Description

Study summary as;

Used a data set with a car features.
mpg stands for our dependent variable as rest of them stands as independent (predictors).
A multiple linear regression model has been conducted to predict “mpg” using other variables.
Study starts with Exploratory Data Analysis.
Selection of variables with k-fold stepwise regression
Regression assumption controlling and comparing efficiency of models.

Main actions and model building includes;

Exploratory data analysis (eda).
- Missing value check
- Outlier check
- Correlation matrix
Stepwise regression
Building model of MLR (%80 train and %20 test)
- Performance metrics for model_1
Assumption controls of model_1 and evaluation steps (violation of assumption controls and actions)
- Linearity of the data
- Homogeneity of variance
  - model_1 evaluated to model_2
  - check model_2 performance metrics and compare with model_1
- Normality of the residuals
- Outliers and high leverage points
  - model_2 evaluated to lm2
  - check lm2 performance metrics and compare with model_2
- Multicollinearity

Final Conclusion

As with all my personal studies, final conclusion comes first in order to save you from the chaos below! :) These are my exercises and my experiences. Sharing is caring!

The data set consists of variables that are strongly correlated with each other.
Outlier values were not found in the variables in the data set.
There was a trace amount of missing values in the data set. Fixed this problem by using mean-imputation method.
Stepwise regression has been used to select the most-efficient model input combination.
One more variable added to the model for creating interaction.
Data set splitted into test and train as %20 and %80.
First models R squared value is for both train and test version is : %75..ish.
There were no pattern in the residual plot. One can assume there is no linear relationship between the predictors and outcome variable for model_1..
model_1 has violation of homogenity of variance assumption..!
Evaluated model with the log value of the dependent variable: mpg.
model_2 with the log of dependent variable has R square of 0.8. R Square improved 0.75 to 0.80
model_2 has violation of points with a high leverage.
Evaluated model with setting those points weight to zero.
Evaluated version of model_2 is called lm2.
lm_2 R Square value is 0.85. Model Rsquare is jumped from 0.8 to 0.85.
Evaluated model two times, R Square value jumped 0.75 to 0.85.
lm 2 have violation of multicollinearity. All predictors have high scores for lm2 model…
- No idea how to work this out! Must study on this topic! Notes to myself.. :) or maybeee;
This one can be solved by PCA method. We already know that all variables are strongly correlated from EDA. (JUST A HUGE “MAYBE”. Did not check this one out.)

Interpretation of the resulting coefficients

All interpretations made over latest model, lm2.

The t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable. All predictors are significant p<0.05.
Besides “origin”, all predictors have negative effect on the mpg.
While holding constant the other predictors, 1000 unit increase in displacement leads to drop down ~2.4 unit in mpg.
While holding constant the other predictors, 100 unit increase in origin leads to ~2.2 unit increase in mpg.
All of these effects are significant.

Download Dataset

Data extracted from public data set for Machine Learning at UC Irvine (https://archive.ics.uci.edu/ml/datasets.html)

fileUrl <- "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
auto <- read.table(fileUrl)
names(auto) <- c("mpg", "cylinders", "displacement", "horsepower", "weight"
                 , "acceleration", "model_year", "origin", "car_name")
library(readxl)

Load Packages and Set Working Directory

Output removed due to packages information lines. Aim is to have a readable output. For more details about packages, one may check raw codes.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to maximize insight into a data set;

uncover underlying structure;
extract important variables;
detect outliers and anomalies;
test underlying assumptions;
develop parsimonious models; and
determine optimal factor settings.

Citation

This step includes;

Data set dimensions (shape)
Column names and types
Missing Value Control
Outlier Control (Boxplot - Whisker Plot)
Correlation Matrix
Linearity of the variables (With plot, skewness of features)

Dataset Dimensions

dim(auto)

## [1] 398   9

There are 398 rows and 9 variables in data set.

Column names and types

str(auto)

## 'data.frame':    398 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr  "130.0" "165.0" "150.0" "150.0" ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model_year  : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car_name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

4 columns are numerical, 3 in integer format and 2 in character format.

length(unique(auto$car_name))

## [1] 305

There are 305 distinct values for car_name. We will remove this column while in regression phase.
Also, horsepower column should be in numerical format. We will convert type of this column.

auto$car_name <- NULL #removed car_name column from df.
auto$horsepower <- as.numeric(auto$horsepower)
#we will remove model_year variable as well.
auto$model_year <- NULL

Missing Values

miss_var_summary(auto)

## # A tibble: 7 x 3
##   variable     n_miss pct_miss
##   <chr>         <int>    <dbl>
## 1 horsepower        6     1.51
## 2 mpg               0     0   
## 3 cylinders         0     0   
## 4 displacement      0     0   
## 5 weight            0     0   
## 6 acceleration      0     0   
## 7 origin            0     0

vis_miss(auto)

There are some missing values in “horsepower” variable (1.51%).
We will use mean imputation method for replacing missing values.

#Replace missing values of horsepower column with a mean value of the horsepower column.
auto$horsepower[is.na(auto$horsepower)] <- mean(auto$horsepower, na.rm=TRUE)

vis_miss(auto)

No more missing values in data set. All clear!

Outlier Control

par(mfrow=c(2,4))
for (i in names(auto)) {
  boxplot(auto[,i], names = "names(auto[,i])")
}

outlier.scores <- lofactor(auto, k=5)
plot(density(outlier.scores))

# pick top 5 as outliers
outliers <- order(outlier.scores, decreasing=T)[1:5]
# who are outliers

n <- nrow(auto)
pch <- rep(".", n)
pch[outliers] <- "+"
col <- rep("black", n)
col[outliers] <- "red"
pairs(auto, pch=pch, col=col)

None of the variables have disturbing outliers.
Red dots can be interpreted as outliers.
Each comparison has its outliers but within an acceptable range.

Correlation Matrix

pairs.panels(auto,method = "pearson",hist.col ="#00AFBB" ,density = TRUE,ellipses = TRUE)

mpg will be our dependent variable.
There are some high correlations.
displacement and weight has 0.93, very strong positive correlation etc.
We will use stepwise regression before running the solid and final model.

Building Model

Stepwise Regression

Selection of the optimal model, decide best combination of the variables.
Best combination of the variables are: horsepower + weight + origin.

# cross-validation options.
# CV is a re-sample procedure and evaluates the model.
# Data set will split into 10 subsets.
train.control <- trainControl(method = "cv", number = 10)

# Stepwise regression with a K-fold CV.
step.model <- train(mpg ~., data = auto,
                    method = "lmStepAIC", 
                    trControl = train.control,
                    trace = FALSE
                    )
# Model accuracy
step.model$results

##   parameter     RMSE  Rsquared      MAE   RMSESD RsquaredSD     MAESD
## 1      none 4.172355 0.7248044 3.204317 0.466853 0.04169929 0.3070223

# Final model coefficients
step.model$finalModel

## 
## Call:
## lm(formula = .outcome ~ horsepower + weight + origin, data = dat)
## 
## Coefficients:
## (Intercept)   horsepower       weight       origin  
##   41.480044    -0.048941    -0.005038     1.342764

# Summary of the model
summary(step.model$finalModel)

## 
## Call:
## lm(formula = .outcome ~ horsepower + weight + origin, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4032  -2.8540  -0.3403   2.2684  14.9027 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.4800443  1.3047868  31.791  < 2e-16 ***
## horsepower  -0.0489405  0.0108507  -4.510 8.55e-06 ***
## weight      -0.0050379  0.0005359  -9.400  < 2e-16 ***
## origin       1.3427638  0.3233991   4.152 4.04e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.18 on 394 degrees of freedom
## Multiple R-squared:  0.7162, Adjusted R-squared:  0.714 
## F-statistic: 331.4 on 3 and 394 DF,  p-value: < 2.2e-16

For the sake of the study, we will add an interaction.
We assume Displacement and horsepower has a combined effect.

Subset Revelant Columns

auto2 <- auto[,c("mpg","horsepower","weight","displacement","origin")]

Split Frame as Train and Test

set.seed(123)
training.samples <- auto2$mpg %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- auto2[training.samples, ]
test.data <- auto2[-training.samples, ]

model_1 <- lm(mpg ~ horsepower*displacement + weight + origin, data = train.data)
summary(model_1)

## 
## Call:
## lm(formula = mpg ~ horsepower * displacement + weight + origin, 
##     data = train.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.962  -2.563  -0.238   1.828  15.888 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.150e+01  2.037e+00  25.281  < 2e-16 ***
## horsepower              -1.791e-01  2.375e-02  -7.539 5.08e-13 ***
## displacement            -5.412e-02  1.161e-02  -4.662 4.64e-06 ***
## weight                  -3.420e-03  8.105e-04  -4.219 3.21e-05 ***
## origin                   9.980e-01  3.711e-01   2.689  0.00754 ** 
## horsepower:displacement  4.189e-04  6.289e-05   6.661 1.21e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.921 on 315 degrees of freedom
## Multiple R-squared:  0.754,  Adjusted R-squared:  0.7501 
## F-statistic: 193.1 on 5 and 315 DF,  p-value: < 2.2e-16

# Make predictions
predictions <- model_1 %>% predict(test.data)
# Model performance
# (a) Prediction error, RMSE
R2(predictions, test.data$mpg)

## [1] 0.757189

Model_1 Performance Metrics

glance(model_1) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value) %>% kable()

adj.r.squared	sigma	AIC	BIC	p.value
0.7500539	3.921153	1796.121	1822.521	0

R Square value of our train and test data is very similar, ~0.75xx..

Assumption Controls

Linearity of the data

plot(model_1, 1)

There is no pattern in the residual plot. We can assume there is no linear relationship between the predictors and outcome variable.

Homogeneity of Variance

plot(model_1, 3)

# Breusch-Pagan test
bptest(model_1)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_1
## BP = 34.087, df = 5, p-value = 2.288e-06

p value of Pagan test is <0.05, we will reject null hypothesis.
We must resolve Heteroskedasticity.
Taking log value of the dependent variable.

model_2 <- lm(log(mpg) ~ horsepower*displacement + weight + origin, data = train.data)
summary(model_2)

## 
## Call:
## lm(formula = log(mpg) ~ horsepower * displacement + weight + 
##     origin, data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49245 -0.10085 -0.00438  0.09320  0.55724 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.246e+00  7.778e-02  54.588  < 2e-16 ***
## horsepower              -5.943e-03  9.069e-04  -6.554 2.29e-10 ***
## displacement            -1.482e-03  4.433e-04  -3.344 0.000925 ***
## weight                  -1.827e-04  3.095e-05  -5.904 9.20e-09 ***
## origin                   3.127e-02  1.417e-02   2.207 0.028043 *  
## horsepower:displacement  1.113e-05  2.401e-06   4.635 5.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1497 on 315 degrees of freedom
## Multiple R-squared:  0.8088, Adjusted R-squared:  0.8058 
## F-statistic: 266.6 on 5 and 315 DF,  p-value: < 2.2e-16

Re-test for model_2.

# Breusch-Pagan test
bptest(model_2)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_2
## BP = 13.3, df = 5, p-value = 0.02072

plot(model_2, 3)

Not the best but both plot and test looks better.

Model_2 Performance Metrics

glance(model_2) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value) %>% kable()

adj.r.squared	sigma	AIC	BIC	p.value
0.805797	0.1497138	-300.2754	-273.8753	0

Normality of the Residuals

plot(model_2, 2)

All the points fall approximately along this reference line, so we can assume normality.

Outliers and High leverage Points

Extreme predictor x values for a data point means that it has a high leverage.

plot(model_2, 5)

we will set the weights of these points as where they can have the minimal effect on the model itself.

w <- abs(rstudent(model_2)) < 3 & abs(cooks.distance(model_2)) < 4/nrow(model_2$model)
lm2 <- update(model_2, weights=as.numeric(w))

plot(lm2, 5)

We trimmed those points with high leverage that greater than 3 sd.

summary(lm2)

## 
## Call:
## lm(formula = log(mpg) ~ horsepower * displacement + weight + 
##     origin, data = train.data, weights = as.numeric(w))
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27339 -0.08549  0.00000  0.08133  0.31665 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.329e+00  7.018e-02  61.686  < 2e-16 ***
## horsepower              -7.096e-03  8.406e-04  -8.442 1.43e-15 ***
## displacement            -2.457e-03  4.069e-04  -6.038 4.68e-09 ***
## weight                  -1.415e-04  2.839e-05  -4.983 1.07e-06 ***
## origin                   2.204e-02  1.217e-02   1.811   0.0711 .  
## horsepower:displacement  1.599e-05  2.163e-06   7.389 1.53e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1247 on 295 degrees of freedom
## Multiple R-squared:  0.857,  Adjusted R-squared:  0.8546 
## F-statistic: 353.7 on 5 and 295 DF,  p-value: < 2.2e-16

lm2 Performance Metrics

glance(lm2) %>%
  dplyr::select(adj.r.squared, sigma, AIC, BIC, p.value) %>% kable()

adj.r.squared	sigma	AIC	BIC	p.value
0.8546076	0.1247051	-391.1054	-365.1556	0

Multicollinearity

car::vif(lm2)

##              horsepower            displacement                  weight 
##               18.757602               33.953057               11.085264 
##                  origin horsepower:displacement 
##                1.778838               41.429320

We already saw that we have realy correlated variables.
Multicollinearity is a huge problem for this model.
Lets use PCA method to merge those columns with a information loss.

Regression Assumptions and Its Effect on the Machine Learning Models (4MLR)

Cagri Cebisli

27 06 2020