1 Overview

How much orders my company will receive from my customers next month? How many customers will churn when the contract expire? How many catalogues do I need to mail in order to increase the probability of my potential customers to buy? These and many other related questions are the challenges face by you and many business analysts.

Multiple Linear Regression (MLR) models are one of the three most common analytics modelling techniques used by practitioners today. In this session, I would like to share with you a collection of R packages specially design for building better MLR models.

This is not a session for explaining the concepts and methods of MLR.

1.1 Setting the Scene

A large Toyota car dealership offers purchasers of new Toyota cars the option to buy their used car as part of a trade-in. In particular, a new promotion promises to pay high prices for used Toyota Corolla cars for purchasers of a new car. The dealer then sells the used cars for a small profit. To ensure a reasonable profit, the dealer needs to be able to predict the price that the dealership will get for the used cars.

1.2 The data

The file provided for the analysis is called ToyotaCorolla.xls. The xls extension indicates that it is in Microsoft xls format. In fact the data file consists of two worksheets, namely: data and metadata. The data worksheet provides the actual data records and the metadata describes the variables of the data records. The data set comprises of 38 columns (i.e. variables) and 1436 rows (i.e. data records).

1.3 Getting Started

Before we get started, it is important for us to install the necessary R packages into R and launch these R packages into R environment.

The R packages needed for this exercise are as follows:

  • Attribute data handling
    • tidyverse, especially readxl and dplyr.
  • Exploratory Data Analysis
    • ggplot2, funModeling, corrplot, and ggpubr.
  • Regression Modelling

The code chunks below installs and launches these R packages into R environment.

packages = c('olsrr', 'corrplot', 'ggpubr', 
             'readxl', 'ggstatsplot',
             'funModeling', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

2 Data Wrangling

2.1 Importing the data

The ToyotaCorolla is in Microsoft Excel file format. The codes chunk below uses read_xls() function of readxl package to import the data worksheet of ToyotaCorolla workbook into R as a tibble data.frame called car_resale.

car_resale <- read_xls("data/ToyotaCorolla.xls",
                       sheet = "data")

2.2 Checking data structure

After importing the data file into R, it is important for us to examine if the data file has been imported correctly.

2.2.1 Checking data types using glimpse()

The codes chunks below uses glimpse() to display the data structure of will do the job.

glimpse(car_resale)
## Rows: 1,436
## Columns: 38
## $ Id               <dbl> 81, 1, 2, 3, 4, 5, 6, 7, 8, 44, 45, 46, 47, 49, 51...
## $ Model            <chr> "TOYOTA Corolla 1.6 5drs 1 4/5-Doors", "TOYOTA Cor...
## $ Price            <dbl> 18950, 13500, 13750, 13950, 14950, 13750, 12950, 1...
## $ Age_08_04        <dbl> 25, 23, 23, 24, 26, 30, 32, 27, 30, 27, 22, 23, 27...
## $ Mfg_Month        <dbl> 8, 10, 10, 9, 7, 3, 1, 6, 3, 6, 11, 10, 6, 11, 11,...
## $ Mfg_Year         <dbl> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 20...
## $ KM               <dbl> 20019, 46986, 72937, 41711, 48000, 38500, 61000, 9...
## $ Quarterly_Tax    <dbl> 100, 210, 210, 210, 210, 210, 210, 210, 210, 234, ...
## $ Weight           <dbl> 1180, 1165, 1165, 1165, 1165, 1170, 1170, 1245, 12...
## $ Guarantee_Period <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ HP_Bin           <chr> "100-120", "< 100", "< 100", "< 100", "< 100", "< ...
## $ CC_bin           <chr> "1600", ">1600", ">1600", ">1600", ">1600", ">1600...
## $ Doors            <dbl> 5, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 3,...
## $ Gears            <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ Cylinders        <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ Fuel_Type        <chr> "Petrol", "Diesel", "Diesel", "Diesel", "Diesel", ...
## $ Color            <chr> "Blue", "Blue", "Silver", "Blue", "Black", "Black"...
## $ Met_Color        <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,...
## $ Automatic        <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Mfr_Guarantee    <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,...
## $ BOVAG_Guarantee  <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ABS              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Airbag_1         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Airbag_2         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Airco            <dbl> 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Automatic_airco  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,...
## $ Boardcomputer    <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ CD_Player        <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,...
## $ Central_Lock     <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Powered_Windows  <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Power_Steering   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Radio            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Mistlamps        <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Sport_Model      <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,...
## $ Backseat_Divider <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Metallic_Rim     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Radio_cassette   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Tow_Bar          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Important

  • In regression modelling using R (i.e. lm()), categorical variables encoded in character and logical data type are required to converted to factor data type. This is because in regression modelling, categorical variables need to be transformed to dummy variables.

2.2.2 Checking summary statistics using summary()

The code chunk below uses summary() of Base Stats of R to display the summary statistics of car_resale data.frame.

summary(car_resale)
##        Id            Model               Price         Age_08_04    
##  Min.   :   1.0   Length:1436        Min.   : 4350   Min.   : 1.00  
##  1st Qu.: 361.8   Class :character   1st Qu.: 8450   1st Qu.:44.00  
##  Median : 721.5   Mode  :character   Median : 9900   Median :61.00  
##  Mean   : 721.6                      Mean   :10731   Mean   :55.95  
##  3rd Qu.:1081.2                      3rd Qu.:11950   3rd Qu.:70.00  
##  Max.   :1442.0                      Max.   :32500   Max.   :80.00  
##    Mfg_Month         Mfg_Year          KM         Quarterly_Tax   
##  Min.   : 1.000   Min.   :1998   Min.   :     1   Min.   : 19.00  
##  1st Qu.: 3.000   1st Qu.:1998   1st Qu.: 43000   1st Qu.: 69.00  
##  Median : 5.000   Median :1999   Median : 63390   Median : 85.00  
##  Mean   : 5.549   Mean   :2000   Mean   : 68533   Mean   : 87.12  
##  3rd Qu.: 8.000   3rd Qu.:2001   3rd Qu.: 87021   3rd Qu.: 85.00  
##  Max.   :12.000   Max.   :2004   Max.   :243000   Max.   :283.00  
##      Weight     Guarantee_Period    HP_Bin             CC_bin         
##  Min.   :1000   Min.   : 3.000   Length:1436        Length:1436       
##  1st Qu.:1040   1st Qu.: 3.000   Class :character   Class :character  
##  Median :1070   Median : 3.000   Mode  :character   Mode  :character  
##  Mean   :1072   Mean   : 3.815                                        
##  3rd Qu.:1085   3rd Qu.: 3.000                                        
##  Max.   :1615   Max.   :36.000                                        
##      Doors           Gears         Cylinders  Fuel_Type        
##  Min.   :2.000   Min.   :3.000   Min.   :4   Length:1436       
##  1st Qu.:3.000   1st Qu.:5.000   1st Qu.:4   Class :character  
##  Median :4.000   Median :5.000   Median :4   Mode  :character  
##  Mean   :4.033   Mean   :5.026   Mean   :4                     
##  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:4                     
##  Max.   :5.000   Max.   :6.000   Max.   :4                     
##     Color             Met_Color        Automatic       Mfr_Guarantee   
##  Length:1436        Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Mode  :character   Median :1.0000   Median :0.00000   Median :0.0000  
##                     Mean   :0.6748   Mean   :0.05571   Mean   :0.4095  
##                     3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##  BOVAG_Guarantee       ABS            Airbag_1         Airbag_2     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.8955   Mean   :0.8134   Mean   :0.9708   Mean   :0.7228  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      Airco        Automatic_airco   Boardcomputer      CD_Player     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.5084   Mean   :0.05641   Mean   :0.2946   Mean   :0.2187  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##   Central_Lock    Powered_Windows Power_Steering       Radio       
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.000   Median :1.0000   Median :0.0000  
##  Mean   :0.5801   Mean   :0.562   Mean   :0.9777   Mean   :0.1462  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##    Mistlamps      Sport_Model     Backseat_Divider  Metallic_Rim   
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.257   Mean   :0.3001   Mean   :0.7702   Mean   :0.2047  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  Radio_cassette      Tow_Bar      
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.1455   Mean   :0.2779  
##  3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000

Quiz: What observations can you draw from the output above?

2.3 Revising data import procedures

The code chunk below performs the followings:

  • creating a vector called cols,
  • converting all fields in cols to factor data type, and
  • converting the data type of Id field from numeric to character.
cols <- c("HP_Bin", "CC_bin", "Doors", "Gears", "Cylinders", "Fuel_Type", "Color",
"Met_Color", "Automatic", "Mfr_Guarantee",
"BOVAG_Guarantee", "ABS", "Airbag_1", "Airbag_2", "Airco", "Automatic_airco", "Boardcomputer", "CD_Player", "Central_Lock", "Powered_Windows", "Power_Steering", "Radio", "Mistlamps",
"Sport_Model", "Backseat_Divider",
"Metallic_Rim", "Radio_cassette", "Tow_Bar")

car_resale <- read_xls("data/ToyotaCorolla.xls",
                       sheet = "data") %>%
  mutate(Id = as.character(Id)) %>%
  mutate_each_(funs(factor(.)),cols)

Now, we can use summary() to examine the summary statistics of the variables again.

summary(car_resale[3:38])
##      Price         Age_08_04       Mfg_Month         Mfg_Year   
##  Min.   : 4350   Min.   : 1.00   Min.   : 1.000   Min.   :1998  
##  1st Qu.: 8450   1st Qu.:44.00   1st Qu.: 3.000   1st Qu.:1998  
##  Median : 9900   Median :61.00   Median : 5.000   Median :1999  
##  Mean   :10731   Mean   :55.95   Mean   : 5.549   Mean   :2000  
##  3rd Qu.:11950   3rd Qu.:70.00   3rd Qu.: 8.000   3rd Qu.:2001  
##  Max.   :32500   Max.   :80.00   Max.   :12.000   Max.   :2004  
##                                                                 
##        KM         Quarterly_Tax        Weight     Guarantee_Period
##  Min.   :     1   Min.   : 19.00   Min.   :1000   Min.   : 3.000  
##  1st Qu.: 43000   1st Qu.: 69.00   1st Qu.:1040   1st Qu.: 3.000  
##  Median : 63390   Median : 85.00   Median :1070   Median : 3.000  
##  Mean   : 68533   Mean   : 87.12   Mean   :1072   Mean   : 3.815  
##  3rd Qu.: 87021   3rd Qu.: 85.00   3rd Qu.:1085   3rd Qu.: 3.000  
##  Max.   :243000   Max.   :283.00   Max.   :1615   Max.   :36.000  
##                                                                   
##      HP_Bin      CC_bin    Doors   Gears    Cylinders  Fuel_Type   
##  < 100  :560   <1600:416   2:  2   3:   2   4:1436    CNG   :  17  
##  > 120  : 11   >1600:166   3:622   4:   1             Diesel: 155  
##  100-120:865   1600 :854   4:138   5:1390             Petrol:1264  
##                            5:674   6:  43                          
##                                                                    
##                                                                    
##                                                                    
##      Color     Met_Color Automatic Mfr_Guarantee BOVAG_Guarantee ABS     
##  Grey   :301   0:467     0:1356    0:848         0: 150          0: 268  
##  Blue   :283   1:969     1:  80    1:588         1:1286          1:1168  
##  Red    :278                                                             
##  Green  :220                                                             
##  Black  :191                                                             
##  Silver :122                                                             
##  (Other): 41                                                             
##  Airbag_1 Airbag_2 Airco   Automatic_airco Boardcomputer CD_Player Central_Lock
##  0:  42   0: 398   0:706   0:1355          0:1013        0:1122    0:603       
##  1:1394   1:1038   1:730   1:  81          1: 423        1: 314    1:833       
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  Powered_Windows Power_Steering Radio    Mistlamps Sport_Model Backseat_Divider
##  0:629           0:  32         0:1226   0:1067    0:1005      0: 330          
##  1:807           1:1404         1: 210   1: 369    1: 431      1:1106          
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  Metallic_Rim Radio_cassette Tow_Bar 
##  0:1142       0:1227         0:1037  
##  1: 294       1: 209         1: 399  
##                                      
##                                      
##                                      
##                                      
## 

Notice that the output report displays correct summary statistics of the independent variables this time.

3 Univariate Exploratory Data Analysis

3.1 Visualising the distribution of continuous variables

Histogram, probability density plot and boxplot are three commonly used statistical graphical methods for visualising the distribution of continuous variables.

The code chunk below is used to create eight histograms. Then, ggarrange() of ggpubr is used to organise these histogram into a 4 columns by 2 rows a small multiple plot.

Price <- ggplot(data=car_resale, aes(x= `Price`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
Age_08_04 <- ggplot(data=car_resale, aes(x= `Age_08_04`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
Mfg_Month <- ggplot(data=car_resale, aes(x= `Mfg_Month`)) +
  geom_histogram(bins=20, color="black", fill="light blue")  
Mfg_Year <- ggplot(data=car_resale, aes(x= `Mfg_Year`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
KM <- ggplot(data=car_resale, aes(x= `KM`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
Quarterly_Tax <- ggplot(data=car_resale, aes(x= `Quarterly_Tax`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
Weight <- ggplot(data=car_resale, aes(x= `Weight`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
Guarantee_Period <- ggplot(data=car_resale, aes(x= `Guarantee_Period`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

ggarrange(Price, Age_08_04, Mfg_Month, Mfg_Year, KM, Quarterly_Tax, Weight, Guarantee_Period, ncol = 4, nrow = 2)

3.1.1 Visualising frequency distributions for categorical variables

To display the frequency distribution of the categorical variables, the freq() of funModeling package will be used.

freq(car_resale[11:38])

##    HP_Bin frequency percentage cumulative_perc
## 1 100-120       865      60.24           60.24
## 2   < 100       560      39.00           99.24
## 3   > 120        11       0.77          100.00

##   CC_bin frequency percentage cumulative_perc
## 1   1600       854      59.47           59.47
## 2  <1600       416      28.97           88.44
## 3  >1600       166      11.56          100.00

##   Doors frequency percentage cumulative_perc
## 1     5       674      46.94           46.94
## 2     3       622      43.31           90.25
## 3     4       138       9.61           99.86
## 4     2         2       0.14          100.00

##   Gears frequency percentage cumulative_perc
## 1     5      1390      96.80           96.80
## 2     6        43       2.99           99.79
## 3     3         2       0.14           99.93
## 4     4         1       0.07          100.00

##   Cylinders frequency percentage cumulative_perc
## 1         4      1436        100             100

##   Fuel_Type frequency percentage cumulative_perc
## 1    Petrol      1264      88.02           88.02
## 2    Diesel       155      10.79           98.81
## 3       CNG        17       1.18          100.00

##     Color frequency percentage cumulative_perc
## 1    Grey       301      20.96           20.96
## 2    Blue       283      19.71           40.67
## 3     Red       278      19.36           60.03
## 4   Green       220      15.32           75.35
## 5   Black       191      13.30           88.65
## 6  Silver       122       8.50           97.15
## 7   White        31       2.16           99.31
## 8  Violet         4       0.28           99.59
## 9   Beige         3       0.21           99.80
## 10 Yellow         3       0.21          100.00

##   Met_Color frequency percentage cumulative_perc
## 1         1       969      67.48           67.48
## 2         0       467      32.52          100.00

##   Automatic frequency percentage cumulative_perc
## 1         0      1356      94.43           94.43
## 2         1        80       5.57          100.00

##   Mfr_Guarantee frequency percentage cumulative_perc
## 1             0       848      59.05           59.05
## 2             1       588      40.95          100.00

##   BOVAG_Guarantee frequency percentage cumulative_perc
## 1               1      1286      89.55           89.55
## 2               0       150      10.45          100.00

##   ABS frequency percentage cumulative_perc
## 1   1      1168      81.34           81.34
## 2   0       268      18.66          100.00

##   Airbag_1 frequency percentage cumulative_perc
## 1        1      1394      97.08           97.08
## 2        0        42       2.92          100.00

##   Airbag_2 frequency percentage cumulative_perc
## 1        1      1038      72.28           72.28
## 2        0       398      27.72          100.00

##   Airco frequency percentage cumulative_perc
## 1     1       730      50.84           50.84
## 2     0       706      49.16          100.00

##   Automatic_airco frequency percentage cumulative_perc
## 1               0      1355      94.36           94.36
## 2               1        81       5.64          100.00

##   Boardcomputer frequency percentage cumulative_perc
## 1             0      1013      70.54           70.54
## 2             1       423      29.46          100.00

##   CD_Player frequency percentage cumulative_perc
## 1         0      1122      78.13           78.13
## 2         1       314      21.87          100.00

##   Central_Lock frequency percentage cumulative_perc
## 1            1       833      58.01           58.01
## 2            0       603      41.99          100.00

##   Powered_Windows frequency percentage cumulative_perc
## 1               1       807       56.2            56.2
## 2               0       629       43.8           100.0

##   Power_Steering frequency percentage cumulative_perc
## 1              1      1404      97.77           97.77
## 2              0        32       2.23          100.00

##   Radio frequency percentage cumulative_perc
## 1     0      1226      85.38           85.38
## 2     1       210      14.62          100.00

##   Mistlamps frequency percentage cumulative_perc
## 1         0      1067       74.3            74.3
## 2         1       369       25.7           100.0

##   Sport_Model frequency percentage cumulative_perc
## 1           0      1005      69.99           69.99
## 2           1       431      30.01          100.00

##   Backseat_Divider frequency percentage cumulative_perc
## 1                1      1106      77.02           77.02
## 2                0       330      22.98          100.00

##   Metallic_Rim frequency percentage cumulative_perc
## 1            0      1142      79.53           79.53
## 2            1       294      20.47          100.00

##   Radio_cassette frequency percentage cumulative_perc
## 1              0      1227      85.45           85.45
## 2              1       209      14.55          100.00

##   Tow_Bar frequency percentage cumulative_perc
## 1       0      1037      72.21           72.21
## 2       1       399      27.79          100.00
## [1] "Variables processed: HP_Bin, CC_bin, Doors, Gears, Cylinders, Fuel_Type, Color, Met_Color, Automatic, Mfr_Guarantee, BOVAG_Guarantee, ABS, Airbag_1, Airbag_2, Airco, Automatic_airco, Boardcomputer, CD_Player, Central_Lock, Powered_Windows, Power_Steering, Radio, Mistlamps, Sport_Model, Backseat_Divider, Metallic_Rim, Radio_cassette, Tow_Bar"

Several useful insights can be obtained from the frequency plots above. They can be summarised as follows:

  • Cylinders can not be used as an explanatory variable because it only has one class, namely 4-cylinder.
  • There are several variables whereby some of their class with very small frequency such as HP_Bin, Doors, Gears, Fuel_Type, Colors.

4 Getting to Know lm() Function and Methods

In this section, you will learn how to build regression model by using lm() of R base.

4.1 Basic lm() function

First, we will build a simple linear regression model by using Price as the dependent variable and Age_08_04 as the independent variable.

car.slr <- lm(formula = Price ~ Age_08_04, data = car_resale)

lm() returns an object of class “lm” or for multiple responses of class c(“mlm”, “lm”).

The functions summary() and anova() can be used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.

summary(car.slr)
## 
## Call:
## lm(formula = Price ~ Age_08_04, data = car_resale)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8423.0  -997.4   -24.6   878.5 12889.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20294.059    146.097  138.91   <2e-16 ***
## Age_08_04    -170.934      2.478  -68.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1746 on 1434 degrees of freedom
## Multiple R-squared:  0.7684, Adjusted R-squared:  0.7682 
## F-statistic:  4758 on 1 and 1434 DF,  p-value: < 2.2e-16

The output report reveals that the Price can be explained by using the formula:

      *y = 20294.059 + -170.934x1*

The R-squared of 0.7684 reveals that the simple regression model built is able to explain about 77% of the trade-in prices.

Since p-value is much smaller than 0.0001, we will reject the null hypothesis that mean is a good estimator of trade-in prices. This will allow us to infer that simple linear regression model above is a good estimator of Price.

The Coefficients: section of the report reveals that the p-values of both the estimates of the Intercept and Age_08_04 are smaller than 0.001. In view of this, the null hypothesis of the B0 and B1 are equal to 0 will be rejected. As a results, we will be able to infer that the B0 and B1 are good parameter estimates.

To visualise the best fit curve on a scatterplot, we can incorporate lm() as a method function in ggplot’s geometry as shown in the code chunk below.

4.1.1 Working with lm()’s function methods

The code chunk below print the Analysis of Variance Table of the simple regression model by using anova().

anova(car.slr)
## Analysis of Variance Table
## 
## Response: Price
##             Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Age_08_04    1 1.4505e+10 1.4505e+10    4758 < 2.2e-16 ***
## Residuals 1434 4.3718e+09 3.0486e+06                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also print the confident interval of estimator by using the code chunk below.

confint(car.slr)
##                  2.5 %     97.5 %
## (Intercept) 20007.4714 20580.6459
## Age_08_04    -175.7946  -166.0725

Last but not least, you can also print the residual of the regression model by using residuals() function as shown in the code chunk below.

residuals(car.slr)
##             1             2             3             4             5 
##  2929.2809764 -2862.5861936 -2612.5861936 -2241.6526086  -899.7854386 
##             6             7             8             9            10 
## -1416.0510986 -1874.1839285  1221.1481464  3433.9489014  1271.1481464 
##            11            12            13            14            15 
##   416.4802213  2637.4138064  2271.1481464  1416.4802213  1416.4802213 
##            16            17            18            19            20 
##  5716.4802213  1074.6130513  4903.6794663  5374.6130513 12889.6756911 
##            21            22            23            24            25 
## 11389.6756911 11664.6756911  6023.4100312  6023.4100312  3852.4764462 
##            26            27            28            29            30 
##  6063.4100312  3023.4100312  2374.6130513  4861.8122963  2903.6794663 
##            31            32            33            34            35 
##  4586.2107862 -4298.5824185 -1506.7152484  1122.3511665  1293.2847516 
##            36            37            38            39            40 
## -1153.2503435 -1153.2503435  1293.2847516 -3313.6450583 -2113.6450583 
##            41            42            43            44            45 
## -3142.7114733   189.8198466 -5571.7778883 -1626.4458133 -1942.7114733 
##            46            47            48            49            50 
##   557.2885267 -2113.6450583 -4247.3793983 -3429.9107182  -770.5748681 
##            51            52            53            54            55 
## -2708.9771332 -2196.1763782 -2283.3756232 -3090.5084531 -2458.9771332 
##            56            57            58            59            60 
## -1404.3092082 -2591.5084531  -196.1763782 -3308.9771332 -1833.3756232 
##            61            62            63            64            65 
## -3138.0435482  -404.3092082 -1429.9107182  -233.3756232 -1404.3092082 
##            66            67            68            69            70 
## -2708.9771332   108.4915469   958.4915469  -746.1763782    79.4251319 
##            71            72            73            74            75 
##   632.8900368   937.5579619   595.6907918 -1720.5748681 -2233.3756232 
##            76            77            78            79            80 
## -1746.1763782  -600.8443032 -2300.8443032  2211.9564518  -341.5084531 
##            81            82            83            84            85 
##  1229.4251319  -746.1763782   861.9564518   570.0892818 -1915.9069431 
##            86            87            88            89            90 
##   159.6945670  -961.2390180 -3244.9733581  -382.1726030     0.3587169 
##            91            92            93            94            95 
##  -469.3718480  -353.1061880 -1553.1061880  1788.7609820 -1828.7076981 
##            96            97            98            99           100 
##  3367.8273970  1367.8273970   -11.2390180   450.3587169  -549.6412831 
##           101           102           103           104           105 
##   146.8938120  -578.7076981   425.9602270 -1599.6412831  -207.7741131 
##           106           107           108           109           110 
##   305.0266419  2880.6281520  1296.8938120 -2207.7741131  1159.6945670 
##           111           112           113           114           115 
##  -907.7741131  -394.9733581    92.2258869  1646.8938120  1134.0930569 
##           116           117           118           119           120 
##  1105.0266419  -865.9069431   171.2923019   900.3587169   792.2258869 
##           121           122           123           124           125 
##  2390.8787113  2561.8122963   741.0228668 -3412.5861936 -5993.9144934 
##           126           127           128           129           130 
## -1506.7152484 -1706.7152484   -35.7816634  -872.9809084 -3861.3831735 
##           131           132           133           134           135 
##     6.0855066   -35.7816634 -1153.2503435  -361.3831735 -1153.2503435 
##           136           137           138           139           140 
##  -811.3831735 -4734.5786433 -5176.4458133 -4234.5786433 -3702.0473234 
##           141           142           143           144           145 
##  -984.5786433 -4339.2465684 -1613.6450583 -3155.5122283 -1284.5786433 
##           146           147           148           149           150 
##  -563.6450583  -626.4458133 -3139.2465684 -1984.5786433  -923.3129833 
##           151           152           153           154           155 
## -3052.3793983 -3510.1801534  -589.2465684  5821.1481464  4929.2809764 
##           156           157           158           159           160 
##  3416.4802213  3579.2809764  6504.8824865  7675.8160715  6492.0817314 
##           161           162           163           164           165 
##  7583.9489014  6954.8824865  6271.1481464  6903.6794663 -8422.9809084 
##           166           167           168           169           170 
## -8022.9809084 -6271.7778883  1758.3473914   558.3473914  1783.9489014 
##           171           172           173           174           175 
##   783.9489014  1613.0153164   442.0817314  1442.0817314   913.0153164 
##           176           177           178           179           180 
##   -70.7190236  1816.1481464   413.0153164  1442.0817314  2783.9489014 
##           181           182           183           184           185 
##  4650.2145614  -520.7190236 -1574.1839285  2587.4138064   832.9489014 
##           186           187           188           189           190 
##   821.1481464  3754.8824865  2783.9489014  1100.2145614  3442.0817314 
##           191           192           193           194           195 
##  4442.0817314   587.4138064  3754.8824865  2771.1481464  1558.0153164 
##           196           197           198           199           200 
##  1913.0153164  2954.8824865  1942.0817314  2024.6130513   336.2107862 
##           201           202           203           204           205 
##  2114.6130513    86.2107862  1036.2107862   899.0115413  1074.6130513 
##           206           207           208           209           210 
## -1197.5235538   878.0779562 -1550.9884587  2074.6130513  -438.1877037 
##           211           212           213           214           215 
##  4190.8787113  -925.3869487 -2221.9220438 -2121.9220438 -1821.9220438 
##           216           217           218           219           220 
##  3024.6130513  1878.0779562   940.8787113  1219.9451263  3624.6130513 
##           221           222           223           224           225 
##  6428.0779562  2403.6794663  3049.0115413  2903.6794663  1257.1443712 
##           226           227           228           229           230 
##  3678.0779562  1940.8787113   486.2107862  2361.8122963  2190.8787113 
##           231           232           233           234           235 
##  1390.8787113   844.3436162  1599.0115413  2049.0115413  2049.0115413 
##           236           237           238           239           240 
##  3049.0115413  2257.1443712  -510.6563838  4823.4100312  3023.4100312 
##           241           242           243           244           245 
##  1023.4100312    23.4100312   852.4764462  3023.4100312  3231.5428612 
##           246           247           248           249           250 
##  -597.5235538  1547.8085211 -1878.1250639 -5193.9144934 -2019.5160035 
##           251           252           253           254           255 
## -1677.6488335 -1277.9809084 -2614.8480784 -1272.9809084  -956.7152484 
##           256           257           258           259           260 
## -1835.7816634 -2019.5160035 -2692.3167585 -2006.7152484 -3703.2503435 
##           261           262           263           264           265 
##  -848.5824185  -993.9144934  -322.9809084  2177.0190916 -1861.3831735 
##           266           267           268           269           270 
##  -848.5824185 -2361.3831735  -664.8480784 -1703.2503435  -385.7816634 
##           271           272           273           274           275 
## -1385.7816634 -2148.5824185 -1822.9809084  1177.0190916 -1506.7152484 
##           276           277           278           279           280 
## -2848.5824185 -1348.5824185  -993.9144934 -2963.2503435  -127.6488335 
##           281           282           283           284           285 
## -1164.8480784  -727.6488335   556.0855066 -1364.8480784 -2903.2503435 
##           286           287           288           289           290 
## -2093.9144934 -1193.9144934  1835.1519216 -3687.6488335  -214.8480784 
##           291           292           293           294           295 
##  -298.5824185 -1203.2503435 -1132.6488335   214.2183366  -706.7152484 
##           296           297           298           299           300 
## -1848.5824185  -277.9809084 -2032.3167585 -2677.6488335  -848.5824185 
##           301           302           303           304           305 
## -1032.3167585   556.0855066 -1805.7816634 -2132.6488335  1191.4175815 
##           306           307           308           309           310 
##  -506.7152484   493.2847516 -1335.7816634  -677.6488335 -1756.7152484 
##           311           312           313           314           315 
##     6.0855066 -1640.4495885   247.3511665 -3811.3831735 -2358.2503435 
##           316           317           318           319           320 
##   835.1519216 -2848.5824185   177.0190916  -606.7152484  -316.3831735 
##           321           322           323           324           325 
##   122.3511665 -1219.5160035  -785.7816634  1006.0855066   385.1519216 
##           326           327           328           329           330 
##  -664.8480784  -506.7152484  1222.0190916   664.2183366   177.0190916 
##           331           332           333           334           335 
##  -822.9809084 -1466.7152484 -3390.4495885   -19.5160035 -1522.9809084 
##           336           337           338           339           340 
##  -677.6488335 -1335.7816634 -1703.2503435 -1361.3831735 -1248.9144934 
##           341           342           343           344           345 
## -2285.7816634  -361.3831735 -1848.5824185 -1706.7152484 -1811.3831735 
##           346           347           348           349           350 
## -1753.2503435 -1085.7816634  -193.9144934 -1164.8480784 -1385.7816634 
##           351           352           353           354           355 
##  -822.9809084  1835.1519216 -1703.2503435  1664.2183366    -6.7152484 
##           356           357           358           359           360 
##   -48.5824185     6.0855066 -1903.2503435 -1348.5824185  2006.0855066 
##           361           362           363           364           365 
##   122.3511665  -932.6488335   336.7496565  -193.9144934  1064.2183366 
##           366           367           368           369           370 
##  -677.6488335    -6.7152484  -898.5824185 -2519.5160035   477.0190916 
##           371           372           373           374           375 
## -3377.6488335  -316.3831735  -506.7152484 -2127.6488335 -2392.7114733 
##           376           377           378           379           380 
## -1676.4458133 -2839.2465684 -1968.3129833 -1797.3793983 -3089.2465684 
##           381           382           383           384           385 
##  -392.7114733  -313.6450583 -1113.6450583  -360.1801534   886.3549417 
##           386           387           388           389           390 
## -3847.3793983  -455.5122283   373.5541867 -1018.3129833  -942.7114733 
##           391           392           393           394           395 
##  -113.6450583 -1418.3129833 -1655.5122283 -2652.0473234  -468.3129833 
##           396           397           398           399           400 
##  1436.3549417 -1247.3793983  1202.6206017   452.6206017  -113.6450583 
##           401           402           403           404           405 
##  -628.3129833  -339.5786433  -760.1801534 -2339.2465684  -642.7114733 
##           406           407           408           409           410 
##   581.6870167   886.3549417  -981.1137384   886.3549417  -339.2465684 
##           411           412           413           414           415 
##  -797.3793983  1057.2885267 -3018.3129833  -510.1801534   594.4877717 
##           416           417           418           419           420 
##   202.6206017  -113.6450583  -113.6450583  -244.5786433 -3168.3129833 
##           421           422           423           424           425 
##  1228.2221117  2544.4877717   186.3549417  1228.2221117 -1639.2465684 
##           426           427           428           429           430 
##  -813.6450583 -1304.5786433  -563.6450583   886.3549417   607.2885267 
##           431           432           433           434           435 
##   386.3549417  -821.7778883  -997.3793983  1278.2221117 -1294.5786433 
##           436           437           438           439           440 
## -1310.1801534  -313.6450583   886.3549417  -155.5122283   686.3549417 
##           441           442           443           444           445 
##    57.2885267   528.2221117 -1876.4458133  -310.1801534 -1334.5786433 
##           446           447           448           449           450 
##  -902.7114733   436.3549417   715.4213567 -1247.3793983   323.5541867 
##           451           452           453           454           455 
##  2373.5541867  1057.2885267  -297.3793983 -1584.5786433  7031.6870167 
##           456           457           458           459           460 
##  -126.4458133 -1155.5122283  -639.2465684  -221.7778883  2686.3549417 
##           461           462           463           464           465 
##  -813.6450583  -213.6450583   489.4877717  1715.4213567   202.6206017 
##           466           467           468           469           470 
##    28.2221117   344.4877717 -1310.1801534  1386.3549417  -905.5122283 
##           471           472           473           474           475 
##  1202.6206017  1778.2221117 -1468.3129833  1031.6870167    31.6870167 
##           476           477           478           479           480 
##  -310.1801534 -1260.1801534  1752.6206017  -113.6450583  1373.5541867 
##           481           482           483           484           485 
##   102.2885267  -455.5122283  1081.6870167   347.9526766    94.4877717 
##           486           487           488           489           490 
##  -379.2465684   673.5541867  -931.1137384  1228.2221117   607.2885267 
##           491           492           493           494           495 
##  -531.1137384  1202.6206017  1494.4877717 -1139.2465684 -1297.3793983 
##           496           497           498           499           500 
## -1010.1801534  -891.5084531 -1104.3092082  -904.3092082   279.4251319 
##           501           502           503           504           505 
## -1258.9771332 -1117.1099632  -891.5084531 -2288.0435482   437.5579619 
##           506           507           508           509           510 
##  1095.6907918  1279.4251319   279.4251319  -720.5748681   399.1556968 
##           511           512           513           514           515 
##   -70.5748681  -429.9107182  -588.0435482  -720.5748681   -75.2427932 
##           516           517           518           519           520 
##   595.6907918 -1958.9771332  1279.4251319  1487.5579619   566.6243768 
##           521           522           523           524           525 
##  -617.1099632   829.4251319   253.8236218 -1775.2427932   829.4251319 
##           526           527           528           529           530 
##  1108.4915469   395.6907918  -175.5748681   -25.2427932  1224.4251319 
##           531           532           533           534           535 
## -1917.1099632    79.4251319  -575.2427932   279.4251319  -917.1099632 
##           536           537           538           539           540 
##  -233.3756232   608.4915469    66.6243768    82.8900368    82.8900368 
##           541           542           543           544           545 
##   108.4915469   408.4915469   382.8900368   279.4251319 -1184.3756232 
##           546           547           548           549           550 
##   887.5579619  1316.6243768  1229.4251319   803.8236218  -367.1099632 
##           551           552           553           554           555 
##   266.6243768 -1104.3092082  -879.9107182  -220.5748681 -1206.1763782 
##           556           557           558           559           560 
## -1288.0435482   424.7572068  -170.5748681 -1075.2427932 -1796.1763782 
##           561           562           563           564           565 
## -1800.8443032 -1308.9771332   279.4251319  2316.6243768   753.8236218 
##           566           567           568           569           570 
##   579.4251319  1766.6243768   753.8236218   579.4251319  -946.1763782 
##           571           572           573           574           575 
##  1129.4251319   753.8236218  -629.9107182  -917.1099632  -762.4420381 
##           576           577           578           579           580 
##  1766.6243768  -917.1099632   582.8900368   399.1556968  -733.3756232 
##           581           582           583           584           585 
##  -258.9771332  1279.4251319   766.6243768   829.4251319  -933.3756232 
##           586           587           588           589           590 
##  1253.8236218  1053.4915469   461.9564518  -967.1099632  -367.1099632 
##           591           592           593           594           595 
##  2108.4915469  1079.4251319   566.6243768   -75.2427932  -112.4420381 
##           596           597           598           599           600 
## -1258.9771332  -193.3756232  1108.4915469  1229.4251319  -746.1763782 
##           601           602           603           604           605 
##   487.5579619  -920.5748681  -575.2427932  1911.9564518  -746.1763782 
##           606           607           608           609           610 
##   279.4251319   595.6907918  -933.3756232   127.8900368  -104.3092082 
##           611           612           613           614           615 
##  1595.6907918  -879.9107182  -429.9107182 -1275.2427932  -708.9771332 
##           616           617           618           619           620 
##  -233.3756232  1253.8236218 -1088.0435482 -2379.9107182  1058.4915469 
##           621           622           623           624           625 
##  -746.1763782  -758.9771332  -904.3092082   316.6243768  1079.4251319 
##           626           627           628           629           630 
##   808.4915469  1279.4251319  -575.2427932  1469.7572068  1316.6243768 
##           631           632           633           634           635 
##   908.4915469  2279.4251319  -258.9771332  1316.6243768   632.8900368 
##           636           637           638           639           640 
##  -446.1763782    82.8900368  1224.7572068  2108.4915469   766.6243768 
##           641           642           643           644           645 
##  -258.9771332  -788.0435482 -1554.3092082   329.4251319 -1600.8443032 
##           646           647           648           649           650 
##  1279.4251319 -1538.0435482 -1701.1763782  -433.3756232  1766.6243768 
##           651           652           653           654           655 
##   908.4915469   279.4251319   666.6243768  -246.1763782  1453.4915469 
##           656           657           658           659           660 
##  1079.4251319  -233.3756232  1229.4251319  -429.9107182   253.8236218 
##           661           662           663           664           665 
##   399.1556968  1124.4251319    79.4251319  -933.3756232  -429.9107182 
##           666           667           668           669           670 
##   253.8236218   291.0228668  1382.8900368 -1308.9771332  2829.4251319 
##           671           672           673           674           675 
## -1100.8443032  -917.1099632  -575.2427932   566.6243768  -417.1099632 
##           676           677           678           679           680 
##   766.6243768  1079.4251319  -233.3756232  -762.4420381 -1088.0435482 
##           681           682           683           684           685 
##   595.6907918   437.5579619  4108.4915469  -429.9107182   798.8236218 
##           686           687           688           689           690 
##   974.7572068  -233.3756232   203.8236218   127.8900368   424.7572068 
##           691           692           693           694           695 
##   424.7572068  -591.5084531  1253.8236218   424.7572068   311.9564518 
##           696           697           698           699           700 
##   766.6243768   399.1556968  1424.7572068   279.4251319    58.4915469 
##           701           702           703           704           705 
##   211.9564518  -417.1099632  -617.1099632   -50.8443032  -300.8443032 
##           706           707           708           709           710 
##   553.8236218  2316.6243768   145.6907918 -1600.8443032 -1638.0435482 
##           711           712           713           714           715 
##  -967.1099632 -1629.9107182   570.0892818  -972.1099632 -1458.9771332 
##           716           717           718           719           720 
##  -117.1099632   253.8236218  1766.6243768  1058.4915469  -196.1763782 
##           721           722           723           724           725 
##    82.8900368   424.7572068   487.5579619   411.9564518  -438.3756232 
##           726           727           728           729           730 
##   199.1556968  -800.8443032  -333.3756232   -75.2427932    82.8900368 
##           731           732           733           734           735 
## -1275.2427932  -258.9771332   456.9564518 -2429.9107182   279.4251319 
##           736           737           738           739           740 
##  -855.8443032  1279.4251319  1108.4915469   424.7572068   545.6907918 
##           741           742           743           744           745 
##  -762.4420381   553.8236218 -1117.1099632  1079.4251319  1082.8900368 
##           746           747           748           749           750 
##   911.9564518 -1308.9771332  -538.0435482   937.5579619  1266.6243768 
##           751           752           753           754           755 
##  1253.8236218   229.4251319   395.6907918  1570.0892818   349.1556968 
##           756           757           758           759           760 
##  1266.6243768   741.0228668   711.9564518  -650.8443032  2079.4251319 
##           761           762           763           764           765 
##    70.0892818   911.9564518  1829.4251319  2803.8236218   120.0892818 
##           766           767           768           769           770 
##   232.5579619   741.0228668  1203.8236218 -1540.3054330 -1865.9069431 
##           771           772           773           774           775 
##  -207.7741131 -1904.9733581   988.7609820 -1707.7741131 -1182.1726030 
##           776           777           778           779           780 
##   -65.9069431  -182.1726030  -394.9733581   117.8273970  2630.6281520 
##           781           782           783           784           785 
##   434.0930569 -2736.8405281   767.8273970  -403.1061880   425.9602270 
##           786           787           788           789           790 
##   121.2923019   646.8938120   305.0266419 -1536.8405281  -824.0397730 
##           791           792           793           794           795 
##   317.8273970   305.0266419   117.8273970   105.0266419   342.2258869 
##           796           797           798           799           800 
##   788.7609820 -2065.9069431  -403.1061880 -1657.7741131  -461.2390180 
##           801           802           803           804           805 
##  -194.9733581  1946.8938120   630.6281520   180.6281520  1225.9602270 
##           806           807           808           809           810 
##   538.7609820  -407.7741131   134.0930569  2475.9602270 -1004.9733581 
##           811           812           813           814           815 
##  -236.8405281  -119.3718480  1446.8938120   305.0266419  1475.9602270 
##           816           817           818           819           820 
## -1024.0397730  -744.9733581   975.9602270  2709.6945670   788.7609820 
##           821           822           823           824           825 
##  1880.6281520   320.9602270  -129.3718480   134.0930569 -1036.8405281 
##           826           827           828           829           830 
##  1959.6945670   292.2258869   817.8273970  1938.7609820  1134.0930569 
##           831           832           833           834           835 
##  1084.0930569  -549.6412831  1605.0266419  -311.2390180   199.6945670 
##           836           837           838           839           840 
##   788.7609820   630.6281520  3330.6281520  2630.6281520   263.1594719 
##           841           842           843           844           845 
##  2159.6945670  1117.8273970  3280.6281520   988.7609820   -24.0397730 
##           846           847           848           849           850 
##   121.2923019   538.7609820  1646.8938120   700.3587169  1305.0266419 
##           851           852           853           854           855 
##  -136.8405281  -274.0397730   488.7609820  -315.9069431  -536.8405281 
##           856           857           858           859           860 
##  -378.7076981   288.7609820   263.1594719   130.6281520  2330.6281520 
##           861           862           863           864           865 
##  1630.6281520  1121.2923019   134.0930569  -428.7076981   684.0930569 
##           866           867           868           869           870 
##  1459.6945670   946.8938120  1134.0930569   305.0266419   330.6281520 
##           871           872           873           874           875 
##   288.7609820   617.8273970  1225.9602270  2330.6281520  -553.1061880 
##           876           877           878           879           880 
##  -657.7741131  -207.7741131   275.9602270   792.2258869   704.6945670 
##           881           882           883           884           885 
##   446.8938120 -1378.7076981   857.8273970   288.7609820   105.0266419 
##           886           887           888           889           890 
##   288.7609820   342.2258869   -36.8405281  1038.7609820   475.9602270 
##           891           892           893           894           895 
##  1630.6281520  1092.2258869  2367.8273970   963.1594719  1204.6945670 
##           896           897           898           899           900 
##  1788.7609820   684.0930569   638.7609820   525.6281520   -49.6412831 
##           901           902           903           904           905 
##  2380.6281520   630.6281520   -74.0397730  2009.6945670  -407.7741131 
##           906           907           908           909           910 
##   -74.0397730   880.6281520  1646.8938120   621.2923019   817.8273970 
##           911           912           913           914           915 
##  1159.6945670   988.7609820   792.2258869   788.7609820  1288.7609820 
##           916           917           918           919           920 
##   988.7609820  1367.8273970   330.6281520   880.6281520   959.6945670 
##           921           922           923           924           925 
##   538.7609820   646.8938120   280.6281520   317.8273970  1880.6281520 
##           926           927           928           929           930 
##  -315.9069431  1617.8273970   196.8938120  3330.6281520  2475.9602270 
##           931           932           933           934           935 
##  2288.7609820  3159.6945670 -1694.9733581   691.8938120  1538.7609820 
##           936           937           938           939           940 
##   171.2923019   630.6281520  2159.6945670  2830.6281520   855.0266419 
##           941           942           943           944           945 
##   780.6281520   280.6281520  1362.8273970  2525.9602270 -1158.7741131 
##           946           947           948           949           950 
##  2330.6281520  1709.6945670  1275.9602270   879.6281520  2380.6281520 
##           951           952           953           954           955 
##  1817.8273970   330.6281520  1367.8273970  1305.0266419  1117.8273970 
##           956           957           958           959           960 
##  -878.7076981  2209.6945670 -1178.7076981   446.8938120   513.1594719 
##           961           962           963           964           965 
##  1630.6281520  2659.6945670 -1749.6412831  1267.8273970   596.8938120 
##           966           967           968           969           970 
##  2130.6281520  2084.0930569 -1662.7741131   -78.7076981 -1599.6412831 
##           971           972           973           974           975 
##  1630.6281520  2330.6281520  1421.2923019  1938.7609820  -553.1061880 
##           976           977           978           979           980 
##   475.9602270   -15.9069431   934.0930569   105.0266419  3159.6945670 
##           981           982           983           984           985 
## -2549.6412831  3646.8938120  2317.8273970  1630.6281520  1250.3587169 
##           986           987           988           989           990 
##   488.7609820  1275.9602270   250.3587169   684.0930569 -1249.6412831 
##           991           992           993           994           995 
##   830.6281520  2667.8273970  1475.9602270  -878.7076981   792.2258869 
##           996           997           998           999          1000 
##  1367.8273970   621.2923019  -249.6412831  2630.6281520   425.9602270 
##          1001          1002          1003          1004          1005 
##   646.8938120  1450.3587169  1105.0266419   880.6281520 -1036.8405281 
##          1006          1007          1008          1009          1010 
##  2159.6945670  1959.6945670   975.9602270   846.8938120   638.7609820 
##          1011          1012          1013          1014          1015 
##  2988.7609820  -353.1061880  3442.0817314  3361.8122963 -1506.7152484 
##          1016          1017          1018          1019          1020 
##   840.6281520 -1247.3793983  1658.4915469   279.4251319   646.8938120 
##          1021          1022          1023          1024          1025 
## -2386.9846836  -783.5197787   271.1481464  -899.7854386 -1033.5197787 
##          1026          1027          1028          1029          1030 
##   -99.7854386   -70.7190236 -1412.5861936   925.8160715  -928.8518536 
##          1031          1032          1033          1034          1035 
## -2583.5197787  1071.1481464 -2583.5197787  -733.5197787   583.9489014 
##          1036          1037          1038          1039          1040 
##  -257.9182686  -599.7854386 -1583.5197787  2271.1481464 -1583.5197787 
##          1041          1042          1043          1044          1045 
##   -70.7190236   442.0817314 -1829.1839285  -612.5861936  -266.0510986 
##          1046          1047          1048          1049          1050 
##  -570.7190236   629.2809764  -266.0510986 -1096.3205337  -425.3869487 
##          1051          1052          1053          1054          1055 
##  -925.3869487 -1096.3205337 -1096.3205337   232.7458813  -796.3205337 
##          1056          1057          1058          1059          1060 
##   428.0779562   -96.3205337    61.8122963  -375.3869487   561.8122963 
##          1061          1062          1063          1064          1065 
##  -625.3869487  -925.3869487  -796.3205337  -375.3869487  -546.3205337 
##          1066          1067          1068          1069          1070 
##  1403.6794663  -425.3869487 -1109.1212887   324.6130513  -134.7227988 
##          1071          1072          1073          1074          1075 
##  -862.0548737  -763.7892138  -960.6563838   573.4100312    23.4100312 
##          1076          1077          1078          1079          1080 
##  -397.5235538  1172.8085211 -2328.1250639 -3164.8480784 -2206.7152484 
##          1081          1082          1083          1084          1085 
## -1993.9144934 -2335.7816634 -1822.9809084 -2792.3167585 -1390.4495885 
##          1086          1087          1088          1089          1090 
##  -847.9809084 -1164.8480784 -1022.9809084  -822.9809084 -1335.7816634 
##          1091          1092          1093          1094          1095 
## -1627.6488335  -177.6488335  -193.9144934 -1361.3831735  -777.6488335 
##          1096          1097          1098          1099          1100 
## -4019.5160035  -877.9809084  1177.0190916 -1390.7816634 -2822.9809084 
##          1101          1102          1103          1104          1105 
## -2519.5160035 -1364.8480784 -1977.6488335 -2993.9144934 -2848.5824185 
##          1106          1107          1108          1109          1110 
## -3214.8480784 -2164.8480784 -2082.3167585 -1335.7816634 -1732.6488335 
##          1111          1112          1113          1114          1115 
## -3164.8480784 -1298.5824185 -2811.3831735 -1335.7816634 -3361.3831735 
##          1116          1117          1118          1119          1120 
## -1335.7816634 -1506.7152484 -1403.2503435 -2048.5824185   556.0855066 
##          1121          1122          1123          1124          1125 
## -2284.5786433  -968.3129833 -1955.5122283 -1797.3793983 -2284.5786433 
##          1126          1127          1128          1129          1130 
##  -589.2465684  -797.3793983 -1168.3129833  -997.3793983   728.2221117 
##          1131          1132          1133          1134          1135 
##  -468.3129833 -1384.5786433 -1392.7114733 -1681.1137384 -2481.1137384 
##          1136          1137          1138          1139          1140 
##  -847.3793983  -971.7778883 -1455.5122283 -2113.6450583 -1563.6450583 
##          1141          1142          1143          1144          1145 
## -1247.3793983 -1777.4458133 -1284.5786433 -2614.2465684   607.2885267 
##          1146          1147          1148          1149          1150 
## -1531.1137384   294.4877717  -968.3129833  -221.7778883  -142.7114733 
##          1151          1152          1153          1154          1155 
##  -721.7778883 -1594.2465684 -1663.6450583 -1113.6450583 -2139.2465684 
##          1156          1157          1158          1159          1160 
## -1513.6450583  -847.3793983 -1310.1801534  -942.7114733   410.7534316 
##          1161          1162          1163          1164          1165 
##  -655.5122283  -752.3793983  -497.3793983  -313.6450583 -3310.1801534 
##          1166          1167          1168          1169          1170 
##  -971.7778883  -163.6450583 -1339.2465684   228.2221117 -1938.3129833 
##          1171          1172          1173          1174          1175 
##    11.6870167  -905.5122283 -1771.7778883 -1981.1137384 -2139.2465684 
##          1176          1177          1178          1179          1180 
##  -942.7114733  -284.5786433 -2310.1801534  -797.3793983  -947.3793983 
##          1181          1182          1183          1184          1185 
##  -221.7778883  -976.4458133 -1639.2465684  1202.6206017 -2127.6488335 
##          1186          1187          1188          1189          1190 
## -2455.5122283  -891.5084531   129.4251319 -2785.7816634 -1942.7114733 
##          1191          1192          1193          1194          1195 
##    31.6870167   173.5541867   373.5541867   228.2221117    28.2221117 
##          1196          1197          1198          1199          1200 
##  2686.3549417  -392.7114733  -942.7114733 -2720.5748681 -1720.5748681 
##          1201          1202          1203          1204          1205 
## -2213.9771332 -1933.3756232 -1354.3092082  1108.4915469 -1958.9771332 
##          1206          1207          1208          1209          1210 
##  -629.9107182 -4258.9771332 -1196.1763782   408.4915469 -1770.5748681 
##          1211          1212          1213          1214          1215 
## -1429.9107182  -275.2427932 -1258.9771332  -604.3092082   716.6243768 
##          1216          1217          1218          1219          1220 
##   -62.4420381 -1275.2427932  -733.3756232  -433.3756232   395.6907918 
##          1221          1222          1223          1224          1225 
## -1917.1099632 -2796.1763782 -1033.3756232   786.0228668  -708.9771332 
##          1226          1227          1228          1229          1230 
##   461.9564518   -91.5084531 -1104.3092082 -1733.3756232  -220.5748681 
##          1231          1232          1233          1234          1235 
##  -262.4420381  -946.1763782 -2088.0435482  -933.3756232   288.4915469 
##          1236          1237          1238          1239          1240 
##   741.0228668 -1050.8443032  -720.5748681 -1404.3092082 -1233.3756232 
##          1241          1242          1243          1244          1245 
##   279.4251319   279.4251319    82.8900368   -91.5084531  -917.1099632 
##          1246          1247          1248          1249          1250 
##  -233.3756232  -720.5748681  1911.9564518   153.4915469  -454.3092082 
##          1251          1252          1253          1254          1255 
##    95.6907918 -2600.8443032   -75.2427932 -1720.5748681 -3429.9107182 
##          1256          1257          1258          1259          1260 
## -1129.9107182   374.7572068  -233.3756232   908.4915469  1729.4251319 
##          1261          1262          1263          1264          1265 
## -1383.3756232  -570.5748681  -746.1763782   224.7572068 -1838.0435482 
##          1266          1267          1268          1269          1270 
##   487.5579619   737.5579619   766.6243768   582.8900368  -262.4420381 
##          1271          1272          1273          1274          1275 
## -1538.0435482    53.8236218 -2088.0435482 -1188.3756232   786.0228668 
##          1276          1277          1278          1279          1280 
##  -591.5084531   -62.4420381  -233.3756232  -196.1763782 -1683.3756232 
##          1281          1282          1283          1284          1285 
##   424.7572068 -1433.3756232  -720.5748681   424.7572068 -1258.9771332 
##          1286          1287          1288          1289          1290 
##   408.4915469  -555.8443032 -1117.1099632   570.0892818  -233.3756232 
##          1291          1292          1293          1294          1295 
## -3029.9107182  -920.5748681  -391.5084531   237.5579619  -917.1099632 
##          1296          1297          1298          1299          1300 
##  -233.3756232   377.5579619   803.8236218   424.7572068  1824.4251319 
##          1301          1302          1303          1304          1305 
## -1879.9107182  -233.3756232  -308.9771332  -625.2427932  -433.3756232 
##          1306          1307          1308          1309          1310 
## -1062.4420381    82.8900368  1324.4251319 -1604.3092082   399.1556968 
##          1311          1312          1313          1314          1315 
## -1088.0435482   -62.4420381    53.8236218   424.7572068  -933.3756232 
##          1316          1317          1318          1319          1320 
##    66.6243768   349.1556968  -770.5748681   766.6243768   779.4251319 
##          1321          1322          1323          1324          1325 
##   711.9564518  -550.8443032  -746.1763782   -25.2427932  2108.4915469 
##          1326          1327          1328          1329          1330 
##   437.5579619  -461.2390180 -1161.2390180   130.6281520 -1265.9069431 
##          1331          1332          1333          1334          1335 
## -1065.9069431 -1874.0397730   330.6281520  2159.6945670  -461.2390180 
##          1336          1337          1338          1339          1340 
##   305.0266419   630.6281520   467.8273970  -636.8405281   -65.9069431 
##          1341          1342          1343          1344          1345 
## -1486.8405281   592.2258869 -2036.8405281  -382.1726030  -894.9733581 
##          1346          1347          1348          1349          1350 
##   134.0930569  1330.6281520   617.8273970  -894.9733581   405.0266419 
##          1351          1352          1353          1354          1355 
##  -365.3054330   450.3587169  1330.6281520 -1149.9733581   288.7609820 
##          1356          1357          1358          1359          1360 
##   159.6945670   409.6945670  -369.3718480   680.6281520   963.1594719 
##          1361          1362          1363          1364          1365 
## -1828.7076981  -207.7741131  1538.7609820  -182.1726030   988.7609820 
##          1366          1367          1368          1369          1370 
##  -382.1726030  1959.6945670   -24.0397730  -724.0397730   134.0930569 
##          1371          1372          1373          1374          1375 
##  1630.6281520  1959.6945670  -549.6412831   792.2258869   488.7609820 
##          1376          1377          1378          1379          1380 
## -1011.2390180  -182.1726030  2130.6281520 -1599.6412831  1817.8273970 
##          1381          1382          1383          1384          1385 
##  -182.1726030  1196.8938120  -325.9069431   -36.8405281  1488.7609820 
##          1386          1387          1388          1389          1390 
## -2207.7741131   450.3587169   250.3587169 -1524.0397730    25.9602270 
##          1391          1392          1393          1394          1395 
##  1488.7609820   617.8273970  -840.3054330   963.1594719   475.9602270 
##          1396          1397          1398          1399          1400 
##  1434.0930569   621.2923019   -24.0397730   946.8938120   130.6281520 
##          1401          1402          1403          1404          1405 
##  1988.7609820   959.6945670  -524.0397730  1317.8273970   475.9602270 
##          1406          1407          1408          1409          1410 
## -2483.7076981  -144.9733581   475.9602270  1630.6281520  1475.9602270 
##          1411          1412          1413          1414          1415 
##   100.3587169  1130.6281520  2709.6945670   617.8273970   621.2923019 
##          1416          1417          1418          1419          1420 
##   709.6945670  1446.8938120  -657.7741131  1305.0266419  3538.7609820 
##          1421          1422          1423          1424          1425 
##  -815.9069431   446.8938120  1538.7609820   -65.9069431  1538.7609820 
##          1426          1427          1428          1429          1430 
##  1330.6281520   -65.9069431  1330.6281520   792.2258869   463.1594719 
##          1431          1432          1433          1434          1435 
##  1988.7609820  1830.6281520  -999.6412831  2858.1594719   342.2258869 
##          1436 
## -1078.7076981

4.2 Handling categorical explanatory variable in lm()

When the explanatory variable is categorical, contrasts is used by lm() to create dummy variables Contrasts is the umbrella term used to describe the process of testing linear combinations of parameters from regression models. All statistical software use contrasts, but each software has different defaults and their own way of overriding these.

The default contrasts in R are “treatment” contrasts (aka “dummy coding”), where each level within a factor is identified within a matrix of binary 0 / 1 variables, with the first level chosen as the reference category. They’re called “treatment” contrasts, because of the typical use case where there is one control group (the reference group) and one or more treatment groups that are to be compared to the controls. It is easy to change the default contrasts to something other than treatment contrasts, though this is rarely needed. More often, we may want to change the reference group in treatment contrasts or get all sets of pairwise contrasts between factor levels.

4.2.1 Default treatment (dummy) contrasts

Let us take Fuel_Type as an example. There are three fuel types, namely CNG, Diesel and Petrol.

contrasts(car_resale$Fuel_Type)
##        Diesel Petrol
## CNG         0      0
## Diesel      1      0
## Petrol      0      1

Notice that by default, two dummy variables, namely Diesel and Petrol are created.

4.2.2 Changing the reference group

We can change the reference group by using relevel() as shown in the code chunk below.

car_resale$Fuel_Type <- relevel(car_resale$Fuel_Type, ref = "Petrol")

If you re-run the code chunk below,

contrasts(car_resale$Fuel_Type)

notice that the newly created dummy variables are CNG and Diesel.

4.3 Fitting the model

car.slr1 <- lm(Price ~ Doors, data = car_resale)
summary(car.slr1)
## 
## Call:
## lm(formula = Price ~ Doors, data = car_resale)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7153.2 -2253.2  -857.3   991.8 20996.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     8100       2514   3.222   0.0013 **
## Doors3          2007       2518   0.797   0.4255   
## Doors4          1707       2532   0.674   0.5004   
## Doors5          3403       2518   1.352   0.1767   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3555 on 1432 degrees of freedom
## Multiple R-squared:  0.04108,    Adjusted R-squared:  0.03908 
## F-statistic: 20.45 on 3 and 1432 DF,  p-value: 5.605e-13

4.4 Incorporating lm() in ggplot2

The code chunk below shows that lm() can be used in ggplot2 in the statistics argument.

ggplot(data=car_resale,  
       aes(x=`Age_08_04`, y=`Price`)) +
  geom_point() +
  geom_smooth(method = lm)

Figure above reveals that there are a few statistical outliers with relatively high selling prices.

condo.slr1 <- lm(formula = log10(Price) ~ Age_08_04, data = car_resale)
summary(condo.slr1)
## 
## Call:
## lm(formula = log10(Price) ~ Age_08_04, data = car_resale)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44410 -0.03247  0.00250  0.03906  0.22534 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.349e+00  5.180e-03  839.67   <2e-16 ***
## Age_08_04   -6.064e-03  8.786e-05  -69.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06191 on 1434 degrees of freedom
## Multiple R-squared:  0.7686, Adjusted R-squared:  0.7684 
## F-statistic:  4763 on 1 and 1434 DF,  p-value: < 2.2e-16
ggplot(data=car_resale,  
       aes(x=`Age_08_04`, y=log10(Price))) +
  geom_point() +
  geom_smooth(method = lm)

5 Building Multiple Linear Regression - olsrr methods

5.1 Why olsrr?

In the previous section, we have shared with you how to build simple and multiple linear regression models by using lm() of R stats. Despite its ability to support the need for building linear regression model, lm() does not provide functions to perform multicollinearity and the 3 classic regression assumptions tests. Furthermore, lm() does not provide functions for variables selections. To meet these analysis needs, one have to look into several different R packages, for example vif() of CAR package to perform multicollinearity test, ad.test() of nortest package to perform normality assumption test and bptest() of lmtest package, just to name a few of them.

olsrr, on the other hand, provides all these necessary tests and methods under one roof. In this and next sections, we will share with you how olsrr can be used to meet these analysis needs.

5.2 Initial Model

The code chunk below build a multiple linear regression for the trade-in prices by using the continuous explanatory variables.

car.mlr <- lm(formula = Price ~ Age_08_04 + Mfg_Month + Mfg_Year + KM + Quarterly_Tax + Weight + Guarantee_Period, data=car_resale)
summary(car.mlr)
## 
## Call:
## lm(formula = Price ~ Age_08_04 + Mfg_Month + Mfg_Year + KM + 
##     Quarterly_Tax + Weight + Guarantee_Period, data = car_resale)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10606.7   -739.4      0.5    735.7   6509.5 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.080e+03  1.079e+03  -1.001   0.3168    
## Age_08_04        -1.239e+02  2.717e+00 -45.610   <2e-16 ***
## Mfg_Month        -1.092e+02  1.090e+01 -10.019   <2e-16 ***
## Mfg_Year                 NA         NA      NA       NA    
## KM               -2.286e-02  1.248e-03 -18.310   <2e-16 ***
## Quarterly_Tax    -1.024e+00  1.241e+00  -0.825   0.4094    
## Weight            1.949e+01  9.905e-01  19.681   <2e-16 ***
## Guarantee_Period  2.592e+01  1.238e+01   2.093   0.0365 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1366 on 1429 degrees of freedom
## Multiple R-squared:  0.8587, Adjusted R-squared:  0.8581 
## F-statistic:  1447 on 6 and 1429 DF,  p-value: < 2.2e-16

Quiz: What can you observe from the output?

5.3 Checking for multicolinearity

Before building a multiple regression model, it is important to ensure that the independent variables used are not highly correlated to each other. If these highly correlated independent variables are used in building a regression model by mistake, the quality of the model will be compromised. This phenomenon is known as multicollinearity in statistics.

5.3.1 Computing VIF

Variance inflation factors (VIF) measure the inflation in the variances of the parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is inflated by the existence of correlation among the predictor variables in the model. A VIF of 1 means that there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of βk is not inflated at all. The general rule of thumb is that VIFs exceeding 5 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction.

In the code chunk below, the ols_vif_tol() of olsrr package is used to test if there are sign of multicollinearity.

ols_vif_tol(car.mlr)
##          Variables Tolerance      VIF
## 1        Age_08_04 0.0000000      Inf
## 2        Mfg_Month 0.0000000      Inf
## 3         Mfg_Year 0.0000000      Inf
## 4               KM 0.5933991 1.685206
## 5    Quarterly_Tax 0.4991949 2.003225
## 6           Weight 0.4785777 2.089525
## 7 Guarantee_Period 0.9360399 1.068331

Since the VIF of Age_08_04, Mfg_Month, and Mfg_Year are greater than 10 (i.e. Inf). We can safely conclude that there are sign of multicollinearity among these independent variables.

5.3.2 Detecting multicollinearity using correlation matrix

Correlation matrix is commonly used to visualise the relationships between the independent variables. Beside the pairs() of R, there are many packages support the display of a correlation matrix. In this section, the corrplot package will be used.

The code chunk below is used to plot a scatterplot matrix of the relationship between the independent variables in car_resale data.frame.

corrplot(cor(car_resale[, 4:10]), diag = FALSE, order = "AOE",
         tl.pos = "td", tl.cex = 1.0, method = "number", type = "upper")

Matrix reorder is very important for mining the hidden structure and patter in the matrix. There are four methods in corrplot (parameter order), named “AOE”, “FPC”, “hclust”, “alphabet”. In the code chunk above, alphabet order is used. It orders the variables alphabetically.

From the scatterplot matrix, it is clear that Age_08_04 is highly correlated to Mfg_Year. In view of this, it is wiser to only include either one of them in the subsequent model building. As a result, Mfg_Year is excluded in the subsequent model building.

Let us perform the multicollinearity check again by using the revised model.

car.mlr <- lm(formula = Price ~ Age_08_04 + Mfg_Month + KM + Quarterly_Tax + Weight + Guarantee_Period, data=car_resale)
ols_vif_tol(car.mlr)
##          Variables Tolerance      VIF
## 1        Age_08_04 0.5095030 1.962697
## 2        Mfg_Month 0.9735547 1.027164
## 3               KM 0.5933991 1.685206
## 4    Quarterly_Tax 0.4991949 2.003225
## 5           Weight 0.4785777 2.089525
## 6 Guarantee_Period 0.9360399 1.068331

Since the VIF of the independent variables are less than 10. We can safely conclude that there are no sign of multicollinearity among the independent variables.

5.3.3 The revised model

The code chunk below using lm() to calibrate the revised multiple linear regression model. However, instead of display the output directly, ols_regress() of olsrr package is used to display the regression report.

car.mlr <- lm(formula = Price ~ Age_08_04 + Mfg_Month + KM + Quarterly_Tax + Weight + Guarantee_Period, data=car_resale)
ols_regress(car.mlr)
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.927       RMSE                  1366.403 
## R-Squared               0.859       Coef. Var               12.733 
## Adj. R-Squared          0.858       MSE                1867056.840 
## Pred R-Squared          0.853       MAE                    960.600 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    16209217239.610           6    2701536206.602    1446.949    0.0000 
## Residual       2668024224.167        1429       1867056.840                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                          Parameter Estimates                                           
## ------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower       upper 
## ------------------------------------------------------------------------------------------------------
##      (Intercept)    -1080.337      1078.775                  -1.001    0.317    -3196.490    1035.816 
##        Age_08_04     -123.916         2.717       -0.635    -45.610    0.000     -129.245    -118.586 
##        Mfg_Month     -109.203        10.899       -0.101    -10.019    0.000     -130.583     -87.822 
##               KM       -0.023         0.001       -0.236    -18.310    0.000       -0.025      -0.020 
##    Quarterly_Tax       -1.024         1.241       -0.012     -0.825    0.409       -3.459       1.411 
##           Weight       19.494         0.990        0.283     19.681    0.000       17.551      21.437 
## Guarantee_Period       25.919        12.382        0.022      2.093    0.036        1.631      50.208 
## ------------------------------------------------------------------------------------------------------

One of the advantage of using ols_regress() to display the model report instead of lm() report directly is its report format is relatively tidier and easier to understand than the later. Notice that Model Summary is displayed first, then followed by ANOVA analysis and Parameter Estimates reports. Furthermore, it also provides standardized betas and confidence intervals for coefficients.

6 Regression Model Diagnostic Test

Linear regression makes several assumptions about the data at hand. This section describes regression assumptions and share with you a collection of awesome functions provide by olsrr package for regression diagnostics.

After performing a regression analysis, you should always check if the model works well for the data at hand.

A first step of this regression diagnostic is to inspect the significance of the regression beta coefficients, as well as, the R2 that tells us how well the linear regression model fits to the data. This has been described in the earlier sections (i.e. 4.1 and 5.2.3).

6.1 Linearity assumption test

As obvious as this may seem, linear regression assumes that there exists a linear relationship between the dependent variable and the predictors. Violation of this assumption is very serious–it means that your linear model probably does a bad job at predicting your actual (non-linear) data. Perhaps the relationship between your predictor(s) and criterion is actually curvilinear or cubic. If that is the case, a linear model does a bad job at modeling that relationship, and it is inappropriate to use such a model. There’s no point in worrying about significance tests or confidence intervals if a linear model doesn’t reflect your non-linear data. Hence, when building a linear regression model, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent variables.

The are many methods can be used to test the linearity assumption. By and large, they are graphical in nature. One of the most commonly used graphical method for testing the linearity assumption is scatter plot. In the scatter plot, the y-axis is used the map the the residuals (errors) and the x-axis is used to map the fitted values (predicted values) .

In the code chunk below, the ols_plot_resid_fit() of olsrr package is used to plot the scatter plot for linearity assumption test.

ols_plot_resid_fit(car.mlr)

Figure above reveals that most of the data points are scattered around the 0 line, hence we can safely conclude that the relationships between the dependent variable and independent variables are linear.

6.2 Normality assumption test

The assumption of normality in regression manifests in three ways:

  • For confidence intervals around a parameter to be accurate, the paramater must come from a normal distribution.
  • For significance tests of models to be accurate, the sampling distribution of the thing you’re testing must be normal.
  • To get the best estimates of parameters (i.e., betas in a regression equation), the residuals in the population must be normally distributed.

This assumption is most important when you have a small sample size (because central limit theorem isn’t working in your favor), and when you’re interested in constructing confidence intervals/doing significance testing.

olsrr package provides both graphical and statistical testing methods to meet the need for testing the normality assumption.

The graphical functions provides are:

The statistical testing, on the other hand, is support by ols_test_normality()

In this sharing, we are going to focus on normal Q-Q plot, residual histogram and the formal statistical testing methods.

6.2.1 Normal Q-Q plot

A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. Here’s an example of a Normal Q-Q plot when both sets of quantiles truly come from Normal distributions.

For ols_plot_resid_qq(), the normal Q-Q plot is plotted by using the output of the multiple linear regression as shown in the code chunk below.

ols_plot_resid_qq(car.mlr)

Since most of the data points fall along the straight, we can conclude that the model is conformed to the normality assumption.

6.2.2 Residual histogram plot

We can also validate the normality assumption by plotting the residual using histogram. If the model conforms to normality assumption, then the histogram should resemble bell shape with the peak centre around 0.

The code chunk below is used to plot the residual histogram.

ols_plot_resid_hist(car.mlr)

Figure above reveals that the residual of the multiple linear regression model (i.e. condo.mlr1) is resemble normal distribution.

6.2.3 Statistical testing for normality

If you prefer formal statistical test methods, the ols_test_normality() of olsrr package can be used as shown in the code chunk below.

ols_test_normality(car.mlr)
## -----------------------------------------------
##        Test             Statistic       pvalue  
## -----------------------------------------------
## Shapiro-Wilk              0.9248         0.0000 
## Kolmogorov-Smirnov        0.0587          1e-04 
## Cramer-von Mises         119.5711        0.0000 
## Anderson-Darling         12.5844         0.0000 
## -----------------------------------------------

Four commonly used normality assumption test statistics are supported by ols_test_normality(). They are:

  • Kolmogorv Smirnov statistic
  • Shapiro Wilk statistic
  • Cramer von Mises statistic
  • Anderson Darling statistic

The summary table above reveals that the p-values of the four tests are way smaller than the alpha value of 0.05. Hence we will reject the null hypothesis that the residual is NOT resemble normal distribution. Hence, we can infer that the residual of the model conformed to normality assumption.

6.3 Homogeneity assumption test

Heteroscedasticity means unequal scatter. In regression analysis, we talk about heteroscedasticity in the context of the residuals or error term. Specifically, heteroscedasticity is a systematic change in the spread of the residuals over the range of measured values. Heteroscedasticity is a problem because ordinary least squares (OLS) regression assumes that all residuals are drawn from a population that has a constant variance (homoscedasticity).

6.3.1 How to identify heteroscedasticity with residual plots

The easiest way to test for heteroskedasticity is to get a good look at your data. Ideally, you generally want your data to all follow a pattern of a line, but sometimes it doesn’t. The quickest way to identify heteroskedastic data is to see the shape that the plotted data take.

Heteroscedasticity produces a distinctive fan or cone shape in residual plots. To check for heteroscedasticity, we need to assess the residuals by fitted value plots specifically. The code chunk below will be used to plot the fitted values against the residual values.

ols_plot_resid_fit(car.mlr)

Figure above does not show any pattern of distinctive fan or cone shape in the residual plots. We say that the response or the residuals are homoscedasticity.

6.3.2 Formal homogeneity assumption test

Beside the residual plot, olsrr also provides the following 4 tests for detecting heteroscedasticity:

  • Bartlett Test
  • Breusch Pagan Test
  • Score Test
  • F Test

6.3.3 Breusch Pagan Test

Breusch Pagan Test was introduced by Trevor Breusch and Adrian Pagan in 1979.

  • It is used to test for heteroskedasticity in a linear regression model and assumes that the error terms are normally distributed.
  • It tests whether the variance of the errors from a regression is dependent on the values of the independent variables.
  • It is a χ2 test.

You can perform the test using the fitted values of the model, the predictors in the model and a subset of the independent variables. It includes options to perform multiple tests and p value adjustments. The options for p value adjustments include Bonferroni, Sidak and Holm’s method.

6.3.3.1 Use fitted values of the model

The code chunk below uses fitted values of the model to perform the homogeneity assumption test

ols_test_breusch_pagan(car.mlr)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##           Test Summary           
##  --------------------------------
##  DF            =    1 
##  Chi2          =    501.7464 
##  Prob > Chi2   =    3.962598e-111

6.3.3.2 Use independent variables of the model

ols_test_breusch_pagan(car.mlr, rhs = TRUE)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                                  Data                                   
##  -----------------------------------------------------------------------
##  Response : Price 
##  Variables: Age_08_04 Mfg_Month KM Quarterly_Tax Weight Guarantee_Period 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    6 
##  Chi2          =    2198.1848 
##  Prob > Chi2   =    0.0000

6.3.4 F Test

F Test for heteroskedasticity under the assumption that the errors are independent and identically distributed (i.i.d.). You can perform the test using the fitted values of the model, the predictors in the model and a subset of the independent variables.

6.3.4.1 Use fitted values of the model

The code chunk below uses ols_test_f() of olsrr package to perform homogeneity assumption test on fitted values of the model.

ols_test_f(car.mlr)
## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: fitted values of Price 
## 
##         Test Summary         
##  ----------------------------
##  Num DF     =    1 
##  Den DF     =    1434 
##  F          =    111.567 
##  Prob > F   =    3.633432e-25

6.3.4.2 Use independent variables of the model

The code chunk below uses ols_test_f() of olsrr package to perform homogeneity assumption test on the independent variables of the model.

ols_test_f(car.mlr, rhs = TRUE)
## 
##  F Test for Heteroskedasticity
##  -----------------------------
##  Ho: Variance is homogenous
##  Ha: Variance is not homogenous
## 
##  Variables: Age_08_04 Mfg_Month KM Quarterly_Tax Weight Guarantee_Period 
## 
##         Test Summary          
##  -----------------------------
##  Num DF     =    6 
##  Den DF     =    1429 
##  F          =    110.1566 
##  Prob > F   =    2.795615e-114

7 Stepwise Rgeression for Veriables Selection: olsrr methods

When building regression models, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model. The actual set of independent variables used in the final regression model must be determined by analysis of the data. Determining this subset is called the variable selection problem.

Finding this subset of independent (also known as regressor) variables involves two opposing objectives. First, we want the regression model to be as complete and realistic as possible. We want every regressor that is even remotely related to the dependent variable to be included. Second, we want to include as few variables as possible because each irrelevant regressor decreases the precision of the estimated coefficients and predicted values. Also, the presence of extra variables increases the complexity of data collection and model maintenance. The goal of variable selection becomes one of parsimony: achieve a balance between simplicity (as few regressors as possible) and fit (as many regressors as needed). Among them, one of the commonly used strategy is called stepwise variables selection method.

Actually, stepwise regression method consists of three variables selection strategies, thye are:

  • Forward selection, which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.
  • Backward elimination, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically insignificant loss of fit.
  • Bidirectional elimination, a combination of the above, testing at each step for variables to be included or excluded.

In oslrr package, these three stepwise variables selection strategies are supported by ols_step_backward_p(), ols_step_forward_p(), and ols_step_both_p() respectively.

7.1 Foreward stepwise regression

In order to show how to perform variables selection by using the stepwise regression functions, let us first build a complex multiple linear regression models using most of the independent variables provided.

car.mlr <- lm(formula = Price ~ Age_08_04 + Mfg_Month + KM + Quarterly_Tax + Weight + Guarantee_Period + HP_Bin + CC_bin + Met_Color + Mfr_Guarantee + Airco + Automatic_airco + Boardcomputer + Central_Lock + Powered_Windows +  Sport_Model + Backseat_Divider + Tow_Bar 
, data=car_resale)
ols_regress(car.mlr)
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.954       RMSE                  1099.239 
## R-Squared               0.909       Coef. Var               10.244 
## Adj. R-Squared          0.908       MSE                1208325.836 
## Pred R-Squared          0.903       MAE                    812.346 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                      ANOVA                                       
## --------------------------------------------------------------------------------
##                        Sum of                                                   
##                       Squares          DF      Mean Square       F         Sig. 
## --------------------------------------------------------------------------------
## Regression    17167460405.194          20    858373020.260    710.382    0.0000 
## Residual       1709781058.583        1415      1208325.836                      
## Total         18877241463.777        1435                                       
## --------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                            
## --------------------------------------------------------------------------------------------------------
##             model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## --------------------------------------------------------------------------------------------------------
##       (Intercept)     5006.450      1099.263                   4.554    0.000     2850.089     7162.811 
##         Age_08_04     -113.432         3.135       -0.582    -36.177    0.000     -119.582     -107.281 
##         Mfg_Month      -95.155         8.928       -0.088    -10.658    0.000     -112.669      -77.642 
##                KM       -0.017         0.001       -0.178    -15.845    0.000       -0.019       -0.015 
##     Quarterly_Tax        9.718         1.359        0.110      7.153    0.000        7.053       12.383 
##            Weight       11.312         0.998        0.164     11.333    0.000        9.354       13.271 
##  Guarantee_Period       68.048        11.744        0.056      5.794    0.000       45.011       91.085 
##       HP_Bin> 120     4916.462       380.392        0.118     12.925    0.000     4170.268     5662.655 
##     HP_Bin100-120     3320.525       382.257        0.448      8.687    0.000     2570.673     4070.376 
##       CC_bin>1600    -1366.308       201.184       -0.120     -6.791    0.000    -1760.960     -971.656 
##        CC_bin1600    -3297.887       379.977       -0.447     -8.679    0.000    -4043.266    -2552.508 
##        Met_Color1       22.976        64.553        0.003      0.356    0.722     -103.653      149.605 
##    Mfr_Guarantee1      285.597        63.787        0.039      4.477    0.000      160.470      410.725 
##            Airco1      278.200        75.872        0.038      3.667    0.000      129.365      427.034 
##  Automatic_airco1     2256.939       157.344        0.144     14.344    0.000     1948.286     2565.592 
##    Boardcomputer1     -153.918        99.718       -0.019     -1.544    0.123     -349.528       41.693 
##     Central_Lock1      -52.167       124.157       -0.007     -0.420    0.674     -295.720      191.385 
##  Powered_Windows1      462.223       124.252        0.063      3.720    0.000      218.486      705.960 
##      Sport_Model1      278.658        72.177        0.035      3.861    0.000      137.072      420.243 
## Backseat_Divider1       -9.349        95.433       -0.001     -0.098    0.922     -196.555      177.857 
##          Tow_Bar1     -137.656        68.384       -0.017     -2.013    0.044     -271.802       -3.511 
## --------------------------------------------------------------------------------------------------------

Notice that the p-values of Met_Color1, Boardcomputer1, Central_Lock1, and Backseat_Divider1 are greater than 0.05.

Now we are going to perform forward stepwise regression by eliminating independent variables failed to meet the p-values less than 0.05 criterion. The code chunk below uses ols_step_forward_p() of olsrr package to build a forward stepwise regression model by setting the penter argument to 0.05. This will remove independent variables with p-values greater than 0.05 from the model.

ols_step_forward_p(car.mlr, 
                   penter = 0.05,
                   print_plot = TRUE)
## 
##                                    Selection Summary                                     
## ----------------------------------------------------------------------------------------
##         Variable                          Adj.                                              
## Step        Entered         R-Square    R-Square      C(p)          AIC          RMSE       
## ----------------------------------------------------------------------------------------
##    1    Age_08_04             0.7684      0.7682    2186.0335    25518.9706    1746.0382    
##    2    Automatic_airco       0.8247      0.8244    1308.7496    25121.1462    1519.6553    
##    3    KM                    0.8427      0.8423    1029.8950    24967.7765    1440.1316    
##    4    Weight                0.8742      0.8739     538.5506    24648.0525    1287.9643    
##    5    HP_Bin                0.8830      0.8825     403.9815    24548.5871    1243.2596    
##    6    Mfg_Month             0.8905      0.8899     289.0095    24455.6261    1203.2452    
##    7    CC_bin                0.8965      0.8958     197.6782    24379.0785    1170.7882    
##    8    Powered_Windows       0.9005      0.8998     136.2015    24323.5954    1147.9904    
##    9    Quarterly_Tax         0.9036      0.9029      89.8123    24280.1756    1130.3748    
##   10    Guarantee_Period      0.9057      0.9049      59.6168    24251.1397    1118.6181    
##   11    Mfr_Guarantee         0.9070      0.9061      41.0703    24232.9769    1111.1829    
##   12    Sport_Model           0.9081      0.9072      25.0118    24217.0187    1104.6450    
##   13    Airco                 0.9090      0.9080      13.8241    24205.7609    1099.9446    
##   14    Tow_Bar               0.9092      0.9082      11.8620    24203.7537    1098.7979    
## ----------------------------------------------------------------------------------------

Notice that all the independent variables with p-values greater than 0.05 have been excluded in the model. Despite four independent variables have been removed from the model, it is interesting to note that the adjusted R-square remains at 0.908.

You can also request for detail printout of each iteration by using the details argument as shown in the code chunk below.

ols_step_forward_p(car.mlr, 
                   penter = 0.05, 
                   details = TRUE)
## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. Age_08_04 
## 2. Mfg_Month 
## 3. KM 
## 4. Quarterly_Tax 
## 5. Weight 
## 6. Guarantee_Period 
## 7. HP_Bin 
## 8. CC_bin 
## 9. Met_Color 
## 10. Mfr_Guarantee 
## 11. Airco 
## 12. Automatic_airco 
## 13. Boardcomputer 
## 14. Central_Lock 
## 15. Powered_Windows 
## 16. Sport_Model 
## 17. Backseat_Divider 
## 18. Tow_Bar 
## 
## We are selecting variables based on p value...
## 
## 
## Forward Selection: Step 1 
## 
## - Age_08_04 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.877       RMSE                  1746.038 
## R-Squared               0.768       Coef. Var               16.271 
## Adj. R-Squared          0.768       MSE                3048649.489 
## Pred R-Squared          0.767       MAE                   1246.747 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                        ANOVA                                        
## -----------------------------------------------------------------------------------
##                        Sum of                                                      
##                       Squares          DF        Mean Square       F          Sig. 
## -----------------------------------------------------------------------------------
## Regression    14505478096.705           1    14505478096.705    4758.001    0.0000 
## Residual       4371763367.072        1434        3048649.489                       
## Total         18877241463.777        1435                                          
## -----------------------------------------------------------------------------------
## 
##                                        Parameter Estimates                                         
## --------------------------------------------------------------------------------------------------
##       model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## --------------------------------------------------------------------------------------------------
## (Intercept)    20294.059       146.097                 138.908    0.000    20007.471    20580.646 
##   Age_08_04     -170.934         2.478       -0.877    -68.978    0.000     -175.795     -166.073 
## --------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 2 
## 
## - Automatic_airco 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.908       RMSE                  1519.655 
## R-Squared               0.825       Coef. Var               14.162 
## Adj. R-Squared          0.824       MSE                2309352.302 
## Pred R-Squared          0.823       MAE                   1096.665 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    15567939614.839           2    7783969807.419    3370.629    0.0000 
## Residual       3309301848.938        1433       2309352.302                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)    18841.989       144.054                 130.799    0.000    18559.410    19124.567 
##        Age_08_04     -149.135         2.384       -0.765    -62.550    0.000     -153.812     -144.458 
## Automatic_airco1     4121.588       192.156        0.262     21.449    0.000     3744.651     4498.524 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 3 
## 
## - KM 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.918       RMSE                  1440.132 
## R-Squared               0.843       Coef. Var               13.421 
## Adj. R-Squared          0.842       MSE                2073979.052 
## Pred R-Squared          0.841       MAE                   1036.274 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    15907303461.666           3    5302434487.222    2556.648    0.0000 
## Residual       2969938002.112        1432       2073979.052                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)    19059.802       137.573                 138.543    0.000    18789.935    19329.668 
##        Age_08_04     -134.462         2.534       -0.690    -53.064    0.000     -139.432     -129.491 
## Automatic_airco1     3994.025       182.373        0.254     21.900    0.000     3636.278     4351.771 
##               KM       -0.015         0.001       -0.156    -12.792    0.000       -0.017       -0.013 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 4 
## 
## - Weight 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.935       RMSE                  1287.964 
## R-Squared               0.874       Coef. Var               12.002 
## Adj. R-Squared          0.874       MSE                1658851.986 
## Pred R-Squared          0.871       MAE                    932.109 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    16503424271.945           4    4125856067.986    2487.176    0.0000 
## Residual       2373817191.832        1431       1658851.986                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                         Parameter Estimates                                          
## ----------------------------------------------------------------------------------------------------
##            model        Beta    Std. Error    Std. Beta       t        Sig        lower       upper 
## ----------------------------------------------------------------------------------------------------
##      (Intercept)    2054.376       905.464                   2.269    0.023     278.197    3830.555 
##        Age_08_04    -113.178         2.529       -0.580    -44.751    0.000    -118.139    -108.217 
## Automatic_airco1    2965.067       171.898        0.189     17.249    0.000    2627.868    3302.265 
##               KM      -0.021         0.001       -0.221    -19.387    0.000      -0.024      -0.019 
##           Weight      15.207         0.802        0.221     18.957    0.000      13.633      16.780 
## ----------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 5 
## 
## - HP_Bin 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.940       RMSE                  1243.260 
## R-Squared               0.883       Coef. Var               11.586 
## Adj. R-Squared          0.883       MSE                1545694.347 
## Pred R-Squared          0.880       MAE                    917.107 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    16668444241.758           6    2778074040.293    1797.298    0.0000 
## Residual       2208797222.019        1429       1545694.347                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                         Parameter Estimates                                          
## ----------------------------------------------------------------------------------------------------
##            model        Beta    Std. Error    Std. Beta       t        Sig        lower       upper 
## ----------------------------------------------------------------------------------------------------
##      (Intercept)    3041.437       879.356                   3.459    0.001    1316.469    4766.405 
##        Age_08_04    -115.989         2.487       -0.595    -46.630    0.000    -120.868    -111.109 
## Automatic_airco1    2592.316       169.810        0.165     15.266    0.000    2259.212    2925.421 
##               KM      -0.020         0.001       -0.207    -18.450    0.000      -0.022      -0.018 
##           Weight      14.136         0.782        0.205     18.075    0.000      12.602      15.671 
##      HP_Bin> 120    3799.123       395.361        0.091      9.609    0.000    3023.573    4574.672 
##    HP_Bin100-120     359.927        69.658        0.049      5.167    0.000     223.284     496.571 
## ----------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 6 
## 
## - Mfg_Month 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.944       RMSE                  1203.245 
## R-Squared               0.890       Coef. Var               11.213 
## Adj. R-Squared          0.890       MSE                1447798.984 
## Pred R-Squared          0.887       MAE                    877.335 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    16809784514.785           7    2401397787.826    1658.654    0.0000 
## Residual       2067456948.992        1428       1447798.984                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                         Parameter Estimates                                          
## ----------------------------------------------------------------------------------------------------
##            model        Beta    Std. Error    Std. Beta       t        Sig        lower       upper 
## ----------------------------------------------------------------------------------------------------
##      (Intercept)    4194.319       859.016                   4.883    0.000    2509.251    5879.387 
##        Age_08_04    -120.065         2.442       -0.616    -49.157    0.000    -124.857    -115.274 
## Automatic_airco1    2448.728       164.986        0.156     14.842    0.000    2125.086    2772.369 
##               KM      -0.019         0.001       -0.201    -18.439    0.000      -0.021      -0.017 
##           Weight      13.726         0.758        0.199     18.108    0.000      12.239      15.213 
##      HP_Bin> 120    3793.269       382.637        0.091      9.914    0.000    3042.679    4543.859 
##    HP_Bin100-120     373.679        67.431        0.050      5.542    0.000     241.405     505.953 
##        Mfg_Month     -95.142         9.629       -0.088     -9.880    0.000    -114.031     -76.253 
## ----------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 7 
## 
## - CC_bin 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.947       RMSE                  1170.788 
## R-Squared               0.896       Coef. Var               10.911 
## Adj. R-Squared          0.896       MSE                1370744.959 
## Pred R-Squared          0.891       MAE                    860.989 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    16922559152.253           9    1880284350.250    1371.724    0.0000 
## Residual       1954682311.525        1426       1370744.959                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)     3578.871      1118.743                   3.199    0.001     1384.313     5773.429 
##        Age_08_04     -121.438         2.385       -0.623    -50.925    0.000     -126.116     -116.760 
## Automatic_airco1     2256.504       162.001        0.144     13.929    0.000     1938.718     2574.290 
##               KM       -0.016         0.001       -0.169    -14.335    0.000       -0.019       -0.014 
##           Weight       14.329         1.014        0.208     14.128    0.000       12.340       16.319 
##      HP_Bin> 120     4537.018       384.234        0.109     11.808    0.000     3783.294     5290.742 
##    HP_Bin100-120     3452.319       404.063        0.466      8.544    0.000     2659.696     4244.941 
##        Mfg_Month      -90.947         9.406       -0.084     -9.669    0.000     -109.397      -72.496 
##      CC_bin>1600     -792.201       169.670       -0.070     -4.669    0.000    -1125.031     -459.371 
##       CC_bin1600    -3273.172       401.153       -0.443     -8.159    0.000    -4060.085    -2486.260 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 8 
## 
## - Powered_Windows 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.949       RMSE                  1147.990 
## R-Squared               0.901       Coef. Var               10.698 
## Adj. R-Squared          0.900       MSE                1317881.944 
## Pred R-Squared          0.895       MAE                    838.276 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    16999259692.872          10    1699925969.287    1289.892    0.0000 
## Residual       1877981770.905        1425       1317881.944                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)     4197.482      1099.951                   3.816    0.000     2039.785     6355.180 
##        Age_08_04     -118.102         2.379       -0.606    -49.649    0.000     -122.768     -113.436 
## Automatic_airco1     2220.587       158.916        0.141     13.973    0.000     1908.852     2532.322 
##               KM       -0.017         0.001       -0.174    -15.049    0.000       -0.019       -0.015 
##           Weight       13.384         1.002        0.194     13.355    0.000       11.418       15.350 
##      HP_Bin> 120     4337.881       377.655        0.104     11.486    0.000     3597.062     5078.701 
##    HP_Bin100-120     3472.803       396.205        0.469      8.765    0.000     2695.596     4250.010 
##        Mfg_Month      -91.668         9.223       -0.085     -9.939    0.000     -109.761      -73.576 
##      CC_bin>1600     -641.829       167.530       -0.057     -3.831    0.000     -970.460     -313.198 
##       CC_bin1600    -3382.072       393.600       -0.458     -8.593    0.000    -4154.170    -2609.974 
## Powered_Windows1      508.696        66.680        0.070      7.629    0.000      377.894      639.499 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 9 
## 
## - Quarterly_Tax 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.951       RMSE                  1130.375 
## R-Squared               0.904       Coef. Var               10.534 
## Adj. R-Squared          0.903       MSE                1277747.117 
## Pred R-Squared          0.898       MAE                    836.056 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    17057729568.466          11    1550702688.042    1213.623    0.0000 
## Residual       1819511895.311        1424       1277747.117                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)     5199.111      1093.147                   4.756    0.000     3054.759     7343.463 
##        Age_08_04     -116.773         2.350       -0.599    -49.681    0.000     -121.384     -112.162 
## Automatic_airco1     2275.253       156.686        0.145     14.521    0.000     1967.892     2582.614 
##               KM       -0.017         0.001       -0.180    -15.773    0.000       -0.020       -0.015 
##           Weight       11.795         1.014        0.171     11.627    0.000        9.805       13.785 
##      HP_Bin> 120     5085.534       387.937        0.122     13.109    0.000     4324.543     5846.524 
##    HP_Bin100-120     3293.965       391.020        0.445      8.424    0.000     2526.929     4061.002 
##        Mfg_Month      -90.751         9.083       -0.084     -9.992    0.000     -108.567      -72.934 
##      CC_bin>1600    -1368.187       196.828       -0.121     -6.951    0.000    -1754.290     -982.084 
##       CC_bin1600    -3227.801       388.231       -0.437     -8.314    0.000    -3989.367    -2466.234 
## Powered_Windows1      509.226        65.657        0.070      7.756    0.000      380.431      638.021 
##    Quarterly_Tax        8.666         1.281        0.098      6.765    0.000        6.153       11.179 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 10 
## 
## - Guarantee_Period 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.952       RMSE                  1118.618 
## R-Squared               0.906       Coef. Var               10.424 
## Adj. R-Squared          0.905       MSE                1251306.531 
## Pred R-Squared          0.900       MAE                    826.060 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    17096632270.336          12    1424719355.861    1138.585    0.0000 
## Residual       1780609193.441        1423       1251306.531                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)     4731.911      1085.018                   4.361    0.000     2603.504     6860.317 
##        Age_08_04     -114.562         2.360       -0.588    -48.552    0.000     -119.190     -109.933 
## Automatic_airco1     2386.531       156.336        0.152     15.265    0.000     2079.857     2693.204 
##               KM       -0.017         0.001       -0.178    -15.734    0.000       -0.019       -0.015 
##           Weight       11.793         1.004        0.171     11.748    0.000        9.824       13.762 
##      HP_Bin> 120     5090.503       383.904        0.122     13.260    0.000     4337.425     5843.581 
##    HP_Bin100-120     3256.161       387.012        0.440      8.414    0.000     2496.986     4015.337 
##        Mfg_Month      -90.326         8.988       -0.084    -10.049    0.000     -107.957      -72.694 
##      CC_bin>1600    -1524.097       196.777       -0.134     -7.745    0.000    -1910.101    -1138.092 
##       CC_bin1600    -3224.574       384.194       -0.437     -8.393    0.000    -3978.220    -2470.927 
## Powered_Windows1      510.964        64.975        0.070      7.864    0.000      383.507      638.421 
##    Quarterly_Tax       10.277         1.300        0.117      7.904    0.000        7.726       12.828 
## Guarantee_Period       57.643        10.338        0.048      5.576    0.000       37.363       77.922 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 11 
## 
## - Mfr_Guarantee 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.952       RMSE                  1111.183 
## R-Squared               0.907       Coef. Var               10.355 
## Adj. R-Squared          0.906       MSE                1234727.422 
## Pred R-Squared          0.901       MAE                    818.345 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    17121459070.214          13    1317035313.093    1066.661    0.0000 
## Residual       1755782393.563        1422       1234727.422                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)     4317.608      1081.759                   3.991    0.000     2195.592     6439.623 
##        Age_08_04     -113.352         2.359       -0.581    -48.043    0.000     -117.980     -108.724 
## Automatic_airco1     2394.538       155.307        0.152     15.418    0.000     2089.883     2699.193 
##               KM       -0.017         0.001       -0.174    -15.376    0.000       -0.019       -0.015 
##           Weight       12.004         0.998        0.174     12.025    0.000       10.046       13.962 
##      HP_Bin> 120     4967.432       382.338        0.119     12.992    0.000     4217.424     5717.439 
##    HP_Bin100-120     3248.489       384.444        0.438      8.450    0.000     2494.351     4002.626 
##        Mfg_Month      -89.550         8.930       -0.083    -10.028    0.000     -107.068      -72.032 
##      CC_bin>1600    -1427.987       196.641       -0.126     -7.262    0.000    -1813.724    -1042.249 
##       CC_bin1600    -3233.196       381.645       -0.438     -8.472    0.000    -3981.844    -2484.549 
## Powered_Windows1      519.917        64.574        0.071      8.051    0.000      393.246      646.588 
##    Quarterly_Tax        9.621         1.300        0.109      7.401    0.000        7.071       12.171 
## Guarantee_Period       63.396        10.349        0.053      6.126    0.000       43.095       83.697 
##   Mfr_Guarantee1      281.356        62.745        0.038      4.484    0.000      158.273      404.439 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 12 
## 
## - Sport_Model 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.953       RMSE                  1104.645 
## R-Squared               0.908       Coef. Var               10.294 
## Adj. R-Squared          0.907       MSE                1220240.596 
## Pred R-Squared          0.902       MAE                    819.404 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                        
## ----------------------------------------------------------------------------------
##                        Sum of                                                     
##                       Squares          DF       Mean Square       F          Sig. 
## ----------------------------------------------------------------------------------
## Regression    17143279577.089          14    1224519969.792    1003.507    0.0000 
## Residual       1733961886.688        1421       1220240.596                       
## Total         18877241463.777        1435                                         
## ----------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)     4698.021      1079.151                   4.353    0.000     2581.122     6814.920 
##        Age_08_04     -113.223         2.346       -0.581    -48.268    0.000     -117.824     -108.621 
## Automatic_airco1     2287.873       156.440        0.146     14.625    0.000     1980.995     2594.751 
##               KM       -0.017         0.001       -0.176    -15.647    0.000       -0.019       -0.015 
##           Weight       11.556         0.998        0.168     11.578    0.000        9.598       13.514 
##      HP_Bin> 120     4912.508       380.311        0.118     12.917    0.000     4166.477     5658.538 
##    HP_Bin100-120     3316.047       382.515        0.448      8.669    0.000     2565.692     4066.403 
##        Mfg_Month      -92.538         8.906       -0.086    -10.391    0.000     -110.008      -75.068 
##      CC_bin>1600    -1304.084       197.668       -0.115     -6.597    0.000    -1691.836     -916.332 
##       CC_bin1600    -3261.514       379.458       -0.442     -8.595    0.000    -4005.873    -2517.155 
## Powered_Windows1      535.327        64.297        0.073      8.326    0.000      409.199      661.456 
##    Quarterly_Tax        9.326         1.294        0.106      7.206    0.000        6.787       11.864 
## Guarantee_Period       69.994        10.406        0.058      6.726    0.000       49.581       90.406 
##   Mfr_Guarantee1      278.452        62.380        0.038      4.464    0.000      156.085      400.819 
##     Sport_Model1      284.859        67.363        0.036      4.229    0.000      152.718      417.000 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 13 
## 
## - Airco 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.953       RMSE                  1099.945 
## R-Squared               0.909       Coef. Var               10.250 
## Adj. R-Squared          0.908       MSE                1209878.034 
## Pred R-Squared          0.903       MAE                    815.408 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                       
## ---------------------------------------------------------------------------------
##                        Sum of                                                    
##                       Squares          DF       Mean Square       F         Sig. 
## ---------------------------------------------------------------------------------
## Regression    17159214655.940          15    1143947643.729    945.507    0.0000 
## Residual       1718026807.837        1420       1209878.034                      
## Total         18877241463.777        1435                                        
## ---------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)     4703.433      1074.560                   4.377    0.000     2595.538     6811.328 
##        Age_08_04     -110.901         2.422       -0.569    -45.794    0.000     -115.652     -106.151 
## Automatic_airco1     2279.209       155.793        0.145     14.630    0.000     1973.600     2584.817 
##               KM       -0.017         0.001       -0.179    -15.958    0.000       -0.019       -0.015 
##           Weight       11.405         0.995        0.166     11.466    0.000        9.454       13.356 
##      HP_Bin> 120     4888.805       378.749        0.118     12.908    0.000     4145.838     5631.772 
##    HP_Bin100-120     3363.410       381.111        0.454      8.825    0.000     2615.809     4111.012 
##        Mfg_Month      -93.029         8.869       -0.086    -10.489    0.000     -110.426      -75.631 
##      CC_bin>1600    -1321.166       196.883       -0.117     -6.710    0.000    -1707.379     -934.954 
##       CC_bin1600    -3356.798       378.755       -0.455     -8.863    0.000    -4099.777    -2613.819 
## Powered_Windows1      419.705        71.513        0.057      5.869    0.000      279.423      559.987 
##    Quarterly_Tax        9.296         1.289        0.105      7.214    0.000        6.768       11.824 
## Guarantee_Period       71.755        10.373        0.060      6.917    0.000       51.407       92.102 
##   Mfr_Guarantee1      281.442        62.120        0.038      4.531    0.000      159.586      403.299 
##     Sport_Model1      295.123        67.136        0.037      4.396    0.000      163.428      426.819 
##           Airco1      272.811        75.172        0.038      3.629    0.000      125.351      420.271 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## Forward Selection: Step 14 
## 
## - Tow_Bar 
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.954       RMSE                  1098.798 
## R-Squared               0.909       Coef. Var               10.240 
## Adj. R-Squared          0.908       MSE                1207356.804 
## Pred R-Squared          0.903       MAE                    812.613 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                       
## ---------------------------------------------------------------------------------
##                        Sum of                                                    
##                       Squares          DF       Mean Square       F         Sig. 
## ---------------------------------------------------------------------------------
## Regression         1.7164e+10          16    1072750134.950    888.511    0.0000 
## Residual       1713239304.585        1419       1207356.804                      
## Total         18877241463.777        1435                                        
## ---------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)     4685.924      1073.475                   4.365    0.000     2580.154     6791.693 
##        Age_08_04     -110.311         2.437       -0.566    -45.260    0.000     -115.092     -105.530 
## Automatic_airco1     2266.342       155.764        0.144     14.550    0.000     1960.789     2571.895 
##               KM       -0.017         0.001       -0.180    -15.994    0.000       -0.019       -0.015 
##           Weight       11.400         0.994        0.165     11.473    0.000        9.451       13.349 
##      HP_Bin> 120     4902.739       378.419        0.118     12.956    0.000     4160.419     5645.059 
##    HP_Bin100-120     3352.815       380.751        0.453      8.806    0.000     2605.920     4099.711 
##        Mfg_Month      -93.404         8.862       -0.086    -10.540    0.000     -110.787      -76.020 
##      CC_bin>1600    -1340.226       196.910       -0.118     -6.806    0.000    -1726.492     -953.959 
##       CC_bin1600    -3328.381       378.629       -0.451     -8.791    0.000    -4071.114    -2585.648 
## Powered_Windows1      419.491        71.438        0.057      5.872    0.000      279.355      559.627 
##    Quarterly_Tax        9.549         1.294        0.108      7.382    0.000        7.011       12.086 
## Guarantee_Period       72.489        10.369        0.060      6.991    0.000       52.149       92.828 
##   Mfr_Guarantee1      280.212        62.058        0.038      4.515    0.000      158.476      401.948 
##     Sport_Model1      287.518        67.174        0.036      4.280    0.000      155.747      419.290 
##           Airco1      274.843        75.101        0.038      3.660    0.000      127.523      422.163 
##         Tow_Bar1     -134.104        67.345       -0.017     -1.991    0.047     -266.210       -1.998 
## -------------------------------------------------------------------------------------------------------
## 
## 
## 
## No more variables to be added.
## 
## Variables Entered: 
## 
## + Age_08_04 
## + Automatic_airco 
## + KM 
## + Weight 
## + HP_Bin 
## + Mfg_Month 
## + CC_bin 
## + Powered_Windows 
## + Quarterly_Tax 
## + Guarantee_Period 
## + Mfr_Guarantee 
## + Sport_Model 
## + Airco 
## + Tow_Bar 
## 
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.954       RMSE                  1098.798 
## R-Squared               0.909       Coef. Var               10.240 
## Adj. R-Squared          0.908       MSE                1207356.804 
## Pred R-Squared          0.903       MAE                    812.613 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                       ANOVA                                       
## ---------------------------------------------------------------------------------
##                        Sum of                                                    
##                       Squares          DF       Mean Square       F         Sig. 
## ---------------------------------------------------------------------------------
## Regression         1.7164e+10          16    1072750134.950    888.511    0.0000 
## Residual       1713239304.585        1419       1207356.804                      
## Total         18877241463.777        1435                                        
## ---------------------------------------------------------------------------------
## 
##                                           Parameter Estimates                                           
## -------------------------------------------------------------------------------------------------------
##            model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
## -------------------------------------------------------------------------------------------------------
##      (Intercept)     4685.924      1073.475                   4.365    0.000     2580.154     6791.693 
##        Age_08_04     -110.311         2.437       -0.566    -45.260    0.000     -115.092     -105.530 
## Automatic_airco1     2266.342       155.764        0.144     14.550    0.000     1960.789     2571.895 
##               KM       -0.017         0.001       -0.180    -15.994    0.000       -0.019       -0.015 
##           Weight       11.400         0.994        0.165     11.473    0.000        9.451       13.349 
##      HP_Bin> 120     4902.739       378.419        0.118     12.956    0.000     4160.419     5645.059 
##    HP_Bin100-120     3352.815       380.751        0.453      8.806    0.000     2605.920     4099.711 
##        Mfg_Month      -93.404         8.862       -0.086    -10.540    0.000     -110.787      -76.020 
##      CC_bin>1600    -1340.226       196.910       -0.118     -6.806    0.000    -1726.492     -953.959 
##       CC_bin1600    -3328.381       378.629       -0.451     -8.791    0.000    -4071.114    -2585.648 
## Powered_Windows1      419.491        71.438        0.057      5.872    0.000      279.355      559.627 
##    Quarterly_Tax        9.549         1.294        0.108      7.382    0.000        7.011       12.086 
## Guarantee_Period       72.489        10.369        0.060      6.991    0.000       52.149       92.828 
##   Mfr_Guarantee1      280.212        62.058        0.038      4.515    0.000      158.476      401.948 
##     Sport_Model1      287.518        67.174        0.036      4.280    0.000      155.747      419.290 
##           Airco1      274.843        75.101        0.038      3.660    0.000      127.523      422.163 
##         Tow_Bar1     -134.104        67.345       -0.017     -1.991    0.047     -266.210       -1.998 
## -------------------------------------------------------------------------------------------------------
## 
##                                    Selection Summary                                     
## ----------------------------------------------------------------------------------------
##         Variable                          Adj.                                              
## Step        Entered         R-Square    R-Square      C(p)          AIC          RMSE       
## ----------------------------------------------------------------------------------------
##    1    Age_08_04             0.7684      0.7682    2186.0335    25518.9706    1746.0382    
##    2    Automatic_airco       0.8247      0.8244    1308.7496    25121.1462    1519.6553    
##    3    KM                    0.8427      0.8423    1029.8950    24967.7765    1440.1316    
##    4    Weight                0.8742      0.8739     538.5506    24648.0525    1287.9643    
##    5    HP_Bin                0.8830      0.8825     403.9815    24548.5871    1243.2596    
##    6    Mfg_Month             0.8905      0.8899     289.0095    24455.6261    1203.2452    
##    7    CC_bin                0.8965      0.8958     197.6782    24379.0785    1170.7882    
##    8    Powered_Windows       0.9005      0.8998     136.2015    24323.5954    1147.9904    
##    9    Quarterly_Tax         0.9036      0.9029      89.8123    24280.1756    1130.3748    
##   10    Guarantee_Period      0.9057      0.9049      59.6168    24251.1397    1118.6181    
##   11    Mfr_Guarantee         0.9070      0.9061      41.0703    24232.9769    1111.1829    
##   12    Sport_Model           0.9081      0.9072      25.0118    24217.0187    1104.6450    
##   13    Airco                 0.9090      0.9080      13.8241    24205.7609    1099.9446    
##   14    Tow_Bar               0.9092      0.9082      11.8620    24203.7537    1098.7979    
## ----------------------------------------------------------------------------------------

Last but not least, we can also visualise each iteration graphically by plotting the model output as shown in the code chunk below.

car.fw.mlr <- ols_step_forward_p(car.mlr, 
                                 penter = 0.05)
plot(car.fw.mlr)

8 Building Predictive Model

So far we are mainly focus on how to use olsrr package to build statistically rigorous model. In fact, olsrr package also provides functions for building predictive model.

One of very interesting function is called ols_step_best_subset(). It is capable to select the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the smallest MSE, Mallow’s Cp or AIC. The default metric used for selecting the model is R2 but the user can choose any of the other available metrics as shown in the code chunk below.

Be warned, this function is very time-consuming.

For demonstration purposes, five independent variables, namely: Age_08_04, Automatic_airco, KM, Weight, and HP_Bin will be used.

car.mlr <- lm(formula = Price ~ Age_08_04 + Automatic_airco + KM + Weight + HP_Bin, data=car_resale)
ols_step_best_subset(car.mlr,
                     metric = c("AIC"))
##                  Best Subsets Regression                 
## ---------------------------------------------------------
## Model Index    Predictors
## ---------------------------------------------------------
##      1         Age_08_04                                  
##      2         Age_08_04 Automatic_airco                  
##      3         Age_08_04 KM Weight                        
##      4         Age_08_04 Automatic_airco KM Weight        
##      5         Age_08_04 Automatic_airco KM Weight HP_Bin 
## ---------------------------------------------------------
## 
##                                                                 Subsets Regression Summary                                                                
## ----------------------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                                    
## Model    R-Square    R-Square    R-Square      C(p)          AIC           SBIC          SBC             MSEP              FPE            HSP        APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.7684      0.7682      0.7674    1396.3492    25518.9706    21441.3253    25534.7794    4377860671.9756    3052895.5188    2127.4595    0.2322 
##   2        0.8247      0.8244      0.8233     710.9808    25121.1462    21043.7622    25142.2247    3316231525.2590    2314176.8543    1612.6762    0.1760 
##   3        0.8481      0.8478      0.8448     427.0711    24917.3085    20840.2078    24943.6566    2875385093.0869    2007932.9444    1399.2700    0.1527 
##   4        0.8742      0.8739      0.8714     109.7611    24648.0525    20572.2030    24679.6702    2382114939.9489    1664627.9329    1160.0364    0.1266 
##   5        0.8830      0.8825      0.8797       5.0000    24548.5871    20471.4543    24590.7440    2218069235.3238    1552152.6800    1081.6615    0.1180 
## ----------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

9 Reference

olsrr