How much orders my company will receive from my customers next month? How many customers will churn when the contract expire? How many catalogues do I need to mail in order to increase the probability of my potential customers to buy? These and many other related questions are the challenges face by you and many business analysts.
Multiple Linear Regression (MLR) models are one of the three most common analytics modelling techniques used by practitioners today. In this session, I would like to share with you a collection of R packages specially design for building better MLR models.
This is not a session for explaining the concepts and methods of MLR.
A large Toyota car dealership offers purchasers of new Toyota cars the option to buy their used car as part of a trade-in. In particular, a new promotion promises to pay high prices for used Toyota Corolla cars for purchasers of a new car. The dealer then sells the used cars for a small profit. To ensure a reasonable profit, the dealer needs to be able to predict the price that the dealership will get for the used cars.
The file provided for the analysis is called ToyotaCorolla.xls. The xls extension indicates that it is in Microsoft xls format. In fact the data file consists of two worksheets, namely: data and metadata. The data worksheet provides the actual data records and the metadata describes the variables of the data records. The data set comprises of 38 columns (i.e. variables) and 1436 rows (i.e. data records).
Before we get started, it is important for us to install the necessary R packages into R and launch these R packages into R environment.
The R packages needed for this exercise are as follows:
The code chunks below installs and launches these R packages into R environment.
packages = c('olsrr', 'corrplot', 'ggpubr',
'readxl', 'ggstatsplot',
'funModeling', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The ToyotaCorolla is in Microsoft Excel file format. The codes chunk below uses read_xls() function of readxl package to import the data worksheet of ToyotaCorolla workbook into R as a tibble data.frame called car_resale.
car_resale <- read_xls("data/ToyotaCorolla.xls",
sheet = "data")
After importing the data file into R, it is important for us to examine if the data file has been imported correctly.
The codes chunks below uses glimpse() to display the data structure of will do the job.
glimpse(car_resale)
## Rows: 1,436
## Columns: 38
## $ Id <dbl> 81, 1, 2, 3, 4, 5, 6, 7, 8, 44, 45, 46, 47, 49, 51...
## $ Model <chr> "TOYOTA Corolla 1.6 5drs 1 4/5-Doors", "TOYOTA Cor...
## $ Price <dbl> 18950, 13500, 13750, 13950, 14950, 13750, 12950, 1...
## $ Age_08_04 <dbl> 25, 23, 23, 24, 26, 30, 32, 27, 30, 27, 22, 23, 27...
## $ Mfg_Month <dbl> 8, 10, 10, 9, 7, 3, 1, 6, 3, 6, 11, 10, 6, 11, 11,...
## $ Mfg_Year <dbl> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 20...
## $ KM <dbl> 20019, 46986, 72937, 41711, 48000, 38500, 61000, 9...
## $ Quarterly_Tax <dbl> 100, 210, 210, 210, 210, 210, 210, 210, 210, 234, ...
## $ Weight <dbl> 1180, 1165, 1165, 1165, 1165, 1170, 1170, 1245, 12...
## $ Guarantee_Period <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ HP_Bin <chr> "100-120", "< 100", "< 100", "< 100", "< 100", "< ...
## $ CC_bin <chr> "1600", ">1600", ">1600", ">1600", ">1600", ">1600...
## $ Doors <dbl> 5, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 3,...
## $ Gears <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ Cylinders <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ Fuel_Type <chr> "Petrol", "Diesel", "Diesel", "Diesel", "Diesel", ...
## $ Color <chr> "Blue", "Blue", "Silver", "Blue", "Black", "Black"...
## $ Met_Color <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,...
## $ Automatic <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Mfr_Guarantee <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,...
## $ BOVAG_Guarantee <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ ABS <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Airbag_1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Airbag_2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Airco <dbl> 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Automatic_airco <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,...
## $ Boardcomputer <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ CD_Player <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,...
## $ Central_Lock <dbl> 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Powered_Windows <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Power_Steering <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Radio <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Mistlamps <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Sport_Model <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,...
## $ Backseat_Divider <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Metallic_Rim <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Radio_cassette <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Tow_Bar <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
Important
The code chunk below uses summary() of Base Stats of R to display the summary statistics of car_resale data.frame.
summary(car_resale)
## Id Model Price Age_08_04
## Min. : 1.0 Length:1436 Min. : 4350 Min. : 1.00
## 1st Qu.: 361.8 Class :character 1st Qu.: 8450 1st Qu.:44.00
## Median : 721.5 Mode :character Median : 9900 Median :61.00
## Mean : 721.6 Mean :10731 Mean :55.95
## 3rd Qu.:1081.2 3rd Qu.:11950 3rd Qu.:70.00
## Max. :1442.0 Max. :32500 Max. :80.00
## Mfg_Month Mfg_Year KM Quarterly_Tax
## Min. : 1.000 Min. :1998 Min. : 1 Min. : 19.00
## 1st Qu.: 3.000 1st Qu.:1998 1st Qu.: 43000 1st Qu.: 69.00
## Median : 5.000 Median :1999 Median : 63390 Median : 85.00
## Mean : 5.549 Mean :2000 Mean : 68533 Mean : 87.12
## 3rd Qu.: 8.000 3rd Qu.:2001 3rd Qu.: 87021 3rd Qu.: 85.00
## Max. :12.000 Max. :2004 Max. :243000 Max. :283.00
## Weight Guarantee_Period HP_Bin CC_bin
## Min. :1000 Min. : 3.000 Length:1436 Length:1436
## 1st Qu.:1040 1st Qu.: 3.000 Class :character Class :character
## Median :1070 Median : 3.000 Mode :character Mode :character
## Mean :1072 Mean : 3.815
## 3rd Qu.:1085 3rd Qu.: 3.000
## Max. :1615 Max. :36.000
## Doors Gears Cylinders Fuel_Type
## Min. :2.000 Min. :3.000 Min. :4 Length:1436
## 1st Qu.:3.000 1st Qu.:5.000 1st Qu.:4 Class :character
## Median :4.000 Median :5.000 Median :4 Mode :character
## Mean :4.033 Mean :5.026 Mean :4
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:4
## Max. :5.000 Max. :6.000 Max. :4
## Color Met_Color Automatic Mfr_Guarantee
## Length:1436 Min. :0.0000 Min. :0.00000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Mode :character Median :1.0000 Median :0.00000 Median :0.0000
## Mean :0.6748 Mean :0.05571 Mean :0.4095
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## BOVAG_Guarantee ABS Airbag_1 Airbag_2
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.8955 Mean :0.8134 Mean :0.9708 Mean :0.7228
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Airco Automatic_airco Boardcomputer CD_Player
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.5084 Mean :0.05641 Mean :0.2946 Mean :0.2187
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## Central_Lock Powered_Windows Power_Steering Radio
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.000 Median :1.0000 Median :0.0000
## Mean :0.5801 Mean :0.562 Mean :0.9777 Mean :0.1462
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## Mistlamps Sport_Model Backseat_Divider Metallic_Rim
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.000 Median :0.0000 Median :1.0000 Median :0.0000
## Mean :0.257 Mean :0.3001 Mean :0.7702 Mean :0.2047
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Radio_cassette Tow_Bar
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.1455 Mean :0.2779
## 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
Quiz: What observations can you draw from the output above?
The code chunk below performs the followings:
cols <- c("HP_Bin", "CC_bin", "Doors", "Gears", "Cylinders", "Fuel_Type", "Color",
"Met_Color", "Automatic", "Mfr_Guarantee",
"BOVAG_Guarantee", "ABS", "Airbag_1", "Airbag_2", "Airco", "Automatic_airco", "Boardcomputer", "CD_Player", "Central_Lock", "Powered_Windows", "Power_Steering", "Radio", "Mistlamps",
"Sport_Model", "Backseat_Divider",
"Metallic_Rim", "Radio_cassette", "Tow_Bar")
car_resale <- read_xls("data/ToyotaCorolla.xls",
sheet = "data") %>%
mutate(Id = as.character(Id)) %>%
mutate_each_(funs(factor(.)),cols)
Now, we can use summary() to examine the summary statistics of the variables again.
summary(car_resale[3:38])
## Price Age_08_04 Mfg_Month Mfg_Year
## Min. : 4350 Min. : 1.00 Min. : 1.000 Min. :1998
## 1st Qu.: 8450 1st Qu.:44.00 1st Qu.: 3.000 1st Qu.:1998
## Median : 9900 Median :61.00 Median : 5.000 Median :1999
## Mean :10731 Mean :55.95 Mean : 5.549 Mean :2000
## 3rd Qu.:11950 3rd Qu.:70.00 3rd Qu.: 8.000 3rd Qu.:2001
## Max. :32500 Max. :80.00 Max. :12.000 Max. :2004
##
## KM Quarterly_Tax Weight Guarantee_Period
## Min. : 1 Min. : 19.00 Min. :1000 Min. : 3.000
## 1st Qu.: 43000 1st Qu.: 69.00 1st Qu.:1040 1st Qu.: 3.000
## Median : 63390 Median : 85.00 Median :1070 Median : 3.000
## Mean : 68533 Mean : 87.12 Mean :1072 Mean : 3.815
## 3rd Qu.: 87021 3rd Qu.: 85.00 3rd Qu.:1085 3rd Qu.: 3.000
## Max. :243000 Max. :283.00 Max. :1615 Max. :36.000
##
## HP_Bin CC_bin Doors Gears Cylinders Fuel_Type
## < 100 :560 <1600:416 2: 2 3: 2 4:1436 CNG : 17
## > 120 : 11 >1600:166 3:622 4: 1 Diesel: 155
## 100-120:865 1600 :854 4:138 5:1390 Petrol:1264
## 5:674 6: 43
##
##
##
## Color Met_Color Automatic Mfr_Guarantee BOVAG_Guarantee ABS
## Grey :301 0:467 0:1356 0:848 0: 150 0: 268
## Blue :283 1:969 1: 80 1:588 1:1286 1:1168
## Red :278
## Green :220
## Black :191
## Silver :122
## (Other): 41
## Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer CD_Player Central_Lock
## 0: 42 0: 398 0:706 0:1355 0:1013 0:1122 0:603
## 1:1394 1:1038 1:730 1: 81 1: 423 1: 314 1:833
##
##
##
##
##
## Powered_Windows Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
## 0:629 0: 32 0:1226 0:1067 0:1005 0: 330
## 1:807 1:1404 1: 210 1: 369 1: 431 1:1106
##
##
##
##
##
## Metallic_Rim Radio_cassette Tow_Bar
## 0:1142 0:1227 0:1037
## 1: 294 1: 209 1: 399
##
##
##
##
##
Notice that the output report displays correct summary statistics of the independent variables this time.
Histogram, probability density plot and boxplot are three commonly used statistical graphical methods for visualising the distribution of continuous variables.
The code chunk below is used to create eight histograms. Then, ggarrange() of ggpubr is used to organise these histogram into a 4 columns by 2 rows a small multiple plot.
Price <- ggplot(data=car_resale, aes(x= `Price`)) +
geom_histogram(bins=20, color="black", fill="light blue")
Age_08_04 <- ggplot(data=car_resale, aes(x= `Age_08_04`)) +
geom_histogram(bins=20, color="black", fill="light blue")
Mfg_Month <- ggplot(data=car_resale, aes(x= `Mfg_Month`)) +
geom_histogram(bins=20, color="black", fill="light blue")
Mfg_Year <- ggplot(data=car_resale, aes(x= `Mfg_Year`)) +
geom_histogram(bins=20, color="black", fill="light blue")
KM <- ggplot(data=car_resale, aes(x= `KM`)) +
geom_histogram(bins=20, color="black", fill="light blue")
Quarterly_Tax <- ggplot(data=car_resale, aes(x= `Quarterly_Tax`)) +
geom_histogram(bins=20, color="black", fill="light blue")
Weight <- ggplot(data=car_resale, aes(x= `Weight`)) +
geom_histogram(bins=20, color="black", fill="light blue")
Guarantee_Period <- ggplot(data=car_resale, aes(x= `Guarantee_Period`)) +
geom_histogram(bins=20, color="black", fill="light blue")
ggarrange(Price, Age_08_04, Mfg_Month, Mfg_Year, KM, Quarterly_Tax, Weight, Guarantee_Period, ncol = 4, nrow = 2)
To display the frequency distribution of the categorical variables, the freq() of funModeling package will be used.
freq(car_resale[11:38])
## HP_Bin frequency percentage cumulative_perc
## 1 100-120 865 60.24 60.24
## 2 < 100 560 39.00 99.24
## 3 > 120 11 0.77 100.00
## CC_bin frequency percentage cumulative_perc
## 1 1600 854 59.47 59.47
## 2 <1600 416 28.97 88.44
## 3 >1600 166 11.56 100.00
## Doors frequency percentage cumulative_perc
## 1 5 674 46.94 46.94
## 2 3 622 43.31 90.25
## 3 4 138 9.61 99.86
## 4 2 2 0.14 100.00
## Gears frequency percentage cumulative_perc
## 1 5 1390 96.80 96.80
## 2 6 43 2.99 99.79
## 3 3 2 0.14 99.93
## 4 4 1 0.07 100.00
## Cylinders frequency percentage cumulative_perc
## 1 4 1436 100 100
## Fuel_Type frequency percentage cumulative_perc
## 1 Petrol 1264 88.02 88.02
## 2 Diesel 155 10.79 98.81
## 3 CNG 17 1.18 100.00
## Color frequency percentage cumulative_perc
## 1 Grey 301 20.96 20.96
## 2 Blue 283 19.71 40.67
## 3 Red 278 19.36 60.03
## 4 Green 220 15.32 75.35
## 5 Black 191 13.30 88.65
## 6 Silver 122 8.50 97.15
## 7 White 31 2.16 99.31
## 8 Violet 4 0.28 99.59
## 9 Beige 3 0.21 99.80
## 10 Yellow 3 0.21 100.00
## Met_Color frequency percentage cumulative_perc
## 1 1 969 67.48 67.48
## 2 0 467 32.52 100.00
## Automatic frequency percentage cumulative_perc
## 1 0 1356 94.43 94.43
## 2 1 80 5.57 100.00
## Mfr_Guarantee frequency percentage cumulative_perc
## 1 0 848 59.05 59.05
## 2 1 588 40.95 100.00
## BOVAG_Guarantee frequency percentage cumulative_perc
## 1 1 1286 89.55 89.55
## 2 0 150 10.45 100.00
## ABS frequency percentage cumulative_perc
## 1 1 1168 81.34 81.34
## 2 0 268 18.66 100.00
## Airbag_1 frequency percentage cumulative_perc
## 1 1 1394 97.08 97.08
## 2 0 42 2.92 100.00
## Airbag_2 frequency percentage cumulative_perc
## 1 1 1038 72.28 72.28
## 2 0 398 27.72 100.00
## Airco frequency percentage cumulative_perc
## 1 1 730 50.84 50.84
## 2 0 706 49.16 100.00
## Automatic_airco frequency percentage cumulative_perc
## 1 0 1355 94.36 94.36
## 2 1 81 5.64 100.00
## Boardcomputer frequency percentage cumulative_perc
## 1 0 1013 70.54 70.54
## 2 1 423 29.46 100.00
## CD_Player frequency percentage cumulative_perc
## 1 0 1122 78.13 78.13
## 2 1 314 21.87 100.00
## Central_Lock frequency percentage cumulative_perc
## 1 1 833 58.01 58.01
## 2 0 603 41.99 100.00
## Powered_Windows frequency percentage cumulative_perc
## 1 1 807 56.2 56.2
## 2 0 629 43.8 100.0
## Power_Steering frequency percentage cumulative_perc
## 1 1 1404 97.77 97.77
## 2 0 32 2.23 100.00
## Radio frequency percentage cumulative_perc
## 1 0 1226 85.38 85.38
## 2 1 210 14.62 100.00
## Mistlamps frequency percentage cumulative_perc
## 1 0 1067 74.3 74.3
## 2 1 369 25.7 100.0
## Sport_Model frequency percentage cumulative_perc
## 1 0 1005 69.99 69.99
## 2 1 431 30.01 100.00
## Backseat_Divider frequency percentage cumulative_perc
## 1 1 1106 77.02 77.02
## 2 0 330 22.98 100.00
## Metallic_Rim frequency percentage cumulative_perc
## 1 0 1142 79.53 79.53
## 2 1 294 20.47 100.00
## Radio_cassette frequency percentage cumulative_perc
## 1 0 1227 85.45 85.45
## 2 1 209 14.55 100.00
## Tow_Bar frequency percentage cumulative_perc
## 1 0 1037 72.21 72.21
## 2 1 399 27.79 100.00
## [1] "Variables processed: HP_Bin, CC_bin, Doors, Gears, Cylinders, Fuel_Type, Color, Met_Color, Automatic, Mfr_Guarantee, BOVAG_Guarantee, ABS, Airbag_1, Airbag_2, Airco, Automatic_airco, Boardcomputer, CD_Player, Central_Lock, Powered_Windows, Power_Steering, Radio, Mistlamps, Sport_Model, Backseat_Divider, Metallic_Rim, Radio_cassette, Tow_Bar"
Several useful insights can be obtained from the frequency plots above. They can be summarised as follows:
In this section, you will learn how to build regression model by using lm() of R base.
First, we will build a simple linear regression model by using Price as the dependent variable and Age_08_04 as the independent variable.
car.slr <- lm(formula = Price ~ Age_08_04, data = car_resale)
lm() returns an object of class “lm” or for multiple responses of class c(“mlm”, “lm”).
The functions summary() and anova() can be used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.
summary(car.slr)
##
## Call:
## lm(formula = Price ~ Age_08_04, data = car_resale)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8423.0 -997.4 -24.6 878.5 12889.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20294.059 146.097 138.91 <2e-16 ***
## Age_08_04 -170.934 2.478 -68.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1746 on 1434 degrees of freedom
## Multiple R-squared: 0.7684, Adjusted R-squared: 0.7682
## F-statistic: 4758 on 1 and 1434 DF, p-value: < 2.2e-16
The output report reveals that the Price can be explained by using the formula:
*y = 20294.059 + -170.934x1*
The R-squared of 0.7684 reveals that the simple regression model built is able to explain about 77% of the trade-in prices.
Since p-value is much smaller than 0.0001, we will reject the null hypothesis that mean is a good estimator of trade-in prices. This will allow us to infer that simple linear regression model above is a good estimator of Price.
The Coefficients: section of the report reveals that the p-values of both the estimates of the Intercept and Age_08_04 are smaller than 0.001. In view of this, the null hypothesis of the B0 and B1 are equal to 0 will be rejected. As a results, we will be able to infer that the B0 and B1 are good parameter estimates.
To visualise the best fit curve on a scatterplot, we can incorporate lm() as a method function in ggplot’s geometry as shown in the code chunk below.
The code chunk below print the Analysis of Variance Table of the simple regression model by using anova().
anova(car.slr)
## Analysis of Variance Table
##
## Response: Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Age_08_04 1 1.4505e+10 1.4505e+10 4758 < 2.2e-16 ***
## Residuals 1434 4.3718e+09 3.0486e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also print the confident interval of estimator by using the code chunk below.
confint(car.slr)
## 2.5 % 97.5 %
## (Intercept) 20007.4714 20580.6459
## Age_08_04 -175.7946 -166.0725
Last but not least, you can also print the residual of the regression model by using residuals() function as shown in the code chunk below.
residuals(car.slr)
## 1 2 3 4 5
## 2929.2809764 -2862.5861936 -2612.5861936 -2241.6526086 -899.7854386
## 6 7 8 9 10
## -1416.0510986 -1874.1839285 1221.1481464 3433.9489014 1271.1481464
## 11 12 13 14 15
## 416.4802213 2637.4138064 2271.1481464 1416.4802213 1416.4802213
## 16 17 18 19 20
## 5716.4802213 1074.6130513 4903.6794663 5374.6130513 12889.6756911
## 21 22 23 24 25
## 11389.6756911 11664.6756911 6023.4100312 6023.4100312 3852.4764462
## 26 27 28 29 30
## 6063.4100312 3023.4100312 2374.6130513 4861.8122963 2903.6794663
## 31 32 33 34 35
## 4586.2107862 -4298.5824185 -1506.7152484 1122.3511665 1293.2847516
## 36 37 38 39 40
## -1153.2503435 -1153.2503435 1293.2847516 -3313.6450583 -2113.6450583
## 41 42 43 44 45
## -3142.7114733 189.8198466 -5571.7778883 -1626.4458133 -1942.7114733
## 46 47 48 49 50
## 557.2885267 -2113.6450583 -4247.3793983 -3429.9107182 -770.5748681
## 51 52 53 54 55
## -2708.9771332 -2196.1763782 -2283.3756232 -3090.5084531 -2458.9771332
## 56 57 58 59 60
## -1404.3092082 -2591.5084531 -196.1763782 -3308.9771332 -1833.3756232
## 61 62 63 64 65
## -3138.0435482 -404.3092082 -1429.9107182 -233.3756232 -1404.3092082
## 66 67 68 69 70
## -2708.9771332 108.4915469 958.4915469 -746.1763782 79.4251319
## 71 72 73 74 75
## 632.8900368 937.5579619 595.6907918 -1720.5748681 -2233.3756232
## 76 77 78 79 80
## -1746.1763782 -600.8443032 -2300.8443032 2211.9564518 -341.5084531
## 81 82 83 84 85
## 1229.4251319 -746.1763782 861.9564518 570.0892818 -1915.9069431
## 86 87 88 89 90
## 159.6945670 -961.2390180 -3244.9733581 -382.1726030 0.3587169
## 91 92 93 94 95
## -469.3718480 -353.1061880 -1553.1061880 1788.7609820 -1828.7076981
## 96 97 98 99 100
## 3367.8273970 1367.8273970 -11.2390180 450.3587169 -549.6412831
## 101 102 103 104 105
## 146.8938120 -578.7076981 425.9602270 -1599.6412831 -207.7741131
## 106 107 108 109 110
## 305.0266419 2880.6281520 1296.8938120 -2207.7741131 1159.6945670
## 111 112 113 114 115
## -907.7741131 -394.9733581 92.2258869 1646.8938120 1134.0930569
## 116 117 118 119 120
## 1105.0266419 -865.9069431 171.2923019 900.3587169 792.2258869
## 121 122 123 124 125
## 2390.8787113 2561.8122963 741.0228668 -3412.5861936 -5993.9144934
## 126 127 128 129 130
## -1506.7152484 -1706.7152484 -35.7816634 -872.9809084 -3861.3831735
## 131 132 133 134 135
## 6.0855066 -35.7816634 -1153.2503435 -361.3831735 -1153.2503435
## 136 137 138 139 140
## -811.3831735 -4734.5786433 -5176.4458133 -4234.5786433 -3702.0473234
## 141 142 143 144 145
## -984.5786433 -4339.2465684 -1613.6450583 -3155.5122283 -1284.5786433
## 146 147 148 149 150
## -563.6450583 -626.4458133 -3139.2465684 -1984.5786433 -923.3129833
## 151 152 153 154 155
## -3052.3793983 -3510.1801534 -589.2465684 5821.1481464 4929.2809764
## 156 157 158 159 160
## 3416.4802213 3579.2809764 6504.8824865 7675.8160715 6492.0817314
## 161 162 163 164 165
## 7583.9489014 6954.8824865 6271.1481464 6903.6794663 -8422.9809084
## 166 167 168 169 170
## -8022.9809084 -6271.7778883 1758.3473914 558.3473914 1783.9489014
## 171 172 173 174 175
## 783.9489014 1613.0153164 442.0817314 1442.0817314 913.0153164
## 176 177 178 179 180
## -70.7190236 1816.1481464 413.0153164 1442.0817314 2783.9489014
## 181 182 183 184 185
## 4650.2145614 -520.7190236 -1574.1839285 2587.4138064 832.9489014
## 186 187 188 189 190
## 821.1481464 3754.8824865 2783.9489014 1100.2145614 3442.0817314
## 191 192 193 194 195
## 4442.0817314 587.4138064 3754.8824865 2771.1481464 1558.0153164
## 196 197 198 199 200
## 1913.0153164 2954.8824865 1942.0817314 2024.6130513 336.2107862
## 201 202 203 204 205
## 2114.6130513 86.2107862 1036.2107862 899.0115413 1074.6130513
## 206 207 208 209 210
## -1197.5235538 878.0779562 -1550.9884587 2074.6130513 -438.1877037
## 211 212 213 214 215
## 4190.8787113 -925.3869487 -2221.9220438 -2121.9220438 -1821.9220438
## 216 217 218 219 220
## 3024.6130513 1878.0779562 940.8787113 1219.9451263 3624.6130513
## 221 222 223 224 225
## 6428.0779562 2403.6794663 3049.0115413 2903.6794663 1257.1443712
## 226 227 228 229 230
## 3678.0779562 1940.8787113 486.2107862 2361.8122963 2190.8787113
## 231 232 233 234 235
## 1390.8787113 844.3436162 1599.0115413 2049.0115413 2049.0115413
## 236 237 238 239 240
## 3049.0115413 2257.1443712 -510.6563838 4823.4100312 3023.4100312
## 241 242 243 244 245
## 1023.4100312 23.4100312 852.4764462 3023.4100312 3231.5428612
## 246 247 248 249 250
## -597.5235538 1547.8085211 -1878.1250639 -5193.9144934 -2019.5160035
## 251 252 253 254 255
## -1677.6488335 -1277.9809084 -2614.8480784 -1272.9809084 -956.7152484
## 256 257 258 259 260
## -1835.7816634 -2019.5160035 -2692.3167585 -2006.7152484 -3703.2503435
## 261 262 263 264 265
## -848.5824185 -993.9144934 -322.9809084 2177.0190916 -1861.3831735
## 266 267 268 269 270
## -848.5824185 -2361.3831735 -664.8480784 -1703.2503435 -385.7816634
## 271 272 273 274 275
## -1385.7816634 -2148.5824185 -1822.9809084 1177.0190916 -1506.7152484
## 276 277 278 279 280
## -2848.5824185 -1348.5824185 -993.9144934 -2963.2503435 -127.6488335
## 281 282 283 284 285
## -1164.8480784 -727.6488335 556.0855066 -1364.8480784 -2903.2503435
## 286 287 288 289 290
## -2093.9144934 -1193.9144934 1835.1519216 -3687.6488335 -214.8480784
## 291 292 293 294 295
## -298.5824185 -1203.2503435 -1132.6488335 214.2183366 -706.7152484
## 296 297 298 299 300
## -1848.5824185 -277.9809084 -2032.3167585 -2677.6488335 -848.5824185
## 301 302 303 304 305
## -1032.3167585 556.0855066 -1805.7816634 -2132.6488335 1191.4175815
## 306 307 308 309 310
## -506.7152484 493.2847516 -1335.7816634 -677.6488335 -1756.7152484
## 311 312 313 314 315
## 6.0855066 -1640.4495885 247.3511665 -3811.3831735 -2358.2503435
## 316 317 318 319 320
## 835.1519216 -2848.5824185 177.0190916 -606.7152484 -316.3831735
## 321 322 323 324 325
## 122.3511665 -1219.5160035 -785.7816634 1006.0855066 385.1519216
## 326 327 328 329 330
## -664.8480784 -506.7152484 1222.0190916 664.2183366 177.0190916
## 331 332 333 334 335
## -822.9809084 -1466.7152484 -3390.4495885 -19.5160035 -1522.9809084
## 336 337 338 339 340
## -677.6488335 -1335.7816634 -1703.2503435 -1361.3831735 -1248.9144934
## 341 342 343 344 345
## -2285.7816634 -361.3831735 -1848.5824185 -1706.7152484 -1811.3831735
## 346 347 348 349 350
## -1753.2503435 -1085.7816634 -193.9144934 -1164.8480784 -1385.7816634
## 351 352 353 354 355
## -822.9809084 1835.1519216 -1703.2503435 1664.2183366 -6.7152484
## 356 357 358 359 360
## -48.5824185 6.0855066 -1903.2503435 -1348.5824185 2006.0855066
## 361 362 363 364 365
## 122.3511665 -932.6488335 336.7496565 -193.9144934 1064.2183366
## 366 367 368 369 370
## -677.6488335 -6.7152484 -898.5824185 -2519.5160035 477.0190916
## 371 372 373 374 375
## -3377.6488335 -316.3831735 -506.7152484 -2127.6488335 -2392.7114733
## 376 377 378 379 380
## -1676.4458133 -2839.2465684 -1968.3129833 -1797.3793983 -3089.2465684
## 381 382 383 384 385
## -392.7114733 -313.6450583 -1113.6450583 -360.1801534 886.3549417
## 386 387 388 389 390
## -3847.3793983 -455.5122283 373.5541867 -1018.3129833 -942.7114733
## 391 392 393 394 395
## -113.6450583 -1418.3129833 -1655.5122283 -2652.0473234 -468.3129833
## 396 397 398 399 400
## 1436.3549417 -1247.3793983 1202.6206017 452.6206017 -113.6450583
## 401 402 403 404 405
## -628.3129833 -339.5786433 -760.1801534 -2339.2465684 -642.7114733
## 406 407 408 409 410
## 581.6870167 886.3549417 -981.1137384 886.3549417 -339.2465684
## 411 412 413 414 415
## -797.3793983 1057.2885267 -3018.3129833 -510.1801534 594.4877717
## 416 417 418 419 420
## 202.6206017 -113.6450583 -113.6450583 -244.5786433 -3168.3129833
## 421 422 423 424 425
## 1228.2221117 2544.4877717 186.3549417 1228.2221117 -1639.2465684
## 426 427 428 429 430
## -813.6450583 -1304.5786433 -563.6450583 886.3549417 607.2885267
## 431 432 433 434 435
## 386.3549417 -821.7778883 -997.3793983 1278.2221117 -1294.5786433
## 436 437 438 439 440
## -1310.1801534 -313.6450583 886.3549417 -155.5122283 686.3549417
## 441 442 443 444 445
## 57.2885267 528.2221117 -1876.4458133 -310.1801534 -1334.5786433
## 446 447 448 449 450
## -902.7114733 436.3549417 715.4213567 -1247.3793983 323.5541867
## 451 452 453 454 455
## 2373.5541867 1057.2885267 -297.3793983 -1584.5786433 7031.6870167
## 456 457 458 459 460
## -126.4458133 -1155.5122283 -639.2465684 -221.7778883 2686.3549417
## 461 462 463 464 465
## -813.6450583 -213.6450583 489.4877717 1715.4213567 202.6206017
## 466 467 468 469 470
## 28.2221117 344.4877717 -1310.1801534 1386.3549417 -905.5122283
## 471 472 473 474 475
## 1202.6206017 1778.2221117 -1468.3129833 1031.6870167 31.6870167
## 476 477 478 479 480
## -310.1801534 -1260.1801534 1752.6206017 -113.6450583 1373.5541867
## 481 482 483 484 485
## 102.2885267 -455.5122283 1081.6870167 347.9526766 94.4877717
## 486 487 488 489 490
## -379.2465684 673.5541867 -931.1137384 1228.2221117 607.2885267
## 491 492 493 494 495
## -531.1137384 1202.6206017 1494.4877717 -1139.2465684 -1297.3793983
## 496 497 498 499 500
## -1010.1801534 -891.5084531 -1104.3092082 -904.3092082 279.4251319
## 501 502 503 504 505
## -1258.9771332 -1117.1099632 -891.5084531 -2288.0435482 437.5579619
## 506 507 508 509 510
## 1095.6907918 1279.4251319 279.4251319 -720.5748681 399.1556968
## 511 512 513 514 515
## -70.5748681 -429.9107182 -588.0435482 -720.5748681 -75.2427932
## 516 517 518 519 520
## 595.6907918 -1958.9771332 1279.4251319 1487.5579619 566.6243768
## 521 522 523 524 525
## -617.1099632 829.4251319 253.8236218 -1775.2427932 829.4251319
## 526 527 528 529 530
## 1108.4915469 395.6907918 -175.5748681 -25.2427932 1224.4251319
## 531 532 533 534 535
## -1917.1099632 79.4251319 -575.2427932 279.4251319 -917.1099632
## 536 537 538 539 540
## -233.3756232 608.4915469 66.6243768 82.8900368 82.8900368
## 541 542 543 544 545
## 108.4915469 408.4915469 382.8900368 279.4251319 -1184.3756232
## 546 547 548 549 550
## 887.5579619 1316.6243768 1229.4251319 803.8236218 -367.1099632
## 551 552 553 554 555
## 266.6243768 -1104.3092082 -879.9107182 -220.5748681 -1206.1763782
## 556 557 558 559 560
## -1288.0435482 424.7572068 -170.5748681 -1075.2427932 -1796.1763782
## 561 562 563 564 565
## -1800.8443032 -1308.9771332 279.4251319 2316.6243768 753.8236218
## 566 567 568 569 570
## 579.4251319 1766.6243768 753.8236218 579.4251319 -946.1763782
## 571 572 573 574 575
## 1129.4251319 753.8236218 -629.9107182 -917.1099632 -762.4420381
## 576 577 578 579 580
## 1766.6243768 -917.1099632 582.8900368 399.1556968 -733.3756232
## 581 582 583 584 585
## -258.9771332 1279.4251319 766.6243768 829.4251319 -933.3756232
## 586 587 588 589 590
## 1253.8236218 1053.4915469 461.9564518 -967.1099632 -367.1099632
## 591 592 593 594 595
## 2108.4915469 1079.4251319 566.6243768 -75.2427932 -112.4420381
## 596 597 598 599 600
## -1258.9771332 -193.3756232 1108.4915469 1229.4251319 -746.1763782
## 601 602 603 604 605
## 487.5579619 -920.5748681 -575.2427932 1911.9564518 -746.1763782
## 606 607 608 609 610
## 279.4251319 595.6907918 -933.3756232 127.8900368 -104.3092082
## 611 612 613 614 615
## 1595.6907918 -879.9107182 -429.9107182 -1275.2427932 -708.9771332
## 616 617 618 619 620
## -233.3756232 1253.8236218 -1088.0435482 -2379.9107182 1058.4915469
## 621 622 623 624 625
## -746.1763782 -758.9771332 -904.3092082 316.6243768 1079.4251319
## 626 627 628 629 630
## 808.4915469 1279.4251319 -575.2427932 1469.7572068 1316.6243768
## 631 632 633 634 635
## 908.4915469 2279.4251319 -258.9771332 1316.6243768 632.8900368
## 636 637 638 639 640
## -446.1763782 82.8900368 1224.7572068 2108.4915469 766.6243768
## 641 642 643 644 645
## -258.9771332 -788.0435482 -1554.3092082 329.4251319 -1600.8443032
## 646 647 648 649 650
## 1279.4251319 -1538.0435482 -1701.1763782 -433.3756232 1766.6243768
## 651 652 653 654 655
## 908.4915469 279.4251319 666.6243768 -246.1763782 1453.4915469
## 656 657 658 659 660
## 1079.4251319 -233.3756232 1229.4251319 -429.9107182 253.8236218
## 661 662 663 664 665
## 399.1556968 1124.4251319 79.4251319 -933.3756232 -429.9107182
## 666 667 668 669 670
## 253.8236218 291.0228668 1382.8900368 -1308.9771332 2829.4251319
## 671 672 673 674 675
## -1100.8443032 -917.1099632 -575.2427932 566.6243768 -417.1099632
## 676 677 678 679 680
## 766.6243768 1079.4251319 -233.3756232 -762.4420381 -1088.0435482
## 681 682 683 684 685
## 595.6907918 437.5579619 4108.4915469 -429.9107182 798.8236218
## 686 687 688 689 690
## 974.7572068 -233.3756232 203.8236218 127.8900368 424.7572068
## 691 692 693 694 695
## 424.7572068 -591.5084531 1253.8236218 424.7572068 311.9564518
## 696 697 698 699 700
## 766.6243768 399.1556968 1424.7572068 279.4251319 58.4915469
## 701 702 703 704 705
## 211.9564518 -417.1099632 -617.1099632 -50.8443032 -300.8443032
## 706 707 708 709 710
## 553.8236218 2316.6243768 145.6907918 -1600.8443032 -1638.0435482
## 711 712 713 714 715
## -967.1099632 -1629.9107182 570.0892818 -972.1099632 -1458.9771332
## 716 717 718 719 720
## -117.1099632 253.8236218 1766.6243768 1058.4915469 -196.1763782
## 721 722 723 724 725
## 82.8900368 424.7572068 487.5579619 411.9564518 -438.3756232
## 726 727 728 729 730
## 199.1556968 -800.8443032 -333.3756232 -75.2427932 82.8900368
## 731 732 733 734 735
## -1275.2427932 -258.9771332 456.9564518 -2429.9107182 279.4251319
## 736 737 738 739 740
## -855.8443032 1279.4251319 1108.4915469 424.7572068 545.6907918
## 741 742 743 744 745
## -762.4420381 553.8236218 -1117.1099632 1079.4251319 1082.8900368
## 746 747 748 749 750
## 911.9564518 -1308.9771332 -538.0435482 937.5579619 1266.6243768
## 751 752 753 754 755
## 1253.8236218 229.4251319 395.6907918 1570.0892818 349.1556968
## 756 757 758 759 760
## 1266.6243768 741.0228668 711.9564518 -650.8443032 2079.4251319
## 761 762 763 764 765
## 70.0892818 911.9564518 1829.4251319 2803.8236218 120.0892818
## 766 767 768 769 770
## 232.5579619 741.0228668 1203.8236218 -1540.3054330 -1865.9069431
## 771 772 773 774 775
## -207.7741131 -1904.9733581 988.7609820 -1707.7741131 -1182.1726030
## 776 777 778 779 780
## -65.9069431 -182.1726030 -394.9733581 117.8273970 2630.6281520
## 781 782 783 784 785
## 434.0930569 -2736.8405281 767.8273970 -403.1061880 425.9602270
## 786 787 788 789 790
## 121.2923019 646.8938120 305.0266419 -1536.8405281 -824.0397730
## 791 792 793 794 795
## 317.8273970 305.0266419 117.8273970 105.0266419 342.2258869
## 796 797 798 799 800
## 788.7609820 -2065.9069431 -403.1061880 -1657.7741131 -461.2390180
## 801 802 803 804 805
## -194.9733581 1946.8938120 630.6281520 180.6281520 1225.9602270
## 806 807 808 809 810
## 538.7609820 -407.7741131 134.0930569 2475.9602270 -1004.9733581
## 811 812 813 814 815
## -236.8405281 -119.3718480 1446.8938120 305.0266419 1475.9602270
## 816 817 818 819 820
## -1024.0397730 -744.9733581 975.9602270 2709.6945670 788.7609820
## 821 822 823 824 825
## 1880.6281520 320.9602270 -129.3718480 134.0930569 -1036.8405281
## 826 827 828 829 830
## 1959.6945670 292.2258869 817.8273970 1938.7609820 1134.0930569
## 831 832 833 834 835
## 1084.0930569 -549.6412831 1605.0266419 -311.2390180 199.6945670
## 836 837 838 839 840
## 788.7609820 630.6281520 3330.6281520 2630.6281520 263.1594719
## 841 842 843 844 845
## 2159.6945670 1117.8273970 3280.6281520 988.7609820 -24.0397730
## 846 847 848 849 850
## 121.2923019 538.7609820 1646.8938120 700.3587169 1305.0266419
## 851 852 853 854 855
## -136.8405281 -274.0397730 488.7609820 -315.9069431 -536.8405281
## 856 857 858 859 860
## -378.7076981 288.7609820 263.1594719 130.6281520 2330.6281520
## 861 862 863 864 865
## 1630.6281520 1121.2923019 134.0930569 -428.7076981 684.0930569
## 866 867 868 869 870
## 1459.6945670 946.8938120 1134.0930569 305.0266419 330.6281520
## 871 872 873 874 875
## 288.7609820 617.8273970 1225.9602270 2330.6281520 -553.1061880
## 876 877 878 879 880
## -657.7741131 -207.7741131 275.9602270 792.2258869 704.6945670
## 881 882 883 884 885
## 446.8938120 -1378.7076981 857.8273970 288.7609820 105.0266419
## 886 887 888 889 890
## 288.7609820 342.2258869 -36.8405281 1038.7609820 475.9602270
## 891 892 893 894 895
## 1630.6281520 1092.2258869 2367.8273970 963.1594719 1204.6945670
## 896 897 898 899 900
## 1788.7609820 684.0930569 638.7609820 525.6281520 -49.6412831
## 901 902 903 904 905
## 2380.6281520 630.6281520 -74.0397730 2009.6945670 -407.7741131
## 906 907 908 909 910
## -74.0397730 880.6281520 1646.8938120 621.2923019 817.8273970
## 911 912 913 914 915
## 1159.6945670 988.7609820 792.2258869 788.7609820 1288.7609820
## 916 917 918 919 920
## 988.7609820 1367.8273970 330.6281520 880.6281520 959.6945670
## 921 922 923 924 925
## 538.7609820 646.8938120 280.6281520 317.8273970 1880.6281520
## 926 927 928 929 930
## -315.9069431 1617.8273970 196.8938120 3330.6281520 2475.9602270
## 931 932 933 934 935
## 2288.7609820 3159.6945670 -1694.9733581 691.8938120 1538.7609820
## 936 937 938 939 940
## 171.2923019 630.6281520 2159.6945670 2830.6281520 855.0266419
## 941 942 943 944 945
## 780.6281520 280.6281520 1362.8273970 2525.9602270 -1158.7741131
## 946 947 948 949 950
## 2330.6281520 1709.6945670 1275.9602270 879.6281520 2380.6281520
## 951 952 953 954 955
## 1817.8273970 330.6281520 1367.8273970 1305.0266419 1117.8273970
## 956 957 958 959 960
## -878.7076981 2209.6945670 -1178.7076981 446.8938120 513.1594719
## 961 962 963 964 965
## 1630.6281520 2659.6945670 -1749.6412831 1267.8273970 596.8938120
## 966 967 968 969 970
## 2130.6281520 2084.0930569 -1662.7741131 -78.7076981 -1599.6412831
## 971 972 973 974 975
## 1630.6281520 2330.6281520 1421.2923019 1938.7609820 -553.1061880
## 976 977 978 979 980
## 475.9602270 -15.9069431 934.0930569 105.0266419 3159.6945670
## 981 982 983 984 985
## -2549.6412831 3646.8938120 2317.8273970 1630.6281520 1250.3587169
## 986 987 988 989 990
## 488.7609820 1275.9602270 250.3587169 684.0930569 -1249.6412831
## 991 992 993 994 995
## 830.6281520 2667.8273970 1475.9602270 -878.7076981 792.2258869
## 996 997 998 999 1000
## 1367.8273970 621.2923019 -249.6412831 2630.6281520 425.9602270
## 1001 1002 1003 1004 1005
## 646.8938120 1450.3587169 1105.0266419 880.6281520 -1036.8405281
## 1006 1007 1008 1009 1010
## 2159.6945670 1959.6945670 975.9602270 846.8938120 638.7609820
## 1011 1012 1013 1014 1015
## 2988.7609820 -353.1061880 3442.0817314 3361.8122963 -1506.7152484
## 1016 1017 1018 1019 1020
## 840.6281520 -1247.3793983 1658.4915469 279.4251319 646.8938120
## 1021 1022 1023 1024 1025
## -2386.9846836 -783.5197787 271.1481464 -899.7854386 -1033.5197787
## 1026 1027 1028 1029 1030
## -99.7854386 -70.7190236 -1412.5861936 925.8160715 -928.8518536
## 1031 1032 1033 1034 1035
## -2583.5197787 1071.1481464 -2583.5197787 -733.5197787 583.9489014
## 1036 1037 1038 1039 1040
## -257.9182686 -599.7854386 -1583.5197787 2271.1481464 -1583.5197787
## 1041 1042 1043 1044 1045
## -70.7190236 442.0817314 -1829.1839285 -612.5861936 -266.0510986
## 1046 1047 1048 1049 1050
## -570.7190236 629.2809764 -266.0510986 -1096.3205337 -425.3869487
## 1051 1052 1053 1054 1055
## -925.3869487 -1096.3205337 -1096.3205337 232.7458813 -796.3205337
## 1056 1057 1058 1059 1060
## 428.0779562 -96.3205337 61.8122963 -375.3869487 561.8122963
## 1061 1062 1063 1064 1065
## -625.3869487 -925.3869487 -796.3205337 -375.3869487 -546.3205337
## 1066 1067 1068 1069 1070
## 1403.6794663 -425.3869487 -1109.1212887 324.6130513 -134.7227988
## 1071 1072 1073 1074 1075
## -862.0548737 -763.7892138 -960.6563838 573.4100312 23.4100312
## 1076 1077 1078 1079 1080
## -397.5235538 1172.8085211 -2328.1250639 -3164.8480784 -2206.7152484
## 1081 1082 1083 1084 1085
## -1993.9144934 -2335.7816634 -1822.9809084 -2792.3167585 -1390.4495885
## 1086 1087 1088 1089 1090
## -847.9809084 -1164.8480784 -1022.9809084 -822.9809084 -1335.7816634
## 1091 1092 1093 1094 1095
## -1627.6488335 -177.6488335 -193.9144934 -1361.3831735 -777.6488335
## 1096 1097 1098 1099 1100
## -4019.5160035 -877.9809084 1177.0190916 -1390.7816634 -2822.9809084
## 1101 1102 1103 1104 1105
## -2519.5160035 -1364.8480784 -1977.6488335 -2993.9144934 -2848.5824185
## 1106 1107 1108 1109 1110
## -3214.8480784 -2164.8480784 -2082.3167585 -1335.7816634 -1732.6488335
## 1111 1112 1113 1114 1115
## -3164.8480784 -1298.5824185 -2811.3831735 -1335.7816634 -3361.3831735
## 1116 1117 1118 1119 1120
## -1335.7816634 -1506.7152484 -1403.2503435 -2048.5824185 556.0855066
## 1121 1122 1123 1124 1125
## -2284.5786433 -968.3129833 -1955.5122283 -1797.3793983 -2284.5786433
## 1126 1127 1128 1129 1130
## -589.2465684 -797.3793983 -1168.3129833 -997.3793983 728.2221117
## 1131 1132 1133 1134 1135
## -468.3129833 -1384.5786433 -1392.7114733 -1681.1137384 -2481.1137384
## 1136 1137 1138 1139 1140
## -847.3793983 -971.7778883 -1455.5122283 -2113.6450583 -1563.6450583
## 1141 1142 1143 1144 1145
## -1247.3793983 -1777.4458133 -1284.5786433 -2614.2465684 607.2885267
## 1146 1147 1148 1149 1150
## -1531.1137384 294.4877717 -968.3129833 -221.7778883 -142.7114733
## 1151 1152 1153 1154 1155
## -721.7778883 -1594.2465684 -1663.6450583 -1113.6450583 -2139.2465684
## 1156 1157 1158 1159 1160
## -1513.6450583 -847.3793983 -1310.1801534 -942.7114733 410.7534316
## 1161 1162 1163 1164 1165
## -655.5122283 -752.3793983 -497.3793983 -313.6450583 -3310.1801534
## 1166 1167 1168 1169 1170
## -971.7778883 -163.6450583 -1339.2465684 228.2221117 -1938.3129833
## 1171 1172 1173 1174 1175
## 11.6870167 -905.5122283 -1771.7778883 -1981.1137384 -2139.2465684
## 1176 1177 1178 1179 1180
## -942.7114733 -284.5786433 -2310.1801534 -797.3793983 -947.3793983
## 1181 1182 1183 1184 1185
## -221.7778883 -976.4458133 -1639.2465684 1202.6206017 -2127.6488335
## 1186 1187 1188 1189 1190
## -2455.5122283 -891.5084531 129.4251319 -2785.7816634 -1942.7114733
## 1191 1192 1193 1194 1195
## 31.6870167 173.5541867 373.5541867 228.2221117 28.2221117
## 1196 1197 1198 1199 1200
## 2686.3549417 -392.7114733 -942.7114733 -2720.5748681 -1720.5748681
## 1201 1202 1203 1204 1205
## -2213.9771332 -1933.3756232 -1354.3092082 1108.4915469 -1958.9771332
## 1206 1207 1208 1209 1210
## -629.9107182 -4258.9771332 -1196.1763782 408.4915469 -1770.5748681
## 1211 1212 1213 1214 1215
## -1429.9107182 -275.2427932 -1258.9771332 -604.3092082 716.6243768
## 1216 1217 1218 1219 1220
## -62.4420381 -1275.2427932 -733.3756232 -433.3756232 395.6907918
## 1221 1222 1223 1224 1225
## -1917.1099632 -2796.1763782 -1033.3756232 786.0228668 -708.9771332
## 1226 1227 1228 1229 1230
## 461.9564518 -91.5084531 -1104.3092082 -1733.3756232 -220.5748681
## 1231 1232 1233 1234 1235
## -262.4420381 -946.1763782 -2088.0435482 -933.3756232 288.4915469
## 1236 1237 1238 1239 1240
## 741.0228668 -1050.8443032 -720.5748681 -1404.3092082 -1233.3756232
## 1241 1242 1243 1244 1245
## 279.4251319 279.4251319 82.8900368 -91.5084531 -917.1099632
## 1246 1247 1248 1249 1250
## -233.3756232 -720.5748681 1911.9564518 153.4915469 -454.3092082
## 1251 1252 1253 1254 1255
## 95.6907918 -2600.8443032 -75.2427932 -1720.5748681 -3429.9107182
## 1256 1257 1258 1259 1260
## -1129.9107182 374.7572068 -233.3756232 908.4915469 1729.4251319
## 1261 1262 1263 1264 1265
## -1383.3756232 -570.5748681 -746.1763782 224.7572068 -1838.0435482
## 1266 1267 1268 1269 1270
## 487.5579619 737.5579619 766.6243768 582.8900368 -262.4420381
## 1271 1272 1273 1274 1275
## -1538.0435482 53.8236218 -2088.0435482 -1188.3756232 786.0228668
## 1276 1277 1278 1279 1280
## -591.5084531 -62.4420381 -233.3756232 -196.1763782 -1683.3756232
## 1281 1282 1283 1284 1285
## 424.7572068 -1433.3756232 -720.5748681 424.7572068 -1258.9771332
## 1286 1287 1288 1289 1290
## 408.4915469 -555.8443032 -1117.1099632 570.0892818 -233.3756232
## 1291 1292 1293 1294 1295
## -3029.9107182 -920.5748681 -391.5084531 237.5579619 -917.1099632
## 1296 1297 1298 1299 1300
## -233.3756232 377.5579619 803.8236218 424.7572068 1824.4251319
## 1301 1302 1303 1304 1305
## -1879.9107182 -233.3756232 -308.9771332 -625.2427932 -433.3756232
## 1306 1307 1308 1309 1310
## -1062.4420381 82.8900368 1324.4251319 -1604.3092082 399.1556968
## 1311 1312 1313 1314 1315
## -1088.0435482 -62.4420381 53.8236218 424.7572068 -933.3756232
## 1316 1317 1318 1319 1320
## 66.6243768 349.1556968 -770.5748681 766.6243768 779.4251319
## 1321 1322 1323 1324 1325
## 711.9564518 -550.8443032 -746.1763782 -25.2427932 2108.4915469
## 1326 1327 1328 1329 1330
## 437.5579619 -461.2390180 -1161.2390180 130.6281520 -1265.9069431
## 1331 1332 1333 1334 1335
## -1065.9069431 -1874.0397730 330.6281520 2159.6945670 -461.2390180
## 1336 1337 1338 1339 1340
## 305.0266419 630.6281520 467.8273970 -636.8405281 -65.9069431
## 1341 1342 1343 1344 1345
## -1486.8405281 592.2258869 -2036.8405281 -382.1726030 -894.9733581
## 1346 1347 1348 1349 1350
## 134.0930569 1330.6281520 617.8273970 -894.9733581 405.0266419
## 1351 1352 1353 1354 1355
## -365.3054330 450.3587169 1330.6281520 -1149.9733581 288.7609820
## 1356 1357 1358 1359 1360
## 159.6945670 409.6945670 -369.3718480 680.6281520 963.1594719
## 1361 1362 1363 1364 1365
## -1828.7076981 -207.7741131 1538.7609820 -182.1726030 988.7609820
## 1366 1367 1368 1369 1370
## -382.1726030 1959.6945670 -24.0397730 -724.0397730 134.0930569
## 1371 1372 1373 1374 1375
## 1630.6281520 1959.6945670 -549.6412831 792.2258869 488.7609820
## 1376 1377 1378 1379 1380
## -1011.2390180 -182.1726030 2130.6281520 -1599.6412831 1817.8273970
## 1381 1382 1383 1384 1385
## -182.1726030 1196.8938120 -325.9069431 -36.8405281 1488.7609820
## 1386 1387 1388 1389 1390
## -2207.7741131 450.3587169 250.3587169 -1524.0397730 25.9602270
## 1391 1392 1393 1394 1395
## 1488.7609820 617.8273970 -840.3054330 963.1594719 475.9602270
## 1396 1397 1398 1399 1400
## 1434.0930569 621.2923019 -24.0397730 946.8938120 130.6281520
## 1401 1402 1403 1404 1405
## 1988.7609820 959.6945670 -524.0397730 1317.8273970 475.9602270
## 1406 1407 1408 1409 1410
## -2483.7076981 -144.9733581 475.9602270 1630.6281520 1475.9602270
## 1411 1412 1413 1414 1415
## 100.3587169 1130.6281520 2709.6945670 617.8273970 621.2923019
## 1416 1417 1418 1419 1420
## 709.6945670 1446.8938120 -657.7741131 1305.0266419 3538.7609820
## 1421 1422 1423 1424 1425
## -815.9069431 446.8938120 1538.7609820 -65.9069431 1538.7609820
## 1426 1427 1428 1429 1430
## 1330.6281520 -65.9069431 1330.6281520 792.2258869 463.1594719
## 1431 1432 1433 1434 1435
## 1988.7609820 1830.6281520 -999.6412831 2858.1594719 342.2258869
## 1436
## -1078.7076981
When the explanatory variable is categorical, contrasts is used by lm() to create dummy variables Contrasts is the umbrella term used to describe the process of testing linear combinations of parameters from regression models. All statistical software use contrasts, but each software has different defaults and their own way of overriding these.
The default contrasts in R are “treatment” contrasts (aka “dummy coding”), where each level within a factor is identified within a matrix of binary 0 / 1 variables, with the first level chosen as the reference category. They’re called “treatment” contrasts, because of the typical use case where there is one control group (the reference group) and one or more treatment groups that are to be compared to the controls. It is easy to change the default contrasts to something other than treatment contrasts, though this is rarely needed. More often, we may want to change the reference group in treatment contrasts or get all sets of pairwise contrasts between factor levels.
Let us take Fuel_Type as an example. There are three fuel types, namely CNG, Diesel and Petrol.
contrasts(car_resale$Fuel_Type)
## Diesel Petrol
## CNG 0 0
## Diesel 1 0
## Petrol 0 1
Notice that by default, two dummy variables, namely Diesel and Petrol are created.
We can change the reference group by using relevel() as shown in the code chunk below.
car_resale$Fuel_Type <- relevel(car_resale$Fuel_Type, ref = "Petrol")
If you re-run the code chunk below,
contrasts(car_resale$Fuel_Type)
notice that the newly created dummy variables are CNG and Diesel.
car.slr1 <- lm(Price ~ Doors, data = car_resale)
summary(car.slr1)
##
## Call:
## lm(formula = Price ~ Doors, data = car_resale)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7153.2 -2253.2 -857.3 991.8 20996.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8100 2514 3.222 0.0013 **
## Doors3 2007 2518 0.797 0.4255
## Doors4 1707 2532 0.674 0.5004
## Doors5 3403 2518 1.352 0.1767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3555 on 1432 degrees of freedom
## Multiple R-squared: 0.04108, Adjusted R-squared: 0.03908
## F-statistic: 20.45 on 3 and 1432 DF, p-value: 5.605e-13
The code chunk below shows that lm() can be used in ggplot2 in the statistics argument.
ggplot(data=car_resale,
aes(x=`Age_08_04`, y=`Price`)) +
geom_point() +
geom_smooth(method = lm)
Figure above reveals that there are a few statistical outliers with relatively high selling prices.
condo.slr1 <- lm(formula = log10(Price) ~ Age_08_04, data = car_resale)
summary(condo.slr1)
##
## Call:
## lm(formula = log10(Price) ~ Age_08_04, data = car_resale)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44410 -0.03247 0.00250 0.03906 0.22534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.349e+00 5.180e-03 839.67 <2e-16 ***
## Age_08_04 -6.064e-03 8.786e-05 -69.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06191 on 1434 degrees of freedom
## Multiple R-squared: 0.7686, Adjusted R-squared: 0.7684
## F-statistic: 4763 on 1 and 1434 DF, p-value: < 2.2e-16
ggplot(data=car_resale,
aes(x=`Age_08_04`, y=log10(Price))) +
geom_point() +
geom_smooth(method = lm)
In the previous section, we have shared with you how to build simple and multiple linear regression models by using lm() of R stats. Despite its ability to support the need for building linear regression model, lm() does not provide functions to perform multicollinearity and the 3 classic regression assumptions tests. Furthermore, lm() does not provide functions for variables selections. To meet these analysis needs, one have to look into several different R packages, for example vif() of CAR package to perform multicollinearity test, ad.test() of nortest package to perform normality assumption test and bptest() of lmtest package, just to name a few of them.
olsrr, on the other hand, provides all these necessary tests and methods under one roof. In this and next sections, we will share with you how olsrr can be used to meet these analysis needs.
The code chunk below build a multiple linear regression for the trade-in prices by using the continuous explanatory variables.
car.mlr <- lm(formula = Price ~ Age_08_04 + Mfg_Month + Mfg_Year + KM + Quarterly_Tax + Weight + Guarantee_Period, data=car_resale)
summary(car.mlr)
##
## Call:
## lm(formula = Price ~ Age_08_04 + Mfg_Month + Mfg_Year + KM +
## Quarterly_Tax + Weight + Guarantee_Period, data = car_resale)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10606.7 -739.4 0.5 735.7 6509.5
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.080e+03 1.079e+03 -1.001 0.3168
## Age_08_04 -1.239e+02 2.717e+00 -45.610 <2e-16 ***
## Mfg_Month -1.092e+02 1.090e+01 -10.019 <2e-16 ***
## Mfg_Year NA NA NA NA
## KM -2.286e-02 1.248e-03 -18.310 <2e-16 ***
## Quarterly_Tax -1.024e+00 1.241e+00 -0.825 0.4094
## Weight 1.949e+01 9.905e-01 19.681 <2e-16 ***
## Guarantee_Period 2.592e+01 1.238e+01 2.093 0.0365 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1366 on 1429 degrees of freedom
## Multiple R-squared: 0.8587, Adjusted R-squared: 0.8581
## F-statistic: 1447 on 6 and 1429 DF, p-value: < 2.2e-16
Quiz: What can you observe from the output?
Before building a multiple regression model, it is important to ensure that the independent variables used are not highly correlated to each other. If these highly correlated independent variables are used in building a regression model by mistake, the quality of the model will be compromised. This phenomenon is known as multicollinearity in statistics.
Variance inflation factors (VIF) measure the inflation in the variances of the parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient βk is inflated by the existence of correlation among the predictor variables in the model. A VIF of 1 means that there is no correlation among the kth predictor and the remaining predictor variables, and hence the variance of βk is not inflated at all. The general rule of thumb is that VIFs exceeding 5 warrant further investigation, while VIFs exceeding 10 are signs of serious multicollinearity requiring correction.
In the code chunk below, the ols_vif_tol() of olsrr package is used to test if there are sign of multicollinearity.
ols_vif_tol(car.mlr)
## Variables Tolerance VIF
## 1 Age_08_04 0.0000000 Inf
## 2 Mfg_Month 0.0000000 Inf
## 3 Mfg_Year 0.0000000 Inf
## 4 KM 0.5933991 1.685206
## 5 Quarterly_Tax 0.4991949 2.003225
## 6 Weight 0.4785777 2.089525
## 7 Guarantee_Period 0.9360399 1.068331
Since the VIF of Age_08_04, Mfg_Month, and Mfg_Year are greater than 10 (i.e. Inf). We can safely conclude that there are sign of multicollinearity among these independent variables.
Correlation matrix is commonly used to visualise the relationships between the independent variables. Beside the pairs() of R, there are many packages support the display of a correlation matrix. In this section, the corrplot package will be used.
The code chunk below is used to plot a scatterplot matrix of the relationship between the independent variables in car_resale data.frame.
corrplot(cor(car_resale[, 4:10]), diag = FALSE, order = "AOE",
tl.pos = "td", tl.cex = 1.0, method = "number", type = "upper")
Matrix reorder is very important for mining the hidden structure and patter in the matrix. There are four methods in corrplot (parameter order), named “AOE”, “FPC”, “hclust”, “alphabet”. In the code chunk above, alphabet order is used. It orders the variables alphabetically.
From the scatterplot matrix, it is clear that Age_08_04 is highly correlated to Mfg_Year. In view of this, it is wiser to only include either one of them in the subsequent model building. As a result, Mfg_Year is excluded in the subsequent model building.
Let us perform the multicollinearity check again by using the revised model.
car.mlr <- lm(formula = Price ~ Age_08_04 + Mfg_Month + KM + Quarterly_Tax + Weight + Guarantee_Period, data=car_resale)
ols_vif_tol(car.mlr)
## Variables Tolerance VIF
## 1 Age_08_04 0.5095030 1.962697
## 2 Mfg_Month 0.9735547 1.027164
## 3 KM 0.5933991 1.685206
## 4 Quarterly_Tax 0.4991949 2.003225
## 5 Weight 0.4785777 2.089525
## 6 Guarantee_Period 0.9360399 1.068331
Since the VIF of the independent variables are less than 10. We can safely conclude that there are no sign of multicollinearity among the independent variables.
The code chunk below using lm() to calibrate the revised multiple linear regression model. However, instead of display the output directly, ols_regress() of olsrr package is used to display the regression report.
car.mlr <- lm(formula = Price ~ Age_08_04 + Mfg_Month + KM + Quarterly_Tax + Weight + Guarantee_Period, data=car_resale)
ols_regress(car.mlr)
## Model Summary
## -------------------------------------------------------------------
## R 0.927 RMSE 1366.403
## R-Squared 0.859 Coef. Var 12.733
## Adj. R-Squared 0.858 MSE 1867056.840
## Pred R-Squared 0.853 MAE 960.600
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 16209217239.610 6 2701536206.602 1446.949 0.0000
## Residual 2668024224.167 1429 1867056.840
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------------
## (Intercept) -1080.337 1078.775 -1.001 0.317 -3196.490 1035.816
## Age_08_04 -123.916 2.717 -0.635 -45.610 0.000 -129.245 -118.586
## Mfg_Month -109.203 10.899 -0.101 -10.019 0.000 -130.583 -87.822
## KM -0.023 0.001 -0.236 -18.310 0.000 -0.025 -0.020
## Quarterly_Tax -1.024 1.241 -0.012 -0.825 0.409 -3.459 1.411
## Weight 19.494 0.990 0.283 19.681 0.000 17.551 21.437
## Guarantee_Period 25.919 12.382 0.022 2.093 0.036 1.631 50.208
## ------------------------------------------------------------------------------------------------------
One of the advantage of using ols_regress() to display the model report instead of lm() report directly is its report format is relatively tidier and easier to understand than the later. Notice that Model Summary is displayed first, then followed by ANOVA analysis and Parameter Estimates reports. Furthermore, it also provides standardized betas and confidence intervals for coefficients.
Linear regression makes several assumptions about the data at hand. This section describes regression assumptions and share with you a collection of awesome functions provide by olsrr package for regression diagnostics.
After performing a regression analysis, you should always check if the model works well for the data at hand.
A first step of this regression diagnostic is to inspect the significance of the regression beta coefficients, as well as, the R2 that tells us how well the linear regression model fits to the data. This has been described in the earlier sections (i.e. 4.1 and 5.2.3).
As obvious as this may seem, linear regression assumes that there exists a linear relationship between the dependent variable and the predictors. Violation of this assumption is very serious–it means that your linear model probably does a bad job at predicting your actual (non-linear) data. Perhaps the relationship between your predictor(s) and criterion is actually curvilinear or cubic. If that is the case, a linear model does a bad job at modeling that relationship, and it is inappropriate to use such a model. There’s no point in worrying about significance tests or confidence intervals if a linear model doesn’t reflect your non-linear data. Hence, when building a linear regression model, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent variables.
The are many methods can be used to test the linearity assumption. By and large, they are graphical in nature. One of the most commonly used graphical method for testing the linearity assumption is scatter plot. In the scatter plot, the y-axis is used the map the the residuals (errors) and the x-axis is used to map the fitted values (predicted values) .
In the code chunk below, the ols_plot_resid_fit() of olsrr package is used to plot the scatter plot for linearity assumption test.
ols_plot_resid_fit(car.mlr)
Figure above reveals that most of the data points are scattered around the 0 line, hence we can safely conclude that the relationships between the dependent variable and independent variables are linear.
The assumption of normality in regression manifests in three ways:
This assumption is most important when you have a small sample size (because central limit theorem isn’t working in your favor), and when you’re interested in constructing confidence intervals/doing significance testing.
olsrr package provides both graphical and statistical testing methods to meet the need for testing the normality assumption.
The graphical functions provides are:
The statistical testing, on the other hand, is support by ols_test_normality()
In this sharing, we are going to focus on normal Q-Q plot, residual histogram and the formal statistical testing methods.
A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight. Here’s an example of a Normal Q-Q plot when both sets of quantiles truly come from Normal distributions.
For ols_plot_resid_qq(), the normal Q-Q plot is plotted by using the output of the multiple linear regression as shown in the code chunk below.
ols_plot_resid_qq(car.mlr)
Since most of the data points fall along the straight, we can conclude that the model is conformed to the normality assumption.
We can also validate the normality assumption by plotting the residual using histogram. If the model conforms to normality assumption, then the histogram should resemble bell shape with the peak centre around 0.
The code chunk below is used to plot the residual histogram.
ols_plot_resid_hist(car.mlr)
Figure above reveals that the residual of the multiple linear regression model (i.e. condo.mlr1) is resemble normal distribution.
If you prefer formal statistical test methods, the ols_test_normality() of olsrr package can be used as shown in the code chunk below.
ols_test_normality(car.mlr)
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.9248 0.0000
## Kolmogorov-Smirnov 0.0587 1e-04
## Cramer-von Mises 119.5711 0.0000
## Anderson-Darling 12.5844 0.0000
## -----------------------------------------------
Four commonly used normality assumption test statistics are supported by ols_test_normality(). They are:
The summary table above reveals that the p-values of the four tests are way smaller than the alpha value of 0.05. Hence we will reject the null hypothesis that the residual is NOT resemble normal distribution. Hence, we can infer that the residual of the model conformed to normality assumption.
Heteroscedasticity means unequal scatter. In regression analysis, we talk about heteroscedasticity in the context of the residuals or error term. Specifically, heteroscedasticity is a systematic change in the spread of the residuals over the range of measured values. Heteroscedasticity is a problem because ordinary least squares (OLS) regression assumes that all residuals are drawn from a population that has a constant variance (homoscedasticity).
The easiest way to test for heteroskedasticity is to get a good look at your data. Ideally, you generally want your data to all follow a pattern of a line, but sometimes it doesn’t. The quickest way to identify heteroskedastic data is to see the shape that the plotted data take.
Heteroscedasticity produces a distinctive fan or cone shape in residual plots. To check for heteroscedasticity, we need to assess the residuals by fitted value plots specifically. The code chunk below will be used to plot the fitted values against the residual values.
ols_plot_resid_fit(car.mlr)
Figure above does not show any pattern of distinctive fan or cone shape in the residual plots. We say that the response or the residuals are homoscedasticity.
Beside the residual plot, olsrr also provides the following 4 tests for detecting heteroscedasticity:
Breusch Pagan Test was introduced by Trevor Breusch and Adrian Pagan in 1979.
You can perform the test using the fitted values of the model, the predictors in the model and a subset of the independent variables. It includes options to perform multiple tests and p value adjustments. The options for p value adjustments include Bonferroni, Sidak and Holm’s method.
The code chunk below uses fitted values of the model to perform the homogeneity assumption test
ols_test_breusch_pagan(car.mlr)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## --------------------------------
## DF = 1
## Chi2 = 501.7464
## Prob > Chi2 = 3.962598e-111
ols_test_breusch_pagan(car.mlr, rhs = TRUE)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------------------------------------------
## Response : Price
## Variables: Age_08_04 Mfg_Month KM Quarterly_Tax Weight Guarantee_Period
##
## Test Summary
## ----------------------------
## DF = 6
## Chi2 = 2198.1848
## Prob > Chi2 = 0.0000
F Test for heteroskedasticity under the assumption that the errors are independent and identically distributed (i.i.d.). You can perform the test using the fitted values of the model, the predictors in the model and a subset of the independent variables.
The code chunk below uses ols_test_f() of olsrr package to perform homogeneity assumption test on fitted values of the model.
ols_test_f(car.mlr)
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## Num DF = 1
## Den DF = 1434
## F = 111.567
## Prob > F = 3.633432e-25
The code chunk below uses ols_test_f() of olsrr package to perform homogeneity assumption test on the independent variables of the model.
ols_test_f(car.mlr, rhs = TRUE)
##
## F Test for Heteroskedasticity
## -----------------------------
## Ho: Variance is homogenous
## Ha: Variance is not homogenous
##
## Variables: Age_08_04 Mfg_Month KM Quarterly_Tax Weight Guarantee_Period
##
## Test Summary
## -----------------------------
## Num DF = 6
## Den DF = 1429
## F = 110.1566
## Prob > F = 2.795615e-114
When building regression models, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model. The actual set of independent variables used in the final regression model must be determined by analysis of the data. Determining this subset is called the variable selection problem.
Finding this subset of independent (also known as regressor) variables involves two opposing objectives. First, we want the regression model to be as complete and realistic as possible. We want every regressor that is even remotely related to the dependent variable to be included. Second, we want to include as few variables as possible because each irrelevant regressor decreases the precision of the estimated coefficients and predicted values. Also, the presence of extra variables increases the complexity of data collection and model maintenance. The goal of variable selection becomes one of parsimony: achieve a balance between simplicity (as few regressors as possible) and fit (as many regressors as needed). Among them, one of the commonly used strategy is called stepwise variables selection method.
Actually, stepwise regression method consists of three variables selection strategies, thye are:
In oslrr package, these three stepwise variables selection strategies are supported by ols_step_backward_p(), ols_step_forward_p(), and ols_step_both_p() respectively.
In order to show how to perform variables selection by using the stepwise regression functions, let us first build a complex multiple linear regression models using most of the independent variables provided.
car.mlr <- lm(formula = Price ~ Age_08_04 + Mfg_Month + KM + Quarterly_Tax + Weight + Guarantee_Period + HP_Bin + CC_bin + Met_Color + Mfr_Guarantee + Airco + Automatic_airco + Boardcomputer + Central_Lock + Powered_Windows + Sport_Model + Backseat_Divider + Tow_Bar
, data=car_resale)
ols_regress(car.mlr)
## Model Summary
## -------------------------------------------------------------------
## R 0.954 RMSE 1099.239
## R-Squared 0.909 Coef. Var 10.244
## Adj. R-Squared 0.908 MSE 1208325.836
## Pred R-Squared 0.903 MAE 812.346
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------------------
## Regression 17167460405.194 20 858373020.260 710.382 0.0000
## Residual 1709781058.583 1415 1208325.836
## Total 18877241463.777 1435
## --------------------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------------------
## (Intercept) 5006.450 1099.263 4.554 0.000 2850.089 7162.811
## Age_08_04 -113.432 3.135 -0.582 -36.177 0.000 -119.582 -107.281
## Mfg_Month -95.155 8.928 -0.088 -10.658 0.000 -112.669 -77.642
## KM -0.017 0.001 -0.178 -15.845 0.000 -0.019 -0.015
## Quarterly_Tax 9.718 1.359 0.110 7.153 0.000 7.053 12.383
## Weight 11.312 0.998 0.164 11.333 0.000 9.354 13.271
## Guarantee_Period 68.048 11.744 0.056 5.794 0.000 45.011 91.085
## HP_Bin> 120 4916.462 380.392 0.118 12.925 0.000 4170.268 5662.655
## HP_Bin100-120 3320.525 382.257 0.448 8.687 0.000 2570.673 4070.376
## CC_bin>1600 -1366.308 201.184 -0.120 -6.791 0.000 -1760.960 -971.656
## CC_bin1600 -3297.887 379.977 -0.447 -8.679 0.000 -4043.266 -2552.508
## Met_Color1 22.976 64.553 0.003 0.356 0.722 -103.653 149.605
## Mfr_Guarantee1 285.597 63.787 0.039 4.477 0.000 160.470 410.725
## Airco1 278.200 75.872 0.038 3.667 0.000 129.365 427.034
## Automatic_airco1 2256.939 157.344 0.144 14.344 0.000 1948.286 2565.592
## Boardcomputer1 -153.918 99.718 -0.019 -1.544 0.123 -349.528 41.693
## Central_Lock1 -52.167 124.157 -0.007 -0.420 0.674 -295.720 191.385
## Powered_Windows1 462.223 124.252 0.063 3.720 0.000 218.486 705.960
## Sport_Model1 278.658 72.177 0.035 3.861 0.000 137.072 420.243
## Backseat_Divider1 -9.349 95.433 -0.001 -0.098 0.922 -196.555 177.857
## Tow_Bar1 -137.656 68.384 -0.017 -2.013 0.044 -271.802 -3.511
## --------------------------------------------------------------------------------------------------------
Notice that the p-values of Met_Color1, Boardcomputer1, Central_Lock1, and Backseat_Divider1 are greater than 0.05.
Now we are going to perform forward stepwise regression by eliminating independent variables failed to meet the p-values less than 0.05 criterion. The code chunk below uses ols_step_forward_p() of olsrr package to build a forward stepwise regression model by setting the penter argument to 0.05. This will remove independent variables with p-values greater than 0.05 from the model.
ols_step_forward_p(car.mlr,
penter = 0.05,
print_plot = TRUE)
##
## Selection Summary
## ----------------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Age_08_04 0.7684 0.7682 2186.0335 25518.9706 1746.0382
## 2 Automatic_airco 0.8247 0.8244 1308.7496 25121.1462 1519.6553
## 3 KM 0.8427 0.8423 1029.8950 24967.7765 1440.1316
## 4 Weight 0.8742 0.8739 538.5506 24648.0525 1287.9643
## 5 HP_Bin 0.8830 0.8825 403.9815 24548.5871 1243.2596
## 6 Mfg_Month 0.8905 0.8899 289.0095 24455.6261 1203.2452
## 7 CC_bin 0.8965 0.8958 197.6782 24379.0785 1170.7882
## 8 Powered_Windows 0.9005 0.8998 136.2015 24323.5954 1147.9904
## 9 Quarterly_Tax 0.9036 0.9029 89.8123 24280.1756 1130.3748
## 10 Guarantee_Period 0.9057 0.9049 59.6168 24251.1397 1118.6181
## 11 Mfr_Guarantee 0.9070 0.9061 41.0703 24232.9769 1111.1829
## 12 Sport_Model 0.9081 0.9072 25.0118 24217.0187 1104.6450
## 13 Airco 0.9090 0.9080 13.8241 24205.7609 1099.9446
## 14 Tow_Bar 0.9092 0.9082 11.8620 24203.7537 1098.7979
## ----------------------------------------------------------------------------------------
Notice that all the independent variables with p-values greater than 0.05 have been excluded in the model. Despite four independent variables have been removed from the model, it is interesting to note that the adjusted R-square remains at 0.908.
You can also request for detail printout of each iteration by using the details argument as shown in the code chunk below.
ols_step_forward_p(car.mlr,
penter = 0.05,
details = TRUE)
## Forward Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. Age_08_04
## 2. Mfg_Month
## 3. KM
## 4. Quarterly_Tax
## 5. Weight
## 6. Guarantee_Period
## 7. HP_Bin
## 8. CC_bin
## 9. Met_Color
## 10. Mfr_Guarantee
## 11. Airco
## 12. Automatic_airco
## 13. Boardcomputer
## 14. Central_Lock
## 15. Powered_Windows
## 16. Sport_Model
## 17. Backseat_Divider
## 18. Tow_Bar
##
## We are selecting variables based on p value...
##
##
## Forward Selection: Step 1
##
## - Age_08_04
##
## Model Summary
## -------------------------------------------------------------------
## R 0.877 RMSE 1746.038
## R-Squared 0.768 Coef. Var 16.271
## Adj. R-Squared 0.768 MSE 3048649.489
## Pred R-Squared 0.767 MAE 1246.747
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------------------
## Regression 14505478096.705 1 14505478096.705 4758.001 0.0000
## Residual 4371763367.072 1434 3048649.489
## Total 18877241463.777 1435
## -----------------------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------------
## (Intercept) 20294.059 146.097 138.908 0.000 20007.471 20580.646
## Age_08_04 -170.934 2.478 -0.877 -68.978 0.000 -175.795 -166.073
## --------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 2
##
## - Automatic_airco
##
## Model Summary
## -------------------------------------------------------------------
## R 0.908 RMSE 1519.655
## R-Squared 0.825 Coef. Var 14.162
## Adj. R-Squared 0.824 MSE 2309352.302
## Pred R-Squared 0.823 MAE 1096.665
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 15567939614.839 2 7783969807.419 3370.629 0.0000
## Residual 3309301848.938 1433 2309352.302
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 18841.989 144.054 130.799 0.000 18559.410 19124.567
## Age_08_04 -149.135 2.384 -0.765 -62.550 0.000 -153.812 -144.458
## Automatic_airco1 4121.588 192.156 0.262 21.449 0.000 3744.651 4498.524
## -------------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 3
##
## - KM
##
## Model Summary
## -------------------------------------------------------------------
## R 0.918 RMSE 1440.132
## R-Squared 0.843 Coef. Var 13.421
## Adj. R-Squared 0.842 MSE 2073979.052
## Pred R-Squared 0.841 MAE 1036.274
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 15907303461.666 3 5302434487.222 2556.648 0.0000
## Residual 2969938002.112 1432 2073979.052
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 19059.802 137.573 138.543 0.000 18789.935 19329.668
## Age_08_04 -134.462 2.534 -0.690 -53.064 0.000 -139.432 -129.491
## Automatic_airco1 3994.025 182.373 0.254 21.900 0.000 3636.278 4351.771
## KM -0.015 0.001 -0.156 -12.792 0.000 -0.017 -0.013
## -------------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 4
##
## - Weight
##
## Model Summary
## -------------------------------------------------------------------
## R 0.935 RMSE 1287.964
## R-Squared 0.874 Coef. Var 12.002
## Adj. R-Squared 0.874 MSE 1658851.986
## Pred R-Squared 0.871 MAE 932.109
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 16503424271.945 4 4125856067.986 2487.176 0.0000
## Residual 2373817191.832 1431 1658851.986
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------------
## (Intercept) 2054.376 905.464 2.269 0.023 278.197 3830.555
## Age_08_04 -113.178 2.529 -0.580 -44.751 0.000 -118.139 -108.217
## Automatic_airco1 2965.067 171.898 0.189 17.249 0.000 2627.868 3302.265
## KM -0.021 0.001 -0.221 -19.387 0.000 -0.024 -0.019
## Weight 15.207 0.802 0.221 18.957 0.000 13.633 16.780
## ----------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 5
##
## - HP_Bin
##
## Model Summary
## -------------------------------------------------------------------
## R 0.940 RMSE 1243.260
## R-Squared 0.883 Coef. Var 11.586
## Adj. R-Squared 0.883 MSE 1545694.347
## Pred R-Squared 0.880 MAE 917.107
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 16668444241.758 6 2778074040.293 1797.298 0.0000
## Residual 2208797222.019 1429 1545694.347
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------------
## (Intercept) 3041.437 879.356 3.459 0.001 1316.469 4766.405
## Age_08_04 -115.989 2.487 -0.595 -46.630 0.000 -120.868 -111.109
## Automatic_airco1 2592.316 169.810 0.165 15.266 0.000 2259.212 2925.421
## KM -0.020 0.001 -0.207 -18.450 0.000 -0.022 -0.018
## Weight 14.136 0.782 0.205 18.075 0.000 12.602 15.671
## HP_Bin> 120 3799.123 395.361 0.091 9.609 0.000 3023.573 4574.672
## HP_Bin100-120 359.927 69.658 0.049 5.167 0.000 223.284 496.571
## ----------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 6
##
## - Mfg_Month
##
## Model Summary
## -------------------------------------------------------------------
## R 0.944 RMSE 1203.245
## R-Squared 0.890 Coef. Var 11.213
## Adj. R-Squared 0.890 MSE 1447798.984
## Pred R-Squared 0.887 MAE 877.335
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 16809784514.785 7 2401397787.826 1658.654 0.0000
## Residual 2067456948.992 1428 1447798.984
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------------
## (Intercept) 4194.319 859.016 4.883 0.000 2509.251 5879.387
## Age_08_04 -120.065 2.442 -0.616 -49.157 0.000 -124.857 -115.274
## Automatic_airco1 2448.728 164.986 0.156 14.842 0.000 2125.086 2772.369
## KM -0.019 0.001 -0.201 -18.439 0.000 -0.021 -0.017
## Weight 13.726 0.758 0.199 18.108 0.000 12.239 15.213
## HP_Bin> 120 3793.269 382.637 0.091 9.914 0.000 3042.679 4543.859
## HP_Bin100-120 373.679 67.431 0.050 5.542 0.000 241.405 505.953
## Mfg_Month -95.142 9.629 -0.088 -9.880 0.000 -114.031 -76.253
## ----------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 7
##
## - CC_bin
##
## Model Summary
## -------------------------------------------------------------------
## R 0.947 RMSE 1170.788
## R-Squared 0.896 Coef. Var 10.911
## Adj. R-Squared 0.896 MSE 1370744.959
## Pred R-Squared 0.891 MAE 860.989
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 16922559152.253 9 1880284350.250 1371.724 0.0000
## Residual 1954682311.525 1426 1370744.959
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 3578.871 1118.743 3.199 0.001 1384.313 5773.429
## Age_08_04 -121.438 2.385 -0.623 -50.925 0.000 -126.116 -116.760
## Automatic_airco1 2256.504 162.001 0.144 13.929 0.000 1938.718 2574.290
## KM -0.016 0.001 -0.169 -14.335 0.000 -0.019 -0.014
## Weight 14.329 1.014 0.208 14.128 0.000 12.340 16.319
## HP_Bin> 120 4537.018 384.234 0.109 11.808 0.000 3783.294 5290.742
## HP_Bin100-120 3452.319 404.063 0.466 8.544 0.000 2659.696 4244.941
## Mfg_Month -90.947 9.406 -0.084 -9.669 0.000 -109.397 -72.496
## CC_bin>1600 -792.201 169.670 -0.070 -4.669 0.000 -1125.031 -459.371
## CC_bin1600 -3273.172 401.153 -0.443 -8.159 0.000 -4060.085 -2486.260
## -------------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 8
##
## - Powered_Windows
##
## Model Summary
## -------------------------------------------------------------------
## R 0.949 RMSE 1147.990
## R-Squared 0.901 Coef. Var 10.698
## Adj. R-Squared 0.900 MSE 1317881.944
## Pred R-Squared 0.895 MAE 838.276
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 16999259692.872 10 1699925969.287 1289.892 0.0000
## Residual 1877981770.905 1425 1317881.944
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 4197.482 1099.951 3.816 0.000 2039.785 6355.180
## Age_08_04 -118.102 2.379 -0.606 -49.649 0.000 -122.768 -113.436
## Automatic_airco1 2220.587 158.916 0.141 13.973 0.000 1908.852 2532.322
## KM -0.017 0.001 -0.174 -15.049 0.000 -0.019 -0.015
## Weight 13.384 1.002 0.194 13.355 0.000 11.418 15.350
## HP_Bin> 120 4337.881 377.655 0.104 11.486 0.000 3597.062 5078.701
## HP_Bin100-120 3472.803 396.205 0.469 8.765 0.000 2695.596 4250.010
## Mfg_Month -91.668 9.223 -0.085 -9.939 0.000 -109.761 -73.576
## CC_bin>1600 -641.829 167.530 -0.057 -3.831 0.000 -970.460 -313.198
## CC_bin1600 -3382.072 393.600 -0.458 -8.593 0.000 -4154.170 -2609.974
## Powered_Windows1 508.696 66.680 0.070 7.629 0.000 377.894 639.499
## -------------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 9
##
## - Quarterly_Tax
##
## Model Summary
## -------------------------------------------------------------------
## R 0.951 RMSE 1130.375
## R-Squared 0.904 Coef. Var 10.534
## Adj. R-Squared 0.903 MSE 1277747.117
## Pred R-Squared 0.898 MAE 836.056
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 17057729568.466 11 1550702688.042 1213.623 0.0000
## Residual 1819511895.311 1424 1277747.117
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 5199.111 1093.147 4.756 0.000 3054.759 7343.463
## Age_08_04 -116.773 2.350 -0.599 -49.681 0.000 -121.384 -112.162
## Automatic_airco1 2275.253 156.686 0.145 14.521 0.000 1967.892 2582.614
## KM -0.017 0.001 -0.180 -15.773 0.000 -0.020 -0.015
## Weight 11.795 1.014 0.171 11.627 0.000 9.805 13.785
## HP_Bin> 120 5085.534 387.937 0.122 13.109 0.000 4324.543 5846.524
## HP_Bin100-120 3293.965 391.020 0.445 8.424 0.000 2526.929 4061.002
## Mfg_Month -90.751 9.083 -0.084 -9.992 0.000 -108.567 -72.934
## CC_bin>1600 -1368.187 196.828 -0.121 -6.951 0.000 -1754.290 -982.084
## CC_bin1600 -3227.801 388.231 -0.437 -8.314 0.000 -3989.367 -2466.234
## Powered_Windows1 509.226 65.657 0.070 7.756 0.000 380.431 638.021
## Quarterly_Tax 8.666 1.281 0.098 6.765 0.000 6.153 11.179
## -------------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 10
##
## - Guarantee_Period
##
## Model Summary
## -------------------------------------------------------------------
## R 0.952 RMSE 1118.618
## R-Squared 0.906 Coef. Var 10.424
## Adj. R-Squared 0.905 MSE 1251306.531
## Pred R-Squared 0.900 MAE 826.060
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 17096632270.336 12 1424719355.861 1138.585 0.0000
## Residual 1780609193.441 1423 1251306.531
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 4731.911 1085.018 4.361 0.000 2603.504 6860.317
## Age_08_04 -114.562 2.360 -0.588 -48.552 0.000 -119.190 -109.933
## Automatic_airco1 2386.531 156.336 0.152 15.265 0.000 2079.857 2693.204
## KM -0.017 0.001 -0.178 -15.734 0.000 -0.019 -0.015
## Weight 11.793 1.004 0.171 11.748 0.000 9.824 13.762
## HP_Bin> 120 5090.503 383.904 0.122 13.260 0.000 4337.425 5843.581
## HP_Bin100-120 3256.161 387.012 0.440 8.414 0.000 2496.986 4015.337
## Mfg_Month -90.326 8.988 -0.084 -10.049 0.000 -107.957 -72.694
## CC_bin>1600 -1524.097 196.777 -0.134 -7.745 0.000 -1910.101 -1138.092
## CC_bin1600 -3224.574 384.194 -0.437 -8.393 0.000 -3978.220 -2470.927
## Powered_Windows1 510.964 64.975 0.070 7.864 0.000 383.507 638.421
## Quarterly_Tax 10.277 1.300 0.117 7.904 0.000 7.726 12.828
## Guarantee_Period 57.643 10.338 0.048 5.576 0.000 37.363 77.922
## -------------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 11
##
## - Mfr_Guarantee
##
## Model Summary
## -------------------------------------------------------------------
## R 0.952 RMSE 1111.183
## R-Squared 0.907 Coef. Var 10.355
## Adj. R-Squared 0.906 MSE 1234727.422
## Pred R-Squared 0.901 MAE 818.345
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 17121459070.214 13 1317035313.093 1066.661 0.0000
## Residual 1755782393.563 1422 1234727.422
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 4317.608 1081.759 3.991 0.000 2195.592 6439.623
## Age_08_04 -113.352 2.359 -0.581 -48.043 0.000 -117.980 -108.724
## Automatic_airco1 2394.538 155.307 0.152 15.418 0.000 2089.883 2699.193
## KM -0.017 0.001 -0.174 -15.376 0.000 -0.019 -0.015
## Weight 12.004 0.998 0.174 12.025 0.000 10.046 13.962
## HP_Bin> 120 4967.432 382.338 0.119 12.992 0.000 4217.424 5717.439
## HP_Bin100-120 3248.489 384.444 0.438 8.450 0.000 2494.351 4002.626
## Mfg_Month -89.550 8.930 -0.083 -10.028 0.000 -107.068 -72.032
## CC_bin>1600 -1427.987 196.641 -0.126 -7.262 0.000 -1813.724 -1042.249
## CC_bin1600 -3233.196 381.645 -0.438 -8.472 0.000 -3981.844 -2484.549
## Powered_Windows1 519.917 64.574 0.071 8.051 0.000 393.246 646.588
## Quarterly_Tax 9.621 1.300 0.109 7.401 0.000 7.071 12.171
## Guarantee_Period 63.396 10.349 0.053 6.126 0.000 43.095 83.697
## Mfr_Guarantee1 281.356 62.745 0.038 4.484 0.000 158.273 404.439
## -------------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 12
##
## - Sport_Model
##
## Model Summary
## -------------------------------------------------------------------
## R 0.953 RMSE 1104.645
## R-Squared 0.908 Coef. Var 10.294
## Adj. R-Squared 0.907 MSE 1220240.596
## Pred R-Squared 0.902 MAE 819.404
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ----------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ----------------------------------------------------------------------------------
## Regression 17143279577.089 14 1224519969.792 1003.507 0.0000
## Residual 1733961886.688 1421 1220240.596
## Total 18877241463.777 1435
## ----------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 4698.021 1079.151 4.353 0.000 2581.122 6814.920
## Age_08_04 -113.223 2.346 -0.581 -48.268 0.000 -117.824 -108.621
## Automatic_airco1 2287.873 156.440 0.146 14.625 0.000 1980.995 2594.751
## KM -0.017 0.001 -0.176 -15.647 0.000 -0.019 -0.015
## Weight 11.556 0.998 0.168 11.578 0.000 9.598 13.514
## HP_Bin> 120 4912.508 380.311 0.118 12.917 0.000 4166.477 5658.538
## HP_Bin100-120 3316.047 382.515 0.448 8.669 0.000 2565.692 4066.403
## Mfg_Month -92.538 8.906 -0.086 -10.391 0.000 -110.008 -75.068
## CC_bin>1600 -1304.084 197.668 -0.115 -6.597 0.000 -1691.836 -916.332
## CC_bin1600 -3261.514 379.458 -0.442 -8.595 0.000 -4005.873 -2517.155
## Powered_Windows1 535.327 64.297 0.073 8.326 0.000 409.199 661.456
## Quarterly_Tax 9.326 1.294 0.106 7.206 0.000 6.787 11.864
## Guarantee_Period 69.994 10.406 0.058 6.726 0.000 49.581 90.406
## Mfr_Guarantee1 278.452 62.380 0.038 4.464 0.000 156.085 400.819
## Sport_Model1 284.859 67.363 0.036 4.229 0.000 152.718 417.000
## -------------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 13
##
## - Airco
##
## Model Summary
## -------------------------------------------------------------------
## R 0.953 RMSE 1099.945
## R-Squared 0.909 Coef. Var 10.250
## Adj. R-Squared 0.908 MSE 1209878.034
## Pred R-Squared 0.903 MAE 815.408
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ---------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ---------------------------------------------------------------------------------
## Regression 17159214655.940 15 1143947643.729 945.507 0.0000
## Residual 1718026807.837 1420 1209878.034
## Total 18877241463.777 1435
## ---------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 4703.433 1074.560 4.377 0.000 2595.538 6811.328
## Age_08_04 -110.901 2.422 -0.569 -45.794 0.000 -115.652 -106.151
## Automatic_airco1 2279.209 155.793 0.145 14.630 0.000 1973.600 2584.817
## KM -0.017 0.001 -0.179 -15.958 0.000 -0.019 -0.015
## Weight 11.405 0.995 0.166 11.466 0.000 9.454 13.356
## HP_Bin> 120 4888.805 378.749 0.118 12.908 0.000 4145.838 5631.772
## HP_Bin100-120 3363.410 381.111 0.454 8.825 0.000 2615.809 4111.012
## Mfg_Month -93.029 8.869 -0.086 -10.489 0.000 -110.426 -75.631
## CC_bin>1600 -1321.166 196.883 -0.117 -6.710 0.000 -1707.379 -934.954
## CC_bin1600 -3356.798 378.755 -0.455 -8.863 0.000 -4099.777 -2613.819
## Powered_Windows1 419.705 71.513 0.057 5.869 0.000 279.423 559.987
## Quarterly_Tax 9.296 1.289 0.105 7.214 0.000 6.768 11.824
## Guarantee_Period 71.755 10.373 0.060 6.917 0.000 51.407 92.102
## Mfr_Guarantee1 281.442 62.120 0.038 4.531 0.000 159.586 403.299
## Sport_Model1 295.123 67.136 0.037 4.396 0.000 163.428 426.819
## Airco1 272.811 75.172 0.038 3.629 0.000 125.351 420.271
## -------------------------------------------------------------------------------------------------------
##
##
##
## Forward Selection: Step 14
##
## - Tow_Bar
##
## Model Summary
## -------------------------------------------------------------------
## R 0.954 RMSE 1098.798
## R-Squared 0.909 Coef. Var 10.240
## Adj. R-Squared 0.908 MSE 1207356.804
## Pred R-Squared 0.903 MAE 812.613
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ---------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ---------------------------------------------------------------------------------
## Regression 1.7164e+10 16 1072750134.950 888.511 0.0000
## Residual 1713239304.585 1419 1207356.804
## Total 18877241463.777 1435
## ---------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 4685.924 1073.475 4.365 0.000 2580.154 6791.693
## Age_08_04 -110.311 2.437 -0.566 -45.260 0.000 -115.092 -105.530
## Automatic_airco1 2266.342 155.764 0.144 14.550 0.000 1960.789 2571.895
## KM -0.017 0.001 -0.180 -15.994 0.000 -0.019 -0.015
## Weight 11.400 0.994 0.165 11.473 0.000 9.451 13.349
## HP_Bin> 120 4902.739 378.419 0.118 12.956 0.000 4160.419 5645.059
## HP_Bin100-120 3352.815 380.751 0.453 8.806 0.000 2605.920 4099.711
## Mfg_Month -93.404 8.862 -0.086 -10.540 0.000 -110.787 -76.020
## CC_bin>1600 -1340.226 196.910 -0.118 -6.806 0.000 -1726.492 -953.959
## CC_bin1600 -3328.381 378.629 -0.451 -8.791 0.000 -4071.114 -2585.648
## Powered_Windows1 419.491 71.438 0.057 5.872 0.000 279.355 559.627
## Quarterly_Tax 9.549 1.294 0.108 7.382 0.000 7.011 12.086
## Guarantee_Period 72.489 10.369 0.060 6.991 0.000 52.149 92.828
## Mfr_Guarantee1 280.212 62.058 0.038 4.515 0.000 158.476 401.948
## Sport_Model1 287.518 67.174 0.036 4.280 0.000 155.747 419.290
## Airco1 274.843 75.101 0.038 3.660 0.000 127.523 422.163
## Tow_Bar1 -134.104 67.345 -0.017 -1.991 0.047 -266.210 -1.998
## -------------------------------------------------------------------------------------------------------
##
##
##
## No more variables to be added.
##
## Variables Entered:
##
## + Age_08_04
## + Automatic_airco
## + KM
## + Weight
## + HP_Bin
## + Mfg_Month
## + CC_bin
## + Powered_Windows
## + Quarterly_Tax
## + Guarantee_Period
## + Mfr_Guarantee
## + Sport_Model
## + Airco
## + Tow_Bar
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -------------------------------------------------------------------
## R 0.954 RMSE 1098.798
## R-Squared 0.909 Coef. Var 10.240
## Adj. R-Squared 0.908 MSE 1207356.804
## Pred R-Squared 0.903 MAE 812.613
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ---------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ---------------------------------------------------------------------------------
## Regression 1.7164e+10 16 1072750134.950 888.511 0.0000
## Residual 1713239304.585 1419 1207356.804
## Total 18877241463.777 1435
## ---------------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------
## (Intercept) 4685.924 1073.475 4.365 0.000 2580.154 6791.693
## Age_08_04 -110.311 2.437 -0.566 -45.260 0.000 -115.092 -105.530
## Automatic_airco1 2266.342 155.764 0.144 14.550 0.000 1960.789 2571.895
## KM -0.017 0.001 -0.180 -15.994 0.000 -0.019 -0.015
## Weight 11.400 0.994 0.165 11.473 0.000 9.451 13.349
## HP_Bin> 120 4902.739 378.419 0.118 12.956 0.000 4160.419 5645.059
## HP_Bin100-120 3352.815 380.751 0.453 8.806 0.000 2605.920 4099.711
## Mfg_Month -93.404 8.862 -0.086 -10.540 0.000 -110.787 -76.020
## CC_bin>1600 -1340.226 196.910 -0.118 -6.806 0.000 -1726.492 -953.959
## CC_bin1600 -3328.381 378.629 -0.451 -8.791 0.000 -4071.114 -2585.648
## Powered_Windows1 419.491 71.438 0.057 5.872 0.000 279.355 559.627
## Quarterly_Tax 9.549 1.294 0.108 7.382 0.000 7.011 12.086
## Guarantee_Period 72.489 10.369 0.060 6.991 0.000 52.149 92.828
## Mfr_Guarantee1 280.212 62.058 0.038 4.515 0.000 158.476 401.948
## Sport_Model1 287.518 67.174 0.036 4.280 0.000 155.747 419.290
## Airco1 274.843 75.101 0.038 3.660 0.000 127.523 422.163
## Tow_Bar1 -134.104 67.345 -0.017 -1.991 0.047 -266.210 -1.998
## -------------------------------------------------------------------------------------------------------
##
## Selection Summary
## ----------------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ----------------------------------------------------------------------------------------
## 1 Age_08_04 0.7684 0.7682 2186.0335 25518.9706 1746.0382
## 2 Automatic_airco 0.8247 0.8244 1308.7496 25121.1462 1519.6553
## 3 KM 0.8427 0.8423 1029.8950 24967.7765 1440.1316
## 4 Weight 0.8742 0.8739 538.5506 24648.0525 1287.9643
## 5 HP_Bin 0.8830 0.8825 403.9815 24548.5871 1243.2596
## 6 Mfg_Month 0.8905 0.8899 289.0095 24455.6261 1203.2452
## 7 CC_bin 0.8965 0.8958 197.6782 24379.0785 1170.7882
## 8 Powered_Windows 0.9005 0.8998 136.2015 24323.5954 1147.9904
## 9 Quarterly_Tax 0.9036 0.9029 89.8123 24280.1756 1130.3748
## 10 Guarantee_Period 0.9057 0.9049 59.6168 24251.1397 1118.6181
## 11 Mfr_Guarantee 0.9070 0.9061 41.0703 24232.9769 1111.1829
## 12 Sport_Model 0.9081 0.9072 25.0118 24217.0187 1104.6450
## 13 Airco 0.9090 0.9080 13.8241 24205.7609 1099.9446
## 14 Tow_Bar 0.9092 0.9082 11.8620 24203.7537 1098.7979
## ----------------------------------------------------------------------------------------
Last but not least, we can also visualise each iteration graphically by plotting the model output as shown in the code chunk below.
car.fw.mlr <- ols_step_forward_p(car.mlr,
penter = 0.05)
plot(car.fw.mlr)
So far we are mainly focus on how to use olsrr package to build statistically rigorous model. In fact, olsrr package also provides functions for building predictive model.
One of very interesting function is called ols_step_best_subset(). It is capable to select the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the smallest MSE, Mallow’s Cp or AIC. The default metric used for selecting the model is R2 but the user can choose any of the other available metrics as shown in the code chunk below.
Be warned, this function is very time-consuming.
For demonstration purposes, five independent variables, namely: Age_08_04, Automatic_airco, KM, Weight, and HP_Bin will be used.
car.mlr <- lm(formula = Price ~ Age_08_04 + Automatic_airco + KM + Weight + HP_Bin, data=car_resale)
ols_step_best_subset(car.mlr,
metric = c("AIC"))
## Best Subsets Regression
## ---------------------------------------------------------
## Model Index Predictors
## ---------------------------------------------------------
## 1 Age_08_04
## 2 Age_08_04 Automatic_airco
## 3 Age_08_04 KM Weight
## 4 Age_08_04 Automatic_airco KM Weight
## 5 Age_08_04 Automatic_airco KM Weight HP_Bin
## ---------------------------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.7684 0.7682 0.7674 1396.3492 25518.9706 21441.3253 25534.7794 4377860671.9756 3052895.5188 2127.4595 0.2322
## 2 0.8247 0.8244 0.8233 710.9808 25121.1462 21043.7622 25142.2247 3316231525.2590 2314176.8543 1612.6762 0.1760
## 3 0.8481 0.8478 0.8448 427.0711 24917.3085 20840.2078 24943.6566 2875385093.0869 2007932.9444 1399.2700 0.1527
## 4 0.8742 0.8739 0.8714 109.7611 24648.0525 20572.2030 24679.6702 2382114939.9489 1664627.9329 1160.0364 0.1266
## 5 0.8830 0.8825 0.8797 5.0000 24548.5871 20471.4543 24590.7440 2218069235.3238 1552152.6800 1081.6615 0.1180
## ----------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria