Background
Background Story
Hi! This is a project submission for a task in a bootcamp course I’m attending right now. The goal of this project is to create a linear regression model and I chose the Used Cars Dataset that I obtained on Kaggle here which was scraped from CarDekho.com, one of India’s leading car search venture.
A little background story of the company: CarDekho.com, as a platform where users could buy new and used cars that are best suited for them, carries rich automotive content such as expert reviews, detailed specs and prices, comparisons as well as videos and pictures of all car brands and models available in India. The company has tie-ups with many auto manufacturers, more than 4000 car dealers and numerous financial institutions to facilitate the purchase of vehicles.
Aside from the ability of the platform to sell cars to users, users could also sell their car to CarDekho.com. I personally think that it would be easier to valuate the price of a new car rather than a used car, since used car prices usually depreciated within time and there are also a lot of other factors that affects the price depreciation such as the brand model, total kilometers driven, the condition of the car, and others.
Since there are many factors that could determine the resale value of a used car, I’m going to build a model to predict just that.
Objective
The objective of this project is to build a model to predict the resale value of used cars based on their specifications using linear regression.
The Data
Preparation
Importing libraries that will be used further on.
# import libraries
library(tidyverse)
library(dplyr)
library(inspectdf)
library(magicfor)
library(ggpubr)
library(GGally)
library(ggplot2)
library(viridis)
library(MLmetrics)
library(car)
library(caret)
library(vtreat)
library(lmtest)
library(pastecs)
Importing the Used Cars Dataset.
# import data
<- read.csv("data/cardekho_cleaned.csv",
car na.strings=c("null", "","NA"),
stringsAsFactors = T)
::datatable(car, rownames = F) DT
Description of each columns:
id
: indexcar_name
: name of carbrand
: brand of car extracted fromcar_name
model
: model of car extracted fromcar_name
new_price
: min. to max. price range of the new carmin_cost_price
: new min. price extracted fromnew_price
max_cost_price
: new max. price extracted fromnew_price
vehicle_age
: age of the vehicle since it was first boughtkm_driven
: total km driven by carseller_type
: type of sellerfuel_type
: type of fuel used by cartransmission_type
: manual/automaticmileage
: fuel efficiency in car (kmpl)engine
: car’s engine in ccmax_power
: max. power of car in bhpseats
: number of seats in carselling_price
: selling price of used car
Right off the bat we could see that there are missing values within the dataset. We’re going to check how many of them were NAs and if there’s any row duplicates.
# check duplicates
table(duplicated(car))
FALSE
19544
# check NA
colSums(is.na(car))
id car_name brand model
0 0 0 0
new_price min_cost_price max_cost_price vehicle_age
10102 10102 10102 0
km_driven seller_type fuel_type transmission_type
0 0 0 0
mileage engine max_power seats
0 0 0 0
selling_price
0
From the results above we could see that there are 51.69% missing values with a total of 10,102 rows which means that half of our data consists of missing values in 3 columns: new_price
, min_cost_price
, and max_cost_price
. The rows in which missing values are present are the same since min_cost_price
and max_cost_price
derived from new_price
.
Before going on to the next step of imputing the missing values, I’d first like to create a new column average_new_price
which is the average price between min_cost_price
and max_cost_price
, and removing those 2 columns afterwards.
# remove id column and create average_new_price column
<- car %>%
car mutate(average_new_price = (min_cost_price + max_cost_price)/2) %>%
select(-c(min_cost_price, max_cost_price)) %>%
select(1:5,16, everything())
With this dataset I’m going to impute the missing values with the mean of average_new_price
of each car name. To ease this process, I’ll split my data into 2: data with no missing values and with missing values.
# car with na
<- car[is.na(car$average_new_price),] %>%
car_na select(!c(new_price, average_new_price))
# car with no na
<- car %>% drop_na() car_no_na
Next, I’ll be going to impute the missing values based on the car_name
provided only if there’s the same car_name
listed on both data.
# calculating the mean of each car name from car_no_na
<- car_no_na %>%
car_no_na_mean group_by(car_name) %>%
summarise(average_new_price = mean(average_new_price)) %>%
ungroup()
# imputing mean price to car_na
<- left_join(car_na, car_no_na_mean)
car_2
# check NA left
round(colSums(is.na(car_2))/nrow(car)*100, 2)
id car_name brand model
0.00 0.00 0.00 0.00
vehicle_age km_driven seller_type fuel_type
0.00 0.00 0.00 0.00
transmission_type mileage engine max_power
0.00 0.00 0.00 0.00
seats selling_price average_new_price
0.00 0.00 21.15
After the imputation process above, the missing values are reduced but it’s still 21.15% of the total data in which the car_name
doesn’t exist in the data with no missing values. Next, I’ll be splitting the car_na_2
into 2 data with the same method as the first pre-imputation process and joining them to the original no NA data for the data that has been imputed.
# splitting the no NA data
<- car_2 %>% drop_na()
car_no_na_2 <- bind_rows(car_no_na, car_no_na_2)
car_no_na_3
# car na
<- car_2[is.na(car_2$average_new_price),] %>%
car_na_3 select(!c(average_new_price))
Ideally, I would say that the price of a new car really depends on the type and brand of the car, but seeing that we don’t have those data available, I’m going to impute it based on the average_new_price
of a car’s specifications, in this case max_power
and engine
. Both of them are numeric variables and it’s going to be hard to determine the price based on numerics (if not with machine learning methods too but that’s not our objective here), hence we’ll be categorizing those to a categorical range. I’d first like to see how these data are distributed.
# engine summary from car_na_3
summary(car_na_3$engine)
Min. 1st Qu. Median Mean 3rd Qu. Max.
624 1197 1364 1450 1498 6752
%>% filter(engine >= 1500 & engine <= 3000) %>% nrow() car_na_3
[1] 1002
# engine summary from the car_no_na
summary(car_no_na_3$engine)
Min. 1st Qu. Median Mean 3rd Qu. Max.
793 1197 1248 1486 1582 6592
# max_power summary from car_na_2
summary(car_na_3$max_power)
Min. 1st Qu. Median Mean 3rd Qu. Max.
34.20 68.05 83.80 95.00 104.68 575.00
%>% filter(max_power >= 105 & max_power <= 200) %>% nrow() car_na_3
[1] 902
# max_power summary from the car_no_na
summary(car_no_na_3$max_power)
Min. 1st Qu. Median Mean 3rd Qu. Max.
38.4 74.0 88.5 100.6 117.3 626.0
From the information above, we could see that 75% of the engine
ranges from 600 - 1600 on both missing values and non-missing values data and the number of data ranging from 1500 - 3000 is still quite a lot which is 1002 rows. From that information we’ll be categorizing the engine
by 100cc intervals until it reaches 3000 and the rest will be going to one category.
As for the max_power
, it ranges from 30 - 120 on both data and the number of data ranging from 105 - 200 is still 902 rows, hence we’ll be categorizing it by 10bhp intervals until it reaches 200.
# categorizing engine and max power to data range
magic_for(print, silent = TRUE)
## create new column
### create breaks for engine
<- get_breaks(by = 100)(x = 1:3000)
engine_breaks ### create labels for engine
for (i in 1:n_distinct(engine_breaks)){
print(paste0(engine_breaks[i], " - ", engine_breaks[i+1]))
}<- magic_result_as_vector()
engine_labels ### omitting the last element of engine_labels
<- engine_labels[-length(engine_labels)]
engine_labels ### create breaks for max_power
<- get_breaks(by = 10)(x = 1:200)
max_power_breaks ### create labels for max_power
for (i in 1:n_distinct(max_power_breaks)){
print(paste0(max_power_breaks[i], " - ", max_power_breaks[i+1]))
}<- magic_result_as_vector()
max_power_labels ### omitting the last element of max_power_labels
<- max_power_labels[-length(max_power_labels)]
max_power_labels ### create new column engine_range and max_power_range for no na data
<- car_no_na_3 %>%
car_no_na_4 mutate(engine_range = cut(engine,
breaks = engine_breaks,
labels = engine_labels),
max_power_range = cut(max_power,
breaks = max_power_breaks,
labels = max_power_labels)) %>%
mutate_at(vars(engine_range, max_power_range), as.character)
### impute engine_range and max_power_range for no NA data
$engine_range[is.na(car_no_na_4$engine_range)] <- "> 3000"
car_no_na_4$max_power_range[is.na(car_no_na_4$max_power_range)] <- "> 200"
car_no_na_4### create new column engine_range and max_power_range for na data
<- car_na_3 %>%
car_na_4 mutate(engine_range = cut(engine,
breaks = engine_breaks,
labels = engine_labels),
max_power_range = cut(max_power,
breaks = max_power_breaks,
labels = max_power_labels)) %>%
mutate_at(vars(engine_range, max_power_range), as.character)
### impute engine_range and max_power_range for NA data
$engine_range[is.na(car_na_4$engine_range)] <- "> 3000"
car_na_4$max_power_range[is.na(car_na_4$max_power_range)] <- "> 200"
car_na_4
## calculate mean for no na data
<- car_no_na_4 %>%
car_no_na_mean_2 group_by(engine_range, max_power_range) %>%
summarise(average_new_price = mean(average_new_price)) %>%
ungroup()
## impute mean price to car_na_3
<- left_join(car_na_4, car_no_na_mean_2)
car_na_5
## check NA left
round(colSums(is.na(car_na_5))/nrow(car)*100, 2)
id car_name brand model
0.00 0.00 0.00 0.00
vehicle_age km_driven seller_type fuel_type
0.00 0.00 0.00 0.00
transmission_type mileage engine max_power
0.00 0.00 0.00 0.00
seats selling_price engine_range max_power_range
0.00 0.00 0.00 0.00
average_new_price
3.09
After the second imputation process, there’s still a remaining of 3.09% missing values left from the total data and I decided to omit those missing values since it’s < 5% and it’s going to be less and less precise to impute the remaining NAs with just one variable of either engine
or max_power
.
# dropping all NAs
<- car_na_5 %>% drop_na()
car_no_na_5
# joining all data
<- bind_rows(car_no_na_4, car_no_na_5)
car_3
# check NA for new_price
colSums(is.na(car_3))
id car_name brand model
0 0 0 0
new_price average_new_price vehicle_age km_driven
9498 0 0 0
seller_type fuel_type transmission_type mileage
0 0 0 0
engine max_power seats selling_price
0 0 0 0
engine_range max_power_range
0 0
# check missing rows
round(nrow(car_3)/nrow(car)*100 + 3.09, 2)
[1] 100
Quick data summary.
glimpse(car_3)
Rows: 18,940
Columns: 18
$ id <int> 1, 4, 5, 6, 7, 8, 12, 14, 19, 22, 23, 25, 27, 29, 30~
$ car_name <fct> Hyundai Grand, Ford Ecosport, Maruti Wagon R, Hyunda~
$ brand <fct> Hyundai, Ford, Maruti, Hyundai, Maruti, Hyundai, Mar~
$ model <fct> Grand, Ecosport, Wagon R, i10, Wagon R, Venue, Swift~
$ new_price <fct> New Car (On-Road Price) : Rs.7.11-7.48 Lakh*, New Ca~
$ average_new_price <dbl> 729500, 1196500, 605000, 658500, 613500, 1036000, 78~
$ vehicle_age <int> 5, 6, 8, 8, 3, 2, 4, 8, 7, 6, 5, 8, 2, 6, 10, 14, 8,~
$ km_driven <int> 20000, 30000, 35000, 40000, 17512, 20000, 28321, 652~
$ seller_type <fct> Individual, Dealer, Individual, Dealer, Dealer, Indi~
$ fuel_type <fct> Petrol, Diesel, Petrol, Petrol, Petrol, Petrol, Petr~
$ transmission_type <fct> Manual, Manual, Manual, Manual, Manual, Automatic, M~
$ mileage <dbl> 18.90, 22.77, 18.90, 20.36, 20.51, 18.15, 16.60, 22.~
$ engine <int> 1197, 1498, 998, 1197, 998, 998, 1197, 1582, 2143, 1~
$ max_power <dbl> 82.00, 98.59, 67.10, 78.90, 67.04, 118.35, 85.00, 12~
$ seats <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 7, 5, 7, 5~
$ selling_price <int> 550000, 570000, 350000, 315000, 410000, 1050000, 511~
$ engine_range <chr> "1100 - 1200", "1400 - 1500", "900 - 1000", "1100 - ~
$ max_power_range <chr> "80 - 90", "90 - 100", "60 - 70", "70 - 80", "60 - 7~
After playing around with the data, I conclude that car_name
and model
won’t be doing any good in further analysis because:
brand
andmodel
are actually derived from thecar_name
. In this case I see thatbrand
serves better purpose in predicting resale value since differentbrand
names has each of their own different branding and some brands have a higher resale value than the others.- the model of the car doesn’t really matter here since each brand have a specific name and different from each other. Even if a different brand has the same car
model
, there’s no guarantee that the specifications of those cars are the same. Themodel
here is just a matter of naming and not really showing anything else. It would make so much difference when instead of just amodel
name, this data provides us along with the car type (hatchback, sedan, SUV, etc.) because a similar specific model of a car along with the type could have a differentmodel
name under different brands/car make, but having a similar price range. Based on those 2 things above, I’m going to removecar_name
andmodel
columns along withid
,new_price
,engine_range
, andmax_power_range
.
# remove columns
<- car_3 %>%
car_clean select(!c(id, new_price, car_name, model, engine_range, max_power_range))
Outliers
Now we’re moving on to the next step: checking on whether there’s any “anomalies” in our dataset. I’m first going to look closely on the descriptive statistics of each column.
%>% inspect_num() car_clean
From the brief descriptive statistics above, we could see there are some oddities in our data:
- there’s no such thing as a 0-seater car.
- the largest difference between
mean
andmedian
is in the prices where there’s an INR 2,881,145 difference inaverage_new_price
and INR 226,194 inselling_price
which means that there’s extreme outliers at the higher side of the price. - other variables’
mean
andmedian
value gap is not as huge as the prices, butkm_driven
,engine
, andmax_power
are worth to be checked later on.
Let’s first tackle the first problem: the car with 0 seats.
# filter rows
<- car %>% filter(seats == 0)
no_seats
unique(no_seats$car_name)
[1] Honda City Nissan Kicks
262 Levels: Ambassador Avigo Ambassador Classic Audi A3 Audi A4 ... Volvo XC90
These 2 cars’ seats are listed as 0 whereas the actual Honda City and Nissan Kicks are both a 5-seater car. Let’s fix the data.
# re-input correct value
<- car_clean %>%
car_clean mutate(seats = ifelse(seats == 0, 5, seats))
# re-check data
%>% filter(seats == 0) car_clean
[1] brand average_new_price vehicle_age km_driven
[5] seller_type fuel_type transmission_type mileage
[9] engine max_power seats selling_price
<0 rows> (or 0-length row.names)
Done for the first problem. The second problem is the outliers in both average_new_price
and selling_price
. Let’s take a look at those values when plotted with the same scale.
<- ggplot(car_clean, aes(x = as.numeric(row.names(car_clean)),
plot1 y = average_new_price)) +
geom_point(aes(color = average_new_price)) +
scale_y_continuous(breaks = seq(0, 500000000, by = 100000000)) +
theme_minimal() +
theme(legend.position = "none") +
xlab("index") +
scale_color_viridis(option = "plasma")
<- ggplot(car_clean, aes(x = as.numeric(row.names(car_clean)),
plot2 y = selling_price)) +
geom_point(aes(color = selling_price)) +
scale_y_continuous(limits = c(0, 500000000),
breaks = seq(0, 500000000, by = 100000000)) +
theme_minimal() +
theme(legend.position = "none") +
xlab("index") +
scale_color_viridis(option = "plasma")
::ggarrange(plot1, plot2, ncol = 2) egg
Looking at the side by side plot comparison with the same scaling we could conclude one thing: resale price are so much lower than the new price, which is why, a little off-topic, many people are saying that it’s not worth it to get a car loan since its price depreciated quickly while installments alone could take up to 5 years or more. By the time the installment is over, the car isn’t worth as much.
Back to the topic, here we could see that there are indeed outliers for both average_new_price
and selling_price
. However, we could also see that the average_new_price
are roughly divided into 3 groups including the outliers which are: cars priced INR 0 - 150,000,000, INR 150,000,000 - 200,000,000, and INR 350,000,000 - 500,000,000. This could also mean that the average_new_price
isn’t normally distributed.
It could be a little overwhelming to remove all the outliers just by looking at the plot since the outlier data looked like they are quite a lot in number, but remembering that this dataset actually contains more than 15,000 rows, I assume that it would still be safe to remove all the outliers.
# remove average_new_price outliers
<- car_clean %>% filter(average_new_price < 200000000)
car_clean_2
# check % of data removed
round((nrow(car_clean)/nrow(car) - 1)*100, 2)
[1] -3.09
round((nrow(car_clean_2)/nrow(car) - 1)*100, 2)
[1] -3.59
We just removed another .5% of the data, which still lies safely below 5%. We’re now going to plot the correlation between average_new_price
and selling_price
to check for other outliers.
ggplot(car_clean_2, aes(x = average_new_price, y = selling_price)) +
geom_point(aes(color = average_new_price)) +
theme_minimal() +
theme(legend.position = "none") +
scale_color_viridis(option = "plasma")
Now we could clearly see the outliers of the selling_price
. The extreme ones are just 3 data in total and we’re going to remove them later. We’re now going back to observe the average_new_price
, the INR 0 - 150,000,000 group mentioned before were actually roughly divided into several groups again if seen in a smaller scale. For now we’re going to let them be although by just briefly looking at it we know that the outliers aren’t that many compared to our total data.
# remove selling_price outlier
<- car_clean_2 %>% filter(selling_price < 20000000) car_clean_2
We’re now going to look at the frequency table of each categorical values.
<- car_clean_2 %>% group_by(brand) %>% summarise(value = length(brand)) %>% ungroup()
brand_freq <- kable(car_clean_2 %>% group_by(seller_type) %>% summarise(value = length(seller_type)) %>% ungroup())
d1 <- kable(car_clean_2 %>% group_by(fuel_type) %>% summarise(value = length(fuel_type)) %>% ungroup())
d2 <- kable(car_clean_2 %>% group_by(transmission_type) %>% summarise(value = length(transmission_type)) %>% ungroup())
d3
ggplot(brand_freq, aes(x = brand, y = value)) +
geom_col(fill = "#b6308b") +
geom_text(aes(label = value),
position = position_dodge(width = 0.9),
vjust = -0.4,
hjust = 0.5) +
theme_classic() +
ylab("frequency") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
axis.title.x = element_blank())
::kables(list(d1, d2, d3)) %>% kableExtra::kable_styling(position = "center") knitr
|
|
|
We could see from the bar plot above that there’s in fact a similar name Isuzu
and ISUZU
and some brands with fewer recurrence. Since this dataset consists of tens of thousands of rows, I decided to remove brand
with < 10 recurrence to ensure that when we split the data into train
and test
, both parts will at least get the same brand
. We’ll also remove Electric fuel_type
for the same reason.
# change ISUZU to Isuzu
<- car_clean_2 %>% mutate(brand = as.character(brand),
car_clean_2 brand = case_when(brand == "ISUZU" ~ "Isuzu",
~ brand),
T brand = as.factor(brand))
# filter brands with <= 10 recurrence
<- car_clean_2 %>% group_by(brand) %>% summarise(value = length(brand)) %>% ungroup() %>% filter(value >= 10)
brand_freq <- unique(brand_freq$brand)
brand_name
# filter rows in original dataset
<- car_clean_2 %>% filter(brand %in% brand_name, fuel_type != "Electric")
car_clean_2
# check % data removed
round(((nrow(car_clean_2)/nrow(car)) - 1)*100, 2)
[1] -3.73
Great! Still below 5%.
Analysis
We’re now going to take a look at the correlation between each variables.
ggcorr(car_clean_2, label = T, label_size = 2.5, hjust = 1, layout.exp = 3) +
scale_fill_viridis(option = "plasma")
We could see that most of the variables have a strong correlation with the selling_price
. We’re now going to take a look at the selling_price
data distribution.
ggplot(car_clean_2, aes(selling_price)) +
geom_histogram(fill = "#2d1a94", color = "#441d9e", alpha = 0.5, bins = 50) +
theme_minimal()
We can see that the selling_price
is right-skewed and so we’re going to transform the selling_price
to its natural log value.
# transform to log
<- car_clean_2 %>%
car_clean_3 mutate(log_selling_price = log(selling_price))
Quick check of the transformed result.
ggplot(car_clean_3, aes(log_selling_price)) +
geom_histogram(fill = "#2d1a94", color = "#441d9e", alpha = 0.5) +
theme_minimal()
Now our data has been evenly distributed. We can now move on to the next step which is building the regression model.
A problem that I found next within this dataset is that it contains a lot of categorical variables. To avoid future issues, We’re going to create our own dummy variables for those categories using dummy_cols()
. We’re also going to remove the selling_price
column since we’re only going to use the log_selling_price
.
# create dummy variables
<- dummy_cols(car_clean_3, select_columns = c("brand", "seller_type", "fuel_type", "transmission_type"))
car_dummy
# create dummy variables and remove columns which created the variables
<- dummy_cols(car_clean_3, select_columns = c("brand", "seller_type", "fuel_type", "transmission_type"),
car_dummy_2 remove_selected_columns = TRUE)
# remove selling_price column
<- car_dummy_2 %>% select(-selling_price)
car_dummy_2
dim(car_dummy_2)
[1] 18815 55
We’re now ending up with 55 columns / 54 predictors for the model.
Modeling
Split Train and Test Data
Before moving on to the model training, We’ll split the car_clean
data into test
and train
. We’ll be using an 80:20 ratio.
# split data
set.seed(902)
<- round(0.8 * nrow(car_clean_3), 0)
samplesize <- sample(seq_len(nrow(car_clean_3)), size = samplesize)
index
<- car_clean_3[index, ]
train <- car_clean_3[-index, ] test
Building Model
First Model
We’re now going to build the first model with all predictors.
# model with all predictors
options(max.print=5.5E5)
<- lm(log_selling_price ~ ., train)
model_1 summary(model_1)
Call:
lm(formula = log_selling_price ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-2.58978 -0.11847 0.01752 0.13783 2.03874
Coefficients:
Estimate Std. Error t value
(Intercept) 13.4675363814059 0.0413582404486 325.631
brandBMW -0.1427563081059 0.0179799842878 -7.940
brandChevrolet -0.6426923752261 0.0200096846287 -32.119
brandDatsun -0.6189139368520 0.0251882306473 -24.572
brandFiat -0.5333103382646 0.0380743912640 -14.007
brandFord -0.3772740994923 0.0174527866270 -21.617
brandHonda -0.2424841560955 0.0165283534883 -14.671
brandHyundai -0.2689689036630 0.0161381585224 -16.667
brandIsuzu -0.4399136783779 0.0888543411160 -4.951
brandJaguar -0.0867280640120 0.0312085729006 -2.779
brandJeep -0.2322541519170 0.0389556828222 -5.962
brandKia -0.1504878698345 0.0436924166933 -3.444
brandLand Rover -0.1092541780447 0.0368798112402 -2.962
brandLexus -0.0975011470408 0.0975736027436 -0.999
brandMahindra -0.4227548153088 0.0182463260609 -23.169
brandMaruti -0.2842282435634 0.0164684158803 -17.259
brandMercedes-Benz -0.1167211038634 0.0180886192776 -6.453
brandMG -0.2108230582550 0.0554500003970 -3.802
brandMini 0.2459180709311 0.0568595663735 4.325
brandMitsubishi -0.2740014942247 0.0446139850740 -6.142
brandNissan -0.3711808928878 0.0210370672101 -17.644
brandPorsche -0.4360277234398 0.0504160364339 -8.649
brandRenault -0.4185126972672 0.0187007010967 -22.380
brandSkoda -0.3212293510023 0.0190065590894 -16.901
brandTata -0.5817521142538 0.0177589658367 -32.758
brandToyota -0.1915892702768 0.0176036457647 -10.883
brandVolkswagen -0.2953446235570 0.0175666167226 -16.813
brandVolvo -0.1285609700877 0.0432168515281 -2.975
average_new_price -0.0000000040668 0.0000000004263 -9.540
vehicle_age -0.1048570420503 0.0007503842438 -139.738
km_driven -0.0000001601917 0.0000000371379 -4.313
seller_typeIndividual -0.0717338594135 0.0038015470265 -18.870
seller_typeTrustmark Dealer -0.0157253563649 0.0175100665696 -0.898
fuel_typeDiesel 0.2128951240930 0.0142392325883 14.951
fuel_typeLPG -0.1242911479406 0.0362719267545 -3.427
fuel_typePetrol -0.0327446132881 0.0147853313943 -2.215
transmission_typeManual -0.1139214901413 0.0058482647296 -19.480
mileage -0.0124746758002 0.0008629527243 -14.456
engine 0.0001091462880 0.0000103784788 10.517
max_power 0.0044454273126 0.0001163817993 38.197
seats 0.0379574593765 0.0035349214526 10.738
selling_price 0.0000002662920 0.0000000044413 59.958
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
brandBMW 0.00000000000000217 ***
brandChevrolet < 0.0000000000000002 ***
brandDatsun < 0.0000000000000002 ***
brandFiat < 0.0000000000000002 ***
brandFord < 0.0000000000000002 ***
brandHonda < 0.0000000000000002 ***
brandHyundai < 0.0000000000000002 ***
brandIsuzu 0.00000074652780333 ***
brandJaguar 0.005460 **
brandJeep 0.00000000254742618 ***
brandKia 0.000574 ***
brandLand Rover 0.003057 **
brandLexus 0.317686
brandMahindra < 0.0000000000000002 ***
brandMaruti < 0.0000000000000002 ***
brandMercedes-Benz 0.00000000011321516 ***
brandMG 0.000144 ***
brandMini 0.00001535109712998 ***
brandMitsubishi 0.00000000083751067 ***
brandNissan < 0.0000000000000002 ***
brandPorsche < 0.0000000000000002 ***
brandRenault < 0.0000000000000002 ***
brandSkoda < 0.0000000000000002 ***
brandTata < 0.0000000000000002 ***
brandToyota < 0.0000000000000002 ***
brandVolkswagen < 0.0000000000000002 ***
brandVolvo 0.002937 **
average_new_price < 0.0000000000000002 ***
vehicle_age < 0.0000000000000002 ***
km_driven 0.00001617642324495 ***
seller_typeIndividual < 0.0000000000000002 ***
seller_typeTrustmark Dealer 0.369160
fuel_typeDiesel < 0.0000000000000002 ***
fuel_typeLPG 0.000613 ***
fuel_typePetrol 0.026798 *
transmission_typeManual < 0.0000000000000002 ***
mileage < 0.0000000000000002 ***
engine < 0.0000000000000002 ***
max_power < 0.0000000000000002 ***
seats < 0.0000000000000002 ***
selling_price < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2136 on 15010 degrees of freedom
Multiple R-squared: 0.912, Adjusted R-squared: 0.9118
F-statistic: 3796 on 41 and 15010 DF, p-value: < 0.00000000000000022
So far we could see that the original predictors were only 12 variables, but after we input them to the linear model the variables increases to 41 in total. That is because R automatically creates dummy variables for all those categorical variables since linear regression model could only take numbers for its predictors.
Our first model gives us a result of 0.912 Multiple R-squared and 0.9118 Adjusted R-squared which is actually pretty good, since it means that the model could depict 91.2% of our data. Most of the predictors are also significant to the target log_selling_price
shown by the Pr(>|t|)
< 0.05. There are some predictors that aren’t significant, but they were all categorical variables which belong to a specific column, so we’re not going to remove them.
Although the R-squared values is quite satisfactory, the number of variables used in this model isn’t. We’re going to tackle this matter in the second model.
Second Model
We’re now going to move to our second model. In this second model we’ll “semi-manually” create dummy variables for the categories. We’re going to follow the steps mentioned in Cory Lesmeister and Sunil Kumar Chinnamgari’ book: Advanced Machine Learning with R in treating the training data.
We’re first going to create varlist
object to be used in the data treatment process, which is just extracting all column names from the car_clean_3
object minus log_selling_price
and selling_price
to be treated as strings.
# create varlist object for data treatment
<- car_clean_3 %>%
varlist select(-"log_selling_price", -"selling_price") %>%
colnames()
# saving deleted columns to an object
<- train$log_selling_price
train_log_selling_price <- test$log_selling_price test_log_selling_price
Next we’re going to design the treatment sceheme. We’re going to use minFraction
of 0.1 which just means that we want the treatment process to return a categorical output with categories that has at least 10% recurrence from the total counts.
# training data treatment
<- designTreatmentsZ(dframe = train,
train_treatment varlist = varlist,
minFraction = 0.1)
[1] "vtreat 1.6.2 inspecting inputs Wed May 05 01:49:40 2021"
[1] "designing treatments Wed May 05 01:49:40 2021"
[1] " have initial level statistics Wed May 05 01:49:40 2021"
[1] " scoring treatments Wed May 05 01:49:40 2021"
[1] "have treatment plan Wed May 05 01:49:40 2021"
# original n of unique variables for each category
n_distinct(train$brand)
[1] 28
n_distinct(train$seller_type)
[1] 3
n_distinct(train$fuel_type)
[1] 4
n_distinct(train$transmission_type)
[1] 2
As we could see from the result above, the designTreatmentsZ()
function creates a treatment plan in the form of dataframe where it automatically states whether the variable is categorical or not and creates extraModelDegrees
where the number is actually n-1 of the total categories in each column. It’s actually the same concept for how dummy variables are supposed to work in modeling because there’s 1 category in each column that is included in the intercept
value.
The treatment plan also generates code
column in which:
- clean: not categorical variables
- catP: categorical variables
- lev: categorical variables in which there’s at least 10% recurrence from the total data
And that’s why I personally this method is more efficient rather than letting R decides the dummy variables. However, in this dataframe I slightly spot a problem which is the transmission_type
. transmission_type
only consists of 2 categorical variables: Automatic and Manual. Here we can see that both Automatic and Manual has their own lev
which could cause a perfect collinearity which means that each 0s in Automatic, results a 1 in Manual.
We’re now going to apply the treatment plan to the train and test data and delete lev
Manual column
# apply treatment to both train and test data
<- prepare(train_treatment, train) %>% select(-transmission_type_lev_x_Manual)
train_2 <- prepare(train_treatment, test) %>% select(-transmission_type_lev_x_Manual) test_2
Here is what the treatment result looks like in train
data.
[1] 15052 17
We now then have a new additional columns resulting to 18 which is fewer than the first model of 41. We could see that the applied treatment changed the categorical columns with an additional _catP
and substitutes each categorical variables to the total % recurrence of the said category in each row while inputing “0” or “1” in the lev
columns. We’re now going to remove columns with _catP
now since we already have the lev
dummy variables and inputs the log_selling_price
column back in.
# remove columns with _catP
<- train_2 %>%
train_2 select(-contains("_catP"))
<- test_2 %>%
test_2 select(-contains("_catP"))
We’re now going to build the second model.
# joining log_selling_price back to the df
$log_selling_price <- train_log_selling_price
train_2
# model 2
<- lm(log_selling_price ~., train_2)
model_2 summary(model_2)
Call:
lm(formula = log_selling_price ~ ., data = train_2)
Residuals:
Min 1Q Median 3Q Max
-2.39868 -0.15290 0.00844 0.16170 2.31614
Coefficients:
Estimate Std. Error t value
(Intercept) 12.9602115026522 0.0497665271067 260.420
average_new_price 0.0000000011810 0.0000000005021 2.352
vehicle_age -0.1216698142064 0.0008189994778 -148.559
km_driven -0.0000003155462 0.0000000459422 -6.868
mileage -0.0137188599443 0.0010123825949 -13.551
engine 0.0002099476214 0.0000116435879 18.031
max_power 0.0079565775879 0.0001200187877 66.294
seats 0.0061712942598 0.0039328833656 1.569
brand_lev_x_Hyundai 0.0908332787399 0.0063840118732 14.228
brand_lev_x_Maruti 0.1439009671707 0.0060514783916 23.779
seller_type_lev_x_Dealer 0.0419801949266 0.0218088564051 1.925
seller_type_lev_x_Individual -0.0654597641009 0.0219084020288 -2.988
fuel_type_lev_x_Diesel 0.2046968924109 0.0164643374490 12.433
fuel_type_lev_x_Petrol -0.0623752506728 0.0167558869679 -3.723
transmission_type_lev_x_Automatic 0.2118382204507 0.0068991910844 30.705
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
average_new_price 0.018671 *
vehicle_age < 0.0000000000000002 ***
km_driven 0.00000000000675 ***
mileage < 0.0000000000000002 ***
engine < 0.0000000000000002 ***
max_power < 0.0000000000000002 ***
seats 0.116633
brand_lev_x_Hyundai < 0.0000000000000002 ***
brand_lev_x_Maruti < 0.0000000000000002 ***
seller_type_lev_x_Dealer 0.054259 .
seller_type_lev_x_Individual 0.002814 **
fuel_type_lev_x_Diesel < 0.0000000000000002 ***
fuel_type_lev_x_Petrol 0.000198 ***
transmission_type_lev_x_Automatic < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2663 on 15037 degrees of freedom
Multiple R-squared: 0.863, Adjusted R-squared: 0.8629
F-statistic: 6766 on 14 and 15037 DF, p-value: < 0.00000000000000022
We now have a linear model with 0.863 R-squared, 0.8629 Adj. R-squared, and a total of 14 predictors. There were some predictors that aren’t as significant as the others so we’re going to make a third model which excludes those variables.
Third Model
We’re now going to build the third model which basically uses the same predictors as the second model minus the insignificant variables.
# model 3
<- lm(log_selling_price ~. - seats -seller_type_lev_x_Dealer, train_2)
model_3 summary(model_3)
Call:
lm(formula = log_selling_price ~ . - seats - seller_type_lev_x_Dealer,
data = train_2)
Residuals:
Min 1Q Median 3Q Max
-2.41196 -0.15219 0.00873 0.16132 2.31721
Coefficients:
Estimate Std. Error t value
(Intercept) 13.0442238011792 0.0355126154418 367.312
average_new_price 0.0000000010988 0.0000000004992 2.201
vehicle_age -0.1219117975918 0.0007898040128 -154.357
km_driven -0.0000003136041 0.0000000459373 -6.827
mileage -0.0143053221698 0.0009440689731 -15.153
engine 0.0002167981011 0.0000107214169 20.221
max_power 0.0078903635430 0.0001118544298 70.541
brand_lev_x_Hyundai 0.0897082506095 0.0063618684610 14.101
brand_lev_x_Maruti 0.1449987324360 0.0059926124934 24.196
seller_type_lev_x_Individual -0.1064827246520 0.0046138929206 -23.079
fuel_type_lev_x_Diesel 0.2058109687934 0.0164375694726 12.521
fuel_type_lev_x_Petrol -0.0642475767788 0.0166470309279 -3.859
transmission_type_lev_x_Automatic 0.2105874030700 0.0068318753204 30.824
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
average_new_price 0.027743 *
vehicle_age < 0.0000000000000002 ***
km_driven 0.00000000000902 ***
mileage < 0.0000000000000002 ***
engine < 0.0000000000000002 ***
max_power < 0.0000000000000002 ***
brand_lev_x_Hyundai < 0.0000000000000002 ***
brand_lev_x_Maruti < 0.0000000000000002 ***
seller_type_lev_x_Individual < 0.0000000000000002 ***
fuel_type_lev_x_Diesel < 0.0000000000000002 ***
fuel_type_lev_x_Petrol 0.000114 ***
transmission_type_lev_x_Automatic < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2663 on 15039 degrees of freedom
Multiple R-squared: 0.8629, Adjusted R-squared: 0.8628
F-statistic: 7891 on 12 and 15039 DF, p-value: < 0.00000000000000022
There’s only a 0.1% difference between the second model and the third model while the third only needs 12 predictors, so between the 2 model (model_2
and model_3
), we’ll go for the 3rd model and compare its prediction results to model_1
.
Model Evaluation
Prediction Results
We’re now going to use the first and third model to predict the test results.
# predict with first model
<- predict(model_1, test)
predict_1
# predict with third model
<- predict(model_3, test_2)
predict_3
head(predict_1)
5 21 22 26 31 39
13.16683 13.26648 13.38866 13.26083 12.72313 14.29395
head(predict_3)
1 2 3 4 5 6
13.20635 13.22513 13.44352 13.37557 12.93095 14.44644
The results weren’t very much different. We could see that the second model tends to predict the variables higher than the first model.
We’re now going to compare the Adjusted R-squared, RMSE (for train
and test
), and MAPE for each models.
# adj. r squared
summary(model_1)$adj.r.squared
summary(model_2)$adj.r.squared
# RMSE train
RMSE(model_1$fitted.values, train$log_selling_price)
RMSE(model_2$fitted.values, train_2$log_selling_price)
# RMSE test
RMSE(predict_1, test$log_selling_price)
RMSE(predict_3, test_log_selling_price)
# MAPE
MAPE(predict_1, test$log_selling_price)
MAPE(predict_3, test_log_selling_price)
Model | Predictors | Adj.R_squared | RMSE_train | RMSE_test | MAPE |
---|---|---|---|---|---|
model_1 | 41 | 91.18 % | 0.2133 | 0.2134 | 0.0123 |
model_3 | 12 | 86.29 % | 0.2662 | 0.2688 | 0.0153 |
Viewing the result in comparison, we could see that:
model_3
has much fewer predictors thanmodel_1
(29 difference),model_1
performs slightly better with the Adj. R-squared which can depict 91.12% of the data whilemodel_3
86.29%,model_1
has lower RMSE in thetrain
andtest
data thanmodel_3
,- both models perform well in both
train
andtest
data which means that the models fit just right (no under or overfitting), - MAPE in
model_1
is better thanmodel_3
Based on those observations, we’ll opt for the third model since the prediction result isn’t very much different with the first model and it only takes 12 predictors.
Assumptions
In this part, we’re going to check whether the model we chose follows the regression assumptions.
Linearity
Linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects.
Linearity and additivity of the relationship between dependent and independent variables are described as:
- The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed.
- The slope of that line does not depend on the values of the other variables.
- The effects of different independent variables on the expected value of the dependent variable are additive.
We’re going to check the linearity of our model by plotting it.
# residual vs fitted plot
ggplot(model_3, aes(x = .fitted, y = .resid)) +
geom_point(color = "#4c72b0", alpha = 0.3) +
geom_hline(aes(yintercept = 0), color = "#345790") +
geom_smooth(color = "#2d1a94") +
theme_bw() +
labs(title = "Residuals vs Fitted",
x = "Fitted Values",
y = "Residuals") +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
Here we could see that our model isn’t linear towards the end, which means that the model could predict better in the mid-range, but when the model is predicting higher values, the error is larger in the negative side since our model tends to under-predict the value.
Residuals Normality
Second, we’re going to check for residuals normality.
# histogram
ggplot(model_3, aes(x = .resid)) +
geom_histogram(bins = 30, fill = "#4c72b0", color = "#345790", alpha = 0.7) +
theme_minimal() +
labs(title = "Residuals Distribution",
x = "Residuals",
y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
Here we could see that the residuals are, indeed, normally distributed. However, there were still some outliers which I assume is still related with the reason why our linearity plot is curved towards the end.
Homoscedasticity
We’re next going to check whether the data are homoscedastic (residuals are equal across the regression line) with bptest()
.
# check for homoscedasticity
bptest(model_3)
studentized Breusch-Pagan test
data: model_3
BP = 1699.4, df = 12, p-value < 0.00000000000000022
Here we could see that the p-value of model_3
is far below 0.05, which means that our data is heteroscedastic.
No-Multicollinearity
For the last assumption, we’re going to check if our data fulfills the no-multicollinearity assumption with vif()
function.
# check for multicollinearity
vif(model_3)
average_new_price vehicle_age
1.507520 1.285787
km_driven mileage
1.217953 3.094771
engine max_power
6.096573 4.686581
brand_lev_x_Hyundai brand_lev_x_Maruti
1.233595 1.589526
seller_type_lev_x_Individual fuel_type_lev_x_Diesel
1.070334 14.331834
fuel_type_lev_x_Petrol transmission_type_lev_x_Automatic
14.690279 1.552254
Here we could see that there’s a strong collinearity between fuel_type
Diesel and Petrol. We could also see that engine has a high VIF value.
Model Improvement
Build Model Improvement
Based on the assumptions that we evaluate above, we’re now going to fix the model. We’re first going to remove outliers in the data.
# remove outliers
<- car_clean_3 %>% filter(log_selling_price < 15.8)
car_improv
# check rows removed
nrow(car_improv)/nrow(car) - 1)*100 (
[1] -3.868195
We’re now going to split the data to train
and test
# split data
set.seed(902)
<- round(0.8 * nrow(car_improv), 0)
samplesize <- sample(seq_len(nrow(car_improv)), size = samplesize)
index
<- car_improv[index, ]
train_improv <- car_improv[-index, ] test_improv
We’re now going to repeat the same step as how we build the second model and remove fuel_type
Diesel and engine
.
# create varlist object for data treatment
<- car_improv %>%
varlist_improv select(-"log_selling_price", -"selling_price") %>%
colnames()
# saving deleted columns to an object
<- train_improv$log_selling_price
train_log_selling_price_improv <- test_improv$log_selling_price
test_log_selling_price_improv
# training data treatment
<- designTreatmentsZ(dframe = train_improv,
train_treatment_improv varlist = varlist_improv,
minFraction = 0.1)
[1] "vtreat 1.6.2 inspecting inputs Wed May 05 01:50:30 2021"
[1] "designing treatments Wed May 05 01:50:30 2021"
[1] " have initial level statistics Wed May 05 01:50:30 2021"
[1] " scoring treatments Wed May 05 01:50:30 2021"
[1] "have treatment plan Wed May 05 01:50:30 2021"
# apply treatment to both train and test data
<- prepare(train_treatment_improv, train_improv) %>%
train_improv_2 select(-transmission_type_lev_x_Manual)
<- prepare(train_treatment_improv, test_improv) %>%
test_improv_2 select(-transmission_type_lev_x_Manual)
# remove columns with _catP
<- train_improv_2 %>%
train_improv_2 select(-contains("_catP"))
<- test_improv_2 %>%
test_improv_2 select(-contains("_catP"))
# joining log_selling_price back to the df
$log_selling_price <- train_log_selling_price_improv
train_improv_2
# model improvement
<- lm(log_selling_price ~. - seats -seller_type_lev_x_Dealer -
model_improv - engine, train_improv_2)
fuel_type_lev_x_Diesel summary(model_improv)
Call:
lm(formula = log_selling_price ~ . - seats - seller_type_lev_x_Dealer -
fuel_type_lev_x_Diesel - engine, data = train_improv_2)
Residuals:
Min 1Q Median 3Q Max
-2.07937 -0.15163 0.00774 0.16235 2.32850
Coefficients:
Estimate Std. Error t value
(Intercept) 13.594387719971 0.024464329954 555.682
average_new_price 0.000000001107 0.000000000510 2.171
vehicle_age -0.123324184128 0.000795727780 -154.983
km_driven -0.000000240168 0.000000046473 -5.168
mileage -0.022867366185 0.000790878569 -28.914
max_power 0.009690736937 0.000092071178 105.253
brand_lev_x_Hyundai 0.063748209795 0.006363808593 10.017
brand_lev_x_Maruti 0.125594044450 0.006036248894 20.807
seller_type_lev_x_Individual -0.103448462838 0.004689091075 -22.062
fuel_type_lev_x_Petrol -0.318182872110 0.005403144467 -58.888
transmission_type_lev_x_Automatic 0.182561834948 0.006910512579 26.418
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
average_new_price 0.0299 *
vehicle_age < 0.0000000000000002 ***
km_driven 0.00000024 ***
mileage < 0.0000000000000002 ***
max_power < 0.0000000000000002 ***
brand_lev_x_Hyundai < 0.0000000000000002 ***
brand_lev_x_Maruti < 0.0000000000000002 ***
seller_type_lev_x_Individual < 0.0000000000000002 ***
fuel_type_lev_x_Petrol < 0.0000000000000002 ***
transmission_type_lev_x_Automatic < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2709 on 15019 degrees of freedom
Multiple R-squared: 0.8575, Adjusted R-squared: 0.8574
F-statistic: 9037 on 10 and 15019 DF, p-value: < 0.00000000000000022
The R-squared and Adj. R-squared aren’t very much different with the model before with 85.75%. We’re now going to predict the result and compare it to the model before.
# predict model
<- predict(model_improv, test_improv_2) predict_improv
Model | Predictors | Adj.R_squared | RMSE_train | RMSE_test | MAPE |
---|---|---|---|---|---|
model_3 | 12 | 86.28 % | 0.2662 | 0.2688 | 0.0153 |
model_improvement | 11 | 85.74 % | 0.2708 | 0.2726 | 0.0153 |
Still, the results aren’t very much different from the base model.
Model Improvement Evaluation
We’re now going to re-evaluate our model based on the 4 assumptions: linearity, residuals normality, homoscedasticity, and no-multicollinearity.
# residual vs fitted plot
ggplot(model_improv, aes(x = .fitted, y = .resid)) +
geom_point(color = "#4c72b0", alpha = 0.3) +
geom_hline(aes(yintercept = 0), color = "#345790") +
geom_smooth(color = "#2d1a94") +
theme_bw() +
labs(title = "Residuals vs Fitted",
x = "Fitted Values",
y = "Residuals") +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
# histogram
ggplot(model_improv, aes(x = .resid)) +
geom_histogram(bins = 30, fill = "#4c72b0", color = "#345790", alpha = 0.7) +
theme_minimal() +
labs(title = "Residuals Distribution",
x = "Residuals",
y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5))
# homoscedasticity
bptest(model_improv)
studentized Breusch-Pagan test
data: model_improv
BP = 1518.8, df = 10, p-value < 0.00000000000000022
# no-multicollinearity
vif(model_improv)
average_new_price vehicle_age
1.435668 1.271233
km_driven mileage
1.206055 2.080817
max_power brand_lev_x_Hyundai
2.889481 1.202041
brand_lev_x_Maruti seller_type_lev_x_Individual
1.550935 1.067726
fuel_type_lev_x_Petrol transmission_type_lev_x_Automatic
1.492993 1.543603
After improving our model, it doesn’t seem to give us satisfactory result. The new model just performs better in the multicollinearity since now there’s no VIF > 5. It still gives us the same tailed value in linearity, and model is still heteroscedastic.
Conclusion
After several tries of me trying to pimp up the model, there are still assumptions which are not satisfied yet, linearity and homoscedasticity. From what I could see by my own newbie perspective, it could happen because:
selling_price
relies heavily on thebrand
of the car. However, we should also know that different brands have different models and how they influence the selling price might be different. I feel like our model could do a lot better with more data such as the type of the car. It would just make more sense if a car resale value is seen more based on the brand and the type because the same type of a car by different brands could have a different resale value.- At the first pre-process stage of the data, we could see that the
selling_price
is very skewed to the right. From tens of thousands of data, 75% of them are actually priced between INR 34,000 - 799,250. With the median value of 530,000 and maximum value of 39,500,000 is too huge of a gap and we couldn’t just remove the 25% of the data. - 50% of the data were actually missing values. Perhaps if there’s a more logical way to impute those missing values it could’ve been better. An additional info of the car type as mentioned above could make the imputation more precise.
- Or maybe… the data is just the way it is and linear regression isn’t the most appropriate method for this data.
All in all, the 2 things that I really hope in the future regarding this dataset is an additional car type
variable and predicting this data with another machine learning method. I hope I could also learn a lot more and be better in machine learning so I can pinpoint exactly what’s wrong and what’s not. This is a long journey but thanks for reading until the end! :)
Resources
- https://towardsdatascience.com/assumptions-of-linear-regression-5d87c347140
- https://people.duke.edu/~rnau/testing.htm
- https://www.statisticssolutions.com/assumptions-of-linear-regression/
- Lesmeister, Cory and Chinnamgari, Dr. Sunil Kumar (2019). Advanced Machine Learning with R. Packt Publisher.