This study is conducted to analyze, which factors affect the price of cars. The data has been gathered from the website turbo.az and the analysis is evaluated in R Studio environment. All the information about car owners and the car properties are included in the data. Moreover, the brief description of the dataset is given in the introduction part of the report. In the exploratory data analysis part, missing values and outliers are excluded from the data. First, the engine volume column is separated from the unit of volume Land converted from factor to double format. Then, the whole analysis of the exploratory data analysis is done with the newly generated dataset, which consists of 713 observations and 11 variables. In this study, we focus on the prices of BMW, Hyundai, and Mercedes cars from Baku based on AZN currency. For this filtered dataset the univariate and bivariate analysis is done according to the mentioned research interest by using visual graphics as scatterplots, boxplots, and histograms. Moreover, all univariate variables are tested for normality by applying the Shapiro-Wilk test. Overall, there are 4 four categorical variables and for each of them, we set dummy variables. The next part of the analysis is devoted to the model construction. At first, multiple linear equation models were constructed without any transformation. We observed that for the initial model, all the assumptions fail. However, in the next step, we follow by applying some transformations to fulfill the assumptions. We first apply the square root transformation for the prices. With this transformation, all the assumptions were fulfilled. Later, to choose the best model with the highest AIC, R squared, R adjusted and C(p) criteria the stepwise regression and the best subset method were applied. At the end of the analysis, we see that the best significant predictors that explain the price are the year, gear, mileage, engine power, engine volume, and transmission. After applying the transformation method to stepwise regression, the model fulfilled all assumptions. From this research, we may conclude that production year, mileage of the car, engine power, transmission, and fuel type affect the price of cars in BMW, Hyundai, Mercedes marque from Baku.
The following data has been collected from the website turbo.az. This site is a popular public platform for car buy and sell in Azerbaijan. The acquired data contains information about the car owner as well as the car itself, including mobile phone numbers, production year of the cars, the engine power, mileage etc. We have analyzed the data of 19819 observations and 24 variables in R Studio. We observe that out of 24 variables 14 variables are qualitative and 10 are quantitative. For a precise description, the pattern of the data is displayed below.
setwd("C:/Users/FIDAN/Desktop/R projects/Turbo_Price_Analysis")
turbo <- read.csv("turbo.csv",header = TRUE)
turbo <- na.omit(turbo)
glimpse(turbo)
## Observations: 19,819
## Variables: 25
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ city <fct> Baku, Imishli, Baku, Baku, Baku, Baku, Siyazan, ...
## $ marque <fct> BMW, Mercedes, Mercedes, Mercedes, Nissan, Kia, ...
## $ model <fct> 325, C 200, E 240, C 220, X-Trail, Optima, C 240...
## $ year <int> 2001, 1998, 1999, 2001, 2014, 2013, 2001, 2001, ...
## $ category <fct> Sedan, Sedan, Sedan, Sedan, Offroader / SUV, Sed...
## $ color <fct> Göy, Gümü?ü, Gümü?ü, Q?z?l?, A?, Qara, Gümü?ü, Q...
## $ engine_volume <fct> 2.5 L, 2.0 L, 2.4 L, 2.2 L, 2.5 L, 2.4 L, 2.6 L,...
## $ engine_power <int> 192, 136, 170, 170, 183, 180, 177, 125, 75, 233,...
## $ fuel_type <fct> Benzin, Benzin, Benzin, Dizel, Benzin, Benzin, B...
## $ mileage <int> 350000, 216000, 282000, 248253, 96500, 83000, 28...
## $ transmission <fct> Avtomat, Avtomat, Avtomat, Avtomat, Avtomat, Avt...
## $ gear <fct> Arxa, Arxa, Arxa, Arxa, Tam, Ön, Arxa, Arxa, Arx...
## $ is_new <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ price <int> 10500, 12000, 12500, 12600, 34500, 21500, 11800,...
## $ currency <fct> AZN, AZN, AZN, AZN, AZN, AZN, AZN, AZN, AZN, $, ...
## $ extras <fct> Yüngül lehimli diskl?r ABS Lyuk M?rk?zi qapanma ...
## $ viewed <int> 747, 24, 8, 6, 8, 26, 22, 34, 48, 584, 50, 94, 1...
## $ date <fct> 03-Sep-18, 03-Sep-18, 03-Sep-18, 03-Sep-18, 03-S...
## $ website_id <int> 2614885, 2618179, 2618178, 2618177, 2617909, 261...
## $ number <fct> 050 300-14-96, 070 737-37-67, 051 770-07-02, 055...
## $ owner <fct> Kamran, Kenan, Necef, Emil, Latif, Nurlan, Zaur,...
## $ is_salon <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ can_barter <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, ...
## $ is_credit <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
The price of cars are unstable and affected by various factors. The main factors are the economic state and the currency of the country. Moreover, the physical characteristics of the vehicle, such as the model, year of production, marque, engine volume, and mileage may have a significant impact on its value. We see that in the turbo dataset the price of cars includes different currencies such as AZN, USD, and Euro. As the website is based in Azerbaijan, most cars are priced in AZN. However, as USD being an international strong currency, many cars are sold/bought in this currency as well. Below we represent a table with statistics of most used currencies.
$ | € | AZN |
---|---|---|
3641 | 52 | 16126 |
The data shows that Mercedes, Hyundai, and BMW marques are the most common marques that are published on the website. We present the numbers for each car brand in the following table.
summary(turbo$marque)
## Mercedes LADA VAZ Hyundai BMW
## 4657 3273 1380 1087
## Opel Toyota Kia Volkswagen
## 1014 988 700 639
## Nissan Ford GAZ Mitsubishi
## 611 568 479 460
## Land Rover Chevrolet Lexus Daewoo
## 438 331 306 291
## Audi Renault Tofas Porsche
## 202 173 145 144
## Jeep KamAz Mazda Honda
## 126 109 107 97
## Infiniti Iran Khodro Fiat Changan
## 95 87 81 60
## Subaru Saipa ZIL Chery
## 57 50 49 43
## Great Wall Peugeot Dodge Lifan
## 42 40 38 38
## SuShaki SEAT Bentley Chrysler
## 37 36 30 29
## UAZ Cadillac HOWO Volvo
## 29 24 24 24
## MAZ Moskvich Ravon Hummer
## 23 23 23 22
## BYD MAN Maserati Shacman
## 21 21 21 21
## Geely Skoda IJ Iveco
## 20 20 19 19
## Baic Dacia Yamaha Citroen
## 16 16 16 15
## Isuzu Mini Zontes DAF
## 15 15 15 14
## GAC Foton ZAZ Ssang Yong
## 14 13 13 12
## MG Jaguar GMC Muravey
## 11 10 9 6
## Rolls-Royce Alfa Romeo Dnepr DongFeng
## 6 5 5 5
## Mercedes-Maybach Ural Buick Dayun
## 5 5 4 4
## Haojue Harley-Davidson Haval JAC
## 4 4 4 4
## PAZ Smart BMW Alpina Ducati
## 4 4 3 3
## Ikarus Jonway MV Agusta Vespa
## 3 3 3 3
## Aston Martin Can-Am Daihatsu FAW
## 2 2 2 2
## JMC KawaShaki Temsa (Other)
## 2 2 2 23
Engine volume, engine size or engine capacity indicates how large a space in engines pistons operates in. According to Haining (C.Haining,2018), a bigger number in liters indicates that each time cars engine move, the piston of an engine is able to push more air and fuel. The bigger engine volume in liters, the more expensive the car. The engine size is usually expressed in liters (L) or cubic centimeters (cc). In our data, it is expressed in liters (L) and outputted below.
head(turbo$engine_volume)
## [1] 2.5 L 2.0 L 2.4 L 2.2 L 2.5 L 2.4 L
## 76 Levels: 0.1 L 0.2 L 0.3 L 0.4 L 0.5 L 0.6 L 0.7 L 0.8 L 0.9 L ... 9.5 L
Below is a pie chart with the statistics of car colors in the car advertisements from the website the data has been retrieved. One can observe that in the collected data, there are 5717 white that is 28.8% of cars and 4950 black, which is 25% of cars in the pie chart. These numbers are the obvious indication of color preference in car purchasing in Azerbaijan.
plot_ly(turbo, labels = ~color, type = 'pie') %>%
layout(title = 'The Color of Cars',
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
zaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
)
Engine power is the maximum power that an engine can put out. It can be expressed in kilowatts or horsepower. The power output depends on the size and design of the engine, but also on the speed. The output of engine power in our data shows that engine power lies in the interval [5, 999] with average 166.22 kilowatts.
glimpse(turbo$engine_power)
## int [1:19819] 192 136 170 170 183 180 177 125 75 233 ...
Mileage is one of the most important properties that one should consider when buying a used car. According to the literature review, it has a great effect on a cars price. The higher the mileage, the lower the price of a car. But that is not always the case, a two-year-old car with 100,000 may be counted less good car than a 10-year-old car with 50,000 miles on it.
glimpse(turbo$mileage)
## int [1:19819] 350000 216000 282000 248253 96500 83000 281000 126000 262000 134500 ...
The geographic location of a car is one of the factors that may affect the price of the car. For example, a used car with 10,000 miles that spends its life in Ganja will be different from a used car with 10,000 miles that came from Baku.
turbo %>%
select(city) %>%
unique() %>%
glimpse()
## Observations: 64
## Variables: 1
## $ city <fct> Baku, Imishli, Siyazan, Ganja, Shirvan, Mingachevir, Xird...
Gears allow a car to be driven with the minimum strain on the engine (Driving Test Success, 2017). There are five forward and one reverse gear in modern cars. Now, some cars have a sixth forward gear, which provides cars greater fuel economy when driving a car at higher speeds over longer distances. Most of the cars have reverse (arxa) gear.
summary(turbo$gear)
## Arxa Ön Tam
## 8614 6375 4830
RESEACH QUESTION
The aim of the research is to determine which factors have the most impact on the price of BMW, Hyundai, and Mercedes cars from Baku with AZN currency. Throughout the study, the questions below will be answered.
Is there a linear relationship between price and the other variables?
Which factors have an impact on price for cars from Baku if one takes currency AZN base and brands like BMW, Mercedes, and Hyundai?
Does Multicollinearity problem exist among independent variables?
Data Cleaning
We first use the function tidyr::separate() to separate the engine volume unit of capacity (L, liter) and integer from each other.
Our observation shows that in total, there are 21 missing values in the data. By using the na.omit() function, we eliminate the missing values.
turbo <- turbo %>%
filter(city=="Baku",currency =="AZN",fuel_type==c("Benzin","Dizel"),
marque==c("BMW","Hyundai","Mercedes")) %>%
select (city,year,gear,engine_volume,engine_power,fuel_type,mileage,
transmission,price,marque,can_barter,city) %>%
separate(col="engine_volume",into="volume", sep = "L",
remove = TRUE, convert = TRUE) %>%
glimpse()
## Observations: 713
## Variables: 11
## $ city <fct> Baku, Baku, Baku, Baku, Baku, Baku, Baku, Baku, B...
## $ year <int> 2001, 1999, 2007, 1999, 1999, 2009, 2004, 1995, 2...
## $ gear <fct> Arxa, Arxa, Tam, Arxa, Arxa, Ön, Arxa, Arxa, Arxa...
## $ volume <dbl> 2.5, 2.4, 2.2, 2.2, 2.8, 2.4, 2.0, 2.0, 2.6, 3.0,...
## $ engine_power <int> 192, 170, 150, 143, 193, 178, 183, 136, 193, 258,...
## $ fuel_type <fct> Benzin, Benzin, Dizel, Dizel, Benzin, Benzin, Ben...
## $ mileage <int> 350000, 282000, 273000, 390000, 190000, 110100, 2...
## $ transmission <fct> Avtomat, Avtomat, Avtomat, Avtomat, Avtomat, Avto...
## $ price <int> 10500, 12500, 22700, 12300, 12800, 16500, 13800, ...
## $ marque <fct> BMW, Mercedes, Hyundai, Mercedes, BMW, Hyundai, B...
## $ can_barter <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0...
As a result, the new dataset has 11 variables and 713 observations. Below we present the variables that are selected for the analysis.
colnames(turbo)
## [1] "city" "year" "gear" "volume"
## [5] "engine_power" "fuel_type" "mileage" "transmission"
## [9] "price" "marque" "can_barter"
The analysis conducted throughout this research is done according to new dataset.
turbo <- dummy_cols(.data = turbo,select_columns =
c("transmission","fuel_type","gear"))
Transmission | Fuel Type | Gear | Marque |
---|---|---|---|
Automatic Mechanics | Gasoline Disiel | Arxa On Tam | Hyundai Mercedes BMW |
1 0 | 1 0 | 1 0 0 | 1 0 0 |
0 1 0 | 1 0 0 | ||
0 0 1 | 1 0 0 | ||
Next, we group the data according to the attributes transmission, fuel type, year, and gear. One can easily see that most of the cars are with transmission type automat. Moreover, cars using gasoline as fuel are from the 1996-1998 production year.
turbo %>%
count(transmission,fuel_type,gear,year)
## # A tibble: 142 x 5
## transmission fuel_type gear year n
## <fct> <fct> <fct> <int> <int>
## 1 Avtomat Benzin Arxa 1985 1
## 2 Avtomat Benzin Arxa 1986 1
## 3 Avtomat Benzin Arxa 1987 2
## 4 Avtomat Benzin Arxa 1989 2
## 5 Avtomat Benzin Arxa 1990 9
## 6 Avtomat Benzin Arxa 1991 10
## 7 Avtomat Benzin Arxa 1992 13
## 8 Avtomat Benzin Arxa 1993 7
## 9 Avtomat Benzin Arxa 1994 12
## 10 Avtomat Benzin Arxa 1995 16
## # ... with 132 more rows
In this part, we analyze the data with 2 categorical variables. For visualization purposes, we use the facet wrap function. For the first analysis, we consider fuel and transmission. The fuel type is divided into two groups and in each group, the transmission type is differentiated as automat and mechanical. These categorical variables mapped on engine power and volume axis, accordingly. Gasoline vs Automat and Gasoline vs Mechanics gives more idea about the two variables. The relation between each subplot is positively associated.
ggplot(data = turbo) +
geom_point(mapping = aes(x = volume, y = engine_power),color="darkgreen")+
facet_wrap( fuel_type ~ transmission)
In the second plot, we present the scatterplot of categorical variables that are grouped by transmission and gear. Transmission and gear are split into two and three groups respectively. It is evident from the subplots that there exists a positive correlation between gear and transmission types. The graphs display that most information about transmission and gear are included in among automat vs forward, mechanics and forward gear, and automatic vs full.
ggplot(data = turbo) +
geom_point(mapping = aes(x = volume, y = engine_power),color="maroon4") +
facet_grid( gear~ transmission)
Price
We see from the data that the price variable is right-skewed. That is the most part of observations accumulated to the left-hand side. Also, one can detect the shape of the variable by looking at the boxplot.
par(mfrow=c(1,2))
hist(turbo$price, xlab="Price",breaks = 15, col = rgb(0,1,0,0.5),
main = "The Price of Cars")
p=boxplot(turbo$price,xlab="Price",col = "mediumseagreen",notch = T,
main = "The Price of Cars")
The alternative visualization to price is a violin plot. A Violin Plot is used to visualize the distribution of the data and its probability density. The advantage of the plot is that it helps to detect whether the distribution bimodal or multimodal. Here, the distribution of the price is unimodal.
turbo%>%
plot_ly(
x = ~price,
type = 'violin',
box = list(
visible = T
),
meanline = list(
visible = T, color="red"
)
) %>%
layout(
xaxis = list(
title = "Price"
)
)
There are some outliers in the price variable and one very evident outlier with the price of 121600 AZN.
p$out
## [1] 35800 138850 49700 87000 35400 50000 35800 41200 135000 39000
## [11] 55000 57000 72000 37400 42000 121600
The minimum price of cars are 2800 AZN and the highest price is 138850 AZN. The 25th percentile is 8300 AZN, the median is 13756 AZN, and the 75th percentile is 17800 AZN
summary(turbo$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2800 9500 13500 15785 19800 138850
Shapiro Wilk Test for Price
We apply the Shapiro Wilk Test to test whether the price follows a normal distribution. We construct the test by formulating the Hypothesis as follows.
H0 : Price is normally distributed`` vs H1: Not normally distributed
shapiro.test(turbo$price)
##
## Shapiro-Wilk normality test
##
## data: turbo$price
## W = 0.65331, p-value < 2.2e-16
If p< alpha=0.05 we reject H0. Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the price is not normally distributed.
Mileage
The minimum observation in this category is 0 km, which indicates that car is not driven to an average car and the highest observation is 598820 km that shows the maximum kilometers the car was driven. The 1st quantile of the variable lies on 167300 km, the 2nd quantile or median is 245245 km, and the 3rd is 313926 km.
par(mfrow=c(1,2))
hist(turbo$mileage, xlab="Mileage",breaks = 15, col = "plum3",
main = "The Mileage of Cars")
m=boxplot(turbo$mileage,xlab="Mileage",col = "plum2",notch = T,
main = "The Mileage of Cars")
summary(turbo$mileage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 167300 245245 244690 313926 598820
We find 11 outliers in mileage variable as presented below. The maximum outlier is 544000 km, which shows the maximum distance car was driven.
m$out
## [1] 538000 598820 535000 567840 550000 587452 536500 547844 576000 561000
## [11] 544000
Shapiro Wilk Test for Mileage
We apply the Shapiro Wilk Test to test whether the mileage follows a normal distribution. We construct the test by formulating the Hypothesis as follows
H0 : Price is normally distributed
vs H1: Not normally distributed
shapiro.test(turbo$mileage)
##
## Shapiro-Wilk normality test
##
## data: turbo$mileage
## W = 0.98894, p-value = 3.27e-05
If p< α =0.05 we reject H0. Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the mileage is not normally distributed.
Year
Production year indicates the year in which the car was manufactured. In the data, the oldest production year is associated with the GAZ marque which was produced in 1953, which has engine size 3.5 L, engine power 90, and mileage of the car is 85000 km. The year variable is one of the most important factors that influence the price. The histogram and the boxplot show that the shape of the dependent variable is almost symmetric.
par(mfrow=c(1,2))
hist(turbo$year, xlab="Year",breaks = 15, col = "palevioletred3",
main = "The Production Year")
y = boxplot(turbo$year,xlab="Year",col = "palevioletred3",notch = T,
main = "The Production Year") #print outliers
In the new dataset, the minimum production year for BMW, Mercedes, and Hyundai are 1982 and the last production is 2018. The 25th percentile lies on the 1997 year, the median is 2001 year, and the 75th percentile is 2007 year.
summary(turbo$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1982 1997 2001 2002 2007 2018
Shapiro-Wilk Test
We apply the Shapiro Wilk Test to test whether the price of a car follows a normal distribution. We construct the test by formulating the Hypothesis as follows.
H0 : Price is normally distributed``vs H1: Not normally distributed
shapiro.test(turbo$year)
##
## Shapiro-Wilk normality test
##
## data: turbo$year
## W = 0.98201, p-value = 1.113e-07
If p< α=0.05 we reject H0. Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the mileage is not normally distributed.
Volume
The volume variable is positively skewed since most of the observations cumulated to the left side. In this variable, the outliers are observed, too.
par(mfrow=c(1,2))
hist(turbo$volume, xlab="Volume",breaks = 15, col = "salmon3",
main = "The Engine Volume")
v<- boxplot(turbo$volume,xlab="Volume",col = "salmon3",notch = T,
main = "The Engine Volume") #print outliers
The common outliers for volume are 4.4 L. The highest volume size for the car is 12 L and the minimum capacity is 3.6 L.
v$out
## [1] 4.8 4.4 4.0 4.8 4.4 4.4 4.8 4.2 4.6 4.3 4.4 12.0 4.4 4.3
## [15] 4.8 4.4 4.8 4.3 4.4 4.4 6.4 4.4 4.8 3.6 4.3 12.0 4.8 4.2
## [29] 5.0 3.6 4.3 5.5 6.3 4.5 4.4
summary(turbo$volume)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 2.000 2.300 2.447 2.600 12.000
Minimum engine size in a car is 1.300 L and the maximum capacity is 12.0 L. The average capacity observed in cars is 2.447 L. The data is right skewed since the mean is greater than median where the median is 2.300 L.
Shapiro Wilk Test for Engine Volume
We apply the Shapiro Wilk Test to test whether the price of a car follows a normal distribution. We construct the test by formulating the Hypothesis as follows.
H0: Engine size is normally distributed
vs H1: Not normally distributed
shapiro.test(turbo$volume)
##
## Shapiro-Wilk normality test
##
## data: turbo$volume
## W = 0.66281, p-value < 2.2e-16
Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the volume is not normally distributed.
Engine Power
From the data, we notice that the minimum engine power is 102.0 kilowatts, whereas the maximum power that car can output is 368 kilowatts. Furthermore, the median is 193 kilowatts. The mean and median are almost equal which indicates the engine power is symmetrically distributed.
par(mfrow=c(1,2))
hist(turbo$engine_power, xlab="Engine Power",breaks = 15, col = "steelblue3",
main = "The Engine Power")
boxplot(turbo$engine_power,xlab="Engine Power",col = "steelblue3",notch=T,
main = "The Engine Power")$out #print outliers
## [1] 360 320 306 280 367 300 286 286 360 420 292 286 320 282 306 360 286
## [18] 360 286 394 286 280 358 333 360 286 279 306 450 367 480 320 354 293
## [35] 333 286
summary(turbo$engine_power)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 75.0 136.0 166.0 172.9 192.0 480.0
It is easy to see that there are no outliers in the dataset for engine power.
Shapiro Wilk test for Engine Power
We apply the Shapiro Wilk Test to test whether the engine power of a car follows a normal distribution. We construct the test by formulating the Hypothesis as follows
H0:Engine Power is normally distributed
vs H1:Not normally distributed
shapiro.test(turbo$engine_power)
##
## Shapiro-Wilk normality test
##
## data: turbo$engine_power
## W = 0.86375, p-value < 2.2e-16
If p< alpha=0.05 we reject H0. Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the mileage is not normally distributed.
Normal QQ Plot
In the plots below all quantitative variables are plotted. The variables do not fit a normal line, which indicates that price, the volume is not normally distributed. Nonetheless, the second engine power, 4th mileage, and 5th-year graphs mostly tend to fit a straight line and only deviate on the ends
par(mfrow=c(2,3))
qqnorm(turbo$price,xlab="Price", main = "Normal QQ Plot for Price")
qqline(turbo$price, col = "red")
qqnorm(turbo$engine_power, xlab="Engine Power",main = "Normal QQ Plot for engine_power")
qqline(turbo$engine_power, col = "red")
qqnorm(turbo$volume,xlab="Volume", main = "Normal QQ Plot for engine_volume")
qqline(turbo$volume, col = "red")
qqnorm(turbo$mileage, xlab="Mileage",main = "Normal QQ Plot for mileage")
qqline(turbo$mileage, col = "red")
qqnorm(turbo$year,xlab="Year", main = "Normal QQ Plot for Year")
qqline(turbo$year, col = "red")
Outlier
In this data, outliers are detected in the price, mileage, volume, engine power, and year variables. These outliers may have an influence on the model assumptions. Therefore, the data should be cleaned from the outliers. However, we decide to delete outliers only in the price variable, since sometimes deleting outliers may lead to loss of information. The new four in one plot with cleaned outliers is presented below:
quantiles <- quantile(turbo$price, probs = c(.25, .75))
range <- 1.3 * IQR(turbo$price)
turbo <- subset(turbo,
turbo$price > (quantiles[1] - range) &
turbo$price < (quantiles[2] + range))
The price variable is free of the outliers. The violin plot is represented below. The shape of price is approximately symmetric. The new mean is 14167 AZN and median 13000 AZN The minimum price is 2800 and the maximum is 31000.
turbo%>%
plot_ly(
x = ~price,
type = 'violin',
box = list(
visible = T
),
meanline = list(
visible = T, color="red"
)
) %>%
layout(
xaxis = list(
title = "Price"
)
)
Simple Scatter Plot Among Variables
Scatterplot matrix displays the relationship between each variable. The plot displays a positive relationship between price and engine power, price and year, engine power and year volume and engine power. Moreover, there is a negative relation between mileage and year, price and mileage.
pairs(~price+volume+engine_power+mileage+year,data=turbo,
main="Simple Scatterplot Matrix",col="midnightblue")
Price vs Engine Volume
The below figure displays a positive linear association between the volume and the price. Strictly speaking, if the volume capacity is large, then the price of the car is high. Vice versa, the smaller the volume capacity is the cheaper car.
plot_ly(data = turbo, x = ~volume, y = ~price,
marker = list(size = 8,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',
width = 2))) %>%
layout(title = 'The relation between engine volume and price',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
Price vs Engine Power
In the following, we present the scatterplots between price and engine power are. The bigger engine power is the more expensive car price.
plot_ly(data = turbo, x = ~engine_power, y = ~price, color = ~gear)
Price vs Mileage
The scatterplot displays that there is a negative relationship between the independent and dependent variables. The high mileage indicates that the car is old and affects the price in a negative way
plot_ly(data = turbo, x = ~mileage, y = ~price,
marker = list(size = 8,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'steelblue',
width = 2))) %>%
layout(title = 'Price vs Mileage',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
Price vs Year
There is a positive relationship between price and year. The higher or lower the age of a car is the more or less expensive cars.
plot_ly(data = turbo, x = ~year, y = ~price,
marker = list(size = 8,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'purple',
width = 2))) %>%
layout(title = 'Price vs Year',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))
Price vs Transmission
The boxplot of price is grouped by the transmission of a car as Mechanics and Automat. The car with mechanical transmission and 22000 AZN is an outlier. There are three outliers for the automat transmission cars. The price of these outliers are 34500,35000, and 40000, accordingly. The minimum price for mechanics cars are 3300AZN and maximum price is 22000AZN. The first quartile for the mechanics boxplot is 4500 AZN, the median is 6250 AZN, and the third is 9700 AZN. The boxplot of mechanics is left-skewed. For the automat cars, the minimum cost of the car is 2700 and the maximum is 40000 AZN. The first quartile, the median, and the 3rd quartile are 9200,13000, and 1925000. The plot is negatively skewed as the mechanics cars since most of the observations cumulated to the left side.
plot_ly(turbo, x = ~price, color = ~transmission, type = "box")
Price vs Fuel Type
The price of cars is divided according to fuel type diesel and gasoline. The minimum price for the diesel car is 3650 AZN and the maximum is 35000 AZN. The median of the diesel cars is 11150, the first and the third quartiles are 9500 and 23000. The plot is left-skewed, very small observations cumulated to the left side. There presence of outliers do not be detected. However, for the car price with gasoline fuel type have outliers. The smallest price for the gasoline cars 27000 AZN and the highest is 40000 AZN. The median of the gasoline cars is 12500, the first and the third quartiles are 8300 and 17750. The shape of price with gasoline seems to have a symmetric shape.
plot_ly(turbo, x = ~price, color = ~fuel_type, type = "box")
Price vs Gear
The boxplot of the price with gear type is displayed. The minimum prices in “Tam” gear are 11000 AZN and the maximum is 35000 AZN. The 25% of the values lie to 11000, 50% lie 19950, and the 75% lie to 25000. For “On” gear the minimum and maximum prices are 8700 AZN and 15500 AZN. The 1st quartile and the 3rd quartile are 8925 and 11150, accordingly. The median price is 9400. There is an outlier in “On” gear, which is 15500 AZN. The shape of “Tam” and “Arxa” gear are symmetric and “On” gear is left-skewed. Finally, the lowest and highest price for “Arxa” gear is 27000 AZN and 40000 AZN. The median is 11000, the first quartile is 7800 and the third is 14825. The presence of outliers are detected.
plot_ly(turbo, x = ~price, color = ~gear, type = "box")
cor(turbo$price,turbo$volume)
## [1] 0.3009634
cor(turbo$price,turbo$year)
## [1] 0.8252084
cor(turbo$price,turbo$engine_power)
## [1] 0.4379799
cor(turbo$price,turbo$mileage)
## [1] -0.4952912
The initial model was constructed without the categorical variables. The hypothesis of the model is presented.
H0: β1=β2=β3=β4= 0 versus H1: βi’s not equal to zero, i=1,2,3,4
fit <- lm(price~year+engine_power+volume+mileage,data=turbo)
summary(fit)
##
## Call:
## lm(formula = price ~ year + engine_power + volume + mileage,
## data = turbo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13163 -1963 -111 1711 15631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.684e+06 5.338e+04 -31.546 < 2e-16 ***
## year 8.451e+02 2.658e+01 31.797 < 2e-16 ***
## engine_power 5.405e+00 4.631e+00 1.167 0.244
## volume 2.298e+03 3.627e+02 6.336 4.27e-10 ***
## mileage 2.313e-03 1.579e-03 1.465 0.143
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3443 on 682 degrees of freedom
## Multiple R-squared: 0.7475, Adjusted R-squared: 0.746
## F-statistic: 504.6 on 4 and 682 DF, p-value: < 2.2e-16
The Residuals and the Residuals Standard Error (RSE)
The residuals are differences between the predicting (y) and predicted values (x) of the actual values. The maximum residual for the model is 15631 and the minimum is -13163. Moreover, RSE is a standard deviation of the residuals. For our model, the RSE is 3299, where this number should be proportional to the 1st and 3rd quantiles of the residuals.
The p-value of model
Here, the p-value is 2.2e-16, which is a very small probability value. The p-value is less than 0.05. Thus, we reject H0 where H0: β1=β2=β3=β4= 0 indicates that the model is significant.
Testing p-value for each variable
The summary of model presents that intercept, year, and volume are significant predictors for the model. Based on the summary of the model, the engine power and mileage of the car do not explain the price variable since the p-value of the predicted variables are greater than 0.05
Significance Stars
The asterisk ***
is for higher significance and **
for lower significance. The asterisks show that intercept, year and volume have a high significance in the model.
Coefficients
The equation explains that when there is no effect of independent variables the price is -1.684e+06. The coefficient of year is 1105.812. This means that when one increase year for 1 unit while keeping other variable constant, price changes 1105.812 units. Simultaneously, if we increase the volume for 1 unit while keeping other variables constant, price changes 6157.853 units.
Intercept | Year | Engine Power | Volume | Mileage | |
---|---|---|---|---|---|
-1.684e+06 | 1.105812e+03 | -2.841815e+01 | 6.157853e+033 | -3.069426e-04 |
Multiple R-squared and Adjusted R-squared
75% of the cause for the dependent variable is due to the independent variables such as year, mileage, and volume.
Regression Diagnostics
Assumption 1: Linearity of the data
The residuals vs fitted plot have a nonlinear shape. We see that the Normal Q-Q plot does not fit a normal line.
plot(fit,1)
Assumption 2: Homogeneity of variance or Homoscedasticity
Furthermore, the variance does not spread constantly along the zero line.
plot(fit,3)
There is no need to check the other assumptions since the first two assumptions are violated. If the residuals vs fitted plot indicate a non-linear relationship in the data, then a simple approach is to take log, sqrt, and x^2, in the regression model (A.Kassambara,2018). A possible solution to the homogeneity of variance assumption is to use a log or square root transformation of the dependent variable.
The square root of the price is taken to satisfy the homoscedasticity assumption.
fit.t <- lm(sqrt(price)~year+engine_power+volume+mileage,data=turbo)
summary(fit.t)
##
## Call:
## lm(formula = sqrt(price) ~ year + engine_power + volume + mileage,
## data = turbo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.858 -7.474 -0.160 7.713 60.606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.226e+03 2.051e+02 -35.230 < 2e-16 ***
## year 3.653e+00 1.021e-01 35.772 < 2e-16 ***
## engine_power 1.715e-02 1.779e-02 0.964 0.3355
## volume 1.058e+01 1.393e+00 7.596 1.01e-13 ***
## mileage 1.215e-05 6.065e-06 2.003 0.0456 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.23 on 682 degrees of freedom
## Multiple R-squared: 0.7874, Adjusted R-squared: 0.7861
## F-statistic: 631.3 on 4 and 682 DF, p-value: < 2.2e-16
RSE of the transformed model is 13.23. The small RSE is a good indication of the model, this means model fits the data well. Comparing with the previous model we see that R-squared and Adjusted R are higher. In the new model, 78% of the independent variables explain the outcome value. Still, the engine power is an insignificant factor for the model. The three asterisks show that intercept, year, and volume have a high significance in the model. Whereas, “*" indicates that mileage has less significance in the model.
Intercept | Year | Engine Power | Volume | Mileage | |
---|---|---|---|---|---|
-7.226e+03 | 3.653 | 0.01715 | 10.58 | 1.215e-05 |
When there is no effect of independent variables the price is -7079. The coefficient of the year is 3.580. If we increase the year for 1 unit while keeping other variable constant, price changes 3.580 units. Simultaneously, if we increase the volume for 1 unit while keeping other variables constant, price changes 10.61 units.
Assumption 1: Linearity of the data
The first assumption is held since the red line approximately close to horizontal line zero.
plot(fit.t,1)
Assumption 2:Homogeneity of variance
The residuals are not spread equally along the ranges of the predictor. Thus, the homogeneity of variance assumption fail.
plot(fit.t,3)
Assumption 3:Normality of Residuals
The residuals of the model are normally distributed.
plot(fit.t,2)
Assumption 4:Outliers and High Leverage Points
The plot below displays the top 3 most extreme points (66,405, and 656) with standardized residuals above -3 and 3 standard deviations.
par(mfrow=c(1,2))
plot(fit.t,4)
plot(fit.t,5)
The data does not present any influential points since all the points are inside the Cooks distance lines. Thus, there are no influential points that may influence the regression result.
The below table shows the top 3 observations with the highest Cooks Distance:
model.diag.metrics <- augment(fit.t)
model.diag.metrics %>%
top_n(3, wt = .cooksd)
## # A tibble: 3 x 13
## .rownames sqrt.price. year engine_power volume mileage .fitted .se.fit
## <chr> <dbl> <int> <int> <dbl> <int> <dbl> <dbl>
## 1 405 176. 2000 170 4.3 230000 131. 2.68
## 2 622 182. 2006 480 4.2 352621 159. 3.73
## 3 656 124. 1987 320 5 100000 92.3 2.99
## # ... with 5 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>, .std.resid <dbl>
Assumption 5:Non-Independence of Errors
Test for Autocorrelated Errors:
H0: The residual erros are independent
H1: The residual errors are not independent
durbinWatsonTest(fit.t)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.05202954 2.102086 0.21
## Alternative hypothesis: rho != 0
Since, p value is 0.188 we cannot reject H0. The 5th assumption is valid, the residuals are independent.
Multicollinearity
Multicollinearity indicates a high correlation between independent variables. The scatterplot between power and volume are plotted below. There is a high positive relationship between volume and engine power and the correlation between them is 0.8165. Therefore we want to check whether there exist multicollinearity among the predictors.
ggplot(data = turbo) +
geom_point(mapping =
aes(x = volume, y = engine_power,color=marque,size=1))+
labs(title="The relation between engine volume and power",
x="Volume", y="Power")
cor(turbo$volume,turbo$engine_power)
## [1] 0.8165848
The variance inflation factor (VIF) is a method to inspect the existence of multicollinearity. According to Minitab blog (Enough Is Enough! Handling Multicollinearity in Regression Analysis,2013), a VIF between 5 and 10 indicates high correlations. If VIF is greater than 10 then we can assume that due to multicollinearity problem the regression coefficients are poorly estimated.
vif(fit.t)
## year engine_power volume mileage
## 1.815890 3.536412 3.293046 1.665150
In our case, there is no multicollinearity problem among the independent variables. However, we will not remove one of the variables from the data.
The AIC (Akaike Criterion) and BIC (Bayesian information criterion) criteria for the model comparison:
AIC(fit.t)
## [1] 5504.89
BIC(fit.t)
## [1] 5532.084
There are three method of stepwise regression (NCSS Statistical Software):
Forward Selection starts with no predictors in the model. It adds recurrently the most significant predictors and stops when there is no more statistically significant predictor for the model.
Backward Selectionstarts with full model. Then iteratively removes the least contributive predictors, and stops when a model is statistically significant.
Stepwise Selection is a combination of forward and backward selection methods. First, it starts with no predictors, then like in forward selection add the most contributive independent variables. Secondly, like in the backward selection, it adds a new predictor and removes any variables that no longer have a contribution to the model fit.
In this analyses, the stepwise selection will be used to detect most fitted model.
STEPWISE REGRESSION MODEL
The Stepwise Regression model based on the p-value is builded:
ols_step_both_p(lm(price~year+engine_power+volume+mileage+
can_barter+gear+fuel_type+transmission+marque,data=turbo))
## Stepwise Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. year
## 2. engine_power
## 3. volume
## 4. mileage
## 5. can_barter
## 6. gear
## 7. fuel_type
## 8. transmission
## 9. marque
##
## We are selecting variables based on p value...
##
## Variables Entered/Removed:
##
## - year added
## - volume added
## - gear added
## - fuel_type added
## - marque added
## - engine_power added
## - transmission added
## - mileage added
##
## No more variables to be added/removed.
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -------------------------------------------------------------------
## R 0.901 RMSE 2988.905
## R-Squared 0.811 Coef. Var 20.787
## Adj. R-Squared 0.809 MSE 8933552.307
## Pred R-Squared 0.803 MAE 2187.463
## -------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## --------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------------------
## Regression 25982657582.542 10 2598265758.254 290.844 0.0000
## Residual 6039081359.452 676 8933552.307
## Total 32021738941.994 686
## --------------------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------------------------------
## (Intercept) -1847092.834 58600.855 -31.520 0.000 -1962154.409 -1732031.259
## year 927.633 29.315 0.905 31.644 0.000 870.074 985.192
## volume 959.022 332.761 0.092 2.882 0.004 305.653 1612.392
## gearÖn -2851.847 556.889 -0.163 -5.121 0.000 -3945.287 -1758.407
## gearTam -279.860 470.104 -0.013 -0.595 0.552 -1202.900 643.180
## fuel_typeDizel 2388.010 333.732 0.137 7.155 0.000 1732.734 3043.286
## marqueHyundai -374.539 626.858 -0.023 -0.597 0.550 -1605.361 856.283
## marqueMercedes 1540.383 363.855 0.110 4.234 0.000 825.962 2254.804
## engine_power 18.957 4.427 0.148 4.282 0.000 10.264 27.650
## transmissionMexanika 1542.256 370.757 0.077 4.160 0.000 814.283 2270.229
## mileage -0.006 0.002 -0.097 -4.072 0.000 -0.009 -0.003
## --------------------------------------------------------------------------------------------------------------------
##
## Stepwise Selection Summary
## -----------------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------------------------
## 1 year addition 0.681 0.681 463.2600 13301.3441 3861.8349
## 2 volume addition 0.746 0.745 230.8370 13146.1597 3446.8949
## 3 gear addition 0.778 0.776 120.3760 13059.7303 3232.0713
## 4 fuel_type addition 0.794 0.792 64.4460 13010.0477 3115.0406
## 5 marque addition 0.799 0.797 47.5090 12996.2738 3079.5296
## 6 engine_power addition 0.803 0.801 34.0600 12983.4244 3048.6729
## 7 transmission addition 0.807 0.804 23.2220 12972.8358 3023.0987
## 8 mileage addition 0.811 0.809 8.6040 12958.1906 2988.9049
## -----------------------------------------------------------------------------------------------
The minimum Cp for the model is 8.60, the maximum AIC is 13301.34 and the highest R square is 81%.
H0: β1=β2=β3=β4=β5=β6=β7=β8=β9=0 versus
H1: βi not equal to zero where i's are i=1,2...9
The square root of price is taken in order to fulfill the constant variance assumption.
step.p <- lm(sqrt(price)~year+engine_power+volume+mileage+
can_barter+gear+fuel_type+transmission+marque, data=turbo)
summary(step.p)
##
## Call:
## lm(formula = sqrt(price) ~ year + engine_power + volume + mileage +
## can_barter + gear + fuel_type + transmission + marque, data = turbo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.913 -5.935 -0.031 6.302 57.453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.995e+03 2.178e+02 -36.710 < 2e-16 ***
## year 4.042e+00 1.089e-01 37.098 < 2e-16 ***
## engine_power 6.803e-02 1.643e-02 4.140 3.90e-05 ***
## volume 5.240e+00 1.236e+00 4.239 2.56e-05 ***
## mileage -2.092e-05 5.604e-06 -3.733 0.000205 ***
## can_barter -1.537e+00 1.114e+00 -1.380 0.168193
## gearÖn -1.146e+01 2.069e+00 -5.539 4.37e-08 ***
## gearTam -1.738e+00 1.750e+00 -0.993 0.321196
## fuel_typeDizel 9.959e+00 1.240e+00 8.031 4.32e-15 ***
## transmissionMexanika 2.082e+00 1.376e+00 1.513 0.130658
## marqueHyundai -2.423e+00 2.337e+00 -1.037 0.300250
## marqueMercedes 7.292e+00 1.354e+00 5.386 9.97e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.09 on 675 degrees of freedom
## Multiple R-squared: 0.8521, Adjusted R-squared: 0.8496
## F-statistic: 353.4 on 11 and 675 DF, p-value: < 2.2e-16
In the stepwise regression model, can barter, gearTam, transmissionMexanika, and marqueHyundai are the insignificant factors for the model. We drop these factors and construct the new model.
step.p <- lm(sqrt(price)~year+engine_power+volume+mileage+
gear+fuel_type+transmission, data=turbo)
summary(step.p)
##
## Call:
## lm(formula = sqrt(price) ~ year + engine_power + volume + mileage +
## gear + fuel_type + transmission, data = turbo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.158 -6.382 -0.043 6.475 56.109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.734e+03 2.132e+02 -36.279 < 2e-16 ***
## year 3.914e+00 1.065e-01 36.755 < 2e-16 ***
## engine_power 5.015e-02 1.622e-02 3.092 0.002073 **
## volume 6.033e+00 1.268e+00 4.759 2.38e-06 ***
## mileage -2.106e-05 5.763e-06 -3.655 0.000277 ***
## gearÖn -1.728e+01 1.600e+00 -10.794 < 2e-16 ***
## gearTam -6.047e+00 1.581e+00 -3.824 0.000143 ***
## fuel_typeDizel 1.115e+01 1.266e+00 8.812 < 2e-16 ***
## transmissionMexanika 1.393e+00 1.414e+00 0.985 0.324765
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.45 on 678 degrees of freedom
## Multiple R-squared: 0.8416, Adjusted R-squared: 0.8397
## F-statistic: 450.1 on 8 and 678 DF, p-value: < 2.2e-16
Here, all the predicted variable are significant except for the transmission of the car. On the average mechanical transmission cost about a 1000 dollars less than an automatic of the same model(Manual vs. Automatic Car Transmission: Pros & Cons). Therefore we are going to keep the insignificant factor in the model. Among the other factors, the engine power has less contribution to the model. The 84% of the independent variables explain the dependent variable, which is a high criterion for the model.
The engine power is insignificant but we are not going to exclude the factor. According to Jim Frost (Statistics By Jim), if theory suggests that an independent variable is important, you might include it in the regression equation even when its p-value is not significant.
Assumption 1: Linearity of the data
The residual plot shows a fitted pattern. The red line approximately close to horizontal line zero. There is no pattern observed in residual vs fitted plot. Thus, the first assumption is held.
plot(step.p,1)
Assumption 2: Homogeneity of variance
The square root of price was taken to fulfill constant variance assumption. We see that the homogeneity of variance exists. Furthermore, residuals are spread equally along the ranges of predictor.
plot(step.p,3)
Assumption 3:Normality of Residuals
All the points fall approximately along the reference line. Also, residuals are normally distributed.
plot(step.p,2)
Assumption 4:Outliers and High Leverage Points
The plot below displays the top 3 most extreme points (143,405, and 656) with a standardized residual above -3 and 4 standard deviations.
par(mfrow=c(1,2))
plot(step.p,4)
plot(step.p,5)
All the points are inside Cooks distance red lines, which mean the data does not present any influential points. Thus, there are no influential points that may influence the regression result.
The below table shows the top 3 observations with the highest Cooks Distance:
model.diag.metrics <- augment(step.p)
model.diag.metrics %>%
top_n(3, wt = .cooksd)
## # A tibble: 3 x 16
## .rownames sqrt.price. year engine_power volume mileage gear fuel_type
## <chr> <dbl> <int> <int> <dbl> <int> <fct> <fct>
## 1 143 116. 1985 143 2.2 370000 Arxa Dizel
## 2 405 176. 2000 170 4.3 230000 Arxa Dizel
## 3 656 124. 1987 320 5 100000 Arxa Benzin
## # ... with 8 more variables: transmission <fct>, .fitted <dbl>,
## # .se.fit <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## # .std.resid <dbl>
Chart of Cook’s distance to detect observations that strongly influence fitted values of the model.
ols_plot_cooksd_bar(step.p)
Graph for detecting influential observations.
ols_plot_resid_lev(step.p)
Assumption 5:Non-Independence of Errors
Test for Autocorrelated Errors
H0: The residual erros are independent
H1: The residual errors are not independent
durbinWatsonTest(step.p)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.05496443 2.1077 0.18
## Alternative hypothesis: rho != 0
Since, p value is 0.302 we cannot reject H0. The 5th assumption is valid, the residuals are independent.
Unlike the stepwise regression method, best subsets regression fits all possible models based on the independent variables that we specify. If we have 5 independent variables, the method will fit 32 models. If we specify 7 independent variables, it will fit 128 models. Usually, the method picks the best fit model based on the adjusted R-squared or Mallows Cp criterion (J.Frost,2018).
best.fit <-
regsubsets(price~year+engine_power+volume+mileage+
gear+fuel_type+transmission+can_barter,
data=turbo,
nbest = 1, # 1 best model for each number of predictors
nvmax =NULL , # NULL for no limit on number of variables
force.in = NULL, force.out = NULL,
method = "exhaustive")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 3 linear dependencies found
## Reordering variables and trying again:
best.fit
## Subset selection object
## Call: regsubsets.formula(price ~ year + engine_power + volume + mileage +
## gear + fuel_type + transmission + can_barter, data = turbo,
## nbest = 1, nvmax = NULL, force.in = NULL, force.out = NULL,
## method = "exhaustive")
## 12 Variables (and intercept)
## Forced in Forced out
## year FALSE FALSE
## engine_power FALSE FALSE
## volume FALSE FALSE
## mileage FALSE FALSE
## gearÖn FALSE FALSE
## gearTam FALSE FALSE
## fuel_typeDizel FALSE FALSE
## transmissionMexanika FALSE FALSE
## can_barter FALSE FALSE
## fuel_typeElektro FALSE FALSE
## fuel_typeHibrid FALSE FALSE
## fuel_typeQaz FALSE FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
reg.summary <- summary(best.fit)
as.data.frame(reg.summary$outmat)
## year engine_power volume mileage gearÖn gearTam fuel_typeDizel
## 1 ( 1 ) *
## 2 ( 1 ) * *
## 3 ( 1 ) * * *
## 4 ( 1 ) * * * *
## 5 ( 1 ) * * * * *
## 6 ( 1 ) * * * * *
## 7 ( 1 ) * * * * * *
## 8 ( 1 ) * * * * * * *
## 9 ( 1 ) * * * * * * *
## fuel_typeElektro fuel_typeHibrid fuel_typeQaz
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 )
## 7 ( 1 )
## 8 ( 1 )
## 9 ( 1 )
## transmissionMexanika can_barter
## 1 ( 1 )
## 2 ( 1 )
## 3 ( 1 )
## 4 ( 1 )
## 5 ( 1 )
## 6 ( 1 ) *
## 7 ( 1 ) *
## 8 ( 1 ) *
## 9 ( 1 ) * *
The above table indicates that year, engine power, volume, mileage,gear, and fuel type are significant factors for the model. The best subset method excludes the can_barter and transmission variables.
The minimum BIC, C(p), and Adjusted R square for the model are 8. This means using these criteria, we should choose the model with 8 variables.
which.min(reg.summary$bic)
## [1] 8
plot(reg.summary$bic, scale = "bic", main = "BIC",type="l")
points(8,reg.summary$bic[8],col="red",cex=2,pch=20)
which.min(reg.summary$cp)
## [1] 9
plot(reg.summary$cp, scale = "Cp", main = "Cp",type="l")
points(8,reg.summary$cp[8],col="red",cex=2,pch=20)
The minimum adjusted R is 1, which is presented in the plot with the red point.
which.min(reg.summary$adjr2)
## [1] 1
plot(reg.summary$adjr2, scale = "adjr2", main = "Adjr2",type="l")
points(1,reg.summary$adjr2[1],col="red",cex=2,pch=20)
The best subset plot based on the adjusted R square is below. According to this plot, the best factors for the model based on adjusted R square are the year, engine power, volume, mileage, gear, Diesel, and Gaz fuel type.
plot(best.fit, scale = "adjr2", main = "Adjusted R^2")
"All models are wrong, some models are useful"
George Box (1976)
In this study, we build several equations to find the best fitting model for the turbo data that can estimate the dependent variable price. The first equation above is constructed without the categorical variables.
Model 1: price=β1 * year + β2 * engine_power + β3 * volume +β4 * mileage
The AIC, BIC, and the adjusted R criteria are presented below:
AIC | BIC | Adj R |
---|---|---|
5411.55 | 5438.67 | 78 |
To avoid less precision in the model, we tried to explain the price variable with the few independent variables. The R square of the model explains 78% of the outcome value. The AIC and BIC criteria are high, these are 5411.55 and 5438.67, accordingly.
The second model was constructed by Stepwise Regression method and the equation is below presented below:
Model 2: price=β1year+β2engine_power+β3volume+β4mileage+β5can_barter+β6gear+β9fuel_type+β10transmission+β11*marque
Mallow’s Cp is the criterion for the model selection. Moreover, the minimum value of Cp is a good indication for the model. For this model Cp is 8.5, the highest AIC is 5214.53 and the highest BIC criterion is 5260.22. Furthermore, the adjusted R square is 0.80, which is higher than the model 1.
AIC | BIC | Adj R |
---|---|---|
5214.53 | 5260.22 | 0.80 |
Both of the models satisfy all the linear assumptions. However, we select the second model based on the above criteria, since the second model is better than the first model in terms of high R square. Nevertheless, the AIC and BIC criterion of the first model is slightly higher. Although in the second model the transmission is insignificant for the price, we will not exclude it. Also, according to the literature review, transmission, gear, and fuel type have an effect on price. Therefore; we prefer to choose the second model. The reason for the choice is that the number of independent variables in the second model is not biased and that the second model has the most precise variables for the outcome value.
[1]Engine size explained(2018,9 May).Retrieved from:https://www.carbuyer.co.uk/tips-and-advice/146778/engine-size-explained
[2]Fitting & Interpreting Linear Models in R(2013,18 May).Retrieved from:http://blog.yhat.com/posts/r-lm-summary.html
[3]How to Change Car Gears.Retrieved from: https://www.driving-test-success.com/gears/gearinfo.htm
[4]Linear Regression Assumptions and Diagnostics in R: Essentials (2018,March 11). Retrieved from:http://www.sthda.com/english/articles/39-regression-mode l-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essenti als/#building-a-regression-model
[5]Stepwise Regression.Retrievedfrom:https://ncss-wpengine.netdna-ssl.com/wp-co ntent/themes/ncss/pdf/Procedures/NCSS/Stepwise_Regression.pdf
[6]http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in- regression-analysis
[7]Linear Regression Assumptions and Diagnostics in R: Essentials(2018,11 March). Retrieved from:http://www.sthda.com/english/articles/39-regression-model-diagnostics /161-linear-regression-assumptions-and-diagnostics-in-r-essentials/#building-a-regre ssion-model
[8]Manual vs. Automatic Car Transmission: Pros & Cons. Retrieved from: https://www.budgetdirect.com.au/blog/manual-vs-automatic-car-transmission-pros-cons. html
[9]Measures of Influence. Retrieved from:
https://cran.r-project.org/web/packages/olsrr/vignettes/influence_measures.html
[10]Model Specification: Choosing the Correct Regression Model(J. Frost )Retrieved from: http://statisticsbyjim.com/regression/model-specification-variable-selection/