Abstract

This study is conducted to analyze, which factors affect the price of cars. The data has been gathered from the website turbo.az and the analysis is evaluated in R Studio environment. All the information about car owners and the car properties are included in the data. Moreover, the brief description of the dataset is given in the introduction part of the report. In the exploratory data analysis part, missing values and outliers are excluded from the data. First, the engine volume column is separated from the unit of volume Land converted from factor to double format. Then, the whole analysis of the exploratory data analysis is done with the newly generated dataset, which consists of 713 observations and 11 variables. In this study, we focus on the prices of BMW, Hyundai, and Mercedes cars from Baku based on AZN currency. For this filtered dataset the univariate and bivariate analysis is done according to the mentioned research interest by using visual graphics as scatterplots, boxplots, and histograms. Moreover, all univariate variables are tested for normality by applying the Shapiro-Wilk test. Overall, there are 4 four categorical variables and for each of them, we set dummy variables. The next part of the analysis is devoted to the model construction. At first, multiple linear equation models were constructed without any transformation. We observed that for the initial model, all the assumptions fail. However, in the next step, we follow by applying some transformations to fulfill the assumptions. We first apply the square root transformation for the prices. With this transformation, all the assumptions were fulfilled. Later, to choose the best model with the highest AIC, R squared, R adjusted and C(p) criteria the stepwise regression and the best subset method were applied. At the end of the analysis, we see that the best significant predictors that explain the price are the year, gear, mileage, engine power, engine volume, and transmission. After applying the transformation method to stepwise regression, the model fulfilled all assumptions. From this research, we may conclude that production year, mileage of the car, engine power, transmission, and fuel type affect the price of cars in BMW, Hyundai, Mercedes marque from Baku.

INTRODUCTION

The following data has been collected from the website turbo.az. This site is a popular public platform for car buy and sell in Azerbaijan. The acquired data contains information about the car owner as well as the car itself, including mobile phone numbers, production year of the cars, the engine power, mileage etc. We have analyzed the data of 19819 observations and 24 variables in R Studio. We observe that out of 24 variables 14 variables are qualitative and 10 are quantitative. For a precise description, the pattern of the data is displayed below.

setwd("C:/Users/FIDAN/Desktop/R projects/Turbo_Price_Analysis")
turbo <- read.csv("turbo.csv",header = TRUE) 
turbo <- na.omit(turbo)

glimpse(turbo)
## Observations: 19,819
## Variables: 25
## $ id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ city          <fct> Baku, Imishli, Baku, Baku, Baku, Baku, Siyazan, ...
## $ marque        <fct> BMW, Mercedes, Mercedes, Mercedes, Nissan, Kia, ...
## $ model         <fct> 325, C 200, E 240, C 220, X-Trail, Optima, C 240...
## $ year          <int> 2001, 1998, 1999, 2001, 2014, 2013, 2001, 2001, ...
## $ category      <fct> Sedan, Sedan, Sedan, Sedan, Offroader / SUV, Sed...
## $ color         <fct> Göy, Gümü?ü, Gümü?ü, Q?z?l?, A?, Qara, Gümü?ü, Q...
## $ engine_volume <fct> 2.5 L, 2.0 L, 2.4 L, 2.2 L, 2.5 L, 2.4 L, 2.6 L,...
## $ engine_power  <int> 192, 136, 170, 170, 183, 180, 177, 125, 75, 233,...
## $ fuel_type     <fct> Benzin, Benzin, Benzin, Dizel, Benzin, Benzin, B...
## $ mileage       <int> 350000, 216000, 282000, 248253, 96500, 83000, 28...
## $ transmission  <fct> Avtomat, Avtomat, Avtomat, Avtomat, Avtomat, Avt...
## $ gear          <fct> Arxa, Arxa, Arxa, Arxa, Tam, Ön, Arxa, Arxa, Arx...
## $ is_new        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ price         <int> 10500, 12000, 12500, 12600, 34500, 21500, 11800,...
## $ currency      <fct> AZN, AZN, AZN, AZN, AZN, AZN, AZN, AZN, AZN, $, ...
## $ extras        <fct> Yüngül lehimli diskl?r ABS Lyuk M?rk?zi qapanma ...
## $ viewed        <int> 747, 24, 8, 6, 8, 26, 22, 34, 48, 584, 50, 94, 1...
## $ date          <fct> 03-Sep-18, 03-Sep-18, 03-Sep-18, 03-Sep-18, 03-S...
## $ website_id    <int> 2614885, 2618179, 2618178, 2618177, 2617909, 261...
## $ number        <fct> 050 300-14-96, 070 737-37-67, 051 770-07-02, 055...
## $ owner         <fct> Kamran, Kenan, Necef, Emil, Latif, Nurlan, Zaur,...
## $ is_salon      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ can_barter    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, ...
## $ is_credit     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...

Price

The price of cars are unstable and affected by various factors. The main factors are the economic state and the currency of the country. Moreover, the physical characteristics of the vehicle, such as the model, year of production, marque, engine volume, and mileage may have a significant impact on its value. We see that in the turbo dataset the price of cars includes different currencies such as AZN, USD, and Euro. As the website is based in Azerbaijan, most cars are priced in AZN. However, as USD being an international strong currency, many cars are sold/bought in this currency as well. Below we represent a table with statistics of most used currencies.

$ AZN
3641 52 16126

Marque

The data shows that Mercedes, Hyundai, and BMW marques are the most common marques that are published on the website. We present the numbers for each car brand in the following table.

summary(turbo$marque)
##         Mercedes         LADA VAZ          Hyundai              BMW 
##             4657             3273             1380             1087 
##             Opel           Toyota              Kia       Volkswagen 
##             1014              988              700              639 
##           Nissan             Ford              GAZ       Mitsubishi 
##              611              568              479              460 
##       Land Rover        Chevrolet            Lexus           Daewoo 
##              438              331              306              291 
##             Audi          Renault            Tofas          Porsche 
##              202              173              145              144 
##             Jeep            KamAz            Mazda            Honda 
##              126              109              107               97 
##         Infiniti      Iran Khodro             Fiat          Changan 
##               95               87               81               60 
##           Subaru            Saipa              ZIL            Chery 
##               57               50               49               43 
##       Great Wall          Peugeot            Dodge            Lifan 
##               42               40               38               38 
##          SuShaki             SEAT          Bentley         Chrysler 
##               37               36               30               29 
##              UAZ         Cadillac             HOWO            Volvo 
##               29               24               24               24 
##              MAZ         Moskvich            Ravon           Hummer 
##               23               23               23               22 
##              BYD              MAN         Maserati          Shacman 
##               21               21               21               21 
##            Geely            Skoda               IJ            Iveco 
##               20               20               19               19 
##             Baic            Dacia           Yamaha          Citroen 
##               16               16               16               15 
##            Isuzu             Mini           Zontes              DAF 
##               15               15               15               14 
##              GAC            Foton              ZAZ       Ssang Yong 
##               14               13               13               12 
##               MG           Jaguar              GMC          Muravey 
##               11               10                9                6 
##      Rolls-Royce       Alfa Romeo            Dnepr         DongFeng 
##                6                5                5                5 
## Mercedes-Maybach             Ural            Buick            Dayun 
##                5                5                4                4 
##           Haojue  Harley-Davidson            Haval              JAC 
##                4                4                4                4 
##              PAZ            Smart       BMW Alpina           Ducati 
##                4                4                3                3 
##           Ikarus           Jonway        MV Agusta            Vespa 
##                3                3                3                3 
##     Aston Martin           Can-Am         Daihatsu              FAW 
##                2                2                2                2 
##              JMC        KawaShaki            Temsa          (Other) 
##                2                2                2               23

Engine Volume

Engine volume, engine size or engine capacity indicates how large a space in engines pistons operates in. According to Haining (C.Haining,2018), a bigger number in liters indicates that each time cars engine move, the piston of an engine is able to push more air and fuel. The bigger engine volume in liters, the more expensive the car. The engine size is usually expressed in liters (L) or cubic centimeters (cc). In our data, it is expressed in liters (L) and outputted below.

head(turbo$engine_volume)
## [1] 2.5 L 2.0 L 2.4 L 2.2 L 2.5 L 2.4 L
## 76 Levels: 0.1 L 0.2 L 0.3 L 0.4 L 0.5 L 0.6 L 0.7 L 0.8 L 0.9 L ... 9.5 L

Color

Below is a pie chart with the statistics of car colors in the car advertisements from the website the data has been retrieved. One can observe that in the collected data, there are 5717 white that is 28.8% of cars and 4950 black, which is 25% of cars in the pie chart. These numbers are the obvious indication of color preference in car purchasing in Azerbaijan.

plot_ly(turbo, labels = ~color, type = 'pie') %>%
  layout(title = 'The Color of Cars',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         zaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
       
         )

Engine Power

Engine power is the maximum power that an engine can put out. It can be expressed in kilowatts or horsepower. The power output depends on the size and design of the engine, but also on the speed. The output of engine power in our data shows that engine power lies in the interval [5, 999] with average 166.22 kilowatts.

glimpse(turbo$engine_power)
##  int [1:19819] 192 136 170 170 183 180 177 125 75 233 ...

Mileage

Mileage is one of the most important properties that one should consider when buying a used car. According to the literature review, it has a great effect on a cars price. The higher the mileage, the lower the price of a car. But that is not always the case, a two-year-old car with 100,000 may be counted less good car than a 10-year-old car with 50,000 miles on it.

glimpse(turbo$mileage)
##  int [1:19819] 350000 216000 282000 248253 96500 83000 281000 126000 262000 134500 ...

City

The geographic location of a car is one of the factors that may affect the price of the car. For example, a used car with 10,000 miles that spends its life in Ganja will be different from a used car with 10,000 miles that came from Baku.

turbo %>% 
  select(city) %>% 
  unique() %>% 
  glimpse()
## Observations: 64
## Variables: 1
## $ city <fct> Baku, Imishli, Siyazan, Ganja, Shirvan, Mingachevir, Xird...

Gear

Gears allow a car to be driven with the minimum strain on the engine (Driving Test Success, 2017). There are five forward and one reverse gear in modern cars. Now, some cars have a sixth forward gear, which provides cars greater fuel economy when driving a car at higher speeds over longer distances. Most of the cars have reverse (arxa) gear.

summary(turbo$gear)
## Arxa   Ön  Tam 
## 8614 6375 4830

EXPLOTARY DATA ANALYSIS (EDA)

RESEACH QUESTION

The aim of the research is to determine which factors have the most impact on the price of BMW, Hyundai, and Mercedes cars from Baku with AZN currency. Throughout the study, the questions below will be answered.

  1. Is there a linear relationship between price and the other variables?

  2. Which factors have an impact on price for cars from Baku if one takes currency AZN base and brands like BMW, Mercedes, and Hyundai?

  3. Does Multicollinearity problem exist among independent variables?

Data Cleaning

  • We first use the function tidyr::separate() to separate the engine volume unit of capacity (L, liter) and integer from each other.

  • Our observation shows that in total, there are 21 missing values in the data. By using the na.omit() function, we eliminate the missing values.

turbo <- turbo %>%  
 
  filter(city=="Baku",currency =="AZN",fuel_type==c("Benzin","Dizel"),
         marque==c("BMW","Hyundai","Mercedes")) %>% 
  
  select (city,year,gear,engine_volume,engine_power,fuel_type,mileage,
          transmission,price,marque,can_barter,city) %>% 
  
  separate(col="engine_volume",into="volume", sep = "L", 
           remove = TRUE, convert = TRUE) %>% 
  
  glimpse() 
## Observations: 713
## Variables: 11
## $ city         <fct> Baku, Baku, Baku, Baku, Baku, Baku, Baku, Baku, B...
## $ year         <int> 2001, 1999, 2007, 1999, 1999, 2009, 2004, 1995, 2...
## $ gear         <fct> Arxa, Arxa, Tam, Arxa, Arxa, Ön, Arxa, Arxa, Arxa...
## $ volume       <dbl> 2.5, 2.4, 2.2, 2.2, 2.8, 2.4, 2.0, 2.0, 2.6, 3.0,...
## $ engine_power <int> 192, 170, 150, 143, 193, 178, 183, 136, 193, 258,...
## $ fuel_type    <fct> Benzin, Benzin, Dizel, Dizel, Benzin, Benzin, Ben...
## $ mileage      <int> 350000, 282000, 273000, 390000, 190000, 110100, 2...
## $ transmission <fct> Avtomat, Avtomat, Avtomat, Avtomat, Avtomat, Avto...
## $ price        <int> 10500, 12500, 22700, 12300, 12800, 16500, 13800, ...
## $ marque       <fct> BMW, Mercedes, Hyundai, Mercedes, BMW, Hyundai, B...
## $ can_barter   <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0...
  • In the introduction part of the report, it was mentioned that the dataset consists of 25 variables and 19819 observations. In turbo data, most of the variables will be excluded for the modeling, since the information as a car owner, date, extras, and mobile phones are not useful for model building. In this study, we will focus on the variables that have a direct effect on the price of a car, such as, year, engine size, mileage etc. According to the literature review (J.DAllegro), it is believed that these factors have a great impact on cars price.

As a result, the new dataset has 11 variables and 713 observations. Below we present the variables that are selected for the analysis.

colnames(turbo)
##  [1] "city"         "year"         "gear"         "volume"      
##  [5] "engine_power" "fuel_type"    "mileage"      "transmission"
##  [9] "price"        "marque"       "can_barter"

The analysis conducted throughout this research is done according to new dataset.

  • The variables transmission, fuel type, marque, and gear are categorical. In order to quantify the data, we convert them to dummy variables in the dataset as below.
turbo <- dummy_cols(.data = turbo,select_columns =
                      c("transmission","fuel_type","gear"))
Transmission Fuel Type Gear Marque
Automatic Mechanics Gasoline Disiel Arxa On Tam Hyundai Mercedes BMW
1 0 1 0 1 0 0 1 0 0
0 1 0 1 0 0
0 0 1 1 0 0

Next, we group the data according to the attributes transmission, fuel type, year, and gear. One can easily see that most of the cars are with transmission type automat. Moreover, cars using gasoline as fuel are from the 1996-1998 production year.

turbo %>% 
  count(transmission,fuel_type,gear,year)
## # A tibble: 142 x 5
##    transmission fuel_type gear   year     n
##    <fct>        <fct>     <fct> <int> <int>
##  1 Avtomat      Benzin    Arxa   1985     1
##  2 Avtomat      Benzin    Arxa   1986     1
##  3 Avtomat      Benzin    Arxa   1987     2
##  4 Avtomat      Benzin    Arxa   1989     2
##  5 Avtomat      Benzin    Arxa   1990     9
##  6 Avtomat      Benzin    Arxa   1991    10
##  7 Avtomat      Benzin    Arxa   1992    13
##  8 Avtomat      Benzin    Arxa   1993     7
##  9 Avtomat      Benzin    Arxa   1994    12
## 10 Avtomat      Benzin    Arxa   1995    16
## # ... with 132 more rows

Analysis and Visualization of Categorical Variables

In this part, we analyze the data with 2 categorical variables. For visualization purposes, we use the facet wrap function. For the first analysis, we consider fuel and transmission. The fuel type is divided into two groups and in each group, the transmission type is differentiated as automat and mechanical. These categorical variables mapped on engine power and volume axis, accordingly. Gasoline vs Automat and Gasoline vs Mechanics gives more idea about the two variables. The relation between each subplot is positively associated.

ggplot(data = turbo) + 
  geom_point(mapping = aes(x = volume, y = engine_power),color="darkgreen")+
  facet_wrap( fuel_type ~ transmission)

In the second plot, we present the scatterplot of categorical variables that are grouped by transmission and gear. Transmission and gear are split into two and three groups respectively. It is evident from the subplots that there exists a positive correlation between gear and transmission types. The graphs display that most information about transmission and gear are included in among automat vs forward, mechanics and forward gear, and automatic vs full.

ggplot(data = turbo) + 
  geom_point(mapping = aes(x = volume, y = engine_power),color="maroon4") + 
  facet_grid( gear~ transmission)

BOXPLOT , HISTOGRAM, SHAPIRO-WILK TEST AND QQNORMAL PLOT FOR UNIVARIATE VARIABLES

Price

We see from the data that the price variable is right-skewed. That is the most part of observations accumulated to the left-hand side. Also, one can detect the shape of the variable by looking at the boxplot.

par(mfrow=c(1,2))

hist(turbo$price, xlab="Price",breaks = 15, col = rgb(0,1,0,0.5), 
     main = "The Price of Cars")

p=boxplot(turbo$price,xlab="Price",col = "mediumseagreen",notch = T,
        main = "The Price of Cars") 

The alternative visualization to price is a violin plot. A Violin Plot is used to visualize the distribution of the data and its probability density. The advantage of the plot is that it helps to detect whether the distribution bimodal or multimodal. Here, the distribution of the price is unimodal.

turbo%>%
  plot_ly(
    x = ~price,
    type = 'violin',
    box = list(
      visible = T 
    ),
    meanline = list(
      visible = T, color="red"
    )
  ) %>% 
  layout(
    xaxis = list(
      title = "Price"
    )
  )

There are some outliers in the price variable and one very evident outlier with the price of 121600 AZN.

p$out
##  [1]  35800 138850  49700  87000  35400  50000  35800  41200 135000  39000
## [11]  55000  57000  72000  37400  42000 121600

The minimum price of cars are 2800 AZN and the highest price is 138850 AZN. The 25th percentile is 8300 AZN, the median is 13756 AZN, and the 75th percentile is 17800 AZN

summary(turbo$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2800    9500   13500   15785   19800  138850

Shapiro Wilk Test for Price

We apply the Shapiro Wilk Test to test whether the price follows a normal distribution. We construct the test by formulating the Hypothesis as follows.

H0 : Price is normally distributed`` vs H1: Not normally distributed

shapiro.test(turbo$price) 
## 
##  Shapiro-Wilk normality test
## 
## data:  turbo$price
## W = 0.65331, p-value < 2.2e-16

If p< alpha=0.05 we reject H0. Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the price is not normally distributed.

Mileage

The minimum observation in this category is 0 km, which indicates that car is not driven to an average car and the highest observation is 598820 km that shows the maximum kilometers the car was driven. The 1st quantile of the variable lies on 167300 km, the 2nd quantile or median is 245245 km, and the 3rd is 313926 km.

par(mfrow=c(1,2))

hist(turbo$mileage, xlab="Mileage",breaks = 15, col = "plum3",
     main = "The Mileage of Cars")


m=boxplot(turbo$mileage,xlab="Mileage",col = "plum2",notch = T,
        main = "The Mileage of Cars") 

summary(turbo$mileage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0  167300  245245  244690  313926  598820

We find 11 outliers in mileage variable as presented below. The maximum outlier is 544000 km, which shows the maximum distance car was driven.

m$out
##  [1] 538000 598820 535000 567840 550000 587452 536500 547844 576000 561000
## [11] 544000

Shapiro Wilk Test for Mileage

We apply the Shapiro Wilk Test to test whether the mileage follows a normal distribution. We construct the test by formulating the Hypothesis as follows

H0 : Price is normally distributed vs H1: Not normally distributed

shapiro.test(turbo$mileage)
## 
##  Shapiro-Wilk normality test
## 
## data:  turbo$mileage
## W = 0.98894, p-value = 3.27e-05

If p< α =0.05 we reject H0. Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the mileage is not normally distributed.

Year

Production year indicates the year in which the car was manufactured. In the data, the oldest production year is associated with the GAZ marque which was produced in 1953, which has engine size 3.5 L, engine power 90, and mileage of the car is 85000 km. The year variable is one of the most important factors that influence the price. The histogram and the boxplot show that the shape of the dependent variable is almost symmetric.

par(mfrow=c(1,2))

hist(turbo$year, xlab="Year",breaks = 15, col = "palevioletred3",
     main = "The Production Year")

y = boxplot(turbo$year,xlab="Year",col = "palevioletred3",notch = T,
        main = "The Production Year") #print outliers

In the new dataset, the minimum production year for BMW, Mercedes, and Hyundai are 1982 and the last production is 2018. The 25th percentile lies on the 1997 year, the median is 2001 year, and the 75th percentile is 2007 year.

summary(turbo$year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1982    1997    2001    2002    2007    2018

Shapiro-Wilk Test

We apply the Shapiro Wilk Test to test whether the price of a car follows a normal distribution. We construct the test by formulating the Hypothesis as follows.

H0 : Price is normally distributed``vs H1: Not normally distributed

shapiro.test(turbo$year)
## 
##  Shapiro-Wilk normality test
## 
## data:  turbo$year
## W = 0.98201, p-value = 1.113e-07

If p< α=0.05 we reject H0. Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the mileage is not normally distributed.

Volume

The volume variable is positively skewed since most of the observations cumulated to the left side. In this variable, the outliers are observed, too.

par(mfrow=c(1,2))

hist(turbo$volume, xlab="Volume",breaks = 15, col = "salmon3",
     main = "The Engine Volume")

v<- boxplot(turbo$volume,xlab="Volume",col = "salmon3",notch = T,
        main = "The Engine Volume") #print outliers

The common outliers for volume are 4.4 L. The highest volume size for the car is 12 L and the minimum capacity is 3.6 L.

v$out
##  [1]  4.8  4.4  4.0  4.8  4.4  4.4  4.8  4.2  4.6  4.3  4.4 12.0  4.4  4.3
## [15]  4.8  4.4  4.8  4.3  4.4  4.4  6.4  4.4  4.8  3.6  4.3 12.0  4.8  4.2
## [29]  5.0  3.6  4.3  5.5  6.3  4.5  4.4
summary(turbo$volume)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   2.000   2.300   2.447   2.600  12.000

Minimum engine size in a car is 1.300 L and the maximum capacity is 12.0 L. The average capacity observed in cars is 2.447 L. The data is right skewed since the mean is greater than median where the median is 2.300 L.

Shapiro Wilk Test for Engine Volume

We apply the Shapiro Wilk Test to test whether the price of a car follows a normal distribution. We construct the test by formulating the Hypothesis as follows.

H0: Engine size is normally distributed vs H1: Not normally distributed

shapiro.test(turbo$volume)  
## 
##  Shapiro-Wilk normality test
## 
## data:  turbo$volume
## W = 0.66281, p-value < 2.2e-16

Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the volume is not normally distributed.

Engine Power

From the data, we notice that the minimum engine power is 102.0 kilowatts, whereas the maximum power that car can output is 368 kilowatts. Furthermore, the median is 193 kilowatts. The mean and median are almost equal which indicates the engine power is symmetrically distributed.

par(mfrow=c(1,2))

hist(turbo$engine_power, xlab="Engine Power",breaks = 15, col = "steelblue3",
     main = "The Engine Power")

boxplot(turbo$engine_power,xlab="Engine Power",col = "steelblue3",notch=T,
        main = "The Engine Power")$out #print outliers

##  [1] 360 320 306 280 367 300 286 286 360 420 292 286 320 282 306 360 286
## [18] 360 286 394 286 280 358 333 360 286 279 306 450 367 480 320 354 293
## [35] 333 286
summary(turbo$engine_power)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    75.0   136.0   166.0   172.9   192.0   480.0

It is easy to see that there are no outliers in the dataset for engine power.

Shapiro Wilk test for Engine Power

We apply the Shapiro Wilk Test to test whether the engine power of a car follows a normal distribution. We construct the test by formulating the Hypothesis as follows

H0:Engine Power is normally distributed vs H1:Not normally distributed

shapiro.test(turbo$engine_power)
## 
##  Shapiro-Wilk normality test
## 
## data:  turbo$engine_power
## W = 0.86375, p-value < 2.2e-16

If p< alpha=0.05 we reject H0. Since our p-value is less than 5%, we reject the null Hypothesis. Thus, the mileage is not normally distributed.

Normal QQ Plot

In the plots below all quantitative variables are plotted. The variables do not fit a normal line, which indicates that price, the volume is not normally distributed. Nonetheless, the second engine power, 4th mileage, and 5th-year graphs mostly tend to fit a straight line and only deviate on the ends

par(mfrow=c(2,3))

qqnorm(turbo$price,xlab="Price", main = "Normal QQ Plot for Price")
qqline(turbo$price, col = "red")

qqnorm(turbo$engine_power, xlab="Engine Power",main = "Normal QQ Plot for engine_power")
qqline(turbo$engine_power, col = "red")

qqnorm(turbo$volume,xlab="Volume", main = "Normal QQ Plot for engine_volume")
qqline(turbo$volume, col = "red")


qqnorm(turbo$mileage, xlab="Mileage",main = "Normal QQ Plot for mileage")
qqline(turbo$mileage, col = "red")

qqnorm(turbo$year,xlab="Year", main = "Normal QQ Plot for Year")
qqline(turbo$year, col = "red")

Outlier

In this data, outliers are detected in the price, mileage, volume, engine power, and year variables. These outliers may have an influence on the model assumptions. Therefore, the data should be cleaned from the outliers. However, we decide to delete outliers only in the price variable, since sometimes deleting outliers may lead to loss of information. The new four in one plot with cleaned outliers is presented below:

quantiles <- quantile(turbo$price, probs = c(.25, .75))
range <- 1.3 * IQR(turbo$price)
turbo <- subset(turbo,
                 turbo$price > (quantiles[1] - range) &
                   turbo$price < (quantiles[2] + range))

The price variable is free of the outliers. The violin plot is represented below. The shape of price is approximately symmetric. The new mean is 14167 AZN and median 13000 AZN The minimum price is 2800 and the maximum is 31000.

turbo%>%
  plot_ly(
    x = ~price,
    type = 'violin',
    box = list(
      visible = T 
    ),
    meanline = list(
      visible = T, color="red"
    )
  ) %>% 
  layout(
    xaxis = list(
      title = "Price"
    )
  )

VISUAL ANALYSIS FOR THE BIVARIATE CASE

Simple Scatter Plot Among Variables

Scatterplot matrix displays the relationship between each variable. The plot displays a positive relationship between price and engine power, price and year, engine power and year volume and engine power. Moreover, there is a negative relation between mileage and year, price and mileage.

pairs(~price+volume+engine_power+mileage+year,data=turbo, 
      main="Simple Scatterplot Matrix",col="midnightblue")

Price vs Engine Volume

The below figure displays a positive linear association between the volume and the price. Strictly speaking, if the volume capacity is large, then the price of the car is high. Vice versa, the smaller the volume capacity is the cheaper car.

plot_ly(data = turbo, x = ~volume, y = ~price,
        marker = list(size = 8,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'rgba(152, 0, 0, .8)',
                                   width = 2))) %>%
  layout(title = 'The relation between engine volume and price',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))

Price vs Engine Power

In the following, we present the scatterplots between price and engine power are. The bigger engine power is the more expensive car price.

plot_ly(data = turbo, x = ~engine_power, y = ~price, color = ~gear)

Price vs Mileage

The scatterplot displays that there is a negative relationship between the independent and dependent variables. The high mileage indicates that the car is old and affects the price in a negative way

plot_ly(data = turbo, x = ~mileage, y = ~price,
        marker = list(size = 8,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'steelblue',
                                   width = 2))) %>%
  layout(title = 'Price vs Mileage',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))

Price vs Year

There is a positive relationship between price and year. The higher or lower the age of a car is the more or less expensive cars.

plot_ly(data = turbo, x = ~year, y = ~price,
        marker = list(size = 8,
                       color = 'rgba(255, 182, 193, .9)',
                       line = list(color = 'purple',
                                   width = 2))) %>%
  layout(title = 'Price vs Year',
         yaxis = list(zeroline = FALSE),
         xaxis = list(zeroline = FALSE))

Price vs Transmission

The boxplot of price is grouped by the transmission of a car as Mechanics and Automat. The car with mechanical transmission and 22000 AZN is an outlier. There are three outliers for the automat transmission cars. The price of these outliers are 34500,35000, and 40000, accordingly. The minimum price for mechanics cars are 3300AZN and maximum price is 22000AZN. The first quartile for the mechanics boxplot is 4500 AZN, the median is 6250 AZN, and the third is 9700 AZN. The boxplot of mechanics is left-skewed. For the automat cars, the minimum cost of the car is 2700 and the maximum is 40000 AZN. The first quartile, the median, and the 3rd quartile are 9200,13000, and 1925000. The plot is negatively skewed as the mechanics cars since most of the observations cumulated to the left side.

 plot_ly(turbo, x = ~price, color = ~transmission, type = "box")

Price vs Categorical Independent Variables

Price vs Fuel Type

The price of cars is divided according to fuel type diesel and gasoline. The minimum price for the diesel car is 3650 AZN and the maximum is 35000 AZN. The median of the diesel cars is 11150, the first and the third quartiles are 9500 and 23000. The plot is left-skewed, very small observations cumulated to the left side. There presence of outliers do not be detected. However, for the car price with gasoline fuel type have outliers. The smallest price for the gasoline cars 27000 AZN and the highest is 40000 AZN. The median of the gasoline cars is 12500, the first and the third quartiles are 8300 and 17750. The shape of price with gasoline seems to have a symmetric shape.

plot_ly(turbo, x = ~price, color = ~fuel_type, type = "box")

Price vs Gear

The boxplot of the price with gear type is displayed. The minimum prices in “Tam” gear are 11000 AZN and the maximum is 35000 AZN. The 25% of the values lie to 11000, 50% lie 19950, and the 75% lie to 25000. For “On” gear the minimum and maximum prices are 8700 AZN and 15500 AZN. The 1st quartile and the 3rd quartile are 8925 and 11150, accordingly. The median price is 9400. There is an outlier in “On” gear, which is 15500 AZN. The shape of “Tam” and “Arxa” gear are symmetric and “On” gear is left-skewed. Finally, the lowest and highest price for “Arxa” gear is 27000 AZN and 40000 AZN. The median is 11000, the first quartile is 7800 and the third is 14825. The presence of outliers are detected.

plot_ly(turbo, x = ~price, color = ~gear, type = "box")

Correlation between Price and Independent Variables

  • As seen from matrix scatterplot there is a weak correlation between price and volume, where the difference is 0.2969.
cor(turbo$price,turbo$volume)
## [1] 0.3009634
  • There exists a positive linear relationship between price and year, which is 0.8275.
cor(turbo$price,turbo$year)
## [1] 0.8252084
  • The high correlation between price and engine power is observed also. There is a 44% linear relationship between the two variables.
cor(turbo$price,turbo$engine_power)
## [1] 0.4379799
  • As highlighted before there is a negative relation between the price and the mileage variables, which is -0.4989.
cor(turbo$price,turbo$mileage)
## [1] -0.4952912

MULTIPLE LINEAR REGRESSION

The First Model

The initial model was constructed without the categorical variables. The hypothesis of the model is presented.

H0: β1=β2=β3=β4= 0 versus H1: βi’s not equal to zero, i=1,2,3,4

fit <- lm(price~year+engine_power+volume+mileage,data=turbo)

summary(fit)
## 
## Call:
## lm(formula = price ~ year + engine_power + volume + mileage, 
##     data = turbo)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13163  -1963   -111   1711  15631 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.684e+06  5.338e+04 -31.546  < 2e-16 ***
## year          8.451e+02  2.658e+01  31.797  < 2e-16 ***
## engine_power  5.405e+00  4.631e+00   1.167    0.244    
## volume        2.298e+03  3.627e+02   6.336 4.27e-10 ***
## mileage       2.313e-03  1.579e-03   1.465    0.143    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3443 on 682 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.746 
## F-statistic: 504.6 on 4 and 682 DF,  p-value: < 2.2e-16

The Residuals and the Residuals Standard Error (RSE)

The residuals are differences between the predicting (y) and predicted values (x) of the actual values. The maximum residual for the model is 15631 and the minimum is -13163. Moreover, RSE is a standard deviation of the residuals. For our model, the RSE is 3299, where this number should be proportional to the 1st and 3rd quantiles of the residuals.

The p-value of model

Here, the p-value is 2.2e-16, which is a very small probability value. The p-value is less than 0.05. Thus, we reject H0 where H0: β1=β2=β3=β4= 0 indicates that the model is significant.

Testing p-value for each variable

The summary of model presents that intercept, year, and volume are significant predictors for the model. Based on the summary of the model, the engine power and mileage of the car do not explain the price variable since the p-value of the predicted variables are greater than 0.05

Significance Stars

The asterisk *** is for higher significance and ** for lower significance. The asterisks show that intercept, year and volume have a high significance in the model.

Coefficients

The equation explains that when there is no effect of independent variables the price is -1.684e+06. The coefficient of year is 1105.812. This means that when one increase year for 1 unit while keeping other variable constant, price changes 1105.812 units. Simultaneously, if we increase the volume for 1 unit while keeping other variables constant, price changes 6157.853 units.

Intercept Year Engine Power Volume Mileage
-1.684e+06 1.105812e+03 -2.841815e+01 6.157853e+033 -3.069426e-04

Multiple R-squared and Adjusted R-squared

75% of the cause for the dependent variable is due to the independent variables such as year, mileage, and volume.

Regression Diagnostics

Assumption 1: Linearity of the data

The residuals vs fitted plot have a nonlinear shape. We see that the Normal Q-Q plot does not fit a normal line.

plot(fit,1)

Assumption 2: Homogeneity of variance or Homoscedasticity

Furthermore, the variance does not spread constantly along the zero line.

plot(fit,3)

There is no need to check the other assumptions since the first two assumptions are violated. If the residuals vs fitted plot indicate a non-linear relationship in the data, then a simple approach is to take log, sqrt, and x^2, in the regression model (A.Kassambara,2018). A possible solution to the homogeneity of variance assumption is to use a log or square root transformation of the dependent variable.

Transformed model

The square root of the price is taken to satisfy the homoscedasticity assumption.

fit.t <- lm(sqrt(price)~year+engine_power+volume+mileage,data=turbo)

summary(fit.t)
## 
## Call:
## lm(formula = sqrt(price) ~ year + engine_power + volume + mileage, 
##     data = turbo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.858  -7.474  -0.160   7.713  60.606 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.226e+03  2.051e+02 -35.230  < 2e-16 ***
## year          3.653e+00  1.021e-01  35.772  < 2e-16 ***
## engine_power  1.715e-02  1.779e-02   0.964   0.3355    
## volume        1.058e+01  1.393e+00   7.596 1.01e-13 ***
## mileage       1.215e-05  6.065e-06   2.003   0.0456 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.23 on 682 degrees of freedom
## Multiple R-squared:  0.7874, Adjusted R-squared:  0.7861 
## F-statistic: 631.3 on 4 and 682 DF,  p-value: < 2.2e-16

RSE of the transformed model is 13.23. The small RSE is a good indication of the model, this means model fits the data well. Comparing with the previous model we see that R-squared and Adjusted R are higher. In the new model, 78% of the independent variables explain the outcome value. Still, the engine power is an insignificant factor for the model. The three asterisks show that intercept, year, and volume have a high significance in the model. Whereas, “*" indicates that mileage has less significance in the model.

Intercept Year Engine Power Volume Mileage
-7.226e+03 3.653 0.01715 10.58 1.215e-05

When there is no effect of independent variables the price is -7079. The coefficient of the year is 3.580. If we increase the year for 1 unit while keeping other variable constant, price changes 3.580 units. Simultaneously, if we increase the volume for 1 unit while keeping other variables constant, price changes 10.61 units.

Assumption 1: Linearity of the data

The first assumption is held since the red line approximately close to horizontal line zero.

plot(fit.t,1)

Assumption 2:Homogeneity of variance

The residuals are not spread equally along the ranges of the predictor. Thus, the homogeneity of variance assumption fail.

plot(fit.t,3)

Assumption 3:Normality of Residuals

The residuals of the model are normally distributed.

plot(fit.t,2)

Assumption 4:Outliers and High Leverage Points

The plot below displays the top 3 most extreme points (66,405, and 656) with standardized residuals above -3 and 3 standard deviations.

par(mfrow=c(1,2))
plot(fit.t,4)
plot(fit.t,5)

The data does not present any influential points since all the points are inside the Cooks distance lines. Thus, there are no influential points that may influence the regression result.

The below table shows the top 3 observations with the highest Cooks Distance:

model.diag.metrics <- augment(fit.t)
model.diag.metrics %>%
  top_n(3, wt = .cooksd)
## # A tibble: 3 x 13
##   .rownames sqrt.price.  year engine_power volume mileage .fitted .se.fit
##   <chr>           <dbl> <int>        <int>  <dbl>   <int>   <dbl>   <dbl>
## 1 405              176.  2000          170    4.3  230000   131.     2.68
## 2 622              182.  2006          480    4.2  352621   159.     3.73
## 3 656              124.  1987          320    5    100000    92.3    2.99
## # ... with 5 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>, .std.resid <dbl>

Assumption 5:Non-Independence of Errors

Test for Autocorrelated Errors:

H0: The residual erros are independent H1: The residual errors are not independent

durbinWatsonTest(fit.t)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.05202954      2.102086    0.21
##  Alternative hypothesis: rho != 0

Since, p value is 0.188 we cannot reject H0. The 5th assumption is valid, the residuals are independent.

Multicollinearity

Multicollinearity indicates a high correlation between independent variables. The scatterplot between power and volume are plotted below. There is a high positive relationship between volume and engine power and the correlation between them is 0.8165. Therefore we want to check whether there exist multicollinearity among the predictors.

ggplot(data = turbo) + 
  geom_point(mapping =
               aes(x = volume, y = engine_power,color=marque,size=1))+
  labs(title="The relation between engine volume and power",
       x="Volume", y="Power")

cor(turbo$volume,turbo$engine_power) 
## [1] 0.8165848

The variance inflation factor (VIF) is a method to inspect the existence of multicollinearity. According to Minitab blog (Enough Is Enough! Handling Multicollinearity in Regression Analysis,2013), a VIF between 5 and 10 indicates high correlations. If VIF is greater than 10 then we can assume that due to multicollinearity problem the regression coefficients are poorly estimated.

vif(fit.t)
##         year engine_power       volume      mileage 
##     1.815890     3.536412     3.293046     1.665150

In our case, there is no multicollinearity problem among the independent variables. However, we will not remove one of the variables from the data.

The AIC (Akaike Criterion) and BIC (Bayesian information criterion) criteria for the model comparison:

AIC(fit.t)
## [1] 5504.89
BIC(fit.t)
## [1] 5532.084

Stepwise Regression Method

There are three method of stepwise regression (NCSS Statistical Software):

  1. Forward Selection starts with no predictors in the model. It adds recurrently the most significant predictors and stops when there is no more statistically significant predictor for the model.

  2. Backward Selectionstarts with full model. Then iteratively removes the least contributive predictors, and stops when a model is statistically significant.

  3. Stepwise Selection is a combination of forward and backward selection methods. First, it starts with no predictors, then like in forward selection add the most contributive independent variables. Secondly, like in the backward selection, it adds a new predictor and removes any variables that no longer have a contribution to the model fit.

In this analyses, the stepwise selection will be used to detect most fitted model.

STEPWISE REGRESSION MODEL

The Stepwise Regression model based on the p-value is builded:

ols_step_both_p(lm(price~year+engine_power+volume+mileage+
                     can_barter+gear+fuel_type+transmission+marque,data=turbo)) 
## Stepwise Selection Method   
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. year 
## 2. engine_power 
## 3. volume 
## 4. mileage 
## 5. can_barter 
## 6. gear 
## 7. fuel_type 
## 8. transmission 
## 9. marque 
## 
## We are selecting variables based on p value...
## 
## Variables Entered/Removed: 
## 
## - year added 
## - volume added 
## - gear added 
## - fuel_type added 
## - marque added 
## - engine_power added 
## - transmission added 
## - mileage added 
## 
## No more variables to be added/removed.
## 
## 
## Final Model Output 
## ------------------
## 
##                            Model Summary                            
## -------------------------------------------------------------------
## R                       0.901       RMSE                  2988.905 
## R-Squared               0.811       Coef. Var               20.787 
## Adj. R-Squared          0.809       MSE                8933552.307 
## Pred R-Squared          0.803       MAE                   2187.463 
## -------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                      ANOVA                                       
## --------------------------------------------------------------------------------
##                        Sum of                                                   
##                       Squares         DF       Mean Square       F         Sig. 
## --------------------------------------------------------------------------------
## Regression    25982657582.542         10    2598265758.254    290.844    0.0000 
## Residual       6039081359.452        676       8933552.307                      
## Total         32021738941.994        686                                        
## --------------------------------------------------------------------------------
## 
##                                                 Parameter Estimates                                                  
## --------------------------------------------------------------------------------------------------------------------
##                model            Beta    Std. Error    Std. Beta       t        Sig            lower           upper 
## --------------------------------------------------------------------------------------------------------------------
##          (Intercept)    -1847092.834     58600.855                 -31.520    0.000    -1962154.409    -1732031.259 
##                 year         927.633        29.315        0.905     31.644    0.000         870.074         985.192 
##               volume         959.022       332.761        0.092      2.882    0.004         305.653        1612.392 
##               gearÖn       -2851.847       556.889       -0.163     -5.121    0.000       -3945.287       -1758.407 
##              gearTam        -279.860       470.104       -0.013     -0.595    0.552       -1202.900         643.180 
##       fuel_typeDizel        2388.010       333.732        0.137      7.155    0.000        1732.734        3043.286 
##        marqueHyundai        -374.539       626.858       -0.023     -0.597    0.550       -1605.361         856.283 
##       marqueMercedes        1540.383       363.855        0.110      4.234    0.000         825.962        2254.804 
##         engine_power          18.957         4.427        0.148      4.282    0.000          10.264          27.650 
## transmissionMexanika        1542.256       370.757        0.077      4.160    0.000         814.283        2270.229 
##              mileage          -0.006         0.002       -0.097     -4.072    0.000          -0.009          -0.003 
## --------------------------------------------------------------------------------------------------------------------
## 
##                                   Stepwise Selection Summary                                    
## -----------------------------------------------------------------------------------------------
##                          Added/                   Adj.                                             
## Step      Variable      Removed     R-Square    R-Square      C(p)         AIC          RMSE       
## -----------------------------------------------------------------------------------------------
##    1        year        addition       0.681       0.681    463.2600    13301.3441    3861.8349    
##    2       volume       addition       0.746       0.745    230.8370    13146.1597    3446.8949    
##    3        gear        addition       0.778       0.776    120.3760    13059.7303    3232.0713    
##    4     fuel_type      addition       0.794       0.792     64.4460    13010.0477    3115.0406    
##    5       marque       addition       0.799       0.797     47.5090    12996.2738    3079.5296    
##    6    engine_power    addition       0.803       0.801     34.0600    12983.4244    3048.6729    
##    7    transmission    addition       0.807       0.804     23.2220    12972.8358    3023.0987    
##    8      mileage       addition       0.811       0.809      8.6040    12958.1906    2988.9049    
## -----------------------------------------------------------------------------------------------

The minimum Cp for the model is 8.60, the maximum AIC is 13301.34 and the highest R square is 81%.

H0: &beta;1=&beta;2=&beta;3=&beta;4=&beta;5=&beta;6=&beta;7=&beta;8=&beta;9=0 versus H1: &beta;i not equal to zero where i's are i=1,2...9

The square root of price is taken in order to fulfill the constant variance assumption.

step.p <- lm(sqrt(price)~year+engine_power+volume+mileage+
                     can_barter+gear+fuel_type+transmission+marque, data=turbo)
summary(step.p)
## 
## Call:
## lm(formula = sqrt(price) ~ year + engine_power + volume + mileage + 
##     can_barter + gear + fuel_type + transmission + marque, data = turbo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.913  -5.935  -0.031   6.302  57.453 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.995e+03  2.178e+02 -36.710  < 2e-16 ***
## year                  4.042e+00  1.089e-01  37.098  < 2e-16 ***
## engine_power          6.803e-02  1.643e-02   4.140 3.90e-05 ***
## volume                5.240e+00  1.236e+00   4.239 2.56e-05 ***
## mileage              -2.092e-05  5.604e-06  -3.733 0.000205 ***
## can_barter           -1.537e+00  1.114e+00  -1.380 0.168193    
## gearÖn               -1.146e+01  2.069e+00  -5.539 4.37e-08 ***
## gearTam              -1.738e+00  1.750e+00  -0.993 0.321196    
## fuel_typeDizel        9.959e+00  1.240e+00   8.031 4.32e-15 ***
## transmissionMexanika  2.082e+00  1.376e+00   1.513 0.130658    
## marqueHyundai        -2.423e+00  2.337e+00  -1.037 0.300250    
## marqueMercedes        7.292e+00  1.354e+00   5.386 9.97e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.09 on 675 degrees of freedom
## Multiple R-squared:  0.8521, Adjusted R-squared:  0.8496 
## F-statistic: 353.4 on 11 and 675 DF,  p-value: < 2.2e-16

In the stepwise regression model, can barter, gearTam, transmissionMexanika, and marqueHyundai are the insignificant factors for the model. We drop these factors and construct the new model.

step.p <- lm(sqrt(price)~year+engine_power+volume+mileage+
                     gear+fuel_type+transmission, data=turbo)

summary(step.p)
## 
## Call:
## lm(formula = sqrt(price) ~ year + engine_power + volume + mileage + 
##     gear + fuel_type + transmission, data = turbo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.158  -6.382  -0.043   6.475  56.109 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -7.734e+03  2.132e+02 -36.279  < 2e-16 ***
## year                  3.914e+00  1.065e-01  36.755  < 2e-16 ***
## engine_power          5.015e-02  1.622e-02   3.092 0.002073 ** 
## volume                6.033e+00  1.268e+00   4.759 2.38e-06 ***
## mileage              -2.106e-05  5.763e-06  -3.655 0.000277 ***
## gearÖn               -1.728e+01  1.600e+00 -10.794  < 2e-16 ***
## gearTam              -6.047e+00  1.581e+00  -3.824 0.000143 ***
## fuel_typeDizel        1.115e+01  1.266e+00   8.812  < 2e-16 ***
## transmissionMexanika  1.393e+00  1.414e+00   0.985 0.324765    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.45 on 678 degrees of freedom
## Multiple R-squared:  0.8416, Adjusted R-squared:  0.8397 
## F-statistic: 450.1 on 8 and 678 DF,  p-value: < 2.2e-16

Here, all the predicted variable are significant except for the transmission of the car. On the average mechanical transmission cost about a 1000 dollars less than an automatic of the same model(Manual vs. Automatic Car Transmission: Pros & Cons). Therefore we are going to keep the insignificant factor in the model. Among the other factors, the engine power has less contribution to the model. The 84% of the independent variables explain the dependent variable, which is a high criterion for the model.

The engine power is insignificant but we are not going to exclude the factor. According to Jim Frost (Statistics By Jim), if theory suggests that an independent variable is important, you might include it in the regression equation even when its p-value is not significant.

Assumption 1: Linearity of the data

The residual plot shows a fitted pattern. The red line approximately close to horizontal line zero. There is no pattern observed in residual vs fitted plot. Thus, the first assumption is held.

plot(step.p,1)

Assumption 2: Homogeneity of variance

The square root of price was taken to fulfill constant variance assumption. We see that the homogeneity of variance exists. Furthermore, residuals are spread equally along the ranges of predictor.

plot(step.p,3)

Assumption 3:Normality of Residuals

All the points fall approximately along the reference line. Also, residuals are normally distributed.

plot(step.p,2)

Assumption 4:Outliers and High Leverage Points

The plot below displays the top 3 most extreme points (143,405, and 656) with a standardized residual above -3 and 4 standard deviations.

par(mfrow=c(1,2))
plot(step.p,4)
plot(step.p,5)

All the points are inside Cooks distance red lines, which mean the data does not present any influential points. Thus, there are no influential points that may influence the regression result.

The below table shows the top 3 observations with the highest Cooks Distance:

model.diag.metrics <- augment(step.p)
model.diag.metrics %>%
  top_n(3, wt = .cooksd)
## # A tibble: 3 x 16
##   .rownames sqrt.price.  year engine_power volume mileage gear  fuel_type
##   <chr>           <dbl> <int>        <int>  <dbl>   <int> <fct> <fct>    
## 1 143              116.  1985          143    2.2  370000 Arxa  Dizel    
## 2 405              176.  2000          170    4.3  230000 Arxa  Dizel    
## 3 656              124.  1987          320    5    100000 Arxa  Benzin   
## # ... with 8 more variables: transmission <fct>, .fitted <dbl>,
## #   .se.fit <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## #   .std.resid <dbl>

Chart of Cook’s distance to detect observations that strongly influence fitted values of the model.

ols_plot_cooksd_bar(step.p)

Graph for detecting influential observations.

ols_plot_resid_lev(step.p)

Assumption 5:Non-Independence of Errors

Test for Autocorrelated Errors

H0: The residual erros are independent H1: The residual errors are not independent

durbinWatsonTest(step.p)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.05496443        2.1077    0.18
##  Alternative hypothesis: rho != 0

Since, p value is 0.302 we cannot reject H0. The 5th assumption is valid, the residuals are independent.

Best Subsets Regression

Unlike the stepwise regression method, best subsets regression fits all possible models based on the independent variables that we specify. If we have 5 independent variables, the method will fit 32 models. If we specify 7 independent variables, it will fit 128 models. Usually, the method picks the best fit model based on the adjusted R-squared or Mallows Cp criterion (J.Frost,2018).

best.fit <-
  regsubsets(price~year+engine_power+volume+mileage+
               gear+fuel_type+transmission+can_barter,
             data=turbo,
             nbest = 1,       # 1 best model for each number of predictors
             nvmax =NULL ,    # NULL for no limit on number of variables
             force.in = NULL, force.out = NULL,
             method = "exhaustive")
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 3 linear dependencies found
## Reordering variables and trying again:
best.fit
## Subset selection object
## Call: regsubsets.formula(price ~ year + engine_power + volume + mileage + 
##     gear + fuel_type + transmission + can_barter, data = turbo, 
##     nbest = 1, nvmax = NULL, force.in = NULL, force.out = NULL, 
##     method = "exhaustive")
## 12 Variables  (and intercept)
##                      Forced in Forced out
## year                     FALSE      FALSE
## engine_power             FALSE      FALSE
## volume                   FALSE      FALSE
## mileage                  FALSE      FALSE
## gearÖn                   FALSE      FALSE
## gearTam                  FALSE      FALSE
## fuel_typeDizel           FALSE      FALSE
## transmissionMexanika     FALSE      FALSE
## can_barter               FALSE      FALSE
## fuel_typeElektro         FALSE      FALSE
## fuel_typeHibrid          FALSE      FALSE
## fuel_typeQaz             FALSE      FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
reg.summary <- summary(best.fit)
as.data.frame(reg.summary$outmat)
##          year engine_power volume mileage gearÖn gearTam fuel_typeDizel
## 1  ( 1 )    *                                                          
## 2  ( 1 )    *                   *                                      
## 3  ( 1 )    *                   *              *                       
## 4  ( 1 )    *                   *              *                      *
## 5  ( 1 )    *            *      *              *                      *
## 6  ( 1 )    *            *              *      *                      *
## 7  ( 1 )    *            *      *       *      *                      *
## 8  ( 1 )    *            *      *       *      *       *              *
## 9  ( 1 )    *            *      *       *      *       *              *
##          fuel_typeElektro fuel_typeHibrid fuel_typeQaz
## 1  ( 1 )                                              
## 2  ( 1 )                                              
## 3  ( 1 )                                              
## 4  ( 1 )                                              
## 5  ( 1 )                                              
## 6  ( 1 )                                              
## 7  ( 1 )                                              
## 8  ( 1 )                                              
## 9  ( 1 )                                              
##          transmissionMexanika can_barter
## 1  ( 1 )                                
## 2  ( 1 )                                
## 3  ( 1 )                                
## 4  ( 1 )                                
## 5  ( 1 )                                
## 6  ( 1 )                    *           
## 7  ( 1 )                    *           
## 8  ( 1 )                    *           
## 9  ( 1 )                    *          *

The above table indicates that year, engine power, volume, mileage,gear, and fuel type are significant factors for the model. The best subset method excludes the can_barter and transmission variables.

The minimum BIC, C(p), and Adjusted R square for the model are 8. This means using these criteria, we should choose the model with 8 variables.

which.min(reg.summary$bic)
## [1] 8
plot(reg.summary$bic, scale = "bic", main = "BIC",type="l")
points(8,reg.summary$bic[8],col="red",cex=2,pch=20)

which.min(reg.summary$cp)
## [1] 9
plot(reg.summary$cp, scale = "Cp", main = "Cp",type="l")
points(8,reg.summary$cp[8],col="red",cex=2,pch=20)

The minimum adjusted R is 1, which is presented in the plot with the red point.

which.min(reg.summary$adjr2)
## [1] 1
plot(reg.summary$adjr2, scale = "adjr2", main = "Adjr2",type="l")
points(1,reg.summary$adjr2[1],col="red",cex=2,pch=20)

The best subset plot based on the adjusted R square is below. According to this plot, the best factors for the model based on adjusted R square are the year, engine power, volume, mileage, gear, Diesel, and Gaz fuel type.

plot(best.fit, scale = "adjr2", main = "Adjusted R^2")

CONCLUSION

                   "All models are wrong, some models are useful"    
                                                George Box (1976)
                                        

In this study, we build several equations to find the best fitting model for the turbo data that can estimate the dependent variable price. The first equation above is constructed without the categorical variables.

Model 1: price=β1 * year + β2 * engine_power + β3 * volume +β4 * mileage

The AIC, BIC, and the adjusted R criteria are presented below:

AIC BIC Adj R
5411.55 5438.67 78

To avoid less precision in the model, we tried to explain the price variable with the few independent variables. The R square of the model explains 78% of the outcome value. The AIC and BIC criteria are high, these are 5411.55 and 5438.67, accordingly.

The second model was constructed by Stepwise Regression method and the equation is below presented below:

Model 2: price=β1year+β2engine_power+β3volume+β4mileage+β5can_barter+β6gear+β9fuel_type+β10transmission+β11*marque

Mallow’s Cp is the criterion for the model selection. Moreover, the minimum value of Cp is a good indication for the model. For this model Cp is 8.5, the highest AIC is 5214.53 and the highest BIC criterion is 5260.22. Furthermore, the adjusted R square is 0.80, which is higher than the model 1.

AIC BIC Adj R
5214.53 5260.22 0.80

Both of the models satisfy all the linear assumptions. However, we select the second model based on the above criteria, since the second model is better than the first model in terms of high R square. Nevertheless, the AIC and BIC criterion of the first model is slightly higher. Although in the second model the transmission is insignificant for the price, we will not exclude it. Also, according to the literature review, transmission, gear, and fuel type have an effect on price. Therefore; we prefer to choose the second model. The reason for the choice is that the number of independent variables in the second model is not biased and that the second model has the most precise variables for the outcome value.

REFERENCE

[1]Engine size explained(2018,9 May).Retrieved from:https://www.carbuyer.co.uk/tips-and-advice/146778/engine-size-explained

[2]Fitting & Interpreting Linear Models in R(2013,18 May).Retrieved from:http://blog.yhat.com/posts/r-lm-summary.html

[3]How to Change Car Gears.Retrieved from: https://www.driving-test-success.com/gears/gearinfo.htm

[4]Linear Regression Assumptions and Diagnostics in R: Essentials (2018,March 11). Retrieved from:http://www.sthda.com/english/articles/39-regression-mode l-diagnostics/161-linear-regression-assumptions-and-diagnostics-in-r-essenti als/#building-a-regression-model

[5]Stepwise Regression.Retrievedfrom:https://ncss-wpengine.netdna-ssl.com/wp-co ntent/themes/ncss/pdf/Procedures/NCSS/Stepwise_Regression.pdf

[6]http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in- regression-analysis

[7]Linear Regression Assumptions and Diagnostics in R: Essentials(2018,11 March). Retrieved from:http://www.sthda.com/english/articles/39-regression-model-diagnostics /161-linear-regression-assumptions-and-diagnostics-in-r-essentials/#building-a-regre ssion-model

[8]Manual vs. Automatic Car Transmission: Pros & Cons. Retrieved from: https://www.budgetdirect.com.au/blog/manual-vs-automatic-car-transmission-pros-cons. html

[9]Measures of Influence. Retrieved from:
https://cran.r-project.org/web/packages/olsrr/vignettes/influence_measures.html

[10]Model Specification: Choosing the Correct Regression Model(J. Frost )Retrieved from: http://statisticsbyjim.com/regression/model-specification-variable-selection/