1. Data preparation

We use data from Kaggle.com, titled Flight Price Prediction, uploaded by Shubham Bathwal. This data is about flight booking dataset obtained from an internet platform booking ticket.

The goal of this exercise is to examine whether Linear Regression could be used in this case to predict price

Read file

pesawat <- read.csv("Data_Pesawat.csv")
dim(pesawat)
## [1] 300153     12
head(pesawat)
names(pesawat)
##  [1] "X"                "airline"          "flight"           "source_city"     
##  [5] "departure_time"   "stops"            "arrival_time"     "destination_city"
##  [9] "class"            "duration"         "days_left"        "price"
str(pesawat)
## 'data.frame':    300153 obs. of  12 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ airline         : chr  "SpiceJet" "SpiceJet" "AirAsia" "Vistara" ...
##  $ flight          : chr  "SG-8709" "SG-8157" "I5-764" "UK-995" ...
##  $ source_city     : chr  "Delhi" "Delhi" "Delhi" "Delhi" ...
##  $ departure_time  : chr  "Evening" "Early_Morning" "Early_Morning" "Morning" ...
##  $ stops           : chr  "zero" "zero" "zero" "zero" ...
##  $ arrival_time    : chr  "Night" "Morning" "Early_Morning" "Afternoon" ...
##  $ destination_city: chr  "Mumbai" "Mumbai" "Mumbai" "Mumbai" ...
##  $ class           : chr  "Economy" "Economy" "Economy" "Economy" ...
##  $ duration        : num  2.17 2.33 2.17 2.25 2.33 2.33 2.08 2.17 2.17 2.25 ...
##  $ days_left       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price           : int  5953 5953 5956 5955 5955 5955 6060 6060 5954 5954 ...

Terdapat data dengan tipe integer, numeric, dan character.

Target: prediksi price

Cek persebaran variabel price

boxplot(pesawat$price)

2. Data cleansing

Remove unused variable, kita tidak perlu kolom X

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
pesawat <- pesawat %>% select(-c(X))
head(pesawat)
dim(pesawat)
## [1] 300153     11
#cek missing value
anyNA(pesawat)
## [1] FALSE
#quick check correlation and multicorrelation
library(GGally)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(pesawat, hjust=1, layout.exp=3, label=TRUE)
## Warning in ggcorr(pesawat, hjust = 1, layout.exp = 3, label = TRUE): data
## in column(s) 'airline', 'flight', 'source_city', 'departure_time', 'stops',
## 'arrival_time', 'destination_city', 'class' are not numeric and were ignored

We can only see multicorrelation for 3 variables, artinya, kita masih punya banyak categorical variables. Let’s check how varies the categorical variables are.

kategori_var_pesawat <- data.frame(
  variable = c("airline","flight","source_city","departure_time","stops","arrival_time", "destination_city","class", "duration", "days_left"),
  jumlah_kategori = c(n_distinct(pesawat$airline),
    n_distinct(pesawat$flight), 
    n_distinct(pesawat$source_city), 
    n_distinct(pesawat$departure_time), 
    n_distinct(pesawat$stops), 
    n_distinct(pesawat$arrival_time),
    n_distinct(pesawat$destination_city),
    n_distinct(pesawat$class),
    n_distinct(pesawat$duration),
    n_distinct(pesawat$days_left))
  )
kategori_var_pesawat

We can see a lot of variation in flight number, duration, and days_left.

Remove flight number, karena sepertinya tidak terlalu berperan dalam prediksi

pesawat <- pesawat %>% select(-c(flight))
dim(pesawat)
## [1] 300153     10

3. Data wrangling

Kita akan melakukan seleksi kolom dan ubah tipe data, karena terdapat 7 jenis data kategorik

pesawat <- pesawat %>%
  select(duration, days_left, price, airline, source_city, departure_time, stops, arrival_time, destination_city, class) %>%
  mutate(airline=as.factor(airline),
         source_city = as.factor(source_city),
         departure_time = as.factor(departure_time),
         stops = as.factor(stops),
         arrival_time = as.factor(arrival_time),
         destination_city = as.factor(destination_city),
         class = as.factor(class)
         )
str(pesawat)
## 'data.frame':    300153 obs. of  10 variables:
##  $ duration        : num  2.17 2.33 2.17 2.25 2.33 2.33 2.08 2.17 2.17 2.25 ...
##  $ days_left       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price           : int  5953 5953 5956 5955 5955 5955 6060 6060 5954 5954 ...
##  $ airline         : Factor w/ 6 levels "Air_India","AirAsia",..: 5 5 2 6 6 6 6 6 3 3 ...
##  $ source_city     : Factor w/ 6 levels "Bangalore","Chennai",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ departure_time  : Factor w/ 6 levels "Afternoon","Early_Morning",..: 3 2 2 5 5 5 5 1 2 1 ...
##  $ stops           : Factor w/ 3 levels "one","two_or_more",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ arrival_time    : Factor w/ 6 levels "Afternoon","Early_Morning",..: 6 5 2 1 5 1 5 3 5 3 ...
##  $ destination_city: Factor w/ 6 levels "Bangalore","Chennai",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ class           : Factor w/ 2 levels "Business","Economy": 2 2 2 2 2 2 2 2 2 2 ...

Sekarang, kita sudah mengubah semua data kategorik menjadi data factor. Sehingga, kita sekarang punya 3 type data: num, int, dan factor

4. Exploratory Data Analysis (EDA)

We have hundred thousands of data. Airline manakah yang paling populer? It turns out Vistara has the most frequent data

plot(pesawat$airline)

I wonder how is class distribution in each airline? It turns out only Air_India and Vistara have business class

table(pesawat$airline, pesawat$class)
##            
##             Business Economy
##   Air_India    32898   47994
##   AirAsia          0   16098
##   GO_FIRST         0   23173
##   Indigo           0   43120
##   SpiceJet         0    9011
##   Vistara      60589   67270

I would like to know what is the mean price for each airline in each type of class? Ternyata, Vistara memiliki nilai mean price tertinggi pada business class maupun economy class

data_agg1 <- aggregate(price~airline+class, 
          data= pesawat, 
          FUN=mean)
data_agg1[order(data_agg1$price, decreasing=T),]
range(pesawat$price)
## [1]   1105 123071

I also wonder how the price vary with Airlines

boxplot(pesawat$price ~ pesawat$airline,data = pesawat, 
        main = "Variation of price in each airline",
        xlab = "Airlines",ylab = "Price",
        ylim = c(1000, 140000)
        )

Bagaimana persebaran price antara Economy and Business Class?

boxplot(pesawat$price ~ pesawat$class,data = pesawat, 
        main = "Variation of price in each class",
        xlab = "Class",ylab = "Price",
        ylim = c(1000, 140000)
        )

How does Stop vary with price?

boxplot(pesawat$price ~ pesawat$stops,data = pesawat, 
        main = "Variation of price in each stop",
        xlab = "Stops",ylab = "Price",
        ylim = c(1000, 140000)
        )

5. Simple linear regresion

We do simple regression using one variable, i.e., duration

model_one <- lm(formula = price ~ duration, data = pesawat)
summary(model_one)
## 
## Call:
## lm(formula = price ~ duration, data = pesawat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -36328 -15171 -11296  19615 101357 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13012.958     79.964   162.7   <2e-16 ***
## duration      644.521      5.639   114.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22220 on 300151 degrees of freedom
## Multiple R-squared:  0.04171,    Adjusted R-squared:  0.0417 
## F-statistic: 1.306e+04 on 1 and 300151 DF,  p-value: < 2.2e-16

We find a linear relationship between price and duration.

There is a positive slope: 644.521.

However, the R-squared is too bad, i.e., 0.04171

Bagaimana visualisasi asumsi regresinya?

plot(pesawat$price, pesawat$duration)
abline(model_one)

I don’t think I see a nice regression line here :-(

6. Linear regresion - all variable

Kita coba lakukan regresi dengan semua predictors

model_all <- lm(formula = price ~ ., data = pesawat)
summary(model_all)
## 
## Call:
## lm(formula = price ~ ., data = pesawat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -36294  -3124   -390   3116  64223 
## 
## Coefficients:
##                               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)                  5.268e+04  8.299e+01   634.818  < 2e-16 ***
## duration                     4.257e+01  2.344e+00    18.160  < 2e-16 ***
## days_left                   -1.310e+02  9.116e-01  -143.647  < 2e-16 ***
## airlineAirAsia              -1.164e+02  6.296e+01    -1.849   0.0645 .  
## airlineGO_FIRST              1.589e+03  5.456e+01    29.127  < 2e-16 ***
## airlineIndigo                1.991e+03  4.727e+01    42.118  < 2e-16 ***
## airlineSpiceJet              2.178e+03  7.705e+01    28.263  < 2e-16 ***
## airlineVistara               3.955e+03  3.111e+01   127.143  < 2e-16 ***
## source_cityChennai          -6.748e+01  4.627e+01    -1.458   0.1448    
## source_cityDelhi            -1.406e+03  4.201e+01   -33.465  < 2e-16 ***
## source_cityHyderabad        -1.679e+03  4.591e+01   -36.569  < 2e-16 ***
## source_cityKolkata           1.584e+03  4.447e+01    35.609  < 2e-16 ***
## source_cityMumbai           -2.119e+02  4.183e+01    -5.065 4.09e-07 ***
## departure_timeEarly_Morning  8.357e+02  4.138e+01    20.195  < 2e-16 ***
## departure_timeEvening        7.338e+02  4.205e+01    17.452  < 2e-16 ***
## departure_timeLate_Night     1.694e+03  1.917e+02     8.840  < 2e-16 ***
## departure_timeMorning        8.563e+02  4.047e+01    21.160  < 2e-16 ***
## departure_timeNight          6.901e+02  4.558e+01    15.141  < 2e-16 ***
## stopstwo_or_more             2.105e+03  6.200e+01    33.955  < 2e-16 ***
## stopszero                   -7.586e+03  4.592e+01  -165.188  < 2e-16 ***
## arrival_timeEarly_Morning   -7.720e+02  6.625e+01   -11.652  < 2e-16 ***
## arrival_timeEvening          9.247e+02  4.284e+01    21.585  < 2e-16 ***
## arrival_timeLate_Night       9.533e+02  6.973e+01    13.670  < 2e-16 ***
## arrival_timeMorning          4.766e+02  4.504e+01    10.582  < 2e-16 ***
## arrival_timeNight            1.143e+03  4.197e+01    27.221  < 2e-16 ***
## destination_cityChennai     -2.198e+02  4.587e+01    -4.793 1.65e-06 ***
## destination_cityDelhi       -1.554e+03  4.306e+01   -36.089  < 2e-16 ***
## destination_cityHyderabad   -1.720e+03  4.547e+01   -37.828  < 2e-16 ***
## destination_cityKolkata      1.377e+03  4.390e+01    31.359  < 2e-16 ***
## destination_cityMumbai      -2.877e+01  4.236e+01    -0.679   0.4971    
## classEconomy                -4.492e+04  3.011e+01 -1492.108  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6754 on 300122 degrees of freedom
## Multiple R-squared:  0.9115, Adjusted R-squared:  0.9114 
## F-statistic: 1.03e+05 on 30 and 300122 DF,  p-value: < 2.2e-16

Dengan memakai all data terjadi perubahan signifikan terhadap R-squared dan adjusted R-squared

summary(model_one)$adj.r.squared
## [1] 0.04170358
summary(model_all)$adj.r.squared
## [1] 0.9114476

7. Stepwise regression

Sekarang kita coba stepwise regression

Backward regression

pesawat_backward <- step(object=model_all,
                       direction="backward",
                       trace=T)
## Start:  AIC=5293494
## price ~ duration + days_left + airline + source_city + departure_time + 
##     stops + arrival_time + destination_city + class
## 
##                    Df  Sum of Sq        RSS     AIC
## <none>                           1.3692e+13 5293494
## - duration          1 1.5046e+10 1.3707e+13 5293822
## - departure_time    5 2.6412e+10 1.3718e+13 5294063
## - arrival_time      5 7.3139e+10 1.3765e+13 5295083
## - destination_city  5 3.0012e+11 1.3992e+13 5299992
## - source_city       5 3.0496e+11 1.3997e+13 5300096
## - airline           5 8.2758e+11 1.4520e+13 5311099
## - days_left         1 9.4137e+11 1.4633e+13 5313450
## - stops             2 1.3022e+12 1.4994e+13 5320759
## - class             1 1.0157e+14 1.1526e+14 5932939

Menarik sekali, backward regression langsung berhenti pada iterasi pertama, dengan nilai AIC=5293494

summary(pesawat_backward)$call
## lm(formula = price ~ duration + days_left + airline + source_city + 
##     departure_time + stops + arrival_time + destination_city + 
##     class, data = pesawat)

Sekarang kita coba forward regression

Forward regression

model_no <- lm(price~1, data=pesawat)

Untuk forward selection, kita perlu mendefinisikan parameter scope untuk menandakan batas atas maksimal kombinasi predictor

#stepwise regression: forward elimination
pesawat_forward <- step(object = model_no,
                       direction = "forward",
                       scope = list(upper=model_all),
                       trace = T)
## Start:  AIC=6021083
## price ~ 1
## 
##                    Df  Sum of Sq        RSS     AIC
## + class             1 1.3601e+14 1.8621e+13 5385726
## + airline           5 3.4431e+13 1.2020e+14 5945493
## + duration          1 6.4493e+12 1.4819e+14 6008298
## + stops             2 6.3978e+12 1.4824e+14 6008405
## + arrival_time      5 2.5674e+12 1.5207e+14 6016068
## + days_left         1 1.3074e+12 1.5333e+14 6018537
## + departure_time    5 8.1801e+11 1.5382e+14 6019501
## + destination_city  5 4.9572e+11 1.5414e+14 6020130
## + source_city       5 3.7278e+11 1.5426e+14 6020369
## <none>                           1.5463e+14 6021083
## 
## Step:  AIC=5385726
## price ~ class
## 
##                    Df  Sum of Sq        RSS     AIC
## + stops             2 2.4380e+12 1.6183e+13 5343609
## + airline           5 1.0548e+12 1.7566e+13 5368232
## + days_left         1 9.8292e+11 1.7638e+13 5369450
## + duration          1 8.6647e+11 1.7754e+13 5371425
## + destination_city  5 3.2284e+11 1.8298e+13 5380486
## + source_city       5 3.1659e+11 1.8304e+13 5380589
## + arrival_time      5 2.4840e+11 1.8372e+13 5381705
## + departure_time    5 4.8304e+10 1.8573e+13 5384956
## <none>                           1.8621e+13 5385726
## 
## Step:  AIC=5343609
## price ~ class + stops
## 
##                    Df  Sum of Sq        RSS     AIC
## + days_left         1 9.7726e+11 1.5206e+13 5324915
## + airline           5 8.8737e+11 1.5295e+13 5326692
## + source_city       5 2.2466e+11 1.5958e+13 5339423
## + destination_city  5 2.0297e+11 1.5980e+13 5339830
## + arrival_time      5 1.5235e+11 1.6030e+13 5340780
## + departure_time    5 3.5602e+10 1.6147e+13 5342958
## + duration          1 2.2460e+10 1.6160e+13 5343194
## <none>                           1.6183e+13 5343609
## 
## Step:  AIC=5324915
## price ~ class + stops + days_left
## 
##                    Df  Sum of Sq        RSS     AIC
## + airline           5 8.7598e+11 1.4330e+13 5307115
## + source_city       5 2.2143e+11 1.4984e+13 5320522
## + destination_city  5 1.9905e+11 1.5007e+13 5320970
## + arrival_time      5 1.3478e+11 1.5071e+13 5322252
## + departure_time    5 3.2946e+10 1.5173e+13 5324274
## + duration          1 1.1375e+10 1.5194e+13 5324692
## <none>                           1.5206e+13 5324915
## 
## Step:  AIC=5307115
## price ~ class + stops + days_left + airline
## 
##                    Df  Sum of Sq        RSS     AIC
## + source_city       5 2.1548e+11 1.4114e+13 5302577
## + destination_city  5 2.0522e+11 1.4124e+13 5302796
## + arrival_time      5 7.4544e+10 1.4255e+13 5305560
## + duration          1 2.6865e+10 1.4303e+13 5306554
## + departure_time    5 2.3578e+10 1.4306e+13 5306631
## <none>                           1.4330e+13 5307115
## 
## Step:  AIC=5302577
## price ~ class + stops + days_left + airline + source_city
## 
##                    Df  Sum of Sq        RSS     AIC
## + destination_city  5 3.1885e+11 1.3795e+13 5295729
## + arrival_time      5 7.0824e+10 1.4043e+13 5301077
## + departure_time    5 2.0491e+10 1.4094e+13 5302151
## + duration          1 1.9007e+10 1.4095e+13 5302175
## <none>                           1.4114e+13 5302577
## 
## Step:  AIC=5295729
## price ~ class + stops + days_left + airline + source_city + destination_city
## 
##                  Df  Sum of Sq        RSS     AIC
## + arrival_time    5 6.0152e+10 1.3735e+13 5294427
## + departure_time  5 1.8982e+10 1.3776e+13 5295326
## + duration        1 1.1720e+10 1.3784e+13 5295476
## <none>                         1.3795e+13 5295729
## 
## Step:  AIC=5294427
## price ~ class + stops + days_left + airline + source_city + destination_city + 
##     arrival_time
## 
##                  Df  Sum of Sq        RSS     AIC
## + departure_time  5 2.8131e+10 1.3707e+13 5293822
## + duration        1 1.6765e+10 1.3718e+13 5294063
## <none>                         1.3735e+13 5294427
## 
## Step:  AIC=5293822
## price ~ class + stops + days_left + airline + source_city + destination_city + 
##     arrival_time + departure_time
## 
##            Df  Sum of Sq        RSS     AIC
## + duration  1 1.5046e+10 1.3692e+13 5293494
## <none>                   1.3707e+13 5293822
## 
## Step:  AIC=5293494
## price ~ class + stops + days_left + airline + source_city + destination_city + 
##     arrival_time + departure_time + duration

Kontrast dengan backward yang regression langsung berhenti pada iterasi pertama, dengan nilai AIC=5293494. Pada Forward Selection, berhenti pada iterasi ke-10, dengan nilai IC=5293494

summary(pesawat_forward)$call
## lm(formula = price ~ class + stops + days_left + airline + source_city + 
##     destination_city + arrival_time + departure_time + duration, 
##     data = pesawat)

8. Model evaluation

summary(model_no)$call
## lm(formula = price ~ 1, data = pesawat)
summary(model_one)$call
## lm(formula = price ~ duration, data = pesawat)
summary(model_all)$call
## lm(formula = price ~ ., data = pesawat)
summary(pesawat_backward)$call
## lm(formula = price ~ duration + days_left + airline + source_city + 
##     departure_time + stops + arrival_time + destination_city + 
##     class, data = pesawat)
summary(pesawat_forward)$call
## lm(formula = price ~ class + stops + days_left + airline + source_city + 
##     destination_city + arrival_time + departure_time + duration, 
##     data = pesawat)

Adjusted R-square

Mari kita bandingkan nilai Adjusted R-square

summary(model_no)$adj.r.squared
## [1] 0
summary(model_one)$adj.r.squared
## [1] 0.04170358
summary(model_all)$adj.r.squared
## [1] 0.9114476
summary(pesawat_backward)$adj.r.squared
## [1] 0.9114476
summary(pesawat_forward)$adj.r.squared
## [1] 0.9114476

Berdasarkan nilai Adjusted R-squared, model terbaik adalah yang menggunakan: All predictors, bacward, dan forward, karena memiliki nilai Adjusted R-squared tertinggi.

RMSE

Sekarang kita cek nilai RMSE Simpan object prediksi

pred_model_none <- predict(object = model_no, newdata = pesawat)
pred_model_one <- predict(object = model_one, newdata = pesawat)
pred_model_all <-predict(object = model_all, newdata= pesawat) 
pred_model_step_backward <- predict(pesawat_backward, newdata= pesawat)
pred_model_step_forward <- predict(pesawat_forward, newdata=pesawat)

Hitung RSME

library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.2.2
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
RMSE(y_pred=pred_model_none, y_true=pesawat$price)
## [1] 22697.73
RMSE(y_pred=pred_model_one, y_true=pesawat$price)
## [1] 22219.36
RMSE(y_pred=pred_model_all, y_true=pesawat$price)
## [1] 6753.998
RMSE(y_pred=pred_model_step_backward, y_true=pesawat$price)
## [1] 6753.998
RMSE(y_pred=pred_model_step_forward, y_true=pesawat$price)
## [1] 6753.998

Berdasarkan nilai RMSE*, model terbaik adalah yang menggunakan: All predictors, bacward, dan forward, karena memiliki nilai RMSE terendah

9. Assumption in Linear Regression

a. Linearity

check correlation and multicorrelation

library(GGally)
ggcorr(pesawat, hjust=1, layout.exp=3, label=TRUE)
## Warning in ggcorr(pesawat, hjust = 1, layout.exp = 3, label = TRUE): data in
## column(s) 'airline', 'source_city', 'departure_time', 'stops', 'arrival_time',
## 'destination_city', 'class' are not numeric and were ignored

Coba kita plot salah satu model terbaik, yaitu forward model

resact <- data.frame(residual = pesawat_forward$residuals, fitted = pesawat_forward$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Saya tidak yakin model ini memenuhi kaidah linearity, karena kita bisa lihat ada pola pengumpulan error

b. Normality of residuals

Model linear regression diharapkan menghasilkan error yang berdistribusi normal. Dengan begitu, error lebih banyak berkumpul di sekitar angka nol

hist(pesawat_forward$residuals)

Kalau dari gambar di atas, seolah error berbentuk distribusi normal

Mari kita cek dengan shapiro test

# shapiro test dari residual
#shapiro.test(pesawat_forward$residuals)

Note:shapiro test tidak berhasil

c. Homoscedasticity of Residuals

Diharapkan error yang dihasilkan oleh model menyebar secara acak atau dengan variasi konstan. Apabila divisualisasikan maka error tidak berpola. Kondisi ini disebut juga sebagai homoscedasticity

Mari kita check

# scatter plot
plot(x = pesawat_forward$fitted.values, y = pesawat_forward$residuals)

#abline(h = 0, col = "red") # garis horizontal di angka 0

Cek dengan bptest() dari package lmtest

library(lmtest)
## Warning: package 'lmtest' was built under R version 4.2.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(pesawat_forward)
## 
##  studentized Breusch-Pagan test
## 
## data:  pesawat_forward
## BP = 60232, df = 30, p-value < 2.2e-16

Breusch-Pagan hypothesis test: H0 : error menyebar konstan atau homoscedasticity (tidak berpola)

H1: error menyebar TIDAK konstan atau heteroscedasticity (berpola)

Kondisi yang diharapkan: H0

__Apabila p-value > alpha, maka gagal tolak H0

__Apabila p-value < alpha, maka tolak H0

INSIGHT test homoscedasticity

alpha = 0.05, p-value (2.2e-16) < alpha,maka tolak H0

artinya error menyebar TIDAK konstan atau heteroscedasticity

d. No Multicollinearity

Multicollinearity adalah kondisi adanya korelasi antar prediktor yang kuat. Hal ini tidak diinginkan karena menandakan prediktor redundan pada model, yang seharusnya dapat dipilih salah satu saja dari variable yang hubungannya amat kuat tersebut. Harapannya tidak terjadi multicollinearityharapkan error yang dihasilkan oleh model menyebar secara acak atau dengan variasi konstan.

Uji VIF (Variance Inflation Factor) dengan fungsi vif() dari package car:

nilai VIF > 10: terjadi multicollinearity pada model

nilai VIF < 10: tidak terjadi multicollinearity pada model > Kondisi yang diharapkan: VIF < 10

library(car)
## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.2
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(pesawat_forward)
##                      GVIF Df GVIF^(1/(2*Df))
## class            1.278908  1        1.130888
## stops            1.557179  2        1.117081
## days_left        1.005511  1        1.002752
## airline          1.903611  5        1.066493
## source_city      1.375482  5        1.032394
## destination_city 1.447302  5        1.037662
## arrival_time     1.371456  5        1.032091
## departure_time   1.284343  5        1.025341
## duration         1.870230  1        1.367563

nilai VIF < 10: tidak terjadi multicollinearity pada model

10. Kesimpulan

  1. Asumsi-asumsi linearity tidak semuanya terpenuhi
  2. Gunakan model lain yang lebih kompleks, sehingga bisa menangkap hubungan non-linear

Note: perlu dicek, mengapa shapiro test tidak berhasil

END