We use data from Kaggle.com, titled Flight Price Prediction, uploaded by Shubham Bathwal. This data is about flight booking dataset obtained from an internet platform booking ticket.
The goal of this exercise is to examine whether
Linear Regression could be used in this case to predict
price
Read file
pesawat <- read.csv("Data_Pesawat.csv")
dim(pesawat)## [1] 300153 12
head(pesawat)names(pesawat)## [1] "X" "airline" "flight" "source_city"
## [5] "departure_time" "stops" "arrival_time" "destination_city"
## [9] "class" "duration" "days_left" "price"
str(pesawat)## 'data.frame': 300153 obs. of 12 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ airline : chr "SpiceJet" "SpiceJet" "AirAsia" "Vistara" ...
## $ flight : chr "SG-8709" "SG-8157" "I5-764" "UK-995" ...
## $ source_city : chr "Delhi" "Delhi" "Delhi" "Delhi" ...
## $ departure_time : chr "Evening" "Early_Morning" "Early_Morning" "Morning" ...
## $ stops : chr "zero" "zero" "zero" "zero" ...
## $ arrival_time : chr "Night" "Morning" "Early_Morning" "Afternoon" ...
## $ destination_city: chr "Mumbai" "Mumbai" "Mumbai" "Mumbai" ...
## $ class : chr "Economy" "Economy" "Economy" "Economy" ...
## $ duration : num 2.17 2.33 2.17 2.25 2.33 2.33 2.08 2.17 2.17 2.25 ...
## $ days_left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 5953 5953 5956 5955 5955 5955 6060 6060 5954 5954 ...
Terdapat data dengan tipe integer, numeric, dan character.
Target: prediksi price
Cek persebaran variabel price
boxplot(pesawat$price)Remove unused variable, kita tidak perlu kolom X
library(dplyr)## Warning: package 'dplyr' was built under R version 4.2.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pesawat <- pesawat %>% select(-c(X))
head(pesawat)dim(pesawat)## [1] 300153 11
#cek missing value
anyNA(pesawat)## [1] FALSE
#quick check correlation and multicorrelation
library(GGally)## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(pesawat, hjust=1, layout.exp=3, label=TRUE)## Warning in ggcorr(pesawat, hjust = 1, layout.exp = 3, label = TRUE): data
## in column(s) 'airline', 'flight', 'source_city', 'departure_time', 'stops',
## 'arrival_time', 'destination_city', 'class' are not numeric and were ignored
We can only see multicorrelation for 3 variables, artinya, kita masih punya banyak categorical variables. Let’s check how varies the categorical variables are.
kategori_var_pesawat <- data.frame(
variable = c("airline","flight","source_city","departure_time","stops","arrival_time", "destination_city","class", "duration", "days_left"),
jumlah_kategori = c(n_distinct(pesawat$airline),
n_distinct(pesawat$flight),
n_distinct(pesawat$source_city),
n_distinct(pesawat$departure_time),
n_distinct(pesawat$stops),
n_distinct(pesawat$arrival_time),
n_distinct(pesawat$destination_city),
n_distinct(pesawat$class),
n_distinct(pesawat$duration),
n_distinct(pesawat$days_left))
)
kategori_var_pesawatWe can see a lot of variation in flight number, duration, and days_left.
Remove flight number, karena sepertinya tidak terlalu berperan dalam prediksi
pesawat <- pesawat %>% select(-c(flight))
dim(pesawat)## [1] 300153 10
Kita akan melakukan seleksi kolom dan ubah tipe data, karena terdapat 7 jenis data kategorik
pesawat <- pesawat %>%
select(duration, days_left, price, airline, source_city, departure_time, stops, arrival_time, destination_city, class) %>%
mutate(airline=as.factor(airline),
source_city = as.factor(source_city),
departure_time = as.factor(departure_time),
stops = as.factor(stops),
arrival_time = as.factor(arrival_time),
destination_city = as.factor(destination_city),
class = as.factor(class)
)str(pesawat)## 'data.frame': 300153 obs. of 10 variables:
## $ duration : num 2.17 2.33 2.17 2.25 2.33 2.33 2.08 2.17 2.17 2.25 ...
## $ days_left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 5953 5953 5956 5955 5955 5955 6060 6060 5954 5954 ...
## $ airline : Factor w/ 6 levels "Air_India","AirAsia",..: 5 5 2 6 6 6 6 6 3 3 ...
## $ source_city : Factor w/ 6 levels "Bangalore","Chennai",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ departure_time : Factor w/ 6 levels "Afternoon","Early_Morning",..: 3 2 2 5 5 5 5 1 2 1 ...
## $ stops : Factor w/ 3 levels "one","two_or_more",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ arrival_time : Factor w/ 6 levels "Afternoon","Early_Morning",..: 6 5 2 1 5 1 5 3 5 3 ...
## $ destination_city: Factor w/ 6 levels "Bangalore","Chennai",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ class : Factor w/ 2 levels "Business","Economy": 2 2 2 2 2 2 2 2 2 2 ...
Sekarang, kita sudah mengubah semua data kategorik menjadi data factor. Sehingga, kita sekarang punya 3 type data: num, int, dan factor
We have hundred thousands of data. Airline manakah yang paling populer? It turns out Vistara has the most frequent data
plot(pesawat$airline)I wonder how is class distribution in each airline? It turns out only Air_India and Vistara have business class
table(pesawat$airline, pesawat$class)##
## Business Economy
## Air_India 32898 47994
## AirAsia 0 16098
## GO_FIRST 0 23173
## Indigo 0 43120
## SpiceJet 0 9011
## Vistara 60589 67270
I would like to know what is the mean price for each airline in each
type of class? Ternyata, Vistara memiliki nilai mean price
tertinggi pada business class maupun economy class
data_agg1 <- aggregate(price~airline+class,
data= pesawat,
FUN=mean)
data_agg1[order(data_agg1$price, decreasing=T),]range(pesawat$price)## [1] 1105 123071
I also wonder how the price vary with Airlines
boxplot(pesawat$price ~ pesawat$airline,data = pesawat,
main = "Variation of price in each airline",
xlab = "Airlines",ylab = "Price",
ylim = c(1000, 140000)
)Bagaimana persebaran price antara Economy and Business
Class?
boxplot(pesawat$price ~ pesawat$class,data = pesawat,
main = "Variation of price in each class",
xlab = "Class",ylab = "Price",
ylim = c(1000, 140000)
)How does Stop vary with price?
boxplot(pesawat$price ~ pesawat$stops,data = pesawat,
main = "Variation of price in each stop",
xlab = "Stops",ylab = "Price",
ylim = c(1000, 140000)
)We do simple regression using one variable, i.e., duration
model_one <- lm(formula = price ~ duration, data = pesawat)
summary(model_one)##
## Call:
## lm(formula = price ~ duration, data = pesawat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36328 -15171 -11296 19615 101357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13012.958 79.964 162.7 <2e-16 ***
## duration 644.521 5.639 114.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22220 on 300151 degrees of freedom
## Multiple R-squared: 0.04171, Adjusted R-squared: 0.0417
## F-statistic: 1.306e+04 on 1 and 300151 DF, p-value: < 2.2e-16
We find a linear relationship between price and duration.
There is a positive slope: 644.521.
However, the R-squared is too bad, i.e., 0.04171
Bagaimana visualisasi asumsi regresinya?
plot(pesawat$price, pesawat$duration)
abline(model_one)I don’t think I see a nice regression line here :-(
Kita coba lakukan regresi dengan semua predictors
model_all <- lm(formula = price ~ ., data = pesawat)
summary(model_all)##
## Call:
## lm(formula = price ~ ., data = pesawat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36294 -3124 -390 3116 64223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.268e+04 8.299e+01 634.818 < 2e-16 ***
## duration 4.257e+01 2.344e+00 18.160 < 2e-16 ***
## days_left -1.310e+02 9.116e-01 -143.647 < 2e-16 ***
## airlineAirAsia -1.164e+02 6.296e+01 -1.849 0.0645 .
## airlineGO_FIRST 1.589e+03 5.456e+01 29.127 < 2e-16 ***
## airlineIndigo 1.991e+03 4.727e+01 42.118 < 2e-16 ***
## airlineSpiceJet 2.178e+03 7.705e+01 28.263 < 2e-16 ***
## airlineVistara 3.955e+03 3.111e+01 127.143 < 2e-16 ***
## source_cityChennai -6.748e+01 4.627e+01 -1.458 0.1448
## source_cityDelhi -1.406e+03 4.201e+01 -33.465 < 2e-16 ***
## source_cityHyderabad -1.679e+03 4.591e+01 -36.569 < 2e-16 ***
## source_cityKolkata 1.584e+03 4.447e+01 35.609 < 2e-16 ***
## source_cityMumbai -2.119e+02 4.183e+01 -5.065 4.09e-07 ***
## departure_timeEarly_Morning 8.357e+02 4.138e+01 20.195 < 2e-16 ***
## departure_timeEvening 7.338e+02 4.205e+01 17.452 < 2e-16 ***
## departure_timeLate_Night 1.694e+03 1.917e+02 8.840 < 2e-16 ***
## departure_timeMorning 8.563e+02 4.047e+01 21.160 < 2e-16 ***
## departure_timeNight 6.901e+02 4.558e+01 15.141 < 2e-16 ***
## stopstwo_or_more 2.105e+03 6.200e+01 33.955 < 2e-16 ***
## stopszero -7.586e+03 4.592e+01 -165.188 < 2e-16 ***
## arrival_timeEarly_Morning -7.720e+02 6.625e+01 -11.652 < 2e-16 ***
## arrival_timeEvening 9.247e+02 4.284e+01 21.585 < 2e-16 ***
## arrival_timeLate_Night 9.533e+02 6.973e+01 13.670 < 2e-16 ***
## arrival_timeMorning 4.766e+02 4.504e+01 10.582 < 2e-16 ***
## arrival_timeNight 1.143e+03 4.197e+01 27.221 < 2e-16 ***
## destination_cityChennai -2.198e+02 4.587e+01 -4.793 1.65e-06 ***
## destination_cityDelhi -1.554e+03 4.306e+01 -36.089 < 2e-16 ***
## destination_cityHyderabad -1.720e+03 4.547e+01 -37.828 < 2e-16 ***
## destination_cityKolkata 1.377e+03 4.390e+01 31.359 < 2e-16 ***
## destination_cityMumbai -2.877e+01 4.236e+01 -0.679 0.4971
## classEconomy -4.492e+04 3.011e+01 -1492.108 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6754 on 300122 degrees of freedom
## Multiple R-squared: 0.9115, Adjusted R-squared: 0.9114
## F-statistic: 1.03e+05 on 30 and 300122 DF, p-value: < 2.2e-16
Dengan memakai all data terjadi perubahan signifikan terhadap R-squared dan adjusted R-squared
summary(model_one)$adj.r.squared## [1] 0.04170358
summary(model_all)$adj.r.squared## [1] 0.9114476
Sekarang kita coba stepwise regression
pesawat_backward <- step(object=model_all,
direction="backward",
trace=T)## Start: AIC=5293494
## price ~ duration + days_left + airline + source_city + departure_time +
## stops + arrival_time + destination_city + class
##
## Df Sum of Sq RSS AIC
## <none> 1.3692e+13 5293494
## - duration 1 1.5046e+10 1.3707e+13 5293822
## - departure_time 5 2.6412e+10 1.3718e+13 5294063
## - arrival_time 5 7.3139e+10 1.3765e+13 5295083
## - destination_city 5 3.0012e+11 1.3992e+13 5299992
## - source_city 5 3.0496e+11 1.3997e+13 5300096
## - airline 5 8.2758e+11 1.4520e+13 5311099
## - days_left 1 9.4137e+11 1.4633e+13 5313450
## - stops 2 1.3022e+12 1.4994e+13 5320759
## - class 1 1.0157e+14 1.1526e+14 5932939
Menarik sekali, backward regression langsung berhenti pada iterasi pertama, dengan nilai AIC=5293494
summary(pesawat_backward)$call## lm(formula = price ~ duration + days_left + airline + source_city +
## departure_time + stops + arrival_time + destination_city +
## class, data = pesawat)
Sekarang kita coba forward regression
model_no <- lm(price~1, data=pesawat)Untuk forward selection, kita perlu mendefinisikan
parameter scope untuk menandakan batas
atas maksimal kombinasi predictor
#stepwise regression: forward elimination
pesawat_forward <- step(object = model_no,
direction = "forward",
scope = list(upper=model_all),
trace = T)## Start: AIC=6021083
## price ~ 1
##
## Df Sum of Sq RSS AIC
## + class 1 1.3601e+14 1.8621e+13 5385726
## + airline 5 3.4431e+13 1.2020e+14 5945493
## + duration 1 6.4493e+12 1.4819e+14 6008298
## + stops 2 6.3978e+12 1.4824e+14 6008405
## + arrival_time 5 2.5674e+12 1.5207e+14 6016068
## + days_left 1 1.3074e+12 1.5333e+14 6018537
## + departure_time 5 8.1801e+11 1.5382e+14 6019501
## + destination_city 5 4.9572e+11 1.5414e+14 6020130
## + source_city 5 3.7278e+11 1.5426e+14 6020369
## <none> 1.5463e+14 6021083
##
## Step: AIC=5385726
## price ~ class
##
## Df Sum of Sq RSS AIC
## + stops 2 2.4380e+12 1.6183e+13 5343609
## + airline 5 1.0548e+12 1.7566e+13 5368232
## + days_left 1 9.8292e+11 1.7638e+13 5369450
## + duration 1 8.6647e+11 1.7754e+13 5371425
## + destination_city 5 3.2284e+11 1.8298e+13 5380486
## + source_city 5 3.1659e+11 1.8304e+13 5380589
## + arrival_time 5 2.4840e+11 1.8372e+13 5381705
## + departure_time 5 4.8304e+10 1.8573e+13 5384956
## <none> 1.8621e+13 5385726
##
## Step: AIC=5343609
## price ~ class + stops
##
## Df Sum of Sq RSS AIC
## + days_left 1 9.7726e+11 1.5206e+13 5324915
## + airline 5 8.8737e+11 1.5295e+13 5326692
## + source_city 5 2.2466e+11 1.5958e+13 5339423
## + destination_city 5 2.0297e+11 1.5980e+13 5339830
## + arrival_time 5 1.5235e+11 1.6030e+13 5340780
## + departure_time 5 3.5602e+10 1.6147e+13 5342958
## + duration 1 2.2460e+10 1.6160e+13 5343194
## <none> 1.6183e+13 5343609
##
## Step: AIC=5324915
## price ~ class + stops + days_left
##
## Df Sum of Sq RSS AIC
## + airline 5 8.7598e+11 1.4330e+13 5307115
## + source_city 5 2.2143e+11 1.4984e+13 5320522
## + destination_city 5 1.9905e+11 1.5007e+13 5320970
## + arrival_time 5 1.3478e+11 1.5071e+13 5322252
## + departure_time 5 3.2946e+10 1.5173e+13 5324274
## + duration 1 1.1375e+10 1.5194e+13 5324692
## <none> 1.5206e+13 5324915
##
## Step: AIC=5307115
## price ~ class + stops + days_left + airline
##
## Df Sum of Sq RSS AIC
## + source_city 5 2.1548e+11 1.4114e+13 5302577
## + destination_city 5 2.0522e+11 1.4124e+13 5302796
## + arrival_time 5 7.4544e+10 1.4255e+13 5305560
## + duration 1 2.6865e+10 1.4303e+13 5306554
## + departure_time 5 2.3578e+10 1.4306e+13 5306631
## <none> 1.4330e+13 5307115
##
## Step: AIC=5302577
## price ~ class + stops + days_left + airline + source_city
##
## Df Sum of Sq RSS AIC
## + destination_city 5 3.1885e+11 1.3795e+13 5295729
## + arrival_time 5 7.0824e+10 1.4043e+13 5301077
## + departure_time 5 2.0491e+10 1.4094e+13 5302151
## + duration 1 1.9007e+10 1.4095e+13 5302175
## <none> 1.4114e+13 5302577
##
## Step: AIC=5295729
## price ~ class + stops + days_left + airline + source_city + destination_city
##
## Df Sum of Sq RSS AIC
## + arrival_time 5 6.0152e+10 1.3735e+13 5294427
## + departure_time 5 1.8982e+10 1.3776e+13 5295326
## + duration 1 1.1720e+10 1.3784e+13 5295476
## <none> 1.3795e+13 5295729
##
## Step: AIC=5294427
## price ~ class + stops + days_left + airline + source_city + destination_city +
## arrival_time
##
## Df Sum of Sq RSS AIC
## + departure_time 5 2.8131e+10 1.3707e+13 5293822
## + duration 1 1.6765e+10 1.3718e+13 5294063
## <none> 1.3735e+13 5294427
##
## Step: AIC=5293822
## price ~ class + stops + days_left + airline + source_city + destination_city +
## arrival_time + departure_time
##
## Df Sum of Sq RSS AIC
## + duration 1 1.5046e+10 1.3692e+13 5293494
## <none> 1.3707e+13 5293822
##
## Step: AIC=5293494
## price ~ class + stops + days_left + airline + source_city + destination_city +
## arrival_time + departure_time + duration
Kontrast dengan backward yang regression langsung berhenti pada iterasi pertama, dengan nilai AIC=5293494. Pada Forward Selection, berhenti pada iterasi ke-10, dengan nilai IC=5293494
summary(pesawat_forward)$call## lm(formula = price ~ class + stops + days_left + airline + source_city +
## destination_city + arrival_time + departure_time + duration,
## data = pesawat)
summary(model_no)$call## lm(formula = price ~ 1, data = pesawat)
summary(model_one)$call## lm(formula = price ~ duration, data = pesawat)
summary(model_all)$call## lm(formula = price ~ ., data = pesawat)
summary(pesawat_backward)$call## lm(formula = price ~ duration + days_left + airline + source_city +
## departure_time + stops + arrival_time + destination_city +
## class, data = pesawat)
summary(pesawat_forward)$call## lm(formula = price ~ class + stops + days_left + airline + source_city +
## destination_city + arrival_time + departure_time + duration,
## data = pesawat)
Mari kita bandingkan nilai Adjusted R-square
summary(model_no)$adj.r.squared## [1] 0
summary(model_one)$adj.r.squared## [1] 0.04170358
summary(model_all)$adj.r.squared## [1] 0.9114476
summary(pesawat_backward)$adj.r.squared## [1] 0.9114476
summary(pesawat_forward)$adj.r.squared## [1] 0.9114476
Berdasarkan nilai Adjusted R-squared, model
terbaik adalah yang menggunakan: All predictors,
bacward, dan forward, karena memiliki nilai
Adjusted R-squared tertinggi.
Sekarang kita cek nilai RMSE Simpan object prediksi
pred_model_none <- predict(object = model_no, newdata = pesawat)
pred_model_one <- predict(object = model_one, newdata = pesawat)
pred_model_all <-predict(object = model_all, newdata= pesawat)
pred_model_step_backward <- predict(pesawat_backward, newdata= pesawat)
pred_model_step_forward <- predict(pesawat_forward, newdata=pesawat)Hitung RSME
library(MLmetrics)## Warning: package 'MLmetrics' was built under R version 4.2.2
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
RMSE(y_pred=pred_model_none, y_true=pesawat$price)## [1] 22697.73
RMSE(y_pred=pred_model_one, y_true=pesawat$price)## [1] 22219.36
RMSE(y_pred=pred_model_all, y_true=pesawat$price)## [1] 6753.998
RMSE(y_pred=pred_model_step_backward, y_true=pesawat$price)## [1] 6753.998
RMSE(y_pred=pred_model_step_forward, y_true=pesawat$price)## [1] 6753.998
Berdasarkan nilai RMSE*, model terbaik adalah yang
menggunakan: All predictors, bacward, dan
forward, karena memiliki nilai RMSE
terendah
check correlation and multicorrelation
library(GGally)
ggcorr(pesawat, hjust=1, layout.exp=3, label=TRUE)## Warning in ggcorr(pesawat, hjust = 1, layout.exp = 3, label = TRUE): data in
## column(s) 'airline', 'source_city', 'departure_time', 'stops', 'arrival_time',
## 'destination_city', 'class' are not numeric and were ignored
Coba kita plot salah satu model terbaik, yaitu forward model
resact <- data.frame(residual = pesawat_forward$residuals, fitted = pesawat_forward$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) +
geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Saya tidak yakin model ini memenuhi kaidah linearity, karena kita bisa lihat ada pola pengumpulan error
Model linear regression diharapkan menghasilkan error yang berdistribusi normal. Dengan begitu, error lebih banyak berkumpul di sekitar angka nol
hist(pesawat_forward$residuals)Kalau dari gambar di atas, seolah error berbentuk distribusi normal
Mari kita cek dengan shapiro test
# shapiro test dari residual
#shapiro.test(pesawat_forward$residuals)Note:shapiro test tidak berhasil
Diharapkan error yang dihasilkan oleh model menyebar secara acak atau dengan variasi konstan. Apabila divisualisasikan maka error tidak berpola. Kondisi ini disebut juga sebagai homoscedasticity
Mari kita check
# scatter plot
plot(x = pesawat_forward$fitted.values, y = pesawat_forward$residuals)#abline(h = 0, col = "red") # garis horizontal di angka 0Cek dengan bptest() dari package
lmtest
library(lmtest)## Warning: package 'lmtest' was built under R version 4.2.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(pesawat_forward)##
## studentized Breusch-Pagan test
##
## data: pesawat_forward
## BP = 60232, df = 30, p-value < 2.2e-16
Breusch-Pagan hypothesis test: H0 : error menyebar konstan atau homoscedasticity (tidak berpola)
H1: error menyebar TIDAK konstan atau heteroscedasticity (berpola)
Kondisi yang diharapkan: H0
__Apabila p-value > alpha, maka gagal tolak H0
__Apabila p-value < alpha, maka tolak H0
INSIGHT test homoscedasticity
alpha = 0.05, p-value (2.2e-16) < alpha,maka tolak H0
artinya error menyebar TIDAK konstan atau heteroscedasticity
Multicollinearity adalah kondisi adanya korelasi antar prediktor yang kuat. Hal ini tidak diinginkan karena menandakan prediktor redundan pada model, yang seharusnya dapat dipilih salah satu saja dari variable yang hubungannya amat kuat tersebut. Harapannya tidak terjadi multicollinearityharapkan error yang dihasilkan oleh model menyebar secara acak atau dengan variasi konstan.
Uji VIF (Variance Inflation Factor) dengan fungsi vif()
dari package car:
nilai VIF > 10: terjadi multicollinearity pada model
nilai VIF < 10: tidak terjadi multicollinearity pada model > Kondisi yang diharapkan: VIF < 10
library(car)## Warning: package 'car' was built under R version 4.2.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.2
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(pesawat_forward)## GVIF Df GVIF^(1/(2*Df))
## class 1.278908 1 1.130888
## stops 1.557179 2 1.117081
## days_left 1.005511 1 1.002752
## airline 1.903611 5 1.066493
## source_city 1.375482 5 1.032394
## destination_city 1.447302 5 1.037662
## arrival_time 1.371456 5 1.032091
## departure_time 1.284343 5 1.025341
## duration 1.870230 1 1.367563
nilai VIF < 10: tidak terjadi multicollinearity pada model
Note: perlu dicek, mengapa shapiro test tidak berhasil
END