Read data into Data Frame

setwd("C:/Users/DELL/Desktop")
airline.df = read.csv("AirlinePricingData.csv")

1. Run linear-linear regression

attach(airline.df)
model <- lm(Price ~ AdvancedBookingDays+Airline+Departure+IsWeekend+IsDiwali+DepartureCityCode+FlyingMinutes+SeatPitch+SeatWidth,data=airline.df)
summary(model)
## 
## Call:
## lm(formula = Price ~ AdvancedBookingDays + Airline + Departure + 
##     IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes + 
##     SeatPitch + SeatWidth, data = airline.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2671.2 -1266.2  -456.4   517.4 11953.9 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4292.94    8897.87  -0.482   0.6298    
## AdvancedBookingDays    -87.70      12.47  -7.033 1.43e-11 ***
## AirlineIndiGo         -577.17     778.64  -0.741   0.4591    
## AirlineJet            -120.75     436.69  -0.277   0.7823    
## AirlineSpice Jet     -1118.38     697.85  -1.603   0.1101    
## DeparturePM           -589.79     275.23  -2.143   0.0329 *  
## IsWeekendYes          -345.92     408.06  -0.848   0.3973    
## IsDiwali              4346.80     568.14   7.651 2.90e-13 ***
## DepartureCityCodeDEL -1413.46     351.54  -4.021 7.38e-05 ***
## FlyingMinutes           38.97      29.27   1.331   0.1841    
## SeatPitch             -279.19     226.64  -1.232   0.2190    
## SeatWidth              868.58     507.54   1.711   0.0881 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2079 on 293 degrees of freedom
## Multiple R-squared:  0.2695, Adjusted R-squared:  0.2421 
## F-statistic: 9.828 on 11 and 293 DF,  p-value: 3.604e-15

2. Run log-linear regression

logmodel <- lm(log(Price) ~ AdvancedBookingDays+Airline+Departure+IsWeekend+IsDiwali+DepartureCityCode+FlyingMinutes+SeatPitch+SeatWidth,data=airline.df)
summary(logmodel)
## 
## Call:
## lm(formula = log(Price) ~ AdvancedBookingDays + Airline + Departure + 
##     IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes + 
##     SeatPitch + SeatWidth, data = airline.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57006 -0.19770 -0.05792  0.12935  1.24672 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.549474   1.243788   5.266 2.71e-07 ***
## AdvancedBookingDays  -0.014639   0.001743  -8.399 1.97e-15 ***
## AirlineIndiGo        -0.098622   0.108842  -0.906   0.3656    
## AirlineJet            0.001113   0.061043   0.018   0.9855    
## AirlineSpice Jet     -0.127169   0.097548  -1.304   0.1934    
## DeparturePM          -0.055844   0.038473  -1.452   0.1477    
## IsWeekendYes         -0.036748   0.057041  -0.644   0.5199    
## IsDiwali              0.744738   0.079418   9.377  < 2e-16 ***
## DepartureCityCodeDEL -0.264017   0.049140  -5.373 1.58e-07 ***
## FlyingMinutes         0.008717   0.004092   2.131   0.0340 *  
## SeatPitch            -0.032824   0.031681  -1.036   0.3010    
## SeatWidth             0.122364   0.070947   1.725   0.0856 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2906 on 293 degrees of freedom
## Multiple R-squared:  0.3671, Adjusted R-squared:  0.3433 
## F-statistic: 15.45 on 11 and 293 DF,  p-value: < 2.2e-16

3. Compare the models. Which one is preferred? Why?

Log Model is preferred as the adjusted r-square value is higher than linear-linear model.

4a. Test the above linear-linear and log-linear models for normality, based on visual examination of their qqplots

par(mfrow=c(2,2))
plot(model,2)
plot(logmodel,2)

4b.(i) Test the above model for normality, based on Shapiro-Wilks test; (ii) Anderson-Darling test? (Linear-Linear Model)

library(nortest)
shapiro.test(Price)
## 
##  Shapiro-Wilk normality test
## 
## data:  Price
## W = 0.77653, p-value < 2.2e-16
ad.test(Price)
## 
##  Anderson-Darling normality test
## 
## data:  Price
## A = 19.412, p-value < 2.2e-16

Is the normality assumption violated in the linear-linear model?

Yes, the normality assumption is violated in the linear-linear model as the p-value < 0.01 for both the tests.

Shapiro-Wilks Test and Anderson-Darling test for Log-Linear Model

shapiro.test(log(Price))
## 
##  Shapiro-Wilk normality test
## 
## data:  log(Price)
## W = 0.93966, p-value = 8.002e-10
ad.test(log(Price))
## 
##  Anderson-Darling normality test
## 
## data:  log(Price)
## A = 5.8513, p-value = 1.978e-14

Is the normality assumption violated in the log-linear model?

Though there is an improvement in the p-value, the normality assumption is still violated in the log-linear model. The p-value < 0.01 for both the tests.

5. Test the above models for linearity, based on visual examination of the Residual versus Fitted plot.

plot(model,1)

The Linearity assumption is violated for the linear-linear model.

The model adheres to Linearity if the fitted plot has a close to horizontal line. This model does not have a horizontal line.

plot(logmodel,1)

## The Linearity assumption is violated for the log-linear model.

Though there is an improvement from the linear-linear model. This model does not have a horizontal line.

6. Run a suitable Box-Cox transformation of the dependent variable

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
priceTrans <- BoxCoxTrans(Price)
priceTrans
## Box-Cox Transformation
## 
## 305 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2607    4051    4681    5395    5725   18015 
## 
## Largest/Smallest: 6.91 
## Sample Skewness: 2.26 
## 
## Estimated Lambda: -0.8

What “lambda” value does Box-Cox transformation indicate?

Lambda value is -0.8

Performing Box-Cox Transformation on Price and appending to dataset

PriceNew = predict(priceTrans, Price)
head(PriceNew)
## [1] 1.248375 1.249299 1.248351 1.248431 1.248931 1.248889
# append the transformed variable to Airline.df data
airline.df <- cbind(airline.df, PriceNew)

Redo the regression model using the transformed Price variable.

modelnew <- lm(PriceNew ~ AdvancedBookingDays+Airline+Departure+IsWeekend+IsDiwali+DepartureCityCode+FlyingMinutes+SeatPitch+SeatWidth,data=airline.df)
summary(modelnew)
## 
## Call:
## lm(formula = PriceNew ~ AdvancedBookingDays + Airline + Departure + 
##     IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes + 
##     SeatPitch + SeatWidth, data = airline.df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -7.868e-04 -1.930e-04 -2.246e-05  1.777e-04  9.443e-04 
## 
## Coefficients:
##                        Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)           1.246e+00  1.218e-03 1022.900  < 2e-16 ***
## AdvancedBookingDays  -1.551e-05  1.707e-06   -9.084  < 2e-16 ***
## AirlineIndiGo        -1.190e-04  1.066e-04   -1.117  0.26509    
## AirlineJet            6.273e-06  5.979e-05    0.105  0.91652    
## AirlineSpice Jet     -1.075e-04  9.555e-05   -1.125  0.26147    
## DeparturePM          -3.419e-05  3.769e-05   -0.907  0.36506    
## IsWeekendYes         -2.680e-05  5.587e-05   -0.480  0.63183    
## IsDiwali              7.987e-04  7.779e-05   10.267  < 2e-16 ***
## DepartureCityCodeDEL -2.844e-04  4.813e-05   -5.909 9.53e-09 ***
## FlyingMinutes         1.056e-05  4.008e-06    2.635  0.00887 ** 
## SeatPitch            -2.475e-05  3.103e-05   -0.798  0.42579    
## SeatWidth             1.143e-04  6.950e-05    1.645  0.10106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0002847 on 293 degrees of freedom
## Multiple R-squared:  0.4183, Adjusted R-squared:  0.3965 
## F-statistic: 19.16 on 11 and 293 DF,  p-value: < 2.2e-16

7a. Test the Box-Cox transformed model for normality, based on visual examination of a qqplot (and compare it with the linear-linear qqplot)

plot(model,2)

plot(modelnew,2)

The transformed model appears to have improved from the base linear-linear model. However we cannot conclusively say if it the normality assumption holds true or not.

7b. Test the Box-Cox transformed model for normality, based on (i) Shapiro-Wilks test;

shapiro.test(PriceNew)
## 
##  Shapiro-Wilk normality test
## 
## data:  PriceNew
## W = 0.98545, p-value = 0.003551

(ii) Anderson-Darling test?

ad.test(PriceNew)
## 
##  Anderson-Darling normality test
## 
## data:  PriceNew
## A = 1.6763, p-value = 0.0002619

Is the normality assumption violated?

The normality assumption is violated as both tests have a p-value < 0.05. The Box-Cox Transformed model also does not adhere to normality assumption.

7c. Test the Box-Cox transformed model for linearity, based on visual examination of a of the Residual versus Fitted plot (and compare it with the Residual versus Fitted plot based on the linear-linear model)

plot(model,1)

plot(modelnew,1)

The transformed model appears to have drastically improved from the base linear-linear model. However it doesn’t appear sufficient to justify a horizontal line.