library(s20x)
Delays.df = read.table(file.choose(), header = TRUE)
head(Delays.df)
NA
#Put in delay hours as variable
hours = Delays.df$dep_delay
#Generate vectors for different airlines
airline = rep(c("Alaska", "American", "Delta", "United"), c(40, 40, 40, 40))
#Rewrite, treating airlines as a factor
airline = factor(airline)
onewayPlot(hours ~ airline)
boxplot(hours ~ airline)
Explanatory Analysis: There are various data points that is on the upper part of the plot, which means there is a positive skew to the data set of all airlines. The centre of all the plots are roughly from 5 to 20 delayed hours.
delay.fit1 = lm(hours ~ airline, data = Delays.df)
eovcheck(delay.fit1)
The data is very skewed and there is not an even spread.
normcheck(delay.fit1)
The data set here is shown to be completely right skewed, hence, there is no good measure of centrality (also noted from the tail from the box plot above). A multiplicative relationship should be used in this case in order to satisfy our model assumption for normality.
delay.fit2 = lm(log(hours) ~ airline, data = Delays.df)
eovcheck(delay.fit2)
From the logged value, we have a much better and evenly spread of the delayed hours for each airline in the multiplicative model compared to that of the original. This satisfy our model assumption for equality in variance.
normcheck(delay.fit2)
There is a normal distribution in the logged data set as it fits within the linear model as well as the bell curve shape of the Q-Q test. Hence, the normality assumption for the model is satisfied.
cooks20x(delay.fit2)
All data points are below 0.4 on Cook’s Distance, hence there is no influential data point.
summary(delay.fit2)
Call:
lm(formula = log(hours) ~ airline, data = Delays.df)
Residuals:
Min 1Q Median 3Q Max
-2.9830 -0.8226 -0.1271 1.0723 3.4970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.4062 0.2188 10.999 <2e-16 ***
airlineAmerican 0.5768 0.3094 1.864 0.0641 .
airlineDelta -0.7155 0.3094 -2.313 0.0220 *
airlineUnited 0.2859 0.3094 0.924 0.3569
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.384 on 156 degrees of freedom
Multiple R-squared: 0.1098, Adjusted R-squared: 0.0927
F-statistic: 6.415 on 3 and 156 DF, p-value: 0.0003992
anova(delay.fit2)
Analysis of Variance Table
Response: log(hours)
Df Sum Sq Mean Sq F value Pr(>F)
airline 3 36.84 12.2800 6.4148 0.0003992 ***
Residuals 156 298.64 1.9143
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
multipleComp(delay.fit2)
Estimate Tukey.L Tukey.U Tukey.p
Alaska - American -0.5768106 -1.3803 0.2266 0.2477
Alaska - Delta 0.7155220 -0.0879 1.5190 0.0995
Alaska - United -0.2858957 -1.0893 0.5175 0.7920
American - Delta 1.2923326 0.4889 2.0958 0.0003
American - United 0.2909149 -0.5125 1.0944 0.7832
Delta - United -1.0014177 -1.8049 -0.1980 0.0080
P-value is not back transformed as it has no meaning then. From the p-value of the model, it can be said that there is no difference between Alaska and United airlines, Alaska and Delta airlines, Alaska and American and between American and United airlines as the value above 0.05 is not significant enough for conclusion.
exp(multipleComp(delay.fit2))[, 1:3]
Estimate Tukey.L Tukey.U
Alaska - American 0.5616869 0.2515031 1.2543280
Alaska - Delta 2.0452540 0.9158525 4.5676553
Alaska - United 0.7513410 0.3364519 1.6778278
American - Delta 3.6412703 1.6305217 8.1319439
American - United 1.3376507 0.5989962 2.9873897
Delta - United 0.3673583 0.1644909 0.8203699
This data now make sense and is able to use it within report analysis.
Methods and Assumption check The measurement for the 4 different airlines where data were collected on the basis of delay hours. The data set obtained from this is right-skewed as shown in the box-plot as well as the normality check for the model. In this case, in order to satisfy the model assumption, a multiplicative model was used. The data that was observed appears to be independent. For the multiplicative model, normality check was satisfied as well as for equality of variance and there were no influential variable within the data set. The model fitted for the model is: loga(Hours delayed)= β_0+ β_1×〖Airline〗(American,i)+ β_2×〖Airline〗(Delta,i)+ β_3×〖Airline〗(United,i)+ ε_i where 〖Airline〗(American,i), 〖Airline〗(Delta,i) and 〖Airline〗(United,i) =1 if the site for observation I is American, Delta and United airline respectively and ε_i (iid) ̃ N(0,σ^2), and 0 otherwise. The baseline is the Alaskan airline.
Executive Summary The delayed times for 4 different airlines in America was recorded in order to quantify the relationship, if there is any, between the average delay time for the departing flights that are delayed and the airline that operated those flights. The data were transformed into a multiplicative term of the medians in order to satisfy the assumptions of the model. The results of the model point out that there is no evidence for the difference in the median delayed hours between Alaska and American airlines, Alaska and Delta airlines, Alaska and United airlines as well as American and United airlines. There is a difference in median that was observed between American and Delta airlines, Delta and United airlines. It is estimated through the model that the median for delayed hours of the American airline is 3.64 times that of Delta airlines and Delta airlines median for delayed hours is 0.37 times that for United Airline. This concludes that Delta airline have shorter delays hours in general when compared to United or American airline.