Airline Costs

Background:

We will examine data on airline costs. The following data were originally presented in the classic study by Proctor and Duncan. It is worth noting there is also a variable representing the Total asses of the airline and a variable representing the Type of airline though we will not use them in the regression modeling.

1.Suppose that (logged) speed of plane is normally distributed with mean and standard deviation exactly as calculated from the data. What is the probability of a value greater than 2.25 logged miles per hour?

result <- prob_norm(mean = 2.202, stdev = 0.071, ub = 2.25)
summary(result)

## Probability calculator
## Distribution: Normal
## Mean        : 2.202 
## St. dev     : 0.071 
## Lower bound : -Inf 
## Upper bound : 2.25 
## 
## P(X < 2.25) = 0.75
## P(X > 2.25) = 0.25

plot(result)

The proability of a value greater than 2.25 logged miles per hour is 0.25

If (logged) speed of plane is normally distributed with mean and standard deviation exactly as calculated from the data, what is the 95% central interval for (logged) speed of plane?

95% of the time, the speed of the plane should fall between 2.063 and 2.341 logged miles per hour.

result <- prob_norm(
  mean = 2.202, 
  stdev = 0.071, 
  plb = 0.025, 
  pub = 0.975
)
summary(result, type = "probs")

## Probability calculator
## Distribution: Normal
## Mean        : 2.202 
## St. dev     : 0.071 
## Lower bound : 0.025 
## Upper bound : 0.975 
## 
## P(X < 2.063) = 0.025
## P(X > 2.063) = 0.975
## P(X < 2.341) = 0.975
## P(X > 2.341) = 0.025
## P(2.063 < X < 2.341)     = 0.95
## 1 - P(2.063 < X < 2.341) = 0.05

plot(result, type = "probs")

Provide a graphical description of Total Operating Costs per revenue ton-mile and a 95% confidence interval for the average.

result <- single_mean(
  AirlineCosts, 
  var = "TotalOperatingCosts"
)
summary(result)

## Single mean test
## Data      : AirlineCosts 
## Variable  : TotalOperatingCosts 
## Confidence: 0.95 
## Null hyp. : the mean of TotalOperatingCosts = 0 
## Alt. hyp. : the mean of TotalOperatingCosts is not equal to 0 
## 
##     mean  n n_missing      sd     se     me
##  113.506 31         0 142.705 25.631 52.344
## 
##     diff     se t.value p.value df   2.5%   97.5%    
##  113.506 25.631   4.429  < .001 30 61.162 165.851 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(result, plots = "hist", custom = FALSE)

Operating costs per revenue ton-mile are symmetric. TRUE or FALSE

False, the Total Operating Costs is not symetric, and it averages 61.162 to 165.851 cents

Provide an appropriate numerical summary of the data given your answer to the previous question.

result <- explore(
  AirlineCosts, 
  vars = "TotalOperatingCosts", 
  fun = c("n_obs", "min", "p25", "median", "p75", "max"), 
  nr = Inf
)

## Warning: attributes are not identical across measure variables;
## they will be dropped

summary(result)

## Explore
## Data        : AirlineCosts 
## Functions   : n_obs, min, p25, median, p75, max 
## Top         : Function 
## 
##             variable n_obs    min    p25 median     p75     max
##  TotalOperatingCosts    31 42.300 50.800 75.400 120.750 820.900

# dtab(result) %>% render()

6.Which airline has the highest total operating costs per revenue ton-mile? >Wiggins (820.90)

Which airline has the lowest total operating costs per revenue ton-mile? >United (42.30)

Provide a 95% confidence interval for the average difference between Assets and Adjusted Assets. Are they different with 95% confidence?

result <- compare_means(
  AirlineCosts, 
  var1 = "Assets", 
  var2 = "AdjustedAssets", 
  samples = "paired"
)
summary(result, show = TRUE)

## Pairwise mean comparisons (t-test)
## Data      : AirlineCosts 
## Variables : Assets, AdjustedAssets 
## Samples   : paired 
## Confidence: 0.95 
## Adjustment: None 
## 
##                  mean  n n_missing    sd    se    me
##          Assets 1.669 31         0 0.776 0.139 0.285
##  AdjustedAssets 1.644 31         0 0.772 0.139 0.283
## 
##  Null hyp.                 Alt. hyp.                            diff  p.value
##  Assets = AdjustedAssets   Assets not equal to AdjustedAssets   0.025 < .001 
##  se    t.value df 2.5%  97.5%    
##  0.006 4.004   30 0.012 0.038 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The average difference between Assets and Adjusted Assets ranges from 0.012 to 0.038

8.There is no difference in (logged) average daily flight time per plane between the two Types of airlines. Difference or No difference? If different, which is larger? What is the 95% confidence interval for the difference in means?

The average difference between Daily flight time per plane and the type of plane ranges from 0.1 to 0.294 hours logged. Being the Type Comp [Comprehensive] the larger one.

result <- compare_means(
  AirlineCosts, 
  var1 = "Type", 
  var2 = "FlightTimeperPlane", 
  samples = "paired"
)
summary(result, show = TRUE)

## Pairwise mean comparisons (t-test)
## Data      : AirlineCosts 
## Variables : Type, FlightTimeperPlane 
## Samples   : independent (obs. per level unequal) 
## Confidence: 0.95 
## Adjustment: None 
## 
##  Type  mean  n n_missing    sd    se    me
##  Comp 0.853 22         0 0.058 0.012 0.026
##    SH 0.656  9         0 0.124 0.041 0.095
## 
##  Null hyp.   Alt. hyp.              diff  p.value se    t.value df    2.5%
##  Comp = SH   Comp not equal to SH   0.197 0.001   0.043 4.572   9.469 0.1 
##  97.5%   
##  0.294 **
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.With 95% confidence, are load factor and (logged) flight length linearly related? Related or Not related Provide graphical and regression evidence.

Yes, they are related, the slope is not zero with 95% of confidence.

visualize(
  AirlineCosts, 
  xvar = "LoadFactor", 
  yvar = "FlightLength", 
  type = "scatter", 
  nrobs = -1, 
  check = c("line", "loess"), 
  custom = FALSE
)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

result <- regress(
  AirlineCosts, 
  rvar = "FlightLength", 
  evar = "LoadFactor"
)
summary(result)

## Linear regression (OLS)
## Data     : AirlineCosts 
## Response variable    : FlightLength 
## Explanatory variables: LoadFactor 
## Null hyp.: the effect of LoadFactor on FlightLength is zero
## Alt. hyp.: the effect of LoadFactor on FlightLength is not zero
## 
##              coefficient std.error t.value p.value    
##  (Intercept)       1.361     0.089  15.211  < .001 ***
##  LoadFactor        1.438     0.180   7.969  < .001 ***
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-squared: 0.687,  Adjusted R-squared: 0.676 
## F-statistic: 63.504 df(1,29), p.value < .001
## Nr obs: 31

A random sample of 85 MBA students responded Yes with probability 0.9 to the question, “Are statistics the most awesome tools ever created?”.

result <- prob_binom(n = 85, p = 0.9, plb = 0.25, pub = 0.975)
summary(result, type = "probs")

## Probability calculator
## Distribution: Binomial
## n           : 85 
## p           : 0.9 
## Mean        : 76.5 
## St. dev     : 2.766 
## Lower bound : 0.25 (75)
## Upper bound : 0.975 (81)
## 
## P(X  = 75) = 0.116
## P(X  < 75) = 0.228
## P(X <= 75) = 0.343
## P(X  > 75) = 0.656
## P(X >= 75) = 0.772
## P(X  = 81) = 0.04
## P(X  < 81) = 0.936
## P(X <= 81) = 0.975
## P(X  > 81) = 0.024
## P(X >= 81) = 0.064
## P(75 <= X <= 81)     = 0.748
## 1 - P(75 <= X <= 81) = 0.252

plot(result, type = "probs")

With 95% central probability, what range of Yes’s should we expect?

From 75 to 81 Yes responses.

Of course, this probability is perhaps a bit optimistic. 95% of the time, we should see 79 or more Yes’s if the true probability is 0.9.

Multiple Regression

We will now model Total Operating Costs for these airlines.

Estimate and report a regression model of Total Operating Costs [measured as Total Operating Expense per Revenue Ton-Mile, in cents] as a function of the drivers listed in the introduction in the form that they are given [all are logged but the load factor; you are to omit Total Assets and Type].

result <- regress(
  AirlineCosts, 
  rvar = "TotalOperatingCosts", 
  evar = c(
    "FlightLength", "PlaneSpeed", "FlightTimeperPlane", "PopulationServed", 
    "LoadFactor", "Capacity", "AdjustedAssets"
  )
)
summary(result, sum_check = c("rmse", "sumsquares", "confint"))

## Linear regression (OLS)
## Data     : AirlineCosts 
## Response variable    : TotalOperatingCosts 
## Explanatory variables: FlightLength, PlaneSpeed, FlightTimeperPlane, PopulationServed, LoadFactor, Capacity, AdjustedAssets 
## Null hyp.: the effect of x on TotalOperatingCosts is zero
## Alt. hyp.: the effect of x on TotalOperatingCosts is not zero
## 
##                     coefficient std.error t.value p.value    
##  (Intercept)             -4.698   989.286  -0.005   0.996    
##  FlightLength           351.792   166.639   2.111   0.046 *  
##  PlaneSpeed             -24.543   544.152  -0.045   0.964    
##  FlightTimeperPlane    -302.651   136.275  -2.221   0.036 *  
##  PopulationServed        41.589    32.109   1.295   0.208    
##  LoadFactor            -597.106   157.684  -3.787  < .001 ***
##  Capacity              -645.592   109.856  -5.877  < .001 ***
##  AdjustedAssets          70.457    56.297   1.252   0.223    
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-squared: 0.881,  Adjusted R-squared: 0.845 
## F-statistic: 24.424 df(7,23), p.value < .001
## Nr obs: 31 
## 
## Prediction error (RMSE):  48.342 
## Residual st.dev   (RSD):  56.123 
## 
## Sum of squares:
##            df          SS
## Regression  7 538,494.406
## Error      23  72,443.992
## Total      30 610,938.399
## 
##                    coefficient      2.5%    97.5%      +/-
## (Intercept)             -4.698 -2051.191 2041.795 2046.493
## FlightLength           351.792     7.074  696.510  344.718
## PlaneSpeed             -24.543 -1150.208 1101.122 1125.665
## FlightTimeperPlane    -302.651  -584.557  -20.745  281.906
## PopulationServed        41.589   -24.834  108.012   66.423
## LoadFactor            -597.106  -923.301 -270.911  326.195
## Capacity              -645.592  -872.847 -418.337  227.255
## AdjustedAssets          70.457   -46.003  186.917  116.460

plot(result, plots = "scatter", nrobs = -1, custom = FALSE)

AirlineCosts <- store(AirlineCosts, result, name = "Resids")
shapiro.test(AirlineCosts$Resids)

## 
##  Shapiro-Wilk normality test
## 
## data:  AirlineCosts$Resids
## W = 0.98881, p-value = 0.9817

What factors are related to Total Operating Costs?

FlightLength, FlightTimeperPlane, LoadFactor, Capacity

What factors are unrelated to Total Operating Costs? Provide 95% confidence intervals for all of the slopes and the intercept and interpret them to answer the previous question.

PlaneSpeed (-1150.208 to 1101.122), PopulationServed (-24.834 to 108.012), AdjustedAssets ( -46.003 to 186.917)

TRUE OR FALSE: With 95% confidence, as (logged) capacity increases, total operating costs increase.

FALSE, they have a negative correlation.

TRUE OR FALSE: With 95% confidence, as the (logged) population served increases, Total Operating Costs increase.

FALSE, It is unrelated to Total Operating Costs

TRUE OR FALSE: Total Operating Costs are never more than $1.10 above their predicted/fitted value?

TRUE

TRUE OR FALSE: Total Operating Costs are never more than $1.10 below their predicted/fitted value?

FALSE

## filter and sort the dataset
AirlineCosts %>%
 arrange(Resids) %>%
 select(Airline, TotalOperatingCosts, Resids) %>%
 dtab(dec = 2, nr = 100) %>% render()

What percentage of variance in Total Operating Costs is explained by this set of predictors?

It is 0.881 or 88.1%

With 95% confidence, does the overall regression [all included predictors] explain variance? YES OR NO. What statistic answers this question and what is the reported p-value for that statistic.

Yes, F the p-value is less than 0.001

What is the residual standard error; what is the metric and what does it say about the fit of our model? Do we want it bigger or smaller?

The residual standard error is the zise of the average miss from the prediction of the regression where the averaging is by degrees of freedom. It is also positive square root of the mean square error. When the residual standard error is exactly 0 then the model fits the data perfectly (likely due to overfitting). If the residual standard error can not be shown to be significantly different from the variability in the unconditional response, then there is little evidence to suggest the linear model has any predictive ability.

For this model, we are 56.12 cents away from what is expected given the regression prediction. We want it small; that shows precise predictions.

Suppose that the quality of management can be measured by the amount of unexplained total operating costs.

What company is best managed?

Lake Central

What company is worst managed?

Wiggins

Are the residuals from this regression normally distributed? Provide graphical evidence and, preferably, some formal test.

The residuals from this regression seen normal, based on the Shapiro Test and the graphics below.

plot(result, plots = "dashboard", lines = "line", nrobs = -1, custom = FALSE)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Without regard to your previous answer, assume the residuals are normal as described in the regression table. What is the probability of within plus or minus $1.00 of the true value?

The probability is 0.683

result2 <- prob_norm(mean = 0, stdev = 1, lb = -1, ub = 1)
summary(result2)

## Probability calculator
## Distribution: Normal
## Mean        : 0 
## St. dev     : 1 
## Lower bound : -1 
## Upper bound : 1 
## 
## P(X < -1) = 0.159
## P(X > -1) = 0.841
## P(X < 1) = 0.841
## P(X > 1) = 0.159
## P(-1 < X < 1)     = 0.683
## 1 - P(-1 < X < 1) = 0.317

plot(result2)

Predict the range of average costs for Braniff Airlines with 95% confidence; if you can, modify the predictions to also show the range of possible costs for Braniff Airlines with 95% confidence.

The range of possible cots is between 13.310 to 74.921 +/- 30.805

pred <- predict(result, pred_data = Braniff)
print(pred, n = 10)

## Linear regression (OLS)
## Data                 : AirlineCosts 
## Response variable    : TotalOperatingCosts 
## Explanatory variables: FlightLength, PlaneSpeed, FlightTimeperPlane, PopulationServed, LoadFactor, Capacity, AdjustedAssets 
## Interval             : confidence 
## Prediction dataset   : Braniff 
## 
##  FlightLength PlaneSpeed FlightTimeperPlane PopulationServed LoadFactor
##         2.246      2.260              0.820            4.074      0.557
##  Capacity AdjustedAssets Prediction   2.5%  97.5%    +/-
##     0.664          2.189     44.115 13.310 74.921 30.805

Braniff <- store(Braniff, pred, name = "pred_reg")

What is the single best predictor among these of airline operating costs?

Capacity

A “streamlined” model of the drivers of airline operating costs is your next task. Streamline the regression model using a stepwise regression technique to yield one final model of total operating costs.

result <- regress(
  AirlineCosts, 
  rvar = "TotalOperatingCosts", 
  evar = c(
    "FlightLength", "PlaneSpeed", "FlightTimeperPlane", "PopulationServed", 
    "LoadFactor", "Capacity", "AdjustedAssets"
  ), 
  check = "stepwise-backward"
)

## Start:  AIC=256.45
## TotalOperatingCosts ~ FlightLength + PlaneSpeed + FlightTimeperPlane + 
##     PopulationServed + LoadFactor + Capacity + AdjustedAssets
## 
##                      Df Sum of Sq    RSS    AIC
## - PlaneSpeed          1         6  72450 254.46
## <none>                             72444 256.45
## - AdjustedAssets      1      4933  77377 256.50
## - PopulationServed    1      5284  77728 256.64
## - FlightLength        1     14038  86482 259.94
## - FlightTimeperPlane  1     15536  87980 260.48
## - LoadFactor          1     45165 117609 269.48
## - Capacity            1    108778 181222 282.88
## 
## Step:  AIC=254.46
## TotalOperatingCosts ~ FlightLength + FlightTimeperPlane + PopulationServed + 
##     LoadFactor + Capacity + AdjustedAssets
## 
##                      Df Sum of Sq    RSS    AIC
## <none>                             72450 254.46
## - AdjustedAssets      1      5093  77543 254.56
## - PopulationServed    1      5365  77815 254.67
## - FlightTimeperPlane  1     15663  88113 258.52
## - FlightLength        1     24665  97115 261.54
## - LoadFactor          1     46190 118640 267.75
## - Capacity            1    114099 186550 281.78

summary(result, sum_check = "confint")

## ----------------------------------------------------
## Backward stepwise selection of variables
## ----------------------------------------------------
## Linear regression (OLS)
## Data     : AirlineCosts 
## Response variable    : TotalOperatingCosts 
## Explanatory variables: FlightLength, PlaneSpeed, FlightTimeperPlane, PopulationServed, LoadFactor, Capacity, AdjustedAssets 
## Null hyp.: the effect of x on TotalOperatingCosts is zero
## Alt. hyp.: the effect of x on TotalOperatingCosts is not zero
## 
##                     coefficient std.error t.value p.value    
##  (Intercept)            -47.792   251.071  -0.190   0.851    
##  FlightLength           346.767   121.315   2.858   0.009 ** 
##  FlightTimeperPlane    -303.095   133.063  -2.278   0.032 *  
##  PopulationServed        41.724    31.298   1.333   0.195    
##  LoadFactor            -595.961   152.356  -3.912  < .001 ***
##  Capacity              -646.626   105.178  -6.148  < .001 ***
##  AdjustedAssets          69.911    53.825   1.299   0.206    
## 
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-squared: 0.881,  Adjusted R-squared: 0.852 
## F-statistic: 29.73 df(6,24), p.value < .001
## Nr obs: 31 
## 
##                    coefficient     2.5%    97.5%     +/-
## (Intercept)            -47.792 -565.976  470.392 518.184
## FlightLength           346.767   96.385  597.150 250.382
## FlightTimeperPlane    -303.095 -577.724  -28.465 274.630
## PopulationServed        41.724  -22.872  106.320  64.596
## LoadFactor            -595.961 -910.408 -281.513 314.447
## Capacity              -646.626 -863.704 -429.549 217.078
## AdjustedAssets          69.911  -41.178  181.000 111.089

plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

AirlineCosts <- store(AirlineCosts, result, name = "NewResids")
shapiro.test(AirlineCosts$NewResids)

## 
##  Shapiro-Wilk normality test
## 
## data:  AirlineCosts$NewResids
## W = 0.9893, p-value = 0.9855

Provide a complete regression table of your final regression model.

How much variance is explained?

R-squared: 0.881 and p.value < .001 are the same. There is no variance.

Average expected Total Operating Costs include zero with 95% confidence given that all of the predictors take the value of zero. TRUE or FALSE.

FALSE

22.TRUE OR FALSE: For your final model, all of the estimated slopes are not zero with 95% confidence.

TRUE

23.Are the residuals from your streamlined regression normal? Provide at least two pieces of evidence.

Yes, they are normal. The value of the Shapiro-Wilk normality test is greater than 0.05

Without regard to the previous, present 95% confidence intervals for each of the slopes in your final regression model. Each slope must clearly state the relevant metric! Change in what to change in what….

(Intercept) -565.976 to 470.392 cents, if the other drivers are zero. FlightLength 96.385 to 597.150 cents per miles [logged]. FlightTimeperPlane -577.724 to -28.465 cents per hours [logged], PopulationServed -22.872 to 106.320 cents per [logged]. LoadFactor -910.408 to -281.513 cents per Ton-Mile (proportion) . Capacity -863.704 to -429.549 cents per Tons per mile. AdjustedAssets -41.178 to 181.000 cent per Adj Ass logged

25.Plot at least one of the relevant effects including a confidence interval of the effect from your final model. You may wish to deploy library(visreg) though you can also do this with radiant’s predict.

plot(result, plots = "scatter", lines = c("line", "loess"), nrobs = -1, custom = FALSE)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

AirlineCosts <- store(AirlineCosts, result, name = "resids2")

visualize(
  AirlineCosts, 
  xvar = "Capacity", 
  yvar = "TotalOperatingCosts", 
  type = "scatter", 
  nrobs = -1, 
  check = c("line", "loess"), 
  custom = FALSE
)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Airline Costs

Karol O

12/4/2020

Background:

Multiple Regression