We will examine data on airline costs. The following data were originally presented in the classic study by Proctor and Duncan. It is worth noting there is also a variable representing the Total asses of the airline and a variable representing the Type of airline though we will not use them in the regression modeling.
1.Suppose that (logged) speed of plane is normally distributed with mean and standard deviation exactly as calculated from the data. What is the probability of a value greater than 2.25 logged miles per hour?
result <- prob_norm(mean = 2.202, stdev = 0.071, ub = 2.25)
summary(result)
## Probability calculator
## Distribution: Normal
## Mean : 2.202
## St. dev : 0.071
## Lower bound : -Inf
## Upper bound : 2.25
##
## P(X < 2.25) = 0.75
## P(X > 2.25) = 0.25
plot(result)
The proability of a value greater than 2.25 logged miles per hour is 0.25
95% of the time, the speed of the plane should fall between 2.063 and 2.341 logged miles per hour.
result <- prob_norm(
mean = 2.202,
stdev = 0.071,
plb = 0.025,
pub = 0.975
)
summary(result, type = "probs")
## Probability calculator
## Distribution: Normal
## Mean : 2.202
## St. dev : 0.071
## Lower bound : 0.025
## Upper bound : 0.975
##
## P(X < 2.063) = 0.025
## P(X > 2.063) = 0.975
## P(X < 2.341) = 0.975
## P(X > 2.341) = 0.025
## P(2.063 < X < 2.341) = 0.95
## 1 - P(2.063 < X < 2.341) = 0.05
plot(result, type = "probs")
result <- single_mean(
AirlineCosts,
var = "TotalOperatingCosts"
)
summary(result)
## Single mean test
## Data : AirlineCosts
## Variable : TotalOperatingCosts
## Confidence: 0.95
## Null hyp. : the mean of TotalOperatingCosts = 0
## Alt. hyp. : the mean of TotalOperatingCosts is not equal to 0
##
## mean n n_missing sd se me
## 113.506 31 0 142.705 25.631 52.344
##
## diff se t.value p.value df 2.5% 97.5%
## 113.506 25.631 4.429 < .001 30 61.162 165.851 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(result, plots = "hist", custom = FALSE)
False, the Total Operating Costs is not symetric, and it averages 61.162 to 165.851 cents
result <- explore(
AirlineCosts,
vars = "TotalOperatingCosts",
fun = c("n_obs", "min", "p25", "median", "p75", "max"),
nr = Inf
)
## Warning: attributes are not identical across measure variables;
## they will be dropped
summary(result)
## Explore
## Data : AirlineCosts
## Functions : n_obs, min, p25, median, p75, max
## Top : Function
##
## variable n_obs min p25 median p75 max
## TotalOperatingCosts 31 42.300 50.800 75.400 120.750 820.900
# dtab(result) %>% render()
6.Which airline has the highest total operating costs per revenue ton-mile? >Wiggins (820.90)
Which airline has the lowest total operating costs per revenue ton-mile? >United (42.30)
result <- compare_means(
AirlineCosts,
var1 = "Assets",
var2 = "AdjustedAssets",
samples = "paired"
)
summary(result, show = TRUE)
## Pairwise mean comparisons (t-test)
## Data : AirlineCosts
## Variables : Assets, AdjustedAssets
## Samples : paired
## Confidence: 0.95
## Adjustment: None
##
## mean n n_missing sd se me
## Assets 1.669 31 0 0.776 0.139 0.285
## AdjustedAssets 1.644 31 0 0.772 0.139 0.283
##
## Null hyp. Alt. hyp. diff p.value
## Assets = AdjustedAssets Assets not equal to AdjustedAssets 0.025 < .001
## se t.value df 2.5% 97.5%
## 0.006 4.004 30 0.012 0.038 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The average difference between Assets and Adjusted Assets ranges from 0.012 to 0.038
8.There is no difference in (logged) average daily flight time per plane between the two Types of airlines. Difference or No difference? If different, which is larger? What is the 95% confidence interval for the difference in means?
The average difference between Daily flight time per plane and the type of plane ranges from 0.1 to 0.294 hours logged. Being the Type Comp [Comprehensive] the larger one.
result <- compare_means(
AirlineCosts,
var1 = "Type",
var2 = "FlightTimeperPlane",
samples = "paired"
)
summary(result, show = TRUE)
## Pairwise mean comparisons (t-test)
## Data : AirlineCosts
## Variables : Type, FlightTimeperPlane
## Samples : independent (obs. per level unequal)
## Confidence: 0.95
## Adjustment: None
##
## Type mean n n_missing sd se me
## Comp 0.853 22 0 0.058 0.012 0.026
## SH 0.656 9 0 0.124 0.041 0.095
##
## Null hyp. Alt. hyp. diff p.value se t.value df 2.5%
## Comp = SH Comp not equal to SH 0.197 0.001 0.043 4.572 9.469 0.1
## 97.5%
## 0.294 **
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
9.With 95% confidence, are load factor and (logged) flight length linearly related? Related or Not related Provide graphical and regression evidence.
Yes, they are related, the slope is not zero with 95% of confidence.
visualize(
AirlineCosts,
xvar = "LoadFactor",
yvar = "FlightLength",
type = "scatter",
nrobs = -1,
check = c("line", "loess"),
custom = FALSE
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
result <- regress(
AirlineCosts,
rvar = "FlightLength",
evar = "LoadFactor"
)
summary(result)
## Linear regression (OLS)
## Data : AirlineCosts
## Response variable : FlightLength
## Explanatory variables: LoadFactor
## Null hyp.: the effect of LoadFactor on FlightLength is zero
## Alt. hyp.: the effect of LoadFactor on FlightLength is not zero
##
## coefficient std.error t.value p.value
## (Intercept) 1.361 0.089 15.211 < .001 ***
## LoadFactor 1.438 0.180 7.969 < .001 ***
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-squared: 0.687, Adjusted R-squared: 0.676
## F-statistic: 63.504 df(1,29), p.value < .001
## Nr obs: 31
result <- prob_binom(n = 85, p = 0.9, plb = 0.25, pub = 0.975)
summary(result, type = "probs")
## Probability calculator
## Distribution: Binomial
## n : 85
## p : 0.9
## Mean : 76.5
## St. dev : 2.766
## Lower bound : 0.25 (75)
## Upper bound : 0.975 (81)
##
## P(X = 75) = 0.116
## P(X < 75) = 0.228
## P(X <= 75) = 0.343
## P(X > 75) = 0.656
## P(X >= 75) = 0.772
## P(X = 81) = 0.04
## P(X < 81) = 0.936
## P(X <= 81) = 0.975
## P(X > 81) = 0.024
## P(X >= 81) = 0.064
## P(75 <= X <= 81) = 0.748
## 1 - P(75 <= X <= 81) = 0.252
plot(result, type = "probs")
With 95% central probability, what range of Yes’s should we expect?
From 75 to 81 Yes responses.
Of course, this probability is perhaps a bit optimistic. 95% of the time, we should see 79 or more Yes’s if the true probability is 0.9.
We will now model Total Operating Costs for these airlines.
result <- regress(
AirlineCosts,
rvar = "TotalOperatingCosts",
evar = c(
"FlightLength", "PlaneSpeed", "FlightTimeperPlane", "PopulationServed",
"LoadFactor", "Capacity", "AdjustedAssets"
)
)
summary(result, sum_check = c("rmse", "sumsquares", "confint"))
## Linear regression (OLS)
## Data : AirlineCosts
## Response variable : TotalOperatingCosts
## Explanatory variables: FlightLength, PlaneSpeed, FlightTimeperPlane, PopulationServed, LoadFactor, Capacity, AdjustedAssets
## Null hyp.: the effect of x on TotalOperatingCosts is zero
## Alt. hyp.: the effect of x on TotalOperatingCosts is not zero
##
## coefficient std.error t.value p.value
## (Intercept) -4.698 989.286 -0.005 0.996
## FlightLength 351.792 166.639 2.111 0.046 *
## PlaneSpeed -24.543 544.152 -0.045 0.964
## FlightTimeperPlane -302.651 136.275 -2.221 0.036 *
## PopulationServed 41.589 32.109 1.295 0.208
## LoadFactor -597.106 157.684 -3.787 < .001 ***
## Capacity -645.592 109.856 -5.877 < .001 ***
## AdjustedAssets 70.457 56.297 1.252 0.223
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-squared: 0.881, Adjusted R-squared: 0.845
## F-statistic: 24.424 df(7,23), p.value < .001
## Nr obs: 31
##
## Prediction error (RMSE): 48.342
## Residual st.dev (RSD): 56.123
##
## Sum of squares:
## df SS
## Regression 7 538,494.406
## Error 23 72,443.992
## Total 30 610,938.399
##
## coefficient 2.5% 97.5% +/-
## (Intercept) -4.698 -2051.191 2041.795 2046.493
## FlightLength 351.792 7.074 696.510 344.718
## PlaneSpeed -24.543 -1150.208 1101.122 1125.665
## FlightTimeperPlane -302.651 -584.557 -20.745 281.906
## PopulationServed 41.589 -24.834 108.012 66.423
## LoadFactor -597.106 -923.301 -270.911 326.195
## Capacity -645.592 -872.847 -418.337 227.255
## AdjustedAssets 70.457 -46.003 186.917 116.460
plot(result, plots = "scatter", nrobs = -1, custom = FALSE)
AirlineCosts <- store(AirlineCosts, result, name = "Resids")
shapiro.test(AirlineCosts$Resids)
##
## Shapiro-Wilk normality test
##
## data: AirlineCosts$Resids
## W = 0.98881, p-value = 0.9817
FlightLength, FlightTimeperPlane, LoadFactor, Capacity
PlaneSpeed (-1150.208 to 1101.122), PopulationServed (-24.834 to 108.012), AdjustedAssets ( -46.003 to 186.917)
FALSE, they have a negative correlation.
FALSE, It is unrelated to Total Operating Costs
TRUE
FALSE
## filter and sort the dataset
AirlineCosts %>%
arrange(Resids) %>%
select(Airline, TotalOperatingCosts, Resids) %>%
dtab(dec = 2, nr = 100) %>% render()
It is 0.881 or 88.1%
Yes, F the p-value is less than 0.001
The residual standard error is the zise of the average miss from the prediction of the regression where the averaging is by degrees of freedom. It is also positive square root of the mean square error. When the residual standard error is exactly 0 then the model fits the data perfectly (likely due to overfitting). If the residual standard error can not be shown to be significantly different from the variability in the unconditional response, then there is little evidence to suggest the linear model has any predictive ability.
For this model, we are 56.12 cents away from what is expected given the regression prediction. We want it small; that shows precise predictions.
Lake Central
Wiggins
The residuals from this regression seen normal, based on the Shapiro Test and the graphics below.
plot(result, plots = "dashboard", lines = "line", nrobs = -1, custom = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
The probability is 0.683
result2 <- prob_norm(mean = 0, stdev = 1, lb = -1, ub = 1)
summary(result2)
## Probability calculator
## Distribution: Normal
## Mean : 0
## St. dev : 1
## Lower bound : -1
## Upper bound : 1
##
## P(X < -1) = 0.159
## P(X > -1) = 0.841
## P(X < 1) = 0.841
## P(X > 1) = 0.159
## P(-1 < X < 1) = 0.683
## 1 - P(-1 < X < 1) = 0.317
plot(result2)
The range of possible cots is between 13.310 to 74.921 +/- 30.805
pred <- predict(result, pred_data = Braniff)
print(pred, n = 10)
## Linear regression (OLS)
## Data : AirlineCosts
## Response variable : TotalOperatingCosts
## Explanatory variables: FlightLength, PlaneSpeed, FlightTimeperPlane, PopulationServed, LoadFactor, Capacity, AdjustedAssets
## Interval : confidence
## Prediction dataset : Braniff
##
## FlightLength PlaneSpeed FlightTimeperPlane PopulationServed LoadFactor
## 2.246 2.260 0.820 4.074 0.557
## Capacity AdjustedAssets Prediction 2.5% 97.5% +/-
## 0.664 2.189 44.115 13.310 74.921 30.805
Braniff <- store(Braniff, pred, name = "pred_reg")
Capacity
result <- regress(
AirlineCosts,
rvar = "TotalOperatingCosts",
evar = c(
"FlightLength", "PlaneSpeed", "FlightTimeperPlane", "PopulationServed",
"LoadFactor", "Capacity", "AdjustedAssets"
),
check = "stepwise-backward"
)
## Start: AIC=256.45
## TotalOperatingCosts ~ FlightLength + PlaneSpeed + FlightTimeperPlane +
## PopulationServed + LoadFactor + Capacity + AdjustedAssets
##
## Df Sum of Sq RSS AIC
## - PlaneSpeed 1 6 72450 254.46
## <none> 72444 256.45
## - AdjustedAssets 1 4933 77377 256.50
## - PopulationServed 1 5284 77728 256.64
## - FlightLength 1 14038 86482 259.94
## - FlightTimeperPlane 1 15536 87980 260.48
## - LoadFactor 1 45165 117609 269.48
## - Capacity 1 108778 181222 282.88
##
## Step: AIC=254.46
## TotalOperatingCosts ~ FlightLength + FlightTimeperPlane + PopulationServed +
## LoadFactor + Capacity + AdjustedAssets
##
## Df Sum of Sq RSS AIC
## <none> 72450 254.46
## - AdjustedAssets 1 5093 77543 254.56
## - PopulationServed 1 5365 77815 254.67
## - FlightTimeperPlane 1 15663 88113 258.52
## - FlightLength 1 24665 97115 261.54
## - LoadFactor 1 46190 118640 267.75
## - Capacity 1 114099 186550 281.78
summary(result, sum_check = "confint")
## ----------------------------------------------------
## Backward stepwise selection of variables
## ----------------------------------------------------
## Linear regression (OLS)
## Data : AirlineCosts
## Response variable : TotalOperatingCosts
## Explanatory variables: FlightLength, PlaneSpeed, FlightTimeperPlane, PopulationServed, LoadFactor, Capacity, AdjustedAssets
## Null hyp.: the effect of x on TotalOperatingCosts is zero
## Alt. hyp.: the effect of x on TotalOperatingCosts is not zero
##
## coefficient std.error t.value p.value
## (Intercept) -47.792 251.071 -0.190 0.851
## FlightLength 346.767 121.315 2.858 0.009 **
## FlightTimeperPlane -303.095 133.063 -2.278 0.032 *
## PopulationServed 41.724 31.298 1.333 0.195
## LoadFactor -595.961 152.356 -3.912 < .001 ***
## Capacity -646.626 105.178 -6.148 < .001 ***
## AdjustedAssets 69.911 53.825 1.299 0.206
##
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-squared: 0.881, Adjusted R-squared: 0.852
## F-statistic: 29.73 df(6,24), p.value < .001
## Nr obs: 31
##
## coefficient 2.5% 97.5% +/-
## (Intercept) -47.792 -565.976 470.392 518.184
## FlightLength 346.767 96.385 597.150 250.382
## FlightTimeperPlane -303.095 -577.724 -28.465 274.630
## PopulationServed 41.724 -22.872 106.320 64.596
## LoadFactor -595.961 -910.408 -281.513 314.447
## Capacity -646.626 -863.704 -429.549 217.078
## AdjustedAssets 69.911 -41.178 181.000 111.089
plot(result, plots = "dashboard", lines = c("line", "loess"), nrobs = -1, custom = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
AirlineCosts <- store(AirlineCosts, result, name = "NewResids")
shapiro.test(AirlineCosts$NewResids)
##
## Shapiro-Wilk normality test
##
## data: AirlineCosts$NewResids
## W = 0.9893, p-value = 0.9855
R-squared: 0.881 and p.value < .001 are the same. There is no variance.
FALSE
22.TRUE OR FALSE: For your final model, all of the estimated slopes are not zero with 95% confidence.
TRUE
23.Are the residuals from your streamlined regression normal? Provide at least two pieces of evidence.
Yes, they are normal. The value of the Shapiro-Wilk normality test is greater than 0.05
(Intercept) -565.976 to 470.392 cents, if the other drivers are zero. FlightLength 96.385 to 597.150 cents per miles [logged]. FlightTimeperPlane -577.724 to -28.465 cents per hours [logged], PopulationServed -22.872 to 106.320 cents per [logged]. LoadFactor -910.408 to -281.513 cents per Ton-Mile (proportion) . Capacity -863.704 to -429.549 cents per Tons per mile. AdjustedAssets -41.178 to 181.000 cent per Adj Ass logged
25.Plot at least one of the relevant effects including a confidence interval of the effect from your final model. You may wish to deploy library(visreg) though you can also do this with radiant’s predict.
plot(result, plots = "scatter", lines = c("line", "loess"), nrobs = -1, custom = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
AirlineCosts <- store(AirlineCosts, result, name = "resids2")
visualize(
AirlineCosts,
xvar = "Capacity",
yvar = "TotalOperatingCosts",
type = "scatter",
nrobs = -1,
check = c("line", "loess"),
custom = FALSE
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'