This data set is called airq402.dat. It describes the airfares and passengers for U.S. Domestic Routes for 4th Quarter of 2002. That is, in includes information concerning airfare pricing, passenger volume, market share, and distances flown. The data was sourced by the United States Department of Transportation, and presumably comes from a random sample of size n=1,000 of the total number of flights that occured in the fourth quarter of 2002, from the first of October to the 31st of December. The variables and respective descriptions are as follows:
City1: The starting city from which people travelled
City2: The destination city people travel to
averageFare: The average price of a plane ticket per passenger
distance: The distance travelled, measured in miles
averageWeeklyPassengers: The average number of passengers on airlines every week
marketLeadingAirline: The particular airline that has the highest percentage of total sales revenue of US Airlines
marketShare: Unsure of what this figure represents in this context
averageFare2: Unsure of what this figure represents in this context
lowPriceAirline: The airline for a particular flight with the lowest price
marketShare2: Unsure of what this figure represents in this context
price: The price of a particular chosen airplane ticket for one passenger
As for practical and analytical questions, I am curious to know which cities had the highest volume of people travelling to and from them, especially considering these data were collected during the fourth fiscal quarter of the year, during the holiday season. I would also like to know why some variables are repeated with different observation values.
The data set does have enough information for me to answer my main
statistical inquiry of the relationship between distance and average
fare. In the dataset there are elevel variables, and 1,000 total
observations. It is stipulated in the guidelines that there are to be at
least 15 observations for each variable.
Our objective and primary question is to determine the relationship between airplane ticket price and relevant predictor variables in the dataset.
As shown in last week’s assignment, a pairwise scatterplot of the data show that multiple variables seem highly correlated, and may in fact be repetitive given the nature of the names they were given. Examples include marketShare and marketShare2, which in the original data, we both simply named “market share”.
Because of the extreme correlation between variables like marketShare and marketShare2, I will remove the duplicate variables from the dataset, and create a subset called airfaireMult with only the variables that I am interested in.
| averageFare | distance | averageWeeklyPassengers | marketShare | price |
|---|---|---|---|---|
| 114.47 | 528 | 424.56 | 70.19 | 111.03 |
| 122.47 | 860 | 276.84 | 75.10 | 118.94 |
| 214.42 | 852 | 215.76 | 78.89 | 167.12 |
| 69.40 | 288 | 606.84 | 96.97 | 68.86 |
| 158.13 | 723 | 313.04 | 39.79 | 145.42 |
| 135.17 | 1204 | 199.02 | 40.68 | 127.69 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 41.4524702 | 4.1537034 | 9.979641 | 0.0000000 |
| averageFare | 0.6873902 | 0.0163178 | 42.125196 | 0.0000000 |
| distance | 0.0042585 | 0.0016127 | 2.640632 | 0.0084048 |
| averageWeeklyPassengers | -0.0025610 | 0.0009608 | -2.665499 | 0.0078119 |
| marketShare | -0.2218813 | 0.0448108 | -4.951510 | 0.0000009 |
Next I will conduct residual diagnostic tests:
We can see from the residual plots that there are some violations:
the variance of the residuals is not constant.
the QQ plot indicates the distribution of residuals is slightly off the normal distribution.
The residual plot seems to have a weak curve pattern.
We first perform Box-Cox transformation to correct the non-constant variance and correct the non-normality of the QQ plot.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 9.0478876 | 0.4356211 | 20.770087 | 0.0000000 |
| averageWeeklyPassengers | -0.0001440 | 0.0000381 | -3.777641 | 0.0001677 |
| averageFare | 0.0281128 | 0.0006526 | 43.078822 | 0.0000000 |
| distance | 0.0002236 | 0.0000655 | 3.413054 | 0.0006683 |
| log(marketShare) | -0.4881688 | 0.0993120 | -4.915508 | 0.0000010 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.2582865 | 0.0281403 | 151.323388 | 0.00e+00 |
| averageWeeklyPassengers | -0.0000323 | 0.0000065 | -4.968216 | 8.00e-07 |
| averageFare | 0.0047022 | 0.0001105 | 42.534880 | 0.00e+00 |
| distance | 0.0000362 | 0.0000109 | 3.316360 | 9.45e-04 |
| marketShare | -0.0022034 | 0.0003036 | -7.258122 | 0.00e+00 |
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 521669.29147 | 0.7658960 | 0.7649548 | 5 | 6267.034 | 6291.5726 | 529107.91138 |
| sqrt.price.log.dist | 822.47761 | 0.7805446 | 0.7796624 | 5 | -185.434 | -160.8952 | 833.20936 |
| log.price | 23.94319 | 0.7815141 | 0.7806357 | 5 | -3722.071 | -3697.5327 | 24.25261 |
We see that the Goodness of Fit measures of the third model are superior, thus we choose to use the third model as our final model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.2582865 | 0.0281403 | 151.323388 | 0.00e+00 |
| averageWeeklyPassengers | -0.0000323 | 0.0000065 | -4.968216 | 8.00e-07 |
| averageFare | 0.0047022 | 0.0001105 | 42.534880 | 0.00e+00 |
| distance | 0.0000362 | 0.0000109 | 3.316360 | 9.45e-04 |
| marketShare | -0.0022034 | 0.0003036 | -7.258122 | 0.00e+00 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.2582865 | 0.0281403 | 151.323388 | 0.00e+00 |
| averageWeeklyPassengers | -0.0000323 | 0.0000065 | -4.968216 | 8.00e-07 |
| averageFare | 0.0047022 | 0.0001105 | 42.534880 | 0.00e+00 |
| distance | 0.0000362 | 0.0000109 | 3.316360 | 9.45e-04 |
| marketShare | -0.0022034 | 0.0003036 | -7.258122 | 0.00e+00 |
\[ \log(price) =4.2583 - 0.0000323\times averageWeeklyPassengers +0.0047022\times averageFare + 0.0000362\times distance - 0.0022034\times marketShare \]
averageWeeklyPassengers and marketShare are both negatively associated with price, while distance and averageFare are positively associated with price.
We used regression techniques such as box-cox transformation for the response, and other transformations of the independent variables to identify a suitable final model for this dataset. Due to the nature of some of the variables from the initial data set, I performed variable selection to remove what appeared to be repetitive variables from the data set.
I used common global goodness-of-fit tests for model selection.
I had difficulty addressing and interpreting the true meaning of the final regression coefficients because of the nature of the transformations used.
The violation of the assumption of normality persists.
Here, we use bootstrapping to discern the confidence intervals of the regression coefficients from the final model.
Here we create histograms of the bootstrap estimates of regression coefficients.
The red density curve uses the estimated regression coefficients and their corresponding standard error in the output of the regression procedure. The p-values reported in the output are based on the red curve.
The blue curve is a non-parametric data-driven estimate of the density of bootstrap sampling distribution. The bootstrap confidence intervals of the regressions are based on these non-parametric bootstrap sampling distributions.
Evidently, the curves in each histogram are quite close to each other, which hints at a consistency between the significance test results and the corresponding bootstrap intervals.
Next, I’ll find 95% bootstrap confidence intervals of each regression coefficient.
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 4.2583 | 0.0281 | 151.3234 | 0.0000 | [ 4.1903 , 4.3289 ] |
| averageWeeklyPassengers | -0.0000 | 0.0000 | -4.9682 | 0.0000 | [ 0 , 0 ] |
| averageFare | 0.0047 | 0.0001 | 42.5349 | 0.0000 | [ 0.0044 , 0.005 ] |
| distance | 0.0000 | 0.0000 | 3.3164 | 0.0009 | [ 0 , 1e-04 ] |
| marketShare | -0.0022 | 0.0003 | -7.2581 | 0.0000 | [ -0.0028 , -0.0016 ] |
The results are decently consistent.
This histogram of the residuals shows one outlier and a leftward skew.
I will now generate bootstrap confidence intervals of regression coefficients, and plot them in histograms.
The fact that the red and blue curves are not close to each other here suggests that the inference of the significance of variables based on p-values and residual bootstrap will not yield the same results.
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 4.2583 | 0.0281 | 151.3234 | 0.0000 | [ 0 , 4.3126 ] |
| averageWeeklyPassengers | -0.0000 | 0.0000 | -4.9682 | 0.0000 | [ 0 , 0 ] |
| averageFare | 0.0047 | 0.0001 | 42.5349 | 0.0000 | [ 0 , 0.0049 ] |
| distance | 0.0000 | 0.0000 | 3.3164 | 0.0009 | [ 0 , 1e-04 ] |
| marketShare | -0.0022 | 0.0003 | -7.2581 | 0.0000 | [ -0.0028 , 0 ] |
| Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | 4.2583 | 0.0281 | 0.0000 | [ 4.1903 , 4.3289 ] | [ 0 , 4.3126 ] |
| averageWeeklyPassengers | -0.0000 | 0.0000 | 0.0000 | [ 0 , 0 ] | [ 0 , 0 ] |
| averageFare | 0.0047 | 0.0001 | 0.0000 | [ 0.0044 , 0.005 ] | [ 0 , 0.0049 ] |
| distance | 0.0000 | 0.0000 | 0.0009 | [ 0 , 1e-04 ] | [ 0 , 1e-04 ] |
| marketShare | -0.0022 | 0.0003 | 0.0000 | [ -0.0028 , -0.0016 ] | [ -0.0028 , 0 ] |
As suspected, the results of the bootstrap method and the normal method are not consistent with each other. This suggests that there may still be something wrong with the data in the vein of a serious violation of assumptions and conditions.
| btc.wd | btr.wd |
|---|---|
| 0.1386166 | 4.3126189 |
| 0.0000251 | 0.0000442 |
| 0.0006622 | 0.0048972 |
| 0.0000420 | 0.0000578 |
| 0.0012408 | 0.0027600 |