Description of the Data Set

This data set is called airq402.dat. It describes the airfares and passengers for U.S. Domestic Routes for 4th Quarter of 2002. That is, in includes information concerning airfare pricing, passenger volume, market share, and distances flown. The data was sourced by the United States Department of Transportation, and presumably comes from a random sample of size n=1,000 of the total number of flights that occured in the fourth quarter of 2002, from the first of October to the 31st of December. The variables and respective descriptions are as follows:

As for practical and analytical questions, I am curious to know which cities had the highest volume of people travelling to and from them, especially considering these data were collected during the fourth fiscal quarter of the year, during the holiday season. I would also like to know why some variables are repeated with different observation values. My main analytical point of interest is to find how the variable distance affects the variable averageFare.

The data set does have enough information for me to answer my main statistical inquiry of the relationship between distance and average fare. In the dataset there are elevel variables, and 1,000 total observations. It is stipulated in the guidelines that there are to be at least 15 observations for each variable.

Simple Linear Regression

Pairwise Scatterplot

airfareNumVars <- c(3, 4, 5, 7, 8, 10, 11)
airfareNumeric <- airfare[airfareNumVars]

pairs(airfareNumeric, main ="Pair-wise Association: Scatter Plot")

To get a better look at the scatterplot of our variables of interest, make a single scatterplot:

attach(airfare)
plot(distance, price, main="Scatterplot Example",
   xlab="Distance ", ylab="Price ", pch=19)
abline(lm(price~distance), col="red") # regression line (y~x)

The scatterplot graphing Distance vs. Price appears to show a somewhat weak, positive, linear relationship between the two variables. Many of the data points are towards the lower end of the x-axis, which makes practical sense, considering these are domestic flights in the US, and most people do not travel extremely far distances.

Simple Linear Regression

distance = airfare$distance
price <- airfare$price
parametric.model <- lm(price ~ distance)
par(mfrow = c(2,2))
plot(parametric.model)

The only residual plot of note is the top-right Normal Q-Q plot, which displays a violation of the assumption of normality.

parametricLM <- lm(price~distance, data=airfare)
summary(parametricLM)
## 
## Call:
## lm(formula = price ~ distance, data = airfare)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71.41 -28.04  -8.07  18.47 174.21 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 97.927025   2.335763   41.92   <2e-16 ***
## distance     0.042826   0.001888   22.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.38 on 998 degrees of freedom
## Multiple R-squared:  0.3402, Adjusted R-squared:  0.3395 
## F-statistic: 514.5 on 1 and 998 DF,  p-value: < 2.2e-16

Bootstrap Confidence Intervals

## Begin Bootstrap simple linear regression

B <- 1000    # number of bootstrap replicates
# define empty vectors to store bootstrap regression coefficients
boot.beta0 <- NULL 
boot.beta1 <- NULL
## bootstrap regression models using for-loop
vec.id <- 1:length(price)   # vector of observation ID
for(i in 1:B){
  boot.id <- sample(vec.id, length(price), replace = TRUE)   # bootstrap obs ID.
  boot.price <- price[boot.id]           # bootstrap price
  boot.distance <- distance[boot.id]     # corresponding bootstrap distance
  ## regression
  boot.reg <-lm(price[boot.id] ~ distance[boot.id]) 
  boot.beta0[i] <- coef(boot.reg)[1]   # bootstrap intercept
  boot.beta1[i] <- coef(boot.reg)[2]   # bootstrap slope
}
summary(boot.reg)
## 
## Call:
## lm(formula = price[boot.id] ~ distance[boot.id])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.894 -28.920  -9.081  17.096 169.443 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       96.176217   2.521482   38.14   <2e-16 ***
## distance[boot.id]  0.045237   0.002038   22.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.19 on 998 degrees of freedom
## Multiple R-squared:  0.3305, Adjusted R-squared:  0.3298 
## F-statistic: 492.6 on 1 and 998 DF,  p-value: < 2.2e-16
##  95% bootstrap confidence intervals
boot.beta0.ci <- quantile(boot.beta0, c(0.025, 0.975), type = 2)
boot.beta1.ci <- quantile(boot.beta1, c(0.025, 0.975), type = 2)
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci)) 
names(boot.coef) <- c("2.5%", "97.5%")
kable(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.") 
Bootstrap confidence intervals of regression coefficients.
2.5% 97.5%
boot.beta0.ci 93.0127338 103.4616538
boot.beta1.ci 0.0383314 0.0472093

The 95% bootstrap confidence interval of the slope is (0.0381403, 0.0476483). Since both limits are positive, the flight price and flight distance are positively associated - 0 is not in the interval. Both parametric and bootstrap regression models indicate that the slope coefficient is significantly different from zero. Further, price and distance are statistically correlated.

The p-values and confidence intervals of the regression coefficients in both the parametric method and the bootstrap method are almost exactly the same. This is fair because as the sample size gets significantly large, the parametric and bootstrap methods will have similar results. In the end, since there are certain violations of linear regression assumptions, the bootstrap is the more reliable method to use.