A regional airline transfers passengers from small airports to a larger regional hub airport. The airline’s data analyst was assigned to estimate the revenue (in thousands of dollars) generated by each of the 22 small airports based on two variables: the distance from each airport (in miles) to the hub and the population (in hundreds) of the cities in wheach each of the 22 airports is located. The data are given in the following table.
The scatter plots may be produced using the R code shown on below.
library(ggplot2)
library(gridExtra)
gp <- ggplot(data = dat) + theme_bw(14)
R_D <-
gp + geom_point(aes(Revenue, Distance), color = 2, size = 3.5)
R_P <-
gp + geom_point(aes(Revenue, Population), color = 4, size = 3.5)
P_D <-
gp + geom_point(aes(Distance, Population), color = 6,size = 3.5)
ml <- grid.arrange(R_D, R_P, P_D, nrow = 1, ncol = 3)
Correlation refers to the extent to which two variables (i.e. \(X\) and \(Y\)) have a linear relationship with each other. If \(X\) and \(Y\) are perfectly correlated, all of the points \((X_i,Y_i), i = 1,\ldots,N\) will fall on a straight line. The strength of the correlation between \(X\) and \(Y\) is measured using the correlation coefficient \(\rho_{_{XY}}\). The correlation coefficient is expressed as
\[ \rho_{_{XY}} = \frac{\mbox{cov}(X,Y)}{\sigma_{X}\sigma_{Y}}= \frac{E\left[(X-\mu_{X})(Y-\mu_{Y})\right]}{\sigma_{X}\sigma_{Y}} \]
In general, we cannot compute \(\rho_{_{XY}}\) as we don’t know the true values of \(\mbox{cov}(X,Y), \mu_{X}, \mu_{Y}, \sigma_{X},\mbox{ and } \sigma_{Y}\). However, if we have two sets of observations \(x_i\) and \(y_i, i = 1,\ldots, n\) we can compute the sample correlation coefficient \(r_{_{xy}}\) using
\[ \begin{aligned} r_{_{xy}} &= \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{(n-1)s_x s_y}\\\\ &=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}} \end{aligned} \]
Thus to determine \(r_{_{DP}}\) we must first compute the means \(\overline{\mbox{Distance}}\) and \(\overline{\mbox{Population}}\)
\[ \begin{aligned} \overline{\mbox{Distance}} &= \sum_{i=1}^{22} \mbox{Distance}_i/22\\\\ &=\frac{233+209+206+232+125+245+213+134+140+165+234+205+214+183+230+238+144+220+170+170+290+340}{22}\\\\ &= 206.3636364\\\\\\ \overline{\mbox{Population}} &= \sum_{i=1}^{22} \mbox{Population}_i/22\\\\ &=\frac{56+74+67+78+73+54+100+98+95+81+52+96+96+73+55+91+64+60+60+240+70+75}{22}\\\\ &= 82.1818182. \end{aligned} \]
Next, we compute the differences and squared differences for each of the \(i=1,\ldots,22\) distance and population observations. These values are shown in the table below
diff_D <- Distance - mean(Distance)
diff_P <- Population - mean(Population)
diff_D2 <- (Distance - mean(Distance))^2
diff_P2 <- (Population - mean(Population))^2
diff_df <- data.frame(diff_D, diff_D2, diff_P, diff_P2)
colnames(diff_df) <- c('$x_i-\\bar{x}$',
'$(x_i-\\bar{x})^2$',
'$y_i-\\bar{y}$',
'$(y_i-\\bar{y})^2$')
knitr::kable(diff_df, caption = 'Difference values for Problem 12.2')
| \(x_i-\bar{x}\) | \((x_i-\bar{x})^2\) | \(y_i-\bar{y}\) | \((y_i-\bar{y})^2\) |
|---|---|---|---|
| 26.6363636 | 7.094959e+02 | -26.181818 | 685.487603 |
| 2.6363636 | 6.950413e+00 | -8.181818 | 66.942149 |
| -0.3636364 | 1.322314e-01 | -15.181818 | 230.487603 |
| 25.6363636 | 6.572231e+02 | -4.181818 | 17.487603 |
| -81.3636364 | 6.620041e+03 | -9.181818 | 84.305785 |
| 38.6363636 | 1.492769e+03 | -28.181818 | 794.214876 |
| 6.6363636 | 4.404132e+01 | 17.818182 | 317.487603 |
| -72.3636364 | 5.236496e+03 | 15.818182 | 250.214876 |
| -66.3636364 | 4.404132e+03 | 12.818182 | 164.305785 |
| -41.3636364 | 1.710950e+03 | -1.181818 | 1.396694 |
| 27.6363636 | 7.637686e+02 | -30.181818 | 910.942149 |
| -1.3636364 | 1.859504e+00 | 13.818182 | 190.942149 |
| 7.6363636 | 5.831405e+01 | 13.818182 | 190.942149 |
| -23.3636364 | 5.458595e+02 | -9.181818 | 84.305785 |
| 23.6363636 | 5.586777e+02 | -27.181818 | 738.851240 |
| 31.6363636 | 1.000860e+03 | 8.818182 | 77.760331 |
| -62.3636364 | 3.889223e+03 | -18.181818 | 330.578512 |
| 13.6363636 | 1.859504e+02 | -22.181818 | 492.033058 |
| -36.3636364 | 1.322314e+03 | -22.181818 | 492.033058 |
| -36.3636364 | 1.322314e+03 | 157.818182 | 24906.578512 |
| 83.6363636 | 6.995041e+03 | -12.181818 | 148.396694 |
| 133.6363636 | 1.785868e+04 | -7.181818 | 51.578512 |
Finally, after inserting the values into the expression above we compute the value for \(r_{_{DP}}\). This value suggests that Distance and Population are not strongly correlated.
numer <- sum(diff_D*diff_P)
denom <- sqrt(sum(diff_D2))*sqrt(sum(diff_P2))
numer/denom
## [1] -0.2396499
We can quickly compute the correlation coefficients for all of the variables in this data set and make a correlation plot using the following code.
knitr::kable(cor(dat))
| Airport | Revenue | Distance | Population | |
|---|---|---|---|---|
| Airport | 1.0000000 | 0.1614322 | 0.2568860 | 0.2407534 |
| Revenue | 0.1614322 | 1.0000000 | -0.0270602 | 0.7622405 |
| Distance | 0.2568860 | -0.0270602 | 1.0000000 | -0.2396499 |
| Population | 0.2407534 | 0.7622405 | -0.2396499 | 1.0000000 |
corrplot::corrplot(cor(dat))
High-leverage points are observations made at extreme or outlying values of the independent variables. Thus, high leverage points do not have neighboring observations. This means that high leverage points exert a strong influence on the fitted regression model drawing it close to that particular observation. Observing the scatter plots from part (a) shows that airports 20 and 22 may be high leveral points.
We can determine, numerically, whether a point in the model matrix is a high leverage point with the leverage score. The leverage score \(h_i\) is defined as
\[ \begin{aligned} h_i &= \left[\boldsymbol{H}_{ii}\right]\\\\ &= \boldsymbol{X}\left(\boldsymbol{X}^{T}\boldsymbol{X}\right)^{-1}\boldsymbol{X}^{T} \end{aligned} \]
where \(\boldsymbol{H}\) is known as the hat matrix, \(\boldsymbol{H}_{ii}\) is the \(i^{th}\) term on the diagonal of \(\boldsymbol{H}\), and \(\boldsymbol{X}\) is the model matrix. For this exercise the model matrix is composed of the Distance and Population values and a column of \(1's\) representing the intercept of the linear model.
\[ \boldsymbol{X} = \left[\begin{array}{rrr} 1.00 & 233.00 & 56.00 \\ 1.00 & 209.00 & 74.00 \\ 1.00 & 206.00 & 67.00 \\ 1.00 & 232.00 & 78.00 \\ 1.00 & 125.00 & 73.00 \\ 1.00 & 245.00 & 54.00 \\ 1.00 & 213.00 & 100.00 \\ 1.00 & 134.00 & 98.00 \\ 1.00 & 140.00 & 95.00 \\ 1.00 & 165.00 & 81.00 \\ 1.00 & 234.00 & 52.00 \\ 1.00 & 205.00 & 96.00 \\ 1.00 & 214.00 & 96.00 \\ 1.00 & 183.00 & 73.00 \\ 1.00 & 230.00 & 55.00 \\ 1.00 & 238.00 & 91.00 \\ 1.00 & 144.00 & 64.00 \\ 1.00 & 220.00 & 60.00 \\ 1.00 & 170.00 & 60.00 \\ 1.00 & 170.00 & 240.00 \\ 1.00 & 290.00 & 70.00 \\ 1.00 & 340.00 & 75.00 \\ \end{array} \right] \]
Using the above expression we can determine \(h_i, i = 1, \ldots,22\) using the code shown in the chunk below. Note, in R matrix multiplication is specified by %*% whereas scalar multiplication is specified using *.
X <- matrix(c(rep(1,nrow(dat)), Distance, Population),
ncol = 3,
byrow = F)
H <- X %*% solve(t(X) %*% X) %*% t(X)
h <- diag(H)
h
## [1] 0.07380725 0.04759827 0.05335529 0.05732730 0.18426389 0.08771878
## [7] 0.05853052 0.14026726 0.12499900 0.07887391 0.08083477 0.05174693
## [13] 0.05434898 0.06139804 0.07340264 0.07067954 0.14505033 0.06203460
## [19] 0.09736347 0.84680251 0.17203235 0.37756438
There is, of course, an R function that will return these leverage scores very quickly.
hat(X)
## [1] 0.07380725 0.04759827 0.05335529 0.05732730 0.18426389 0.08771878
## [7] 0.05853052 0.14026726 0.12499900 0.07887391 0.08083477 0.05174693
## [13] 0.05434898 0.06139804 0.07340264 0.07067954 0.14505033 0.06203460
## [19] 0.09736347 0.84680251 0.17203235 0.37756438
The plot below shows the leverage score values for all 22 airports. Observing the plot, it’s clear that Airport 20 is a high leverage point, while Airport 22 may be a high leverage point.
barplot(height = h,
las = 1,
cex.axis = 1.1,
xlab = 'Airport',
ylab = expression(Leverage~Score~h[i]),
names.arg = 1:22)
Regression is a family pf supervised statistical learning techniques used to determine functional relationships between input variables and output variables. Members of the regression family include: linear regression, logistic regression, failure-time regression, and general linear models. The member of the regression family that students are typically introduced to first is linear regression. Linear regression is a relatively simple method to determine functional relationships between input variables and output variables when the output variable is numeric. In accordance with standard terminology used in statistics literature, we will from now on refer to the output variable as the response and the input variables as either regressors, predictors, or factors.
In this exercise we are asked to fit a first-order linear model relating Revenue to both Distance and Population. In this case Revenue is the response and Distance and Population are the predictors. The expression is for this model is
\[ \mbox{Revenue}_{i}= \beta_0 + \beta_1(\mbox{Distance}_i) + \beta_2(\mbox{Population}_i) + \epsilon_{i}. \tag{1} \]
This model in (1) is ‘linear’ because the response is a linear combination of the \(\beta\) values. An example of a non-linear model is shown below in (2). This is a log-linear model - meaning that \(\log[\mbox{Revenue}]\) would be expressed as a linear model.
\[ \mbox{Revenue}_i = ae^{\beta_1(\text{Distance}_i)}e^{\beta_2(\text{Population}_i)}\Upsilon_i. \tag{2} \]
Further, the model in (1) is ‘first-order’ because all of the \(\beta\) terms are first order. An example of a non-first-order model is shown in (3). This is a second-order model since includes would include
\[ \mbox{Revenue}_{i}= \beta_0 + \beta_1(\mbox{Distance}_i) + \beta_2(\mbox{Population}_i) + \beta_3(\mbox{Distance}_i)^2 + \beta_4(\mbox{Population}_i)^2 + \epsilon_{i}. \tag{1} \]
Referring back to the model in (1), this equation states that the Revenue generated at Airport \(i = 1,\ldots,22\) is a linear function of the distance between the airport and the hub and the population of the city in which the airport is located. The term \(\beta_0\) is interpreted as the revenue when Distance and Population are set to their base levels. The terms \(\beta_1\) and \(\beta_2\) denote the ‘strength’ of impact that changes to Distance and Population have on revenue. For example, if \(\beta_1 = -0.015\) this would imply that for every unit change in distance (miles, km, etc.) revenue would decrease by \(0.015\) units. Likewise, if \(\beta_2 = 2.34\) this would imply that for every unit increase in population, revenue would increase by \(2.34\) units. Finally, the \(\epsilon_i\) term denotes the error that exists between the value of \(\mbox{Response}_i\) predicted by the model and the true, but unknown value of \(\mbox{Response}_i\).
The value of \(\epsilon\) is comprised of measurement errors and errors due to lack of fit. Lack of fit errors come about if we don’t include an important term in our model that has a significant impact on the Revenues that were observed. For example, our model does not consider whether the airport is located near a popular vacation destination.
In R, the standard function for performing a linear model is lm. The code shown in the chunk below is used to perform a linear model to determine what effect Distance and Population have on Revenue. In R, the \(\sim\) symbol denotes a functional relationship (i.e. \(y \sim x \equiv\) y versus x). The \(+\) symbol in the model specification means additive effects.
model <- lm(Revenue ~ Distance + Population, data = dat)
We first want to get a summary of the model which we can do in R by using the function…wait for it…summary. The chunk below provides a summary of the model showing the estimated effect that Distance and Population have on Revenue and an indication as to whether the effect is statistically significant. The summary shows that the effect of Population is statistically significant at the \(\alpha=0.000032\) level while the effect of Distance is significant at the \(\alpha=0.279\) level.
summary(model)
##
## Call:
## lm(formula = Revenue ~ Distance + Population, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.98 -14.99 -2.84 13.49 69.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 144.3885 39.5312 3.653 0.00169 **
## Distance 0.1692 0.1519 1.114 0.27916
## Population 1.0942 0.2022 5.411 3.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.7 on 19 degrees of freedom
## Multiple R-squared: 0.6067, Adjusted R-squared: 0.5653
## F-statistic: 14.65 on 2 and 19 DF, p-value: 0.0001412
We can verify these results by computing \((1-\alpha)100\%\) confidence intervals for the \(\beta\) values.
confint(model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 61.6486928 227.1282560
## Distance -0.1486697 0.4870241
## Population 0.6709427 1.5175409
Because the confidence interval for \(\beta_1\) includes 0 we cannot say that the \(\beta_1 \ne 0\) at the \(\alpha = 0.05\) significance level.
Next, we want to