Answer: The MSE on the test sample of the 6 term linear model is 2.157 and the r-squared is 0.2513.
Answer: The MSE on the test sample of the cross-term, second order linear model is 1.840 and the r-squared value is 0.3613. Compared to the 6 term model, the second order model MSE indicates that it performs better at prediction, while the higher r-squared value is indicative of better fit. R-squared is monotonically increasing in number of variables, so the higher value is expected. The magnitude of the jump, however, suggests we are adding variables that substantially impact fit.
Answer: The covariate selection procedure on the second order model, using BIC as a minimizing criterion, selected a 13 covariate model with an MSE of 1.9068 on the test sample. The ADJR2 criterion selected an 18 covariate model with an MSE of 1.8735 on the test sample. Compared to the full model in problem 4, both of these models do a slightly worse job at prediction, but this increase in MSE should not be seen as a problem. By lowering the amount of covariates in the model, we are biasing results, but also substantially decreasing the variance of prediction on the test sample. This is a trade-off we should consider making given that the increase in MSE is small.
Answer: The lasso MSE on the training sample for the 50/50 split, second order model is 1.8409 and the ridge MSE is also 1.8409 (results very slightly out to 5 decimal places). Both the lasso and ridge models compare favorably to the full, second order polynomial model, while outperforming the cross validated model and the 6 term OLS model. These results are encouraging because they punish complexity in a similar way as BIC and ADJR2, but without sacrificing any predictive performance. We would be inclined to select either the ridge or the lasso if the models to this point were the only ones from which we could select. Since the lasso model performs defacto model selection, it performs better in sparse data environments than the ridge. Due to this characteristic of lasso and the sparse nature of the dataset, we should be inclined to select the lasso over the ridge.
Answer: Essentially across the board, MSE on test samples increases when noise variables are added and when training sample is compressed. Compression of the training sample in tandem with noise tends to have the strongest upwards impact on MSE. Intuitively, this increase makes sense in both cases. By compressing the training sample, we are giving the model less information to learn from, which makes application to the test data more challenging. By adding noise to the data we are muddying the water. Noise makes it harder for the model to identify what covariates truly have an impact on the outcome. With a smaller trainer sample and noise, we give the model less information and more nonsense to sort through. Obviously this will tend to give us the highest MSE’s (lowest predictive power).
One important note: The lasso MSE outperforms the ridge MSE in the 98/2 noise split. This is again because the lasso does model selection, so it is more ruthless is eliminating variables that may not matter. The ridge’s l1 norm makes it more generous to noise, so it does not perform as well in the heavily muddled environment. In the 50/50 noise split the ridge performs better than the lasso; presumably because the increase in size of the training sample outweighs the negative impacts of noise. It’s unclear to me why the lasso performs so poorly in comparison. The lasso is strongest in sparse environments, so there may be a negative correlation between sparsity and noisy training sample size.
Answer: Using the 50/50 split, the BIC covariate selection procedure returns a 4 coefficient model consisting of dist, size, vaca, and hub. The ADJR2 procedure returns a 5 coefficient model with the same variables in additon to faa, which represents slot-control. The 98/2 BIC procedure returns dist and hub, while the 98/2 ADJR2 procedure returns hub, inc, and dist.
The MSE is the lowest for the 50/50 split, so those models will be the only ones considered in determining market structure. Comparing those two models, faa seems to be the variable most on the fence. Excluding it from ADJR2 makes almost no difference in maximum adjusted r-squared, so it seems like dist, size, vaca, and hub determine airline market structure. This conclusion makes intuitive sense. In general, people either travel for work or vacation and hubs are often located in high traffic areas, which can serve as vacation spots or are close to industry (JFK -> NYC or LAX -> LA). Market size is important for an airliners revenue and the longer the distance of a market, the more likely people are to travel by plane. Due to the cost-prohibitive nature of flying, median municipal income should not have a huge effect on the number of airlines in any given market. Even in the low-income cities, such as Detroit or my hometown of Cleveland, there are enough high-income individuals to saturate the market for flights.
a) Load market airline data along with the four new datasets and then merge by id. (Note: all code output is suppressed by echo = FALSE to clean up the markdown submission.)
b) Merge datasets downloaded from canvas.
c) Sort data by origin_airport_id and then by dest_airport_id.
a) Set seed to zero and split the data into a 50/50 test sample and training sample.
smp_size = floor(0.5 * nrow(final))
set.seed(0)
train_ind <- sample(seq_len(nrow(final)), size = smp_size)
train = final[train_ind, ]
test = final[-train_ind, ]
a) Estimate a linear model including six variables (market income, market size, slot controlled, vacation route, hub route, average distance). Compute the R squared and cal- culate the MSE on the test sample.
# OLS model using the training dataset
m.1.50 = lm(num ~ size + inc + faa + dist + vaca + hub,data=train)
m.1.mse.50 = mean((test$num - predict(m.1.50,test))^2)
m.1.rs.50 = 1 - (sum((test$num-predict(m.1.50,test))^2))/(sum((test$num - mean(test$num))^2))
# OLS 50 with noise
m.1.50.noise = lm(num ~ size + inc + faa + dist + vaca + hub + noise1 + noise2 + noise3,data=train)
m.1.mse.50.noise = mean((test$num - predict(m.1.50.noise,test))^2)
m.1.rs.50.noise = 1 - (sum((test$num-predict(m.1.50.noise,test))^2))/(sum((test$num - mean(test$num))^2))
data.frame(m.1.mse.50,m.1.rs.50,m.1.mse.50.noise,m.1.rs.50.noise)
## m.1.mse.50 m.1.rs.50 m.1.mse.50.noise m.1.rs.50.noise
## 1 2.157328 0.2513352 2.156537 0.2516097
a) Estimate a linear model with second order polynomials and all cross terms. Com- pute the R squared.
Note: Both the noise and non-noise mse’s for the 50/50 split are listed in this problem.
# Cross term 50
m.2.50 = lm(num ~ (size + inc + faa + dist + vaca + hub)^2 + I(size^2) + I(inc^2) + I(dist^2), data = train)
m.2.mse.50 = mean((test$num - predict(m.2.50,test))^2)
m.2.rs.50 = 1 - (sum((test$num-predict(m.2.50,test))^2))/(sum((test$num - mean(test$num))^2))
# Cross term 50 with noise
m.2.50.noise = lm(num ~ (size + inc + faa + dist + vaca + hub +noise1 + noise2 +noise3)^2 + I(noise1^2) + I(noise2^2) + I(noise3^2) + I(size^2) + I(inc^2) + I(dist^2), data = train)
m.2.mse.50.noise = mean((test$num - predict(m.2.50.noise,test))^2)
m.2.rs.50.noise = 1 - (sum((test$num-predict(m.2.50.noise,test))^2))/(sum((test$num - mean(test$num))^2))
data.frame(m.2.mse.50,m.2.rs.50,m.2.mse.50.noise,m.2.rs.50.noise)
## m.2.mse.50 m.2.rs.50 m.2.mse.50.noise m.2.rs.50.noise
## 1 1.840322 0.3613467 1.867185 0.3520246
Note: The output is shown for the 50/50 split with and without noise. When this process is repeated in question 7, the output is suppressed and only graphs are shown to allow for a more concise presentation.
a) Perform a simple covariate selection procedure on the model in (4) using Algorithm 6.3 from James et al. (the textbook) using BIC and adjusted R squared as criteria. Com- pute the out-of-sample MSE of the selected models. Compare with the fit and prediction performance of the OLS model.
# Use a backwards stepwise method to generate a set of regressions using all variables included in l_2 model.
set.seed(0)
regfit.full=regsubsets(num ~ I(size^2) + I(inc^2) + I(dist^2) + (faa + hub + vaca + size + inc + dist)^2, data = train, nvmax = 22,method="backward")
## Reordering variables and trying again:
reg.summary=summary(regfit.full)
# Plot the BIC versus the number of variables for each regression. Mark the minimum BIC with a red dot.
plot(reg.summary$bic,main = "50/50 Split without Noise - BIC Selection",xlab="Number of Variables",ylab="BIC",type="l")
points(which.min(reg.summary$bic),reg.summary$bic[which.min(reg.summary$bic)], col="red",cex=2,pch=20)
plot(regfit.full,scale = "bic",main = "BIC Variable Bar Chart")
# Plot the ADJR2 versus the number of variables for each regression. Mark the max ADJR2 with a red dot.
plot(reg.summary$adjr2,main = "50/50 Split without Noise - ADJR2 Selection",xlab="Number of Variables",ylab="Adjusted RSq",type="l")
points(which.max(reg.summary$adjr2),reg.summary$adjr2[which.max(reg.summary$adjr2)], col="red",cex=2,pch=20)
plot(regfit.full,scale = "adjr2", main = "ADJ R-squared Variable Bar Chart")
# Print out the coeffients that satisfy the min BIc and max ADJR2.
data.frame(ADJr2 = which.max(reg.summary$adjr2), BIC = which.min(reg.summary$bic), CP = which.min(reg.summary$cp))
## ADJr2 BIC CP
## 1 18 13 16
# Create a matrix composed of all models that the dest data will be passed through
test.mat=model.matrix(num ~ I(size^2) + I(inc^2) + I(dist^2) + (faa + hub + vaca + size + inc + dist)^2, data = test, nvmax = 22, method="backward")
val.errors = rep(NA,22)
for (i in 1:22){
coefi = coef(regfit.full, id=i)
pred = test.mat[,names(coefi)]%*%coefi
val.errors[i] = mean((test$num-pred)^2)
}
# Print out the mse's
adjr2.mse.50 = val.errors[which.max(reg.summary$adjr2)]
bic.mse.50 = val.errors[which.min(reg.summary$bic)]
data.frame(adjr2.mse.50,bic.mse.50)
## adjr2.mse.50 bic.mse.50
## 1 1.873489 1.906829
Compare with fit and prediction performance of the OLS model
Do the same process but for the 50/50 sample with noise.
# Use a backwards stepwise method to generate a set of regressions using all variables included in l_2 model.
set.seed(0)
regfit.full.noise=regsubsets(num ~ I(size^2) + I(inc^2) + I(dist^2) + I(noise1^2) + I(noise2^2) + I(noise3^2) + (faa + hub + vaca + size + inc + dist + noise1 + noise2 + noise3)^2, data = train, nvmax = 46,method="backward")
## Reordering variables and trying again:
reg.summary.noise=summary(regfit.full.noise)
# Plot the BIC versus the number of variables for each regression. Mark the minimum BIC with a red dot.
plot(reg.summary.noise$bic,main = "50/50 Split with Noise - BIC Selection",xlab="Number of Variables",ylab="BIC",type="l")
points(which.min(reg.summary.noise$bic),reg.summary.noise$bic[which.min(reg.summary.noise$bic)], col="red",cex=2,pch=20)
plot(regfit.full.noise,scale = "bic",main = "BIC Variable Bar Chart")
# Plot the ADJR2 versus the number of variables for each regression. Mark the max ADJR2 with a red dot.
plot(reg.summary.noise$adjr2,main = "50/50 Split without Noise - ADJR2 Selection",xlab="Number of Variables",ylab="Adjusted RSq",type="l")
points(which.max(reg.summary.noise$adjr2),reg.summary.noise$adjr2[which.max(reg.summary.noise$adjr2)], col="red",cex=2,pch=20)
plot(regfit.full.noise,scale = "adjr2",main = "ADJ R-squared Variable Bar Chart")
# Print out the coeffients that satisfy the min BIc and max ADJR2.
data.frame(ADJr2 = which.max(reg.summary.noise$adjr2), BIC = which.min(reg.summary.noise$bic), CP = which.min(reg.summary.noise$cp))
## ADJr2 BIC CP
## 1 29 13 24
# Create a matrix composed of all models that the dest data will be passed through
test.mat=model.matrix(num ~ I(size^2) + I(inc^2) + I(dist^2) + I(noise1^2) + I(noise2^2) + I(noise3^2) + (faa + hub + vaca + size + inc + dist + noise1 + noise2 + noise3)^2, data = test, nvmax = 46, method="backward")
val.errors = rep(NA,46)
for (i in 1:46){
coefi = coef(regfit.full.noise, id=i)
pred = test.mat[,names(coefi)]%*%coefi
val.errors[i] = mean((test$num-pred)^2)
}
# Print out the mse's
adjr2.mse.50.noise = val.errors[which.max(reg.summary$adjr2)]
bic.mse.50.noise = val.errors[which.min(reg.summary$bic)]
data.frame(adjr2.mse.50,bic.mse.50)
## adjr2.mse.50 bic.mse.50
## 1 1.873489 1.906829
Compare with fit and prediction performance of the OLS model
Note: The output is shown for the 50/50 split with and without noise. When this process is repeated in question 7, the output is suppressed and only graphs are shown to allow for a more concise presentation.
a) Perform a 10-fold cross validation procedure to pick the tuning paramter for lasso and ridge, using the training dataset only.
#Ridge no noise 50
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-18
set.seed(0)
grid=10^seq(10,-5,length=500)
x.1=model.matrix(num ~ (size + inc + faa + dist + vaca + hub)^2 + I(size^2) + I(inc^2) + I(dist^2),train)
y.1=train$num
x.2 = model.matrix(num ~ (size + inc + faa + dist + vaca + hub)^2 + I(size^2) + I(inc^2) + I(dist^2),test)
y.2 = test$num
ridge.mod=glmnet(x.1,y.1,alpha=0,lambda=grid,thresh = 1e-12)
cv.out=cv.glmnet(x.1,y.1,alpha=0,type.measure = "mse",nfolds = 10, lambda = grid)
plot(cv.out,main = "Plot of Log Lambda Grid vs. Ridge 50/50 no-noise MSE")
bestlam=cv.out$lambda.min
ridge.pred = predict(ridge.mod,s=bestlam,x.2)
ridge.mse.50 = mean((test$num-ridge.pred)^2)
# Ridge noise 50
grid=10^seq(10,-5,length=500)
x.1=model.matrix(num ~ (size + inc + faa + dist + vaca + hub +noise1 + noise2 + noise3)^2 + I(noise1^2) + I(noise2^2) + I(noise3^2) + I(size^2) + I(inc^2) + I(dist^2),train)
y.1=train$num
x.2 = model.matrix(num ~ (size + inc + faa + dist + vaca + hub +noise1 + noise2 + noise3)^2 + I(noise1^2) + I(noise2^2) + I(noise3^2) + I(size^2) + I(inc^2) + I(dist^2),test)
y.2 = test$num
ridge.mod=glmnet(x.1,y.1,alpha=0,lambda=grid,thresh = 1e-12)
cv.out=cv.glmnet(x.1,y.1,alpha=0,type.measure = "mse",nfolds = 10, lambda = grid)
plot(cv.out,main = "Plot of Log Lambda Grid vs. Ridge 50/50 noise MSE")
bestlam=cv.out$lambda.min
ridge.pred = predict(ridge.mod,s=bestlam,x.2)
ridge.mse.50.noise = mean((test$num-ridge.pred)^2)
# Lasso no noise 50
grid=10^seq(10,-5,length=500)
x.1=model.matrix(num ~ (size + inc + faa + dist + vaca + hub)^2 + I(size^2) + I(inc^2) + I(dist^2),train)
y.1=train$num
x.2 = model.matrix(num ~ (size + inc + faa + dist + vaca + hub)^2 + I(size^2) + I(inc^2) + I(dist^2),test)
y.2 = test$num
lasso.mod=glmnet(x.1,y.1,alpha=1,lambda=grid,thresh = 1e-12)
cv.out=cv.glmnet(x.1,y.1,alpha=1,type.measure = "mse",nfolds = 10, lambda = grid)
plot(cv.out,main = "Plot of Log Lambda Grid vs. Lasso 50/50 no-noise MSE")
bestlam=cv.out$lambda.min
lasso.pred = predict(lasso.mod,s=bestlam,x.2)
lasso.mse.50 = mean((test$num-lasso.pred)^2)
# Lasso Noise 50
grid=10^seq(10,-5,length=500)
x.1=model.matrix(num ~ (size + inc + faa + dist + vaca + hub +noise1 + noise2 + noise3)^2 + I(noise1^2) + I(noise2^2) + I(noise3^2) + I(size^2) + I(inc^2) + I(dist^2),train)
y.1=train$num
x.2 = model.matrix(num ~ (size + inc + faa + dist + vaca + hub +noise1 + noise2 + noise3)^2 + I(noise1^2) + I(noise2^2) + I(noise3^2) + I(size^2) + I(inc^2) + I(dist^2),test)
y.2 = test$num
lasso.mod=glmnet(x.1,y.1,alpha=1,lambda=grid,thresh = 1e-12)
cv.out=cv.glmnet(x.1,y.1,alpha=1,type.measure = "mse",nfolds = 10, lambda = grid)
plot(cv.out,main = "Plot of Log Lambda Grid vs. Lasso 50/50 noise MSE")
bestlam=cv.out$lambda.min
lasso.pred = predict(lasso.mod,s=bestlam,x.2)
lasso.mse.50.noise = mean((test$num-lasso.pred)^2)
data.frame(lasso.mse.50,lasso.mse.50.noise,ridge.mse.50,ridge.mse.50.noise)
## lasso.mse.50 lasso.mse.50.noise ridge.mse.50 ridge.mse.50.noise
## 1 1.840944 2.2549 1.840918 1.843179
a) Reallocate the size of the training sample and the test sample (2/98 split).
smp_size2 = floor(0.02 * nrow(final))
set.seed(0)
train2_ind <- sample(seq_len(nrow(final)), size = smp_size2)
train2 = final[train2_ind, ]
test2 = final[-train2_ind, ]
b) Re-estimate linear model with new variables and samples.
## m.1.mse.98 m.1.rs.98 m.1.mse.98.noise m.1.rs.98.noise
## 1 2.38395 0.188192 2.476861 0.1565528
b) Estimate a linear model with second order polynomials and all cross terms. Com- pute the R squared.
## m.2.mse.98.noise m.2.rs.98.noise m.2.mse.98 m.2.rs.98
## 1 4.615976 -0.5718814 2.7604 0.05999934
c) OLS 3 98 with noise.
## Reordering variables and trying again:
## ADJr2.98.noise BIC.98.noise CP.98.noise
## 1 25 7 13
## adjr2.mse.98.noise bic.mse.98.noise
## 1 3.555884 2.505595
d) OLS 3, 98 split, with no noise.
## Reordering variables and trying again:
## ADJr2.98 BIC.98 CP.98
## 1 11 7 8
## adjr2.mse.98 bic.mse.98
## 1 2.256477 2.552352
e) Re-estimate the Ridge and Lasso models on the 98/2 split with and without noise.
## lasso.mse.98 lasso.mse.98.noise ridge.mse.98 ridge.mse.98.noise
## 1 2.386226 2.187736 2.505385 2.32966
a) Run the simple OLS model again with cross validation using BIC and ADJR2 as model selection parameters. 50 Split.
## ADJr2.50 BIC.50
## 1 5 4
## adjr2.mse.50.simple bic.mse.50.simple
## 1 2.15819 2.159633
## ADJr2.98 BIC.98
## 1 3 2
## adjr2.mse.98.simple bic.mse.98.simple
## 1 3.507795 3.530022
Create data frame with results
## OLS1 MSE OLS1 AR2 OLS2 MSE OLS2 AR2 OLS3CV AR2 MSE
## 50 2.157 0.2513 1.840 0.3613 1.873
## 50 w/ noise 2.157 0.2516 1.867 0.3520 1.934
## 98 2.384 0.1882 2.760 0.0600 2.256
## 98 w/ noise 2.477 0.1566 4.616 -0.5719 3.556
## OLS3CV BIC MSE Lasso MSE Ridge MSE OLS1CV AR2 MSE
## 50 1.907 1.841 1.841 2.158
## 50 w/ noise 2.085 2.255 1.843 NA
## 98 2.552 2.386 2.505 3.508
## 98 w/ noise 2.506 2.188 2.330 NA
## OLS1CV BIC MSE
## 50 2.16
## 50 w/ noise NA
## 98 3.53
## 98 w/ noise NA