airfares <- read.csv("Airfares.csv", header = TRUE)
head(airfares, 10)
## S_CODE S_CITY E_CODE E_CITY COUPON NEW
## 1 * Dallas/Fort Worth TX * Amarillo TX 1.00 3
## 2 * Atlanta GA * Baltimore/Wash Intl MD 1.06 3
## 3 * Boston MA * Baltimore/Wash Intl MD 1.06 3
## 4 ORD Chicago IL * Baltimore/Wash Intl MD 1.06 3
## 5 MDW Chicago IL * Baltimore/Wash Intl MD 1.06 3
## 6 * Cleveland OH * Baltimore/Wash Intl MD 1.01 3
## 7 * Dallas/Fort Worth TX * Baltimore/Wash Intl MD 1.28 3
## 8 * Fort Lauderdale FL * Baltimore/Wash Intl MD 1.15 3
## 9 * Houston TX * Baltimore/Wash Intl MD 1.33 3
## 10 * Kansas City MO * Baltimore/Wash Intl MD 1.60 2
## VACATION SW HI S_INCOME E_INCOME S_POP E_POP SLOT GATE
## 1 No Yes 5291.99 28637 21112 3036732 205711 Free Free
## 2 No No 5419.16 26993 29838 3532657 7145897 Free Free
## 3 No No 9185.28 30124 29838 5787293 7145897 Free Free
## 4 No Yes 2657.35 29260 29838 7830332 7145897 Controlled Free
## 5 No Yes 2657.35 29260 29838 7830332 7145897 Free Free
## 6 No Yes 3408.11 26046 29838 2230955 7145897 Free Free
## 7 No No 6754.48 28637 29838 3036732 7145897 Free Free
## 8 Yes Yes 5584.00 26752 29838 1440377 7145897 Free Free
## 9 No Yes 4662.44 27211 29838 3770125 7145897 Free Free
## 10 No Yes 2617.00 25450 29838 1694803 7145897 Free Free
## DISTANCE PAX FARE
## 1 312 7864 64.11
## 2 576 8820 174.47
## 3 364 6452 207.76
## 4 612 25144 85.47
## 5 612 25144 85.47
## 6 309 13386 56.76
## 7 1220 4625 228.00
## 8 921 5512 116.54
## 9 1249 7811 172.63
## 10 964 4657 114.76
# Remove unnecessary variables
# Since S_Code, S_CITY, E_CODE, and E_CITY are likely to determine
# DISTANCE, and these 4 categorical variables have too many values,
# resulting in too many dummy variables,
# they are dropped from the regression model.
# Incidentally, they are also not included in the new record.
airfares <- airfares[, -c(1:4)]
names(airfares)
## [1] "COUPON" "NEW" "VACATION" "SW" "HI" "S_INCOME"
## [7] "E_INCOME" "S_POP" "E_POP" "SLOT" "GATE" "DISTANCE"
## [13] "PAX" "FARE"
# 2 Correlations and training-validation split ----------------------------------------------------------------------
set.seed(666)
# 60% Training / 40% Validation split
train_index <- sample(1:nrow(airfares), 0.6 * nrow(airfares))
valid_index <- setdiff(1:nrow(airfares), train_index)
train_df <- airfares[train_index, ]
valid_df <- airfares[valid_index, ]
# correlation
library(corrgram)
corrgram(train_df)

# It's possible that the newer version of corrgram does not automatically
# exclude the categorical variables. If corrgram() does not run, please try
# using the numerical columns only. For eg, if columns 3 and 5-10 are numerical,
# use train_df[,c(3, 5:10)]
# alternatively, use the following to compute the correlations for the
# correlation matrix for the numerical variables only
cor(train_df[sapply(train_df,is.numeric)])
## COUPON NEW HI S_INCOME E_INCOME
## COUPON 1.00000000 0.018547315 -0.38663306 -0.18604105 0.06294362
## NEW 0.01854732 1.000000000 0.02882838 -0.04964667 0.13018727
## HI -0.38663306 0.028828379 1.00000000 0.03289966 0.08336164
## S_INCOME -0.18604105 -0.049646670 0.03289966 1.00000000 -0.18508018
## E_INCOME 0.06294362 0.130187275 0.08336164 -0.18508018 1.00000000
## S_POP -0.18487385 -0.038284370 -0.07834743 0.51637130 -0.15212886
## E_POP 0.15729603 0.088964587 -0.10961935 -0.31456037 0.42065795
## DISTANCE 0.73686173 0.096615554 -0.35625281 -0.04997598 0.19493466
## PAX -0.34477588 0.005575263 -0.16039055 0.15201566 0.27909442
## FARE 0.49599378 0.118340391 -0.01094012 0.16549632 0.30689238
## S_POP E_POP DISTANCE PAX FARE
## COUPON -0.18487385 0.15729603 0.73686173 -0.344775883 0.49599378
## NEW -0.03828437 0.08896459 0.09661555 0.005575263 0.11834039
## HI -0.07834743 -0.10961935 -0.35625281 -0.160390551 -0.01094012
## S_INCOME 0.51637130 -0.31456037 -0.04997598 0.152015664 0.16549632
## E_INCOME -0.15212886 0.42065795 0.19493466 0.279094423 0.30689238
## S_POP 1.00000000 -0.27414791 -0.04810911 0.321823933 0.14974024
## E_POP -0.27414791 1.00000000 0.16017486 0.318280760 0.26249249
## DISTANCE -0.04810911 0.16017486 1.00000000 -0.092993058 0.67632661
## PAX 0.32182393 0.31828076 -0.09299306 1.000000000 -0.06843540
## FARE 0.14974024 0.26249249 0.67632661 -0.068435401 1.00000000
#INTERPRETATION:
#COUPON Correlates 0.496 with FARE. This means that higher COUPON values in the dataset are associated with higher fares with a moderate positive correlation.
#DISTANCE Correlates 0.676 with FARE. This means that longer flights result in higher average fares.
#E_INCOME, and E_POP have 0.307 and 0.262 correlation with FARE respectively, meaning that the higher income or population the destination is for a flight, the higher the average fare is, with a weak to moderate correlation.
# 3 Regression model ----------------------------------------------------------------------
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
#k=11
knn_model_11 <- caret::knnreg(FARE ~ ., data = train_df, k = 11)
predict_train <- predict(knn_model_11, train_df)
predict_valid <- predict(knn_model_11, valid_df)
train_metrics <- postResample(pred = predict_train, obs = train_df$FARE)
valid_metrics <- postResample(pred = predict_valid, obs = valid_df$FARE)
train_metrics
## RMSE Rsquared MAE
## 63.6915878 0.2827738 52.4816072
valid_metrics
## RMSE Rsquared MAE
## 64.5607745 0.3080904 52.5923820
#INTERPRETATION
#On average, flight cost predictions are off by about $66, while the MEAN average error (absolute difference between predicted and actual fares) is about $52, consistent across both the training and validation sets.
#While there isn't much indication of overfitting in this model, the R-squared value of 0.27 means the model's performance is modest. This statistic indicates there are other statistical indicators of fare that aren't being accessed by either the dataframe given or the model used.
#k=3 results in the lowest Mean Error Percentage and lowest Mean Error on average in the Validation Set.
# predict validation set
library(forecast)
# model evaluation
plot(train_df$FARE, predict(knn_model_11, train_df),
xlab = "Actual FARE",
ylab = "Predicted FARE",
main = "Predicted vs Actual FARE (Training)",
pch = 19,
col = "green",
cex = 1.2,
las = 1,
bty = "l")
grid()

plot(valid_df$FARE, predict(knn_model_11, valid_df),
xlab = "Actual FARE",
ylab = "Predicted FARE",
main = "Predicted vs Actual FARE (Training)",
pch = 19,
col = "blue",
cex = 1.2,
las = 1,
bty = "l")
grid()

# 4 create new record ----------------------------------------------------------------------
library(forecast)
new_record <- data.frame(
COUPON = 1.202, NEW = 3, VACATION = "Yes", SW = "Yes",
HI = 4442.141, S_INCOME = 28760, E_INCOME = 27664,
S_POP = 4557004, E_POP = 3195503, SLOT = "Free",
GATE = "Free", PAX = 12782, DISTANCE = 1976
)
# Convert categorical columns to factors with the same levels as training
cat_cols <- c("VACATION","SW","SLOT","GATE")
for (col in cat_cols) {
# Ensure the column in train_df is factor
train_df[[col]] <- factor(train_df[[col]])
# Get the training levels
train_levels <- levels(train_df[[col]])
# If new_record value not in training levels, replace with first level
val <- new_record[[col]]
if (!(val %in% train_levels)) {
val <- train_levels[1]
}
# Convert to factor with the same levels as training
new_record[[col]] <- factor(val, levels = train_levels)
}
predicted_fare <- predict(knn_model_11, new_record)
predicted_fare
## [1] 156.2664
#The predicted fare for this new record is $156.27
# 5 Discussion ----------------------------------------------------------------------
#5. Which variables would actually be available in a real-world implementation of this model for predicting average fare between airports?
#In this data set, the variables for population S_POP and E_POP would be known.
#Additionally, airport-specific variables like SLOT, GATE, PAX, and DISTANCE Would also be used in a real-world implementation.
#6. Assuming the variable fare is recoded into high/low using the mean.
#ifelse(FARE >= mean(FARE), “high”, “low”)
#a. Propose one data analysis technique that can be used to predict whether a route’s fare is high or low. There’s no need to actually train the model
#LOGISTIC REGRESSION should be used to scale and classify continuous data into a categorical variable, using a probability cutoff to determine the final output.
#b. Describe appropriate data transformations for your proposed
#technique (if any)
#Normalize all variables to a relative scale, such that categorical and continuous variables both determine the predictions based on relative values to other observations in the dataset, rather than comparing large numbers to 0/1 values.
#c. Justify your proposed technique.
#This technique is justified because it combines both types of variables into this set into a set of scaled coefficients which determine the impact on the regression that variable has.
#Additionally, the output of this regression can be both probability-based and classification-based, meaning that "high" and "low" probability cutoff values can be adjusted for accuracy.
#d. Describe how you would evaluate the quality of this new model
# To determine the quality of this model, a confusion matrix would be generated listing the amount of predicted high/low fares to their actual classification. Depending on the amount of false positives or negatives, further tuning of parameters would be used before further analysis.