Lab 6 Regression Calvin Little

airfares <- read.csv("Airfares.csv", header = TRUE)
head(airfares, 10)

##    S_CODE                 S_CITY E_CODE                 E_CITY COUPON NEW
## 1       * Dallas/Fort Worth   TX      * Amarillo            TX   1.00   3
## 2       * Atlanta             GA      * Baltimore/Wash Intl MD   1.06   3
## 3       * Boston              MA      * Baltimore/Wash Intl MD   1.06   3
## 4     ORD Chicago             IL      * Baltimore/Wash Intl MD   1.06   3
## 5     MDW Chicago             IL      * Baltimore/Wash Intl MD   1.06   3
## 6       * Cleveland           OH      * Baltimore/Wash Intl MD   1.01   3
## 7       * Dallas/Fort Worth   TX      * Baltimore/Wash Intl MD   1.28   3
## 8       * Fort Lauderdale     FL      * Baltimore/Wash Intl MD   1.15   3
## 9       * Houston             TX      * Baltimore/Wash Intl MD   1.33   3
## 10      * Kansas City         MO      * Baltimore/Wash Intl MD   1.60   2
##    VACATION  SW      HI S_INCOME E_INCOME   S_POP   E_POP       SLOT GATE
## 1        No Yes 5291.99    28637    21112 3036732  205711       Free Free
## 2        No  No 5419.16    26993    29838 3532657 7145897       Free Free
## 3        No  No 9185.28    30124    29838 5787293 7145897       Free Free
## 4        No Yes 2657.35    29260    29838 7830332 7145897 Controlled Free
## 5        No Yes 2657.35    29260    29838 7830332 7145897       Free Free
## 6        No Yes 3408.11    26046    29838 2230955 7145897       Free Free
## 7        No  No 6754.48    28637    29838 3036732 7145897       Free Free
## 8       Yes Yes 5584.00    26752    29838 1440377 7145897       Free Free
## 9        No Yes 4662.44    27211    29838 3770125 7145897       Free Free
## 10       No Yes 2617.00    25450    29838 1694803 7145897       Free Free
##    DISTANCE   PAX   FARE
## 1       312  7864  64.11
## 2       576  8820 174.47
## 3       364  6452 207.76
## 4       612 25144  85.47
## 5       612 25144  85.47
## 6       309 13386  56.76
## 7      1220  4625 228.00
## 8       921  5512 116.54
## 9      1249  7811 172.63
## 10      964  4657 114.76

# Remove unnecessary variables
# Since S_Code, S_CITY, E_CODE, and E_CITY are likely to determine
# DISTANCE, and these 4 categorical variables have too many values,
# resulting in too many dummy variables,
# they are dropped from the regression model.
# Incidentally, they are also not included in the new record.

airfares <- airfares[, -c(1:4)]
names(airfares)

##  [1] "COUPON"   "NEW"      "VACATION" "SW"       "HI"       "S_INCOME"
##  [7] "E_INCOME" "S_POP"    "E_POP"    "SLOT"     "GATE"     "DISTANCE"
## [13] "PAX"      "FARE"

# 2 Correlations and training-validation split ----------------------------------------------------------------------


set.seed(666)

# 60% Training / 40% Validation split
train_index <- sample(1:nrow(airfares), 0.6 * nrow(airfares))
valid_index <- setdiff(1:nrow(airfares), train_index)

train_df <- airfares[train_index, ]
valid_df <- airfares[valid_index, ]

# correlation

library(corrgram)

corrgram(train_df)

# It's possible that the newer version of corrgram does not automatically
# exclude the categorical variables. If corrgram() does not run, please try
# using the numerical columns only. For eg, if columns 3 and 5-10 are numerical,
# use train_df[,c(3, 5:10)]


# alternatively, use the following to compute the correlations for the 
# correlation matrix for the numerical variables only

cor(train_df[sapply(train_df,is.numeric)])

##               COUPON          NEW          HI    S_INCOME    E_INCOME
## COUPON    1.00000000  0.018547315 -0.38663306 -0.18604105  0.06294362
## NEW       0.01854732  1.000000000  0.02882838 -0.04964667  0.13018727
## HI       -0.38663306  0.028828379  1.00000000  0.03289966  0.08336164
## S_INCOME -0.18604105 -0.049646670  0.03289966  1.00000000 -0.18508018
## E_INCOME  0.06294362  0.130187275  0.08336164 -0.18508018  1.00000000
## S_POP    -0.18487385 -0.038284370 -0.07834743  0.51637130 -0.15212886
## E_POP     0.15729603  0.088964587 -0.10961935 -0.31456037  0.42065795
## DISTANCE  0.73686173  0.096615554 -0.35625281 -0.04997598  0.19493466
## PAX      -0.34477588  0.005575263 -0.16039055  0.15201566  0.27909442
## FARE      0.49599378  0.118340391 -0.01094012  0.16549632  0.30689238
##                S_POP       E_POP    DISTANCE          PAX        FARE
## COUPON   -0.18487385  0.15729603  0.73686173 -0.344775883  0.49599378
## NEW      -0.03828437  0.08896459  0.09661555  0.005575263  0.11834039
## HI       -0.07834743 -0.10961935 -0.35625281 -0.160390551 -0.01094012
## S_INCOME  0.51637130 -0.31456037 -0.04997598  0.152015664  0.16549632
## E_INCOME -0.15212886  0.42065795  0.19493466  0.279094423  0.30689238
## S_POP     1.00000000 -0.27414791 -0.04810911  0.321823933  0.14974024
## E_POP    -0.27414791  1.00000000  0.16017486  0.318280760  0.26249249
## DISTANCE -0.04810911  0.16017486  1.00000000 -0.092993058  0.67632661
## PAX       0.32182393  0.31828076 -0.09299306  1.000000000 -0.06843540
## FARE      0.14974024  0.26249249  0.67632661 -0.068435401  1.00000000

#INTERPRETATION:

#COUPON Correlates 0.496 with FARE. This means that higher COUPON values in the dataset are associated with higher fares with a moderate positive correlation.

#DISTANCE Correlates 0.676 with FARE. This means that longer flights result in higher average fares.

#E_INCOME, and E_POP have 0.307 and 0.262 correlation with FARE respectively, meaning that the higher income or population the destination is for a flight, the higher the average fare is, with a weak to moderate correlation.

# 3 Regression model ----------------------------------------------------------------------
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:corrgram':
## 
##     panel.fill

#k=11
knn_model_11 <- caret::knnreg(FARE ~ ., data = train_df, k = 11)

predict_train <- predict(knn_model_11, train_df)
predict_valid <- predict(knn_model_11, valid_df)

train_metrics <- postResample(pred = predict_train, obs = train_df$FARE)
valid_metrics <- postResample(pred = predict_valid, obs = valid_df$FARE)

train_metrics

##       RMSE   Rsquared        MAE 
## 63.6915878  0.2827738 52.4816072

valid_metrics

##       RMSE   Rsquared        MAE 
## 64.5607745  0.3080904 52.5923820

#INTERPRETATION

#On average, flight cost predictions are off by about $66, while the MEAN average error (absolute difference between predicted and actual fares) is about $52, consistent across both the training and validation sets.  

#While there isn't much indication of overfitting in this model, the R-squared value of 0.27 means the model's performance is modest. This statistic indicates there are other statistical indicators of fare that aren't being accessed by either the dataframe given or the model used.

#k=3 results in the lowest Mean Error Percentage and lowest Mean Error on average in the Validation Set.


# predict validation set
library(forecast)


# model evaluation

plot(train_df$FARE, predict(knn_model_11, train_df),
     xlab = "Actual FARE",
     ylab = "Predicted FARE",
     main = "Predicted vs Actual FARE (Training)",  
     pch = 19,                         
     col = "green",                     
     cex = 1.2,                        
     las = 1,                          
     bty = "l")
grid()

plot(valid_df$FARE, predict(knn_model_11, valid_df),
     xlab = "Actual FARE",
     ylab = "Predicted FARE",
     main = "Predicted vs Actual FARE (Training)",  
     pch = 19,                         
     col = "blue",                     
     cex = 1.2,                        
     las = 1,                          
     bty = "l")
grid()

# 4 create new record ----------------------------------------------------------------------
library(forecast)

new_record <- data.frame(
  COUPON = 1.202, NEW = 3, VACATION = "Yes", SW = "Yes", 
  HI = 4442.141, S_INCOME = 28760, E_INCOME = 27664, 
  S_POP = 4557004, E_POP = 3195503, SLOT = "Free", 
  GATE = "Free", PAX = 12782, DISTANCE = 1976
)

# Convert categorical columns to factors with the same levels as training
cat_cols <- c("VACATION","SW","SLOT","GATE")
for (col in cat_cols) {
  # Ensure the column in train_df is factor
  train_df[[col]] <- factor(train_df[[col]])
  
  # Get the training levels
  train_levels <- levels(train_df[[col]])
  
  # If new_record value not in training levels, replace with first level
  val <- new_record[[col]]
  if (!(val %in% train_levels)) {
    val <- train_levels[1]
  }
  
  # Convert to factor with the same levels as training
  new_record[[col]] <- factor(val, levels = train_levels)
}

predicted_fare <- predict(knn_model_11, new_record)
predicted_fare

## [1] 156.2664

#The predicted fare for this new record is $156.27

# 5 Discussion ----------------------------------------------------------------------
#5. Which variables would actually be available in a real-world implementation of this model for predicting average fare between airports?

  #In this data set, the variables for population S_POP and E_POP would be known.

  #Additionally, airport-specific variables like SLOT, GATE, PAX, and DISTANCE Would   also be used in a real-world implementation.


#6. Assuming the variable fare is recoded into high/low using the mean.
#ifelse(FARE >= mean(FARE), “high”, “low”)

#a. Propose one data analysis technique that can be used to predict whether a route’s fare is high or low. There’s no need to actually train the model

  #LOGISTIC REGRESSION should be used to scale and classify continuous data into a categorical variable, using a probability cutoff to determine the final output.

#b. Describe appropriate data transformations for your proposed
#technique (if any)

  #Normalize all variables to a relative scale, such that categorical and continuous variables both determine the predictions based on relative values to other observations in the dataset, rather than comparing large numbers to 0/1 values.

#c. Justify your proposed technique.

  #This technique is justified because it combines both types of variables into this set into a set of scaled coefficients which determine the impact on the regression that variable has.

  #Additionally, the output of this regression can be both probability-based and classification-based, meaning that "high" and "low" probability cutoff values can be adjusted for accuracy.


#d. Describe how you would evaluate the quality of this new model

  # To determine the quality of this model, a confusion matrix would be generated listing the amount of predicted high/low fares to their actual classification. Depending on the amount of false positives or negatives, further tuning of parameters would be used before further analysis.

Lab 6 Regression Calvin Little

2025-11-23