Logistic Regression

CHAPTER 10

Logistic Regression

Logistic regression extends the ideas of linear regression to the situation where the outcome variable, Y, is categorical. We can think of a categorical variable as dividing the records into classes. For example, if Y denotes a recommendation on holding/selling/buying a stock, we have a categorical variable with three categories. We can think of each of the stocks in the dataset (the records) as belonging to one of three classes: the hold class, the sell class, and the buy class. Logistic regression can be used for classifying a new record, where its class is unknown, into one of the classes, based on the values of its predictor variables (called classification). It can also be used in data where the class is known, to find factors distinguishing between records in different classes in terms of their predictor variables, or “predictor profile” (called profiling). Logistic regression is used in applications such as

Classifying customers as returning or nonreturning (classification)
Finding factors that differentiate between male and female top executives (profiling)
Predicting the approval or disapproval of a loan based on information such as credit scores (classification)

10.3 Example: Acceptance of Personal Loan

[Recall] the example described in Chapter 9 of acceptance of a personal loan by Universal Bank. The bank’s dataset includes data on 5000 customers. The data include the customer’s response to the last personal loan campaign (Personal Loan), as well as customer demographic information (Age, Income, etc.) and the customer’s relationship with the bank (mortgage, securities account, etc.). See Table 10.1. Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan offered to them in a previous campaign. The goal is to build a model that identifies customers who are most likely to accept the loan offer in future mailings.

#### Load the data
#### Table 10.2

bank.df <- read.csv("~/Box/Teaching (jmmejia@iu.edu)/2020 - K-513/Public Files/Public Data/UniversalBank.csv")
bank.df <- select(bank.df, -c(ID, ZIP.Code))
# Compare to: bank.df <- bank.df[ , -c(1, 5)]  # Drop ID and zip code columns.
# treat Education as categorical (R will create dummy variables)
bank.df$Education <- factor(bank.df$Education, levels = c(1, 2, 3), 
                            labels = c("Undergrad", "Graduate", "Advanced/Professional"))
glimpse(bank.df)

## Observations: 5,000
## Variables: 12
## $ Age                <int> 25, 45, 39, 35, 35, 37, 53, 50, 35, 34, 65, 29, 48…
## $ Experience         <int> 1, 19, 15, 9, 8, 13, 27, 24, 10, 9, 39, 5, 23, 32,…
## $ Income             <int> 49, 34, 11, 100, 45, 29, 72, 22, 81, 180, 105, 45,…
## $ Family             <int> 4, 3, 1, 1, 4, 4, 2, 1, 3, 1, 4, 3, 2, 4, 1, 1, 4,…
## $ CCAvg              <dbl> 1.6, 1.5, 1.0, 2.7, 1.0, 0.4, 1.5, 0.3, 0.6, 8.9, …
## $ Education          <fct> Undergrad, Undergrad, Undergrad, Graduate, Graduat…
## $ Mortgage           <int> 0, 0, 0, 0, 0, 155, 0, 0, 104, 0, 0, 0, 0, 0, 0, 0…
## $ Personal.Loan      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
## $ Securities.Account <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,…
## $ CD.Account         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Online             <int> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0,…
## $ CreditCard         <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,…

Partition the data

# partition data
set.seed(2)
train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)  
train.df <- bank.df[train.index, ]  ## data frames not vectors!
valid.df <- bank.df[-train.index, ]
glimpse(valid.df)

## Observations: 2,000
## Variables: 12
## $ Age                <int> 45, 37, 35, 34, 65, 38, 46, 56, 57, 44, 36, 30, 31…
## $ Experience         <int> 19, 13, 10, 9, 39, 14, 21, 31, 27, 18, 11, 6, 5, 2…
## $ Income             <int> 34, 29, 81, 180, 105, 130, 193, 25, 63, 43, 152, 1…
## $ Family             <int> 3, 4, 3, 1, 4, 4, 2, 4, 3, 2, 2, 3, 4, 1, 4, 3, 1,…
## $ CCAvg              <dbl> 1.5, 0.4, 0.6, 8.9, 2.4, 4.7, 8.1, 0.9, 2.0, 0.7, …
## $ Education          <fct> Undergrad, Graduate, Graduate, Advanced/Profession…
## $ Mortgage           <int> 0, 155, 104, 0, 0, 134, 0, 111, 0, 163, 159, 0, 0,…
## $ Personal.Loan      <int> 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Securities.Account <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ CD.Account         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Online             <int> 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1,…
## $ CreditCard         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…

Run regression

# run logistic regression
# use glm() (general linear model) with family = "binomial" to fit a logistic 
# regression.
logit.reg <- glm(Personal.Loan ~ ., data = bank.df, family = "binomial")
#options(scipen=999)
summary(logit.reg)

## 
## Call:
## glm(formula = Personal.Loan ~ ., family = "binomial", data = bank.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1308  -0.1900  -0.0696  -0.0218   4.1835  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -1.231e+01  1.818e+00  -6.773 1.26e-11 ***
## Age                            -3.592e-02  6.727e-02  -0.534 0.593368    
## Experience                      4.504e-02  6.682e-02   0.674 0.500309    
## Income                          6.018e-02  2.966e-03  20.289  < 2e-16 ***
## Family                          6.182e-01  7.704e-02   8.024 1.02e-15 ***
## CCAvg                           1.634e-01  4.406e-02   3.708 0.000209 ***
## EducationGraduate               3.965e+00  2.696e-01  14.708  < 2e-16 ***
## EducationAdvanced/Professional  4.064e+00  2.669e-01  15.225  < 2e-16 ***
## Mortgage                        7.105e-04  5.940e-04   1.196 0.231641    
## Securities.Account             -8.701e-01  3.007e-01  -2.894 0.003806 ** 
## CD.Account                      3.839e+00  3.416e-01  11.239  < 2e-16 ***
## Online                         -7.605e-01  1.657e-01  -4.589 4.46e-06 ***
## CreditCard                     -1.038e+00  2.131e-01  -4.872 1.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3162.0  on 4999  degrees of freedom
## Residual deviance: 1172.3  on 4987  degrees of freedom
## AIC: 1198.3
## 
## Number of Fisher Scoring iterations: 8

Predict loans

#### Table 10.3

# use predict() with type = "response" to compute predicted probabilities. 
logit.reg.pred <- predict(logit.reg, valid.df[, -8], type = "response")
plot(logit.reg.pred)

length(logit.reg.pred)

## [1] 2000

length(logit.reg$fitted.values)

## [1] 5000

some.residuals <- valid.df$Personal.Loan[1:20] - logit.reg.pred[1:20]

plot.df <- data.frame(actual = valid.df$Personal.Loan, predicted = logit.reg.pred)

# first 5 actual and predicted records
data.frame(actual = valid.df$Personal.Loan[1:5], predicted = logit.reg.pred[1:5])

##    actual    predicted
## 2       0 5.569887e-05
## 6       0 4.253719e-03
## 9       0 4.692249e-02
## 10      1 9.790117e-01
## 11      0 5.888661e-01

# in points
ggplot(plot.df) + 
  geom_point(aes(x=actual, y=predicted))

# in boxplot form
ggplot(plot.df) + 
  geom_boxplot(aes(x=factor(actual), y=predicted))

lift chart

Another useful tool for assessing model classification performance are the lift (gains) chart and decile-wise lift chart (see Chapter 5). Figure 10.3 illustrates the lift chart obtained for the personal loan offer logistic model using the validation set. The “lift” over the base curve indicates for a given number of cases (read on the x-axis), the additional responders that you can identify by using the model. The same information is portrayed in in Figure 10.3: Taking the 10% of the records that are ranked by the model as “most probable 1’s” yields 7.9 times as many 1’s as would simply selecting 10% of the records at random.

#### Figure 10.3

library(gains)
gain <- gains(valid.df$Personal.Loan, logit.reg.pred, groups=10)

# plot lift chart
{  # got to put the curly brackers or it breaks!
  plot(c(0,gain$cume.pct.of.total*sum(valid.df$Personal.Loan))~c(0,gain$cume.obs), 
     xlab="# cases", ylab="Cumulative", main="", type="l")
  lines(c(0,sum(valid.df$Personal.Loan))~c(0, dim(valid.df)[1]), lty=2)
}

10.5 Example of Complete Analysis: Predicting Delayed Flights

Predicting flight delays can be useful to a variety of organizations: airport authorities, airlines, aviation authorities. At times, joint task forces have been formed to address the problem. Such an organization, if it were to provide ongoing real-time assistance with flight delays,would benefit from some advance notice about flights likely to be delayed.

In this simplified illustration, we look at six predictors (see Table 10.4). The outcome of interest is whether the flight is delayed or not (delayed means more than 15 minutes late). Our data consist of all flights from the Washington, DC area into the New York City area during January 2004. The percent of delayed flights among these 2201 flights is 19.5%. The data were obtained from the Bureau of Transportation Statistics website (www.transtats.bts.gov).

The goal is to predict accurately whether a new flight, not in this dataset, will be delayed or not. The outcome variable is a variable called Flight Status, coded as delayed or ontime.

Other information available on the website, such as distance and arrival time, is irrelevant because we are looking at a certain route (distance, flight time, etc. should be approximately equal for all flights in the data). A sample of the data for 20 flights is shown in Table 10.5. Figures 10.4 and 10.5 show visualizations of the relationships between flight delays and different predictors or combinations of predictors. From Figure 10.4, we see that Sundays and Mondays saw the largest proportion of delays. Delay rates also seem to differ by carrier, by time of day, as well as by origin and destination airports. For Weather, we see a strong distinction between delays when Weather = 1 (in that case there is always a delay) and Weather = 0. The heatmap in Figure 10.5 reveals some specific combinations with high rates of delays, such as Sunday flights by carrier RU, departing from BWI, or Sunday flights by MQ departing from DCA. We can also see combinations with very low delay rates.

delays.df <- read.csv("~/Box/Teaching (jmmejia@iu.edu)/2020 - K-513/Public Files/Public Data/FlightDelays.csv")
glimpse(delays.df)

## Observations: 2,201
## Variables: 13
## $ CRS_DEP_TIME  <int> 1455, 1640, 1245, 1715, 1039, 840, 1240, 1645, 1715, 21…
## $ CARRIER       <fct> OH, DH, DH, DH, DH, DH, DH, DH, DH, DH, DH, DL, DL, DL,…
## $ DEP_TIME      <int> 1455, 1640, 1245, 1709, 1035, 839, 1243, 1644, 1710, 21…
## $ DEST          <fct> JFK, JFK, LGA, LGA, LGA, JFK, JFK, JFK, JFK, JFK, LGA, …
## $ DISTANCE      <int> 184, 213, 229, 229, 229, 228, 228, 228, 228, 228, 229, …
## $ FL_DATE       <fct> 01/01/2004, 01/01/2004, 01/01/2004, 01/01/2004, 01/01/2…
## $ FL_NUM        <int> 5935, 6155, 7208, 7215, 7792, 7800, 7806, 7810, 7812, 7…
## $ ORIGIN        <fct> BWI, DCA, IAD, IAD, IAD, IAD, IAD, IAD, IAD, IAD, IAD, …
## $ Weather       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DAY_WEEK      <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ DAY_OF_MONTH  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ TAIL_NUM      <fct> N940CA, N405FJ, N695BR, N662BR, N698BR, N687BR, N321UE,…
## $ Flight.Status <fct> ontime, ontime, ontime, ontime, ontime, ontime, ontime,…

#### Figure 10.4

# code for generating top-right bar chart
# for other plots, replace aggregating variable by setting argument by =  in 
# aggregate(). 
# in function barplot(), set the x-label (argument xlab =) and y-label 
# (argument names.arg =) 
# according to the variable of choice.  

barplot(aggregate(delays.df$Flight.Status == "delayed", by = list(delays.df$DAY_WEEK), 
                  mean, rm.na = T)[,2], xlab = "Day of Week", ylab = "Average Delay", 
        names.arg = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

# VS. ggplot2
barplot.df <- delays.df %>%
  mutate(nd = Flight.Status == "delayed") %>%
  group_by(DAY_WEEK) %>%
  summarise(mean=mean(nd, na.rm=T))
ggplot(barplot.df, aes(x=DAY_WEEK, y=mean))+
  geom_bar(stat="identity")

library(reshape)

## 
## Attaching package: 'reshape'

## The following object is masked from 'package:dplyr':
## 
##     rename

library(ggplot2)
# create matrix for plot
delays.df$isDelay <- 1 * (delays.df$Flight.Status == "delayed")

agg <- aggregate(delays.df$isDelay, 
                 by = list(delays.df$DAY_WEEK, delays.df$CARRIER, delays.df$ORIGIN), 
                 FUN = mean, na.rm = TRUE)
m <- melt(agg)

## Using Group.2, Group.3 as id variables

names(m)[1:3] <- c("DAY_WEEK", "CARRIER", "ORIGIN")

# plot with ggplot
# use facet_grid() with arguments scales = "free" and space = "free" to skip 
# missing values.
ggplot(m, aes(y = CARRIER, x = DAY_WEEK, fill = value)) +  geom_tile() + 
  facet_grid(ORIGIN ~ ., scales = "free", space = "free") + 
  scale_fill_gradient(low="white", high="black")

Data Preprocessing

Create a binary outcome variable called isDelay that takes the value 1 if Flight Status = delayed and 0 otherwise. Transform day of week into a categorical variable, and bin and categorize the departure time into hourly intervals between 6:00 AM and 10:00 PM. Set reference categories for categorical variables: IAD for departure airport, LGA for arrival, USAirways for carrier, and Wednesday for day (see Figure 10.4 for R code). This yields a total of 34 dummies. In addition, we have a single dummy for Weather. While this is a large number of predictors, we start by using all of them, look at performance, and then explore reducing the dimension. We then partition the data into training set (60%) and validation set (40%). We use the training set to fit a model and the validation set to assess the model’s performance.

#### Table 10.6

#delays.df <- read.csv("FlightDelays.csv")

# transform variables and create bins
delays.df$DAY_WEEK <- factor(delays.df$DAY_WEEK, levels = c(1:7), 
                             labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
delays.df$CRS_DEP_TIME <- factor(round(delays.df$CRS_DEP_TIME/100))

# create reference categories
delays.df$ORIGIN <- relevel(delays.df$ORIGIN, ref = "IAD")
delays.df$DEST <- relevel(delays.df$DEST, ref = "LGA")
delays.df$CARRIER <- relevel(delays.df$CARRIER, ref = "US")
delays.df$DAY_WEEK <- relevel(delays.df$DAY_WEEK, ref = "Wed")
delays.df$isDelay <- 1 * (delays.df$Flight.Status == "delayed")

# create training and validation sets
names(delays.df)

##  [1] "CRS_DEP_TIME"  "CARRIER"       "DEP_TIME"      "DEST"         
##  [5] "DISTANCE"      "FL_DATE"       "FL_NUM"        "ORIGIN"       
##  [9] "Weather"       "DAY_WEEK"      "DAY_OF_MONTH"  "TAIL_NUM"     
## [13] "Flight.Status" "isDelay"

selected.var <- c(10, 1, 8, 4, 2, 9, 14)
names(delays.df)[selected.var]

## [1] "DAY_WEEK"     "CRS_DEP_TIME" "ORIGIN"       "DEST"         "CARRIER"     
## [6] "Weather"      "isDelay"

train.index <- sample(c(1:dim(delays.df)[1]), dim(delays.df)[1]*0.6)  
train.df <- delays.df[train.index, selected.var]
valid.df <- delays.df[-train.index, selected.var]
glimpse(valid.df)

## Observations: 881
## Variables: 7
## $ DAY_WEEK     <fct> Thu, Thu, Thu, Thu, Thu, Thu, Thu, Thu, Thu, Thu, Thu, T…
## $ CRS_DEP_TIME <fct> 17, 12, 16, 21, 21, 13, 14, 15, 8, 9, 15, 17, 21, 15, 17…
## $ ORIGIN       <fct> IAD, IAD, IAD, IAD, IAD, DCA, DCA, DCA, IAD, DCA, DCA, D…
## $ DEST         <fct> LGA, JFK, JFK, JFK, LGA, LGA, LGA, LGA, LGA, LGA, LGA, L…
## $ CARRIER      <fct> DH, DH, DH, DH, DH, MQ, MQ, MQ, UA, US, US, US, US, RU, …
## $ Weather      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ isDelay      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Model estimation

The estimated model with 34 predictors is shown in Table 10.6. Note how negative coefficients in the logit model (the “Coefficient” column) translate into odds coefficients lower than 1, and positive logit coefficients translate into odds coefficients larger than 1.

Table 10.6 Estimated Logistic regression model for delayed flights (based on the training set)

# run logistic model, and show coefficients and odds
lm.fit <- glm(isDelay ~ ., data = train.df, family = "binomial")
data.frame(summary(lm.fit)$coefficients, odds = exp(coef(lm.fit)))

##                    Estimate  Std..Error     z.value     Pr...z..         odds
## (Intercept)    -3.512453960   0.6259687 -5.61122981 2.008938e-08 2.982364e-02
## DAY_WEEKMon     0.430487338   0.2763151  1.55795832 1.192431e-01 1.538007e+00
## DAY_WEEKTue     0.006196255   0.3008525  0.02059566 9.835682e-01 1.006215e+00
## DAY_WEEKThu    -0.118090767   0.2820844 -0.41863628 6.754820e-01 8.886154e-01
## DAY_WEEKFri     0.207853869   0.2659728  0.78148539 4.345171e-01 1.231033e+00
## DAY_WEEKSat    -0.724119789   0.3635445 -1.99183273 4.638941e-02 4.847511e-01
## DAY_WEEKSun     0.852523547   0.2830278  3.01215528 2.593999e-03 2.345559e+00
## CRS_DEP_TIME7   0.151998568   0.5337999  0.28474819 7.758371e-01 1.164159e+00
## CRS_DEP_TIME8   0.211996960   0.5100765  0.41561794 6.776896e-01 1.236144e+00
## CRS_DEP_TIME9  -0.794927505   0.6592789 -1.20575296 2.279128e-01 4.516140e-01
## CRS_DEP_TIME10 -0.044900525   0.5754950 -0.07802070 9.378116e-01 9.560926e-01
## CRS_DEP_TIME11  0.162510640   0.6800059  0.23898418 8.111179e-01 1.176461e+00
## CRS_DEP_TIME12  0.402859767   0.5015982  0.80315229 4.218867e-01 1.496097e+00
## CRS_DEP_TIME13 -0.364403105   0.5701715 -0.63911144 5.227504e-01 6.946111e-01
## CRS_DEP_TIME14  0.011711884   0.5508469  0.02126160 9.830370e-01 1.011781e+00
## CRS_DEP_TIME15  1.124810815   0.4388133  2.56330166 1.036819e-02 3.079634e+00
## CRS_DEP_TIME16  0.494825830   0.4741336  1.04364224 2.966509e-01 1.640213e+00
## CRS_DEP_TIME17  0.736936568   0.4420660  1.66702850 9.550874e-02 2.089525e+00
## CRS_DEP_TIME18  0.625730562   0.5706079  1.09660333 2.728148e-01 1.869611e+00
## CRS_DEP_TIME19  1.117640961   0.4855576  2.30176816 2.134825e-02 3.057633e+00
## CRS_DEP_TIME20  1.041911051   0.6539145  1.59334457 1.110829e-01 2.834629e+00
## CRS_DEP_TIME21  0.925534984   0.4788166  1.93296362 5.324068e-02 2.523218e+00
## ORIGINBWI       0.964032867   0.3961410  2.43355977 1.495117e-02 2.622250e+00
## ORIGINDCA       0.294210379   0.3610597  0.81485240 4.151568e-01 1.342066e+00
## DESTEWR        -0.149273791   0.3528250 -0.42308166 6.722357e-01 8.613333e-01
## DESTJFK        -0.567790125   0.2652041 -2.14095554 3.227762e-02 5.667766e-01
## CARRIERCO       1.953153817   0.5413359  3.60802565 3.085360e-04 7.050890e+00
## CARRIERDH       1.858084486   0.4924686  3.77300065 1.612959e-04 6.411444e+00
## CARRIERDL       0.543449302   0.3279001  1.65736261 9.744619e-02 1.721936e+00
## CARRIERMQ       1.632020578   0.3248186  5.02440630 5.049923e-07 5.114198e+00
## CARRIEROH      -0.191771860   0.9425363 -0.20346364 8.387727e-01 8.254952e-01
## CARRIERRU       1.499533415   0.5047014  2.97113007 2.967061e-03 4.479598e+00
## CARRIERUA       0.872898966   0.9261840  0.94246823 3.459530e-01 2.393840e+00
## Weather        18.194242378 529.8026745  0.03434154 9.726048e-01 7.973685e+07

round(data.frame(summary(lm.fit)$coefficients, odds = exp(coef(lm.fit))), 5)

##                Estimate Std..Error  z.value Pr...z..         odds
## (Intercept)    -3.51245    0.62597 -5.61123  0.00000 2.982000e-02
## DAY_WEEKMon     0.43049    0.27632  1.55796  0.11924 1.538010e+00
## DAY_WEEKTue     0.00620    0.30085  0.02060  0.98357 1.006220e+00
## DAY_WEEKThu    -0.11809    0.28208 -0.41864  0.67548 8.886200e-01
## DAY_WEEKFri     0.20785    0.26597  0.78149  0.43452 1.231030e+00
## DAY_WEEKSat    -0.72412    0.36354 -1.99183  0.04639 4.847500e-01
## DAY_WEEKSun     0.85252    0.28303  3.01216  0.00259 2.345560e+00
## CRS_DEP_TIME7   0.15200    0.53380  0.28475  0.77584 1.164160e+00
## CRS_DEP_TIME8   0.21200    0.51008  0.41562  0.67769 1.236140e+00
## CRS_DEP_TIME9  -0.79493    0.65928 -1.20575  0.22791 4.516100e-01
## CRS_DEP_TIME10 -0.04490    0.57549 -0.07802  0.93781 9.560900e-01
## CRS_DEP_TIME11  0.16251    0.68001  0.23898  0.81112 1.176460e+00
## CRS_DEP_TIME12  0.40286    0.50160  0.80315  0.42189 1.496100e+00
## CRS_DEP_TIME13 -0.36440    0.57017 -0.63911  0.52275 6.946100e-01
## CRS_DEP_TIME14  0.01171    0.55085  0.02126  0.98304 1.011780e+00
## CRS_DEP_TIME15  1.12481    0.43881  2.56330  0.01037 3.079630e+00
## CRS_DEP_TIME16  0.49483    0.47413  1.04364  0.29665 1.640210e+00
## CRS_DEP_TIME17  0.73694    0.44207  1.66703  0.09551 2.089520e+00
## CRS_DEP_TIME18  0.62573    0.57061  1.09660  0.27281 1.869610e+00
## CRS_DEP_TIME19  1.11764    0.48556  2.30177  0.02135 3.057630e+00
## CRS_DEP_TIME20  1.04191    0.65391  1.59334  0.11108 2.834630e+00
## CRS_DEP_TIME21  0.92553    0.47882  1.93296  0.05324 2.523220e+00
## ORIGINBWI       0.96403    0.39614  2.43356  0.01495 2.622250e+00
## ORIGINDCA       0.29421    0.36106  0.81485  0.41516 1.342070e+00
## DESTEWR        -0.14927    0.35283 -0.42308  0.67224 8.613300e-01
## DESTJFK        -0.56779    0.26520 -2.14096  0.03228 5.667800e-01
## CARRIERCO       1.95315    0.54134  3.60803  0.00031 7.050890e+00
## CARRIERDH       1.85808    0.49247  3.77300  0.00016 6.411440e+00
## CARRIERDL       0.54345    0.32790  1.65736  0.09745 1.721940e+00
## CARRIERMQ       1.63202    0.32482  5.02441  0.00000 5.114200e+00
## CARRIEROH      -0.19177    0.94254 -0.20346  0.83877 8.255000e-01
## CARRIERRU       1.49953    0.50470  2.97113  0.00297 4.479600e+00
## CARRIERUA       0.87290    0.92618  0.94247  0.34595 2.393840e+00
## Weather        18.19424  529.80267  0.03434  0.97260 7.973685e+07

summary(lm.fit)

## 
## Call:
## glm(formula = isDelay ~ ., family = "binomial", data = train.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4207  -0.6565  -0.4773  -0.2825   2.5966  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.512454   0.625969  -5.611 2.01e-08 ***
## DAY_WEEKMon      0.430487   0.276315   1.558 0.119243    
## DAY_WEEKTue      0.006196   0.300853   0.021 0.983568    
## DAY_WEEKThu     -0.118091   0.282084  -0.419 0.675482    
## DAY_WEEKFri      0.207854   0.265973   0.781 0.434517    
## DAY_WEEKSat     -0.724120   0.363544  -1.992 0.046389 *  
## DAY_WEEKSun      0.852524   0.283028   3.012 0.002594 ** 
## CRS_DEP_TIME7    0.151999   0.533800   0.285 0.775837    
## CRS_DEP_TIME8    0.211997   0.510077   0.416 0.677690    
## CRS_DEP_TIME9   -0.794928   0.659279  -1.206 0.227913    
## CRS_DEP_TIME10  -0.044901   0.575495  -0.078 0.937812    
## CRS_DEP_TIME11   0.162511   0.680006   0.239 0.811118    
## CRS_DEP_TIME12   0.402860   0.501598   0.803 0.421887    
## CRS_DEP_TIME13  -0.364403   0.570171  -0.639 0.522750    
## CRS_DEP_TIME14   0.011712   0.550847   0.021 0.983037    
## CRS_DEP_TIME15   1.124811   0.438813   2.563 0.010368 *  
## CRS_DEP_TIME16   0.494826   0.474134   1.044 0.296651    
## CRS_DEP_TIME17   0.736937   0.442066   1.667 0.095509 .  
## CRS_DEP_TIME18   0.625731   0.570608   1.097 0.272815    
## CRS_DEP_TIME19   1.117641   0.485558   2.302 0.021348 *  
## CRS_DEP_TIME20   1.041911   0.653914   1.593 0.111083    
## CRS_DEP_TIME21   0.925535   0.478817   1.933 0.053241 .  
## ORIGINBWI        0.964033   0.396141   2.434 0.014951 *  
## ORIGINDCA        0.294210   0.361060   0.815 0.415157    
## DESTEWR         -0.149274   0.352825  -0.423 0.672236    
## DESTJFK         -0.567790   0.265204  -2.141 0.032278 *  
## CARRIERCO        1.953154   0.541336   3.608 0.000309 ***
## CARRIERDH        1.858084   0.492469   3.773 0.000161 ***
## CARRIERDL        0.543449   0.327900   1.657 0.097446 .  
## CARRIERMQ        1.632021   0.324819   5.024 5.05e-07 ***
## CARRIEROH       -0.191772   0.942536  -0.203 0.838773    
## CARRIERRU        1.499533   0.504701   2.971 0.002967 ** 
## CARRIERUA        0.872899   0.926184   0.942 0.345953    
## Weather         18.194242 529.802675   0.034 0.972605    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1295.7  on 1319  degrees of freedom
## Residual deviance: 1102.5  on 1286  degrees of freedom
## AIC: 1170.5
## 
## Number of Fisher Scoring iterations: 15

Model Interpretation

The coefficient for Arrival Airport JFK (DESTJFK) is estimated as − 0.19. Recall that the reference group is LGA. We interpret this coefficient as follows: e− 0.19 = 0.83 are the odds of a flight arriving at JFK being delayed relative to a flight to LGA being delayed (= the base-case odds), holding all other variables constant. This means that flights to LGA are more likely to be delayed than those to JFK (holding everything else constant). If we consider statistical significance of the coefficients, we see that in general, the origin and destination airports are not associated with the chance of delays. For carriers, two carriers (CO, MQ) are significantly different from the base carrier (USAirways), with odds of delay ranging between 4 and 5.5 relative to the other airlines. Weather has an enormous coefficient, which is not statistically significant. Flights leaving on Sunday or Monday have, on average, odds of 1.8 of delays relative to other days of the week (the other days seem statistically similar to the reference group Wednesday). Also, odds of delays appear to change over the course of the day, with the most noticeable difference from the reference category (6–7 AM) being 3–4 PM.

Model Performance

How should we measure the performance of models? One possible measure is “percent of flights correctly classified.” Accurate classification can be obtained from the confusion matrix for the validation data. The confusion matrix gives a sense of the classification accuracy and what type of misclassification is more frequent. From the confusion matrix and error rates in Figure 10.6, it can be seen that the model more accurately classifies nondelayed flights and is less accurate in classifying flights that were delayed. (Note: The same pattern appears in the confusion matrix for the training data, so it is not surprising to see it emerge for new data.) If there is an asymmetric cost structure so that one type of misclassification is more costly than the other, the cutoff value can be selected to minimize the cost. Of course, this tweaking should be carried out on the training data and assessed only using the validation data.

#### Figure 10.6

library(gains)
pred <- predict(lm.fit, valid.df)
gain <- gains(valid.df$isDelay, pred, groups=100)

{
  plot(c(0,gain$cume.pct.of.total*sum(valid.df$isDelay))~
       c(0,gain$cume.obs), 
     xlab="# cases", ylab="Cumulative", main="", type="l")
lines(c(0,sum(valid.df$isDelay))~c(0, dim(valid.df)[1]), lty=2)
}

Exercise

What if you were to include only the significant predictors? That is, what if you tried to do variable selection as we did for linear regression?
Why do you think the weather is not playing a role?

Logistic Regression

Jorge

CHAPTER 10