Write an R function for the logistic function. The function should accept a numeric vector with values [-Inf,Inf] and produce a numeric vector in the the range [0,1].
Plot the logistic function from [-10,10]
Write a R function for the inverse logistic function. The function should accept a numeric vector with values [0,1] and prodcuce a numeric vector in the range [-Inf,Inf]
Plot the Inverse Logistic function from [0,1]
Hint: For plotting curves see ?graphics::curve or ?ggplot2::stat_function
library(magrittr)
logistic <- function(x) {
1/(1+exp(-x))
}
logistic <- . %>% {1/(1+exp(-.))}
inv_logistic <- function(x) {
-log((1-x)/x)
}
x <- -10:10
curve(logistic, -10, 10)
curve(inv_logistic, 0, 1)
Using the rectangular data that you created from the earlier assignment and following theexample from the text and class, create a model for arr_delay >= 22 minutes. Describe/Explain each of the steps and show all work.
KNIT YOUR DOCUMENT AS HTML AND SUBMIT IT AND THE Rmd file to your repository.
YX <- flightsDT
YX %<>% merge( planesDT, all.x = TRUE, by='tailnum', suffixes=c('','.pl') )
YX %<>% merge( weatherDT, all.x = TRUE, by=c('origin','year','month','day','hour'), suffixes=c('','.we') )
YX %<>% merge( airportsDT, all.x = TRUE, by.x='origin', by.y='faa', suffixes=c('','.orig') )
YX %<>% merge( airportsDT, all.x = TRUE, by.x='dest', by.y='faa', suffixes=c('','.dest') )
data.joined <- YX
data.joined$arrival_delayed <- ifelse(data.joined$arr_delay >= 22, 1,0)
#colsofinterest <- c("dep_delay", "arr_delay", "dest", "origin", "year", "month", "day", "hour", "tailnum", "sched_dep_time", "sched_arr_time", "flight", "distance", "year.pl", "minute", "type", "manufacturer", "model", "engines", "seats", "engine", "temp", "dewp" ,"humid", "wind_dir", "wind_speed", "wind_gust", "precip", "pressure", "visib", "name", "lat", "lon", "tz", "name.dest", "lat.dest", "lon.dest", "alt.dest", "tz.dest")
data.joined <- data.joined %>%
filter(!is.na(dep_delay)) %>%
filter(!is.na(arr_delay)) %>%
filter(!is.na(dest)) %>%
filter(!is.na(origin)) %>%
filter(!is.na(year)) %>%
filter(!is.na(month)) %>%
filter(!is.na(day)) %>%
filter(!is.na(hour)) %>%
filter(!is.na(tailnum)) %>%
filter(!is.na(sched_dep_time)) %>%
filter(!is.na(sched_arr_time)) %>%
filter(!is.na(flight)) %>%
filter(!is.na(distance)) %>%
filter(!is.na(year.pl)) %>%
filter(!is.na(minute)) %>%
filter(!is.na(year.pl)) %>%
filter(!is.na(type)) %>%
filter(!is.na(manufacturer)) %>%
filter(!is.na(model)) %>%
filter(!is.na(engines)) %>%
filter(!is.na(seats)) %>%
filter(!is.na(engine)) %>%
filter(!is.na(temp)) %>%
filter(!is.na(dewp)) %>%
filter(!is.na(humid)) %>%
filter(!is.na(wind_dir)) %>%
filter(!is.na(wind_speed)) %>%
filter(!is.na(wind_gust)) %>%
filter(!is.na(precip)) %>%
filter(!is.na(pressure)) %>%
filter(!is.na(visib)) %>%
filter(!is.na(name)) %>%
filter(!is.na(lat)) %>%
filter(!is.na(lon)) %>%
filter(!is.na(tz)) %>%
filter(!is.na(name.dest)) %>%
filter(!is.na(lat.dest)) %>%
filter(!is.na(lon.dest)) %>%
filter(!is.na(alt.dest)) %>%
filter(!is.na(tz.dest))
saveRDS(data.joined, file="flightdata.Rda")
data.joined.training <- sample_frac(data.joined, .75)
data.joined.testing <- sample_frac(data.joined, .5)
logit.fit <- glm(arrival_delayed ~ dep_delay + dest + origin + year + month + day + hour + sched_dep_time + sched_arr_time + carrier + distance + year.pl + type + engines + seats + engine + temp + dewp + humid + wind_dir + wind_speed + wind_gust +precip + pressure + visib + lat + lon + alt + tz + lat.dest + lon.dest + alt.dest + tz.dest, data = data.joined.training, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#summary(logit.fit)
prob = plogis(predict(logit.fit, data.joined.testing))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
Is this a good model? (Write your answer here.)
To answer this let’s first calculate the misclassifications, Specificity and Sensitivity of the model.
Using the InformationValue::optimalCutoff we determine the optimal cut-off
optimalCutoff(data.joined.testing$arrival_delayed, prob)[1]
## [1] 0.51
optCutOff <- optimalCutoff(data.joined.testing$arrival_delayed, prob)[1]
misClassError(data.joined.testing$arrival_delayed, prob, threshold = optCutOff)
## [1] 0.0737
sensitivity(data.joined.testing$arrival_delayed, prob, threshold = optCutOff)
## [1] 0.6871163
specificity(data.joined.testing$arrival_delayed, prob, threshold = optCutOff)
## [1] 0.9802354
confusionMatrix(data.joined.testing$arrival_delayed, prob, threshold = optCutOff)
## 0 1
## 0 93438 6727
## 1 1884 14773
Based on the Confusion Matrix results, high Specificity and high Sensitivity and very low mis-classification errors this is a fairly good model, with high accuracy, sensitivity and specivity.
Your model should be good at explaining tardiness. Now, assume that your job is to predict arrival delays a month in advance. You can no longer use all the features in your model. Retrain your model using only features that will be known only a month in advance of the departure time. Show all steps as above.
logit.fit <- glm(arrival_delayed ~ dest + origin + sched_dep_time + sched_arr_time + carrier + distance + lat + lon + alt + tz + lat.dest + lon.dest + alt.dest + tz.dest, data = data.joined.training, family = binomial(link="logit"))
Let’s see how good this model is using a Confusion Matrix.
Using the InformationValue::optimalCutoff we determine the optimal cut-off
prob = plogis(predict(logit.fit, data.joined.testing))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
optimalCutoff(data.joined.testing$arrival_delayed, prob)[1]
## [1] 0.4867589
optCutOff <- optimalCutoff(data.joined.testing$arrival_delayed, prob)[1]
misClassError(data.joined.testing$arrival_delayed, prob, threshold = optCutOff)
## [1] 0.1839
Note that the Sensitivity is very low
sensitivity(data.joined.testing$arrival_delayed, prob, threshold = optCutOff)
## [1] 0.001813953
specificity(data.joined.testing$arrival_delayed, prob, threshold = optCutOff)
## [1] 0.9997587
confusionMatrix(data.joined.testing$arrival_delayed, prob, threshold = optCutOff)
## 0 1
## 0 95299 21461
## 1 23 39
Based on the Confusion Matrix results, and a very low Sensitivity, we can conclude that this is NOT a Model Fit.
#prob = predict(logit.fit, data.joined.sample, type="response")
#prob = plogis(predict(logit.fit, data.joined.testing))
#prob<-ifelse(prob> 0.5,1,0)
#data.joined.testing$prob = prob