class: center, middle, inverse, title-slide .title[ # Linear Regression Presentation ] .subtitle[ ## Demonstration of Xaringan Package ] .author[ ### Alex Dragonetti ] .date[ ### 2024-02-08 ] --- class:inverse4, top <h1 align="center"> Table of Contents</h1> <BR> .pull-left[ ####- Data Overview ####- Goals and Analysis Methods ####- Findings ####- Discussion ] --- <h1 align = "center">Overview of Data</h1> <BR> .pull-left[ #### Basic Details <li> 3593 observations of 11 variables, one numeric dependent, 1 categorical independent, 9 numeric dependent <li> Data that measures flight delay times and potential causes <li> Source: <i>Applied Analytics through Case Studies Using SAS and R</i>, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6 <li> Copy of data can be found at https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv ] .pull-right[ ##### Variables <li> Carrier - sole categorical variable <li> Airport_Distance - Distance between airports (effectively distance of flight) in miles <li> Number of flights - Total number of flights in airport (not clear if this is departure or arrival airport) <li> Weather - 0-10 scale of weather conditions, where higher number is more 'extreme' <li> Support_Crew_Available - self-explanatory <li> Baggage_Loading_time - self-explanatory <li> Late_Arrival_o - time in minutes for late arriving aircraft <li> Cleaning_o - time in minutes for aircraft cleaning <li> Fueling_o - time in minutes for aircraft fueling <li> Security_o - time in minutes for security checks <li> Arr_Delay - time in minutes for delay of aircraft. ] --- <h1 align="center"> Goals and Methodology</h1> <br> .pull-left[ #### Goals <li> Primary: Create a Linear Regression Model to predict the delay time (in minutes) of a flight <li> Secondary: See which variables have the greatest impact on delay to give airports an actionable target. #### Preparing the Data <li> Check for missing values with is.na returns no missing values <li> Rename variables using CamelCase for ease of typing (eg, "Arr_Delay" becomes ArrDelay) <li> Creation of preliminary models to identify variables of interest and eliminate variables that are unnecessary ] .pull-right[ #### Method We will use a simple linear regression to analyze the data. Linear regression takes the form of: $$ \hat{Y} = \beta_0+\beta_1x_1+\cdots+\beta_kx_k $$ Where `\(\hat{Y}\)` is the predicted value of the dependent variable, `\(\beta_0\)` is the intercept (the value of Y when all independent variables equal 0), `\(x_1\)` through `\(x_k\)` is the value of each independent variable, and `\(\beta_0\)` through `\(\beta_k\)` are the coefficients for each `\(x\)` variable. We will create a model with a 'training' data set, which will use 80% of the observations, and test it against the remaining 20% of observations. ] --- class: inverse center middle # Analysis and Results --- <h1 align="center"> Splitting and Training the Data </h1> <br> ```r set.seed(251) row.number <- sample(1:nrow(df), 0.8*nrow(df)) train = df[row.number,] test = df[-row.number,] model1<-lm(ArrDelay~., data=train) ``` From our initial model, we can see that only a few variables appear to be significant. We will continue with a reduced model, only using the variables that appear to have significant effect on ArrDelay. ```r model2<-lm(ArrDelay~AirportDistance + NumberOfFlights + Weather + SupportCrewAvailable + BaggageLoadingTime + LateArrival , data = df) ``` --- <h1 align="center"> The "Champion" Model </h1> <br> ``` # A tibble: 7 × 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -577. 7.27 -79.3 0 2 AirportDistance 0.174 0.0135 12.8 6.45e- 37 3 NumberOfFlights 0.00443 0.000108 40.9 4.21e-300 4 Weather 4.46 0.454 9.83 1.67e- 22 5 SupportCrewAvailable -0.0489 0.00531 -9.21 5.47e- 20 6 BaggageLoadingTime 13.5 0.438 30.8 1.63e-184 7 LateArrival 6.90 0.333 20.7 2.91e- 90 ``` The above variables were shown to have the most significant effect on ArrDelay. This model will now be 'tested' on the previously sequestered data to see how accurately it can predict ArrDelay. --- <h1 align="center"> Testing The Model </h1> <br> ```r pred <- predict(model2, newdata = test) results<-test results$pred<-pred results$resid = results$ArrDelay-results$pred results$resid2 = (results$resid)^2 results$st= (results$ArrDelay- mean(results$ArrDelay))^2 SSE=sum(results$resid2) SST=sum(results$st) R2=1-(SSE/SST) R2 ``` ``` [1] 0.8188571 ``` The `\(R^2\)` value is the percent of total variation in the `\(Y\)` variable explained by our regression model. Our model was able to achieve an `\(R^2\)` value of 0.819, meaning it accounts for nearly 82% of the variance within the data. The graphs on the following pages show (1) a comparison of Predicted vs Actual Delay time, which shows a very strong correlation and (2) the error at each prediction, showing that this model does not seem to over or under-predict consistently. --- <img src="hw2_files/figure-html/unnamed-chunk-6-1.png" width="600px" style="display: block; margin: auto;" /> --- <img src="hw2_files/figure-html/unnamed-chunk-7-1.png" width="600px" style="display: block; margin: auto;" /> --- <h1 align="center"> Discussion</h1> <br><br> From our results, we are confident that Number of Flights, Distance, Weather, Support Staff, Arrival Delay, and Baggage Loading Time have a strong relationship with a flight delay. Carrier did not appear to have a significant impact, so recommendations can be generalized to all carriers at this time. Our recommendations are split into two categories: Projection Methods and Systemic Improvements ### Projection Methods Understanding the impact that Number of Flights, Airport Distance, and to an extent, available staff and weather, can help create a more realistic ETA, available to a passenger before they even reach the airport. Obviously, while they can be forecast with weather reports and staffing information, weather and available staff may change, so they should not be as heavily relied on as the number of flights and airport distance. --- <h1 align="center"> Discussion, Part 2</h1> <br><br> ### Systemic Improvements Having established above which variables impact the presence of a long delay more, an airport or carrier may want to focus on improving those above others. For example, bagging loading time was considered a significant variable by most models. Average baggage loading time was about the same across all carriers. In the event of infrastructure improvement with the purpose of reducing delays, one could prioritize that over a variable that did not have as strong of an impact, like fueling. ### Considerations for Future Study For future analysis, we recommend recording all airports involved in a flight, if feasible, to gauge impact. There is plenty of potential analysis to be done with the airport a plane is arriving from, departing from, and heading towards. While this information will create more computational demand, we believe that it is worth investigating. Crucially, no one involved in this study is familiar with research related to air travel, so any actions should be run by an expert before implementation.