Aircraft Strikes

Author

Sam Rajabian

Introduction

The FAA Wildlife Strike Database is provided by the Federal Aviation Administration, containing records of wildlife (bird) strikes by civil aircraft reported in the USA.
Source: https://wildlife.faa.gov/home

I plan to explore when strikes most frequently occur to see if bird migration impacts strikes.

Key variables:
-Month
-Time of day

For linear regression analysis:
-Aircraft speed
-Cost of repairs

Load libraries and dataset

library(tidyverse)
library(RColorBrewer)
strikes <- read_csv("aircraft_wildlife_strikes_faa_20-25.csv")

Examine variables

#str(strikes)   #commented for rendering
head(strikes)
# A tibble: 6 × 101
  INDEX_NR INCIDENT_DATE INCIDENT_MONTH INCIDENT_YEAR TIME   TIME_OF_DAY
     <dbl> <chr>                  <dbl>         <dbl> <time> <chr>      
1   638334 8/6/2000                   8          2000    NA  Day        
2   638335 3/15/2000                  3          2000    NA  <NA>       
3   638336 5/8/2000                   5          2000 11:25  Day        
4   638337 3/24/2000                  3          2000 18:40  Dusk       
5   638338 8/28/2000                  8          2000    NA  Day        
6   638339 10/9/2000                 10          2000    NA  <NA>       
# ℹ 95 more variables: AIRPORT_ID <chr>, AIRPORT <chr>, AIRPORT_LATITUDE <dbl>,
#   AIRPORT_LONGITUDE <dbl>, RUNWAY <chr>, STATE <chr>, FAAREGION <chr>,
#   LOCATION <chr>, OPID <chr>, OPERATOR <chr>, REG <chr>, FLT <chr>,
#   AIRCRAFT <chr>, AMA <chr>, AMO <dbl>, EMA <dbl>, EMO <dbl>, AC_CLASS <chr>,
#   AC_MASS <dbl>, TYPE_ENG <chr>, NUM_ENGS <dbl>, ENG_1_POS <dbl>,
#   ENG_2_POS <dbl>, ENG_3_POS <dbl>, ENG_4_POS <dbl>, PHASE_OF_FLIGHT <chr>,
#   HEIGHT <dbl>, SPEED <dbl>, DISTANCE <dbl>, SKY <chr>, …

Clean data

names(strikes) <- tolower(names(strikes))  #variable names lowercase

Prepare data for linear regression plotting

strikes2 <- strikes |>
  filter(!cost_repairs_infl_adj > 50000000) |> #outlier
  filter(!is.na(cost_repairs_infl_adj) & !is.na(speed))

Plot linear regression

linear2 <- ggplot(strikes2, aes(x = speed, y = cost_repairs_infl_adj)) +
  labs(title = "Speed vs. Repair Cost",
       y = "Repair Cost (Infl. Adjusted $)",
       x = "Speed (MPH)") +
  geom_point() +
  geom_smooth(method='lm',formula=y~x, se = FALSE, linetype= "dotdash",
size = 0.3) +
  theme_minimal()
linear2    

Analyze linear regression

cor(strikes2$speed, strikes2$cost_repairs_infl_adj)
[1] 0.0822346
fit <- lm(cost_repairs_infl_adj ~ speed, data = strikes2)
summary(fit)

Call:
lm(formula = cost_repairs_infl_adj ~ speed, data = strikes2)

Residuals:
     Min       1Q   Median       3Q      Max 
 -512966  -170907  -126161   -73975 18877812 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -7619.4    42053.4  -0.181    0.856    
speed         1362.8      294.2   4.632 3.77e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 878400 on 3151 degrees of freedom
Multiple R-squared:  0.006763,  Adjusted R-squared:  0.006447 
F-statistic: 21.45 on 1 and 3151 DF,  p-value: 3.771e-06

Equation: cost_repairs_infl_adj = 0.0822(speed) -7619.4
R^2: 0.67% of variation can be explained by the model.
P-value: 3.77e-06, small from large sample size.
As speed increases, repair cost is predicted to increase $0.0822
The correlation is very weak, this information should not be used in the main plot.

Group by month and time of day to answer the question

strikes3 <- strikes |>
  filter(!is.na(time_of_day)) |>
  group_by(incident_month, time_of_day) |>
  summarise(number_strikes = n())
strikes3
# A tibble: 48 × 3
# Groups:   incident_month [12]
   incident_month time_of_day number_strikes
            <dbl> <chr>                <int>
 1              1 Dawn                   250
 2              1 Day                   3479
 3              1 Dusk                   343
 4              1 Night                 1365
 5              2 Dawn                   275
 6              2 Day                   3129
 7              2 Dusk                   286
 8              2 Night                 1458
 9              3 Dawn                   327
10              3 Day                   4536
# ℹ 38 more rows

Plot the data

ggplot(strikes3, aes(x = incident_month, y = number_strikes, fill = time_of_day)) +
  geom_col() +
  scale_x_continuous(
    breaks = 1:12,          
    labels = month.abb) +
#Source: https://stackoverflow.com/questions/69411847/changing-month-from-number-to-full-month-name-in-r
  scale_fill_brewer(name = "Time", palette = "Spectral") +
  labs(
    title = "Number of Aircraft Wildlife Strikes by Month",
    x = "Month",
    y = "Number of Strikes",
    caption = "Source: FAA"
  ) +
  theme_minimal()

Reflection

a. How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate)

My dataset’s variable names were already mostly clean, but I used the tolower() function to make the capitalized names lowercase. I also filtered out NA values for key variables.

b. What the visualization represents, any interesting patterns or surprises that arise within the visualization.

The visualization shows that bird migration patterns do indeed affect wildlife collision frequency. The plot shows that collisions greatly increase during spring (peak April/May) and fall (peak Sep/Oct) months, which coincide with avian migration patterns. Also, several bird species migrate at nighttime, and this is demonstrated in the bar plot through higher nighttime collisions during peak seasons.

c. Anything that you might have shown that you could not get to work or that you wished you could have included

I could not get my linear regression graph to accommodate the many values to create a proper fit line without removing several points. I also wish I could have included state information somewhere, possibly through plotly interactivity.