library(tidyverse)
library(RColorBrewer)
strikes <- read_csv("aircraft_wildlife_strikes_faa_20-25.csv")Aircraft Strikes
Introduction
The FAA Wildlife Strike Database is provided by the Federal Aviation Administration, containing records of wildlife (bird) strikes by civil aircraft reported in the USA.
Source: https://wildlife.faa.gov/home
I plan to explore when strikes most frequently occur to see if bird migration impacts strikes.
Key variables:
-Month
-Time of day
For linear regression analysis:
-Aircraft speed
-Cost of repairs
Load libraries and dataset
Examine variables
#str(strikes) #commented for rendering
head(strikes)# A tibble: 6 × 101
INDEX_NR INCIDENT_DATE INCIDENT_MONTH INCIDENT_YEAR TIME TIME_OF_DAY
<dbl> <chr> <dbl> <dbl> <time> <chr>
1 638334 8/6/2000 8 2000 NA Day
2 638335 3/15/2000 3 2000 NA <NA>
3 638336 5/8/2000 5 2000 11:25 Day
4 638337 3/24/2000 3 2000 18:40 Dusk
5 638338 8/28/2000 8 2000 NA Day
6 638339 10/9/2000 10 2000 NA <NA>
# ℹ 95 more variables: AIRPORT_ID <chr>, AIRPORT <chr>, AIRPORT_LATITUDE <dbl>,
# AIRPORT_LONGITUDE <dbl>, RUNWAY <chr>, STATE <chr>, FAAREGION <chr>,
# LOCATION <chr>, OPID <chr>, OPERATOR <chr>, REG <chr>, FLT <chr>,
# AIRCRAFT <chr>, AMA <chr>, AMO <dbl>, EMA <dbl>, EMO <dbl>, AC_CLASS <chr>,
# AC_MASS <dbl>, TYPE_ENG <chr>, NUM_ENGS <dbl>, ENG_1_POS <dbl>,
# ENG_2_POS <dbl>, ENG_3_POS <dbl>, ENG_4_POS <dbl>, PHASE_OF_FLIGHT <chr>,
# HEIGHT <dbl>, SPEED <dbl>, DISTANCE <dbl>, SKY <chr>, …
Clean data
names(strikes) <- tolower(names(strikes)) #variable names lowercasePrepare data for linear regression plotting
strikes2 <- strikes |>
filter(!cost_repairs_infl_adj > 50000000) |> #outlier
filter(!is.na(cost_repairs_infl_adj) & !is.na(speed))Plot linear regression
linear2 <- ggplot(strikes2, aes(x = speed, y = cost_repairs_infl_adj)) +
labs(title = "Speed vs. Repair Cost",
y = "Repair Cost (Infl. Adjusted $)",
x = "Speed (MPH)") +
geom_point() +
geom_smooth(method='lm',formula=y~x, se = FALSE, linetype= "dotdash",
size = 0.3) +
theme_minimal()
linear2 Analyze linear regression
cor(strikes2$speed, strikes2$cost_repairs_infl_adj)[1] 0.0822346
fit <- lm(cost_repairs_infl_adj ~ speed, data = strikes2)
summary(fit)
Call:
lm(formula = cost_repairs_infl_adj ~ speed, data = strikes2)
Residuals:
Min 1Q Median 3Q Max
-512966 -170907 -126161 -73975 18877812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7619.4 42053.4 -0.181 0.856
speed 1362.8 294.2 4.632 3.77e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 878400 on 3151 degrees of freedom
Multiple R-squared: 0.006763, Adjusted R-squared: 0.006447
F-statistic: 21.45 on 1 and 3151 DF, p-value: 3.771e-06
Equation: cost_repairs_infl_adj = 0.0822(speed) -7619.4
R^2: 0.67% of variation can be explained by the model.
P-value: 3.77e-06, small from large sample size.
As speed increases, repair cost is predicted to increase $0.0822
The correlation is very weak, this information should not be used in the main plot.
Group by month and time of day to answer the question
strikes3 <- strikes |>
filter(!is.na(time_of_day)) |>
group_by(incident_month, time_of_day) |>
summarise(number_strikes = n())
strikes3# A tibble: 48 × 3
# Groups: incident_month [12]
incident_month time_of_day number_strikes
<dbl> <chr> <int>
1 1 Dawn 250
2 1 Day 3479
3 1 Dusk 343
4 1 Night 1365
5 2 Dawn 275
6 2 Day 3129
7 2 Dusk 286
8 2 Night 1458
9 3 Dawn 327
10 3 Day 4536
# ℹ 38 more rows
Plot the data
ggplot(strikes3, aes(x = incident_month, y = number_strikes, fill = time_of_day)) +
geom_col() +
scale_x_continuous(
breaks = 1:12,
labels = month.abb) +
#Source: https://stackoverflow.com/questions/69411847/changing-month-from-number-to-full-month-name-in-r
scale_fill_brewer(name = "Time", palette = "Spectral") +
labs(
title = "Number of Aircraft Wildlife Strikes by Month",
x = "Month",
y = "Number of Strikes",
caption = "Source: FAA"
) +
theme_minimal()Reflection
a. How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate)
My dataset’s variable names were already mostly clean, but I used the tolower() function to make the capitalized names lowercase. I also filtered out NA values for key variables.
b. What the visualization represents, any interesting patterns or surprises that arise within the visualization.
The visualization shows that bird migration patterns do indeed affect wildlife collision frequency. The plot shows that collisions greatly increase during spring (peak April/May) and fall (peak Sep/Oct) months, which coincide with avian migration patterns. Also, several bird species migrate at nighttime, and this is demonstrated in the bar plot through higher nighttime collisions during peak seasons.
c. Anything that you might have shown that you could not get to work or that you wished you could have included
I could not get my linear regression graph to accommodate the many values to create a proper fit line without removing several points. I also wish I could have included state information somewhere, possibly through plotly interactivity.