library(tidyverse) # load the library
library(viridis)
library(leaflet)
library(GGally)
library(webshot2)
setwd("~/DATA110")
aircraft <- read_csv("aircraft_wildlife_strikes_faa (2).csv") # load the data
data(aircraft)aircraft
AIRCRAFT WILDFIRE.
Introduction to My Project on FAA and Aircraft Incidents Worldwide.
For my project, I decided to focus on the FAA and aircraft incidents worldwide. I chose this topic because I find it quite interesting and personally relevant.
About two weeks ago, I was on a trip and experienced a lot of turbulence during the flight. That made me wonder: What actually causes turbulence? Is it the weather? Is it birds? I had a lot of questions and doubts about what was really going on up there. That curiosity is what led me to choose this dataset.
Through this analysis, I hope to gain a clearer understanding of:
-Which airlines are most affected by bird-related incidents?
-Which bird species are most frequently involved in those incidents? - which regions of the United States are most affected by bird strikes?
Variables Used in My Analysis
To carry out this analysis, I used several important variables from the dataset:
Species: A categorical variable that tells us the type of animal involved in the incident.
Incident Year: The year in which the bird strike or incident occurred.
Cost Repair: This numeric variable shows how much it cost the operator to repair the aircraft after the incident.
Operator: This refers to the name of the airline or company involved.
Longitude and Latitude: These allow us to locate the exact point where each incident happened, between the years 2000 and 2023.
Sky: This describes the sky or weather conditions at the time of the incident (e.g., clear, cloudy, etc.).
Load the library and the data
In this section, I downloaded and loaded all the libraries I would need for my project. I also imported the dataset from my computer and loaded it into R so I could start working with it in my logical (R) environment.
Cleaning my dataset
cln1 <- aircraft |>
filter(SPECIES %in% c("Mourning dove", "Barn swallow", "Killdeer")) |> # Select species
filter(!is.na(INCIDENT_YEAR)) |> # remove na
filter(INCIDENT_YEAR >= 2000) |> #select year
group_by(INCIDENT_YEAR, SPECIES) |> # make a classification
summarise(nb_incidents = n(), .groups = "drop") In this section, I focused on cleaning my data. I filtered it to include only the most common bird species involved in aircraft strikes—specifically, Mourning Dove, Barn Swallow, and Killdeer. Then, I removed any rows with missing values (NA) in the YEAR column. I decided to start my analysis from the year 2000, since it marks the beginning of a new century and provides over two decades of data to work with. Finally, I grouped the data by YEAR and SPECIES to prepare it for further analysis.
Create my first graph.
ggplot(cln1, aes(x = factor(INCIDENT_YEAR), y = nb_incidents, fill = SPECIES)) +
geom_bar(stat = "identity", position = "stack") +
scale_fill_manual(values = c("Mourning dove" = "pink", # define the graph colors
"Barn swallow" = "violet",
"Killdeer" = "blue")) +
labs(
title = "Number of Incidents per Year for the 3 Species", # add labs
x = "Year",
y = "Number of Incidents",
fill = "Species",
caption = "Source: FAA"
) +
theme_minimal() + # change the theme
theme(axis.text.x = element_text(angle = 45, hjust = 1)) #modify the x axis text format 
This graph shows the number of incidents per year involving the three bird species I filtered earlier—Mourning Dove, Barn Swallow, and Killdeer. One thing that really stood out to me was the steady increase in the number of incidents over time. However, after 2019, there is a noticeable drop around 2020, followed by a rebound.
This trend raised a few important questions for me:
1-Was the drop in 2020 related to the COVID-19 pandemic and a decrease in air travel?
2-Is the overall increase in bird strikes over the years due to more aircraft in the sky, a growing bird population, or changes in how well airports manage bird activity around runways?
Cleaning my data for my map.
aircraft1 <- aircraft |>
filter(!is.na(LONGITUDE)& !is.na(LATITUDE)) |> # remove na
filter(SPECIES %in% c("Mourning dove", "Barn swallow", "Killdeer")) # select a particular type of species.The reason why I’m restarting my data analysis is that if I had removed all the rows with missing information from the beginning, my dataset would have been significantly reduced. I was worried that this wouldn’t accurately reflect reality. That’s why I chose to retrieve the full dataset and clean it gradually, depending on which variables were needed at each stage of the analysis.
Set my data coordinates
aircraft1 |>
leaflet() |>
setView(lng = -106.5348, lat = 38.7946, zoom =6) |>
addProviderTiles("Esri.WorldStreetMap") |> # choose the type of map
addCircles(radius = 1
) Assuming "LONGITUDE" and "LATITUDE" are longitude and latitude, respectively
This chunk allows me to geographically define the boundaries of my map and also choose the type of map I want by selecting from the different options available in R.
Create my popup.
popup1 <- paste0( # create the legend of each point .
"<b>Species: </b>", aircraft1$SPECIES, "<br>",
"<b>SKY: </b>", aircraft1$SKY, "<br>",
"<b>COST_REPAIRS: </b>", aircraft1$COST_REPAIRS, "<br>",
"<b>Operator: </b>", aircraft1$OPERATOR, "<br>",
"<b>Year: </b>", aircraft1$INCIDENT_YEAR
)This field allows me to create my popup. So, what is a pop-up? A popup is like a small label or tooltip that appears when you interact with a specific point on an interactive graph or map. When I click on a point, it displays detailed information such as the species, sky condition, repair cost, operator, and the year of the incident.
Add the popup to my graph.
colors <- c("#916", "#1B949A", "#237BE8") #selecting colors
pal <- colorFactor(
palette = colors,
domain = aircraft1$SPECIES)
leaflet() |> # create the map
setView(lng = -106.5348, lat = 38.7946, zoom =6) |> #Set the geographical coordiates
addProviderTiles("Esri.WorldStreetMap") |>
addCircles(
data = aircraft1, # select the database
radius = 500,
color = pal(aircraft1$SPECIES),
popup = popup1) |>
addLegend("topleft", pal = pal, values = aircraft1$SPECIES,
title = "SPECIES", opacity = 1)Assuming "LONGITUDE" and "LATITUDE" are longitude and latitude, respectively
gOne thing that really surprised me when I looked at the second graph was how clearly the data showed a geographic pattern. There’s a noticeable difference between the eastern and western parts of the country. In the East, especially along the coastline, there is a significantly higher number of incidents — in some areas, even double or triple the amount compared to the West.
This raised questions for me about what might be causing this difference:
- Could the higher number of bird-aircraft incidents near coastal areas be due to the presence of beaches, which naturally attract large numbers of birds?
-Or, could the increased number of incidents be explained by bird migration patterns, with more birds passing through or
settling in these regions?
Cleaning the data for my modeling.
aircraft2 <- aircraft %>%
filter(!is.na(INCIDENT_MONTH)) %>%
filter(!is.na(INCIDENT_YEAR)) %>%
filter(!is.na(AMO)) %>%
filter(!is.na(SPEED)) %>%
filter(!is.na(EMO)) %>%
filter(!is.na(SPEED)) %>%
filter(!is.na(COST_OTHER)) %>%
filter(!is.na(COST_REPAIRS))The reason I restarted the cleaning process for my modeling is that I believed, to have a sufficiently large dataset, I needed to redo the basic cleaning. If I had done a thorough cleaning all at once, I would have ended up with too many missing values
Modeling Data.
fit2 <- lm(data= aircraft2,COST_REPAIRS~SPEED) #modeling my dataset.
summary(fit2)
Call:
lm(formula = COST_REPAIRS ~ SPEED, data = aircraft2)
Residuals:
Min 1Q Median 3Q Max
-258801 -99479 -76386 -38325 9394060
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17878.6 38451.9 0.465 0.6421
SPEED 652.3 264.2 2.469 0.0137 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 481000 on 1125 degrees of freedom
Multiple R-squared: 0.00539, Adjusted R-squared: 0.004505
F-statistic: 6.096 on 1 and 1125 DF, p-value: 0.0137
fit3 <- lm(data=aircraft2, COST_REPAIRS ~ SPEED + INCIDENT_YEAR + INCIDENT_MONTH + AMO + SPEED + COST_REPAIRS + EMO)Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
the right-hand side and was dropped
Warning in model.matrix.default(mt, mf, contrasts): problem with term 5 in
model.matrix: no columns are assigned
summary(fit3)
Call:
lm(formula = COST_REPAIRS ~ SPEED + INCIDENT_YEAR + INCIDENT_MONTH +
AMO + SPEED + COST_REPAIRS + EMO, data = aircraft2)
Residuals:
Min 1Q Median 3Q Max
-265807 -110031 -70324 -29918 9355934
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4903712.4 3674158.3 1.335 0.1823
SPEED 684.4 271.1 2.524 0.0117 *
INCIDENT_YEAR -2434.2 1830.9 -1.330 0.1839
INCIDENT_MONTH -4294.8 4273.0 -1.005 0.3151
AMO 956.0 909.6 1.051 0.2935
EMO 973.0 1370.6 0.710 0.4779
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 481000 on 1121 degrees of freedom
Multiple R-squared: 0.008826, Adjusted R-squared: 0.004405
F-statistic: 1.996 on 5 and 1121 DF, p-value: 0.07662
fit4 <- lm(data=aircraft2, COST_REPAIRS ~ SPEED + INCIDENT_YEAR + INCIDENT_MONTH + AMO + SPEED + COST_REPAIRS )Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
the right-hand side and was dropped
Warning in model.matrix.default(mt, mf, contrasts): problem with term 5 in
model.matrix: no columns are assigned
summary(fit4)
Call:
lm(formula = COST_REPAIRS ~ SPEED + INCIDENT_YEAR + INCIDENT_MONTH +
AMO + SPEED + COST_REPAIRS, data = aircraft2)
Residuals:
Min 1Q Median 3Q Max
-224897 -110532 -69960 -30227 9359701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4948507.6 3672804.4 1.347 0.1781
SPEED 660.9 269.0 2.457 0.0142 *
INCIDENT_YEAR -2449.9 1830.4 -1.338 0.1810
INCIDENT_MONTH -4338.8 4271.6 -1.016 0.3100
AMO 912.9 907.3 1.006 0.3146
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 480900 on 1122 degrees of freedom
Multiple R-squared: 0.008381, Adjusted R-squared: 0.004845
F-statistic: 2.371 on 4 and 1122 DF, p-value: 0.05076
fit5 <- lm(data=aircraft2, COST_REPAIRS ~ SPEED + INCIDENT_YEAR + INCIDENT_MONTH+ SPEED + COST_REPAIRS )Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
the right-hand side and was dropped
Warning in model.matrix.default(mt, mf, contrasts): problem with term 4 in
model.matrix: no columns are assigned
summary(fit5)
Call:
lm(formula = COST_REPAIRS ~ SPEED + INCIDENT_YEAR + INCIDENT_MONTH +
SPEED + COST_REPAIRS, data = aircraft2)
Residuals:
Min 1Q Median 3Q Max
-245668 -108149 -71990 -32017 9352749
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4639495.2 3659959.8 1.268 0.20519
SPEED 700.2 266.2 2.630 0.00864 **
INCIDENT_YEAR -2290.1 1823.5 -1.256 0.20941
INCIDENT_MONTH -4256.4 4270.8 -0.997 0.31916
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 480900 on 1123 degrees of freedom
Multiple R-squared: 0.007486, Adjusted R-squared: 0.004835
F-statistic: 2.823 on 3 and 1123 DF, p-value: 0.03771
After modeling my data, I aimed to determine whether there was a correlation between repair cost and speed. To explore this, I added additional numerical variables such as year, month, incident count, and speed into the model.
Unfortunately, the initial results were not very conclusive, as the adjusted R-squared value was quite weak, indicating a poor fit.
In my second modeling attempt, I began to notice a slight improvement in the performance of the model. However, after a certain point, there was a regression in the results again. This suggests that repair cost and speed are likely not strongly correlated in this dataset.
Clean my data in order to see the most affected operator.
summary_operator <- aircraft |>
filter(!is.na(OPERATOR)) |>
group_by(OPERATOR) |>
summarise(
nb_incidents = n(),
avg_cost = mean(COST_REPAIRS, na.rm = TRUE)
) |>
arrange(desc(nb_incidents)) |> #
filter(!OPERATOR %in% c("UNKNOWN", "BUSINESS"))|> # remove some data
slice_head(n = 10) # select the 10 firstThis line of code allows us to identify the airlines most affected by animal-related aircraft incidents. The reason I decided to remove the “Unknown” and “Business” categories is that I consider “Unknown” to be similar to missing values—it can’t be treated the same way as NA values. As for “Business,” I believe these represent private jets since there is no airline named Business. Therefore, I chose to exclude both categories, even though they were among the two largest groups.
ggplot(summary_operator, aes(x = reorder(OPERATOR, nb_incidents), y = nb_incidents, fill = OPERATOR)) +
geom_col() + # create a plot
coord_flip() + # flip the graph
scale_fill_viridis_d(option ="rocket") +
labs(
title = "Top 10 Operators by Number of Incidents",
x = "Operator",
y = "Number of Incidents",
fill = "Operator",
caption=" Source: FAA"
) +
theme_minimal()
This line allows me to create the graph that will help determine which airlines are most affected by animal-related incidents.
Essay
Analyzing this dataset was a truly fascinating experience for me because it allowed me to deepen my knowledge and learn new things. I discovered new formulas that helped me better understand the data. It also gave me insights into which types of animals are most responsible for causing bird strikes and related aircraft damages.
For the analysis, I first decided to clean the data by selecting only certain types of animals. Then, I created a graph to show how these animal types caused aircraft incidents over the years starting from 2000. After identifying the birds that caused the most damage, I looked into which airlines were most affected by these incidents, aiming to get a clearer picture of which companies experience these events most frequently.
Next, I made a map to pinpoint where exactly these incidents occurred globally, which helped me appreciate the geographical distribution and diversity of the events.
Doing this analysis helped me answer many questions, especially the first one I asked myself: Which airlines are most affected by bird strikes? I found my answer, and surprisingly, Southwest Airlines was among the most affected. I hadn’t heard much about this airline before, so it was interesting to discover it through this analysis. Other airlines like American, United, and Delta were also among the top, which was less surprising since they are well-known.
The second question I had was: Which animals cause the most damage? I was surprised to find that I only knew one of the species involved, the mourning dove. The rest were new to me, so this was a good learning experience.
For the third question, I was genuinely shocked and puzzled: Which region of the United States is most affected by bird strikes? The Eastern region of the US has two to three times more incidents than the Western region. This made me wonder if it’s because there are more animals there, or simply because there are more flights in the East. This finding helped answer both the third question and reinforced the insight on where the most affected areas are — particularly around ports on the West Coast.
