aircraft

Author

Djeneba kounta

AIRCRAFT WILDFIRE.

https://www.bing.com/ck/a?!&&p=856cad3d1e2050ed7f25a119634fbfc8aac88775c39d028b96bbf658636007d4JmltdHM9MTc1MTY3MzYwMA&ptn=3&ver=2&hsh=4&fclid=1bcb18d0-0bf0-62e5-0da8-08160a4963ff&u=a1L2ltYWdlcy9zZWFyY2g_cT1pbWFnZSthdmlvbithdHRhcXVlK3BhcitvaXNlYXUmaWQ9QkIwMTcxRDYyRTA1RENGNTE5RkVFODY0QUIyNDI3OEM1NjhFNzk4NCZGT1JNPUlBQ0ZJUg&ntb=1

Introduction to My Project on FAA and Aircraft Incidents Worldwide.

For my project, I decided to focus on the FAA and aircraft incidents worldwide. I chose this topic because I find it quite interesting and personally relevant.

About two weeks ago, I was on a trip and experienced a lot of turbulence during the flight. That made me wonder: What actually causes turbulence? Is it the weather? Is it birds? I had a lot of questions and doubts about what was really going on up there. That curiosity is what led me to choose this dataset.

Through this analysis, I hope to gain a clearer understanding of:

-Which airlines are most affected by bird-related incidents?

-Which bird species are most frequently involved in those incidents? - which regions of the United States are most affected by bird strikes?

Variables Used in My Analysis

To carry out this analysis, I used several important variables from the dataset:

Species: A categorical variable that tells us the type of animal involved in the incident.

Incident Year: The year in which the bird strike or incident occurred.

Cost Repair: This numeric variable shows how much it cost the operator to repair the aircraft after the incident.

Operator: This refers to the name of the airline or company involved.

Longitude and Latitude: These allow us to locate the exact point where each incident happened, between the years 2000 and 2023.

Sky: This describes the sky or weather conditions at the time of the incident (e.g., clear, cloudy, etc.).

Load the library and the data

library(tidyverse) # load the library 

library(viridis)
library(leaflet)
library(GGally)
library(webshot2)


setwd("~/DATA110")
aircraft <- read_csv("aircraft_wildlife_strikes_faa (2).csv") # load the data 
data(aircraft)

In this section, I downloaded and loaded all the libraries I would need for my project. I also imported the dataset from my computer and loaded it into R so I could start working with it in my logical (R) environment.

Cleaning my dataset

cln1 <- aircraft |>
  filter(SPECIES %in% c("Mourning dove", "Barn swallow", "Killdeer")) |> # Select species
  filter(!is.na(INCIDENT_YEAR)) |>      # remove na 
  filter(INCIDENT_YEAR >= 2000) |>       #select year
  group_by(INCIDENT_YEAR, SPECIES) |> # make a classification
  
  summarise(nb_incidents = n(), .groups = "drop")

In this section, I focused on cleaning my data. I filtered it to include only the most common bird species involved in aircraft strikes—specifically, Mourning Dove, Barn Swallow, and Killdeer. Then, I removed any rows with missing values (NA) in the YEAR column. I decided to start my analysis from the year 2000, since it marks the beginning of a new century and provides over two decades of data to work with. Finally, I grouped the data by YEAR and SPECIES to prepare it for further analysis.

Create my first graph.

ggplot(cln1, aes(x = factor(INCIDENT_YEAR), y = nb_incidents, fill = SPECIES)) +  
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_manual(values = c("Mourning dove" = "pink",  # define the graph colors 
                               "Barn swallow" = "violet", 
                               "Killdeer" = "blue")) +
  labs(
    title = "Number of Incidents per Year for the 3 Species", # add labs
    x = "Year",
    y = "Number of Incidents",
    fill = "Species",
    caption = "Source: FAA"
  ) +
  theme_minimal() +    # change the theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  #modify the x axis text format

This graph shows the number of incidents per year involving the three bird species I filtered earlier—Mourning Dove, Barn Swallow, and Killdeer. One thing that really stood out to me was the steady increase in the number of incidents over time. However, after 2019, there is a noticeable drop around 2020, followed by a rebound.

This trend raised a few important questions for me:

1-Was the drop in 2020 related to the COVID-19 pandemic and a decrease in air travel?

2-Is the overall increase in bird strikes over the years due to more aircraft in the sky, a growing bird population, or changes in how well airports manage bird activity around runways?

Cleaning my data for my map.

aircraft1 <- aircraft |>
  
  filter(!is.na(LONGITUDE)& !is.na(LATITUDE)) |>  # remove na
   filter(SPECIES %in% c("Mourning dove", "Barn swallow", "Killdeer")) # select a particular type of species.

The reason why I’m restarting my data analysis is that if I had removed all the rows with missing information from the beginning, my dataset would have been significantly reduced. I was worried that this wouldn’t accurately reflect reality. That’s why I chose to retrieve the full dataset and clean it gradually, depending on which variables were needed at each stage of the analysis.

Set my data coordinates

 aircraft1 |>

leaflet() |>
  setView(lng = -106.5348, lat = 38.7946, zoom =6) |>
  addProviderTiles("Esri.WorldStreetMap") |>  # choose the type of map
  addCircles(radius = 1
  )

Assuming "LONGITUDE" and "LATITUDE" are longitude and latitude, respectively

This chunk allows me to geographically define the boundaries of my map and also choose the type of map I want by selecting from the different options available in R.

Create my popup.

popup1 <- paste0(     # create the legend of each point .
  "<b>Species: </b>", aircraft1$SPECIES, "<br>",
  "<b>SKY: </b>", aircraft1$SKY, "<br>",
  "<b>COST_REPAIRS: </b>", aircraft1$COST_REPAIRS, "<br>",
  "<b>Operator: </b>", aircraft1$OPERATOR, "<br>",
  "<b>Year: </b>", aircraft1$INCIDENT_YEAR
)

This field allows me to create my popup. So, what is a pop-up? A popup is like a small label or tooltip that appears when you interact with a specific point on an interactive graph or map. When I click on a point, it displays detailed information such as the species, sky condition, repair cost, operator, and the year of the incident.

Add the popup to my graph.

colors <- c("#916", "#1B949A", "#237BE8") #selecting colors
pal <- colorFactor(
  palette = colors,
  domain = aircraft1$SPECIES) 

leaflet() |>        # create the map 
  setView(lng = -106.5348, lat = 38.7946, zoom =6) |> #Set the geographical coordiates 
  addProviderTiles("Esri.WorldStreetMap") |>
  addCircles(
    data = aircraft1,   # select the database
 
  radius = 500,
 
  color = pal(aircraft1$SPECIES),
   popup = popup1) |>
addLegend("topleft", pal = pal, values = aircraft1$SPECIES,
            title = "SPECIES", opacity = 1)

Assuming "LONGITUDE" and "LATITUDE" are longitude and latitude, respectively

gOne thing that really surprised me when I looked at the second graph was how clearly the data showed a geographic pattern. There’s a noticeable difference between the eastern and western parts of the country. In the East, especially along the coastline, there is a significantly higher number of incidents — in some areas, even double or triple the amount compared to the West.

This raised questions for me about what might be causing this difference:

Could the higher number of bird-aircraft incidents near coastal areas be due to the presence of beaches, which naturally attract large numbers of birds?

-Or, could the increased number of incidents be explained by bird migration patterns, with more birds passing through or

settling in these regions?

Cleaning the data for my modeling.

aircraft2 <- aircraft %>%
  
  
  filter(!is.na(INCIDENT_MONTH)) %>%
  filter(!is.na(INCIDENT_YEAR)) %>%
 filter(!is.na(AMO)) %>%
  filter(!is.na(SPEED)) %>%
  filter(!is.na(EMO)) %>%
  filter(!is.na(SPEED)) %>%
  filter(!is.na(COST_OTHER)) %>%
  filter(!is.na(COST_REPAIRS))

The reason I restarted the cleaning process for my modeling is that I believed, to have a sufficiently large dataset, I needed to redo the basic cleaning. If I had done a thorough cleaning all at once, I would have ended up with too many missing values

Modeling Data.

fit2 <- lm(data= aircraft2,COST_REPAIRS~SPEED) #modeling my dataset.
summary(fit2)


Call:
lm(formula = COST_REPAIRS ~ SPEED, data = aircraft2)

Residuals:
    Min      1Q  Median      3Q     Max 
-258801  -99479  -76386  -38325 9394060 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  17878.6    38451.9   0.465   0.6421  
SPEED          652.3      264.2   2.469   0.0137 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 481000 on 1125 degrees of freedom
Multiple R-squared:  0.00539,   Adjusted R-squared:  0.004505 
F-statistic: 6.096 on 1 and 1125 DF,  p-value: 0.0137

fit3 <- lm(data=aircraft2, COST_REPAIRS ~ SPEED +  INCIDENT_YEAR + INCIDENT_MONTH + AMO + SPEED + COST_REPAIRS  + EMO)

Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
the right-hand side and was dropped

Warning in model.matrix.default(mt, mf, contrasts): problem with term 5 in
model.matrix: no columns are assigned

summary(fit3)


Call:
lm(formula = COST_REPAIRS ~ SPEED + INCIDENT_YEAR + INCIDENT_MONTH + 
    AMO + SPEED + COST_REPAIRS + EMO, data = aircraft2)

Residuals:
    Min      1Q  Median      3Q     Max 
-265807 -110031  -70324  -29918 9355934 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)    4903712.4  3674158.3   1.335   0.1823  
SPEED              684.4      271.1   2.524   0.0117 *
INCIDENT_YEAR    -2434.2     1830.9  -1.330   0.1839  
INCIDENT_MONTH   -4294.8     4273.0  -1.005   0.3151  
AMO                956.0      909.6   1.051   0.2935  
EMO                973.0     1370.6   0.710   0.4779  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 481000 on 1121 degrees of freedom
Multiple R-squared:  0.008826,  Adjusted R-squared:  0.004405 
F-statistic: 1.996 on 5 and 1121 DF,  p-value: 0.07662

fit4 <- lm(data=aircraft2, COST_REPAIRS ~ SPEED +  INCIDENT_YEAR + INCIDENT_MONTH + AMO + SPEED + COST_REPAIRS  )

Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
the right-hand side and was dropped

Warning in model.matrix.default(mt, mf, contrasts): problem with term 5 in
model.matrix: no columns are assigned

summary(fit4)


Call:
lm(formula = COST_REPAIRS ~ SPEED + INCIDENT_YEAR + INCIDENT_MONTH + 
    AMO + SPEED + COST_REPAIRS, data = aircraft2)

Residuals:
    Min      1Q  Median      3Q     Max 
-224897 -110532  -69960  -30227 9359701 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)    4948507.6  3672804.4   1.347   0.1781  
SPEED              660.9      269.0   2.457   0.0142 *
INCIDENT_YEAR    -2449.9     1830.4  -1.338   0.1810  
INCIDENT_MONTH   -4338.8     4271.6  -1.016   0.3100  
AMO                912.9      907.3   1.006   0.3146  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 480900 on 1122 degrees of freedom
Multiple R-squared:  0.008381,  Adjusted R-squared:  0.004845 
F-statistic: 2.371 on 4 and 1122 DF,  p-value: 0.05076

fit5 <- lm(data=aircraft2, COST_REPAIRS ~ SPEED +  INCIDENT_YEAR + INCIDENT_MONTH+ SPEED + COST_REPAIRS  )

Warning in model.matrix.default(mt, mf, contrasts): the response appeared on
the right-hand side and was dropped

Warning in model.matrix.default(mt, mf, contrasts): problem with term 4 in
model.matrix: no columns are assigned

summary(fit5)


Call:
lm(formula = COST_REPAIRS ~ SPEED + INCIDENT_YEAR + INCIDENT_MONTH + 
    SPEED + COST_REPAIRS, data = aircraft2)

Residuals:
    Min      1Q  Median      3Q     Max 
-245668 -108149  -71990  -32017 9352749 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)    4639495.2  3659959.8   1.268  0.20519   
SPEED              700.2      266.2   2.630  0.00864 **
INCIDENT_YEAR    -2290.1     1823.5  -1.256  0.20941   
INCIDENT_MONTH   -4256.4     4270.8  -0.997  0.31916   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 480900 on 1123 degrees of freedom
Multiple R-squared:  0.007486,  Adjusted R-squared:  0.004835 
F-statistic: 2.823 on 3 and 1123 DF,  p-value: 0.03771

After modeling my data, I aimed to determine whether there was a correlation between repair cost and speed. To explore this, I added additional numerical variables such as year, month, incident count, and speed into the model.

Unfortunately, the initial results were not very conclusive, as the adjusted R-squared value was quite weak, indicating a poor fit.

In my second modeling attempt, I began to notice a slight improvement in the performance of the model. However, after a certain point, there was a regression in the results again. This suggests that repair cost and speed are likely not strongly correlated in this dataset.

Clean my data in order to see the most affected operator.

summary_operator <- aircraft |> 
  filter(!is.na(OPERATOR)) |>
  group_by(OPERATOR) |>
  summarise(
    nb_incidents = n(),
    avg_cost = mean(COST_REPAIRS, na.rm = TRUE)
  ) |>
  arrange(desc(nb_incidents)) |>  # 
  filter(!OPERATOR %in% c("UNKNOWN", "BUSINESS"))|> # remove some  data 
 
  slice_head(n = 10) # select the 10 first

This line of code allows us to identify the airlines most affected by animal-related aircraft incidents. The reason I decided to remove the “Unknown” and “Business” categories is that I consider “Unknown” to be similar to missing values—it can’t be treated the same way as NA values. As for “Business,” I believe these represent private jets since there is no airline named Business. Therefore, I chose to exclude both categories, even though they were among the two largest groups.

ggplot(summary_operator, aes(x = reorder(OPERATOR, nb_incidents), y = nb_incidents, fill = OPERATOR)) +
  geom_col() + # create a plot 
  coord_flip() + # flip the graph 
  scale_fill_viridis_d(option ="rocket") +
  labs(
    title = "Top 10 Operators by Number of Incidents", 
    x = "Operator",                                      
    y = "Number of Incidents",
    fill = "Operator",
    caption=" Source:  FAA"
  ) +
  theme_minimal()

This line allows me to create the graph that will help determine which airlines are most affected by animal-related incidents.

Essay

Analyzing this dataset was a truly fascinating experience for me because it allowed me to deepen my knowledge and learn new things. I discovered new formulas that helped me better understand the data. It also gave me insights into which types of animals are most responsible for causing bird strikes and related aircraft damages.

For the analysis, I first decided to clean the data by selecting only certain types of animals. Then, I created a graph to show how these animal types caused aircraft incidents over the years starting from 2000. After identifying the birds that caused the most damage, I looked into which airlines were most affected by these incidents, aiming to get a clearer picture of which companies experience these events most frequently.

Next, I made a map to pinpoint where exactly these incidents occurred globally, which helped me appreciate the geographical distribution and diversity of the events.

Doing this analysis helped me answer many questions, especially the first one I asked myself: Which airlines are most affected by bird strikes? I found my answer, and surprisingly, Southwest Airlines was among the most affected. I hadn’t heard much about this airline before, so it was interesting to discover it through this analysis. Other airlines like American, United, and Delta were also among the top, which was less surprising since they are well-known.

The second question I had was: Which animals cause the most damage? I was surprised to find that I only knew one of the species involved, the mourning dove. The rest were new to me, so this was a good learning experience.

For the third question, I was genuinely shocked and puzzled: Which region of the United States is most affected by bird strikes? The Eastern region of the US has two to three times more incidents than the Western region. This made me wonder if it’s because there are more animals there, or simply because there are more flights in the East. This finding helped answer both the third question and reinforced the insight on where the most affected areas are — particularly around ports on the West Coast.