knitr::include_graphics("C:/Users/lydia/Downloads/Comms/data 110/birdieplanebgrm.png")

Travel Art, Kansai International Airport, Airplane, Aviation Accidents And Incidents, Bird Strike, Aircraft, Korean Air, Airline transparent background PNG clipart. Source: HiClipart.com. (https://www.hiclipart.com/free-transparent-background-png-clipart-pgkxm).

Introduction

The topic of my data is Aircraft-Wildlife Collisions. I found the data set from the OpenIntro website. The source of the data set is Aircraft Wildlife Strike Data: Search Tool - FAA Wildlife Strike Database.

I chose this topic because I’ve always found myself wondering, especially during takeoff and landing, how often birds and planes intersect with each other. As someone interested in both environmental issues and data science, I wanted to explore how our human made technology affect wildlife, and how often these two very different worlds collide.

I will be using 7 different variables including:

To clean the data I first filtered out the NA’s in each column that I would be using in my analysis. Then I selected the variables I wanted to include in my project. Finally, I arranged the speed in descending order because I was interested in examining the max speeds of the aircrafts.

Research Question: How do flight phase, time of day, aircraft mass, speed, and height during bird strikes relate to the likelihood of bird strike incidents?

Setting Up

#loading libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)
library(broom) 
## Warning: package 'broom' was built under R version 4.4.2
#setting working directory
setwd("C:/Users/lydia/Downloads/Comms/data 110")
birds <- read_csv("birds.csv")
## Rows: 19302 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): opid, operator, atype, remarks, phase_of_flt, date, time_of_day, s...
## dbl  (4): ac_mass, num_engs, height, speed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#heading data
head(birds)
## # A tibble: 6 × 17
##   opid  operator   atype remarks phase_of_flt ac_mass num_engs date  time_of_day
##   <chr> <chr>      <chr> <chr>   <chr>          <dbl>    <dbl> <chr> <chr>      
## 1 AAL   AMERICAN … MD-80 NO DAM… Descent            4        2 9/30… Night      
## 2 USA   US AIRWAYS FK-2… 2 BIRD… Climb              4        2 11/2… Day        
## 3 AAL   AMERICAN … B-72… <NA>    Approach           4        3 8/13… Day        
## 4 AAL   AMERICAN … MD-82 <NA>    Climb              4        2 10/7… Day        
## 5 AAL   AMERICAN … MD-82 NO DAM… Climb              4        2 9/25… Day        
## 6 GFT   GULFSTREA… BE-99 FLT 71… Landing Roll       2        2 9/20… Day        
## # ℹ 8 more variables: state <chr>, height <dbl>, speed <dbl>, effect <chr>,
## #   sky <chr>, species <chr>, birds_seen <chr>, birds_struck <chr>

Exploring the Data

str(birds)
## spc_tbl_ [19,302 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ opid        : chr [1:19302] "AAL" "USA" "AAL" "AAL" ...
##  $ operator    : chr [1:19302] "AMERICAN AIRLINES" "US AIRWAYS" "AMERICAN AIRLINES" "AMERICAN AIRLINES" ...
##  $ atype       : chr [1:19302] "MD-80" "FK-28-4000" "B-727-200" "MD-82" ...
##  $ remarks     : chr [1:19302] "NO DAMAGE" "2 BIRDS, NO DAMAGE." NA NA ...
##  $ phase_of_flt: chr [1:19302] "Descent" "Climb" "Approach" "Climb" ...
##  $ ac_mass     : num [1:19302] 4 4 4 4 4 2 4 2 4 2 ...
##  $ num_engs    : num [1:19302] 2 2 3 2 2 2 2 1 3 2 ...
##  $ date        : chr [1:19302] "9/30/1990 0:00:00" "11/29/1993 0:00:00" "8/13/1993 0:00:00" "10/7/1993 0:00:00" ...
##  $ time_of_day : chr [1:19302] "Night" "Day" "Day" "Day" ...
##  $ state       : chr [1:19302] "IL" "MD" "TN" "VA" ...
##  $ height      : num [1:19302] 7000 10 400 100 50 0 4000 100 200 3000 ...
##  $ speed       : num [1:19302] 250 140 140 200 170 40 230 50 130 NA ...
##  $ effect      : chr [1:19302] NA "None" "None" "None" ...
##  $ sky         : chr [1:19302] "No Cloud" "No Cloud" "Some Cloud" "Overcast" ...
##  $ species     : chr [1:19302] "UNKNOWN BIRD - MEDIUM" "UNKNOWN BIRD - MEDIUM" "UNKNOWN BIRD - SMALL" "UNKNOWN BIRD - SMALL" ...
##  $ birds_seen  : chr [1:19302] NA "2-10" "2-10" NA ...
##  $ birds_struck: chr [1:19302] "1" "2-10" "1" "1" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   opid = col_character(),
##   ..   operator = col_character(),
##   ..   atype = col_character(),
##   ..   remarks = col_character(),
##   ..   phase_of_flt = col_character(),
##   ..   ac_mass = col_double(),
##   ..   num_engs = col_double(),
##   ..   date = col_character(),
##   ..   time_of_day = col_character(),
##   ..   state = col_character(),
##   ..   height = col_double(),
##   ..   speed = col_double(),
##   ..   effect = col_character(),
##   ..   sky = col_character(),
##   ..   species = col_character(),
##   ..   birds_seen = col_character(),
##   ..   birds_struck = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Wrangling the Data

birds_clean <- birds |>
  #filtering NAs
  filter(!is.na(phase_of_flt) & !is.na(time_of_day) & !is.na(ac_mass) & !is.na(speed) & !is.na(height) & !is.na(species) & !is.na(birds_struck) & !is.na(sky)) |>
  #selecting my variables
   select(phase_of_flt, time_of_day, ac_mass, speed, height, species, birds_struck, sky)|>
  #arranging speed in descending order
    arrange(desc(speed))

birds_clean
## # A tibble: 11,611 × 8
##    phase_of_flt time_of_day ac_mass speed height species      birds_struck sky  
##    <chr>        <chr>         <dbl> <dbl>  <dbl> <chr>        <chr>        <chr>
##  1 En Route     Day               4   400  25000 UNKNOWN BIRD 1            Some…
##  2 Descent      Day               4   350  13400 UNKNOWN BIR… 1            No C…
##  3 Climb        Day               4   330   2500 HAWKS        1            No C…
##  4 En Route     Day               4   330  14000 BLACKBIRDS   1            Some…
##  5 Descent      Night             4   320  15000 UNKNOWN BIR… 1            No C…
##  6 Climb        Day               4   320   7000 UNKNOWN BIRD 1            Some…
##  7 Descent      Day               2   320  13500 UNKNOWN BIR… 1            Some…
##  8 Climb        Day               4   320  12500 UNKNOWN BIR… 1            No C…
##  9 Descent      Night             4   320  15000 UNKNOWN BIR… 1            Over…
## 10 Climb        Day               4   320  12000 UNKNOWN BIRD 2-10         No C…
## # ℹ 11,601 more rows
#converting birds_struck to an factor with specific levels.
birds_clean$birds_struck <- factor(birds_clean$birds_struck, levels = c("0", "1", "2-10", "11-100", "Over 100"), ordered = TRUE)

Cleaning for PCA

birds_pca <- birds_clean |>
 select(phase_of_flt, ac_mass, speed, height
        )|> 
drop_na()

birds_pca
## # A tibble: 11,611 × 4
##    phase_of_flt ac_mass speed height
##    <chr>          <dbl> <dbl>  <dbl>
##  1 En Route           4   400  25000
##  2 Descent            4   350  13400
##  3 Climb              4   330   2500
##  4 En Route           4   330  14000
##  5 Descent            4   320  15000
##  6 Climb              4   320   7000
##  7 Descent            2   320  13500
##  8 Climb              4   320  12500
##  9 Descent            4   320  15000
## 10 Climb              4   320  12000
## # ℹ 11,601 more rows

Scaling the Data

bpca_fit <- birds_pca |> 
  select(where(is.numeric)) |> 
  scale() |>                   
  prcomp()                     

bpca_fit
## Standard deviations (1, .., p=3):
## [1] 1.3554777 0.9748320 0.4608502
## 
## Rotation (n x k) = (3 x 3):
##                PC1         PC2        PC3
## ac_mass -0.4056454 -0.83964839  0.3611681
## speed   -0.6959604  0.02757924 -0.7175503
## height  -0.5925293  0.54242968  0.5955494

Performing PCA

bpca_fit |>
  augment(birds_clean) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = phase_of_flt, alpha = .3))+
  xlim(-5,5)+
  ylim(-4,4)+
  xlab("PC1")+
  ylab("PC2")+
  guides(color = guide_legend(title = NULL))+
  theme_update()

Adding PC Coordinates and Arrows

bpca_fit |>
  augment(birds_clean) |>
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(color = phase_of_flt), alpha = .3) +
  
  geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC1 arrow
  geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4), 
               arrow = arrow(type = "closed", length = unit(0.1, "inches")),
               color = "black") +  # PC2 arrow
  
  geom_text(aes(x = 5, y = 0, label = "PC1"), 
            vjust = -0.5, color = "black") +  
   
  geom_text(aes(x = 0, y = 4, label = "PC2"), 
            hjust = -0.5, color = "black") +  
  xlim(-5, 5) +
  ylim(-4, 4) +
  xlab("PC1") +
  ylab("PC2") +
  guides(color = guide_legend(title = NULL)) +  # No title for the legend
  scale_color_manual(values = c("Approach" = "red", "Climb" = "orange", "Descent" = "yellow", "En Route" = "green", "Landing Roll" = "blue", "Parked" = "purple", "Take-off run" = "hotpink", "Taxi" = "black"))+
  theme_update()

We can see that different flight phases separate along PC1 and PC2 in the PCA plot. This separation occurs around the origin. Phases like En Route and Descent tend to have more positive PC2 values, while Landing Roll, Take-off Run, and Taxi cluster around lower PC1 and PC2 values. This shows that PC1 and PC2 show variation in flight characteristics. However, the PCA doesn’t tell us exactly which variables are driving the separation.

Creating Rotation Matrix

arrow_style <- arrow(
  angle = 20, length = grid::unit(8, "pt"),
  ends = "first", type = "closed"
)
bpca_fit |>
  tidy(matrix = "rotation") |>
  pivot_wider(
    names_from = "PC", values_from = "value",
    names_prefix = "PC"
  ) |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(
    xend = 0, yend = 0,
    arrow = arrow_style
  ) +
  geom_text(aes(label = column), hjust = 1) +
  xlim(-1.5, 0.5) + ylim(-1, 1) + 
  coord_fixed()+
  theme_update()

PC1 shows variation in speed so we can say that aircrafts with lower PC1 scores tend to have higher speeds, while the aircrafts with higher PC1 scores tend to go slower.

Height points upward while ac_mass points downward this means that PC2 shows the difference between altitude and the aircrafts mass. There is not much separation on PC2, which tells us that this variation isn’t majorly tied to specific flight phases so aircrafts in various phases can fly at different altitudes and have different masses.

Plot the Variance

bpca_fit |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) + 
  geom_col(fill= "lightblue") + 
  scale_x_continuous(
    breaks = 1:3
  ) +
  scale_y_continuous(
    name = "variance explained",
    breaks = seq(0, 1, by = 0.1), 
    label = scales::label_percent(accuracy = 1)
  )+
  xlab("Principal Component (PC)") +
  theme_update()

Calculating the Exact Values

 bpca_fit |>
 tidy(matrix = "eigenvalues")
## # A tibble: 3 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   1.36   0.612       0.612
## 2     2   0.975  0.317       0.929
## 3     3   0.461  0.0708      1

PC1 explains around 61% of the variance while PC2 explains around 32%. Together PC1 and PC2 cover about 93% of all variation.

Creating a Stacked Bar Chart

ggplot(birds_clean, aes(x = time_of_day, fill = birds_struck)) +
  geom_bar(position = "stack") +
  labs(title = "Number of Bird Strikes by Time of Day",
       x = "Time of Day", y = "count",
       fill = "# of Birds Struck") +
  labs(caption = "Source: FAA Wildlife Strike Database")+
  scale_fill_manual(values = c("#1d3557", "#a8dadc", "#e63946","#457b9d", "#e9c46a"))+
  theme_light()

Findings

Most bird strikes occur during the day, with the majority involving only one bird. Nighttime strikes follow, while dawn and dusk don’t see many collisions. Strikes over 10 birds are rare across all times of day

Taking a Sample

#choosing random seed
set.seed(4000)
#taking 100 random observations to reduce over plotting
samp <- sample_n(birds_clean, 100, replace = TRUE)

head(samp)
## # A tibble: 6 × 8
##   phase_of_flt time_of_day ac_mass speed height species       birds_struck sky  
##   <chr>        <chr>         <dbl> <dbl>  <dbl> <chr>         <ord>        <chr>
## 1 Approach     Night             4   190   2000 UNKNOWN BIRD… 1            No C…
## 2 Climb        Day               4   140     40 AMERICAN ROB… 1            No C…
## 3 Approach     Night             4   140    200 UNKNOWN BIRD… 1            Some…
## 4 Take-off run Day               3    90      0 GULLS         1            Over…
## 5 Descent      Night             4   210   9000 CANADA GOOSE  11-100       No C…
## 6 Take-off run Day               3   120      0 UNKNOWN BIRD… 1            Some…

Interactive Scatterplot (Plotly)

#plotting interactive scatterplot
c <- ggplot(samp,
            aes(x = speed,
                y = height,
                color = birds_struck,
                text = paste("time of Day:", time_of_day,
                             "<br>sky:", sky,
                             "<br>aircraft mass category:", ac_mass))) +
  geom_point(alpha = 6, size = .8)+
  xlim(0,300) +
  ylim(0,10000) + 
  ggtitle("Bird Strikes by Flight Phase, Speed, and Altitude  in the U.S.", subtitle = "Source: FAA Wildlife Strike Database") +
  xlab("Speed in Knots") +
  ylab ("Height in Feet") +
  labs(caption = "Source: FAA Wildlife Strike Database") +
 scale_color_manual(
  name = "# of Birds Struck",
  values = c("1" = "#f18f01", "2-10" = "#ff0000", "11-100" = "#0078ff"),
  breaks = c("1", "2-10", "11-100")
  )+
  facet_wrap(~phase_of_flt)+
  theme_dark(base_size = 12)

c <- ggplotly(c)

c

Findings

Bird strikes usually happen at lower altitudes like the approach, climb, landing roll, and take off. Probably due to more bird activity around those heights. they also occur across a range of speeds within each flight phase so we can say that speed alone isn’t the sole determinant of risk. There aren’t many incidents involving a large number of birds but they seem to be spread out proving that there aren’t particular conditions for the possibility of crashing into more birds higher. Bird strikes vary by flight phase, with the most incidents during approach and climb occurring below 2,500 feet and involving a smaller numbers of birds. Descent strikes span a wider range of altitudes and speeds, occasionally involving larger bird groups. During the en route phase, strikes typically occur at higher altitudes with fewer birds involved. Landing roll and take-off run incidents happen near ground level at lower speeds, with take-off sometimes involving larger flocks.

Conclusion

Overall these visualization show us that, bird strike patterns vary by flight phase and time of day, with most incidents occurring during approach, climb, and daytime, typically involving a single bird at low altitudes. Larger bird strikes are rare, and incidents during landing roll, take-off, and night are not as frequent. PCA illustrates that flight phases separate along principal components, with PC1 representing speed variation and PC2 reflecting a trade-off between altitude and aircraft mass.

One thing that I wasn’t surprised by but made working with this dataset kind of challenging was all of the 1’s in birds struck. Of course I didn’t expect more observations sating they hit 100 birds but I the concentration of 1’s in this data was overwhelming and made it difficult when plotting which is why i had to take a sample. I wish I had included the type of aircraft. That would have added an extra detail and interesting observations in the end to see if there is a certain type of aircraft that collides with wild life the most.