knitr::include_graphics("C:/Users/lydia/Downloads/Comms/data 110/birdieplanebgrm.png")
Travel Art, Kansai International Airport, Airplane, Aviation Accidents And Incidents, Bird Strike, Aircraft, Korean Air, Airline transparent background PNG clipart. Source: HiClipart.com. (https://www.hiclipart.com/free-transparent-background-png-clipart-pgkxm).
The topic of my data is Aircraft-Wildlife Collisions. I found the data set from the OpenIntro website. The source of the data set is Aircraft Wildlife Strike Data: Search Tool - FAA Wildlife Strike Database.
I chose this topic because I’ve always found myself wondering, especially during takeoff and landing, how often birds and planes intersect with each other. As someone interested in both environmental issues and data science, I wanted to explore how our human made technology affect wildlife, and how often these two very different worlds collide.
I will be using 7 different variables including:
A categorical variable, phase_of_flt, which represents the phase of the flight for when collision occurred it is a has 8 categories: Approach, Climb, Descent, En Route, Landing Roll, Parked, Take-off run, Taxi.*
A categorical variable, ac_mass, which states the of the aircraft through: 2250 kg or less (1), 2251-5700 kg (2), 5701-27000 kg (3), 27001-272000 kg (4), above 272000 kg (5).
A categorical variable, time_of_day, which represents the light conditions including the categories: Dawn, Day, Dusk, Night.
The quantitative variables, height and speed, which represent the feet above the ground and the speed of the aricraft in knots.
A categorical variable, sky, which tells the type of cloud conditions, if any, including: No Cloud, Overcast, Some Cloud.
And finally, another categorical variable, birds_struck, which states the number of birds struck: 0, 1, 2-10, 11-100, Over 100.
To clean the data I first filtered out the NA’s in each column that I would be using in my analysis. Then I selected the variables I wanted to include in my project. Finally, I arranged the speed in descending order because I was interested in examining the max speeds of the aircrafts.
Research Question: How do flight phase, time of day, aircraft mass, speed, and height during bird strikes relate to the likelihood of bird strike incidents?
#loading libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.4.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
library(broom)
## Warning: package 'broom' was built under R version 4.4.2
#setting working directory
setwd("C:/Users/lydia/Downloads/Comms/data 110")
birds <- read_csv("birds.csv")
## Rows: 19302 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): opid, operator, atype, remarks, phase_of_flt, date, time_of_day, s...
## dbl (4): ac_mass, num_engs, height, speed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#heading data
head(birds)
## # A tibble: 6 × 17
## opid operator atype remarks phase_of_flt ac_mass num_engs date time_of_day
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 AAL AMERICAN … MD-80 NO DAM… Descent 4 2 9/30… Night
## 2 USA US AIRWAYS FK-2… 2 BIRD… Climb 4 2 11/2… Day
## 3 AAL AMERICAN … B-72… <NA> Approach 4 3 8/13… Day
## 4 AAL AMERICAN … MD-82 <NA> Climb 4 2 10/7… Day
## 5 AAL AMERICAN … MD-82 NO DAM… Climb 4 2 9/25… Day
## 6 GFT GULFSTREA… BE-99 FLT 71… Landing Roll 2 2 9/20… Day
## # ℹ 8 more variables: state <chr>, height <dbl>, speed <dbl>, effect <chr>,
## # sky <chr>, species <chr>, birds_seen <chr>, birds_struck <chr>
str(birds)
## spc_tbl_ [19,302 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ opid : chr [1:19302] "AAL" "USA" "AAL" "AAL" ...
## $ operator : chr [1:19302] "AMERICAN AIRLINES" "US AIRWAYS" "AMERICAN AIRLINES" "AMERICAN AIRLINES" ...
## $ atype : chr [1:19302] "MD-80" "FK-28-4000" "B-727-200" "MD-82" ...
## $ remarks : chr [1:19302] "NO DAMAGE" "2 BIRDS, NO DAMAGE." NA NA ...
## $ phase_of_flt: chr [1:19302] "Descent" "Climb" "Approach" "Climb" ...
## $ ac_mass : num [1:19302] 4 4 4 4 4 2 4 2 4 2 ...
## $ num_engs : num [1:19302] 2 2 3 2 2 2 2 1 3 2 ...
## $ date : chr [1:19302] "9/30/1990 0:00:00" "11/29/1993 0:00:00" "8/13/1993 0:00:00" "10/7/1993 0:00:00" ...
## $ time_of_day : chr [1:19302] "Night" "Day" "Day" "Day" ...
## $ state : chr [1:19302] "IL" "MD" "TN" "VA" ...
## $ height : num [1:19302] 7000 10 400 100 50 0 4000 100 200 3000 ...
## $ speed : num [1:19302] 250 140 140 200 170 40 230 50 130 NA ...
## $ effect : chr [1:19302] NA "None" "None" "None" ...
## $ sky : chr [1:19302] "No Cloud" "No Cloud" "Some Cloud" "Overcast" ...
## $ species : chr [1:19302] "UNKNOWN BIRD - MEDIUM" "UNKNOWN BIRD - MEDIUM" "UNKNOWN BIRD - SMALL" "UNKNOWN BIRD - SMALL" ...
## $ birds_seen : chr [1:19302] NA "2-10" "2-10" NA ...
## $ birds_struck: chr [1:19302] "1" "2-10" "1" "1" ...
## - attr(*, "spec")=
## .. cols(
## .. opid = col_character(),
## .. operator = col_character(),
## .. atype = col_character(),
## .. remarks = col_character(),
## .. phase_of_flt = col_character(),
## .. ac_mass = col_double(),
## .. num_engs = col_double(),
## .. date = col_character(),
## .. time_of_day = col_character(),
## .. state = col_character(),
## .. height = col_double(),
## .. speed = col_double(),
## .. effect = col_character(),
## .. sky = col_character(),
## .. species = col_character(),
## .. birds_seen = col_character(),
## .. birds_struck = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
birds_clean <- birds |>
#filtering NAs
filter(!is.na(phase_of_flt) & !is.na(time_of_day) & !is.na(ac_mass) & !is.na(speed) & !is.na(height) & !is.na(species) & !is.na(birds_struck) & !is.na(sky)) |>
#selecting my variables
select(phase_of_flt, time_of_day, ac_mass, speed, height, species, birds_struck, sky)|>
#arranging speed in descending order
arrange(desc(speed))
birds_clean
## # A tibble: 11,611 × 8
## phase_of_flt time_of_day ac_mass speed height species birds_struck sky
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 En Route Day 4 400 25000 UNKNOWN BIRD 1 Some…
## 2 Descent Day 4 350 13400 UNKNOWN BIR… 1 No C…
## 3 Climb Day 4 330 2500 HAWKS 1 No C…
## 4 En Route Day 4 330 14000 BLACKBIRDS 1 Some…
## 5 Descent Night 4 320 15000 UNKNOWN BIR… 1 No C…
## 6 Climb Day 4 320 7000 UNKNOWN BIRD 1 Some…
## 7 Descent Day 2 320 13500 UNKNOWN BIR… 1 Some…
## 8 Climb Day 4 320 12500 UNKNOWN BIR… 1 No C…
## 9 Descent Night 4 320 15000 UNKNOWN BIR… 1 Over…
## 10 Climb Day 4 320 12000 UNKNOWN BIRD 2-10 No C…
## # ℹ 11,601 more rows
#converting birds_struck to an factor with specific levels.
birds_clean$birds_struck <- factor(birds_clean$birds_struck, levels = c("0", "1", "2-10", "11-100", "Over 100"), ordered = TRUE)
birds_pca <- birds_clean |>
select(phase_of_flt, ac_mass, speed, height
)|>
drop_na()
birds_pca
## # A tibble: 11,611 × 4
## phase_of_flt ac_mass speed height
## <chr> <dbl> <dbl> <dbl>
## 1 En Route 4 400 25000
## 2 Descent 4 350 13400
## 3 Climb 4 330 2500
## 4 En Route 4 330 14000
## 5 Descent 4 320 15000
## 6 Climb 4 320 7000
## 7 Descent 2 320 13500
## 8 Climb 4 320 12500
## 9 Descent 4 320 15000
## 10 Climb 4 320 12000
## # ℹ 11,601 more rows
bpca_fit <- birds_pca |>
select(where(is.numeric)) |>
scale() |>
prcomp()
bpca_fit
## Standard deviations (1, .., p=3):
## [1] 1.3554777 0.9748320 0.4608502
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## ac_mass -0.4056454 -0.83964839 0.3611681
## speed -0.6959604 0.02757924 -0.7175503
## height -0.5925293 0.54242968 0.5955494
bpca_fit |>
augment(birds_clean) |>
ggplot(aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(color = phase_of_flt, alpha = .3))+
xlim(-5,5)+
ylim(-4,4)+
xlab("PC1")+
ylab("PC2")+
guides(color = guide_legend(title = NULL))+
theme_update()
bpca_fit |>
augment(birds_clean) |>
ggplot(aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(color = phase_of_flt), alpha = .3) +
geom_segment(aes(x = -4.9, y = 0, xend = 5, yend = 0),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") + # PC1 arrow
geom_segment(aes(x = 0, y = -2.5, xend = 0, yend = 4),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "black") + # PC2 arrow
geom_text(aes(x = 5, y = 0, label = "PC1"),
vjust = -0.5, color = "black") +
geom_text(aes(x = 0, y = 4, label = "PC2"),
hjust = -0.5, color = "black") +
xlim(-5, 5) +
ylim(-4, 4) +
xlab("PC1") +
ylab("PC2") +
guides(color = guide_legend(title = NULL)) + # No title for the legend
scale_color_manual(values = c("Approach" = "red", "Climb" = "orange", "Descent" = "yellow", "En Route" = "green", "Landing Roll" = "blue", "Parked" = "purple", "Take-off run" = "hotpink", "Taxi" = "black"))+
theme_update()
We can see that different flight phases separate along PC1 and PC2 in the PCA plot. This separation occurs around the origin. Phases like En Route and Descent tend to have more positive PC2 values, while Landing Roll, Take-off Run, and Taxi cluster around lower PC1 and PC2 values. This shows that PC1 and PC2 show variation in flight characteristics. However, the PCA doesn’t tell us exactly which variables are driving the separation.
arrow_style <- arrow(
angle = 20, length = grid::unit(8, "pt"),
ends = "first", type = "closed"
)
bpca_fit |>
tidy(matrix = "rotation") |>
pivot_wider(
names_from = "PC", values_from = "value",
names_prefix = "PC"
) |>
ggplot(aes(PC1, PC2)) +
geom_segment(
xend = 0, yend = 0,
arrow = arrow_style
) +
geom_text(aes(label = column), hjust = 1) +
xlim(-1.5, 0.5) + ylim(-1, 1) +
coord_fixed()+
theme_update()
PC1 shows variation in speed so we can say that aircrafts with lower PC1 scores tend to have higher speeds, while the aircrafts with higher PC1 scores tend to go slower.
Height points upward while ac_mass points downward this means that PC2 shows the difference between altitude and the aircrafts mass. There is not much separation on PC2, which tells us that this variation isn’t majorly tied to specific flight phases so aircrafts in various phases can fly at different altitudes and have different masses.
bpca_fit |>
tidy(matrix = "eigenvalues") |>
ggplot(aes(PC, percent)) +
geom_col(fill= "lightblue") +
scale_x_continuous(
breaks = 1:3
) +
scale_y_continuous(
name = "variance explained",
breaks = seq(0, 1, by = 0.1),
label = scales::label_percent(accuracy = 1)
)+
xlab("Principal Component (PC)") +
theme_update()
bpca_fit |>
tidy(matrix = "eigenvalues")
## # A tibble: 3 × 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1.36 0.612 0.612
## 2 2 0.975 0.317 0.929
## 3 3 0.461 0.0708 1
PC1 explains around 61% of the variance while PC2 explains around 32%. Together PC1 and PC2 cover about 93% of all variation.
ggplot(birds_clean, aes(x = time_of_day, fill = birds_struck)) +
geom_bar(position = "stack") +
labs(title = "Number of Bird Strikes by Time of Day",
x = "Time of Day", y = "count",
fill = "# of Birds Struck") +
labs(caption = "Source: FAA Wildlife Strike Database")+
scale_fill_manual(values = c("#1d3557", "#a8dadc", "#e63946","#457b9d", "#e9c46a"))+
theme_light()
Most bird strikes occur during the day, with the majority involving only one bird. Nighttime strikes follow, while dawn and dusk don’t see many collisions. Strikes over 10 birds are rare across all times of day
#choosing random seed
set.seed(4000)
#taking 100 random observations to reduce over plotting
samp <- sample_n(birds_clean, 100, replace = TRUE)
head(samp)
## # A tibble: 6 × 8
## phase_of_flt time_of_day ac_mass speed height species birds_struck sky
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <ord> <chr>
## 1 Approach Night 4 190 2000 UNKNOWN BIRD… 1 No C…
## 2 Climb Day 4 140 40 AMERICAN ROB… 1 No C…
## 3 Approach Night 4 140 200 UNKNOWN BIRD… 1 Some…
## 4 Take-off run Day 3 90 0 GULLS 1 Over…
## 5 Descent Night 4 210 9000 CANADA GOOSE 11-100 No C…
## 6 Take-off run Day 3 120 0 UNKNOWN BIRD… 1 Some…
#plotting interactive scatterplot
c <- ggplot(samp,
aes(x = speed,
y = height,
color = birds_struck,
text = paste("time of Day:", time_of_day,
"<br>sky:", sky,
"<br>aircraft mass category:", ac_mass))) +
geom_point(alpha = 6, size = .8)+
xlim(0,300) +
ylim(0,10000) +
ggtitle("Bird Strikes by Flight Phase, Speed, and Altitude in the U.S.", subtitle = "Source: FAA Wildlife Strike Database") +
xlab("Speed in Knots") +
ylab ("Height in Feet") +
labs(caption = "Source: FAA Wildlife Strike Database") +
scale_color_manual(
name = "# of Birds Struck",
values = c("1" = "#f18f01", "2-10" = "#ff0000", "11-100" = "#0078ff"),
breaks = c("1", "2-10", "11-100")
)+
facet_wrap(~phase_of_flt)+
theme_dark(base_size = 12)
c <- ggplotly(c)
c
Bird strikes usually happen at lower altitudes like the approach, climb, landing roll, and take off. Probably due to more bird activity around those heights. they also occur across a range of speeds within each flight phase so we can say that speed alone isn’t the sole determinant of risk. There aren’t many incidents involving a large number of birds but they seem to be spread out proving that there aren’t particular conditions for the possibility of crashing into more birds higher. Bird strikes vary by flight phase, with the most incidents during approach and climb occurring below 2,500 feet and involving a smaller numbers of birds. Descent strikes span a wider range of altitudes and speeds, occasionally involving larger bird groups. During the en route phase, strikes typically occur at higher altitudes with fewer birds involved. Landing roll and take-off run incidents happen near ground level at lower speeds, with take-off sometimes involving larger flocks.
Overall these visualization show us that, bird strike patterns vary by flight phase and time of day, with most incidents occurring during approach, climb, and daytime, typically involving a single bird at low altitudes. Larger bird strikes are rare, and incidents during landing roll, take-off, and night are not as frequent. PCA illustrates that flight phases separate along principal components, with PC1 representing speed variation and PC2 reflecting a trade-off between altitude and aircraft mass.
One thing that I wasn’t surprised by but made working with this dataset kind of challenging was all of the 1’s in birds struck. Of course I didn’t expect more observations sating they hit 100 birds but I the concentration of 1’s in this data was overwhelming and made it difficult when plotting which is why i had to take a sample. I wish I had included the type of aircraft. That would have added an extra detail and interesting observations in the end to see if there is a certain type of aircraft that collides with wild life the most.