Introduction

Although not usual, bird strikes represent a significant safety and operational challenge in aviation. It poses risks to both airlines and wildlife. The occurrence of bird strikes may be influenced by several factors including the environment and flying technical patterns. In consequence, this document examines that situation through the following question:” How do environmental conditions and flying technical aspects influence the likeliness of bird strikes?”

A dataset provided by open intro is used to conduct the analysis. It contains 19,302 observations distributed across 17 variables. It is accessible through the following link: https://www.openintro.org/data/index.php?data=birds

Key variables include: - opid: operator ID

- phase_of_flight: telling if the flight is taking off, landing, or en route

- remarks: Verbal remarks regarding the collision

- ac_mass: Mass of the aircraft 

- time_of_day: light conditions: Day, night, dawn, dusk

- height: distance above grouf level

- sky: type of cloud cover
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
df<-read_csv("birds.csv")
## Rows: 19302 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): opid, operator, atype, remarks, phase_of_flt, date, time_of_day, s...
## dbl  (4): ac_mass, num_engs, height, speed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 17
##   opid  operator   atype remarks phase_of_flt ac_mass num_engs date  time_of_day
##   <chr> <chr>      <chr> <chr>   <chr>          <dbl>    <dbl> <chr> <chr>      
## 1 AAL   AMERICAN … MD-80 NO DAM… Descent            4        2 9/30… Night      
## 2 USA   US AIRWAYS FK-2… 2 BIRD… Climb              4        2 11/2… Day        
## 3 AAL   AMERICAN … B-72… <NA>    Approach           4        3 8/13… Day        
## 4 AAL   AMERICAN … MD-82 <NA>    Climb              4        2 10/7… Day        
## 5 AAL   AMERICAN … MD-82 NO DAM… Climb              4        2 9/25… Day        
## 6 GFT   GULFSTREA… BE-99 FLT 71… Landing Roll       2        2 9/20… Day        
## # ℹ 8 more variables: state <chr>, height <dbl>, speed <dbl>, effect <chr>,
## #   sky <chr>, species <chr>, birds_seen <chr>, birds_struck <chr>
str(df)
## spc_tbl_ [19,302 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ opid        : chr [1:19302] "AAL" "USA" "AAL" "AAL" ...
##  $ operator    : chr [1:19302] "AMERICAN AIRLINES" "US AIRWAYS" "AMERICAN AIRLINES" "AMERICAN AIRLINES" ...
##  $ atype       : chr [1:19302] "MD-80" "FK-28-4000" "B-727-200" "MD-82" ...
##  $ remarks     : chr [1:19302] "NO DAMAGE" "2 BIRDS, NO DAMAGE." NA NA ...
##  $ phase_of_flt: chr [1:19302] "Descent" "Climb" "Approach" "Climb" ...
##  $ ac_mass     : num [1:19302] 4 4 4 4 4 2 4 2 4 2 ...
##  $ num_engs    : num [1:19302] 2 2 3 2 2 2 2 1 3 2 ...
##  $ date        : chr [1:19302] "9/30/1990 0:00:00" "11/29/1993 0:00:00" "8/13/1993 0:00:00" "10/7/1993 0:00:00" ...
##  $ time_of_day : chr [1:19302] "Night" "Day" "Day" "Day" ...
##  $ state       : chr [1:19302] "IL" "MD" "TN" "VA" ...
##  $ height      : num [1:19302] 7000 10 400 100 50 0 4000 100 200 3000 ...
##  $ speed       : num [1:19302] 250 140 140 200 170 40 230 50 130 NA ...
##  $ effect      : chr [1:19302] NA "None" "None" "None" ...
##  $ sky         : chr [1:19302] "No Cloud" "No Cloud" "Some Cloud" "Overcast" ...
##  $ species     : chr [1:19302] "UNKNOWN BIRD - MEDIUM" "UNKNOWN BIRD - MEDIUM" "UNKNOWN BIRD - SMALL" "UNKNOWN BIRD - SMALL" ...
##  $ birds_seen  : chr [1:19302] NA "2-10" "2-10" NA ...
##  $ birds_struck: chr [1:19302] "1" "2-10" "1" "1" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   opid = col_character(),
##   ..   operator = col_character(),
##   ..   atype = col_character(),
##   ..   remarks = col_character(),
##   ..   phase_of_flt = col_character(),
##   ..   ac_mass = col_double(),
##   ..   num_engs = col_double(),
##   ..   date = col_character(),
##   ..   time_of_day = col_character(),
##   ..   state = col_character(),
##   ..   height = col_double(),
##   ..   speed = col_double(),
##   ..   effect = col_character(),
##   ..   sky = col_character(),
##   ..   species = col_character(),
##   ..   birds_seen = col_character(),
##   ..   birds_struck = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Since the dataset contains a lot of variables, and not all of them will be used, a subset is created below to allow the process to use only necessary information. In addition, to avoid dropping to many rows, which would have a considerable impact on the data, the missing values are replaced by imputing the median and the mode, respectively for the numerical and categorical variables.

#subseting the data, taking only needed info 

info<- df|> 
  select(opid, phase_of_flt, remarks, height, time_of_day, sky, ac_mass, speed)

#checking NAs
colSums(is.na(info))
##         opid phase_of_flt      remarks       height  time_of_day          sky 
##            0         1783         2786         3193         2077         3579 
##      ac_mass        speed 
##         1284         7008
#calculating the summary statistics 
mod_phase<-mode(info$phase_of_flt)
mod_remarks<- mode(info$remarks)
md_height<- median(info$height, na.rm=TRUE)
mod_time<- mode(info$time_of_day)
mod_sky<- mode(info$sky)
md_mass<- median(info$ac_mass, na.rm=TRUE)

#substituting the missing values by imputing the modes and the median 

info<- info|>
  mutate(height=ifelse(is.na(height), md_height, height),
         remarks=ifelse(is.na(remarks), mod_remarks, remarks),
         phase_of_flt=ifelse(is.na(phase_of_flt), mod_phase, phase_of_flt),
         time_of_day=ifelse(is.na(time_of_day), mod_time, time_of_day),
         ac_mass=ifelse(is.na(ac_mass), md_mass, ac_mass),
         sky=ifelse(is.na(sky),mod_sky, sky)
           
  )

info$sky<-gsub("character", "other", info$sky)

To understand the environmental influence on bird strikes, the chunk below will generate a bar graph showing the distribution of collision based on cloud cover. It will highlight whether strikes are more frequent under cloudy, overcast,.. conditions. It will help assess the role of this environmental condition in collision risk.

table(info$sky)
## 
##   No Cloud      other   Overcast Some Cloud 
##       7113       3579       3356       5254
ggplot(info, aes(x=sky))+
  geom_bar(color="black", fill="red")+
  labs(x="sky conditions", y="Counts", title="Distribution of collision in regard to sky conditions")+
  theme_classic()

Based on the graph returned, more collision were registered under the “No cloud” category with 7,113 counts. It is followed by “some cloud” with 5,254 counts. “Overcast” and “Other” count approximately 3,000 registration each.

table(info$phase_of_flt)
## 
##     Approach    character        Climb      Descent     En Route Landing Roll 
##         6470         1783         3222          599          585         3047 
##       Parked Take-off run         Taxi 
##           11         3506           79
ggplot(info, aes(x=phase_of_flt))+
  geom_bar(color="black", fill="blue")+
  labs(x="Flying phase", y="Counts", title="Distribution of collision in regard to flying phase")+
  theme_minimal()

Above is another bar graph that shows the number collisions counted over each category. As observed, more collisions ccurred when planes where in approaching phase. Also, planes strike birds quite often when making their take-off run.

ggplot(info, aes(x=height, y=speed, color=phase_of_flt))+
  geom_point(alpha=0.5)+
  labs(
    title = "Scatterplot of height vs Speed",
    x = "Height",
    y = "Speed",
    color="Flying phase"
  ) +
  theme_classic()

Above is scatterplot that correlates plane height and speed while taking their flying phase into account. A strong positive correlation is observed.This indicates that as planes gain in altitude, their speed tend to increase.

Hypothesis testing

\(H_0\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\) = \(\mu_4\)

\(H_a\): not all \(\mu_i\) are equal

# Perform ANOVA
anova_t <- aov(height~ sky, data = info)

anova_t
## Call:
##    aov(formula = height ~ sky, data = info)
## 
## Terms:
##                         sky   Residuals
## Sum of Squares   1234478228 52073576204
## Deg. of Freedom           3       19298
## 
## Residual standard error: 1642.678
## Estimated effects may be unbalanced
summary(anova_t)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## sky             3 1.234e+09 411492743   152.5 <2e-16 ***
## Residuals   19298 5.207e+10   2698392                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An anova test was performed to test whether the mean height of collisions differs across sky conditions. The test returned a <2e-16 p-value, which is statistically significant. Therefore, since there is enough evidence to support that the mean height of collisions differ for the sky conditions, we reject the null.

TukeyHSD(anova_t)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = height ~ sky, data = info)
## 
## $sky
##                           diff        lwr       upr     p adj
## other-No Cloud      -631.00567 -717.49889 -544.5124 0.0000000
## Overcast-No Cloud   -546.06731 -634.45160 -457.6830 0.0000000
## Some Cloud-No Cloud -295.09508 -371.87022 -218.3199 0.0000000
## Overcast-other        84.93836  -16.47409  186.3508 0.1369842
## Some Cloud-other     335.91059  244.43860  427.3826 0.0000000
## Some Cloud-Overcast  250.97223  157.71007  344.2344 0.0000000

To further understand the difference in means, and which exact groups differ, a TukeyHSD test was also conducted. All categories show a significant difference, except overcast and other.

conclusion

The analysis was centered on understanding how environmental n flying technical patterns influence bird striking. The analysis revealed that sky conditions do not significantly influence the occurrence of birds strikes. Instead, flying phases closer to the ground show higher frequency compared to other phases. Therefore, when planes are at lower altitude, they are more likely to encounter bird activity.

An anova test was conducted to examine whether the mean altitude of bird strikes differ across sky conditions. The test returned a significant p-value, which demonstrates that at least one group mean differs.

While environmental conditions may be the primary influencer, the operational context of flight phase plays a critical role in bird strike risk.

Future analysis can go deeper by examining the implication of airlines, including also the operational practices.