Final Project

Data 110 Final Project

Introduction

This project examines a dataset containing information about aircraft incidents, collected by the Federal Aviation Administration (FAA). The dataset includes various details about these incidents, such as the month and year they occurred, the time of day, the airport location, latitude, longitude, phase of flight, and weather conditions. It also includes numerical variables like the altitude, speed, distance, height and more. This dataset seems to come from incident reports submitted by airports, airlines, and flight operations. However, there is no ReadMe file, so the exact method of data collection isn’t clear. Most likely, the data comes from reports of incidents or accidents at various U.S. airports, with each row representing a specific event or group of incidents.

The topic of aircraft incidents is both interesting and important because it helps us understand aviation safety, recognize patterns and trends in incidents, and improve safety measures. I chose this dataset because I have always been interested in aviation safety. Learning how different factors affect the occurrence of incidents is not only fascinating but also has real-world value, as it can lead to better safety practices and more efficient operations in the airline industry. As someone who loves to travel and is naturally curious, I find myself interested in understanding the factors that impact aviation safety.

Questions for my analysis:

What species are more involve in the Aircraft incident?

What is the relationship between speed and height?

Which months see the highest number of Incidents?

Do aircraft incidents occur more frequently in specific weather conditions?

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("/Users/leikarayjoseph/Desktop/Data 110") 
#upload my working directory so I can install my file.
Aircraft <- read_csv("aircraft_wildlife_strikes_faa.csv")

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

## Rows: 288810 Columns: 100
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (37): INCIDENT_DATE, TIME_OF_DAY, AIRPORT_ID, AIRPORT, RUNWAY, STATE, F...
## dbl  (19): INDEX_NR, INCIDENT_MONTH, INCIDENT_YEAR, LATITUDE, LONGITUDE, AMO...
## num   (4): COST_REPAIRS, COST_OTHER, COST_REPAIRS_INFL_ADJ, COST_OTHER_INFL_ADJ
## lgl  (39): INGESTED_OTHER, INDICATED_DAMAGE, STR_RAD, DAM_RAD, STR_WINDSHLD,...
## time  (1): TIME
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Change the format of my headers
names(Aircraft) <- tolower(names(Aircraft))
names(Aircraft) <- gsub(" ","",names(Aircraft))
head(Aircraft)

## # A tibble: 6 × 100
##   index_nr incident_date incident_month incident_year time   time_of_day
##      <dbl> <chr>                  <dbl>         <dbl> <time> <chr>      
## 1   608242 6/22/1996                  6          1996    NA  <NA>       
## 2   608243 6/26/1996                  6          1996    NA  <NA>       
## 3   608244 7/1/1996                   7          1996    NA  <NA>       
## 4   608245 7/1/1996                   7          1996    NA  <NA>       
## 5   608246 7/1/1996                   7          1996    NA  <NA>       
## 6   608247 5/6/1991                   5          1991    NA  Day        
## # ℹ 94 more variables: airport_id <chr>, airport <chr>, latitude <dbl>,
## #   longitude <dbl>, runway <chr>, state <chr>, faaregion <chr>,
## #   location <chr>, enroute_state <chr>, opid <chr>, operator <chr>, reg <chr>,
## #   flt <chr>, aircraft <chr>, ama <chr>, amo <dbl>, ema <dbl>, emo <dbl>,
## #   ac_class <chr>, ac_mass <dbl>, type_eng <chr>, num_engs <dbl>,
## #   eng_1_pos <dbl>, eng_2_pos <dbl>, eng_3_pos <dbl>, eng_4_pos <dbl>,
## #   phase_of_flight <chr>, height <dbl>, speed <dbl>, distance <dbl>, …

# Count the variable "phase_of_flight" to see wich one has the the higher rate of incident
Count1 <- Aircraft |>
 group_by(phase_of_flight) |>
  filter(!is.na(phase_of_flight)) |> # Filter and remove NA's.
 count(name= "total") |> 
  # The count for each phase of the flight when the incident happend.
 arrange(total)

Count1

## # A tibble: 11 × 2
## # Groups:   phase_of_flight [11]
##    phase_of_flight total
##    <chr>           <int>
##  1 Parked            115
##  2 Taxi              667
##  3 Arrival           753
##  4 Local            1150
##  5 Descent          2331
##  6 Departure        2933
##  7 En Route         5308
##  8 Climb           26957
##  9 Take-off Run    30248
## 10 Landing Roll    32157
## 11 Approach        75224

ggplot(Count1,aes(x= phase_of_flight, y= total, fill = phase_of_flight, na.rm= TRUE)) +
  geom_bar(stat= "identity", position= "dodge", na.rm = TRUE) +
    labs(x= "Phase of Flight", 
         y= "Total of Incidents", 
         title= "Phase of the Flight when the Incident Happend",
         caption = "Source: FAA (Federal Aviation Administration)") +
  scale_fill_brewer(palette= "Set3")+
    theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

From this plot we observe that “approach” is the phase of the flight in wich most of the incident happend, “landing roll” is the second highest but, parked is the one in wich less incident happended.

Relation between Height and Speed

# Linear regression plot 
p1 <- ggplot(Aircraft, aes(x = height, y = speed)) +
labs(title = "Relationship between Height and Speed",
caption = "Source:FAA (Federal Aviation Administration",
x = "Height",
y = "Speed") +
theme_minimal()+ 
  geom_point(color= "lightblue") +
  geom_smooth() # add the points, specify the limits of the variable
#geom_smooth(method = 'lm', formula= y~x, se = FALSE, linetype= "dotdash", color= "red", size = 0.3)
p1

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Removed 195957 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 195957 rows containing missing values or values outside the scale range
## (`geom_point()`).

This plot show a weak relationship between height and speed but, I also see some outliers with very high speed that may influence my result.

Same plot without the outliers

# correlation between 
p2 <- ggplot(Aircraft, aes(x = height, y = speed)) +
labs(title = "Relationship between Height and Speed",
caption = "Source:FAA (Federal Aviation Administration",
x = "Height",
y = "Speed") +
theme_minimal()+ 
  xlim(0, 3000) + # set the limit of my x axis.
  ylim(0,400) +  #limit for my y axis.
  geom_point(color= "lightblue") +
 geom_smooth(method = 'lm', formula= y~x, se = FALSE) # add the regresion line

p2

## Warning: Removed 205707 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 205707 rows containing missing values or values outside the scale range
## (`geom_point()`).

Although the points are still dense in one part of the plot this one is a much better representation of the relationship between the two variables. The trend line in this plot suggest a slightly positive relationship between height and speed.

cor(Aircraft$height, Aircraft$speed, use = "complete.obs")

## [1] 0.6960162

# "I used use = "complete.obs" to handle missing values because the default cor() function returns NA when there are missing observations in the data. According to google this ensures the correlation is calculated using only rows with complete data."

While correlation doesn’t always imply causation, the correlation coefficient between height and speed (0.696) indicate a strong positive relationship.

Linear Regression Eqquation

# Find the statistical information for my model
Eq <- lm(height ~ speed, data= Aircraft)
summary(Eq)

## 
## Call:
## lm(formula = height ~ speed, data = Aircraft)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33794   -832   -322    490  24233 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3115.4174    15.0250  -207.3   <2e-16 ***
## speed          29.6077     0.1002   295.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1425 on 92851 degrees of freedom
##   (195957 observations deleted due to missingness)
## Multiple R-squared:  0.4844, Adjusted R-squared:  0.4844 
## F-statistic: 8.725e+04 on 1 and 92851 DF,  p-value: < 2.2e-16

Linear Equation

Equation: Speed = 29.6077(height) - 3115.41

p-values: < 2.2e-16 The p-value is close to zero indicating a strong evidence against the null, which make the model statistically significant.

Adjusted R^2: 0.4844

This value indicate that about 48.4% of the variability in speed is explain by height.

Aircraft_select <- Aircraft |>
  select("incident_month", "incident_year", "time_of_day", "airport", "latitude",      "longitude", "phase_of_flight", "height", "speed", "distance", "sky",     "precipitation", "species") |> # Select variables that I'm going to work with.
  mutate(incident_month= month.name[incident_month]) # Change the month from the number of the month to the name of the month.

Aircraft_select

## # A tibble: 288,810 × 13
##    incident_month incident_year time_of_day airport           latitude longitude
##    <chr>                  <dbl> <chr>       <chr>                <dbl>     <dbl>
##  1 June                    1996 <NA>        SACRAMENTO INTL       38.7    -122. 
##  2 June                    1996 <NA>        DENVER INTL AIRP…     39.9    -105. 
##  3 July                    1996 <NA>        EPPLEY AIRFIELD       41.3     -95.9
##  4 July                    1996 <NA>        WASHINGTON DULLE…     38.9     -77.5
##  5 July                    1996 <NA>        LA GUARDIA ARPT       40.8     -73.9
##  6 May                     1991 Day         SAN ANTONIO INTL      29.5     -98.5
##  7 November                1993 Dawn        KANSAS CITY INTL      39.3     -94.7
##  8 July                    1995 <NA>        KANSAS CITY INTL      39.3     -94.7
##  9 September               1990 Day         DALLAS/FORT WORT…     32.9     -97.0
## 10 May                     1992 Day         NORMAN Y. MINETA…     37.4    -122. 
## # ℹ 288,800 more rows
## # ℹ 7 more variables: phase_of_flight <chr>, height <dbl>, speed <dbl>,
## #   distance <dbl>, sky <chr>, precipitation <chr>, species <chr>

# Use of group_by, count, ungroup, arrange
Species_count <- Aircraft_select |>
  group_by(species) |>
  count(name = "total") |>  # Count the number of rows for each species
  ungroup() |>  # Remove grouping to avoid affecting further operations
  arrange(desc(total)) |>  # Arrange species by descending total
  slice(1:10) # Select the top 10 species

 Species_count

## # A tibble: 10 × 2
##    species               total
##    <chr>                 <int>
##  1 Unknown bird - small  48901
##  2 Unknown bird - medium 38259
##  3 Unknown bird          24839
##  4 Mourning dove         14578
##  5 Barn swallow           9679
##  6 Killdeer               9592
##  7 American kestrel       8879
##  8 Horned lark            8032
##  9 Gulls                  7414
## 10 European starling      6148

Based on this table of the top ten species involved in strike incidents, birds are the primary contributors to these occurrences.

The months in which most of the incident happend

Most_commun_month <- Aircraft_select |>
  group_by(incident_month) |>
  count(name = "total") |>  # Count the number of incidents for each month
  ungroup() |>  # Remove grouping 
  arrange(desc(total))  # Arrange months by descending total
Most_commun_month

## # A tibble: 12 × 2
##    incident_month total
##    <chr>          <int>
##  1 August         41194
##  2 July           37688
##  3 September      37524
##  4 October        35182
##  5 May            28826
##  6 June           24483
##  7 April          20767
##  8 November       19304
##  9 March          14104
## 10 December       11086
## 11 January         9499
## 12 February        9153

The month in which most of the incident happend is “August”.

Weather101 <- Aircraft_select |>
  group_by(precipitation) |>
  filter(!is.na(precipitation)) |>
  count(name = "total") |>  # Count the number of incidents
  arrange(desc(total))  # Arrange by descending total
Weather101

## # A tibble: 12 × 2
## # Groups:   precipitation [12]
##    precipitation     total
##    <chr>             <int>
##  1 None             122687
##  2 Rain               7772
##  3 Fog                2501
##  4 Snow                490
##  5 Fog, Rain           314
##  6 None, Snow           28
##  7 Rain, Snow           27
##  8 Fog, Snow            16
##  9 None, Rain           13
## 10 Fog, None             7
## 11 Fog, Rain, Snow       6
## 12 None, Rain, Snow      1

Weather110 <- Aircraft_select |>
  group_by(sky) |> 
  filter(!is.na(sky)) |>
  count(name = "total") |>  # Count the number of incidents
  #ungroup() |> 
  arrange(desc(total))  # Arrange by descending total
Weather110

## # A tibble: 3 × 2
## # Groups:   sky [3]
##   sky        total
##   <chr>      <int>
## 1 No Cloud   67403
## 2 Some Cloud 48160
## 3 Overcast   23819

Most of the incidents happened when the are no precipitation and with No cloud.

# filter so I can only have the top 5 species only in the species column.
Data1 <- Aircraft_select |>
  filter( species %in%c( "Unknown bird - small", "Unknown bird - medium", "Unknown bird", "Mourning dove", "Barn swallow"))
Data1

## # A tibble: 136,256 × 13
##    incident_month incident_year time_of_day airport           latitude longitude
##    <chr>                  <dbl> <chr>       <chr>                <dbl>     <dbl>
##  1 June                    1996 <NA>        SACRAMENTO INTL       38.7    -122. 
##  2 June                    1996 <NA>        DENVER INTL AIRP…     39.9    -105. 
##  3 July                    1996 <NA>        EPPLEY AIRFIELD       41.3     -95.9
##  4 July                    1996 <NA>        WASHINGTON DULLE…     38.9     -77.5
##  5 July                    1996 <NA>        LA GUARDIA ARPT       40.8     -73.9
##  6 May                     1991 Day         SAN ANTONIO INTL      29.5     -98.5
##  7 November                1993 Dawn        KANSAS CITY INTL      39.3     -94.7
##  8 July                    1995 <NA>        KANSAS CITY INTL      39.3     -94.7
##  9 September               1990 Day         DALLAS/FORT WORT…     32.9     -97.0
## 10 September               1990 Day         AUGUSTA REGIONAL…     33.4     -82.0
## # ℹ 136,246 more rows
## # ℹ 7 more variables: phase_of_flight <chr>, height <dbl>, speed <dbl>,
## #   distance <dbl>, sky <chr>, precipitation <chr>, species <chr>

#names(Data1)

library(highcharter) # Load highchater

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Highcharts (www.highcharts.com) is a Highsoft software product which is

## not free for commercial and Governmental use

# In this I'm adding a column "total" in this dataset.
Data1 <- Data1 |>
  group_by(species) |>
  mutate(total = n()) |>  # Count the number of rows for each species
  ungroup() # remove the group created by group_by.
Data1

## # A tibble: 136,256 × 14
##    incident_month incident_year time_of_day airport           latitude longitude
##    <chr>                  <dbl> <chr>       <chr>                <dbl>     <dbl>
##  1 June                    1996 <NA>        SACRAMENTO INTL       38.7    -122. 
##  2 June                    1996 <NA>        DENVER INTL AIRP…     39.9    -105. 
##  3 July                    1996 <NA>        EPPLEY AIRFIELD       41.3     -95.9
##  4 July                    1996 <NA>        WASHINGTON DULLE…     38.9     -77.5
##  5 July                    1996 <NA>        LA GUARDIA ARPT       40.8     -73.9
##  6 May                     1991 Day         SAN ANTONIO INTL      29.5     -98.5
##  7 November                1993 Dawn        KANSAS CITY INTL      39.3     -94.7
##  8 July                    1995 <NA>        KANSAS CITY INTL      39.3     -94.7
##  9 September               1990 Day         DALLAS/FORT WORT…     32.9     -97.0
## 10 September               1990 Day         AUGUSTA REGIONAL…     33.4     -82.0
## # ℹ 136,246 more rows
## # ℹ 8 more variables: phase_of_flight <chr>, height <dbl>, speed <dbl>,
## #   distance <dbl>, sky <chr>, precipitation <chr>, species <chr>, total <int>

#Make a table where it counts the number of incidents per year and the time of teh day the incidents happened.
Year_count <- Data1 |>
  group_by(incident_year, time_of_day) |> 
  filter(!is.na(incident_year) & !is.na(time_of_day)) |> # Filter my two rows and remove NA's.
  count(name = "total") |>  # Count the number of incidents
  arrange(desc(total)) 
Year_count

## # A tibble: 136 × 3
## # Groups:   incident_year, time_of_day [136]
##    incident_year time_of_day total
##            <dbl> <chr>       <int>
##  1          2022 Day          2908
##  2          2018 Day          2830
##  3          2017 Day          2767
##  4          2021 Day          2754
##  5          2014 Day          2729
##  6          2016 Day          2471
##  7          2009 Day          2416
##  8          2019 Day          2211
##  9          2015 Day          2080
## 10          2010 Day          2001
## # ℹ 126 more rows

library(ggalluvial) # upload library

Visualizing Incident Patterns with an Alluvial Plot

p3 <- ggplot(data = Year_count, aes(x = incident_year,
           y = total,
           alluvium= time_of_day,
           fill = time_of_day, label = time_of_day)) +
  geom_alluvium() +
  geom_flow() +
  scale_fill_brewer(palette= "Spectral") +
  #geom_stratum(alpha = 0.5) +
   labs(x= "Incident_year", 
          y= "Total", 
          title = "Yearly Trends in Aviation Incidents Across Different Times of Day",
          caption = "Source:FAA (Federal Aviation Administration") +
  theme_minimal() 

p3

This plot, displays the total number of incidents that occur in year. The number of events expanded between 1990 and the late 2010s, with Day and Night having the highest incident rates. There were fewer incidences at dawn and dusk. After 2020, there was a significant decline in the overall number of events. The number of night time occurrences increased gradually starting in the late 1990s and was constant until the 2010s. In general, the majority of occurrences occurred during the day, and until the most recent decline, there was a noticeable increase in incidents.

** The COVID-19 pandemic, when travel was restricted, may be the cause of the decline in incidents. Lockdowns, curfews, and restricted mobility probably made it harder for events to happen, especially during the day when activity is often higher.**

Scatter plot using Highchater

#visualization

highchart() |>
  hc_add_series(
    data = Data1,
    type = "scatter", 
    hcaes(size = total, 
          x = height, 
          y = speed, 
          group = species)
  ) |>
  hc_tooltip(
    useHTML = TRUE,  # Enable HTML in tooltip
    headerFormat = "",  
    pointFormat = "
      <strong>Species: {point.species}</strong><br>
      <b>Incident Month:</b> {point.incident_month}<br>
      <b>Incident Year:</b> {point.incident_year}<br>
      <b>Time of Day:</b> {point.time_of_day}<br>
      <b>Airport:</b> {point.airport}<br>
      <b>Latitude:</b> {point.latitude}°<br>
      <b>Longitude:</b> {point.longitude}°<br>
      <b>Phase of Flight:</b> {point.phase_of_flight}<br>
      <b>Height:</b> {point.height} m<br>
      <b>Speed:</b> {point.speed} km/h<br>
      <b>Sky Condition:</b> {point.sky}<br>
      <b>Precipitation:</b> {point.precipitation}<br>
      <b>Total Incidents:</b> {point.total}"
  ) |>
  hc_title(text = "Aircraft Incident Data") |>
  hc_caption(text = "Source: FAA (Federal Aviation Administration)") |>
  hc_xAxis(title = list(text = "Height (meters)")) |>
  hc_yAxis(title = list(text = "Speed (km/h)"),
           min= -500,
           max= 1000) |>
  hc_legend(title = list(text = "Species")) |>
  hc_add_theme(hc_theme_bloom())

This plot visualizes the relationship between height and speed during different phases of flight, with the size of the data points indicating the frequency or severity of incidents. The points are arranged by phase of flight, from the highest to the lowest number of incidents. In this plot, I’ve included the top 5 phases of flight with the most incidents. Although there is some crowding of data points, it is still clear that the “Approach” phase is most prominent, appearing as the phase with the highest concentration of incidents. This pattern shows that incidents are most frequent during the approach phase of flight.

Conclusion:

Working on this project has deepened my understanding of aircraft incidents and the various factors that contribute to them. Through analyzing the dataset, I uncovered several key insights about aviation safety. One important finding is that the “approach” phase of flight had the highest number of incidents, highlighting the critical nature of this phase in flight operations. The “landing roll” phase was the second most common, suggesting that incidents are also frequent during the final stages of landing. In contrast, the “parked” phase had the fewest incidents, which makes sense since the aircraft is stationary and not facing the same risks as when it’s in motion.

These findings about the “approach” phase align with broader research, such as the study “Effects of the Federal Aviation Administration’s Compliance Program on Aircraft Incidents and Accidents” (2022). The study shows that incidents are more common during critical phases like approach and landing, as these stages involve complex maneuvers that require high coordination, which can increase the risk of accidents, especially under certain conditions.

Additionally, the analysis revealed that July and August saw the highest number of aircraft incidents. This may be due to factors like increased air traffic during the summer months or weather-related issues that make incidents more likely during these times.

One surprising thing I found was that most incidents happened when there was no rain and the sky was clear. This was unexpected, because we might think that bad weather, like rain or clouds, would cause more incidents. This aligns with the findings in the article “A causal factors analysis of aircraft incidents due to radar limitations: The Norway case study” (2015), which highlighted that operational issues and equipment limitations, rather than weather, were often the main contributors to incidents in certain cases. It suggests that factors like radar limitations, miscommunication, or other technological challenges might play a more significant role than we often think. I also wanted to map the latitude and longitude to see if certain places had more incidents or if certain types of incidents happened more in specific locations. Unfortunately, I wasn’t able to do that in this project.

All things considered, I have learned a lot about aircraft incidents and the potential contributing elements from working on this project. Even though I couldn’t investigate all I wanted to, the information I did discover has improved my understanding of aircraft safety. I’m excited to learn more about how we might enhance flight safety in the future and to keep researching this subject.

Article used Calabrese, Curtis G., et al. “Effects of the Federal Aviation Administration’s Compliance Program on Aircraft Incidents and Accidents.” Transportation Research. Part A, Policy and Practice, vol. 163, 2022, pp. 304–19, https://doi.org/10.1016/j.tra.2022.07.016.

Syd Ali, Busyairah, et al. “A Causal Factors Analysis of Aircraft Incidents Due to Radar Limitations: The Norway Case Study.” Journal of Air Transport Management, vol. 44–45, 2015, pp. 103–09, https://doi.org/10.1016/j.jairtraman.2015.03.004.