This project examines a dataset containing information about aircraft incidents, collected by the Federal Aviation Administration (FAA). The dataset includes various details about these incidents, such as the month and year they occurred, the time of day, the airport location, latitude, longitude, phase of flight, and weather conditions. It also includes numerical variables like the altitude, speed, distance, height and more. This dataset seems to come from incident reports submitted by airports, airlines, and flight operations. However, there is no ReadMe file, so the exact method of data collection isn’t clear. Most likely, the data comes from reports of incidents or accidents at various U.S. airports, with each row representing a specific event or group of incidents.
The topic of aircraft incidents is both interesting and important because it helps us understand aviation safety, recognize patterns and trends in incidents, and improve safety measures. I chose this dataset because I have always been interested in aviation safety. Learning how different factors affect the occurrence of incidents is not only fascinating but also has real-world value, as it can lead to better safety practices and more efficient operations in the airline industry. As someone who loves to travel and is naturally curious, I find myself interested in understanding the factors that impact aviation safety.
Questions for my analysis:
What species are more involve in the Aircraft incident?
What is the relationship between speed and height?
Which months see the highest number of Incidents?
Do aircraft incidents occur more frequently in specific weather conditions?
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/leikarayjoseph/Desktop/Data 110")
#upload my working directory so I can install my file.
Aircraft <- read_csv("aircraft_wildlife_strikes_faa.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 288810 Columns: 100
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (37): INCIDENT_DATE, TIME_OF_DAY, AIRPORT_ID, AIRPORT, RUNWAY, STATE, F...
## dbl (19): INDEX_NR, INCIDENT_MONTH, INCIDENT_YEAR, LATITUDE, LONGITUDE, AMO...
## num (4): COST_REPAIRS, COST_OTHER, COST_REPAIRS_INFL_ADJ, COST_OTHER_INFL_ADJ
## lgl (39): INGESTED_OTHER, INDICATED_DAMAGE, STR_RAD, DAM_RAD, STR_WINDSHLD,...
## time (1): TIME
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Change the format of my headers
names(Aircraft) <- tolower(names(Aircraft))
names(Aircraft) <- gsub(" ","",names(Aircraft))
head(Aircraft)
## # A tibble: 6 × 100
## index_nr incident_date incident_month incident_year time time_of_day
## <dbl> <chr> <dbl> <dbl> <time> <chr>
## 1 608242 6/22/1996 6 1996 NA <NA>
## 2 608243 6/26/1996 6 1996 NA <NA>
## 3 608244 7/1/1996 7 1996 NA <NA>
## 4 608245 7/1/1996 7 1996 NA <NA>
## 5 608246 7/1/1996 7 1996 NA <NA>
## 6 608247 5/6/1991 5 1991 NA Day
## # ℹ 94 more variables: airport_id <chr>, airport <chr>, latitude <dbl>,
## # longitude <dbl>, runway <chr>, state <chr>, faaregion <chr>,
## # location <chr>, enroute_state <chr>, opid <chr>, operator <chr>, reg <chr>,
## # flt <chr>, aircraft <chr>, ama <chr>, amo <dbl>, ema <dbl>, emo <dbl>,
## # ac_class <chr>, ac_mass <dbl>, type_eng <chr>, num_engs <dbl>,
## # eng_1_pos <dbl>, eng_2_pos <dbl>, eng_3_pos <dbl>, eng_4_pos <dbl>,
## # phase_of_flight <chr>, height <dbl>, speed <dbl>, distance <dbl>, …
# Count the variable "phase_of_flight" to see wich one has the the higher rate of incident
Count1 <- Aircraft |>
group_by(phase_of_flight) |>
filter(!is.na(phase_of_flight)) |> # Filter and remove NA's.
count(name= "total") |>
# The count for each phase of the flight when the incident happend.
arrange(total)
Count1
## # A tibble: 11 × 2
## # Groups: phase_of_flight [11]
## phase_of_flight total
## <chr> <int>
## 1 Parked 115
## 2 Taxi 667
## 3 Arrival 753
## 4 Local 1150
## 5 Descent 2331
## 6 Departure 2933
## 7 En Route 5308
## 8 Climb 26957
## 9 Take-off Run 30248
## 10 Landing Roll 32157
## 11 Approach 75224
ggplot(Count1,aes(x= phase_of_flight, y= total, fill = phase_of_flight, na.rm= TRUE)) +
geom_bar(stat= "identity", position= "dodge", na.rm = TRUE) +
labs(x= "Phase of Flight",
y= "Total of Incidents",
title= "Phase of the Flight when the Incident Happend",
caption = "Source: FAA (Federal Aviation Administration)") +
scale_fill_brewer(palette= "Set3")+
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
From this plot we observe that “approach” is the phase of the flight in wich most of the incident happend, “landing roll” is the second highest but, parked is the one in wich less incident happended.
# Linear regression plot
p1 <- ggplot(Aircraft, aes(x = height, y = speed)) +
labs(title = "Relationship between Height and Speed",
caption = "Source:FAA (Federal Aviation Administration",
x = "Height",
y = "Speed") +
theme_minimal()+
geom_point(color= "lightblue") +
geom_smooth() # add the points, specify the limits of the variable
#geom_smooth(method = 'lm', formula= y~x, se = FALSE, linetype= "dotdash", color= "red", size = 0.3)
p1
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 195957 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 195957 rows containing missing values or values outside the scale range
## (`geom_point()`).
This plot show a weak relationship between height and speed but, I also see some outliers with very high speed that may influence my result.
Same plot without the outliers
# correlation between
p2 <- ggplot(Aircraft, aes(x = height, y = speed)) +
labs(title = "Relationship between Height and Speed",
caption = "Source:FAA (Federal Aviation Administration",
x = "Height",
y = "Speed") +
theme_minimal()+
xlim(0, 3000) + # set the limit of my x axis.
ylim(0,400) + #limit for my y axis.
geom_point(color= "lightblue") +
geom_smooth(method = 'lm', formula= y~x, se = FALSE) # add the regresion line
p2
## Warning: Removed 205707 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 205707 rows containing missing values or values outside the scale range
## (`geom_point()`).
Although the points are still dense in one part of the plot this one is a much better representation of the relationship between the two variables. The trend line in this plot suggest a slightly positive relationship between height and speed.
cor(Aircraft$height, Aircraft$speed, use = "complete.obs")
## [1] 0.6960162
# "I used use = "complete.obs" to handle missing values because the default cor() function returns NA when there are missing observations in the data. According to google this ensures the correlation is calculated using only rows with complete data."
While correlation doesn’t always imply causation, the correlation coefficient between height and speed (0.696) indicate a strong positive relationship.
# Find the statistical information for my model
Eq <- lm(height ~ speed, data= Aircraft)
summary(Eq)
##
## Call:
## lm(formula = height ~ speed, data = Aircraft)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33794 -832 -322 490 24233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3115.4174 15.0250 -207.3 <2e-16 ***
## speed 29.6077 0.1002 295.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1425 on 92851 degrees of freedom
## (195957 observations deleted due to missingness)
## Multiple R-squared: 0.4844, Adjusted R-squared: 0.4844
## F-statistic: 8.725e+04 on 1 and 92851 DF, p-value: < 2.2e-16
Equation: Speed = 29.6077(height) - 3115.41
p-values: < 2.2e-16 The p-value is close to zero indicating a strong evidence against the null, which make the model statistically significant.
Adjusted R^2: 0.4844
This value indicate that about 48.4% of the variability in speed is explain by height.
Aircraft_select <- Aircraft |>
select("incident_month", "incident_year", "time_of_day", "airport", "latitude", "longitude", "phase_of_flight", "height", "speed", "distance", "sky", "precipitation", "species") |> # Select variables that I'm going to work with.
mutate(incident_month= month.name[incident_month]) # Change the month from the number of the month to the name of the month.
Aircraft_select
## # A tibble: 288,810 × 13
## incident_month incident_year time_of_day airport latitude longitude
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 June 1996 <NA> SACRAMENTO INTL 38.7 -122.
## 2 June 1996 <NA> DENVER INTL AIRP… 39.9 -105.
## 3 July 1996 <NA> EPPLEY AIRFIELD 41.3 -95.9
## 4 July 1996 <NA> WASHINGTON DULLE… 38.9 -77.5
## 5 July 1996 <NA> LA GUARDIA ARPT 40.8 -73.9
## 6 May 1991 Day SAN ANTONIO INTL 29.5 -98.5
## 7 November 1993 Dawn KANSAS CITY INTL 39.3 -94.7
## 8 July 1995 <NA> KANSAS CITY INTL 39.3 -94.7
## 9 September 1990 Day DALLAS/FORT WORT… 32.9 -97.0
## 10 May 1992 Day NORMAN Y. MINETA… 37.4 -122.
## # ℹ 288,800 more rows
## # ℹ 7 more variables: phase_of_flight <chr>, height <dbl>, speed <dbl>,
## # distance <dbl>, sky <chr>, precipitation <chr>, species <chr>
# Use of group_by, count, ungroup, arrange
Species_count <- Aircraft_select |>
group_by(species) |>
count(name = "total") |> # Count the number of rows for each species
ungroup() |> # Remove grouping to avoid affecting further operations
arrange(desc(total)) |> # Arrange species by descending total
slice(1:10) # Select the top 10 species
Species_count
## # A tibble: 10 × 2
## species total
## <chr> <int>
## 1 Unknown bird - small 48901
## 2 Unknown bird - medium 38259
## 3 Unknown bird 24839
## 4 Mourning dove 14578
## 5 Barn swallow 9679
## 6 Killdeer 9592
## 7 American kestrel 8879
## 8 Horned lark 8032
## 9 Gulls 7414
## 10 European starling 6148
Based on this table of the top ten species involved in strike incidents, birds are the primary contributors to these occurrences.
Most_commun_month <- Aircraft_select |>
group_by(incident_month) |>
count(name = "total") |> # Count the number of incidents for each month
ungroup() |> # Remove grouping
arrange(desc(total)) # Arrange months by descending total
Most_commun_month
## # A tibble: 12 × 2
## incident_month total
## <chr> <int>
## 1 August 41194
## 2 July 37688
## 3 September 37524
## 4 October 35182
## 5 May 28826
## 6 June 24483
## 7 April 20767
## 8 November 19304
## 9 March 14104
## 10 December 11086
## 11 January 9499
## 12 February 9153
The month in which most of the incident happend is “August”.
Weather101 <- Aircraft_select |>
group_by(precipitation) |>
filter(!is.na(precipitation)) |>
count(name = "total") |> # Count the number of incidents
arrange(desc(total)) # Arrange by descending total
Weather101
## # A tibble: 12 × 2
## # Groups: precipitation [12]
## precipitation total
## <chr> <int>
## 1 None 122687
## 2 Rain 7772
## 3 Fog 2501
## 4 Snow 490
## 5 Fog, Rain 314
## 6 None, Snow 28
## 7 Rain, Snow 27
## 8 Fog, Snow 16
## 9 None, Rain 13
## 10 Fog, None 7
## 11 Fog, Rain, Snow 6
## 12 None, Rain, Snow 1
Weather110 <- Aircraft_select |>
group_by(sky) |>
filter(!is.na(sky)) |>
count(name = "total") |> # Count the number of incidents
#ungroup() |>
arrange(desc(total)) # Arrange by descending total
Weather110
## # A tibble: 3 × 2
## # Groups: sky [3]
## sky total
## <chr> <int>
## 1 No Cloud 67403
## 2 Some Cloud 48160
## 3 Overcast 23819
Most of the incidents happened when the are no precipitation and with No cloud.
# filter so I can only have the top 5 species only in the species column.
Data1 <- Aircraft_select |>
filter( species %in%c( "Unknown bird - small", "Unknown bird - medium", "Unknown bird", "Mourning dove", "Barn swallow"))
Data1
## # A tibble: 136,256 × 13
## incident_month incident_year time_of_day airport latitude longitude
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 June 1996 <NA> SACRAMENTO INTL 38.7 -122.
## 2 June 1996 <NA> DENVER INTL AIRP… 39.9 -105.
## 3 July 1996 <NA> EPPLEY AIRFIELD 41.3 -95.9
## 4 July 1996 <NA> WASHINGTON DULLE… 38.9 -77.5
## 5 July 1996 <NA> LA GUARDIA ARPT 40.8 -73.9
## 6 May 1991 Day SAN ANTONIO INTL 29.5 -98.5
## 7 November 1993 Dawn KANSAS CITY INTL 39.3 -94.7
## 8 July 1995 <NA> KANSAS CITY INTL 39.3 -94.7
## 9 September 1990 Day DALLAS/FORT WORT… 32.9 -97.0
## 10 September 1990 Day AUGUSTA REGIONAL… 33.4 -82.0
## # ℹ 136,246 more rows
## # ℹ 7 more variables: phase_of_flight <chr>, height <dbl>, speed <dbl>,
## # distance <dbl>, sky <chr>, precipitation <chr>, species <chr>
#names(Data1)
library(highcharter) # Load highchater
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
# In this I'm adding a column "total" in this dataset.
Data1 <- Data1 |>
group_by(species) |>
mutate(total = n()) |> # Count the number of rows for each species
ungroup() # remove the group created by group_by.
Data1
## # A tibble: 136,256 × 14
## incident_month incident_year time_of_day airport latitude longitude
## <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 June 1996 <NA> SACRAMENTO INTL 38.7 -122.
## 2 June 1996 <NA> DENVER INTL AIRP… 39.9 -105.
## 3 July 1996 <NA> EPPLEY AIRFIELD 41.3 -95.9
## 4 July 1996 <NA> WASHINGTON DULLE… 38.9 -77.5
## 5 July 1996 <NA> LA GUARDIA ARPT 40.8 -73.9
## 6 May 1991 Day SAN ANTONIO INTL 29.5 -98.5
## 7 November 1993 Dawn KANSAS CITY INTL 39.3 -94.7
## 8 July 1995 <NA> KANSAS CITY INTL 39.3 -94.7
## 9 September 1990 Day DALLAS/FORT WORT… 32.9 -97.0
## 10 September 1990 Day AUGUSTA REGIONAL… 33.4 -82.0
## # ℹ 136,246 more rows
## # ℹ 8 more variables: phase_of_flight <chr>, height <dbl>, speed <dbl>,
## # distance <dbl>, sky <chr>, precipitation <chr>, species <chr>, total <int>
#Make a table where it counts the number of incidents per year and the time of teh day the incidents happened.
Year_count <- Data1 |>
group_by(incident_year, time_of_day) |>
filter(!is.na(incident_year) & !is.na(time_of_day)) |> # Filter my two rows and remove NA's.
count(name = "total") |> # Count the number of incidents
arrange(desc(total))
Year_count
## # A tibble: 136 × 3
## # Groups: incident_year, time_of_day [136]
## incident_year time_of_day total
## <dbl> <chr> <int>
## 1 2022 Day 2908
## 2 2018 Day 2830
## 3 2017 Day 2767
## 4 2021 Day 2754
## 5 2014 Day 2729
## 6 2016 Day 2471
## 7 2009 Day 2416
## 8 2019 Day 2211
## 9 2015 Day 2080
## 10 2010 Day 2001
## # ℹ 126 more rows
library(ggalluvial) # upload library
p3 <- ggplot(data = Year_count, aes(x = incident_year,
y = total,
alluvium= time_of_day,
fill = time_of_day, label = time_of_day)) +
geom_alluvium() +
geom_flow() +
scale_fill_brewer(palette= "Spectral") +
#geom_stratum(alpha = 0.5) +
labs(x= "Incident_year",
y= "Total",
title = "Yearly Trends in Aviation Incidents Across Different Times of Day",
caption = "Source:FAA (Federal Aviation Administration") +
theme_minimal()
p3
This plot, displays the total number of incidents that occur in year. The number of events expanded between 1990 and the late 2010s, with Day and Night having the highest incident rates. There were fewer incidences at dawn and dusk. After 2020, there was a significant decline in the overall number of events. The number of night time occurrences increased gradually starting in the late 1990s and was constant until the 2010s. In general, the majority of occurrences occurred during the day, and until the most recent decline, there was a noticeable increase in incidents.
** The COVID-19 pandemic, when travel was restricted, may be the cause of the decline in incidents. Lockdowns, curfews, and restricted mobility probably made it harder for events to happen, especially during the day when activity is often higher.**
#visualization
highchart() |>
hc_add_series(
data = Data1,
type = "scatter",
hcaes(size = total,
x = height,
y = speed,
group = species)
) |>
hc_tooltip(
useHTML = TRUE, # Enable HTML in tooltip
headerFormat = "",
pointFormat = "
<strong>Species: {point.species}</strong><br>
<b>Incident Month:</b> {point.incident_month}<br>
<b>Incident Year:</b> {point.incident_year}<br>
<b>Time of Day:</b> {point.time_of_day}<br>
<b>Airport:</b> {point.airport}<br>
<b>Latitude:</b> {point.latitude}°<br>
<b>Longitude:</b> {point.longitude}°<br>
<b>Phase of Flight:</b> {point.phase_of_flight}<br>
<b>Height:</b> {point.height} m<br>
<b>Speed:</b> {point.speed} km/h<br>
<b>Sky Condition:</b> {point.sky}<br>
<b>Precipitation:</b> {point.precipitation}<br>
<b>Total Incidents:</b> {point.total}"
) |>
hc_title(text = "Aircraft Incident Data") |>
hc_caption(text = "Source: FAA (Federal Aviation Administration)") |>
hc_xAxis(title = list(text = "Height (meters)")) |>
hc_yAxis(title = list(text = "Speed (km/h)"),
min= -500,
max= 1000) |>
hc_legend(title = list(text = "Species")) |>
hc_add_theme(hc_theme_bloom())