American Institute of Aeronautics and Astronautics
** INTRODUCTION **
This project examines a dataset containing information about aircraft incidents, collected by the Federal Aviation Administration (FAA). The dataset includes various details about these incidents, such as the month and year they occurred, the time of day, the airport location, latitude, longitude, phase of flight, and weather conditions. It also includes numerical data like the altitude, speed, distance and more. This dataset seems to come from incident reports submitted by airports, airlines, and flight operations. However, there is no ReadMe file, so the exact method of data collection isn’t clear. Most likely, the data comes from reports of incidents or accidents at various U.S. airports, with each row representing a specific event or group of incidents.
The topic of aircraft incidents is both interesting and important because it helps us understand aviation safety, recognize patterns and trends in incidents, and improve safety measures. I chose this dataset because I have always been interested in aviation safety. Learning how different factors affect the occurrence of incidents is not only fascinating but also has real-world value, as it can lead to better safety practices and more efficient operations in the airline industry.
Questions for my analysis: What species are more involve in the Aircraft incident?
What is the relationship between speed and height?
Which months see the highest number of Incidents?
Do aircraft incidents occur more frequently in specific weather conditions?
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/leikarayjoseph/Desktop/Data 110") #upload my working directory so I can install my file.Aircraft <-read_csv("aircraft_wildlife_strikes_faa.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 288810 Columns: 100
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (37): INCIDENT_DATE, TIME_OF_DAY, AIRPORT_ID, AIRPORT, RUNWAY, STATE, F...
dbl (19): INDEX_NR, INCIDENT_MONTH, INCIDENT_YEAR, LATITUDE, LONGITUDE, AMO...
num (4): COST_REPAIRS, COST_OTHER, COST_REPAIRS_INFL_ADJ, COST_OTHER_INFL_ADJ
lgl (39): INGESTED_OTHER, INDICATED_DAMAGE, STR_RAD, DAM_RAD, STR_WINDSHLD,...
time (1): TIME
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Change the format of my headers
# Change the format of my headersnames(Aircraft) <-tolower(names(Aircraft))names(Aircraft) <-gsub(" ","",names(Aircraft))head(Aircraft)
# A tibble: 6 × 100
index_nr incident_date incident_month incident_year time time_of_day
<dbl> <chr> <dbl> <dbl> <time> <chr>
1 608242 6/22/1996 6 1996 NA <NA>
2 608243 6/26/1996 6 1996 NA <NA>
3 608244 7/1/1996 7 1996 NA <NA>
4 608245 7/1/1996 7 1996 NA <NA>
5 608246 7/1/1996 7 1996 NA <NA>
6 608247 5/6/1991 5 1991 NA Day
# ℹ 94 more variables: airport_id <chr>, airport <chr>, latitude <dbl>,
# longitude <dbl>, runway <chr>, state <chr>, faaregion <chr>,
# location <chr>, enroute_state <chr>, opid <chr>, operator <chr>, reg <chr>,
# flt <chr>, aircraft <chr>, ama <chr>, amo <dbl>, ema <dbl>, emo <dbl>,
# ac_class <chr>, ac_mass <dbl>, type_eng <chr>, num_engs <dbl>,
# eng_1_pos <dbl>, eng_2_pos <dbl>, eng_3_pos <dbl>, eng_4_pos <dbl>,
# phase_of_flight <chr>, height <dbl>, speed <dbl>, distance <dbl>, …
# Count the variable "phase_of_flight" to see wich one has the the higher rate of incidentCount1 <- Aircraft |>group_by(phase_of_flight) |>filter(!is.na(phase_of_flight)) |>count(name="total") |># The count for each phase of the flight when the incident happend.arrange(total)Count1
# A tibble: 11 × 2
# Groups: phase_of_flight [11]
phase_of_flight total
<chr> <int>
1 Parked 115
2 Taxi 667
3 Arrival 753
4 Local 1150
5 Descent 2331
6 Departure 2933
7 En Route 5308
8 Climb 26957
9 Take-off Run 30248
10 Landing Roll 32157
11 Approach 75224
ggplot(Count1,aes(x= phase_of_flight, y= total, fill = phase_of_flight, na.rm=TRUE)) +geom_bar(stat="identity", position="dodge", na.rm =TRUE) +labs(x="Phase of Flight", y="Total of Incidents", title="Phase of the Flight when the Incident Happend",caption ="Source: FAA (Federal Aviation Administration)") +scale_fill_brewer(palette="Set3")+theme_minimal() +theme(axis.text.x =element_text(angle =45, vjust =1, hjust =1))
From this plot we observe that “approach” is the phase of the flight in wich most of the incident happend, “landing roll” is the second highest but, parked is the one in wich less incident happended.
Relation between Height and Speed
# Linear regression plot p1 <-ggplot(Aircraft, aes(x = height, y = speed)) +labs(title ="Relationship between Height and Speed",caption ="Source:FAA (Federal Aviation Administration",x ="Height",y ="Speed") +theme_minimal()+geom_point(color="lightblue") +geom_smooth() # add the points, specify the limits of the variable#geom_smooth(method = 'lm', formula= y~x, se = FALSE, linetype= "dotdash", color= "red", size = 0.3)p1
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 195957 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 195957 rows containing missing values or values outside the scale range
(`geom_point()`).
This plot show a weak relationship between height and speed but, I also see some outliers with very high speed that may influence my result.
Same plot without the outliers
# correlation between p2 <-ggplot(Aircraft, aes(x = height, y = speed)) +labs(title ="Relationship between Height and Speed",caption ="Source:FAA (Federal Aviation Administration",x ="Height",y ="Speed") +theme_minimal()+xlim(0, 3000) +ylim(0,400) +geom_point(color="lightblue") +geom_smooth(method ='lm', formula= y~x, se =FALSE)p2
Warning: Removed 205707 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 205707 rows containing missing values or values outside the scale range
(`geom_point()`).
Although the points are still dense in one part of the plot this one is a much better representation of the relationship between the two variables. The trend line in this plot suggest a slightly positive relationship between height and speed.
cor(Aircraft$height, Aircraft$speed, use ="complete.obs")
[1] 0.6960162
# "I used use = "complete.obs" to handle missing values because the default cor() function returns NA when there are missing observations in the data. According to google this ensures the correlation is calculated using only rows with complete data."
While correlation doesn’t always imply causation, the correlation coefficient between height and speed (0.696) indicate a strong positive relationship.
Linear Regression Eqquation
# Find the statistical information for my modelEq <-lm(height ~ speed, data= Aircraft)summary(Eq)
Call:
lm(formula = height ~ speed, data = Aircraft)
Residuals:
Min 1Q Median 3Q Max
-33794 -832 -322 490 24233
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3115.4174 15.0250 -207.3 <2e-16 ***
speed 29.6077 0.1002 295.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1425 on 92851 degrees of freedom
(195957 observations deleted due to missingness)
Multiple R-squared: 0.4844, Adjusted R-squared: 0.4844
F-statistic: 8.725e+04 on 1 and 92851 DF, p-value: < 2.2e-16
Linear Equation
Equation: Speed = 29.6077(height) - 3115.41
p-values: <2.2e-16 The p-value is close to zero indicating a strong evidence against the null, which make the model statistically significant.
Adjusted R^2: 0.4844
This value indicate that about 48.4% of the variability in speed is explain by height.
# A tibble: 288,810 × 13
incident_month incident_year time_of_day airport latitude longitude
<chr> <dbl> <chr> <chr> <dbl> <dbl>
1 June 1996 <NA> SACRAMENTO INTL 38.7 -122.
2 June 1996 <NA> DENVER INTL AIRP… 39.9 -105.
3 July 1996 <NA> EPPLEY AIRFIELD 41.3 -95.9
4 July 1996 <NA> WASHINGTON DULLE… 38.9 -77.5
5 July 1996 <NA> LA GUARDIA ARPT 40.8 -73.9
6 May 1991 Day SAN ANTONIO INTL 29.5 -98.5
7 November 1993 Dawn KANSAS CITY INTL 39.3 -94.7
8 July 1995 <NA> KANSAS CITY INTL 39.3 -94.7
9 September 1990 Day DALLAS/FORT WORT… 32.9 -97.0
10 May 1992 Day NORMAN Y. MINETA… 37.4 -122.
# ℹ 288,800 more rows
# ℹ 7 more variables: phase_of_flight <chr>, height <dbl>, speed <dbl>,
# distance <dbl>, sky <chr>, precipitation <chr>, species <chr>
# Use of group_by, count, ungroup, arrangeSpecies_count <- Aircraft_select |>group_by(species) |>count(name ="total") |># Count the number of rows for each speciesungroup() |>arrange(desc(total)) |># Arrange species by descending totalslice(1:10) # Select the top 10 species Species_count
# A tibble: 10 × 2
species total
<chr> <int>
1 Unknown bird - small 48901
2 Unknown bird - medium 38259
3 Unknown bird 24839
4 Mourning dove 14578
5 Barn swallow 9679
6 Killdeer 9592
7 American kestrel 8879
8 Horned lark 8032
9 Gulls 7414
10 European starling 6148
Based on this table of the top ten species involved in strike incidents, birds are the primary contributors to these occurrences.
The months in which most of the incident happend
Most_commun_month <- Aircraft_select |>group_by(incident_month) |>count(name ="total") |># Count the number of incidents for each monthungroup() |>arrange(desc(total)) # Arrange months by descending totalMost_commun_month
# A tibble: 12 × 2
incident_month total
<chr> <int>
1 August 41194
2 July 37688
3 September 37524
4 October 35182
5 May 28826
6 June 24483
7 April 20767
8 November 19304
9 March 14104
10 December 11086
11 January 9499
12 February 9153
The month in which most of the incident happend is “August”.
Weather101 <- Aircraft_select |>group_by(precipitation) |>filter(!is.na(precipitation)) |>count(name ="total") |># Count the number of incidentsungroup() |>arrange(desc(total)) # Arrange by descending totalWeather101
Weather110 <- Aircraft_select |>group_by(sky) |>filter(!is.na(sky)) |>count(name ="total") |># Count the number of incidentsungroup() |>arrange(desc(total)) # Arrange by descending totalWeather110
# A tibble: 3 × 2
sky total
<chr> <int>
1 No Cloud 67403
2 Some Cloud 48160
3 Overcast 23819
Most of the incidents happened when the are no precipitation and with No cloud.
# filter so I can only have the top 5 species only in the species column.Data1 <- Aircraft_select |>filter( species %in%c( "Unknown bird - small", "Unknown bird - medium", "Unknown bird", "Mourning dove", "Barn swallow"))Data1
# A tibble: 136,256 × 13
incident_month incident_year time_of_day airport latitude longitude
<chr> <dbl> <chr> <chr> <dbl> <dbl>
1 June 1996 <NA> SACRAMENTO INTL 38.7 -122.
2 June 1996 <NA> DENVER INTL AIRP… 39.9 -105.
3 July 1996 <NA> EPPLEY AIRFIELD 41.3 -95.9
4 July 1996 <NA> WASHINGTON DULLE… 38.9 -77.5
5 July 1996 <NA> LA GUARDIA ARPT 40.8 -73.9
6 May 1991 Day SAN ANTONIO INTL 29.5 -98.5
7 November 1993 Dawn KANSAS CITY INTL 39.3 -94.7
8 July 1995 <NA> KANSAS CITY INTL 39.3 -94.7
9 September 1990 Day DALLAS/FORT WORT… 32.9 -97.0
10 September 1990 Day AUGUSTA REGIONAL… 33.4 -82.0
# ℹ 136,246 more rows
# ℹ 7 more variables: phase_of_flight <chr>, height <dbl>, speed <dbl>,
# distance <dbl>, sky <chr>, precipitation <chr>, species <chr>
#names(Data1)
library(highcharter) # Load highchater
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
Data1 <- Data1 |>group_by(species) |>mutate(total =n()) |># Count the number of rows for each speciesungroup()Data1
# A tibble: 136,256 × 14
incident_month incident_year time_of_day airport latitude longitude
<chr> <dbl> <chr> <chr> <dbl> <dbl>
1 June 1996 <NA> SACRAMENTO INTL 38.7 -122.
2 June 1996 <NA> DENVER INTL AIRP… 39.9 -105.
3 July 1996 <NA> EPPLEY AIRFIELD 41.3 -95.9
4 July 1996 <NA> WASHINGTON DULLE… 38.9 -77.5
5 July 1996 <NA> LA GUARDIA ARPT 40.8 -73.9
6 May 1991 Day SAN ANTONIO INTL 29.5 -98.5
7 November 1993 Dawn KANSAS CITY INTL 39.3 -94.7
8 July 1995 <NA> KANSAS CITY INTL 39.3 -94.7
9 September 1990 Day DALLAS/FORT WORT… 32.9 -97.0
10 September 1990 Day AUGUSTA REGIONAL… 33.4 -82.0
# ℹ 136,246 more rows
# ℹ 8 more variables: phase_of_flight <chr>, height <dbl>, speed <dbl>,
# distance <dbl>, sky <chr>, precipitation <chr>, species <chr>, total <int>