The dataset I am exploring analyzes reports submitted by pilots of aircrafts hitting mainly birds but a few other animals. This dataset shows 288810 different reports with 100 variables. Many variables are included such as the year of the accident (INCIDENT_YEAR), the species of the animal, mainly birds (SPECIES), whether the incident happened during dawn, day, dusk, or night (TIME_OF_DAY), the speed the aircraft was going in knots (SPEED), the weight of the aircraft on a 1-5 scale – 1 being 2,250 kg or less, 2 being between 2251 and 5700 kg, 3 being between 5,701 and 27,000 kg, 4 being between 27,001 and 272,000 kg, and 5 is above 272,000 kg – (AC_MASS), The distance the aircraft was from the airport (DISTANCE), the cloud coverage at the time of the incident (SKY), whether the bird was small, medium, or large (SIZE), and the height the aircraft was in the time of the incident in feet (HEIGHT). I chose this dataset because I’m very interested in birds and how the rise of air travel has affected them. My source for this dataset is from the Federal Aviation Administration (FAA).
Load Packages and Data
I loaded the packages ‘tidyverse’, ‘magick’, ‘raster’, and ‘highcharter’. I then set my working directory with setwd() and loaded my dataset using read_csv and named it “birds”. I then used the head() function to show the first 6 observations in my dataset.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
I used the function anyNA() to check if there are any NAs in the variables I’m going to use. If there are NAs, when I run the chunk TRUE will show up.
anyNA(birds$INCIDENT_YEAR)
[1] FALSE
anyNA(birds$TIME_OF_DAY)
[1] TRUE
anyNA(birds$SPECIES)
[1] TRUE
anyNA(birds$HEIGHT)
[1] TRUE
anyNA(birds$SPEED)
[1] TRUE
anyNA(birds$AC_MASS)
[1] TRUE
anyNA(birds$DISTANCE)
[1] TRUE
anyNA(birds$SKY)
[1] TRUE
anyNA(birds$SIZE)
[1] TRUE
Removing NAs and Filtering
I used filter(!is.na()) to remove the NAs from every variable I checked above that showed they had NAs. I made a new dataset called “birds2” with the adjusted variables. I filtered for reports only from 2023 and reports with heights of 100 ft or more.
I went through the list of species and filtered out species that were not birds and unknown bird species. I also mutated speed from knots to MPH and distance from nautical miles to miles.
birds3 <- birds2 |>filter(SPECIES !="Seminole bat", SPECIES !="Evening bat", SPECIES !="Silver-haired bat", SPECIES !="Eastern red bat", SPECIES !="Hoary bat", SPECIES !="Brazilian free-tailed bat", SPECIES !="Unknown Bird", SPECIES !="Unknown bird - small", SPECIES !="Unknown bird - medium", SPECIES !="Unknown bird - large") |>mutate(SPEED = (SPEED *1.151),DISTANCE = (DISTANCE *1.15) )
Mutating Variable
I used the function mutate() the change the variable ‘AC_MASS’ from quantitative to categorical. 1 = “2,250 kg or less”, 2 = “2,251 kg to 5,700 kg”, 3 = “5,701 kg to 27,000 kg”, 4 = “27,001 kg to 272,000 kg”, and 5 = “above 272,000 kg”.
birds3 <- birds3 |>mutate(AC_MASS =factor(AC_MASS, levels =1:5, labels =c("2,250 kg or less", "2,251 kg to 5,700 kg", "5,701 kg to 27,000 kg", "27,001 kg to 272,000 kg", "above 272,000 kg")))head(birds3)
# A tibble: 6 × 100
INDEX_NR INCIDENT_DATE INCIDENT_MONTH INCIDENT_YEAR TIME TIME_OF_DAY
<dbl> <chr> <dbl> <dbl> <chr> <chr>
1 1405944 1/5/2023 1 2023 17:30 Night
2 1406361 1/5/2023 1 2023 22:31 Night
3 1406684 1/6/2023 1 2023 09:30 Day
4 1407081 1/17/2023 1 2023 15:35 Day
5 1407394 1/27/2023 1 2023 14:45 Day
6 1407830 2/7/2023 2 2023 19:00 Night
# ℹ 94 more variables: AIRPORT_ID <chr>, AIRPORT <chr>, LATITUDE <dbl>,
# LONGITUDE <dbl>, RUNWAY <chr>, STATE <chr>, FAAREGION <chr>,
# LOCATION <chr>, ENROUTE_STATE <chr>, OPID <chr>, OPERATOR <chr>, REG <chr>,
# FLT <chr>, AIRCRAFT <chr>, AMA <chr>, AMO <chr>, EMA <chr>, EMO <chr>,
# AC_CLASS <chr>, AC_MASS <fct>, TYPE_ENG <chr>, NUM_ENGS <dbl>,
# ENG_1_POS <dbl>, ENG_2_POS <dbl>, ENG_3_POS <dbl>, ENG_4_POS <dbl>,
# PHASE_OF_FLIGHT <chr>, HEIGHT <dbl>, SPEED <dbl>, DISTANCE <dbl>, …
Exploring
I made a bar plot to see how many reports are in each weight category.
plot1 <- birds3 |>group_by(AC_MASS)ggplot(plot1, aes(AC_MASS)) +geom_bar(color ="black", fill =c("#9a41e5", "#fdbafa", "#b93eab", "#45046b")) +labs(title ="Number of Reports in Each Weight Group\n(2023)",x ="Weight in kg",y ="Number of Reports")
The main number of reports are from aircrafts that weigh between 27,001 kg and 272,000 kg. I decided to do this plot to see if one weight category has more reports and it does.
Exploring 2
I made a boxplot of with the different types of cloud coverage and the distance of the aircrafts from the airport when the incident happened.
plot2 <- birds3 |>ggplot(aes(DISTANCE, SKY)) +geom_boxplot(color ="black", fill =c("lightskyblue2", "cornflowerblue", "royalblue4")) +theme_light() +labs(title ="The Cloud Coverage and Distance from the Airport\nEach Aircraft was During the Strike",y ="Cloud Coverage",x ="Distance in Miles From the Airport")plot2
I made this plot because I noticed when going through the dataset that a lot of the distances were 0 so I wanted to see the mean and I though this was an easy way to do it.
Linear Regression
I used the function cor() to find the correlation coefficient of the linear regression. Then I used lm(y ~ x) to find the slope, intercept, and p-value.
cor(birds2$DISTANCE, birds2$SPEED)
[1] 0.587492
birds_lm <-lm(DISTANCE ~ SPEED, data = birds2)summary(birds_lm)
Call:
lm(formula = DISTANCE ~ SPEED, data = birds2)
Residuals:
Min 1Q Median 3Q Max
-11.339 -3.265 -1.173 2.096 53.031
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.345097 0.837751 -11.15 <2e-16 ***
SPEED 0.090737 0.004884 18.58 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.846 on 655 degrees of freedom
Multiple R-squared: 0.3451, Adjusted R-squared: 0.3441
F-statistic: 345.2 on 1 and 655 DF, p-value: < 2.2e-16
The correlation coefficient is 0.59 which shows a moderate to weak correlation. The equation for this linear regression is: DISTANCE = 2e-16(SPEED) - 9.35. The p-value has three asterisks which suggests that it is meaningful but we have to take into account the adjusted R-squared which shows that only 35% of observations may be explained by the model.
Model
I made a linear regression model.
plot_lm <- birds2 |>ggplot(aes(DISTANCE, SPEED)) +geom_point(color ="seagreen", fill ="seagreen2", shape =21, size =3, alpha = .6) +geom_smooth(method ='lm', formula = y~x, linewidth =1, color ="navyblue") +theme_classic() +labs(x ="Distance in Nautical Miles",y ="Speed in Knots")plot_lm
The regression line is pretty concentrated and although the adjusted R-square is 35% there are many variables in this data set so I would say that there is a relationship between distance and speed that would make sense to explore.
Exploring 3
I made 4 separate bar graphs that show the 3 sizes of the birds struck listed in the report and I separated them into the 4 times of the day.
plot3 <- birds3 |>group_by(TIME_OF_DAY)ggplot(plot3, aes(SIZE, fill = AC_MASS)) +geom_bar() +theme_classic() +scale_fill_manual(values =c("#e76f51", "#f4a261", "#2a9d8f", "#264653" )) +facet_wrap(~TIME_OF_DAY) +labs(title ="Number and Size of Birds that Hit or Got Hit by an Aircaft in 2023\nfor the 4 Times of the Day",x ="Size of Bird",y ="Number of Birds",fill ="Weight of Aircraft")
I decided not to use the size of the birds as one of my main variables because there are barely any for dawn and dusk. Like I noticed in my previous graph, the most reports are from the weight group “27,001 kg to 272,000 kg” this is shown in the graph above.
Importing Image
I used ‘image_read()’ and ‘image_fill()’ from the ‘magick’ package. ‘image_read()’ reads the link of the image I chose and ‘image_fill()’ makes the image able to use as the background. The ‘as.raster()’ function coerces the image so it can be implemented in the graph.
bird_img <-image_read("https://media.istockphoto.com/id/1218631737/photo/seagull-flying-in-the-blue-sky.webp?b=1&s=170667a&w=0&k=20&c=0D8xdiWIavUFEkaIsj4Hayii5NWOJekXgkjsJHL8Qcg=")img3 <-image_fill(bird_img, 'none')img4 <-as.raster(img3)#Got the code from here: https://www.rdocumentation.org/packages/magick/versions/2.8.3/topics/image_ggplot
ggplot plot
I made a ggplot version of my final plot that shows speed in MPH on the x axis and height in feet on the y axis. The fill is the time of day and the size is the weight of each aircraft.
main_plot <- birds3 |>ggplot(aes(SPEED, HEIGHT, fill = TIME_OF_DAY)) +annotation_raster(img4, -Inf, Inf, -Inf, Inf, interpolate = F) +geom_point(shape =21, color ="white", alpha = .8, aes(size = AC_MASS)) +guides(size =guide_legend(title.theme =element_blank(), label = F)) +scale_fill_manual(values =c("#e88504", "#ffcc25", "#a7408b", "#352389")) +theme_classic() +labs(title ="The Speed vs Height from Ground Level\nEach Aircraft Was when the Strikes Occurred in 2023",x ="Speed in MPH",y ="Height Above Ground Level in Feet",caption ="Size of Each Dot is Correlated to the Weight of Each Aircraft",fill ="Time of Day")main_plot
Warning: Using size for a discrete variable is not advised.
highcharter plot
This is my final plot in highcharter that shows speed in MPH on the x axis and height in feet on the y axis. The fill is the time of day and it has a tooltip which shows species, weight, height, and speed.
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
main_plot2 <-highchart() |>hc_add_series(data = birds3,type ="scatter",hcaes(x = SPEED,y = HEIGHT,group = TIME_OF_DAY)) |>hc_colors(colors =c("#e88504", "#ffcc25", "#a7408b", "#352389")) |>hc_plotOptions(scatter =list(marker =list(radius =6, lineColor ="white", symbol ="circle"))) |>hc_chart(backgroundColor ="transparent", divBackgroundImage ="https://media.istockphoto.com/id/1218631737/photo/seagull-flying-in-the-blue-sky.webp?b=1&s=170667a&w=0&k=20&c=0D8xdiWIavUFEkaIsj4Hayii5NWOJekXgkjsJHL8Qcg=") |>hc_title(text ="The Speed vs Height from Ground Level\nEach Aircraft Was When the Strikes Occurred (Reports from 2023)") |>hc_xAxis(title =list(text ="Speed in Knots")) |>hc_yAxis(title =list(text ="Height Above Ground Level in Feet")) |>hc_caption(text ="FAA") |>hc_legend(align ="center", verticalAlign ="bottom", title =list(text ="Time of Day")) |>hc_tooltip(shared = F, pointFormat ="Species: {point.SPECIES}<br> Weight: {point.AC_MASS}<br> Height: {point.y} ft<br> Speed: {point.x} MPH")main_plot2
Conclusion
Based on the information I researched, bird strikes are very common, and they happen more during the summer and fall months because of migration. Bird strikes also are a big threat because they happen when the plane is taking off or landing which is when the plane is going the slowest. This information is from the Pilot Institute : https://pilotinstitute.com/bird-strike/
From my final graph you can see that the most amount of reports happened in the day and the least amount of reports occurred during dawn. You an also see from my graph that reports in dawn and dusk were not very high and were not very fast which could be because the planes were taking off or landing. Reports of strikes at night and day happen at all heights and speeds. I wish I could have included size in the highcharter plot like I did in the ggplot one but I couldn’t figure it out.