The data set I have chosen gives all the reported collisions that occurred between aircrafts and birds. This data set has a total of 17 variables and a little over 19,000 observations. Some of these more important and notable variables in the data set includes operator/airline name, time of day, height, state, speed, and number of birds struck. There is also a variable called OPID which stands for Operator Identification Number and it’s what I’ll be using to identify the airline name since it’s narrowed down to three letters such as American Airlines as AAL and US Airways as USA. Although I’m not sure I’ll be using this variable in my visualizations, it’s still a good thing to have in mind. I plan to use mass or speed as one of my variables as they have numeric values rather than just names.
This data set is sourced from the Federal Aviation Administration (FAA) Wildlife Strike Database.
Loading tidyverse
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 19302 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): opid, operator, atype, remarks, phase_of_flt, date, time_of_day, s...
dbl (4): ac_mass, num_engs, height, speed
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning
I have noticed that there are over 44,000 missing variables and even though the number of missing valuables isn’t too high in this scenario compared to the data that is there, missing data can always bring challenges to data analysis.
Although some columns in the data set do not require any significant change, some of the variables that are continuous may need to be manipulated using commands that fill in the missing values.
The missing values in the data set have now been manipulated and cleaned up one by one. This method was used to fill in certain columns such as effect, speed, and height where it wouldn’t be appropriate to fill them with zeros. This was done to prevent any inconsistent results in the data set. Although below you can see that there is still only a few missing variables, this was one of the few ways that the data set could have been cleaned.
I will first be analyzing the effect of collision.
table(data1$effect)
Aborted Take-off Engine Shut Down None
803 173 16403
Other Precautionary Landing
521 1402
Most of the collision between aircrafts and birds didn’t result in any major effects. About 15% of cases did require attention as they may have had some more serious consequences. Out of all the cases, 4% resulted in take-off being aborted, 7% required a precautionary landing, and about a 1% chance of the engine shutting down, which could have been the worst case scenario here.
= I will now analyze the conditions of the sky when collision occured.
Based on these results, it appears that there are more colllisions when there are no clouds in the sky as the category of “no clouds” has the highest number of recored incidents.
Data Visualization
Below is the first graph which is a histogram that shows the correlation between the mass of the aircraft and how many reports were made. It also considers the the time of the day that the incident took place.
p1 <-ggplot(data = data1) +geom_bar(mapping =aes(x = ac_mass, fill = time_of_day))+ggtitle("Number of Bird Collision Reports Considering Mass and Time of Day" ) +ylab ("Reports of Bird Collision") +xlab("Mass") +coord_flip() +scale_fill_brewer()p1 +theme_dark()
It is shown that most incidents took place during the day and had a mass of 4, which according the the openintro site, is between 27001-272000 kg.
Below is another histogram that shows how most incidents had no effect on the aircraft.
p2 <-ggplot(data = data1) +geom_bar(mapping =aes(x = sky, fill = effect), position ="dodge") +ggtitle("Bird Collision Reports Considering Weather and the Effect of the Collision" ) +ylab ("Reports of Bird Collision") +xlab("Weather") +scale_fill_brewer(palette ="Dark2")p2 +theme_update()
As shown, there is increased bird collision when the skies are clear and reduced when there is overcast. The graph also shows that there were very few incidents when the plane aborted take-off. I am also curious to know what the “other” category could include.
There is an N/A category with the x-axis that has weather so I have to remove it.
data1 <- data %>%filter(!is.na(speed) &!is.na(ac_mass) &!is.na(state) &!is.na(time_of_day) &!is.na(sky) &!is.na(effect))
Summary
I had some difficulties with this data set including how to fix and replace the missing and N/A data since there were so many that were missing. I decided that it would be best if i’d just fill it in with either 0’s or data that makes sense. I used the forward filling method. this method basically forwards the last observed value to fill in the gaps.
I created 2 histograms: one that represents the number of collisions considering the mass and the time of day and the other that reresents the number of collisions considering the wather and the effect of the collision. In the first visualization, I noticed that most of the incidents occured with the heavier aircrafts and in the second visualization, I noticed that most incidents did’t have any major effects.
One thing that really botherd me during the creation of these visualizations was the birds_seen and birds_struck column in the data set. Most of the data has a “10-Feb” in the colums which doesnt make sense since that isn’t a number. Analyzing the data makes it seem like it could be something like 2-10. It would make sense since it translated to the date February 10th but I was unable to make the change. Luckily I didnt need those two categories in my visualizations.