The information is sourced from the course data set directory and pertains to incidents of aircraft and bird collisions. The interest in this data set stems from the question that arises during air travel: will there be an interaction between the aircraft and birds, and will this affect the journey? Through the use of skills learned in class, the data will be explored and conclusions drawn.
The data set contains 17 columns, with 12 categorical and 5 continuous variables. It records details about the aircraft, such as the operator name and id, the aircraft model, status and phase, as well as the size and number of engines. Additionally, information about the bird is recorded, including species and the sky area where it was spotted. Crucially, the interaction/collision between the two parties is documented, including the date and time of the event, the state, height and speed at that moment, as well as the outcome of the collision.
There are only a few missing data in the dataset. For most of the columns, the percentage of missing data is less than 20%. None of the rows were dropped as they are not completely useless as the data is very solid. Instead, we filled the missing values by using the values before the missing ones so that we do not generate some weird results. In addition, this is the best way so that the information is still perserved and can be comprehended. After all, We did not have too many of the missing values in the first place either.
According to a study published by the U.S. Fish & Wildlife Service Organization, the collision between aircraft and birds is becoming an issue. This poses a threat not only to human life, but also to a significant number of bird deaths each year. The study identified some of the bird species commonly involved in aircraft collisions, with waterfowl, gulls, and raptors being the top three. Interestingly, this somewhat matches the results of our data analysis, which found that gulls are most likely to collide with aircraft, while waterfowl and raptors are not represented in the dataset. The study also proposed potential solutions to address this problem, including modifying aircraft flight schedules, habitat management, bird removal, and lethal control. The information from the study is valuable and supports the findings of this dataset.
#Load the dataset
library(readr)
data <- read_csv("C:/Users/94478/OneDrive/桌面/data/aircraft_bird_collision_openintro.csv")
head(data,5)
Upon examining the missing values in the dataset, it was found that there are a total of 44,183 missing values, which represents around 13% of the entire dataset. While missing data can present challenges in data analysis, this proportion of missing values is not considered too high, especially considering the nature of recording data related to aircraft and bird collisions.
table(is.na(data))
##
## FALSE TRUE
## 283951 44183
miss_data<-function(x){sum(is.na(x))}
apply(data,2,miss_data)
## opid operator atype remarks phase_of_flt ac_mass
## 0 0 0 2786 1783 1284
## num_engs date time_of_day state height speed
## 1307 0 2077 871 3193 7008
## effect sky species birds_seen birds_struck
## 5718 3579 0 14538 39
pc_miss_data<-function(x){sum(is.na(x))/length(x)*100}
apply(data,2,pc_miss_data)
## opid operator atype remarks phase_of_flt ac_mass
## 0.0000000 0.0000000 0.0000000 14.4337374 9.2373847 6.6521604
## num_engs date time_of_day state height speed
## 6.7713190 0.0000000 10.7605429 4.5124858 16.5423272 36.3071184
## effect sky species birds_seen birds_struck
## 29.6238732 18.5421200 0.0000000 75.3186198 0.2020516
It appears that most of the columns in the dataset do not require any significant changes, but the continuous variables may need to be manipulated using commands such as filling in the missing values. However, the categorical variables can be left as they are because a missing value can also be considered a valid category. For instance, in the “effect” column, which indicates the outcome of the collision, leaving empty values would imply that the data was not recorded, but it would not be appropriate to assume that there was no effect.
library(tidyr)
data1 <- data %>% fill(sky, .direction = 'up')
data1 <- data1 %>% fill(speed, .direction = 'up')
data1 <- data1 %>% fill(height, .direction = 'up')
data1 <- data1 %>% fill(effect, .direction = 'up')
data1 <- data1 %>% fill(birds_seen, .direction = 'up')
data1 <- data1 %>% fill(birds_struck, .direction = 'up')
data1 <- data1%>% fill(remarks, .direction = 'up')
data1 <- data1 %>% fill(phase_of_flt, .direction = 'up')
data1 <-data1 %>% fill(ac_mass, .direction = 'up')
data1 <-data1 %>% fill(num_engs, .direction = 'up')
data1 <- data1 %>% fill(time_of_day, .direction = 'up')
data1 <- data1%>% fill(state, .direction = 'up')
The missing values in the dataset were addressed one by one in order to clean the dataset in the most effective manner. The forward filling method was utilized for certain columns, such as effect, speed, and height, where it would not be appropriate to assign missing values as zero or other arbitrary values. This was done to prevent the generation of anomalous results in the dataset. Given the relatively low number of missing values, this approach was deemed the most useful way to clean the dataset, and it is now ready for analysis.
missing_data<-function(x){sum(is.na(x))/length(x)*100}
apply(data1,2,missing_data)
## opid operator atype remarks phase_of_flt ac_mass
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## num_engs date time_of_day state height speed
## 0.00518081 0.00000000 0.00000000 0.00518081 0.00000000 0.00000000
## effect sky species birds_seen birds_struck
## 0.00000000 0.00518081 0.00000000 0.06735053 0.00000000
It sounds like you are interested in exploring the effect of the collision, which is a key variable in the dataset. A quick way to obtain some insights would be to use the “value_counts()” function to see the distribution of this variable across the dataset.
table(data1$effect)
##
## Aborted Take-off Engine Shut Down None
## 803 173 16403
## Other Precautionary Landing
## 521 1402
The majority of collisions between aircraft and birds do not result in any noticeable effects, accounting for over 85% of cases. However, the remaining 15% of cases require attention as they can have more severe consequences. Out of these cases, 4% resulted in the aircraft’s take-off being aborted, 7% required the aircraft to make a precautionary landing, and in the worst-case scenario, there was an 0.8% chance of the engine shutting down, which could endanger the lives of passengers. This is a general summary of the main variable and its effects, and we will now examine it in more detail.
library(dplyr)
dangerous_flights= data1 %>% filter(effect == "Engine Shut Down")
dangerous_flights %>%
group_by(sky) %>%
summarise(
n = n(),
speed = mean(speed, na.rm = TRUE),
height = mean(height, na.rm = TRUE),
num_engs = mean(num_engs, na.rm = TRUE),
ac_mass = mean(ac_mass, na.rm = TRUE)
)
Based on the analysis, it appears that bird collisions are more likely to occur when the sky condition is “no cloud”, as this had the highest number of recorded incidents. In contrast, when the sky condition was “overcast” or “some cloud”, the number of recorded incidents decreased. Interestingly, when the sky condition was “overcast”, the aircraft tended to fly at a lower height compared to the other two weather conditions.
library(dplyr)
bird_count <- data1 %>%
group_by(data1$species) %>%
summarize(Count=n()) %>%
arrange(desc(Count)) %>%
head(10)
knitr::kable(bird_count)
| data1$species | Count |
|---|---|
| UNKNOWN BIRD - SMALL | 4156 |
| UNKNOWN BIRD | 2987 |
| GULLS | 2563 |
| UNKNOWN BIRD - MEDIUM | 2074 |
| SPARROWS | 694 |
| BLACKBIRDS | 543 |
| UNKNOWN BIRD - LARGE | 538 |
| EUROPEAN STARLING | 452 |
| CANADA GOOSE | 409 |
| ROCK PIGEON | 371 |
##Data Visualizations
We can fit a regression model between height and speed of aircraft flying.
model <- lm(speed ~ height, data = data1) # Fit the model
summary(model)
##
## Call:
## lm(formula = speed ~ height, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -382.93 -21.58 2.13 17.79 203.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.266e+02 2.960e-01 427.60 <2e-16 ***
## height 1.281e-02 1.517e-04 84.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.95 on 19300 degrees of freedom
## Multiple R-squared: 0.2697, Adjusted R-squared: 0.2697
## F-statistic: 7128 on 1 and 19300 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(model)
The results from the analysis of the entire dataset confirms the assumption that there is a strong linear relationship between speed and height. The analysis found a positive slope with a p-value that is well below the threshold of 0.05, indicating that the relationship is statistically significant with a high level of confidence (95%). The residual vs fitted plots show some outliers on the right side, but the majority of the data points cluster on the left, further supporting the positive slope. The normal QQ plot shows that the residuals are normally distributed, and the Cook’s distance plot indicates the presence of some outliers.
library(ggplot2)
theme_set(theme_bw()) # c
ggplot(data1, aes(x=effect, y=speed, fill=effect)) + geom_boxplot() + ggtitle("relationship between different effect results and speed of the aircraft") + xlab("effect of aircraft-bird collisions") + ylab("speed of the aircraft")
The boxplot above shows the distributions for each level/factors of the “effect” variable. For instance, for those flights where the take-off was aborted, the mean of speed is the lowest which very makes sense.
library(ggplot2)
ggplot(data1, aes(x=effect, y=height, fill=effect)) + geom_boxplot() + ggtitle("relationship between different effect results and height of the aircraft") + xlab("effect of aircraft-bird collisions") + ylab("height of the aircraft")
The boxplot above shows that the flights where take-offs were aborted, the hight was very low, we can barely see the “box”, just a couple of outliers.
ggplot(data1, aes(x=reorder(effect, effect, function(x)-length(x)))) +
geom_bar( fill = 'blue') + labs(x='Team')+ ggtitle("histogram of the count of flights in different effect levels") + xlab("effect levels") + ylab("occurance of flights")
Based on our analysis of key variables, we can conclude that the probability of encountering a bird during a flight is less than 15%. Furthermore, among the 15% of flights that do encounter birds, the risk of danger is relatively low. We also discovered that flying at a lower altitude increases the likelihood of a bird collision. However, the speed of the aircraft does not seem to have a significant impact on the effect of a collision.