Introduction

The information is sourced from the course data set directory and pertains to incidents of aircraft and bird collisions. The interest in this data set stems from the question that arises during air travel: will there be an interaction between the aircraft and birds, and will this affect the journey? Through the use of skills learned in class, the data will be explored and conclusions drawn.

The data set contains 17 columns, with 12 categorical and 5 continuous variables. It records details about the aircraft, such as the operator name and id, the aircraft model, status and phase, as well as the size and number of engines. Additionally, information about the bird is recorded, including species and the sky area where it was spotted. Crucially, the interaction/collision between the two parties is documented, including the date and time of the event, the state, height and speed at that moment, as well as the outcome of the collision.

There are only a few missing data in the dataset. For most of the columns, the percentage of missing data is less than 20%. None of the rows were dropped as they are not completely useless as the data is very solid. Instead, we filled the missing values by using the values before the missing ones so that we do not generate some weird results. In addition, this is the best way so that the information is still perserved and can be comprehended. After all, We did not have too many of the missing values in the first place either.

Research on the topic

According to a study published by the U.S. Fish & Wildlife Service Organization, the collision between aircraft and birds is becoming an issue. This poses a threat not only to human life, but also to a significant number of bird deaths each year. The study identified some of the bird species commonly involved in aircraft collisions, with waterfowl, gulls, and raptors being the top three. Interestingly, this somewhat matches the results of our data analysis, which found that gulls are most likely to collide with aircraft, while waterfowl and raptors are not represented in the dataset. The study also proposed potential solutions to address this problem, including modifying aircraft flight schedules, habitat management, bird removal, and lethal control. The information from the study is valuable and supports the findings of this dataset.

#Load the dataset

library(readr)
data <- read_csv("C:/Users/94478/OneDrive/桌面/data/aircraft_bird_collision_openintro.csv")
head(data,5)

Data Cleaning

Upon examining the missing values in the dataset, it was found that there are a total of 44,183 missing values, which represents around 13% of the entire dataset. While missing data can present challenges in data analysis, this proportion of missing values is not considered too high, especially considering the nature of recording data related to aircraft and bird collisions.

table(is.na(data))
## 
##  FALSE   TRUE 
## 283951  44183
miss_data<-function(x){sum(is.na(x))}
apply(data,2,miss_data)
##         opid     operator        atype      remarks phase_of_flt      ac_mass 
##            0            0            0         2786         1783         1284 
##     num_engs         date  time_of_day        state       height        speed 
##         1307            0         2077          871         3193         7008 
##       effect          sky      species   birds_seen birds_struck 
##         5718         3579            0        14538           39
pc_miss_data<-function(x){sum(is.na(x))/length(x)*100} 
apply(data,2,pc_miss_data)
##         opid     operator        atype      remarks phase_of_flt      ac_mass 
##    0.0000000    0.0000000    0.0000000   14.4337374    9.2373847    6.6521604 
##     num_engs         date  time_of_day        state       height        speed 
##    6.7713190    0.0000000   10.7605429    4.5124858   16.5423272   36.3071184 
##       effect          sky      species   birds_seen birds_struck 
##   29.6238732   18.5421200    0.0000000   75.3186198    0.2020516

It appears that most of the columns in the dataset do not require any significant changes, but the continuous variables may need to be manipulated using commands such as filling in the missing values. However, the categorical variables can be left as they are because a missing value can also be considered a valid category. For instance, in the “effect” column, which indicates the outcome of the collision, leaving empty values would imply that the data was not recorded, but it would not be appropriate to assume that there was no effect.

library(tidyr)
data1 <- data %>% fill(sky, .direction = 'up')
data1 <- data1 %>% fill(speed, .direction = 'up')
data1 <- data1 %>% fill(height, .direction = 'up')
data1 <- data1 %>% fill(effect, .direction = 'up')
data1 <- data1 %>% fill(birds_seen, .direction = 'up')
data1 <- data1 %>% fill(birds_struck, .direction = 'up')
data1 <- data1%>% fill(remarks, .direction = 'up')
data1 <- data1 %>% fill(phase_of_flt, .direction = 'up')
data1 <-data1 %>% fill(ac_mass, .direction = 'up')
data1 <-data1 %>% fill(num_engs, .direction = 'up')
data1 <- data1 %>% fill(time_of_day, .direction = 'up')
data1 <- data1%>% fill(state, .direction = 'up')

The missing values in the dataset were addressed one by one in order to clean the dataset in the most effective manner. The forward filling method was utilized for certain columns, such as effect, speed, and height, where it would not be appropriate to assign missing values as zero or other arbitrary values. This was done to prevent the generation of anomalous results in the dataset. Given the relatively low number of missing values, this approach was deemed the most useful way to clean the dataset, and it is now ready for analysis.

missing_data<-function(x){sum(is.na(x))/length(x)*100} 
apply(data1,2,missing_data)
##         opid     operator        atype      remarks phase_of_flt      ac_mass 
##   0.00000000   0.00000000   0.00000000   0.00000000   0.00000000   0.00000000 
##     num_engs         date  time_of_day        state       height        speed 
##   0.00518081   0.00000000   0.00000000   0.00518081   0.00000000   0.00000000 
##       effect          sky      species   birds_seen birds_struck 
##   0.00000000   0.00518081   0.00000000   0.06735053   0.00000000

Exploratory data Analysis

It sounds like you are interested in exploring the effect of the collision, which is a key variable in the dataset. A quick way to obtain some insights would be to use the “value_counts()” function to see the distribution of this variable across the dataset.

table(data1$effect)
## 
##      Aborted Take-off      Engine Shut Down                  None 
##                   803                   173                 16403 
##                 Other Precautionary Landing 
##                   521                  1402

The majority of collisions between aircraft and birds do not result in any noticeable effects, accounting for over 85% of cases. However, the remaining 15% of cases require attention as they can have more severe consequences. Out of these cases, 4% resulted in the aircraft’s take-off being aborted, 7% required the aircraft to make a precautionary landing, and in the worst-case scenario, there was an 0.8% chance of the engine shutting down, which could endanger the lives of passengers. This is a general summary of the main variable and its effects, and we will now examine it in more detail.

library(dplyr)
dangerous_flights= data1 %>% filter(effect == "Engine Shut Down")
dangerous_flights %>%
  group_by(sky) %>%
  summarise(
    n = n(),
    speed = mean(speed, na.rm = TRUE),
    height = mean(height, na.rm = TRUE),
    num_engs = mean(num_engs, na.rm = TRUE),
    ac_mass = mean(ac_mass, na.rm = TRUE)
  ) 

Based on the analysis, it appears that bird collisions are more likely to occur when the sky condition is “no cloud”, as this had the highest number of recorded incidents. In contrast, when the sky condition was “overcast” or “some cloud”, the number of recorded incidents decreased. Interestingly, when the sky condition was “overcast”, the aircraft tended to fly at a lower height compared to the other two weather conditions.

library(dplyr)
bird_count <- data1 %>% 
  group_by(data1$species) %>%
  summarize(Count=n()) %>%
  arrange(desc(Count)) %>%
  head(10)
knitr::kable(bird_count)
data1$species Count
UNKNOWN BIRD - SMALL 4156
UNKNOWN BIRD 2987
GULLS 2563
UNKNOWN BIRD - MEDIUM 2074
SPARROWS 694
BLACKBIRDS 543
UNKNOWN BIRD - LARGE 538
EUROPEAN STARLING 452
CANADA GOOSE 409
ROCK PIGEON 371

##Data Visualizations

We can fit a regression model between height and speed of aircraft flying.

model <- lm(speed ~ height, data = data1)  # Fit the model
summary(model) 
## 
## Call:
## lm(formula = speed ~ height, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -382.93  -21.58    2.13   17.79  203.42 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.266e+02  2.960e-01  427.60   <2e-16 ***
## height      1.281e-02  1.517e-04   84.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.95 on 19300 degrees of freedom
## Multiple R-squared:  0.2697, Adjusted R-squared:  0.2697 
## F-statistic:  7128 on 1 and 19300 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(model)

The results from the analysis of the entire dataset confirms the assumption that there is a strong linear relationship between speed and height. The analysis found a positive slope with a p-value that is well below the threshold of 0.05, indicating that the relationship is statistically significant with a high level of confidence (95%). The residual vs fitted plots show some outliers on the right side, but the majority of the data points cluster on the left, further supporting the positive slope. The normal QQ plot shows that the residuals are normally distributed, and the Cook’s distance plot indicates the presence of some outliers.

library(ggplot2)
theme_set(theme_bw()) # c

ggplot(data1, aes(x=effect, y=speed, fill=effect)) +  geom_boxplot() +  ggtitle("relationship between different effect results and speed of the aircraft") +  xlab("effect of aircraft-bird collisions") + ylab("speed of the aircraft")

The boxplot above shows the distributions for each level/factors of the “effect” variable. For instance, for those flights where the take-off was aborted, the mean of speed is the lowest which very makes sense.

library(ggplot2)
ggplot(data1, aes(x=effect, y=height, fill=effect)) +  geom_boxplot() +  ggtitle("relationship between different effect results and height of the aircraft") +  xlab("effect of aircraft-bird collisions") + ylab("height of the aircraft")

The boxplot above shows that the flights where take-offs were aborted, the hight was very low, we can barely see the “box”, just a couple of outliers.

ggplot(data1, aes(x=reorder(effect, effect, function(x)-length(x)))) +
geom_bar( fill = 'blue') +  labs(x='Team')+  ggtitle("histogram of the count of flights in different effect levels") +  xlab("effect levels") + ylab("occurance of flights")

Based on our analysis of key variables, we can conclude that the probability of encountering a bird during a flight is less than 15%. Furthermore, among the 15% of flights that do encounter birds, the risk of danger is relatively low. We also discovered that flying at a lower altitude increases the likelihood of a bird collision. However, the speed of the aircraft does not seem to have a significant impact on the effect of a collision.