The data is extracted from the course dataset directory. The dataset records information about the collision between aircraft and bird. The underlying reason that this dataset interests me is because this is the question that emerges while taking a airplane trip. Are we expecting some kind of interaction between the aircraft and the birds? Does this affects the travel? I am hoping that by implementing some of the skills learned in class, I can do my own data exploration and come up with some conclusions.
The dataset has 17 columns in total. Among the 17 variables, there are 12 categorical variables and 5 continuous variables. The dataset documents the information of aircraft, including the operator name and id, the type/model of the aircraft, the status and phase of the aircraft, along with the size and number of engines it has. On the other hand, the information of the bird is also getting recorded which includes the species, the sky area where the bird is seen. More importantly, the interaction/collision between the two parties is recorded which includes the time of date where the interaction happens, the state, height and speed of that time, and the outcome/effect of the collision.
As for cleaning the data, there appears to be some very obvious mistakes that are caused by Excel/CSV. For example, ‘2-10’ are transferred into ‘Feb-10’. Therefore, they need to be replaced by the correct object type and value prior to our analysis. Furthermore, there are only a few missing data in the dataset. For most of the columns, the percentage of missing data is less than 20%. None of the rows were dropped as they are not completely useless as the data is very solid. Instead, we filled the missing values by using the values before the missing ones so that we do not generate some weird results. In addition, this is the best way so that the information is still perserved and can be comprehended. After all, We did not have too many of the missing values in the first place either.
The U.S. Fish & Wildlife Service Organization published a study/report which states that the collision between aircraft and birds are becoming an issue. This is not only a threat to human death, it is also a potential reason of a large amount of bird death per year. The study proposed some of the bird species that is commonly seen in the context of collision with aircrafts. Waterfowl, gulls, and raptors are the top 3 bird species, and interestingly, it somewhat matches with the results from our data analysis. We found that gulls have the highest chances to collide with aircraft. Waterfowl and raptors are not seen in the dataset. The study also proposed some potential solutions to solve this collision problem, including modifications of aircraft flight schedules, habitat management, bird removal and lethal control. The information from the study is very useful, and it somewhat aligns with what this dataset is trying to tell us.
Below is a quick demo of what the data looks like:
df <- read.csv("aircraft_bird_collision_openintro.csv")
head(df, 5)
Firstly, let’s fix an error made by excel, the excel format read ‘2-10’ as a datetime type so it is transferred as ‘10-Feb’ which is incorrect. The cell below fixes that.
df$birds_seen[df$birds_seen == '10-Feb'] <- '2-10'
df$birds_struck[df$birds_struck == '10-Feb'] <- '2-10'
Now, let’s take a look at the missing values. It shows that the dataset has 44183 missing values, which means about 13% of the entire dataset is missing. This is not too bad considering the difficulties of recording data like this.
table(is.na(df))
##
## FALSE TRUE
## 283951 44183
Further, let’s narrow it down and identify the number of observations that is missing for each variable/column:
number_of_missing_data<-function(x){sum(is.na(x))}
apply(df,2,number_of_missing_data)
## opid operator atype remarks phase_of_flt ac_mass
## 0 0 0 2786 1783 1284
## num_engs date time_of_day state height speed
## 1307 0 2077 871 3193 7008
## effect sky species birds_seen birds_struck
## 5718 3579 0 14538 39
Along with the same information delivered in the percentage format, which might be easier to visualize and get a sense of the percentage of data that is missing.
percentage_of_missing_data<-function(x){sum(is.na(x))/length(x)*100}
apply(df,2,percentage_of_missing_data)
## opid operator atype remarks phase_of_flt ac_mass
## 0.0000000 0.0000000 0.0000000 14.4337374 9.2373847 6.6521604
## num_engs date time_of_day state height speed
## 6.7713190 0.0000000 10.7605429 4.5124858 16.5423272 36.3071184
## effect sky species birds_seen birds_struck
## 29.6238732 18.5421200 0.0000000 75.3186198 0.2020516
It looks like most of the columns are okay to leave as they are, but the continuous variables should be manipulated by the commands such like filling the missing values. The categorical variables are okay to leave the way they are because NA is also describing a type of category. For example, “effect” column depicts the effect of the collision, there is really nothing to fill in with. An empty value would just mean the data did not get recorded, but we can not arbitarily say it has no effect.
library(tidyr)
df1 <- df %>% fill(sky, .direction = 'up')
df1 <- df1 %>% fill(speed, .direction = 'up')
df1 <- df1 %>% fill(height, .direction = 'up')
df1 <- df1 %>% fill(effect, .direction = 'up')
df1 <- df1 %>% fill(birds_seen, .direction = 'up')
df1 <- df1 %>% fill(birds_struck, .direction = 'up')
df1 <- df1 %>% fill(remarks, .direction = 'up')
df1 <- df1 %>% fill(phase_of_flt, .direction = 'up')
df1 <- df1 %>% fill(ac_mass, .direction = 'up')
df1 <- df1 %>% fill(num_engs, .direction = 'up')
df1 <- df1 %>% fill(time_of_day, .direction = 'up')
df1 <- df1 %>% fill(state, .direction = 'up')
By dealing with the missing values one by one we are able to clean the dataset by the old-school fashion but at the same time, the most useful way. The reason why forward filling method is used is because for columns like effect, speed and height, we cant really set the missing value to 0 or Other. By this way, we will not generate a super weird dataset. We did not have too many of the missing values in the first place either. Now the dataset is ready to be analyzed.
Here is a quick check to see if we successfully filled the missing data.
percentage_of_missing_data<-function(x){sum(is.na(x))/length(x)*100}
apply(df1,2,percentage_of_missing_data)
## opid operator atype remarks phase_of_flt ac_mass
## 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
## num_engs date time_of_day state height speed
## 0.00518081 0.00000000 0.00000000 0.00518081 0.00000000 0.00000000
## effect sky species birds_seen birds_struck
## 0.00000000 0.00518081 0.00000000 0.06735053 0.00000000
The effect of the collision sounds like the answer to my question in the beginning, so here is a quick value_counts() for this variable.
table(df1$effect)
##
## Aborted Take-off Engine Shut Down None
## 803 173 16403
## Other Precautionary Landing
## 521 1402
It looks like more than 85% of the time, there is no effect, which means that the collision is okay to happen. However, among the remaining 15% of the time, is the place where people should pay more attention. There are 4% chance to cause the aircraft take-off aborted, 7% of chance to cause the aircraft to take precautionary landing, and for the worst case scenario, 0.8% chance to cause the engine to shut down which could servely put the passenger’s lives in danger. This is a general explanation for our key variable in a high level. Now lets dive deeper into it.
Now the dangerous flights are taken out and stored separately as df2, which has 173 observations. I would like to see what kinds of elements makes the flight dangerous. Trying to see if there is any pattern by adding some additional variables such as the sky area, the size of the plane, the speed and the height the aircraft was located.
library(dplyr)
dangerous_flights= df1 %>% filter(effect == "Engine Shut Down")
dangerous_flights %>%
group_by(sky) %>%
summarise(
n = n(),
speed = mean(speed, na.rm = TRUE),
height = mean(height, na.rm = TRUE),
num_engs = mean(num_engs, na.rm = TRUE),
ac_mass = mean(ac_mass, na.rm = TRUE)
)
It looks like when the sky is “no cloud”, there is a higher chance of getting collided with birds as they have more records, whereas the flights are safer with the sky being “overcast” and “some cloud”. When it is “Overcast”, the aircraft seems to fly at a significantly lower height comparing to the other two weather types which is quite interesting. As for the mass and number of engines of the aircraft, there seems to draw no significant insights from those two features.
This exploratory data analysis with dplyr package gives some interesting and surprise results, and more importantly it gives me the direction of my final goal: To identify the relationship between a dangerous flight and it’s features(what kind of flight have a higher chance of getting into collisions with birds, and getting the engines shut down)
In addition, it is also interesting to see what are the top 3 bird species that collide with the aircrafts frequently. It turns out that from the birds that are identified, gulls have the highest chance, followed by sparrows, blackbirds and European starling.
library(dplyr)
bird_count <- df1 %>%
group_by(df1$species) %>%
summarize(Count=n()) %>%
arrange(desc(Count)) %>%
head(10)
knitr::kable(bird_count)
| df1$species | Count |
|---|---|
| UNKNOWN BIRD - SMALL | 4156 |
| UNKNOWN BIRD | 2987 |
| GULLS | 2563 |
| UNKNOWN BIRD - MEDIUM | 2074 |
| SPARROWS | 694 |
| BLACKBIRDS | 543 |
| UNKNOWN BIRD - LARGE | 538 |
| EUROPEAN STARLING | 452 |
| CANADA GOOSE | 409 |
| ROCK PIGEON | 371 |
It is very interesting to see if there is a relationship between the height and speed the aircraft is flying. I am assuming that the higher it flies, the faster the speed would be. Let’s make a simple lienar regression analysis between the two numerical variables to examine their relationship and significance.
model <- lm(speed ~ height, data = df1) # Fit the model
summary(model)
##
## Call:
## lm(formula = speed ~ height, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -382.93 -21.58 2.13 17.79 203.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.266e+02 2.960e-01 427.60 <2e-16 ***
## height 1.281e-02 1.517e-04 84.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.95 on 19300 degrees of freedom
## Multiple R-squared: 0.2697, Adjusted R-squared: 0.2697
## F-statistic: 7128 on 1 and 19300 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(model)
Using all the data, the output above justifies the assumption made
beforehand, there is indeed a very strong linear relationship between
speed and height. The slope is positive and the p-value is way less than
0.05. This means that at the 95% confidence level, this relationship is
convincing. With the visualizations between the residuals vs fitted, we
can spot some outliers towards the end at the right hand side, but the
concentrated area on the left hand side justifies the positive slope.
Normal QQ plot confirms that the normal distribution. Cook’s distance
plot shows some evidence for outliers as well.
Now we have convinced ourselves with some significant relationship between speed and height, now let’s make some plots using the two variables with respect to the “effect” variable which describes the effect of aircraft collision with the birds.
library(ggplot2)
theme_set(theme_bw()) # change the theme to black and white, instead of using the default theme
ggplot(df1, aes(x=effect, y=speed, fill=effect)) + geom_boxplot() + ggtitle("relationship between different effect results and speed of the aircraft") + xlab("effect of aircraft-bird collisions") + ylab("speed of the aircraft")
The boxplot above shows the distributions for each level/factors of the
“effect” variable. For instance, for those flights where the take-off
was aborted, the mean of speed is the lowest which very makes sense. For
the other levels of “effect”, the distribution of speed is very similar,
with some significant outliers. Now lets do the same thing with respect
to height.
library(ggplot2)
ggplot(df1, aes(x=effect, y=height, fill=effect)) + geom_boxplot() + ggtitle("relationship between different effect results and height of the aircraft") + xlab("effect of aircraft-bird collisions") + ylab("height of the aircraft")
The boxplot above shows that the flights where take-offs were aborted,
the hight was very low, we can barely see the “box”, just a couple of
outliers
ggplot(df1, aes(x=reorder(effect, effect, function(x)-length(x)))) +
geom_bar( fill = 'purple') + labs(x='Team')+ ggtitle("histogram of the count of flights in different effect levels") + xlab("effect levels") + ylab("occurance of flights")
Luckily, the dangerous levels have low occurrences, majority of the data
does not have any negative effects. Only a couple of rows of the data
were recorded as engine shut down which can be very dangerous especially
in the middle of a trip.
According to some of our key variables, we are able to come up with the insight that there are less than 15% of chance that you are going to encounter a bird while taking a trip with plane. Among those 15% percent of chance, their is really only a small chance that you will meet danger. Some of the good tips we gained is that when the aircraft is flying low, there is a higher chance to collide with birds. The speed in this case did not tell a difference among difference effect levels.
Here demonstrates the source of the literature review.
Source: https://www.fws.gov/story/threats-birds-collisions-aircraft