##R Final Project ##by Catherine Cho
library(readr)
urlfile<-"https://vincentarelbundock.github.io/Rdatasets/csv/openintro/birds.csv"
birddata<-read_csv(url(urlfile))
## New names:
## * `` -> ...1
## Rows: 19302 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): opid, operator, atype, remarks, phase_of_flt, date, time_of_day, s...
## dbl (5): ...1, ac_mass, num_engs, height, speed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(birddata)
## ...1 opid operator atype
## Min. : 1 Length:19302 Length:19302 Length:19302
## 1st Qu.: 4826 Class :character Class :character Class :character
## Median : 9652 Mode :character Mode :character Mode :character
## Mean : 9652
## 3rd Qu.:14477
## Max. :19302
##
## remarks phase_of_flt ac_mass num_engs
## Length:19302 Length:19302 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:3.000 1st Qu.:2.000
## Mode :character Mode :character Median :4.000 Median :2.000
## Mean :3.362 Mean :2.096
## 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :5.000 Max. :4.000
## NA's :1284 NA's :1307
## date time_of_day state height
## Length:19302 Length:19302 Length:19302 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.: 0.0
## Mode :character Mode :character Mode :character Median : 40.0
## Mean : 754.7
## 3rd Qu.: 500.0
## Max. :32500.0
## NA's :3193
## speed effect sky species
## Min. : 0.0 Length:19302 Length:19302 Length:19302
## 1st Qu.:110.0 Class :character Class :character Class :character
## Median :130.0 Mode :character Mode :character Mode :character
## Mean :136.1
## 3rd Qu.:150.0
## Max. :400.0
## NA's :7008
## birds_seen birds_struck
## Length:19302 Length:19302
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
The following analysis will extract statistical information from dataset, “Aircraft-Wildlife Collisions”, to assess the risk of bird-strikes in the US.
1. Is there a greater chance of a bird strike during the day or the evening?
+ Is there any correlation to whether a bird-strike is likely to happen during a certain weather condition/visibility?
2. At what speed and height will the aircraft experience the least risk of a bird-strike?
+Is there a correlation of bird-strikes at a certain height?
#subset of data is created to contain the variables of interest; height, speed, sky, species, and time of day.
birds_1<-subset(birddata,select=c(height,speed,sky,time_of_day,species))
summary(birds_1)
## height speed sky time_of_day
## Min. : 0.0 Min. : 0.0 Length:19302 Length:19302
## 1st Qu.: 0.0 1st Qu.:110.0 Class :character Class :character
## Median : 40.0 Median :130.0 Mode :character Mode :character
## Mean : 754.7 Mean :136.1
## 3rd Qu.: 500.0 3rd Qu.:150.0
## Max. :32500.0 Max. :400.0
## NA's :3193 NA's :7008
## species
## Length:19302
## Class :character
## Mode :character
##
##
##
##
It is evident that the most incidents occur during the day as shown in the bar plot below. This is a raw count of incidents without the consideration of other variables. Day incidents make up 57.9% of all incidents recorded in this database, which means it outnumbers dawn, dusk, and night all together. This count alone indicates some contradiction to the notion that visibility is better during the day so there should be less incidents. However the next section considers the weather (i.e.cloudy, clear, etc) to further assess whether visibility has a direct effect on number of incidents
Hypothesis: There is a direct connection between visibility due to lack of light or clear skies and the number of bird strikes
#counting the frequency of incidents per species
barplot(table(birds_1$time_of_day))
title(main="Incidents Per Time of Day")
#Proportions of incidents per time of day
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
time_of_day_count<-count(birds_1,'time_of_day')
time_of_day_count
## # A tibble: 1 × 2
## `"time_of_day"` n
## <chr> <int>
## 1 time_of_day 19302
day_percent<-(time_of_day_count[2,2]/nrow(birds_1))*100
day_percent
## n
## 1 NA
The stacked barplot below also shows that increased visibility does not decrease chance of a bird strike. There are more recorded bird strikes during the day with a large porportion of that being on a clear day without clouds.
#change column name "sky" to "weather"
colnames(birds_1)<-c("height_ft","speed_knots","weather","time_of_day")
#creating a cross table of time of day and weather
myTable<-table(birds_1$weather,birds_1$time_of_day)
myTable
##
## Dawn Day Dusk Night
## No Cloud 224 4097 333 2190
## Overcast 187 2314 153 543
## Some Cloud 148 3567 293 997
#convert table to dataframe
myData<-as.data.frame.matrix(myTable)
#add rownames as a separate variable:
myData$weather<-rownames(myData)
#convert to long format:
library(reshape2)
myDataLong<-melt(myData,id.vars=c("weather"),value.name="count")
#Rename variable:
names(myDataLong)[2]<-paste("time_of_day")
#Then the plot:
library(ggplot2)
ggplot() + geom_bar(aes(y=count,
x=time_of_day,
fill=weather),
data=myDataLong,
stat="identity")
The mean of 136 knots and median of 130 knots is pretty similar so the distribution is fairly symmetric. The box plot shows this even distribution as well. Most jets climb at 250 knots up to 10,000 ft due to FAA regulations. This is supported by the scatter plot shown below which shows a positive trend between height and speed.
#mean height and speed
mean_speed<-mean(birds_1$speed_knots, na.rm=TRUE)
median_speed<-median(birds_1$speed_knots, na.rm=TRUE)
cat("the mean speed is",mean_speed,"and median speed is",median_speed)
## the mean speed is 136.0993 and median speed is 130
#box plot of bird strikes relating to speed
boxplot(birds_1$speed_knots,
main="Incidents vs Speed",
xlab="Speed (knots)",
ylab="Incidents",
col="orange",
border="brown",
horizontal=TRUE,
notch=TRUE)
The box plot below shows that the of number of bird strikes relating to height is scewed to the right. The mean height at which bird strikes were recorded is about 800 ft, which is just when most jets will have come off the take off power. Therefore this is typically just as soon as the jet has taken off the runway.
#box plot of bird strikes relating to height
boxplot(birds_1$height_ft,
main="Incidents vs Height",
xlab="Height (ft)",
ylab="Incidents",
col="blue",
border="brown",
horizontal=TRUE,
notch=TRUE)
#mean and median of speed
mean_height<-mean(birds_1$height_ft, na.rm=TRUE)
median_height<-median(birds_1$height_ft, na.rm=TRUE)
cat("the mean height is",mean_height,"and median height is",median_height)
## the mean height is 754.6778 and median height is 40
The scatter plot below shows a direct positive correclation between height and speed just based on the data collected. This scatter plot alone does not provide much insight to bird strike probablity but it does support the assumption made earlier that height and speed are generally related.
#plotting scatter plot of height and speed to see if the dataset agrees with the assumption that the aircraft generally flies faster at higher heights.
plot(birds_1$speed_knots,birds_1$height_ft,main='Regression for height on speed',
xlab='speed (knots)',ylab='height (ft)')
abline(lm(height_ft~speed_knots,data=birds_1),col='red')
Conclusion: The data set used in this analysis does not support the hypothesis presented in the beginning of this report. There does not seem to be a direct relationship in the visibility and the likelihood of a bird strike for commercial jets. This is apparent since majority of the bird strikes have happened during the day when there is the most visibility through light and a larger portion of those incidents have been in clear skies. The incidents seem more likely at takeoff at lower heights as well, which may be due to the fact that the aircraft is changing speed and height drastically in altitudes where birds may be more prevalent. Based on this information it can be concluded that incidents are less of a matter of pilot error, since it does not have much correlation to visibilty and the pilot’s ability circumvent strikes, but rather the jet being in bird fly zones. A potential area of skepticism in this analysis is that it does not compare the data set to the total number of attempted flights in the periods of the incidents recorded. There may have been much less flights in the evening than in the day so this could potentially skew the data to show more day incidents than evening. However, this would only effect the bar plots presented in the beginning of this report.