##R Final Project ##by Catherine Cho

library(readr)
urlfile<-"https://vincentarelbundock.github.io/Rdatasets/csv/openintro/birds.csv"
birddata<-read_csv(url(urlfile))
## New names:
## * `` -> ...1
## Rows: 19302 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): opid, operator, atype, remarks, phase_of_flt, date, time_of_day, s...
## dbl  (5): ...1, ac_mass, num_engs, height, speed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(birddata)
##       ...1           opid             operator            atype          
##  Min.   :    1   Length:19302       Length:19302       Length:19302      
##  1st Qu.: 4826   Class :character   Class :character   Class :character  
##  Median : 9652   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 9652                                                           
##  3rd Qu.:14477                                                           
##  Max.   :19302                                                           
##                                                                          
##    remarks          phase_of_flt          ac_mass         num_engs    
##  Length:19302       Length:19302       Min.   :1.000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:3.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :4.000   Median :2.000  
##                                        Mean   :3.362   Mean   :2.096  
##                                        3rd Qu.:4.000   3rd Qu.:2.000  
##                                        Max.   :5.000   Max.   :4.000  
##                                        NA's   :1284    NA's   :1307   
##      date           time_of_day           state               height       
##  Length:19302       Length:19302       Length:19302       Min.   :    0.0  
##  Class :character   Class :character   Class :character   1st Qu.:    0.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :   40.0  
##                                                           Mean   :  754.7  
##                                                           3rd Qu.:  500.0  
##                                                           Max.   :32500.0  
##                                                           NA's   :3193     
##      speed          effect              sky              species         
##  Min.   :  0.0   Length:19302       Length:19302       Length:19302      
##  1st Qu.:110.0   Class :character   Class :character   Class :character  
##  Median :130.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :136.1                                                           
##  3rd Qu.:150.0                                                           
##  Max.   :400.0                                                           
##  NA's   :7008                                                            
##   birds_seen        birds_struck      
##  Length:19302       Length:19302      
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

The following analysis will extract statistical information from dataset, “Aircraft-Wildlife Collisions”, to assess the risk of bird-strikes in the US.
1. Is there a greater chance of a bird strike during the day or the evening?
+ Is there any correlation to whether a bird-strike is likely to happen during a certain weather condition/visibility?
2. At what speed and height will the aircraft experience the least risk of a bird-strike?
+Is there a correlation of bird-strikes at a certain height?

#subset of data is created to contain the variables of interest; height, speed, sky, species, and time of day.
birds_1<-subset(birddata,select=c(height,speed,sky,time_of_day,species))
summary(birds_1)
##      height            speed           sky            time_of_day       
##  Min.   :    0.0   Min.   :  0.0   Length:19302       Length:19302      
##  1st Qu.:    0.0   1st Qu.:110.0   Class :character   Class :character  
##  Median :   40.0   Median :130.0   Mode  :character   Mode  :character  
##  Mean   :  754.7   Mean   :136.1                                        
##  3rd Qu.:  500.0   3rd Qu.:150.0                                        
##  Max.   :32500.0   Max.   :400.0                                        
##  NA's   :3193      NA's   :7008                                         
##    species         
##  Length:19302      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

It is evident that the most incidents occur during the day as shown in the bar plot below. This is a raw count of incidents without the consideration of other variables. Day incidents make up 57.9% of all incidents recorded in this database, which means it outnumbers dawn, dusk, and night all together. This count alone indicates some contradiction to the notion that visibility is better during the day so there should be less incidents. However the next section considers the weather (i.e.cloudy, clear, etc) to further assess whether visibility has a direct effect on number of incidents

Hypothesis: There is a direct connection between visibility due to lack of light or clear skies and the number of bird strikes

#counting the frequency of incidents per species
barplot(table(birds_1$time_of_day))
title(main="Incidents Per Time of Day")

#Proportions of incidents per time of day
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
time_of_day_count<-count(birds_1,'time_of_day')
time_of_day_count
## # A tibble: 1 × 2
##   `"time_of_day"`     n
##   <chr>           <int>
## 1 time_of_day     19302
day_percent<-(time_of_day_count[2,2]/nrow(birds_1))*100
day_percent
##    n
## 1 NA

The stacked barplot below also shows that increased visibility does not decrease chance of a bird strike. There are more recorded bird strikes during the day with a large porportion of that being on a clear day without clouds.

#change column name "sky" to "weather"
colnames(birds_1)<-c("height_ft","speed_knots","weather","time_of_day")
#creating a cross table of time of day and weather
myTable<-table(birds_1$weather,birds_1$time_of_day)
myTable
##             
##              Dawn  Day Dusk Night
##   No Cloud    224 4097  333  2190
##   Overcast    187 2314  153   543
##   Some Cloud  148 3567  293   997
#convert table to dataframe
myData<-as.data.frame.matrix(myTable)
#add rownames as a separate variable:
myData$weather<-rownames(myData)

#convert to long format:
library(reshape2)
myDataLong<-melt(myData,id.vars=c("weather"),value.name="count")
#Rename variable:
names(myDataLong)[2]<-paste("time_of_day")
#Then the plot:
library(ggplot2)
ggplot() + geom_bar(aes(y=count,
                        x=time_of_day,
                        fill=weather),
                    data=myDataLong,
                    stat="identity")

The mean of 136 knots and median of 130 knots is pretty similar so the distribution is fairly symmetric. The box plot shows this even distribution as well. Most jets climb at 250 knots up to 10,000 ft due to FAA regulations. This is supported by the scatter plot shown below which shows a positive trend between height and speed.

#mean height and speed
mean_speed<-mean(birds_1$speed_knots, na.rm=TRUE)
median_speed<-median(birds_1$speed_knots, na.rm=TRUE)
cat("the mean speed is",mean_speed,"and median speed is",median_speed)
## the mean speed is 136.0993 and median speed is 130
#box plot of bird strikes relating to speed
boxplot(birds_1$speed_knots,
        main="Incidents vs Speed",
        xlab="Speed (knots)",
        ylab="Incidents",
        col="orange",
        border="brown",
        horizontal=TRUE,
        notch=TRUE)

The box plot below shows that the of number of bird strikes relating to height is scewed to the right. The mean height at which bird strikes were recorded is about 800 ft, which is just when most jets will have come off the take off power. Therefore this is typically just as soon as the jet has taken off the runway.

#box plot of bird strikes relating to height
boxplot(birds_1$height_ft,
        main="Incidents vs Height",
        xlab="Height (ft)",
        ylab="Incidents",
        col="blue",
        border="brown",
        horizontal=TRUE,
        notch=TRUE)

#mean and median of speed
mean_height<-mean(birds_1$height_ft, na.rm=TRUE)
median_height<-median(birds_1$height_ft, na.rm=TRUE)
cat("the mean height is",mean_height,"and median height is",median_height)
## the mean height is 754.6778 and median height is 40

The scatter plot below shows a direct positive correclation between height and speed just based on the data collected. This scatter plot alone does not provide much insight to bird strike probablity but it does support the assumption made earlier that height and speed are generally related.

#plotting scatter plot of height and speed to see if the dataset agrees with the assumption that the aircraft generally flies faster at higher heights. 
plot(birds_1$speed_knots,birds_1$height_ft,main='Regression for height on speed',
     xlab='speed (knots)',ylab='height (ft)')
abline(lm(height_ft~speed_knots,data=birds_1),col='red')

Conclusion: The data set used in this analysis does not support the hypothesis presented in the beginning of this report. There does not seem to be a direct relationship in the visibility and the likelihood of a bird strike for commercial jets. This is apparent since majority of the bird strikes have happened during the day when there is the most visibility through light and a larger portion of those incidents have been in clear skies. The incidents seem more likely at takeoff at lower heights as well, which may be due to the fact that the aircraft is changing speed and height drastically in altitudes where birds may be more prevalent. Based on this information it can be concluded that incidents are less of a matter of pilot error, since it does not have much correlation to visibilty and the pilot’s ability circumvent strikes, but rather the jet being in bird fly zones. A potential area of skepticism in this analysis is that it does not compare the data set to the total number of attempted flights in the periods of the incidents recorded. There may have been much less flights in the evening than in the day so this could potentially skew the data to show more day incidents than evening. However, this would only effect the bar plots presented in the beginning of this report.