Introduction

The occurrence of wildlife strikes in the USA poses a multifaceted concern that extends beyond the immediate safety of air travel. Each year, thousands of reported incidents involving aircraft and wildlife underscore the potential risks and consequences associated with these encounters. From the economic impacts of flight delays and aircraft damage to the potential loss of human life and environmental conservation concerns, wildlife strikes represent a complex challenge that requires careful consideration and proactive management.

Wildlife strikes, also known as “bird strikes,” occur when aircraft collide with birds or other wildlife during flight operations. In recent years, the frequency of wildlife strikes in the USA has garnered increased attention from aviation authorities, wildlife management agencies, and the general public alike. With the growth of air travel and expanding urbanization encroaching upon natural habitats, the likelihood of such encounters has heightened, prompting concerted efforts to mitigate the associated risks.

In this research, our focus is on exploring the wildlife strike dataset to reveal patterns regarding reported strikes, damages, species involved, and the timing of occurrences. Through exploratory data analysis (EDA), we aim to answer the following research questions:

Research Questions:

Is there a pattern in the timing of wildlife strike occurrences throughout the month or year?
How does the severity of damage vary across different incidents?
Which species are most frequently involved in wildlife strikes?
Are there particular phases of flight (e.g., takeoff, landing) where wildlife strikes are more prevalent?
Is there a significant relationship between the type of wildlife species involved in strike incidents and the time of day when the incidents occur, and how does this association impact aviation safety and operational planning?
Are there particular regions with higher incident rates? Are there any geographical patterns?
Do warnings been issued have any effect on the damage level?
Are there any weather patterns associated with strikes or damages?

About the Dataset

Source

The data for the research analysis was sourced from the Federal Aviation Administration (FAA) Wildlife Strike Database, available at https://wildlife.faa.gov/.

Variable of Interest

In total, the dataset comprises 101 variables and 298,246 rows, covering wildlife strike incidents dating back to 1990 up to 2024. Given an emphasis on characteristics that can potentially affect wildlife strike incidences, and high computational power, we chose 21 variables from the FAA Wildlife Strike Database for research. The following variables were select as variable of interest to understand the data and perform the necessary statistical test.

incident_month: The month in which the wildlife strike incident occurred.
incident_year: The year in which the wildlife strike incident occurred.
time_of_day: The time of day when the wildlife strike incident occurred.
airport: The airport where the wildlife strike incident occurred.
latitude: The latitude coordinate of the incident location.
longitude: The longitude coordinate of the incident location.
state: The state where the wildlife strike incident occurred.
faaregion: The FAA region where the wildlife strike incident occurred.
operator: The operator of the aircraft involved in the incident.
aircraft: The type of aircraft involved in the incident.
phase_of_flight: The phase of flight during which the incident occurred.
height: The altitude of the aircraft at the time of the incident.
speed: The speed of the aircraft at the time of the incident.
distance: The distance traveled by the aircraft at the time of the incident.
sky: The sky condition at the time of the incident.
precipitation: The precipitation condition at the time of the incident.
num_struck: number of species recorded.
damage_level: The level of damage caused by the incident.
species: The species of wildlife involved in the incident.
warned: Whether there was any warning issued before the incident.
size: The size of the wildlife species involved in the incident.

Data Cleaning and Preprocessing

We first formatted the variable names using the clean_names package to standardize the variable names. We also tried several method to impute missing values which all failed due to computational time hence we opted to drop missing values. All character variables were converted to factors except the target variables(damage) which was convert using different method.

Statistical Analysis Techniques

To investigate correlations between variables and gain understanding of wildlife striking incidences, we used a range of statistical analysis approaches. These methods include ANOVA to compare strike rates across various airport categories and geographic regions, chi-square tests to investigate relationships between wildlife species and time of day, and exploratory data analysis (EDA) for descriptive statistics and data visualization. Regression analysis(logistics, lasso & ridge) is also used to identify significant predictors of strike likelihood and severity.

# Load the dataset
wildlife_strike <- read.csv("wildlife.csv")
wildlife_strike <- clean_names(wildlife_strike)
as_tibble(wildlife_strike)

A closer examination of the dataset reveals a significant number of missing values across several variables. There are approximately 43% missing values in the variables time_of_day, 38.6% in the variable phase_of_flight, and 35% in the variable damage_level. Since these variables are crucial for our analysis, we opted to perform exploratory data analysis (EDA) while omitting the missing values during the aggregation process.

By omitting the missing values during aggregation, we can still use the available data for analysis while disregarding records with missing values in the specific variables of interest. This approach ensures that our analysis is focused on the available data, providing insights into patterns and trends related to wildlife strike incidents without being unduly influenced by missing values

# percentage of missing values per column
get_percent_missing <- function(df){
  #this function produce the missing percentage of each variable
  missing_percentage <- colMeans(is.na(df)) * 100
  missing_percentage <- data.frame(missing_percentage)
  missing_percentage <- missing_percentage %>% 
    arrange(desc(missing_percentage)) %>% round(., 2)
  return(missing_percentage) # return the missing percentage
}
missing_percentage = get_percent_missing(wildlife_strike)
head(missing_percentage,10)

Exploratory Data Analysis

Using EDA, we investigated the distribution of important variables and found patterns or trends in the data to start our research. The correlations between the data were shown, and any outliers or anomalies were found, using visualizations including scatter plots, box plots, and histograms.

A. Time Analysis

We Explore the distribution of incidents over the months and years. Are there any seasonal trends or changes over time? We also Investigate incidents by time of day. Are there certain periods where incidents are more likely to occur?

1. Number of reported Strike

In 2019, there were 17,340 reported strikes, which decreased to 11,623 in 2020. Since then, the number has continued to rise, reaching 19,613 records by 2023. Upon closer observation, the trend across the years appears to fluctuate, with increases in some years followed by decreases, and then subsequent increases again.

When comparing the occurrence of strike across the month, most of the incident happen mosly during the summer time and some spring time, and this can be attributed to many factors such as Breeding and Nesting Season: Many bird species, including migratory birds, breed and nest during the spring and summer months. This activity increases the population of birds in and around airports, leading to a higher likelihood of bird strikes during takeoff and landing. Other reasons include Migration: Some bird species undertake seasonal migrations during the spring and fall months. and lastly Increased Aircraft Traffic: The summer months coincide with peak travel seasons for both leisure and business travel. Increased aircraft traffic translates to more flights taking off and landing at airports, thereby increasing the likelihood of bird strikes.

#--------- Exploratory Data Analysis ---------
# What is the trend in wildlife strike incidents over the past
# few decades based on the dataset?
year_freq <- wildlife_strike %>% group_by(incident_year) %>%
  summarise(n = n()) 
#plot
year_plot <- ggplot(year_freq, aes(x = incident_year, y=n)) + geom_col(fill = 'gray') +
  labs(title = 'Incident Per Year',
       x = 'Year',
       y = 'Number Strike') +
  theme_gray() + theme(legend.position = 'None', 
                          plot.title = element_text(hjust = 0.5))
# Arrange plots side by side
#ggarrange(year_plot, month_plot, nrow = 1, ncol = 2)
year_plot

Trend of Strike: Year

# ------ month frequency------
month_freq <- wildlife_strike %>% group_by(incident_month) %>%
  summarise(n = n()) %>% 
  mutate(month = factor(month.abb[incident_month], levels = month.abb))
#plot
month_plot <- ggplot(month_freq, aes(x = month, y=n, fill=n)) + geom_col() +
  scale_fill_gradient(low = "lightgray", high = "lightblue") +
  labs(title = 'Incident in Each Month',
       x = 'Months',
       y = 'Number Strike') +
  theme_gray() + theme(legend.position = 'None', 
                          plot.title = element_text(hjust = 0.5),
                          axis.text.x = element_text(angle = 45, hjust = 1))
month_plot

Trend of Strike: Month

2. Time of Occurance and level of damage

According to the data, the majority of reported strikes, totaling 173,295, resulted in no damage. However, there were 84 strikes that caused serious damage, leading to the destruction of the aircraft. Additionally, 8,616 strikes resulted in minor damage, 4,290 in substantial damage, and 7,175 were of undetermined level (Table 1).

#--- indicated damage
table(wildlife_strike$damage_level)

## 
##      D      M     M?      N      S 
##     84   8616   7175 173295   4290

# Plotting
barplot(table(wildlife_strike$damage_level),
        main = "Frequency of Damage Level",
        xlab = "Damage Level",
        ylab = "Frequency",
        ylim = c(0, max(table(wildlife_strike$damage_level))*1.1))

Distribution of Damage Level

Most of these strikes occurred during the day, followed by occurrences at night. One possible explanation for this trend could be the higher volume of air traffic during daylight hours, increasing the likelihood of interactions between aircraft and wildlife.

Also there were more reports of serious damage to aircraft when no warning was issued compared to instances where warnings were given. This suggests that the absence of warnings may lead to a higher risk of severe damage, possibly due to a lack of preparedness or awareness among flight crews.

time_warning <- table(wildlife_strike$warned, wildlife_strike$time_of_day)
# Create the data frame
time_warning <- as.data.frame(time_warning)
# Rename the columns
colnames(time_warning) <- c("Warning", "Time_of_day", "Frequency")

# Create the bar plot
ggplot(time_warning, aes(x = Time_of_day, y = Frequency, fill=Warning)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("No" = "lightblue", "Yes" = "gray"))

  labs(title = "Timing of Strike",
       x = "Time of Day",
       y = "Frequency",
       fill = "Warning") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

## NULL

# Create a new variable 'damage'
wildlife_strike$damage <- ifelse(wildlife_strike$damage_level %in% c('D', 'M', 'M?', 'S'), 'damage', 
                                 ifelse(wildlife_strike$damage_level == 'N', 'no damage', NA))

Warning Analysis

warning_damage <- table(wildlife_strike$warned, wildlife_strike$damage)
# Create the data frame
warning_damage_df <- as.data.frame(warning_damage)
# Rename the columns
colnames(warning_damage_df) <- c("Warning", "Damage", "Frequency")

# Create the bar plot
ggplot(warning_damage_df, aes(x = Warning, y = Frequency, fill=Damage)) +
  scale_fill_manual(values = c("damage" = "lightblue", 'no damage'='gray'))+
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Warning Vs Damage",
       x = "Warning",
       y = "Frequency",
       fill = "Damage") +
  theme_gray() +
  theme(plot.title = element_text(hjust = 0.5))

B. Location Analysis

We analyze the distribution of incidents across different airports or states. Are there particular regions with higher incident rates? Are there any geographical patterns?

1. Distribution across different Airport

Among the airports with the highest damage incident frequencies, Sacramento International Airport, Salt Lake City International Airport, and John F. Kennedy International Airport stand out. This may be because of several factors such as high passenger traffic, weather conditions, hub status, etc. JFK International Airport accounted for 301 total damages from 1990-2024. It is one of the busiest airports in the United States and serves as a major international gateway to New York City. High passenger volumes and diverse flight operations could increase incident frequency. Salt Lake City International Airport also serves as a hub for several airlines, leading to a high volume of flights and passenger traffic. all these airports happen to be among the top 35 largest hubs and busiest airports in the US (List of the Busiest Airports in the United States, 2024)

# Location Analysis
# Distribution of incidents across different airports
airport_counts <- table(wildlife_strike$airport, wildlife_strike$damage)
# Create the data frame
airport_damage <- as.data.frame(airport_counts)
# Rename the columns
colnames(airport_damage) <- c("Airport", "Damage_status", "Frequency")
# Subset phase_damage dataframe to remove rows where damage_level = 'N'
airport_damage <- airport_damage[airport_damage$Damage_status != 'no damage', ]
airport_damage <- airport_damage %>% arrange(desc(Frequency))
# Select top 10 airports

# Plot distribution of incidents across top airports
ggplot(data = head(airport_damage,10), aes(y = Frequency, x=Airport)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  labs(title = "Distribution of Damage Strike Across Top 10 Airports", 
       x = "Airport", 
       y = "Frequency") +
  theme_classic() + theme(legend.position = 'None', plot.title = element_text(hjust = 0.5))

2. Strike across State

Based on the strike frequency against states, we can observe that certain states have higher frequencies of wildlife strike incidents compared to others. Florida (FL) stands out with 541 reported incidents, suggesting a significant occurrence of wildlife strikes. This could be attributed to the presence of major airports like Miami International Airport and Orlando International Airport, which are bustling transportation hubs with substantial air traffic. Similarly, California (CA) records a substantial strike frequency of 480 incidents. Texas (TX) follows closely with 359 reported incidents, possibly reflecting the activity at airports like Dallas/Fort Worth International Airport and George Bush Intercontinental Airport in Houston. New York (NY) and Illinois (IL) also exhibit high strike frequencies of 232 and 179 incidents, respectively, potentially linked to the significant air traffic at airports such as John F. Kennedy International Airport and O’Hare International Airport. Additionally, states like Pennsylvania (PA), New Jersey (NJ), Missouri (MO), Ohio (OH), and Michigan (MI) show notable strike frequencies ranging from 112 to 161 incidents. Other factors may relate to states with high frequencies of wildlife strike incidents. Ecological Factors: States with diverse ecosystems, such as Florida and California, may experience higher wildlife strike frequencies due to abundant wildlife habitats, including wetlands, forests, and coastal areas. These habitats attract various bird species and wildlife, increasing the potential for aircraft collisions. Urbanization and Development: High-traffic states like New York and Illinois often have extensive urban development near airports, leading to habitat fragmentation and altered wildlife behavior.

# Distribution of incidents across different states
state_freq  <-  wildlife_strike %>% filter(damage == 'damage') %>% na.omit()%>%
  group_by(state) %>% summarise(strike_frequency = n()) 
#plot
map_plot <- plot_usmap(data = state_freq, values = "strike_frequency", labels=TRUE) +
  labs(title = 'Number of Strike across US State')
map_plot

C. Flight Analysis

We investigate the phase of flight during which incidents occur. Are certain phases associated with higher incident rates?

Phase of Flight

The phase of flight reveal that most of the strike occur during the (Approach, Landing Roll & Descent) arrival period as compared to departure phase (Take-off Run & Climb)

# ----- PHASE OF FLIGHT -----
flight_phase <- wildlife_strike %>% na.omit() %>% group_by(phase_of_flight) %>%
  summarise(n = n()) %>% arrange(desc(n))
#flight_phase
#plot
ggplot(flight_phase, aes(x = phase_of_flight, y=n)) +
  geom_bar(stat = 'identity', fill='gray') +
  labs(title = 'Number of Strike Vs Phase of Flight',
       x = 'Phase of Flight',
       y = 'Number Strike') +
  theme_gray() + theme(legend.position = 'None', 
                          plot.title = element_text(hjust = 0.5),
                          axis.text.x = element_text(angle = 45, hjust = 1))

D. Specices Analysis

About 298246 strike reported , Birds cause more than 50% of the occurrence. The top 10 species that cause damage reveal 9 out 10 records of birds of different kinds(Hawk, gulls, Canada goose). The White-tailed deer is also another species of animal not belonging to the bird family which appear in the top 10 species that cause damage.

# ---- SPECIES ------
filter_damage <- wildlife_strike %>% filter(damage_level!="N")
species_freq <- filter_damage %>% group_by(species) %>%
  summarise(n = n()) %>% arrange(desc(n))
#species_freq
#plot
ggplot(head(species_freq,10), aes(y = species, x=n)) + 
  geom_bar(stat = 'identity', fill='gray') +
  labs(title = 'Top 10 Species that cause Damage',
                                    x = 'Species',
                                    y = 'Frequency') +
  theme_classic() + theme(legend.position = 'None', 
                          plot.title = element_text(hjust = 0.5))

(the white-tailed deer and Canada-goose)

A closer look into the number of species involve reveal that most of the species that destroy the aircraft to the extent of no-repair show that only one animal was counted which is quite interesting. And the White-tailed deer happen to be the species that cause extreme damage beyond repair to the aircraft

#--- Number STRUCK ----
table(wildlife_strike$num_struck)

## 
##             1        11-100          2-10 More than 100 
##        264055          1538         31943            60

table(wildlife_strike$damage_level, wildlife_strike$num_struck)

##     
##           1 11-100   2-10 More than 100
##   D      51      2     21             0
##   M    7107    106   1361             8
##   M?   5777     94   1287             4
##   N  151854    961  20174            30
##   S    3071    119   1029             8

# species that destroy the aircraft
species_d <- wildlife_strike %>% filter(damage_level=='D' & num_struck==1) %>% 
  select(species) 
species_d %>% group_by(species) %>% summarise(n=n()) %>% arrange(desc(n))

E. Environmental Conditions

We examine the sky conditions and precipitation during incidents. Are there any weather patterns associated with incidents?

Upon closer examination of the data, we observe that the distribution of damage levels varies across different weather conditions. Generally, incidents are more prevalent under ‘No Cloud’ and ‘Some Cloud’ conditions compared to ‘Overcast’. Despite the higher frequency of incidents under ‘No Cloud’ conditions, a significant proportion of these incidents result in ‘N’ (no damage) outcomes, indicating that many encounters with wildlife may not lead to significant damage. Overall, there seem to be no clear-cut pattern linking weather conditions directly to the level of damage in wildlife strike incidents.

# Environmental conditions
table(wildlife_strike$sky, wildlife_strike$damage_level)

##             
##                  D     M    M?     N     S
##   No Cloud      30  3039  2622 57700  1436
##   Overcast       2  1042   816 19082   588
##   Some Cloud    13  1880  1709 39883   926

# precipitation
table(wildlife_strike$precipitation, wildlife_strike$damage_level)

##                   
##                         D      M     M?      N      S
##   Fog                   2    147    116   1951     82
##   Fog, None             0      0      0      5      0
##   Fog, Rain             0     21     16    256      5
##   Fog, Rain, Snow       0      1      0      5      0
##   Fog, Snow             0      0      0     14      1
##   None                 42   5247   4455 102924   2556
##   None, Rain            0      0      2     14      1
##   None, Rain, Snow      0      0      0      1      0
##   None, Snow            0      0      6     26      0
##   Rain                  0    345    268   6211    206
##   Rain, Snow            0      5      2     19      0
##   Snow                  0     28     27    378     16

F. Height & Speed

The mean height is 644.2, the minimum is 0 and maximum is 29000. The aircraft detail that correspond to a height of 29000 traveled at a speed of 270, and it happened during the month of June. However no damage occurred. The mean speed of flight is 131 and the median is 130. There seem to be outliers present in our dataset

# ------ HEIGHT & SPEED --------
wildlife_na_omit <- wildlife_strike %>% na.omit()

# Set up the plotting layout
par(mfrow = c(2, 1))

# Create Histogram
hist(wildlife_na_omit$height, main = "Distribution of Height", xlab = "Height")
hist(wildlife_na_omit$speed, main = "Distribution of Speed", xlab = "Speed")

par(mfrow = c(1, 1))

Chi-square Test

Test 1:

From our exploratory analysis, we saw that there seem to be some form of association between between the time of day and damage level. Hence we use chi-square test to examine the relationship between the time of day (Dawn, Day, Dusk, Night) and whether there will be damage or no damage caused by wildlife strikes. We group all damages(D,M,M?,S) as damage and N as no damage.

# Create a new variable 'damage'
wildlife_strike$damage <- ifelse(wildlife_strike$damage_level %in% c('D', 'M', 'M?', 'S'), 'damage', 
                                     ifelse(wildlife_strike$damage_level == 'N', 'no damage', NA))
time_damage <- table(wildlife_strike$time_of_day, wildlife_strike$damage)
time_damage

##        
##         damage no damage
##   Dawn     542      4609
##   Day     9415     80731
##   Dusk     876      5893
##   Night   5491     43627

Hypothesis

(\(H_0\)): The time of the day is independent of whether there will be a damage or not

(\(H_1\)): The time of the day is dependent of whether there will be damage or not(our claim)

Test Value

# Perform the chi-square test
chisq_1 <- chisq.test(time_damage)
# Print the chi-square test result
chisq_1

## 
##  Pearson's Chi-squared test
## 
## data:  time_damage
## X-squared = 51.821, df = 3, p-value = 3.27e-11

Since the p-value < alpha at 0.05, the decision is to reject the null hypothesis, and conclude that there is enough evidence to support our claim that there is a significant relationship between the time of day and whether there will be damage or not caused by wildlife strikes.

Test 2:

Another observation was that, prior warning cause less damage compared to no warning. Hence we use chi-square of independent to test the association between the warning issued (prior warning or no warning) and the damage caused by wildlife strikes to check if our claim is significant or not.

warning_damage <- table(wildlife_strike$damage, wildlife_strike$warned)
warning_damage

##            
##                No Unknown   Yes
##   damage     7416    8756  3993
##   no damage 51876   79690 41729

Hypothesis

(\(H_0\)): Warning issued is independent of whether a damage is possible or not

(\(H_1\)): Warning issued is dependent of whether a damage is possible or not.

Test Value

# Perform the chi-square test
chisq_2 <- chisq.test(warning_damage)
# Print the chi-square test result
chisq_2

## 
##  Pearson's Chi-squared test
## 
## data:  warning_damage
## X-squared = 441.71, df = 2, p-value < 2.2e-16

The decision is to reject the null hypothesis, since the p-value is significantly low, less than alpha at 0.05, hence we conclude that, there is significant evidence to support our claim that prior warning is dependent of whether there will be a damage or no damage

ANOVA

To compare the mean number of strikes throughout the approach, landing roll, and takeoff run phases of flight, we plan to do ANOVA testing. We will use one-way ANOVA tests to determine if there are statistically significant differences in strike frequency between the phases of flight.

Hypothesis

(\(H_0\)): There is no significant difference in the mean number of strikes between the approach, landing roll, and takeoff run phases of flight.

(\(H_1\)): There is a significant difference in the mean number of strikes between at least two of the flight phases (approach, landing roll, and takeoff run).

# Perform one-way ANOVA test
anova_result <- aov(incident_month ~ phase_of_flight, data = wildlife_strike)

# Print the summary of ANOVA test
summary(anova_result)

##                     Df  Sum Sq Mean Sq F value Pr(>F)    
## phase_of_flight     11     819   74.49   9.292 <2e-16 ***
## Residuals       183271 1469195    8.02                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 114963 observations deleted due to missingness

The decision is to reject the null hypothesis, since the p-value is less than alpha at 0.05, hence we conclude that, there is statistically significant differences in strike frequency between the phases of flight.

Preprocessing

We proceed to use logistic regression, lasso and ridge to model our data. We will split our data to 70% training and 30% testing. Before that we perform data cleaning on our data.

Dealing with missing values: There were a lot of missing values in the dataset. We tried to used different method such as mice, missforest, missranger, knn but none of the method worked as it was taking longer time to commpute. Hence in we drop the missing values entirely from the dataset.

Drop unwanted variables:We dropped some variables with over 50 distinct values and other variables which were not of interest in the regression analysis

Dealing with Outliers: As seen in the histogram plot of height and speed, outliers were present in our dataset. We tried to use transformation techniques(log) but it didn’t worked well with our data. We also perform a trail test by dropped the missing value and the AIC increase and so finally we decided not to drop the outliers as they may be relevant.

 # convert target variable as factor
# omit missing values
reg_data <- wildlife_strike %>% na.omit()

# convert values target column to 1,0
#reg_data$damage <- ifelse(reg_data$damage == "damage", 1, 0)
reg_data$damage <- as.factor(reg_data$damage)

# remove variables with over 50 distint values
reg_data <- reg_data %>% select(-c(state,airport,operator,species,aircraft))

# drop the weather variables and damage_level
reg_data <- reg_data %>% select(-c(sky, precipitation,damage_level, faaregion))

Train test Split

# Set seed
set.seed(123)
split <- sample.split(reg_data$damage, SplitRatio = 0.70) # split the 
train <- subset(reg_data, split == TRUE) # get the train data
test <- subset(reg_data, split == FALSE) # get the test data

x_train = model.matrix(damage~., train)[,-1] # get x_train
x_test = model.matrix(damage~., test)[,-1]  #get x_test

# Get the target variables
y_train = train$damage
y_test = test$damage

Logistic Regression

We perform stepwise selection backward procedure to identify the most parsimonious model by iteratively removing variables that do not significantly contribute to the model’s performance. The backward stepwise selection process concludes with a final model that includes only the variables incident_year, time_of_day, latitude, phase_of_flight, height, speed, distance, warned, num_struck, and size. This final model has an AIC of 16104.35. This suggests that these predictors are the most significant in predicting the damage outcome variable.

# Perform logistic regression analysis
logit_model <- glm(damage ~., 
                   data = train, family = binomial)

# Perform stepwise selection (backward)
stepwise_model <- step(logit_model, direction = "backward")

## Start:  AIC=16107.21
## damage ~ incident_month + incident_year + time_of_day + latitude + 
##     longitude + phase_of_flight + height + speed + distance + 
##     warned + num_struck + size
## 
##                   Df Deviance   AIC
## - incident_month   1    16057 16105
## - longitude        1    16058 16106
## <none>                  16057 16107
## - time_of_day      3    16072 16116
## - incident_year    1    16068 16116
## - distance         1    16071 16119
## - latitude         1    16072 16120
## - warned           2    16081 16127
## - height           1    16082 16130
## - speed            1    16137 16185
## - phase_of_flight  7    16319 16355
## - num_struck       3    16478 16522
## - size             2    19762 19808
## 
## Step:  AIC=16105.28
## damage ~ incident_year + time_of_day + latitude + longitude + 
##     phase_of_flight + height + speed + distance + warned + num_struck + 
##     size
## 
##                   Df Deviance   AIC
## - longitude        1    16058 16104
## <none>                  16057 16105
## - incident_year    1    16068 16114
## - time_of_day      3    16072 16114
## - distance         1    16071 16117
## - latitude         1    16072 16118
## - warned           2    16081 16125
## - height           1    16082 16128
## - speed            1    16137 16183
## - phase_of_flight  7    16319 16353
## - num_struck       3    16478 16520
## - size             2    19776 19820
## 
## Step:  AIC=16104.35
## damage ~ incident_year + time_of_day + latitude + phase_of_flight + 
##     height + speed + distance + warned + num_struck + size
## 
##                   Df Deviance   AIC
## <none>                  16058 16104
## - incident_year    1    16069 16113
## - time_of_day      3    16073 16113
## - distance         1    16072 16116
## - latitude         1    16075 16119
## - warned           2    16082 16124
## - height           1    16084 16128
## - speed            1    16139 16183
## - phase_of_flight  7    16320 16352
## - num_struck       3    16478 16518
## - size             2    19779 19821

# View the selected model
summary(stepwise_model)

## 
## Call:
## glm(formula = damage ~ incident_year + time_of_day + latitude + 
##     phase_of_flight + height + speed + distance + warned + num_struck + 
##     size, family = binomial, data = train)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -1.620e+01  4.982e+00  -3.251 0.001152 ** 
## incident_year                8.026e-03  2.476e-03   3.242 0.001188 ** 
## time_of_dayDay              -5.080e-02  1.091e-01  -0.465 0.641609    
## time_of_dayDusk             -1.396e-01  1.395e-01  -1.001 0.317005    
## time_of_dayNight             1.368e-01  1.165e-01   1.174 0.240437    
## latitude                    -1.165e-02  2.902e-03  -4.013 5.99e-05 ***
## phase_of_flightClimb        -5.551e-01  6.192e-02  -8.966  < 2e-16 ***
## phase_of_flightDescent       6.243e-01  2.381e-01   2.622 0.008748 ** 
## phase_of_flightLanding Roll  6.117e-01  6.622e-02   9.238  < 2e-16 ***
## phase_of_flightLocal        -6.488e-01  3.664e-01  -1.771 0.076632 .  
## phase_of_flightParked        1.159e+01  1.404e+02   0.083 0.934217    
## phase_of_flightTake-off Run -7.616e-02  5.879e-02  -1.295 0.195187    
## phase_of_flightTaxi          1.440e+00  5.000e-01   2.880 0.003971 ** 
## height                      -1.235e-04  2.411e-05  -5.124 3.00e-07 ***
## speed                        6.025e-03  6.711e-04   8.978  < 2e-16 ***
## distance                    -2.330e-02  6.249e-03  -3.729 0.000192 ***
## warnedUnknown               -1.043e-01  5.641e-02  -1.849 0.064409 .  
## warnedYes                    1.788e-01  4.884e-02   3.660 0.000252 ***
## num_struck11-100            -2.154e+00  1.319e-01 -16.326  < 2e-16 ***
## num_struck2-10              -7.909e-01  5.097e-02 -15.516  < 2e-16 ***
## num_struckMore than 100     -3.192e+00  6.088e-01  -5.243 1.58e-07 ***
## sizeMedium                   1.626e+00  5.345e-02  30.409  < 2e-16 ***
## sizeSmall                    3.277e+00  5.708e-02  57.414  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20848  on 31110  degrees of freedom
## Residual deviance: 16058  on 31088  degrees of freedom
## AIC: 16104
## 
## Number of Fisher Scoring iterations: 12

Evaluate Model Performance

Based on the confusion matrix the logistic regression model performs well in predicting no damage classes. The specificity is 98.30%, which means the model correctly identifies 98.30% of the non-damage cases. However, the model doesn’t perform well in predicting damage classes. The sensitivity, or recall, is 20.79%, indicating that the model correctly identifies only 20.79% of the true damage cases. This is quite low, suggesting the model struggles to correctly predict actual damage cases. The precision (positive predictive value) is 58.82%, meaning that when the model predicts damage, it is correct 58.82% of the time. The F1 score is very low suggesting that the model’s overall performance in predicting damage cases is not strong.

# predict model
probabilities <- stepwise_model %>% predict(test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'no damage', 'damage')

predicted = factor(predicted.classes)
expected = factor(test$damage)

# confusion matrix
library(caret)
cm = confusionMatrix(data=predicted, reference = expected,mode="everything",positive = "damage")
cm

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  damage no damage
##   damage       290       203
##   no damage   1105     11736
##                                           
##                Accuracy : 0.9019          
##                  95% CI : (0.8967, 0.9069)
##     No Information Rate : 0.8954          
##     P-Value [Acc > NIR] : 0.006818        
##                                           
##                   Kappa : 0.2672          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.20789         
##             Specificity : 0.98300         
##          Pos Pred Value : 0.58824         
##          Neg Pred Value : 0.91395         
##               Precision : 0.58824         
##                  Recall : 0.20789         
##                      F1 : 0.30720         
##              Prevalence : 0.10462         
##          Detection Rate : 0.02175         
##    Detection Prevalence : 0.03697         
##       Balanced Accuracy : 0.59544         
##                                           
##        'Positive' Class : damage          
##

Area Under the Curve

# roc curve
library(ROCR)
ROCRpred = prediction(as.numeric(probabilities), as.numeric(test$damage))
ROCRperf = performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

Area Under the Curve: Logistic Regression

auc_val = performance(ROCRpred, "auc")@y.values[[1]] # get the auc value
auc_val

## [1] 0.8275785

Lasso Regression

Find the best lambda

The value of lambda.min = 0.000233374 is the value that minimizes the mean squared error in the cross-validation process. It indicates the level of regularization needed to prevent over-fitting while retaining predictive accuracy. This suggest that, we can minimize the prediction error when lambda = lambda.min. The value of lambda.1se = 0.005027889 represents the lambda parameter that is one standard error away from the lambda value that minimizes the mean cross-validated error. The selection of the regularization parameter, lambda (\(\lambda\)), is crucial as it determines the strength of the penalty applied to the coefficients.

#library(glmnet)
# Find best lambda
lasso_model <- cv.glmnet(x = x_train, y = y_train, family = "binomial", alpha = 1, nfolds=10)

# print lambda values
lambda_min <- lasso_model$lambda.min
lambda_1se <- lasso_model$lambda.1se

print(lambda_min)

## [1] 0.000233374

print(lambda_1se)

## [1] 0.005027889

The Figure below is the CV model plot. The y-axis represent the mean-square error, the x-axis is the log of lambda(log(\(\lambda\))) and the numbers across the top of the plot is the number of non-zero coefficients in the model for each value of lambda. The second dotted line represents the maximum value within one standard error of the minimum error (\(\lambda_{\text{}}\)) i.e log(\(\lambda_{\text{1se}}\)) = -5.292755. This lambda value retains 13 predictor variables within the model, indicating that 11 predictors were dropped from our model during the regularization process. Similarly, The first dotted line represent the minimum value of lambda which also retains all 24 non-zero coefficients of the lasso model. The red dots denote the error estimate and the boundary line around the red line is the confidence interval for the error estimate.

# cv plot
plot(lasso_model)

Cross validation Lasso Model Plot

coef(lasso_model)

## 25 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                 -2.596224e-01
## incident_month               .           
## incident_year                3.428495e-04
## time_of_dayDay               .           
## time_of_dayDusk              .           
## time_of_dayNight             .           
## latitude                    -4.851787e-03
## longitude                    .           
## phase_of_flightClimb        -3.428448e-01
## phase_of_flightDescent       .           
## phase_of_flightLanding Roll  3.517995e-01
## phase_of_flightLocal         .           
## phase_of_flightParked        .           
## phase_of_flightTake-off Run  .           
## phase_of_flightTaxi          .           
## height                      -1.068757e-05
## speed                        9.464297e-04
## distance                    -8.424730e-03
## warnedUnknown                .           
## warnedYes                    7.873626e-02
## num_struck11-100            -1.743200e+00
## num_struck2-10              -6.012786e-01
## num_struckMore than 100     -1.792708e+00
## sizeMedium                   1.293603e+00
## sizeSmall                    2.930979e+00

Performance on test data

Based on the confusion matrix and statistics, the Lasso regression model’s performance in predicting damage and no damage cases is evaluated as follows:

The sensitivity for detecting damage cases is 9.46%, indicating that the model correctly identifies only 9.46% of the true damage cases. This is very low, suggesting poor performance in predicting actual damage cases.

The specificity for detecting no-damage cases is 99.38%, meaning the model correctly identifies 99.38% of the non-damage cases. The precision for damage cases is 64.08%, meaning that when the model predicts damage, it is correct 64.08% of the time.

Negative Predictive Value: The NPV is 90.38%, indicating that when the model predicts no damage, it is correct 90.38% of the time.

The F1 score for damage cases is 0.1649 which is very low indicating the model’s poor performance in predicting damage cases.

# Predict probabilities
# use lambda_1se
lasso_test_pred <- predict(lasso_model, newx = x_test, type = "response", lambda=lambda_1se)

# Convert probabilities to class predictions

lasso_predictions <- ifelse(lasso_test_pred > 0.5, "no damage", "damage")

lasso_predicted = factor(lasso_predictions)
lasso_expected = factor(y_test)

# confusion matrix
#library(caret)
lasso_cm = confusionMatrix(reference = lasso_expected,data=lasso_predicted,
                     mode="everything",positive = "damage")
lasso_cm

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  damage no damage
##   damage       132        74
##   no damage   1263     11865
##                                           
##                Accuracy : 0.8997          
##                  95% CI : (0.8945, 0.9048)
##     No Information Rate : 0.8954          
##     P-Value [Acc > NIR] : 0.05121         
##                                           
##                   Kappa : 0.1418          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.09462         
##             Specificity : 0.99380         
##          Pos Pred Value : 0.64078         
##          Neg Pred Value : 0.90379         
##               Precision : 0.64078         
##                  Recall : 0.09462         
##                      F1 : 0.16490         
##              Prevalence : 0.10462         
##          Detection Rate : 0.00990         
##    Detection Prevalence : 0.01545         
##       Balanced Accuracy : 0.54421         
##                                           
##        'Positive' Class : damage          
##

# roc curve
#library(ROCR)
ROCRpred = prediction(as.numeric(lasso_test_pred), as.numeric(y_test))
ROCRperf = performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

Area Under the Curve: Lasso Regression

auc_val = performance(ROCRpred, "auc")@y.values[[1]] # get the auc value
auc_val

## [1] 0.8214061

Ridge Regression

For ridge regression, the value of lambda.min = 0.009869951 and lambda.1se = 0.01431961. The first dotted line represent the minimum value of lambda i.e log(\(\lambda\)) which retain all 24 predictor variables within the model. The second dotted line represent the maximum value within one standard error of the minimum error which also retain all 24 non-zero coefficients of the ridge model

#------ Ridge Regression -----------
#library(glmnet)
# find the best lambda
ridge_model <- cv.glmnet(x = x_train, y = y_train, family = "binomial", alpha = 0, nfolds=10)

r_lambda_min <- ridge_model$lambda.min
r_lambda_1se <- ridge_model$lambda.1se

print(r_lambda_min)

## [1] 0.009869951

print(r_lambda_1se)

## [1] 0.01431961

plot(ridge_model)

Cross Validation Ridge Model Plot

coef(ridge_model)

## 25 x 1 sparse Matrix of class "dgCMatrix"
##                                        s1
## (Intercept)                 -1.276207e+01
## incident_month               4.362665e-03
## incident_year                6.678083e-03
## time_of_dayDay              -3.175077e-02
## time_of_dayDusk             -1.329438e-01
## time_of_dayNight             5.470069e-02
## latitude                    -1.118236e-02
## longitude                    3.808538e-04
## phase_of_flightClimb        -4.777069e-01
## phase_of_flightDescent       4.032262e-01
## phase_of_flightLanding Roll  4.262292e-01
## phase_of_flightLocal        -4.557622e-01
## phase_of_flightParked        1.734372e+00
## phase_of_flightTake-off Run -6.828498e-02
## phase_of_flightTaxi          9.038400e-01
## height                      -8.416847e-05
## speed                        4.248442e-03
## distance                    -2.102656e-02
## warnedUnknown               -1.016258e-01
## warnedYes                    1.498971e-01
## num_struck11-100            -1.738592e+00
## num_struck2-10              -6.120794e-01
## num_struckMore than 100     -2.719892e+00
## sizeMedium                   9.889304e-01
## sizeSmall                    2.418884e+00

Evaluate Model

Based on the confusion matrix and statistics, the performance of the Ridge regression model in predicting damage and no damage cases is as follows:

Sensitivity for detecting damage cases is quite low at 6.31%, indicating that the Ridge regression model correctly identifies only 6.31% of the true damage cases.

The specificity for detecting no-damage cases is high at 99.59%, indicating that the model correctly identifies 99.59% of the non-damage cases.

Precision for damage cases is 64.23%, meaning that when the model predicts damage, it is correct 64.23% of the time.

The F1 score, which balances precision and recall, is 11.49%, indicating that the model’s overall performance in predicting damage cases is not strong.

# Predict probabilities
ridge_test_pred <- predict(ridge_model, newx = x_test, type = "response", lambda=r_lambda_min)

# Convert probabilities to class predictions
ridge_predictions <- ifelse(ridge_test_pred > 0.5, "no damage", "damage")


ridge_predicted = factor(ridge_predictions)
ridge_expected = factor(y_test)

# confusion matrix
ridge_cm = confusionMatrix(data=ridge_predicted, reference = ridge_expected,
                           mode="everything",positive = "damage")
ridge_cm

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  damage no damage
##   damage        88        49
##   no damage   1307     11890
##                                           
##                Accuracy : 0.8983          
##                  95% CI : (0.8931, 0.9034)
##     No Information Rate : 0.8954          
##     P-Value [Acc > NIR] : 0.1378          
##                                           
##                   Kappa : 0.098           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.06308         
##             Specificity : 0.99590         
##          Pos Pred Value : 0.64234         
##          Neg Pred Value : 0.90096         
##               Precision : 0.64234         
##                  Recall : 0.06308         
##                      F1 : 0.11488         
##              Prevalence : 0.10462         
##          Detection Rate : 0.00660         
##    Detection Prevalence : 0.01027         
##       Balanced Accuracy : 0.52949         
##                                           
##        'Positive' Class : damage          
##

# roc curve
#library(ROCR)
ROCRpred = prediction(as.numeric(ridge_test_pred), as.numeric(y_test))
ROCRperf = performance(ROCRpred, 'tpr','fpr')
#score
auc_val = performance(ROCRpred, "auc")@y.values[[1]] # get the auc value
auc_val

## [1] 0.8266011

#plot
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

Area Under the Curve: Ridge Regression

When we compare the area under the curve of all 3 model, the logistic regression model happens to have the highest score, with value of 0.8275785. Lasso regression recorded AUC score of 0.8214061 and ridge regression

Improving the model performance

We continue our analysis to predict the occurrence of damage using 3 models namely logistic, lasso & ridge. The outcome of the models indicated that overall, all the models demonstrates strong performance in predicting no-damage cases but struggles significantly with identifying damage cases, as indicated by the low sensitivity and F1 score. This may be due to the unbalance in the target variables. And so we will perform some advance techniques to improve our model performances. The after we will compare the 3 model and select the best among them. Here are the steps we will use to improve our model

Standardizing the features
Balancing the dataset
Parameter Tuning

Standardizing the Features

reg_data2 <- reg_data %>%
  mutate_if(is.character, as.factor)
# Ensure factor levels are valid R variable names
levels(reg_data2$damage) <- make.names(levels(reg_data2$damage))
# Standardize features
preProc <- preProcess(reg_data2[, -13], method = c('center','scale')) # remove the target variables
data_scaled <- predict(preProc, reg_data2[, -13])
sc_data <- cbind(data_scaled, reg_data2[, 13]) # combine the columns together
sc_data$damage <- sc_data$`reg_data2[, 13]`

# drop duplicated column
scale_data2 <- sc_data %>% select(-`reg_data2[, 13]`)

Re-split the data

After standardizing our predictors, we perform data splitting .

#------------------- split data ----------------
set.seed(123)
split <- sample.split(scale_data2$damage, SplitRatio = 0.70) # split the data

train <- subset(scale_data2, split == TRUE) # get the train data
test <- subset(scale_data2, split == FALSE) # get the test data

x_train = model.matrix(damage~., train)[,-1]

x_test = model.matrix(damage~., test)[,-1]


# y values
# Ensure y_train and y_test are factors with valid levels
y_train <- train$damage
y_test <- test$damage

Balancing the data

#--------- Balancing the data ---------------
# Random oversampling using ROSE package
#install.packages("ROSE")

#library(ROSE)
# Perform random oversampling
oversampled_data <- ROSE(damage ~ ., data = train, seed = 123)$data
x_train = model.matrix(damage~., oversampled_data)[,-1]
y_train <- oversampled_data$damage

Logistic Regression with Parameter Tuning

#-------- Parameter Tuning ----------
train_control <- trainControl(method = "cv", number = 10)
logistic_model <- train(damage ~ ., data = oversampled_data, method = "glm", 
                        family = "binomial", trControl = train_control)

#summary(logistic_model)

log_pred_prob <- predict(logistic_model, newdata = test, type = "prob")[,2]

# Convert probabilities to binary predictions for confusion matrix
log_pred_class <- ifelse(log_pred_prob > 0.5, 'damage', 'no.damage')
log_pred_class <- as.factor(log_pred_class)

# Ensure levels of test$damage match those in log_pred_class
test$damage <- factor(test$damage, levels = levels(log_pred_class))

# Calculate confusion matrix
conf_matrix_log <- confusionMatrix(log_pred_class, test$damage, mode = 'everything')
print(conf_matrix_log)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  damage no.damage
##   damage      1050      2797
##   no.damage    345      9142
##                                           
##                Accuracy : 0.7644          
##                  95% CI : (0.7571, 0.7715)
##     No Information Rate : 0.8954          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2919          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.75269         
##             Specificity : 0.76573         
##          Pos Pred Value : 0.27294         
##          Neg Pred Value : 0.96363         
##               Precision : 0.27294         
##                  Recall : 0.75269         
##                      F1 : 0.40061         
##              Prevalence : 0.10462         
##          Detection Rate : 0.07875         
##    Detection Prevalence : 0.28851         
##       Balanced Accuracy : 0.75921         
##                                           
##        'Positive' Class : damage          
##

Lasso with Parameter tuning

#-------------- Lasso Reg ----------
# Define a grid of lambda values
lambda_grid <- 10^seq(-4, 1, length = 100)
grid <- expand.grid(alpha = 1, lambda = lambda_grid)

# Set up train control for cross-validation
train_control <- trainControl(method = "cv", number = 10, classProbs = TRUE, 
                              summaryFunction = twoClassSummary)

# Train the LASSO model using grid search
set.seed(123)
lasso_model <- train(x = x_train, y = y_train,
                     method = "glmnet",
                     trControl = train_control,
                     tuneGrid = grid,
                     metric = "ROC",
                     family = "binomial")

# Print the best tuning parameters
print(lasso_model$bestTune)

##    alpha       lambda
## 13     1 0.0004037017

# Predict on the test set
lasso_pred2 <- predict(lasso_model, newdata = x_test, type = "prob")[,2]

# Convert probabilities to binary predictions
lasso_pred_class <- ifelse(lasso_pred2 > 0.5, 'damage', 'no.damage')
lasso_pred_class <- as.factor(lasso_pred_class)


# Confusion Matrix
conf_matrix_las <- confusionMatrix(lasso_pred_class, y_test, mode = 'everything')
print(conf_matrix_las)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  damage no.damage
##   damage      1050      2801
##   no.damage    345      9138
##                                           
##                Accuracy : 0.7641          
##                  95% CI : (0.7568, 0.7712)
##     No Information Rate : 0.8954          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2915          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.75269         
##             Specificity : 0.76539         
##          Pos Pred Value : 0.27266         
##          Neg Pred Value : 0.96362         
##               Precision : 0.27266         
##                  Recall : 0.75269         
##                      F1 : 0.40030         
##              Prevalence : 0.10462         
##          Detection Rate : 0.07875         
##    Detection Prevalence : 0.28881         
##       Balanced Accuracy : 0.75904         
##                                           
##        'Positive' Class : damage          
##

Ridge Regression with Parameter Tuning

#--------------- Ridge Regression ------------------
grid2 <- expand.grid(alpha = 0, lambda = lambda_grid)

# Set up train control for cross-validation
train_control2 <- trainControl(method = "cv", number = 10, classProbs = TRUE, 
                               summaryFunction = twoClassSummary)

# Train the RIDGE model using grid search
set.seed(123)
ridge_model <- train(x = x_train, y = y_train,
                     method = "glmnet",
                     trControl = train_control2,
                     tuneGrid = grid2,
                     metric = "ROC",
                     family = "binomial")

# Print the best tuning parameters
print(ridge_model$bestTune)

##    alpha     lambda
## 48     0 0.02364489

# Predict on the test set
ridge_pred2 <- predict(ridge_model, newdata = x_test, type = "prob")[,2]

# Convert probabilities to binary predictions
ridge_pred_class <- ifelse(ridge_pred2 > 0.5, 'damage', 'no.damage')
ridge_pred_class <- as.factor(ridge_pred_class)

# Confusion Matrix
conf_matrix_r <- confusionMatrix(ridge_pred_class, y_test, mode = 'everything')
print(conf_matrix_r)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  damage no.damage
##   damage      1069      2962
##   no.damage    326      8977
##                                          
##                Accuracy : 0.7534         
##                  95% CI : (0.746, 0.7607)
##     No Information Rate : 0.8954         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2825         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.76631        
##             Specificity : 0.75191        
##          Pos Pred Value : 0.26519        
##          Neg Pred Value : 0.96496        
##               Precision : 0.26519        
##                  Recall : 0.76631        
##                      F1 : 0.39403        
##              Prevalence : 0.10462        
##          Detection Rate : 0.08017        
##    Detection Prevalence : 0.30231        
##       Balanced Accuracy : 0.75911        
##                                          
##        'Positive' Class : damage         
##

Compare Model

After standardizing the data, balancing and performing parameter tuning, the performance of the model increased. Although the specificity as compared to the initial reduced the sensitivity increased among all 3 models.

Logistic Regression: The model was able to correctly predict 1050 as damages and 9142 as no damage as compared to the first model which correctly predicted 290 as damages and 11736 no damages. The sensitivity improve from 20.7% to 75.27%. And the F1 values also increased from 30.7% to 40.0%
Lasso Regression: The model was able to correctly predict 1050 as damages and 9138 as no damages as compared to the initial lasso model which correctly predicted 132 as damages 11865 and as no damages The sensitivity improve from 9.46% to 75.27%. And the F1 values also increased from 16.5% to 40.0%
Ridge Regression: The model was able to correctly predict 1069 as damages and 8977 as no damages as compared to the initial lasso model which correctly predicted 88 as damages 11890 and as no damages The sensitivity improve from 6.31% to 76.63%. And the F1 values also increased from 11.5% to 39.4%

metric_table <- data.frame(
  "Logistic Regression" = conf_matrix_log$byClass,
  "Lasso Regression" = conf_matrix_las$byClass,
  "Ridge Regression" = conf_matrix_r$byClass
)

metric_table

Among the 3 models, Logistic Regression seem to have similiar scores with lasso regression in terms of sensitivity, specificity, recall and F1. By comparing these values, Logistic regression seem to have the highest score in terms of F1. This suggests that Logistic Regression might be the best choice among the three models. Since it has the highest F1 score, it strikes a good balance between avoiding false positives and capturing true positives.

Conclusion

The aviation industry faces numerous challenges to ensure passenger safety and operational efficiency. One such challenge is the occurrence of wildlife strike incidents, where aircraft collide with birds or other animals during flight operations. These incidents can result in significant damage to aircraft, pose risks to passengers and crew, and lead to financial losses for airlines. Through this report, we have come to understand that certain factors contribute to wildlife strikes and it’s severity.

By analyzing wildlife strike data, through the exploratory data analysis and other statistical analysis, we have gained valuable insights into the patterns of wildlife strikes across different time frames, species, and conditions. We observed that the spring and summer seasons accounted for a higher frequency of strikes compared to winter and fall. This trend aligns with factors such as breeding and nesting seasons for birds, increased migration activity, and heightened air traffic during peak travel seasons.

Furthermore, our analysis revealed that the majority of wildlife strikes occurred during the day, indicating that daylight hours are associated with a higher risk of these incidents. However prior warning to the aircraft operator can help reduce the intensity of the strike. Another observation is that, most of the strike occur during the arrival period as compared to departure period. Species like the white-tailed deer is among the animal that cause severe damage to aircraft. Other birds such as the Canada-goose, and the hawk also case significant damage to aircraft.

Upon using chi-square test of independent to test for association between the time of day and damage present, and whether prior warning has association with damage, the result were significant. Our one-way Anova also shown that there is a statistically significant differences in strike frequency among the phases of flight

We predicted the occurrence of damage using three different models: Logistic Regression, Lasso Regression, and Ridge Regression. Initially, the models demonstrated low sensitivity and high specificity. After applying standardization, balancing, and parameter tuning, the performance of all three models improved significantly. Among them, Logistic Regression emerged as the best model for predicting the occurrence of damage or no damage with the highest F1 score.

Variables such as incident_year, time_of_day, distance, location,(latitude), warning issues or no, height of aircraft, speed of aircraft, phase of flight, number of species and size of species were used in the prediction model.

Reference

FAA Wildlife Strike Database. (n.d.). https://wildlife.faa.gov/search

Grolemund, H. W. A. G. (n.d.). 7 Exploratory Data Analysis | R for Data Science. https://r4ds.had.co.nz/exploratory-data-analysis.html

Kabacoff, R. I. (2015). R in Action, Second Edition. O’Reilly Online Learning. Retrieved April 17, 2024, from https://learning.oreilly.com/library/view/r-in-action/9781617291388/kindle_split_002.html

List of the busiest airports in the United States. (2024, May 10). Wikipedia. https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_the_United_States#Busiest_U.S._airports_by_total_passenger_traffic

making maps with R. (2021, October 13). http://jenrichmond.rbind.io/post/2021-10-13-making-maps-with-r/

Torelli, N. L. G. M. N. (2021, June 14). ROSE: Random Over-Sampling Examples. https://rdrr.io/cran/ROSE/man/ROSE-package.html

Wildlife Hazard Mitigation. (n.d.). Federal Aviation Administration. https://www.faa.gov/airports/airport_safety/wildlife

Wildlife Strikes in the USA (Group 1) - ALY6015: Intermediate Analytics - Prof. Vivian Clements Edwin - Northeastern University

Sheila Kwartemaa Boateng

Krutika Patel

Yi Du

Yash Tailor

2024-05-05

Introduction

Research Questions:

About the Dataset

Source

Variable of Interest

Data Cleaning and Preprocessing

Statistical Analysis Techniques

Exploratory Data Analysis

A. Time Analysis

1. Number of reported Strike

2. Time of Occurance and level of damage

Warning Analysis

B. Location Analysis

1. Distribution across different Airport

2. Strike across State

C. Flight Analysis

Phase of Flight

D. Specices Analysis

E. Environmental Conditions

F. Height & Speed

Chi-square Test

Test 1:

Test 2:

ANOVA

Preprocessing

Train test Split

Logistic Regression

Evaluate Model Performance

Area Under the Curve

Lasso Regression

Find the best lambda

Performance on test data

Ridge Regression

Evaluate Model

Improving the model performance

Standardizing the Features

Re-split the data

Balancing the data

Logistic Regression with Parameter Tuning

Lasso with Parameter tuning

Ridge Regression with Parameter Tuning

Compare Model

Conclusion

Reference