Every day, the Transportation Security Administration(TSA) performs millions of safety checks across U.S airports with efforts in maintaining the safety of travelers. Through these checks, TSA screens for explosives and other prohibited &/or dangerous items. However, airport security has not always been as long and rigorous of a process as it is today. TSA was established shortly after the 9/11 attacks and has greatly improved technologically over the years to ease the process for everyone. Despite improvements, TSA faces numerous of claims filled against them for issues such as passenger property loss, damage, and personal injuries, all reflecting on challenges still faced by the overall process.
The original dataset provides information about TSA claims filled from 2002-2015 but visualizations are from based of years 2004-2015. While dataset was originally obtained from Kaggle.com, original data can be found in the Homeland Security webpage. The full dataset contains a total of 204,269 observations with 13 columns: Claim Number, Date Received, Incident Date, Airport Code, Airport Name, Airline Name, Claim Type, Claim Site, Item, Claim Amount, Status, Close amount, and Disposition. The objective of the data collected is to analyze key trends and insights drawn from claims filed throughout the years.
From 2004-2015, the data reveals a continuous declining trend in the number of claims filled against TSA. This could be due to a variety of reasons, such as shifts in regulatory practices. As more airports and airlines increase security measures, the overall process continues to improve for both travelers and workers. Notably, the two sharpest declines are right after 2004 and 2008, suggesting that significant changes were implemented during this period. Upon further analysis of the data and considering the economic state of the country during these periods, the sharp decline after 2008 could be explained by the economic recession during the time. Overall, these trends illustrate how improvements in security measures, economic factors, and various of other changes can all influence the volume of claims filled against TSA.
library(data.table)
library(DescTools)
library(dplyr)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(scales)
library(RColorBrewer)
library(viridis)
#Reads file
setwd("/Users/melitenezaca/R_datafiles")
filename = "clean_tsa_claims2.csv"
df <- fread(filename, na.strings =c(NA,""))
carcount <- data.frame(count(df,`Claim Type` )) #Creates df of claim type and n
carcount <- carcount[order(carcount$n, decreasing = TRUE), ] # sorts the claim type by high to lowest
#Gets the "-" and NAs and the row each is in
NoClaimType <- which(carcount$`Claim.Type` %like any% c("%-%"))
NARows <- which(is.na(carcount$`Claim.Type`))
NA_and_slash <- c(NoClaimType, NARows)
#Groups the lower categories of Claim Types
OtherClaim2 <- which(carcount$`Claim.Type` %like any% c("Passenger Theft", "Motor Vehicle", "Complaint", "Wrongful Death", "Compliment", "Bus Terminal"))
OtherClaim <- c(OtherClaim2) #Groups "-" and NA in one
BadTotal <- sum(carcount[NA_and_slash, "n"]) #Sums row # of Na and "-"
OtherClaimTotal <- sum(carcount[OtherClaim, "n"]) #Sums rows of lower categories
carcount <- carcount[- NA_and_slash, ]
carcount <- carcount[- OtherClaim]
carcount <- rbind(c("No Claim Type", BadTotal), carcount) #Adds no claim type to column
carcount <- rbind(c("Other Type", OtherClaimTotal), carcount) #Adds Other Claim Type to column
#Renames the row names numerically
rownames(carcount) <- c(1:nrow(carcount))
#--------------------Beginning of data visualization----------------------------
carcount$n <- as.numeric(carcount$n)
#Removes the part (0:00 or 12:50 ) from the date
df$`Incident Date` <- sub(" \\d{1,2}:\\d{2}$", "", df$`Incident Date`)
df$`Incident Date` <- parse_date_time(df$`Incident Date`,
orders = c("mdy", "dmy"),
quiet = TRUE)
df$`Incident Date` <- as.Date(df$`Incident Date`)
#Extracts the year
df$year <- year(ymd(df$`Incident Date`))
# Filters out the years to only use 2004-2015
df <- df %>%
filter(!is.na(year), year >= 2004, year <= 2015)
x_axis_labels <- min(df$year):max(df$year)
#-------------------------------Histogram---------------------------------------
#Graphs out the claims per year
p1 <- ggplot (df, aes(x = year)) +
geom_histogram(bins = 12, color= "darkblue", fill= "lightblue", size = 0.7) + # Bin border color & size + fill color
labs(title = "Histogram of Claims by Year", x = "Year", y = "Amount of Citations")+
scale_y_continuous(labels = comma) +
stat_bin(binwidth = 1, geom = 'text', color ='blue', aes(label = after_stat(count)), vjust=-0.5, size = 2.7) +
scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels ) +
theme(plot.title = element_text(hjust= 0.5, face = "bold"))
print(p1)
The following visualization presents the distribution of TSA claims across different categories of lost or damaged items where the black line represents the total amount of money claimed for each item category. The vertical axis represents the top 10 items damaged and ranges from “Luggage and Baggage” to smaller items such as “Locks” and “Jewelry”. The stacked bar plot segments represent the amount of times a claim was filed based off the item for each year. As seen from the previous section, 2004 represented by the aqua color, has the most amount of claims in each of the stacked bar charts. The graph effectively shows that while certain categories are more prone to damage, higher-value items such as computers have a higher payout despite having a lower total. This discrepancy underscores the importance in the value of item and how much one can claim. Such insights can be of use to both travelers and airlines to improve customer satisfaction and better manage expectations.
#---------------------data set up for stacked bar charts------------------------
#Cleans Item column
df <- df %>%
mutate(Item = sub(" -.*", "", Item)) %>%
mutate(Item = trimws(Item)) %>%
# Update Close Amount where Status is "Approve in Full" and Close Amount is NA
mutate(`Close Amount` = if_else(
is.na(`Close Amount`) & Status == "Approve in Full",
`Claim Amount`,
`Close Amount`
))
df <- df %>%
mutate(Item = case_when(
grepl("Alcoholic beverages", Item, ignore.case = TRUE) ~ "Alcoholic beverages",
grepl("Luggage|Baggage/Cases/Purses", Item, ignore.case = TRUE) ~ "Luggage and Baggage",
grepl("Cameras", Item, ignore.case = TRUE) ~ "Cameras",
grepl("Jewelry", Item, ignore.case = TRUE) ~ "Jewelry",
grepl("Computer", Item, ignore.case = TRUE) ~ "Computer",
TRUE ~ Item # Keep other items as they are
))
df_reasons <- count(df, Item) #Item name and n
df_reasons <- df_reasons [order(df_reasons$n, decreasing = TRUE), ] #Prints out the amount in from highest to lowest
top_reasons <- df_reasons$Item[!df_reasons$Item %in% c("-", NA)][1:10] #Excludes "-" and NA from the list
new_df <- df %>%
filter(Item %in% top_reasons) %>%
select(`Incident Date`, Item) %>%
mutate(year = year(ymd(`Incident Date`))) %>%
group_by(Item, year) %>%
summarise(n = length(Item), .groups = 'keep') %>%
data.frame()
other_df <- df %>%
filter(!Item %in% top_reasons) %>%
select(`Incident Date`) %>%
mutate(year = year(ymd(`Incident Date`)), Item = "Other") %>%
group_by(Item, year) %>%
summarise(n =length(Item), .groups = 'keep') %>%
data.frame()
new_df <- rbind(new_df, other_df)
agg_tot <- new_df %>% # Item Description and Total
select(Item, n) %>%
group_by(Item) %>%
summarise(tot = sum(n), .groups = 'keep') %>%
data.frame()
claims_df <- df %>%
filter(Item %in% top_reasons) %>%
select(Item, `Close Amount`) %>%
group_by(Item) %>%
summarise(totclaims = sum(`Close Amount`, na.rm =TRUE)) %>%
data.frame()
# ---------------------------------Stacked Bar Plot with Labels-----------------
new_df$year <-as.factor(new_df$year)
#Combining palettes
set3_colors <- brewer.pal(n = 12, name = "Set3") # Maximum of 12 colors
accent_colors <- brewer.pal(n = 8, name = "Accent") # Maximum of 8 colors
combined_colors <- c(set3_colors, accent_colors)
library(plyr)
max_y <- round_any(max(agg_tot$tot), 95000, ceiling)
new_df$year <- as.factor(new_df$year)
# ---------------------------dual axis on a stacked bar chart ------------------
other_sum <- sum(df[!df$Item %in% top_reasons, "Close Amount"], na.rm = TRUE)
claims_df <-rbind(claims_df, c("Other", other_sum))
claims_df$totclaims <- as.numeric(claims_df$totclaims)
ylab <- seq(0, 1.1*max(claims_df$totclaims)/1000, 350)
my_label <- paste0("$", ylab, "K")
g2 <- ggplot(new_df, aes(x = reorder(Item, n, sum), y = n, fill =year)) +
geom_bar(stat= "identity", position =position_stack(reverse = TRUE)) +
coord_flip() +
theme_light() +
labs(title = " Claim Count and Total Claim Amount", x= "", y = "Claim Count", fill = "Year") +
theme(plot.title = element_text(hjust =0.5)) +
scale_fill_manual(values =combined_colors, guide = guide_legend(reverse = TRUE)) + #works
#Adds the line
geom_line(inherit.aes = FALSE, data=claims_df,
aes(x= Item, y = totclaims/20, colour= "Total Claim Amount", group =1), size =1) +
scale_color_manual(NULL, values = "black") +
scale_y_continuous(labels = comma,
sec.axis = sec_axis(~. *20, name= "Total Claims Amount",labels =my_label,
breaks = ylab*10^3))+
geom_point(inherit.aes= FALSE,data=claims_df,
aes(x= Item, y =totclaims/20, group =1),
size =3, shape = 21, fill = "white", color = "black") +
theme(legend.background = element_rect(fill = "transparent"),
legend.box.background = element_rect(fill = "transparent", colour = NA),
legend.spacing = unit(-1,"lines"),
axis.text.x.top = element_text(size = 6))
g2
The line graph below shows occurrence of TSA claims filed by day of the week, over a span of 12 years. The graph shows a general pattern in volume of claims being lower during the middle of week( Tues- Thurs), rising towards the weekend and hitting peak Sunday. The rise in claims during the weekend can be linked to an increase in travel which subsequently increases the chances in items being damaged and/or lost. Although there are fluctuations in the volume of claims year to year, the graph shows a consistent trend during the week. This highlights the need for improvement in how TSA is equipped to better handle peak days.
#-------------------------Line Plots: Citations by Day and by Year--------------
days_df <- df %>%
#Selects the Incident Date column from the data frame
dplyr::select(`Incident Date`) %>%
dplyr::mutate(year = year(ymd(`Incident Date`)), #Extracts year
dayoftheweek = weekdays(ymd(`Incident Date`), abbreviate = TRUE)) %>% #Abbreviates the day of the week
dplyr::group_by(year, dayoftheweek) %>% #Groups by year and day
dplyr::summarise(n = length(`Incident Date`), .groups = 'keep') %>%
data.frame()
days_df$year <- as.factor(days_df$year)
day_order <- factor(days_df$dayoftheweek, level = c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))
ggplot(days_df, aes(x= day_order, y = n, group = year)) +
geom_line(aes(color = year), size = 1.5) +
labs(title = "TSA Claims by Day and by Year", x = "Days of the Week", y = "Claim Count") +
theme_light() +
theme(plot.title = element_text(hjust = 0, face = "bold")) +
geom_point(shape = 21, size = 2, color = "black", fill = "white") +
scale_y_continuous(labels = comma) +
scale_color_brewer(palette = "Paired", name = "Year", guide = guide_legend(reverse = TRUE))
The pie chart below illustrates the status of claims over the years and classifies them into three categories: approved, denied, and other. The “Approved” category groups claims classified as “Approved” and “Settled”, while the “Other” category groups claims with insufficient information to formally process them. 2004 and 2005 have the highest approval rates at 58% and 47% which could be due to TSA being recently implemented and wanting to better accommodate travelers after a significant change being put to practice. 2008 and 2009, however, show a noticeable increase in claims being denied. As earlier stated, this was during the 2008 financial crisis which greatly impacted airlines as many went bankrupt and others faced substantial decreases in revenue. It’s important to note these factors as airlines possibly couldn’t afford these liabilities. Ultimately, this offers a valuable insight into claim outcomes over the years, helping travelers understand the likelihood of their claims being approved.
#--------------------------------Pie Chart--------------------------------------
top_status <- dplyr::count(df, Status)
top_status <- top_status[order(-n), ]
status_df <- df %>%
dplyr::select(Status, `Incident Date`) %>%
dplyr::mutate(year = year(ymd(`Incident Date`)),
myStatus = case_when(
Status %in% c ("Approved", "Approve in Full", "Settled") ~ "Approved",
Status %in% c("Denied", "Deny") ~ "Denied",
TRUE ~"Other" # Everything else goes in other
)) %>%
dplyr::group_by(year, myStatus) %>%
dplyr::summarise(n=length(myStatus), .groups = 'keep') %>%
dplyr::group_by(year) %>%
dplyr::mutate(percent_of_total = round(100*n/sum(n,1))) %>%
dplyr::ungroup() %>%
data.frame()
#Set myStatus as a factor
status_df$myStatus = factor(status_df$myStatus, levels=c("Approved", "Denied", "Other"))
status_df$year <- factor(status_df$year, levels = rev(unique(status_df$year)))
ggplot(data = status_df, aes(x= "", y = n, fill = myStatus))+
geom_bar(stat = "identity", position = "fill") +
coord_polar(theta = "y", start = 0)+
labs(fill = "Status", x= NULL, y= NULL, title = "Claim Status by Year") +
theme_light() +
theme(plot.title = element_text(hjust = 0.6),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()) +
facet_wrap(~year, ncol = 4, nrow= 4) +
scale_fill_brewer(palette = "Reds") +
geom_text(aes(x=1.8, label = ifelse(percent_of_total>5, paste0(percent_of_total, "%"), "")),
size =3,
position = position_fill(vjust =0.5))
As expected, JFK and LAX, two of the busiest airports in the US, consistently show a higher volume of claims being filed. In 2004, LAX had the highest volume of claims filed at 1,403, while JFK closely followed with 983 claims. This data showcases the correlation between airport size and amount of claims filed as larger airports are more vulnerable to claims being filed due to more activity. Similarly, ORD and EWR rank just below the two, further reinforcing the correlation. Interestingly, all the airports show a steady decrease in claims, yet, LAX and JFK remain to be the two with the highest claims as indicated by the darker hues in the heatmap. While data doesn’t represent more recent years, this information is still useful for travelers as they can choose to fly from other smaller airports to decrease the risk in having their belongings being damaged and/or lost.
#-----------------------------Prep for HeatMap----------------------------------
names(df)[4] <- "Code"
names(df)[3] <- "IncidentDate"
top10 <- dplyr::count(df, Code)
top10 <- top10 [order(top10$n, decreasing = TRUE), ] #Prints out the amount in from highest to lowest
top10 <- top10[1:10, ]
yearlyClaims <- df %>%
dplyr::mutate(year = year(ymd(IncidentDate))) %>%
dplyr::filter(!is.na(Code), !is.na(IncidentDate), Code !="-") %>%
dplyr::mutate(year = year(ymd(IncidentDate))) %>% #Extracts year
dplyr::filter(Code %in% top10$Code) %>%
dplyr::group_by(year, Code) %>% #year and airport name
dplyr::summarise(n = length(IncidentDate), .groups = 'keep') %>%
data.frame()
yearlyClaims$year <- as.factor(yearlyClaims$year)
#---------------------------HeatMap Visual--------------------------------------
breaks <- c(seq(0, max(yearlyClaims$n), by = 250))
library(plotly)
g <- ggplot(yearlyClaims, aes(x = year, y = Code, fill = n)) +
geom_tile(color = "black") +
geom_text(aes(label = comma(n))) +
coord_equal(ratio = 1) +
labs(title = "Heatmap: TSA Claims by Year & Airport Code ",
x = "Year",
y = "Airport Code",
fill = "Claim Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_discrete(labels = x_axis_labels, breaks = x_axis_labels) +
scale_y_discrete(limits = rev(levels(yearlyClaims$Code))) +
scale_fill_continuous(low = "white", high = "red", breaks = breaks) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour = "black")))
g
These data visualizations reveal clear decline not only in the number of claims filed, but also in the amount of claims being approved. Significant improvements continue to be implemented and how each airport chooses to operate heavily influences the volume of claims. While data set was limited, information on the type of equipment used during the TSA process would greatly help understand the role technological advancements partake. Seeing the overall analysis of the interrelationship between claim items, claim amounts, time, airport, and outcome, reveal how various factors contribute to the overall TSA experience. Some useful key takeaways from the visualizations suggest that Sundays are not good days to travel as you are more prone to filing a claim and that 2008 and 2009 were not good years to file a claim as they held low approval rates. Additionally, travelers might consider avoiding airports such as LAX and JFK as they consistently have higher claims. Conclusively, this data set provides both travelers and airport authorities useful knowledge in improving the overall TSA process.