The TSA Claims Dataset is available to the public on Kaggle, a website that hosts a variety of datasets for anyone to access and download. This dataset contains information about claims made to the Transportation Security Agency in the United States from 2002 to 2015.
# Import libraries
library(data.table)
library(DescTools)
library(ggplot2)
library(lubridate)
library(dplyr)
library(scales)
library(ggthemes)
library(RColorBrewer)
library(ggrepel)
library(plotly)
# Link for CSV
# https://www.kaggle.com/datasets/terminal-security-agency/tsa-claims-database
# NOTE: I did some manual cleaning within Excel before reading the file in
# Load the CSV file
csv = "tsa_claims.csv"
df = fread(csv, na.strings = c(NA, ""))
The TSA Claims Dataset includes a total of 13 columns and 204,269 rows.
The following table contains information about each column.
| Column Name | Type | NA Count |
|---|---|---|
| Claim Number | chr | 0 |
| Date Received | Date | 258 |
| Incident Date | POSIXct | 2179 |
| Airport Code | Factor w/ 464 levels | 8960 |
| Airport Name | Factor w/ 467 levels | 8960 |
| Airline Name | Factor w/ 232 levels | 38616 |
| Claim Type | Factor w/ 10 levels | 8190 |
| Claim Site | Factor w/ 5 levels | 1011 |
| Item | chr | 8384 |
| Claim Amount | num | 16790 |
| Status | Factor w/ 13 levels | 12752 |
| Close Amount | num | 68947 |
| Disposition | Factor w/ 3 levels | 72903 |
We would expect NA to be an acceptable response for many of these columns. One thing, however, that disqualifies a row from being valid is a Date Received outside of the 2002 to 2015 range, since this information was given to us in the description. After filtering invalid dates out, we are left with 203,989 rows, which does result in some columns having a lower NA count.
For the columns of type num, we can include some summary statistics:
Claim Amount
Close Amount
From these numerical summaries, we can tell that both columns are extremely skewed to the right since the maximum values are extremely higher than the median and 3rd quartile values. Comparing both columns, we can also see that the TSA tends to close the claims with a value lower than the claim amount, since all of the statistics for the Close Amount column are lower than the corresponding value in the Claim Amount column. We will keep the outlying points in the dataset since they are likely not accidental and they should not cause any issues for our visualizations.
# Convert datetime columns
df$`Date Received` = dmy(df$`Date Received`)
# Make sure we are only using dates between 2002 and 2015 (given by dataset)
df = df %>%
filter(year(ymd(`Date Received`)) >= 2002 & year(ymd(`Date Received`)) <= 2015)
df$`Incident Date` = mdy_hm(df$`Incident Date`)
# Convert numerical columns
df$`Claim Amount` = as.numeric(df$`Claim Amount`)
df$`Close Amount` = as.numeric(df$`Close Amount`)
# Convert categorical columns into factors
df$`Airport Code` = as.factor(df$`Airport Code`)
# Find unique airport names and how many unique names are included
df$`Airport Name` = as.factor(df$`Airport Name`)
# Find unique airline names and how many unique names are included
df$`Airline Name` = as.factor(df$`Airline Name`)
# Find unique claim type and how many unique types are included
df$`Claim Type` = as.factor(df$`Claim Type`)
# Find unique claim sites and how many unique sites are included
df$`Claim Site` = as.factor(df$`Claim Site`)
# Find unique statuses and how many unique statuses are included
df$Status[df$Status == "Insufficient, one of the following items required: sum certain, statement of fact, signature, location of incident, and date."] = "Insufficient information provided"
df$Status = as.factor(df$Status)
# Find unique dispositions and how many unique dispositions are included
df$Disposition = as.factor(df$Disposition)
The following line graph depicts the number of claims by the day of the week that the claim was received by the TSA. This visualization is useful because it helps to reveal any trends in the Date Received column, which can help the agency predict which days will be busiest so they can adequately staff the department that processes the claims. We can see that claims are submitted in a higher abundance earlier in the week (week order starting on Monday and ending on Sunday), with Tuesday having the highest sum of claims received (49,605). As the week continues, the number of claims received decreases, with a significant drop on Saturday and Sunday, with Saturday having the lowest sum (863).
# Make new df
days_df = df %>%
select(`Date Received`) %>%
mutate(dayoftheweek = weekdays(ymd(`Date Received`), abbreviate = TRUE)) %>%
group_by(dayoftheweek) %>%
dplyr::summarise(n = length(dayoftheweek), .groups = 'keep') %>%
data.frame()
days_df = na.omit(days_df)
# Make x labels in week order
day_order = factor(days_df$dayoftheweek,
level = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
# what are the highest and lowest points
hi_lo = days_df %>%
filter(n == min(n, na.rm = TRUE) | n == max(n, na.rm = TRUE)) %>%
data.frame()
ggplot(days_df, aes(x = day_order, y = n, group = 1)) +
geom_line(color = 'black', linewidth = 1) +
geom_point(shape = 21, size = 4, color = 'red', fill = 'white') +
labs(x = "Day of the Week", y = "Claim Count",
title = "Claims Received by Day of the Week",
caption = "Source: Kaggle
(https://www.kaggle.com/datasets/terminal-security-agency/tsa-claims-database)") +
scale_y_continuous(labels = comma) +
theme_light() +
theme(plot.title = element_text(hjust = 0.5, size = 16),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12)) +
geom_point(data = hi_lo, aes(x = dayoftheweek, y = n),
shape = 21, size = 4, fill = 'red', color = 'red') +
geom_label_repel(aes(label = ifelse(n == max(n) |
n == min(n), scales::comma(n), "")),
box.padding = 1, point.padding = 1, size = 4,
color = 'Grey50', segment.color = 'gray50')
The following stacked bar chart shows the breakdown of claim status for each of the top ten airports (sorted by number of claims in the database). This visualization is interesting because we can make a rough comparison between each of the top ten airports to see whether there are any major trends in claim status—or if any airport has a distribution that stands out from the others. It appears as if all of the ten airports included in the chart have a roughly similar distribution of claim status, indicating that they are all likely being held to similar guidelines and no evident biases are present.
# Make a df by airport code
df_airports = dplyr::count(df, `Airport Name`)
df_airports = df_airports[order(df_airports$n, decreasing = TRUE), ]
df_airports = na.omit(df_airports)
# Pull out top 10 airports
top_airports = df_airports$`Airport Name`[1:10]
# Make new df
bar_df = df %>%
filter(`Airport Name` %in% top_airports) %>%
select(Status, `Airport Name`) %>%
group_by(`Airport Name`, Status) %>%
dplyr::summarise(n = length(`Airport Name`), .groups = 'keep') %>%
data.frame()
# Find the aggregate total
agg_tot = bar_df %>%
select(Airport.Name, n) %>%
group_by(Airport.Name) %>%
dplyr::summarise(tot = sum(n), .groups = 'keep') %>%
data.frame()
# create a ceiling
max_y = plyr::round_any(max(agg_tot$tot), 2500, ceiling)
ggplot(bar_df, aes(x = reorder(Airport.Name, n, sum), y = n, fill = Status)) +
geom_bar(stat = "identity", position = position_stack(reverse = TRUE)) +
coord_flip() +
labs(title = "Claim Count for the Top Ten Airports by TSA Claims", x = "",
y = "Claim Count", fill = "Status") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5, size = 16),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 12)) +
scale_fill_brewer(palette = "Paired", na.value = "gray85") +
geom_text(data = agg_tot, aes(x = Airport.Name, y = tot,
label = scales::comma(tot), fill = NULL),
hjust = -0.1, size = 4) +
scale_y_continuous(labels = comma,
breaks = seq(0, max_y, 2000),
limits = c(0, max_y))
The pie charts below show the distribution of claims by claim amount for the top ten airlines (sorted by number of claims in the database). Similar to the stacked bar chart in the previous tab, this visualization is interesting because we can compare the top ten airlines to see whether there are any major trends in claim amount—or if any airport has a distribution that stands out from the others. Since Claim Amount is of type num, groups were created in order to make slices of the pie. From this visualization, we can see that most claims are requesting monetary amounts on the lower side, with the $0.00 to $49.99 group dominating the pie for nine of the ten top airlines. Northwest Airlines has a slightly different distribution with the majority group for Claim Amount as $200.00 to $499.99, however, only by a small margin of 0.4%. This visualization could be especially helpful if we wanted to conduct statistical hypotheses tests to see whether the distributions for these airlines significantly differ from the population distribution.
# Make a df by airline name
df_airlines = dplyr::count(df, `Airline Name`)
df_airlines = df_airlines[order(df_airlines$n, decreasing = TRUE), ]
df_airlines = na.omit(df_airlines)
# Pull out top 10 airlines
top_airlines = df_airlines$`Airline Name`[1:10]
# Make a df to be used with the pie charts
pie_df = df %>%
filter(`Airline Name` %in% top_airlines) %>%
mutate(AmountGroup = ifelse(`Claim Amount` < 50, "$0.00-$49.99",
ifelse(`Claim Amount` < 100, "$50.00-$99.99",
ifelse(`Claim Amount` < 200, "$100.00-$199.99",
ifelse(`Claim Amount` < 500, "$200.00-$499.99",
ifelse(`Claim Amount` < 1000, "$500.00-$999.99",
ifelse(`Claim Amount` < 10000, "$1,000.00-$9,999.99",
"$10,000.00+"))))))) %>%
group_by(`Airline Name`, AmountGroup) %>%
summarise(n = length(`Airline Name`), .groups = "keep") %>%
group_by(`Airline Name`) %>%
mutate(percent_of_total = round(100 * n / sum(n), 1)) %>%
ungroup() %>%
data.frame()
# Create a trellis chart containing pie graphs
pie_df$AmountGroup = factor(pie_df$AmountGroup, levels = c("$0.00-$49.99",
"$50.00-$99.99",
"$100.00-$199.99",
"$200.00-$499.99",
"$500.00-$999.99",
"$1,000.00-$9,999.99",
"$10,000.00+"))
ggplot(pie_df, aes(x = "", y = n, fill = AmountGroup)) +
geom_bar(stat = "identity", position = "fill") +
coord_polar(theta = "y", start = 0) +
labs(fill = "Claim Amount", x = NULL, y = NULL,
title = "Claim Distribution by Airline (Top 10) and by Claim Amount",
caption = "Slices under 5% are not labeled") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5, size = 16),
strip.text = element_text(size = 13),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()) +
facet_wrap(~Airline.Name, ncol = 5, nrow = 2) +
scale_fill_brewer(palette = "Spectral", na.value = "gray85") +
geom_text(aes(x = 1.7, label = ifelse(percent_of_total > 5,
paste0(percent_of_total, "%"), "")),
size = 4,
position = position_fill(vjust = 0.5))
This nested donut chart depicts the distribution of claim status (grouped into “Closed or Denied”, “Approve(d) or Settle(d)”, “null”, and “Pending”) broken up by claim type (from outer ring to inner ring, grouped into “Damages or Loss”, “Injury or Death”, and “Other”). It can provide information about the trends of claim status between the different claim types and can be used to predict the distribution of claim status by type in the future. We can see that for the “Damages or Loss” group, the “Closed or Denied” claim status holds the majority, representing 54.2% of the claims. This is followed by the “Approve(d) or Settle(d) group having 39.5% of the claims and the remaining 6.23% being null. For the”Other” claim type group, the distribution is somewhat similar with the “Closed or Denied” (47.5%) and the “Approve(d) or Settle(d)” (48.6%) groups taking the large majority of the proportion with the remaining being null (3.81%). The “Injury or Death” claim type has a somewhat different distribution. While the “Closed or Denied” status represents the majority at 56.2% with the remaining claims being split between “null” (23.4%), “Approve(d) or Settle(d)” (16.7%), and “Pending” (3.68%). This is interesting because the “Approve(d) or Settle(d)” status group represents a significantly smaller proportion of the “Injury or Death” claim group as compared to the others. It might be worthwhile to further investigate why this is, so that the TSA can look into the reason so many of the “Injury or Death” claims are null or why there is a much lower “Approve(d) or Settle(d)” percentage.
nested_df = df %>%
select(`Claim Type`, Status) %>%
mutate(ClaimGroup = ifelse(`Claim Type` %in% c("Property Damage",
"Passenger Property Loss",
"Employee Loss (MPCECA)",
"Passenger Theft"),
"Damages or Loss",
ifelse(`Claim Type` %in% c("Personal Injury",
"Wrongful Death"),
"Injury or Death",
"Other"))) %>%
mutate(StatusGroup = ifelse(Status %in% c("Approved",
"Settled",
"Approve in Full",
"Settle"),
"Approve(d) or Settle(d)",
ifelse(Status %in% c("Denied",
"Deny",
"Canceled",
"Closed as a contractor claim",
"Insufficient information provided"),
"Closed or Denied",
ifelse(Status %in% c("In litigation",
"In review",
"Claim has been assigned for further investigation",
"Pending response from claimant"),
"Pending",
Status)))) %>%
group_by(ClaimGroup, StatusGroup) %>%
summarise(n = length(StatusGroup), .groups = "keep") %>%
group_by(ClaimGroup) %>%
mutate(percent_of_total = round(100 * n / sum(n), 1)) %>%
ungroup() %>%
data.frame()
plot_ly(hole = 0.7) %>%
layout(title = "Claim Type by Status",
font = list(size = 16),
margin = list(t = 100)) %>%
add_trace(data = nested_df[nested_df$ClaimGroup == "Damages or Loss", ],
labels = ~StatusGroup,
values = ~nested_df[nested_df$ClaimGroup == "Damages or Loss", "n"],
type = "pie",
textposition = "inside",
hovertemplate = "Claim Type: Damages or Loss<br />Status: %{label}<br />Percent: %{percent}<br />Citation Count: %{value}<extra></extra>") %>%
add_trace(data = nested_df[nested_df$ClaimGroup == "Injury or Death", ],
labels = ~StatusGroup,
values = ~nested_df[nested_df$ClaimGroup == "Injury or Death", "n"],
type = "pie",
textposition = "inside",
hovertemplate = "Claim Type: Injury or Death<br />Status: %{label}<br />Percent: %{percent}<br />Citation Count: %{value}<extra></extra>",
domain = list(
x = c(0.16, 0.84),
y = c(0.16, 0.84))) %>%
add_trace(data = nested_df[nested_df$ClaimGroup == "Other", ],
labels = ~StatusGroup,
values = ~nested_df[nested_df$ClaimGroup == "Other", "n"],
type = "pie",
textposition = "inside",
hovertemplate = "Claim Type: Other<br />Status: %{label}<br />Percent: %{percent}<br />Citation Count: %{value}<extra></extra>",
domain = list(
x = c(0.27, 0.73),
y = c(0.27, 0.73)))
The heatmap below shows the different Claim Types (unedited) by year. This visualization can be useful in showing any trends over time, as well as highlighting the years that the minority Claim Types occurred. Notably, we can see that the “Passenger Property Loss” type has the greatest number of claims since this row has the highest concentration of color. Additionally, we can tell that there were the most “Passenger Property Loss” claims submitted in 2004 and 2005, but this decreased in the years leading to 2015. Another notable row is “Property Damage” that shows a similar trend to “Passenger Property Loss”. This information indicates that TSA has been receiving less claims in the two highest areas, indicating that either TSA operations have been improving in quality or that people are submitting claims less often. This would be interesting to investigate further. A few other interesting points are the “Wrongful Death”, “Passenger Theft”, “Compliment”, “Complaint”, and “Bus Terminal” claim types since they have years where a claim has not been received under that type. It may be worthwhile to see whether these claims are worth keeping in the database (Have these options always been available to submit claims under? Are complaints and compliments appropriate to include in this database? etc.).
heatmap_df = df %>%
select(`Claim Type`, `Date Received`) %>%
mutate(year = year(ymd(`Date Received`))) %>%
group_by(year, `Claim Type`) %>%
summarise(n = length(year), .groups = "keep") %>%
data.frame()
breaks = c(seq(0, max(heatmap_df$n), by = 1000))
ggplot(heatmap_df, aes(x = year, y = Claim.Type, fill = n)) +
geom_tile(color = "black") +
geom_text(aes(label = comma(n))) +
coord_equal(ratio = 1) +
labs(title = "Heatmap: Claims by Type and Year",
x = "Year",
y = "Claim Type",
fill = "Citation Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_continuous(low = "white", high = "red", breaks = breaks, labels = comma) +
scale_x_continuous(breaks = unique(heatmap_df$year)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour = "black")))
The visualizations included in this report give us valuable information about the TSA Claims Dataset that we would not have been able to uncover if the data were not cleaned and processed. The line graph told us the trend how claims were received by day of the week so that we can prepare to staff more people on Tuesday and Monday, and less people on Saturday and Sunday. The stacked bar chart helped verify that the status for claims at the top ten airports (sorted by number of claims received) are somewhat similar, and thus indicates that the claims at each airport are being held to the same guidelines. The trellis chart of pie graphs is similar, showing that the claim amounts requested from customers of the top ten airlines (sorted by number of claims received) are mostly similar to each other, but it would be interesting to conduct hypotheses tests to be sure. The interactive nested donut chart shows us that the claims that a lot of claims fall under the “Closed or Denied” status group. Broken down, the “Damages or Loss” claim types and the “Other” claim types have a somewhat even distribution between being Closed/Denied and Approved/Settled, while the “Injury or Death” claim type has a higher proportion of “null” status, a lower proportion of “Approved/Settled”, but a similar proportion of “Closed/Denied” statuses as compared to the other claim types. Finally, the heatmap highlights the trend of instances for the different claim types over the years, specifically showing that the Property Damage and Passenger Property Loss types peaked around 2004-2005 but decreased since. Additionally, it allows us to further ask questions about the validity of the minority claim types being present in the dataset.
While the visualizations created in this report have given us additional insight into the data included in the set, there can always be more data cleaning and preparation that would have resulted in visualizations and summaries that are more accurate and helpful. Further, the visualizations that were generated have allowed us to ask more valid questions that should be addressed in the future.