This dataset was aquired from Kaggle.com and downloaded as a CSV file. The file was then implemented into RStudio. I set my working directly to the location of my file and read in the CSV. This lays out the ground work to dig into the Insurance Claims Dataset.
The Insurance Claims Dataset contains nearly 7000 observations and 26 variables. These variables include Customer,Country,State Code,State,Claim Amount,Response,Coverage,Education,Effective To Date, EmploymentStatus, Gender, Income, Location Code, Marital Status, Monthly Premium Auto, Months Since Last Claim, Months Since Policy Inception, Number of Open Complaints, Number of Policies, Policy Type, Policy, Claim Reason, Sales Channel, Total Claim Amount, Vehicle Class, and Vehicle Size.
The summary function outputs information like minimum, maximum, Q1, Q3, median, mean, and max. It also includes descriptions of the variables in the data frame such as their class, length, and mode. The summary function is useful when scoping out what type of data is in the data set.
Below are five uniue visualizations that look at different variables within the Insurance Claims Dataset.
The first visualization is a Histogram that compares the variable Income, to how many times this income is found in the data set. The x-axis contains Income information starting at $10,000 to $100,000 and above. The y-axis The Histogram of Incomes contains ten bins. At the top of each bin, you can find the respective bin frequency. The histogram appears to be uni-modal and is skewed right. This is because the dataset contains observations of people who’s incomes are on the higher side.
The second visualization is a stacked bar chart that illustrates Claim Amount by State. The x-axis represents the Claim Amount, which ranges from 0 to 1,500,000 and above. The y-axis represents the five states within the dataset which occurs in descending order. To the right of the stacked bar chart is a legend that sorts each bar by Claim Reason. This includes collisions, hail, scratch/dent, or other. The legend is gradient, going from light to dark.
The third visualization is a Donut Chart that separated Claim Amount by Gender. By hovering over each color of the donut chart, you can find information of the claim amount and percentage by gender. The donut chart is separated by a cumulative $2,627,510.23 by Males (Orange), which accounts for 47.4% of the data. The other side of the chart contains a cumulative $2,914,867.10 by Females (Blue) which accounts for 52.6% of the data.
The fourth visualization is a Trellis Chart which depicts Gender by State. There are five total pie charts, one for each state including Kansas, Missouri, Nebraska, Iowa, and Oklahoma. Each pie chart is divided by Male and Female occupancy and represented as a percentage.
The final visualizations is a Heat Map which shows Policy Type by Marital Status. The x-axis represents Policy type, which includes corporate auto, personal auto, or special auto. The y-axis indicates Marital Status, which includes single, married, or divorced. The legend illustrates six gradient boxes which indicates frequency for both the x and y axis. We can see that the most prominent box has a count of 3,361 and represents a married individual who has person auto insurance.
library(data.table)
library(ggplot2)
library(scales)
library(RColorBrewer)
library(ggthemes)
library(plyr)
library(dplyr)
library(plotly)
setwd("/Users/kristinakuzmina/DataVisualization/project")
filename <- "/Users/kristinakuzmina/DataVisualization/project/Insurance_Claims.csv"
df <- fread(filename)
df <- df[df$Income != 0,]
x_values <- paste0('$',comma(seq(0,110,10)),'k')
x_values[1] <- ""
x_values[12] <- ""
p1 <- ggplot(df, aes(x = Income)) +
geom_histogram(bins = 10, color="hotpink", fill="lightpink") +
labs(title= "Histogram of Frequency of Incomes in Dataset", x="Income (in thousands)", y="Frequency of Income") +
stat_bin(binwidth = 10000, geom='text', color='black', aes(label=comma(stat(count))), vjust=-0.5) +
scale_y_continuous(labels=comma) +
scale_x_continuous(labels=x_values, n.breaks=10) +
theme(plot.title = element_text(hjust = 0.5))
p1
p2 <- ggplot(df, aes(x=reorder(State, `Claim Amount`, sum), y= `Claim Amount`, fill=`Claim Reason`)) +
geom_bar(stat="identity") +
coord_flip() +
labs(title = "Claim Amount By State and Claim Reason", x= "State", y= "Claim Amount") +
theme_clean() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = comma) +
scale_fill_brewer(palette = "RdPu")
p2
plot_ly(df, labels = df$Gender, values = df$`Claim Amount`) %>%
add_pie(hole=0.6) %>%
layout(title="Claim Amount by Gender") %>%
layout(annotations = list(text = paste0("Total Claim Amount: \n",
scales::comma(sum(df$`Claim Amount`))),
"showarrow"=F))
state_df <- df %>%
select(State, Gender) %>%
mutate(myState = ifelse(State == "Kansas", "Kansas", ifelse(State == "Oklahoma", "Oklahoma",
ifelse(State == "Iowa", "Iowa", ifelse(State == "Missouri", "Missouri",
ifelse(State == "Nebraska", "Nebraska", "Other")))))) %>%
group_by(Gender, myState) %>%
summarise(n=length(myState), .groups='keep') %>%
group_by(Gender) %>%
mutate(percent_of_total = round(100*n/sum(n),1)) %>%
ungroup() %>%
data.frame()
p4 <- plot_ly() %>%
add_pie(data=state_df[state_df$myState == "Kansas",], labels = ~Gender, values =~n,
name="Kansas", title="Kansas", textposition="inside", domain=list(row=0, column=0)) %>%
add_pie(data=state_df[state_df$myState == "Iowa",], labels = ~Gender, values =~n,
name="Iowa", title="Iowa", textposition="inside", domain=list(row=0, column=1)) %>%
add_pie(data=state_df[state_df$myState == "Missouri",], labels = ~Gender, values =~n,
name="Missouri", title="Missouri", textposition="inside", domain=list(row=1, column=0)) %>%
add_pie(data=state_df[state_df$myState == "Oklahoma",], labels = ~Gender, values =~n,
name="Oklahoma", title="Oklahoma", textposition="inside", domain=list(row=1, column=1)) %>%
add_pie(data=state_df[state_df$myState == "Nebraska",], labels = ~Gender, values =~n,
name="Nebraska", title="Nebraska", textposition="inside", domain=list(row=2, column=0)) %>%
layout(title="Trellis Chart: Gender By State", showlegend = TRUE,
grid=list(rows=5, columns=2))
p4
policy_summary <- df %>%
select(`Marital Status`, `Policy Type`) %>%
mutate(maritalStatus = ifelse(`Marital Status` == "Married", "Married", ifelse(`Marital Status` == "Single", "Single", ifelse(`Marital Status` == "Divorced", "Divorced", "Other")))) %>%
group_by(`Policy Type`, maritalStatus) %>%
summarise(n = n(), .groups = 'keep') %>%
ungroup()
breaks <- c(seq(0, max(policy_summary$n), by=500))
ggplot(policy_summary, aes(x = `Policy Type`, y = maritalStatus, fill = n)) +
geom_tile(color="black") +
geom_text(aes(label=comma(n))) +
coord_equal(ratio = 1) +
labs(title="Policy Type by Marital Status",
x = "Policy Type",
y = "Marital Status",
fill = "Policy Type Frequency") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_continuous(low="white", high="hotpink", breaks = breaks) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(color="black")))
Thank you for viewing my visualizations. Here are some key takeaways from the Insurance Claims Dataset. Histogram: Individuals making between $25k and $35k contain the highest frequency at 1,132 Stacked Bar Chart: Collision is the primary reason of filing an insurance claim in every state. The 52.6% of Females in the dataset filed a cumulative Claim Amount of $2,914,867.10. Based on state occupancy, the Gender that dominates is Females, which have a higher percentage in each state. The combination with the lowest frequency comes at 34 individuals, who are divorced and have special auto insurance.