Introduction

This report provides a descriptive analysis of an insurance dataset. The dataset contains information about insurance policies, claims, and customer details. Our goal is to explore the dataset using summary statistics and visualizations.

Basic Summary Statistics

summary(data[, c("Age","Premium_Amount", "Credit_Score","Total_Discounts")])
##       Age        Premium_Amount  Credit_Score   Total_Discounts 
##  Min.   :18.00   Min.   :1800   Min.   :530.0   Min.   :  0.00  
##  1st Qu.:29.00   1st Qu.:2100   1st Qu.:681.0   1st Qu.:  0.00  
##  Median :39.00   Median :2236   Median :715.0   Median : 50.00  
##  Mean   :39.99   Mean   :2220   Mean   :714.3   Mean   : 30.11  
##  3rd Qu.:50.00   3rd Qu.:2336   3rd Qu.:748.0   3rd Qu.: 50.00  
##  Max.   :90.00   Max.   :2936   Max.   :850.0   Max.   :150.00

Data Visualizations

1. Distribution of Insurance Premiums

library(ggplot2)
library(dplyr)
ggplot(data, aes(x = Premium_Amount)) +
  geom_histogram(binwidth = 50, fill = "blue", alpha = 0.7) +
  geom_density(aes(y = after_stat(count) * 50), color = "red", size = 1) +
  labs(title = "Distribution of Insurance Premiums", x = "Premium Amount", y = "Count") +
  theme_minimal()

Explanation: The chart is showing the distribution of premiums, revealing how much policyholders are paying for their insurance. It visualizes the different premium amounts making it easy to see which price ranges are the most common and where the higher or lower premiums fall. This visualization provides a clear picture of typical insurance costs.This helps us understand what most people are paying and it sets the stage for exploring the reasons behind premium variations and the factors that influence those differences.

2. Claims vs. Age of Policyholders

library(ggplot2)
library(dplyr)
ggplot(data, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "darkblue", alpha = 0.5, color = "black") +
  labs(title = "Claims Frequency by Age Group", 
       x = "Age (5-year intervals)", 
       y = "Number of Claims") +
  theme_minimal()

Explanation: The chart displays the frequency of insurance claims grouped by age in 5-year intervals, highlighting how claims vary across different age groups. Insurance claims represent requests made by policyholders to their insurance company for compensation due to a covered loss, such as accidents or damages, while premiums are the payments made to maintain coverage. This visualization allows us to see which age groups file the most claims, providing insights into risk behaviors and trends associated with different ages. Understanding this relationship helps us identify patterns that may influence insurance pricing and policy development.

3. Policy Type Breakdown

library(ggplot2)
library(dplyr)
data_counts <- data %>%
  group_by(Policy_Type, Region) %>%
  summarize(count = n(), .groups = "drop")

ggplot(data_counts, aes(x = Policy_Type, y = count, fill = Region)) +
  geom_bar(stat = "identity", alpha = 0.7) +
  coord_flip() +
  labs(title = "Distribution of Policy Types by Region (Counts)",
       x = "Policy Type",
       y = "Count",
       fill = "Region") +
  theme_minimal() +
  geom_text(aes(label = count), 
            position = position_stack(vjust = 0.5), 
            color = "black", size = 3) # Adjust size as needed

Explanation: The chart presents a stacked bar chart illustrating the distribution of insurance policy types liability only and full coverage. Then separated by residential areas: rural, urban, and suburban. It clearly shows that urban areas have the highest frequency for both policy types, indicating a greater prevalence of insurance coverage in these regions. By visualizing the data this way we can easily compare how policy preferences differ across different living environments. This information helps us understand the insurance landscape and the varying needs based on where policyholders reside.

4. Proportion of Discount Amounts

library(ggplot2)
library(dplyr)
data %>%
  mutate(Score_Range = cut(Credit_Score, breaks = seq(530, 850, by = 40), include.lowest = TRUE)) %>%
  ggplot(aes(x = Score_Range, fill = factor(Total_Discounts))) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Blues", direction = 1, name = "Discount") +
  labs(
    title = "Proportion of Discount Amounts by Credit Score Range",
    x = "Credit Score Range",
    y = "Proportion"
  ) +
  theme_minimal()

Explanation: The chart illustrates the proportion of total discount amounts categorized by credit score ranges, grouped in intervals of 40 from 530 to 850. It shows how total discounts of 0, 50, 100, and 150 are distributed across these credit score ranges. This visualization allows us to see which credit score groups receive higher or lower discounts and provides insights into how credit scores influence the total discounts offered. Understanding this distribution helps us identify trends related to creditworthiness and the associated benefits in terms of discounts. Notably, the best credit score ranges do not have any policyholders receiving the maximum discount of 150. This could be due to how discounts are structured. Usually Higher credit scores already have low premiums so they do not need incentive discounts.

5. Premium Amount vs. Credit Score

library(ggplot2)
library(dplyr)
ggplot(data, aes(x = Credit_Score, y = Premium_Amount)) +
    geom_bin2d(bins = 50, alpha = 1) +
    scale_fill_gradient(low = "blue", high = "yellow") +
    labs(
      title = "Premium Amount vs. Credit Score",
      x = "Credit Score",
      y = "Premium Amount",
      fill = "Density"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 16, face = "bold", hjust = 0.5)  
    )

Explanation: The heatmap displays the relationship between premium amounts and credit scores. The y-axis represents the premium amounts, while the x-axis represents credit scores, allowing us to see the distribution of premiums across different credit score levels. The varying colors in the heatmap indicate the frequency or intensity of premium amounts associated with specific credit scores. This visualization helps us quickly identify trends, such as whether higher credit scores are correlated with lower premiums, and where the most common premium amounts fall within the range of credit scores. Understanding this relationship can assist us in making informed decisions about pricing and risk assessment in insurance

Conclusion

Looking at all five charts together, we can see clear trends in how different factors influence auto insurance pricing, claims, and discounts. The premium distribution chart gives us an overview of how much policyholders typically pay. Allowing us to set the foundation for understanding variations in cost. The age based claims frequency chart shows how claim patterns shift with age, which can directly impact premium pricing and risk assessment. The policy type distribution by location highlights that urban areas have the highest concentration of both liability-only and full-coverage policies. This is likely due to higher vehicle density and risk factors in city environments.

The credit score and discount proportion chart reveals that while better credit scores generally receive more discounts, the highest credit tiers don’t get the largest discounts likely because their base premiums are already lower. Finally, the heatmap showing the relationship between credit scores and premium amounts confirms that higher credit scores are associated with lower auto insurance premiums. Reinforcing the idea that creditworthiness plays a major role in pricing.

Overall, these charts suggest that auto insurance pricing is influenced by multiple interconnected factors, including risk behaviors (claims by age), location (policy type distribution), and financial responsibility (credit score impact on premiums and discounts). Understanding these patterns can help refine pricing models and risk assessment, ensuring that premiums and discounts are structured fairly while still accounting for risk.