DATA 607 TidyVerse Create

In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

GitHub repository: https://github.com/acatlin/SPRING2024TIDYVERSE

FiveThirtyEight.com datasets.

Kaggle datasets.

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Later, you’ll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you’ll then extend his or her example with additional annotated code. (15 points)

You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.

After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.

You should complete your submission on the schedule stated in the course syllabus.

The dataset that I decided to use to complete this assignment and demonstrate the uses and capabilities of the TidyVerse package is the “Insurance claims and policy data” from kaggle. This dataset is a a comprehensive collection designed to facilitate predictive analytics and risk assessment within the insurance industry. My theory is that people who are experiencing stressful life situations are more likely to have higher claims history thus resulting in a higher risk of policy payout. I believe indicuals who are experienceing either a seperation, divorce or widowed or more likely to have a higher claims history in comparison to people who are married or single. We will begin by loading the libraries needed to demonstrate the capabilities of the Tidyverse package as well as loading the insurance data into a dataframe.

Link to dataset: https://www.kaggle.com/datasets/ravalsmit/insurance-claims-and-policy-data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
kaggleurl <- "https://raw.githubusercontent.com/Zcash95/DATA607-files/main/data_synthetic.csv"

insurance_data <- read_csv(kaggleurl)
## Rows: 53503 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): Gender, Marital Status, Occupation, Education Level, Geographic In...
## dbl (11): Customer ID, Age, Income Level, Location, Claim History, Coverage ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We will now take a look at the dataset to see what attributes we can use to assess and gauge our theory.

glimpse(insurance_data)
## Rows: 53,503
## Columns: 30
## $ `Customer ID`                        <dbl> 84966, 95568, 10544, 77033, 88160…
## $ Age                                  <dbl> 23, 26, 29, 20, 25, 41, 55, 35, 4…
## $ Gender                               <chr> "Female", "Male", "Female", "Male…
## $ `Marital Status`                     <chr> "Married", "Widowed", "Single", "…
## $ Occupation                           <chr> "Entrepreneur", "Manager", "Entre…
## $ `Income Level`                       <dbl> 70541, 54168, 73899, 63381, 38794…
## $ `Education Level`                    <chr> "Associate Degree", "Doctorate", …
## $ `Geographic Information`             <chr> "Mizoram", "Goa", "Rajasthan", "S…
## $ Location                             <dbl> 37534, 63304, 53174, 22803, 92858…
## $ `Behavioral Data`                    <chr> "policy5", "policy5", "policy5", …
## $ `Purchase History`                   <chr> "04-10-2018", "11-06-2018", "06-0…
## $ `Policy Start Date`                  <chr> "08-01-2023", "09-06-2020", "09-0…
## $ `Policy Renewal Date`                <chr> "12-03-2023", "06-09-2023", "11-0…
## $ `Claim History`                      <dbl> 5, 0, 4, 5, 3, 4, 3, 3, 0, 4, 2, …
## $ `Interactions with Customer Service` <chr> "Phone", "Chat", "Email", "Chat",…
## $ `Insurance Products Owned`           <chr> "policy2", "policy1", "policy3", …
## $ `Coverage Amount`                    <dbl> 366603, 780236, 773926, 787815, 3…
## $ `Premium Amount`                     <dbl> 2749, 1966, 4413, 4342, 1276, 110…
## $ Deductible                           <dbl> 1604, 1445, 1612, 1817, 133, 1173…
## $ `Policy Type`                        <chr> "Group", "Group", "Group", "Famil…
## $ `Customer Preferences`               <chr> "Email", "Mail", "Email", "Text",…
## $ `Preferred Communication Channel`    <chr> "In-Person Meeting", "In-Person M…
## $ `Preferred Contact Time`             <chr> "Afternoon", "Morning", "Evening"…
## $ `Preferred Language`                 <chr> "English", "French", "German", "F…
## $ `Risk Profile`                       <dbl> 1, 1, 2, 3, 0, 2, 3, 0, 1, 1, 2, …
## $ `Previous Claims History`            <dbl> 3, 2, 1, 0, 3, 2, 1, 3, 1, 1, 1, …
## $ `Credit Score`                       <dbl> 728, 792, 719, 639, 720, 811, 836…
## $ `Driving Record`                     <chr> "DUI", "Clean", "Accident", "DUI"…
## $ `Life Events`                        <chr> "Job Change", "Retirement", "Chil…
## $ `Segmentation Group`                 <chr> "Segment5", "Segment5", "Segment3…

I will use ggplot to help create a plot to visualize the relationship statuses of the insured.

marital_plot <- insurance_data %>% count(`Marital Status`) %>%
ggplot(aes(x = `Marital Status`, y = n)) + geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Distribution of Marital Status", x = "Marital Status", y = "Count") + theme_minimal()
print(marital_plot)

We can see that there is about an equal amount of people who are married or divorced but there are less people who are separated, single or widowed. We will now analyze whether there is a correlation between a persons relationship status and the amount of claims by persons in this group.

correlation <- insurance_data %>%
mutate(`Marital Status` = factor(`Marital Status`)) %>%
group_by(`Marital Status`) %>%
summarise(mean_claim = mean(`Claim History`)) %>%
arrange(desc(mean_claim))


print(correlation)
## # A tibble: 5 × 2
##   `Marital Status` mean_claim
##   <fct>                 <dbl>
## 1 Separated              2.59
## 2 Widowed                2.55
## 3 Divorced               2.52
## 4 Married                2.51
## 5 Single                 2.45

We can see the correlation shows that single and married people have less of a claims history in comparison to people who are separated, widowed or divorced. We will now use ggplot to visualize this correlation as well ordering the bars in the same order to demonstrate which relationship status is more likely to have a higher claims history thus being a higher risk for an insurance company.

correlation_plot <- correlation %>%
ggplot(aes(x = reorder(`Marital Status`, -mean_claim), y = mean_claim, fill = `Marital Status`)) +
geom_bar(stat = "identity") +
labs(title = "Mean Claim Amount by Marital Status", x = "Marital Status", y = "Mean Claim Amount") + 
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(correlation_plot)

Conclusion: GGplot2 gave us the ability to display this analysis of the data and the correlation between a persons relationship status and their insurance claims history. This confirmed my theory that people who are in stable situations such as being married or single will have less insurance claims in comparison to people who may be in stressful situations such as being separated, widowed or divorced. Though this is the case, this should not be the only factor analyzed by an insurance company to gauge the risk of a consumer.