1.Choose a data set (the number of data attributes should be more than 5), explain why it is important or interesting for you. 2 2. Formulate research questions (for which you expect to find the answers)
3. Make some visualizations for the formulated questions.
4. Prepare a presentation (where you explain the data, questions, problems, results) and upload it.
This is a project in which I use Car Insurance Claim Dataset from Kaggle to generate some insights about car insurance claims and see what factors will make customers more likely to be ‘repeat offenders’.
Data visualisation is done with R program languge using R Markdown tool.
| Column name | Description |
|---|---|
| ID | Customer ID Number |
| KIDSDRIV | # of children (teenagers) driving the car |
| BIRTH | Date of birth |
| AGE | Age of driver |
| HOMEKIDS | # of children at home |
| YOJ | Years on job |
| INCOME | Income |
| PARENT1 | Single parent y/n |
| HOME_VAL | Value of home |
| MSTATUS | Marital status |
| GENDER | Gender |
| EDUCATION | Maximum education level |
| OCCUPATION | Occupation |
| TRAVTIME | Distance to work |
| CAR_USE | Vehicle use |
| BLUEBOOK | Value of vehicle |
| CAR_TYPE | Type of car |
| OLDCLAIM | Payouts, last 5 years |
| CLM_FREQ | # of claims, last 5 years |
| REVOKED | License revoked past 7 years y/n |
| MVR_PTS | Motor vehicle record points (demerits) |
| CLM_AMT | Claim amount |
| REPEAT5 | IF THERE WAS CLAIMS BEFORE=1, OTHER=0 |
R librraries used:
library(tidyverse)
library(tibble)
library(ggplot2)
library(devtools)
library(ISLR)
library(GGally)
# for missing values visualisation
library(naniar)
read.csv("insurance.csv")
insurance_data <- read.csv("insurance.csv")
# Data sample:
head(insurance_data, 6)
## ID KIDSDRIV BIRTH AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS
## 1 63581743 0 16MAR39 60 0 11 $67,349 No $0 z_No
## 2 132761049 0 21JAN56 43 0 11 $91,449 No $257,252 z_No
## 3 921317019 0 18NOV51 48 0 11 $52,881 No $0 z_No
## 4 727598473 0 05MAR64 35 1 10 $16,039 No $124,191 Yes
## 5 450221861 0 05JUN48 51 0 14 No $306,251 Yes
## 6 743146596 0 17MAY49 50 0 NA $114,986 No $243,925 Yes
## GENDER EDUCATION OCCUPATION TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE
## 1 M PhD Professional 14 Private $14,230 11 Minivan
## 2 M z_High School z_Blue Collar 22 Commercial $14,940 1 Minivan
## 3 M Bachelors Manager 26 Private $21,970 1 Van
## 4 z_F z_High School Clerical 5 Private $4,010 4 z_SUV
## 5 M <High School z_Blue Collar 32 Private $15,440 7 Minivan
## 6 z_F PhD Doctor 36 Private $18,000 1 z_SUV
## RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CLM_AMT CAR_AGE CLAIM_FLAG
## 1 yes $4,461 2 No 3 $0 18 0
## 2 yes $0 0 No 0 $0 1 0
## 3 yes $0 0 No 2 $0 10 0
## 4 no $38,690 2 No 3 $0 10 0
## 5 yes $0 0 No 0 $0 6 0
## 6 no $19,217 2 Yes 3 $0 17 0
## URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
# Amount of data
nrow(insurance_data)
## [1] 10302
# Removing ID column
insurance_data <- subset(insurance_data, select = -ID)
names(insurance_data)
## [1] "KIDSDRIV" "BIRTH" "AGE" "HOMEKIDS" "YOJ"
## [6] "INCOME" "PARENT1" "HOME_VAL" "MSTATUS" "GENDER"
## [11] "EDUCATION" "OCCUPATION" "TRAVTIME" "CAR_USE" "BLUEBOOK"
## [16] "TIF" "CAR_TYPE" "RED_CAR" "OLDCLAIM" "CLM_FREQ"
## [21] "REVOKED" "MVR_PTS" "CLM_AMT" "CAR_AGE" "CLAIM_FLAG"
## [26] "URBANICITY"
# Fixining data values
insurance_data$INCOME = as.numeric(gsub("[\\$,]", "", insurance_data$INCOME))
insurance_data$OLDCLAIM = as.numeric(gsub("[\\$,]", "", insurance_data$OLDCLAIM))
insurance_data$HOME_VAL = as.numeric(gsub("[\\$,]", "", insurance_data$HOME_VAL))
insurance_data$BLUEBOOK = as.numeric(gsub("[\\$,]", "", insurance_data$BLUEBOOK))
insurance_data$CLM_AMT = as.numeric(gsub("[\\$,]", "", insurance_data$CLM_AMT))
insurance_data$GENDER = gsub("[\\z_]", "", insurance_data$GENDER)
insurance_data$OCCUPATION = gsub("[\\z_]", "", insurance_data$OCCUPATION)
insurance_data$EDUCATION = gsub("[\\z_]", "", insurance_data$EDUCATION)
insurance_data$CAR_TYPE = gsub("[\\z_]", "", insurance_data$CAR_TYPE)
insurance_data$URBANICITY = gsub("[\\z_]", "", insurance_data$URBANICITY)
insurance_data$CAR_TYPE = gsub("[\\z_]", "", insurance_data$CAR_TYPE)
insurance_data$EDUCATION = gsub("[\\z_]", "", insurance_data$EDUCATION)
insurance_data$OCCUPATION = gsub("[\\z_]", "", insurance_data$OCCUPATION)
insurance_data$MSTATUS = gsub("[\\z_]", "", insurance_data$MSTATUS)
insurance_data$EDUCATION = gsub("[\\<]", "", insurance_data$EDUCATION)
insurance_data[insurance_data == "Highly Urban/ Urban" ] <- "Urban"
insurance_data[insurance_data == "Highly Rural/ Rural" ] <- "Rural"
# missing values visualization
vis_miss(insurance_data)
### Data cleaning
# Checking if there asre missing values
insurance_data[!complete.cases(insurance_data),]
# rows with missing values is being removed
insurance_data<- na.omit(insurance_data)
# Checking one more time if there asre missing values
insurance_data[!complete.cases(insurance_data),]
# Amount of date after cleaning
nrow(insurance_data)
## [1] 8163
# Cleaning of unnecessary symbols
unique(insurance_data$KIDSDRIV)
## [1] 0 1 2 3 4
unique(insurance_data$HOMEKIDS)
## [1] 0 1 2 3 4 5
unique(insurance_data$PARENT1)
## [1] "No" "Yes"
unique(insurance_data$MSTATUS)
## [1] "No" "Yes"
unique(insurance_data$GENDER)
## [1] "M" "F"
unique(insurance_data$EDUCATION)
## [1] "PhD" "High School" "Bachelors" "Masters"
unique(insurance_data$OCCUPATION)
## [1] "Professional" "Blue Collar" "Manager" "Clerical" "Lawyer"
## [6] "" "Home Maker" "Doctor" "Student"
unique(insurance_data$URBANICITY)
## [1] "Urban" "Rural"
# min/max values
sapply(insurance_data, min)
## KIDSDRIV BIRTH AGE HOMEKIDS YOJ INCOME
## "0" "01APR39" "16" "0" "0" "0"
## PARENT1 HOME_VAL MSTATUS GENDER EDUCATION OCCUPATION
## "Yes" "0" "Yes" "F" "Bachelors" ""
## TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR
## "5" "Commercial" "1500" "1" "Minivan" "yes"
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS CLM_AMT CAR_AGE
## "0" "0" "Yes" "0" "0" "-3"
## CLAIM_FLAG URBANICITY
## "0" "Rural"
sapply(insurance_data, max)
## KIDSDRIV BIRTH AGE HOMEKIDS YOJ INCOME PARENT1
## "4" "31OCT73" "81" "5" "23" "367030" "No"
## HOME_VAL MSTATUS GENDER EDUCATION OCCUPATION TRAVTIME CAR_USE
## "885282" "No" "M" "PhD" "Student" "142" "Private"
## BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED
## "69740" "25" "Van" "no" "57037" "5" "No"
## MVR_PTS CLM_AMT CAR_AGE CLAIM_FLAG URBANICITY
## "13" "85524" "28" "1" "Urban"
# Determine the data types of a data frame's columns
str(insurance_data)
## 'data.frame': 8163 obs. of 26 variables:
## $ KIDSDRIV : int 0 0 0 0 0 1 0 0 0 0 ...
## $ BIRTH : chr "16MAR39" "21JAN56" "18NOV51" "05MAR64" ...
## $ AGE : int 60 43 48 35 34 40 44 34 50 53 ...
## $ HOMEKIDS : int 0 0 0 1 1 1 2 0 0 0 ...
## $ YOJ : int 11 11 11 10 12 11 12 10 7 14 ...
## $ INCOME : num 67349 91449 52881 16039 125301 ...
## $ PARENT1 : chr "No" "No" "No" "No" ...
## $ HOME_VAL : num 0 257252 0 124191 0 ...
## $ MSTATUS : chr "No" "No" "No" "Yes" ...
## $ GENDER : chr "M" "M" "M" "F" ...
## $ EDUCATION : chr "PhD" "High School" "Bachelors" "High School" ...
## $ OCCUPATION: chr "Professional" "Blue Collar" "Manager" "Clerical" ...
## $ TRAVTIME : int 14 22 26 5 46 21 30 34 48 15 ...
## $ CAR_USE : chr "Private" "Commercial" "Private" "Private" ...
## $ BLUEBOOK : num 14230 14940 21970 4010 17430 ...
## $ TIF : int 11 1 1 4 1 6 10 1 7 1 ...
## $ CAR_TYPE : chr "Minivan" "Minivan" "Van" "SUV" ...
## $ RED_CAR : chr "yes" "yes" "yes" "no" ...
## $ OLDCLAIM : num 4461 0 0 38690 0 ...
## $ CLM_FREQ : int 2 0 0 2 0 1 0 0 0 0 ...
## $ REVOKED : chr "No" "No" "No" "No" ...
## $ MVR_PTS : int 3 0 2 3 0 2 0 0 1 0 ...
## $ CLM_AMT : num 0 0 0 0 2946 ...
## $ CAR_AGE : int 18 1 10 10 7 1 10 1 17 11 ...
## $ CLAIM_FLAG: int 0 0 0 0 1 1 0 1 0 1 ...
## $ URBANICITY: chr "Urban" "Urban" "Urban" "Urban" ...
## - attr(*, "na.action")= 'omit' Named int [1:2139] 5 6 8 11 18 26 30 36 37 55 ...
## ..- attr(*, "names")= chr [1:2139] "5" "6" "8" "11" ...
Since every customer in this dataset has made a claim, it’ll be hard to draw conclusions from the data as is. Instead, I’m going to use the information in the CLM_FREQ column to create a target variable: if a customer has had 1 or more claims in the past 5 years, we’ll call them a “repeat offender”. Then we can use the rest of the data to see which customer attributes lead to them becoming repeat offenders.
insurance_data <- insurance_data %>% add_column(REPEAT5 = if_else(.$CLM_FREQ == 0, 0, 1))
ggplot(insurance_data, aes(CAR_TYPE, group = GENDER)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + scale_y_continuous(labels=scales::percent) + ylab("relative frequencies") + facet_grid(~GENDER) + ggtitle("Car type by gender") +labs(x = "Car Type", fill = "Car Type") +
theme_light()
ggplot(insurance_data, aes(CAR_TYPE, group = URBANICITY)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + scale_y_continuous(labels=scales::percent) + ylab("relative frequencies") + facet_grid(~URBANICITY) + ggtitle("Car type by area type (RURAL/URBAN)")+ labs(x = "Car Type", fill = "Car Type")
ggplot(insurance_data, aes(x = factor(REPEAT5), fill=REPEAT5)) + geom_bar() + geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white") + ggtitle("First or reapeted claim") +labs(x = "Firsrt Claim More than one claim", fill = "0 - First claim \n 1 - Repeated claim")
Does gender have any influence on our target variable?
ggplot(insurance_data, aes(REPEAT5, group = GENDER)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + scale_y_continuous(labels=scales::percent) + ylab("relative frequencies") + facet_grid(~GENDER) + ggtitle("Gender influence on repeated claims ") +labs(x = "Firsrt Claim / More than one claim", fill = "0 - First claim \n 1 - Repeated claim")
There is no signifcant influence by gender.
### Urbanicity influence on reapeted claims
Does Urbanicity have any influence on our target variable?
ggplot(insurance_data, aes(REPEAT5, group = URBANICITY)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + scale_y_continuous(labels=scales::percent) + ylab("relative frequencies") + facet_grid(~URBANICITY) + ggtitle("URBANICITY influence on repeated claims ") +labs(x = "Car Type", fill = "0 - First claim \n 1 - Repeated claim")
Urban drivers have a much higher probability of becoming repeat offenders than those who live in Rural areas.
### Urbanicity influence on reapeted claims
ggplot(data = insurance_data) + geom_smooth(mapping = aes(x = MVR_PTS, y = REPEAT5)) + ggtitle("Dermit points and repeted claims releation ") +labs(x = "Number of dermit points ", y = "Proportion of Repeated claims")
As the number of demerits a driver has increases, so do the odds that they will make repeat insurance claims. or those with 9 demerit points or more, 100% of them have made more than one claim in the past 5 years!