Homework assignment No.1

1.Choose a data set (the number of data attributes should be more than 5), explain why it is important or interesting for you. 2 2. Formulate research questions (for which you expect to find the answers)
3. Make some visualizations for the formulated questions.
4. Prepare a presentation (where you explain the data, questions, problems, results) and upload it.

Introduction:

This is a project in which I use Car Insurance Claim Dataset from Kaggle to generate some insights about car insurance claims and see what factors will make customers more likely to be ‘repeat offenders’.
Data visualisation is done with R program languge using R Markdown tool.

Column name	Description
ID	Customer ID Number
KIDSDRIV	# of children (teenagers) driving the car
BIRTH	Date of birth
AGE	Age of driver
HOMEKIDS	# of children at home
YOJ	Years on job
INCOME	Income
PARENT1	Single parent y/n
HOME_VAL	Value of home
MSTATUS	Marital status
GENDER	Gender
EDUCATION	Maximum education level
OCCUPATION	Occupation
TRAVTIME	Distance to work
CAR_USE	Vehicle use
BLUEBOOK	Value of vehicle
CAR_TYPE	Type of car
OLDCLAIM	Payouts, last 5 years
CLM_FREQ	# of claims, last 5 years
REVOKED	License revoked past 7 years y/n
MVR_PTS	Motor vehicle record points (demerits)
CLM_AMT	Claim amount
REPEAT5	IF THERE WAS CLAIMS BEFORE=1, OTHER=0

R librraries used:

library(tidyverse)
library(tibble)
library(ggplot2)
library(devtools)
library(ISLR)
library(GGally)
# for missing values visualisation
library(naniar)

Data preparation

read.csv("insurance.csv")
insurance_data <- read.csv("insurance.csv")

Data review

# Data sample:
head(insurance_data, 6)

##          ID KIDSDRIV   BIRTH AGE HOMEKIDS YOJ   INCOME PARENT1 HOME_VAL MSTATUS
## 1  63581743        0 16MAR39  60        0  11  $67,349      No       $0    z_No
## 2 132761049        0 21JAN56  43        0  11  $91,449      No $257,252    z_No
## 3 921317019        0 18NOV51  48        0  11  $52,881      No       $0    z_No
## 4 727598473        0 05MAR64  35        1  10  $16,039      No $124,191     Yes
## 5 450221861        0 05JUN48  51        0  14               No $306,251     Yes
## 6 743146596        0 17MAY49  50        0  NA $114,986      No $243,925     Yes
##   GENDER     EDUCATION    OCCUPATION TRAVTIME    CAR_USE BLUEBOOK TIF CAR_TYPE
## 1      M           PhD  Professional       14    Private  $14,230  11  Minivan
## 2      M z_High School z_Blue Collar       22 Commercial  $14,940   1  Minivan
## 3      M     Bachelors       Manager       26    Private  $21,970   1      Van
## 4    z_F z_High School      Clerical        5    Private   $4,010   4    z_SUV
## 5      M  <High School z_Blue Collar       32    Private  $15,440   7  Minivan
## 6    z_F           PhD        Doctor       36    Private  $18,000   1    z_SUV
##   RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CLM_AMT CAR_AGE CLAIM_FLAG
## 1     yes   $4,461        2      No       3      $0      18          0
## 2     yes       $0        0      No       0      $0       1          0
## 3     yes       $0        0      No       2      $0      10          0
## 4      no  $38,690        2      No       3      $0      10          0
## 5     yes       $0        0      No       0      $0       6          0
## 6      no  $19,217        2     Yes       3      $0      17          0
##            URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban

# Amount of data
nrow(insurance_data)

## [1] 10302

# Removing ID column
insurance_data <- subset(insurance_data, select = -ID)
names(insurance_data)

##  [1] "KIDSDRIV"   "BIRTH"      "AGE"        "HOMEKIDS"   "YOJ"       
##  [6] "INCOME"     "PARENT1"    "HOME_VAL"   "MSTATUS"    "GENDER"    
## [11] "EDUCATION"  "OCCUPATION" "TRAVTIME"   "CAR_USE"    "BLUEBOOK"  
## [16] "TIF"        "CAR_TYPE"   "RED_CAR"    "OLDCLAIM"   "CLM_FREQ"  
## [21] "REVOKED"    "MVR_PTS"    "CLM_AMT"    "CAR_AGE"    "CLAIM_FLAG"
## [26] "URBANICITY"

# Fixining data values 
insurance_data$INCOME = as.numeric(gsub("[\\$,]", "", insurance_data$INCOME))
insurance_data$OLDCLAIM = as.numeric(gsub("[\\$,]", "", insurance_data$OLDCLAIM))
insurance_data$HOME_VAL = as.numeric(gsub("[\\$,]", "", insurance_data$HOME_VAL))
insurance_data$BLUEBOOK = as.numeric(gsub("[\\$,]", "", insurance_data$BLUEBOOK))
insurance_data$CLM_AMT = as.numeric(gsub("[\\$,]", "", insurance_data$CLM_AMT))
insurance_data$GENDER = gsub("[\\z_]", "", insurance_data$GENDER)
insurance_data$OCCUPATION = gsub("[\\z_]", "", insurance_data$OCCUPATION)
insurance_data$EDUCATION = gsub("[\\z_]", "", insurance_data$EDUCATION)
insurance_data$CAR_TYPE = gsub("[\\z_]", "", insurance_data$CAR_TYPE)
insurance_data$URBANICITY = gsub("[\\z_]", "", insurance_data$URBANICITY)
insurance_data$CAR_TYPE = gsub("[\\z_]", "", insurance_data$CAR_TYPE)
insurance_data$EDUCATION = gsub("[\\z_]", "", insurance_data$EDUCATION)
insurance_data$OCCUPATION = gsub("[\\z_]", "", insurance_data$OCCUPATION)
insurance_data$MSTATUS = gsub("[\\z_]", "", insurance_data$MSTATUS)
insurance_data$EDUCATION = gsub("[\\<]", "", insurance_data$EDUCATION)
insurance_data[insurance_data == "Highly Urban/ Urban" ] <- "Urban"
insurance_data[insurance_data == "Highly Rural/ Rural" ] <- "Rural"

Mising Values visualsation

# missing values visualization
vis_miss(insurance_data)

### Data cleaning

# Checking if there asre missing values
insurance_data[!complete.cases(insurance_data),]


# rows with missing values is being removed
insurance_data<- na.omit(insurance_data)
# Checking one more time if there asre missing values
insurance_data[!complete.cases(insurance_data),]

Data review

# Amount of date after cleaning
nrow(insurance_data)

## [1] 8163

# Cleaning of unnecessary symbols
unique(insurance_data$KIDSDRIV)

## [1] 0 1 2 3 4

unique(insurance_data$HOMEKIDS)

## [1] 0 1 2 3 4 5

unique(insurance_data$PARENT1)

## [1] "No"  "Yes"

unique(insurance_data$MSTATUS)

## [1] "No"  "Yes"

unique(insurance_data$GENDER)

## [1] "M" "F"

unique(insurance_data$EDUCATION)

## [1] "PhD"         "High School" "Bachelors"   "Masters"

unique(insurance_data$OCCUPATION)

## [1] "Professional" "Blue Collar"  "Manager"      "Clerical"     "Lawyer"      
## [6] ""             "Home Maker"   "Doctor"       "Student"

unique(insurance_data$URBANICITY)

## [1] "Urban" "Rural"

# min/max values
sapply(insurance_data, min)

##     KIDSDRIV        BIRTH          AGE     HOMEKIDS          YOJ       INCOME 
##          "0"    "01APR39"         "16"          "0"          "0"          "0" 
##      PARENT1     HOME_VAL      MSTATUS       GENDER    EDUCATION   OCCUPATION 
##        "Yes"          "0"        "Yes"          "F"  "Bachelors"           "" 
##     TRAVTIME      CAR_USE     BLUEBOOK          TIF     CAR_TYPE      RED_CAR 
##          "5" "Commercial"       "1500"          "1"    "Minivan"        "yes" 
##     OLDCLAIM     CLM_FREQ      REVOKED      MVR_PTS      CLM_AMT      CAR_AGE 
##          "0"          "0"        "Yes"          "0"          "0"         "-3" 
##   CLAIM_FLAG   URBANICITY 
##          "0"      "Rural"

sapply(insurance_data, max)

##   KIDSDRIV      BIRTH        AGE   HOMEKIDS        YOJ     INCOME    PARENT1 
##        "4"  "31OCT73"       "81"        "5"       "23"   "367030"       "No" 
##   HOME_VAL    MSTATUS     GENDER  EDUCATION OCCUPATION   TRAVTIME    CAR_USE 
##   "885282"       "No"        "M"      "PhD"  "Student"      "142"  "Private" 
##   BLUEBOOK        TIF   CAR_TYPE    RED_CAR   OLDCLAIM   CLM_FREQ    REVOKED 
##    "69740"       "25"      "Van"       "no"    "57037"        "5"       "No" 
##    MVR_PTS    CLM_AMT    CAR_AGE CLAIM_FLAG URBANICITY 
##       "13"    "85524"       "28"        "1"    "Urban"

# Determine the data types of a data frame's columns
str(insurance_data)

## 'data.frame':    8163 obs. of  26 variables:
##  $ KIDSDRIV  : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ BIRTH     : chr  "16MAR39" "21JAN56" "18NOV51" "05MAR64" ...
##  $ AGE       : int  60 43 48 35 34 40 44 34 50 53 ...
##  $ HOMEKIDS  : int  0 0 0 1 1 1 2 0 0 0 ...
##  $ YOJ       : int  11 11 11 10 12 11 12 10 7 14 ...
##  $ INCOME    : num  67349 91449 52881 16039 125301 ...
##  $ PARENT1   : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL  : num  0 257252 0 124191 0 ...
##  $ MSTATUS   : chr  "No" "No" "No" "Yes" ...
##  $ GENDER    : chr  "M" "M" "M" "F" ...
##  $ EDUCATION : chr  "PhD" "High School" "Bachelors" "High School" ...
##  $ OCCUPATION: chr  "Professional" "Blue Collar" "Manager" "Clerical" ...
##  $ TRAVTIME  : int  14 22 26 5 46 21 30 34 48 15 ...
##  $ CAR_USE   : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK  : num  14230 14940 21970 4010 17430 ...
##  $ TIF       : int  11 1 1 4 1 6 10 1 7 1 ...
##  $ CAR_TYPE  : chr  "Minivan" "Minivan" "Van" "SUV" ...
##  $ RED_CAR   : chr  "yes" "yes" "yes" "no" ...
##  $ OLDCLAIM  : num  4461 0 0 38690 0 ...
##  $ CLM_FREQ  : int  2 0 0 2 0 1 0 0 0 0 ...
##  $ REVOKED   : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS   : int  3 0 2 3 0 2 0 0 1 0 ...
##  $ CLM_AMT   : num  0 0 0 0 2946 ...
##  $ CAR_AGE   : int  18 1 10 10 7 1 10 1 17 11 ...
##  $ CLAIM_FLAG: int  0 0 0 0 1 1 0 1 0 1 ...
##  $ URBANICITY: chr  "Urban" "Urban" "Urban" "Urban" ...
##  - attr(*, "na.action")= 'omit' Named int [1:2139] 5 6 8 11 18 26 30 36 37 55 ...
##   ..- attr(*, "names")= chr [1:2139] "5" "6" "8" "11" ...

Since every customer in this dataset has made a claim, it’ll be hard to draw conclusions from the data as is. Instead, I’m going to use the information in the CLM_FREQ column to create a target variable: if a customer has had 1 or more claims in the past 5 years, we’ll call them a “repeat offender”. Then we can use the rest of the data to see which customer attributes lead to them becoming repeat offenders.

insurance_data <- insurance_data %>% add_column(REPEAT5 = if_else(.$CLM_FREQ == 0, 0, 1))

VISUALIZATIONS

Most driven car types by gender

ggplot(insurance_data, aes(CAR_TYPE, group = GENDER)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +  scale_y_continuous(labels=scales::percent) + ylab("relative frequencies") + facet_grid(~GENDER) + ggtitle("Car type by gender") +labs(x = "Car Type", fill = "Car Type") + 
  theme_light()

Most driven car in URBAN ir RURAL areas

ggplot(insurance_data, aes(CAR_TYPE, group = URBANICITY)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +  scale_y_continuous(labels=scales::percent) + ylab("relative frequencies") + facet_grid(~URBANICITY) + ggtitle("Car type by area type (RURAL/URBAN)")+ labs(x = "Car Type", fill = "Car Type")

First caim or reapeted claim?

 ggplot(insurance_data, aes(x = factor(REPEAT5), fill=REPEAT5)) + geom_bar() + geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white") + ggtitle("First or reapeted claim") +labs(x = "Firsrt Claim                   More than one claim", fill = "0 - First claim \n 1 - Repeated claim")

Gender infulence

Does gender have any influence on our target variable?

ggplot(insurance_data, aes(REPEAT5, group = GENDER)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +  scale_y_continuous(labels=scales::percent) + ylab("relative frequencies") + facet_grid(~GENDER) + ggtitle("Gender influence on repeated claims ")  +labs(x = "Firsrt Claim / More than one claim", fill = "0 - First claim \n 1 - Repeated claim")

There is no signifcant influence by gender.
### Urbanicity influence on reapeted claims

Does Urbanicity have any influence on our target variable?

ggplot(insurance_data, aes(REPEAT5, group = URBANICITY)) + geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +  scale_y_continuous(labels=scales::percent) + ylab("relative frequencies") + facet_grid(~URBANICITY) + ggtitle("URBANICITY influence on repeated claims ")  +labs(x = "Car Type", fill = "0 - First claim \n 1 - Repeated claim")

Urban drivers have a much higher probability of becoming repeat offenders than those who live in Rural areas.
### Urbanicity influence on reapeted claims

 ggplot(data = insurance_data) +  geom_smooth(mapping = aes(x = MVR_PTS, y = REPEAT5)) + ggtitle("Dermit points and repeted claims releation ")  +labs(x = "Number of dermit points ", y = "Proportion  of Repeated claims")

As the number of demerits a driver has increases, so do the odds that they will make repeat insurance claims. or those with 9 demerit points or more, 100% of them have made more than one claim in the past 5 years!

CAR INSURANCE CLAIM DATA

Gediminas Kazėnas

2022-03-17

Homework assignment No.1

Introduction:

Data preparation

Data review

Mising Values visualsation

Data review

VISUALIZATIONS

Most driven car types by gender

Most driven car in URBAN ir RURAL areas

First caim or reapeted claim?

Gender infulence