knitr::opts_chunk$set(echo = TRUE)

1. PROJECT OBJECTIVE

Insurance provides financing, leverage, cushion against possible eventuality. It is an arrangement by which the Company undertakes to provide a guarantee of compensation for specified loss, damage, illness or death in return by paymnent of a specified premium.

Premium paid by the customer is the major revenue source for insurance companies. Default in premium payments results in significant revenue losses and hence insurance companies would like to know upfront which type of customers would default premium payments. The objective of this project is to predict the probability that a customer will default the premium payment, so that the insurance agent can proactively reach out to the policy holder to follow up for the payment of premium.

Problem Statement: *The project ultimate focus is to predict if the current customer(s) will Default in Future Payments

The Project will include the following: * Visual inspection of data (rows, columns, descriptive details) * Understanding of attributes (variable info, renaming if required) * Univariate analysis (distribution and spread for every continuous attribute, distribution of the data in categories for categorical ones) * Bivariate analysis (relationship between different variables, correlations) * Building a model that can predict the likelihood of a customer defaulting on premium payments (Who is likely to default) * Identifing the factors that drive higher default rate (Are there any characteristics of the customers who are likely to default?) * Propose a strategy for reducing default rates by using the model and other insights from the analysis (What should be done to reduce the default rates?)

2.1 The Data

The dataset contains the following details collected well over ten years on 79, 854 Policy Holders. (The exact number of years will be decided as we gain better understanding of the data) * id: Unique customer ID perc_premium_paid_by_cash_credit: What % of the premium was paid by cash payments? * age_in_days: age of the customer in days * Income: Income of the customer * Marital Status: Married/Unmarried, Married (1), unmarried (0) * Veh_owned: Number of vehicles owned (1-3) * Count_3-6_months_late: Number of times premium was paid 3-6 months late * Count_6-12_months_late: Number of times premium was paid 6-12 months late * Count_more_than_12_months_late: Number of times premium was paid more than 12 months late * Risk_score: Risk score of customer (similar to credit score) * No_of_dep: Number of dependents in the family of the customer (1-4) * Accommodation: Owned (1), Rented (0) no_of_premiums_paid: Number of premiums paid till date * sourcing_channel: Channel through which customer was sourced * residence_area_type: Residence type of the customer * premium : Total premium amount paid till now * default: Y variable - 0 indicates that customer has defaulted the premium and 1 indicates that customer has not defaulted

3. Install Packages

library(car)

## Loading required package: carData

library(MASS)
library(ggplot2)
library(gridExtra)
library(caTools)
library(e1071)
library(ROCR)
library(InformationValue)
library(class)
library(knitr)
library(corrplot)

## corrplot 0.84 loaded

library(dataframeexplorer)

3. Exploratory Data Analysis

3.1 Import Data

3.1.1 Set Working Directory

setwd("~/Great Learning/Capstone Project")

3.1.2 Import and Read the Dataset

library(readxl)
InsuranceDataset <- read_excel("Insurance Premium Default-Dataset.xlsx")
View(InsuranceDataset)

head(InsuranceDataset)

## # A tibble: 6 x 17
##      id perc_premium_pa~ age_in_days Income `Count_3-6_mont~ `Count_6-12_mon~
##   <dbl>            <dbl>       <dbl>  <dbl>            <dbl>            <dbl>
## 1     1            0.317       11330  90050                0                0
## 2     2            0           30309 156080                0                0
## 3     3            0.015       16069 145020                1                0
## 4     4            0           23733 187560                0                0
## 5     5            0.888       19360 103050                7                3
## 6     6            0.512       16795 113500                0                0
## # ... with 11 more variables: Count_more_than_12_months_late <dbl>, `Marital
## #   Status` <dbl>, Veh_Owned <dbl>, No_of_dep <dbl>, Accomodation <dbl>,
## #   risk_score <dbl>, no_of_premiums_paid <dbl>, sourcing_channel <chr>,
## #   residence_area_type <chr>, premium <dbl>, default <dbl>

tail(InsuranceDataset)

## # A tibble: 6 x 17
##      id perc_premium_pa~ age_in_days Income `Count_3-6_mont~ `Count_6-12_mon~
##   <dbl>            <dbl>       <dbl>  <dbl>            <dbl>            <dbl>
## 1 79848            0.009       21545 133140                0                0
## 2 79849            0.249       25555  64420                0                0
## 3 79850            0.003       16797 660040                1                0
## 4 79851            0.012       24835 227760                0                0
## 5 79852            0.19        10959 153060                1                0
## 6 79853            0           19720 324030                0                0
## # ... with 11 more variables: Count_more_than_12_months_late <dbl>, `Marital
## #   Status` <dbl>, Veh_Owned <dbl>, No_of_dep <dbl>, Accomodation <dbl>,
## #   risk_score <dbl>, no_of_premiums_paid <dbl>, sourcing_channel <chr>,
## #   residence_area_type <chr>, premium <dbl>, default <dbl>

str(InsuranceDataset)

## tibble [79,853 x 17] (S3: tbl_df/tbl/data.frame)
##  $ id                              : num [1:79853] 1 2 3 4 5 6 7 8 9 10 ...
##  $ perc_premium_paid_by_cash_credit: num [1:79853] 0.317 0 0.015 0 0.888 0.512 0 0.994 0.019 0.018 ...
##  $ age_in_days                     : num [1:79853] 11330 30309 16069 23733 19360 ...
##  $ Income                          : num [1:79853] 90050 156080 145020 187560 103050 ...
##  $ Count_3-6_months_late           : num [1:79853] 0 0 1 0 7 0 0 0 0 0 ...
##  $ Count_6-12_months_late          : num [1:79853] 0 0 0 0 3 0 0 0 0 0 ...
##  $ Count_more_than_12_months_late  : num [1:79853] 0 0 0 0 4 0 0 0 0 0 ...
##  $ Marital Status                  : num [1:79853] 0 1 0 1 0 0 0 0 1 1 ...
##  $ Veh_Owned                       : num [1:79853] 3 3 1 1 2 1 3 3 2 3 ...
##  $ No_of_dep                       : num [1:79853] 3 1 1 1 1 4 4 2 4 3 ...
##  $ Accomodation                    : num [1:79853] 1 1 1 0 0 0 1 0 1 1 ...
##  $ risk_score                      : num [1:79853] 98.8 99.1 99.2 99.4 98.8 ...
##  $ no_of_premiums_paid             : num [1:79853] 8 3 14 13 15 4 8 4 8 8 ...
##  $ sourcing_channel                : chr [1:79853] "A" "A" "C" "A" ...
##  $ residence_area_type             : chr [1:79853] "Rural" "Urban" "Urban" "Urban" ...
##  $ premium                         : num [1:79853] 5400 11700 18000 13800 7500 3300 20100 3300 5400 9600 ...
##  $ default                         : num [1:79853] 1 1 1 1 0 1 1 1 1 1 ...

3.2 Lets Rename Columns with dplyr::rename ()

We will rename the columns to shorter names we can easiy work with
Lets first get the Column Names

colnames(InsuranceDataset)

##  [1] "id"                               "perc_premium_paid_by_cash_credit"
##  [3] "age_in_days"                      "Income"                          
##  [5] "Count_3-6_months_late"            "Count_6-12_months_late"          
##  [7] "Count_more_than_12_months_late"   "Marital Status"                  
##  [9] "Veh_Owned"                        "No_of_dep"                       
## [11] "Accomodation"                     "risk_score"                      
## [13] "no_of_premiums_paid"              "sourcing_channel"                
## [15] "residence_area_type"              "premium"                         
## [17] "default"

Let’s rename the column names

names(InsuranceDataset) [names(InsuranceDataset) == "perc_premium_paid_by_cash_credit"] <-"Perc_Premiumcash"

names(InsuranceDataset) [names(InsuranceDataset) == "age_in_days"] <-"Age"

names(InsuranceDataset) [names(InsuranceDataset) == "Count_3-6_months_late"] <-"Count3-6"

names(InsuranceDataset) [names(InsuranceDataset) == "Count_6-12_months_late"] <-"Count6_12"

names(InsuranceDataset) [names(InsuranceDataset) == "Count_more_than_12_months_late"] <-"Count12andmore"

names(InsuranceDataset) [names(InsuranceDataset) == "Marital Status"] <-"Marital_Status"

names(InsuranceDataset) [names(InsuranceDataset) == "No_of_dep"] <-"Dependents"

names(InsuranceDataset) [names(InsuranceDataset) == "risk_score"] <-"Risk_Score"

names(InsuranceDataset) [names(InsuranceDataset) == "no_of_premiums_paid"] <-"Premiums_Paid"

names(InsuranceDataset) [names(InsuranceDataset) == "sourcing_channel"] <-"Sourcing_Channel"

names(InsuranceDataset) [names(InsuranceDataset) == "residence_area_type"] <-"Residence"

names(InsuranceDataset) [names(InsuranceDataset) == "premium"] <-"Premium"

names(InsuranceDataset) [names(InsuranceDataset) == "default"] <-"Default"

Having changed the column names, lets check them

colnames(InsuranceDataset)

##  [1] "id"               "Perc_Premiumcash" "Age"              "Income"          
##  [5] "Count3-6"         "Count6_12"        "Count12andmore"   "Marital_Status"  
##  [9] "Veh_Owned"        "Dependents"       "Accomodation"     "Risk_Score"      
## [13] "Premiums_Paid"    "Sourcing_Channel" "Residence"        "Premium"         
## [17] "Default"

3.3 Missing Value Treatment

Lets check if there is any missing value in the dataset

colSums(is.na(InsuranceDataset))

##               id Perc_Premiumcash              Age           Income 
##                0                0                0                0 
##         Count3-6        Count6_12   Count12andmore   Marital_Status 
##                0                0                0                0 
##        Veh_Owned       Dependents     Accomodation       Risk_Score 
##                0                0                0                0 
##    Premiums_Paid Sourcing_Channel        Residence          Premium 
##                0                0                0                0 
##          Default 
##                0

There are no missing values

3.4 Lets check the summary of the dataset so, we can understand the type of transformation we need to make in our dataset

summary (InsuranceDataset)

##        id        Perc_Premiumcash      Age            Income        
##  Min.   :    1   Min.   :0.0000   Min.   : 7670   Min.   :   24030  
##  1st Qu.:19964   1st Qu.:0.0340   1st Qu.:14974   1st Qu.:  108010  
##  Median :39927   Median :0.1670   Median :18625   Median :  166560  
##  Mean   :39927   Mean   :0.3143   Mean   :18847   Mean   :  208847  
##  3rd Qu.:59890   3rd Qu.:0.5380   3rd Qu.:22636   3rd Qu.:  252090  
##  Max.   :79853   Max.   :1.0000   Max.   :37602   Max.   :90262600  
##     Count3-6         Count6_12        Count12andmore     Marital_Status  
##  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.:0.0000  
##  Median : 0.0000   Median : 0.00000   Median : 0.00000   Median :0.0000  
##  Mean   : 0.2484   Mean   : 0.07809   Mean   : 0.05994   Mean   :0.4987  
##  3rd Qu.: 0.0000   3rd Qu.: 0.00000   3rd Qu.: 0.00000   3rd Qu.:1.0000  
##  Max.   :13.0000   Max.   :17.00000   Max.   :11.00000   Max.   :1.0000  
##    Veh_Owned       Dependents     Accomodation      Risk_Score   
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :91.90  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:98.83  
##  Median :2.000   Median :3.000   Median :1.0000   Median :99.18  
##  Mean   :1.998   Mean   :2.503   Mean   :0.5013   Mean   :99.07  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:99.52  
##  Max.   :3.000   Max.   :4.000   Max.   :1.0000   Max.   :99.89  
##  Premiums_Paid   Sourcing_Channel    Residence            Premium     
##  Min.   : 2.00   Length:79853       Length:79853       Min.   : 1200  
##  1st Qu.: 7.00   Class :character   Class :character   1st Qu.: 5400  
##  Median :10.00   Mode  :character   Mode  :character   Median : 7500  
##  Mean   :10.86                                         Mean   :10925  
##  3rd Qu.:14.00                                         3rd Qu.:13800  
##  Max.   :60.00                                         Max.   :60000  
##     Default      
##  Min.   :0.0000  
##  1st Qu.:1.0000  
##  Median :1.0000  
##  Mean   :0.9374  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

The id is just the serial number of each policy holder, from no 1 (min) to number 79,853 (Max)
Perc_Premiumcash, this is the % of the premium made by cash payment; the 1st Quartile is 3.4%, the Median is 16.7%, the mean is 31.4% and the 3rd Quartile is 58.3%, hence 75% of the policy holders are less than 58.3%, meaning lesser policy holders make their premium by cash.
Age in days, seems we can’t make meaning of it yet, until we convert it to age in years
The min income is 24,030 USD, median is 166,560 USD, and the max is 90,262,600 USD.
Count3-6; the number of times premium was paid 3-6 months late; 3rd Quartile is 0.000, meaning most policy holders pay up in time, but a few of them do delay, with max up to 13 times
Count6_12 ; the number of times premium was paid 6-12 months late; although 3rd Quartile is 0.000, yet max is 17, meaning most policy holders pays their premium on time, but few outliers still delay up to 6-12 months
Count12andmore; number of times premium was paid, more than 12 months is maximally 11 times.
Marital_Status; this will be changed to factor for us to make more sense of the data
Dependents; the media number of dependents per policy holder is 3 and the max is 4
Risk_Score is high for most clients, the min is 91.9, the median is 99.18, meaning most clients are nearer to the max than the minimum, as the mas is 99.89
Sourcing_Channel; channel through which the customer was sourced will be converted to factor so that we can make more sense of the data
Veh_Owned; max is 3 and minimum is 1, hence this is likely to be Vehicle insurance, since all policy vehicle own at least 1 vehicle#
Accommodation: this will be converted to factor, so we can make more sense of the data
Premiums_Paid; no of premium paid till date; min is 2, median is 10 and max is 60. This may help understand the duration of this data collection, It means it will definitely be over 5 years, if it is monthly premium
Residence: we will change this to factor to help us understand the data better
Premium; total premium paid till date, min is 1,200, mean is 7,500 and max is 60,000
Default: this is the variable of interest and it will be changed to factor to help gain better insight into the data.

3.5 Lets now Transform the Age Column Values from days to years

We will create a new column of Age in Years from the original Age in Days column

InsuranceDataset$AgeinYears <- InsuranceDataset$Age /365

Having created another column called AgeinYears, we can see in the Global Environment that the variables have increased from 17 to 18.
We will go ahead to remove the “Age” column (Column 3) later on.

3.6. Lets also transform the Income Variable, by creating a new variable called Income Class

We will create the income class column using the cut function on the current Income Variable.

InsuranceDataset$IncomeClass <- cut(InsuranceDataset$Income, breaks = c(0, 150000, 250000, 90262600), labels = c("low","middle","high"))

We have created 3 income class (low, middle and high) within a new variable called IncomeClass
We now have 19 variables up from 18.
Lets check the dimension of the new dataset

dim(InsuranceDataset)

## [1] 79853    19

summary(InsuranceDataset)

##        id        Perc_Premiumcash      Age            Income        
##  Min.   :    1   Min.   :0.0000   Min.   : 7670   Min.   :   24030  
##  1st Qu.:19964   1st Qu.:0.0340   1st Qu.:14974   1st Qu.:  108010  
##  Median :39927   Median :0.1670   Median :18625   Median :  166560  
##  Mean   :39927   Mean   :0.3143   Mean   :18847   Mean   :  208847  
##  3rd Qu.:59890   3rd Qu.:0.5380   3rd Qu.:22636   3rd Qu.:  252090  
##  Max.   :79853   Max.   :1.0000   Max.   :37602   Max.   :90262600  
##     Count3-6         Count6_12        Count12andmore     Marital_Status  
##  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.:0.0000  
##  Median : 0.0000   Median : 0.00000   Median : 0.00000   Median :0.0000  
##  Mean   : 0.2484   Mean   : 0.07809   Mean   : 0.05994   Mean   :0.4987  
##  3rd Qu.: 0.0000   3rd Qu.: 0.00000   3rd Qu.: 0.00000   3rd Qu.:1.0000  
##  Max.   :13.0000   Max.   :17.00000   Max.   :11.00000   Max.   :1.0000  
##    Veh_Owned       Dependents     Accomodation      Risk_Score   
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :91.90  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:98.83  
##  Median :2.000   Median :3.000   Median :1.0000   Median :99.18  
##  Mean   :1.998   Mean   :2.503   Mean   :0.5013   Mean   :99.07  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:99.52  
##  Max.   :3.000   Max.   :4.000   Max.   :1.0000   Max.   :99.89  
##  Premiums_Paid   Sourcing_Channel    Residence            Premium     
##  Min.   : 2.00   Length:79853       Length:79853       Min.   : 1200  
##  1st Qu.: 7.00   Class :character   Class :character   1st Qu.: 5400  
##  Median :10.00   Mode  :character   Mode  :character   Median : 7500  
##  Mean   :10.86                                         Mean   :10925  
##  3rd Qu.:14.00                                         3rd Qu.:13800  
##  Max.   :60.00                                         Max.   :60000  
##     Default         AgeinYears     IncomeClass   
##  Min.   :0.0000   Min.   : 21.01   low   :33495  
##  1st Qu.:1.0000   1st Qu.: 41.02   middle:25779  
##  Median :1.0000   Median : 51.03   high  :20579  
##  Mean   :0.9374   Mean   : 51.63                 
##  3rd Qu.:1.0000   3rd Qu.: 62.02                 
##  Max.   :1.0000   Max.   :103.02

We now have 19 variables and the Income Class variable is clearly seen
About 42% of the policy holders are within the low income bracket, 32% within the middle income class and 26% within the high income class

3.7 Lets convert IncomeClass, Marital Status, Accomodation, SourcingChannel, Residence and Default to Factor

InsuranceDataset$IncomeClass <- as.factor(InsuranceDataset$IncomeClass)
InsuranceDataset$Marital_Status <- as.factor(InsuranceDataset$Marital_Status)
InsuranceDataset$Accomodation <- as.factor(InsuranceDataset$Accomodation)
InsuranceDataset$Sourcing_Channel <- as.factor(InsuranceDataset$Sourcing_Channel)
InsuranceDataset$Residence <- as.factor(InsuranceDataset$Residence)
InsuranceDataset$Default <- as.factor(InsuranceDataset$Default)

Having converted these variables to factor, lets now check the structure and the summary of our dataset again, so we that we can make more sense of the data

str(InsuranceDataset)

## tibble [79,853 x 19] (S3: tbl_df/tbl/data.frame)
##  $ id              : num [1:79853] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Perc_Premiumcash: num [1:79853] 0.317 0 0.015 0 0.888 0.512 0 0.994 0.019 0.018 ...
##  $ Age             : num [1:79853] 11330 30309 16069 23733 19360 ...
##  $ Income          : num [1:79853] 90050 156080 145020 187560 103050 ...
##  $ Count3-6        : num [1:79853] 0 0 1 0 7 0 0 0 0 0 ...
##  $ Count6_12       : num [1:79853] 0 0 0 0 3 0 0 0 0 0 ...
##  $ Count12andmore  : num [1:79853] 0 0 0 0 4 0 0 0 0 0 ...
##  $ Marital_Status  : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 2 ...
##  $ Veh_Owned       : num [1:79853] 3 3 1 1 2 1 3 3 2 3 ...
##  $ Dependents      : num [1:79853] 3 1 1 1 1 4 4 2 4 3 ...
##  $ Accomodation    : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
##  $ Risk_Score      : num [1:79853] 98.8 99.1 99.2 99.4 98.8 ...
##  $ Premiums_Paid   : num [1:79853] 8 3 14 13 15 4 8 4 8 8 ...
##  $ Sourcing_Channel: Factor w/ 5 levels "A","B","C","D",..: 1 1 3 1 1 2 3 1 1 1 ...
##  $ Residence       : Factor w/ 2 levels "Rural","Urban": 1 2 2 2 2 1 1 2 2 1 ...
##  $ Premium         : num [1:79853] 5400 11700 18000 13800 7500 3300 20100 3300 5400 9600 ...
##  $ Default         : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
##  $ AgeinYears      : num [1:79853] 31 83 44 65 53 ...
##  $ IncomeClass     : Factor w/ 3 levels "low","middle",..: 1 2 1 2 1 1 3 1 1 2 ...

summary(InsuranceDataset)

##        id        Perc_Premiumcash      Age            Income        
##  Min.   :    1   Min.   :0.0000   Min.   : 7670   Min.   :   24030  
##  1st Qu.:19964   1st Qu.:0.0340   1st Qu.:14974   1st Qu.:  108010  
##  Median :39927   Median :0.1670   Median :18625   Median :  166560  
##  Mean   :39927   Mean   :0.3143   Mean   :18847   Mean   :  208847  
##  3rd Qu.:59890   3rd Qu.:0.5380   3rd Qu.:22636   3rd Qu.:  252090  
##  Max.   :79853   Max.   :1.0000   Max.   :37602   Max.   :90262600  
##     Count3-6         Count6_12        Count12andmore     Marital_Status
##  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000   0:40032       
##  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000   1:39821       
##  Median : 0.0000   Median : 0.00000   Median : 0.00000                 
##  Mean   : 0.2484   Mean   : 0.07809   Mean   : 0.05994                 
##  3rd Qu.: 0.0000   3rd Qu.: 0.00000   3rd Qu.: 0.00000                 
##  Max.   :13.0000   Max.   :17.00000   Max.   :11.00000                 
##    Veh_Owned       Dependents    Accomodation   Risk_Score    Premiums_Paid  
##  Min.   :1.000   Min.   :1.000   0:39823      Min.   :91.90   Min.   : 2.00  
##  1st Qu.:1.000   1st Qu.:2.000   1:40030      1st Qu.:98.83   1st Qu.: 7.00  
##  Median :2.000   Median :3.000                Median :99.18   Median :10.00  
##  Mean   :1.998   Mean   :2.503                Mean   :99.07   Mean   :10.86  
##  3rd Qu.:3.000   3rd Qu.:3.000                3rd Qu.:99.52   3rd Qu.:14.00  
##  Max.   :3.000   Max.   :4.000                Max.   :99.89   Max.   :60.00  
##  Sourcing_Channel Residence        Premium      Default     AgeinYears    
##  A:43134          Rural:31670   Min.   : 1200   0: 4998   Min.   : 21.01  
##  B:16512          Urban:48183   1st Qu.: 5400   1:74855   1st Qu.: 41.02  
##  C:12039                        Median : 7500             Median : 51.03  
##  D: 7559                        Mean   :10925             Mean   : 51.63  
##  E:  609                        3rd Qu.:13800             3rd Qu.: 62.02  
##                                 Max.   :60000             Max.   :103.02  
##  IncomeClass   
##  low   :33495  
##  middle:25779  
##  high  :20579  
##                
##                
##

IncomeClass, Marital Status, Accomodation, SourcingChannel, Residence and Default have now been converted to factors, all with levels 2 or 3, except SourcingChannel that has 5 levels
50.1% of policy holders are unmarried
50.1% of the policy holders own their accomodation ^ The predominant sourcing channel is A, accounting for 54% of the policy holders, while the least is E, accounting for <1% (0.8%) of all policy holders
60.3% of policy holders have their residences in Urban areas
6.3% of policy holders have defaulted their premium

3.8 Lets remove insignificant columns (Age in Days)

InsuranceDataset <- subset(InsuranceDataset, select = -c(3))

Having removed the Age in Days, lets see the new dimensions

dim(InsuranceDataset)

## [1] 79853    18

4. Exploratory Data Analysis

4.1 Univariate Analysis of all Numetrical Variables

4.1.1 Univariate Analysis of Numerical variables with histogram

library(ggplot2)

4.2 Lets draw histogram and boxplot of numerical variables using ggplot

plot_histogram_n_boxplot = function(variable, variableNameString, binw){
  h = ggplot(data = InsuranceDataset, aes(x= variable))+
    labs(x = variableNameString,y ='count')+
    geom_histogram(fill = 'green',col = 'white',binwidth = binw)+
    geom_vline(aes(xintercept=mean(variable)),
            color="black", linetype="dashed", size=0.5)
  b = ggplot(data = InsuranceDataset, aes('',variable))+ 
    geom_boxplot(outlier.colour = 'red',col = 'red',outlier.shape = 19)+
    labs(x = '',y = variableNameString)+ coord_flip()
  grid.arrange(h,b,ncol = 2)
}

4.2.1 Observation on Percentage ofo Premium Paid with Cash

hist(InsuranceDataset$Perc_Premiumcash, col = "blue", main = "Percentage of Premium Paid by Cash")

plot_histogram_n_boxplot(InsuranceDataset$Perc_Premiumcash, 'Percentage of Premium Paid by Cash', 1)

Most of the policy holders don’t pay with cash
The Median proportion that pays with cash is 17%

4.2.1 Observation on Age

plot_histogram_n_boxplot(InsuranceDataset$AgeinYears, 'Age', 1)

Age is normally distributed
The Median Age is 51 years
Age is skewed to the left but there are right Outliers

4.2.2 Observation on Income

hist(InsuranceDataset$Income, col = "blue", main = "Income")

4.2.3 Observation on ‘Count3-6’; the number of times premium was paid 3-6 months late

plot_histogram_n_boxplot(InsuranceDataset$`Count3-6`, 'Count3-6', 1)

* Majority of policy holders pay their premium on time, just a few pay 3-6 months late

4.2.4 Observation on Count6_12; Number of times premium was paid 6-12 months late

plot_histogram_n_boxplot(InsuranceDataset$Count6_12, 'Count6_12', 1)

Most times, premiums were paid on time by policy holders, only on few occasions were there delays of upto 6-12 months.
Most delays were 1 day, though could extend to 3 on fewer occassions, while there are vert few extreme cases were delays extended

4.2.5 Observation on Count12andmore; Number of times premium were paid more than 12 months late.

plot_histogram_n_boxplot(InsuranceDataset$Count12andmore, 'Count12andmore', 1)

Premiums were rarely paid more than 12 months late.
The highest form of delay is 3-6 months late, while the least is a delay of 12 months and more.

4.2.6 Observation on Number of Vehicle Owned by Customers

plot_histogram_n_boxplot(InsuranceDataset$Veh_Owned, 'Vehicle Owned', 1)

The distribution of number of vehicle of owned by customers is evenly distributed between 1 to 3 vehicles
The proportion of customers that own 1,2 and 3 cars each are the same as seen in the chart above

4.2.7 Observation on Dependents; Number of Dependents in the family of customers

plot_histogram_n_boxplot(InsuranceDataset$Dependents, 'Dependents', 1)

The distribution of Dependents in family is also evenly distributed
Most of the customers have 2-3 dependents
The proportion of customers with 1,2,3 & 4 dependents is the same.

4.2.8 Observation on No of Premium Paid till date

plot_histogram_n_boxplot(InsuranceDataset$Premiums_Paid, 'No of Premium Paid', 1)

The distribution on the Number of Premium Paid till date is a Normal Distribution
The Median no is 10, while the distribution is skewed to the left, but there are right outliers, that is extreme values to the right

4.2.9 Observation on the Total Premium Amount Paid till now

plot_histogram_n_boxplot(InsuranceDataset$Premium, 'Premium', 1)

The Median Total Premium Paid is 7500, the minimum is 1200 and the maximum is 60,000
The first Quartile and third Quartile of the total Premium Paid is between 5,400 and 13,800
The distribution is skewed to the left, with extreme values on the right
The modal Total Premium Paid is 10,000
The least total Premium Paid per individual policy holder is between 35,000 and 60,000

4.2.10 Obervation on Risk Score

plot_histogram_n_boxplot(InsuranceDataset$Risk_Score, 'Risk Score', 1)

The Risk Score is skwed to the right, with left outliers
The Median Risk Score is 99
Majority of the Clients have very high risk score, with the higher proportion having 99% risk score

5. Bivariate Analysis

5.1 Let’s plot percent stacked barchart to see the effect of independents variables on Default, which is the variable of interest

plot_stacked_barchart = function(variable, variableNamesString){ggplot(InsuranceDataset, aes(fill = Default, x = variable)) + geom_bar(position = "fill")+
    labs(title = variableNamesString, y = '', x = '')+
    scale_fill_manual(values = c("RED", "BLUE"))}

5.1.1 Default versus Marital Status

plot_stacked_barchart(InsuranceDataset$Marital_Status, 'Marital Status')

* Marital Status doesn’t seem to have any impact on the Default on Premium as the similar proportion of Default exists between the Married and the Non Married. * Most of the customers don’t Default on their Premiums

5.1.2 Default versus Accomodation

plot_stacked_barchart(InsuranceDataset$Accomodation, 'Accomodation')

Accomodation doesn’t seem to have much impact on Default
Thus who own their accomodation may seem to have a light higher chance of Defaulting on their premiums

5.1.3 Default versus Sourcing Channel

plot_stacked_barchart(InsuranceDataset$Sourcing_Channel, 'Sourcing Channel')

Sourcing Channels tends to have varying impact on Default
Customers sourced from Channel A have the lowest chance of Defaulting
Customers sourced from Channel D and E have the higher Chance of Defaulting as compared to Channel A
Customers sourced from D Channel have the highest chance of Defaulting, followed by those from channels C & E, whose risk of Defaulting seems to be the same, followed by those from B, while Customers from A have the lowest risk of Defaulting

5.1.4 Default versus Residence

plot_stacked_barchart(InsuranceDataset$Residence, 'Residence')

Residence doesn’t seem to have any impact on Default
Those from Rural areas tends to have a slightly higher chance of Defaulting

5.1.5 Defualt versus Income Class

plot_stacked_barchart(InsuranceDataset$IncomeClass, 'Income Class')

The Income Class of the customer tend to have a significant impact on Default
Those Customers from lower income groups have a higher chance of Defaulting than those from Middle Income Class, while the High Income Class Customers have the least likelihood of Defaulting as expected

5.1.6 Age versus Late Premium 3-6 months using boxplot

boxplot(InsuranceDataset$AgeinYears~InsuranceDataset$`Count3-6`, col = "Blue", main = "Age versus Late Premium in 3- 6 months")

The concentration of late payment, 3-6 month late is higher around the middle age segment of 40-60 years, median peak ranging around 50 years

5.1.7 Age versus Late Premium after 6-12 months

boxplot(InsuranceDataset$AgeinYears~InsuranceDataset$Count6_12, col = "Blue", main = "Age versus Late Premium in 6-12 months")

This is similar to above 3-6 months, but it has more varying peaks per age

5.1.8 Age Versus Late Premium After 12 MOnths

boxplot(InsuranceDataset$AgeinYears~InsuranceDataset$Count12andmore, col = "Blue", main = "Age versus Late Premium After 12 Months")

This is similar to 3-6 months
There are Outliers

5.1.9 No of Dependents versus Late Premium in 3- 6 months

boxplot(InsuranceDataset$Dependents~InsuranceDataset$`Count3-6`, col = "Blue", main = "No of Dependents versus Late Premium in 3-6 months")

The higher the number of Dependents the higher the chances of late payment in 3- 6 months

5.1.10 No of Dependents versus Late Payment in 6-12 Months

boxplot(InsuranceDataset$Dependents~InsuranceDataset$Count6_12, col = "Blue", main = "No of Dependents versus Late Premium in 6-12 months")

Payment of 6-12 months is more prevalent among the customers with higher number of Dependents

5.1.11 Income versus Late Premium After 12 months

boxplot(InsuranceDataset$Income~InsuranceDataset$Count12andmore, col = "Blue", main = "Income versus Late Payment After 12 MOnths")

The higher the income the lower the chances of delay in payment of Premiums

5.2 Bivariate Analysis: Plotting the Relationship between the Dependent Factor variable and Numerical Variables

Lets see the relationship between Default and the Numeric Variables (Age in Years, Percentage Premium Paid in Cash, Vehicle Owned, Count 3-6 Months Late, Count 6-12 Months Late, Count More than 12 Months, Risk Score, No of Dependents, No of Premium Paid till date and Total Premium Paid)

5.2.1 Relationship between Age in Years and Defualt

ggplot(InsuranceDataset,aes_string(x=InsuranceDataset$AgeinYears,fill="Default")) + geom_histogram(bins=50,alpha=0.5,colour='black')

The lower the age the more the chances of Default on Premiums

5.2.2 Relationship between Percentage Premium Paid in Cash and Default

ggplot(InsuranceDataset,aes_string(x=InsuranceDataset$Perc_Premiumcash,fill="Default")) + geom_histogram(bins=50,alpha=0.5,colour='black')

The more the percentage premium paid in cash the more the likelihood of Default

5.2.3 Relationship between Vehicle Owned and Default

ggplot(InsuranceDataset,aes_string(x=InsuranceDataset$Veh_Owned,fill="Default")) + geom_histogram(bins=50,alpha=0.5,colour='black')

No of Vehicle Owned doesn’t have any effect on Defualt on Payment of Premium

5.2.3 Relationship between Count 3-6 months Late and Default

ggplot(InsuranceDataset,aes_string(x=InsuranceDataset$`Count3-6`,fill="Default")) + geom_histogram(bins=50,alpha=0.5,colour='black')

According to the data available, the lower the number of times premium was paid below 6 months, the more the likelihood of Default
THis doesn’t sound logical and we will still check the dat

5.2.3 Relationship between Count more than 6 months Late and Default

ggplot(InsuranceDataset,aes_string(x=InsuranceDataset$Count12andmore,fill="Default")) + geom_histogram(bins=50,alpha=0.5,colour='black')

The proportion of Customers that pay their premium more than 12 months late and those who Default on Premium is propotional to increasing delays in payment

5.2.4 Relationship between Risk Score and Default

ggplot(InsuranceDataset,aes_string(x=InsuranceDataset$Risk_Score,fill="Default")) + geom_histogram(bins=50,alpha=0.5,colour='black')

The more the Risk Score the higher the chance of Default

5.2.5 Relationship between No of Dependents and Default

ggplot(InsuranceDataset,aes_string(x=InsuranceDataset$Dependents,fill="Default")) + geom_histogram(bins=50,alpha=0.5,colour='black')

No of Dependents have only a slight impact on Default; although the higher the number of Dependents the higher the chance of Default

5.2.6 Relationship between No of Premium Paid till date and Default

ggplot(InsuranceDataset,aes_string(x=InsuranceDataset$Premiums_Paid,fill="Default")) + geom_histogram(bins=50,alpha=0.5,colour='black')

The lower the No of Premium Paid till date the higher the chance of Default as expected

5.2.7 Relationship between Total Premium and Default

ggplot(InsuranceDataset,aes_string(x=InsuranceDataset$Premium,fill="Default")) + geom_histogram(bins=50,alpha=0.5,colour='black')

The lower the Total Premium Paid till date the higher the chance of Default

5.3 Correlation Plot Between the Entire Variables in the Data Set

5.3.1 Install Corrplot Package

install.packages("corrplot")

## Warning: package 'corrplot' is in use and will not be installed

library(corrplot)

5.3.2 In order to do the correlation plot, we convert all Factor variables to Numeric

InsuranceDataset$IncomeClass <- as.numeric(InsuranceDataset$IncomeClass)
InsuranceDataset$Marital_Status <- as.numeric(InsuranceDataset$Marital_Status)
InsuranceDataset$Accomodation <- as.numeric(InsuranceDataset$Accomodation)
InsuranceDataset$Sourcing_Channel <- as.numeric(InsuranceDataset$Sourcing_Channel)
InsuranceDataset$Residence <- as.numeric(InsuranceDataset$Residence)
InsuranceDataset$Default <- as.numeric(InsuranceDataset$Default)

5.3.3 Use the Corrplot on the dataset already changed to Numeric

M <- cor(InsuranceDataset)
corrplot(M, order = "AOE")

There is a negative correlation between Percentage of Premium Paid in Cash Credit and Defaulting on Premium *There is a negative correlation between late payments of Premium in 3-6 months, 6-12 months and after 12 months. The delay of payment of 6-12 months has a higher negative correlation
There is a positive correlation between age and Default
No of Premium Paid to date is positively correlated with Risk Score

6 Data Pre Processing

6.1 Removal of Unwanted Variable

The two variables to be removed are the ID variable and the Age in Days variable

6.1.1 Removal of Age in Days Column

*Done in sub-Section 3.8, where we removed “Age in Days” Column

6.1.2 Removal of ID Variable

InsuranceDataset <- subset(InsuranceDataset, select = -c(1))

Having removed the ID variable, lets check the Data structure to see the no of variables we have left

str(InsuranceDataset)

## tibble [79,853 x 17] (S3: tbl_df/tbl/data.frame)
##  $ Perc_Premiumcash: num [1:79853] 0.317 0 0.015 0 0.888 0.512 0 0.994 0.019 0.018 ...
##  $ Income          : num [1:79853] 90050 156080 145020 187560 103050 ...
##  $ Count3-6        : num [1:79853] 0 0 1 0 7 0 0 0 0 0 ...
##  $ Count6_12       : num [1:79853] 0 0 0 0 3 0 0 0 0 0 ...
##  $ Count12andmore  : num [1:79853] 0 0 0 0 4 0 0 0 0 0 ...
##  $ Marital_Status  : num [1:79853] 1 2 1 2 1 1 1 1 2 2 ...
##  $ Veh_Owned       : num [1:79853] 3 3 1 1 2 1 3 3 2 3 ...
##  $ Dependents      : num [1:79853] 3 1 1 1 1 4 4 2 4 3 ...
##  $ Accomodation    : num [1:79853] 2 2 2 1 1 1 2 1 2 2 ...
##  $ Risk_Score      : num [1:79853] 98.8 99.1 99.2 99.4 98.8 ...
##  $ Premiums_Paid   : num [1:79853] 8 3 14 13 15 4 8 4 8 8 ...
##  $ Sourcing_Channel: num [1:79853] 1 1 3 1 1 2 3 1 1 1 ...
##  $ Residence       : num [1:79853] 1 2 2 2 2 1 1 2 2 1 ...
##  $ Premium         : num [1:79853] 5400 11700 18000 13800 7500 3300 20100 3300 5400 9600 ...
##  $ Default         : num [1:79853] 2 2 2 2 1 2 2 2 2 2 ...
##  $ AgeinYears      : num [1:79853] 31 83 44 65 53 ...
##  $ IncomeClass     : num [1:79853] 1 2 1 2 1 1 3 1 1 2 ...

We are now left with 17 variables down from 18 after removing the Age in Days variable
All our variables are Numeric
We still have 79,853 Observations
We will have to convert the variable of interest to Factor later on.

6.2 Missing Value Treatment

This was already handled in subsection 3.3
We found that there were no mising values

6.3 Convert Variable of Interest (Default) to Factor

The variable of interest as per problem statement is Default

InsuranceDataset$Default <- as.factor(InsuranceDataset$Default)

InsuranceDataset$Default<-ifelse(InsuranceDataset$Default=='2', 1,0)

table(InsuranceDataset$Default)

## 
##     0     1 
##  4998 74855

We have transformed the variable from 2 and 1 to 1 and 0, 1 standing for those who have not defaulted and 0 stranding for those who have defaulted
We can see from the Default Table that the Variable is not balanced; only 6.7% Defaulted (Those with “0” )

6.4 Lets convert the variable of interest to Factor again since we did earlier but reconverted to binary

InsuranceDataset$Default <- as.factor(InsuranceDataset$Default)

6.5 Lets again check the summary of the Dataframe

summary(InsuranceDataset)

##  Perc_Premiumcash     Income            Count3-6         Count6_12       
##  Min.   :0.0000   Min.   :   24030   Min.   : 0.0000   Min.   : 0.00000  
##  1st Qu.:0.0340   1st Qu.:  108010   1st Qu.: 0.0000   1st Qu.: 0.00000  
##  Median :0.1670   Median :  166560   Median : 0.0000   Median : 0.00000  
##  Mean   :0.3143   Mean   :  208847   Mean   : 0.2484   Mean   : 0.07809  
##  3rd Qu.:0.5380   3rd Qu.:  252090   3rd Qu.: 0.0000   3rd Qu.: 0.00000  
##  Max.   :1.0000   Max.   :90262600   Max.   :13.0000   Max.   :17.00000  
##  Count12andmore     Marital_Status    Veh_Owned       Dependents   
##  Min.   : 0.00000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 0.00000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median : 0.00000   Median :1.000   Median :2.000   Median :3.000  
##  Mean   : 0.05994   Mean   :1.499   Mean   :1.998   Mean   :2.503  
##  3rd Qu.: 0.00000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :11.00000   Max.   :2.000   Max.   :3.000   Max.   :4.000  
##   Accomodation     Risk_Score    Premiums_Paid   Sourcing_Channel
##  Min.   :1.000   Min.   :91.90   Min.   : 2.00   Min.   :1.000   
##  1st Qu.:1.000   1st Qu.:98.83   1st Qu.: 7.00   1st Qu.:1.000   
##  Median :2.000   Median :99.18   Median :10.00   Median :1.000   
##  Mean   :1.501   Mean   :99.07   Mean   :10.86   Mean   :1.823   
##  3rd Qu.:2.000   3rd Qu.:99.52   3rd Qu.:14.00   3rd Qu.:3.000   
##  Max.   :2.000   Max.   :99.89   Max.   :60.00   Max.   :5.000   
##    Residence        Premium      Default     AgeinYears      IncomeClass   
##  Min.   :1.000   Min.   : 1200   0: 4998   Min.   : 21.01   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.: 5400   1:74855   1st Qu.: 41.02   1st Qu.:1.000  
##  Median :2.000   Median : 7500             Median : 51.03   Median :2.000  
##  Mean   :1.603   Mean   :10925             Mean   : 51.63   Mean   :1.838  
##  3rd Qu.:2.000   3rd Qu.:13800             3rd Qu.: 62.02   3rd Qu.:3.000  
##  Max.   :2.000   Max.   :60000             Max.   :103.02   Max.   :3.000

str(InsuranceDataset)

## tibble [79,853 x 17] (S3: tbl_df/tbl/data.frame)
##  $ Perc_Premiumcash: num [1:79853] 0.317 0 0.015 0 0.888 0.512 0 0.994 0.019 0.018 ...
##  $ Income          : num [1:79853] 90050 156080 145020 187560 103050 ...
##  $ Count3-6        : num [1:79853] 0 0 1 0 7 0 0 0 0 0 ...
##  $ Count6_12       : num [1:79853] 0 0 0 0 3 0 0 0 0 0 ...
##  $ Count12andmore  : num [1:79853] 0 0 0 0 4 0 0 0 0 0 ...
##  $ Marital_Status  : num [1:79853] 1 2 1 2 1 1 1 1 2 2 ...
##  $ Veh_Owned       : num [1:79853] 3 3 1 1 2 1 3 3 2 3 ...
##  $ Dependents      : num [1:79853] 3 1 1 1 1 4 4 2 4 3 ...
##  $ Accomodation    : num [1:79853] 2 2 2 1 1 1 2 1 2 2 ...
##  $ Risk_Score      : num [1:79853] 98.8 99.1 99.2 99.4 98.8 ...
##  $ Premiums_Paid   : num [1:79853] 8 3 14 13 15 4 8 4 8 8 ...
##  $ Sourcing_Channel: num [1:79853] 1 1 3 1 1 2 3 1 1 1 ...
##  $ Residence       : num [1:79853] 1 2 2 2 2 1 1 2 2 1 ...
##  $ Premium         : num [1:79853] 5400 11700 18000 13800 7500 3300 20100 3300 5400 9600 ...
##  $ Default         : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
##  $ AgeinYears      : num [1:79853] 31 83 44 65 53 ...
##  $ IncomeClass     : num [1:79853] 1 2 1 2 1 1 3 1 1 2 ...

All variables are numeric except our variable of interest which is what is desired for building our model

6.6 Lets Check or Multicollinearity between the variables

library(usdm)

## Loading required package: sp

## Loading required package: raster

## 
## Attaching package: 'raster'

## The following object is masked from 'package:e1071':
## 
##     interpolate

## The following objects are masked from 'package:MASS':
## 
##     area, select

## 
## Attaching package: 'usdm'

## The following object is masked from 'package:car':
## 
##     vif

library(VIF)

## 
## Attaching package: 'VIF'

## The following object is masked from 'package:usdm':
## 
##     vif

## The following object is masked from 'package:car':
## 
##     vif

library(raster)
library(MASS)
library(e1071)

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(rms)

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:raster':
## 
##     mask, zoom

## The following object is masked from 'package:e1071':
## 
##     impute

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

## 
## Attaching package: 'rms'

## The following object is masked from 'package:VIF':
## 
##     vif

## The following object is masked from 'package:usdm':
## 
##     vif

## The following objects are masked from 'package:car':
## 
##     Predict, vif

library(car)

r<-subset(InsuranceDataset, select = c(Perc_Premiumcash, Income, `Count3-6`, Count6_12, Count12andmore, Marital_Status, Veh_Owned, Dependents, Accomodation, Risk_Score, Premiums_Paid, Sourcing_Channel, Residence, Premium, AgeinYears, IncomeClass))

lm.Default<-lm(IncomeClass~., data = r)
lm.Default

## 
## Call:
## lm(formula = IncomeClass ~ ., data = r)
## 
## Coefficients:
##      (Intercept)  Perc_Premiumcash            Income        `Count3-6`  
##       -1.440e+01        -5.866e-02         9.951e-08         1.336e-04  
##        Count6_12    Count12andmore    Marital_Status         Veh_Owned  
##       -3.623e-02        -3.291e-02         4.273e-03         3.189e-03  
##       Dependents      Accomodation        Risk_Score     Premiums_Paid  
##        1.236e-03         2.301e-03         1.535e-01         2.721e-02  
## Sourcing_Channel         Residence           Premium        AgeinYears  
##        9.240e-02        -2.392e-02         4.442e-05         1.979e-03

vif(lm.Default)

## Perc_Premiumcash           Income       `Count3-6`        Count6_12 
##         1.206060         1.103221         1.163545         1.133204 
##   Count12andmore   Marital_Status        Veh_Owned       Dependents 
##         1.159067         1.000112         1.000243         1.000402 
##     Accomodation       Risk_Score    Premiums_Paid Sourcing_Channel 
##         1.000115         1.161790         1.227097         1.086352 
##        Residence          Premium       AgeinYears 
##         1.001129         1.199955         1.154599

According to the correlation coefficients the minimum correlation Perc_Premiumcash which has a linear correlation coefficient value of -5.866e-02. That is Income Class~Perc_PremiumCash have the least correlation together.
The maximum correlation is 1.535e-01, which represent correlation between Income Class and Risk_Score
The most correlated variable to to Income Class is Premiums_Paid followed by Perc_Premiumcash, the third most correlated is Premium
Since non of the VIF is above 5, we may not neceassarily have to remove any variable based on Multicollinearity

plot(lm.Default)

cor1<-cor(r)
corrplot::corrplot.mixed(cor1, lower.col = "black", number.cex=.7)

THere is a positive correlation between Premium and Income, no of Premium Paid, Risk Score and Sourcing Channel
There is a negative correlation between Premium Paid via Cash Credit and Sourcing Channel, Accomodation, Risk Score
There is a positive correlation betweem Percentage of Premium by Cash Credit and Count 3-6 months, Count 6-12 months

6.7 Outlier Treatment

6.7.1 Lets first check for Outliers

boxplot(InsuranceDataset$Perc_Premiumcash)

boxplot(InsuranceDataset$Income)

boxplot(InsuranceDataset$`Count3-6`)

boxplot(InsuranceDataset$Count6_12)

boxplot(InsuranceDataset$Count12andmore)

boxplot(InsuranceDataset$Risk_Score)

boxplot(InsuranceDataset$Premiums_Paid)

boxplot(InsuranceDataset$Premium)

boxplot(InsuranceDataset$AgeinYears)

* There are Outliers that is extreme values in Income, Count 3-6 months, Count 6-12 months, Count 12andmore, Risk Score, Premium Paid, Premium and Age in Years

6.7.2 Removing the Outliers

quantile(InsuranceDataset$Income, c(0.95))

##    95% 
## 450050

quantile(InsuranceDataset$Premiums_Paid, c(0.95))

## 95% 
##  20

quantile(InsuranceDataset$Premium, c(0.95))

##   95% 
## 28500

quantile(InsuranceDataset$AgeinYears, c(0.95))

##      95% 
## 76.03836

All income above 450,050 are being removed
All No of Premium-Paid above 20 will be removed
All Premium (Total Amount of Premium Paid) above 28,500 will be removed
All ages above 76 years will be removed

7 Unbalanced Nature of the Variable of Interest

7.1 Lets check the data in the Default Variable

table(InsuranceDataset$Default)

## 
##     0     1 
##  4998 74855

7.2 Lets check for the probability of the Default Variable

prop.table(table(InsuranceDataset$Default))

## 
##          0          1 
## 0.06259001 0.93740999

As we can see only 6.2% of the Customers default on their premium payment
The data is not balanced
We therefore need to handle the problem of data imbalance

7.3 Lets fix class imbalance by SMOTEing

7.3.1 Generate Synthetic Data using SMOTE

Lets load the libraries

library(DMwR)

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

## The following objects are masked from 'package:InformationValue':
## 
##     confusionMatrix, precision, sensitivity, specificity

library(smotefamily)

## 
## Attaching package: 'smotefamily'

## The following object is masked from 'package:DMwR':
## 
##     SMOTE

Lets now use the SMOTE function

str(InsuranceDataset)

## tibble [79,853 x 17] (S3: tbl_df/tbl/data.frame)
##  $ Perc_Premiumcash: num [1:79853] 0.317 0 0.015 0 0.888 0.512 0 0.994 0.019 0.018 ...
##  $ Income          : num [1:79853] 90050 156080 145020 187560 103050 ...
##  $ Count3-6        : num [1:79853] 0 0 1 0 7 0 0 0 0 0 ...
##  $ Count6_12       : num [1:79853] 0 0 0 0 3 0 0 0 0 0 ...
##  $ Count12andmore  : num [1:79853] 0 0 0 0 4 0 0 0 0 0 ...
##  $ Marital_Status  : num [1:79853] 1 2 1 2 1 1 1 1 2 2 ...
##  $ Veh_Owned       : num [1:79853] 3 3 1 1 2 1 3 3 2 3 ...
##  $ Dependents      : num [1:79853] 3 1 1 1 1 4 4 2 4 3 ...
##  $ Accomodation    : num [1:79853] 2 2 2 1 1 1 2 1 2 2 ...
##  $ Risk_Score      : num [1:79853] 98.8 99.1 99.2 99.4 98.8 ...
##  $ Premiums_Paid   : num [1:79853] 8 3 14 13 15 4 8 4 8 8 ...
##  $ Sourcing_Channel: num [1:79853] 1 1 3 1 1 2 3 1 1 1 ...
##  $ Residence       : num [1:79853] 1 2 2 2 2 1 1 2 2 1 ...
##  $ Premium         : num [1:79853] 5400 11700 18000 13800 7500 3300 20100 3300 5400 9600 ...
##  $ Default         : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
##  $ AgeinYears      : num [1:79853] 31 83 44 65 53 ...
##  $ IncomeClass     : num [1:79853] 1 2 1 2 1 1 3 1 1 2 ...

table(InsuranceDataset$Default)

## 
##     0     1 
##  4998 74855

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:raster':
## 
##     intersect, select, union

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:MASS':
## 
##     select

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We will use the SMOTE Function or other Oversampling or undersMpling method to balance the data before comencing Model building later on

8 Preliminary Conlcusion on Initial Data analysis

The data was collected over 5 years, or up to 10 years
50.1% of policy holders are unmarried, and there seems to be no impact between marital status of customers and the risk of Defaulting on Premium
50.1% of the policy holders own their accommodation, and there is almost no impact between ownership of accommodation and Default on Premiums
The predominant sourcing channel is A, accounting for 54% of the policy holders, while the least is E, accounting for <1% (0.8%) of all policy holders: Customers sourced from D Channel have the highest chance of Defaulting, followed by those from channels C & E, whose risk of Defaulting seems to be the same, followed by those from B, while Customers from A have the lowest risk of Defaulting
60.3% of policy holders have their residences in Urban areas; Residences doesn’t seem to have a significant impact on Defaulting on Premiums
6.3% of policy holders have defaulted their premium
Customers from high income class have the lowest risk of Defaulting, while those from low income class have the higher risk of Defaulting
The higher the risk score the higher the risk of Default
The higher the numbers of Premium Paid till date the lower the risk of Default
The higher the Total Premiums Paid till date, the lower the risk of Default
The higher the Percentage of Premium Paid in Cash Credit the higher the risk of Defaulting on Premium *The higher occurrence of late payments of Premium in 3-6 months, 6-12 months and after 12 months, the higher the risk of Defaulting on Premium. The delay of payment of 6-12 months has a higher risk of Defaulting
The higher the age, the lower the risk of Default
The more the No of Premium Paid to date the lower the Risk Score.

9 More Concluson on Further Data Analysis

The two variables to be removed are the ID variable and the Age in Days variable
We are now left with 17 variables down from 18 after removing the Age in Days variable
All our variables are Numeric
We still have 79,853 Observations
We will have to convert the variable of interest to Factor later on.
- We have transformed the variable from 2 and 1 to 1 and 0, 1 standing for those who have not defaulted and 0 stranding for those who have defaulted
We can see from the Default Table that the Variable is not balanced; only 6.7% Defaulted (Those with “0” )
According to the correlation coefficients the minimum correlation Perc_Premiumcash which has a linear correlation coefficient value of -5.866e-02. That is Income Class~Perc_PremiumCash have the least correlation together.
The maximum correlation is 1.535e-01, which represent correlation between Income Class and Risk_Score
The most correlated variable to to Income Class is Premiums_Paid followed by Perc_Premiumcash, the third most correlated is Premium
Since non of the VIF is above 5, we may not necessarily have to remove any variable based on Multicollinearity
THere is a positive correlation between Premium and Income, no of Premium Paid, Risk Score and Sourcing Channel
There is a negative correlation between Premium Paid via Cash Credit and Sourcing Channel, Accommodation, Risk Score
There is a positive correlation between Percentage of Premium by Cash Credit and Count 3-6 months, Count 6-12 months
There are Outliers that is extreme values in Income, Count 3-6 months, Count 6-12 months, Count 12andmore, Risk Score, Premium Paid, Premium and Age in Years
All income above 450,050 are being removed
All No of Premium-Paid above 20 will be removed
All Premium (Total Amount of Premium Paid) above 28,500 will be removed
All ages above 76 years will be removed
As we can see only 6.2% of the Customers default on their premium payment
The data is not balanced
We therefore need to handle the problem of data imbalance
We will use the SMOTE Function or other Oversampling or undersMpling method to balance the data before commencing Model building later on

#======================================================================= # # T H E - E N D # #=======================================================================

Generate the .R file from this .Rmd to hold the source code

purl(“Insurance Premium Default.Rmd”, documentation = 1)

Insurance Premium Default Propensity Prediction

Tobi Ilesanmi

11/5/2020