Prudential life insurance uses data to assess risk in providing insurance to individual customers. While traditional methods include collecting history, medical records amongh many other data points, it usually takes aound 30 days. My goal here is to automate the process so that risk assessment would be quick and accurate.The data is provided by Prudential Financial for kaggle competition for this assessment challenege.
My approach includes tidying and cleaning data, understanding data by performing exploratory data analysis,identifying key variables which contribe in predicting response variable (determining level of risk , descrete 1 to 8) , creating a model and presenting results.
The results of this analysis will help Insurance companies make well informed decisions.
A list of packages used in this project is documented below. It may be modified, if needed.
library(tidyverse)
library(randomForest)
library(VIM)
library(DT)
library(caret)
library(DMwR)
library(gbm)
library(RCurl)
library(nnet)
Dataset was provided by Prudential Financial as a competition on Kaggle. Original dataset can be downloaded from here. Data is further cleaned in R and prepared for analysis.
Data dictionary of 128 variables in original dataset is provided below.
Data Dictionary (Click to expand)
Explaination of each variable as provided by kaggle
Variable Description
Id A unique identifier associated with an application.
Product_Info_1-7 A set of normalized variables relating to the product applied for Ins_Age Normalized age of applicant Ht Normalized height of applicant Wt Normalized weight of applicant BMI Normalized BMI of applicant Employment_Info_1-6 A set of normalized variables relating to the employment history of the applicant. InsuredInfo_1-6 A set of normalized variables providing information about the applicant. Insurance_History_1-9 A set of normalized variables relating to the insurance history of the applicant. Family_Hist_1-5 A set of normalized variables relating to the family history of the applicant. Medical_History_1-41 A set of normalized variables relating to the medical history of the applicant. Medical_Keyword_1-48 A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application. Response This is the target variable, an ordinal variable relating to the final decision associated with an application
The following variables are all categorical (nominal):
Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41
The following variables are continuous:
Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5
The following variables are discrete:
Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32
Medical_Keyword_1-48 are dummy variables.
Original data set contains 128 variables and 59381 observations in training dataset. Dataset was provided by Prudential Financial to assess the risk of providing policy to life insurance applicants. Data set was uploaded on kaggle as a compitition sponsored by Prudential Financial, around 2 years ago (11/23/2015). Many missing values are present in the original dataset in multiple columns.
Missing values in original dataset were recorded as blanks, while importing in R, blanks were converted in NAs.
Profile <- read.csv(text=getURL("https://raw.githubusercontent.com/nitin3078/Prudential/master/train.csv"), header=T)
datatable(head(profile, 50), options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
"}")
))
# number of observations and number of variables in dataset
dim(Profile)
## [1] 59381 128
# displaying structure and a subset of values of dataset
glimpse(Profile)
## Observations: 59,381
## Variables: 128
## $ Id <int> 2, 5, 6, 7, 8, 10, 11, 14, 15, 16, 17, 18,...
## $ Product_Info_1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ Product_Info_2 <fctr> D3, A1, E1, D4, D2, D2, A8, D2, D3, E1, D...
## $ Product_Info_3 <int> 10, 26, 26, 10, 26, 26, 10, 26, 26, 21, 26...
## $ Product_Info_4 <dbl> 0.07692308, 0.07692308, 0.07692308, 0.4871...
## $ Product_Info_5 <int> 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Product_Info_6 <int> 1, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Product_Info_7 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Ins_Age <dbl> 0.64179104, 0.05970149, 0.02985075, 0.1641...
## $ Ht <dbl> 0.5818182, 0.6000000, 0.7454545, 0.6727273...
## $ Wt <dbl> 0.1485356, 0.1317992, 0.2887029, 0.2050209...
## $ BMI <dbl> 0.3230080, 0.2722877, 0.4287804, 0.3524377...
## $ Employment_Info_1 <dbl> 0.0280, 0.0000, 0.0300, 0.0420, 0.0270, 0....
## $ Employment_Info_2 <int> 12, 1, 9, 9, 9, 15, 1, 12, 9, 1, 9, 3, 9, ...
## $ Employment_Info_3 <int> 1, 3, 1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, ...
## $ Employment_Info_4 <dbl> 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, ...
## $ Employment_Info_5 <int> 3, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, ...
## $ Employment_Info_6 <dbl> NA, 0.0018, 0.0300, 0.2000, 0.0500, 1.0000...
## $ InsuredInfo_1 <int> 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, ...
## $ InsuredInfo_2 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ InsuredInfo_3 <int> 6, 6, 8, 8, 6, 8, 3, 6, 3, 3, 4, 3, 8, 3, ...
## $ InsuredInfo_4 <int> 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, ...
## $ InsuredInfo_5 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, ...
## $ InsuredInfo_6 <int> 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, ...
## $ InsuredInfo_7 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Insurance_History_1 <int> 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, ...
## $ Insurance_History_2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Insurance_History_3 <int> 3, 3, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 1, ...
## $ Insurance_History_4 <int> 1, 1, 3, 3, 3, 2, 2, 1, 2, 1, 1, 3, 1, 3, ...
## $ Insurance_History_5 <dbl> 0.000666667, 0.000133333, NA, NA, NA, 0.00...
## $ Insurance_History_7 <int> 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3, 2, 3, ...
## $ Insurance_History_8 <int> 1, 3, 2, 2, 2, 3, 1, 1, 1, 3, 1, 2, 1, 2, ...
## $ Insurance_History_9 <int> 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 2, 3, ...
## $ Family_Hist_1 <int> 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 3, ...
## $ Family_Hist_2 <dbl> NA, 0.1884058, 0.3043478, 0.4202899, 0.463...
## $ Family_Hist_3 <dbl> 0.5980392, NA, NA, NA, NA, 0.2941176, NA, ...
## $ Family_Hist_4 <dbl> NA, 0.08450704, 0.22535211, 0.35211268, 0....
## $ Family_Hist_5 <dbl> 0.5267857, NA, NA, NA, NA, NA, NA, 0.63392...
## $ Medical_History_1 <int> 4, 5, 10, 0, NA, 6, 5, 6, 4, NA, 1, 4, 5, ...
## $ Medical_History_2 <int> 112, 412, 3, 350, 162, 491, 600, 145, 16, ...
## $ Medical_History_3 <int> 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, ...
## $ Medical_History_4 <int> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, ...
## $ Medical_History_5 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_6 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, ...
## $ Medical_History_7 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_8 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_9 <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, ...
## $ Medical_History_10 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_11 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_12 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_13 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_14 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_15 <int> 240, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_16 <int> 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 1, 1, ...
## $ Medical_History_17 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_18 <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_19 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_20 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_21 <int> 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_22 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_23 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 1, 3, ...
## $ Medical_History_24 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_25 <int> 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_26 <int> 3, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_27 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_28 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, ...
## $ Medical_History_29 <int> 3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 1, 3, 3, 3, ...
## $ Medical_History_30 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_31 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_32 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_33 <int> 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 1, ...
## $ Medical_History_34 <int> 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 1, 3, 3, 3, ...
## $ Medical_History_35 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_36 <int> 2, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_37 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_38 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_39 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, ...
## $ Medical_History_40 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_41 <int> 3, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 3, 3, 1, ...
## $ Medical_Keyword_1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_8 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_9 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_10 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_11 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_12 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_13 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_14 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_15 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, ...
## $ Medical_Keyword_16 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_17 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_18 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_19 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_20 <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_21 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_22 <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_23 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, ...
## $ Medical_Keyword_24 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_25 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ Medical_Keyword_26 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_27 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_28 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_29 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_30 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_31 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_32 <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_33 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_34 <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_35 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_36 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_37 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_38 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_39 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_40 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_41 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_42 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_43 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_44 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Medical_Keyword_45 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_46 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_47 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Medical_Keyword_48 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Response <int> 8, 4, 8, 8, 8, 8, 8, 1, 8, 1, 6, 2, 7, 3, ...
#summary of predictive variable(Response) on training dataset
summary(Profile$Response)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 5.637 8.000 8.000
#Checking missing values present in dataset
colSums(is.na(Profile))
## Id Product_Info_1 Product_Info_2
## 0 0 0
## Product_Info_3 Product_Info_4 Product_Info_5
## 0 0 0
## Product_Info_6 Product_Info_7 Ins_Age
## 0 0 0
## Ht Wt BMI
## 0 0 0
## Employment_Info_1 Employment_Info_2 Employment_Info_3
## 19 0 0
## Employment_Info_4 Employment_Info_5 Employment_Info_6
## 6779 0 10854
## InsuredInfo_1 InsuredInfo_2 InsuredInfo_3
## 0 0 0
## InsuredInfo_4 InsuredInfo_5 InsuredInfo_6
## 0 0 0
## InsuredInfo_7 Insurance_History_1 Insurance_History_2
## 0 0 0
## Insurance_History_3 Insurance_History_4 Insurance_History_5
## 0 0 25396
## Insurance_History_7 Insurance_History_8 Insurance_History_9
## 0 0 0
## Family_Hist_1 Family_Hist_2 Family_Hist_3
## 0 28656 34241
## Family_Hist_4 Family_Hist_5 Medical_History_1
## 19184 41811 8889
## Medical_History_2 Medical_History_3 Medical_History_4
## 0 0 0
## Medical_History_5 Medical_History_6 Medical_History_7
## 0 0 0
## Medical_History_8 Medical_History_9 Medical_History_10
## 0 0 58824
## Medical_History_11 Medical_History_12 Medical_History_13
## 0 0 0
## Medical_History_14 Medical_History_15 Medical_History_16
## 0 44596 0
## Medical_History_17 Medical_History_18 Medical_History_19
## 0 0 0
## Medical_History_20 Medical_History_21 Medical_History_22
## 0 0 0
## Medical_History_23 Medical_History_24 Medical_History_25
## 0 55580 0
## Medical_History_26 Medical_History_27 Medical_History_28
## 0 0 0
## Medical_History_29 Medical_History_30 Medical_History_31
## 0 0 0
## Medical_History_32 Medical_History_33 Medical_History_34
## 58274 0 0
## Medical_History_35 Medical_History_36 Medical_History_37
## 0 0 0
## Medical_History_38 Medical_History_39 Medical_History_40
## 0 0 0
## Medical_History_41 Medical_Keyword_1 Medical_Keyword_2
## 0 0 0
## Medical_Keyword_3 Medical_Keyword_4 Medical_Keyword_5
## 0 0 0
## Medical_Keyword_6 Medical_Keyword_7 Medical_Keyword_8
## 0 0 0
## Medical_Keyword_9 Medical_Keyword_10 Medical_Keyword_11
## 0 0 0
## Medical_Keyword_12 Medical_Keyword_13 Medical_Keyword_14
## 0 0 0
## Medical_Keyword_15 Medical_Keyword_16 Medical_Keyword_17
## 0 0 0
## Medical_Keyword_18 Medical_Keyword_19 Medical_Keyword_20
## 0 0 0
## Medical_Keyword_21 Medical_Keyword_22 Medical_Keyword_23
## 0 0 0
## Medical_Keyword_24 Medical_Keyword_25 Medical_Keyword_26
## 0 0 0
## Medical_Keyword_27 Medical_Keyword_28 Medical_Keyword_29
## 0 0 0
## Medical_Keyword_30 Medical_Keyword_31 Medical_Keyword_32
## 0 0 0
## Medical_Keyword_33 Medical_Keyword_34 Medical_Keyword_35
## 0 0 0
## Medical_Keyword_36 Medical_Keyword_37 Medical_Keyword_38
## 0 0 0
## Medical_Keyword_39 Medical_Keyword_40 Medical_Keyword_41
## 0 0 0
## Medical_Keyword_42 Medical_Keyword_43 Medical_Keyword_44
## 0 0 0
## Medical_Keyword_45 Medical_Keyword_46 Medical_Keyword_47
## 0 0 0
## Medical_Keyword_48 Response
## 0 0
#Checking if missing values present in "Id" column
sum(is.na(Profile$Id))
## [1] 0
# using purrr to find columns with missing value and arranging columns in ascending order
missingcol <- map_df(Profile, function(x) sum(is.na(x)))
missingcol<- gather(missingcol)
colnames(missingcol) <-c("variable","missing")
missingcol <- missingcol %>%
filter( missing>0) %>%
arrange(missing) %>% mutate( percentmissing = missing/59381*100 )
print(missingcol)
## # A tibble: 13 x 3
## variable missing percentmissing
## <chr> <int> <dbl>
## 1 Employment_Info_1 19 0.03199677
## 2 Employment_Info_4 6779 11.41610953
## 3 Medical_History_1 8889 14.96943467
## 4 Employment_Info_6 10854 18.27857395
## 5 Family_Hist_4 19184 32.30663007
## 6 Insurance_History_5 25396 42.76788872
## 7 Family_Hist_2 28656 48.25786026
## 8 Family_Hist_3 34241 57.66322561
## 9 Family_Hist_5 41811 70.41141106
## 10 Medical_History_15 44596 75.10146343
## 11 Medical_History_24 55580 93.59896263
## 12 Medical_History_32 58274 98.13576733
## 13 Medical_History_10 58824 99.06198953
Creating a plot for missing value columns. Plot shows percentage of missing values in each column having missing value count greater than zero.
MissingValueColumns<-colnames(Profile)[colSums(is.na(Profile)) > 0]
## Matrix plot. Red for missing values, high values are represented by dark color.
matrixplot(Profile[MissingValueColumns], interactive = F, sortby = "Employment_Info_1",main="Missing value plot" )
As we can see from the plot (“missing value plot”) under section 3.2 Original Dataset, 11 out of 13 columns having NA have very high missing values percentage (between 14%-99%). These columns will be removed from dataset.
Removing 11 out of 13 columns having missing values.
# Creating a new data set to perform cleaning
ProfileClean <- Profile
ProfileClean <- ProfileClean[,-c(39,18,37,30,35,36,38,53,62,70,48)]
dim(ProfileClean)
## [1] 59381 117
for Employment_Info_1 , we can check the summary and replace missing values (19 in total), with the mean. mean= 0.0775821
summary(Profile$Employment_Info_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.03500 0.06000 0.07758 0.10000 1.00000 19
# replacing NA with mean for Employment_Info_1
ProfileClean$Employment_Info_1[is.na(ProfileClean$Employment_Info_1)] <- 0.07758
for Employment_Info_4, since the distribution is not uniform, we will use kNN imputation to predict missing value for Employment_Info_4
summary(Profile$Employment_Info_4)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.006 0.000 1.000 6779
# kNN Impputation for Employment_Info_4
# Not executing below code again as it takes a lot of time. I have
# already run the code previously and saved the data
# I am uploading data after kNN Imputation once again in ProfileClean data set
### ProfileClean <- knnImputation(ProfileClean, k = 3, scale = T, meth = "weighAvg", distData = NULL)
# Loading Data from github. This is Cleaned dataset, after median Imputaion and kNN Imputation
ProfileClean <- read.csv(text=getURL("https://raw.githubusercontent.com/nitin3078/Prudential/master/ProfileClean.csv"), header=T)
No Missing values present in Cleaned dataset anymore 0
#Updating first column name to "Id" in cleaned data set ProfileClean
colnames(ProfileClean)[1] <- "Id"
# Checking Correlation between BMI, Wt and Ht
cor(ProfileClean$BMI,ProfileClean$Wt)
## [1] 0.8540833
cor(ProfileClean$BMI,ProfileClean$Ht)
## [1] 0.1231248
cor(ProfileClean$Ht,ProfileClean$Wt)
## [1] 0.6104248
ProfileClean <- subset(ProfileClean, select = -c(Wt) )
# Creating a new column as a sum of all these column : MedKeywordSum
ProfileClean$MedKeywordSum <- rowSums(ProfileClean[,c(68:115)])
# Dropping Medical_Keyword_1 to Medical_Keyword_48 from dataset
ProfileClean <- subset(ProfileClean, select = -c(68:115) )
Similar as above, columns Medical_History_3 to Medical_History_41 are 0 and 1 with little overall information,
# Creating a new column as a sum of all these column : MedHistSum
ProfileClean$MedHistSum <- rowSums(ProfileClean[,c(33:67)])
# Dropping Medical_Keyword_1 to Medical_Keyword_48 from dataset
ProfileClean <- subset(ProfileClean, select = -c(33:67) )
After cleaning steps, cleaned data set ProfileClean contains 35 variables and 59381 observations.
** A glimpse of the cleaned data**
datatable(head(ProfileClean, 50), options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
"}")
))
With the cleaned data set in our hand, next step is to understand effect of various variables on the Response column. One of the way is to use visualizations to understand the distributions and assess the predictive power of each variable
Before checking plots of different variables with Response variable, lets first see the distribution of two newly created columns, MedKeywordSum and MedHistSum
# Distribution of custom column MedKeywordSum is not normal
ggplot(ProfileClean, aes(x=MedKeywordSum)) +
geom_histogram(aes(y = ..density..),position="identity", colour="black", alpha=0.2, bins = 10)+
ggtitle("Histogram of Custom Column MedHistSum") +
theme_bw() + stat_function(fun = dnorm, colour = "red",
args = list(mean = mean(ProfileClean$MedKeywordSum, na.rm = TRUE),
sd = sd(ProfileClean$MedKeywordSum, na.rm = TRUE)))
As we can see, distribution of newly createed column MedKeywordSum is not normal, but distribution of custom column MedHistSum is normal ( see below histogram).
# Distribution of custom column MedHistSum is normal
ggplot(ProfileClean, aes(x=MedHistSum)) +
geom_histogram(aes(y = ..density..),position="identity", colour="black", alpha=0.2, bins = 10)+
ggtitle("Histogram of Custom Column MedHistSum") +
theme_bw() + stat_function(fun = dnorm, colour = "red",
args = list(mean = mean(ProfileClean$MedHistSum, na.rm = TRUE),
sd = sd(ProfileClean$MedHistSum, na.rm = TRUE)))
To understand the relationship between predictor variables and Response, box plots and histogram distributions are quite helpful. Below are the distributions and relationships of few of the most important variables.
# Boxplot Response ~ BMI
ggplot(ProfileClean, aes(x=factor(Response), y=BMI)) + ggtitle("Boxplot Response ~ BMI") +
geom_boxplot(colour="blue")+ theme(axis.text.x=element_text(angle=90,hjust=1))
We can see that BMI has prediction power. Response value 8 has lower value of BMI compared to response value 1.
# Boxplot Response ~ Ht
ggplot(ProfileClean, aes(x=factor(Response), y=Ht)) + ggtitle("Boxplot Response ~ Ht") +
geom_boxplot(colour="red")+ theme(axis.text.x=element_text(angle=90,hjust=1))
We can see that Ht has very less prediction power.
ggplot(ProfileClean, aes(x=factor(Response), y=Ins_Age)) + ggtitle("Boxplot Response ~ Ins_Age") +
geom_boxplot(colour="blue")+ theme(axis.text.x=element_text(angle=90,hjust=1))
From this plot, we can see Ins_Age has some predictive power
# Boxplot Response ~ Product_Info_4
ggplot(ProfileClean, aes(x=factor(Response), y=Product_Info_4)) + ggtitle("Boxplot Response with custom variable MedHistSum") +
geom_boxplot(colour="blue")+ theme(axis.text.x=element_text(angle=90,hjust=1))
From this plot, we can see Product_Info_4 has much predictive power
# Boxplot Response ~ MedKeywordSum
ggplot(ProfileClean, aes(x=factor(Response), y=MedKeywordSum)) + ggtitle("Boxplot Response with custom column MedKeywordSum") +
geom_boxplot(colour="red")+ theme(axis.text.x=element_text(angle=90,hjust=1))
Little predictive power
# Boxplot Response ~ MedHistSum
ggplot(ProfileClean, aes(x=factor(Response), y=MedHistSum)) + ggtitle("Boxplot Response with custom variable MedHistSum") +
geom_boxplot(colour="blue")+ theme(axis.text.x=element_text(angle=90,hjust=1))
little predictive power MedHistSum
# histogram of Response on Product_Info_2
ggplot(ProfileClean, aes(x=Response)) + ggtitle("Histogram Response with Product_Info_2 categories") +
geom_histogram(position="identity", colour="blue", alpha=0.2, bins = 10)+
facet_grid(. ~ Product_Info_2)
theme(axis.text.x=element_text(angle=90,hjust=1))
## List of 1
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : num 90
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
From above histogram, we can see that distribution of response variable is dependent on Product_Info_2.
# histogram of Response on Family_Hist_1
ggplot(ProfileClean, aes(x=Response)) + ggtitle("Histogram Response with Family_Hist_1 values") +
geom_histogram(position="identity", colour="black", alpha=0.2, bins = 10)+
facet_grid(. ~ Family_Hist_1)
theme(axis.text.x=element_text(angle=90,hjust=1))
## List of 1
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : num 90
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
From above histogram, we can see that distribution of response variable is dependent on Family_Hist_1.
A predictive model is built to predict response value using ** Multinomial Logistic Regressio**. Below are the steps executed
# loading "test" data from website and then removing columns just like we did in Train data set to make
# algorithms work
test <- read.csv(text=getURL("https://raw.githubusercontent.com/nitin3078/Prudential/master/test.csv"), header=T)
# execute below steps to delete required columns just like we did in train data set to make it congruent
test <- test[,-c(39,18,37,30,35,36,38,53,62,70,48)]
test <- subset(test, select = -c(Wt) )
test$MedKeywordSum <- rowSums(test[,c(68:115)])
test <- subset(test, select = -c(68:115) )
test$MedHistSum <- rowSums(test[,c(33:67)])
test <- subset(test, select = -c(33:67) )
Creating a Multinomila logistic regression model to predict Response.
# Multinomial model creation and submission
MultinomModel <- multinom(Response ~ Product_Info_4+ Product_Info_2 + BMI + Ins_Age +Family_Hist_1+ MedKeywordSum + MedHistSum, data = ProfileClean)
## # weights: 208 (175 variable)
## initial value 123479.318186
## iter 10 value 104682.355198
## iter 20 value 96796.326013
## iter 30 value 95364.740220
## iter 40 value 93849.923830
## iter 50 value 92213.706589
## iter 60 value 91600.786670
## iter 70 value 91054.948388
## iter 80 value 90847.815520
## iter 90 value 90770.643849
## iter 100 value 90734.071368
## final value 90734.071368
## stopped after 100 iterations
MultinomModel
## Call:
## multinom(formula = Response ~ Product_Info_4 + Product_Info_2 +
## BMI + Ins_Age + Family_Hist_1 + MedKeywordSum + MedHistSum,
## data = ProfileClean)
##
## Coefficients:
## (Intercept) Product_Info_4 Product_Info_2A2 Product_Info_2A3
## 2 -0.5406229 -0.25933637 -0.09111900 0.22686666
## 3 6.8440225 -1.34585204 -0.21924483 0.11643902
## 4 7.7173533 -1.15832830 0.35657408 0.35167248
## 5 -7.4385075 0.07046966 -0.20611241 -0.15998302
## 6 -1.7798521 0.69993129 0.03326992 0.07631785
## 7 -3.4632764 0.90649512 -0.41296403 -0.17689201
## 8 -9.0412727 1.12909832 -0.35772963 0.06258798
## Product_Info_2A4 Product_Info_2A5 Product_Info_2A6 Product_Info_2A7
## 2 -4.1086349 -3.8586372 -4.472830402 -0.09346403
## 3 -0.1730812 0.1096897 -0.443404802 0.90114601
## 4 0.3736278 0.7895756 -0.003917642 -2.60457785
## 5 -0.2437824 -0.2185548 -0.495672414 1.43285528
## 6 -0.2095346 0.1055205 -0.023496737 -3.45434592
## 7 -0.3579272 0.2150354 0.025574138 -3.58014567
## 8 -0.1183681 0.2997791 0.083578446 -3.78495182
## Product_Info_2A8 Product_Info_2B1 Product_Info_2B2 Product_Info_2C1
## 2 -0.4283289 -4.2064302 -1.20188732 -0.3275233
## 3 0.1257580 2.4454896 -0.62078287 0.5249037
## 4 -0.3508280 2.5162500 -0.25169130 0.5401341
## 5 0.3914091 0.5316323 -0.40647947 -0.9467392
## 6 -0.3336799 -0.2377402 -0.08745355 -0.5480086
## 7 -0.5241112 -1.1286374 -0.05128955 -0.7910001
## 8 -0.5518351 -0.7705589 -0.02650824 -1.0956439
## Product_Info_2C2 Product_Info_2C3 Product_Info_2C4 Product_Info_2D1
## 2 -0.2065126 -0.2299767 -0.5678185 0.03948999
## 3 0.6538315 0.9451619 -1.2444693 0.55457046
## 4 0.4254888 0.5178797 0.3140331 0.88701136
## 5 -0.9342104 -0.5267494 -0.8651972 -0.49414769
## 6 -0.5515811 -0.0315970 -0.2032081 -0.34990907
## 7 -0.9208739 -0.1474561 -0.2925301 -0.97230668
## 8 -0.9100231 -0.4648721 -0.2062558 -0.92463752
## Product_Info_2D2 Product_Info_2D3 Product_Info_2D4 Product_Info_2E1
## 2 0.01387663 -0.19820576 -0.49050178 -0.28444390
## 3 0.42076043 0.06731384 -0.71086564 0.09274823
## 4 0.74917154 0.57949111 -0.57019577 0.52147395
## 5 -0.23296526 -0.13975010 -0.21226550 -0.13350650
## 6 0.19304294 0.07571381 -0.06365077 0.15843633
## 7 -0.35563959 -0.35392210 -0.16843188 0.10480347
## 8 -0.34858316 -0.47456955 -0.04740363 0.15980804
## BMI Ins_Age Family_Hist_1 MedKeywordSum MedHistSum
## 2 2.6799627 -1.2511595 -0.03694289 -0.06294195 0.003984845
## 3 0.6406122 -3.9246202 -0.17758579 -0.28522218 -0.086011175
## 4 -8.6229952 -3.5480947 -0.05373809 -0.43334210 -0.041834536
## 5 4.6388707 -2.1640322 -0.11684306 -0.18858357 0.088603678
## 6 -1.0133409 -0.6241689 0.05566893 -0.20220451 0.045021363
## 7 -3.4796207 0.1101358 -0.27302268 -0.20472085 0.090888052
## 8 -12.2886924 -1.3087523 0.14918725 -0.42579899 0.223296126
##
## Residual Deviance: 181468.1
## AIC: 181818.1
predict_Response <- predict (MultinomModel, test, "probs")
test$Response <- predict (MultinomModel, test)
Creating a submission file to submit on kaggle. URL needs to be replaced to a local URL in below command. Also I am commenting the command for now but it can be uncommented and executed successfully. Once Submission.csv is genereated. It can be submitted to Kaggle website to obtain score and rank. Here is the URL to upload submissions.
# Create a file to submit on Kaggle, which contain only 2 columns, Id and Response
submission <- test[, c(1,35)]
write.csv(submission, "C:\\Users\\nitin\\Desktop\\study\\Special topics in BANA R\\Project\\submission.csv", row.names = F)
Accuracy obtained from Kaggle Kappa is 0.36562 for this model. There is a lot of scope for improvement. Cross validation can be done to split train and test and iterations which is going to most significantly imporve the model.
Overall we can see that age, BMI, Family history and product type as well are main factors to assess risk of the insurance.