Prudential Life Insurance Assessment

Prudential Financial is a fortune 500 company which offers products and services in life insurance,mutual fund, real estate and many more products.

1. Introduction

Prudential life insurance uses data to assess risk in providing insurance to individual customers. While traditional methods include collecting history, medical records amongh many other data points, it usually takes aound 30 days. My goal here is to automate the process so that risk assessment would be quick and accurate.The data is provided by Prudential Financial for kaggle competition for this assessment challenege.

My approach includes tidying and cleaning data, understanding data by performing exploratory data analysis,identifying key variables which contribe in predicting response variable (determining level of risk , descrete 1 to 8) , creating a model and presenting results.

The results of this analysis will help Insurance companies make well informed decisions.

2. Packages Required

A list of packages used in this project is documented below. It may be modified, if needed.

tidyverse = For data import and data manipulation. Package contains many other important packages. Most of the packages needed separately like magrittr, ggplot2,dplyr, purrr are included in this.
VIM = For Visualization and Imputation of Missing Values
randomForest= for machine learning algorithm
DT = For HTML capabilities
caret = For training model and finding most important variables
DMwR = For kNN imputation of missing values
gbm = For boosting algorithms
RCurl = To load data from GitHub repository
nnet = For multinomial logistic regression
mlbench = For machine learning algorithm

library(tidyverse)
library(randomForest)
library(VIM)
library(DT)
library(caret)
library(DMwR)
library(gbm)
library(RCurl)
library(nnet)

3. Data Preparation

3.1 Data Source

Dataset was provided by Prudential Financial as a competition on Kaggle. Original dataset can be downloaded from here. Data is further cleaned in R and prepared for analysis.

3.1.1 Data Dictionary

Data dictionary of 128 variables in original dataset is provided below.

Data Dictionary (Click to expand)

Explaination of each variable as provided by kaggle

Variable Description

Id A unique identifier associated with an application.

Product_Info_1-7 A set of normalized variables relating to the product applied for Ins_Age Normalized age of applicant Ht Normalized height of applicant Wt Normalized weight of applicant BMI Normalized BMI of applicant Employment_Info_1-6 A set of normalized variables relating to the employment history of the applicant. InsuredInfo_1-6 A set of normalized variables providing information about the applicant. Insurance_History_1-9 A set of normalized variables relating to the insurance history of the applicant. Family_Hist_1-5 A set of normalized variables relating to the family history of the applicant. Medical_History_1-41 A set of normalized variables relating to the medical history of the applicant. Medical_Keyword_1-48 A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application. Response This is the target variable, an ordinal variable relating to the final decision associated with an application

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.

3.2 Original Dataset

Original data set contains 128 variables and 59381 observations in training dataset. Dataset was provided by Prudential Financial to assess the risk of providing policy to life insurance applicants. Data set was uploaded on kaggle as a compitition sponsored by Prudential Financial, around 2 years ago (11/23/2015). Many missing values are present in the original dataset in multiple columns.

Missing values in original dataset were recorded as blanks, while importing in R, blanks were converted in NAs.

Statistics and exploration of data

Profile <- read.csv(text=getURL("https://raw.githubusercontent.com/nitin3078/Prudential/master/train.csv"), header = T)

datatable(head(Profile, 50), options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
    "}")
))

# number of observations and number of variables in dataset
dim(Profile)

## [1] 59381   128

# displaying structure and a subset of values of dataset
glimpse(Profile)

## Observations: 59,381
## Variables: 128
## $ Id                  <int> 2, 5, 6, 7, 8, 10, 11, 14, 15, 16, 17, 18,...
## $ Product_Info_1      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ Product_Info_2      <fctr> D3, A1, E1, D4, D2, D2, A8, D2, D3, E1, D...
## $ Product_Info_3      <int> 10, 26, 26, 10, 26, 26, 10, 26, 26, 21, 26...
## $ Product_Info_4      <dbl> 0.07692308, 0.07692308, 0.07692308, 0.4871...
## $ Product_Info_5      <int> 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Product_Info_6      <int> 1, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Product_Info_7      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Ins_Age             <dbl> 0.64179104, 0.05970149, 0.02985075, 0.1641...
## $ Ht                  <dbl> 0.5818182, 0.6000000, 0.7454545, 0.6727273...
## $ Wt                  <dbl> 0.1485356, 0.1317992, 0.2887029, 0.2050209...
## $ BMI                 <dbl> 0.3230080, 0.2722877, 0.4287804, 0.3524377...
## $ Employment_Info_1   <dbl> 0.0280, 0.0000, 0.0300, 0.0420, 0.0270, 0....
## $ Employment_Info_2   <int> 12, 1, 9, 9, 9, 15, 1, 12, 9, 1, 9, 3, 9, ...
## $ Employment_Info_3   <int> 1, 3, 1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, ...
## $ Employment_Info_4   <dbl> 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, ...
## $ Employment_Info_5   <int> 3, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, ...
## $ Employment_Info_6   <dbl> NA, 0.0018, 0.0300, 0.2000, 0.0500, 1.0000...
## $ InsuredInfo_1       <int> 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, ...
## $ InsuredInfo_2       <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ InsuredInfo_3       <int> 6, 6, 8, 8, 6, 8, 3, 6, 3, 3, 4, 3, 8, 3, ...
## $ InsuredInfo_4       <int> 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, ...
## $ InsuredInfo_5       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, ...
## $ InsuredInfo_6       <int> 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, ...
## $ InsuredInfo_7       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Insurance_History_1 <int> 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, ...
## $ Insurance_History_2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Insurance_History_3 <int> 3, 3, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 1, ...
## $ Insurance_History_4 <int> 1, 1, 3, 3, 3, 2, 2, 1, 2, 1, 1, 3, 1, 3, ...
## $ Insurance_History_5 <dbl> 0.000666667, 0.000133333, NA, NA, NA, 0.00...
## $ Insurance_History_7 <int> 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3, 2, 3, ...
## $ Insurance_History_8 <int> 1, 3, 2, 2, 2, 3, 1, 1, 1, 3, 1, 2, 1, 2, ...
## $ Insurance_History_9 <int> 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 2, 3, ...
## $ Family_Hist_1       <int> 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 3, ...
## $ Family_Hist_2       <dbl> NA, 0.1884058, 0.3043478, 0.4202899, 0.463...
## $ Family_Hist_3       <dbl> 0.5980392, NA, NA, NA, NA, 0.2941176, NA, ...
## $ Family_Hist_4       <dbl> NA, 0.08450704, 0.22535211, 0.35211268, 0....
## $ Family_Hist_5       <dbl> 0.5267857, NA, NA, NA, NA, NA, NA, 0.63392...
## $ Medical_History_1   <int> 4, 5, 10, 0, NA, 6, 5, 6, 4, NA, 1, 4, 5, ...
## $ Medical_History_2   <int> 112, 412, 3, 350, 162, 491, 600, 145, 16, ...
## $ Medical_History_3   <int> 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, ...
## $ Medical_History_4   <int> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, ...
## $ Medical_History_5   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_6   <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, ...
## $ Medical_History_7   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_8   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_9   <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, ...
## $ Medical_History_10  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_11  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_12  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_13  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_14  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_15  <int> 240, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_16  <int> 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 1, 1, ...
## $ Medical_History_17  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_18  <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_19  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_20  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_21  <int> 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_22  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_23  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 1, 3, ...
## $ Medical_History_24  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_25  <int> 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_26  <int> 3, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_27  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_28  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, ...
## $ Medical_History_29  <int> 3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 1, 3, 3, 3, ...
## $ Medical_History_30  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_31  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_32  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_33  <int> 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 1, ...
## $ Medical_History_34  <int> 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 1, 3, 3, 3, ...
## $ Medical_History_35  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_36  <int> 2, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_37  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_38  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_39  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, ...
## $ Medical_History_40  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_41  <int> 3, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 3, 3, 1, ...
## $ Medical_Keyword_1   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_2   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_3   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_4   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_5   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_6   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_7   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_8   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_9   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_10  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_11  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_12  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_13  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_14  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_15  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, ...
## $ Medical_Keyword_16  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_17  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_18  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_19  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_20  <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_21  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_22  <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_23  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, ...
## $ Medical_Keyword_24  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_25  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ Medical_Keyword_26  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_27  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_28  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_29  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_30  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_31  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_32  <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_33  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_34  <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_35  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_36  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_37  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_38  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_39  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_40  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_41  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_42  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_43  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_44  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Medical_Keyword_45  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_46  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_47  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Medical_Keyword_48  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Response            <int> 8, 4, 8, 8, 8, 8, 8, 1, 8, 1, 6, 2, 7, 3, ...

summary statistics of response variable for training data set

#summary of predictive variable(Response) on training dataset
summary(Profile$Response)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   5.637   8.000   8.000

Checking missing values in columns

#Checking missing values present in dataset
colSums(is.na(Profile))

##                  Id      Product_Info_1      Product_Info_2 
##                   0                   0                   0 
##      Product_Info_3      Product_Info_4      Product_Info_5 
##                   0                   0                   0 
##      Product_Info_6      Product_Info_7             Ins_Age 
##                   0                   0                   0 
##                  Ht                  Wt                 BMI 
##                   0                   0                   0 
##   Employment_Info_1   Employment_Info_2   Employment_Info_3 
##                  19                   0                   0 
##   Employment_Info_4   Employment_Info_5   Employment_Info_6 
##                6779                   0               10854 
##       InsuredInfo_1       InsuredInfo_2       InsuredInfo_3 
##                   0                   0                   0 
##       InsuredInfo_4       InsuredInfo_5       InsuredInfo_6 
##                   0                   0                   0 
##       InsuredInfo_7 Insurance_History_1 Insurance_History_2 
##                   0                   0                   0 
## Insurance_History_3 Insurance_History_4 Insurance_History_5 
##                   0                   0               25396 
## Insurance_History_7 Insurance_History_8 Insurance_History_9 
##                   0                   0                   0 
##       Family_Hist_1       Family_Hist_2       Family_Hist_3 
##                   0               28656               34241 
##       Family_Hist_4       Family_Hist_5   Medical_History_1 
##               19184               41811                8889 
##   Medical_History_2   Medical_History_3   Medical_History_4 
##                   0                   0                   0 
##   Medical_History_5   Medical_History_6   Medical_History_7 
##                   0                   0                   0 
##   Medical_History_8   Medical_History_9  Medical_History_10 
##                   0                   0               58824 
##  Medical_History_11  Medical_History_12  Medical_History_13 
##                   0                   0                   0 
##  Medical_History_14  Medical_History_15  Medical_History_16 
##                   0               44596                   0 
##  Medical_History_17  Medical_History_18  Medical_History_19 
##                   0                   0                   0 
##  Medical_History_20  Medical_History_21  Medical_History_22 
##                   0                   0                   0 
##  Medical_History_23  Medical_History_24  Medical_History_25 
##                   0               55580                   0 
##  Medical_History_26  Medical_History_27  Medical_History_28 
##                   0                   0                   0 
##  Medical_History_29  Medical_History_30  Medical_History_31 
##                   0                   0                   0 
##  Medical_History_32  Medical_History_33  Medical_History_34 
##               58274                   0                   0 
##  Medical_History_35  Medical_History_36  Medical_History_37 
##                   0                   0                   0 
##  Medical_History_38  Medical_History_39  Medical_History_40 
##                   0                   0                   0 
##  Medical_History_41   Medical_Keyword_1   Medical_Keyword_2 
##                   0                   0                   0 
##   Medical_Keyword_3   Medical_Keyword_4   Medical_Keyword_5 
##                   0                   0                   0 
##   Medical_Keyword_6   Medical_Keyword_7   Medical_Keyword_8 
##                   0                   0                   0 
##   Medical_Keyword_9  Medical_Keyword_10  Medical_Keyword_11 
##                   0                   0                   0 
##  Medical_Keyword_12  Medical_Keyword_13  Medical_Keyword_14 
##                   0                   0                   0 
##  Medical_Keyword_15  Medical_Keyword_16  Medical_Keyword_17 
##                   0                   0                   0 
##  Medical_Keyword_18  Medical_Keyword_19  Medical_Keyword_20 
##                   0                   0                   0 
##  Medical_Keyword_21  Medical_Keyword_22  Medical_Keyword_23 
##                   0                   0                   0 
##  Medical_Keyword_24  Medical_Keyword_25  Medical_Keyword_26 
##                   0                   0                   0 
##  Medical_Keyword_27  Medical_Keyword_28  Medical_Keyword_29 
##                   0                   0                   0 
##  Medical_Keyword_30  Medical_Keyword_31  Medical_Keyword_32 
##                   0                   0                   0 
##  Medical_Keyword_33  Medical_Keyword_34  Medical_Keyword_35 
##                   0                   0                   0 
##  Medical_Keyword_36  Medical_Keyword_37  Medical_Keyword_38 
##                   0                   0                   0 
##  Medical_Keyword_39  Medical_Keyword_40  Medical_Keyword_41 
##                   0                   0                   0 
##  Medical_Keyword_42  Medical_Keyword_43  Medical_Keyword_44 
##                   0                   0                   0 
##  Medical_Keyword_45  Medical_Keyword_46  Medical_Keyword_47 
##                   0                   0                   0 
##  Medical_Keyword_48            Response 
##                   0                   0

#Checking if missing values present in "Id" column
sum(is.na(Profile$Id))

## [1] 0

# using purrr to find columns with missing value and arranging columns in ascending order
missingcol <- map_df(Profile, function(x) sum(is.na(x)))
missingcol <- gather(missingcol) 
colnames(missingcol) <- c("variable","missing")

missingcol <- missingcol %>%
  filter( missing > 0) %>%
  arrange(missing) %>% mutate( percentmissing = missing/59381*100 ) 
  
print(missingcol)

## # A tibble: 13 x 3
##               variable missing percentmissing
##                  <chr>   <int>          <dbl>
##  1   Employment_Info_1      19     0.03199677
##  2   Employment_Info_4    6779    11.41610953
##  3   Medical_History_1    8889    14.96943467
##  4   Employment_Info_6   10854    18.27857395
##  5       Family_Hist_4   19184    32.30663007
##  6 Insurance_History_5   25396    42.76788872
##  7       Family_Hist_2   28656    48.25786026
##  8       Family_Hist_3   34241    57.66322561
##  9       Family_Hist_5   41811    70.41141106
## 10  Medical_History_15   44596    75.10146343
## 11  Medical_History_24   55580    93.59896263
## 12  Medical_History_32   58274    98.13576733
## 13  Medical_History_10   58824    99.06198953

Creating a plot for missing value columns. Plot shows percentage of missing values in each column having missing value count greater than zero.

MissingValueColumns<-colnames(Profile)[colSums(is.na(Profile)) > 0]

## Matrix plot. Red for missing values, high values are represented by dark color.

matrixplot(Profile[MissingValueColumns], interactive = F, sortby = "Employment_Info_1",main="Missing value plot" )

Data containes missing values in 13 variables from as low as count=19 to as high as count > 50000 ( which is more than 99% observations for that column). Above plot of percentage missing values of these columns depicts the same.

3.3 Data Cleaning

As we can see from the plot (“missing value plot”) under section 3.2 Original Dataset, 11 out of 13 columns having NA have very high missing values percentage (between 14%-99%). These columns will be removed from dataset.

Removing 11 out of 13 columns having missing values.

# Creating a new data set to perform cleaning
ProfileClean <- Profile
ProfileClean <- ProfileClean[,-c(39,18,37,30,35,36,38,53,62,70,48)]

dim(ProfileClean)

## [1] 59381   117

For Employment_Info_1 , we can check the summary and replace missing values (19 in total), with the mean. mean= 0.0775821

summary(Profile$Employment_Info_1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.03500 0.06000 0.07758 0.10000 1.00000      19

# replacing NA with mean for Employment_Info_1
ProfileClean$Employment_Info_1[is.na(ProfileClean$Employment_Info_1)] <- 0.07758

For Employment_Info_4, since the distribution is not uniform, we will use kNN imputation to predict missing value for Employment_Info_4

summary(Profile$Employment_Info_4)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.006   0.000   1.000    6779

# kNN Impputation for Employment_Info_4
# Not executing below code again as it takes  a lot of time. I have
# already run the code previously and saved the data
# I am uploading data after kNN Imputation once again in ProfileClean data set

### ProfileClean <- knnImputation(ProfileClean, k = 3, scale = T, meth = "weighAvg", distData = NULL)

# Loading Data from github. This is Cleaned dataset, after median Imputaion and kNN Imputation
ProfileClean <- read.csv(text=getURL("https://raw.githubusercontent.com/nitin3078/Prudential/master/ProfileClean.csv"), header = T)

No Missing values present in Cleaned dataset anymore.

3.3 Data Customization

#Updating first column name to "Id" in cleaned data set ProfileClean
colnames(ProfileClean)[1] <- "Id"

# Checking Correlation between BMI, Wt and Ht
cor(ProfileClean$BMI,ProfileClean$Wt)

## [1] 0.8540833

cor(ProfileClean$BMI,ProfileClean$Ht)

## [1] 0.1231248

cor(ProfileClean$Ht,ProfileClean$Wt)

## [1] 0.6104248

As we can see ,BMI is highly correlated with Wt, removing Wt from dataset

ProfileClean <- subset(ProfileClean, select = -c(Wt) )

After looking at variables values, we see that columns Medical_Keyword_1 to Medical_Keyword_48 all have only zeros and ones, which may not have much predictive power, but adding them together might be significant . Below steps create new column and removes the columns from which it was created

# Creating a new column as a sum of all these column : MedKeywordSum

ProfileClean$MedKeywordSum <- rowSums(ProfileClean[,c(68:115)])
# Dropping Medical_Keyword_1 to Medical_Keyword_48 from dataset
ProfileClean <- subset(ProfileClean, select = -c(68:115) )

Similar as above, columns Medical_History_3 to Medical_History_41 are 0 and 1 with little overall information.

# Creating a new column as a sum of all these column : MedHistSum

ProfileClean$MedHistSum <- rowSums(ProfileClean[,c(33:67)])
# Dropping Medical_Keyword_1 to Medical_Keyword_48 from dataset
ProfileClean <- subset(ProfileClean, select = -c(33:67) )

After cleaning steps, cleaned data set ProfileClean contains 35 variables and 59381 observations.

A glimpse of the cleaned data

datatable(head(ProfileClean, 50), options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
    "}")
))

4. Exploratory Data Analysis

With the cleaned data set in our hand, next step is to understand effect of various variables on the Response column. One of the way is to use visualizations to understand the distributions and assess the predictive power of each variable

Before checking plots of different variables with Response variable, lets first see the distribution of two newly created columns, MedKeywordSum and MedHistSum

# Distribution of custom column MedKeywordSum is not normal
ggplot(ProfileClean, aes(x = MedKeywordSum)) + 
  geom_histogram(aes(y = ..density..),position = "identity", colour = "black", alpha = 0.2, bins = 10) + 
  ggtitle("Histogram of Custom Column MedHistSum") +
  theme_bw() + stat_function(fun = dnorm, colour = "red",
                             args = list(mean = mean(ProfileClean$MedKeywordSum, na.rm = TRUE),
                                         sd = sd(ProfileClean$MedKeywordSum, na.rm = TRUE)))

As we can see, distribution of newly createed column MedKeywordSum is not normal, but distribution of custom column MedHistSum is normal ( see below histogram).

# Distribution of custom column MedHistSum is normal
ggplot(ProfileClean, aes(x = MedHistSum)) + 
  geom_histogram(aes(y = ..density..),position = "identity", colour = "black", alpha = 0.2, bins = 10) +
  ggtitle("Histogram of Custom Column MedHistSum") +
  theme_bw() + stat_function(fun = dnorm, colour = "red",
                             args = list(mean = mean(ProfileClean$MedHistSum, na.rm = TRUE),
                                        sd = sd(ProfileClean$MedHistSum, na.rm = TRUE)))

To understand the relationship between predictor variables and Response, box plots and histogram distributions are quite helpful. Below are the distributions and relationships of few of the most important variables.

# Boxplot Response ~ BMI
ggplot(ProfileClean, aes(x = factor(Response), y=BMI)) + ggtitle("Boxplot Response ~ BMI") + 
  geom_boxplot(colour="blue")+  theme(axis.text.x=element_text(angle=90,hjust=1))

* We can see that BMI has prediction power. Response value 8 has lower value of BMI compared to response value 1.

# Boxplot Response ~ Ht
ggplot(ProfileClean, aes(x=factor(Response), y=Ht)) + ggtitle("Boxplot Response ~ Ht") + 
  geom_boxplot(colour="red")+  theme(axis.text.x=element_text(angle=90,hjust=1))

* We can see that Ht has very less prediction power.

ggplot(ProfileClean, aes(x=factor(Response), y=Ins_Age)) + ggtitle("Boxplot Response ~ Ins_Age") + 
  geom_boxplot(colour="blue")+  theme(axis.text.x=element_text(angle=90,hjust=1))

* From this plot, we can see Ins_Age has some predictive power

# Boxplot Response ~ Product_Info_4
ggplot(ProfileClean, aes(x=factor(Response), y=Product_Info_4)) + ggtitle("Boxplot Response with custom variable MedHistSum") + 
  geom_boxplot(colour="blue")+  theme(axis.text.x=element_text(angle=90,hjust=1))

* From this plot, we can see Product_Info_4 has much predictive power

# Boxplot Response ~ MedKeywordSum
ggplot(ProfileClean, aes(x=factor(Response), y=MedKeywordSum)) + ggtitle("Boxplot Response with custom column MedKeywordSum") + 
  geom_boxplot(colour="red")+  theme(axis.text.x=element_text(angle=90,hjust=1))

* Little predictive power

# Boxplot Response ~ MedHistSum
ggplot(ProfileClean, aes(x=factor(Response), y=MedHistSum)) + ggtitle("Boxplot Response with custom variable MedHistSum") + 
  geom_boxplot(colour="blue")+  theme(axis.text.x=element_text(angle=90,hjust=1))

* little predictive power MedHistSum

# histogram of Response on Product_Info_2

ggplot(ProfileClean, aes(x=Response)) +  ggtitle("Histogram  Response with Product_Info_2 categories") + 
  geom_histogram(position="identity", colour="blue", alpha=0.2, bins = 10)+
  facet_grid(. ~ Product_Info_2)

  theme(axis.text.x=element_text(angle=90,hjust=1))

## List of 1
##  $ axis.text.x:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 1
##   ..$ vjust        : NULL
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

From above histogram, we can see that distribution of response variable is dependent on Product_Info_2.

# histogram of Response on Family_Hist_1
  
  ggplot(ProfileClean, aes(x=Response)) + ggtitle("Histogram  Response with Family_Hist_1 values") +
    geom_histogram(position="identity", colour="black", alpha=0.2, bins = 10)+
    facet_grid(. ~ Family_Hist_1)+
  theme(axis.text.x=element_text(angle=90,hjust=1))

* From above histogram, we can see that distribution of response variable is dependent on Family_Hist_1.

5. Building Predictive Model

A predictive model is built to predict response value using ** Multinomial Logistic Regression**. Below are the steps executed.

Preparing “test” dataset to contain same column as “ProfileClean” data set to use in predictive models. Below steps needs to be executed to download test file from kaggle(now from github as I have put it there) and tidying data.

# loading "test" data from website and then removing columns just like we did in Train data set to make
# algorithms work
test <- read.csv(text = getURL("https://raw.githubusercontent.com/nitin3078/Prudential/master/test.csv"), header=T)

# execute below steps to delete required columns just like we did in train data set to make it congruent

test <- test[,-c(39,18,37,30,35,36,38,53,62,70,48)]
test <- subset(test, select = -c(Wt) )
test$MedKeywordSum <- rowSums(test[,c(68:115)])
test <- subset(test, select = -c(68:115) )
test$MedHistSum <- rowSums(test[,c(33:67)])
test <- subset(test, select = -c(33:67) )

Creating a Multinomila logistic regression model to predict Response.

# Multinomial model creation and submission

MultinomModel <- multinom(Response ~  Product_Info_4+ Product_Info_2 +  BMI + Ins_Age +Family_Hist_1+ MedKeywordSum + MedHistSum, data = ProfileClean)

predict_Response <- predict (MultinomModel, test, "probs")

test$Response <- predict (MultinomModel, test)

Creating a submission file to submit on kaggle. URL needs to be replaced to a local URL in below command. Also I am commenting the command for now but it can be uncommented and executed successfully. Once Submission.csv is genereated. It can be submitted to Kaggle website to obtain score and rank. Here is the URL to upload submissions.

# Create a file to submit on Kaggle, which contain only 2 columns, Id and Response
submission <- test[, c(1,35)]
  
write.csv(submission, "C:\\Users\\nitin\\Desktop\\study\\Special topics in BANA R\\Project\\submission.csv", row.names = F)

6. Summary

Accuracy obtained from Kaggle Kappa is 0.36562 for this model. There is a lot of scope for improvement. Cross validation can be done to split train and test and iterations which is going to most significantly imporve the model.

Overall we can see that age, BMI, Family history and product type as well are main factors to assess risk of the insurance.

Data Wrangling in R - Final Project

Nitin Sharma

03 December, 2017