Prudential Life Insurance Assessment

Prudential Financial is a fortune 500 company which offers products and services in life insurance,mutual fund, real estate and many more products.

1. Introduction

Prudential life insurance uses data to assess risk in providing insurance to individual customers. While traditional methods include collecting history, medical records amongh many other data points, it usually takes aound 30 days. My goal here is to automate the process so that risk assessment would be quick and accurate.The data is provided by Prudential Financial for kaggle competition for this assessment challenege.

My approach will include identifying key variables contributing to final decision (determining level of risk , from 1 to 8) and how much they influence it. My approach will be to understand relationships between “Response” and features, finding a pattern among data, estimating a model and validating model.

The results of this analysis will help Insurance companies make well informed decisions.

2. Packages Required

A list of packages used in this project is documented below. It may be modified, if needed.

tidyverse = For data import and data manipulation. Package contains many other important packages. Most of the packages needed separately like magrittr, ggplot2,dplyr, purrr are included in this.
VIM = For Visualization and Imputation of Missing Values
randomForest= for machine learning algorithm
DT = For HTML capabilities
caret = For training model and finding most important variables
DMwR = For kNN imputation of missing values

Few more packages related to model will be included.

library(tidyverse)
library(randomForest)
library(VIM)
library(DT)
library(caret)
library(DMwR)

3. Data Preparation

3.1 Data Source

Dataset was provided by Prudential Financial as a competition on Kaggle. Original dataset can be downloaded from here. Data is further cleaned in R and prepared for analysis.

3.1.1 Data Dictionary

Data dictionary of 128 variables in original dataset is provided below.

Data Dictionary (Click to expand)

Explaination of each variable as provided by kaggle

Variable Description

Id A unique identifier associated with an application.

Product_Info_1-7 A set of normalized variables relating to the product applied for Ins_Age Normalized age of applicant Ht Normalized height of applicant Wt Normalized weight of applicant BMI Normalized BMI of applicant Employment_Info_1-6 A set of normalized variables relating to the employment history of the applicant. InsuredInfo_1-6 A set of normalized variables providing information about the applicant. Insurance_History_1-9 A set of normalized variables relating to the insurance history of the applicant. Family_Hist_1-5 A set of normalized variables relating to the family history of the applicant. Medical_History_1-41 A set of normalized variables relating to the medical history of the applicant. Medical_Keyword_1-48 A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application. Response This is the target variable, an ordinal variable relating to the final decision associated with an application

The following variables are all categorical (nominal):

Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41

The following variables are continuous:

Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5

The following variables are discrete:

Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32

Medical_Keyword_1-48 are dummy variables.

3.2 Original Dataset

Original data set contained 128 variables and 59381 observations in training dataset. Dataset was provided by Prudential Financial to assess the risk of providing policy to life insurance applicants. Data set was uploaded on kaggle as a compitition sponsored by Prudential Financial, around 2 years ago (11/23/2015). Many missing values are present in the original dataset in multiple columns.

Missing values in original dataset were recorded as blanks, while importing in R, blanks were converted in NAs.

Statistics and exploration of data

profile<-read.csv("C:\\Users\\nitin\\Desktop\\study\\Special topics in BANA R\\Project\\train\\train.csv")
datatable(head(profile, 50), options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
    "}")
))

# number of observations and number of variables in dataset
dim(profile)

## [1] 59381   128

# displaying structure and a subset of values of dataset
glimpse(profile)

## Observations: 59,381
## Variables: 128
## $ Id                  <int> 2, 5, 6, 7, 8, 10, 11, 14, 15, 16, 17, 18,...
## $ Product_Info_1      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ Product_Info_2      <fctr> D3, A1, E1, D4, D2, D2, A8, D2, D3, E1, D...
## $ Product_Info_3      <int> 10, 26, 26, 10, 26, 26, 10, 26, 26, 21, 26...
## $ Product_Info_4      <dbl> 0.07692308, 0.07692308, 0.07692308, 0.4871...
## $ Product_Info_5      <int> 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Product_Info_6      <int> 1, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Product_Info_7      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Ins_Age             <dbl> 0.64179104, 0.05970149, 0.02985075, 0.1641...
## $ Ht                  <dbl> 0.5818182, 0.6000000, 0.7454545, 0.6727273...
## $ Wt                  <dbl> 0.1485356, 0.1317992, 0.2887029, 0.2050209...
## $ BMI                 <dbl> 0.3230080, 0.2722877, 0.4287804, 0.3524377...
## $ Employment_Info_1   <dbl> 0.0280, 0.0000, 0.0300, 0.0420, 0.0270, 0....
## $ Employment_Info_2   <int> 12, 1, 9, 9, 9, 15, 1, 12, 9, 1, 9, 3, 9, ...
## $ Employment_Info_3   <int> 1, 3, 1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, ...
## $ Employment_Info_4   <dbl> 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, ...
## $ Employment_Info_5   <int> 3, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, ...
## $ Employment_Info_6   <dbl> NA, 0.0018, 0.0300, 0.2000, 0.0500, 1.0000...
## $ InsuredInfo_1       <int> 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, ...
## $ InsuredInfo_2       <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ InsuredInfo_3       <int> 6, 6, 8, 8, 6, 8, 3, 6, 3, 3, 4, 3, 8, 3, ...
## $ InsuredInfo_4       <int> 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, ...
## $ InsuredInfo_5       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, ...
## $ InsuredInfo_6       <int> 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, ...
## $ InsuredInfo_7       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Insurance_History_1 <int> 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, ...
## $ Insurance_History_2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Insurance_History_3 <int> 3, 3, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 1, ...
## $ Insurance_History_4 <int> 1, 1, 3, 3, 3, 2, 2, 1, 2, 1, 1, 3, 1, 3, ...
## $ Insurance_History_5 <dbl> 0.000666667, 0.000133333, NA, NA, NA, 0.00...
## $ Insurance_History_7 <int> 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3, 2, 3, ...
## $ Insurance_History_8 <int> 1, 3, 2, 2, 2, 3, 1, 1, 1, 3, 1, 2, 1, 2, ...
## $ Insurance_History_9 <int> 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 2, 3, ...
## $ Family_Hist_1       <int> 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 3, ...
## $ Family_Hist_2       <dbl> NA, 0.1884058, 0.3043478, 0.4202899, 0.463...
## $ Family_Hist_3       <dbl> 0.5980392, NA, NA, NA, NA, 0.2941176, NA, ...
## $ Family_Hist_4       <dbl> NA, 0.08450704, 0.22535211, 0.35211268, 0....
## $ Family_Hist_5       <dbl> 0.5267857, NA, NA, NA, NA, NA, NA, 0.63392...
## $ Medical_History_1   <int> 4, 5, 10, 0, NA, 6, 5, 6, 4, NA, 1, 4, 5, ...
## $ Medical_History_2   <int> 112, 412, 3, 350, 162, 491, 600, 145, 16, ...
## $ Medical_History_3   <int> 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, ...
## $ Medical_History_4   <int> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, ...
## $ Medical_History_5   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_6   <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, ...
## $ Medical_History_7   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_8   <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_9   <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, ...
## $ Medical_History_10  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_11  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_12  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_13  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_14  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_15  <int> 240, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_16  <int> 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 1, 1, ...
## $ Medical_History_17  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_18  <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_19  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_20  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_21  <int> 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_22  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_23  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 1, 3, ...
## $ Medical_History_24  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_25  <int> 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_26  <int> 3, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_27  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_28  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, ...
## $ Medical_History_29  <int> 3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 1, 3, 3, 3, ...
## $ Medical_History_30  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_31  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_32  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_33  <int> 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 1, ...
## $ Medical_History_34  <int> 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 1, 3, 3, 3, ...
## $ Medical_History_35  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_36  <int> 2, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_37  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_38  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_39  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, ...
## $ Medical_History_40  <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_41  <int> 3, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 3, 3, 1, ...
## $ Medical_Keyword_1   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_2   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_3   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_4   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_5   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_6   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_7   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_8   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_9   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_10  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_11  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_12  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_13  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_14  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_15  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, ...
## $ Medical_Keyword_16  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_17  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_18  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_19  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_20  <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_21  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_22  <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_23  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, ...
## $ Medical_Keyword_24  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_25  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ Medical_Keyword_26  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_27  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_28  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_29  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_30  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_31  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_32  <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_33  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_34  <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_35  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_36  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_37  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_38  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_39  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_40  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_41  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_42  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_43  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_44  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Medical_Keyword_45  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_46  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_47  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Medical_Keyword_48  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Response            <int> 8, 4, 8, 8, 8, 8, 8, 1, 8, 1, 6, 2, 7, 3, ...

attach(profile)

summary statistics of response variable for training data set

#summary of predictive variable on training dataset

summary(Response)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   5.637   8.000   8.000

Checking missing values in columns

#Checking if missing values present in "Id" column

sum(is.na(Id))

## [1] 0

# using purrr to find columns with missing value and arranging columns in ascending order

missingcol <- map_df(profile, function(x) sum(is.na(x)))
missingcol <- gather(missingcol) 
colnames(missingcol) <- c("variable","missing")

missingcol <- missingcol %>%
filter( missing > 0) %>%
arrange(missing) %>% mutate( percentmissing = missing/59381*100 ) 
  
print(missingcol)

## # A tibble: 13 x 3
##               variable missing percentmissing
##                  <chr>   <int>          <dbl>
##  1   Employment_Info_1      19     0.03199677
##  2   Employment_Info_4    6779    11.41610953
##  3   Medical_History_1    8889    14.96943467
##  4   Employment_Info_6   10854    18.27857395
##  5       Family_Hist_4   19184    32.30663007
##  6 Insurance_History_5   25396    42.76788872
##  7       Family_Hist_2   28656    48.25786026
##  8       Family_Hist_3   34241    57.66322561
##  9       Family_Hist_5   41811    70.41141106
## 10  Medical_History_15   44596    75.10146343
## 11  Medical_History_24   55580    93.59896263
## 12  Medical_History_32   58274    98.13576733
## 13  Medical_History_10   58824    99.06198953

Creating a plot for missing value columns. Plot shows percentage of missing values in each column having missing value count greater than zero.

columnnames <- colnames(profile)[colSums(is.na(profile)) > 0]

## Matrix plot. Red for missing values, high values are represented by dark color.

matrixplot(profile[columnnames], interactive = F, sortby = "Employment_Info_1",main="Missing value plot" )

Data containes missing values in 13 variables from as low as count=19 to as high as count > 50000 ( which is more than 99% observations for that column). Above plot of percentage missing values of these columns depicts the same.

3.3 Data Cleaning

As we can see from the plot (“missing value plot”) under section 3.2 Original Dataset, 11 out of 13 columns having NA have very high missing values percentage (between 14%-99%). These columns will be removed from dataset.

Removing 11 out of 13 columns having missing values.

# Setting eval to False to avoide executing this chunk of code once again.

ProfileClean <- profile
ProfileClean <- ProfileClean[,-c(39,18,37,30,35,36,38,53,62,70,48)]

dim(ProfileClean)

Remaining missing value columns:

# Remaining missing value columns
missingcol[1:2,]

## # A tibble: 2 x 3
##            variable missing percentmissing
##               <chr>   <int>          <dbl>
## 1 Employment_Info_1      19     0.03199677
## 2 Employment_Info_4    6779    11.41610953

for Employment_Info_1 , we can check the summary and replace missing values (19 in total), with the mean. mean= 0.0775821

summary(profile$Employment_Info_1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.03500 0.06000 0.07758 0.10000 1.00000      19

# Setting eval to False to avoide executing this chunk of code once again.
# replacing NA with mean for Employment_Info_1
ProfileClean$Employment_Info_1[is.na(ProfileClean$Employment_Info_1)] <- 0.07758

for Employment_Info_4, since the distribution is not uniform, we will use kNN imputation to predict missing value for Employment_Info_4

summary(profile$Employment_Info_4)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.006   0.000   1.000    6779

# Setting eval to False to avoide executing this chunk of code once again.
# kNN computation for Employment_Info_4
ProfileClean <- knnImputation(ProfileClean, k = 3, scale = T, meth = "weighAvg",
              distData = NULL)

Since above algorithm takes a lot of time to run, I am not executing this code in markdown. I have downloaded my final cleaned data in an excel file and reading same file in below code directly to R Markdown for future analysis.

After cleaning steps, cleaned data set ProfileClean contains 117 variables and 59381 observations.

ProfileClean <- read.csv("C:\\Users\\nitin\\Desktop\\study\\Special topics in BANA R\\Project\\ProfileClean.csv")

dim(ProfileClean)

## [1] 59381   117

datatable(head(ProfileClean, 50), options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
    "}")
))

4. Exploratory Data Analysis

I would like to understand correlation and multicolinearity between variables.
I will use caret package to train a model and find most important variables out of remaining 117 predictor variables.
A linear regression model would be created to understand the relationship between variables. I would then perform model validation using linear regression data model validation techniques and plot residual summary.
Scatter plot and linear regression line will be included in plots. I do not know predictive modelling right now, Though its not the scope of this project ( as discussed with Brad), I might try to include it.
I will need to learn basics of machine learning and using caret package to train models and find importance of variables. I will also need to learn ggplot2 to plot outcomes, which I will do coming week.

5. Summary

Work in progress…

Data Wrangling in R-Mid-term Project

Nitin Sharma

03 December, 2017