Prudential life insurance uses data to assess risk in providing insurance to individual customers. While traditional methods include collecting history, medical records amongh many other data points, it usually takes aound 30 days. My goal here is to automate the process so that risk assessment would be quick and accurate.The data is provided by Prudential Financial for kaggle competition for this assessment challenege.
My approach will include identifying key variables contributing to final decision (determining level of risk , from 1 to 8) and how much they influence it. My approach will be to understand relationships between “Response” and features, finding a pattern among data, estimating a model and validating model.
The results of this analysis will help Insurance companies make well informed decisions.
A list of packages used in this project is documented below. It may be modified, if needed.
tidyverse = For data import and data manipulation. Package contains many other important packages. Most of the packages needed separately like magrittr, ggplot2,dplyr, purrr are included in this.
VIM = For Visualization and Imputation of Missing Values
DMwR = For kNN imputation of missing values
Few more packages related to model will be included.
library(tidyverse)
library(randomForest)
library(VIM)
library(DT)
library(caret)
library(DMwR)
Dataset was provided by Prudential Financial as a competition on Kaggle. Original dataset can be downloaded from here. Data is further cleaned in R and prepared for analysis.
Data dictionary of 128 variables in original dataset is provided below.
Data Dictionary (Click to expand)
Explaination of each variable as provided by kaggle
Variable Description
Id A unique identifier associated with an application.
Product_Info_1-7 A set of normalized variables relating to the product applied for Ins_Age Normalized age of applicant Ht Normalized height of applicant Wt Normalized weight of applicant BMI Normalized BMI of applicant Employment_Info_1-6 A set of normalized variables relating to the employment history of the applicant. InsuredInfo_1-6 A set of normalized variables providing information about the applicant. Insurance_History_1-9 A set of normalized variables relating to the insurance history of the applicant. Family_Hist_1-5 A set of normalized variables relating to the family history of the applicant. Medical_History_1-41 A set of normalized variables relating to the medical history of the applicant. Medical_Keyword_1-48 A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application. Response This is the target variable, an ordinal variable relating to the final decision associated with an application
The following variables are all categorical (nominal):
Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41
The following variables are continuous:
Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5
The following variables are discrete:
Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32
Medical_Keyword_1-48 are dummy variables.
Original data set contained 128 variables and 59381 observations in training dataset. Dataset was provided by Prudential Financial to assess the risk of providing policy to life insurance applicants. Data set was uploaded on kaggle as a compitition sponsored by Prudential Financial, around 2 years ago (11/23/2015). Many missing values are present in the original dataset in multiple columns.
Missing values in original dataset were recorded as blanks, while importing in R, blanks were converted in NAs.
profile<-read.csv("C:\\Users\\nitin\\Desktop\\study\\Special topics in BANA R\\Project\\train\\train.csv")
datatable(head(profile, 50), options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
"}")
))
# number of observations and number of variables in dataset
dim(profile)
## [1] 59381 128
# displaying structure and a subset of values of dataset
glimpse(profile)
## Observations: 59,381
## Variables: 128
## $ Id <int> 2, 5, 6, 7, 8, 10, 11, 14, 15, 16, 17, 18,...
## $ Product_Info_1 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
## $ Product_Info_2 <fctr> D3, A1, E1, D4, D2, D2, A8, D2, D3, E1, D...
## $ Product_Info_3 <int> 10, 26, 26, 10, 26, 26, 10, 26, 26, 21, 26...
## $ Product_Info_4 <dbl> 0.07692308, 0.07692308, 0.07692308, 0.4871...
## $ Product_Info_5 <int> 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Product_Info_6 <int> 1, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Product_Info_7 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Ins_Age <dbl> 0.64179104, 0.05970149, 0.02985075, 0.1641...
## $ Ht <dbl> 0.5818182, 0.6000000, 0.7454545, 0.6727273...
## $ Wt <dbl> 0.1485356, 0.1317992, 0.2887029, 0.2050209...
## $ BMI <dbl> 0.3230080, 0.2722877, 0.4287804, 0.3524377...
## $ Employment_Info_1 <dbl> 0.0280, 0.0000, 0.0300, 0.0420, 0.0270, 0....
## $ Employment_Info_2 <int> 12, 1, 9, 9, 9, 15, 1, 12, 9, 1, 9, 3, 9, ...
## $ Employment_Info_3 <int> 1, 3, 1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, ...
## $ Employment_Info_4 <dbl> 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, ...
## $ Employment_Info_5 <int> 3, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, ...
## $ Employment_Info_6 <dbl> NA, 0.0018, 0.0300, 0.2000, 0.0500, 1.0000...
## $ InsuredInfo_1 <int> 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, ...
## $ InsuredInfo_2 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ InsuredInfo_3 <int> 6, 6, 8, 8, 6, 8, 3, 6, 3, 3, 4, 3, 8, 3, ...
## $ InsuredInfo_4 <int> 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, ...
## $ InsuredInfo_5 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, ...
## $ InsuredInfo_6 <int> 2, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, ...
## $ InsuredInfo_7 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Insurance_History_1 <int> 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, ...
## $ Insurance_History_2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Insurance_History_3 <int> 3, 3, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 1, ...
## $ Insurance_History_4 <int> 1, 1, 3, 3, 3, 2, 2, 1, 2, 1, 1, 3, 1, 3, ...
## $ Insurance_History_5 <dbl> 0.000666667, 0.000133333, NA, NA, NA, 0.00...
## $ Insurance_History_7 <int> 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3, 2, 3, ...
## $ Insurance_History_8 <int> 1, 3, 2, 2, 2, 3, 1, 1, 1, 3, 1, 2, 1, 2, ...
## $ Insurance_History_9 <int> 2, 2, 3, 3, 3, 2, 2, 2, 2, 2, 2, 3, 2, 3, ...
## $ Family_Hist_1 <int> 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 3, ...
## $ Family_Hist_2 <dbl> NA, 0.1884058, 0.3043478, 0.4202899, 0.463...
## $ Family_Hist_3 <dbl> 0.5980392, NA, NA, NA, NA, 0.2941176, NA, ...
## $ Family_Hist_4 <dbl> NA, 0.08450704, 0.22535211, 0.35211268, 0....
## $ Family_Hist_5 <dbl> 0.5267857, NA, NA, NA, NA, NA, NA, 0.63392...
## $ Medical_History_1 <int> 4, 5, 10, 0, NA, 6, 5, 6, 4, NA, 1, 4, 5, ...
## $ Medical_History_2 <int> 112, 412, 3, 350, 162, 491, 600, 145, 16, ...
## $ Medical_History_3 <int> 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 3, 2, ...
## $ Medical_History_4 <int> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, ...
## $ Medical_History_5 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_6 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, ...
## $ Medical_History_7 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_8 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_9 <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, ...
## $ Medical_History_10 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_11 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_12 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_13 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_14 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_15 <int> 240, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_16 <int> 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 1, 1, ...
## $ Medical_History_17 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_18 <int> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_19 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_20 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_21 <int> 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_22 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_23 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 1, 3, ...
## $ Medical_History_24 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_25 <int> 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_26 <int> 3, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_27 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_28 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, ...
## $ Medical_History_29 <int> 3, 3, 3, 3, 3, 3, 1, 3, 1, 3, 1, 3, 3, 3, ...
## $ Medical_History_30 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_31 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_32 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ Medical_History_33 <int> 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, 1, ...
## $ Medical_History_34 <int> 3, 1, 3, 3, 3, 1, 3, 3, 3, 3, 1, 3, 3, 3, ...
## $ Medical_History_35 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_36 <int> 2, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_37 <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ Medical_History_38 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Medical_History_39 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3, ...
## $ Medical_History_40 <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ Medical_History_41 <int> 3, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 3, 3, 1, ...
## $ Medical_Keyword_1 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_8 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_9 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_10 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_11 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_12 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_13 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_14 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_15 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, ...
## $ Medical_Keyword_16 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_17 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_18 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_19 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_20 <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_21 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_22 <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_23 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, ...
## $ Medical_Keyword_24 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_25 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...
## $ Medical_Keyword_26 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_27 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_28 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_29 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_30 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_31 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_32 <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_33 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_34 <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_35 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_36 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_37 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_38 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_39 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_40 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_41 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_42 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_43 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_44 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Medical_Keyword_45 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_46 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Medical_Keyword_47 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Medical_Keyword_48 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...
## $ Response <int> 8, 4, 8, 8, 8, 8, 8, 1, 8, 1, 6, 2, 7, 3, ...
attach(profile)
#summary of predictive variable on training dataset
summary(Response)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 5.637 8.000 8.000
#Checking if missing values present in "Id" column
sum(is.na(Id))
## [1] 0
# using purrr to find columns with missing value and arranging columns in ascending order
missingcol <- map_df(profile, function(x) sum(is.na(x)))
missingcol <- gather(missingcol)
colnames(missingcol) <- c("variable","missing")
missingcol <- missingcol %>%
filter( missing > 0) %>%
arrange(missing) %>% mutate( percentmissing = missing/59381*100 )
print(missingcol)
## # A tibble: 13 x 3
## variable missing percentmissing
## <chr> <int> <dbl>
## 1 Employment_Info_1 19 0.03199677
## 2 Employment_Info_4 6779 11.41610953
## 3 Medical_History_1 8889 14.96943467
## 4 Employment_Info_6 10854 18.27857395
## 5 Family_Hist_4 19184 32.30663007
## 6 Insurance_History_5 25396 42.76788872
## 7 Family_Hist_2 28656 48.25786026
## 8 Family_Hist_3 34241 57.66322561
## 9 Family_Hist_5 41811 70.41141106
## 10 Medical_History_15 44596 75.10146343
## 11 Medical_History_24 55580 93.59896263
## 12 Medical_History_32 58274 98.13576733
## 13 Medical_History_10 58824 99.06198953
Creating a plot for missing value columns. Plot shows percentage of missing values in each column having missing value count greater than zero.
columnnames <- colnames(profile)[colSums(is.na(profile)) > 0]
## Matrix plot. Red for missing values, high values are represented by dark color.
matrixplot(profile[columnnames], interactive = F, sortby = "Employment_Info_1",main="Missing value plot" )
As we can see from the plot (“missing value plot”) under section 3.2 Original Dataset, 11 out of 13 columns having NA have very high missing values percentage (between 14%-99%). These columns will be removed from dataset.
Removing 11 out of 13 columns having missing values.
# Setting eval to False to avoide executing this chunk of code once again.
ProfileClean <- profile
ProfileClean <- ProfileClean[,-c(39,18,37,30,35,36,38,53,62,70,48)]
dim(ProfileClean)
Remaining missing value columns:
# Remaining missing value columns
missingcol[1:2,]
## # A tibble: 2 x 3
## variable missing percentmissing
## <chr> <int> <dbl>
## 1 Employment_Info_1 19 0.03199677
## 2 Employment_Info_4 6779 11.41610953
for Employment_Info_1 , we can check the summary and replace missing values (19 in total), with the mean. mean= 0.0775821
summary(profile$Employment_Info_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.03500 0.06000 0.07758 0.10000 1.00000 19
# Setting eval to False to avoide executing this chunk of code once again.
# replacing NA with mean for Employment_Info_1
ProfileClean$Employment_Info_1[is.na(ProfileClean$Employment_Info_1)] <- 0.07758
for Employment_Info_4, since the distribution is not uniform, we will use kNN imputation to predict missing value for Employment_Info_4
summary(profile$Employment_Info_4)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.006 0.000 1.000 6779
# Setting eval to False to avoide executing this chunk of code once again.
# kNN computation for Employment_Info_4
ProfileClean <- knnImputation(ProfileClean, k = 3, scale = T, meth = "weighAvg",
distData = NULL)
Since above algorithm takes a lot of time to run, I am not executing this code in markdown. I have downloaded my final cleaned data in an excel file and reading same file in below code directly to R Markdown for future analysis.
After cleaning steps, cleaned data set ProfileClean contains 117 variables and 59381 observations.
ProfileClean <- read.csv("C:\\Users\\nitin\\Desktop\\study\\Special topics in BANA R\\Project\\ProfileClean.csv")
dim(ProfileClean)
## [1] 59381 117
datatable(head(ProfileClean, 50), options = list(
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
"}")
))
caret package to train a model and find most important variables out of remaining 117 predictor variables.Scatter plot and linear regression line will be included in plots. I do not know predictive modelling right now, Though its not the scope of this project ( as discussed with Brad), I might try to include it.
I will need to learn basics of machine learning and using caret package to train models and find importance of variables. I will also need to learn ggplot2 to plot outcomes, which I will do coming week.
Work in progress…