Prediction and Stratification of Health Insurance Cost Based on Health and Demographic Features

1. Introduction

Risk assessment is a crucial factor in insurance business to making their pricing strategy. Accurate prediction of likely future health care cost of individual and identify the most significant features which contribute to increase of healthcare expenditure is important in order to introduce several insurance packages with competitive yearly premium charges depending on the expected treatment costs of the beneficiaries.

2. Initial Questions

The goal of this project is to answer the following two questions by implementing regression algorithms and classification algorithms respectively on a sample data of the health insurance beneficiaries;

1. What is the medical cost of a person based on health and demographic features?

2. What type of insurance package do you offer to a certain person?

3. Process Flow

The processes involved in completing this project are:

Problem Identification: The domain of study is selected, research questions are prepared and datasets are obtained.
Data Understanding(EDA): The data is explored to gain understanding.
Data Pre-processing: The data is prepared for the analysis.
Modeling (Supervised Learning): The prediction models are trained based on the research question.
Performance Evaluation: The performance of each prediction model is measured.
Conclusion: The findings are summarised.

Datas Science Flow

4. Data Understanding

4.1. Data Source

A kaggle data set named Medical Cost Personal Datasets (Insurance Forecast by using Linear Regression) created in year 2018 is used for this analysis. This data set provides the annual medical costs including demographic features of 1,338 beneficiaries enrolled in an insurance plan in USA.
https://www.kaggle.com/mirichoi0218/insurance

4.2. Exploratory Data Analysis

# Import Libraries 
library(ggplot2) # visualization
library(cowplot) # visualization
library(psych) # visualization

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(caret) # regression

## Loading required package: lattice

library(e1071) # SVR
library(Metrics) # RMSE

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

library(pls) # Partial Least Squares

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

4.2.1. Variables

df = read.csv("insurance.csv",stringsAsFactors = T)
knitr::kable(head(df),"simple")

age	sex	bmi	children	smoker	region	charges
19	female	27.900	0	yes	southwest	16884.924
18	male	33.770	1	no	southeast	1725.552
28	male	33.000	3	no	southeast	4449.462
33	male	22.705	0	no	northwest	21984.471
32	male	28.880	0	no	northwest	3866.855
31	female	25.740	0	no	southeast	3756.622

The response/dependent variable of the analysis is charges, which is the individual medical costs billed by health insurance in a particular year.

The predictors/independent variables of the analysis are as follows;
* age: age of primary beneficiary
* sex: insurance contractor gender, female, male
* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
* children: Number of children covered by health insurance / Number of dependents
* smoker: Smoking
* region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.

4.2.2. Summary Statistics

# Structure of data
str(df)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

This data set contains 4 numerical variables namely age,bmi,children and charges and 3 categorical variables namely sex,smoker and region.

summary(df)

##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Age of the respondents are between 18 and 64.
The respondents’ sex and region are evenly distributed.
Number of children or dependents of the principal range between 0 to 5.
Number of non-smokers are approximately 4 times than smokers.
The average medical cost is $13,270 and median is $9,382. Maximum claim value is $63,770.

4.2.3 Visualization

# Distribution of charges
ggplot(df,aes(x=charges))+   
  geom_histogram(binwidth=2000)+
  geom_vline(aes(xintercept=mean(charges)),color="blue", linetype="dashed",size=1)+
  geom_vline(aes(xintercept=median(charges)),color="red", linetype="dashed",size=1)+
  labs(title = "Distribution of Charges")+labs(x = "Charges / USD", y = "Count")+
  theme(plot.title=element_text(hjust=0.5))

This graph shows right-skewed distribution of medical expenses and 75% of people lies between zero and $16,640.

# Correlation between Age,Smoking and Charges
g1<-ggplot(df,aes(x=age, y = charges,color = smoker))+
  geom_point()+
  geom_smooth(method="lm")+
  labs(title = "Effect of Age on Charges")+labs(x = "Age", y = "Charges / USD")+
  theme(plot.title=element_text(hjust=0.5),legend.position = "none")

# Correlation between BMI,Smoking and Charges
g2 <- ggplot(df,aes(x=bmi, y = charges,color = smoker))+
  geom_point()+
  geom_smooth(method="lm")+
  labs(title = "Effect of BMI on Charges")+labs(x = "BMI")+
  theme(plot.title=element_text(hjust=0.5),axis.title.y = element_blank())

plot_grid(g1,g2)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

This graphs clearly indicate that medical expenses are highly associate with the age and BMI. Elderly and over weight smokers’ expenses are higher than others. This may be due to smoking and obesity increase the risk of heart disease, cancer and diabetics.

# Correlation between No. of Children and Charges
p1 <- ggplot(df,aes(x=children, y = charges))+
  geom_jitter(color = "blue", alpha=0.5)+
  labs(title = "Effect of children on Charges")+labs(x = "Number of Children/Dependents", y = "Charges / USD")+
  theme(plot.title=element_text(hjust=0.5))

# Correlation between Region and Charges
p2 <- ggplot(df,aes(x=region, y = charges))+
  geom_jitter(color = "blue", alpha=0.5)+
  labs(title = "Effect of region on Charges")+labs(x = "Region", y = "Charges / USD")+
  theme(plot.title=element_text(hjust=0.5))

plot_grid(p1,p2)

An obvious relationship is not evident between number of children/dependents covered in insurance or the beneficiary’s residential area for the charges.

# Relationship among variables
pairs.panels(df[c("age","bmi","children","charges")])

This graph shows the pairwise comparison of numerical features.

5. Data Preprocessing

5.1. Missing Values

colSums(is.na(df))

##      age      sex      bmi children   smoker   region  charges 
##        0        0        0        0        0        0        0

This data set does not have missing values.

5.2. Encoding Categorical Data

#Encode sex,smoker and region
df$sex = factor(df$sex,
                levels = c('male','female'),
                labels = c(1, 2))
df$smoker = factor(df$smoker,
                   levels = c('no', 'yes'),
                   labels = c(0, 1))
df$region = factor(df$region,
                   levels = c('northeast','northwest', 'southeast','southwest'),
                   labels = c(1, 2,3,4))

head(df)

##   age sex    bmi children smoker region   charges
## 1  19   2 27.900        0      1      4 16884.924
## 2  18   1 33.770        1      0      3  1725.552
## 3  28   1 33.000        3      0      3  4449.462
## 4  33   1 22.705        0      0      2 21984.471
## 5  32   1 28.880        0      0      2  3866.855
## 6  31   2 25.740        0      0      3  3756.622

6. Regression

This section is to answer the first question, what is the medical cost of a person based on health and demographic features? For this, various regression algorithms mentioned below are applied to the sample data with charges as the dependent variable.

Method

6.1. Split the Data into Training Set and Test Set

# Split the data set into training set and test set
split = 0.8 # 80% for training
set.seed(100)
trainIndex <-createDataPartition(df$charges,p=split,list=FALSE)
data_train <- df[trainIndex,]
data_test <- df[-trainIndex,]
cat(" Size of data_train:",dim(data_train),"\n","Size of data_test :",dim(data_test))

##  Size of data_train: 1072 7 
##  Size of data_test : 266 7

6.2. k-Fold Cross Validation

control <- trainControl(method = "repeatedcv", repeats = 10)

6.3. Multiple Linear Regression

# Fit multiple linear regression to the training set
lmModel = train(charges~.,data = data_train,trControl = control,method="lm" )

# Summary of training model
summary(lmModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11542  -2819  -1076   1329  29812 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11885.56    1141.29 -10.414  < 2e-16 ***
## age            258.52      13.47  19.188  < 2e-16 ***
## sex2           384.42     377.57   1.018 0.308847    
## bmi            324.23      32.64   9.933  < 2e-16 ***
## children       562.29     156.60   3.591 0.000345 ***
## smoker1      23972.70     475.28  50.440  < 2e-16 ***
## region2       -309.72     537.00  -0.577 0.564219    
## region3       -783.31     548.53  -1.428 0.153581    
## region4       -947.82     543.57  -1.744 0.081501 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6152 on 1063 degrees of freedom
## Multiple R-squared:  0.743,  Adjusted R-squared:  0.7411 
## F-statistic: 384.2 on 8 and 1063 DF,  p-value: < 2.2e-16

# Predict the test set results
y_pred = predict(lmModel,newdata = data_test )

# Calculate RMSE
rmse = rmse(data_test$charges,y_pred)
cat("Root Mean Squared Error_MLR :", rmse)

## Root Mean Squared Error_MLR : 5712.968

The results shows high statistically significant relationships between charges and age,bmi,children and smoking. The sex and region will not have significant impact on charges.

p-value of the result is less than 0.05.

R-squared value for this model is 0.743.

Root Mean Squared Error of the model is 5713

Optimize the Model with Backward Elimination

# Eliminate variable with highest p-value (sex) 
lmModel = train(charges ~ age + bmi + children + smoker + region, 
                data = data_train, trControl = control,method="lm" )

# Summary of training model
summary(lmModel)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11612  -2785  -1062   1256  29650 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11674.52    1122.33 -10.402  < 2e-16 ***
## age            258.64      13.47  19.197  < 2e-16 ***
## bmi            323.53      32.64   9.913  < 2e-16 ***
## children       558.33     156.56   3.566 0.000378 ***
## smoker1      23933.44     473.72  50.523  < 2e-16 ***
## region2       -311.27     537.00  -0.580 0.562280    
## region3       -777.50     548.51  -1.417 0.156633    
## region4       -937.54     543.49  -1.725 0.084809 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6152 on 1064 degrees of freedom
## Multiple R-squared:  0.7428, Adjusted R-squared:  0.7411 
## F-statistic: 438.9 on 7 and 1064 DF,  p-value: < 2.2e-16

# Predict the test set results
y_pred = predict(lmModel,newdata = data_test )

# Calculate RMSE
rmse = rmse(data_test$charges,y_pred)
cat("Root Mean Squared Error_MLR :", rmse)

## Root Mean Squared Error_MLR : 5693.958

Model has improved slightly.

Root Mean Squared Error of the model is 5694

6.4. Support Vector Regression

# Fit SVR to the training set
regressor_SVM = svm(formula = charges~.,data = data_train,type = 'eps-regression')

summary(regressor_SVM)

## 
## Call:
## svm(formula = charges ~ ., data = data_train, type = "eps-regression")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1111111 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  419

# Predict the test set results
y_pred_SVM = predict(regressor_SVM,newdata = data_test )

# Calculate RMSE
rmse_SVM = rmse(data_test$charges,y_pred_SVM)
cat("Root Mean Squared Error_SVR :", rmse_SVM)

## Root Mean Squared Error_SVR : 4341.062

6.5. Decision Tree

# Fit the decision tree model to the training set
DTModel =train(charges~.,data = data_train,trControl = control,method="rpart" )

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

#summary(DTModel)

# Predict the test set results
y_pred_DT = predict(DTModel,newdata = data_test )

# Calculate RMSE
rmse_DT= rmse(data_test$charges,y_pred_DT)
cat("Root Mean Squared Error_DT:", rmse_DT)

## Root Mean Squared Error_DT: 5803.355

6.6. Random Forest

# Fit random forest model to the training set
RFModel =train(charges~.,data = data_train,trControl = control,method="rf" )

summary(RFModel)

##                 Length Class      Mode     
## call               4   -none-     call     
## type               1   -none-     character
## predicted       1072   -none-     numeric  
## mse              500   -none-     numeric  
## rsq              500   -none-     numeric  
## oob.times       1072   -none-     numeric  
## importance         8   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            11   -none-     list     
## coefs              0   -none-     NULL     
## y               1072   -none-     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames             8   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          1   -none-     logical  
## param              0   -none-     list

# Predict the test set results
y_pred_RF = predict(RFModel,newdata = data_test )

# Calculate RMSE
rmse_RF= rmse(data_test$charges,y_pred_RF)
cat("Root Mean Squared Error_RF:", rmse_RF)

## Root Mean Squared Error_RF: 4280.677

6.7. k-Nearest Neighbors

# Fit kNN model to the training set
kNNModel =train(charges~.,data = data_train,trControl = control,method="knn" )

summary(kNNModel)

##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      8      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   1      -none-     logical  
## param       0      -none-     list

# Predict the test set results
y_pred_kNN = predict(kNNModel,newdata = data_test )

# Calculate RMSE
rmse_kNN= rmse(data_test$charges,y_pred_kNN)
cat("Root Mean Squared Error_kNN:", rmse_kNN)

## Root Mean Squared Error_kNN: 11518.92

6.8. Partial Least Squares Method

# library (pls)
# Fit multiple linear regression to the training set
plsModel =train(charges~.,
               data = data_train,"kernelpls" )

summary(plsModel)

## Data:    X dimension: 1072 8 
##  Y dimension: 1072 1
## Fit method: kernelpls
## Number of components considered: 3
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps
## X          82.849    98.97    99.16
## .outcome    9.992    12.66    62.87

# Predict the test set results
y_pred = predict(plsModel,newdata = data_test )

# Calculate RMSE
rmse_pls= rmse(data_test$charges,y_pred)
cat("Root Mean Squared Error_pls :", rmse_pls)

## Root Mean Squared Error_pls : 7175.29

6.9. Evaluation of Regression Algorithm Models

Following table shows the results of different regression models;

Model	RMSE
Multiple Linear Regression	5712
Multiple Linear Regression (Backward Elimination)	5694
Support Vector Regression	4341
Decision Tree	5803
Random Forest	4280
k-Nearest Neighbors	11519
Partial Least Squares	7175

Random Forest algorithm shows the best performance
Support Vector Regression provides second best results
k-Nearest Neighbors is the least performing algorithm

Ultimately, it can conclude that Random Forest is the most suitable regression method for the prediction of medical charges.

7. Classification

This section is to answer the second question, what type of insurance package do you offer to a certain person? For this, the customers are clustered into 3 groups (high, medium and low risk). Then, classification algorithms are applied to the sample data with risk as the dependent variable.

7.1. Clustering

Preparing the data for classification task. The method DBScan is used to cluster the high risk, medium risk, and low risk patients. Only two features are used, which are age and charges. This is due to the clear separation of the data between the 3 group as shown in the EDA section.

DBscan is used instead of kmeans method due to the shape of the data. The almost consistent distance between the data points in the age-cluster plot favors the DBscan method.

The value of epsilon used is 0.25. This is used as it manages to group most of the data points without mixing between the 3 groups.

# Perform DBScan to cluster the high risk, medium risk, and low risk. 
library(dbscan)

# Selecting age and charges
df_clustering = df[c(1,7)]
df_clustering[c(1,2)] = scale(df_clustering[c(1,2)])
set.seed(1234)

# Perform clustering
r <- dbscan(df_clustering, eps=.25)
df_clustering['cluster'] <- r['cluster']

# Plot the clustering result
plot(ggplot(df_clustering,aes(x=age, y = charges))+geom_point(aes(colour = as.factor(cluster))))

As shown above, there are 4 groups detected where group 0 is the data points that does not belong to any of the other 3 groups. This data points are added manually to the other 3 groups based on observation. This is so that we might not loose any possible crucial information considering the dataset size is small.

# Adding the values from cluster 0 to 1,2,3
df_clustering$cluster[df_clustering$cluster==0 & df_clustering$charges>2] <- 3
df_clustering$cluster[df_clustering$cluster==0] <- 1

# Swapping values and setting 1 as low risk and 2 as medium risk
df_clustering$cluster[df_clustering$cluster==2] <- 0
df_clustering$cluster[df_clustering$cluster==1] <- 2
df_clustering$cluster[df_clustering$cluster==0] <- 1
df_clustering$cluster <- as.factor(df_clustering$cluster)

# Plot the clustering result after adjustment
plot(ggplot(df_clustering,aes(x=age, y = charges))+geom_point(aes(colour = cluster)))

7.2. Classification Model Traning

For this task, the caret and rpart libraries are used.

#library(caret)
library(rpart) # Decision Tree
library(rattle)  # Decision Tree Pretty Plot

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

7.3. k-Nearest Neighbors - Split Train-Test method

Excluding the charges columns as this is the dependent variable for this dataset.

df_classification = df[c(1:6)]
df_classification['risk'] = df_clustering$cluster


split=0.80  # define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(df_classification$risk, p=split, list=FALSE)
data_train <- df_classification[ trainIndex,]
data_test <- df_classification[-trainIndex,]

nrow(data_train)

## [1] 1071

nrow(data_test)

## [1] 267

knn_fit <- caret::train(risk ~., data = data_train, method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)
#knn classifier
knn_fit

## k-Nearest Neighbors 
## 
## 1071 samples
##    6 predictor
##    3 classes: '1', '2', '3' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1071, 1071, 1071, 1071, 1071, 1071, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8539056  0.6585699
##    7  0.8685593  0.6869394
##    9  0.8734104  0.6957009
##   11  0.8776158  0.7037218
##   13  0.8822920  0.7132926
##   15  0.8816673  0.7115153
##   17  0.8829436  0.7142138
##   19  0.8838793  0.7159953
##   21  0.8830563  0.7133764
##   23  0.8834736  0.7140524
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 19.

#predict classes for test set using knn classifier
test_pred_knn <- predict(knn_fit, newdata = data_test)
confusionMatrix(test_pred_knn, as.factor(data_test$risk) )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 193  22   0
##          2   0  18   3
##          3   0   5  26
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8876          
##                  95% CI : (0.8435, 0.9229)
##     No Information Rate : 0.7228          
##     P-Value [Acc > NIR] : 4.803e-11       
##                                           
##                   Kappa : 0.7134          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000  0.40000  0.89655
## Specificity            0.7027  0.98649  0.97899
## Pos Pred Value         0.8977  0.85714  0.83871
## Neg Pred Value         1.0000  0.89024  0.98729
## Prevalence             0.7228  0.16854  0.10861
## Detection Rate         0.7228  0.06742  0.09738
## Detection Prevalence   0.8052  0.07865  0.11610
## Balanced Accuracy      0.8514  0.69324  0.93777

7.4. k-Nearest Neighbors - Cross Validation Method

# using cross validation
knn_fit_cv <- caret::train(risk ~., data = df_classification, method = "knn",
                        preProcess = c("center", "scale"),
                        tuneLength = 10,
                        trControl = trainControl(method = "cv", number=5))
print(knn_fit_cv)

## k-Nearest Neighbors 
## 
## 1338 samples
##    6 predictor
##    3 classes: '1', '2', '3' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1070, 1070, 1071, 1070, 1071 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.8849041  0.7124215
##    7  0.8856448  0.7125657
##    9  0.8856448  0.7124750
##   11  0.8863967  0.7145688
##   13  0.8863883  0.7145298
##   15  0.8886271  0.7198966
##   17  0.8878836  0.7181360
##   19  0.8871373  0.7164253
##   21  0.8878836  0.7181360
##   23  0.8908715  0.7255346
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.

#Confusion matrix for k-Nearest Neighbors
test_pred_knn_cv <- predict(knn_fit_cv, newdata = df_classification)
cm_knn <- confusionMatrix(test_pred_knn_cv, as.factor(df_classification$risk))
cm_knn$table

##           Reference
## Prediction   1   2   3
##          1 966  98   0
##          2   0 105  14
##          3   0  22 133

7.5. Naive Bayes - Cross Validation Method

# using cross validation and naive bayes
nb_fit_cv <- caret::train(risk ~., data = df_classification, method = "nb",
                            preProcess = c("center", "scale"),
                           tuneLength = 10,
                           trControl = trainControl(method = "cv", number=5))
print(nb_fit_cv)

## Naive Bayes 
## 
## 1338 samples
##    6 predictor
##    3 classes: '1', '2', '3' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1070, 1069, 1071, 1071, 1071 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE            NaN        NaN
##    TRUE      0.8385591  0.5390927
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.

#Confusion matrix for naive bayes
test_pred_nb_cv <- predict(nb_fit_cv, newdata = df_classification)
cm_nb <- confusionMatrix(test_pred_nb_cv, as.factor(df_classification$risk))
cm_nb$table

##           Reference
## Prediction   1   2   3
##          1 966 116  78
##          2   0 109   2
##          3   0   0  67

7.6. Decision Tree - Cross Validation Method

# using cross validation and decision tree
rpart_fit_cv <- caret::train(risk ~., data = df_classification, method = "rpart",
                           preProcess = c("center", "scale"),
                           tuneLength = 10,
                           trControl = trainControl(method = "cv", number=5))

print(rpart_fit_cv)

## CART 
## 
## 1338 samples
##    6 predictor
##    3 classes: '1', '2', '3' 
## 
## Pre-processing: centered (8), scaled (8) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1071, 1070, 1071, 1069, 1071 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.00000000  0.9065484  0.7690119
##   0.04390681  0.9222508  0.8029989
##   0.08781362  0.9222508  0.8029989
##   0.13172043  0.9222508  0.8029989
##   0.17562724  0.9222508  0.8029989
##   0.21953405  0.9222508  0.8029989
##   0.26344086  0.9222508  0.8029989
##   0.30734767  0.9222508  0.8029989
##   0.35125448  0.8595021  0.6524118
##   0.39516129  0.7871442  0.3488221
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.3073477.

fancyRpartPlot(rpart_fit_cv$finalModel)

#Confusion matrix for decision tree
test_pred_rpart_cv <- predict(rpart_fit_cv, newdata = df_classification)
cm_rpart <- confusionMatrix(test_pred_rpart_cv, as.factor(df_classification$risk))
cm_rpart$table

##           Reference
## Prediction   1   2   3
##          1 966  98   0
##          2   0 127   3
##          3   0   0 144

7.7. Evaluation of Classification Algorithm Models

Following table shows the results of different regression models;

Model	Accuracy
k-Nearest Neighbors (Train-Test Split)	0.888
k-Nearest Neighbors (Cross Validation)	0.891
Naive Bayes (Cross Validation)	0.839
Decision Tree (Cross Validation)	0.922

Based on the accuracy, Decision Tree algorithm performs best in classifying the customers.

8. Conclusion

For the first question, what is the medical cost of a person based on demographic features?, regression is performed. 6 methods are tested. Comparison were made between Multiple Linear Regression, Support Vector Regression, Decision Tree, Random Forest, k-Nearest Neighbors, and Partial Least Squares Method. Root Mean Square Error(RMSE) method is used to evaluate the models. Random Forest algorithm is found to be the most suitable regression method.

For the second question, What type of insurance package do you offer to a certain person?, classification is performed. 4 methods are tested. Comparison were made between train-test split validation and cross validation. In this case, k-Nearest Neighbors method were used. The train-test split method produces better accuracy but only slightly.

Methods 2 to 4 were done to compare the performance between k-Nearest Neighbors, Naive Bayes, and Decision Tree methods. All are tested using cross-validation method. The Decision Tree model is found to perform the best. However, when applying all 4 methods, most of the errors are found when classifying a medium risk person where they are classified as low risk instead. This is not good for the insurance provider as they would be undercharging these customers.

Better understanding of the algorithms are required to produce a good model. The data size is also small which also affects the accuracy of the prediction.

Group Members:

Ushani Kumari (s2003535)
Mohd Anas Ahmad (s2001089)