Prediction and Stratification of Health Insurance Cost Based on Health and Demographic Features
1. Introduction
Risk assessment is a crucial factor in insurance business to making their pricing strategy. Accurate prediction of likely future health care cost of individual and identify the most significant features which contribute to increase of healthcare expenditure is important in order to introduce several insurance packages with competitive yearly premium charges depending on the expected treatment costs of the beneficiaries.
2. Initial Questions
The goal of this project is to answer the following two questions by implementing regression algorithms and classification algorithms respectively on a sample data of the health insurance beneficiaries;
1. What is the medical cost of a person based on health and demographic features?
2. What type of insurance package do you offer to a certain person?
3. Process Flow
The processes involved in completing this project are:
- Problem Identification: The domain of study is selected, research questions are prepared and datasets are obtained.
- Data Understanding(EDA): The data is explored to gain understanding.
- Data Pre-processing: The data is prepared for the analysis.
- Modeling (Supervised Learning): The prediction models are trained based on the research question.
- Performance Evaluation: The performance of each prediction model is measured.
- Conclusion: The findings are summarised.
Datas Science Flow
4. Data Understanding
4.1. Data Source
A kaggle data set named Medical Cost Personal Datasets (Insurance Forecast by using Linear Regression) created in year 2018 is used for this analysis. This data set provides the annual medical costs including demographic features of 1,338 beneficiaries enrolled in an insurance plan in USA.
https://www.kaggle.com/mirichoi0218/insurance
4.2. Exploratory Data Analysis
# Import Libraries
library(ggplot2) # visualization
library(cowplot) # visualization
library(psych) # visualization##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Loading required package: lattice
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
4.2.1. Variables
| age | sex | bmi | children | smoker | region | charges |
|---|---|---|---|---|---|---|
| 19 | female | 27.900 | 0 | yes | southwest | 16884.924 |
| 18 | male | 33.770 | 1 | no | southeast | 1725.552 |
| 28 | male | 33.000 | 3 | no | southeast | 4449.462 |
| 33 | male | 22.705 | 0 | no | northwest | 21984.471 |
| 32 | male | 28.880 | 0 | no | northwest | 3866.855 |
| 31 | female | 25.740 | 0 | no | southeast | 3756.622 |
The response/dependent variable of the analysis is charges, which is the individual medical costs billed by health insurance in a particular year.
The predictors/independent variables of the analysis are as follows;
* age: age of primary beneficiary
* sex: insurance contractor gender, female, male
* bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
* children: Number of children covered by health insurance / Number of dependents
* smoker: Smoking
* region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
4.2.2. Summary Statistics
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
This data set contains 4 numerical variables namely age,bmi,children and charges and 3 categorical variables namely sex,smoker and region.
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
- Age of the respondents are between 18 and 64.
- The respondents’ sex and region are evenly distributed.
- Number of children or dependents of the principal range between 0 to 5.
- Number of non-smokers are approximately 4 times than smokers.
- The average medical cost is $13,270 and median is $9,382. Maximum claim value is $63,770.
4.2.3 Visualization
# Distribution of charges
ggplot(df,aes(x=charges))+
geom_histogram(binwidth=2000)+
geom_vline(aes(xintercept=mean(charges)),color="blue", linetype="dashed",size=1)+
geom_vline(aes(xintercept=median(charges)),color="red", linetype="dashed",size=1)+
labs(title = "Distribution of Charges")+labs(x = "Charges / USD", y = "Count")+
theme(plot.title=element_text(hjust=0.5))This graph shows right-skewed distribution of medical expenses and 75% of people lies between zero and $16,640.
# Correlation between Age,Smoking and Charges
g1<-ggplot(df,aes(x=age, y = charges,color = smoker))+
geom_point()+
geom_smooth(method="lm")+
labs(title = "Effect of Age on Charges")+labs(x = "Age", y = "Charges / USD")+
theme(plot.title=element_text(hjust=0.5),legend.position = "none")
# Correlation between BMI,Smoking and Charges
g2 <- ggplot(df,aes(x=bmi, y = charges,color = smoker))+
geom_point()+
geom_smooth(method="lm")+
labs(title = "Effect of BMI on Charges")+labs(x = "BMI")+
theme(plot.title=element_text(hjust=0.5),axis.title.y = element_blank())
plot_grid(g1,g2)## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
This graphs clearly indicate that medical expenses are highly associate with the age and BMI. Elderly and over weight smokers’ expenses are higher than others. This may be due to smoking and obesity increase the risk of heart disease, cancer and diabetics.
# Correlation between No. of Children and Charges
p1 <- ggplot(df,aes(x=children, y = charges))+
geom_jitter(color = "blue", alpha=0.5)+
labs(title = "Effect of children on Charges")+labs(x = "Number of Children/Dependents", y = "Charges / USD")+
theme(plot.title=element_text(hjust=0.5))
# Correlation between Region and Charges
p2 <- ggplot(df,aes(x=region, y = charges))+
geom_jitter(color = "blue", alpha=0.5)+
labs(title = "Effect of region on Charges")+labs(x = "Region", y = "Charges / USD")+
theme(plot.title=element_text(hjust=0.5))
plot_grid(p1,p2)An obvious relationship is not evident between number of children/dependents covered in insurance or the beneficiary’s residential area for the charges.
This graph shows the pairwise comparison of numerical features.
5. Data Preprocessing
5.1. Missing Values
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
This data set does not have missing values.
5.2. Encoding Categorical Data
#Encode sex,smoker and region
df$sex = factor(df$sex,
levels = c('male','female'),
labels = c(1, 2))
df$smoker = factor(df$smoker,
levels = c('no', 'yes'),
labels = c(0, 1))
df$region = factor(df$region,
levels = c('northeast','northwest', 'southeast','southwest'),
labels = c(1, 2,3,4))
head(df)## age sex bmi children smoker region charges
## 1 19 2 27.900 0 1 4 16884.924
## 2 18 1 33.770 1 0 3 1725.552
## 3 28 1 33.000 3 0 3 4449.462
## 4 33 1 22.705 0 0 2 21984.471
## 5 32 1 28.880 0 0 2 3866.855
## 6 31 2 25.740 0 0 3 3756.622
6. Regression
This section is to answer the first question, what is the medical cost of a person based on health and demographic features? For this, various regression algorithms mentioned below are applied to the sample data with charges as the dependent variable.
Method
6.1. Split the Data into Training Set and Test Set
# Split the data set into training set and test set
split = 0.8 # 80% for training
set.seed(100)
trainIndex <-createDataPartition(df$charges,p=split,list=FALSE)
data_train <- df[trainIndex,]
data_test <- df[-trainIndex,]
cat(" Size of data_train:",dim(data_train),"\n","Size of data_test :",dim(data_test))## Size of data_train: 1072 7
## Size of data_test : 266 7
6.3. Multiple Linear Regression
# Fit multiple linear regression to the training set
lmModel = train(charges~.,data = data_train,trControl = control,method="lm" )
# Summary of training model
summary(lmModel)##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11542 -2819 -1076 1329 29812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11885.56 1141.29 -10.414 < 2e-16 ***
## age 258.52 13.47 19.188 < 2e-16 ***
## sex2 384.42 377.57 1.018 0.308847
## bmi 324.23 32.64 9.933 < 2e-16 ***
## children 562.29 156.60 3.591 0.000345 ***
## smoker1 23972.70 475.28 50.440 < 2e-16 ***
## region2 -309.72 537.00 -0.577 0.564219
## region3 -783.31 548.53 -1.428 0.153581
## region4 -947.82 543.57 -1.744 0.081501 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6152 on 1063 degrees of freedom
## Multiple R-squared: 0.743, Adjusted R-squared: 0.7411
## F-statistic: 384.2 on 8 and 1063 DF, p-value: < 2.2e-16
# Predict the test set results
y_pred = predict(lmModel,newdata = data_test )
# Calculate RMSE
rmse = rmse(data_test$charges,y_pred)
cat("Root Mean Squared Error_MLR :", rmse)## Root Mean Squared Error_MLR : 5712.968
The results shows high statistically significant relationships between charges and age,bmi,children and smoking. The sex and region will not have significant impact on charges.
p-value of the result is less than 0.05.
R-squared value for this model is 0.743.
Root Mean Squared Error of the model is 5713
Optimize the Model with Backward Elimination
# Eliminate variable with highest p-value (sex)
lmModel = train(charges ~ age + bmi + children + smoker + region,
data = data_train, trControl = control,method="lm" )
# Summary of training model
summary(lmModel)##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11612 -2785 -1062 1256 29650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11674.52 1122.33 -10.402 < 2e-16 ***
## age 258.64 13.47 19.197 < 2e-16 ***
## bmi 323.53 32.64 9.913 < 2e-16 ***
## children 558.33 156.56 3.566 0.000378 ***
## smoker1 23933.44 473.72 50.523 < 2e-16 ***
## region2 -311.27 537.00 -0.580 0.562280
## region3 -777.50 548.51 -1.417 0.156633
## region4 -937.54 543.49 -1.725 0.084809 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6152 on 1064 degrees of freedom
## Multiple R-squared: 0.7428, Adjusted R-squared: 0.7411
## F-statistic: 438.9 on 7 and 1064 DF, p-value: < 2.2e-16
# Predict the test set results
y_pred = predict(lmModel,newdata = data_test )
# Calculate RMSE
rmse = rmse(data_test$charges,y_pred)
cat("Root Mean Squared Error_MLR :", rmse)## Root Mean Squared Error_MLR : 5693.958
Model has improved slightly.
Root Mean Squared Error of the model is 5694
6.4. Support Vector Regression
# Fit SVR to the training set
regressor_SVM = svm(formula = charges~.,data = data_train,type = 'eps-regression')
summary(regressor_SVM)##
## Call:
## svm(formula = charges ~ ., data = data_train, type = "eps-regression")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1111111
## epsilon: 0.1
##
##
## Number of Support Vectors: 419
# Predict the test set results
y_pred_SVM = predict(regressor_SVM,newdata = data_test )
# Calculate RMSE
rmse_SVM = rmse(data_test$charges,y_pred_SVM)
cat("Root Mean Squared Error_SVR :", rmse_SVM)## Root Mean Squared Error_SVR : 4341.062
6.5. Decision Tree
# Fit the decision tree model to the training set
DTModel =train(charges~.,data = data_train,trControl = control,method="rpart" )## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
#summary(DTModel)
# Predict the test set results
y_pred_DT = predict(DTModel,newdata = data_test )
# Calculate RMSE
rmse_DT= rmse(data_test$charges,y_pred_DT)
cat("Root Mean Squared Error_DT:", rmse_DT)## Root Mean Squared Error_DT: 5803.355
6.6. Random Forest
# Fit random forest model to the training set
RFModel =train(charges~.,data = data_train,trControl = control,method="rf" )
summary(RFModel)## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 1072 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 1072 -none- numeric
## importance 8 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 1072 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 8 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
# Predict the test set results
y_pred_RF = predict(RFModel,newdata = data_test )
# Calculate RMSE
rmse_RF= rmse(data_test$charges,y_pred_RF)
cat("Root Mean Squared Error_RF:", rmse_RF)## Root Mean Squared Error_RF: 4280.677
6.7. k-Nearest Neighbors
# Fit kNN model to the training set
kNNModel =train(charges~.,data = data_train,trControl = control,method="knn" )
summary(kNNModel)## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 8 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical
## param 0 -none- list
# Predict the test set results
y_pred_kNN = predict(kNNModel,newdata = data_test )
# Calculate RMSE
rmse_kNN= rmse(data_test$charges,y_pred_kNN)
cat("Root Mean Squared Error_kNN:", rmse_kNN)## Root Mean Squared Error_kNN: 11518.92
6.8. Partial Least Squares Method
# library (pls)
# Fit multiple linear regression to the training set
plsModel =train(charges~.,
data = data_train,"kernelpls" )
summary(plsModel)## Data: X dimension: 1072 8
## Y dimension: 1072 1
## Fit method: kernelpls
## Number of components considered: 3
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps
## X 82.849 98.97 99.16
## .outcome 9.992 12.66 62.87
# Predict the test set results
y_pred = predict(plsModel,newdata = data_test )
# Calculate RMSE
rmse_pls= rmse(data_test$charges,y_pred)
cat("Root Mean Squared Error_pls :", rmse_pls)## Root Mean Squared Error_pls : 7175.29
6.9. Evaluation of Regression Algorithm Models
Following table shows the results of different regression models;
| Model | RMSE |
|---|---|
| Multiple Linear Regression | 5712 |
| Multiple Linear Regression (Backward Elimination) | 5694 |
| Support Vector Regression | 4341 |
| Decision Tree | 5803 |
| Random Forest | 4280 |
| k-Nearest Neighbors | 11519 |
| Partial Least Squares | 7175 |
Random Forest algorithm shows the best performance
Support Vector Regression provides second best results
k-Nearest Neighbors is the least performing algorithm
Ultimately, it can conclude that Random Forest is the most suitable regression method for the prediction of medical charges.
7. Classification
This section is to answer the second question, what type of insurance package do you offer to a certain person? For this, the customers are clustered into 3 groups (high, medium and low risk). Then, classification algorithms are applied to the sample data with risk as the dependent variable.
7.1. Clustering
Preparing the data for classification task. The method DBScan is used to cluster the high risk, medium risk, and low risk patients. Only two features are used, which are age and charges. This is due to the clear separation of the data between the 3 group as shown in the EDA section.
DBscan is used instead of kmeans method due to the shape of the data. The almost consistent distance between the data points in the age-cluster plot favors the DBscan method.
The value of epsilon used is 0.25. This is used as it manages to group most of the data points without mixing between the 3 groups.
# Perform DBScan to cluster the high risk, medium risk, and low risk.
library(dbscan)
# Selecting age and charges
df_clustering = df[c(1,7)]
df_clustering[c(1,2)] = scale(df_clustering[c(1,2)])
set.seed(1234)
# Perform clustering
r <- dbscan(df_clustering, eps=.25)
df_clustering['cluster'] <- r['cluster']
# Plot the clustering result
plot(ggplot(df_clustering,aes(x=age, y = charges))+geom_point(aes(colour = as.factor(cluster))))As shown above, there are 4 groups detected where group 0 is the data points that does not belong to any of the other 3 groups. This data points are added manually to the other 3 groups based on observation. This is so that we might not loose any possible crucial information considering the dataset size is small.
# Adding the values from cluster 0 to 1,2,3
df_clustering$cluster[df_clustering$cluster==0 & df_clustering$charges>2] <- 3
df_clustering$cluster[df_clustering$cluster==0] <- 1
# Swapping values and setting 1 as low risk and 2 as medium risk
df_clustering$cluster[df_clustering$cluster==2] <- 0
df_clustering$cluster[df_clustering$cluster==1] <- 2
df_clustering$cluster[df_clustering$cluster==0] <- 1
df_clustering$cluster <- as.factor(df_clustering$cluster)
# Plot the clustering result after adjustment
plot(ggplot(df_clustering,aes(x=age, y = charges))+geom_point(aes(colour = cluster)))7.2. Classification Model Traning
For this task, the caret and rpart libraries are used.
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
7.3. k-Nearest Neighbors - Split Train-Test method
Excluding the charges columns as this is the dependent variable for this dataset.
df_classification = df[c(1:6)]
df_classification['risk'] = df_clustering$cluster
split=0.80 # define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(df_classification$risk, p=split, list=FALSE)
data_train <- df_classification[ trainIndex,]
data_test <- df_classification[-trainIndex,]
nrow(data_train)## [1] 1071
## [1] 267
knn_fit <- caret::train(risk ~., data = data_train, method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10)
#knn classifier
knn_fit ## k-Nearest Neighbors
##
## 1071 samples
## 6 predictor
## 3 classes: '1', '2', '3'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1071, 1071, 1071, 1071, 1071, 1071, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8539056 0.6585699
## 7 0.8685593 0.6869394
## 9 0.8734104 0.6957009
## 11 0.8776158 0.7037218
## 13 0.8822920 0.7132926
## 15 0.8816673 0.7115153
## 17 0.8829436 0.7142138
## 19 0.8838793 0.7159953
## 21 0.8830563 0.7133764
## 23 0.8834736 0.7140524
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 19.
#predict classes for test set using knn classifier
test_pred_knn <- predict(knn_fit, newdata = data_test)
confusionMatrix(test_pred_knn, as.factor(data_test$risk) ) ## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 193 22 0
## 2 0 18 3
## 3 0 5 26
##
## Overall Statistics
##
## Accuracy : 0.8876
## 95% CI : (0.8435, 0.9229)
## No Information Rate : 0.7228
## P-Value [Acc > NIR] : 4.803e-11
##
## Kappa : 0.7134
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 0.40000 0.89655
## Specificity 0.7027 0.98649 0.97899
## Pos Pred Value 0.8977 0.85714 0.83871
## Neg Pred Value 1.0000 0.89024 0.98729
## Prevalence 0.7228 0.16854 0.10861
## Detection Rate 0.7228 0.06742 0.09738
## Detection Prevalence 0.8052 0.07865 0.11610
## Balanced Accuracy 0.8514 0.69324 0.93777
7.4. k-Nearest Neighbors - Cross Validation Method
# using cross validation
knn_fit_cv <- caret::train(risk ~., data = df_classification, method = "knn",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv", number=5))
print(knn_fit_cv)## k-Nearest Neighbors
##
## 1338 samples
## 6 predictor
## 3 classes: '1', '2', '3'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1070, 1070, 1071, 1070, 1071
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8849041 0.7124215
## 7 0.8856448 0.7125657
## 9 0.8856448 0.7124750
## 11 0.8863967 0.7145688
## 13 0.8863883 0.7145298
## 15 0.8886271 0.7198966
## 17 0.8878836 0.7181360
## 19 0.8871373 0.7164253
## 21 0.8878836 0.7181360
## 23 0.8908715 0.7255346
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 23.
#Confusion matrix for k-Nearest Neighbors
test_pred_knn_cv <- predict(knn_fit_cv, newdata = df_classification)
cm_knn <- confusionMatrix(test_pred_knn_cv, as.factor(df_classification$risk))
cm_knn$table ## Reference
## Prediction 1 2 3
## 1 966 98 0
## 2 0 105 14
## 3 0 22 133
7.5. Naive Bayes - Cross Validation Method
# using cross validation and naive bayes
nb_fit_cv <- caret::train(risk ~., data = df_classification, method = "nb",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv", number=5))
print(nb_fit_cv)## Naive Bayes
##
## 1338 samples
## 6 predictor
## 3 classes: '1', '2', '3'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1070, 1069, 1071, 1071, 1071
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE NaN NaN
## TRUE 0.8385591 0.5390927
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
## = 1.
#Confusion matrix for naive bayes
test_pred_nb_cv <- predict(nb_fit_cv, newdata = df_classification)
cm_nb <- confusionMatrix(test_pred_nb_cv, as.factor(df_classification$risk))
cm_nb$table ## Reference
## Prediction 1 2 3
## 1 966 116 78
## 2 0 109 2
## 3 0 0 67
7.6. Decision Tree - Cross Validation Method
# using cross validation and decision tree
rpart_fit_cv <- caret::train(risk ~., data = df_classification, method = "rpart",
preProcess = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "cv", number=5))
print(rpart_fit_cv)## CART
##
## 1338 samples
## 6 predictor
## 3 classes: '1', '2', '3'
##
## Pre-processing: centered (8), scaled (8)
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1071, 1070, 1071, 1069, 1071
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00000000 0.9065484 0.7690119
## 0.04390681 0.9222508 0.8029989
## 0.08781362 0.9222508 0.8029989
## 0.13172043 0.9222508 0.8029989
## 0.17562724 0.9222508 0.8029989
## 0.21953405 0.9222508 0.8029989
## 0.26344086 0.9222508 0.8029989
## 0.30734767 0.9222508 0.8029989
## 0.35125448 0.8595021 0.6524118
## 0.39516129 0.7871442 0.3488221
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.3073477.
#Confusion matrix for decision tree
test_pred_rpart_cv <- predict(rpart_fit_cv, newdata = df_classification)
cm_rpart <- confusionMatrix(test_pred_rpart_cv, as.factor(df_classification$risk))
cm_rpart$table ## Reference
## Prediction 1 2 3
## 1 966 98 0
## 2 0 127 3
## 3 0 0 144
7.7. Evaluation of Classification Algorithm Models
Following table shows the results of different regression models;
| Model | Accuracy |
|---|---|
| k-Nearest Neighbors (Train-Test Split) | 0.888 |
| k-Nearest Neighbors (Cross Validation) | 0.891 |
| Naive Bayes (Cross Validation) | 0.839 |
| Decision Tree (Cross Validation) | 0.922 |
Based on the accuracy, Decision Tree algorithm performs best in classifying the customers.
8. Conclusion
For the first question, what is the medical cost of a person based on demographic features?, regression is performed. 6 methods are tested. Comparison were made between Multiple Linear Regression, Support Vector Regression, Decision Tree, Random Forest, k-Nearest Neighbors, and Partial Least Squares Method. Root Mean Square Error(RMSE) method is used to evaluate the models. Random Forest algorithm is found to be the most suitable regression method.
For the second question, What type of insurance package do you offer to a certain person?, classification is performed. 4 methods are tested. Comparison were made between train-test split validation and cross validation. In this case, k-Nearest Neighbors method were used. The train-test split method produces better accuracy but only slightly.
Methods 2 to 4 were done to compare the performance between k-Nearest Neighbors, Naive Bayes, and Decision Tree methods. All are tested using cross-validation method. The Decision Tree model is found to perform the best. However, when applying all 4 methods, most of the errors are found when classifying a medium risk person where they are classified as low risk instead. This is not good for the insurance provider as they would be undercharging these customers.
Better understanding of the algorithms are required to produce a good model. The data size is also small which also affects the accuracy of the prediction.
Group Members:
- Ushani Kumari (s2003535)
- Mohd Anas Ahmad (s2001089)