In order to make money, insurance companies need to collect more in premiums than it spends on medical care to its beneficiaries. Every insurer makes a significant portion of its revenue by underwriting, which is basically charging a fee (called a premium) for taking on financial risk.
A great deal of time and money is invested to employ statistical and mathematical models to evaluate and mitigate the financial risk involved in different scenarios.
The data comes from the book Machine Learning with R by Brett Lantz. We should note that the data was simulated using hypothetical medical expenses, and demographic statistics from the US Census Bureau. It approximately reflects real-world conditions.
The dataset contains 1,338 observations and 6 variables. Here is a run down of each variable individually:
age : An integer indicating the age of the beneficiary
sex : The policy holders gender
bmi : The Body Mass index of the beneficiary. It’s a measure of how overweight or underweight a person is relative to their height. BMI is equal to weight(kilograms) divided by height(meters)squared. An ideal BMI is within the range of 18.5 to 24.9
children: An integer indicating the number of children/dependents covered by the insurance plan
smoker: A yes or no categorical variable that indicates whether the insured regularly smokes tobacco.
region: The beneficiary’s place of residence in the US divided into four geographic regions: northeast, southeast, southwest, or northwest.
#reading in the data
insurance <- read.csv("C:/Users/Wadud/Downloads/insurance.csv")
insurance$children <- as.character(insurance$children)#loading any necessary packages
library(plotly)
library(ggplot2)
library(GGally)
library(dplyr)
library(devtools)
library(reshape2)To understand our data it would be wise to get a feel of the distributions of any numeric variables.
str(insurance)## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: chr "0" "1" "3" "0" ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
#Distribution of Age
ggplot(insurance, aes(age))+
geom_bar(bins=24)+
labs(title= "Distribution of Age") +
theme(
panel.background = element_rect(fill = "lightblue", colour = "lightblue",size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid', colour = "white"))#Distribution of BMI
ggplot(insurance,aes(bmi))+
geom_histogram(bins = 25)+
labs(title= "Distribution of BMI") +
theme(
panel.background = element_rect(fill = "lightblue", colour = "lightblue",size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',colour = "white"))#Distribution of Children
ggplot(insurance, aes(children))+
geom_bar()+
labs(title= "Distribution of Dependents Covered") +
theme(
panel.background = element_rect(fill = "lightblue",colour = "lightblue", size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid', colour = "white"))#Distribution of Charges
ggplot(insurance,aes(charges))+
geom_histogram(bins = 32)+
labs(title= "Distribution of Charges") +
theme(
panel.background = element_rect(fill = "lightblue", colour = "lightblue",size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',colour = "white"))###Box Plots of Categorical Variables
plot_ly(insurance, y = ~charges, color = ~sex, type = "box")%>%
layout(title="Comparison of Charges Between Genders")plot_ly(insurance, y = ~charges, color = ~children, type = "box")%>%
layout(title="Comparison of Charges Between Number of Dependents")We can use a scatterplot matrix to quickly visualize the relationship between any two variables:
insurance$children <- as.integer(insurance$children)
ggpairs(insurance, columns = c("age","bmi","children","charges"), title = "Scatterplot Matrix")+
theme(
panel.background = element_rect(fill = "lightblue", colour = "lightblue",size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid', colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',colour = "white"))Linear Regression is the most widely used predictive analytic technique.
ins_model <- lm(charges ~ age + children + bmi + sex +
smoker + region, data = insurance)
ins_model##
## Call:
## lm(formula = charges ~ age + children + bmi + sex + smoker +
## region, data = insurance)
##
## Coefficients:
## (Intercept) age children bmi
## -11938.5 256.9 475.5 339.2
## sexmale smokeryes regionnorthwest regionsoutheast
## -131.3 23848.5 -353.0 -1035.0
## regionsouthwest
## -960.1
A 3D scatter plot allows us to plot on three axes to better understand how the relationship between them.
plot_ly(
insurance, x=~charges, y=~children, z=~bmi,
color=~sex, colors = c('lightcoral', 'cornflowerblue')
) %>%
add_markers() %>%
layout(
scene = list(title = list(text="3D Scatter Plot"),
xaxis = list(title="Charges"),
yaxis = list(title="# of Dependents"),
zaxis = list(title="BMI"))
)work in progress
ins_model2 <- lm(charges ~ age + children, data = insurance)
#ins_model2
graph_reso <- .05
#setting up axis
axis_x <- seq(min(insurance$age), max(insurance$age), by = graph_reso)
axis_y <- seq(min(min(insurance$children), max(insurance$children)), by = graph_reso)
#Sample Points
ins_model2_surface <- expand.grid(age=axis_x, children=axis_y, KEEP.OUT.ATTRS =F)
ins_model2_surface$charges <- predict.lm(ins_model2, newdata = ins_model2_surface)
ins_model2_surface <- acast(ins_model2_surface, children ~ age, value.var = "charges")
hcolors=c("red","blue")[insurance$sex]
charges_plot <- plot_ly(insurance,
x = ~age,
y = ~children,
z = ~charges,
text = ~sex, # EDIT: ~ added
type = "scatter3d",
mode = "markers",
marker = list(color = hcolors))
charges_plot <- add_trace(p = charges_plot,
z = ins_model2_surface,
x = axis_x,
y = axis_y,
type = "surface")
charges_plotA quantile-quantile plot is a visual check if the data set is from a theoretical distribution, like normal or cauchy.
ggplot(insurance, aes(sample = charges))+stat_qq()+theme_bw()+ggtitle("Q-Q Plot: Charges")qplot(sample = charges, data = insurance, color=sex)+theme_bw()+ggtitle("Q-Q Plot:Charges by Sex")qplot(sample = bmi, data = insurance, color=sex)+theme_bw()+ggtitle("Q=Q Plot: BMI by Sex")qplot(sample = age, data = insurance, color=sex)+theme_bw()+ggtitle("Q=Q Plot: Age by Sex")Decision Trees/ Tree Based Methods are a class of supervised machine learning algorithms, which are used in both classification (predicts discrete outcome) and regression (predicts continuous numeric outcomes) predictive modeling.
Decision Trees are composed of nodes, branches and leafs. Each node represents an attribute (or feature), each branch represents a rule (or decision), and each leaf represents an outcome of our target feature. The depth of a tree is defined by the number of levels, not including the root node.
Since we’re predicting a continuous numeric variable - insurance premiums - we will use a version of a decision tree called a Regression Tree
It is standard practice in machine learning to first split your data into two subsets- a set for training the model, and a set for testing the model. This is called the Training Split and the Test Split respectively. This is used to validate any insights and reduce the risk of over-fitting your model to your data.
There is no rule as to the exact size split to make but it is sensible to reserve a larger split for training - a typical split is 80% training and 20% testing data.
R has a useful package for creating regression trees called rpart (short for ‘recursive partitioning’)
Another package, rpart.plot is used to visualize the tree.
#insurance <- read.csv("C:/Users/Wadud/Downloads/insurance.csv")
library(dplyr)
library(rpart)
library(rpart.plot)Let’s remind ourselves of our variables and data types:
str(insurance)## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
You might be wondering, if our outcome variable is a numeric variable, can we still use categorical variables as features(predictors)?
YES
#creating our random split
set.seed(1234)
sample_size = floor(0.8*nrow(insurance))
select_rows = sample(seq_len(nrow(insurance)),size = sample_size)
#creating the training split
insurance_training = insurance[select_rows,]
#creating the testing split
insurance_testing = insurance[-select_rows,]
nrow(insurance_training)## [1] 1070
nrow(insurance_testing)## [1] 268
Our training data and testing data has 1070 and 268 rows respectively.
Here we will use the ‘rpart’ and ‘rpart.plot’ packages mentioned previously.
insurance_model <- rpart(charges ~ age + sex + bmi + region + smoker + children, data = insurance_training, method = "anova", control=rpart.control(minsplit=100, cp=0.001))
insurance_predictions <- predict(insurance_model,insurance_testing)
insurance_model## n= 1070
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1070 161342000000 13467.080
## 2) smoker=no 848 31612330000 8514.415
## 4) age< 42.5 471 11004990000 5371.606
## 8) children< 0.5 209 3439828000 3709.599
## 16) age< 32.5 176 2360514000 3109.901 *
## 17) age>=32.5 33 678437100 6907.993 *
## 9) children>=0.5 262 6527319000 6697.405
## 18) region=northwest,southeast,southwest 197 3645995000 6090.324
## 36) age< 38.5 159 2653487000 5566.524 *
## 37) age>=38.5 38 766349800 8282.014 *
## 19) region=northeast 65 2588675000 8537.327 *
## 5) age>=42.5 377 10143010000 12440.840
## 10) age< 51.5 156 3581709000 10457.550 *
## 11) age>=51.5 221 5514542000 13840.810
## 22) age< 58.5 128 2603403000 12770.220 *
## 23) age>=58.5 93 2562507000 15314.310 *
## 3) smoker=yes 222 29475050000 32385.360
## 6) bmi< 30.01 103 2755371000 21601.270
## 12) age< 41 57 809638500 18676.370 *
## 13) age>=41 46 853853000 25225.600 *
## 7) bmi>=30.01 119 4373125000 41719.500
## 14) age< 41.5 65 1448883000 38404.200 *
## 15) age>=41.5 54 1349856000 45710.130 *
This is our regression tree before visualization. We started out with 1070 observations, and then the data was split first on the “smoker” variable, so our model found this to be the most “predictive” feature.
rpart.plot(insurance_model, type = 3, digits = 5, fallen.leaves = TRUE)
The Percent in the leaf nodes are the proportion of the training data of
the training data, that have the conditions of the decision nodes before
it. Hence it add up to 100% (the entire training split). The number
above is the predicted average premiums paid across all those rows that
meet the conditions.
An interesting thing to note in our tree is that as far as the ‘region’ feature is concerned, northeast is grouped separately from the rest. The predicted cost of insurance premiums in this model is higher here, so this tells us that people generally pay higher premiums in the northeast region than any other region.