In this notebook, I will try to fit a linear regression model for
this data set to estimate the average charge for each population.
According to Investopedia, insurance is a contract, represented by a
policy, in which a policyholder receives financial protection or
reimbursement against losses from an insurance company. The company
pools clients’ risks to make payments more affordable for the insured.
Most people have some insurance: for their car, their house, their
healthcare, or their life.
We’ll analyze this data by going through those processes.
1- Look for basic descriptive statistics for the data.
2- Searching for duplicates and NAs.
3- Look at the distributions of the numeric variables and the bar
plot for the character variables.
4- Look at the scatter plot for each variable against the independent
variable.
5- preparing the data for the analysis.
6- fitting a multiple linear regression model using all the data.
7- evaluating the model and trying to improve its performance.
Basic descriptive statistics for the data
describe(insurance)
## vars n mean sd median trimmed mad min max
## age 1 1338 39.21 14.05 39.00 39.01 17.79 18.00 64.00
## sex* 2 1338 1.51 0.50 2.00 1.51 0.00 1.00 2.00
## bmi 3 1338 30.66 6.10 30.40 30.50 6.20 15.96 53.13
## children 4 1338 1.09 1.21 1.00 0.94 1.48 0.00 5.00
## smoker* 5 1338 1.20 0.40 1.00 1.13 0.00 1.00 2.00
## region* 6 1338 2.52 1.10 3.00 2.52 1.48 1.00 4.00
## charges 7 1338 13270.42 12110.01 9382.03 11076.02 7440.81 1121.87 63770.43
## range skew kurtosis se
## age 46.00 0.06 -1.25 0.38
## sex* 1.00 -0.02 -2.00 0.01
## bmi 37.17 0.28 -0.06 0.17
## children 5.00 0.94 0.19 0.03
## smoker* 1.00 1.46 0.14 0.01
## region* 3.00 -0.04 -1.33 0.03
## charges 62648.55 1.51 1.59 331.07
The trimmed mean is almost the same as the mean for all variables
except for the charge variable, which is slightly less than the mean,
which may indicate that the variable has a positive skew. Also, the skew
parameter supports that, and all the kurtosis is less than 3, which
means its distribution has a thin tail.
Checking for NAs
colSums(is.na(insurance))
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
checking for duplicates
dim(insurance)
## [1] 1338 7
dim(unique(insurance))
## [1] 1337 7
There is one duplicate in the dataset; let’s see the observation of
the non-unique value.
insurance[duplicated(insurance),]
## age sex bmi children smoker region charges
## 582 19 male 30.59 0 no northwest 1639.563
According to the values of this raw, it’s possible to have a real
repeated value, but let’s remove it.
insurance<-distinct (insurance)
Distributions of the dependent variables
Let’s start with the age.
ggplot(insurance, aes(x =age)) +
geom_histogram(fill="#3B9C9C",binwidth=4,col="white")+
theme_solarized()+
scale_fill_brewer(palette="Set2")+
ggtitle("Distribution of age")+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Age")

With a minor bias toward the stage between 18 and 22, the
distribution is nearly uniform and represents all age groups in the
dataset equally.
BMI
ggplot(insurance, aes(x =bmi)) +
geom_histogram(fill="#5E5A80",binwidth=4,col="white")+
theme_solarized()+
scale_fill_brewer(palette="Set2")+
ggtitle("Distribution of bmi")+
theme(plot.title = element_text(hjust = 0.5))+
xlab("BMI")

bmi is an almost normal distribution with a slightly positive
skew.
charges
ggplot(insurance, aes(x =charges)) +
geom_histogram(fill="#728FCE",col="white")+
theme_solarized()+
scale_fill_brewer(palette="Set2")+
ggtitle("Distribution of charges")+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Charges")

The charge variable has a Log-Normal distribution. Normal
distributions may present a few problems that log-normal distributions
can solve. Mainly, normal distributions can allow for negative random
variables, while log-normal distributions include all positive
variables.
One of the most common applications where log-normal distributions
are used in finance is in the analysis of stock prices.
The log-normal distribution is right-skewed with a long tail towards
the right.
Sex
ggplot(insurance, aes(x =sex,fill=sex)) +
geom_bar()+
theme_solarized()+
scale_fill_brewer(palette="Set1")+
ggtitle("Distribution of sex")+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Sex")

Number of children
ggplot(insurance, aes(x =children,fill=as.factor(children))) +
geom_bar()+
theme_solarized()+
scale_fill_brewer(palette="Set2")+
ggtitle("Distribution of children")+
theme(plot.title = element_text(hjust = 0.5))+
theme(legend.position = "none")+
xlab("Children")

Smoker
ggplot(insurance, aes(x =smoker,fill=smoker)) +
geom_bar()+
theme_solarized()+
scale_fill_brewer(palette="Accent")+
ggtitle("Distribution of smoker")+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Smoker")

Region
ggplot(insurance, aes(x =region,fill=region)) +
geom_bar()+
theme_solarized()+
scale_fill_brewer(palette="Dark2")+
ggtitle("Distribution of region")+
theme(plot.title = element_text(hjust = 0.5))+
xlab("Region")

It seems like we have equal numbers for each sex, as well as every
region, but a higher number of non-smokers compared to smokers. Also,
the people with more children appear to be less than the people with
fewer children, with a log-normal distribution.
Character variables’ effectiveness on the
charges
Sex
ggplot(insurance,aes(y=charges,x=sex,fill=sex))+
geom_boxplot()+
ggtitle("Boxplot of the charges by sex")+
xlab("Sex")+
ylab("Charges")+
theme_solarized()+
scale_fill_brewer(palette="Set2")+
theme(plot.title = element_text(hjust = 0.5))

Number of children
ggplot(insurance,aes(y=charges,x=as.factor(children),fill=as.factor(children)))+
geom_boxplot()+
ggtitle("Boxplot of the charges by number of children")+
ylab("charges")+
xlab("Number of children")+
theme_solarized()+
scale_fill_brewer(palette="Set1")+
theme(legend.position = "none")+
theme(plot.title = element_text(hjust = 0.5))

Smoker
ggplot(insurance,aes(y=charges,x=smoker,fill=smoker))+
geom_boxplot()+
ggtitle("Boxplot of the charges by smoker")+
ylab("charges")+
xlab("smoker")+
theme_solarized()+
scale_fill_startrek()+
theme(plot.title = element_text(hjust = 0.5))
Region
ggplot(insurance,aes(y=charges,x=region,fill=region))+
geom_boxplot()+
ggtitle("Boxplot of the charges by region")+
ylab("charges")+
xlab("region")+
theme_solarized()+
scale_fill_jco()+
theme(plot.title = element_text(hjust = 0.5))

Based on the plots, it appears that there are no differences between
the components in terms of sex, region, or number of children.
Naturally, an ANOVA test is necessary to confirm that there is no
difference between those components, but we will let that happen now
because if there are any differences, it seems to be small differences.
We can negligate them by now.
It appears that our upcoming model heavily considers the smoking
condition.
Visualizing the relationship between the dependent and
independent variables
creating a scatter plot with other independent numeric variables and
the charges and trying to see the effects of the character variables on
it.
ggplot(insurance,aes(x=age,y=charges))+
geom_point(col="#488AC7")+
theme_solarized()+
ggtitle("Relationship between charges and age")+
xlab("Age")+
ylab("charges")+
theme(plot.title = element_text(hjust = 0.5))

It seems clear that increasing age correlates with higher insurance
costs, and it also appears that the value of the charge is determined by
one or two character variables. Additionally, it appears that the
relationship is linear or has a modest tendency to be a second- or
third-degree polynomial.
Let’s separate the plot by the character variables.
ggplot(insurance,aes(x=age,y=charges,col=sex))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and age by sex")+
xlab("Age")+
ylab("charges")+
theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=age,y=charges,col=smoker))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and age by smoker")+
xlab("Age")+
ylab("charges")+
theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=age,y=charges,col=region))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and age by region")+
xlab("Age")+
ylab("charges")+
theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=age,y=charges,col=as.factor(children)))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and age by number of children ")+
xlab("Age")+
ylab("charges")+
guides(color = guide_legend(title = "Number of children"))+
theme(plot.title = element_text(hjust = 0.5))

As it appeared in the boxplots section, the smoking state separated
the values of the charges and had that much impact on the dependent
variable. and the two lines seem to be parallel, so we didn’t need an
interaction term between the age and smoker variables. Also, it seems to be
that there is another separate variable beside the character variables
that we had, so we are going to create another character variable, well,
call it “obs,” depending on the body mass index “bmi” flow with thes
rules. BMI classifications are: underweight (under 18.5 kg/m2), normal
weight (18.5 to 24.9), overweight (25 to 29.9), and obese (30 or
more).
insurance$obs<-case_when(
insurance$bmi<18.5~"underweight",
insurance$bmi>=18.5&insurance$bmi<=24.999~"normal",
insurance$bmi>=25&insurance$bmi<=29.999~"overweight",
insurance$bmi>=30~"obese")
Visualize the numbers of the new variable and its impact on the
charges.
ggplot(insurance, aes(x =obs,fill=obs)) +
geom_bar()+
theme_solarized()+
scale_fill_brewer(palette="Set1")+
ggtitle("Distribution of category of BMI")+
guides(fill = guide_legend(title = "category of BMI"))+
theme(plot.title = element_text(hjust = 0.5))+
xlab("category of BMI")

ggplot(insurance,aes(y=charges,x=obs,fill=obs))+
geom_boxplot()+
ggtitle("Boxplot of the charges by the category of BMI")+
ylab("charges")+
xlab("category of BMI")+
theme_solarized()+
scale_fill_brewer(palette="Set1")+
theme(legend.position = "none")+
theme(plot.title = element_text(hjust = 0.5))

Clearly, the data set appears to have more instances of the obese and
overweight categories than of the normal weight and underweight
category.
However, obesity appears to have a significant impact on charges; any
charge beyond forty thousand is unquestionably associated with
obesity.
Let’s see the plot of age separated by category of BMI
ggplot(insurance,aes(x=age,y=charges,col=obs))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and age by category of BMI ")+
xlab("Age")+
ylab("charges")+
guides(color = guide_legend(title = "category of BMI"))+
theme(plot.title = element_text(hjust = 0.5))

Clearly, the category of BMI separates the charges, and it also seems
to be in the same way, so we didn’t need the interaction term by age and
category of BMI variables.
Relationship between charges and age by category of BMI and smoking
state
ggplot(insurance,aes(x=age,y=charges,col=obs,shape=smoker))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and age by category of BMI ")+
xlab("Age")+
ylab("charges")+
guides(color = guide_legend(title = "category of BMI"))+
theme(plot.title = element_text(hjust = 0.5))

It is obvious that everyone in the higher charge group falls into the
smoking obesity category, and being older has a beneficial effect on the
charges.
Let’s see the impact of bmi on charges as a numeric variable and
combine it with the other character variables.
ggplot(insurance,aes(x=bmi,y=charges))+
geom_point(col="#488AC7")+
theme_solarized()+
ggtitle("Relationship between charges and BMI")+
xlab("BMI")+
ylab("charges")+
theme(plot.title = element_text(hjust = 0.5))

separate the plot by the character variables.
ggplot(insurance,aes(x=bmi,y=charges,col=sex))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and BMI by sex")+
xlab("BMI")+
ylab("charges")+
theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=bmi,y=charges,col=smoker))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and BMI by smoker")+
xlab("BMI")+
ylab("charges")+
theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=bmi,y=charges,col=region))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and BMI by region")+
xlab("BMI")+
ylab("charges")+
theme(plot.title = element_text(hjust = 0.5))

ggplot(insurance,aes(x=bmi,y=charges,col=as.factor(children)))+
geom_point()+
theme_solarized()+
ggtitle("Relationship between charges and BMI by number of children ")+
xlab("BMI")+
ylab("charges")+
guides(color = guide_legend(title = "Number of children"))+
theme(plot.title = element_text(hjust = 0.5))

The plots show that the increase in BMI doesn’t mean an increase in
charge unless the smoking state is yes.