MATH1324 Assignment 4

Factors Affecting Insurance Charges

Fangwen Dang, ID:3664793

Introduction

Insurance is one of the most important components of national economy, which has significant influence on the healthy development of financial industry. Therefore, it is beneficial for exploring the development space, motivating the growth of insurance industry, stimulating the progress of macro economy.

As we know, insurance charges for different individuals are generally different. Based on some surveys and experience, I select smoking status, gender, body mass index, region, age and number of children as the potential factors affecting insurance charges, and make the following analysis in order to explore the relationship between insurance charges and these individual characteristics.

Firstly, I find some open data from the website “https://pan.baidu.com/s/1eSL3gUu”. The data set insurance.csv describes the insurance charges of individuals according to their attributes: smoker (smoking status), sex (gender), bmi (body mass index), region, age and their number of children.

Next, I make use of RStudio to make some statistical analysis on the data to find the factors which significantly affects insurance charge of an individual graphically.

Problem Statement

Descriptive statistics of age grouped by smoker
Descriptive statistics of bmi grouped by smoker
Descriptive statistics of children grouped by smoker
Obtain a boxplot of charges by smoker and sex, and briefly describe the observation.
Obtain a boxplot of charges by region and smoker, and briefly describe the observation.
Obtain a scatter plot between charges and bmi by smoker, and briefly describe the observation.
Split the data set by smoker. Name them insurance_smoker and insurance_nosmoker respectively and obtain their summary statistics.
For insurance_smoker, create a variable bmi_factor which equals one if bmi > 30 and zero otherwise.
Obtain a scatter plot between charges and bmi by sex and bmi_factor, and briefly describe the observation.
Obtain a histogram with density plot of Charges and use it to explain why the scatter plot in problem 9 has so many outliers.
Make a regression analysis on dependent variable Charges, and independent variables age, bmi, children, smoker, region and sex.
Make a t-test to test whether mean insurance charge for nonsmokers is higher than that for smokers.
Make a t-test to test whether mean charges for smokers with BMI<=30 is higher than that for smokers with BMI>30.
Make a Chi-square goodness-of-fit test to test whether sex has influence on the number of smokers.

Data

setwd("/Users/meiruoqi/Desktop")
ins <- read.table("insurance.csv",sep=',',header=TRUE)

Descriptive Statistics and Visualisation

Descriptive statistics of age grouped by smoker

grp<-group_by(ins,smoker) 

summarise(grp,Min = min(age,na.rm = TRUE),
                                           Q1 = quantile(age,probs = .25,na.rm = TRUE),
                                           Median = median(age, na.rm = TRUE),
                                           Q3 = quantile(age,probs = .75,na.rm = TRUE),
                                           Max = max(age,na.rm = TRUE),
                                           Mean = mean(age, na.rm = TRUE),
                                           SD = sd(age, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(age)))

Descriptive statistics of bmi grouped by smoker

grp<-group_by(ins,smoker) 
summarise(grp,Min = min(bmi,na.rm = TRUE),
                                           Q1 = quantile(bmi,probs = .25,na.rm = TRUE),
                                           Median = median(bmi, na.rm = TRUE),
                                           Q3 = quantile(bmi,probs = .75,na.rm = TRUE),
                                           Max = max(bmi,na.rm = TRUE),
                                           Mean = mean(bmi, na.rm = TRUE),
                                           SD = sd(bmi, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(bmi)))

Descriptive statistics of children grouped by smoker

summarise(grp,Min = min(children,na.rm = TRUE),
                                           Q1 = quantile(children,probs = .25,na.rm = TRUE),
                                           Median = median(children, na.rm = TRUE),
                                           Q3 = quantile(children,probs = .75,na.rm = TRUE),
                                           Max = max(children,na.rm = TRUE),
                                           Mean = mean(children, na.rm = TRUE),
                                           SD = sd(children, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(children)))

We can also get descriptive statistics of age, bmi, and children grouped by sex or region. Here is no more detailed description.

Obtain a boxplot of charges by smoker and sex, and briefly describe the observation.

p1 <- ggplot(data=ins,aes(x=sex, y=charges, fill=smoker))+
geom_boxplot()+
ggtitle("Charges of Sex and Smoker")+
xlab("Sex")+
ylab("Charges")

p1

Findings:

Smokers’ mean charge is far more than nonsmokers’, no matter female or male.
The mean charges of female nonsmokers and male nonsmokers are slightly different, while mean of male somkers are larger than that of female smokers.
There are many outliers among nonsmokers.

Obtain a boxplot of charges by region and smoker, and briefly describe the observation.

p2 <- ggplot(data=ins,aes(x=region, y=charges, fill=smoker))+
geom_boxplot()+
ggtitle("Charges of Region and Smoker")+
xlab("Region")+
ylab("Charges")

p2

Findings:

Smokers’ mean charge far more than nonsmokers’, no matter which region.
The mean charges of nonsmokers in different regions are slightly different, while mean of smokers in southeast and southwest are larger than that in northeeast and north west.
There are many outliers among nonsmokers.

Obtain a scatter plot between charges and bmi by smoker, and briefly describe the observation.

p3 <- ggplot(data=ins,aes(x=bmi, y=charges, color=smoker))+
geom_point()+
ggtitle("Charges of BMI and Smoker")+
xlab("BMI")+
ylab("Charges")

p3

Findings:

Among nonsmokers, points gather in one group while among smokers, points can subdivide into two groups, one with BMI less than 30 and another with more than 30.
Smokers with BMI larger than 30 charge more than other people.
Among smoker, the relation between BMI and charge is positive while among nonsmoker, the relation is likely to normal.

Split the data set by smoker. Name them insurance_smoker and insurance_nosmoker respectively and obtain their summary statistics.

listins <- split(ins, ins$smoker)
insurance_nosmoker <- listins[[1]]
insurance_smoker <- listins[[2]]

For insurance_smoker, create a variable bmi_factor which equals one if bmi > 30 and zero otherwise.

insurance_smoker$bmi_factor <- as.factor(ifelse(insurance_smoker$bmi>30,1,0))

smoker_split <- split(insurance_smoker, insurance_smoker$bmi_factor)

bmi_factor_0 <- smoker_split[[1]]

bmi_factor_1 <- smoker_split[[2]]

Obtain a scatter plot between charges and bmi by sex and bmi_factor, and briefly describe the observation.

p4 <- ggplot(data=insurance_smoker,aes(x=bmi, y=charges, color=sex,shape=bmi_factor))+
geom_point()+
ggtitle("Charges of BMI by Smoker and BMI_factor")+
xlab("BMI")+
ylab("Charges")

Findings:

Generally speaking, people with bmi_factor=1, i.e. BMI>30, are charged more than those with bmi_factor=0, i.e. BMI<=30.
Sex scatter approximately evenly in two groups.
The more BMI is, the more charge is.

Obtain density plot of charges by bmi_factor, and briefly describe the observation.

p5 <- ggplot(data=insurance_smoker,aes(x=bmi, y=charges,color=bmi_factor))+
geom_point()+
stat_density2d()+
ggtitle("Charges of BMI_factor")+
xlab("BMI")+
ylab("Charges")

p5

Findings:

Charges for smokers with bmi_factor=1, i.e. BMI>30，are obviously larger than charges for somkers with bmi_factor=0, i.e. BMI<=30.
For smokers with bmi_factor=1, the center focuses around BMI=35 and charge=40000, indicating average BMI for smokers whose BMIs are higher than 30 is about 35 and average charges for them is about 40000.
Similarly, for smokers with bmi_factor=0, the center focuses around BMI=28 and charge=20000, indicating average BMI for smokers whose BMIs are less than 30 is about 28 and average charges for them is about 20000.

For insurance_nosmoker, obtain a scatter plot between charges and age by sex, and briefly describe the observation.

p6 <- ggplot(data=insurance_nosmoker,aes(x=age, y=charges, color=sex))+
geom_point()+
ggtitle("Charges of Age by Sex")+
xlab("Age")+
ylab("Charges")

p6

Obtain a histogram with density plot of Charges and use it to explain why the scatter plot in step 9 has so many outliers.

p7 <- ggplot(data=insurance_nosmoker,aes(x=charges))+
geom_histogram(aes(y=..density..),color="black",fill="grey",binwidth=500)+
geom_density(alpha=.3,fill="blue")+
ggtitle("Histogram of Nonsmoker Charges")+
xlab("Charges")+
ylab("Density")

p7

Findings:

1.Obviously, the relation between age and charge is positive, no matter female or male.

2.Although most peole with the same age are charged similarly, there do exist many outliers with high charges.

Make a regression analysis on dependent variable Charges, and independent variables age, bmi, children, smoker, region and sex.

test<-lm(charges~age+bmi+children+smoker+region+sex,data=ins)
summary(test)

## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region + 
##     sex, data = ins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## sexmale           -131.3      332.9  -0.394 0.693348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Judging from the p-values of coefficients, we can find that all the p-values all less than 0.05, except p-values for regionnorthwest and sexmale.

Based on findings above, we refit the model by removing variables regionnorthwest and sexmale.

region_split<- split(ins, ins$region)

region1<-region_split[[1]]
region2<-region_split[[2]]
region3<-region_split[[3]]
region4<-region_split[[4]]

region2$region<-0

new<-rbind(region1,region2,region3,region4)

test2<-lm(charges~age+bmi+children+smoker+region,data=new)
summary(test2)

## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11958.1  -2900.3   -850.6   1712.4  29337.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -12290.24    1079.37 -11.386   <2e-16 ***
## age                260.54      13.62  19.126   <2e-16 ***
## bmi                343.02      31.58  10.861   <2e-16 ***
## children           377.40     156.30   2.415   0.0159 *  
## smokeryes        24486.74     464.80  52.682   <2e-16 ***
## regionsoutheast  -1079.18     480.48  -2.246   0.0249 *  
## regionsouthwest   -938.55     476.64  -1.969   0.0492 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6039 on 1006 degrees of freedom
##   (325 observations deleted due to missingness)
## Multiple R-squared:  0.7648, Adjusted R-squared:  0.7634 
## F-statistic: 545.3 on 6 and 1006 DF,  p-value: < 2.2e-16

The newly fitted model is: charges=-12290.24+260.54age+343.02bmi+377.40children+24486.74I(smoker=yes)-1079.18I(region=southeast)-938.55I(region=southwest), where I(*) is the indicator function. All the p-values are less than 0.05 here.

Hence, we can roughly approximate the insurance charges for a specific individual with the fitted model above.

Hypothesis Testing

Test1:

The first step is to check if the variances of charges for smokers and nonsmokers are equal.

H0: Variances of charge for smokers and nonsmokers are equal. vs H1: Variances of charge for smokers and nonsmokers are not equal.

var.test(insurance_smoker$charges,insurance_nosmoker$charges)

## 
##  F test to compare two variances
## 
## data:  insurance_smoker$charges and insurance_nosmoker$charges
## F = 3.7079, num df = 273, denom df = 1063, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  3.087603 4.500341
## sample estimates:
## ratio of variances 
##           3.707885

Based on the ouput, p-value<2.2e-16<<0.05, hence we can reject the null hypothesis and conclude that variances of charge for smokers and nonsmokers are not equal.

Based on unequal variance and the average charges for nonsmokers is lower than that for smokers, we conduct Welch two sample test with the null hypothsis that mean insurance charge for nonsmokers is higher than that for smokers.

H0: Mean insurance charge for nonsmokers is higher than that for smokers. vs H1: Mean insurance charge for nonsmokers is less than that for smokers.

t.test(insurance_nosmoker$charges,insurance_smoker$charges, var.equal = FALSE, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  insurance_nosmoker$charges and insurance_smoker$charges
## t = -32.752, df = 311.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -22426.4
## sample estimates:
## mean of x mean of y 
##  8434.268 32050.232

Based on the output, p-value<2.2e-6<<0.05, hence we can reject the null hypothesis and conclude that mean insurance charge for nonsmokers is less than that for smokers.

Test2:

The first step is to check if the variances of charges for smokers with BMI>30 and smokers with BMI<=30 are equal.

H0: Variances of charge for smokers with BMI>30 and smokers with BMI<=30 are equal. vs H1: Variances of charge for smokers with BMI>30 and smokers with BMI<=30 are equal.

var.test(bmi_factor_0$charges,bmi_factor_1$charges)

## 
##  F test to compare two variances
## 
## data:  bmi_factor_0$charges and bmi_factor_1$charges
## F = 0.74981, num df = 129, denom df = 143, p-value = 0.09617
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.535560 1.052907
## sample estimates:
## ratio of variances 
##          0.7498124

Based on the ouput, p-value=0.09617>0.05, hence we cannot reject the null hypothesis and conclude that variances of charge for smokers with BMI>30 and smokers with BMI<=30 are equal.

Based on equal variance and the average charges for smokers with BMI<=30 is lower than that for smokers with BMI>30, we conduct Welch two sample test with the null hypothsis that mean charges for smokers with BMI<=30 is higher than that for smokers with BMI>30.

H0: Mean charges for smokers with BMI<=30 is higher than that for smokers with BMI>30. vs H1: Mean charges for smokers with BMI<=30 is lower than that for smokers with BMI>30.

t.test(bmi_factor_0$charges,bmi_factor_1$charges, var.equal = TRUE, alternative = "less")

## 
##  Two Sample t-test
## 
## data:  bmi_factor_0$charges and bmi_factor_1$charges
## t = -30.697, df = 272, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -19230.86
## sample estimates:
## mean of x mean of y 
##  21369.22  41692.81

Based on the output, p-value<2.2e-6<<0.05, hence we can reject the null hypothesis and conclude that mean charges for smokers with BMI<=30 is lower than that for smokers with BMI>30.

Test3:

Make a Chi-square goodness-of-fit test to test whether sex has influence on the number of smokers.

H0: Sex has no influence on the number of smokers. vs H1:Sex has influence on the number of smokers.

table<-xtabs(~sex+smoker,data=ins)
chisq.test(table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 7.3929, df = 1, p-value = 0.006548

From the output, p-value=0.006548<0.05, hence we can reject the null hypothesis and conclude that Sex has influence on the number of smokers.

Discussion

From steps above, we can conclude that:

Smokers’ mean charge is far more than nonsmokers’, no matter female or male.
For nonsmokers, there is a slight difference between mean charges for female and male. For smokers, there is a relatively huge difference between mean charges for female and male.
Smokers’ mean charge far more than nonsmokers’, no matter which region.
For nonsmokers, there are slight differences between different regions. However, mean charges for smokers in southeast and southwest are larger than that in northeeast and northwest.
Among nonsmokers, points gather in one group while among smokers, points can subdivide into two groups, one with BMI less than 30 and another with more than 30.
For smokers, insurance charges for smokers with BMI larger than 30 are more than others.
Among smokers, the relation between BMI and charge is positive, while among nonsmoker, the relation is likely to normal.
Generally speaking, people with bmi_factor=1, i.e. BMI>30, are charged more than those with bmi_factor=0, i.e. BMI<=30.
Sex scatter approximately evenly in two groups.
When other variables are kept fiexed, the more BMI is, the more charge is.
Charges for smokers with bmi_factor=1, i.e. BMI>30，are obviously larger than charges for somkers with bmi_factor=0, i.e. BMI<=30.
For smokers with bmi_factor=1, the center focuses around BMI=35 and charge=40000, indicating average BMI for smokers whose BMIs are higher than 30 is about 35 and average charges for them is about 40000.
Similarly, for smokers with bmi_factor=0, the center focuses around BMI=28 and charge=20000, indicating average BMI for smokers whose BMIs are less than 30 is about 28 and average charges for them is about 20000.
Obviously, the relation between age and charge is positive, no matter female or male.
Although most peole with the same age are charged similarly, there do exist many outliers with high charges.
The histogram with density plot of Charges shows positive skewness, which indicates that for nonsmokers, insurance charges concentrate mainly on relatively low values.
The density function also shows a thick tail on the right, which explains why the scatter plot for nonsmokers between charges and age by sex has many outliers.
We can roughly approximate the insurance charges for a specific individual with the fitted model: charges=-12290.24+260.54age+343.02bmi+377.40children+24486.74I(smoker=yes)-1079.18I(region=southeast)-938.55I(region=southwest), where I(*) is the indicator function.
Sex has influence on the number of smokers.