Fangwen Dang, ID:3664793
Insurance is one of the most important components of national economy, which has significant influence on the healthy development of financial industry. Therefore, it is beneficial for exploring the development space, motivating the growth of insurance industry, stimulating the progress of macro economy.
As we know, insurance charges for different individuals are generally different. Based on some surveys and experience, I select smoking status, gender, body mass index, region, age and number of children as the potential factors affecting insurance charges, and make the following analysis in order to explore the relationship between insurance charges and these individual characteristics.
Firstly, I find some open data from the website “https://pan.baidu.com/s/1eSL3gUu”. The data set insurance.csv describes the insurance charges of individuals according to their attributes: smoker (smoking status), sex (gender), bmi (body mass index), region, age and their number of children.
Next, I make use of RStudio to make some statistical analysis on the data to find the factors which significantly affects insurance charge of an individual graphically.
age grouped by smokerbmi grouped by smokerchildren grouped by smokersetwd("/Users/meiruoqi/Desktop")
ins <- read.table("insurance.csv",sep=',',header=TRUE)Descriptive statistics of age grouped by smoker
grp<-group_by(ins,smoker)
summarise(grp,Min = min(age,na.rm = TRUE),
Q1 = quantile(age,probs = .25,na.rm = TRUE),
Median = median(age, na.rm = TRUE),
Q3 = quantile(age,probs = .75,na.rm = TRUE),
Max = max(age,na.rm = TRUE),
Mean = mean(age, na.rm = TRUE),
SD = sd(age, na.rm = TRUE),
n = n(),
Missing = sum(is.na(age))) Descriptive statistics of bmi grouped by smoker
grp<-group_by(ins,smoker)
summarise(grp,Min = min(bmi,na.rm = TRUE),
Q1 = quantile(bmi,probs = .25,na.rm = TRUE),
Median = median(bmi, na.rm = TRUE),
Q3 = quantile(bmi,probs = .75,na.rm = TRUE),
Max = max(bmi,na.rm = TRUE),
Mean = mean(bmi, na.rm = TRUE),
SD = sd(bmi, na.rm = TRUE),
n = n(),
Missing = sum(is.na(bmi))) Descriptive statistics of children grouped by smoker
summarise(grp,Min = min(children,na.rm = TRUE),
Q1 = quantile(children,probs = .25,na.rm = TRUE),
Median = median(children, na.rm = TRUE),
Q3 = quantile(children,probs = .75,na.rm = TRUE),
Max = max(children,na.rm = TRUE),
Mean = mean(children, na.rm = TRUE),
SD = sd(children, na.rm = TRUE),
n = n(),
Missing = sum(is.na(children))) We can also get descriptive statistics of age, bmi, and children grouped by sex or region. Here is no more detailed description.
p1 <- ggplot(data=ins,aes(x=sex, y=charges, fill=smoker))+
geom_boxplot()+
ggtitle("Charges of Sex and Smoker")+
xlab("Sex")+
ylab("Charges")
p1 Findings:
Smokers’ mean charge is far more than nonsmokers’, no matter female or male.
The mean charges of female nonsmokers and male nonsmokers are slightly different, while mean of male somkers are larger than that of female smokers.
There are many outliers among nonsmokers.
p2 <- ggplot(data=ins,aes(x=region, y=charges, fill=smoker))+
geom_boxplot()+
ggtitle("Charges of Region and Smoker")+
xlab("Region")+
ylab("Charges")
p2Findings:
Smokers’ mean charge far more than nonsmokers’, no matter which region.
The mean charges of nonsmokers in different regions are slightly different, while mean of smokers in southeast and southwest are larger than that in northeeast and north west.
There are many outliers among nonsmokers.
p3 <- ggplot(data=ins,aes(x=bmi, y=charges, color=smoker))+
geom_point()+
ggtitle("Charges of BMI and Smoker")+
xlab("BMI")+
ylab("Charges")
p3Findings:
Among nonsmokers, points gather in one group while among smokers, points can subdivide into two groups, one with BMI less than 30 and another with more than 30.
Smokers with BMI larger than 30 charge more than other people.
Among smoker, the relation between BMI and charge is positive while among nonsmoker, the relation is likely to normal.
listins <- split(ins, ins$smoker)
insurance_nosmoker <- listins[[1]]
insurance_smoker <- listins[[2]]insurance_smoker$bmi_factor <- as.factor(ifelse(insurance_smoker$bmi>30,1,0))smoker_split <- split(insurance_smoker, insurance_smoker$bmi_factor)
bmi_factor_0 <- smoker_split[[1]]
bmi_factor_1 <- smoker_split[[2]]p4 <- ggplot(data=insurance_smoker,aes(x=bmi, y=charges, color=sex,shape=bmi_factor))+
geom_point()+
ggtitle("Charges of BMI by Smoker and BMI_factor")+
xlab("BMI")+
ylab("Charges")Findings:
Generally speaking, people with bmi_factor=1, i.e. BMI>30, are charged more than those with bmi_factor=0, i.e. BMI<=30.
Sex scatter approximately evenly in two groups.
The more BMI is, the more charge is.
p5 <- ggplot(data=insurance_smoker,aes(x=bmi, y=charges,color=bmi_factor))+
geom_point()+
stat_density2d()+
ggtitle("Charges of BMI_factor")+
xlab("BMI")+
ylab("Charges")
p5Findings:
Charges for smokers with bmi_factor=1, i.e. BMI>30,are obviously larger than charges for somkers with bmi_factor=0, i.e. BMI<=30.
For smokers with bmi_factor=1, the center focuses around BMI=35 and charge=40000, indicating average BMI for smokers whose BMIs are higher than 30 is about 35 and average charges for them is about 40000.
Similarly, for smokers with bmi_factor=0, the center focuses around BMI=28 and charge=20000, indicating average BMI for smokers whose BMIs are less than 30 is about 28 and average charges for them is about 20000.
p6 <- ggplot(data=insurance_nosmoker,aes(x=age, y=charges, color=sex))+
geom_point()+
ggtitle("Charges of Age by Sex")+
xlab("Age")+
ylab("Charges")
p6p7 <- ggplot(data=insurance_nosmoker,aes(x=charges))+
geom_histogram(aes(y=..density..),color="black",fill="grey",binwidth=500)+
geom_density(alpha=.3,fill="blue")+
ggtitle("Histogram of Nonsmoker Charges")+
xlab("Charges")+
ylab("Density")
p7Findings:
1.Obviously, the relation between age and charge is positive, no matter female or male.
2.Although most peole with the same age are charged similarly, there do exist many outliers with high charges.
test<-lm(charges~age+bmi+children+smoker+region+sex,data=ins)
summary(test)##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region +
## sex, data = ins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## sexmale -131.3 332.9 -0.394 0.693348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
Judging from the p-values of coefficients, we can find that all the p-values all less than 0.05, except p-values for regionnorthwest and sexmale.
Based on findings above, we refit the model by removing variables regionnorthwest and sexmale.
region_split<- split(ins, ins$region)
region1<-region_split[[1]]
region2<-region_split[[2]]
region3<-region_split[[3]]
region4<-region_split[[4]]
region2$region<-0
new<-rbind(region1,region2,region3,region4)
test2<-lm(charges~age+bmi+children+smoker+region,data=new)
summary(test2)##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11958.1 -2900.3 -850.6 1712.4 29337.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12290.24 1079.37 -11.386 <2e-16 ***
## age 260.54 13.62 19.126 <2e-16 ***
## bmi 343.02 31.58 10.861 <2e-16 ***
## children 377.40 156.30 2.415 0.0159 *
## smokeryes 24486.74 464.80 52.682 <2e-16 ***
## regionsoutheast -1079.18 480.48 -2.246 0.0249 *
## regionsouthwest -938.55 476.64 -1.969 0.0492 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6039 on 1006 degrees of freedom
## (325 observations deleted due to missingness)
## Multiple R-squared: 0.7648, Adjusted R-squared: 0.7634
## F-statistic: 545.3 on 6 and 1006 DF, p-value: < 2.2e-16
The newly fitted model is: charges=-12290.24+260.54age+343.02bmi+377.40children+24486.74I(smoker=yes)-1079.18I(region=southeast)-938.55I(region=southwest), where I(*) is the indicator function. All the p-values are less than 0.05 here.
Hence, we can roughly approximate the insurance charges for a specific individual with the fitted model above.
Test1:
The first step is to check if the variances of charges for smokers and nonsmokers are equal.
H0: Variances of charge for smokers and nonsmokers are equal. vs H1: Variances of charge for smokers and nonsmokers are not equal.
var.test(insurance_smoker$charges,insurance_nosmoker$charges)##
## F test to compare two variances
##
## data: insurance_smoker$charges and insurance_nosmoker$charges
## F = 3.7079, num df = 273, denom df = 1063, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 3.087603 4.500341
## sample estimates:
## ratio of variances
## 3.707885
Based on the ouput, p-value<2.2e-16<<0.05, hence we can reject the null hypothesis and conclude that variances of charge for smokers and nonsmokers are not equal.
Based on unequal variance and the average charges for nonsmokers is lower than that for smokers, we conduct Welch two sample test with the null hypothsis that mean insurance charge for nonsmokers is higher than that for smokers.
H0: Mean insurance charge for nonsmokers is higher than that for smokers. vs H1: Mean insurance charge for nonsmokers is less than that for smokers.
t.test(insurance_nosmoker$charges,insurance_smoker$charges, var.equal = FALSE, alternative = "less")##
## Welch Two Sample t-test
##
## data: insurance_nosmoker$charges and insurance_smoker$charges
## t = -32.752, df = 311.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -22426.4
## sample estimates:
## mean of x mean of y
## 8434.268 32050.232
Based on the output, p-value<2.2e-6<<0.05, hence we can reject the null hypothesis and conclude that mean insurance charge for nonsmokers is less than that for smokers.
Test2:
The first step is to check if the variances of charges for smokers with BMI>30 and smokers with BMI<=30 are equal.
H0: Variances of charge for smokers with BMI>30 and smokers with BMI<=30 are equal. vs H1: Variances of charge for smokers with BMI>30 and smokers with BMI<=30 are equal.
var.test(bmi_factor_0$charges,bmi_factor_1$charges)##
## F test to compare two variances
##
## data: bmi_factor_0$charges and bmi_factor_1$charges
## F = 0.74981, num df = 129, denom df = 143, p-value = 0.09617
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.535560 1.052907
## sample estimates:
## ratio of variances
## 0.7498124
Based on the ouput, p-value=0.09617>0.05, hence we cannot reject the null hypothesis and conclude that variances of charge for smokers with BMI>30 and smokers with BMI<=30 are equal.
Based on equal variance and the average charges for smokers with BMI<=30 is lower than that for smokers with BMI>30, we conduct Welch two sample test with the null hypothsis that mean charges for smokers with BMI<=30 is higher than that for smokers with BMI>30.
H0: Mean charges for smokers with BMI<=30 is higher than that for smokers with BMI>30. vs H1: Mean charges for smokers with BMI<=30 is lower than that for smokers with BMI>30.
t.test(bmi_factor_0$charges,bmi_factor_1$charges, var.equal = TRUE, alternative = "less")##
## Two Sample t-test
##
## data: bmi_factor_0$charges and bmi_factor_1$charges
## t = -30.697, df = 272, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -19230.86
## sample estimates:
## mean of x mean of y
## 21369.22 41692.81
Based on the output, p-value<2.2e-6<<0.05, hence we can reject the null hypothesis and conclude that mean charges for smokers with BMI<=30 is lower than that for smokers with BMI>30.
Test3:
Make a Chi-square goodness-of-fit test to test whether sex has influence on the number of smokers.
H0: Sex has no influence on the number of smokers. vs H1:Sex has influence on the number of smokers.
table<-xtabs(~sex+smoker,data=ins)
chisq.test(table)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 7.3929, df = 1, p-value = 0.006548
From the output, p-value=0.006548<0.05, hence we can reject the null hypothesis and conclude that Sex has influence on the number of smokers.
From steps above, we can conclude that: