When you have an insurance policy, the company charges you money in exchange for that coverage. That cost is known as the insurance premium. In health insurance ,Cost of premium depends majorly on amount of expected costs of health care. So before we calculate the premium we need to find out expected Medical care expenses.
We can find expected Medical care expenses using liner regression . For this purpose i have used data form here.
library(readr)
library(DT)
insurance <- read_csv("insurance.csv")
datatable(insurance,filter = "top")
Here we see the Measure of central tendency
summary(insurance)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
cor(insurance[c("age","bmi","children","charges")])
## age bmi children charges
## age 1.0000000 0.1092719 0.04246900 0.29900819
## bmi 0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges 0.2990082 0.1983410 0.06799823 1.00000000
first we model using the backward selection
model1 <- lm(charges ~ . ,data = insurance)
summary(model1)
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
model2 <- lm(charges~ age + children + bmi + sex + bmi30:smoker + region , data = insurance)
summary(model2)
##
## Call:
## lm(formula = charges ~ age + children + bmi + sex + bmi30:smoker +
## region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18013.0 -3546.5 -1807.5 -410.5 28061.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1333.70 1237.31 -1.078 0.281276
## age 263.43 11.43 23.044 < 2e-16 ***
## children 570.57 132.36 4.311 1.75e-05 ***
## bmi 85.16 44.85 1.899 0.057817 .
## sexmale -329.53 320.04 -1.030 0.303369
## regionnorthwest -364.13 457.52 -0.796 0.426249
## regionsoutheast -489.78 460.42 -1.064 0.287632
## regionsouthwest -1581.13 458.94 -3.445 0.000588 ***
## bmi30:smokerno -3370.09 542.20 -6.216 6.83e-10 ***
## bmi30:smokeryes 29782.47 692.36 43.016 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5823 on 1328 degrees of freedom
## Multiple R-squared: 0.7704, Adjusted R-squared: 0.7688
## F-statistic: 495 on 9 and 1328 DF, p-value: < 2.2e-16
library(ggplot2)
library(ggthemes)
ggplot(insurance) +
aes(x = region, y = charges) +
geom_boxplot(fill = "#75b8d1") +
labs(x = "Region", y = "Charges", title = "Box Plot", caption = "Visualization") +
theme_economist()