Introduction

  • When you have an insurance policy, the company charges you money in exchange for that coverage. That cost is known as the insurance premium. In health insurance ,Cost of premium depends majorly on amount of expected costs of health care. So before we calculate the premium we need to find out expected Medical care expenses.

  • We can find expected Medical care expenses using liner regression . For this purpose i have used data form here.

Abbreviation used in data

  • BMI = Body mass index (in kg/m^2)

Reading data

library(readr)
library(DT)
insurance <- read_csv("insurance.csv")
datatable(insurance,filter = "top")

Descriptive Analysis

Here we see the Measure of central tendency

summary(insurance)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Here we see the measure of Dispersion.

Correlation

cor(insurance[c("age","bmi","children","charges")])
##                age       bmi   children    charges
## age      1.0000000 0.1092719 0.04246900 0.29900819
## bmi      0.1092719 1.0000000 0.01275890 0.19834097
## children 0.0424690 0.0127589 1.00000000 0.06799823
## charges  0.2990082 0.1983410 0.06799823 1.00000000

Predictive analysis

first we model using the backward selection

model1 <- lm(charges ~ . ,data = insurance)
summary(model1)
## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Improve the model

insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)

model2 <- lm(charges~ age + children + bmi + sex + bmi30:smoker + region , data = insurance)

summary(model2)
## 
## Call:
## lm(formula = charges ~ age + children + bmi + sex + bmi30:smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18013.0  -3546.5  -1807.5   -410.5  28061.4 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1333.70    1237.31  -1.078 0.281276    
## age               263.43      11.43  23.044  < 2e-16 ***
## children          570.57     132.36   4.311 1.75e-05 ***
## bmi                85.16      44.85   1.899 0.057817 .  
## sexmale          -329.53     320.04  -1.030 0.303369    
## regionnorthwest  -364.13     457.52  -0.796 0.426249    
## regionsoutheast  -489.78     460.42  -1.064 0.287632    
## regionsouthwest -1581.13     458.94  -3.445 0.000588 ***
## bmi30:smokerno  -3370.09     542.20  -6.216 6.83e-10 ***
## bmi30:smokeryes 29782.47     692.36  43.016  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5823 on 1328 degrees of freedom
## Multiple R-squared:  0.7704, Adjusted R-squared:  0.7688 
## F-statistic:   495 on 9 and 1328 DF,  p-value: < 2.2e-16

Data Visualization

library(ggplot2)
library(ggthemes)


ggplot(insurance) +
 aes(x = region, y = charges) +
 geom_boxplot(fill = "#75b8d1") +
 labs(x = "Region", y = "Charges", title = "Box Plot", caption = "Visualization") +
 theme_economist()