Insurance Premium Data Set

#Calling Data
data1 <- read_csv("insurance.csv")
data1$children <- ifelse(data1$children>0, 1, 0)
data1$smoker <- ifelse(data1$smoker=="yes", 1, 0)
data1 <- subset(data1, select=-c(age, region))

#Data Table
DT::datatable(data1)

Data Distribution

#Histogram of Response
hist(data1$expenses, xlab="Expenses", ylab="Frequency", main = "Histogram of Response")

OLS Model

\({\hat{Expenses} = \beta_0 + \beta_{1}Sex + \beta_{2}Children + \beta_{3}Smoker + \beta_{4}BMI}\)

#ols reg
reg <- lm(expenses ~ bmi + smoker + sex + children, data=data1)
summary(reg)
## 
## Call:
## lm(formula = expenses ~ bmi + smoker + sex + children, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15058  -4666  -1055   3649  31563 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4054.2     1025.4  -3.954 8.09e-05 ***
## bmi            387.6       31.7  12.225  < 2e-16 ***
## smoker       23605.8      479.8  49.204  < 2e-16 ***
## sexmale       -303.4      387.7  -0.783 0.433933    
## children      1329.2      390.1   3.407 0.000676 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7061 on 1333 degrees of freedom
## Multiple R-squared:  0.6611, Adjusted R-squared:   0.66 
## F-statistic:   650 on 4 and 1333 DF,  p-value: < 2.2e-16
#BP test for heteroskedasticity
bptest(reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  reg
## BP = 83.023, df = 4, p-value < 2.2e-16

To validate our use of quantile regression, we can test the assumption of equal variance of residuals.


Quantile Regression Lines

Quantile Model

\({\hat{Expenses} = \beta_0(\tau) + \beta_{1}(\tau)Sex + \beta_{2}(\tau)Children + \beta_{3}(\tau)Smoker + \beta_{4}(\tau)BMI}\)

#Quantile reg tau=0.10, 0.50, 0.90
#rq comes for the "quantreg" package from Koenker
quantreg10 <- rq(expenses ~ bmi + smoker + sex + children , data=data1, tau=0.10)
summary(quantreg10)
## 
## Call: rq(formula = expenses ~ bmi + smoker + sex + children, tau = 0.1, 
##     data = data1)
## 
## tau: [1] 0.1
## 
## Coefficients:
##             Value       Std. Error  t value     Pr(>|t|)   
## (Intercept)   644.18000   328.65125     1.96007     0.05020
## bmi            42.29306    11.37587     3.71779     0.00021
## smoker      14884.73222   366.64923    40.59665     0.00000
## sexmale      -433.18403   141.32737    -3.06511     0.00222
## children     1897.32750   165.73634    11.44787     0.00000
quantreg50 <- rq(expenses ~ bmi + smoker + sex + children, data=data1, tau=0.50)
summary(quantreg50)
## 
## Call: rq(formula = expenses ~ bmi + smoker + sex + children, tau = 0.5, 
##     data = data1)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             Value       Std. Error  t value     Pr(>|t|)   
## (Intercept) -4017.07302  1217.90944    -3.29833     0.00100
## bmi           341.34528    38.15926     8.94528     0.00000
## smoker      27348.72226  1186.35661    23.05270     0.00000
## sexmale      -335.12943   456.82033    -0.73361     0.46331
## children     1449.93226   490.15922     2.95808     0.00315
quantreg90 <- rq(expenses ~ bmi + smoker + sex + children, data=data1, tau=0.90)
summary(quantreg90)
## 
## Call: rq(formula = expenses ~ bmi + smoker + sex + children, tau = 0.9, 
##     data = data1)
## 
## tau: [1] 0.9
## 
## Coefficients:
##             Value       Std. Error  t value     Pr(>|t|)   
## (Intercept)  3840.95325  1930.46721     1.98965     0.04683
## bmi           358.73252    66.37308     5.40479     0.00000
## smoker      29509.19984   637.46160    46.29173     0.00000
## sexmale      -853.90098   599.75667    -1.42375     0.15475
## children     1076.62878   611.84014     1.75966     0.07870

For instance our interpretation for the Smoker variable would be as follows: individuals who classify as smokers spend an additional 14,844.73 for those who with lower medical insurance premiums (10th quantile), smokers with higher insurance premiums spend 29,509.72 (90th quantile). Simply put, for individuals who smoke the cost of insurance premiums increase as quantiles increase.


95% Confidence Interval for Smoker Variable


\(CI_{95} = (\beta_i - 1.96 * SE , \beta_i + 1.96 * SE)\)

#Confidence Intervals
#OLS Smoker = 23605.8
lower10 <- 14884.73222 - 1.96*366.64923
upper10 <- 14884.73222 + 1.96*366.64923

lower50 <- 27348.72226 - 1.96*1186.35661
upper50 <- 27348.72226 + 1.96*1186.35661

lower90 <- 29509.19984 - 1.96*637.46160
upper90 <- 29509.19984 + 1.96*637.46160

lowerlimit <- c(lower10, lower50, lower90)
upperlimit <- c(upper10, upper50, upper90)

CI <- data.frame(lowerlimit, upperlimit)
CI
##   lowerlimit upperlimit
## 1   14166.10   15603.36
## 2   25023.46   29673.98
## 3   28259.78   30758.62

Here we can see the \(\beta\) estimate from the normal linear regression output falls well outside the confidence interval for the 10th, 50th, and 90th quantile. Therefore for the quantiles listed, the \(\beta\) estimate for said quantiles for Smoker is significantly different from the OLS model.


Quantile Regression Plots

#QR plots
quantregall<-rq(expenses ~ bmi + smoker + sex + children, tau=seq(0.10, 0.90, by=0.05), data=data1)
quantregplot<-summary(quantregall)
plot(quantregplot)

We can visually see that the conclusions we made from our confidence interval for the Smoker Variable reflect the quantile plot.

40th Quantile for Smoker

#OLS Smoker = 23605.8
lower40 <- 22582.89936 - 1.96*697.89665 
upper40 <- 22582.89936 + 1.96*697.89665
lowerlimit <- c(lower40)
upperlimit <- c(upper40)
CI <- data.frame(lowerlimit, upperlimit)
CI
##   lowerlimit upperlimit
## 1   21215.02   23950.78

Here we can see that the OLS \(\beta\) coefficient for smoker 23,605.8, falls just within the Confidence Interval for the 40th quantile, meaning there is not a significant difference in \(\beta\)’s between \(\tau\)=0.40 and OLS.


Testing for difference in coefficients across different quantiles

anova(quantreg10, quantreg50, quantreg90)
## Quantile Regression Analysis of Deviance Table
## 
## Model: expenses ~ bmi + smoker + sex + children
## Joint Test of Equality of Slopes: tau in {  0.1 0.5 0.9  }
## 
##   Df Resid Df F value    Pr(>F)    
## 1  8     4006   67.74 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Further justifying our use of quantile regression, would be to test for a significant difference between the coefficients of different quantiles. Shown by the Anova table above, we can test for a significant difference between the slopes for the 10th, 50th, and 90th quantile. From the table we see that we have \(p<0.0001\), meaning we would reject the null hypothesis that our slopes are the same and conclude that at least one is different.