#Histogram of Response
hist(data1$expenses, xlab="Expenses", ylab="Frequency", main = "Histogram of Response")
\({\hat{Expenses} = \beta_0 + \beta_{1}Sex + \beta_{2}Children + \beta_{3}Smoker + \beta_{4}BMI}\)
#ols reg
reg <- lm(expenses ~ bmi + smoker + sex + children, data=data1)
summary(reg)
##
## Call:
## lm(formula = expenses ~ bmi + smoker + sex + children, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15058 -4666 -1055 3649 31563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4054.2 1025.4 -3.954 8.09e-05 ***
## bmi 387.6 31.7 12.225 < 2e-16 ***
## smoker 23605.8 479.8 49.204 < 2e-16 ***
## sexmale -303.4 387.7 -0.783 0.433933
## children 1329.2 390.1 3.407 0.000676 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7061 on 1333 degrees of freedom
## Multiple R-squared: 0.6611, Adjusted R-squared: 0.66
## F-statistic: 650 on 4 and 1333 DF, p-value: < 2.2e-16
#BP test for heteroskedasticity
bptest(reg)
##
## studentized Breusch-Pagan test
##
## data: reg
## BP = 83.023, df = 4, p-value < 2.2e-16
To validate our use of quantile regression, we can test the assumption of equal variance of residuals.
\({\hat{Expenses} = \beta_0(\tau) + \beta_{1}(\tau)Sex + \beta_{2}(\tau)Children + \beta_{3}(\tau)Smoker + \beta_{4}(\tau)BMI}\)
#Quantile reg tau=0.10, 0.50, 0.90
#rq comes for the "quantreg" package from Koenker
quantreg10 <- rq(expenses ~ bmi + smoker + sex + children , data=data1, tau=0.10)
summary(quantreg10)
##
## Call: rq(formula = expenses ~ bmi + smoker + sex + children, tau = 0.1,
## data = data1)
##
## tau: [1] 0.1
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 644.18000 328.65125 1.96007 0.05020
## bmi 42.29306 11.37587 3.71779 0.00021
## smoker 14884.73222 366.64923 40.59665 0.00000
## sexmale -433.18403 141.32737 -3.06511 0.00222
## children 1897.32750 165.73634 11.44787 0.00000
quantreg50 <- rq(expenses ~ bmi + smoker + sex + children, data=data1, tau=0.50)
summary(quantreg50)
##
## Call: rq(formula = expenses ~ bmi + smoker + sex + children, tau = 0.5,
## data = data1)
##
## tau: [1] 0.5
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) -4017.07302 1217.90944 -3.29833 0.00100
## bmi 341.34528 38.15926 8.94528 0.00000
## smoker 27348.72226 1186.35661 23.05270 0.00000
## sexmale -335.12943 456.82033 -0.73361 0.46331
## children 1449.93226 490.15922 2.95808 0.00315
quantreg90 <- rq(expenses ~ bmi + smoker + sex + children, data=data1, tau=0.90)
summary(quantreg90)
##
## Call: rq(formula = expenses ~ bmi + smoker + sex + children, tau = 0.9,
## data = data1)
##
## tau: [1] 0.9
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 3840.95325 1930.46721 1.98965 0.04683
## bmi 358.73252 66.37308 5.40479 0.00000
## smoker 29509.19984 637.46160 46.29173 0.00000
## sexmale -853.90098 599.75667 -1.42375 0.15475
## children 1076.62878 611.84014 1.75966 0.07870
For instance our interpretation for the Smoker variable would be as follows: individuals who classify as smokers spend an additional 14,844.73 for those who with lower medical insurance premiums (10th quantile), smokers with higher insurance premiums spend 29,509.72 (90th quantile). Simply put, for individuals who smoke the cost of insurance premiums increase as quantiles increase.
\(CI_{95} = (\beta_i - 1.96 * SE , \beta_i +
1.96 * SE)\)
#Confidence Intervals
#OLS Smoker = 23605.8
lower10 <- 14884.73222 - 1.96*366.64923
upper10 <- 14884.73222 + 1.96*366.64923
lower50 <- 27348.72226 - 1.96*1186.35661
upper50 <- 27348.72226 + 1.96*1186.35661
lower90 <- 29509.19984 - 1.96*637.46160
upper90 <- 29509.19984 + 1.96*637.46160
lowerlimit <- c(lower10, lower50, lower90)
upperlimit <- c(upper10, upper50, upper90)
CI <- data.frame(lowerlimit, upperlimit)
CI
## lowerlimit upperlimit
## 1 14166.10 15603.36
## 2 25023.46 29673.98
## 3 28259.78 30758.62
Here we can see the \(\beta\) estimate from the normal linear regression output falls well outside the confidence interval for the 10th, 50th, and 90th quantile. Therefore for the quantiles listed, the \(\beta\) estimate for said quantiles for Smoker is significantly different from the OLS model.
#QR plots
quantregall<-rq(expenses ~ bmi + smoker + sex + children, tau=seq(0.10, 0.90, by=0.05), data=data1)
quantregplot<-summary(quantregall)
plot(quantregplot)
We can visually see that the conclusions we made from our confidence interval for the Smoker Variable reflect the quantile plot.
#OLS Smoker = 23605.8
lower40 <- 22582.89936 - 1.96*697.89665
upper40 <- 22582.89936 + 1.96*697.89665
lowerlimit <- c(lower40)
upperlimit <- c(upper40)
CI <- data.frame(lowerlimit, upperlimit)
CI
## lowerlimit upperlimit
## 1 21215.02 23950.78
Here we can see that the OLS \(\beta\) coefficient for smoker 23,605.8, falls just within the Confidence Interval for the 40th quantile, meaning there is not a significant difference in \(\beta\)’s between \(\tau\)=0.40 and OLS.
anova(quantreg10, quantreg50, quantreg90)
## Quantile Regression Analysis of Deviance Table
##
## Model: expenses ~ bmi + smoker + sex + children
## Joint Test of Equality of Slopes: tau in { 0.1 0.5 0.9 }
##
## Df Resid Df F value Pr(>F)
## 1 8 4006 67.74 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Further justifying our use of quantile regression, would be to test for a significant difference between the coefficients of different quantiles. Shown by the Anova table above, we can test for a significant difference between the slopes for the 10th, 50th, and 90th quantile. From the table we see that we have \(p<0.0001\), meaning we would reject the null hypothesis that our slopes are the same and conclude that at least one is different.