Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
I work in health insurance, so I thought it would be interesting to take a look at healthcare costs.
library(readr)
insuranceData <- read_csv('https://raw.githubusercontent.com/amberferger/DATA605_Discussions/master/AFerger_Wk12_Disc_data.csv')
We can see from the following plot that as the age increases, so do the charges. Intuitively this makes sense – as individuals age, they are more likely to have health-related issues and therefore, more cost is incurred.
plot(insuranceData$age,insuranceData$charges, main="Age vs Charges", xlab="Age", ylab="Charges")
We will now fit a linear model to the data.
ageChgLM <- lm(insuranceData$charges ~ insuranceData$age)
ageChgLM
##
## Call:
## lm(formula = insuranceData$charges ~ insuranceData$age)
##
## Coefficients:
## (Intercept) insuranceData$age
## 3165.9 257.7
Next, we’ll plot the original data with the best fit line:
plot(insuranceData$age,insuranceData$charges, main="Age vs Charges", xlab="Age", ylab="Charges")
abline(ageChgLM)
Just from looking at the plot, it appears that the model follows the general trend of the data, but doesn’t predict actual values very well. (Only a few actual observations fall on the best fit line)
Let’s take a look at the summary information of our model:
summary(ageChgLM)
##
## Call:
## lm(formula = insuranceData$charges ~ insuranceData$age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8059 -6671 -5939 5440 47829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3165.9 937.1 3.378 0.000751 ***
## insuranceData$age 257.7 22.5 11.453 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11560 on 1336 degrees of freedom
## Multiple R-squared: 0.08941, Adjusted R-squared: 0.08872
## F-statistic: 131.2 on 1 and 1336 DF, p-value: < 2.2e-16
RESIDUALS: The residuals should be balanced and close to the mean of \(0\). The median should have value near \(0\), min/max values should be about the same magnitude, and the first and third quartile should also be about the same magnitude. Ultimately, this would mean that the actual observations aren’t too far from the predicted values and there isn’t much variance in the predictive error.
Considering the scale of the data, our median of \(-5,939\) isn’t too horrible. The quartiles have about the same magnitude. Our min and max values are drastically different in magnitude and this is because most of the positive residuals are a lot higher from the best fit line than the negative residuals.
RESIDUAL STANDARD ERROR: At the bottom of the summary, we can see the Residual Standard Error. This measures the variation in the residuals. As a general rule, this value should be 1.5 times the quartile values. For the first quartile: \(6,671 * 1.5 = 10,006.5\) and for the third quartile: \(5,440 *1.5 = 8,160\).
ERROR: As a general rule, a good model has standard errors that are at least five to ten times smaller than the corresponding coefficients. In this case, the standard error for the intercept is \(3.378\) times smaller than the coefficient and the standard error for the age variable is \(11.453\) times smaller. What this means is that there is significant variability in the intercept but little variability in the age.
SIGNIFICANCE: The next thing we will look at is the p-value for the age factor. This measures the probability that the factor is not important in the model. The smaller the value, the more significant the factor is. In this case, we have a p-value of \(2.2e-16\), which means that we can reject the null hypothesis (the factor is not important), and accept the alternative (that it is important.)
MULTIPLE R-SQUARED: This measures how well our model describes the data. Our R-squared value of \(0.08941\) means that the model explains \(8.9\%\) of the data’s variation.
To dig a bit deeper, we can examine the residuals.
plot(fitted(ageChgLM),resid(ageChgLM))
Although there doesn’t appear to be a pattern in the residuals as we move in either direction on the chart, we can see that they are not evenly distributed around \(0\). Our min value is around \(-10,000\) whereas our max value is around \(50,000\). Overall, this tells us that we either need to add in other predictors or create a different model using just this factor.
In a good model we expect the residuals to be normally distributed. By graphing the data in a Q-Q plot, we can see how well the observations follow the line. If they fit the line well, we know that the data is normally distributed.
qqnorm(resid(ageChgLM))
qqline(resid(ageChgLM))
From this analysis, we can see that modeling the healthcare costs using one-factor linear regression with age as the explanatory variable does not produce a very good output for a few reasons: