set.seed(3)
mydata <- read.table("./insurance.csv", header=TRUE, sep=",", dec=".") %>%
sample_n(300)
head(mydata)
## age sex bmi children smoker region charges
## 1 44 female 36.48 0 no northeast 12797.210
## 2 53 female 39.60 1 no southeast 10579.711
## 3 33 female 36.29 3 no northeast 6551.750
## 4 54 female 46.70 2 no southwest 11538.421
## 5 41 male 35.75 1 yes southeast 40273.645
## 6 44 male 21.85 3 no northeast 8891.139
A unit of observation in this dataset is the primary beneficiary (person) of health insurance
The sample size in this data set is equal to 300 units of observation
Definition of variables:
The data was taken from the website Kaggle.com, more specifically from the link https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download
The research question for this analysis is to find out how different variables (age, sex, bmi, n of children, tobacco use and location) influence the amount of medical costs billed by health insurance.
mydata$sexF <- factor(mydata$sex,
labels = c("male", "female"),
levels = c("male", "female")) # Adding a factor variable for "sex"
mydata$smokerF <- factor(mydata$smoker,
labels = c("smoker", "non-smoker"),
levels = c("yes", "no")) # Adding a factor variable for "smoker"
mydata$regionF <- factor(mydata$region,
labels = c("northeast", "southeast", "southwest", "northwest"),
levels = c("northeast", "southeast", "southwest", "northwest")) # Adding a factor for the variable "region"
mydata <- mydata[c( 1, 2, 8, 3, 4, 5, 9, 6, 10, 7)] # Rearranging the column order
The dependent variable in my regression will be “charges” (medical costs billed by health insurance) . The independent variables will be: Age, BMI, Children, smoker, region, sex
What affects the amount of medical costs billed by health insurance in our dataset is most likely the age, bmi, number of children, smoking habits and the geographical region of the person. The age should increase the costs as with older age health deteriorates, a higher BMI also indicates a less healthy individual increasing the costs, the number of children increases the costs as the person insured has more people covered by the insurance and lastly smokers are a lot less healthy so they will have higher costs. Additionally the regions might affect the amount of charges but the gender should not affect it. The article that was used for this theoretical justification is posted on the website of a US insurance company, cited below.
scatterplotMatrix(mydata[ , c(1, 4, 5, 10)],
smooth = FALSE)
From the scatterplots we can see that all of our explanatory numerical variables (age, bmi and children) are positively correlated with the dependent variable (charges).
rcorr(as.matrix(mydata[, c(1, 4, 5, 10)]),
type = "pearson")
## age bmi children charges
## age 1.00 0.07 0.01 0.29
## bmi 0.07 1.00 -0.03 0.23
## children 0.01 -0.03 1.00 0.09
## charges 0.29 0.23 0.09 1.00
##
## n= 300
##
##
## P
## age bmi children charges
## age 0.2226 0.8548 0.0000
## bmi 0.2226 0.5692 0.0000
## children 0.8548 0.5692 0.1041
## charges 0.0000 0.0000 0.1041
reg1 <- lm(data = mydata, charges ~ age + bmi + children + sexF + regionF + smokerF)
vif(reg1)
## GVIF Df GVIF^(1/(2*Df))
## age 1.034395 1 1.017052
## bmi 1.151570 1 1.073112
## children 1.015770 1 1.007854
## sexF 1.029721 1 1.014752
## regionF 1.175090 3 1.027256
## smokerF 1.023141 1 1.011504
mean(vif(reg1))
## [1] 1.143401
sqrt(5)
## [1] 2.236068
Using the variance inflation factor (VIF) test to identify if the regression model has a problem of multicolineartiy. Because one of the variables has more than two categories (region) we have to look at the right GVIF column and compare it with the square root of 5. Comparing it with the square root of 5 which is 2.24 we can see that none of the values are remotely close (all are around 1). Additionally, the mean of VIF is close to one. This tells us that we do not have any multicolinearity with the variables in the regression.
mydata$StdResid <- round(rstandard(reg1), 3)
mydata$CooksD <- round(cooks.distance(reg1), 3)
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
There are some standardized residuals above 3 so we will have to remove them.
shapiro.test(mydata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.88568, p-value = 3.327e-14
H0: The standardized residuals are normally distributed
Ha: The standardized residuals are not normally distributed
Because the p-value is below 0.05 we reject the null hypothesis, this means that we do not have standardized residuals which are normally distributed. However, because our sample consists of 300 observations this is not a big problem
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
The histogram of Cook’s distances has gaps in between the values so we will have to remove some observations.
head(mydata[order(-mydata$StdResid),], 10)
## age sex sexF bmi children smoker smokerF region regionF charges StdResid CooksD
## 218 31 female female 38.095 1 yes smoker northeast northeast 58571.07 4.310 0.097
## 177 24 female female 23.210 0 no non-smoker southeast southeast 25081.77 4.059 0.058
## 78 19 male male 27.265 2 no non-smoker northwest northwest 22493.66 3.587 0.040
## 209 61 female female 33.330 4 no non-smoker southeast southeast 36580.28 3.556 0.060
## 76 33 male male 22.705 0 no non-smoker northwest northwest 21984.47 3.387 0.037
## 31 23 female female 24.225 2 no non-smoker northeast northeast 22395.74 3.240 0.034
## 166 26 female female 29.640 4 no non-smoker northeast northeast 24671.66 3.023 0.046
## 90 53 male male 31.350 0 no non-smoker southeast southeast 27346.04 2.939 0.023
## 244 19 female female 27.930 3 no non-smoker northwest northwest 18838.70 2.683 0.026
## 289 21 male male 31.020 0 no non-smoker southeast southeast 16586.50 2.393 0.015
head(mydata[order(-mydata$CooksD),], 15)
## age sex sexF bmi children smoker smokerF region regionF charges StdResid CooksD
## 218 31 female female 38.095 1 yes smoker northeast northeast 58571.07 4.310 0.097
## 209 61 female female 33.330 4 no non-smoker southeast southeast 36580.28 3.556 0.060
## 177 24 female female 23.210 0 no non-smoker southeast southeast 25081.77 4.059 0.058
## 36 39 female female 18.300 5 yes smoker southwest southwest 19023.26 -2.431 0.054
## 166 26 female female 29.640 4 no non-smoker northeast northeast 24671.66 3.023 0.046
## 78 19 male male 27.265 2 no non-smoker northwest northwest 22493.66 3.587 0.040
## 76 33 male male 22.705 0 no non-smoker northwest northwest 21984.47 3.387 0.037
## 31 23 female female 24.225 2 no non-smoker northeast northeast 22395.74 3.240 0.034
## 206 21 male male 25.700 4 yes smoker southwest southwest 17942.11 -1.962 0.028
## 34 49 female female 23.845 3 yes smoker northeast northeast 24106.91 -2.201 0.026
## 244 19 female female 27.930 3 no non-smoker northwest northwest 18838.70 2.683 0.026
## 90 53 male male 31.350 0 no non-smoker southeast southeast 27346.04 2.939 0.023
## 111 61 male male 36.100 3 no non-smoker southwest southwest 27941.29 2.326 0.023
## 70 27 female female 24.750 0 yes smoker southeast southeast 16577.78 -2.187 0.022
## 242 61 female female 25.080 0 no non-smoker southeast southeast 24513.09 2.316 0.022
head(mydata[order(mydata$StdResid),], 6)
## age sex sexF bmi children smoker smokerF region regionF charges StdResid CooksD
## 36 39 female female 18.300 5 yes smoker southwest southwest 19023.26 -2.431 0.054
## 34 49 female female 23.845 3 yes smoker northeast northeast 24106.91 -2.201 0.026
## 70 27 female female 24.750 0 yes smoker southeast southeast 16577.78 -2.187 0.022
## 18 43 female female 25.270 1 yes smoker northeast northeast 21771.34 -2.158 0.019
## 204 47 male male 25.410 1 yes smoker southeast southeast 21978.68 -2.129 0.019
## 37 39 male male 26.410 0 yes smoker northeast northeast 20149.32 -2.103 0.019
All observations with standardized residuals above 3 and below -3 should be removed. For Cook’s distances we will remove every observation with CD above 0.025, below this value the distances don’t decrease largely when ordered from biggest to smallest.
mydata1 <- mydata[abs(mydata$StdResid) <= 3, ]
mydata1 <- mydata1[mydata1$CooksD <= 0.025, ]
reg2 <- lm(data = mydata1, charges ~ age + bmi + children + sexF + regionF + smokerF)
Reestimating the multiple regression after removing the outliers.
mydata1$StdResid <- round(rstandard(reg2), 3)
mydata1$CooksD <- round(cooks.distance(reg2), 3)
hist(mydata1$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
We can see some standardized residuals are still above 3, however we do not remove them again as we do the removal and reestimation only once.
shapiro.test(mydata1$StdResid)
##
## Shapiro-Wilk normality test
##
## data: mydata1$StdResid
## W = 0.92383, p-value = 5.515e-11
H0: The standardized residuals are normally distributed
Ha: The standardized residuals are not normally distributed
Because the p-value is below 0.05 we reject the null hypothesis, this means that we do not have standardized residuals which are normally distributed. However, because our sample consists of 300 observations this is not a big problem
hist(mydata1$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
The histogram of cooks distances is now connected and does not have a big outlier which is good.
mydata1$StdFittedValues <- scale(reg2$fitted.values)
scatterplot(y = mydata1$StdResid, x = mydata1$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = TRUE,
smooth = FALSE)
From the scatterplot we can see that the variance is probably not constant, however we will test that formally with the breusch pagan test. Additionally we can see that linearity is met.
ols_test_breusch_pagan(reg2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------
## Response : charges
## Variables: fitted values of charges
##
## Test Summary
## ------------------------------
## DF = 1
## Chi2 = 166.8972
## Prob > Chi2 = 3.52277e-38
H0: the variance is constant
Ha: the variance is not constant
Formally testing the homoskedasticity, we can reject the null hypothesis at p < 0.001 and say that there is no homoskedasticity. The variance is not constant, this will affect the standardized errors and consequently the p-values in the linear model therefore we will use the lm robust function to correct this.
summary(mydata1[c(1, 3, 4, 5, 7, 9, 10)])
## age sexF bmi children smokerF regionF charges
## Min. :18.0 male :142 Min. :19.19 Min. :0.000 smoker : 54 northeast:67 Min. : 1137
## 1st Qu.:27.0 female:147 1st Qu.:26.60 1st Qu.:0.000 non-smoker:235 southeast:96 1st Qu.: 4661
## Median :41.0 Median :30.78 Median :1.000 southwest:60 Median : 8827
## Mean :39.9 Mean :31.19 Mean :1.024 northwest:66 Mean :12631
## 3rd Qu.:52.0 3rd Qu.:35.70 3rd Qu.:2.000 3rd Qu.:13144
## Max. :63.0 Max. :53.13 Max. :5.000 Max. :48885
reg3 <- lm_robust(data = mydata1, charges ~ age + bmi + children + sexF + regionF + smokerF)
summary(reg3)
##
## Call:
## lm_robust(formula = charges ~ age + bmi + children + sexF + regionF +
## smokerF, data = mydata1)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 15029.80 2354.52 6.3834 7.165e-10 10395.0 19664.6 280
## age 262.37 21.35 12.2885 4.756e-28 220.3 304.4 280
## bmi 260.22 52.04 5.0003 1.011e-06 157.8 362.7 280
## children 652.80 203.44 3.2089 1.487e-03 252.3 1053.3 280
## sexFfemale 116.72 543.62 0.2147 8.302e-01 -953.4 1186.8 280
## regionFsoutheast -93.55 825.46 -0.1133 9.098e-01 -1718.5 1531.3 280
## regionFsouthwest -503.37 701.38 -0.7177 4.735e-01 -1884.0 877.3 280
## regionFnorthwest -1141.74 702.78 -1.6246 1.054e-01 -2525.1 241.7 280
## smokerFnon-smoker -26213.57 1194.41 -21.9470 8.539e-63 -28564.7 -23862.4 280
##
## Multiple R-squared: 0.8649 , Adjusted R-squared: 0.8611
## F-statistic: 120.5 on 8 and 280 DF, p-value: < 2.2e-16
Charges = 15029.8 + 262.37age + 260.22bmi + 652.8children + 116.72female - 93.55southeast - 503.37southwest - 1141.74northwest - 26213.57*non-smoker
The coefficients of sexFfemale, regionFsoutheast, regionFsouthwest and regionFnorthwest are not statisticaly significant at p-value of 0.05 therefore we can not claim that they are statistically different from 0. Because of this they do not require a further explanation.
Explanation of the partial coefficients: - Age: If age increases by 1 year and everything else stays the same, charges will on average increase by 262.37 USD (at p < 0.001) - BMI: If BMI increases by 1 and everything else stays the same, charges will on average increase by 260.22 USD (at p < 0.001) - Children: If the number of children increases by 1 and everything else stays the same, charges will on average increase by 652.8 USD (at p < 0.01) - SmokerF non-smoker: On average nonsmokers have charges that are lower by 26213.57 USD compared to smokers if everything else stays the same (at p < 0.001)
The multiple R squared is 0.865. This means that 86 % of the variability in the dependent variable is explained by the independent variables
The F test hypothesis:
H0: ρ2 = 0
H1: ρ2 > 0
In our case the null hypothesis can be rejected at 0.001 level of significance (p < 0.001). This means that ro is more than 0 and our regression model is a good fit meaning that there is a relationship between the dependent variable and at least one independent variable. This model is a better fit than the model with no independent variables.
sqrt(reg3$r.squared)
## [1] 0.9300163
The maximum degree of linear relationship between the dependent and the independent variables is 0.93 whichi indicates a very strong relationship.