Ian Gošnak

Importing the data and creating a 300 unit sample

set.seed(3)
mydata <- read.table("./insurance.csv", header=TRUE, sep=",", dec=".") %>% 
  sample_n(300)

head(mydata)
##   age    sex   bmi children smoker    region   charges
## 1  44 female 36.48        0     no northeast 12797.210
## 2  53 female 39.60        1     no southeast 10579.711
## 3  33 female 36.29        3     no northeast  6551.750
## 4  54 female 46.70        2     no southwest 11538.421
## 5  41   male 35.75        1    yes southeast 40273.645
## 6  44   male 21.85        3     no northeast  8891.139

A unit of observation in this dataset is the primary beneficiary (person) of health insurance

The sample size in this data set is equal to 300 units of observation

Definition of variables:

  • age: Age of primary beneficiary (in years)
  • sex: Insurance contractor gender, can be female or male
  • bmi: Body mass index, (kg / m ^ 2) using the ratio of height to weight
  • children: Number of children covered by health insurance / Number of dependents
  • smoker: Is the beneficiary a smoker or not
  • region: The beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
  • charges: Individual medical costs billed by health insurance (in USD)

The data was taken from the website Kaggle.com, more specifically from the link https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download

The research question for this analysis is to find out how different variables (age, sex, bmi, n of children, tobacco use and location) influence the amount of medical costs billed by health insurance.

Cleaning of data

mydata$sexF <- factor(mydata$sex,
                          labels = c("male", "female"),
                          levels = c("male", "female")) # Adding a factor variable for "sex"

mydata$smokerF <- factor(mydata$smoker,
                             labels = c("smoker", "non-smoker"),
                             levels = c("yes", "no")) # Adding a factor variable for "smoker"

mydata$regionF <- factor(mydata$region,
                             labels = c("northeast", "southeast", "southwest", "northwest"),
                             levels = c("northeast", "southeast", "southwest", "northwest")) # Adding a factor for the variable "region"

mydata <- mydata[c( 1, 2, 8, 3, 4, 5, 9, 6, 10, 7)] # Rearranging the column order

Defining the multiple regression model

The dependent variable in my regression will be “charges” (medical costs billed by health insurance) . The independent variables will be: Age, BMI, Children, smoker, region, sex

What affects the amount of medical costs billed by health insurance in our dataset is most likely the age, bmi, number of children, smoking habits and the geographical region of the person. The age should increase the costs as with older age health deteriorates, a higher BMI also indicates a less healthy individual increasing the costs, the number of children increases the costs as the person insured has more people covered by the insurance and lastly smokers are a lot less healthy so they will have higher costs. Additionally the regions might affect the amount of charges but the gender should not affect it. The article that was used for this theoretical justification is posted on the website of a US insurance company, cited below.

Showing the relationship between numerical variables with the scatterplot

scatterplotMatrix(mydata[ , c(1, 4, 5, 10)],
                  smooth = FALSE)

From the scatterplots we can see that all of our explanatory numerical variables (age, bmi and children) are positively correlated with the dependent variable (charges).

rcorr(as.matrix(mydata[, c(1, 4, 5, 10)]), 
      type = "pearson")
##           age   bmi children charges
## age      1.00  0.07     0.01    0.29
## bmi      0.07  1.00    -0.03    0.23
## children 0.01 -0.03     1.00    0.09
## charges  0.29  0.23     0.09    1.00
## 
## n= 300 
## 
## 
## P
##          age    bmi    children charges
## age             0.2226 0.8548   0.0000 
## bmi      0.2226        0.5692   0.0000 
## children 0.8548 0.5692          0.1041 
## charges  0.0000 0.0000 0.1041

Estimating the multiple regression model

reg1 <- lm(data = mydata, charges ~ age + bmi + children + sexF + regionF + smokerF)

Checking the assumptions

Multicolinearity

vif(reg1)
##              GVIF Df GVIF^(1/(2*Df))
## age      1.034395  1        1.017052
## bmi      1.151570  1        1.073112
## children 1.015770  1        1.007854
## sexF     1.029721  1        1.014752
## regionF  1.175090  3        1.027256
## smokerF  1.023141  1        1.011504
mean(vif(reg1))
## [1] 1.143401
sqrt(5)
## [1] 2.236068

Using the variance inflation factor (VIF) test to identify if the regression model has a problem of multicolineartiy. Because one of the variables has more than two categories (region) we have to look at the right GVIF column and compare it with the square root of 5. Comparing it with the square root of 5 which is 2.24 we can see that none of the values are remotely close (all are around 1). Additionally, the mean of VIF is close to one. This tells us that we do not have any multicolinearity with the variables in the regression.

Standardised residuals

mydata$StdResid <- round(rstandard(reg1), 3) 
mydata$CooksD <- round(cooks.distance(reg1), 3) 

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

There are some standardized residuals above 3 so we will have to remove them.

Normality of standardized residuals

shapiro.test(mydata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.88568, p-value = 3.327e-14

H0: The standardized residuals are normally distributed

Ha: The standardized residuals are not normally distributed

Because the p-value is below 0.05 we reject the null hypothesis, this means that we do not have standardized residuals which are normally distributed. However, because our sample consists of 300 observations this is not a big problem

Cook’s distance

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

The histogram of Cook’s distances has gaps in between the values so we will have to remove some observations.

head(mydata[order(-mydata$StdResid),], 10) 
##     age    sex   sexF    bmi children smoker    smokerF    region   regionF  charges StdResid CooksD
## 218  31 female female 38.095        1    yes     smoker northeast northeast 58571.07    4.310  0.097
## 177  24 female female 23.210        0     no non-smoker southeast southeast 25081.77    4.059  0.058
## 78   19   male   male 27.265        2     no non-smoker northwest northwest 22493.66    3.587  0.040
## 209  61 female female 33.330        4     no non-smoker southeast southeast 36580.28    3.556  0.060
## 76   33   male   male 22.705        0     no non-smoker northwest northwest 21984.47    3.387  0.037
## 31   23 female female 24.225        2     no non-smoker northeast northeast 22395.74    3.240  0.034
## 166  26 female female 29.640        4     no non-smoker northeast northeast 24671.66    3.023  0.046
## 90   53   male   male 31.350        0     no non-smoker southeast southeast 27346.04    2.939  0.023
## 244  19 female female 27.930        3     no non-smoker northwest northwest 18838.70    2.683  0.026
## 289  21   male   male 31.020        0     no non-smoker southeast southeast 16586.50    2.393  0.015
head(mydata[order(-mydata$CooksD),], 15) 
##     age    sex   sexF    bmi children smoker    smokerF    region   regionF  charges StdResid CooksD
## 218  31 female female 38.095        1    yes     smoker northeast northeast 58571.07    4.310  0.097
## 209  61 female female 33.330        4     no non-smoker southeast southeast 36580.28    3.556  0.060
## 177  24 female female 23.210        0     no non-smoker southeast southeast 25081.77    4.059  0.058
## 36   39 female female 18.300        5    yes     smoker southwest southwest 19023.26   -2.431  0.054
## 166  26 female female 29.640        4     no non-smoker northeast northeast 24671.66    3.023  0.046
## 78   19   male   male 27.265        2     no non-smoker northwest northwest 22493.66    3.587  0.040
## 76   33   male   male 22.705        0     no non-smoker northwest northwest 21984.47    3.387  0.037
## 31   23 female female 24.225        2     no non-smoker northeast northeast 22395.74    3.240  0.034
## 206  21   male   male 25.700        4    yes     smoker southwest southwest 17942.11   -1.962  0.028
## 34   49 female female 23.845        3    yes     smoker northeast northeast 24106.91   -2.201  0.026
## 244  19 female female 27.930        3     no non-smoker northwest northwest 18838.70    2.683  0.026
## 90   53   male   male 31.350        0     no non-smoker southeast southeast 27346.04    2.939  0.023
## 111  61   male   male 36.100        3     no non-smoker southwest southwest 27941.29    2.326  0.023
## 70   27 female female 24.750        0    yes     smoker southeast southeast 16577.78   -2.187  0.022
## 242  61 female female 25.080        0     no non-smoker southeast southeast 24513.09    2.316  0.022
head(mydata[order(mydata$StdResid),], 6) 
##     age    sex   sexF    bmi children smoker smokerF    region   regionF  charges StdResid CooksD
## 36   39 female female 18.300        5    yes  smoker southwest southwest 19023.26   -2.431  0.054
## 34   49 female female 23.845        3    yes  smoker northeast northeast 24106.91   -2.201  0.026
## 70   27 female female 24.750        0    yes  smoker southeast southeast 16577.78   -2.187  0.022
## 18   43 female female 25.270        1    yes  smoker northeast northeast 21771.34   -2.158  0.019
## 204  47   male   male 25.410        1    yes  smoker southeast southeast 21978.68   -2.129  0.019
## 37   39   male   male 26.410        0    yes  smoker northeast northeast 20149.32   -2.103  0.019

All observations with standardized residuals above 3 and below -3 should be removed. For Cook’s distances we will remove every observation with CD above 0.025, below this value the distances don’t decrease largely when ordered from biggest to smallest.

Removing the outliers

mydata1 <- mydata[abs(mydata$StdResid) <= 3, ]
mydata1 <- mydata1[mydata1$CooksD <= 0.025, ]

Re-estimating the multiple regression model

reg2 <- lm(data = mydata1, charges ~ age + bmi + children + sexF + regionF + smokerF)

Reestimating the multiple regression after removing the outliers.

mydata1$StdResid <- round(rstandard(reg2), 3) 
mydata1$CooksD <- round(cooks.distance(reg2), 3) 

hist(mydata1$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

We can see some standardized residuals are still above 3, however we do not remove them again as we do the removal and reestimation only once.

shapiro.test(mydata1$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata1$StdResid
## W = 0.92383, p-value = 5.515e-11

H0: The standardized residuals are normally distributed

Ha: The standardized residuals are not normally distributed

Because the p-value is below 0.05 we reject the null hypothesis, this means that we do not have standardized residuals which are normally distributed. However, because our sample consists of 300 observations this is not a big problem

hist(mydata1$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

The histogram of cooks distances is now connected and does not have a big outlier which is good.

Checking homoskedasticity and linearity

mydata1$StdFittedValues <- scale(reg2$fitted.values)

scatterplot(y = mydata1$StdResid, x = mydata1$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = TRUE,
            smooth = FALSE)

From the scatterplot we can see that the variance is probably not constant, however we will test that formally with the breusch pagan test. Additionally we can see that linearity is met.

ols_test_breusch_pagan(reg2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                 
##  -----------------------------------
##  Response : charges 
##  Variables: fitted values of charges 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    166.8972 
##  Prob > Chi2   =    3.52277e-38

H0: the variance is constant

Ha: the variance is not constant

Formally testing the homoskedasticity, we can reject the null hypothesis at p < 0.001 and say that there is no homoskedasticity. The variance is not constant, this will affect the standardized errors and consequently the p-values in the linear model therefore we will use the lm robust function to correct this.

Descriptive statistics

summary(mydata1[c(1, 3, 4, 5, 7, 9, 10)])
##       age           sexF          bmi           children           smokerF         regionF      charges     
##  Min.   :18.0   male  :142   Min.   :19.19   Min.   :0.000   smoker    : 54   northeast:67   Min.   : 1137  
##  1st Qu.:27.0   female:147   1st Qu.:26.60   1st Qu.:0.000   non-smoker:235   southeast:96   1st Qu.: 4661  
##  Median :41.0                Median :30.78   Median :1.000                    southwest:60   Median : 8827  
##  Mean   :39.9                Mean   :31.19   Mean   :1.024                    northwest:66   Mean   :12631  
##  3rd Qu.:52.0                3rd Qu.:35.70   3rd Qu.:2.000                                   3rd Qu.:13144  
##  Max.   :63.0                Max.   :53.13   Max.   :5.000                                   Max.   :48885
  • Age: The minimum age in the sample is 18 years and the max is 63 years, the average 39.9 years
  • Sex: There were 142 males in the sample and 147 females
  • BMI: The median bmi in the sample is 30.78 which means that 50% of observaions have a higher bmi and 50% have a lower bmi
  • Children: The max number of children for a person in the sample was 5, the average was 1.02
  • Smoker: There were 54 smokers in the sample and 235 non smokers
  • Region: There were 67 people from the northeast of the US in the sample, 96 from the southeast, 60 from the southwest and 66 from the northwest of the US
  • Charges: 75% of people in the sample had charges higher than 4661 USD, the average charges were 12631 USD

Estimation of multiple regression with lm robust and explanation

reg3 <- lm_robust(data = mydata1, charges ~ age + bmi + children + sexF + regionF + smokerF)

summary(reg3)
## 
## Call:
## lm_robust(formula = charges ~ age + bmi + children + sexF + regionF + 
##     smokerF, data = mydata1)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                    Estimate Std. Error  t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)        15029.80    2354.52   6.3834 7.165e-10  10395.0  19664.6 280
## age                  262.37      21.35  12.2885 4.756e-28    220.3    304.4 280
## bmi                  260.22      52.04   5.0003 1.011e-06    157.8    362.7 280
## children             652.80     203.44   3.2089 1.487e-03    252.3   1053.3 280
## sexFfemale           116.72     543.62   0.2147 8.302e-01   -953.4   1186.8 280
## regionFsoutheast     -93.55     825.46  -0.1133 9.098e-01  -1718.5   1531.3 280
## regionFsouthwest    -503.37     701.38  -0.7177 4.735e-01  -1884.0    877.3 280
## regionFnorthwest   -1141.74     702.78  -1.6246 1.054e-01  -2525.1    241.7 280
## smokerFnon-smoker -26213.57    1194.41 -21.9470 8.539e-63 -28564.7 -23862.4 280
## 
## Multiple R-squared:  0.8649 ,    Adjusted R-squared:  0.8611 
## F-statistic: 120.5 on 8 and 280 DF,  p-value: < 2.2e-16

Charges = 15029.8 + 262.37age + 260.22bmi + 652.8children + 116.72female - 93.55southeast - 503.37southwest - 1141.74northwest - 26213.57*non-smoker

The coefficients of sexFfemale, regionFsoutheast, regionFsouthwest and regionFnorthwest are not statisticaly significant at p-value of 0.05 therefore we can not claim that they are statistically different from 0. Because of this they do not require a further explanation.

Explanation of the partial coefficients: - Age: If age increases by 1 year and everything else stays the same, charges will on average increase by 262.37 USD (at p < 0.001) - BMI: If BMI increases by 1 and everything else stays the same, charges will on average increase by 260.22 USD (at p < 0.001) - Children: If the number of children increases by 1 and everything else stays the same, charges will on average increase by 652.8 USD (at p < 0.01) - SmokerF non-smoker: On average nonsmokers have charges that are lower by 26213.57 USD compared to smokers if everything else stays the same (at p < 0.001)

The multiple R squared is 0.865. This means that 86 % of the variability in the dependent variable is explained by the independent variables

The F test hypothesis:

H0: ρ2 = 0

H1: ρ2 > 0

In our case the null hypothesis can be rejected at 0.001 level of significance (p < 0.001). This means that ro is more than 0 and our regression model is a good fit meaning that there is a relationship between the dependent variable and at least one independent variable. This model is a better fit than the model with no independent variables.

sqrt(reg3$r.squared)
## [1] 0.9300163

The maximum degree of linear relationship between the dependent and the independent variables is 0.93 whichi indicates a very strong relationship.