Ilina Maksimovska

Research question: How do age and the smoking status of patients affect their medical charges?

mydata <- read.csv("./medical_cost.csv")

head(mydata)
##   Id age    sex    bmi children smoker    region   charges
## 1  1  19 female 27.900        0    yes southwest 16884.924
## 2  2  18   male 33.770        1     no southeast  1725.552
## 3  3  28   male 33.000        3     no southeast  4449.462
## 4  4  33   male 22.705        0     no northwest 21984.471
## 5  5  32   male 28.880        0     no northwest  3866.855
## 6  6  31 female 25.740        0     no southeast  3756.622

Explanation of a dataset:

Explanation of the variables in the data set:

set.seed(1) 
mydata <- mydata[sample(nrow(mydata), 100), ] #Choosing random sample of 100 units
head(mydata)
##        Id age    sex    bmi children smoker    region   charges
## 1017 1017  19 female 24.605        1     no northwest  2709.244
## 679   679  56   male 36.100        3     no southwest 12363.547
## 129   129  32 female 17.765        2    yes northwest 32734.186
## 930   930  41   male 34.210        1     no southeast  6289.755
## 471   471  27   male 32.670        0     no southeast  2497.038
## 299   299  31   male 34.390        3    yes northwest 38746.355
mydata$sex <- as.factor(mydata$sex)
mydata$smoker <- as.factor(mydata$smoker)
mydata$region <- as.factor(mydata$region)
summary(mydata) #Descriptive statistics
##        Id              age            sex          bmi           children   
##  Min.   :  22.0   Min.   :18.00   female:45   Min.   :17.39   Min.   :0.00  
##  1st Qu.: 381.0   1st Qu.:28.75   male  :55   1st Qu.:27.45   1st Qu.:0.00  
##  Median : 646.0   Median :41.00               Median :31.73   Median :1.00  
##  Mean   : 687.2   Mean   :40.38               Mean   :31.18   Mean   :1.19  
##  3rd Qu.: 991.5   3rd Qu.:52.00               3rd Qu.:34.39   3rd Qu.:2.00  
##  Max.   :1336.0   Max.   :64.00               Max.   :48.07   Max.   :5.00  
##  smoker         region      charges     
##  no :78   northeast:24   Min.   : 1392  
##  yes:22   northwest:26   1st Qu.: 5721  
##           southeast:26   Median : 9498  
##           southwest:24   Mean   :14191  
##                          3rd Qu.:14352  
##                          Max.   :62593

Descriptive statistics of the dataset

library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
scatterplotMatrix(~ charges + age, data = mydata,
smooth = FALSE)

Based on the scatter plot we can see that there is linear relationship between the variables charges and age. Since smoker is a categorical variable we do not include it in the scatter plot matrix.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -3, -4, -5, -6, -7) ])) #Correlation matrix
##          age charges
## age     1.00    0.27
## charges 0.27    1.00
## 
## n= 100 
## 
## 
## P
##         age    charges
## age            0.0063 
## charges 0.0063

H0: rho (age, charges) = 0

H1: rho (age, charges) != 0

We reject the null hypothesis at p < 0.0064. There is correlation between age and charges based on the data set.

The linear relationship between age and charges is positive and weak.

fit1 <- lm(charges ~ age + smoker, #Dependent and explanatory variables
           data = mydata) #Name of the data frame

charges= β0 + β1×age + β2×smoker + ϵ

Where:

charges is the dependent variable, representing the medical charges.

age is an explanatory variable, representing the age of the individual.

smoker is an explanatory variable, representing whether the individual is a smoker (categorical variable: “yes” or “no”).

β0 is the intercept term.

β1 is the coefficient associated with the variable age. age.

β2 is the coefficient associated with the variable smoker. smoker.

ϵ is the error term, representing unobserved factors influencing medical charges that are not captured by the model.

Justification for the choice of explanatory variables:

Age:

Expected effect: The coefficient β1 measures the expected change in medical charges for a one-unit increase in age, assuming all other variables are held constant. It is expected that older individuals may have higher medical charges due to increased healthcare needs associated with aging.

Smoker:

Expected effect: The coefficient β2 measures the difference in medical charges between smokers and non-smokers, assuming all other variables are held constant. Smoking is a known risk factor for various health conditions, so it is expected that smokers may have higher medical charges compared to non-smokers.

vif(fit1) #Multicolinearity
##    age smoker 
## 1.0025 1.0025
mean(vif(fit1))
## [1] 1.0025

All variance inflation factors are less than 5 and also, their mean is low. That means that there is no strong multicolinearity in my data.

mydata$StdResid <- round(rstandard(fit1), 3) #Standardized residuals
mydata$CooksD <- round(cooks.distance(fit1), 3) #Cooks distances
hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals") #Histogram of std. residuals

There are units that have critical values (less than -3 or more than 3) of standardized residuals that should be removed.

shapiro.test(mydata$StdResid) #Checking normality 
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.71616, p-value = 1.305e-12

H0: Errors are normally distributed

H1: Errors are not normally distributed

We reject H0 at p value < 0.001 and assume that the errors are not normally distributed.

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances") #Histogram of Cooks distances

There are no units with Cooks Distance above 1 however, the histogram does not have continuous distribution, meaning that there are units that should be removed because of Cooks Distances.

head(mydata[order(mydata$StdResid),], 3) #Three units with lowest value of stand. residuals
##      Id age    sex    bmi children smoker    region  charges StdResid CooksD
## 990 990  24 female 20.520        0    yes northeast 14571.89   -3.294  0.221
## 858 858  25   male 24.130        0    yes northwest 15817.99   -3.111  0.192
## 855 855  49 female 23.845        3    yes northeast 24106.91   -2.950  0.155
head(mydata[order(-mydata$StdResid), ], 3) #Three units with highest value of stand. residuals
##        Id age    sex    bmi children smoker    region  charges StdResid CooksD
## 1301 1301  45   male 30.360        0    yes southeast 62592.87    4.693  0.365
## 988   988  45 female 27.645        1     no northwest 28340.19    3.602  0.060
## 526   526  18 female 33.880        0     no southeast 11482.63    2.039  0.058
head(mydata[order(-mydata$CooksD),], 6) #Six units with highest value of Cooks distance 
##        Id age    sex    bmi children smoker    region  charges StdResid CooksD
## 1301 1301  45   male 30.360        0    yes southeast 62592.87    4.693  0.365
## 990   990  24 female 20.520        0    yes northeast 14571.89   -3.294  0.221
## 858   858  25   male 24.130        0    yes northwest 15817.99   -3.111  0.192
## 855   855  49 female 23.845        3    yes northeast 24106.91   -2.950  0.155
## 983   983  31   male 25.900        3    yes southwest 19199.94   -2.810  0.135
## 955   955  34   male 27.835        1    yes northwest 20009.63   -2.831  0.131
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ids_to_remove <- c("1301", "990", "858", "988", "983", "955", "855", "526", "1129", "642", "129", "282", "382", "851", "501", "40", "378", "37", "619") #Units with values of standardized residuals that fall outside +3 or -3 or cause the histogram of Cooks Distances to not have continuous distribution

mydata <- mydata %>%
  filter(!(Id %in% ids_to_remove)) #Removing variables

I removed the units one by one and checked the histograms of standardized residuals and cooks distances to see whether there are new units with stand. res. outside the critical values that should be removed and whether the histogram of cooks distances has continuous distribution. I stopped removing units when there were no units left with std. res. outside the critical values and the histogram of cooks distances had continuous distribution.

fit1 <- lm(charges ~ age + smoker, #Dependent and explanatory variables
           data = mydata) #Name of the data frame
mydata$StdResid <- round(rstandard(fit1), 3) #Rounding std. residuals to 3 decimals
mydata$CooksD <- round(cooks.distance(fit1), 3) #Rounding Cooks Distances to 3 decimals
hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals") #Histogram of std. residuals

I removed all the units that have critical values (less than -3 or more than 3) of standardized residuals and the histogram seems to have normal distribution.

shapiro.test(mydata$StdResid) #Checking normality
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.97827, p-value = 0.1842

H0: Errors are normally distributed

H1: Errors are not normally distributed

We do not reject H0 at p value < 0.1843 and assume that the errors are normally distributed.

The Shapiro Wilk test confirmed what I saw on the histogram, that the errors are normally distributed.

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances") #Histogram of Cooks distances

The histogram of Cooks Distances has continuous distribution without breaks and has no units with values above 1. That is a good sign meaning we can proceed.

mydata$StdFitted <- scale(fit1$fitted.values) #Standardization of the variable
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE) #Scatter plot of std. residuals and std. fitted values

It would be better for the model if there wasn’t the gap we can see on the scatter plot but for educational purposes we will assume that it is okay.

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit1) #Checking heteroskedasticity
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                 
##  -----------------------------------
##  Response : charges 
##  Variables: fitted values of charges 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    1.939859 
##  Prob > Chi2   =    0.1636839

H0: Homoskedasticity is met

H1: Homoskedasticity is violated

p-value = 0.1636839

Since p-value is above the significant level (alpha) of 5%, we do not reject the null hypothesis. The variance between standardized residuals and standardized fitted values is constant and there is no potential heteroskedasticity.

summary(fit1) #Results of regression model
## 
## Call:
## lm(formula = charges ~ age + smoker, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1347.60  -556.62    31.35   527.01  1634.05 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3711.033    253.580  -14.63   <2e-16 ***
## age           273.138      5.863   46.59   <2e-16 ***
## smokeryes   33777.259    286.612  117.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 724.2 on 78 degrees of freedom
## Multiple R-squared:  0.9953, Adjusted R-squared:  0.9952 
## F-statistic:  8275 on 2 and 78 DF,  p-value: < 2.2e-16

F-statistic: H0: rho2 = 0 H1: rho2 > 0

We reject the null hypothesis at p-value < 0.001 meaning that we explain something with our model and this model is okay.

H0: Beta1 = 0 H1: Beta1 != 0 p < 0.001 b1 = 273.138

We reject H0 at p-value < 0.001 and found that age has effect on the medical charges.

Explanation of b1 = 273.138, If Age increases by one year, the Charges on average increase by 273.138 euros (p < 0.001), assuming that the other explanatory variable, included in the model, stays the same.

H0: Beta2 = 0 H1: Beta2 != 0 p < 0.001 b2 = 33777.259

We reject H0 at p-value < 0.001 and and found that whether a person is a smoker has effect on the medical charges.

Explanation of b2 = 33777.259, Given the value of the other explanatory variable, the people who are smoking have on average medical charges higher for 33777.259 euros compared to people who are not smoking (p < 0.001).

Multiple R-squared: 0.9953

99.53% of the variability of medical charges can be explained by the effect of age and whether a person is a smoker or not.

sqrt(summary(fit1)$r.squared) #Coefficient of multiple correlation
## [1] 0.9976519

The linear relationship between the 3 variables is very strong.

library(lm.beta)
lm.beta(fit1) #Standardized regression coefficients
## 
## Call:
## lm(formula = charges ~ age + smoker, data = mydata)
## 
## Standardized Coefficients::
## (Intercept)         age   smokeryes 
##          NA   0.3615836   0.9147094

The explanatory variable smoker has larger absolute value of standardized regression coefficient than the explanatory variable age meaning that variable smoker had a higher effect when explaining the medical charges than variable age.

The explanatory variable smoker contributed more to the Multiple R - squared meaning that the variable smoker explained higher percentage of the dependent variable medical charges or the differences in medical charges than the variable age.

fit2 <- lm(charges ~ age + smoker + children + sex,
           data = mydata) #Regression model 2

summary(fit2) #Results from regression model 2
## 
## Call:
## lm(formula = charges ~ age + smoker + children + sex, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -972.63 -392.29  -84.87  467.08 1214.50 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3698.521    227.383 -16.266  < 2e-16 ***
## age           271.896      4.539  59.904  < 2e-16 ***
## smokeryes   33639.867    221.271 152.030  < 2e-16 ***
## children      298.228     53.323   5.593 3.36e-07 ***
## sexmale      -578.486    125.467  -4.611 1.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 557.1 on 76 degrees of freedom
## Multiple R-squared:  0.9973, Adjusted R-squared:  0.9972 
## F-statistic:  7006 on 4 and 76 DF,  p-value: < 2.2e-16

H0: Beta1 = 0 H1: Beta1 != 0 p < 0.001 b1 = 271.896

We reject H0 at p-value < 0.001 and found that age has effect on the medical charges.

Explanation of b1 = 271.896, If Age increases by one year, the Charges on average increase by 271.896 euros (p < 0.001), assuming that all other explanatory variables, included in the model, stay the same.

H0: Beta2 = 0 H1: Beta2 != 0 p < 0.001 b2 = 33639.867

We reject H0 at p-value < 0.001 and and found that whether a person is a smoker has effect on the medical charges.

Explanation of b2 = 33639.867 , Given the value of all other explanatory variables, the people who are smoking have on average medical charges higher for 33639.867 euros compared to people who are not smoking (p < 0.001).

H0: Beta3 = 0 H1: Beta3 != 0 p < 0.001 b3 = 298.228

We reject H0 at p-value < 0.001 and found that the number of children or dependents covered under the medical insurance has effect on the medical charges.

Explanation of b3 = 298.228, If the number of children or dependents covered under the medical insurance increases by one, the Charges on average increase by 298.228 euros (p < 0.001), assuming that all other explanatory variables, included in the model, stay the same.

H0: Beta4 = 0 H1: Beta4 != 0 p < 0.001 b2 = -578.486

We reject H0 at p-value < 0.001 and and found that whether a person is a smoker has effect on the medical charges.

Explanation of b4 = -578.486, Given the value of all other explanatory variables, men have on average medical charges lower for 578.486 euros compared to women (p < 0.001).

Multiple R-squared: 0.9973

99.73% of the variability of medical charges can be explained by the effect of age, the number of children or dependents covered under the medical insurance, the gender of the person and whether a person is a smoker or not.

Multiple R-Squared increased compared to the previous model meaning that this model is better if we are comparing Multiple R-Squared since it explains higher proportion of variability of medical charges.

anova(fit1, fit2) #Comparison of two regression models that differ in the number of expl. variables
## Analysis of Variance Table
## 
## Model 1: charges ~ age + smoker
## Model 2: charges ~ age + smoker + children + sex
##   Res.Df      RSS Df Sum of Sq      F    Pr(>F)    
## 1     78 40903181                                  
## 2     76 23584160  2  17319020 27.905 8.181e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Δρ2 (delta rho squared) = 0

H1: Δρ2 (delta rho squared) > 0

We reject the null hypothesis at p - value < 0.001 and conclude that the second model is better since it explains more.