mydata <- read.csv("./medical_cost.csv")
head(mydata)
## Id age sex bmi children smoker region charges
## 1 1 19 female 27.900 0 yes southwest 16884.924
## 2 2 18 male 33.770 1 no southeast 1725.552
## 3 3 28 male 33.000 3 no southeast 4449.462
## 4 4 33 male 22.705 0 no northwest 21984.471
## 5 5 32 male 28.880 0 no northwest 3866.855
## 6 6 31 female 25.740 0 no southeast 3756.622
Unit of observation: one patient
Sample size: 1338 observations
Number of variables: 8
Source: Pore, N. (2023, August 19). Medical_cost_dataset. Kaggle. https://www.kaggle.com/datasets/nanditapore/medical-cost-dataset
ID: A unique identifier assigned to each individual record
Age: The age of the patient
Sex: The gender of the patient (male/female)
BMI: The Body Mass Index (BMI) of the patient
Children: The number of children or dependents covered under the medical insurance
Smoker: A binary indicator of whether the patient is a smoker or not
Region: The geographic region of the patient (northeast, northwest, southeast, southwest)
Charges: The medical charges incurred by the patient.
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 100), ] #Choosing random sample of 100 units
head(mydata)
## Id age sex bmi children smoker region charges
## 1017 1017 19 female 24.605 1 no northwest 2709.244
## 679 679 56 male 36.100 3 no southwest 12363.547
## 129 129 32 female 17.765 2 yes northwest 32734.186
## 930 930 41 male 34.210 1 no southeast 6289.755
## 471 471 27 male 32.670 0 no southeast 2497.038
## 299 299 31 male 34.390 3 yes northwest 38746.355
mydata$sex <- as.factor(mydata$sex)
mydata$smoker <- as.factor(mydata$smoker)
mydata$region <- as.factor(mydata$region)
summary(mydata) #Descriptive statistics
## Id age sex bmi children
## Min. : 22.0 Min. :18.00 female:45 Min. :17.39 Min. :0.00
## 1st Qu.: 381.0 1st Qu.:28.75 male :55 1st Qu.:27.45 1st Qu.:0.00
## Median : 646.0 Median :41.00 Median :31.73 Median :1.00
## Mean : 687.2 Mean :40.38 Mean :31.18 Mean :1.19
## 3rd Qu.: 991.5 3rd Qu.:52.00 3rd Qu.:34.39 3rd Qu.:2.00
## Max. :1336.0 Max. :64.00 Max. :48.07 Max. :5.00
## smoker region charges
## no :78 northeast:24 Min. : 1392
## yes:22 northwest:26 1st Qu.: 5721
## southeast:26 Median : 9498
## southwest:24 Mean :14191
## 3rd Qu.:14352
## Max. :62593
Min: The minimum value for body mass index (bmi) of a patient is 17.39.
1st Qu (First Quartile): 25% of the data falls below this value and 75% falls above this value. The 1st quartile for charges is 5721 euros.
Median: 50% of the data falls below this value and 50% of the data falls above this value. The median for age is 41 years.
Mean: The average body mass index (bmi) 31.18.
3rd Qu (Third Quartile): 75% of the data falls below this value and 25% of the data falls above this value. The 3rd quartile for charges is 14352 euros.
Max: The maximum value for children is 5.
We have 45 Females in the sample data and 55 Males.
There are 78 non-smokers and 22 smokers in the sample data.
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
scatterplotMatrix(~ charges + age, data = mydata,
smooth = FALSE)
Based on the scatter plot we can see that there is linear relationship between the variables charges and age. Since smoker is a categorical variable we do not include it in the scatter plot matrix.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[ , c(-1, -3, -4, -5, -6, -7) ])) #Correlation matrix
## age charges
## age 1.00 0.27
## charges 0.27 1.00
##
## n= 100
##
##
## P
## age charges
## age 0.0063
## charges 0.0063
H0: rho (age, charges) = 0
H1: rho (age, charges) != 0
We reject the null hypothesis at p < 0.0064. There is correlation between age and charges based on the data set.
The linear relationship between age and charges is positive and weak.
fit1 <- lm(charges ~ age + smoker, #Dependent and explanatory variables
data = mydata) #Name of the data frame
charges= β0 + β1×age + β2×smoker + ϵ
Where:
charges is the dependent variable, representing the medical charges.
age is an explanatory variable, representing the age of the individual.
smoker is an explanatory variable, representing whether the individual is a smoker (categorical variable: “yes” or “no”).
β0 is the intercept term.
β1 is the coefficient associated with the variable age. age.
β2 is the coefficient associated with the variable smoker. smoker.
ϵ is the error term, representing unobserved factors influencing medical charges that are not captured by the model.
Justification for the choice of explanatory variables:
Age:
Expected effect: The coefficient β1 measures the expected change in medical charges for a one-unit increase in age, assuming all other variables are held constant. It is expected that older individuals may have higher medical charges due to increased healthcare needs associated with aging.
Smoker:
Expected effect: The coefficient β2 measures the difference in medical charges between smokers and non-smokers, assuming all other variables are held constant. Smoking is a known risk factor for various health conditions, so it is expected that smokers may have higher medical charges compared to non-smokers.
vif(fit1) #Multicolinearity
## age smoker
## 1.0025 1.0025
mean(vif(fit1))
## [1] 1.0025
All variance inflation factors are less than 5 and also, their mean is low. That means that there is no strong multicolinearity in my data.
mydata$StdResid <- round(rstandard(fit1), 3) #Standardized residuals
mydata$CooksD <- round(cooks.distance(fit1), 3) #Cooks distances
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals") #Histogram of std. residuals
There are units that have critical values (less than -3 or more than 3) of standardized residuals that should be removed.
shapiro.test(mydata$StdResid) #Checking normality
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.71616, p-value = 1.305e-12
H0: Errors are normally distributed
H1: Errors are not normally distributed
We reject H0 at p value < 0.001 and assume that the errors are not normally distributed.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances") #Histogram of Cooks distances
There are no units with Cooks Distance above 1 however, the histogram does not have continuous distribution, meaning that there are units that should be removed because of Cooks Distances.
head(mydata[order(mydata$StdResid),], 3) #Three units with lowest value of stand. residuals
## Id age sex bmi children smoker region charges StdResid CooksD
## 990 990 24 female 20.520 0 yes northeast 14571.89 -3.294 0.221
## 858 858 25 male 24.130 0 yes northwest 15817.99 -3.111 0.192
## 855 855 49 female 23.845 3 yes northeast 24106.91 -2.950 0.155
head(mydata[order(-mydata$StdResid), ], 3) #Three units with highest value of stand. residuals
## Id age sex bmi children smoker region charges StdResid CooksD
## 1301 1301 45 male 30.360 0 yes southeast 62592.87 4.693 0.365
## 988 988 45 female 27.645 1 no northwest 28340.19 3.602 0.060
## 526 526 18 female 33.880 0 no southeast 11482.63 2.039 0.058
head(mydata[order(-mydata$CooksD),], 6) #Six units with highest value of Cooks distance
## Id age sex bmi children smoker region charges StdResid CooksD
## 1301 1301 45 male 30.360 0 yes southeast 62592.87 4.693 0.365
## 990 990 24 female 20.520 0 yes northeast 14571.89 -3.294 0.221
## 858 858 25 male 24.130 0 yes northwest 15817.99 -3.111 0.192
## 855 855 49 female 23.845 3 yes northeast 24106.91 -2.950 0.155
## 983 983 31 male 25.900 3 yes southwest 19199.94 -2.810 0.135
## 955 955 34 male 27.835 1 yes northwest 20009.63 -2.831 0.131
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ids_to_remove <- c("1301", "990", "858", "988", "983", "955", "855", "526", "1129", "642", "129", "282", "382", "851", "501", "40", "378", "37", "619") #Units with values of standardized residuals that fall outside +3 or -3 or cause the histogram of Cooks Distances to not have continuous distribution
mydata <- mydata %>%
filter(!(Id %in% ids_to_remove)) #Removing variables
I removed the units one by one and checked the histograms of standardized residuals and cooks distances to see whether there are new units with stand. res. outside the critical values that should be removed and whether the histogram of cooks distances has continuous distribution. I stopped removing units when there were no units left with std. res. outside the critical values and the histogram of cooks distances had continuous distribution.
fit1 <- lm(charges ~ age + smoker, #Dependent and explanatory variables
data = mydata) #Name of the data frame
mydata$StdResid <- round(rstandard(fit1), 3) #Rounding std. residuals to 3 decimals
mydata$CooksD <- round(cooks.distance(fit1), 3) #Rounding Cooks Distances to 3 decimals
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals") #Histogram of std. residuals
I removed all the units that have critical values (less than -3 or more than 3) of standardized residuals and the histogram seems to have normal distribution.
shapiro.test(mydata$StdResid) #Checking normality
##
## Shapiro-Wilk normality test
##
## data: mydata$StdResid
## W = 0.97827, p-value = 0.1842
H0: Errors are normally distributed
H1: Errors are not normally distributed
We do not reject H0 at p value < 0.1843 and assume that the errors are normally distributed.
The Shapiro Wilk test confirmed what I saw on the histogram, that the errors are normally distributed.
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances") #Histogram of Cooks distances
The histogram of Cooks Distances has continuous distribution without breaks and has no units with values above 1. That is a good sign meaning we can proceed.
mydata$StdFitted <- scale(fit1$fitted.values) #Standardization of the variable
library(car)
scatterplot(y = mydata$StdResid, x = mydata$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE) #Scatter plot of std. residuals and std. fitted values
It would be better for the model if there wasn’t the gap we can see on the scatter plot but for educational purposes we will assume that it is okay.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit1) #Checking heteroskedasticity
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------
## Response : charges
## Variables: fitted values of charges
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.939859
## Prob > Chi2 = 0.1636839
H0: Homoskedasticity is met
H1: Homoskedasticity is violated
p-value = 0.1636839
Since p-value is above the significant level (alpha) of 5%, we do not reject the null hypothesis. The variance between standardized residuals and standardized fitted values is constant and there is no potential heteroskedasticity.
summary(fit1) #Results of regression model
##
## Call:
## lm(formula = charges ~ age + smoker, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1347.60 -556.62 31.35 527.01 1634.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3711.033 253.580 -14.63 <2e-16 ***
## age 273.138 5.863 46.59 <2e-16 ***
## smokeryes 33777.259 286.612 117.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 724.2 on 78 degrees of freedom
## Multiple R-squared: 0.9953, Adjusted R-squared: 0.9952
## F-statistic: 8275 on 2 and 78 DF, p-value: < 2.2e-16
F-statistic: H0: rho2 = 0 H1: rho2 > 0
We reject the null hypothesis at p-value < 0.001 meaning that we explain something with our model and this model is okay.
H0: Beta1 = 0 H1: Beta1 != 0 p < 0.001 b1 = 273.138
We reject H0 at p-value < 0.001 and found that age has effect on the medical charges.
Explanation of b1 = 273.138, If Age increases by one year, the Charges on average increase by 273.138 euros (p < 0.001), assuming that the other explanatory variable, included in the model, stays the same.
H0: Beta2 = 0 H1: Beta2 != 0 p < 0.001 b2 = 33777.259
We reject H0 at p-value < 0.001 and and found that whether a person is a smoker has effect on the medical charges.
Explanation of b2 = 33777.259, Given the value of the other explanatory variable, the people who are smoking have on average medical charges higher for 33777.259 euros compared to people who are not smoking (p < 0.001).
Multiple R-squared: 0.9953
99.53% of the variability of medical charges can be explained by the effect of age and whether a person is a smoker or not.
sqrt(summary(fit1)$r.squared) #Coefficient of multiple correlation
## [1] 0.9976519
The linear relationship between the 3 variables is very strong.
library(lm.beta)
lm.beta(fit1) #Standardized regression coefficients
##
## Call:
## lm(formula = charges ~ age + smoker, data = mydata)
##
## Standardized Coefficients::
## (Intercept) age smokeryes
## NA 0.3615836 0.9147094
The explanatory variable smoker has larger absolute value of standardized regression coefficient than the explanatory variable age meaning that variable smoker had a higher effect when explaining the medical charges than variable age.
The explanatory variable smoker contributed more to the Multiple R - squared meaning that the variable smoker explained higher percentage of the dependent variable medical charges or the differences in medical charges than the variable age.
fit2 <- lm(charges ~ age + smoker + children + sex,
data = mydata) #Regression model 2
summary(fit2) #Results from regression model 2
##
## Call:
## lm(formula = charges ~ age + smoker + children + sex, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -972.63 -392.29 -84.87 467.08 1214.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3698.521 227.383 -16.266 < 2e-16 ***
## age 271.896 4.539 59.904 < 2e-16 ***
## smokeryes 33639.867 221.271 152.030 < 2e-16 ***
## children 298.228 53.323 5.593 3.36e-07 ***
## sexmale -578.486 125.467 -4.611 1.59e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 557.1 on 76 degrees of freedom
## Multiple R-squared: 0.9973, Adjusted R-squared: 0.9972
## F-statistic: 7006 on 4 and 76 DF, p-value: < 2.2e-16
H0: Beta1 = 0 H1: Beta1 != 0 p < 0.001 b1 = 271.896
We reject H0 at p-value < 0.001 and found that age has effect on the medical charges.
Explanation of b1 = 271.896, If Age increases by one year, the Charges on average increase by 271.896 euros (p < 0.001), assuming that all other explanatory variables, included in the model, stay the same.
H0: Beta2 = 0 H1: Beta2 != 0 p < 0.001 b2 = 33639.867
We reject H0 at p-value < 0.001 and and found that whether a person is a smoker has effect on the medical charges.
Explanation of b2 = 33639.867 , Given the value of all other explanatory variables, the people who are smoking have on average medical charges higher for 33639.867 euros compared to people who are not smoking (p < 0.001).
H0: Beta3 = 0 H1: Beta3 != 0 p < 0.001 b3 = 298.228
We reject H0 at p-value < 0.001 and found that the number of children or dependents covered under the medical insurance has effect on the medical charges.
Explanation of b3 = 298.228, If the number of children or dependents covered under the medical insurance increases by one, the Charges on average increase by 298.228 euros (p < 0.001), assuming that all other explanatory variables, included in the model, stay the same.
H0: Beta4 = 0 H1: Beta4 != 0 p < 0.001 b2 = -578.486
We reject H0 at p-value < 0.001 and and found that whether a person is a smoker has effect on the medical charges.
Explanation of b4 = -578.486, Given the value of all other explanatory variables, men have on average medical charges lower for 578.486 euros compared to women (p < 0.001).
Multiple R-squared: 0.9973
99.73% of the variability of medical charges can be explained by the effect of age, the number of children or dependents covered under the medical insurance, the gender of the person and whether a person is a smoker or not.
Multiple R-Squared increased compared to the previous model meaning that this model is better if we are comparing Multiple R-Squared since it explains higher proportion of variability of medical charges.
anova(fit1, fit2) #Comparison of two regression models that differ in the number of expl. variables
## Analysis of Variance Table
##
## Model 1: charges ~ age + smoker
## Model 2: charges ~ age + smoker + children + sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 78 40903181
## 2 76 23584160 2 17319020 27.905 8.181e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Δρ2 (delta rho squared) = 0
H1: Δρ2 (delta rho squared) > 0
We reject the null hypothesis at p - value < 0.001 and conclude that the second model is better since it explains more.