The cost of medical care significantly impacts both healthcare providers and patients. This project aims to explore the predictive utility of patient features captured by an insurance firm to estimate the annual cost of medical care. The dataset used is the publicly available Medical Cost Personal dataset from Kaggle, containing information on 1338 beneficiaries and 7 variables, including the target variable: medical costs billed by health insurance in a year.
age: Age of the primary beneficiary. sex: Gender of the insurance contractor. bmi: Body mass index of the beneficiary. children: Number of children covered by health insurance. smoker: Smoking status of the beneficiary. region: Residential area of the beneficiary in the US. charges: Individual medical costs per beneficiary billed by health insurance in a year.
df = read.csv('insurance.csv', header=TRUE) #loading data
str(df) #examining structure of the dataset
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Developing a robust predictive model for medical costs is crucial for assisting healthcare providers, insurers, and policymakers. This study aims to demonstrate the practical application of such a model.
The primary objective is to develop a predictive model using linear regression, establishing relationships between predictor variables (e.g., age, BMI, location) and the target variable (medical cost).
While the model provides valuable insights based on historical data, it assumes observed relationships will continue in the future. External factors not in the dataset may influence medical costs in the real world.
The objective of this study is to demonstrate the development of a linear regression model for the purpose of learning and research only and does not necessarily reflect the real-world situation for predicting cost of insurance or medical care for any individual patient. The study findings are not intended to be used for any commercial or diagnostic purposes.
This study employs a multiple linear regression approach to investigate the relationship between the response variable (medical costs) and predictor variables.
The initial model includes all potential predictor variables. The model is refined using backward elimination, removing variables based on p-values.
The process involves fitting the initial model, assessing predictor significance using hypothesis tests, and iteratively removing variables until the final model is obtained.
Final model adequacy is assessed using diagnostic checks, including residual analysis and goodness-of-fit tests, mainly the R-squared and Adjusted R-squared values.
All analyses are conducted in RStudio using libraries such as corrplot, ggplot2, mctest and knitr
# Display basic information about the dataset
str(df)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
summary(df)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
# Creating density plots for key variables
par(mfrow=c(2,2))
plot(density(df$age), main="Distribution of Age", xlab="Age")
plot(density(df$bmi), main="Distribution of BMI", xlab="BMI")
plot(density(df$charges), main="Distribution of Cost", xlab="Cost")
plot(density(df$children), main="Distribution of Children", xlab="Number of Children")
### Correlation Analysis
# Scatter plot matrix for numeric variables
num_cols <- unlist(lapply(df, is.numeric))
plot(df[,num_cols])
# Correlation Heatmap of Numeric Variables
cor_matrix <- round(cor(df[,num_cols]),2)
corrplot(cor_matrix, method="number", type="upper", order="hclust", tl.col="black", tl.srt=45, addCoef.col="black", number.cex=0.7)
# Boxplots for categorical variables
boxplot(df$charges ~ df$region, main ='Region')
boxplot(df$charges ~ df$smoker, main ='Smoker')
boxplot(df$charges ~ df$sex, main ='Sex')
smoker <- as.factor(df$smoker)
sex <- as.factor(df$sex)
region <- as.factor(df$region)
age <- df$age
bmi <- df$bmi
cost <- df$charges
children <- df$children
# Fit Model 1
model_1 <- lm(charges ~ ., data = df)
# Evaluate Model 1
summary(model_1)
##
## Call:
## lm(formula = charges ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11304.9 -2848.1 -982.1 1393.9 29992.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
## age 256.9 11.9 21.587 < 2e-16 ***
## sexmale -131.3 332.9 -0.394 0.693348
## bmi 339.2 28.6 11.860 < 2e-16 ***
## children 475.5 137.8 3.451 0.000577 ***
## smokeryes 23848.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -353.0 476.3 -0.741 0.458769
## regionsoutheast -1035.0 478.7 -2.162 0.030782 *
## regionsouthwest -960.0 477.9 -2.009 0.044765 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
# Residual plots for Model 1
par(mfrow=c(2,2))
plot(model_1)
# Fit Model 2
model_2 <- lm(charges ~ age + bmi + children + smoker + region, data= df)
# Evaluate Model 2
summary(model_2)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
# Residual plots for Model 2
par(mfrow=c(2,2))
plot(model_2)
# Fit Model 2
model_3 <- lm(charges ~ age + bmi + children + smoker + region, data= df)
# Evaluate Model 2
summary(model_3)
##
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11367.2 -2835.4 -979.7 1361.9 29935.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11990.27 978.76 -12.250 < 2e-16 ***
## age 256.97 11.89 21.610 < 2e-16 ***
## bmi 338.66 28.56 11.858 < 2e-16 ***
## children 474.57 137.74 3.445 0.000588 ***
## smokeryes 23836.30 411.86 57.875 < 2e-16 ***
## regionnorthwest -352.18 476.12 -0.740 0.459618
## regionsoutheast -1034.36 478.54 -2.162 0.030834 *
## regionsouthwest -959.37 477.78 -2.008 0.044846 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7496
## F-statistic: 572.7 on 7 and 1330 DF, p-value: < 2.2e-16
# Residual plots for Model 2
par(mfrow=c(2,2))
plot(model_3)
From the R-squared values, all three models demonstrate a good fit with an R-squared of approximately 0.75. The inclusion or exclusion of the ‘sex’ variable doesn’t significantly impact the model’s explanatory power.
The residual plots show homoscedasticity, normality and linearity of the residuals in all three models.
Feature Engineering: Explore additional features or transformations of existing features to enhance model performance.
Outlier Handling: Investigate and handle potential outliers in the dataset.
Interaction Terms: Consider incorporating interaction terms between variables to capture nuanced relationships.
Cross-Validation: Implement cross-validation techniques for robust model validation.
Further Exploration: Explore non-linear models or ensemble methods to capture complex patterns.
This project establishes a foundation for predicting medical costs using linear regression. The models developed provide valuable insights, but continuous refinement and exploration of advanced techniques can further enhance predictive accuracy. In Part II of this Study, I will explore the implementation of the recommendations in Section 5 to further refine the models developed in this study.