1.1 Background of the Study

The cost of medical care significantly impacts both healthcare providers and patients. This project aims to explore the predictive utility of patient features captured by an insurance firm to estimate the annual cost of medical care. The dataset used is the publicly available Medical Cost Personal dataset from Kaggle, containing information on 1338 beneficiaries and 7 variables, including the target variable: medical costs billed by health insurance in a year.

Brief Overview of Features in the Data

age: Age of the primary beneficiary. sex: Gender of the insurance contractor. bmi: Body mass index of the beneficiary. children: Number of children covered by health insurance. smoker: Smoking status of the beneficiary. region: Residential area of the beneficiary in the US. charges: Individual medical costs per beneficiary billed by health insurance in a year.

df = read.csv('insurance.csv', header=TRUE) #loading data
str(df) #examining structure of the dataset
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

1.3 Significance

Developing a robust predictive model for medical costs is crucial for assisting healthcare providers, insurers, and policymakers. This study aims to demonstrate the practical application of such a model.

1.4 Objective

The primary objective is to develop a predictive model using linear regression, establishing relationships between predictor variables (e.g., age, BMI, location) and the target variable (medical cost).

1.5 Scope and Limitations

While the model provides valuable insights based on historical data, it assumes observed relationships will continue in the future. External factors not in the dataset may influence medical costs in the real world.

Disclaimer

The objective of this study is to demonstrate the development of a linear regression model for the purpose of learning and research only and does not necessarily reflect the real-world situation for predicting cost of insurance or medical care for any individual patient. The study findings are not intended to be used for any commercial or diagnostic purposes.

2.0 Methodology

2.1 Study Design

This study employs a multiple linear regression approach to investigate the relationship between the response variable (medical costs) and predictor variables.

2.2 Model Specification

The initial model includes all potential predictor variables. The model is refined using backward elimination, removing variables based on p-values.

2.3 Statistical Analysis

The process involves fitting the initial model, assessing predictor significance using hypothesis tests, and iteratively removing variables until the final model is obtained.

2.4 Model Evaluation

Final model adequacy is assessed using diagnostic checks, including residual analysis and goodness-of-fit tests, mainly the R-squared and Adjusted R-squared values.

2.5 Software and Libraries

All analyses are conducted in RStudio using libraries such as corrplot, ggplot2, mctest and knitr

2.6 Data Exploration and Preprocessing

# Display basic information about the dataset
str(df)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
summary(df)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Descriptive Analysis and Visualization

# Creating density plots for key variables
par(mfrow=c(2,2))
plot(density(df$age), main="Distribution of Age", xlab="Age")
plot(density(df$bmi), main="Distribution of BMI", xlab="BMI")
plot(density(df$charges), main="Distribution of Cost", xlab="Cost")
plot(density(df$children), main="Distribution of Children", xlab="Number of Children")

### Correlation Analysis

# Scatter plot matrix for numeric variables
num_cols <- unlist(lapply(df, is.numeric))
plot(df[,num_cols])

# Correlation Heatmap of Numeric Variables
cor_matrix <- round(cor(df[,num_cols]),2)
corrplot(cor_matrix, method="number", type="upper", order="hclust", tl.col="black", tl.srt=45, addCoef.col="black", number.cex=0.7)

Descriptive Analysis of Categorical Variables

# Boxplots for categorical variables
boxplot(df$charges ~ df$region, main ='Region')

boxplot(df$charges ~ df$smoker, main ='Smoker')

boxplot(df$charges ~ df$sex, main ='Sex')

3.0 Model Building and Evaluation

Defining Variables

smoker <- as.factor(df$smoker)
sex <- as.factor(df$sex)
region <- as.factor(df$region)
age <- df$age
bmi <- df$bmi
cost <- df$charges
children <- df$children

3.1 Model 1

# Fit Model 1
model_1 <- lm(charges ~ ., data = df)
# Evaluate Model 1
summary(model_1)
## 
## Call:
## lm(formula = charges ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16
# Residual plots for Model 1
par(mfrow=c(2,2))
plot(model_1)

3.2 Model 2

# Fit Model 2
model_2 <- lm(charges ~ age + bmi + children + smoker + region, data= df)
# Evaluate Model 2
summary(model_2)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16
# Residual plots for Model 2
par(mfrow=c(2,2))
plot(model_2)

3.3 Model 3

# Fit Model 2
model_3 <- lm(charges ~ age + bmi + children + smoker + region, data= df)
# Evaluate Model 2
summary(model_3)
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16
# Residual plots for Model 2
par(mfrow=c(2,2))
plot(model_3)

4.0 Model Evaluation and Interpretation

From the R-squared values, all three models demonstrate a good fit with an R-squared of approximately 0.75. The inclusion or exclusion of the ‘sex’ variable doesn’t significantly impact the model’s explanatory power.

Residual Analysis

The residual plots show homoscedasticity, normality and linearity of the residuals in all three models.

5.0 Recommendations for Improvement

Feature Engineering: Explore additional features or transformations of existing features to enhance model performance.

Outlier Handling: Investigate and handle potential outliers in the dataset.

Interaction Terms: Consider incorporating interaction terms between variables to capture nuanced relationships.

Cross-Validation: Implement cross-validation techniques for robust model validation.

Further Exploration: Explore non-linear models or ensemble methods to capture complex patterns.

6.0 Conclusion

This project establishes a foundation for predicting medical costs using linear regression. The models developed provide valuable insights, but continuous refinement and exploration of advanced techniques can further enhance predictive accuracy. In Part II of this Study, I will explore the implementation of the recommendations in Section 5 to further refine the models developed in this study.

7. Contact of the Author

Michael Adu,PharmD

Email:

Feel free to connect with me on LinkedIn: Michael Adu