2025-09-25

Introduction

  • Focus on Predicting Medical Charges
  • Evaluation of Lifestyle and Demographic Factors
  • Application of Statistical and Machine Learning Models
  • Identification of Key Drivers of Healthcare Cost

Dataset Overview

  • Age: Age of Primary Beneficiary
  • Sex: Gender of Insurance Contractor
  • BMI: Body Mass Index, Defined as Weight-to-Height Ratio
  • Children: Number of Dependents Covered by Insurance
  • Smoker: Smoking Status of Beneficiary
  • Region: Residential Area within the United States
  • Charges: Individual Medical Costs Billed by Health Insurance

Problem Definition

  • Prediction of Individual Medical Insurance Charges
  • Evaluation of Demographic and Lifestyle Predictors
  • Determination of Most Influential Risk Factors
  • Comparison of Statistical and Machine Learning Models
  • Assessment of Accuracy and Practical Interpretability

Data Wrangling and Preparation

Preparation of the dataset began with a structured evaluation of its overall organization to confirm correct variable types and formatting. Transformation of qualitative attributes into factor variables established proper categorical representation for analysis. Verification of complete records confirmed the dataset contained no missing observations, thereby supporting consistent application of statistical modeling and machine learning methods.

# Inspect Structure
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...
# Encode Categorical Variables
insurance$sex    <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)

# Check Missing Values
sum(is.na(insurance))
## [1] 0

Summary Statistics

Descriptive statistics provided an initial quantitative profile of the dataset. Numerical variables were summarized by central tendency and dispersion, revealing the typical age range, distribution of body mass index, and magnitude of medical charges. Categorical factors were represented through frequency counts, highlighting imbalances in smoking prevalence and regional distribution. These outputs established a rigorous baseline for subsequent exploration of relationships among predictors and healthcare expenditures.

# Summary Statistics
summary(insurance)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

Exploring the Response Variable

Histogram indicates a strongly right-skewed cost distribution, with typical expenditures clustered below $20,000 and a small subset of individuals incurring exceptionally high charges. Such an imbalanced pattern highlights the influence of extreme values on statistical estimates and underscores the importance of considering transformations or robust regression strategies in predictive modeling.

Investigating Predictor Relationships

Boxplot comparison reveals profound cost disparity, with smokers exhibiting substantially higher medical charges relative to non-smokers. Magnitude of this separation establishes smoking status as a dominant predictor of healthcare expenditures and reinforces its importance in modeling efforts.

Exploring Continuous Predictors

Scatterplot highlights positive association between BMI and medical charges, with costs increasing at higher BMI values. Outliers emerge among individuals with BMI above 40, particularly smokers incurring charges exceeding $60,000. Presence of these extreme cases indicates strong interaction between lifestyle risk factors and healthcare expenditures.

Investigating Continuous Predictors

Scatterplot demonstrates upward trajectory of medical charges with increasing age, consistent with higher healthcare utilization among older individuals. Pronounced separation between smokers and non-smokers intensifies with age, with older smokers frequently incurring costs above $40,000. Evidence reinforces age as a primary demographic predictor of healthcare expenditures.

Exploring Regional Differences

Boxplot comparison reveals modest regional variation, with charges highest in the Southeast and lowest in the Northwest and Southwest. Overlap in distributions indicates that regional residence exerts weaker influence relative to lifestyle and demographic factors, reinforcing the dominant role of smoking status, age, and body mass index in determining medical costs.

Exploring Family Size

Boxplot comparison illustrates minimal variation in charges across family sizes, with medians and interquartile ranges remaining relatively stable. Outliers occur in all categories, driven primarily by smoking status and body mass index rather than number of dependents. Evidence indicates family size exerts negligible influence on healthcare expenditures relative to lifestyle and demographic predictors.

Exploring Gender Differences

Boxplot comparison indicates modest gender-based variation in charges, with males exhibiting slightly higher median and mean costs relative to females. Distributional overlap demonstrates that gender alone contributes minimally to predicting expenditures, particularly when contrasted against smoking status, body mass index, and age.

Multivariate Visualization

Three-dimensional scatterplot integrates age, body mass index, and charges, with color encoding smoking status. Visualization reveals pronounced clustering of extreme expenditures among smokers with elevated BMI and older age, reinforcing interactive effects between demographic and lifestyle risk factors. Evidence consolidates smoking, age, and body mass index as the dominant predictors of healthcare costs.

R Code Demonstration

# Correlation Matrix Among Continuous Predictors
cont_vars <- insurance[, c("age", "bmi", "charges")]
cor_matrix <- cor(cont_vars)
round(cor_matrix, 3)
##           age   bmi charges
## age     1.000 0.109   0.299
## bmi     0.109 1.000   0.198
## charges 0.299 0.198   1.000
# Simple Linear Model Example
lm_charges <- lm(charges ~ age + bmi, data = insurance)
summary(lm_charges)
## 
## Call:
## lm(formula = charges ~ age + bmi, data = insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14457  -7045  -5136   7211  48022 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6424.80    1744.09  -3.684 0.000239 ***
## age           241.93      22.30  10.850  < 2e-16 ***
## bmi           332.97      51.37   6.481 1.28e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11390 on 1335 degrees of freedom
## Multiple R-squared:  0.1172, Adjusted R-squared:  0.1159 
## F-statistic:  88.6 on 2 and 1335 DF,  p-value: < 2.2e-16

Predictive Modeling: Linear Regression

Model Equation (LaTeX):

\[ \text{Charges}_i = \beta_0 + \beta_1 \cdot \text{Smoker}_i + \beta_2 \cdot \text{Age}_i + \beta_3 \cdot \text{BMI}_i + \epsilon_i \]

# Linear Regression Model
lm_model <- lm(charges ~ smoker + age + bmi, data = insurance)
summary(lm_model)
## 
## Call:
## lm(formula = charges ~ smoker + age + bmi, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12415.4  -2970.9   -980.5   1480.0  28971.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11676.83     937.57  -12.45   <2e-16 ***
## smokerYes    23823.68     412.87   57.70   <2e-16 ***
## age            259.55      11.93   21.75   <2e-16 ***
## bmi            322.62      27.49   11.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6092 on 1334 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7469 
## F-statistic:  1316 on 3 and 1334 DF,  p-value: < 2.2e-16

Linear regression confirms that smoking, age, and BMI are statistically significant predictors of medical charges. Interpretation of coefficients indicates that smokers incur approximately $23,824 higher costs relative to non-smokers, each unit increase in BMI adds about $323, and each additional year of age contributes nearly $260 to expected charges. With an R² of 0.748, the model explains nearly three-quarters of the variance in charges, underscoring its effectiveness while suggesting potential improvement through nonlinear methods or additional predictors.

Conclusion

  • Smoking is Dominant Predictor of Medical Charges
  • Age and BMI Show Strong Positive Associations
  • Gender, Region, and Family Size Have Minimal Influence
  • Linear Regression Explained ~75% of Charge Variance
  • Lifestyle and Demographics Drive Expenditures
  • Future Work: Explore Nonlinear Models for Accuracy

Thank You for Viewing This Presentation!