- Focus on Predicting Medical Charges
- Evaluation of Lifestyle and Demographic Factors
- Application of Statistical and Machine Learning Models
- Identification of Key Drivers of Healthcare Cost
2025-09-25
Preparation of the dataset began with a structured evaluation of its overall organization to confirm correct variable types and formatting. Transformation of qualitative attributes into factor variables established proper categorical representation for analysis. Verification of complete records confirmed the dataset contained no missing observations, thereby supporting consistent application of statistical modeling and machine learning methods.
# Inspect Structure str(insurance)
## 'data.frame': 1338 obs. of 7 variables: ## $ age : int 19 18 28 33 32 31 46 37 37 60 ... ## $ sex : chr "female" "male" "male" "male" ... ## $ bmi : num 27.9 33.8 33 22.7 28.9 ... ## $ children: int 0 1 3 0 0 0 1 3 2 0 ... ## $ smoker : chr "yes" "no" "no" "no" ... ## $ region : chr "southwest" "southeast" "southeast" "northwest" ... ## $ charges : num 16885 1726 4449 21984 3867 ...
# Encode Categorical Variables insurance$sex <- as.factor(insurance$sex) insurance$smoker <- as.factor(insurance$smoker) insurance$region <- as.factor(insurance$region) # Check Missing Values sum(is.na(insurance))
## [1] 0
Descriptive statistics provided an initial quantitative profile of the dataset. Numerical variables were summarized by central tendency and dispersion, revealing the typical age range, distribution of body mass index, and magnitude of medical charges. Categorical factors were represented through frequency counts, highlighting imbalances in smoking prevalence and regional distribution. These outputs established a rigorous baseline for subsequent exploration of relationships among predictors and healthcare expenditures.
# Summary Statistics summary(insurance)
## age sex bmi children smoker ## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064 ## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274 ## Median :39.00 Median :30.40 Median :1.000 ## Mean :39.21 Mean :30.66 Mean :1.095 ## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000 ## Max. :64.00 Max. :53.13 Max. :5.000 ## region charges ## northeast:324 Min. : 1122 ## northwest:325 1st Qu.: 4740 ## southeast:364 Median : 9382 ## southwest:325 Mean :13270 ## 3rd Qu.:16640 ## Max. :63770
Histogram indicates a strongly right-skewed cost distribution, with typical expenditures clustered below $20,000 and a small subset of individuals incurring exceptionally high charges. Such an imbalanced pattern highlights the influence of extreme values on statistical estimates and underscores the importance of considering transformations or robust regression strategies in predictive modeling.
Boxplot comparison reveals profound cost disparity, with smokers exhibiting substantially higher medical charges relative to non-smokers. Magnitude of this separation establishes smoking status as a dominant predictor of healthcare expenditures and reinforces its importance in modeling efforts.
Scatterplot highlights positive association between BMI and medical charges, with costs increasing at higher BMI values. Outliers emerge among individuals with BMI above 40, particularly smokers incurring charges exceeding $60,000. Presence of these extreme cases indicates strong interaction between lifestyle risk factors and healthcare expenditures.
Scatterplot demonstrates upward trajectory of medical charges with increasing age, consistent with higher healthcare utilization among older individuals. Pronounced separation between smokers and non-smokers intensifies with age, with older smokers frequently incurring costs above $40,000. Evidence reinforces age as a primary demographic predictor of healthcare expenditures.
Boxplot comparison reveals modest regional variation, with charges highest in the Southeast and lowest in the Northwest and Southwest. Overlap in distributions indicates that regional residence exerts weaker influence relative to lifestyle and demographic factors, reinforcing the dominant role of smoking status, age, and body mass index in determining medical costs.
Boxplot comparison illustrates minimal variation in charges across family sizes, with medians and interquartile ranges remaining relatively stable. Outliers occur in all categories, driven primarily by smoking status and body mass index rather than number of dependents. Evidence indicates family size exerts negligible influence on healthcare expenditures relative to lifestyle and demographic predictors.
Boxplot comparison indicates modest gender-based variation in charges, with males exhibiting slightly higher median and mean costs relative to females. Distributional overlap demonstrates that gender alone contributes minimally to predicting expenditures, particularly when contrasted against smoking status, body mass index, and age.
Three-dimensional scatterplot integrates age, body mass index, and charges, with color encoding smoking status. Visualization reveals pronounced clustering of extreme expenditures among smokers with elevated BMI and older age, reinforcing interactive effects between demographic and lifestyle risk factors. Evidence consolidates smoking, age, and body mass index as the dominant predictors of healthcare costs.
# Correlation Matrix Among Continuous Predictors
cont_vars <- insurance[, c("age", "bmi", "charges")]
cor_matrix <- cor(cont_vars)
round(cor_matrix, 3)
## age bmi charges ## age 1.000 0.109 0.299 ## bmi 0.109 1.000 0.198 ## charges 0.299 0.198 1.000
# Simple Linear Model Example lm_charges <- lm(charges ~ age + bmi, data = insurance) summary(lm_charges)
## ## Call: ## lm(formula = charges ~ age + bmi, data = insurance) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14457 -7045 -5136 7211 48022 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -6424.80 1744.09 -3.684 0.000239 *** ## age 241.93 22.30 10.850 < 2e-16 *** ## bmi 332.97 51.37 6.481 1.28e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 11390 on 1335 degrees of freedom ## Multiple R-squared: 0.1172, Adjusted R-squared: 0.1159 ## F-statistic: 88.6 on 2 and 1335 DF, p-value: < 2.2e-16
\[ \text{Charges}_i = \beta_0 + \beta_1 \cdot \text{Smoker}_i + \beta_2 \cdot \text{Age}_i + \beta_3 \cdot \text{BMI}_i + \epsilon_i \]
# Linear Regression Model lm_model <- lm(charges ~ smoker + age + bmi, data = insurance) summary(lm_model)
## ## Call: ## lm(formula = charges ~ smoker + age + bmi, data = insurance) ## ## Residuals: ## Min 1Q Median 3Q Max ## -12415.4 -2970.9 -980.5 1480.0 28971.8 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -11676.83 937.57 -12.45 <2e-16 *** ## smokerYes 23823.68 412.87 57.70 <2e-16 *** ## age 259.55 11.93 21.75 <2e-16 *** ## bmi 322.62 27.49 11.74 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6092 on 1334 degrees of freedom ## Multiple R-squared: 0.7475, Adjusted R-squared: 0.7469 ## F-statistic: 1316 on 3 and 1334 DF, p-value: < 2.2e-16
Linear regression confirms that smoking, age, and BMI are statistically significant predictors of medical charges. Interpretation of coefficients indicates that smokers incur approximately $23,824 higher costs relative to non-smokers, each unit increase in BMI adds about $323, and each additional year of age contributes nearly $260 to expected charges. With an R² of 0.748, the model explains nearly three-quarters of the variance in charges, underscoring its effectiveness while suggesting potential improvement through nonlinear methods or additional predictors.
Thank You for Viewing This Presentation!