Modeling Medical Costs Through Regression

CUNY SPS DATA 621

Author

Darwhin Gomez

Medical costs represent a significant financial burden for individuals and are a primary reason health insurance coverage is widely sought. Accurately understanding the factors that drive medical insurance charges is essential for insurers seeking to price plans fairly and sustainably, as well as for policymakers concerned with healthcare affordability. This paper aims to model the relationship between individual medical insurance costs and selected demographic, health-related, and behavioral indicators using linear regression techniques. Using a dataset of insured individuals in the United States, we examine how factors such as age, body mass index (BMI), smoking status, number of dependents, and geographic region influence medical charges. Multiple regression models are developed and evaluated to identify statistically significant predictors and assess model assumptions. The results provide interpretable insights into the primary drivers of medical insurance costs and demonstrate the usefulness of regression modeling in health economics and insurance pricing applications.

Abstract

Medical insurance costs are highly variable, right-skewed, and influenced by both demographic and behavioral factors. Accurately modeling these costs is essential for insurers seeking to price risk fairly and sustainably. This study examines the relationship between medical insurance charges and individual characteristics using regression-based approaches. Using a dataset of 1,338 insured individuals in the United States, we evaluate ordinary least squares regression, interaction models, robust regression, and generalized linear models. Due to the skewed and heteroskedastic nature of cost data, a Gamma generalized linear model with a log link and selected interaction terms was chosen as the final specification. Results indicate that smoking status is the dominant driver of medical costs and significantly amplifies the effect of body mass index. The final model demonstrates strong predictive performance on a held-out test set and provides interpretable insights relevant for insurance pricing and risk assessment.

1. Introduction

Rising healthcare costs present a significant financial challenge for individuals, insurers, and policymakers. Medical insurance premiums are typically based on expected healthcare utilization, which in turn depends on demographic characteristics, health indicators, and behavioral risk factors. Understanding how these factors jointly influence medical costs is critical for effective risk pricing and sustainable insurance design.

Traditional linear regression models are often used to analyze cost data due to their interpretability; however, medical costs violate many classical regression assumptions. Costs are strictly positive, right-skewed, and characterized by a small number of high-cost individuals who account for a disproportionate share of total expenditures. As a result, naïve linear regression may produce biased inference and poor predictive performance.

This study addresses these challenges by applying regression techniques tailored to cost data. Specifically, we investigate whether smoking status modifies the effects of other predictors—particularly body mass index—on medical insurance charges. The objective is twofold: (1) to identify key drivers of medical costs and (2) to develop a predictive model that balances interpretability with statistical appropriateness.

2. Literature Review

Modeling healthcare expenditures presents significant econometric challenges due to the inherently skewed and heavy-tailed nature of cost data. Medical expenditures are characterized by a large mass of low-cost observations and a long right tail of extremely high costs. These features are commonly associated with heteroscedastic error structures, leading to frequent violations of the homoscedasticity assumption in ordinary least squares regression.

A common response is to apply a logarithmic transformation to costs and estimate a linear model on the transformed scale. However, this approach introduces important limitations. Zero-cost observations cannot be accommodated without ad hoc adjustments, and retransformation bias arises when predicted log-costs are exponentiated to recover expected costs on the original scale. As shown by Manning (1998), this bias persists under heteroscedasticity, which is common even after log transformation. While smearing estimators may partially address this issue (Duan, 1983; Duan et al., 1983), they rely on additional assumptions and do not fully resolve model misspecification (Manning, 1998).

Generalized linear models provide a more appropriate framework for modeling healthcare costs (Blough et al., 1999; Deb & Trivedi, 2002). By allowing the response variable to follow a distribution from the exponential dispersion family, GLMs explicitly accommodate the skewed and heteroscedastic nature of cost data (Glick et al., 2014). In particular, the Gamma distribution with a log link is well suited for continuous, positive expenditures, as it naturally models variance proportional to the square of the mean and ensures positive predictions (Blough et al., 1999; Deb & Trivedi, 2002). Coefficients from such models admit a straightforward multiplicative interpretation in terms of percentage changes in expected cost (Glick et al., 2014).

Consistent with prior literature, our analysis highlights smoking status as a dominant cost driver and demonstrates that its effect interacts strongly with body mass index. This finding aligns with evidence that health risk factors operate synergistically rather than independently (Manning et al., 1987; Deb & Trivedi, 2002). Age and family structure also exhibit significant relationships with medical expenditures, reflecting systematic differences in healthcare utilization documented in prior studies (Blough et al., 1999; Glick et al., 2014).

Overall, the use of a Gamma generalized linear model with targeted interaction terms aligns with best practices in health econometrics, yielding consistent estimates and improved efficiency relative to misspecified linear models (Manning, 1998; Blough et al., 1999; Deb & Trivedi, 2002).

3. Methodology

3.1 Data

The dataset is found from the online Data Science community Kaggle.com. A link to the exact download source is provided in the Appendix.

The dataset consists of 1,338 individuals covered by medical insurance in the United States. The response variable is annual medical insurance charges. Predictor variables include age, sex, body mass index (BMI), smoking status, number of dependents (recoded as a binary indicator for having children), and residential region.

All variables were complete with no missing observations. Categorical predictors were encoded as factor variables prior to model estimation. The number of children was originally measured as a numeric variable and was subsequently recoded as a binary indicator reflecting the presence of dependents.

3.2 Exploratory Analysis

Exploratory data analysis revealed that medical charges are heavily right-skewed, with a long upper tail corresponding to high-cost individuals. Boxplots and residual diagnostics confirmed the presence of heteroskedasticity and non-normality, consistent with the known characteristics of healthcare cost data. These patterns motivated the use of modeling approaches beyond ordinary least squares regression.

3.3 Model Specification

Several models were estimated and compared:

Ordinary least squares regression on raw charges

Linear regression with interaction terms

Robust regression to assess sensitivity to outliers

Gamma generalized linear models with a log link

To assess effect heterogeneity, interaction terms between smoking status and other predictors were explored. While smoking significantly modified the effects of BMI, age, and dependent status, interactions with sex and region did not materially improve model fit.

The final model specification was a Gamma generalized linear model with a log link including the following predictors: Age, BMI, Smoking status,Having children, Sex, Region, Smoking × BMI, Smoking, Age, Smoking × Having children

3.4 Model Evaluation

The dataset was split into training (70%) and testing (30%) subsets. Model performance was evaluated on the test set using root mean squared error (RMSE) and mean absolute error (MAE), with predictions generated on the original cost scale using the inverse log link.

4. Results

The final Gamma model demonstrated strong explanatory power, reducing deviance substantially relative to a null model and achieving a deviance-based pseudo-R² of approximately 0.73.

Smoking status emerged as the dominant predictor of medical costs. Holding other factors constant, smokers incur nearly three times the expected medical costs of non-smokers. Body mass index alone had little effect among non-smokers; however, for smokers, each additional BMI unit increased expected medical costs by approximately 5%. This interaction indicates that smoking substantially amplifies the cost impact of obesity.

Age was also a significant predictor, with expected medical costs increasing by roughly 3–4% per additional year for non-smokers. The interaction between smoking and age suggested that the marginal effect of age is attenuated among smokers, likely reflecting their already elevated baseline risk. Having dependents increased expected costs for non-smokers, though this effect was weaker among smokers.

On the held-out test set, the model achieved an RMSE of approximately $5,200 and an MAE of approximately $3,170. A predicted-versus-actual plot showed good calibration across most of the cost distribution, with increased dispersion among very high-cost individuals.

5. Discussion

The results highlight smoking as the most influential driver of medical insurance costs, both directly and through its interaction with BMI and age. From a business perspective, these findings support risk-adjusted pricing strategies that account not only for smoking status but also for how smoking interacts with other health indicators.

The strong interaction between smoking and BMI suggests that wellness interventions targeting weight management among smokers may yield disproportionately large cost savings. Insurers may also benefit from differentiated premium structures or targeted preventive care programs for high-risk subgroups identified by the model.

6. Limitations

Several limitations should be noted. The dataset does not include detailed clinical variables such as diagnosed conditions or healthcare utilization patterns, which likely explain much of the remaining variation in high-cost cases. Additionally, the analysis focuses on expected costs rather than extreme tail risk, which may be of interest for reinsurance or catastrophic coverage modeling.

7. Conclusion

This study demonstrates that medical insurance costs can be effectively modeled using regression-based approaches when the distributional characteristics of cost data are properly addressed. A Gamma generalized linear model with targeted interaction terms provides a robust, interpretable framework for understanding cost drivers and predicting expected medical expenditures. The findings underscore the central role of smoking in healthcare costs and illustrate the value of interaction modeling in uncovering meaningful risk heterogeneity.

GLM BMI Smoker

Medical insurance charges as a function of body mass index (BMI), stratified by smoking status. Points represent observed charges, while solid lines represent fitted values from the final Gamma generalized linear model with a log link and interaction between BMI and smoking status. Predicted values are shown on the original cost scale.

The mdoel's rmse: 5198.975  The models mae 3170.379

Appendices

Tables and Figures

Models


Call:
lm(formula = charges ~ ., data = data_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-11483.5  -2894.7   -956.5   1478.1  30059.6 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -12001.01     993.99 -12.074  < 2e-16 ***
age                256.91      11.92  21.562  < 2e-16 ***
sexmale           -126.41     333.31  -0.379  0.70456    
bmi                339.51      28.63  11.858  < 2e-16 ***
smokeryes        23849.65     413.63  57.660  < 2e-16 ***
regionnorthwest   -352.22     476.88  -0.739  0.46029    
regionsoutheast  -1057.33     479.27  -2.206  0.02755 *  
regionsouthwest   -944.26     478.40  -1.974  0.04861 *  
has_childrenYes    999.58     335.88   2.976  0.00297 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6069 on 1329 degrees of freedom
Multiple R-squared:  0.7503,    Adjusted R-squared:  0.7488 
F-statistic: 499.3 on 8 and 1329 DF,  p-value: < 2.2e-16

Call:
lm(formula = charges ~ bmi * smoker + age + has_children, data = data_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-14992.4  -2033.6  -1242.8   -372.9  29880.8 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -2753.520    838.589  -3.284 0.001052 ** 
bmi                  6.423     24.950   0.257 0.796894    
smokeryes       -20082.183   1659.601 -12.101  < 2e-16 ***
age                265.192      9.585  27.667  < 2e-16 ***
has_childrenYes    960.640    270.224   3.555 0.000391 ***
bmi:smokeryes     1430.143     52.987  26.991  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4886 on 1332 degrees of freedom
Multiple R-squared:  0.8378,    Adjusted R-squared:  0.8372 
F-statistic:  1376 on 5 and 1332 DF,  p-value: < 2.2e-16

Call:
lm(formula = sqrt(charges) ~ (bmi * smoker + age), data = data_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.971  -9.796  -4.530   0.831 107.572 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    28.14841    3.49594   8.052 1.79e-15 ***
bmi             0.05813    0.10529   0.552    0.581    
smokeryes     -31.44669    7.00376  -4.490 7.74e-06 ***
age             1.43666    0.04041  35.554  < 2e-16 ***
bmi:smokeryes   3.97785    0.22361  17.789  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.62 on 1333 degrees of freedom
Multiple R-squared:  0.8142,    Adjusted R-squared:  0.8137 
F-statistic:  1461 on 4 and 1333 DF,  p-value: < 2.2e-16
Selected Model

Call:
glm(formula = charges ~ age + bmi + has_children + sex + region + 
    smoker + smoker:bmi + smoker:age + smoker:has_children, family = Gamma(link = "log"), 
    data = data_clean)

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                7.467355   0.127288  58.665  < 2e-16 ***
age                        0.034462   0.001534  22.466  < 2e-16 ***
bmi                        0.002258   0.003690   0.612  0.54075    
has_childrenYes            0.270674   0.043227   6.262 5.13e-10 ***
sexmale                   -0.063918   0.038358  -1.666  0.09588 .  
regionnorthwest           -0.066977   0.054828  -1.222  0.22208    
regionsoutheast           -0.151504   0.055130  -2.748  0.00608 ** 
regionsouthwest           -0.169059   0.055015  -3.073  0.00216 ** 
smokeryes                  1.080839   0.266440   4.057 5.27e-05 ***
bmi:smokeryes              0.049428   0.007595   6.508 1.08e-10 ***
age:smokeryes             -0.025914   0.003422  -7.573 6.78e-14 ***
has_childrenYes:smokeryes -0.274594   0.096193  -2.855  0.00438 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.4861963)

    Null deviance: 1056.04  on 1337  degrees of freedom
Residual deviance:  288.82  on 1326  degrees of freedom
AIC: 26168

Number of Fisher Scoring iterations: 6

Exploratory Plots

Cost and BMI Smoker Labels

Box Plot - Age

Box Plot - Body Mass Index

Box Plot- Charges

Numerical Correlation To Target

Distribution - BMI

Distribution - Age

Distribution - Charges

Model Diagnostic Plots

Linear model Diagnostics

Log Linear Model Diagnostics

Interaction Linear Model Smoker * BMI

Model Selected

GLM Final Model

Actual V Predicted

BMI Smoker GLM

References

Manning, W. G. (1998). The logged dependent variable, heteroscedasticity, and the retransformation problem. Journal of Health Economics, 17(3), 283–295. https://doi.org/10.1016/S0167-6296(98)00025-3

Blough, D. K., Madden, C. W., & Hornbrook, M. C. (1999). Modeling risk using generalized linear models. Journal of Health Economics, 18(2), 153–171. https://doi.org/10.1016/S0167-6296(98)00041-1

Deb, P., & Trivedi, P. K. (2002). The structure of demand for health care by the elderly: Econometric evidence from the Health and Retirement Study. Journal of Health Economics, 21(5), 803–824. https://doi.org/10.1016/S0167-6296(02)00021-7

Glick, H. A., Doshi, J. A., Sonnad, S. S., & Polsky, D. (2014). Economic evaluation in clinical trials (2nd ed.). Oxford University Press.

Data

Abdelghany, M. (2025). Medical insurance cost dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/12853160

R Code

R code Link

Predictions

    Predicted charges
1           21667.670
2            5038.960
3            4639.096
4            4122.332
5           29303.890
6            3309.855
7           15008.039
8           35764.482
9           15374.077
10           3453.264
11          43254.124
12           7563.904
13           4722.729
14          18232.641
15          18171.631
16          20064.448
17          14671.339
18           5725.687
19           3152.160
20           7242.559
21           8996.259
22          39299.724
23           8503.573
24          11575.900
25          10236.835
26          15330.048
27           3975.315
28           4082.238
29           9001.299
30           3220.740
31           3480.834
32          35483.043
33           3130.432
34           3720.357
35           3419.793
36          20478.467
37           5846.825
38          25659.763
39           5846.385
40          32297.263
41           3729.171
42          24713.371
43          33711.791
44          11759.760
45          34723.498
46          12776.781
47          11585.490
48           6965.176
49           4436.735
50          13583.932
51           8033.744
52          10091.537
53          12085.888
54           3049.048
55           4804.498
56          57151.096
57           5421.165
58           3538.956
59           3166.220
60          17356.019
61          13921.274
62           9793.813
63           5167.157
64           3623.992
65          11718.099
66          29035.963
67           4035.101
68          16036.034
69           2960.915
70           3498.093
71           6241.120
72          16166.321
73          41529.979
74           6099.973
75          27709.145
76          26419.087
77          54772.624
78           9172.891
79           3866.868
80           6440.580
81           9584.883
82          23015.811
83          34746.569
84           7918.280
85           5574.483
86           5894.672
87          26843.127
88          56742.537
89          13599.835
90          18141.460
91           3596.394
92           3146.587
93           4423.483
94           7200.641
95           9965.981
96           6503.324
97           4239.782
98          13308.270
99          14046.840
100         13119.523
101         25362.868
102         25954.463
103         10115.530
104          7262.874
105          2883.653
106         12122.143
107         12042.383
108          9214.050
109          3120.227
110          8976.342
111         16000.059
112         15013.187
113          7185.584
114         21295.282
115         15603.567
116          3410.133
117         18334.933
118          4913.227
119          6100.487
120         12517.132
121          3763.969
122          6479.874
123         13148.691
124         32620.192
125          3341.873
126         11522.443
127         27578.707
128         11990.163
129          6360.497
130          3870.823
131          3851.262
132          9531.281
133         10893.413
134         46164.010
135          3483.270
136          7662.204
137          7796.128
138          4236.471
139          3859.879
140         11089.481
141         10427.555
142         10954.087
143         11479.932
144         14028.825
145          4219.745
146         10568.079
147         26050.287
148         37767.908
149          3201.730
150         11266.741
151          6110.268
152         17672.584
153         47964.989
154         17410.496
155          6264.738
156          2795.826
157          7827.168
158         10625.523
159          6347.110
160          8959.510
161         14013.510
162         19540.276
163         27123.798
164          7762.654
165          3507.005
166         23232.261
167          5485.110
168         11000.312
169         31933.766
170         12515.015
171          6346.780
172         52600.604
173         12246.816
174         11524.444
175         16239.518
176          4364.064
177         26440.784
178         13745.311
179         14503.806
180          5128.654
181          8814.707
182         13391.784
183          8436.238
184         12438.585
185         10493.044
186         36798.494
187         43000.292
188          5584.397
189          5100.489
190          6626.366
191         62153.607
192         13580.717
193          6297.897
194          6151.934
195          3600.765
196          9961.443
197          4759.045
198         10844.973
199          9624.108
200         11584.839
201          6506.011
202         11646.411
203         10119.098
204          3556.025
205         11699.757
206          2890.171
207         11847.040
208         12755.827
209         12558.742
210         44230.821
211         10712.304
212         23628.389
213          3103.690
214          5308.774
215          4346.560
216         21896.021
217          9874.117
218          8794.058
219         14668.694
220         16134.489
221          8930.916
222         10179.478
223          2878.786
224          6129.669
225          3440.502
226          2839.720
227          3935.700
228          8840.701
229         24862.980
230         23107.288
231          3940.617
232          8366.667
233          9349.060
234          3421.122
235          5399.116
236         12468.616
237          5942.402
238          6911.615
239          4365.245
240         29623.213
241         40122.446
242          9245.661
243          9605.961
244         12823.894
245          8106.882
246         13097.682
247          5998.925
248          7510.882
249          8528.811
250         12850.292
251          8080.127
252          3066.482
253         29249.870
254         33680.362
255          4577.770
256         11002.150
257         17715.896
258         26784.153
259          3147.475
260         69684.126
261          8512.293
262          2853.872
263         14300.164
264          4843.401
265          8560.028
266          7410.493
267          4195.532
268          5059.101
269         24285.941
270          7278.700
271         11143.693
272          4103.520
273          3062.709
274          5606.430
275         12647.804
276          9660.987
277         20417.253
278          4578.860
279         17405.785
280          3713.274
281         12888.910
282          7114.699
283          6429.207
284         12960.834
285         14453.080
286          8207.225
287          5032.439
288          6059.125
289         12011.489
290          2764.579
291         12254.692
292         58812.648
293          4888.774
294          4256.307
295         13123.284
296          9708.068
297          8168.276
298         11768.525
299          5956.519
300          3743.975
301          5560.207
302         20230.398
303          4588.617
304         18286.786
305         11653.883
306          9270.388
307          8022.143
308          3360.030
309         28953.375
310         13283.566
311         17394.932
312         11318.753
313          4762.498
314          8738.167
315         39951.595
316         13701.631
317         34136.002
318          4119.393
319         27691.623
320         28035.041
321          4111.198
322         15507.412
323         30673.409
324          4356.023
325          5575.594
326         18529.496
327         15452.301
328          3589.969
329         17495.247
330          3612.261
331         11754.378
332         12362.150
333          4578.455
334         48394.852
335          5038.204
336         18275.063
337         10376.614
338         12750.563
339          9130.038
340         13250.046
341         28716.735
342          6258.094
343         56979.174
344          7141.252
345          2966.267
346         12819.080
347         10237.145
348         30421.248
349         11264.167
350          8565.498
351         12695.240
352         13115.022
353         11947.389
354          8338.360
355          6564.420
356          8292.702
357         23117.378
358          7964.396
359          5878.260
360          3353.081
361         12758.564
362         23432.951
363         33713.098
364         14924.821
365         27389.933
366          6723.463
367          7609.264
368          6999.639
369         16006.883
370         12533.446
371         38435.757
372          8508.315
373          6777.903
374          9330.905
375          6725.493
376         10504.062
377         12973.073
378         15944.054
379         14581.824
380         16112.323
381         12872.568
382          3059.668
383         27756.090
384          2790.151
385          6844.843
386          9120.913
387         10535.917
388         29585.583
389         10427.464
390          5272.620
391         29968.411
392          3238.056
393         37765.224
394          4805.280
395         24825.793
396          8027.887
397          6461.406
398         30672.450
399          6210.509
400          3520.365
401          3039.221
402          3230.003