This report details the implementation of a statistical and machine learning model for clinical data analytics, following a structured process from data acquisition and cleaning to model deployment readiness.
The chosen dataset is the Medical Cost Personal Insurance Dataset, a publicly available, structured dataset suitable for regression analysis. It contains various demographic and health-related features to predict individual medical insurance charges.
| Feature | Data Type | Description |
|---|---|---|
| age | Numerical | Age of the primary beneficiary |
| sex | Categorical | Insurance contractor gender (female, male) |
| bmi | Numerical | Body Mass Index, providing an understanding of body weight |
| children | Numerical | Number of children covered by health insurance / Number of dependents |
| smoker | Categorical | Smoker status (yes, no) |
| region | Categorical | The beneficiary’s residential area in the US (northeast, southeast, southwest, northwest) |
| charges | Numerical | Individual medical costs billed by health insurance (Target Variable) |
The dataset was downloaded as insurance.csv.
The data cleaning and preprocessing routine was implemented in
data_preprocessing.py. An initial check revealed no
missing values in the dataset, so no imputation was strictly
necessary. However, the routine was built to be robust and included the
following steps:
sex,
smoker, region).age, bmi,
children) to normalize their distribution for the
models.random_state
for reproducibility.data_preprocessing.py```python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer import numpy as np
def load_data(file_path): “““Loads the dataset from a CSV file.”“” try: df = pd.read_csv(file_path) return df except FileNotFoundError: return None
def get_preprocessor(df): ““” Creates a ColumnTransformer for preprocessing. ““” numerical_features = df.select_dtypes(include=np.number).columns.tolist() if ‘charges’ in numerical_features: numerical_features.remove(‘charges’)
categorical_features = df.select_dtypes(include='object').columns.tolist()
numerical_transformer = [
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]
categorical_transformer = [
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
]
preprocessor = ColumnTransformer(
transformers=[
('num', Pipeline(numerical_transformer), numerical_features),
('cat', Pipeline(categorical_transformer), categorical_features)
],
remainder='passthrough'
)
return preprocessor, numerical_features, categorical_features
def preprocess_data(df): “““Applies the full preprocessing pipeline to the data.”“”
X = df.drop('charges', axis=1)
y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
preprocessor, num_features, cat_features = get_preprocessor(X_train)
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
cat_ohe_features = preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(cat_features)
all_features = num_features + list(cat_ohe_features)
X_train_processed_df = pd.DataFrame(X_train_processed, columns=all_features)
X_test_processed_df = pd.DataFrame(X_test_processed, columns=all_features)
return X_train_processed_df, X_test_processed_df, y_train, y_test, preprocessor
if name == ‘main’: # … (Execution logic to load and save processed data) pass ```
Exploratory analysis was performed to understand the data
distribution and relationships between features and the target variable
(charges).
Descriptive statistics for the numerical features of the original dataset:
| Feature | count | mean | std | min | 25% | 50% | 75% | max |
|---|---|---|---|---|---|---|---|---|
| age | 1338 | 39.21 | 14.05 | 18 | 27 | 39 | 51 | 64 |
| bmi | 1338 | 30.66 | 6.10 | 15.96 | 26.30 | 30.40 | 34.69 | 53.13 |
| children | 1338 | 1.09 | 1.21 | 0 | 0 | 1 | 2 | 5 |
Visualizations were generated and saved to the eda_plots
directory. The key findings are:
charges_distribution.png): The target variable is
right-skewed, indicating that most individuals have lower charges, but a
smaller number of individuals incur very high costs. This suggests that
a log transformation might be beneficial for linear models, but we will
proceed without it for the initial models.correlation_heatmap.png): The heatmap shows that
age and bmi have a positive, moderate
correlation with charges. The correlation with
children is weak.sex_vs_charges_boxplot.png,
smoker_vs_charges_boxplot.png,
region_vs_charges_boxplot.png):
Two regression models were implemented and evaluated on the test set: a Linear Regression model (a statistical approach) and a Random Forest Regressor (a machine learning approach).
model_training.py```python import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score import joblib
def train_and_save_models(X_train, X_test, y_train, y_test): “““Trains and saves the Linear Regression and Random Forest models.”“”
# 1. Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_metrics = evaluate_model(lr_model, X_test, y_test)
joblib.dump(lr_model, 'lr_model.pkl')
# 2. Random Forest Regressor Model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)
rf_metrics = evaluate_model(rf_model, X_test, y_test)
joblib.dump(rf_model, 'rf_model.pkl')
return lr_metrics, rf_metrics
if name == ‘main’: # … (Execution logic to load data and train models) pass ```
| Model | MSE | RMSE | R-squared (\(R^2\)) |
|---|---|---|---|
| Linear Regression | 33,596,915.85 | 5,796.28 | 0.7836 |
| Random Forest Regressor | 21,003,637.61 | 4,582.97 | 0.8647 |
The Random Forest Regressor significantly
outperformed the Linear Regression model, achieving an \(R^2\) of 0.8647. This
indicates that the non-linear, ensemble approach is better at capturing
the complex relationship between the features and the medical charges.
The Random Forest model (rf_model.pkl) was selected for API
deployment.
The entire process is modularized into three scripts: 1.
data_preprocessing.py: Handles data loading, cleaning, and
transformation. 2. model_training.py: Handles model
training, evaluation, and saving. 3. api_service.py: A
FastAPI application to serve predictions from the best-performing
model.
api_service.py (FastAPI Deployment Routine)The API service uses the saved Random Forest model
(rf_model.pkl) and a re-initialized version of the
preprocessor to make predictions on new, raw data
inputs.
```python import joblib import pandas as pd from fastapi import FastAPI from pydantic import BaseModel from data_preprocessing import load_data, preprocess_data # Re-use preprocessing logic
app = FastAPI( title=“Clinical Data Analytics Model API”, description=“API for predicting medical charges using a Random Forest Regressor model.” )
try: rf_model = joblib.load(‘rf_model.pkl’) original_data = load_data(‘insurance.csv’) , , , , preprocessor = preprocess_data(original_data.copy()) except Exception as e: rf_model = None preprocessor = None # In a real environment, this would raise an exception and prevent startup
class InsuranceData(BaseModel): age: int sex: str bmi: float children: int smoker: str region: str
@app.get(“/”) def read_root(): return {“message”: “Clinical Data Analytics Model API is running.”}
@app.post(“/predict”) def predict_insurance_charges(data: InsuranceData): ““” Predicts the medical insurance charges based on the provided personal data. ““” if rf_model is None or preprocessor is None: return {“error”: “Model or Preprocessor not loaded.”}
try:
# Prepare input data
input_df = pd.DataFrame([data.dict()])
expected_cols = ['age', 'sex', 'bmi', 'children', 'smoker', 'region']
input_df = input_df[expected_cols]
# Transform and predict
transformed_data = preprocessor.transform(input_df)
prediction = rf_model.predict(transformed_data)
return {
"predicted_charges": round(prediction[0], 2),
"model_used": "Random Forest Regressor"
}
except Exception as e:
return {"error": f"Prediction failed: {e}"}
```
This project successfully implemented a Statistical & Machine
Learning solution for clinical data analytics. The process included: *
Acquiring the Medical Cost Personal Insurance Dataset.
* Developing a robust data cleaning and preprocessing
pipeline (data_preprocessing.py). * Conducting
Exploratory Data Analysis to understand key features
like smoker status and its strong correlation with
charges. * Implementing and evaluating a Linear
Regression and a Random Forest Regressor
model, with the Random Forest model achieving a superior \(R^2\) of 0.8647. *
Creating a modular FastAPI service
(api_service.py) for production-ready deployment of the
best-performing model.
The complete codebase and generated artifacts are provided in the final delivery.
The following files are available: * insurance.csv: The
raw dataset. * data_preprocessing.py: Script for data
cleaning and preparation. * eda.py: Script for exploratory
data analysis. * model_training.py: Script for training and
evaluation. * api_service.py: Modular script for API
deployment. * lr_model.pkl: Trained Linear Regression
model. * rf_model.pkl: Trained Random Forest Regressor
model. * eda_plots/: Directory containing the generated
plots. * charges_distribution.png *
correlation_heatmap.png *
sex_vs_charges_boxplot.png *
smoker_vs_charges_boxplot.png *
region_vs_charges_boxplot.png
```