Statistical & Machine Learning Model for Clinical Data Analytics

1. Introduction and Dataset Acquisition

This report details the implementation of a statistical and machine learning model for clinical data analytics, following a structured process from data acquisition and cleaning to model deployment readiness.

The chosen dataset is the Medical Cost Personal Insurance Dataset, a publicly available, structured dataset suitable for regression analysis. It contains various demographic and health-related features to predict individual medical insurance charges.

Feature Data Type Description
age Numerical Age of the primary beneficiary
sex Categorical Insurance contractor gender (female, male)
bmi Numerical Body Mass Index, providing an understanding of body weight
children Numerical Number of children covered by health insurance / Number of dependents
smoker Categorical Smoker status (yes, no)
region Categorical The beneficiary’s residential area in the US (northeast, southeast, southwest, northwest)
charges Numerical Individual medical costs billed by health insurance (Target Variable)

The dataset was downloaded as insurance.csv.

2. Data Cleaning and Variable Imputation Routine

The data cleaning and preprocessing routine was implemented in data_preprocessing.py. An initial check revealed no missing values in the dataset, so no imputation was strictly necessary. However, the routine was built to be robust and included the following steps:

  1. Categorical Encoding: One-Hot Encoding was applied to the nominal categorical features (sex, smoker, region).
  2. Numerical Scaling: Standard Scaling was applied to the numerical features (age, bmi, children) to normalize their distribution for the models.
  3. Data Splitting: The data was split into a training set (80%) and a testing set (20%) with a fixed random_state for reproducibility.

data_preprocessing.py

```python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer import numpy as np

def load_data(file_path): “““Loads the dataset from a CSV file.”“” try: df = pd.read_csv(file_path) return df except FileNotFoundError: return None

def get_preprocessor(df): ““” Creates a ColumnTransformer for preprocessing. ““” numerical_features = df.select_dtypes(include=np.number).columns.tolist() if ‘charges’ in numerical_features: numerical_features.remove(‘charges’)

categorical_features = df.select_dtypes(include='object').columns.tolist()

numerical_transformer = [
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
]

categorical_transformer = [
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(numerical_transformer), numerical_features),
        ('cat', Pipeline(categorical_transformer), categorical_features)
    ],
    remainder='passthrough'
)

return preprocessor, numerical_features, categorical_features

def preprocess_data(df): “““Applies the full preprocessing pipeline to the data.”“”

X = df.drop('charges', axis=1)
y = df['charges']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

preprocessor, num_features, cat_features = get_preprocessor(X_train)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

cat_ohe_features = preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(cat_features)
all_features = num_features + list(cat_ohe_features)

X_train_processed_df = pd.DataFrame(X_train_processed, columns=all_features)
X_test_processed_df = pd.DataFrame(X_test_processed, columns=all_features)

return X_train_processed_df, X_test_processed_df, y_train, y_test, preprocessor

if name == ‘main’: # … (Execution logic to load and save processed data) pass ```

3. Exploratory Data Analysis (EDA)

Exploratory analysis was performed to understand the data distribution and relationships between features and the target variable (charges).

Summary Tables

Descriptive statistics for the numerical features of the original dataset:

Feature count mean std min 25% 50% 75% max
age 1338 39.21 14.05 18 27 39 51 64
bmi 1338 30.66 6.10 15.96 26.30 30.40 34.69 53.13
children 1338 1.09 1.21 0 0 1 2 5

Key Findings from Visualizations (Plots)

Visualizations were generated and saved to the eda_plots directory. The key findings are:

  1. Distribution of Medical Charges (charges_distribution.png): The target variable is right-skewed, indicating that most individuals have lower charges, but a smaller number of individuals incur very high costs. This suggests that a log transformation might be beneficial for linear models, but we will proceed without it for the initial models.
  2. Correlation Heatmap (correlation_heatmap.png): The heatmap shows that age and bmi have a positive, moderate correlation with charges. The correlation with children is weak.
  3. Categorical Feature Box Plots (sex_vs_charges_boxplot.png, smoker_vs_charges_boxplot.png, region_vs_charges_boxplot.png):
    • Smoker status is the most significant factor, with smokers having substantially higher charges than non-smokers.
    • Sex shows a negligible difference in charges.
    • Region shows the Southeast region having slightly higher average charges.

4. Model Implementation and Evaluation

Two regression models were implemented and evaluated on the test set: a Linear Regression model (a statistical approach) and a Random Forest Regressor (a machine learning approach).

model_training.py

```python import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, r2_score import joblib

… (load_processed_data and evaluate_model functions)

def train_and_save_models(X_train, X_test, y_train, y_test): “““Trains and saves the Linear Regression and Random Forest models.”“”

# 1. Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_metrics = evaluate_model(lr_model, X_test, y_test)
joblib.dump(lr_model, 'lr_model.pkl')

# 2. Random Forest Regressor Model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)
rf_metrics = evaluate_model(rf_model, X_test, y_test)
joblib.dump(rf_model, 'rf_model.pkl')

return lr_metrics, rf_metrics

if name == ‘main’: # … (Execution logic to load data and train models) pass ```

Model Performance Summary

Model MSE RMSE R-squared (\(R^2\))
Linear Regression 33,596,915.85 5,796.28 0.7836
Random Forest Regressor 21,003,637.61 4,582.97 0.8647

The Random Forest Regressor significantly outperformed the Linear Regression model, achieving an \(R^2\) of 0.8647. This indicates that the non-linear, ensemble approach is better at capturing the complex relationship between the features and the medical charges. The Random Forest model (rf_model.pkl) was selected for API deployment.

5. Modular Python Scripts Ready for API Deployment

The entire process is modularized into three scripts: 1. data_preprocessing.py: Handles data loading, cleaning, and transformation. 2. model_training.py: Handles model training, evaluation, and saving. 3. api_service.py: A FastAPI application to serve predictions from the best-performing model.

api_service.py (FastAPI Deployment Routine)

The API service uses the saved Random Forest model (rf_model.pkl) and a re-initialized version of the preprocessor to make predictions on new, raw data inputs.

```python import joblib import pandas as pd from fastapi import FastAPI from pydantic import BaseModel from data_preprocessing import load_data, preprocess_data # Re-use preprocessing logic

— Initialization —

app = FastAPI( title=“Clinical Data Analytics Model API”, description=“API for predicting medical charges using a Random Forest Regressor model.” )

Load the trained model and re-initialize the preprocessor

try: rf_model = joblib.load(‘rf_model.pkl’) original_data = load_data(‘insurance.csv’) , , , , preprocessor = preprocess_data(original_data.copy()) except Exception as e: rf_model = None preprocessor = None # In a real environment, this would raise an exception and prevent startup

— Pydantic Schema for Request Body —

class InsuranceData(BaseModel): age: int sex: str bmi: float children: int smoker: str region: str

— API Endpoints —

@app.get(“/”) def read_root(): return {“message”: “Clinical Data Analytics Model API is running.”}

@app.post(“/predict”) def predict_insurance_charges(data: InsuranceData): ““” Predicts the medical insurance charges based on the provided personal data. ““” if rf_model is None or preprocessor is None: return {“error”: “Model or Preprocessor not loaded.”}

try:
    # Prepare input data
    input_df = pd.DataFrame([data.dict()])
    expected_cols = ['age', 'sex', 'bmi', 'children', 'smoker', 'region']
    input_df = input_df[expected_cols]
    
    # Transform and predict
    transformed_data = preprocessor.transform(input_df)
    prediction = rf_model.predict(transformed_data)
    
    return {
        "predicted_charges": round(prediction[0], 2),
        "model_used": "Random Forest Regressor"
    }
except Exception as e:
    return {"error": f"Prediction failed: {e}"}

To run the API locally:

uvicorn api_service:app –reload

```

6. Conclusion

This project successfully implemented a Statistical & Machine Learning solution for clinical data analytics. The process included: * Acquiring the Medical Cost Personal Insurance Dataset. * Developing a robust data cleaning and preprocessing pipeline (data_preprocessing.py). * Conducting Exploratory Data Analysis to understand key features like smoker status and its strong correlation with charges. * Implementing and evaluating a Linear Regression and a Random Forest Regressor model, with the Random Forest model achieving a superior \(R^2\) of 0.8647. * Creating a modular FastAPI service (api_service.py) for production-ready deployment of the best-performing model.

The complete codebase and generated artifacts are provided in the final delivery.

Artifacts

The following files are available: * insurance.csv: The raw dataset. * data_preprocessing.py: Script for data cleaning and preparation. * eda.py: Script for exploratory data analysis. * model_training.py: Script for training and evaluation. * api_service.py: Modular script for API deployment. * lr_model.pkl: Trained Linear Regression model. * rf_model.pkl: Trained Random Forest Regressor model. * eda_plots/: Directory containing the generated plots. * charges_distribution.png * correlation_heatmap.png * sex_vs_charges_boxplot.png * smoker_vs_charges_boxplot.png * region_vs_charges_boxplot.png

```