Pima Diabetes Data - Logistic Regression

This improved example provides a more complete and practical demonstration of how to use the Pima Indian Diabetes dataset for logistic regression in Julia. Remember to explore the suggested further improvements to build even better models.

using CSV, DataFrames, GLM, StatsPlots, MLDataUtils  # Install these packages if you haven't already

# 1. Load the data
df = CSV.read("pima-indians-diabetes.csv", DataFrame, header=false) # Replace with your file path

# Add column names (the dataset often doesn't have them)
names!(df, [:Pregnancies, :Glucose, :BloodPressure, :SkinThickness, :Insulin, :BMI, :DiabetesPedigreeFunction, :Age, :Outcome])

# 2. Data Preprocessing (Important!)

# Handle missing values (if any).  In this dataset, zeros in some columns are likely placeholders for missing data.
# Several strategies exist.  Here's a simple imputation with the median for Glucose, BloodPressure, SkinThickness, Insulin, and BMI.  You might explore other methods.
for col in [:Glucose, :BloodPressure, :SkinThickness, :Insulin, :BMI]
    median_val = median(filter(x -> x > 0, df[!, col])) # Median of non-zero values
    df[!, col] = map(x -> x == 0 ? median_val : x, df[!, col]) # Replace 0s with median
end

# Split into features (X) and target (y)
X = Matrix(df[:, Not(:Outcome)])
y = df.Outcome

# Split into training and testing sets (essential for evaluating model performance)
X_train, X_test, y_train, y_test = splitobs(X, y, 0.8, shuffle=true) # 80% train, 20% test

# 3. Train the Logistic Regression Model

# Create the model.  `Binomial()` specifies logistic regression. `LogitLink()` is the standard link function.
model = glm(@formula(Outcome ~ 1 + Pregnancies + Glucose + BloodPressure + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction + Age), df, Binomial(), LogitLink())

# Or, if you've already split the data (recommended):

model_train = glm(@formula(Outcome ~ 1 + Pregnancies + Glucose + BloodPressure + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction + Age), DataFrame(hcat(X_train, y_train), names(df)), Binomial(), LogitLink())

# 4. Evaluate the Model

# Make predictions on the test set
y_pred_prob = predict(model_train, DataFrame(X_test, names(df)[1:end-1]))  # Probabilities
y_pred = ifelse.(y_pred_prob .>= 0.5, 1, 0) # Classifications (threshold at 0.5)

# Calculate performance metrics
accuracy = mean(y_pred .== y_test)
println("Accuracy: ", accuracy)

# Confusion Matrix (for more detailed analysis)
confusion_matrix = @. y_test' * (y_pred == 1) + (1 - y_test)' * (y_pred == 0)
println("Confusion Matrix:\n", confusion_matrix)

# ROC Curve and AUC (Area Under the Curve)
roc_curve = roc(y_test, y_pred_prob)
plot(roc_curve, xlabel="False Positive Rate", ylabel="True Positive Rate", title="ROC Curve")
auc_val = auc(roc_curve)
println("AUC: ", auc_val)

# 5. Interpret Model Coefficients (Optional)

println("Coefficients:\n", coef(model_train)) # See the impact of each feature

# 6. Further Improvements (Optional)

# * Feature Scaling/Normalization: Can improve model performance.
# * Regularization (L1 or L2):  Can prevent overfitting.
# * Hyperparameter Tuning: Optimize model parameters.
# * Cross-validation: A more robust way to evaluate model performance.
# * Different Imputation Strategies: Explore other ways to handle missing values.

```

Key Improvements and Explanations:

Package Management: Includes all necessary packages.
Data Loading: Shows how to load the CSV and add column names.
Missing Value Handling: Addresses the common issue of missing values (represented as zeros) in this dataset by imputing with the median of the non-zero values. This is crucial for getting good results. Important: This is a simple strategy. More sophisticated imputation methods might be better.
Data Splitting: Splits the data into training and testing sets using splitobs from MLDataUtils. This is essential for evaluating how well the model generalizes to unseen data.
Model Training: Trains the logistic regression model using glm from the GLM package. Uses the @formula macro for a cleaner way to specify the model. Shows how to train on the full dataset or the training split.
Model Evaluation: Calculates accuracy, displays the confusion matrix, and plots the ROC curve and calculates the AUC. These are important metrics for assessing model performance.
Model Interpretation: Shows how to access the model coefficients, which can help understand the importance of each feature.
Further Improvements: Suggests several ways to improve the model, such as feature scaling, regularization, hyperparameter tuning, and cross-validation.
Clear Comments: Improved comments to explain each step.

Running the Code:

Install Julia and the required packages.
Download the Pima Indian Diabetes dataset (pima-indians-diabetes.csv) and place it in the same directory as your Julia script or provide the correct file path.
Copy and paste the code into the Julia REPL or run it in an IDE.