Class 1 transcript summarized and broken down into sections going into case studfy 1.

  1. Course Introduction and Structure
    • Instructor: Professor Slater.
    • Course: “Quantifying the World.”
    • Syllabus and Textbook:
      • Free online textbook available.
    • Key Components of Grading:
      • Case studies (50%): Seven biweekly case studies (odd weeks). Final case study counts double.
      • Participation (30%): Camera on during class, and pre-session submissions.
      • Quizzes (20%): Weekly, due Tuesday midnight, auto-graded.
    • Late Submission Policy:
      • Case studies: 10% deduction per day.
      • Pre-session and quizzes: No late submissions accepted.
  2. Case Study Requirements
    • Format:
      • Technical write-up aimed at explaining data science analysis.
      • Include all steps, such as data preparation, model selection, and results.
    • Key Elements:
      • Data cleaning (e.g., imputation, feature handling).
      • Model choice and configuration.
      • Metrics like loss function, accuracy, precision, recall, etc.
      • Visualizations (residual plots, confusion matrix, etc.).
    • Mathematical Representations:
      • Cross-validation: Split data into training and test sets, rotating test sets for unbiased evaluation.
      • Regularization:
        • L1 (Lasso): \(\text{Loss} = ||y - X\beta||^2_2 + \lambda ||\beta||_1\)
        • L2 (Ridge): \(\text{Loss} = ||y - X\beta||^2_2 + \lambda ||\beta||^2_2\)
  3. Python-Specific Concepts
    • Python’s Strengths:
      • Versatility across data science and general programming tasks.
      • Key libraries: NumPy, Pandas, PyTorch.
    • Python Indexing:
      • Zero-based indexing: list[0] refers to the first element.
      • Slicing: list[start:end] includes start but excludes end.
    • Environments:
      • Importance of using virtual environments to manage Python dependencies.
  4. Mathematical Principles for Analysis
    • Metrics:
      • Mean Absolute Error (MAE): \(MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|\)
      • Precision, Recall, Sensitivity, Specificity.
    • Data Handling:
      • Feature selection using L1 regularization.
      • Handling highly correlated features: Avoid dropping unless 100% correlated or uninformative.
  5. Case Study Example: Superconductors
    • Dataset:
      • Contains chemical compositions and their critical temperatures.
      • Tasks:
        • Merge training and unique datasets.
        • Handle duplicate target columns.
        • Train models using L1 and L2 regularization.
    • Visualizations:
      • Include clear, legible graphs like feature importance and residual plots.
  6. Key Takeaways and Recommendations
    • Avoid subjective terms like “performed well”—use precise metrics.
    • Always cross-validate models unless computationally prohibitive.
    • Submit all relevant code for transparency, even if unorganized.

To expand on the case study and demonstrate the concepts using the provided data (train.csv and unique_m.csv), I will showcase:

  1. Loading and merging datasets.
  2. Data preprocessing:
    • Identifying duplicate columns (target).
    • Handling missing values (if any).
    • Dropping unnecessary features.
  3. Training two models:
    • Linear regression with L1 regularization (Lasso).
    • Linear regression with L2 regularization (Ridge).
  4. Evaluating and visualizing results.

Here’s the Python code for these steps:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Step 1: Load datasets
train_df = pd.read_csv('/mnt/data/train.csv')
unique_df = pd.read_csv('/mnt/data/unique_m.csv')

# Step 2: Merge datasets and preprocess
merged_df = pd.concat([train_df, unique_df.iloc[:, 1:]], axis=1)  # Concatenate without duplicating headers
merged_df = merged_df.drop_duplicates()  # Drop duplicates, if any

# Drop the duplicate target column (assuming it's named 'target' in both datasets)
if 'critical_temp' in merged_df.columns:
    merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]

# Check for missing values
if merged_df.isnull().sum().sum() > 0:
    print("Handling missing values...")
    merged_df.fillna(merged_df.mean(), inplace=True)

# Define target and features
target = 'critical_temp'  # Replace with the correct column name
X = merged_df.drop(columns=[target])
y = merged_df[target]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train models
# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.01, random_state=42)
lasso.fit(X_train, y_train)
lasso_predictions = lasso.predict(X_test)

# L2 Regularization (Ridge)
ridge = Ridge(alpha=0.01, random_state=42)
ridge.fit(X_train, y_train)
ridge_predictions = ridge.predict(X_test)

# Step 4: Evaluate models
def evaluate_model(name, y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{name} Model Evaluation:")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"R^2 Score: {r2:.4f}")
    print("-" * 40)

evaluate_model("Lasso", y_test, lasso_predictions)
evaluate_model("Ridge", y_test, ridge_predictions)

# Step 5: Visualize results
plt.figure(figsize=(12, 6))

# Actual vs Predicted (Lasso)
plt.subplot(1, 2, 1)
plt.scatter(y_test, lasso_predictions, alpha=0.6, label='Lasso Predictions')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.title("Actual vs Predicted (Lasso)")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()

# Actual vs Predicted (Ridge)
plt.subplot(1, 2, 2)
plt.scatter(y_test, ridge_predictions, alpha=0.6, label='Ridge Predictions')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.title("Actual vs Predicted (Ridge)")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()

plt.tight_layout()
plt.show()

Key Outputs

  1. Metrics: MAE, MSE, and \(R^2\) scores for both Lasso and Ridge models.
  2. Visualizations:
    • Scatter plots showing actual vs predicted values for both models.
  3. Feature Importance:
    • For Lasso, coefficients with absolute values closer to 0 indicate less important features.
    • Ridge coefficients are shrunk but not eliminated.

Here’s an expanded approach to include hyperparameter tuning and cross-validation for both Lasso and Ridge regression models.

Updated Python Code:

from sklearn.model_selection import GridSearchCV, KFold

# Step 1: Cross-Validation Setup
kf = KFold(n_splits=5, shuffle=True, random_state=42)  # 5-fold cross-validation

# Step 2: Hyperparameter Tuning using GridSearchCV
# Define hyperparameter grid for Lasso and Ridge
lasso_param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
ridge_param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# Lasso Regression with GridSearch
lasso_grid = GridSearchCV(Lasso(random_state=42), lasso_param_grid, scoring='neg_mean_absolute_error', cv=kf)
lasso_grid.fit(X_train, y_train)

# Ridge Regression with GridSearch
ridge_grid = GridSearchCV(Ridge(random_state=42), ridge_param_grid, scoring='neg_mean_absolute_error', cv=kf)
ridge_grid.fit(X_train, y_train)

# Step 3: Best Parameters and Re-training
best_lasso = lasso_grid.best_estimator_
best_ridge = ridge_grid.best_estimator_

# Retrain on the entire training set with the best parameters
best_lasso.fit(X_train, y_train)
best_ridge.fit(X_train, y_train)

# Predictions
lasso_predictions_cv = best_lasso.predict(X_test)
ridge_predictions_cv = best_ridge.predict(X_test)

# Step 4: Evaluate Models After Tuning
print("Best Lasso Alpha:", lasso_grid.best_params_['alpha'])
print("Best Ridge Alpha:", ridge_grid.best_params_['alpha'])

evaluate_model("Tuned Lasso", y_test, lasso_predictions_cv)
evaluate_model("Tuned Ridge", y_test, ridge_predictions_cv)

# Step 5: Visualize Feature Importance
plt.figure(figsize=(12, 6))

# Lasso Feature Importance
plt.subplot(1, 2, 1)
lasso_importance = pd.Series(np.abs(best_lasso.coef_), index=X.columns).sort_values(ascending=False)
lasso_importance.head(10).plot(kind='bar')
plt.title("Top 10 Features (Lasso)")
plt.ylabel("Coefficient Magnitude")

# Ridge Feature Importance
plt.subplot(1, 2, 2)
ridge_importance = pd.Series(np.abs(best_ridge.coef_), index=X.columns).sort_values(ascending=False)
ridge_importance.head(10).plot(kind='bar')
plt.title("Top 10 Features (Ridge)")
plt.ylabel("Coefficient Magnitude")

plt.tight_layout()
plt.show()

# Step 6: Cross-Validation Results Visualization
lasso_cv_results = pd.DataFrame(lasso_grid.cv_results_)
ridge_cv_results = pd.DataFrame(ridge_grid.cv_results_)

plt.figure(figsize=(10, 5))
plt.plot(lasso_cv_results['param_alpha'], -lasso_cv_results['mean_test_score'], label='Lasso')
plt.plot(ridge_cv_results['param_alpha'], -ridge_cv_results['mean_test_score'], label='Ridge')
plt.xscale('log')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Negative Mean Absolute Error')
plt.title('Cross-Validation Performance for Lasso and Ridge')
plt.legend()
plt.show()

Code Explanation:

  1. Cross-Validation:
    • The KFold object splits the data into 5 folds, ensuring that each fold is used for testing exactly once.
    • This provides a more robust evaluation compared to a single train-test split.
  2. Hyperparameter Tuning:
    • GridSearchCV is used to search for the optimal alpha value for both Lasso and Ridge.
    • The scoring metric is neg_mean_absolute_error, where more negative values indicate worse performance.
  3. Feature Importance:
    • The magnitudes of the coefficients (absolute values) are plotted to visualize the top 10 most influential features for both models.
  4. Cross-Validation Results Visualization:
    • A line plot is used to show how the performance changes with different alpha values for both Lasso and Ridge.

Expected Outputs:

  1. Optimal Alpha Values:
    • Printed best alpha for both Lasso and Ridge from GridSearchCV.
  2. Model Metrics:
    • MAE, MSE, and \(R^2\) scores after tuning.
  3. Feature Importance:
    • Bar plots showing the top 10 features for each model.
  4. Cross-Validation Curve:
    • Performance (Negative MAE) plotted against different alpha values to observe the impact of regularization strength.

Step-by-Step Case Study Implementation

Following the detailed requirements and directives from the assignment and transcript, here’s a structured walkthrough for the case study:


1. Data Preparation

a) Loading and Merging Data:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load datasets
train_df = pd.read_csv('/mnt/data/train.csv')
unique_df = pd.read_csv('/mnt/data/unique_m.csv')

# Merge datasets
merged_df = pd.concat([train_df, unique_df.iloc[:, 1:]], axis=1)

# Drop duplicate target columns (if applicable)
if 'critical_temp' in merged_df.columns:
    merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]

# Check for missing values
print("Missing values:", merged_df.isnull().sum().sum())

# Impute missing values (if any)
merged_df.fillna(merged_df.mean(), inplace=True)

# Describe data size
print(f"Dataset contains {merged_df.shape[0]} rows and {merged_df.shape[1]} columns.")

b) Feature Selection:

# Calculate correlation matrix and drop highly correlated features
correlation_matrix = merged_df.corr()
high_corr_features = correlation_matrix[correlation_matrix > 0.95].stack().index
merged_df.drop(columns=[col for col, _ in high_corr_features if col != _], inplace=True)

# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(merged_df.drop(columns=['critical_temp']))
y = merged_df['critical_temp']

2. Splitting Data and Cross-Validation

from sklearn.model_selection import train_test_split, KFold

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5-fold cross-validation setup
kf = KFold(n_splits=5, shuffle=True, random_state=42)

3. Model Training and Hyperparameter Tuning

a) Define Elastic Net Model and Hyperparameter Grid:

from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1, 10],
    'l1_ratio': [0.1, 0.5, 0.7, 0.9, 1.0]
}

# Elastic Net with GridSearchCV
elastic_net = GridSearchCV(ElasticNet(random_state=42), param_grid, cv=kf, scoring='neg_mean_absolute_error')
elastic_net.fit(X_train, y_train)

# Best hyperparameters
print("Best Hyperparameters:", elastic_net.best_params_)

4. Model Evaluation

a) Metrics and Visualizations:

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Predictions
y_pred = elastic_net.predict(X_test)

# Metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae:.4f}")
print(f"Mean Squared Error: {mse:.4f}")
print(f"R^2 Score: {r2:.4f}")

# Visualizations
plt.figure(figsize=(12, 6))

# Actual vs Predicted
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.title("Actual vs Predicted")
plt.xlabel("Actual")
plt.ylabel("Predicted")

# Residuals
residuals = y_test - y_pred
plt.subplot(1, 2, 2)
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='r', linestyle='--')
plt.title("Residuals vs Predicted")
plt.xlabel("Predicted")
plt.ylabel("Residuals")

plt.tight_layout()
plt.show()

5. Feature Importance

# Feature importance from Elastic Net
import numpy as np

importance = np.abs(elastic_net.best_estimator_.coef_)
sorted_idx = np.argsort(importance)[::-1]
top_features = np.array(merged_df.columns[:-1])[sorted_idx[:20]]  # Top 20 features

# Bar plot
plt.figure(figsize=(10, 6))
plt.barh(top_features, importance[sorted_idx[:20]])
plt.gca().invert_yaxis()
plt.title("Top 20 Feature Importances (Elastic Net)")
plt.xlabel("Coefficient Magnitude")
plt.show()

6. Regularization Strength vs MAE

# Analyze effect of alpha on MAE
results = elastic_net.cv_results_
alpha_values = [params['alpha'] for params in results['params']]
mae_values = -results['mean_test_score']

plt.figure(figsize=(8, 5))
plt.plot(alpha_values, mae_values, marker='o')
plt.xscale('log')
plt.xlabel("Alpha (Regularization Strength)")
plt.ylabel("Negative Mean Absolute Error")
plt.title("Effect of Alpha on MAE")
plt.show()

GPU Utilization

For additional acceleration using GPU (if available), we can implement this using PyTorch

