Class 1 transcript summarized and broken down into sections going
into case studfy 1.
- Course Introduction and Structure
- Instructor: Professor Slater.
- Course: “Quantifying the World.”
- Syllabus and Textbook:
- Free online textbook available.
- Key Components of Grading:
- Case studies (50%): Seven biweekly case studies (odd weeks). Final
case study counts double.
- Participation (30%): Camera on during class, and pre-session
submissions.
- Quizzes (20%): Weekly, due Tuesday midnight, auto-graded.
- Late Submission Policy:
- Case studies: 10% deduction per day.
- Pre-session and quizzes: No late submissions accepted.
- Case Study Requirements
- Format:
- Technical write-up aimed at explaining data science analysis.
- Include all steps, such as data preparation, model selection, and
results.
- Key Elements:
- Data cleaning (e.g., imputation, feature handling).
- Model choice and configuration.
- Metrics like loss function, accuracy, precision, recall, etc.
- Visualizations (residual plots, confusion matrix, etc.).
- Mathematical Representations:
- Cross-validation: Split data into training and test sets, rotating
test sets for unbiased evaluation.
- Regularization:
- L1 (Lasso): \(\text{Loss} = ||y -
X\beta||^2_2 + \lambda ||\beta||_1\)
- L2 (Ridge): \(\text{Loss} = ||y -
X\beta||^2_2 + \lambda ||\beta||^2_2\)
- Python-Specific Concepts
- Python’s Strengths:
- Versatility across data science and general programming tasks.
- Key libraries: NumPy, Pandas, PyTorch.
- Python Indexing:
- Zero-based indexing:
list[0]
refers to the first
element.
- Slicing:
list[start:end]
includes start
but excludes end
.
- Environments:
- Importance of using virtual environments to manage Python
dependencies.
- Mathematical Principles for Analysis
- Metrics:
- Mean Absolute Error (MAE): \(MAE =
\frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|\)
- Precision, Recall, Sensitivity, Specificity.
- Data Handling:
- Feature selection using L1 regularization.
- Handling highly correlated features: Avoid dropping unless 100%
correlated or uninformative.
- Case Study Example: Superconductors
- Dataset:
- Contains chemical compositions and their critical temperatures.
- Tasks:
- Merge training and unique datasets.
- Handle duplicate target columns.
- Train models using L1 and L2 regularization.
- Visualizations:
- Include clear, legible graphs like feature importance and residual
plots.
- Key Takeaways and Recommendations
- Avoid subjective terms like “performed well”—use precise
metrics.
- Always cross-validate models unless computationally
prohibitive.
- Submit all relevant code for transparency, even if unorganized.
To expand on the case study and demonstrate the concepts using the
provided data (train.csv
and unique_m.csv
), I
will showcase:
- Loading and merging datasets.
- Data preprocessing:
- Identifying duplicate columns (target).
- Handling missing values (if any).
- Dropping unnecessary features.
- Training two models:
- Linear regression with L1 regularization (Lasso).
- Linear regression with L2 regularization (Ridge).
- Evaluating and visualizing results.
Here’s the Python code for these steps:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Step 1: Load datasets
train_df = pd.read_csv('/mnt/data/train.csv')
unique_df = pd.read_csv('/mnt/data/unique_m.csv')
# Step 2: Merge datasets and preprocess
merged_df = pd.concat([train_df, unique_df.iloc[:, 1:]], axis=1) # Concatenate without duplicating headers
merged_df = merged_df.drop_duplicates() # Drop duplicates, if any
# Drop the duplicate target column (assuming it's named 'target' in both datasets)
if 'critical_temp' in merged_df.columns:
merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]
# Check for missing values
if merged_df.isnull().sum().sum() > 0:
print("Handling missing values...")
merged_df.fillna(merged_df.mean(), inplace=True)
# Define target and features
target = 'critical_temp' # Replace with the correct column name
X = merged_df.drop(columns=[target])
y = merged_df[target]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Train models
# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.01, random_state=42)
lasso.fit(X_train, y_train)
lasso_predictions = lasso.predict(X_test)
# L2 Regularization (Ridge)
ridge = Ridge(alpha=0.01, random_state=42)
ridge.fit(X_train, y_train)
ridge_predictions = ridge.predict(X_test)
# Step 4: Evaluate models
def evaluate_model(name, y_true, y_pred):
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{name} Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R^2 Score: {r2:.4f}")
print("-" * 40)
evaluate_model("Lasso", y_test, lasso_predictions)
evaluate_model("Ridge", y_test, ridge_predictions)
# Step 5: Visualize results
plt.figure(figsize=(12, 6))
# Actual vs Predicted (Lasso)
plt.subplot(1, 2, 1)
plt.scatter(y_test, lasso_predictions, alpha=0.6, label='Lasso Predictions')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.title("Actual vs Predicted (Lasso)")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()
# Actual vs Predicted (Ridge)
plt.subplot(1, 2, 2)
plt.scatter(y_test, ridge_predictions, alpha=0.6, label='Ridge Predictions')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.title("Actual vs Predicted (Ridge)")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()
plt.tight_layout()
plt.show()
Key Outputs
- Metrics: MAE, MSE, and \(R^2\) scores for both Lasso and Ridge
models.
- Visualizations:
- Scatter plots showing actual vs predicted values for both
models.
- Feature Importance:
- For Lasso, coefficients with absolute values closer to 0 indicate
less important features.
- Ridge coefficients are shrunk but not eliminated.
Here’s an expanded approach to include hyperparameter
tuning and cross-validation for both Lasso and
Ridge regression models.
Updated Python Code:
from sklearn.model_selection import GridSearchCV, KFold
# Step 1: Cross-Validation Setup
kf = KFold(n_splits=5, shuffle=True, random_state=42) # 5-fold cross-validation
# Step 2: Hyperparameter Tuning using GridSearchCV
# Define hyperparameter grid for Lasso and Ridge
lasso_param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
ridge_param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
# Lasso Regression with GridSearch
lasso_grid = GridSearchCV(Lasso(random_state=42), lasso_param_grid, scoring='neg_mean_absolute_error', cv=kf)
lasso_grid.fit(X_train, y_train)
# Ridge Regression with GridSearch
ridge_grid = GridSearchCV(Ridge(random_state=42), ridge_param_grid, scoring='neg_mean_absolute_error', cv=kf)
ridge_grid.fit(X_train, y_train)
# Step 3: Best Parameters and Re-training
best_lasso = lasso_grid.best_estimator_
best_ridge = ridge_grid.best_estimator_
# Retrain on the entire training set with the best parameters
best_lasso.fit(X_train, y_train)
best_ridge.fit(X_train, y_train)
# Predictions
lasso_predictions_cv = best_lasso.predict(X_test)
ridge_predictions_cv = best_ridge.predict(X_test)
# Step 4: Evaluate Models After Tuning
print("Best Lasso Alpha:", lasso_grid.best_params_['alpha'])
print("Best Ridge Alpha:", ridge_grid.best_params_['alpha'])
evaluate_model("Tuned Lasso", y_test, lasso_predictions_cv)
evaluate_model("Tuned Ridge", y_test, ridge_predictions_cv)
# Step 5: Visualize Feature Importance
plt.figure(figsize=(12, 6))
# Lasso Feature Importance
plt.subplot(1, 2, 1)
lasso_importance = pd.Series(np.abs(best_lasso.coef_), index=X.columns).sort_values(ascending=False)
lasso_importance.head(10).plot(kind='bar')
plt.title("Top 10 Features (Lasso)")
plt.ylabel("Coefficient Magnitude")
# Ridge Feature Importance
plt.subplot(1, 2, 2)
ridge_importance = pd.Series(np.abs(best_ridge.coef_), index=X.columns).sort_values(ascending=False)
ridge_importance.head(10).plot(kind='bar')
plt.title("Top 10 Features (Ridge)")
plt.ylabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
# Step 6: Cross-Validation Results Visualization
lasso_cv_results = pd.DataFrame(lasso_grid.cv_results_)
ridge_cv_results = pd.DataFrame(ridge_grid.cv_results_)
plt.figure(figsize=(10, 5))
plt.plot(lasso_cv_results['param_alpha'], -lasso_cv_results['mean_test_score'], label='Lasso')
plt.plot(ridge_cv_results['param_alpha'], -ridge_cv_results['mean_test_score'], label='Ridge')
plt.xscale('log')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Negative Mean Absolute Error')
plt.title('Cross-Validation Performance for Lasso and Ridge')
plt.legend()
plt.show()
Code Explanation:
- Cross-Validation:
- The
KFold
object splits the data into 5 folds, ensuring
that each fold is used for testing exactly once.
- This provides a more robust evaluation compared to a single
train-test split.
- Hyperparameter Tuning:
GridSearchCV
is used to search for the optimal
alpha
value for both Lasso and Ridge.
- The scoring metric is
neg_mean_absolute_error
, where
more negative values indicate worse performance.
- Feature Importance:
- The magnitudes of the coefficients (absolute values) are plotted to
visualize the top 10 most influential features for both models.
- Cross-Validation Results Visualization:
- A line plot is used to show how the performance changes with
different
alpha
values for both Lasso and Ridge.
Expected Outputs:
- Optimal Alpha Values:
- Printed best
alpha
for both Lasso and Ridge from
GridSearchCV
.
- Model Metrics:
- MAE, MSE, and \(R^2\) scores after
tuning.
- Feature Importance:
- Bar plots showing the top 10 features for each model.
- Cross-Validation Curve:
- Performance (Negative MAE) plotted against different
alpha
values to observe the impact of regularization
strength.
Step-by-Step Case Study Implementation
Following the detailed requirements and directives from the
assignment and transcript, here’s a structured walkthrough for the case
study:
1. Data Preparation
a) Loading and Merging Data:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load datasets
train_df = pd.read_csv('/mnt/data/train.csv')
unique_df = pd.read_csv('/mnt/data/unique_m.csv')
# Merge datasets
merged_df = pd.concat([train_df, unique_df.iloc[:, 1:]], axis=1)
# Drop duplicate target columns (if applicable)
if 'critical_temp' in merged_df.columns:
merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]
# Check for missing values
print("Missing values:", merged_df.isnull().sum().sum())
# Impute missing values (if any)
merged_df.fillna(merged_df.mean(), inplace=True)
# Describe data size
print(f"Dataset contains {merged_df.shape[0]} rows and {merged_df.shape[1]} columns.")
b) Feature Selection:
# Calculate correlation matrix and drop highly correlated features
correlation_matrix = merged_df.corr()
high_corr_features = correlation_matrix[correlation_matrix > 0.95].stack().index
merged_df.drop(columns=[col for col, _ in high_corr_features if col != _], inplace=True)
# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(merged_df.drop(columns=['critical_temp']))
y = merged_df['critical_temp']
2. Splitting Data and Cross-Validation
from sklearn.model_selection import train_test_split, KFold
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 5-fold cross-validation setup
kf = KFold(n_splits=5, shuffle=True, random_state=42)
3. Model Training and Hyperparameter Tuning
a) Define Elastic Net Model and Hyperparameter Grid:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'alpha': [0.001, 0.01, 0.1, 1, 10],
'l1_ratio': [0.1, 0.5, 0.7, 0.9, 1.0]
}
# Elastic Net with GridSearchCV
elastic_net = GridSearchCV(ElasticNet(random_state=42), param_grid, cv=kf, scoring='neg_mean_absolute_error')
elastic_net.fit(X_train, y_train)
# Best hyperparameters
print("Best Hyperparameters:", elastic_net.best_params_)
4. Model Evaluation
a) Metrics and Visualizations:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Predictions
y_pred = elastic_net.predict(X_test)
# Metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.4f}")
print(f"Mean Squared Error: {mse:.4f}")
print(f"R^2 Score: {r2:.4f}")
# Visualizations
plt.figure(figsize=(12, 6))
# Actual vs Predicted
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.title("Actual vs Predicted")
plt.xlabel("Actual")
plt.ylabel("Predicted")
# Residuals
residuals = y_test - y_pred
plt.subplot(1, 2, 2)
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='r', linestyle='--')
plt.title("Residuals vs Predicted")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.tight_layout()
plt.show()
5. Feature Importance
# Feature importance from Elastic Net
import numpy as np
importance = np.abs(elastic_net.best_estimator_.coef_)
sorted_idx = np.argsort(importance)[::-1]
top_features = np.array(merged_df.columns[:-1])[sorted_idx[:20]] # Top 20 features
# Bar plot
plt.figure(figsize=(10, 6))
plt.barh(top_features, importance[sorted_idx[:20]])
plt.gca().invert_yaxis()
plt.title("Top 20 Feature Importances (Elastic Net)")
plt.xlabel("Coefficient Magnitude")
plt.show()
6. Regularization Strength vs MAE
# Analyze effect of alpha on MAE
results = elastic_net.cv_results_
alpha_values = [params['alpha'] for params in results['params']]
mae_values = -results['mean_test_score']
plt.figure(figsize=(8, 5))
plt.plot(alpha_values, mae_values, marker='o')
plt.xscale('log')
plt.xlabel("Alpha (Regularization Strength)")
plt.ylabel("Negative Mean Absolute Error")
plt.title("Effect of Alpha on MAE")
plt.show()
GPU Utilization
For additional acceleration using GPU (if available), we can
implement this using PyTorch
