Class 1 transcript summarized and broken down into sections going
into case studfy 1.
- Course Introduction and Structure
- Instructor: Professor Slater.
- Course: “Quantifying the World.”
- Syllabus and Textbook:
- Free online textbook available.
- Key Components of Grading:
- Case studies (50%): Seven biweekly case studies (odd weeks). Final
case study counts double.
- Participation (30%): Camera on during class, and pre-session
submissions.
- Quizzes (20%): Weekly, due Tuesday midnight, auto-graded.
- Late Submission Policy:
- Case studies: 10% deduction per day.
- Pre-session and quizzes: No late submissions accepted.
- Case Study Requirements
- Format:
- Technical write-up aimed at explaining data science analysis.
- Include all steps, such as data preparation, model selection, and
results.
- Key Elements:
- Data cleaning (e.g., imputation, feature handling).
- Model choice and configuration.
- Metrics like loss function, accuracy, precision, recall, etc.
- Visualizations (residual plots, confusion matrix, etc.).
- Mathematical Representations:
- Cross-validation: Split data into training and test sets, rotating
test sets for unbiased evaluation.
- Regularization:
- L1 (Lasso): \(\text{Loss} = ||y -
X\beta||^2_2 + \lambda ||\beta||_1\)
- L2 (Ridge): \(\text{Loss} = ||y -
X\beta||^2_2 + \lambda ||\beta||^2_2\)
- Python-Specific Concepts
- Python’s Strengths:
- Versatility across data science and general programming tasks.
- Key libraries: NumPy, Pandas, PyTorch.
- Python Indexing:
- Zero-based indexing:
list[0]
refers to the first
element.
- Slicing:
list[start:end]
includes start
but excludes end
.
- Environments:
- Importance of using virtual environments to manage Python
dependencies.
- Mathematical Principles for Analysis
- Metrics:
- Mean Absolute Error (MAE): \(MAE =
\frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|\)
- Precision, Recall, Sensitivity, Specificity.
- Data Handling:
- Feature selection using L1 regularization.
- Handling highly correlated features: Avoid dropping unless 100%
correlated or uninformative.
- Case Study Example: Superconductors
- Dataset:
- Contains chemical compositions and their critical temperatures.
- Tasks:
- Merge training and unique datasets.
- Handle duplicate target columns.
- Train models using L1 and L2 regularization.
- Visualizations:
- Include clear, legible graphs like feature importance and residual
plots.
- Key Takeaways and Recommendations
- Avoid subjective terms like “performed well”—use precise
metrics.
- Always cross-validate models unless computationally
prohibitive.
- Submit all relevant code for transparency, even if unorganized.
To expand on the case study and demonstrate the concepts using the
provided data (train.csv
and unique_m.csv
), I
will showcase:
- Loading and merging datasets.
- Data preprocessing:
- Identifying duplicate columns (target).
- Handling missing values (if any).
- Dropping unnecessary features.
- Training two models:
- Linear regression with L1 regularization (Lasso).
- Linear regression with L2 regularization (Ridge).
- Evaluating and visualizing results.
Here’s the Python code for these steps:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Step 1: Load datasets
train_df = pd.read_csv('/mnt/data/train.csv')
unique_df = pd.read_csv('/mnt/data/unique_m.csv')
# Step 2: Merge datasets and preprocess
merged_df = pd.concat([train_df, unique_df.iloc[:, 1:]], axis=1) # Concatenate without duplicating headers
merged_df = merged_df.drop_duplicates() # Drop duplicates, if any
# Drop the duplicate target column (assuming it's named 'target' in both datasets)
if 'critical_temp' in merged_df.columns:
merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]
# Check for missing values
if merged_df.isnull().sum().sum() > 0:
print("Handling missing values...")
merged_df.fillna(merged_df.mean(), inplace=True)
# Define target and features
target = 'critical_temp' # Replace with the correct column name
X = merged_df.drop(columns=[target])
y = merged_df[target]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 3: Train models
# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.01, random_state=42)
lasso.fit(X_train, y_train)
lasso_predictions = lasso.predict(X_test)
# L2 Regularization (Ridge)
ridge = Ridge(alpha=0.01, random_state=42)
ridge.fit(X_train, y_train)
ridge_predictions = ridge.predict(X_test)
# Step 4: Evaluate models
def evaluate_model(name, y_true, y_pred):
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{name} Model Evaluation:")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R^2 Score: {r2:.4f}")
print("-" * 40)
evaluate_model("Lasso", y_test, lasso_predictions)
evaluate_model("Ridge", y_test, ridge_predictions)
# Step 5: Visualize results
plt.figure(figsize=(12, 6))
# Actual vs Predicted (Lasso)
plt.subplot(1, 2, 1)
plt.scatter(y_test, lasso_predictions, alpha=0.6, label='Lasso Predictions')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.title("Actual vs Predicted (Lasso)")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()
# Actual vs Predicted (Ridge)
plt.subplot(1, 2, 2)
plt.scatter(y_test, ridge_predictions, alpha=0.6, label='Ridge Predictions')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.title("Actual vs Predicted (Ridge)")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()
plt.tight_layout()
plt.show()
Key Outputs
- Metrics: MAE, MSE, and \(R^2\) scores for both Lasso and Ridge
models.
- Visualizations:
- Scatter plots showing actual vs predicted values for both
models.
- Feature Importance:
- For Lasso, coefficients with absolute values closer to 0 indicate
less important features.
- Ridge coefficients are shrunk but not eliminated.
Here’s an expanded approach to include hyperparameter
tuning and cross-validation for both Lasso and
Ridge regression models.
Updated Python Code:
from sklearn.model_selection import GridSearchCV, KFold
# Step 1: Cross-Validation Setup
kf = KFold(n_splits=5, shuffle=True, random_state=42) # 5-fold cross-validation
# Step 2: Hyperparameter Tuning using GridSearchCV
# Define hyperparameter grid for Lasso and Ridge
lasso_param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
ridge_param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
# Lasso Regression with GridSearch
lasso_grid = GridSearchCV(Lasso(random_state=42), lasso_param_grid, scoring='neg_mean_absolute_error', cv=kf)
lasso_grid.fit(X_train, y_train)
# Ridge Regression with GridSearch
ridge_grid = GridSearchCV(Ridge(random_state=42), ridge_param_grid, scoring='neg_mean_absolute_error', cv=kf)
ridge_grid.fit(X_train, y_train)
# Step 3: Best Parameters and Re-training
best_lasso = lasso_grid.best_estimator_
best_ridge = ridge_grid.best_estimator_
# Retrain on the entire training set with the best parameters
best_lasso.fit(X_train, y_train)
best_ridge.fit(X_train, y_train)
# Predictions
lasso_predictions_cv = best_lasso.predict(X_test)
ridge_predictions_cv = best_ridge.predict(X_test)
# Step 4: Evaluate Models After Tuning
print("Best Lasso Alpha:", lasso_grid.best_params_['alpha'])
print("Best Ridge Alpha:", ridge_grid.best_params_['alpha'])
evaluate_model("Tuned Lasso", y_test, lasso_predictions_cv)
evaluate_model("Tuned Ridge", y_test, ridge_predictions_cv)
# Step 5: Visualize Feature Importance
plt.figure(figsize=(12, 6))
# Lasso Feature Importance
plt.subplot(1, 2, 1)
lasso_importance = pd.Series(np.abs(best_lasso.coef_), index=X.columns).sort_values(ascending=False)
lasso_importance.head(10).plot(kind='bar')
plt.title("Top 10 Features (Lasso)")
plt.ylabel("Coefficient Magnitude")
# Ridge Feature Importance
plt.subplot(1, 2, 2)
ridge_importance = pd.Series(np.abs(best_ridge.coef_), index=X.columns).sort_values(ascending=False)
ridge_importance.head(10).plot(kind='bar')
plt.title("Top 10 Features (Ridge)")
plt.ylabel("Coefficient Magnitude")
plt.tight_layout()
plt.show()
# Step 6: Cross-Validation Results Visualization
lasso_cv_results = pd.DataFrame(lasso_grid.cv_results_)
ridge_cv_results = pd.DataFrame(ridge_grid.cv_results_)
plt.figure(figsize=(10, 5))
plt.plot(lasso_cv_results['param_alpha'], -lasso_cv_results['mean_test_score'], label='Lasso')
plt.plot(ridge_cv_results['param_alpha'], -ridge_cv_results['mean_test_score'], label='Ridge')
plt.xscale('log')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Negative Mean Absolute Error')
plt.title('Cross-Validation Performance for Lasso and Ridge')
plt.legend()
plt.show()
Code Explanation:
- Cross-Validation:
- The
KFold
object splits the data into 5 folds, ensuring
that each fold is used for testing exactly once.
- This provides a more robust evaluation compared to a single
train-test split.
- Hyperparameter Tuning:
GridSearchCV
is used to search for the optimal
alpha
value for both Lasso and Ridge.
- The scoring metric is
neg_mean_absolute_error
, where
more negative values indicate worse performance.
- Feature Importance:
- The magnitudes of the coefficients (absolute values) are plotted to
visualize the top 10 most influential features for both models.
- Cross-Validation Results Visualization:
- A line plot is used to show how the performance changes with
different
alpha
values for both Lasso and Ridge.
Expected Outputs:
- Optimal Alpha Values:
- Printed best
alpha
for both Lasso and Ridge from
GridSearchCV
.
- Model Metrics:
- MAE, MSE, and \(R^2\) scores after
tuning.
- Feature Importance:
- Bar plots showing the top 10 features for each model.
- Cross-Validation Curve:
- Performance (Negative MAE) plotted against different
alpha
values to observe the impact of regularization
strength.
Step-by-Step Case Study Implementation
Following the detailed requirements and directives from the
assignment and transcript, here’s a structured walkthrough for the case
study:
1. Data Preparation
a) Loading and Merging Data:
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load datasets
train_df = pd.read_csv('/mnt/data/train.csv')
unique_df = pd.read_csv('/mnt/data/unique_m.csv')
# Merge datasets
merged_df = pd.concat([train_df, unique_df.iloc[:, 1:]], axis=1)
# Drop duplicate target columns (if applicable)
if 'critical_temp' in merged_df.columns:
merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]
# Check for missing values
print("Missing values:", merged_df.isnull().sum().sum())
# Impute missing values (if any)
merged_df.fillna(merged_df.mean(), inplace=True)
# Describe data size
print(f"Dataset contains {merged_df.shape[0]} rows and {merged_df.shape[1]} columns.")
b) Feature Selection:
# Calculate correlation matrix and drop highly correlated features
correlation_matrix = merged_df.corr()
high_corr_features = correlation_matrix[correlation_matrix > 0.95].stack().index
merged_df.drop(columns=[col for col, _ in high_corr_features if col != _], inplace=True)
# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(merged_df.drop(columns=['critical_temp']))
y = merged_df['critical_temp']
2. Splitting Data and Cross-Validation
from sklearn.model_selection import train_test_split, KFold
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 5-fold cross-validation setup
kf = KFold(n_splits=5, shuffle=True, random_state=42)
3. Model Training and Hyperparameter Tuning
a) Define Elastic Net Model and Hyperparameter Grid:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'alpha': [0.001, 0.01, 0.1, 1, 10],
'l1_ratio': [0.1, 0.5, 0.7, 0.9, 1.0]
}
# Elastic Net with GridSearchCV
elastic_net = GridSearchCV(ElasticNet(random_state=42), param_grid, cv=kf, scoring='neg_mean_absolute_error')
elastic_net.fit(X_train, y_train)
# Best hyperparameters
print("Best Hyperparameters:", elastic_net.best_params_)
4. Model Evaluation
a) Metrics and Visualizations:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Predictions
y_pred = elastic_net.predict(X_test)
# Metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.4f}")
print(f"Mean Squared Error: {mse:.4f}")
print(f"R^2 Score: {r2:.4f}")
# Visualizations
plt.figure(figsize=(12, 6))
# Actual vs Predicted
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.title("Actual vs Predicted")
plt.xlabel("Actual")
plt.ylabel("Predicted")
# Residuals
residuals = y_test - y_pred
plt.subplot(1, 2, 2)
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='r', linestyle='--')
plt.title("Residuals vs Predicted")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.tight_layout()
plt.show()
5. Feature Importance
# Feature importance from Elastic Net
import numpy as np
importance = np.abs(elastic_net.best_estimator_.coef_)
sorted_idx = np.argsort(importance)[::-1]
top_features = np.array(merged_df.columns[:-1])[sorted_idx[:20]] # Top 20 features
# Bar plot
plt.figure(figsize=(10, 6))
plt.barh(top_features, importance[sorted_idx[:20]])
plt.gca().invert_yaxis()
plt.title("Top 20 Feature Importances (Elastic Net)")
plt.xlabel("Coefficient Magnitude")
plt.show()
6. Regularization Strength vs MAE
# Analyze effect of alpha on MAE
results = elastic_net.cv_results_
alpha_values = [params['alpha'] for params in results['params']]
mae_values = -results['mean_test_score']
plt.figure(figsize=(8, 5))
plt.plot(alpha_values, mae_values, marker='o')
plt.xscale('log')
plt.xlabel("Alpha (Regularization Strength)")
plt.ylabel("Negative Mean Absolute Error")
plt.title("Effect of Alpha on MAE")
plt.show()
GPU Utilization
For additional acceleration using GPU (if available), we can
implement this using PyTorch
---
title: "QTW Class 1 into Case Study 1"
author: Jessica McPhaul
output: html_notebook
---

Class 1 transcript summarized and  broken down into sections going into case studfy 1. 

1. **Course Introduction and Structure**
   - Instructor: Professor Slater.
   - Course: "Quantifying the World."
   - Syllabus and Textbook:
     - Free online textbook available.
   - Key Components of Grading:
     - Case studies (50%): Seven biweekly case studies (odd weeks). Final case study counts double.
     - Participation (30%): Camera on during class, and pre-session submissions.
     - Quizzes (20%): Weekly, due Tuesday midnight, auto-graded.
   - Late Submission Policy:
     - Case studies: 10% deduction per day.
     - Pre-session and quizzes: No late submissions accepted.

2. **Case Study Requirements**
   - Format:
     - Technical write-up aimed at explaining data science analysis.
     - Include all steps, such as data preparation, model selection, and results.
   - Key Elements:
     - Data cleaning (e.g., imputation, feature handling).
     - Model choice and configuration.
     - Metrics like loss function, accuracy, precision, recall, etc.
     - Visualizations (residual plots, confusion matrix, etc.).
   - Mathematical Representations:
     - Cross-validation: Split data into training and test sets, rotating test sets for unbiased evaluation.
     - Regularization:
       - L1 (Lasso): \( \text{Loss} = ||y - X\beta||^2_2 + \lambda ||\beta||_1 \)
       - L2 (Ridge): \( \text{Loss} = ||y - X\beta||^2_2 + \lambda ||\beta||^2_2 \)

3. **Python-Specific Concepts**
   - Python's Strengths:
     - Versatility across data science and general programming tasks.
     - Key libraries: NumPy, Pandas, PyTorch.
   - Python Indexing:
     - Zero-based indexing: `list[0]` refers to the first element.
     - Slicing: `list[start:end]` includes `start` but excludes `end`.
   - Environments:
     - Importance of using virtual environments to manage Python dependencies.

4. **Mathematical Principles for Analysis**
   - Metrics:
     - Mean Absolute Error (MAE): \( MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| \)
     - Precision, Recall, Sensitivity, Specificity.
   - Data Handling:
     - Feature selection using L1 regularization.
     - Handling highly correlated features: Avoid dropping unless 100% correlated or uninformative.

5. **Case Study Example: Superconductors**
   - Dataset:
     - Contains chemical compositions and their critical temperatures.
     - Tasks:
       - Merge training and unique datasets.
       - Handle duplicate target columns.
       - Train models using L1 and L2 regularization.
   - Visualizations:
     - Include clear, legible graphs like feature importance and residual plots.

6. **Key Takeaways and Recommendations**
   - Avoid subjective terms like "performed well"—use precise metrics.
   - Always cross-validate models unless computationally prohibitive.
   - Submit all relevant code for transparency, even if unorganized.

To expand on the case study and demonstrate the concepts using the provided data (`train.csv` and `unique_m.csv`), I will showcase:

1. **Loading and merging datasets**.
2. **Data preprocessing**:
   - Identifying duplicate columns (target).
   - Handling missing values (if any).
   - Dropping unnecessary features.
3. **Training two models**:
   - Linear regression with L1 regularization (Lasso).
   - Linear regression with L2 regularization (Ridge).
4. **Evaluating and visualizing results**.

Here’s the Python code for these steps:

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Step 1: Load datasets
train_df = pd.read_csv('/mnt/data/train.csv')
unique_df = pd.read_csv('/mnt/data/unique_m.csv')

# Step 2: Merge datasets and preprocess
merged_df = pd.concat([train_df, unique_df.iloc[:, 1:]], axis=1)  # Concatenate without duplicating headers
merged_df = merged_df.drop_duplicates()  # Drop duplicates, if any

# Drop the duplicate target column (assuming it's named 'target' in both datasets)
if 'critical_temp' in merged_df.columns:
    merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]

# Check for missing values
if merged_df.isnull().sum().sum() > 0:
    print("Handling missing values...")
    merged_df.fillna(merged_df.mean(), inplace=True)

# Define target and features
target = 'critical_temp'  # Replace with the correct column name
X = merged_df.drop(columns=[target])
y = merged_df[target]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train models
# L1 Regularization (Lasso)
lasso = Lasso(alpha=0.01, random_state=42)
lasso.fit(X_train, y_train)
lasso_predictions = lasso.predict(X_test)

# L2 Regularization (Ridge)
ridge = Ridge(alpha=0.01, random_state=42)
ridge.fit(X_train, y_train)
ridge_predictions = ridge.predict(X_test)

# Step 4: Evaluate models
def evaluate_model(name, y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{name} Model Evaluation:")
    print(f"Mean Absolute Error (MAE): {mae:.4f}")
    print(f"Mean Squared Error (MSE): {mse:.4f}")
    print(f"R^2 Score: {r2:.4f}")
    print("-" * 40)

evaluate_model("Lasso", y_test, lasso_predictions)
evaluate_model("Ridge", y_test, ridge_predictions)

# Step 5: Visualize results
plt.figure(figsize=(12, 6))

# Actual vs Predicted (Lasso)
plt.subplot(1, 2, 1)
plt.scatter(y_test, lasso_predictions, alpha=0.6, label='Lasso Predictions')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.title("Actual vs Predicted (Lasso)")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()

# Actual vs Predicted (Ridge)
plt.subplot(1, 2, 2)
plt.scatter(y_test, ridge_predictions, alpha=0.6, label='Ridge Predictions')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2)
plt.title("Actual vs Predicted (Ridge)")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.legend()

plt.tight_layout()
plt.show()
```

### Key Outputs
1. **Metrics**: MAE, MSE, and \(R^2\) scores for both Lasso and Ridge models.
2. **Visualizations**:
   - Scatter plots showing actual vs predicted values for both models.
3. **Feature Importance**:
   - For Lasso, coefficients with absolute values closer to 0 indicate less important features.
   - Ridge coefficients are shrunk but not eliminated.

Here's an expanded approach to include **hyperparameter tuning** and **cross-validation** for both Lasso and Ridge regression models.

### Updated Python Code:

```python
from sklearn.model_selection import GridSearchCV, KFold

# Step 1: Cross-Validation Setup
kf = KFold(n_splits=5, shuffle=True, random_state=42)  # 5-fold cross-validation

# Step 2: Hyperparameter Tuning using GridSearchCV
# Define hyperparameter grid for Lasso and Ridge
lasso_param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
ridge_param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# Lasso Regression with GridSearch
lasso_grid = GridSearchCV(Lasso(random_state=42), lasso_param_grid, scoring='neg_mean_absolute_error', cv=kf)
lasso_grid.fit(X_train, y_train)

# Ridge Regression with GridSearch
ridge_grid = GridSearchCV(Ridge(random_state=42), ridge_param_grid, scoring='neg_mean_absolute_error', cv=kf)
ridge_grid.fit(X_train, y_train)

# Step 3: Best Parameters and Re-training
best_lasso = lasso_grid.best_estimator_
best_ridge = ridge_grid.best_estimator_

# Retrain on the entire training set with the best parameters
best_lasso.fit(X_train, y_train)
best_ridge.fit(X_train, y_train)

# Predictions
lasso_predictions_cv = best_lasso.predict(X_test)
ridge_predictions_cv = best_ridge.predict(X_test)

# Step 4: Evaluate Models After Tuning
print("Best Lasso Alpha:", lasso_grid.best_params_['alpha'])
print("Best Ridge Alpha:", ridge_grid.best_params_['alpha'])

evaluate_model("Tuned Lasso", y_test, lasso_predictions_cv)
evaluate_model("Tuned Ridge", y_test, ridge_predictions_cv)

# Step 5: Visualize Feature Importance
plt.figure(figsize=(12, 6))

# Lasso Feature Importance
plt.subplot(1, 2, 1)
lasso_importance = pd.Series(np.abs(best_lasso.coef_), index=X.columns).sort_values(ascending=False)
lasso_importance.head(10).plot(kind='bar')
plt.title("Top 10 Features (Lasso)")
plt.ylabel("Coefficient Magnitude")

# Ridge Feature Importance
plt.subplot(1, 2, 2)
ridge_importance = pd.Series(np.abs(best_ridge.coef_), index=X.columns).sort_values(ascending=False)
ridge_importance.head(10).plot(kind='bar')
plt.title("Top 10 Features (Ridge)")
plt.ylabel("Coefficient Magnitude")

plt.tight_layout()
plt.show()

# Step 6: Cross-Validation Results Visualization
lasso_cv_results = pd.DataFrame(lasso_grid.cv_results_)
ridge_cv_results = pd.DataFrame(ridge_grid.cv_results_)

plt.figure(figsize=(10, 5))
plt.plot(lasso_cv_results['param_alpha'], -lasso_cv_results['mean_test_score'], label='Lasso')
plt.plot(ridge_cv_results['param_alpha'], -ridge_cv_results['mean_test_score'], label='Ridge')
plt.xscale('log')
plt.xlabel('Alpha (Regularization Strength)')
plt.ylabel('Negative Mean Absolute Error')
plt.title('Cross-Validation Performance for Lasso and Ridge')
plt.legend()
plt.show()
```

### Code Explanation:

1. **Cross-Validation**:
   - The `KFold` object splits the data into 5 folds, ensuring that each fold is used for testing exactly once.
   - This provides a more robust evaluation compared to a single train-test split.

2. **Hyperparameter Tuning**:
   - `GridSearchCV` is used to search for the optimal `alpha` value for both Lasso and Ridge.
   - The scoring metric is `neg_mean_absolute_error`, where more negative values indicate worse performance.

3. **Feature Importance**:
   - The magnitudes of the coefficients (absolute values) are plotted to visualize the top 10 most influential features for both models.

4. **Cross-Validation Results Visualization**:
   - A line plot is used to show how the performance changes with different `alpha` values for both Lasso and Ridge.

### Expected Outputs:

1. **Optimal Alpha Values**:
   - Printed best `alpha` for both Lasso and Ridge from `GridSearchCV`.

2. **Model Metrics**:
   - MAE, MSE, and \(R^2\) scores after tuning.

3. **Feature Importance**:
   - Bar plots showing the top 10 features for each model.

4. **Cross-Validation Curve**:
   - Performance (Negative MAE) plotted against different `alpha` values to observe the impact of regularization strength.





### Step-by-Step Case Study Implementation

Following the detailed requirements and directives from the assignment and transcript, here’s a structured walkthrough for the case study:

---

### **1. Data Preparation**
#### a) Loading and Merging Data:
```python
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load datasets
train_df = pd.read_csv('/mnt/data/train.csv')
unique_df = pd.read_csv('/mnt/data/unique_m.csv')

# Merge datasets
merged_df = pd.concat([train_df, unique_df.iloc[:, 1:]], axis=1)

# Drop duplicate target columns (if applicable)
if 'critical_temp' in merged_df.columns:
    merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]

# Check for missing values
print("Missing values:", merged_df.isnull().sum().sum())

# Impute missing values (if any)
merged_df.fillna(merged_df.mean(), inplace=True)

# Describe data size
print(f"Dataset contains {merged_df.shape[0]} rows and {merged_df.shape[1]} columns.")
```

#### b) Feature Selection:
```python
# Calculate correlation matrix and drop highly correlated features
correlation_matrix = merged_df.corr()
high_corr_features = correlation_matrix[correlation_matrix > 0.95].stack().index
merged_df.drop(columns=[col for col, _ in high_corr_features if col != _], inplace=True)

# Scale features
scaler = StandardScaler()
X = scaler.fit_transform(merged_df.drop(columns=['critical_temp']))
y = merged_df['critical_temp']
```

---

### **2. Splitting Data and Cross-Validation**
```python
from sklearn.model_selection import train_test_split, KFold

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5-fold cross-validation setup
kf = KFold(n_splits=5, shuffle=True, random_state=42)
```

---

### **3. Model Training and Hyperparameter Tuning**
#### a) Define Elastic Net Model and Hyperparameter Grid:
```python
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1, 10],
    'l1_ratio': [0.1, 0.5, 0.7, 0.9, 1.0]
}

# Elastic Net with GridSearchCV
elastic_net = GridSearchCV(ElasticNet(random_state=42), param_grid, cv=kf, scoring='neg_mean_absolute_error')
elastic_net.fit(X_train, y_train)

# Best hyperparameters
print("Best Hyperparameters:", elastic_net.best_params_)
```

---

### **4. Model Evaluation**
#### a) Metrics and Visualizations:
```python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Predictions
y_pred = elastic_net.predict(X_test)

# Metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae:.4f}")
print(f"Mean Squared Error: {mse:.4f}")
print(f"R^2 Score: {r2:.4f}")

# Visualizations
plt.figure(figsize=(12, 6))

# Actual vs Predicted
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')
plt.title("Actual vs Predicted")
plt.xlabel("Actual")
plt.ylabel("Predicted")

# Residuals
residuals = y_test - y_pred
plt.subplot(1, 2, 2)
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, color='r', linestyle='--')
plt.title("Residuals vs Predicted")
plt.xlabel("Predicted")
plt.ylabel("Residuals")

plt.tight_layout()
plt.show()
```

---

### **5. Feature Importance**
```python
# Feature importance from Elastic Net
import numpy as np

importance = np.abs(elastic_net.best_estimator_.coef_)
sorted_idx = np.argsort(importance)[::-1]
top_features = np.array(merged_df.columns[:-1])[sorted_idx[:20]]  # Top 20 features

# Bar plot
plt.figure(figsize=(10, 6))
plt.barh(top_features, importance[sorted_idx[:20]])
plt.gca().invert_yaxis()
plt.title("Top 20 Feature Importances (Elastic Net)")
plt.xlabel("Coefficient Magnitude")
plt.show()
```

---

### **6. Regularization Strength vs MAE**
```python
# Analyze effect of alpha on MAE
results = elastic_net.cv_results_
alpha_values = [params['alpha'] for params in results['params']]
mae_values = -results['mean_test_score']

plt.figure(figsize=(8, 5))
plt.plot(alpha_values, mae_values, marker='o')
plt.xscale('log')
plt.xlabel("Alpha (Regularization Strength)")
plt.ylabel("Negative Mean Absolute Error")
plt.title("Effect of Alpha on MAE")
plt.show()
```

---

### GPU Utilization
For additional acceleration using GPU (if available), we can implement this using PyTorch 
