Project 1: Pay and our industry

Author

Team: I Love Lucy

Published

October 20, 2024

github   |   web presentation

Data Sets

This project analyzes two datasets: one from ZipRecruiter and the other from Stack Overflow.

• ZipRecruiter Dataset: A heavily curated, clean, and well-structured dataset. It offers a tight focus, making it easy to work with and delivering highly accurate predictions.

• Stack Overflow Dataset: A generation of self-reported survey data, providing valuable insights but requiring more effort to clean and analyze. As we’ll see, while the ZipRecruiter dataset enables near-perfect predictions, the Stack Overflow data presents more challenges due to variability and inconsistencies.

import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import pandas as pd
import io
import base64
from mpl_toolkits.axes_grid1 import make_axes_locatable
import matplotlib.ticker as mticker
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
from scipy.stats import randint
from lussi.ziprecruiter import *
from lussi.stackoverflow import * 
data_dir = "622data_nogit"
ziprecruiter = load_zip(data_dir = data_dir)
wide_stack = load_stack(data_dir=data_dir, stack_type=StackType.WIDE)

Data Set 1: Zip Recruiter

First, let’s view a sample of the Zip Recruiter dataset:

ziprecruiter.sample(n=10, random_state=42).head(10)
State Annual Salary Monthly Pay Weekly Pay Hourly Wage Job Title Abbreviation Salary Tier
157 Alaska 129821 10818 2496 62.41 Machine Learning Engineer AK 120K-140K
341 Arkansas 73611 6134 1415 35.39 Statistician AR 60K-80K
315 Maryland 84798 7066 1630 40.77 Statistician MD 80K-100K
234 Wyoming 117710 9809 2263 56.59 Quantitative Analyst WY 110K-120K
155 New Jersey 130704 10892 2513 62.84 Machine Learning Engineer NJ 120K-140K
274 South Dakota 119548 9962 2299 57.48 Big Data Engineer SD 110K-120K
304 Hawaii 97697 8141 1878 46.97 Statistician HI 80K-100K
227 South Carolina 121214 10101 2331 58.28 Quantitative Analyst SC 120K-140K
278 Rhode Island 117210 9767 2254 56.35 Big Data Engineer RI 110K-120K
185 Maine 112395 9366 2161 54.04 Machine Learning Engineer ME 110K-120K
# Note that becuase we are publishing to the web, we will be base64 encoding our 
#images directly into this web page.
df = ziprecruiter # so we don't overwrite anything! 
plt.figure(figsize=(8, 5))
sns.boxplot(y='Job Title', x='Annual Salary', data=df, orient='h')
plt.title('Salary by title within this data set')
plt.ylabel('')  # Remove the label on the y-axis
img_buf = io.BytesIO()
plt.savefig(img_buf, format='png', bbox_inches='tight')  # Save figure to buffer
plt.close()  # Prevents Quarto from auto-rendering the plot. 
img_buf.seek(0) ## reset the buffer
img_base64 = base64.b64encode(img_buf.read()).decode('utf-8')
img_html = f'<img src="data:image/png;base64,{img_base64}" alt="Salary by title" />'
# And render. There is no cached image!
from IPython.display import display, HTML
display(HTML(img_html))
Salary by title

Overview of the Dataset

This dataset was scraped from ZipRecruiter using Selenium and Chromedriver. We collected several pages containing salary information by job title and state, such as a page for data engineers. Additionally, we added state abbreviations and a salary tier column to provide more granular analysis.

Though relatively small, the dataset is clean, standardized, and well-suited for both analysis and prediction. Its structure makes it an excellent candidate for regression-based algorithms such as Linear Regression or Random Forest, allowing us to predict salaries based on key features like job title and state.

Model Evaluation

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# One-hot encode categorical state and job title
df = pd.get_dummies(ziprecruiter, columns=['State', 'Job Title'], drop_first=True)

# Split the data into features (X) and target (y)
X = df.drop(['Annual Salary', 'Abbreviation', 'Salary Tier'], axis=1)
y = df['Annual Salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

y_pred = lr_model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f'Mean Squared Error: {mse:.2f}')
print(f'R² Score: {r2:.5f}')
Mean Squared Error: 10.70
R² Score: 1.00000

The linear regression model performed exceptionally well:

• Mean squared error : 10.70. This low MSE indicates that the model’s predictions are highly accurate and closely align with the actual values.

• R² score: 0.99999997. An R² score near 1 suggests that the model explains nearly all the variability in annual salaries.

Conclusion

This strong linear relationship between the input features (e.g., job title and state) and the target variable (annual salary) demonstrates the suitability of linear regression for this dataset. Its ability to make accurate predictions with minimal error makes it highly useful for exploring salary trends across different states and roles. This can help job seekers and recruiters alike in setting appropriate salary expectations based on job title and location. Additionally, the clean dataset offers opportunities for further analysis with more advanced models, such as Random Forest or Gradient Boosting.

But, nobody cares about Zip Recruiter. That’s where newbies get their jobs.

Data Set 2: Stack Overflow

Overview of the Dataset

Now let’s look at the wide Stack Overflow dataset, and be sure scroll to the right.:

wide_stack.sample(n=10, random_state=42).head(10)
Year OrgSize Country Employment Gender EdLevel US_State Age DevType Sexuality ... mariadb microsoftazure googlecloud ibmcloudorwatson kubernetes linux windows sexuality_grouped ethnicity_grouped aws
371801 2021 2 to 9 employees Belgium Full-Time Male Bachelors NaN 18-24 years old Developer, front-end;Developer, desktop or ent... Straight / Heterosexual ... yes yes yes no no no no straight non-minority yes
312400 2021 1,000 to 4,999 employees Australia Full-Time Male Masters NaN 35-44 years old Data or business analyst Straight / Heterosexual ... no yes no no no no no straight minority no
351053 2021 2 to 9 employees United States Self-Employed Male Bachelors California 45-54 years old Developer, full-stack Straight / Heterosexual ... yes no no no no no no straight non-minority yes
2519 2017 100 to 499 employees United States Full-Time Male Masters NaN NaN Developer with a statistics or mathematics bac... NaN ... no no no no no no no NaN non-minority no
321665 2021 1,000 to 4,999 employees Estonia Part-Time Male Bachelors NaN 18-24 years old Developer, full-stack Straight / Heterosexual ... no no no no no no no straight non-minority no
315461 2021 1,000 to 4,999 employees Germany Full-Time Male Bachelors NaN 35-44 years old Developer, full-stack Straight / Heterosexual ... no no no no no no no straight minority yes
99267 2018 Fewer than 10 employees Poland Part-Time Male Some College NaN 18 - 24 years old Full-stack developer Straight or heterosexual ... no no no no no no no straight non-minority no
197349 2019 20 to 99 employees Netherlands Full-Time Male Bachelors NaN 22.0 Developer, back-end;Developer, front-end;Devel... Straight / Heterosexual ... no no no no no no no straight non-minority no
293118 2020 1,000 to 4,999 employees China Full-Time Male Masters NaN 28.0 Developer, back-end;Developer, desktop or ente... Bisexual ... no no no no no no yes lgbtq NaN no
528679 2023 100 to 499 employees Morocco Full-Time NaN Masters NaN 25-34 years old Developer, full-stack NaN ... yes yes yes no no no no NaN minority yes

10 rows × 49 columns

This dataset originates from the Stack Overflow user-entered survey data, covering responses from 2017 through 2023. We uploaded the raw survey data to an S3 bucket and processed it extensively to extract core columns that we believe could predict key values, particularly annual salary.

This wide dataset contains over 500,000 records and is much more complex, with numerous categorical variables and some missing data. The complexity makes it well-suited for advanced machine learning algorithms like XGBoost or Random Forest, which can efficiently handle high-dimensional data and missing entries. It offers opportunities to explore various questions, such as predicting salary based on skills and education or classifying job titles based on technical skillsets.

We decided to focus on records from the United States only so it is comparable to the Zip Recruiter analysis.

# filter for US only
us_data = wide_stack[wide_stack['Country'] == 'United States']

Exploratory Data Analysis (EDA)

Stack Overflow columns used percentage, by year

Before we dig any further, we have to point out that some data doesn’t exist in some years. For example, there is no gender or sexual identity included in the survey moving forward. This shows which core columns we have access to.

filtered_columns = wide_stack.loc[:, 'Year':'AgeAvg']

# Group and calculate the percentage of non-null values for the filtered columns
grouped_summary_filtered = filtered_columns.groupby("Year").agg(lambda x: (x.notnull().mean() * 100)).reset_index()

# Round the result to one decimal place
grouped_summary_filtered_rounded = grouped_summary_filtered.round(1)

# Display the grouped summary
from IPython.display import display
display(grouped_summary_filtered_rounded)
Year OrgSize Country Employment Gender EdLevel US_State Age DevType Sexuality Ethnicity DatabaseWorkedWith LanguageWorkedWith PlatformWorkedWith YearsCodePro AnnualSalary YearsCodeProAvg OrgSizeAvg AgeAvg
0 2017 75.7 100.0 100.0 68.2 100.0 0.0 0.0 70.3 0.0 64.3 57.3 71.3 56.7 99.5 25.1 96.6 72.7 0.0
1 2018 72.4 99.6 96.4 65.2 95.8 0.0 65.3 93.2 60.5 58.1 67.0 79.2 66.8 94.9 48.3 94.9 72.4 65.3
2 2019 80.8 99.9 98.1 96.1 97.2 0.0 89.1 91.5 85.7 86.3 85.5 98.5 90.8 83.6 0.0 83.6 75.9 89.1
3 2020 68.8 99.4 99.1 78.4 89.1 0.0 70.5 76.6 68.2 71.3 76.8 89.0 83.5 71.9 0.0 71.9 65.4 70.5
4 2021 72.8 100.0 99.9 98.6 99.6 17.9 98.8 79.7 87.9 95.2 83.3 98.7 62.5 73.4 56.1 73.4 66.3 98.1
5 2022 69.7 98.0 97.9 96.7 97.7 0.0 96.8 83.7 90.9 94.8 82.1 96.9 68.1 70.7 52.0 70.7 64.5 96.1
6 2023 72.9 98.6 98.6 0.0 98.6 0.0 100.0 86.2 0.0 0.0 82.3 97.7 71.3 74.2 53.8 74.2 66.8 99.5

Features

With a clear understanding of the dataset’s structure established, we then turned our attention to examining the features within the data. This analysis will help us identify the characteristics that could influence our model’s performance and guide our feature selection process.

def group_gender(gender):
    gender = str(gender).lower()  # converts to lowercase 
    if 'female' in gender or 'woman' in gender:
        return 'Female'
    elif 'male' in gender or 'man' in gender:
        return 'Male'
    else:
        return 'Other'

us_data['gender_grouped']=us_data['Gender'].apply(group_gender)
# columns to keep (all features but DevType (as there was too self reported to be much use) and US_State (as only 2021 had this info))
cols_to_keep =['Year', 'Employment', 'EdLevel', 'AnnualSalary',
       'YearsCodeProAvg', 'OrgSizeAvg', 'AgeAvg', 'python', 'sql', 'java',
       'javascript', 'ruby', 'php', 'c', 'swift', 'scala', 'r', 'rust',
       'julia', 'mysql', 'microsoftsqlserver', 'mongodb', 'postgresql',
       'oracle', 'ibmdb2', 'redis', 'sqlite', 'mariadb', 'microsoftazure',
       'googlecloud', 'ibmcloudorwatson', 'kubernetes', 'linux', 'windows',
       'sexuality_grouped', 'ethnicity_grouped', 'aws', 'gender_grouped']

# new dataframe with specific columns
df_new = us_data[cols_to_keep].copy()

# print datatypes of columns
print(df_new.dtypes)
Year                    int64
Employment             object
EdLevel                object
AnnualSalary          float64
YearsCodeProAvg       float64
OrgSizeAvg            float64
AgeAvg                float64
python                 object
sql                    object
java                   object
javascript             object
ruby                   object
php                    object
c                      object
swift                  object
scala                  object
r                      object
rust                   object
julia                  object
mysql                  object
microsoftsqlserver     object
mongodb                object
postgresql             object
oracle                 object
ibmdb2                 object
redis                  object
sqlite                 object
mariadb                object
microsoftazure         object
googlecloud            object
ibmcloudorwatson       object
kubernetes             object
linux                  object
windows                object
sexuality_grouped      object
ethnicity_grouped      object
aws                    object
gender_grouped         object
dtype: object

We deteremined that majority of the features are categorical with a few float features (YearsCodeProAvg, OrgSizeAvg,and AgeAvg). Based on this information, we determined that we needed a model that can handle categorical variables natively.

Target Variable

There are labels in our data (target variable) and this affected our choice of algorithm. Since AnnualSalary is a continous variable, we are working on a regression problem. This narrows down our alogrithm choices to regression models such as:

  • Linear Regression
  • XGBoost
  • Random Forest Regressor
  • Support Vector Regressor

Given our exploratory analysis of the features, we narrowed down our model choice to XGBoost and Random Forest.

Below is the distribution of the target variable (AnnualSalary).

annualsalary = df_new['AnnualSalary'].dropna()

plt.figure(figsize=(10, 5))
plt.boxplot(annualsalary)
plt.title('Boxplot of Target Variable (AnnualSalary)')
plt.ylabel('Target Variable Values')


# Save the plot to a BytesIO buffer
img_buf = io.BytesIO()
plt.savefig(img_buf, format='png', bbox_inches='tight')  # Save figure to buffer
plt.close()  # Prevent Quarto from auto-rendering the plot

# Reset buffer position and convert the image to Base64
img_buf.seek(0)
img_base64 = base64.b64encode(img_buf.read()).decode('utf-8')

# Generate the HTML for embedding the image
img_html = f'<img src="data:image/png;base64,{img_base64}" alt="Boxplot of Target Variable (AnnualSalary)" />'

# Display the image
from IPython.display import display, HTML
display(HTML(img_html))
Boxplot of Target Variable (AnnualSalary)

As observed, there are outliers in the distribution. We chose to address these using the Interquartile Range (IQR) method because it is a robust technique for detecting and removing extreme values without making assumptions about the data distribution, unlike methods such as z-scores which assume normality. The IQR method is particularly effective when dealing with skewed data, which is common in salary distributions. By focusing on the middle 50% of the data (between the 25th and 75th percentiles), this approach minimizes the impact of extreme outliers that could distort model performance.

Further details on the impact of removing outliers will be discussed in the model section, where we describe the performance of the model with and without these outliers.

 # Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df_new['AnnualSalary'].quantile(0.25)
Q3 = df_new['AnnualSalary'].quantile(0.75)

# Calculate the IQR
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers from the dataframe
df_filtered = df_new[(df_new['AnnualSalary'] >= lower_bound) & (df_new['AnnualSalary'] <= upper_bound)]

Here is the distribution post-IQR.

plt.figure(figsize=(10, 5))
plt.boxplot(df_filtered['AnnualSalary'])
plt.title('Boxplot of Target Variable (AnnualSalary)')
plt.ylabel('Target Variable Values')

# Save the plot to a BytesIO buffer
img_buf = io.BytesIO()
plt.savefig(img_buf, format='png', bbox_inches='tight')  # Save figure to buffer
plt.close()  # Prevent Quarto from auto-rendering the plot

# Reset buffer position and convert the image to Base64
img_buf.seek(0)
img_base64 = base64.b64encode(img_buf.read()).decode('utf-8')

# Generate the HTML for embedding the image
img_html = f'<img src="data:image/png;base64,{img_base64}" alt="Boxplot of Target Variable (AnnualSalary) Post-IQR" />'

# Display the image
from IPython.display import display, HTML
display(HTML(img_html))
Boxplot of Target Variable (AnnualSalary) Post-IQR

Model

For the model, we decided to use XGBoost for the following reasons:

  • Large dataset: XGBoost is highly efficient when working with large datasets

  • Mixed feature types: The dataset contains both categorical and numeric features, and XGBoost can natively handle this mixture without requiring extensive pre-processing.

  • Missing values: XGBoost can effectively handle missing values by internally learning the best direction to take when encountering them, making it a natural fit for our dataset with some incomplete entries.

We chose not to use Random Forest primarily because it tends to perform better on smaller and less complex datasets. Given the size and intricacy of the Stack Overflow dataset, XGBoost was a better fit due to its scalability and ability to handle high-dimensional, complex data.

Pre-Processing

Even though XGBoost can handle categorical variables natively, they still need to be represented numerically. To prepare the categorical variables for the model, we opted to use one-hot encoding.

In the code below, we used the pd.get_dummies function to convert the categorical variables into binary (boolean) columns. These columns represent each category as True or False. To make this data compatible with the model, we then converted the boolean values into integers, assigning True as 1 and False as 0.

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df_filtered, drop_first=True)

# Identify all boolean columns
bool_cols = df_encoded.select_dtypes(include=['bool']).columns

# Convert boolean columns to integers (True -> 1, False -> 0)
df_encoded[bool_cols] = df_encoded[bool_cols].astype(int)

Splitting Dataset into Train and Test Set

After preprocessing the dataset, we split it into training and testing sets to evaluate how well the model generalizes to new, unseen data. This split ensures that the model does not simply memorize the training data, but instead learns patterns that will perform well on future, unknown data.

We divided the dataset into 80% training data and 20% testing data. This training set is used to build the model and the test set is used to evaluate the model’s performance. By using the train_test_split function with test_size=0.2, we ensure that 80% of the data is used for training and the remaining 20% is used for testing. We also set random_state=42 to make the split reproducible, ensuring that every time the code is run, the dataset is split in the same way.

# Update X and Y with the encoded data
X = df_encoded.drop(columns=['AnnualSalary'])
Y = df_encoded['AnnualSalary']

# Split the dataset into 80% training and 20% testing data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Creating the Model

To create the model, we started by defining the parameter grid for tuning our XGBoost model. The initial parameters were adjusted during experimentation to better suit the data’s characteristics. Here’s the final grid:

# Define the parameter grid
param_grid = {
    'n_estimators': randint(200, 1000), # initial was randint(100, 500) ; updated was randint(200, 1000)
     'max_depth': randint(2,6), # initial was (3,10); updated to  (2,6)
     'learning_rate': [0.01, 0.05, 0.1], # initial was [0.001, 0.01, 0.1, 0.3]; updated to [0.01, 0.05, 0.1]
     'subsample': [0.6, 0.8, 1.0],
     'colsample_bytree': [0.6, 0.8, 1.0],
     'gamma': [0, 0.1, 0.3, 0.5],
     'min_child_weight': [10, 15, 20] # inital was [1,5,10]; updated to [10, 15, 20]
}

This grid allows us to explore combinations of parameters that control both overfitting and generalization. The adjustments made to the ranges reflect our goal of simplifying the model without sacrificing accuracy, particularly with the high dimensionality of the dataset.

Next, we used RandomizedSearchCV to optimize the hyperparameters and select the best-performing model based on cross-validation results.

# Instantiate the model and RandomizedSearchCV
xgb_r = xgb.XGBRegressor(objective='reg:squarederror', seed=123)
random_search = RandomizedSearchCV(estimator=xgb_r, 
                                   param_distributions=param_grid, 
                                   n_iter=50, 
                                   scoring='neg_mean_squared_error', 
                                   cv=5, 
                                   verbose=1, 
                                   random_state=42)

# Fit the random search model
random_search.fit(X_train, Y_train)
print(f"Best parameters: {random_search.best_params_}")
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters: {'colsample_bytree': 1.0, 'gamma': 0, 'learning_rate': 0.05, 'max_depth': 5, 'min_child_weight': 15, 'n_estimators': 426, 'subsample': 0.8}

We used RandomizedSearchCV instead of GridSearchCV to save time, as it explores a random selection of hyperparameter combinations rather than exhaustively testing all possible options. The n_iter=50 argument limits the number of iterations, ensuring a good balance between performance and computational efficiency.

By tuning the model in this way, we can ensure that it performs optimally given the complexity and size of the Stack Overflow dataset.

The next step is to use the best model identified through RandomizedSearchCV to make predictions on the test data:

# Predict using the best model
best_model = random_search.best_estimator_
pred = best_model.predict(X_test)

This code retrieves the optimal XGBoost model from the RandomizedSearchCV process and uses it to generate predictions for the test set (X_test). The predictions (pred) can then be used to evaluate the model’s performance on unseen data.

Assessing the Model Performance

Metrics

To assess the performance of the model, we calculated two key metrics: Root Mean Square Error (RMSE) and R-squared (R²). These metrics provide insights into the model’s accuracy and goodness of fit.

# calculate RMSE
rmse = np.sqrt(mean_squared_error(Y_test, pred))
print(f"RMSE: {rmse}")

# Compute the R² score
r2 = r2_score(Y_test, pred)
print("R²: %f" % r2)
RMSE: 40752.117865209984
R²: 0.431343

After evaluating the model, we obtained the following performance metrics:

  • RMSE: 40,752.117865
  • R²: 0.431343

The Root Mean Square Error (RMSE) quantifies the average error between the predicted and actual values. In this case, an RMSE of 40,752.117865 indicates that, on average, the model’s predictions deviate from the actual values by this amount. Lower RMSE values suggest better model performance, making this a critical metric for assessing accuracy.

The R-squared (R²) score measures the proportion of variance in the target variable (Annual Salary) that can be explained by the features included in the model. With an R² value of 0.431343, approximately 43.13% of the variance in Annual Salary is accounted for by the model. While this indicates some explanatory power, there is still a significant portion of the variance (over 56.87%) that remains unexplained, suggesting potential areas for improvement in the model or the inclusion of additional features.

These metrics together provide a solid foundation for evaluating the model’s predictive capabilities, emphasizing its strengths while also identifying areas that may require further enhancement.

To ensure that our model is not overfitting, we also compared the training and test metrics for RMSE and R²:

# Predictions on the training set
train_pred = best_model.predict(X_train)

# Predictions on the testing set
test_pred = best_model.predict(X_test)

# RMSE on the training set
train_rmse = np.sqrt(mean_squared_error(Y_train, train_pred))
print(f"Training RMSE: {train_rmse}")

# RMSE on the testing set
test_rmse = np.sqrt(mean_squared_error(Y_test, test_pred))
print(f"Testing RMSE: {test_rmse}")

# R² on the training set
train_r2 = r2_score(Y_train, train_pred)
print(f"Training R²: {train_r2}")

# R² on the testing set
test_r2 = r2_score(Y_test, test_pred)
print(f"Testing R²: {test_r2}")
Training RMSE: 37495.24039831062
Testing RMSE: 40752.117865209984
Training R²: 0.5184798109204363
Testing R²: 0.4313433781313698

By analyzing the RMSE and R² values for both training and test sets, we observed that they are relatively close to one another. This similarity indicates that the model generalizes well and is not overfitting, performing similarly across both datasets.

Feature Importance Plot

The plot below illustrates the importance of each feature in relation to the model’s predictions. We experimented with various combinations of features based on the insights gained from this plot. However, our analysis revealed that the optimal model performance was achieved by including all available features from the dataset. This suggests that each feature adds valuable information, and together they are important for enhancing the model’s accuracy and strength.

# Plot feature importance
xgb.plot_importance(best_model, importance_type='weight', max_num_features=20)
plt.title('Top Feature Importance')

# Save the plot to a BytesIO buffer
img_buf = io.BytesIO()
plt.savefig(img_buf, format='png', bbox_inches='tight')  # Save figure to buffer
plt.close()  # Close the figure to prevent auto-rendering

# Reset buffer position and convert the image to Base64
img_buf.seek(0)
img_base64 = base64.b64encode(img_buf.read()).decode('utf-8')

# Generate the HTML for embedding the image
img_html = f'<img src="data:image/png;base64,{img_base64}" alt="Top Feature Importance" />'

# Display the image
display(HTML(img_html))
Top Feature Importance
Correlation

We examined the correlation between the features in the training set to identify any relationships that could potentially influence the model’s performance. If two features are correlated, it may indicate redundancy, which can lead to issues such as multicollinearity. This situation can negatively affect the model’s interpretability and stability, as the model may struggle to distinguish the individual contributions of each correlated feature. Therefore, understanding feature correlations is crucial for feature selection and ensuring the model is both efficient and reliable.

corr_matrix = df_encoded.drop(columns=['AnnualSalary']).corr()
sns.heatmap(corr_matrix, annot=True)

# Save the plot to a BytesIO buffer
img_buf = io.BytesIO()
plt.savefig(img_buf, format='png', bbox_inches='tight')  # Save figure to buffer
plt.close()  # Close the figure to prevent auto-rendering

# Reset buffer position and convert the image to Base64
img_buf.seek(0)
img_base64 = base64.b64encode(img_buf.read()).decode('utf-8')

# Generate the HTML for embedding the image
img_html = f'<img src="data:image/png;base64,{img_base64}" alt="Correlation Matrix Heatmap" />'

# Display the image
display(HTML(img_html))
Correlation Matrix Heatmap

Upon analyzing the correlation between features, we discovered that the YearsCodeProAvg and AgeAvg features exhibited a high degree of correlation. Given that YearsCodeProAvg was indicated as a more important feature in the feature importance plot, we initially considered removing AgeAvg from the model. However, after running experiments, we observed that the model’s performance actually improved when both features were included. This finding suggests that while these features are correlated, they each contribute unique information that enhances the model’s predictive capabilities. Consequently, we decided to retain both YearsCodeProAvg and AgeAvg in the final model to leverage their combined influence on the predictions.

Outliers

To thoroughly evaluate the model’s performance, we conducted experiments both with and without outliers. Our findings revealed that excluding the outliers significantly enhanced the model’s predictive accuracy and stability, leading us to decide to omit them from the final dataset used for modeling.

When we ran the model with the outliers included, the Root Mean Square Error (RMSE) was 768,195.66, and the R-squared value was 0.007583, keeping all else the constant. These metrics indicated that the model was poorly fitted to the data. As a result, we applied the IQR method to handle the outliers, as described in the Exploratory Data Analysis (EDA) section, which led to a substantial improvement in model performance.

Assessing the Model Visually
Residuals vs Predicted Values Plot

To understand how well our model performs, we plotted the residuals (the differences between actual and predicted values) against the predicted values. This helps us identify any patterns or biases in the model’s predictions.

Ideally, residuals should be randomly scattered around 0 without any noticeable pattern, indicating that the model is accurately capturing the underlying relationships in the data. In the plot below, the residuals appear to be randomly distributed around 0, which suggests that the model’s predictions are not biased in any direction. This randomness supports the validity of the model’s performance and suggests that it is effectively fitting the data.

residuals = Y_test - pred  # Calculate residuals

# create data frame for plotting
residuals_df = pd.DataFrame({
    'Actual': Y_test,
    'Predicted': pred,
    'Residuals': residuals
})

# Generate the plot
fig, ax = plt.subplots(figsize=(12, 6))
sns.scatterplot(x='Predicted', y='Residuals', data=residuals_df, ax=ax)
ax.axhline(0, color='red', linestyle='--')
ax.set_title('Residuals vs Predicted Values')
ax.set_xlabel('Predicted Values')
ax.set_ylabel('Residuals')

# Save the plot to a BytesIO buffer
img_buf = io.BytesIO()
plt.savefig(img_buf, format='png', bbox_inches='tight')  # Save figure to buffer
plt.close(fig)  # Close the figure to prevent auto-rendering

# Reset buffer position and convert to Base64
img_buf.seek(0)
img_base64 = base64.b64encode(img_buf.read()).decode('utf-8')

# Generate HTML for embedding the image
img_html = f'<img src="data:image/png;base64,{img_base64}" alt="Residuals vs Predicted Values" />'

# Display the image
display(HTML(img_html))
Residuals vs Predicted Values
Distribution of Residuals

Next, we examined the distribution of the residuals using a histogram. This plot provides insight into how the residuals are spread, helping us assess the model’s accuracy.

The plot below shows a mostly bell-shaped distribution, with a slight tail to the right. While this deviation is noticeable, it is not severe enough to raise major concerns about the model’s validity. Overall, the residuals suggest that the model performs reasonably well.

# Histogram of Residuals
fig, ax = plt.subplots(figsize=(12, 6))
sns.histplot(residuals, bins=30, kde=True, ax=ax)
ax.set_title('Histogram of Residuals')
ax.set_xlabel('Residuals')
ax.set_ylabel('Frequency')

# Save the plot to a BytesIO buffer
img_buf = io.BytesIO()
plt.savefig(img_buf, format='png', bbox_inches='tight')  # Save figure to buffer
plt.close(fig)  # Close the figure to prevent auto-rendering

# Reset buffer position and convert to Base64
img_buf.seek(0)
img_base64 = base64.b64encode(img_buf.read()).decode('utf-8')

# Generate HTML for embedding the image
img_html = f'<img src="data:image/png;base64,{img_base64}" alt="Histogram of Residuals" />'

# Display the image
display(HTML(img_html))
Histogram of Residuals
Q-Q Plot of Residuals

Finally, we created a Q-Q (Quantile-Quantile) plot to check the normality of the residuals. The red line represents a normal distribution. The closer the data points are to this line, the more they align with a normal distribution.

In the plot below, we observe that most of the residuals closely follow the normal line, indicating that they are approximately normally distributed. This provides a strong indication that the residuals meet the assumption of normality, lending further credibility to the model’s validity.

import statsmodels.api as sm
# Q-Q Plot
fig = sm.qqplot(residuals, line='s')
plt.title('Q-Q Plot of Residuals')

# Save the plot to a BytesIO buffer
img_buf = io.BytesIO()
plt.savefig(img_buf, format='png', bbox_inches='tight')  # Save figure to buffer
plt.close(fig)  # Close the figure to prevent auto-rendering

# Reset buffer position and convert to Base64
img_buf.seek(0)
img_base64 = base64.b64encode(img_buf.read()).decode('utf-8')

# Generate HTML for embedding the image
img_html = f'<img src="data:image/png;base64,{img_base64}" alt="Q-Q Plot of Residuals" />'

# Display the image
display(HTML(img_html))
Q-Q Plot of Residuals
Summary

Overall, the visualizations of the model through the residuals vs. predicted values plot, histogram, and Q-Q plot provide a comprehensive evaluation of its performance. The randomness of residuals, the bell-shaped distribution, and the normality indicated in the Q-Q plot collectively suggest that the model is well-fitted to the data, meeting key assumptions of regression analysis. These findings reinforce the model’s reliability and predictive capability, further suggesting that we utilized the appropriate model type, XGBoost.

Conclusion

In conclusion, our final model accounts for approximately 43.13% of the variability in annual salary and we’ve demonstrated that XGBoost was a justified model to use for this problem. However, there is still room for improvement. The results suggest that the features included in the model contribute valuable insights into salary, but it is essential to recognize that there are likely other signiciant factors influencing salary that were not captured in this dataset. Variables such as industry, state, market demand, etc. may all pay crucial roles in salary determination. The absence of these features could have limited the model’s overall predicitve capability.

Future work could involve exploring additional features, experimenting with more sophisticated modeling techniques, or implementing more advanced hyperparameter tuning to further refine the model’s predictive capabilities. Incorporating external data sources to address omitted variables may also enhance the model’s robustness and accuracy.

Comparison of Data Sets

When comparing the two datasets (Zip Recruiter and Stack Overflow), we observe that a smaller dataset with fewer features often leads to a predictive model that can more accurately predict the target variable. This is primarily due to the reduced complexity and lower likelihood of overfitting. In contrast, larger and more complex datasets introduce challenges that can hinder predictive accuracy.

In smaller datasets, the relationships between features and the target variable are typically clearer, allowing simpler models to generalize better to unseen data. However, as dataset size and complexity increase, the potential for noise and irrelevant information also rises, complicating the modeling process. Consequently, achieving high predictive accuracy becomes more difficult as the number of features grows and the dataset becomes more intricate.

Overall, while larger datasets can provide more information, they require careful handling to ensure that the additional complexity does not undermine model performance. Finding the right balance between dataset size and model complexity is essential for achieving accurate predictions.

Sources