import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import pandas as pd
import io
import base64
from mpl_toolkits.axes_grid1 import make_axes_locatable
import matplotlib.ticker as mticker
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import xgboost as xgb
from scipy.stats import randint
from lussi.ziprecruiter import *
from lussi.stackoverflow import *
= "622data_nogit"
data_dir = load_zip(data_dir = data_dir)
ziprecruiter = load_stack(data_dir=data_dir, stack_type=StackType.WIDE) wide_stack
Project 1: Pay and our industry
Data Sets
This project analyzes two datasets: one from ZipRecruiter and the other from Stack Overflow.
• ZipRecruiter Dataset: A heavily curated, clean, and well-structured dataset. It offers a tight focus, making it easy to work with and delivering highly accurate predictions.
• Stack Overflow Dataset: A generation of self-reported survey data, providing valuable insights but requiring more effort to clean and analyze. As we’ll see, while the ZipRecruiter dataset enables near-perfect predictions, the Stack Overflow data presents more challenges due to variability and inconsistencies.
Data Set 1: Zip Recruiter
First, let’s view a sample of the Zip Recruiter dataset:
=10, random_state=42).head(10) ziprecruiter.sample(n
State | Annual Salary | Monthly Pay | Weekly Pay | Hourly Wage | Job Title | Abbreviation | Salary Tier | |
---|---|---|---|---|---|---|---|---|
157 | Alaska | 129821 | 10818 | 2496 | 62.41 | Machine Learning Engineer | AK | 120K-140K |
341 | Arkansas | 73611 | 6134 | 1415 | 35.39 | Statistician | AR | 60K-80K |
315 | Maryland | 84798 | 7066 | 1630 | 40.77 | Statistician | MD | 80K-100K |
234 | Wyoming | 117710 | 9809 | 2263 | 56.59 | Quantitative Analyst | WY | 110K-120K |
155 | New Jersey | 130704 | 10892 | 2513 | 62.84 | Machine Learning Engineer | NJ | 120K-140K |
274 | South Dakota | 119548 | 9962 | 2299 | 57.48 | Big Data Engineer | SD | 110K-120K |
304 | Hawaii | 97697 | 8141 | 1878 | 46.97 | Statistician | HI | 80K-100K |
227 | South Carolina | 121214 | 10101 | 2331 | 58.28 | Quantitative Analyst | SC | 120K-140K |
278 | Rhode Island | 117210 | 9767 | 2254 | 56.35 | Big Data Engineer | RI | 110K-120K |
185 | Maine | 112395 | 9366 | 2161 | 54.04 | Machine Learning Engineer | ME | 110K-120K |
# Note that becuase we are publishing to the web, we will be base64 encoding our
#images directly into this web page.
= ziprecruiter # so we don't overwrite anything!
df =(8, 5))
plt.figure(figsize='Job Title', x='Annual Salary', data=df, orient='h')
sns.boxplot(y'Salary by title within this data set')
plt.title('') # Remove the label on the y-axis
plt.ylabel(= io.BytesIO()
img_buf format='png', bbox_inches='tight') # Save figure to buffer
plt.savefig(img_buf, # Prevents Quarto from auto-rendering the plot.
plt.close() 0) ## reset the buffer
img_buf.seek(= base64.b64encode(img_buf.read()).decode('utf-8')
img_base64 = f'<img src="data:image/png;base64,{img_base64}" alt="Salary by title" />'
img_html # And render. There is no cached image!
from IPython.display import display, HTML
display(HTML(img_html))
Overview of the Dataset
This dataset was scraped from ZipRecruiter using Selenium and Chromedriver. We collected several pages containing salary information by job title and state, such as a page for data engineers. Additionally, we added state abbreviations and a salary tier column to provide more granular analysis.
Though relatively small, the dataset is clean, standardized, and well-suited for both analysis and prediction. Its structure makes it an excellent candidate for regression-based algorithms such as Linear Regression or Random Forest, allowing us to predict salaries based on key features like job title and state.
Model Evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# One-hot encode categorical state and job title
= pd.get_dummies(ziprecruiter, columns=['State', 'Job Title'], drop_first=True)
df
# Split the data into features (X) and target (y)
= df.drop(['Annual Salary', 'Abbreviation', 'Salary Tier'], axis=1)
X = df['Annual Salary']
y
= train_test_split(X, y, test_size=0.3, random_state=42)
X_train, X_test, y_train, y_test = LinearRegression()
lr_model
lr_model.fit(X_train, y_train)
= lr_model.predict(X_test)
y_pred
# Evaluate the model
= mean_squared_error(y_test, y_pred)
mse = r2_score(y_test, y_pred)
r2
# Print the evaluation metrics
print(f'Mean Squared Error: {mse:.2f}')
print(f'R² Score: {r2:.5f}')
Mean Squared Error: 10.70
R² Score: 1.00000
The linear regression model performed exceptionally well:
• Mean squared error : 10.70. This low MSE indicates that the model’s predictions are highly accurate and closely align with the actual values.
• R² score: 0.99999997. An R² score near 1 suggests that the model explains nearly all the variability in annual salaries.
Conclusion
This strong linear relationship between the input features (e.g., job title and state) and the target variable (annual salary) demonstrates the suitability of linear regression for this dataset. Its ability to make accurate predictions with minimal error makes it highly useful for exploring salary trends across different states and roles. This can help job seekers and recruiters alike in setting appropriate salary expectations based on job title and location. Additionally, the clean dataset offers opportunities for further analysis with more advanced models, such as Random Forest or Gradient Boosting.
But, nobody cares about Zip Recruiter. That’s where newbies get their jobs.
Data Set 2: Stack Overflow
Overview of the Dataset
Now let’s look at the wide Stack Overflow dataset, and be sure scroll to the right.:
=10, random_state=42).head(10) wide_stack.sample(n
Year | OrgSize | Country | Employment | Gender | EdLevel | US_State | Age | DevType | Sexuality | ... | mariadb | microsoftazure | googlecloud | ibmcloudorwatson | kubernetes | linux | windows | sexuality_grouped | ethnicity_grouped | aws | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
371801 | 2021 | 2 to 9 employees | Belgium | Full-Time | Male | Bachelors | NaN | 18-24 years old | Developer, front-end;Developer, desktop or ent... | Straight / Heterosexual | ... | yes | yes | yes | no | no | no | no | straight | non-minority | yes |
312400 | 2021 | 1,000 to 4,999 employees | Australia | Full-Time | Male | Masters | NaN | 35-44 years old | Data or business analyst | Straight / Heterosexual | ... | no | yes | no | no | no | no | no | straight | minority | no |
351053 | 2021 | 2 to 9 employees | United States | Self-Employed | Male | Bachelors | California | 45-54 years old | Developer, full-stack | Straight / Heterosexual | ... | yes | no | no | no | no | no | no | straight | non-minority | yes |
2519 | 2017 | 100 to 499 employees | United States | Full-Time | Male | Masters | NaN | NaN | Developer with a statistics or mathematics bac... | NaN | ... | no | no | no | no | no | no | no | NaN | non-minority | no |
321665 | 2021 | 1,000 to 4,999 employees | Estonia | Part-Time | Male | Bachelors | NaN | 18-24 years old | Developer, full-stack | Straight / Heterosexual | ... | no | no | no | no | no | no | no | straight | non-minority | no |
315461 | 2021 | 1,000 to 4,999 employees | Germany | Full-Time | Male | Bachelors | NaN | 35-44 years old | Developer, full-stack | Straight / Heterosexual | ... | no | no | no | no | no | no | no | straight | minority | yes |
99267 | 2018 | Fewer than 10 employees | Poland | Part-Time | Male | Some College | NaN | 18 - 24 years old | Full-stack developer | Straight or heterosexual | ... | no | no | no | no | no | no | no | straight | non-minority | no |
197349 | 2019 | 20 to 99 employees | Netherlands | Full-Time | Male | Bachelors | NaN | 22.0 | Developer, back-end;Developer, front-end;Devel... | Straight / Heterosexual | ... | no | no | no | no | no | no | no | straight | non-minority | no |
293118 | 2020 | 1,000 to 4,999 employees | China | Full-Time | Male | Masters | NaN | 28.0 | Developer, back-end;Developer, desktop or ente... | Bisexual | ... | no | no | no | no | no | no | yes | lgbtq | NaN | no |
528679 | 2023 | 100 to 499 employees | Morocco | Full-Time | NaN | Masters | NaN | 25-34 years old | Developer, full-stack | NaN | ... | yes | yes | yes | no | no | no | no | NaN | minority | yes |
10 rows × 49 columns
This dataset originates from the Stack Overflow user-entered survey data, covering responses from 2017 through 2023. We uploaded the raw survey data to an S3 bucket and processed it extensively to extract core columns that we believe could predict key values, particularly annual salary.
This wide dataset contains over 500,000 records and is much more complex, with numerous categorical variables and some missing data. The complexity makes it well-suited for advanced machine learning algorithms like XGBoost or Random Forest, which can efficiently handle high-dimensional data and missing entries. It offers opportunities to explore various questions, such as predicting salary based on skills and education or classifying job titles based on technical skillsets.
We decided to focus on records from the United States only so it is comparable to the Zip Recruiter analysis.
# filter for US only
= wide_stack[wide_stack['Country'] == 'United States'] us_data
Exploratory Data Analysis (EDA)
Stack Overflow columns used percentage, by year
Before we dig any further, we have to point out that some data doesn’t exist in some years. For example, there is no gender or sexual identity included in the survey moving forward. This shows which core columns we have access to.
= wide_stack.loc[:, 'Year':'AgeAvg']
filtered_columns
# Group and calculate the percentage of non-null values for the filtered columns
= filtered_columns.groupby("Year").agg(lambda x: (x.notnull().mean() * 100)).reset_index()
grouped_summary_filtered
# Round the result to one decimal place
= grouped_summary_filtered.round(1)
grouped_summary_filtered_rounded
# Display the grouped summary
from IPython.display import display
display(grouped_summary_filtered_rounded)
Year | OrgSize | Country | Employment | Gender | EdLevel | US_State | Age | DevType | Sexuality | Ethnicity | DatabaseWorkedWith | LanguageWorkedWith | PlatformWorkedWith | YearsCodePro | AnnualSalary | YearsCodeProAvg | OrgSizeAvg | AgeAvg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017 | 75.7 | 100.0 | 100.0 | 68.2 | 100.0 | 0.0 | 0.0 | 70.3 | 0.0 | 64.3 | 57.3 | 71.3 | 56.7 | 99.5 | 25.1 | 96.6 | 72.7 | 0.0 |
1 | 2018 | 72.4 | 99.6 | 96.4 | 65.2 | 95.8 | 0.0 | 65.3 | 93.2 | 60.5 | 58.1 | 67.0 | 79.2 | 66.8 | 94.9 | 48.3 | 94.9 | 72.4 | 65.3 |
2 | 2019 | 80.8 | 99.9 | 98.1 | 96.1 | 97.2 | 0.0 | 89.1 | 91.5 | 85.7 | 86.3 | 85.5 | 98.5 | 90.8 | 83.6 | 0.0 | 83.6 | 75.9 | 89.1 |
3 | 2020 | 68.8 | 99.4 | 99.1 | 78.4 | 89.1 | 0.0 | 70.5 | 76.6 | 68.2 | 71.3 | 76.8 | 89.0 | 83.5 | 71.9 | 0.0 | 71.9 | 65.4 | 70.5 |
4 | 2021 | 72.8 | 100.0 | 99.9 | 98.6 | 99.6 | 17.9 | 98.8 | 79.7 | 87.9 | 95.2 | 83.3 | 98.7 | 62.5 | 73.4 | 56.1 | 73.4 | 66.3 | 98.1 |
5 | 2022 | 69.7 | 98.0 | 97.9 | 96.7 | 97.7 | 0.0 | 96.8 | 83.7 | 90.9 | 94.8 | 82.1 | 96.9 | 68.1 | 70.7 | 52.0 | 70.7 | 64.5 | 96.1 |
6 | 2023 | 72.9 | 98.6 | 98.6 | 0.0 | 98.6 | 0.0 | 100.0 | 86.2 | 0.0 | 0.0 | 82.3 | 97.7 | 71.3 | 74.2 | 53.8 | 74.2 | 66.8 | 99.5 |
Features
With a clear understanding of the dataset’s structure established, we then turned our attention to examining the features within the data. This analysis will help us identify the characteristics that could influence our model’s performance and guide our feature selection process.
def group_gender(gender):
= str(gender).lower() # converts to lowercase
gender if 'female' in gender or 'woman' in gender:
return 'Female'
elif 'male' in gender or 'man' in gender:
return 'Male'
else:
return 'Other'
'gender_grouped']=us_data['Gender'].apply(group_gender) us_data[
# columns to keep (all features but DevType (as there was too self reported to be much use) and US_State (as only 2021 had this info))
=['Year', 'Employment', 'EdLevel', 'AnnualSalary',
cols_to_keep 'YearsCodeProAvg', 'OrgSizeAvg', 'AgeAvg', 'python', 'sql', 'java',
'javascript', 'ruby', 'php', 'c', 'swift', 'scala', 'r', 'rust',
'julia', 'mysql', 'microsoftsqlserver', 'mongodb', 'postgresql',
'oracle', 'ibmdb2', 'redis', 'sqlite', 'mariadb', 'microsoftazure',
'googlecloud', 'ibmcloudorwatson', 'kubernetes', 'linux', 'windows',
'sexuality_grouped', 'ethnicity_grouped', 'aws', 'gender_grouped']
# new dataframe with specific columns
= us_data[cols_to_keep].copy()
df_new
# print datatypes of columns
print(df_new.dtypes)
Year int64
Employment object
EdLevel object
AnnualSalary float64
YearsCodeProAvg float64
OrgSizeAvg float64
AgeAvg float64
python object
sql object
java object
javascript object
ruby object
php object
c object
swift object
scala object
r object
rust object
julia object
mysql object
microsoftsqlserver object
mongodb object
postgresql object
oracle object
ibmdb2 object
redis object
sqlite object
mariadb object
microsoftazure object
googlecloud object
ibmcloudorwatson object
kubernetes object
linux object
windows object
sexuality_grouped object
ethnicity_grouped object
aws object
gender_grouped object
dtype: object
We deteremined that majority of the features are categorical with a few float features (YearsCodeProAvg, OrgSizeAvg,and AgeAvg). Based on this information, we determined that we needed a model that can handle categorical variables natively.
Target Variable
There are labels in our data (target variable) and this affected our choice of algorithm. Since AnnualSalary is a continous variable, we are working on a regression problem. This narrows down our alogrithm choices to regression models such as:
- Linear Regression
- XGBoost
- Random Forest Regressor
- Support Vector Regressor
Given our exploratory analysis of the features, we narrowed down our model choice to XGBoost and Random Forest.
Below is the distribution of the target variable (AnnualSalary).
= df_new['AnnualSalary'].dropna()
annualsalary
=(10, 5))
plt.figure(figsize
plt.boxplot(annualsalary)'Boxplot of Target Variable (AnnualSalary)')
plt.title('Target Variable Values')
plt.ylabel(
# Save the plot to a BytesIO buffer
= io.BytesIO()
img_buf format='png', bbox_inches='tight') # Save figure to buffer
plt.savefig(img_buf, # Prevent Quarto from auto-rendering the plot
plt.close()
# Reset buffer position and convert the image to Base64
0)
img_buf.seek(= base64.b64encode(img_buf.read()).decode('utf-8')
img_base64
# Generate the HTML for embedding the image
= f'<img src="data:image/png;base64,{img_base64}" alt="Boxplot of Target Variable (AnnualSalary)" />'
img_html
# Display the image
from IPython.display import display, HTML
display(HTML(img_html))
As observed, there are outliers in the distribution. We chose to address these using the Interquartile Range (IQR) method because it is a robust technique for detecting and removing extreme values without making assumptions about the data distribution, unlike methods such as z-scores which assume normality. The IQR method is particularly effective when dealing with skewed data, which is common in salary distributions. By focusing on the middle 50% of the data (between the 25th and 75th percentiles), this approach minimizes the impact of extreme outliers that could distort model performance.
Further details on the impact of removing outliers will be discussed in the model section, where we describe the performance of the model with and without these outliers.
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
= df_new['AnnualSalary'].quantile(0.25)
Q1 = df_new['AnnualSalary'].quantile(0.75)
Q3
# Calculate the IQR
= Q3 - Q1
IQR
# Define outlier bounds
= Q1 - 1.5 * IQR
lower_bound = Q3 + 1.5 * IQR
upper_bound
# Remove outliers from the dataframe
= df_new[(df_new['AnnualSalary'] >= lower_bound) & (df_new['AnnualSalary'] <= upper_bound)] df_filtered
Here is the distribution post-IQR.
=(10, 5))
plt.figure(figsize'AnnualSalary'])
plt.boxplot(df_filtered['Boxplot of Target Variable (AnnualSalary)')
plt.title('Target Variable Values')
plt.ylabel(
# Save the plot to a BytesIO buffer
= io.BytesIO()
img_buf format='png', bbox_inches='tight') # Save figure to buffer
plt.savefig(img_buf, # Prevent Quarto from auto-rendering the plot
plt.close()
# Reset buffer position and convert the image to Base64
0)
img_buf.seek(= base64.b64encode(img_buf.read()).decode('utf-8')
img_base64
# Generate the HTML for embedding the image
= f'<img src="data:image/png;base64,{img_base64}" alt="Boxplot of Target Variable (AnnualSalary) Post-IQR" />'
img_html
# Display the image
from IPython.display import display, HTML
display(HTML(img_html))
Model
For the model, we decided to use XGBoost for the following reasons:
Large dataset: XGBoost is highly efficient when working with large datasets
Mixed feature types: The dataset contains both categorical and numeric features, and XGBoost can natively handle this mixture without requiring extensive pre-processing.
Missing values: XGBoost can effectively handle missing values by internally learning the best direction to take when encountering them, making it a natural fit for our dataset with some incomplete entries.
We chose not to use Random Forest primarily because it tends to perform better on smaller and less complex datasets. Given the size and intricacy of the Stack Overflow dataset, XGBoost was a better fit due to its scalability and ability to handle high-dimensional, complex data.
Pre-Processing
Even though XGBoost can handle categorical variables natively, they still need to be represented numerically. To prepare the categorical variables for the model, we opted to use one-hot encoding.
In the code below, we used the pd.get_dummies function to convert the categorical variables into binary (boolean) columns. These columns represent each category as True or False. To make this data compatible with the model, we then converted the boolean values into integers, assigning True as 1 and False as 0.
# One-hot encode categorical variables
= pd.get_dummies(df_filtered, drop_first=True)
df_encoded
# Identify all boolean columns
= df_encoded.select_dtypes(include=['bool']).columns
bool_cols
# Convert boolean columns to integers (True -> 1, False -> 0)
= df_encoded[bool_cols].astype(int) df_encoded[bool_cols]
Splitting Dataset into Train and Test Set
After preprocessing the dataset, we split it into training and testing sets to evaluate how well the model generalizes to new, unseen data. This split ensures that the model does not simply memorize the training data, but instead learns patterns that will perform well on future, unknown data.
We divided the dataset into 80% training data and 20% testing data. This training set is used to build the model and the test set is used to evaluate the model’s performance. By using the train_test_split function with test_size=0.2, we ensure that 80% of the data is used for training and the remaining 20% is used for testing. We also set random_state=42 to make the split reproducible, ensuring that every time the code is run, the dataset is split in the same way.
# Update X and Y with the encoded data
= df_encoded.drop(columns=['AnnualSalary'])
X = df_encoded['AnnualSalary']
Y
# Split the dataset into 80% training and 20% testing data
= train_test_split(X, Y, test_size=0.2, random_state=42) X_train, X_test, Y_train, Y_test
Creating the Model
To create the model, we started by defining the parameter grid for tuning our XGBoost model. The initial parameters were adjusted during experimentation to better suit the data’s characteristics. Here’s the final grid:
# Define the parameter grid
= {
param_grid 'n_estimators': randint(200, 1000), # initial was randint(100, 500) ; updated was randint(200, 1000)
'max_depth': randint(2,6), # initial was (3,10); updated to (2,6)
'learning_rate': [0.01, 0.05, 0.1], # initial was [0.001, 0.01, 0.1, 0.3]; updated to [0.01, 0.05, 0.1]
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'gamma': [0, 0.1, 0.3, 0.5],
'min_child_weight': [10, 15, 20] # inital was [1,5,10]; updated to [10, 15, 20]
}
This grid allows us to explore combinations of parameters that control both overfitting and generalization. The adjustments made to the ranges reflect our goal of simplifying the model without sacrificing accuracy, particularly with the high dimensionality of the dataset.
Next, we used RandomizedSearchCV to optimize the hyperparameters and select the best-performing model based on cross-validation results.
# Instantiate the model and RandomizedSearchCV
= xgb.XGBRegressor(objective='reg:squarederror', seed=123)
xgb_r = RandomizedSearchCV(estimator=xgb_r,
random_search =param_grid,
param_distributions=50,
n_iter='neg_mean_squared_error',
scoring=5,
cv=1,
verbose=42)
random_state
# Fit the random search model
random_search.fit(X_train, Y_train)print(f"Best parameters: {random_search.best_params_}")
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters: {'colsample_bytree': 1.0, 'gamma': 0, 'learning_rate': 0.05, 'max_depth': 5, 'min_child_weight': 15, 'n_estimators': 426, 'subsample': 0.8}
We used RandomizedSearchCV instead of GridSearchCV to save time, as it explores a random selection of hyperparameter combinations rather than exhaustively testing all possible options. The n_iter=50 argument limits the number of iterations, ensuring a good balance between performance and computational efficiency.
By tuning the model in this way, we can ensure that it performs optimally given the complexity and size of the Stack Overflow dataset.
The next step is to use the best model identified through RandomizedSearchCV to make predictions on the test data:
# Predict using the best model
= random_search.best_estimator_
best_model = best_model.predict(X_test) pred
This code retrieves the optimal XGBoost model from the RandomizedSearchCV process and uses it to generate predictions for the test set (X_test). The predictions (pred) can then be used to evaluate the model’s performance on unseen data.
Assessing the Model Performance
Metrics
To assess the performance of the model, we calculated two key metrics: Root Mean Square Error (RMSE) and R-squared (R²). These metrics provide insights into the model’s accuracy and goodness of fit.
# calculate RMSE
= np.sqrt(mean_squared_error(Y_test, pred))
rmse print(f"RMSE: {rmse}")
# Compute the R² score
= r2_score(Y_test, pred)
r2 print("R²: %f" % r2)
RMSE: 40752.117865209984
R²: 0.431343
After evaluating the model, we obtained the following performance metrics:
- RMSE: 40,752.117865
- R²: 0.431343
The Root Mean Square Error (RMSE) quantifies the average error between the predicted and actual values. In this case, an RMSE of 40,752.117865 indicates that, on average, the model’s predictions deviate from the actual values by this amount. Lower RMSE values suggest better model performance, making this a critical metric for assessing accuracy.
The R-squared (R²) score measures the proportion of variance in the target variable (Annual Salary) that can be explained by the features included in the model. With an R² value of 0.431343, approximately 43.13% of the variance in Annual Salary is accounted for by the model. While this indicates some explanatory power, there is still a significant portion of the variance (over 56.87%) that remains unexplained, suggesting potential areas for improvement in the model or the inclusion of additional features.
These metrics together provide a solid foundation for evaluating the model’s predictive capabilities, emphasizing its strengths while also identifying areas that may require further enhancement.
To ensure that our model is not overfitting, we also compared the training and test metrics for RMSE and R²:
# Predictions on the training set
= best_model.predict(X_train)
train_pred
# Predictions on the testing set
= best_model.predict(X_test)
test_pred
# RMSE on the training set
= np.sqrt(mean_squared_error(Y_train, train_pred))
train_rmse print(f"Training RMSE: {train_rmse}")
# RMSE on the testing set
= np.sqrt(mean_squared_error(Y_test, test_pred))
test_rmse print(f"Testing RMSE: {test_rmse}")
# R² on the training set
= r2_score(Y_train, train_pred)
train_r2 print(f"Training R²: {train_r2}")
# R² on the testing set
= r2_score(Y_test, test_pred)
test_r2 print(f"Testing R²: {test_r2}")
Training RMSE: 37495.24039831062
Testing RMSE: 40752.117865209984
Training R²: 0.5184798109204363
Testing R²: 0.4313433781313698
By analyzing the RMSE and R² values for both training and test sets, we observed that they are relatively close to one another. This similarity indicates that the model generalizes well and is not overfitting, performing similarly across both datasets.
Feature Importance Plot
The plot below illustrates the importance of each feature in relation to the model’s predictions. We experimented with various combinations of features based on the insights gained from this plot. However, our analysis revealed that the optimal model performance was achieved by including all available features from the dataset. This suggests that each feature adds valuable information, and together they are important for enhancing the model’s accuracy and strength.
# Plot feature importance
='weight', max_num_features=20)
xgb.plot_importance(best_model, importance_type'Top Feature Importance')
plt.title(
# Save the plot to a BytesIO buffer
= io.BytesIO()
img_buf format='png', bbox_inches='tight') # Save figure to buffer
plt.savefig(img_buf, # Close the figure to prevent auto-rendering
plt.close()
# Reset buffer position and convert the image to Base64
0)
img_buf.seek(= base64.b64encode(img_buf.read()).decode('utf-8')
img_base64
# Generate the HTML for embedding the image
= f'<img src="data:image/png;base64,{img_base64}" alt="Top Feature Importance" />'
img_html
# Display the image
display(HTML(img_html))
Correlation
We examined the correlation between the features in the training set to identify any relationships that could potentially influence the model’s performance. If two features are correlated, it may indicate redundancy, which can lead to issues such as multicollinearity. This situation can negatively affect the model’s interpretability and stability, as the model may struggle to distinguish the individual contributions of each correlated feature. Therefore, understanding feature correlations is crucial for feature selection and ensuring the model is both efficient and reliable.
= df_encoded.drop(columns=['AnnualSalary']).corr()
corr_matrix =True)
sns.heatmap(corr_matrix, annot
# Save the plot to a BytesIO buffer
= io.BytesIO()
img_buf format='png', bbox_inches='tight') # Save figure to buffer
plt.savefig(img_buf, # Close the figure to prevent auto-rendering
plt.close()
# Reset buffer position and convert the image to Base64
0)
img_buf.seek(= base64.b64encode(img_buf.read()).decode('utf-8')
img_base64
# Generate the HTML for embedding the image
= f'<img src="data:image/png;base64,{img_base64}" alt="Correlation Matrix Heatmap" />'
img_html
# Display the image
display(HTML(img_html))
Upon analyzing the correlation between features, we discovered that the YearsCodeProAvg and AgeAvg features exhibited a high degree of correlation. Given that YearsCodeProAvg was indicated as a more important feature in the feature importance plot, we initially considered removing AgeAvg from the model. However, after running experiments, we observed that the model’s performance actually improved when both features were included. This finding suggests that while these features are correlated, they each contribute unique information that enhances the model’s predictive capabilities. Consequently, we decided to retain both YearsCodeProAvg and AgeAvg in the final model to leverage their combined influence on the predictions.
Outliers
To thoroughly evaluate the model’s performance, we conducted experiments both with and without outliers. Our findings revealed that excluding the outliers significantly enhanced the model’s predictive accuracy and stability, leading us to decide to omit them from the final dataset used for modeling.
When we ran the model with the outliers included, the Root Mean Square Error (RMSE) was 768,195.66, and the R-squared value was 0.007583, keeping all else the constant. These metrics indicated that the model was poorly fitted to the data. As a result, we applied the IQR method to handle the outliers, as described in the Exploratory Data Analysis (EDA) section, which led to a substantial improvement in model performance.
Assessing the Model Visually
Residuals vs Predicted Values Plot
To understand how well our model performs, we plotted the residuals (the differences between actual and predicted values) against the predicted values. This helps us identify any patterns or biases in the model’s predictions.
Ideally, residuals should be randomly scattered around 0 without any noticeable pattern, indicating that the model is accurately capturing the underlying relationships in the data. In the plot below, the residuals appear to be randomly distributed around 0, which suggests that the model’s predictions are not biased in any direction. This randomness supports the validity of the model’s performance and suggests that it is effectively fitting the data.
= Y_test - pred # Calculate residuals
residuals
# create data frame for plotting
= pd.DataFrame({
residuals_df 'Actual': Y_test,
'Predicted': pred,
'Residuals': residuals
})
# Generate the plot
= plt.subplots(figsize=(12, 6))
fig, ax ='Predicted', y='Residuals', data=residuals_df, ax=ax)
sns.scatterplot(x0, color='red', linestyle='--')
ax.axhline('Residuals vs Predicted Values')
ax.set_title('Predicted Values')
ax.set_xlabel('Residuals')
ax.set_ylabel(
# Save the plot to a BytesIO buffer
= io.BytesIO()
img_buf format='png', bbox_inches='tight') # Save figure to buffer
plt.savefig(img_buf, # Close the figure to prevent auto-rendering
plt.close(fig)
# Reset buffer position and convert to Base64
0)
img_buf.seek(= base64.b64encode(img_buf.read()).decode('utf-8')
img_base64
# Generate HTML for embedding the image
= f'<img src="data:image/png;base64,{img_base64}" alt="Residuals vs Predicted Values" />'
img_html
# Display the image
display(HTML(img_html))
Distribution of Residuals
Next, we examined the distribution of the residuals using a histogram. This plot provides insight into how the residuals are spread, helping us assess the model’s accuracy.
The plot below shows a mostly bell-shaped distribution, with a slight tail to the right. While this deviation is noticeable, it is not severe enough to raise major concerns about the model’s validity. Overall, the residuals suggest that the model performs reasonably well.
# Histogram of Residuals
= plt.subplots(figsize=(12, 6))
fig, ax =30, kde=True, ax=ax)
sns.histplot(residuals, bins'Histogram of Residuals')
ax.set_title('Residuals')
ax.set_xlabel('Frequency')
ax.set_ylabel(
# Save the plot to a BytesIO buffer
= io.BytesIO()
img_buf format='png', bbox_inches='tight') # Save figure to buffer
plt.savefig(img_buf, # Close the figure to prevent auto-rendering
plt.close(fig)
# Reset buffer position and convert to Base64
0)
img_buf.seek(= base64.b64encode(img_buf.read()).decode('utf-8')
img_base64
# Generate HTML for embedding the image
= f'<img src="data:image/png;base64,{img_base64}" alt="Histogram of Residuals" />'
img_html
# Display the image
display(HTML(img_html))
Q-Q Plot of Residuals
Finally, we created a Q-Q (Quantile-Quantile) plot to check the normality of the residuals. The red line represents a normal distribution. The closer the data points are to this line, the more they align with a normal distribution.
In the plot below, we observe that most of the residuals closely follow the normal line, indicating that they are approximately normally distributed. This provides a strong indication that the residuals meet the assumption of normality, lending further credibility to the model’s validity.
import statsmodels.api as sm
# Q-Q Plot
= sm.qqplot(residuals, line='s')
fig 'Q-Q Plot of Residuals')
plt.title(
# Save the plot to a BytesIO buffer
= io.BytesIO()
img_buf format='png', bbox_inches='tight') # Save figure to buffer
plt.savefig(img_buf, # Close the figure to prevent auto-rendering
plt.close(fig)
# Reset buffer position and convert to Base64
0)
img_buf.seek(= base64.b64encode(img_buf.read()).decode('utf-8')
img_base64
# Generate HTML for embedding the image
= f'<img src="data:image/png;base64,{img_base64}" alt="Q-Q Plot of Residuals" />'
img_html
# Display the image
display(HTML(img_html))
Summary
Overall, the visualizations of the model through the residuals vs. predicted values plot, histogram, and Q-Q plot provide a comprehensive evaluation of its performance. The randomness of residuals, the bell-shaped distribution, and the normality indicated in the Q-Q plot collectively suggest that the model is well-fitted to the data, meeting key assumptions of regression analysis. These findings reinforce the model’s reliability and predictive capability, further suggesting that we utilized the appropriate model type, XGBoost.
Conclusion
In conclusion, our final model accounts for approximately 43.13% of the variability in annual salary and we’ve demonstrated that XGBoost was a justified model to use for this problem. However, there is still room for improvement. The results suggest that the features included in the model contribute valuable insights into salary, but it is essential to recognize that there are likely other signiciant factors influencing salary that were not captured in this dataset. Variables such as industry, state, market demand, etc. may all pay crucial roles in salary determination. The absence of these features could have limited the model’s overall predicitve capability.
Future work could involve exploring additional features, experimenting with more sophisticated modeling techniques, or implementing more advanced hyperparameter tuning to further refine the model’s predictive capabilities. Incorporating external data sources to address omitted variables may also enhance the model’s robustness and accuracy.
Comparison of Data Sets
When comparing the two datasets (Zip Recruiter and Stack Overflow), we observe that a smaller dataset with fewer features often leads to a predictive model that can more accurately predict the target variable. This is primarily due to the reduced complexity and lower likelihood of overfitting. In contrast, larger and more complex datasets introduce challenges that can hinder predictive accuracy.
In smaller datasets, the relationships between features and the target variable are typically clearer, allowing simpler models to generalize better to unseen data. However, as dataset size and complexity increase, the potential for noise and irrelevant information also rises, complicating the modeling process. Consequently, achieving high predictive accuracy becomes more difficult as the number of features grows and the dataset becomes more intricate.
Overall, while larger datasets can provide more information, they require careful handling to ensure that the additional complexity does not undermine model performance. Finding the right balance between dataset size and model complexity is essential for achieving accurate predictions.