import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/ProgramData/Anaconda3/Library/plugins/platforms'

Introduction

In any work environment, employee performance, productivity, and satisfaction are important metrics that reflect on both a workplace success and atmosphere. These factors shape a workplace environment, influence employee retention, and contribute to the success of team. By understanding how these factors play out, organizations can identify areas of weakness and make plans of improvement. Additionally, these factors can help employers understand the elements that drive its employees which benefits both a company and workplace.

Dataset

The original dataset was obtained from Kaggle.com and contains 100,000 rows of data. Full dataset contains 20 columns capturing key info of employee details, work habits, employee satisfaction, retention, and performance, serving various purposes for analysis.

Findings

Employee Satisfaction

The following scatter plot visualization provides an in-depth look at the employee satisfaction across different age groups. Notably, it appears the the age distribution is relatively balanced across all age categories, with a slight difference for employees in the 20-24 and 60-64 age ranges. In terms of satisfaction, it seems as the most of the employees satisfaction lies around 3-4, indicated by its slightly larger and yellow markers. Interestingly, the satisfaction level at 5 has very little employees across all age ranges, indicating that complete satisfaction is rare within this work industry.

#Imports libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from matplotlib.ticker import FuncFormatter

warnings.filterwarnings("ignore")

path = "U:/Python/"
filename= path + "performance.csv"

#Reads only specific columns
df =pd.read_csv(filename,usecols= ['Age', 'Employee_Satisfaction_Score', 'Monthly_Salary', 'Department', 'Gender', 'Hire_Date', 'Team_Size',
'Projects_Handled', 'Sick_Days', 'Work_Hours_Per_Week', 'Years_At_Company', 'Training_Hours'] ) #Reads file of only selected columns


#Renames columns
df.rename(columns={'Employee_Satisfaction_Score': 'SatisfactionScore'}, inplace=True) # Renames column        

#Removes rows with NAs
df= df[ df['Age'].notna() & df['SatisfactionScore'].notna()]

df['SatisfactionScore']=df['SatisfactionScore'].astype('int') # Converts float to int

#Defines bins for ages
bins = [20,25,30,35, 40,45, 50,55, 60,65]  
labels= ['20-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59','60-64']

#Categorizes 'Age' based off the bins
df['AgeGroup']=pd.cut(df['Age'], bins=bins,
                      labels= labels, right = False)

#Groups by 'AgeGroup' & 'SatisfactionScore and counts 
x= df.groupby(['AgeGroup', 'SatisfactionScore'])['SatisfactionScore'].count().reset_index(name='Count')
x=pd.DataFrame(x)    

x['count_hundreds'] =round(x['Count']/100,0)

#Creates scatter plot
plt.figure(figsize= (15,10))

plt.scatter(x['AgeGroup'], x['SatisfactionScore'], marker='8', cmap='viridis',
            c= x['Count'], s= x['Count']/2, edgecolors= 'black')

#Sets the labels and title
plt.title('Employee Satisfaction by Age', fontsize =18)
plt.xlabel('Age', fontsize= 14)
plt.ylabel('Employee Satisfaction', fontsize = 14)

#Adds colorbar to the plot
cbar= plt.colorbar()
cbar.set_label('Number of Employees', rotation = 270, fontsize = 14, color = 'black', labelpad= 30)

#Sets color bar ticks and labels
my_colorbar_ticks = [*range(0, int(x['Count'].max()),500)]
cbar.set_ticks(my_colorbar_ticks)
cbar.ax.set_yticklabels([f'{int(tick):,}' for tick in my_colorbar_ticks])

my_y_ticks=[*range(x['SatisfactionScore'].min(), x['SatisfactionScore'].max()+1,1)]
plt.yticks(my_y_ticks, fontsize=16, color='black');
plt.ylim(0.5, 5.5);

plt.show()

Average Monthly Salary

The graph below illustrates the average monthly salary between men and women across various departments. In observing the values there is relatively balanced salary with only three striking differences. In the Customer Support and Engineering departments, women have a notable higher average salary than men. However, he largest gap appears in the Operations department, where women on average make $6,381 compare to $6,445 for males. Overall, this indicates a fair pay structure across departments with minor flucuations that might need to be addressed.

#Removes rows where 'Gender' is 'Other'
df = df[df['Gender'] != 'Other']

#Groups by Department & Gender and gets avg salary
x2=df.groupby(['Department','Gender']).agg({'Monthly_Salary':['mean']}).reset_index()
x2.columns = ['Department', 'Gender', 'AvgSalary']

x2 = x2.pivot_table(index='Department', columns='Gender', values='AvgSalary').reset_index()

# Renames the columns
x2.columns = ['Department', 'WomenAvgMonSalary', 'MaleAvgMonSalary']

#Function to add labels on top of the bars
def autolabel(these_bars, this_ax, placedec,symbol):
    for each_bar in these_bars:
        height= each_bar.get_height()
        comma = symbol + format(int(height), placedec) 
        this_ax.text(each_bar.get_x()+each_bar.get_width()/2, height*1.0, comma, fontsize= 11,
        color='black',ha='center', va ='bottom')

#Creates bar plot for avg monthly salary by department and gender
fig = plt.figure(figsize=(18,19.75))
ax1= fig.add_subplot(1,1,1)

bar_width=0.4
x_pos = np.arange(9)

# Add commas to the y-axis values
comma = FuncFormatter(lambda x, pos: format(int(x), ',d'))  

women_monthly_Avg = ax1.bar(x_pos-(0.5*bar_width), x2.WomenAvgMonSalary, bar_width, color='darkorange', edgecolor = 'black', label='Women Average Salary')
male_monthly_Avg  = ax1.bar(x_pos+(0.5*bar_width), x2.MaleAvgMonSalary, bar_width, edgecolor = 'black',color= 'steelblue', label='Male Average Salary')

#Sets labels
ax1.set_ylabel('Average Montly Salary', fontsize= 20, color= 'black')
ax1.set_xlabel('Department', fontsize=20, color='black') 
ax1.set_yticks(np.arange(6275,6475, 25))
ax1.set_ylim(6250, 6475);
ax1.tick_params(axis='y',labelsize =14)


plt.title('Average Monthly Salary by Gender', fontsize =20)
ax1.set_xticks(x_pos)  # Set Department as x-axis labels
ax1.set_xlabel('Department', fontsize=20, color='black')
ax1.set_xticklabels(x2.Department, rotation=45, ha="right", fontsize =18) 
ax1.yaxis.set_major_formatter(comma)

#Adds legend 
handles, labels=ax1.get_legend_handles_labels()
legend= ax1.legend(handles, labels, loc = 'upper left', frameon=True,ncol=1, shadow = True,
                   borderpad= 1, fontsize =14)

autolabel(women_monthly_Avg,ax1,',d', '$')
autolabel(male_monthly_Avg,ax1,',d', '$')


plt.show()                    

Hired Employees each Quarter

The following donut chart below presents the total number of employees being hired across each quarter. As each quarter remains a relative similar percentage, this suggests the the total amount of hires is evenly distributed during the year.

# Convert 'Hire_Date' to datetime format
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'])

# Creates new column for the month name and quarter
df['Month_Name'] = df['Hire_Date'].dt.month_name()
df['Quarter']='Quarter ' + df.Hire_Date.dt.quarter.astype('string')

pie_df = df.groupby(['Quarter', 'Month_Name']).size().reset_index(name='People_Hired')

numbers= len(pie_df.Quarter.unique())
outside_color= np.arange(numbers)*4

fig= plt.figure(figsize=(18,10))
ax=fig.add_subplot(1,1,1)

colormap=plt.get_cmap("tab20c")
outer_colors= colormap(outside_color)

#Calculates total # of people hired for inner circle
all_hired= pie_df.People_Hired.sum()

pie_df.groupby(['Quarter'])['People_Hired'].sum().plot(
    kind='pie', radius=1, colors=outer_colors, pctdistance=0.85, labeldistance= 1.1, 
    wedgeprops= dict(edgecolor='w'), textprops={'fontsize':18},
    autopct = lambda p:'{:.2f}%\n({:,.0f})'.format(p, (p / 100) * all_hired),
    startangle= 90)

#Adds inner hole in pie chart
hole= plt.Circle((0,0), 0.3, fc='white')
fig1= plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('Total Hired by Quarter', fontsize = 18)

ax.text(0,0, 'Total Hired\n' + str(round(all_hired)),size= 18, ha= 'center', va= 'center')
ax.axis('equal');

plt.tight_layout()
plt.show()

Sick Days Taken by Department and Projects Handled

The following visualization represents the relationship between the total amount of sick days taken by a department given their project load. While one may assume that the department handling greater amounts of projects would account for majority of sick days from burnout, this data actually shows otherwise. Based on this visualization, data reveals how the greater amount of sick days come from departments between customer support and legal where project load is on the lower end of the range (0-19). A notable outlier in this dataset in the Finance department where those handling 33-39 projects report an exceptional higher total of 1,647 sick days within their department. Overall, this visualization provides insights to the workload and strain employees might be facing in their given department.

hm_df = df.groupby(['Department', 'Projects_Handled'])['Sick_Days'].sum().reset_index(name='TotSickDays')

bins = [0,5,10,15,20,25,30,35,40,45,50]  # Upper limit of the last bin is set to 70 for flexibility

# Digitize ages into bins
labels= ['0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-49']

hm_df['Projects_Binned'] = pd.cut(hm_df['Projects_Handled'], bins=bins, labels=labels, right=False)

heat_df = pd.pivot_table(hm_df, index ='Projects_Binned', columns = 'Department', values = 'TotSickDays')

fig = plt.figure(figsize= (18,18))
ax= fig.add_subplot(1,1,1)

comma_fmt= FuncFormatter(lambda x, p:format(int(x),','))
ax = sns.heatmap(heat_df, linewidth =0.2, annot= True, cmap = 'coolwarm', fmt = ',.0f',
                  square = True, annot_kws = {'size':11},
                  cbar_kws= {'format': comma_fmt, 'orientation':'vertical'})

plt.title('Heatmap of the Number of Sick Days By Department and Projects Handled', fontsize = 18, pad = 15 )
plt.xlabel('Department', fontsize=18, labelpad =10)
plt.ylabel('Amount of Projects Handled', fontsize = 14, labelpad=10, color='black')
plt.yticks(rotation =0, size =14);
plt.xticks(size=14, rotation=70);

ax.invert_yaxis()

cbar = ax.collections[0].colorbar 
cbar.set_label('Number of Sick Days', rotation = 270, fontsize=16, color='black', labelpad=20)

plt.show()

Training Hours

The following visualization illustrates the total amount of training by each department given their team size. Although there are fluctuations in the total amount of training hours, the graph shows a consistent trend among each department. Large and medium team sizes in the Finance and Marketing department consistently receive the highest amount of total training hours. Interestingly, smaller and very large team sizes among all departments fall below 115,000 total hours of training. Overall, this analysis could help organizations understand which teams and departments receive the most intense trainings.

# Define bins and labels for Team_Size based on unique values
bins = [0, 5, 10, 15, df['Team_Size'].max()]
labels = ['Small', 'Medium', 'Large', 'Very Large']
df['Team_Size_Range'] = pd.cut(df['Team_Size'], bins=bins, labels=labels, right=False)

# Group by Team_Size_Range and Department to sum of Training_Hours
lineplot_df = df.groupby(['Team_Size_Range', 'Department'])['Training_Hours'].sum().reset_index(name='Training_Hours')

my_colors = {'Small': 'blue', 'Medium': 'red', 'Large': 'green', 'Very Large': 'purple'}

fig = plt.figure(figsize=(18,16))
ax = fig.add_subplot(1,1,1)

# Plot each group with specific colors
for key, grp in lineplot_df.groupby('Team_Size_Range'):
    team_size = key
    # Check if team_size exists in the dictionary
    color = my_colors.get(team_size, 'black')  
    grp.plot(ax=ax, kind='line', x='Department', y='Training_Hours', color=color, label=team_size, marker='8')

# Set the plot title and labels
plt.title('Total Training Hours by Team Size and Department', fontsize=20)
ax.set_xlabel('Department', fontsize=18)
ax.set_ylabel('Total Training Hours', fontsize=18, labelpad=20)

# Tick labels
ax.tick_params(axis='x', labelsize=14, rotation=40)
ax.tick_params(axis='y', labelsize=14, rotation=0)

# Replace legend labels with your custom range labels
handles, labels = ax.get_legend_handles_labels()
labels = ['Small: 0-5', 'Medium: 5-10', 'Large: 10-15', 'Very Large: 15+']
plt.legend(handles, labels, loc='best', fontsize=14, ncol=1, title='Team Size')

 # Adds commas to the y-axis values
comma = FuncFormatter(lambda x, pos: format(int(x), ','))
ax.yaxis.set_major_formatter(comma)

# Display the plot
plt.show()

Conclusion

These data visualizations provide valuable insights in workforce demographics and behavior. Each visualization highlights patterns that when analyzed together can help organizations improve. The second bar chart and heap map, for example, can be analyzed together to see if there is a connection between gender, department, amount of projects handled, and sick days. Additionally, since these visualizations focus on different departments organizations can identify potential areas where workload distribution or department may need improvements. A useful key takeaway from the visualizations could help what your area of study should be given your gender if one is money motivated. Furthermore, there are many more conclusions that can be made out of this dataset as only half columns were used.