Payroll Dataset Analysis

The following report will look at all Cook County employees excluding Forest Preserves, indicating amount of base salary paid to an employee during the County fiscal quarter.The data ranges fiscal years 2016 through 2018 – the fiscal year is from December to November. The data set is sourced from data.gov (https://catalog.data.gov/dataset/employee-payroll)

This data set did not need any cleaning and was ready to use directly from the download from data.gov.

One item to note from the data is that the pay details are reported on a quarterly basis, so there are 4 entries for each employee for each financial year, given they stay employed with the Cook County for the full financial year. Since salary is generally examined on an annual basis, I had to do some aggregating of the data to have the salary data for each financial year for each employee (I summed the quarterly salary for each employee identifier for each year), resulting in the full financial year salary for each employee for each financial year. This eliminates the confusion of comparison when it was presented as a quarterly base, since salary is commonly viewed on a annual base.

library(tidyverse)
library(ggplot2)
library(plotly)
library(sampling)

employee_payroll_df = read.csv("Employee_Payroll.csv")
employee_payroll = as_tibble(employee_payroll_df)

Employee_population_Bureau = table(select(employee_payroll, Bureau))

sum_salary = employee_payroll %>%
  group_by(Bureau, Fiscal.Year) %>%
  summarize(total_salary = sum(Base.Pay, na.rm = TRUE), 
  .groups = "keep")
  
sum_salary = sum_salary %>%
  mutate(total_salary = as.numeric(total_salary))
  
employee_counts = employee_payroll %>%
  group_by(Bureau, Fiscal.Year) %>%
  summarize(Unique_Employees  = n_distinct(Employee.Identifier), .groups = "keep") %>% 
  arrange(Bureau, Fiscal.Year)

employee_counts = employee_counts %>%
  mutate(Unique_Employees = as.numeric(Unique_Employees))
  
average_salary = sum_salary %>%
  inner_join(employee_counts, by = c("Bureau", "Fiscal.Year")) %>%
  mutate(Average_Salary = total_salary / Unique_Employees)

top_5_salary_bureau = average_salary %>%
  group_by(Fiscal.Year) %>%
  arrange(desc(Average_Salary)) %>%
  slice_head(n = 5) %>%
  ungroup()
  
top_5_population_bureau = average_salary %>%
  group_by(Fiscal.Year) %>%
  arrange(desc(Unique_Employees)) %>%
  slice_head(n = 5) %>%
  ungroup()

#Bar Chart of Top 5
p = ggplot(top_5_salary_bureau,
    aes(
        x = reorder(Bureau, Average_Salary),
        y = Average_Salary,
        text = paste("Bureau:", Bureau, "<br>Average Salary: $", format(Average_Salary, big.mark = ","))
    )
  ) +
  geom_bar(stat = "identity", fill = "#0072B2") +
  facet_wrap(~ Fiscal.Year, scales = "free_x") +
  labs(
    title = "Top 5 Bureaus by Average Salary, Per Fiscal Year",
    x = "Bureau",
    y = "Average Salary (USD)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
plotly_graph = ggplotly(p, tooltip = "text")
print(plotly_graph)

p = ggplot(top_5_population_bureau,
    aes(
        x = reorder(Bureau, Average_Salary),
        y = Average_Salary,
        text = paste("Bureau:", Bureau, "<br>Average Salary: $", format(Average_Salary, big.mark = ","))
    )
  ) +
  geom_bar(stat = "identity", fill = "#0072B2") +
  facet_wrap(~ Fiscal.Year, scales = "free_x") +
  labs(
    title = "Top 5 Bureaus by Population Average Salary, Per Fiscal Year",
    x = "Bureau",
    y = "Average Salary (USD)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
plotly_graph = ggplotly(p, tooltip = "text")
print(plotly_graph)

#Bar graph Analysis In the above bar graphs, the ranking of the bureaus was done in two ways, the first was based on the average salary and the second was based on bureau population.

For the first bar graph, I calculated the average annual salary for each bureau and then ranked them in descending order and extracted the top five bureaus (based on average annual salary) for each fiscal year. I see that the annual salary for the financial year 2018 was significantly lower than the 2 other years that were included. This could be due to incomplete data and unreported earnings. Furthermore, the second bar graph looks at bureaus based on the population count, so the count of unique employees for each bureau for each year was calculated and ordered in descending order and I extracted the top five bureaus based on the population of employees. I see that when comparing the two graphs, I see that the population graph has lower average earnings than the graph that is based solely on the average salary.

# Box plot processing
individual_salary = employee_payroll %>%
  group_by(Employee.Identifier, Fiscal.Year, Bureau) %>%
  summarize(total_salary = sum(Base.Pay, na.rm = TRUE), 
  .groups = "keep")

subset_2016_salary_data = filter(individual_salary, Fiscal.Year == 2016)
five_num_2016 = fivenum(subset_2016_salary_data$total_salary)
IQR_2016 = five_num_2016[4]-five_num_2016[2]

lower_bound_2016 = five_num_2016[2]-(1.5*IQR_2016)
lower_2016_outliers = subset_2016_salary_data$total_salary[which(subset_2016_salary_data$total_salary < lower_bound_2016)]
#length(lower_2016_outliers)

upper_bound_2016 = five_num_2016[4]+(1.5*IQR_2016)
upper_2016_outliers = subset_2016_salary_data$total_salary[which(subset_2016_salary_data$total_salary > upper_bound_2016)]
#length(upper_2016_outliers)

subset_2017_salary_data = filter(individual_salary, Fiscal.Year == 2017) 
five_num_2017 = fivenum(subset_2017_salary_data$total_salary)
IQR_2017 = five_num_2017[4]-five_num_2017[2]

lower_bound_2017 = five_num_2017[2]-(1.5*IQR_2017)
lower_2017_outliers = subset_2017_salary_data$total_salary[which(subset_2017_salary_data$total_salary < lower_bound_2017)]
#length(lower_2017_outliers)

upper_bound_2017 = five_num_2017[4]+(1.5*IQR_2017)
upper_2017_outliers = subset_2017_salary_data$total_salary[which(subset_2017_salary_data$total_salary > upper_bound_2017)]
#length(upper_2017_outliers)

subset_2018_salary_data = filter(individual_salary, Fiscal.Year == 2018) 
five_num_2018 = fivenum(subset_2018_salary_data$total_salary)
IQR_2018 = five_num_2018[4]-five_num_2018[2]

lower_bound_2018 = five_num_2018[2]-(1.5*IQR_2018)
lower_2018_outliers = subset_2018_salary_data$total_salary[which(subset_2018_salary_data$total_salary < lower_bound_2018)]
#length(lower_2018_outliers)

upper_bound_2018 = five_num_2018[4]+(1.5*IQR_2018)
upper_2018_outliers = subset_2018_salary_data$total_salary[which(subset_2018_salary_data$total_salary > upper_bound_2018)]
#length(upper_2018_outliers)

#Boxplots
grouped_box_plot = plot_ly(
    data = individual_salary,
    y = ~as.factor(Fiscal.Year), 
    x = ~total_salary,       
    type = 'box',
    boxpoints = "outliers",
    color = ~as.factor(Fiscal.Year),
    hovertemplate = "Year: %{x}<br>Salary: $%{y:.2f}<extra></extra>"
  ) %>%

  layout(
    title = "Distribution of Employee Salaries by Fiscal Year (With Outliers)",
    xaxis = list(title = "Fiscal Year"),
    yaxis = list(title = "Total Salary (USD)"),
    boxmode = 'group', 
    showlegend = FALSE
  )

print(grouped_box_plot)

salary_2016 = filter(subset_2016_salary_data, total_salary< upper_bound_2016)
salary_2017 = filter(subset_2017_salary_data, total_salary< upper_bound_2017)
salary_2018 = filter(subset_2018_salary_data, total_salary< upper_bound_2018)

combined_salary <- bind_rows(salary_2016, salary_2017, salary_2018)

grouped_box_plot = plot_ly(
    data = combined_salary,
    y = ~as.factor(Fiscal.Year), 
    x = ~total_salary,       
    type = 'box',
    boxpoints = "outliers",
    color = ~as.factor(Fiscal.Year),
    hovertemplate = "Year: %{x}<br>Salary: $%{y:.2f}<extra></extra>"
  ) %>%

  layout(
    title = "Distribution of Employee Salaries by Fiscal Year (Without Outliers)",
    xaxis = list(title = "Fiscal Year"),
    yaxis = list(title = "Total Salary (USD)"),
    boxmode = 'group', 
    showlegend = FALSE
  )
print(grouped_box_plot)

years = sort(unique(individual_salary$Fiscal.Year))

max_count = individual_salary %>%
  group_by(Fiscal.Year) %>%
  summarise(counts = hist(total_salary, plot = FALSE)$counts) %>%
  summarise(max_count = max(unlist(counts))) %>%
  pull(max_count)

plots_list = lapply(years, function(y) {
  df = individual_salary %>% filter(Fiscal.Year == y)
  
  plot_ly(
    data = df,
    x = ~total_salary,
    type = 'histogram',
    name = paste("Fiscal Year", y),
    hovertemplate = "Salary: %{x}<br>Count: %{y}<extra></extra>"
  ) %>%
    layout(
      title = "Annual Salary for Population (With Outliers)",
      xaxis = list(title = "Total Salary (USD)"),
      yaxis = list(title = "Count of Employees", range = c(0, 1200))
    )
})

plotly_histograms = subplot(
  plots_list,
  nrows = 3,
  shareY = TRUE,
  titleX = TRUE,
  titleY = TRUE,
  margin = 0.05
)

plotly_histograms
max_count = individual_salary %>%
  group_by(Fiscal.Year) %>%
  summarise(counts = hist(combined_salary$total_salary, plot = FALSE)$counts) %>%
  summarise(max_count = max(unlist(counts))) %>%
  pull(max_count)

plots_list = lapply(years, function(y) {
  df = individual_salary %>% filter(Fiscal.Year == y)
  
  plot_ly(
    data = df,
    x = ~combined_salary$total_salary,
    type = 'histogram',
    name = paste("Fiscal Year", y),
    hovertemplate = "Salary: %{x}<br>Count: %{y}<extra></extra>"
  ) %>%
    layout(
      title = "Annual Salary for Population (Without Outliers)",
      xaxis = list(title = "Total Salary (USD)"),
      yaxis = list(title = "Count of Employees", range = c(0, 1200))
    )
})

plotly_histograms = subplot(
  plots_list,
  nrows = 3,
  shareY = TRUE,
  titleX = TRUE,
  titleY = TRUE,
  margin = 0.05
)

plotly_histograms

#Boxplot and Histogram Analysis I started by creating a box plot and histogram for the annual salary data for each fiscal year as it was presented in the data set. Once I saw the box plots and histograms I could tell that there was a heavy positive skew of the data; indicating that there are some high salaries that are skewing the data to the right.

Due to the skewing, I wanted to see the outliers in the salary data and calculated the inner quartile range for each year and identified the outliers. Once they were identified, I again created the box plots and histograms for each year; but this time, without the outliers to see the data without the dramatic skewing from the outliers.

As a result, we can see that the data is more centralized to be between $0 and about $140,000, and there is still some positive skewing across all the fiscal years. This indicates that there is a larger portion of individual who are earning higher annual incomes and are pulling the data up towards the higher end of annual salaries.

I also found that there are some employees who are paid $0, there is no background information as to why there is zero pay, and I kept them in the data set since they remain within the lower bound of the data.

I found that in the data, there were 784/25,257, 752/23,645, and 709/26,198 outliers on the upper bound in the 2016, 2017, and 2018 years respectively.

# Central Limit Theorem

pop_mean = mean(subset_2016_salary_data$total_salary)
pop_sd = sd(subset_2016_salary_data$total_salary)

num_samples = 1000

set.seed(3803)
samples = subset_2016_salary_data$total_salary
sample.size = 10
xbar_10 = numeric(num_samples)

for (i in 1: num_samples) {
  xbar_10[i] = mean(sample(subset_2016_salary_data$total_salary, sample.size, replace = FALSE))
}

set.seed(3803)
samples = subset_2016_salary_data$total_salary
sample.size = 20
xbar_20 = numeric(num_samples)

for (i in 1: num_samples) {
  xbar_20[i] = mean(sample(subset_2016_salary_data$total_salary, sample.size, replace = FALSE))
}

set.seed(3803)
samples = subset_2016_salary_data$total_salary
sample.size = 30
xbar_30 = numeric(num_samples)

for (i in 1: num_samples) {
  xbar_30[i] = mean(sample(subset_2016_salary_data$total_salary, sample.size, replace = FALSE))
}

set.seed(3803)
samples = subset_2016_salary_data$total_salary
sample.size = 40
xbar_40 = numeric(num_samples)

for (i in 1: num_samples) {
  xbar_40[i] = mean(sample(subset_2016_salary_data$total_salary, sample.size, replace = FALSE))
}

population_values = subset_2016_salary_data$total_salary

population_hist = plot_ly(
  x = population_values,
  type = "histogram",
  name = "Population Salary Distribution"
) %>%
  layout(
    title = "Population Salary Distribution (All Employees - 2016)",
    xaxis = list(title = "Total Salary"),
    yaxis = list(title = "Density")
  )

population_hist
all_values = c(xbar_10, xbar_20, xbar_30, xbar_40)
x_range = range(all_values)

p1 = plot_ly(x = xbar_10, type = "histogram", name = "n = 10") %>%
  layout(title = "Sample Mean Distribution (n = 10)",
         xaxis = list(range = x_range))

p2 = plot_ly(x = xbar_20, type = "histogram", name = "n = 20") %>%
  layout(title = "Sample Mean Distribution (n = 20)",
         xaxis = list(range = x_range))

p3 = plot_ly(x = xbar_30, type = "histogram", name = "n = 30") %>%
  layout(title = "Sample Mean Distribution (n = 30)",
         xaxis = list(range = x_range))

p4 = plot_ly(x = xbar_40, type = "histogram", name = "n = 40") %>%
  layout(title = "Sample Mean Distribution (n = 40)",
         xaxis = list(range = x_range))

subplot(p1, p2, p3, p4, nrows = 2, shareX = TRUE, shareY = FALSE, titleX = TRUE, titleY = TRUE) %>%
  layout(title = "Central Limit Theorem - Sampling Distributions xbar (FY 2016)")

#Central Limit Theorem in the 2016 Salary Data The Central Limit Theorem states that the distribution of the sample means for a given sample size of the population has the shape of the normal distribution. I tested this with the salary data for the financial year of 2016. Above is the histogram of the annual base salary of all the employees in financial year 2016 (includes outliers). I then created additional histograms for each of the samples where is is 1000 random samples for the sample sizes 10, 20, 30, 40.The annual salary was analyzed and we can see from the histograms that the Central Limit Theorem holds true and the sampled data shows a normal distribution and as the sample size increased, the histogram became narrower, meaning the standard deviation decreased as the sample size increased. It can be observed that the shape remains relatively similar across the different sample sizes, but as it increases, we can see a slight positive skew, which is present in the population histogram.

 # Central Limit Theorem
paste("population mean:", round(pop_mean,2))
## [1] "population mean: 61225.11"
paste("population standard deviation: ",round(pop_sd,2))
## [1] "population standard deviation:  42509.95"
paste("sample size 10 mean:",round(mean(xbar_10),2))
## [1] "sample size 10 mean: 60746.95"
paste("sample size 10 standard deviation: ", round(sd(xbar_10),2))
## [1] "sample size 10 standard deviation:  13427.49"
paste("sample size 20 mean:",round(mean(xbar_20),2))
## [1] "sample size 20 mean: 61318.99"
paste("sample size 20 standard deviation: ", round(sd(xbar_20),2))
## [1] "sample size 20 standard deviation:  9729.34"
paste("sample size 30 mean:",round(mean(xbar_30),2))
## [1] "sample size 30 mean: 61587.02"
paste("sample size 30 standard deviation: ", round(sd(xbar_30),2))
## [1] "sample size 30 standard deviation:  8067.25"
paste("sample size 40 mean:",round(mean(xbar_40),2))
## [1] "sample size 40 mean: 61656.09"
paste("sample size 40 standard deviation: ", round(sd(xbar_40),2))
## [1] "sample size 40 standard deviation:  6918.06"

#Means, and Standard Deviation of Annual Salary in Financial Year 2016 As mentioned before, the samples of the 2016 financial year data were taken and the means and standard deviation were calculated. When comparing the sample means to the population mean, we can see that the figure remains fairly consistent and is about ~61,000. As depicted in the histograms, as the sample size increased, the histogram narrowed, and now we can see the numbers behind the graph. The standard deviation of the sample sizes decreased as the sample size increased.

#SRSWOR
library(sampling)
top_5_salary_bureau_names_2016 = top_5_salary_bureau$Bureau[1:5]

population_bureau_2016 = subset_2016_salary_data

set.seed(3803)
SRSWOR_sal = sample(population_bureau_2016$total_salary, 5000, replace = FALSE)
salary_SRSWOR = table(SRSWOR_sal)
SRSWOR_b = sample(population_bureau_2016$Bureau, 5000, replace = FALSE)
bureau_SRSWOR = table(SRSWOR_b)

#Systematic
N = 50000
n = 5000
k = ceiling(N / n)
r = sample(k, 1)

s = seq(r, by = k, length = n)

sample.3 = population_bureau_2016[s, ]
sample.3_table = table(sample.3$total_salary)

#Stratified Sampling based on Top 5 Bureaus by Average Salary
set.seed(3803)

bureau_index = order(population_bureau_2016$Bureau)
data = population_bureau_2016[bureau_index, ]

freq = table(population_bureau_2016$Bureau)
sample_size = 5000
sizes = round(sample_size * freq / sum(freq))
sizes[sizes == 0] = 1

st = sampling::strata(data, stratanames = c("Bureau"),
                       size = sizes, method = "srswor")
                       
sample = sampling::getdata(data, st)
sample_sal_strat = table(sample$total_salary)

bureau_counts = table(population_bureau_2016$Bureau)
plot_ly(
  labels = names(bureau_counts),
  values = as.numeric(bureau_counts),
  type = "pie",
  textinfo = "label+percent",
  insidetextorientation = "radial"
) %>%
  layout(title = "Distribution of Bureaus for 2016 Population")
bureau_counts = table(SRSWOR_b)
plot_ly(
  labels = names(bureau_counts),
  values = as.numeric(bureau_counts),
  type = "pie",
  textinfo = "label+percent",
  insidetextorientation = "radial"
) %>%
  layout(title = "Distribution of Bureaus with SRSWOR")
bureau_counts = table(sample.3$Bureau)
plot_ly(
  labels = names(bureau_counts),
  values = as.numeric(bureau_counts),
  type = "pie",
  textinfo = "label+percent",
  insidetextorientation = "radial"
) %>%
  layout(title = "Distribution of Bureaus with Systemic Sampling")
bureau_counts = table(sample$Bureau)
plot_ly(
  labels = names(bureau_counts),
  values = as.numeric(bureau_counts),
  type = "pie",
  textinfo = "label+percent",
  insidetextorientation = "radial"
) %>%
  layout(title = "Distribution of Bureaus with Stratified Sampling")
library(plotly)

# Create the individual histograms
p_srs = plot_ly(
  x = SRSWOR_sal,
  type = "histogram",
  name = "Simple Random Sampling"
) %>%
  layout(title = "Simple Random Sampling", xaxis = list(title = "Salary"), yaxis = list(title = "Count"))

p_sys = plot_ly(
  x = sample.3$total_salary,
  type = "histogram",
  name = "Systematic Sampling"
) %>%
  layout(title = "Systematic Sampling", xaxis = list(title = "Salary"), yaxis = list(title = "Count"))

p_strat = plot_ly(
  x = sample$total_salary,
  type = "histogram",
  name = "Stratified Sampling"
) %>%
  layout(title = "Annual Salary (2016) using Various Sampling Methods", xaxis = list(title = "Salary"), yaxis = list(title = "Count"))

# Combine them horizontally
subplot(p_srs, p_sys, p_strat, nrows = 3, shareY = TRUE, titleX = TRUE, titleY = TRUE, margin = 0.05)

#Sampling Analysis Sampling is a technique that looks at a representative portion of a population to get insights on what is being examined or researched. There are different types of sampling that are being utilized are simple random sampling, where each element within the population have an equal chance of being chosen. The next technique is systematic sampling is a method where samples are selected via a fixed periodic interval, the interval is calculated based on the population divided by a desired sample size. Lastly is stratified sampling which takes into account the demographic of the population, meaning it looks to make the sample group have similar proportions of demographics as the population as a whole. In the salary data set, the bureau population was used as the stratifier to make sure that the bureaus are represented in similar proportions as they are in the whole population of the salary data.

For the graphs, they are pie graphs that show the demographic make up of the samples based on the bureau the individual works in, there is on pie graph for each sampling method. As we can see, the demographic remain consistent among the different sampling techniques. Lastly, a histogram was also included to show the distribution of the annual salary of the sampled employees. The data seems to be fairly consistent across the sampling techniques, but there is more positive skewing in the systematic sampling. # Conclusions

Overall, we can see that there are wide ranges of salaries within the Cook County. Based on the bureau you are working in, there may be higher salaries, so if you are someone who is looking to work in Cook County, you may want to apply for the Board of Elections, Inspector General, Public Defender, Bureau of Economics, or Bureau of Technology for the highest paying jobs.