Loading packages

Tidyverse is a versatile package used to read in and wrangle the data, as well as visualize it. Skimr is a package used to make summaries of the data.

Cincinnati Salaries

This is a dataset of Cincinnati’s public employees’ cincinnati_salaries. The following data analysis is a Spencer Avenue Production. This fine group of gentleman are Casey Donegan, James Kelin, and Joe Carignan. Our data is made available from this link: https://myxavier-my.sharepoint.com/:x:/g/personal/doneganm_xavier_edu/EYCpxQLlGGZJoSFPsDoNEPcBsleg7TxuYHWSOIeQRfWODA?download=1.

Loading data

We are loading the data directly from OneDrive using read_csv()

cincinnati_salaries<- read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/doneganm_xavier_edu/EdTGZt6w2XpDmF3eYzv_gOYBOT2l7_B1g1PJ8ulP_7zRjg?download=1")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   PK = col_double(),
##   DEPTID = col_double(),
##   JOBCODE = col_double(),
##   POSITION_NBR = col_double(),
##   JOB_ENTRY_DT = col_double(),
##   STD_HOURS = col_double(),
##   FTE = col_double(),
##   GRADE = col_double(),
##   STEP = col_double(),
##   ANNUAL_RT = col_double(),
##   EEO_JOB_GROUP = col_double()
## )
## See spec(...) for full column specifications.

cleaning data

we have to clean the data

cincinnati_salaries<- na.omit(cincinnati_salaries)

new columns

We created new columns from what we found interesting. SalaryBracket divides annual cincinnati_salaries into 6 categories of ranges of pay. We also made a new column to calcualte a person’s bi-weekly paycheck. Then, we created new columns that separate first and last names.

Race / Ethnicity

We want a comparison of salaries by race/ethnicity.

ethnicities<- cincinnati_salaries %>% 
  group_by(RACE) %>% 
  summarize(countya = n(),
            avg_salary = mean(ANNUAL_RT))

ggplot(data = ethnicities,
       aes(x = RACE,
           y = countya))+
  geom_col(fill = "cyan")+
  ggtitle("Distribution of Employees by Ethnicity")+
  xlab("Ethnicity")+
  ylab("Number of Employees")+
  scale_y_continuous(name = "Number of Employees",
                     labels = scales::comma)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(data = ethnicities,
       aes(x = RACE,
           y = avg_salary))+
  geom_col(fill = "green")+
  ggtitle("Pay by Ethnicity")+
  xlab("Ethnicity")+
  ylab("Average Salary")+
  scale_y_continuous(name = "Average Salary",
                     labels = scales::dollar)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

These results could be vastly skewed because there may be a lot more instances of other races in the data set. I could filter each so it removes races that have less than a certain number of entries in the data set to qualify in the analysis. Additionally, I can randomly pull a sample of instances from each race to ensure each variable has a controlled number of occurances and presents a more easily comparable average salary. I would anticipate a higher salary for White employees.

Job Groups

Casey’s analysis

Let’s take a look at pay differences among job groups. Job group is a variable in the dataset that is not specific. The people who constructed this dataset named these groups “0-8”, based on their type of job. I dug through job titles of each group to make a broad name for each group.

The best statistical test to run for this subset of data would be an ANOVA test. This would test if there are significant differences in the averages of salary by the job groups.

group<- cincinnati_salaries %>% 
  group_by(EEO_JOB_GROUP) %>% 
  summarize(countya = n(),
            avg_salary = mean(ANNUAL_RT))

job_group<- c("recreation", "engineering", "health care", "higher education",
                   "police and fire", "municipal", "dispatchers and receptionists",
                   "blue collar", "parks")

plzpls<- as.data.frame(job_group)

group<- bind_cols(plzpls, group)

ggplot(data = group,
       aes(x = job_group,
           y = countya))+
  geom_col(fill = "cyan")+
  ggtitle("Distribution of Employees by Job Group")+
  xlab("Job Group")+
  ylab("Number of Employees")+
  scale_y_continuous(name = "Number of Employees",
                     labels = scales::comma)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(data = group,
       aes(x = job_group,
           y = avg_salary))+
  geom_col(fill = "magenta")+
  ggtitle("Pay by Job Group")+
  xlab("Job Group")+
  ylab("Average Salary")+
  scale_y_continuous(name = "Average Salary",
                     labels = scales::dollar)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Age Ranges

Joe self-directed analysis 1. How does paycheck amount differ for each age group? This is interesting because the biweekly (paycheck) amount column is one that we calculated and created ourselves, so running further analysis on it was intriguing. 2. I intend on answering this question by averaging the biweekly paycheck amount and grouping the age range while using a bar chart see the relationship between average biweekly paycheck per age range. As can be seen in the graph, the highest average paycheck per age group is a near tie between the 41-50 and 51-60 groups, with the lowest in the under 18 age range A statistical test that could be used to further evaluate the significance between these two variables is a one-way ANOVA test because we have one categorical variable that is the age ranges and one continuous variable in the paycheck amount, and I think it would be interesting to see the variance in the means for each age range grouping.

ggplot(data = cincinnati_salaries,
       aes(x = cincinnati_salaries$AGE_RANGE,
           y = cincinnati_salaries$ANNUAL_RT)) +
  geom_col(fill = "red") +
  ggtitle("Annual Salary By Age") +
  xlab("Age Range")+
  ylab("Annual Salary")+
  scale_y_continuous(name = "Annual Salary",
                     labels = scales::dollar)

Salaries_By_Age_Range<- cincinnati_salaries %>%
  group_by(AGE_RANGE) %>% 
  summarize(Avg_Paycheck = mean(Bi_Week_Check),
            countof = n())

ggplot(data = Salaries_By_Age_Range,
       aes(x = Salaries_By_Age_Range$AGE_RANGE,
           y = Salaries_By_Age_Range$Avg_Paycheck))+
  geom_col(fill = "red")+
  ggtitle("Paycheck by Age")+
  xlab("Age")+
  ylab("Paycheck Amount")+
  scale_y_continuous(name = "Paycheck Amount",
                     labels = scales::dollar)

Salary Buckets

How is wealth distributed among Cincinnati residents?

ggplot(data = cincinnati_salaries,
       aes(x = cincinnati_salaries$SalaryBracket)) +
  geom_bar(fill = "blue") +
  ggtitle("Annual Bucketed salaries") +
  xlab("Salary") +
  ylab("Number of Employees")

Gender Pay

Is there a wage gap between males and females?

ggplot(data = cincinnati_salaries,
       aes(x = SEX,
           y = Bi_Week_Check))+
  geom_boxplot(fill = "purple")+
  ggtitle("Paycheck per Sex")+
  xlab("Sex")+
  ylab("Paycheck Amount")+
  scale_y_continuous(labels = scales::dollar, limits = c(0,8000))
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).