The dataset selected for this Exploratory data analysis is Atlanta City Employee Salaries from data.world (https://data.world/brentbrewington/atlanta-city-employee-salaries). This dataset contains salary details of employees from Atlanta region during 2015. There are 7 columns in this dataset. They are employee name, age, gender, annual salary, ethnic origin, job title and organization. This dataset helps job seekers to better understand the job market in the Atlanta region. It provides information about the openings for various jobs in the job market. It gives an overview of number of openings per organization. This is an interesting dataset for exploratory data analysis because it provides the data to analyse the variation of salary over the age of the employees. Also, it helps to compare the salaries between male and female. It can also be used to explore the difference in the employability of the various ethnic groups.
library(tidyverse)
library(readr)
For the exploratory data analysis, the tidyverse package is loaded.
salary <- read_csv("data/atlanta_salary.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## age = col_double(),
## sex = col_character(),
## ethnic.origin = col_character(),
## job.title = col_character(),
## organization = col_character(),
## annual.salary = col_double()
## )
Now, the Atlanta City Employee Salaries dataset is loaded. The function read_csv() is used to read the file “atlanta_salary.csv” and store it in a variable called salary. Once the file is read, the output of the fuction will show the specifications of the columns in the dataset.
TheView()function can be used to view the entire data in the dataframe.
View(salary)
The head() function is used to check first few rows in the salary dataset.
head(salary)
## # A tibble: 6 x 7
## name age sex ethnic.origin job.title organization annual.salary
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Aaron,~ 38 Male Black or Afr~ ATL311 Te~ EXE Executiv~ 46000.
## 2 Aaron,~ 52 Male Black or Afr~ Environme~ DPW Departme~ 26700.
## 3 Aaron,~ 44 Male Black or Afr~ Watershed~ DWM Departme~ 46575.
## 4 Abazen~ 42 Female Black or Afr~ Benefits ~ DHR Departme~ 42867.
## 5 Abbott~ 32 Male White Recreatio~ PRC Parks, R~ 28035.
## 6 Abbott~ 44 Male White Managemen~ DWM Departme~ 67800.
The summary() function provides the overview of the salary dataset.
summary(salary)
## name age sex ethnic.origin
## Length:8246 Min. :19.00 Length:8246 Length:8246
## Class :character 1st Qu.:34.00 Class :character Class :character
## Mode :character Median :44.00 Mode :character Mode :character
## Mean :43.19
## 3rd Qu.:52.00
## Max. :89.00
## job.title organization annual.salary
## Length:8246 Length:8246 Min. : 20800
## Class :character Class :character 1st Qu.: 36917
## Mode :character Mode :character Median : 43983
## Mean : 49040
## 3rd Qu.: 55279
## Max. :240698
dplyrverbs such as filter(), arrange() select() and mutate() are used to transform, rearrange, and filter the data in the salary dataframe.They are explained below in detail.
It is interesting to see the salary details for a specific job title. For example, for job seekers interested in “Data Technician” roles, the salary details for this specific job can be extracted using the filter function.
filter(salary, job.title == "Data Technician", annual.salary)
## # A tibble: 11 x 7
## name age sex ethnic.origin job.title organization annual.salary
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Ameri~ 44 Female Black or Afri~ Data Tec~ DWM Departme~ 42335.
## 2 Berry~ 40 Female Black or Afri~ Data Tec~ DWM Departme~ 34326.
## 3 Caulw~ 41 Female Black or Afri~ Data Tec~ DWM Departme~ 38763.
## 4 Davis~ 58 Female Black or Afri~ Data Tec~ DPW Departme~ 32454.
## 5 Dixon~ 46 Female Black or Afri~ Data Tec~ DPW Departme~ 31127.
## 6 Dobbs~ 36 Female Black or Afri~ Data Tec~ DWM Departme~ 34191.
## 7 Hall,~ 30 Male Black or Afri~ Data Tec~ PRC Parks, R~ 28189.
## 8 Harve~ 41 Female Black or Afri~ Data Tec~ DWM Departme~ 37073.
## 9 Jones~ 54 Female Black or Afri~ Data Tec~ PRC Parks, R~ 32499.
## 10 Walke~ 38 Female Black or Afri~ Data Tec~ DWM Departme~ 35262
## 11 Willi~ 63 Male Black or Afri~ Data Tec~ DWM Departme~ 42736.
It is appealing to know the highest paid jobs in the Atlanta region. If the data is sorted based on the annual salary, then the jobs can be visualized based on the annual salary. The arrange()function is used to sort based on a variable.desc keyword is used to sort the observations in descending order.
arrange(salary, desc(annual.salary))
## # A tibble: 8,246 x 7
## name age sex ethnic.origin job.title organization annual.salary
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Gordo~ 39 Male White Chief Ope~ EXE Executiv~ 240698.
## 2 Turne~ 56 Male Black or Afr~ Police Ch~ APD Atlanta ~ 240698.
## 3 Hampt~ 48 Female Black or Afr~ City Atto~ LAW Law Depa~ 240697.
## 4 Beard~ 52 Male Black or Afr~ Chief Fin~ DOF Departme~ 221108.
## 5 Geisl~ 66 Male White Airport D~ EXE Executiv~ 221108.
## 6 Saini~ 40 Male Asian Chief Inf~ DIT Departme~ 221000
## 7 South~ 60 Male Black or Afr~ Airport G~ DOA Departme~ 221000
## 8 Smith~ 61 Male Black or Afr~ Airport D~ DOA Departme~ 215250.
## 9 Smith~ 52 Male Black or Afr~ Procureme~ DOP Departme~ 202344.
## 10 Rucke~ 60 Male Black or Afr~ Airport G~ DOA Departme~ 200000.
## # ... with 8,236 more rows
The select()function selects and displays only the specified columns from the dataset. The below code extracts only sex, age,job title and annual salary columns for analysis from the salary dataset.
select(salary,sex, age,job.title ,annual.salary)
## # A tibble: 8,246 x 4
## sex age job.title annual.salary
## <chr> <dbl> <chr> <dbl>
## 1 Male 38 ATL311 Team Lead 46000.
## 2 Male 52 Environmental Service Worker I (D) 26700.
## 3 Male 44 Watershed Crew Supervisor (D) 46575.
## 4 Female 42 Benefits Representative, Sr 42867.
## 5 Male 32 Recreation Operations Assistant 28035.
## 6 Male 44 Management Analyst, Sr 67800.
## 7 Male 33 Police Investigator (E) (F) (O) 46378.
## 8 Female 28 Police Officer (E) (F) (O) (B) (S) 39328.
## 9 Male 58 Management Consulting Director 125000.
## 10 Male 45 Police Investigator (E) (F) (O) 60466.
## # ... with 8,236 more rows
It is useful to create new columns based on the data from the existing columns in the dataset. The mutate()function is used to add new column without changing the existing column in the salary dataset. By using mutate function, the monthly salary for the employees in the Atlanta region is calculated based on their annual salary.
salary <- mutate(salary, monthly.salary = annual.salary / 12)
salary
## # A tibble: 8,246 x 8
## name age sex ethnic.origin job.title organization annual.salary
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Aaro~ 38 Male Black or Afr~ ATL311 T~ EXE Executi~ 46000.
## 2 Aaro~ 52 Male Black or Afr~ Environm~ DPW Departm~ 26700.
## 3 Aaro~ 44 Male Black or Afr~ Watershe~ DWM Departm~ 46575.
## 4 Abaz~ 42 Fema~ Black or Afr~ Benefits~ DHR Departm~ 42867.
## 5 Abbo~ 32 Male White Recreati~ PRC Parks, ~ 28035.
## 6 Abbo~ 44 Male White Manageme~ DWM Departm~ 67800.
## 7 Abdu~ 33 Male Black or Afr~ Police I~ APD Atlanta~ 46378.
## 8 Abdu~ 28 Fema~ Black or Afr~ Police O~ APD Atlanta~ 39328.
## 9 Abed~ 58 Male White Manageme~ DWM Departm~ 125000.
## 10 Aber~ 45 Male White Police I~ APD Atlanta~ 60466.
## # ... with 8,236 more rows, and 1 more variable: monthly.salary <dbl>
For analysis purposs it is handy to have all the dplyr actions in a single command. This is achieved through the pipe operator(%>%). The pipe inputs the result from the left-hand side as the first argument of the right-hand side function. The below code does the filtering based on a job title “Data/Reporting Analyst”, creates a new column to show their monthly salary and then sorts the data based on the monthly salary.
salary %>%
filter(job.title == "Data/Reporting Analyst") %>%
mutate(monthly.salary = annual.salary / 12) %>%
arrange(desc(monthly.salary))
## # A tibble: 56 x 8
## name age sex ethnic.origin job.title organization annual.salary
## <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Ivey~ 38 Fema~ Black or Afr~ Data/Rep~ DOA Departm~ 58706.
## 2 Rile~ 55 Fema~ Black or Afr~ Data/Rep~ DPW Departm~ 56665.
## 3 Cypr~ 50 Fema~ Black or Afr~ Data/Rep~ DWM Departm~ 56393.
## 4 Beth~ 65 Fema~ Black or Afr~ Data/Rep~ DWM Departm~ 56176.
## 5 Tuck~ 48 Male White Data/Rep~ DWM Departm~ 54081.
## 6 Ligg~ 51 Fema~ Black or Afr~ Data/Rep~ DWM Departm~ 53820
## 7 Dobs~ 48 Male Black or Afr~ Data/Rep~ DPW Departm~ 53033.
## 8 Rive~ 62 Male White Data/Rep~ DWM Departm~ 52578.
## 9 Tram~ 46 Fema~ Black or Afr~ Data/Rep~ APD Atlanta~ 52028.
## 10 Dunc~ 56 Fema~ Black or Afr~ Data/Rep~ DOF Departm~ 51855.
## # ... with 46 more rows, and 1 more variable: monthly.salary <dbl>
Dplyr has an extra summary functionality in the form of the summarise().This helps to explore the dataset by computing relevant summary statistics, such as means, ranges, variances, differences, etc. The power of summarise() is in its combined use with the group_by() function, which makes it easy to make grouped summaries. The below example creates a grouped summary of the monthly salary per gender in the salary dataset.
salary %>%
group_by(sex) %>%
summarise(
mean = mean(monthly.salary),
variance = var(monthly.salary),
min = min(monthly.salary),
max = max(monthly.salary)
)
## # A tibble: 2 x 5
## sex mean variance min max
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female 4088. 3028628. 1733. 20058.
## 2 Male 4086. 2668029. 1733. 20058.
Plots can be made in R by using plot(), hist() or barplot() functions. The salary dataset is analysed with the help of these functions.
In the salary dataset,it is interesting to see the distribution of the employees based on their age.The below histogram shows the number of employees per age group. It is clear from the histogram that the highest number of employees are in the age between 40 and 55.
hist(salary$age, xlab ="Employee age")
barplot(table(salary$sex))
From this barplot it is clear that the number of male employees in Atlanta outweigh the number of female employees.
ggplot2 is a data visualization package which implements “Grammar of Graphics”. The below scatter plot shows the distribution of annual salary of employees in Atlanta based on their age.
salary_plot <-
ggplot(salary, aes(x = age, y = annual.salary)) +
geom_point() +
labs(title = "Comparison between employee's age and annual salary",
x = "employee age", y = "annual salary")
salary_plot
Facet is used to display a different subset of the data. The below plots show the salary distribution for the ethnic groups as different subsets. It includes color aesthetic to distinguish the distribution between the male and female.
ggplot(salary, aes(x = age, y = annual.salary, color = sex)) +
geom_point() +
facet_wrap(~ethnic.origin)
This below barplot shows the number of employees working in each of the organization in Atlanta region. One clear indication is that the number of employees in the “Atlanta Police Department” completely outweigh the number of employees in other departments. “Atlanta Police Department” provides more employment opportunities than any other organization.
salary %>%
ggplot(aes(x = organization)) +
geom_bar(aes(fill = sex), position = position_stack(reverse = TRUE)) +
coord_flip()
In the above barplot it is clear that the number of employees working in “Atlanta Police Department” is the highest in the Atlanta region as per the salary dataset. So, Its interesting to analyse the density of the employees age in that department. geom_rug() function is used to add a rug size to the density plot.
salary %>%
filter(organization == "APD Atlanta Police Department") %>%
ggplot(aes(x = age)) +
geom_density(fill = "light seagreen") +
geom_rug(size = 1, colour = "red")
The boxplot provides visual comparison of the distribution of the employee groups based on their job title(for 2 sample job titles) through their summary statistics. The below boxplot compares the statistics of “Data/Reporting Analyst” and “Data Technician”.
salary %>%
filter(job.title == "Data/Reporting Analyst" |
job.title == "Data Technician") %>%
ggplot(aes(x = job.title, y = annual.salary, fill = job.title)) +
geom_boxplot() +
theme_minimal()
The below diagram shows the line plot. It shows the salary distribution for different ages for “Black or African American” and “White” ethnic groups.
job <-
salary %>%
filter(job.title == "Data/Reporting Analyst, Sr" |
job.title == "Data/Reporting Analyst") %>%
filter(ethnic.origin == "Black or African American" |
ethnic.origin == "White") %>%
ggplot(aes(x = age, y = annual.salary, color= ethnic.origin)) +
geom_line() +
geom_point() +
theme_minimal()
job
For this salary dataset, the analysis is started with the summary() function. The important statistics like maximum salary, minimum salary and mean salary are known through summary(). Then the salaries for ‘Data/Reporting Analyst’ jobs are filtered to visualise thier annual salaries. It is interesting to note that the salaries of the ‘Data/Reporting Analyst’ does not deviate much from the overall mean.It is also evident that the average salaries for male and female are almost the same. This shows that there is no wage gap between the two genders.
It is also interesting to know the distribution of the employees based on their age. The histogram shows the number of employees per age group. It is clear that the highest number of employees are in the age between 40 and 55. As expected, this distribution is normally distributed with the peak for the age group 45-50 and tail on both sides. Another appealing fact is the number of male and female employees. From the barplot, it is obvious that the number of male employees in Atlanta outweigh the number of female employees. The ggplot helps us to understand that the salaries for the various roles increase with age. If the outliers (i.e. exceptionally high paying roles like CEO) are ignored, then the age vs salary distribution also resemble a normal distribution. This shows that salary increases with age and after a certain age (around 55 years), there is a decline in salary with age. The questions like ‘Are some ethnic groups better employed than the others?’,‘Does some ethnic groups receive higher salary than the other?’ are answered by the ggplot with Facetwrap. The plots show that ‘Black or African American’ and ‘White’ ethnic groups are more employed than the other groups. But, there can also be other factors like higher population of these ethnic groups than others which might influence this visualisation. Those detailed analysis are out of scope of this work because those analysis require extensive population data that are not part of this selected data set.
From the barplot which shows the number of employees working in each of the organization in Atlanta region, it is clear that the number of employees in the “Atlanta Police Department” completely outweigh the number of employees in other organizations. “Atlanta Police Department” provides more employment opportunities than any other organization in the dataset. The density of the employees age in the “Atlanta Police Department” is visualised through the density plot. The boxplot compares the statistics of “Data/Reporting Analyst” and “Data Technician”. Interestingly, the “Data/Reporting Analyst” jobs are higher paying than the “Data Technician” jobs. The median salary for “Data/Reporting Analyst” jobs are reasonably higher than the median salary for “Data Technician” jobs. The line plot shows the salary distribution for different ages for “Black or African American” and “White” ethnic groups. If the outliers are excluded, then there is no apparant difference in the salaries of these two ethnic groups for “Data/Reporting Analyst” jobs.