Exploratory Data Analysis

Introduction

The dataset selected for this Exploratory data analysis is Atlanta City Employee Salaries from data.world (https://data.world/brentbrewington/atlanta-city-employee-salaries). This dataset contains salary details of employees from Atlanta region during 2015. There are 7 columns in this dataset. They are employee name, age, gender, annual salary, ethnic origin, job title and organization. This dataset helps job seekers to better understand the job market in the Atlanta region. It provides information about the openings for various jobs in the job market. It gives an overview of number of openings per organization. This is an interesting dataset for exploratory data analysis because it provides the data to analyse the variation of salary over the age of the employees. Also, it helps to compare the salaries between male and female. It can also be used to explore the difference in the employability of the various ethnic groups.

Loading, viewing, and summarising data

load the packages:

library(tidyverse)
library(readr)

For the exploratory data analysis, the tidyverse package is loaded.

load the dataset

salary <- read_csv("data/atlanta_salary.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   age = col_double(),
##   sex = col_character(),
##   ethnic.origin = col_character(),
##   job.title = col_character(),
##   organization = col_character(),
##   annual.salary = col_double()
## )

Now, the Atlanta City Employee Salaries dataset is loaded. The function read_csv() is used to read the file “atlanta_salary.csv” and store it in a variable called salary. Once the file is read, the output of the fuction will show the specifications of the columns in the dataset.

view() function

TheView()function can be used to view the entire data in the dataframe.

View(salary)

head() function

The head() function is used to check first few rows in the salary dataset.

head(salary)

## # A tibble: 6 x 7
##   name      age sex    ethnic.origin job.title  organization  annual.salary
##   <chr>   <dbl> <chr>  <chr>         <chr>      <chr>                 <dbl>
## 1 Aaron,~    38 Male   Black or Afr~ ATL311 Te~ EXE Executiv~        46000.
## 2 Aaron,~    52 Male   Black or Afr~ Environme~ DPW Departme~        26700.
## 3 Aaron,~    44 Male   Black or Afr~ Watershed~ DWM Departme~        46575.
## 4 Abazen~    42 Female Black or Afr~ Benefits ~ DHR Departme~        42867.
## 5 Abbott~    32 Male   White         Recreatio~ PRC Parks, R~        28035.
## 6 Abbott~    44 Male   White         Managemen~ DWM Departme~        67800.

summary()function

The summary() function provides the overview of the salary dataset.

summary(salary)

##      name                age            sex            ethnic.origin     
##  Length:8246        Min.   :19.00   Length:8246        Length:8246       
##  Class :character   1st Qu.:34.00   Class :character   Class :character  
##  Mode  :character   Median :44.00   Mode  :character   Mode  :character  
##                     Mean   :43.19                                        
##                     3rd Qu.:52.00                                        
##                     Max.   :89.00                                        
##   job.title         organization       annual.salary   
##  Length:8246        Length:8246        Min.   : 20800  
##  Class :character   Class :character   1st Qu.: 36917  
##  Mode  :character   Mode  :character   Median : 43983  
##                                        Mean   : 49040  
##                                        3rd Qu.: 55279  
##                                        Max.   :240698

Data wrangling with dplyr

dplyrverbs such as filter(), arrange() select() and mutate() are used to transform, rearrange, and filter the data in the salary dataframe.They are explained below in detail.

filter()

It is interesting to see the salary details for a specific job title. For example, for job seekers interested in “Data Technician” roles, the salary details for this specific job can be extracted using the filter function.

filter(salary, job.title == "Data Technician", annual.salary)

## # A tibble: 11 x 7
##    name     age sex    ethnic.origin  job.title organization  annual.salary
##    <chr>  <dbl> <chr>  <chr>          <chr>     <chr>                 <dbl>
##  1 Ameri~    44 Female Black or Afri~ Data Tec~ DWM Departme~        42335.
##  2 Berry~    40 Female Black or Afri~ Data Tec~ DWM Departme~        34326.
##  3 Caulw~    41 Female Black or Afri~ Data Tec~ DWM Departme~        38763.
##  4 Davis~    58 Female Black or Afri~ Data Tec~ DPW Departme~        32454.
##  5 Dixon~    46 Female Black or Afri~ Data Tec~ DPW Departme~        31127.
##  6 Dobbs~    36 Female Black or Afri~ Data Tec~ DWM Departme~        34191.
##  7 Hall,~    30 Male   Black or Afri~ Data Tec~ PRC Parks, R~        28189.
##  8 Harve~    41 Female Black or Afri~ Data Tec~ DWM Departme~        37073.
##  9 Jones~    54 Female Black or Afri~ Data Tec~ PRC Parks, R~        32499.
## 10 Walke~    38 Female Black or Afri~ Data Tec~ DWM Departme~        35262 
## 11 Willi~    63 Male   Black or Afri~ Data Tec~ DWM Departme~        42736.

arrange()

It is appealing to know the highest paid jobs in the Atlanta region. If the data is sorted based on the annual salary, then the jobs can be visualized based on the annual salary. The arrange()function is used to sort based on a variable.desc keyword is used to sort the observations in descending order.

arrange(salary, desc(annual.salary))

## # A tibble: 8,246 x 7
##    name     age sex    ethnic.origin job.title  organization  annual.salary
##    <chr>  <dbl> <chr>  <chr>         <chr>      <chr>                 <dbl>
##  1 Gordo~    39 Male   White         Chief Ope~ EXE Executiv~       240698.
##  2 Turne~    56 Male   Black or Afr~ Police Ch~ APD Atlanta ~       240698.
##  3 Hampt~    48 Female Black or Afr~ City Atto~ LAW Law Depa~       240697.
##  4 Beard~    52 Male   Black or Afr~ Chief Fin~ DOF Departme~       221108.
##  5 Geisl~    66 Male   White         Airport D~ EXE Executiv~       221108.
##  6 Saini~    40 Male   Asian         Chief Inf~ DIT Departme~       221000 
##  7 South~    60 Male   Black or Afr~ Airport G~ DOA Departme~       221000 
##  8 Smith~    61 Male   Black or Afr~ Airport D~ DOA Departme~       215250.
##  9 Smith~    52 Male   Black or Afr~ Procureme~ DOP Departme~       202344.
## 10 Rucke~    60 Male   Black or Afr~ Airport G~ DOA Departme~       200000.
## # ... with 8,236 more rows

select()

The select()function selects and displays only the specified columns from the dataset. The below code extracts only sex, age,job title and annual salary columns for analysis from the salary dataset.

select(salary,sex, age,job.title ,annual.salary)

## # A tibble: 8,246 x 4
##    sex      age job.title                          annual.salary
##    <chr>  <dbl> <chr>                                      <dbl>
##  1 Male      38 ATL311 Team Lead                          46000.
##  2 Male      52 Environmental Service Worker I (D)        26700.
##  3 Male      44 Watershed Crew Supervisor (D)             46575.
##  4 Female    42 Benefits Representative, Sr               42867.
##  5 Male      32 Recreation Operations Assistant           28035.
##  6 Male      44 Management Analyst, Sr                    67800.
##  7 Male      33 Police Investigator (E) (F) (O)           46378.
##  8 Female    28 Police Officer (E) (F) (O) (B) (S)        39328.
##  9 Male      58 Management Consulting Director           125000.
## 10 Male      45 Police Investigator (E) (F) (O)           60466.
## # ... with 8,236 more rows

mutate()

It is useful to create new columns based on the data from the existing columns in the dataset. The mutate()function is used to add new column without changing the existing column in the salary dataset. By using mutate function, the monthly salary for the employees in the Atlanta region is calculated based on their annual salary.

salary <- mutate(salary, monthly.salary = annual.salary / 12)
salary

## # A tibble: 8,246 x 8
##    name    age sex   ethnic.origin job.title organization annual.salary
##    <chr> <dbl> <chr> <chr>         <chr>     <chr>                <dbl>
##  1 Aaro~    38 Male  Black or Afr~ ATL311 T~ EXE Executi~        46000.
##  2 Aaro~    52 Male  Black or Afr~ Environm~ DPW Departm~        26700.
##  3 Aaro~    44 Male  Black or Afr~ Watershe~ DWM Departm~        46575.
##  4 Abaz~    42 Fema~ Black or Afr~ Benefits~ DHR Departm~        42867.
##  5 Abbo~    32 Male  White         Recreati~ PRC Parks, ~        28035.
##  6 Abbo~    44 Male  White         Manageme~ DWM Departm~        67800.
##  7 Abdu~    33 Male  Black or Afr~ Police I~ APD Atlanta~        46378.
##  8 Abdu~    28 Fema~ Black or Afr~ Police O~ APD Atlanta~        39328.
##  9 Abed~    58 Male  White         Manageme~ DWM Departm~       125000.
## 10 Aber~    45 Male  White         Police I~ APD Atlanta~        60466.
## # ... with 8,236 more rows, and 1 more variable: monthly.salary <dbl>

Data processing pipelines

For analysis purposs it is handy to have all the dplyr actions in a single command. This is achieved through the pipe operator(%>%). The pipe inputs the result from the left-hand side as the first argument of the right-hand side function. The below code does the filtering based on a job title “Data/Reporting Analyst”, creates a new column to show their monthly salary and then sorts the data based on the monthly salary.

salary %>%
  filter(job.title == "Data/Reporting Analyst") %>%
  mutate(monthly.salary = annual.salary / 12) %>%
  arrange(desc(monthly.salary))

## # A tibble: 56 x 8
##    name    age sex   ethnic.origin job.title organization annual.salary
##    <chr> <dbl> <chr> <chr>         <chr>     <chr>                <dbl>
##  1 Ivey~    38 Fema~ Black or Afr~ Data/Rep~ DOA Departm~        58706.
##  2 Rile~    55 Fema~ Black or Afr~ Data/Rep~ DPW Departm~        56665.
##  3 Cypr~    50 Fema~ Black or Afr~ Data/Rep~ DWM Departm~        56393.
##  4 Beth~    65 Fema~ Black or Afr~ Data/Rep~ DWM Departm~        56176.
##  5 Tuck~    48 Male  White         Data/Rep~ DWM Departm~        54081.
##  6 Ligg~    51 Fema~ Black or Afr~ Data/Rep~ DWM Departm~        53820 
##  7 Dobs~    48 Male  Black or Afr~ Data/Rep~ DPW Departm~        53033.
##  8 Rive~    62 Male  White         Data/Rep~ DWM Departm~        52578.
##  9 Tram~    46 Fema~ Black or Afr~ Data/Rep~ APD Atlanta~        52028.
## 10 Dunc~    56 Fema~ Black or Afr~ Data/Rep~ DOF Departm~        51855.
## # ... with 46 more rows, and 1 more variable: monthly.salary <dbl>

Grouping and summarisation

Dplyr has an extra summary functionality in the form of the summarise().This helps to explore the dataset by computing relevant summary statistics, such as means, ranges, variances, differences, etc. The power of summarise() is in its combined use with the group_by() function, which makes it easy to make grouped summaries. The below example creates a grouped summary of the monthly salary per gender in the salary dataset.

salary %>% 
  group_by(sex) %>% 
  summarise(
    mean = mean(monthly.salary), 
    variance = var(monthly.salary), 
    min = min(monthly.salary), 
    max = max(monthly.salary)
  )

## # A tibble: 2 x 5
##   sex     mean variance   min    max
##   <chr>  <dbl>    <dbl> <dbl>  <dbl>
## 1 Female 4088. 3028628. 1733. 20058.
## 2 Male   4086. 2668029. 1733. 20058.

Data Visualisation using R

Plots can be made in R by using plot(), hist() or barplot() functions. The salary dataset is analysed with the help of these functions.

Histogram

In the salary dataset,it is interesting to see the distribution of the employees based on their age.The below histogram shows the number of employees per age group. It is clear from the histogram that the highest number of employees are in the age between 40 and 55.

hist(salary$age, xlab ="Employee age")

Barplot

barplot(table(salary$sex))

From this barplot it is clear that the number of male employees in Atlanta outweigh the number of female employees.

Data Visualisation using ggplot2

ggplot2 is a data visualization package which implements “Grammar of Graphics”. The below scatter plot shows the distribution of annual salary of employees in Atlanta based on their age.

salary_plot <-
  ggplot(salary, aes(x = age, y = annual.salary)) +
  geom_point() +
  labs(title = "Comparison between employee's age and annual salary",
       x = "employee age", y = "annual salary")

salary_plot

ggplot with Aesthetics and Facet

Facet is used to display a different subset of the data. The below plots show the salary distribution for the ethnic groups as different subsets. It includes color aesthetic to distinguish the distribution between the male and female.

ggplot(salary, aes(x = age, y = annual.salary, color = sex)) +
  geom_point() +
  facet_wrap(~ethnic.origin)

Visual exploratory data analysis

Barplot

This below barplot shows the number of employees working in each of the organization in Atlanta region. One clear indication is that the number of employees in the “Atlanta Police Department” completely outweigh the number of employees in other departments. “Atlanta Police Department” provides more employment opportunities than any other organization.

salary %>% 
  ggplot(aes(x = organization)) + 
  geom_bar(aes(fill = sex), position = position_stack(reverse = TRUE)) +
  coord_flip()

Density graph

In the above barplot it is clear that the number of employees working in “Atlanta Police Department” is the highest in the Atlanta region as per the salary dataset. So, Its interesting to analyse the density of the employees age in that department. geom_rug() function is used to add a rug size to the density plot.

salary %>%
  filter(organization == "APD Atlanta Police Department") %>%
  ggplot(aes(x = age)) +
  geom_density(fill = "light seagreen") +
  geom_rug(size = 1, colour = "red")

Box plot

The boxplot provides visual comparison of the distribution of the employee groups based on their job title(for 2 sample job titles) through their summary statistics. The below boxplot compares the statistics of “Data/Reporting Analyst” and “Data Technician”.

salary %>% 
  filter(job.title == "Data/Reporting Analyst" |
           job.title == "Data Technician") %>%
  ggplot(aes(x = job.title, y = annual.salary, fill = job.title)) +
  geom_boxplot() +
  theme_minimal()

Line plot

The below diagram shows the line plot. It shows the salary distribution for different ages for “Black or African American” and “White” ethnic groups.

job <- 
  salary %>%
  filter(job.title == "Data/Reporting Analyst, Sr" |
           job.title == "Data/Reporting Analyst") %>%
  filter(ethnic.origin == "Black or African American" |
           ethnic.origin == "White") %>%
  ggplot(aes(x = age, y = annual.salary, color= ethnic.origin)) +
  geom_line() +
  geom_point() +
  theme_minimal()
job

Conclusion

For this salary dataset, the analysis is started with the summary() function. The important statistics like maximum salary, minimum salary and mean salary are known through summary(). Then the salaries for ‘Data/Reporting Analyst’ jobs are filtered to visualise thier annual salaries. It is interesting to note that the salaries of the ‘Data/Reporting Analyst’ does not deviate much from the overall mean.It is also evident that the average salaries for male and female are almost the same. This shows that there is no wage gap between the two genders.

It is also interesting to know the distribution of the employees based on their age. The histogram shows the number of employees per age group. It is clear that the highest number of employees are in the age between 40 and 55. As expected, this distribution is normally distributed with the peak for the age group 45-50 and tail on both sides. Another appealing fact is the number of male and female employees. From the barplot, it is obvious that the number of male employees in Atlanta outweigh the number of female employees. The ggplot helps us to understand that the salaries for the various roles increase with age. If the outliers (i.e. exceptionally high paying roles like CEO) are ignored, then the age vs salary distribution also resemble a normal distribution. This shows that salary increases with age and after a certain age (around 55 years), there is a decline in salary with age. The questions like ‘Are some ethnic groups better employed than the others?’,‘Does some ethnic groups receive higher salary than the other?’ are answered by the ggplot with Facetwrap. The plots show that ‘Black or African American’ and ‘White’ ethnic groups are more employed than the other groups. But, there can also be other factors like higher population of these ethnic groups than others which might influence this visualisation. Those detailed analysis are out of scope of this work because those analysis require extensive population data that are not part of this selected data set.

From the barplot which shows the number of employees working in each of the organization in Atlanta region, it is clear that the number of employees in the “Atlanta Police Department” completely outweigh the number of employees in other organizations. “Atlanta Police Department” provides more employment opportunities than any other organization in the dataset. The density of the employees age in the “Atlanta Police Department” is visualised through the density plot. The boxplot compares the statistics of “Data/Reporting Analyst” and “Data Technician”. Interestingly, the “Data/Reporting Analyst” jobs are higher paying than the “Data Technician” jobs. The median salary for “Data/Reporting Analyst” jobs are reasonably higher than the median salary for “Data Technician” jobs. The line plot shows the salary distribution for different ages for “Black or African American” and “White” ethnic groups. If the outliers are excluded, then there is no apparant difference in the salaries of these two ethnic groups for “Data/Reporting Analyst” jobs.