This project focusses on analysis of Payroll Data of Los Angeles City and derivation of insights such as:
These insights can then be used to form development policies of the city/companies for strategic growth.
Several analysis visualizing the trends of salaries, costs and payments have been performed
In this section, we list down the packages required to reproduce the code and results. For example - the data is imported from the Controller Data website using the RSocrata package. To accomplish this, we load the RSocrata library.
library(RSocrata) # Reading data from the web
library(tidyverse) # Tidying data
library(dplyr) # Transforming data
library(ggplot2) # Visualizing data
library(data.table) # Reading and presenting data tables
library(stringr) # Text mining
library(tm) # Text mining
library(SnowballC) # Text mining
library(wordcloud) # Word cloud package
library(RColorBrewer) # Colour Palette selection
library(corrplot) # Finding correlation
payroll <- as_tibble(read.csv("C:/Manisha_Arora/UC-BANA/Sem1/Data Wrangling R - BB/Data Wrangling with R (BANA 8090)/Final Project/Raw data/City_Payroll_Data_csv.csv"))
head(payroll)
A brief description of the dataset has been provided:
Codebook of the dataset has not been provided on the website. However, some of the variables which will be used for analysis have been described below:
Before we clean our dataset to get it in a tidy format, ready for analysis, we look at the dimensions of the dataset, number of missing values and complete cases.
colnames(payroll) #lists down the different column names of the dataset
[1] "Row.ID" "Year"
[3] "Department.Title" "Payroll.Department"
[5] "Record.Number" "Job.Class.Title"
[7] "Employment.Type" "Hourly.or.Event.Rate"
[9] "Projected.Annual.Salary" "Q1.Payments"
[11] "Q2.Payments" "Q3.Payments"
[13] "Q4.Payments" "Payments.Over.Base.Pay"
[15] "X..Over.Base.Pay" "Total.Payments"
[17] "Base.Pay" "Permanent.Bonus.Pay"
[19] "Longevity.Bonus.Pay" "Temporary.Bonus.Pay"
[21] "Lump.Sum.Pay" "Overtime.Pay"
[23] "Other.Pay...Adjustments" "Other.Pay..Payroll.Explorer."
[25] "MOU" "MOU.Title"
[27] "FMS.Department" "Job.Class"
[29] "Pay.Grade" "Average.Health.Cost"
[31] "Average.Dental.Cost" "Average.Basic.Life"
[33] "Average.Benefit.Cost" "Benefits.Plan"
[35] "Job.Class.Link"
dim(payroll) # Provides the dimensions of the data ie. number of observations and variables
[1] 285008 35
sum(is.na(payroll)) # Counts the number of missing values
[1] 53545
sum(complete.cases(payroll)) # Provides the number of complete cases ie observations with no NAs
[1] 231463
We now remove the NA values and check the percentage of data removed.
payroll_clean <- payroll %>%
na.omit()
Deleted_pcnt <- (1 - (nrow(payroll_clean)/nrow(payroll)))*100
Deleted_pcnt
[1] 18.78719
We observe that the percentage of data deleted is not very significant. Further, prior to performing an analysis, we check if we removed any values in the required variables due to the above data cleaning.
Additionally, we observe that some of the columns have information that can be derived from other datasets. For example - Payments from each quarter are listed in 4 different columns. There is also an additional column listing the Total payments which can be derived from the quarterly payments. Hence, we can get rid of some of these columns which present repeated or derived information.
That being said, we will only use the clened dataset where required. This is because removing missing values across the data has also led to loss of relevant information from columns that did not contain missing values. Hence, if we need only a limited set of variables for a particular analysis, preference is given to the original dataset if it contains no missing values for those particular variables. This is done to preserve information and obtain more accurate results.
payroll_clean <- payroll_clean %>%
mutate(avgCost = as.numeric(Average.Health.Cost) + as.numeric(Average.Dental.Cost) + as.numeric(Average.Benefit.Cost) + as.numeric(Average.Basic.Life)) %>%
subset(,c(1:4,6:9,16,21,28:29,34,36))
dim(payroll_clean)
[1] 231463 14
head(payroll_clean)
The primary purpose of this analysis is to analyze trends over years and derive interdependency relationships.
We analyze the trend of Total Payments across years.
detach(package:ggplot2)
library(ggplot2)
#The above lines of code is run to ensure the annotate function is picked up from the ggplot2 package and not the default NLP package, which also contains a similar function but would not fulfil the purpose we want here.
payroll %>%
select(Year, Department.Title, Total.Payments) %>%
group_by(Year) %>%
summarise(avgTotalPayments = mean(as.numeric(Total.Payments))) %>%
ungroup() %>%
ggplot(aes(x= Year, y=avgTotalPayments)) +
geom_line(linetype="solid", alpha=1, size=1.2, colour="darkseagreen") + geom_point(colour="black", size=2) +
ylab("Average Payment (USD)") +
geom_hline(aes(yintercept = mean(as.numeric(payroll$Total.Payments))), linetype="dashed", alpha=.6) +
annotate("text", x = 2015, y = 131000, label = "Average: 128,627", size=3) +
ylim(limits=c(100000,150000))
The above graph shows a dip in the average payment over an year in 2014 and 2015. However, we see an increased payment for 2016. This can be compared to the overall overage of ~128k USD.
Next, we try to visualize variation of Payments and Projected Salaried across years.
payroll %>%
select(Year, Department.Title, Total.Payments, Projected.Annual.Salary) %>%
group_by(Department.Title, Year) %>%
summarise(avgTotalPayments = mean(as.numeric(Total.Payments)), avgAnnualSalary = mean(as.numeric(Projected.Annual.Salary))) %>%
ungroup() %>%
ggplot(aes(x=avgTotalPayments, y=avgAnnualSalary)) +
geom_point(colour="tomato", size=2) +
facet_wrap(~Year) +
labs(x="Total Payments (USD)", y="Annual Salary (USD)") +
geom_smooth(method="lm")
The above graph shows that Payments and projected salaries usually had a positive correlation (though not very strong) over 2013-2015. However, we observe a slight negative correlation between the two for year 2016.
To deep dive into the correlation of various parameters, a correlation graph between the variables is plotted.
Payroll_subset <- payroll[,c(8,9,16:17,30:33)] %>%
mutate(Cost = as.numeric(sub('\\$','',as.character(Average.Health.Cost))) +
as.numeric(sub('\\$','',as.character(Average.Dental.Cost))) +
as.numeric(sub('\\$','',as.character(Average.Benefit.Cost))) +
as.numeric(sub('\\$','',as.character(Average.Basic.Life)))) %>%
mutate(Salary = as.numeric(sub('\\$','',as.character(Projected.Annual.Salary)))) %>%
mutate(Payment = as.numeric(sub('\\$','',as.character(Total.Payments)))) %>%
mutate(BasePay = as.numeric(sub('\\$','',as.character(Base.Pay))))
Payroll_subset[,c(9:12)] %>%
cor() %>%
corrplot(method="square", reorder=TRUE)
The graph shows a significant correlation between the total payments and salary. SImilarly, the correlation between Cost and Salary is also pretty high.
Next, we analyze the various types of Benefit plans for the city. We observe that there are 4 different kinds of benefit plans - City, DWP, Fire and Police.
payroll %>%
filter(Benefits.Plan != '') %>%
mutate(BenefitsPlan = as.character(Benefits.Plan)) %>%
mutate(AvgBenCost = as.numeric(sub('\\$','',as.character(Average.Benefit.Cost)))) %>%
group_by(BenefitsPlan) %>%
summarise(avgBenefits = sum(AvgBenCost)) %>%
ungroup() %>%
ggplot(aes(x=reorder(BenefitsPlan, -avgBenefits), y=avgBenefits)) +
geom_bar(stat="identity", colour="black", fill="tomato", width = 0.5) +
labs(x="Benefit Plan", y="Average Benefit (USD)")
The graph shows the average benefit Cost under the ‘City’ plan is highest, followed by ‘DWP’.
Lastly, we observe that there are various different kinds of job titles. We plot a word cloud to show the most commonly occurring job titles.
text <- payroll$Job.Class.Title
docs <- Corpus(VectorSource(text))
docs <- tm_map(docs, removeWords, c("I", "II", "III", "and", "senior"))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 250,
max.words=75, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
We plot the top 5 job titles based on Total Payments, for a particular year filtered by a phrase(s) in the Job Title.
a <- payroll %>%
subset(select = Job.Class.Title) %>%
unique()
b <- function(a, Y,J){
a %>%
filter(Year==Y) %>%
filter(str_detect(Job.Class.Title, regex(J, ignore_case=TRUE))) %>%
select(c(Row.ID,Year,Job.Class.Title, Total.Payments)) %>%
top_n(5) %>%
ggplot(mapping=aes(x=Job.Class.Title, y=Total.Payments)) +
geom_bar(stat="identity", width=0.6, fill="dodgerblue") +
labs(x=" Job Title", y="Total Payments")
}
b(payroll, 2014, "Account")
b(payroll, 2016, "Police")
Through the analysis performed, we observed how the payroll data for the Los Angeles City has changed over the past few years. Further, analysis was performed to check the variation of payments, costs and salaries with change in parameters.
The payroll data was analyzed by plotting line graphs to show trends in Payments. Scatter plots and correlation matrix showed dependency across Costs, Payments and Salaries. Different kinds of Benefit plans were analyzed to indicate the one with large benefits. Additionally, the plot of word cloud and job title graphs show commonly ocurring titles and the ones with highest payment based on a keyword.
Some of the key insights from the analysis are presented: