Final_Project

Details

Packages Required

In this section, we list down the packages required to reproduce the code and results. For example - the data is imported from the Controller Data website using the RSocrata package. To accomplish this, we load the RSocrata library.

library(RSocrata) # Reading data from the web
library(tidyverse) # Tidying data
library(dplyr) # Transforming data
library(ggplot2) # Visualizing data
library(data.table) # Reading and presenting data tables
library(stringr) # Text mining
library(tm) # Text mining
library(SnowballC) # Text mining
library(wordcloud) # Word cloud package
library(RColorBrewer) # Colour Palette selection
library(corrplot) # Finding correlation

payroll <- as_tibble(read.csv("C:/Manisha_Arora/UC-BANA/Sem1/Data Wrangling R - BB/Data Wrangling with R (BANA 8090)/Final Project/Raw data/City_Payroll_Data_csv.csv"))
head(payroll)

Data Preparation

A brief description of the dataset has been provided:

The City Payroll Data will be used for the final project.
The data has been obtained from the Opendatanetwork.com website and contains payroll information of all Los Angeles City Departments from 2013-2016.
The data is updated on a quarterly basis by the Los Angeles City Controller’s Office.
It provides Payment, Costs and Rates for various departments from 2013 to 2016
Further, details about Bonus/Longevity/Overtime Pays and other adjustment Pays has also been provided
We also have the Employment Types, job titles and pay grades within each department for 4 years

Codebook of the dataset has not been provided on the website. However, some of the variables which will be used for analysis have been described below:

Year - This indicates the year of payment and varies from 2013 to 2016
Department.Title - This indicates the name of the department
Employment.Type - This indicates the type pf employment (part time, full time or per event)
Payments - This indicates the quarterly payment made for a particular job type in a particular year by a certain department
Average.Cost - This indicates the different types of average cost incurred by the department

Data Cleaning

Before we clean our dataset to get it in a tidy format, ready for analysis, we look at the dimensions of the dataset, number of missing values and complete cases.

colnames(payroll) #lists down the different column names of the dataset

 [1] "Row.ID"                       "Year"                        
 [3] "Department.Title"             "Payroll.Department"          
 [5] "Record.Number"                "Job.Class.Title"             
 [7] "Employment.Type"              "Hourly.or.Event.Rate"        
 [9] "Projected.Annual.Salary"      "Q1.Payments"                 
[11] "Q2.Payments"                  "Q3.Payments"                 
[13] "Q4.Payments"                  "Payments.Over.Base.Pay"      
[15] "X..Over.Base.Pay"             "Total.Payments"              
[17] "Base.Pay"                     "Permanent.Bonus.Pay"         
[19] "Longevity.Bonus.Pay"          "Temporary.Bonus.Pay"         
[21] "Lump.Sum.Pay"                 "Overtime.Pay"                
[23] "Other.Pay...Adjustments"      "Other.Pay..Payroll.Explorer."
[25] "MOU"                          "MOU.Title"                   
[27] "FMS.Department"               "Job.Class"                   
[29] "Pay.Grade"                    "Average.Health.Cost"         
[31] "Average.Dental.Cost"          "Average.Basic.Life"          
[33] "Average.Benefit.Cost"         "Benefits.Plan"               
[35] "Job.Class.Link"

dim(payroll) # Provides the dimensions of the data ie. number of observations and variables

[1] 285008     35

sum(is.na(payroll)) # Counts the number of missing values

[1] 53545

sum(complete.cases(payroll)) # Provides the number of complete cases ie observations with no NAs

[1] 231463

We now remove the NA values and check the percentage of data removed.

payroll_clean <- payroll %>%
                    na.omit()

Deleted_pcnt <- (1 - (nrow(payroll_clean)/nrow(payroll)))*100
Deleted_pcnt

[1] 18.78719

We observe that the percentage of data deleted is not very significant. Further, prior to performing an analysis, we check if we removed any values in the required variables due to the above data cleaning.

Additionally, we observe that some of the columns have information that can be derived from other datasets. For example - Payments from each quarter are listed in 4 different columns. There is also an additional column listing the Total payments which can be derived from the quarterly payments. Hence, we can get rid of some of these columns which present repeated or derived information.

That being said, we will only use the clened dataset where required. This is because removing missing values across the data has also led to loss of relevant information from columns that did not contain missing values. Hence, if we need only a limited set of variables for a particular analysis, preference is given to the original dataset if it contains no missing values for those particular variables. This is done to preserve information and obtain more accurate results.

payroll_clean <- payroll_clean %>%
                  mutate(avgCost = as.numeric(Average.Health.Cost) + as.numeric(Average.Dental.Cost) + as.numeric(Average.Benefit.Cost) + as.numeric(Average.Basic.Life)) %>%
                  subset(,c(1:4,6:9,16,21,28:29,34,36))

dim(payroll_clean)

[1] 231463     14

head(payroll_clean)

Exploratory Analysis

The primary purpose of this analysis is to analyze trends over years and derive interdependency relationships.

We analyze the trend of Total Payments across years.

detach(package:ggplot2)
library(ggplot2)
#The above lines of code is run to ensure the annotate function is picked up from the ggplot2 package and not the default NLP package, which also contains a similar function but would not fulfil the purpose we want here.

payroll %>%
  select(Year, Department.Title, Total.Payments) %>%
  group_by(Year) %>%
  summarise(avgTotalPayments = mean(as.numeric(Total.Payments))) %>%
  ungroup() %>%
  ggplot(aes(x= Year, y=avgTotalPayments)) +
  geom_line(linetype="solid", alpha=1, size=1.2, colour="darkseagreen") + geom_point(colour="black", size=2) + 
  ylab("Average Payment (USD)") +
  geom_hline(aes(yintercept = mean(as.numeric(payroll$Total.Payments))), linetype="dashed", alpha=.6) +
  annotate("text", x = 2015, y = 131000, label = "Average: 128,627", size=3) +
  ylim(limits=c(100000,150000))

The above graph shows a dip in the average payment over an year in 2014 and 2015. However, we see an increased payment for 2016. This can be compared to the overall overage of ~128k USD.

Next, we try to visualize variation of Payments and Projected Salaried across years.

payroll %>%
  select(Year, Department.Title, Total.Payments, Projected.Annual.Salary) %>%
  group_by(Department.Title, Year) %>%
  summarise(avgTotalPayments = mean(as.numeric(Total.Payments)), avgAnnualSalary = mean(as.numeric(Projected.Annual.Salary))) %>%
  ungroup() %>%
  ggplot(aes(x=avgTotalPayments, y=avgAnnualSalary)) +
  geom_point(colour="tomato", size=2) +
  facet_wrap(~Year) +
  labs(x="Total Payments (USD)", y="Annual Salary (USD)") +
  geom_smooth(method="lm")

The above graph shows that Payments and projected salaries usually had a positive correlation (though not very strong) over 2013-2015. However, we observe a slight negative correlation between the two for year 2016.

To deep dive into the correlation of various parameters, a correlation graph between the variables is plotted.

Payroll_subset <- payroll[,c(8,9,16:17,30:33)] %>%
  mutate(Cost = as.numeric(sub('\\$','',as.character(Average.Health.Cost))) + 
                   as.numeric(sub('\\$','',as.character(Average.Dental.Cost))) +
                   as.numeric(sub('\\$','',as.character(Average.Benefit.Cost))) +
                   as.numeric(sub('\\$','',as.character(Average.Basic.Life)))) %>%
  mutate(Salary = as.numeric(sub('\\$','',as.character(Projected.Annual.Salary)))) %>%
  mutate(Payment = as.numeric(sub('\\$','',as.character(Total.Payments)))) %>%
  mutate(BasePay = as.numeric(sub('\\$','',as.character(Base.Pay))))

Payroll_subset[,c(9:12)] %>%
  cor() %>%
  corrplot(method="square", reorder=TRUE)

The graph shows a significant correlation between the total payments and salary. SImilarly, the correlation between Cost and Salary is also pretty high.

Next, we analyze the various types of Benefit plans for the city. We observe that there are 4 different kinds of benefit plans - City, DWP, Fire and Police.

payroll %>%
  filter(Benefits.Plan != '') %>%
  mutate(BenefitsPlan = as.character(Benefits.Plan)) %>%
  mutate(AvgBenCost = as.numeric(sub('\\$','',as.character(Average.Benefit.Cost)))) %>%
  group_by(BenefitsPlan) %>%
  summarise(avgBenefits = sum(AvgBenCost)) %>%
  ungroup() %>%
  ggplot(aes(x=reorder(BenefitsPlan, -avgBenefits), y=avgBenefits)) +
  geom_bar(stat="identity", colour="black", fill="tomato", width = 0.5) +
  labs(x="Benefit Plan", y="Average Benefit (USD)")

The graph shows the average benefit Cost under the ‘City’ plan is highest, followed by ‘DWP’.

Lastly, we observe that there are various different kinds of job titles. We plot a word cloud to show the most commonly occurring job titles.

text <- payroll$Job.Class.Title
docs <- Corpus(VectorSource(text))
docs <- tm_map(docs, removeWords, c("I", "II", "III", "and", "senior"))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 250,
          max.words=75, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

We plot the top 5 job titles based on Total Payments, for a particular year filtered by a phrase(s) in the Job Title.

a <- payroll %>%
  subset(select = Job.Class.Title) %>%
  unique()

b <- function(a, Y,J){
  a %>%
    filter(Year==Y) %>%
    filter(str_detect(Job.Class.Title, regex(J, ignore_case=TRUE))) %>%
    select(c(Row.ID,Year,Job.Class.Title, Total.Payments)) %>%
    top_n(5) %>%
    ggplot(mapping=aes(x=Job.Class.Title, y=Total.Payments)) +
    geom_bar(stat="identity", width=0.6, fill="dodgerblue") +
    labs(x=" Job Title", y="Total Payments")
}

b(payroll, 2014, "Account")

b(payroll, 2016, "Police")

Summary

Through the analysis performed, we observed how the payroll data for the Los Angeles City has changed over the past few years. Further, analysis was performed to check the variation of payments, costs and salaries with change in parameters.

The payroll data was analyzed by plotting line graphs to show trends in Payments. Scatter plots and correlation matrix showed dependency across Costs, Payments and Salaries. Different kinds of Benefit plans were analyzed to indicate the one with large benefits. Additionally, the plot of word cloud and job title graphs show commonly ocurring titles and the ones with highest payment based on a keyword.

Some of the key insights from the analysis are presented:

The payments made in 2016 were slightly higher than those made in the previous 3 years
There is a positive correlation between Payments and Salaries for 2013-2015. However, this year we observe a slightly negative correlation
Deeper analysis showed that overall Payments, Salaries and Costs are highly correlation, when the year-on-year variations are not taken into account
Most commonly used Job Titles in the Los Angeles City are Police Officer, Clerk and Assistant
The ‘City’ Benefits plan emerges as the Best Plan in the City based on Average yearly benefits, followed by DWP