Synopsis

This is the week-6 homework for the BANA 8090 Data Wrangling With R course. This week focuses on providing a project proposal for the final project.The aim of the project is to analyze the Payroll Data of Los Angeles City and derive insights such as:

Details

Data Description

In this section, a brief description of the dataset has been provided:

  • The City Payroll Data will be used for the final project.
  • The data has been obtained from the Opendatanetwork.com website and contains payroll information of all Los Angeles City Departments from 2013-2016.
  • The data is updated on a quarterly basis by the Los Angeles City Controller’s Office.
  • It provides Payment, Costs and Rates for various departments from 2013 to 2016
  • Further, details about Bonus/Longevity/Overtime Pays and other adjustment Pays has also been provided
  • We also have the Employment Types, job titles and pay grades within each department for 4 years

Codebook of the dataset has not been provided on the website. However, some of the variables which will be used for analysis have been described below:

  • Year - This indicates the year of payment and varies from 2013 to 2016
  • Department.Title - This indicates the name of the department
  • Employment.Type - This indicates the type pf employment (part time, full time or per event)
  • Payments - This indicates the quarterly payment made for a particular job type in a particular year by a certain department
  • Average.Cost - This indicates the different types of average cost incurred by the department

Data Import

In this section, we import the data from the Controller Data website using the RSocrata package. To accomplish this, we load the RSocrata library. The data set is read as a data frame and then converted to a tibble.

library(RSocrata)
library(tidyverse)

url <- "https://controllerdata.lacity.org/Payroll/City-Payroll-Data/pazn-qyym"
data <- read.socrata(url)
payroll <- as_tibble(data)
head(payroll)

Data Cleaning

In this section, we anayze and clean our dataset to get it in a tidy format, so we can perform analysis on on it. To accomplish this, we look at the dimensions of the dataset, number of missing values and complete cases.

colnames(payroll)
##  [1] "Row.ID"                       "Year"                        
##  [3] "Department.Title"             "Payroll.Department"          
##  [5] "Record.Number"                "Job.Class.Title"             
##  [7] "Employment.Type"              "Hourly.or.Event.Rate"        
##  [9] "Projected.Annual.Salary"      "Q1.Payments"                 
## [11] "Q2.Payments"                  "Q3.Payments"                 
## [13] "Q4.Payments"                  "Payments.Over.Base.Pay"      
## [15] "X..Over.Base.Pay"             "Total.Payments"              
## [17] "Base.Pay"                     "Permanent.Bonus.Pay"         
## [19] "Longevity.Bonus.Pay"          "Temporary.Bonus.Pay"         
## [21] "Lump.Sum.Pay"                 "Overtime.Pay"                
## [23] "Other.Pay...Adjustments"      "Other.Pay..Payroll.Explorer."
## [25] "MOU"                          "MOU.Title"                   
## [27] "FMS.Department"               "Job.Class"                   
## [29] "Pay.Grade"                    "Average.Health.Cost"         
## [31] "Average.Dental.Cost"          "Average.Basic.Life"          
## [33] "Average.Benefit.Cost"         "Benefits.Plan"               
## [35] "Job.Class.Link"
nrow(payroll)
## [1] 285008
dim(payroll)
## [1] 285008     35
summary(payroll)
##      Row.ID            Year                       Department.Title
##  Min.   :     1   Min.   :2013   Police (LAPD)            :65846  
##  1st Qu.: 71888   1st Qu.:2014   Water And Power (DWP)    :53341  
##  Median :143505   Median :2015   Recreation And Parks     :35219  
##  Mean   :143140   Mean   :2015   Airports (LAWA)          :18905  
##  3rd Qu.:214688   3rd Qu.:2016   Fire (LAFD)              :18078  
##  Max.   :285208   Max.   :2016   Public Works - Sanitation:14679  
##                                  (Other)                  :78940  
##  Payroll.Department Record.Number                   Job.Class.Title  
##  Min.   : 101       Min.   :1.953e+05   Police Officer II   : 21160  
##  1st Qu.:3001       1st Qu.:1.074e+09   Police Officer III  : 10927  
##  Median :4301       Median :2.147e+09   Recreation Assistant: 10214  
##  Mean   :4538       Mean   :2.150e+09   Firefighter III     :  8739  
##  3rd Qu.:7031       3rd Qu.:3.228e+09   Administrative Clerk:  5161  
##  Max.   :8702       Max.   :4.295e+09   Security Officer    :  4285  
##  NA's   :53341                          (Other)             :224522  
##   Employment.Type   Hourly.or.Event.Rate Projected.Annual.Salary
##  Full Time:236720   Min.   :  0.00       Min.   :     0         
##  Part Time: 47373   1st Qu.: 26.50       1st Qu.: 56919         
##  Per Event:   915   Median : 37.44       Median : 79073         
##                     Mean   : 38.96       Mean   : 80112         
##                     3rd Qu.: 47.66       3rd Qu.:100037         
##                     Max.   :900.00       Max.   :360744         
##                     NA's   :20817                               
##   Q1.Payments      Q2.Payments        Q3.Payments      Q4.Payments    
##  Min.   : -2255   Min.   :-12607.9   Min.   :-34123   Min.   :-30728  
##  1st Qu.: 12434   1st Qu.:   192.5   1st Qu.:     0   1st Qu.:     0  
##  Median : 21424   Median : 17723.4   Median :  4312   Median :  3840  
##  Mean   : 21236   Mean   : 16995.3   Mean   : 12245   Mean   : 12497  
##  3rd Qu.: 29014   3rd Qu.: 27092.8   3rd Qu.: 23269   3rd Qu.: 24100  
##  Max.   :262966   Max.   :191435.4   Max.   :291317   Max.   :500931  
##                                                                       
##  Payments.Over.Base.Pay X..Over.Base.Pay Total.Payments      Base.Pay     
##  Min.   :-30878         0.00%  : 37043   Min.   :-30728   Min.   : -3838  
##  1st Qu.:   650         0.03%  :  7559   1st Qu.: 23007   1st Qu.: 19732  
##  Median :  4396         0.01%  :  4643   Median : 55045   Median : 46342  
##  Mean   : 11379         0.02%  :  3839   Mean   : 62974   Mean   : 51594  
##  3rd Qu.: 13251         0.08%  :  2872   3rd Qu.: 95819   3rd Qu.: 82912  
##  Max.   :435622         0.09%  :  2702   Max.   :500931   Max.   :359362  
##                         (Other):226350                                    
##  Permanent.Bonus.Pay Longevity.Bonus.Pay Temporary.Bonus.Pay
##  Min.   :-6476       Min.   : -108.57    Min.   : -6283.5   
##  1st Qu.:    0       1st Qu.:    0.00    1st Qu.:     0.0   
##  Median :    0       Median :    0.00    Median :     0.0   
##  Mean   : 1497       Mean   :  660.48    Mean   :   462.9   
##  3rd Qu.: 1802       3rd Qu.:   72.98    3rd Qu.:   670.0   
##  Max.   :60728       Max.   :22354.64    Max.   :211066.6   
##                                                             
##   Lump.Sum.Pay       Overtime.Pay      Other.Pay...Adjustments
##  Min.   :  -102.7   Min.   : -1019.5   $0.00  :131327         
##  1st Qu.:     0.0   1st Qu.:     0.0   $132.00:  4267         
##  Median :     0.0   Median :   334.7   $600.00:  3545         
##  Mean   :   735.4   Mean   :  6396.9   $264.00:  2465         
##  3rd Qu.:     0.0   3rd Qu.:  5903.0   $150.00:  2402         
##  Max.   :174325.3   Max.   :311315.6   $300.00:  2333         
##  NA's   :42061      NA's   :433        (Other):138669         
##  Other.Pay..Payroll.Explorer.     MOU           
##  $0.00  : 36357               Length:285008     
##  $100.00:  2328               Class :character  
##  $200.00:  2104               Mode  :character  
##  $150.00:  1782                                 
##  $600.00:  1145                                 
##  $300.00:   798                                 
##  (Other):240494                                 
##                            MOU.Title      FMS.Department    Job.Class   
##                                 : 20817   Min.   : 2.00   Min.   :   1  
##  POLICE OFFICERS UNIT           : 20214   1st Qu.:42.00   1st Qu.:1961  
##  CLERICAL UNIT                  : 14087   Median :70.00   Median :2236  
##  Clerical                       : 11983   Mean   :66.05   Mean   :3056  
##  POLICE OFFICERS, LTS. AND BELOW: 10284   3rd Qu.:88.00   3rd Qu.:3580  
##  Police Officers, Lts. and Below:  9972   Max.   :98.00   Max.   :9998  
##  (Other)                        :197651                                 
##    Pay.Grade     Average.Health.Cost Average.Dental.Cost
##  0      :67365   Min.   :    0       Min.   :   0.0     
##  2      :49798   1st Qu.:10710       1st Qu.: 405.2     
##         :33710   Median :11470       Median : 412.3     
##  1      :24580   Mean   :10632       Mean   : 612.6     
##  3      :22508   3rd Qu.:13025       3rd Qu.: 898.1     
##  (Other):23790   Max.   :18144       Max.   :1282.6     
##  NA's   :63257                                          
##  Average.Basic.Life Average.Benefit.Cost Benefits.Plan  
##  Min.   :  0.00     Min.   :    0              : 35905  
##  1st Qu.:  0.00     1st Qu.:11127        City  :125853  
##  Median : 11.40     Median :11894        DWP   : 52827  
##  Mean   : 48.97     Mean   :11294        Fire  : 16513  
##  3rd Qu.: 11.45     3rd Qu.:14074        Police: 50910  
##  Max.   :193.32     Max.   :19393        NA's  :  3000  
##                                                         
##                                  Job.Class.Link  
##  http://per.lacity.org/perspecs/2214.pdf: 34837  
##  http://per.lacity.org/perspecs/2498.pdf: 10214  
##  http://per.lacity.org/perspecs/2112.pdf:  9743  
##  http://per.lacity.org/perspecs/1358.pdf:  9000  
##  http://per.lacity.org/perspecs/2223.pdf:  7997  
##  http://per.lacity.org/perspecs/1368.pdf:  6811  
##  (Other)                                :206406
sum(is.na(payroll))
## [1] 203672
sum(complete.cases(payroll))
## [1] 189015

We now remove the NA values and check the percentage of data removed.

payroll_clean <- payroll %>%
                    na.omit()

Deleted_pcnt <- (1 - (nrow(payroll_clean)/nrow(payroll)))*100
Deleted_pcnt
## [1] 33.68081

We observe that the percentage of data deleted is not very significant. Further, prior to performing an analysis, we check if we removed any values in the required variables due to the above data cleaning.

Additionally, we observe that some of the columns have information that can be derived from other datasets. For example - Payments from each quarter are listed in 4 different columns. There is also an additional column listing the Total payments which can be derived from the quarterly payments. Hence, we can get rid of some of these columns which present repeated or derived information.

payroll_clean %>%
  mutate(avgCost = as.numeric(Average.Health.Cost) + as.numeric(Average.Dental.Cost) + as.numeric(Average.Benefit.Cost) + as.numeric(Average.Basic.Life)) %>%
  subset(,c(1:4,6:9,16,21,28:29,34,36)) %>%
  dim()
## [1] 189015     14

Planned Analysis

I plan to perform the following analysis:

  • Analyze the trend of Total payments over years (both departmental and mean values)
  • Analyze the variation of average cost of living across Years
  • Vizualize top 10 job class titles with respect to Payments, faceted by Employment Type
  • Analyze correlation between Total Costs and Total Payments, faceted by Year