This is the week-6 homework for the BANA 8090 Data Wrangling With R course. This week focuses on providing a project proposal for the final project.The aim of the project is to analyze the Payroll Data of Los Angeles City and derive insights such as:
In this section, a brief description of the dataset has been provided:
Codebook of the dataset has not been provided on the website. However, some of the variables which will be used for analysis have been described below:
In this section, we import the data from the Controller Data website using the RSocrata package. To accomplish this, we load the RSocrata library. The data set is read as a data frame and then converted to a tibble.
library(RSocrata)
library(tidyverse)
url <- "https://controllerdata.lacity.org/Payroll/City-Payroll-Data/pazn-qyym"
data <- read.socrata(url)
payroll <- as_tibble(data)
head(payroll)
In this section, we anayze and clean our dataset to get it in a tidy format, so we can perform analysis on on it. To accomplish this, we look at the dimensions of the dataset, number of missing values and complete cases.
colnames(payroll)
## [1] "Row.ID" "Year"
## [3] "Department.Title" "Payroll.Department"
## [5] "Record.Number" "Job.Class.Title"
## [7] "Employment.Type" "Hourly.or.Event.Rate"
## [9] "Projected.Annual.Salary" "Q1.Payments"
## [11] "Q2.Payments" "Q3.Payments"
## [13] "Q4.Payments" "Payments.Over.Base.Pay"
## [15] "X..Over.Base.Pay" "Total.Payments"
## [17] "Base.Pay" "Permanent.Bonus.Pay"
## [19] "Longevity.Bonus.Pay" "Temporary.Bonus.Pay"
## [21] "Lump.Sum.Pay" "Overtime.Pay"
## [23] "Other.Pay...Adjustments" "Other.Pay..Payroll.Explorer."
## [25] "MOU" "MOU.Title"
## [27] "FMS.Department" "Job.Class"
## [29] "Pay.Grade" "Average.Health.Cost"
## [31] "Average.Dental.Cost" "Average.Basic.Life"
## [33] "Average.Benefit.Cost" "Benefits.Plan"
## [35] "Job.Class.Link"
nrow(payroll)
## [1] 285008
dim(payroll)
## [1] 285008 35
summary(payroll)
## Row.ID Year Department.Title
## Min. : 1 Min. :2013 Police (LAPD) :65846
## 1st Qu.: 71888 1st Qu.:2014 Water And Power (DWP) :53341
## Median :143505 Median :2015 Recreation And Parks :35219
## Mean :143140 Mean :2015 Airports (LAWA) :18905
## 3rd Qu.:214688 3rd Qu.:2016 Fire (LAFD) :18078
## Max. :285208 Max. :2016 Public Works - Sanitation:14679
## (Other) :78940
## Payroll.Department Record.Number Job.Class.Title
## Min. : 101 Min. :1.953e+05 Police Officer II : 21160
## 1st Qu.:3001 1st Qu.:1.074e+09 Police Officer III : 10927
## Median :4301 Median :2.147e+09 Recreation Assistant: 10214
## Mean :4538 Mean :2.150e+09 Firefighter III : 8739
## 3rd Qu.:7031 3rd Qu.:3.228e+09 Administrative Clerk: 5161
## Max. :8702 Max. :4.295e+09 Security Officer : 4285
## NA's :53341 (Other) :224522
## Employment.Type Hourly.or.Event.Rate Projected.Annual.Salary
## Full Time:236720 Min. : 0.00 Min. : 0
## Part Time: 47373 1st Qu.: 26.50 1st Qu.: 56919
## Per Event: 915 Median : 37.44 Median : 79073
## Mean : 38.96 Mean : 80112
## 3rd Qu.: 47.66 3rd Qu.:100037
## Max. :900.00 Max. :360744
## NA's :20817
## Q1.Payments Q2.Payments Q3.Payments Q4.Payments
## Min. : -2255 Min. :-12607.9 Min. :-34123 Min. :-30728
## 1st Qu.: 12434 1st Qu.: 192.5 1st Qu.: 0 1st Qu.: 0
## Median : 21424 Median : 17723.4 Median : 4312 Median : 3840
## Mean : 21236 Mean : 16995.3 Mean : 12245 Mean : 12497
## 3rd Qu.: 29014 3rd Qu.: 27092.8 3rd Qu.: 23269 3rd Qu.: 24100
## Max. :262966 Max. :191435.4 Max. :291317 Max. :500931
##
## Payments.Over.Base.Pay X..Over.Base.Pay Total.Payments Base.Pay
## Min. :-30878 0.00% : 37043 Min. :-30728 Min. : -3838
## 1st Qu.: 650 0.03% : 7559 1st Qu.: 23007 1st Qu.: 19732
## Median : 4396 0.01% : 4643 Median : 55045 Median : 46342
## Mean : 11379 0.02% : 3839 Mean : 62974 Mean : 51594
## 3rd Qu.: 13251 0.08% : 2872 3rd Qu.: 95819 3rd Qu.: 82912
## Max. :435622 0.09% : 2702 Max. :500931 Max. :359362
## (Other):226350
## Permanent.Bonus.Pay Longevity.Bonus.Pay Temporary.Bonus.Pay
## Min. :-6476 Min. : -108.57 Min. : -6283.5
## 1st Qu.: 0 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 0 Median : 0.00 Median : 0.0
## Mean : 1497 Mean : 660.48 Mean : 462.9
## 3rd Qu.: 1802 3rd Qu.: 72.98 3rd Qu.: 670.0
## Max. :60728 Max. :22354.64 Max. :211066.6
##
## Lump.Sum.Pay Overtime.Pay Other.Pay...Adjustments
## Min. : -102.7 Min. : -1019.5 $0.00 :131327
## 1st Qu.: 0.0 1st Qu.: 0.0 $132.00: 4267
## Median : 0.0 Median : 334.7 $600.00: 3545
## Mean : 735.4 Mean : 6396.9 $264.00: 2465
## 3rd Qu.: 0.0 3rd Qu.: 5903.0 $150.00: 2402
## Max. :174325.3 Max. :311315.6 $300.00: 2333
## NA's :42061 NA's :433 (Other):138669
## Other.Pay..Payroll.Explorer. MOU
## $0.00 : 36357 Length:285008
## $100.00: 2328 Class :character
## $200.00: 2104 Mode :character
## $150.00: 1782
## $600.00: 1145
## $300.00: 798
## (Other):240494
## MOU.Title FMS.Department Job.Class
## : 20817 Min. : 2.00 Min. : 1
## POLICE OFFICERS UNIT : 20214 1st Qu.:42.00 1st Qu.:1961
## CLERICAL UNIT : 14087 Median :70.00 Median :2236
## Clerical : 11983 Mean :66.05 Mean :3056
## POLICE OFFICERS, LTS. AND BELOW: 10284 3rd Qu.:88.00 3rd Qu.:3580
## Police Officers, Lts. and Below: 9972 Max. :98.00 Max. :9998
## (Other) :197651
## Pay.Grade Average.Health.Cost Average.Dental.Cost
## 0 :67365 Min. : 0 Min. : 0.0
## 2 :49798 1st Qu.:10710 1st Qu.: 405.2
## :33710 Median :11470 Median : 412.3
## 1 :24580 Mean :10632 Mean : 612.6
## 3 :22508 3rd Qu.:13025 3rd Qu.: 898.1
## (Other):23790 Max. :18144 Max. :1282.6
## NA's :63257
## Average.Basic.Life Average.Benefit.Cost Benefits.Plan
## Min. : 0.00 Min. : 0 : 35905
## 1st Qu.: 0.00 1st Qu.:11127 City :125853
## Median : 11.40 Median :11894 DWP : 52827
## Mean : 48.97 Mean :11294 Fire : 16513
## 3rd Qu.: 11.45 3rd Qu.:14074 Police: 50910
## Max. :193.32 Max. :19393 NA's : 3000
##
## Job.Class.Link
## http://per.lacity.org/perspecs/2214.pdf: 34837
## http://per.lacity.org/perspecs/2498.pdf: 10214
## http://per.lacity.org/perspecs/2112.pdf: 9743
## http://per.lacity.org/perspecs/1358.pdf: 9000
## http://per.lacity.org/perspecs/2223.pdf: 7997
## http://per.lacity.org/perspecs/1368.pdf: 6811
## (Other) :206406
sum(is.na(payroll))
## [1] 203672
sum(complete.cases(payroll))
## [1] 189015
We now remove the NA values and check the percentage of data removed.
payroll_clean <- payroll %>%
na.omit()
Deleted_pcnt <- (1 - (nrow(payroll_clean)/nrow(payroll)))*100
Deleted_pcnt
## [1] 33.68081
We observe that the percentage of data deleted is not very significant. Further, prior to performing an analysis, we check if we removed any values in the required variables due to the above data cleaning.
Additionally, we observe that some of the columns have information that can be derived from other datasets. For example - Payments from each quarter are listed in 4 different columns. There is also an additional column listing the Total payments which can be derived from the quarterly payments. Hence, we can get rid of some of these columns which present repeated or derived information.
payroll_clean %>%
mutate(avgCost = as.numeric(Average.Health.Cost) + as.numeric(Average.Dental.Cost) + as.numeric(Average.Benefit.Cost) + as.numeric(Average.Basic.Life)) %>%
subset(,c(1:4,6:9,16,21,28:29,34,36)) %>%
dim()
## [1] 189015 14
I plan to perform the following analysis: