Here is an involved exercise using a larger data base in the form of an Excel .csv file to work with.
And here’s the code!
If you want to download the .csv file from github you can find it here.
Excel can be kind of squirrelly, and I found it best to download the ZIP on the github page and drag the Employee_Earnings_Report_2014.csv file to my desktop without opening the file.
earnings <- read.csv('C:\\Users\\Steven\\Desktop\\Employee_Earnings_Report_2014.csv', stringsAsFactors = FALSE)
Explore the data a bit using the dim, head, or tail commands. We will see there are 22,233 rows and 12 columns.
dim(earnings)
## [1] 22233 12
head(earnings)
## NAME TITLE
## 1 Adario,Anthony J Supvising Claims Agent (Asd)
## 2 Baker,Katherine E. Nurse Case Manager
## 3 Braga,Ann Hess HumanResourcesComplianceOffcr
## 4 Calobrisi,Jane Ellen Personnel Analyst
## 5 Castillo-Cook,Wendolyn M. Alcholism Coord I
## 6 Curran,Liam Patrick Asst Corp Counsel III
## DEPARTMENT.NAME REGULAR RETRO OTHER OVERTIME INJURED DETAIL
## 1 ASD Human Resources $98538.32 $0.00 $1842.87 $0.00 $0.00 $0.00
## 2 ASD Human Resources $83942.21 $0.00 $225.00 $0.00 $0.00 $0.00
## 3 ASD Human Resources $98538.34 $0.00 $0.00 $0.00 $0.00 $0.00
## 4 ASD Human Resources $70073.81 $0.00 $225.00 $0.00 $0.00 $0.00
## 5 ASD Human Resources $79253.11 $0.00 $0.00 $0.00 $0.00 $0.00
## 6 ASD Human Resources $69266.69 $0.00 $0.00 $0.00 $0.00 $0.00
## QUINN TOTAL.EARNINGS ZIP
## 1 $0.00 $100381.19 02132
## 2 $0.00 $84167.21 02136
## 3 $0.00 $98538.34 02126
## 4 $0.00 $70298.81 02090
## 5 $0.00 $79253.11 02121
## 6 $0.00 $69266.69 02122
tail(earnings)
## NAME TITLE DEPARTMENT.NAME
## 22228 Yanosick,Justin Timothy Fire Fighter Boston Fire Department
## 22229 Yee,Jacob Fire Fighter Boston Fire Department
## 22230 Yetman,Patrick Fire Fighter Boston Fire Department
## 22231 Young,Allan M Fire Fighter Boston Fire Department
## 22232 Young,Michael L Fire Fighter Boston Fire Department
## 22233 Zagami,Anthony J Fire Alarm Operator Boston Fire Department
## REGULAR RETRO OTHER OVERTIME INJURED DETAIL QUINN
## 22228 $92646.90 $14250.84 $550.00 $28269.86 $0.00 $0.00 $0.00
## 22229 $80073.63 $13529.71 $550.00 $15560.09 $13353.47 $7245.75 $0.00
## 22230 $69886.18 $5191.64 $550.00 $8227.25 $0.00 $0.00 $0.00
## 22231 $97234.17 $14091.91 $3050.00 $6193.76 $0.00 $4464.00 $0.00
## 22232 $95694.84 $13973.38 $550.00 $10678.12 $0.00 $132.00 $0.00
## 22233 $0.00 $5494.93 $0.00 $0.00 $0.00 $0.00 $0.00
## TOTAL.EARNINGS ZIP
## 22228 $135717.60 02132
## 22229 $130312.65 02136
## 22230 $83855.07 02127
## 22231 $125033.84 02136
## 22232 $121028.34 02121
## 22233 $5494.93 02124
You will notice that the data file consists entirely of characters. All cells are non-numeric. We will have to change this in order to analyze the data. We’ll get to that in a bit.
str(earnings)
## 'data.frame': 22233 obs. of 12 variables:
## $ NAME : chr "Adario,Anthony J" "Baker,Katherine E." "Braga,Ann Hess" "Calobrisi,Jane Ellen" ...
## $ TITLE : chr "Supvising Claims Agent (Asd)" "Nurse Case Manager" "HumanResourcesComplianceOffcr" "Personnel Analyst" ...
## $ DEPARTMENT.NAME: chr "ASD Human Resources" "ASD Human Resources" "ASD Human Resources" "ASD Human Resources" ...
## $ REGULAR : chr "$98538.32" "$83942.21" "$98538.34" "$70073.81" ...
## $ RETRO : chr "$0.00" "$0.00" "$0.00" "$0.00" ...
## $ OTHER : chr "$1842.87" "$225.00" "$0.00" "$225.00" ...
## $ OVERTIME : chr "$0.00" "$0.00" "$0.00" "$0.00" ...
## $ INJURED : chr "$0.00" "$0.00" "$0.00" "$0.00" ...
## $ DETAIL : chr "$0.00" "$0.00" "$0.00" "$0.00" ...
## $ QUINN : chr "$0.00" "$0.00" "$0.00" "$0.00" ...
## $ TOTAL.EARNINGS : chr "$100381.19" "$84167.21" "$98538.34" "$70298.81" ...
## $ ZIP : chr "02132" "02136" "02126" "02090" ...
If you want to find values in a specific row, column, or cell, you can use the following codes. As usual in R, there are different ways of doing the same thing.
earnings[3, 1] #row 3, column 1 or
## [1] "Braga,Ann Hess"
earnings$NAME[3]
## [1] "Braga,Ann Hess"
earnings[1:10, 1] #column 1 (rows 1 through 10 only) or
## [1] "Adario,Anthony J" "Baker,Katherine E."
## [3] "Braga,Ann Hess" "Calobrisi,Jane Ellen"
## [5] "Castillo-Cook,Wendolyn M." "Curran,Liam Patrick"
## [7] "De Jesus,Winnie Yam" "Donnelly,Jennifer P."
## [9] "Donovan,Carol Anne" "Donovan,John J"
earnings[1:10, 'NAME']
## [1] "Adario,Anthony J" "Baker,Katherine E."
## [3] "Braga,Ann Hess" "Calobrisi,Jane Ellen"
## [5] "Castillo-Cook,Wendolyn M." "Curran,Liam Patrick"
## [7] "De Jesus,Winnie Yam" "Donnelly,Jennifer P."
## [9] "Donovan,Carol Anne" "Donovan,John J"
earnings[3, ] #row 3
## NAME TITLE DEPARTMENT.NAME
## 3 Braga,Ann Hess HumanResourcesComplianceOffcr ASD Human Resources
## REGULAR RETRO OTHER OVERTIME INJURED DETAIL QUINN TOTAL.EARNINGS ZIP
## 3 $98538.34 $0.00 $0.00 $0.00 $0.00 $0.00 $0.00 $98538.34 02126
We noted that the information in the file is all non-numeric. We need to deal with that. Let’s start by removing the dollar sign from a selected column.
If you are working in R Studio, it is useful to keep the file view open in the upper left (the script editing pane) you will instantly see changes you make.
earnings$TOTAL.EARNINGS <- gsub('\\$', '', earnings$TOTAL.EARNINGS)
The column has to be a numeric vector if we are to manipulate it, so time to convert the character vector to numeric.
earnings$TOTAL.EARNINGS <- as.numeric(earnings$TOTAL.EARNINGS)
str(earnings$TOTAL.EARNINGS)
## num [1:22233] 100381 84167 98538 70299 79253 ...
Sorting is not a problem in R. The negative sign (-) in the code means sort in descending order. The default is ascending order.
earnings <- earnings[order(-earnings$TOTAL.EARNINGS), ]
earnings[1:10, 11] #column 1 (rows 1 through 10 only)
## [1] 415709.5 361025.0 350510.5 342249.4 325943.6 325579.0 317181.5
## [8] 300065.6 292205.7 289436.2
In this case, we are going to create a new column called Total.less.OT to find each employee’s base salary. The Total.Earnings column includes base plus overtime.
But first we have to remove the dollar signs and then convert the Overtime column to a numeric vector, as we did to Total.Earnings, or we will induce an error.
earnings$OVERTIME <- sub('\\$', '', earnings$OVERTIME)
earnings$OVERTIME <- as.numeric(earnings$OVERTIME)
earnings$Total.less.OT <- earnings$TOTAL.EARNINGS - earnings$OVERTIME
head(earnings)
## NAME TITLE
## 17835 Crossen,Patrick J Police Captain/DDC
## 19414 Murray,Timothy J Capt.D.D.C-pd Details Section
## 18825 Kervin,Timothy M. Police Lieutenant/Hdq Dispatch
## 8342 Hite,Brenda Teacher
## 17822 Creaven,Jacqueline D Police Lieutenant
## 18633 Hosein,Haseeb Police Captain/DDC
## DEPARTMENT.NAME REGULAR RETRO OTHER OVERTIME
## 17835 Boston Police Department $132427.70 $44690.23 $140711.86 77287.73
## 19414 Boston Police Department $133014.78 $36744.79 $70567.98 46059.04
## 18825 Boston Police Department $112948.06 $41654.06 $16975.92 115079.93
## 8342 BPS Kenny Elementary $38453.84 $0.00 $303795.59 0.00
## 17822 Boston Police Department $111090.95 $30161.50 $131044.53 26375.89
## 18633 Boston Police Department $117206.34 $32585.30 $14656.40 50739.18
## INJURED DETAIL QUINN TOTAL.EARNINGS ZIP Total.less.OT
## 17835 $0.00 $0.00 $20592.01 415709.5 02043 338421.8
## 19414 $0.00 $53956.00 $20682.44 361025.0 02021 314966.0
## 18825 $0.00 $46314.00 $17538.56 350510.5 02135 235430.6
## 8342 $0.00 $0.00 $0.00 342249.4 02136 342249.4
## 17822 $132.97 $9918.00 $17219.79 325943.6 02132 299567.7
## 18633 $1294.34 $90840.00 $18257.43 325579.0 02186 274839.8
Basically, in Excel terms, we are going to cut-and-paste columns into new R objects. In other words, we are extracting all data in rows from the earnings data frame that include “Boston Fire Department” in the Department.Name column and putting them into a separate object. Likewise for the Boston PD.
fire_dept <- subset(earnings, DEPARTMENT.NAME == 'Boston Fire Department')
police_dept <- subset(earnings, DEPARTMENT.NAME == 'Boston Police Department')
head(fire_dept)
## NAME TITLE
## 21554 McDevitt,Daniel J Dep Fire Chief Adm-Tech-ADR
## 22111 Tobin,William A District Fire Chief-Technician
## 21986 Shea,Bartholomew J Dep Fire Chief
## 20853 Doherty,Michael J. Dep Fire Chief
## 21232 Holt,Thomas E Fire Captain Admin-ADR
## 21933 Ruggere,Michael J Dep Fire Chief
## DEPARTMENT.NAME REGULAR RETRO OTHER OVERTIME
## 21554 Boston Fire Department $8656.63 $8457.93 $132814.00 0.00
## 22111 Boston Fire Department $6049.45 $1522.99 $114358.64 0.00
## 21986 Boston Fire Department $172250.77 $22037.04 $10061.23 55245.78
## 20853 Boston Fire Department $167289.41 $23861.77 $3741.54 63222.42
## 21232 Boston Fire Department $27005.66 $13978.43 $101051.37 3153.96
## 21933 Boston Fire Department $183195.52 $20912.07 $4565.94 33749.38
## INJURED DETAIL QUINN TOTAL.EARNINGS ZIP Total.less.OT
## 21554 $118050.73 $0.00 $0.00 267979.3 02127 267979.3
## 22111 $140484.65 $0.00 $0.00 262415.7 02191 262415.7
## 21986 $0.00 $0.00 $0.00 259594.8 02132 204349.0
## 20853 $603.66 $0.00 $0.00 258718.8 01867 195496.4
## 21232 $107359.66 $0.00 $0.00 252549.1 02135 249395.1
## 21933 $0.00 $0.00 $0.00 242422.9 02188 208673.5
head(police_dept)
## NAME TITLE
## 17835 Crossen,Patrick J Police Captain/DDC
## 19414 Murray,Timothy J Capt.D.D.C-pd Details Section
## 18825 Kervin,Timothy M. Police Lieutenant/Hdq Dispatch
## 17822 Creaven,Jacqueline D Police Lieutenant
## 18633 Hosein,Haseeb Police Captain/DDC
## 17906 Danilecki,John H Police Captain
## DEPARTMENT.NAME REGULAR RETRO OTHER OVERTIME
## 17835 Boston Police Department $132427.70 $44690.23 $140711.86 77287.73
## 19414 Boston Police Department $133014.78 $36744.79 $70567.98 46059.04
## 18825 Boston Police Department $112948.06 $41654.06 $16975.92 115079.93
## 17822 Boston Police Department $111090.95 $30161.50 $131044.53 26375.89
## 18633 Boston Police Department $117206.34 $32585.30 $14656.40 50739.18
## 17906 Boston Police Department $131219.38 $37210.94 $19757.04 78010.40
## INJURED DETAIL QUINN TOTAL.EARNINGS ZIP Total.less.OT
## 17835 $0.00 $0.00 $20592.01 415709.5 02043 338421.8
## 19414 $0.00 $53956.00 $20682.44 361025.0 02021 314966.0
## 18825 $0.00 $46314.00 $17538.56 350510.5 02135 235430.6
## 17822 $132.97 $9918.00 $17219.79 325943.6 02132 299567.7
## 18633 $1294.34 $90840.00 $18257.43 325579.0 02186 274839.8
## 17906 $0.00 $30573.00 $20410.79 317181.5 02081 239171.1
We could work on these objects separately on their own, away from the original earnings data frame.
As in Excel, we can calculate sums, means, medians, or any other value we like on columns we choose.
sum(earnings$TOTAL.EARNINGS)
## [1] 1487852769
mean(earnings$TOTAL.EARNINGS)
## [1] 66920.92
median(earnings$TOTAL.EARNINGS)
## [1] 61714.87
max(earnings$TOTAL.EARNINGS)
## [1] 415709.5
range(earnings$TOTAL.EARNINGS)
## [1] 2.53 415709.53
And we can create separate object for each of these if we need them. For example, for the sum of TOTAL.EARNINGS.
earnings_total <- sum(earnings$TOTAL.EARNINGS)
earnings_total
## [1] 1487852769
You may have noticed the Name column contains first and last names, and middle initials. It may be necessary to break this aggregated column into its constituent parts. This can be done in R but the process is a little complicated.
Let’s begin by breaking out the last names. The sub command below will delete everything in the Name column after the comma following the last name then place the remainder in a new column called Last.Name.
earnings$Last.Name <- sub(',.*', '', earnings$NAME) #deletes everything in NAME
#column after the comma
earnings[1:10, 'Last.Name']
## [1] "Crossen" "Murray" "Kervin" "Hite" "Creaven"
## [6] "Hosein" "Danilecki" "Spillane" "Assad" "Greeley"
Next, we will create a new column called First.Name and extract contents from the Name column.
earnings$First.Name <- sub('.*,', '', earnings$NAME)
earnings[1:10, 'First.Name']
## [1] "Patrick J" "Timothy J" "Timothy M." "Brenda"
## [5] "Jacqueline D" "Haseeb" "John H" "Matthew J"
## [9] "Mark L" "Darrin Patrick"
Still, there is the problem of the middle initials. We want to remove it from the First.Name column and create a new column for them. This gets a little complicated because we take a couple steps, one involving a function.
Part 1: Create an Array of All Lines in earnings
earnings_list <- 1:nrow(earnings)
head(earnings_list)
## [1] 1 2 3 4 5 6
Part 2: Use Function (and Loop)
In a couple steps, we are going to create a column for middle names, then clean up the column First.Name.
for(i in earnings_list){
if (grepl(' ', earnings$First.Name[i])) {
earnings$Mid.Name[i] <- sub('.* ', '', earnings$First.Name[i])
} else {
earnings$Mid.Name[i] <- ''
}
}
head(earnings)
## NAME TITLE
## 17835 Crossen,Patrick J Police Captain/DDC
## 19414 Murray,Timothy J Capt.D.D.C-pd Details Section
## 18825 Kervin,Timothy M. Police Lieutenant/Hdq Dispatch
## 8342 Hite,Brenda Teacher
## 17822 Creaven,Jacqueline D Police Lieutenant
## 18633 Hosein,Haseeb Police Captain/DDC
## DEPARTMENT.NAME REGULAR RETRO OTHER OVERTIME
## 17835 Boston Police Department $132427.70 $44690.23 $140711.86 77287.73
## 19414 Boston Police Department $133014.78 $36744.79 $70567.98 46059.04
## 18825 Boston Police Department $112948.06 $41654.06 $16975.92 115079.93
## 8342 BPS Kenny Elementary $38453.84 $0.00 $303795.59 0.00
## 17822 Boston Police Department $111090.95 $30161.50 $131044.53 26375.89
## 18633 Boston Police Department $117206.34 $32585.30 $14656.40 50739.18
## INJURED DETAIL QUINN TOTAL.EARNINGS ZIP Total.less.OT
## 17835 $0.00 $0.00 $20592.01 415709.5 02043 338421.8
## 19414 $0.00 $53956.00 $20682.44 361025.0 02021 314966.0
## 18825 $0.00 $46314.00 $17538.56 350510.5 02135 235430.6
## 8342 $0.00 $0.00 $0.00 342249.4 02136 342249.4
## 17822 $132.97 $9918.00 $17219.79 325943.6 02132 299567.7
## 18633 $1294.34 $90840.00 $18257.43 325579.0 02186 274839.8
## Last.Name First.Name Mid.Name
## 17835 Crossen Patrick J J
## 19414 Murray Timothy J J
## 18825 Kervin Timothy M. M.
## 8342 Hite Brenda
## 17822 Creaven Jacqueline D D
## 18633 Hosein Haseeb
Now that we have three columns – Last.Name, First.Name, and Mid.Name we have to go back to the First.Name column and remove the middle initials.
earnings$First.Name <- sub(' .*', '', earnings$First.Name)
head(earnings)
## NAME TITLE
## 17835 Crossen,Patrick J Police Captain/DDC
## 19414 Murray,Timothy J Capt.D.D.C-pd Details Section
## 18825 Kervin,Timothy M. Police Lieutenant/Hdq Dispatch
## 8342 Hite,Brenda Teacher
## 17822 Creaven,Jacqueline D Police Lieutenant
## 18633 Hosein,Haseeb Police Captain/DDC
## DEPARTMENT.NAME REGULAR RETRO OTHER OVERTIME
## 17835 Boston Police Department $132427.70 $44690.23 $140711.86 77287.73
## 19414 Boston Police Department $133014.78 $36744.79 $70567.98 46059.04
## 18825 Boston Police Department $112948.06 $41654.06 $16975.92 115079.93
## 8342 BPS Kenny Elementary $38453.84 $0.00 $303795.59 0.00
## 17822 Boston Police Department $111090.95 $30161.50 $131044.53 26375.89
## 18633 Boston Police Department $117206.34 $32585.30 $14656.40 50739.18
## INJURED DETAIL QUINN TOTAL.EARNINGS ZIP Total.less.OT
## 17835 $0.00 $0.00 $20592.01 415709.5 02043 338421.8
## 19414 $0.00 $53956.00 $20682.44 361025.0 02021 314966.0
## 18825 $0.00 $46314.00 $17538.56 350510.5 02135 235430.6
## 8342 $0.00 $0.00 $0.00 342249.4 02136 342249.4
## 17822 $132.97 $9918.00 $17219.79 325943.6 02132 299567.7
## 18633 $1294.34 $90840.00 $18257.43 325579.0 02186 274839.8
## Last.Name First.Name Mid.Name
## 17835 Crossen Patrick J
## 19414 Murray Timothy J
## 18825 Kervin Timothy M.
## 8342 Hite Brenda
## 17822 Creaven Jacqueline D
## 18633 Hosein Haseeb
It’s not perfect (or easy) but it can be done.
Suppose we want a count of employees in each department in our data frame. We are going to follow these steps.
Start with a new data frame.
Dept_Workers <- data.frame(table(earnings$DEPARTMENT.NAME))
head(Dept_Workers)
## Var1 Freq
## 1 Achievement Gap 15
## 2 Alighieri Montessori School 12
## 3 ASD Human Resources 53
## 4 ASD Intergvernmtl Relations 11
## 5 ASD Office of Budget Mangmnt 25
## 6 ASD Office Of Labor Relation 10
Now we know there are 12 employees at the Alighieri Montessori School.
Next, let’s sort the data frame by number of employees in descending order.
Dept_Workers <- Dept_Workers[order(-Dept_Workers$Freq), ]
head(Dept_Workers)
## Var1 Freq
## 22 Boston Police Department 3178
## 20 Boston Fire Department 1834
## 120 BPS Substitute Teachers/Nurs 1007
## 119 BPS Special Education 721
## 64 BPS Facility Management 581
## 19 Boston Cntr - Youth & Families 562
With 3178 employees, the Boston Police Department is largest.
If we don’t like the current column names – and who does? – we can change Var1 and Freq to something more descriptive.
colnames(Dept_Workers) <- c('Dept', 'No. of Employees')
head(Dept_Workers)
## Dept No. of Employees
## 22 Boston Police Department 3178
## 20 Boston Fire Department 1834
## 120 BPS Substitute Teachers/Nurs 1007
## 119 BPS Special Education 721
## 64 BPS Facility Management 581
## 19 Boston Cntr - Youth & Families 562
We are in a position now to create a pivot table of total income for employees in each department.
Start by creating a new object – income – with a new column – Dept. You will understand why we are doing this in a moment.
income <- tapply(earnings$TOTAL.EARNINGS, earnings$DEPARTMENT.NAME, sum)
income <- data.frame(income)
income$Dept <- rownames(income)
head(income)
## income Dept
## Achievement Gap 1399700.2 Achievement Gap
## Alighieri Montessori School 649428.2 Alighieri Montessori School
## ASD Human Resources 3300290.3 ASD Human Resources
## ASD Intergvernmtl Relations 792921.8 ASD Intergvernmtl Relations
## ASD Office of Budget Mangmnt 1993052.6 ASD Office of Budget Mangmnt
## ASD Office Of Labor Relation 734024.6 ASD Office Of Labor Relation
Now we know the 12 employees at the Montessori School earn a sum of nearly $650,000. That’s about $54,000 per employee.
Next, we will the two data frames we’ve created – income and Dept_Workers – into a data frame called merged. In order to make this merge work, we had to create the Dept column in the income data frame
merged <- merge(Dept_Workers, income, by = 'Dept')
head(merged)
## Dept No. of Employees income
## 1 Achievement Gap 15 1399700.2
## 2 Alighieri Montessori School 12 649428.2
## 3 ASD Human Resources 53 3300290.3
## 4 ASD Intergvernmtl Relations 11 792921.8
## 5 ASD Office of Budget Mangmnt 25 1993052.6
## 6 ASD Office Of Labor Relation 10 734024.6
Now that we have created this data frame, we can manipulate it as we can any other object in R. For example, let’s figure out average income per employee in each department. We will have to create a new column for the results. Then we will sort the data frame by the data in the new column in descending order.
merged$income.per.employee <- merged$income / merged$'No. of Employees'
mergedPer <- merged[order(-merged$income.per.employee), ]
head(mergedPer)
## Dept No. of Employees income
## 13 Asst Superintendent-Network E 3 392022.2
## 14 Asst Superintendent-Network F 2 242305.9
## 10 Asst Superintendent-Network B 3 359097.6
## 15 Asst Superintendent-Network G 5 584471.3
## 20 Boston Fire Department 1834 209723252.9
## 22 Boston Police Department 3178 357809909.7
## income.per.employee
## 13 130674.1
## 14 121152.9
## 10 119699.2
## 15 116894.3
## 20 114352.9
## 22 112589.7
We can sort merged by income, if that is more useful to us.
merged <- merged[order(-merged$income), ]
head(merged)
## Dept No. of Employees income
## 22 Boston Police Department 3178 357809910
## 20 Boston Fire Department 1834 209723253
## 119 BPS Special Education 721 39917713
## 64 BPS Facility Management 581 25468073
## 210 Public Works Department 485 24888294
## 23 Boston Public Library 558 24858698
## income.per.employee
## 22 112589.65
## 20 114352.92
## 119 55364.37
## 64 43834.89
## 210 51316.07
## 23 44549.64
Now, we’re finished and, just like Excel, we can save our work in the fammiliar .csv format.
write.csv(merged, 'C:\\Users\\Steven\\Desktop\\merged.csv')
write.csv(mergedPer, 'C:\\Users\\Steven\\Desktop\\mergedPer.csv')