Look at the US New York City Attorney Registrations for interesting information.
Source: https://data.ny.gov/Transparency/NYS-Attorney-Registrations/eqw2-r5nb
First I need to load some libraries.
library(readr)
library(stringi)
library(stringr)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(rvest)
I used CSV file downloaded from their website, but I also assume one could use https://github.com/Chicago/RSocrata to fetch data too.
Nevertheless, by using readr package I skip 2 columns, which are low interest to me and finally present a summary statistics. As, you can see, here I am going to deal with characters only.
data.NYS <- readr::read_csv("/srv/shiny-server/SemesterProject/Data/NYS_Attorney_Registrations.csv",
col_types = list("Zip" = col_character(), "Zip Plus Four" = col_character(),
"Suffix" = col_skip(), "Middle Name" = col_skip()))
summary(data.NYS)
## Registration Number First Name Last Name
## Min. :1000017 Length:336887 Length:336887
## 1st Qu.:1870410 Class :character Class :character
## Median :2720977 Mode :character Mode :character
## Mean :3066439
## 3rd Qu.:4435628
## Max. :5347455
##
## Company Name Street 1 Street 2
## Length:336887 Length:336887 Length:336887
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## City State Zip
## Length:336887 Length:336887 Length:336887
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Zip Plus Four Country County
## Length:336887 Length:336887 Length:336887
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Phone Number Email Address Year Admitted
## Length:336887 Length:336887 Min. :1898
## Class :character Class :character 1st Qu.:1982
## Mode :character Mode :character Median :1995
## Mean :1991
## 3rd Qu.:2006
## Max. :2015
## NA's :9
## Judicial Department of Admission Law School Status
## Min. :1.000 Length:336887 Length:336887
## 1st Qu.:1.000 Class :character Class :character
## Median :2.000 Mode :character Mode :character
## Mean :2.017
## 3rd Qu.:3.000
## Max. :4.000
## NA's :8
## Next Registration
## Length:336887
## Class :character
## Mode :character
##
##
##
##
Because data contain blank cells, I need to replace “” (empty spaces) with NAs. Then I also rename some columns (to make them shorter).
data.NYS[data.NYS == ""] <- NA
colnames(data.NYS) <- c("ID", "F.Name", "L.Name", "Comp.Name", "Street_1", "Street_2", "City", "State", "Zip", "Zip_2", "Country", "County", "Phone", "Email", "Year_Adm", "JDoA", "Law_School", "Status", "Next_Reg")
Using stringi package I split Email column into a matrix. Then I omit NAs rows (i.e. those cells that I replaced above) and by applying tolower function (which changing names to lower case) “convert” the matrix to a data frame. Letters before @ get a “NickName” column name and after @ an “Organisation”. Beware people cannot fill in their email properly (maybe because of http://rickrobinson.files.wordpress.com/2012/10/it-systems.jpg or being to lazy)!
Finally using table function, I get frequencies of email providers. #Google leads.
splitted.email <- stri_split_fixed(data.NYS$Email, "@", 2, omit_empty = NA, simplify = TRUE)
splitted.email <- data.frame(na.omit(tolower(splitted.email)))
colnames(splitted.email) <- c("NickName", "Organisation")
splitted.email <- as.data.frame(table(splitted.email$Organisation))
splitted.email <- plyr::arrange(splitted.email, desc(splitted.email$Freq))
head(splitted.email, 15)
## Var1 Freq
## 1 gmail.com 13796
## 2 aol.com 4111
## 3 yahoo.com 3315
## 4 hotmail.com 1802
## 5 verizon.net 816
## 6 optonline.net 815
## 7 msn.com 479
## 8 comcast.net 437
## 9 usdoj.gov 365
## 10 courts.state.ny.us 302
## 11 legal-aid.org 265
## 12 me.com 254
## 13 skadden.com 244
## 14 nycourts.gov 242
## 15 law.nyc.gov 202
Not so many law firms, right. Actually just two (not counting the government itself, of course): http://www.legal-aid.org/en/home.aspx and http://www.skadden.com/. The later one is also the second largest law firm by revenue. The former one is not even a law office, merely a non-profit helping poor people!
Here I can use same code, but what is far more interesting is that people fill in their law school totally differently. Well, I get that from the foreigners but what about “Harvard”, “Harvard Law School” and “Harvard University” (very US-centric) …… Just look at that. I mean common…
law_school <- as.data.frame(table(tolower(data.NYS$Law_School)))
law_school <- plyr::arrange(law_school, desc(law_school$Freq))
head(law_school, 15)
## Var1 Freq
## 1 brooklyn law school 12329
## 2 new york university 11755
## 3 new york law school 9752
## 4 harvard 8023
## 5 brooklyn 7651
## 6 fordham 7466
## 7 columbia 6497
## 8 st johns university 6215
## 9 albany law school 5127
## 10 fordham university 4707
## 11 columbia law school 4628
## 12 harvard law school 4600
## 13 new york 3948
## 14 boston university 3918
## 15 nyu 3786
4 lawyers even wrote “harvard 1950”. Furthermore, there are e.g. 551 unique occurrences of harvard in the Law_School column. The most interesting fact to me is that some people cannot even spell their law school name (or company they work in) properly. I wouldn’t want to go to such lawyers :( OMG
Source: http://www.r-bloggers.com/select-operations-on-r-data-frames/
harvard <- law_school[grep("harvard", law_school$Var1, ignore.case=T),]
head(harvard)
## Var1 Freq
## 4 harvard 8023
## 12 harvard law school 4600
## 233 harvard university 160
## 287 harvard law 113
## 1014 harvard univ 15
## 1623 harvard law schl 7
newyork <- law_school[grep("new york", law_school$Var1, ignore.case=T),]
head(newyork)
## Var1 Freq
## 2 new york university 11755
## 3 new york law school 9752
## 13 new york 3948
## 24 new york university school of law 2955
## 83 new york law 640
## 146 city university of new york school of law 321
brooklyn <- law_school[grep("brooklyn", law_school$Var1, ignore.case=T),]
head(brooklyn)
## Var1 Freq
## 1 brooklyn law school 12329
## 5 brooklyn 7651
## 49 brooklyn law 1192
## 703 brooklyn law schl 27
## 974 st johns brooklyn 16
## 1156 brooklyn law sch 12
johns <- law_school[grep("johns|john's", law_school$Var1, ignore.case=T),]
head(johns)
## Var1 Freq
## 8 st johns university 6215
## 16 st johns 3668
## 47 st. john's university school of law 1315
## 52 st. john's university 1119
## 73 st johns univ 745
## 98 st. johns university school of law 497
# first - observations -> rows; second - variables -> columns
rbind(johns = dim(johns),brooklyn = dim(brooklyn), ny = dim(newyork), harvard = dim(harvard))
## [,1] [,2]
## johns 415 2
## brooklyn 173 2
## ny 973 2
## harvard 551 2
law_school[grep("new york law shool", law_school$Var1, ignore.case=T),] # FAIL
## Var1 Freq
## 3885 new york law shool 2
Sadly enough, I first need to get rid of “unique” business entities, abbreviations, symbols such as ‘&’ - ‘,’ - ‘()’ or totally wrong names. The reason is that data are - as one could have seen - very raw indeed.
All of that is done by running a function using other functions of stringi (and stringr too) library. Even though it uses C behind it, it is still very slow (with some optimisations too - e.g. sapply) and even requires to be run twice because of the pattern which is complicated and not bulletproof.
data.NYS$Comp.Name <- tolower(data.NYS$Comp.Name)
stringsToCheck <- c("corporation.", "corporation", "incorporated", "corp", "corp.", "group", "gmbh", "company", "pllc", "llp", "llc", "l.l.c.", "l.l.p.", "ltd.","inc.", "inc", "plc", "p.c.", "pc", "p.a.", "l.p.", "lp", "co.", "l.p")
patTTT <- paste(stringsToCheck, collapse = '|')
# Just an idea how to proceed (Old version)
# checkAndCleanFormatting <- function(x) {
# x <- stri_trim_both(x)
# for(i in 1:length(x)) {
# if(stri_detect_regex(x[i],patTTT) == TRUE) {
# x[i] <- stri_replace_first_regex(x[i], patTTT, "")
# }
# }
# x <- stri_trim_both(x)
# }
# Much more faster version (with C behind it)
# Run the function twice for better handling of all (edge) cases
for(i in 1:2) {
data.NYS$Comp.Name <- sapply(as.character(data.NYS$Comp.Name), function(x) {
if (is.na(x) == TRUE) {
x <- NA # if NA then put there 'NA' string
} else {
x <- stri_replace_first_fixed(x, "&", "and")
x <- stri_replace_last_fixed(x, "and", "")
x <- stri_replace_last_regex(x, ",", "")
if(stri_detect_regex(x, patTTT) == TRUE) {
x <- str_replace(x, patTTT, "") # not from stringi because of weird behaviour
}
x <- stri_trim_both(x)
}
})
}
Once, companies names are cleaned, I create a bar plot with TOP 20 names of companies and their number of lawyers registered in the New York City. Beware that not all cases are handled by the function above, e.g. there is davis polk and wardwell london which however is NOT counted as (/assigned to) davis polk and wardwell!!! Thus, numbers are not exact. See below for a prime example.
com_name <- as.data.frame(table(stri_trim_both(data.NYS$Comp.Name)))
com_name <- plyr::arrange(com_name, desc(com_name$Freq))
com_name$Var1 <- as.character(com_name$Var1)
com_name.plot <- ggplot(data=com_name[1:20,], aes(x=reorder(Var1, Freq), y=Freq))
com_name.plot <- com_name.plot + geom_bar(stat = "identity") + coord_flip(ylim= c(400,900)) + theme(
axis.text.y = element_text(size = 14),
axis.text.x = element_text(size = 14))
com_name.plot <- com_name.plot + ylab("How many registered lawyers do companies have in NYC ?") + xlab("Companies")
com_name.plot <- com_name.plot + scale_y_continuous(breaks = seq(400, 900, 50))
com_name.plot
Yeeh, citi captures just about everything.
companysNames <- com_name[grep("citi", com_name$Var1, ignore.case=T),]
head(companysNames, 10)
## Var1 Freq
## 65 citi 229
## 209 citi global markets 79
## 316 citibank n.a. 51
## 528 citibank 32
## 1012 citi private bank 17
## 2273 citibank na 8
## 2281 cuny citizenship now 8
## 2989 united states citizenship immigration services 7
## 2999 u.s. citizenship immigration services 7
## 3833 citizens financial 5
But Goldman is not far more better too.
## Var1 Freq
## 28 goldman sachs 333
## 362 belkin burden wenig goldman 45
## 391 wilentz goldman spitzer 43
## 414 goldman sachs co 40
## 1506 goldman sachs (asia) 12
## 1507 goldman sachs international 12
## 2213 wolff goodrich goldman 9
## 2698 goldman sachs asset management 7
## 3312 kaman, berlove, marafioti jastein goldman 6
## 3978 goldman sachs japan 5
A good chuck of firms in the list are law offices. But what about Fortune 500 companies? How many lawyers do they have? Actually, I want to select only first 200, and plot them again.
First, I will get the current 2014 list of F500, for example here.
f500 <- html("http://www.zyxware.com/articles/4344/list-of-fortune-500-companies-and-their-websites") %>%
html_node(".data-table") %>%
html_table(header = TRUE)
f500 <- f500[1:200, ]
f500.com <- data.frame(Names = tolower(f500$Company))
Similar idea as above (changed a bit).
f500.com$Names <- sapply(as.character(f500.com$Names), function(x) {
if ( is.na(x) == T) {
x <- NA
} else {
x <- stri_replace_first_fixed(x, "&", "and")
x <- stri_replace_last_fixed(x, "and", "")
if(stri_detect_regex(x, patTTT) == TRUE) {
x <- stri_replace_last_regex(x, patTTT, "")
}
x <- stri_replace_last_fixed(x, ",", "")
x <- stri_replace_last_fixed(x, ".", "")
x <- stri_trim_both(x)
}
})
Now I will join both previous data sets - ‘company names’ from original data set with the list Fortune 500 companies (actually just 200).
right.join <- right_join(com_name, f500.com, by = c("Var1" = "Names"))
right.join <- plyr::arrange(right.join, desc(right.join$Freq))
head(right.join, 15)
## Var1 Freq
## 1 morgan stanley 369
## 2 citi 229
## 3 pfizer 178
## 4 bank of america 149
## 5 metlife 143
## 6 american express 143
## 7 google 100
## 8 american international 98
## 9 johnson johnson 91
## 10 new york life insurance 85
## 11 general electric 75
## 12 the bank of new york mellon 60
## 13 microsoft 58
## 14 prudential financial 51
## 15 tiaa-cref 43
Find me some widely known companies inside the F500 and show me their number of lawyers.
optesa <- right.join[grep("google|apple|microsoft|att|verizon|comcast|cbs|oracle|intel|chevron|nike|facebook|amazon|directv|dish|boeing", right.join$Var1, ignore.case=T),]
optesa
## Var1 Freq
## 7 google 100
## 13 microsoft 58
## 18 cbs 37
## 19 apple 35
## 22 att 29
## 25 verizon munications 26
## 28 oracle 24
## 32 intel 21
## 39 chevron 15
## 42 amazon 12
## 43 nike 12
## 66 directv 5
## 85 boeing 3
No bad for Google and AT&T considering its merger with DirectTV. However, let’s plot them all.
com_name2.plot <- ggplot(data=right.join[1:30,], aes(x=reorder(Var1, Freq), y=Freq)) + coord_flip(ylim= c(10,400))
com_name2.plot <- com_name2.plot + geom_bar(stat = "identity") + theme(
axis.text.y = element_text(size = 14),
axis.text.x = element_text(size = 14))
com_name2.plot <- com_name2.plot + ylab("How many registered lawyers do companies have in NYC ?") + xlab("Companies")
com_name2.plot <- com_name2.plot + scale_y_continuous(breaks = seq(10, 400, 40))
com_name2.plot
That’s was all I wanted to show to you. As a matter of fact, this should have been a part of the my semester project at the university but has remained unused (very sadly).