1 Goal

Look at the US New York City Attorney Registrations for interesting information.

Source: https://data.ny.gov/Transparency/NYS-Attorney-Registrations/eqw2-r5nb

2 Get the data

First I need to load some libraries.

library(readr)
library(stringi)
library(stringr)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(rvest)

I used CSV file downloaded from their website, but I also assume one could use https://github.com/Chicago/RSocrata to fetch data too.

Nevertheless, by using readr package I skip 2 columns, which are low interest to me and finally present a summary statistics. As, you can see, here I am going to deal with characters only.

data.NYS <- readr::read_csv("/srv/shiny-server/SemesterProject/Data/NYS_Attorney_Registrations.csv",
                            col_types = list("Zip" = col_character(), "Zip Plus Four" = col_character(), 
                                             "Suffix" = col_skip(), "Middle Name" = col_skip()))

summary(data.NYS)
##  Registration Number  First Name         Last Name        
##  Min.   :1000017    Length:336887      Length:336887     
##  1st Qu.:1870410    Class :character   Class :character  
##  Median :2720977    Mode  :character   Mode  :character  
##  Mean   :3066439                                         
##  3rd Qu.:4435628                                         
##  Max.   :5347455                                         
##                                                          
##  Company Name         Street 1           Street 2        
##  Length:336887      Length:336887      Length:336887     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      City              State               Zip           
##  Length:336887      Length:336887      Length:336887     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Zip Plus Four        Country             County         
##  Length:336887      Length:336887      Length:336887     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  Phone Number       Email Address      Year Admitted 
##  Length:336887      Length:336887      Min.   :1898  
##  Class :character   Class :character   1st Qu.:1982  
##  Mode  :character   Mode  :character   Median :1995  
##                                        Mean   :1991  
##                                        3rd Qu.:2006  
##                                        Max.   :2015  
##                                        NA's   :9     
##  Judicial Department of Admission  Law School           Status         
##  Min.   :1.000                    Length:336887      Length:336887     
##  1st Qu.:1.000                    Class :character   Class :character  
##  Median :2.000                    Mode  :character   Mode  :character  
##  Mean   :2.017                                                         
##  3rd Qu.:3.000                                                         
##  Max.   :4.000                                                         
##  NA's   :8                                                             
##  Next Registration 
##  Length:336887     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

3 Data cleansing

Because data contain blank cells, I need to replace “” (empty spaces) with NAs. Then I also rename some columns (to make them shorter).

data.NYS[data.NYS == ""] <- NA
colnames(data.NYS) <- c("ID", "F.Name", "L.Name", "Comp.Name", "Street_1", "Street_2", "City", "State", "Zip", "Zip_2", "Country", "County",  "Phone", "Email", "Year_Adm", "JDoA", "Law_School", "Status", "Next_Reg")

3.1 1Q: What kind of email providers lawyers used to register?

Using stringi package I split Email column into a matrix. Then I omit NAs rows (i.e. those cells that I replaced above) and by applying tolower function (which changing names to lower case) “convert” the matrix to a data frame. Letters before @ get a “NickName” column name and after @ an “Organisation”. Beware people cannot fill in their email properly (maybe because of http://rickrobinson.files.wordpress.com/2012/10/it-systems.jpg or being to lazy)!

Finally using table function, I get frequencies of email providers. #Google leads.

splitted.email <- stri_split_fixed(data.NYS$Email, "@", 2, omit_empty = NA, simplify = TRUE) 
splitted.email <- data.frame(na.omit(tolower(splitted.email)))
colnames(splitted.email) <- c("NickName", "Organisation")

splitted.email <- as.data.frame(table(splitted.email$Organisation))
splitted.email <- plyr::arrange(splitted.email, desc(splitted.email$Freq))
head(splitted.email, 15)
##                  Var1  Freq
## 1           gmail.com 13796
## 2             aol.com  4111
## 3           yahoo.com  3315
## 4         hotmail.com  1802
## 5         verizon.net   816
## 6       optonline.net   815
## 7             msn.com   479
## 8         comcast.net   437
## 9           usdoj.gov   365
## 10 courts.state.ny.us   302
## 11      legal-aid.org   265
## 12             me.com   254
## 13        skadden.com   244
## 14       nycourts.gov   242
## 15        law.nyc.gov   202

Not so many law firms, right. Actually just two (not counting the government itself, of course): http://www.legal-aid.org/en/home.aspx and http://www.skadden.com/. The later one is also the second largest law firm by revenue. The former one is not even a law office, merely a non-profit helping poor people!

3.2 2Q: What kind of law school did people attend?

Here I can use same code, but what is far more interesting is that people fill in their law school totally differently. Well, I get that from the foreigners but what about “Harvard”, “Harvard Law School” and “Harvard University” (very US-centric) …… Just look at that. I mean common…

law_school <- as.data.frame(table(tolower(data.NYS$Law_School)))
law_school <- plyr::arrange(law_school, desc(law_school$Freq))
head(law_school, 15)
##                   Var1  Freq
## 1  brooklyn law school 12329
## 2  new york university 11755
## 3  new york law school  9752
## 4              harvard  8023
## 5             brooklyn  7651
## 6              fordham  7466
## 7             columbia  6497
## 8  st johns university  6215
## 9    albany law school  5127
## 10  fordham university  4707
## 11 columbia law school  4628
## 12  harvard law school  4600
## 13            new york  3948
## 14   boston university  3918
## 15                 nyu  3786

4 lawyers even wrote “harvard 1950”. Furthermore, there are e.g. 551 unique occurrences of harvard in the Law_School column. The most interesting fact to me is that some people cannot even spell their law school name (or company they work in) properly. I wouldn’t want to go to such lawyers :( OMG

Source: http://www.r-bloggers.com/select-operations-on-r-data-frames/

harvard <- law_school[grep("harvard", law_school$Var1, ignore.case=T),]
head(harvard)
##                    Var1 Freq
## 4               harvard 8023
## 12   harvard law school 4600
## 233  harvard university  160
## 287         harvard law  113
## 1014       harvard univ   15
## 1623   harvard law schl    7
newyork <- law_school[grep("new york", law_school$Var1, ignore.case=T),]
head(newyork)
##                                          Var1  Freq
## 2                         new york university 11755
## 3                         new york law school  9752
## 13                                   new york  3948
## 24          new york university school of law  2955
## 83                               new york law   640
## 146 city university of new york school of law   321
brooklyn <- law_school[grep("brooklyn", law_school$Var1, ignore.case=T),]
head(brooklyn)
##                     Var1  Freq
## 1    brooklyn law school 12329
## 5               brooklyn  7651
## 49          brooklyn law  1192
## 703    brooklyn law schl    27
## 974    st johns brooklyn    16
## 1156    brooklyn law sch    12
johns <- law_school[grep("johns|john's", law_school$Var1, ignore.case=T),]
head(johns)
##                                   Var1 Freq
## 8                  st johns university 6215
## 16                            st johns 3668
## 47 st. john's university school of law 1315
## 52               st. john's university 1119
## 73                       st johns univ  745
## 98  st. johns university school of law  497
# first - observations -> rows; second - variables -> columns
rbind(johns = dim(johns),brooklyn = dim(brooklyn), ny = dim(newyork), harvard = dim(harvard))
##          [,1] [,2]
## johns     415    2
## brooklyn  173    2
## ny        973    2
## harvard   551    2
law_school[grep("new york law shool", law_school$Var1, ignore.case=T),] # FAIL
##                    Var1 Freq
## 3885 new york law shool    2

3.3 3Q: Where do lawyers work?

Sadly enough, I first need to get rid of “unique” business entities, abbreviations, symbols such as ‘&’ - ‘,’ - ‘()’ or totally wrong names. The reason is that data are - as one could have seen - very raw indeed.

All of that is done by running a function using other functions of stringi (and stringr too) library. Even though it uses C behind it, it is still very slow (with some optimisations too - e.g. sapply) and even requires to be run twice because of the pattern which is complicated and not bulletproof.

data.NYS$Comp.Name <- tolower(data.NYS$Comp.Name)

stringsToCheck <- c("corporation.", "corporation", "incorporated", "corp", "corp.", "group", "gmbh", "company", "pllc", "llp", "llc", "l.l.c.", "l.l.p.", "ltd.","inc.", "inc", "plc", "p.c.", "pc", "p.a.", "l.p.", "lp", "co.", "l.p")
patTTT <- paste(stringsToCheck, collapse = '|')


# Just an idea how to proceed (Old version)

# checkAndCleanFormatting <- function(x) {
#   x <- stri_trim_both(x)  
#   for(i in 1:length(x)) {
#     if(stri_detect_regex(x[i],patTTT) == TRUE) {
#       x[i] <- stri_replace_first_regex(x[i], patTTT, "")
#     }
#   }
#   x <- stri_trim_both(x)
# }

# Much more faster version (with C behind it)
# Run the function twice for better handling of all (edge) cases
for(i in 1:2) {
  data.NYS$Comp.Name <- sapply(as.character(data.NYS$Comp.Name), function(x) {
    if (is.na(x) == TRUE) {
      x <- NA # if NA then put there 'NA' string
    } else {
      x <- stri_replace_first_fixed(x, "&", "and")
      x <- stri_replace_last_fixed(x, "and", "")
      x <- stri_replace_last_regex(x, ",", "") 
      if(stri_detect_regex(x, patTTT) == TRUE) {
        x <- str_replace(x, patTTT, "") # not from stringi because of weird behaviour
      }
      x <- stri_trim_both(x)
    }
  })
}

Once, companies names are cleaned, I create a bar plot with TOP 20 names of companies and their number of lawyers registered in the New York City. Beware that not all cases are handled by the function above, e.g. there is davis polk and wardwell london which however is NOT counted as (/assigned to) davis polk and wardwell!!! Thus, numbers are not exact. See below for a prime example.

com_name <- as.data.frame(table(stri_trim_both(data.NYS$Comp.Name)))
com_name <- plyr::arrange(com_name, desc(com_name$Freq))
com_name$Var1 <- as.character(com_name$Var1)

com_name.plot <- ggplot(data=com_name[1:20,], aes(x=reorder(Var1, Freq), y=Freq)) 
com_name.plot <- com_name.plot + geom_bar(stat = "identity") + coord_flip(ylim= c(400,900)) + theme(
                      axis.text.y = element_text(size = 14), 
                      axis.text.x = element_text(size = 14)) 
com_name.plot <- com_name.plot + ylab("How many registered lawyers do companies have in NYC ?") + xlab("Companies")
com_name.plot <- com_name.plot + scale_y_continuous(breaks = seq(400, 900, 50))
com_name.plot

Yeeh, citi captures just about everything.

companysNames <- com_name[grep("citi", com_name$Var1, ignore.case=T),]
head(companysNames, 10)
##                                                 Var1 Freq
## 65                                              citi  229
## 209                              citi global markets   79
## 316                                    citibank n.a.   51
## 528                                         citibank   32
## 1012                               citi private bank   17
## 2273                                     citibank na    8
## 2281                            cuny citizenship now    8
## 2989 united states citizenship  immigration services    7
## 2999          u.s. citizenship  immigration services    7
## 3833                              citizens financial    5

But Goldman is not far more better too.

##                                            Var1 Freq
## 28                                goldman sachs  333
## 362                belkin burden wenig  goldman   45
## 391                    wilentz goldman  spitzer   43
## 414                           goldman sachs  co   40
## 1506                       goldman sachs (asia)   12
## 1507                goldman sachs international   12
## 2213                    wolff goodrich  goldman    9
## 2698             goldman sachs asset management    7
## 3312 kaman, berlove, marafioti jastein  goldman    6
## 3978                        goldman sachs japan    5

3.4 4Q: Join with Fortune 500 and plot their companies

A good chuck of firms in the list are law offices. But what about Fortune 500 companies? How many lawyers do they have? Actually, I want to select only first 200, and plot them again.

First, I will get the current 2014 list of F500, for example here.

f500 <- html("http://www.zyxware.com/articles/4344/list-of-fortune-500-companies-and-their-websites") %>%
  html_node(".data-table") %>%
  html_table(header = TRUE)
f500 <- f500[1:200, ]
f500.com <- data.frame(Names = tolower(f500$Company))

Similar idea as above (changed a bit).

f500.com$Names <- sapply(as.character(f500.com$Names), function(x) {
  if ( is.na(x) == T) {
    x <- NA
  } else {
    x <- stri_replace_first_fixed(x, "&", "and")
    x <- stri_replace_last_fixed(x, "and", "")
    if(stri_detect_regex(x, patTTT) == TRUE) {
      x <- stri_replace_last_regex(x, patTTT, "") 
    }
    x <- stri_replace_last_fixed(x, ",", "")
    x <- stri_replace_last_fixed(x, ".", "")
    x <- stri_trim_both(x)
  }
})

Now I will join both previous data sets - ‘company names’ from original data set with the list Fortune 500 companies (actually just 200).

right.join <- right_join(com_name, f500.com, by = c("Var1" = "Names"))
right.join <- plyr::arrange(right.join, desc(right.join$Freq))
head(right.join, 15)
##                           Var1 Freq
## 1               morgan stanley  369
## 2                         citi  229
## 3                       pfizer  178
## 4              bank of america  149
## 5                      metlife  143
## 6             american express  143
## 7                       google  100
## 8       american international   98
## 9             johnson  johnson   91
## 10     new york life insurance   85
## 11            general electric   75
## 12 the bank of new york mellon   60
## 13                   microsoft   58
## 14        prudential financial   51
## 15                   tiaa-cref   43

Find me some widely known companies inside the F500 and show me their number of lawyers.

optesa <- right.join[grep("google|apple|microsoft|att|verizon|comcast|cbs|oracle|intel|chevron|nike|facebook|amazon|directv|dish|boeing",  right.join$Var1, ignore.case=T),]
optesa
##                   Var1 Freq
## 7               google  100
## 13           microsoft   58
## 18                 cbs   37
## 19               apple   35
## 22                 att   29
## 25 verizon munications   26
## 28              oracle   24
## 32               intel   21
## 39             chevron   15
## 42              amazon   12
## 43                nike   12
## 66             directv    5
## 85              boeing    3

No bad for Google and AT&T considering its merger with DirectTV. However, let’s plot them all.

com_name2.plot <- ggplot(data=right.join[1:30,], aes(x=reorder(Var1, Freq), y=Freq)) + coord_flip(ylim= c(10,400))
com_name2.plot <- com_name2.plot + geom_bar(stat = "identity") + theme(
  axis.text.y = element_text(size = 14), 
  axis.text.x = element_text(size = 14)) 
com_name2.plot <- com_name2.plot + ylab("How many registered lawyers do companies have in NYC ?") + xlab("Companies")
com_name2.plot <- com_name2.plot + scale_y_continuous(breaks = seq(10, 400, 40))
com_name2.plot

4 The end

That’s was all I wanted to show to you. As a matter of fact, this should have been a part of the my semester project at the university but has remained unused (very sadly).

5 About the author

I am student from Germany who is interested in Asia (and its small nations).