Around the globe, cities, states, and countries are attempting to be more open and accountable by sharing open data with citizens.

New York City has over 1200 datasets available on NYC OpenData. There are fewer things more iconic for NYC than a yellow cab, and NYC OpenData provides data on the 52 thousand or so taxi and livery medallion drivers in the city.

The goal of this demo is to find out the most common first and last names of taxi drivers in New York City.

Along the way this demo will use the R language show you how to:

download data directly from the web,
do some basic clean-up and processing of text strings, and
make some nice plots with the R package ggplot2.

Downloading data

We will use the function getURL() from the R package RCurl to download the data.

library(RCurl)
URL <- "https://data.cityofnewyork.us/api/views/jb3k-j3gp/rows.csv?accessType=DOWNLOAD"
X <- getURL(URL, ssl.verifypeer = FALSE)
data <- read.csv(textConnection(X))

You can find the URL by going to NYC OpenData and searching for the “Medallion Drivers” data set.
Then, click on “Export” > “Download” > “CSV” and Right click “Copy link address.”

Drawing

Examine the data

names(data)

## [1] "License.Number"    "Name"              "Type"             
## [4] "Expiration.Date"   "Last.Updated.Date" "Last.Updated.Time"

You can see we only have six variables in the dataset, including the license number, names, and expiration dates of their taxi medallion. The “Last.Updated” variables tell you the last time the entry in the database was updated.

dim(data)

## [1] 52063     6

You can also see that there are over 52 thousand taxi driver in NYC!

str(data)

## 'data.frame':    52063 obs. of  6 variables:
##  $ License.Number   : int  496830 497137 497295 497376 497479 497582 497642 497843 497920 497910 ...
##  $ Name             : Factor w/ 44837 levels "A HANNAN,M","AABIZZA,NOURDDINE",..: 42718 34540 32124 3458 32600 14605 22064 19312 11349 41329 ...
##  $ Type             : Factor w/ 1 level "MEDALLION TAXI DRIVER": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Expiration.Date  : Factor w/ 756 levels "01/01/2016","01/01/2017",..: 4 753 46 740 73 36 702 79 54 48 ...
##  $ Last.Updated.Date: Factor w/ 1 level "01/08/2015": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Last.Updated.Time: Factor w/ 1 level "13:20": 1 1 1 1 1 1 1 1 1 1 ...

str() is a useful function to see how R is treating each of the different variables, e.g., is it a number, factor, etc.? The variable we are interested in - Name - is a factor with 44837 unique levels. That means that some of the 52 thousand drivers have the same name!

Data clean-up

If you look at the variable Name you can see that it includes first and last names and sometimes middle initials. We are going to need to separate out this variable into separate columns for each.

data$Name[1:20]

##  [1] ULLAH,RAHAMAT         RASHID,MIRZA,A        PARHAR,GURNAM        
##  [4] ALI,SHAHID            PERSONNA,JEAN,M       FAKHRUZZAMAN,MOHAMMAD
##  [7] KAMAL,AKHTER          HUSSAIN,MAQSOOD       DANIEL,RAFAEL        
## [10] TAYLOR,MARILYN,C      JANVIER,JOSEPH,W      LOZADA,CESAR,R       
## [13] SINGH,TIRATH          YASAYEV,MIKHAIL       EL-KALLINI,KAMAL     
## [16] SANE,ALMAMY           WAGUENOUNI,MASSY      AMIN,MOHAMMAD        
## [19] SINGH,SURINDER        ELARABY,HOSSAM       
## 44837 Levels: A HANNAN,M AABIZZA,NOURDDINE ... ZZAMAN,MOLLA,MONIRU

First let’s convert Name to a character

data$Name <- as.character(data$Name)

Now, we’re going to use the cSplit() function from the package splitstackshape to split the Name variable at the commas. This will result in a new column (variable) for each. Make sure you specify drop=FALSE in order to keep the original Name variable.

library(splitstackshape)

## Loading required package: data.table

data <- cSplit(data, splitCols="Name", sep=",", drop=FALSE)

names(data)

##  [1] "License.Number"    "Name"              "Type"             
##  [4] "Expiration.Date"   "Last.Updated.Date" "Last.Updated.Time"
##  [7] "Name_1"            "Name_2"            "Name_3"           
## [10] "Name_4"            "Name_5"            "Name_6"

You can see that cSplit made more columns than First name, Last name, Middle name or initial. In fact there are now 6 name columns. That means there are some drivers with up to 6 elements to their name!

Rearrange data for plotting

We are going to use the function aggregate() to count up the different first and last names.

last_names <- aggregate(data$Name_1, by=list(data$Name_1), length)
colnames(last_names)[1] <- "name"
colnames(last_names)[2] <- "count"
last_names <- last_names[order(last_names$count,decreasing=TRUE),]

first_names <- aggregate(data$Name_1, by=list(data$Name_2), length)
colnames(first_names)[1] <- "name"
colnames(first_names)[2] <- "count"
first_names <- first_names[order(first_names$count,decreasing=TRUE),]

We don’t want to plot all of the names so let’s just take the top 30:

last_names_top30 <- last_names[1:30,]
last_names_top30

##            name count
## 15608     SINGH  2889
## 13807    RAHMAN   961
## 8195      ISLAM   954
## 687       AHMED   941
## 9168       KHAN   923
## 7874    HOSSAIN   774
## 1062        ALI   637
## 16894     UDDIN   455
## 4085  CHOWDHURY   437
## 967        ALAM   423
## 4819     DIALLO   366
## 7957    HUSSAIN   331
## 15374    SHERPA   239
## 681       AHMAD   219
## 7566      HAQUE   213
## 8149      IQBAL   199
## 7601      HASAN   190
## 9663      KUMAR   189
## 7626     HASSAN   185
## 2124        BAH   177
## 11231      MIAH   165
## 10489   MAHMOOD   161
## 7852      HOQUE   144
## 2923    BHUIYAN   137
## 2373      BARRY   136
## 10582     MALIK   135
## 15160      SHAH   129
## 8655      KABIR   126
## 14306       ROY   125
## 13941    RASHID   124

first_names_top30 <- first_names[1:30,]
first_names_top30

##           name count
## 6998        MD  2552
## 7379  MOHAMMAD  2063
## 7385  MOHAMMED  1650
## 7701  MUHAMMAD   798
## 7368   MOHAMED   706
## 5287      JEAN   484
## 204      ABDUL   398
## 11143     SYED   360
## 486      AHMED   338
## 6700   MAMADOU   335
## 5504    JOSEPH   299
## 698        ALI   207
## 7182   MICHAEL   203
## 5462      JOHN   180
## 5499      JOSE   165
## 3946    GEORGE   153
## 5879    KHALID   138
## 2605     DAVID   137
## 6467      LUIS   137
## 4777   IBRAHIM   128
## 2551    DANIEL   118
## 4779  IBRAHIMA   118
## 9485    ROBERT   117
## 9907    SAMUEL   115
## 5790      KAZI   110
## 4503    HASSAN   109
## 8839    PIERRE   105
## 807     AMADOU   101
## 10276   SHAHID    99
## 298        ABU    97

Plotting

library(ggplot2)

plot_first_names <- ggplot(first_names_top30, aes(reorder(name,-count),count)) + geom_bar(stat="identity")
plot_first_names <- plot_first_names + xlab("First Name") + ylab("Number")
plot_first_names <- plot_first_names + theme_classic(base_size=12)
plot_first_names <- plot_first_names + theme(axis.text.x=element_text(angle=45, hjust=1))
plot_first_names

plot_last_names <- ggplot(last_names_top30, aes(reorder(name,-count),count)) + geom_bar(stat="identity")
plot_last_names <- plot_last_names + xlab("Last Name") + ylab("Number")
plot_last_names <- plot_last_names + theme_classic(base_size=12)
plot_last_names <- plot_last_names + theme(axis.text.x=element_text(angle=45, hjust=1))
plot_last_names

For first names you’ll notice that there are at least five different versions or spellings of the name “Muhammad” in the top 30 names. Let’s aggregate those into one. (Note: “MD” is an abbreviation for Muhammad).

Combining the name “Muhammad”

We’ll merge some of the most common spellings of “Muhammad” into one.

# combine different version of "MOHAMMAD" into one 
names_fixed <- data 
names_fixed$Name_2[names_fixed$Name_2=="MD"] <- "MOHAMMAD"
names_fixed$Name_2[names_fixed$Name_2=="MOHAMMED"] <- "MOHAMMAD"
names_fixed$Name_2[names_fixed$Name_2=="MUHAMMAD"] <- "MOHAMMAD"
names_fixed$Name_2[names_fixed$Name_2=="MOHAMED"] <- "MOHAMMAD"

# re-aggregate
first_names_fixed <- aggregate(names_fixed$Name_2, by=list(names_fixed$Name_2), length)
colnames(first_names_fixed)[1] <- "name"
colnames(first_names_fixed)[2] <- "count"
first_names_fixed <- first_names_fixed[order(first_names_fixed$count,decreasing=TRUE),]

#  take the top 30 
first_names_fixed_top30 <- first_names_fixed[1:30,]

Now, let’s replot the first names

plot_first_names_fixed <- ggplot(first_names_fixed_top30, aes(reorder(name,-count),count)) + geom_bar(stat="identity")
plot_first_names_fixed <- plot_first_names_fixed + xlab("First Name") + ylab("Number")
plot_first_names_fixed <- plot_first_names_fixed + theme_classic(base_size=12)
plot_first_names_fixed <- plot_first_names_fixed + theme(axis.text.x=element_text(angle=45, hjust=1))
plot_first_names_fixed

Summary

There you go. “Singh” is the most common last name and “Muhammad” and all of its spelling variations is the most common first name for NYC Taxi Drivers!

NYC Taxi Drivers

Michael J. McCann