Around the globe, cities, states, and countries are attempting to be more open and accountable by sharing open data with citizens.
New York City has over 1200 datasets available on NYC OpenData. There are fewer things more iconic for NYC than a yellow cab, and NYC OpenData provides data on the 52 thousand or so taxi and livery medallion drivers in the city.
The goal of this demo is to find out the most common first and last names of taxi drivers in New York City.
Along the way this demo will use the R language show you how to:
R package ggplot2.
We will use the function getURL() from the R package RCurl to download the data.
library(RCurl)
URL <- "https://data.cityofnewyork.us/api/views/jb3k-j3gp/rows.csv?accessType=DOWNLOAD"
X <- getURL(URL, ssl.verifypeer = FALSE)
data <- read.csv(textConnection(X))
You can find the URL by going to NYC OpenData and searching for the “Medallion Drivers” data set.
Then, click on “Export” > “Download” > “CSV” and Right click “Copy link address.”
names(data)
## [1] "License.Number" "Name" "Type"
## [4] "Expiration.Date" "Last.Updated.Date" "Last.Updated.Time"
You can see we only have six variables in the dataset, including the license number, names, and expiration dates of their taxi medallion. The “Last.Updated” variables tell you the last time the entry in the database was updated.
dim(data)
## [1] 52063 6
You can also see that there are over 52 thousand taxi driver in NYC!
str(data)
## 'data.frame': 52063 obs. of 6 variables:
## $ License.Number : int 496830 497137 497295 497376 497479 497582 497642 497843 497920 497910 ...
## $ Name : Factor w/ 44837 levels "A HANNAN,M","AABIZZA,NOURDDINE",..: 42718 34540 32124 3458 32600 14605 22064 19312 11349 41329 ...
## $ Type : Factor w/ 1 level "MEDALLION TAXI DRIVER": 1 1 1 1 1 1 1 1 1 1 ...
## $ Expiration.Date : Factor w/ 756 levels "01/01/2016","01/01/2017",..: 4 753 46 740 73 36 702 79 54 48 ...
## $ Last.Updated.Date: Factor w/ 1 level "01/08/2015": 1 1 1 1 1 1 1 1 1 1 ...
## $ Last.Updated.Time: Factor w/ 1 level "13:20": 1 1 1 1 1 1 1 1 1 1 ...
str() is a useful function to see how R is treating each of the different variables, e.g., is it a number, factor, etc.? The variable we are interested in - Name - is a factor with 44837 unique levels. That means that some of the 52 thousand drivers have the same name!
If you look at the variable Name you can see that it includes first and last names and sometimes middle initials. We are going to need to separate out this variable into separate columns for each.
data$Name[1:20]
## [1] ULLAH,RAHAMAT RASHID,MIRZA,A PARHAR,GURNAM
## [4] ALI,SHAHID PERSONNA,JEAN,M FAKHRUZZAMAN,MOHAMMAD
## [7] KAMAL,AKHTER HUSSAIN,MAQSOOD DANIEL,RAFAEL
## [10] TAYLOR,MARILYN,C JANVIER,JOSEPH,W LOZADA,CESAR,R
## [13] SINGH,TIRATH YASAYEV,MIKHAIL EL-KALLINI,KAMAL
## [16] SANE,ALMAMY WAGUENOUNI,MASSY AMIN,MOHAMMAD
## [19] SINGH,SURINDER ELARABY,HOSSAM
## 44837 Levels: A HANNAN,M AABIZZA,NOURDDINE ... ZZAMAN,MOLLA,MONIRU
First let’s convert Name to a character
data$Name <- as.character(data$Name)
Now, we’re going to use the cSplit() function from the package splitstackshape to split the Name variable at the commas. This will result in a new column (variable) for each. Make sure you specify drop=FALSE in order to keep the original Name variable.
library(splitstackshape)
## Loading required package: data.table
data <- cSplit(data, splitCols="Name", sep=",", drop=FALSE)
names(data)
## [1] "License.Number" "Name" "Type"
## [4] "Expiration.Date" "Last.Updated.Date" "Last.Updated.Time"
## [7] "Name_1" "Name_2" "Name_3"
## [10] "Name_4" "Name_5" "Name_6"
You can see that cSplit made more columns than First name, Last name, Middle name or initial. In fact there are now 6 name columns. That means there are some drivers with up to 6 elements to their name!
We are going to use the function aggregate() to count up the different first and last names.
last_names <- aggregate(data$Name_1, by=list(data$Name_1), length)
colnames(last_names)[1] <- "name"
colnames(last_names)[2] <- "count"
last_names <- last_names[order(last_names$count,decreasing=TRUE),]
first_names <- aggregate(data$Name_1, by=list(data$Name_2), length)
colnames(first_names)[1] <- "name"
colnames(first_names)[2] <- "count"
first_names <- first_names[order(first_names$count,decreasing=TRUE),]
We don’t want to plot all of the names so let’s just take the top 30:
last_names_top30 <- last_names[1:30,]
last_names_top30
## name count
## 15608 SINGH 2889
## 13807 RAHMAN 961
## 8195 ISLAM 954
## 687 AHMED 941
## 9168 KHAN 923
## 7874 HOSSAIN 774
## 1062 ALI 637
## 16894 UDDIN 455
## 4085 CHOWDHURY 437
## 967 ALAM 423
## 4819 DIALLO 366
## 7957 HUSSAIN 331
## 15374 SHERPA 239
## 681 AHMAD 219
## 7566 HAQUE 213
## 8149 IQBAL 199
## 7601 HASAN 190
## 9663 KUMAR 189
## 7626 HASSAN 185
## 2124 BAH 177
## 11231 MIAH 165
## 10489 MAHMOOD 161
## 7852 HOQUE 144
## 2923 BHUIYAN 137
## 2373 BARRY 136
## 10582 MALIK 135
## 15160 SHAH 129
## 8655 KABIR 126
## 14306 ROY 125
## 13941 RASHID 124
first_names_top30 <- first_names[1:30,]
first_names_top30
## name count
## 6998 MD 2552
## 7379 MOHAMMAD 2063
## 7385 MOHAMMED 1650
## 7701 MUHAMMAD 798
## 7368 MOHAMED 706
## 5287 JEAN 484
## 204 ABDUL 398
## 11143 SYED 360
## 486 AHMED 338
## 6700 MAMADOU 335
## 5504 JOSEPH 299
## 698 ALI 207
## 7182 MICHAEL 203
## 5462 JOHN 180
## 5499 JOSE 165
## 3946 GEORGE 153
## 5879 KHALID 138
## 2605 DAVID 137
## 6467 LUIS 137
## 4777 IBRAHIM 128
## 2551 DANIEL 118
## 4779 IBRAHIMA 118
## 9485 ROBERT 117
## 9907 SAMUEL 115
## 5790 KAZI 110
## 4503 HASSAN 109
## 8839 PIERRE 105
## 807 AMADOU 101
## 10276 SHAHID 99
## 298 ABU 97
library(ggplot2)
plot_first_names <- ggplot(first_names_top30, aes(reorder(name,-count),count)) + geom_bar(stat="identity")
plot_first_names <- plot_first_names + xlab("First Name") + ylab("Number")
plot_first_names <- plot_first_names + theme_classic(base_size=12)
plot_first_names <- plot_first_names + theme(axis.text.x=element_text(angle=45, hjust=1))
plot_first_names
plot_last_names <- ggplot(last_names_top30, aes(reorder(name,-count),count)) + geom_bar(stat="identity")
plot_last_names <- plot_last_names + xlab("Last Name") + ylab("Number")
plot_last_names <- plot_last_names + theme_classic(base_size=12)
plot_last_names <- plot_last_names + theme(axis.text.x=element_text(angle=45, hjust=1))
plot_last_names
For first names you’ll notice that there are at least five different versions or spellings of the name “Muhammad” in the top 30 names. Let’s aggregate those into one. (Note: “MD” is an abbreviation for Muhammad).
We’ll merge some of the most common spellings of “Muhammad” into one.
# combine different version of "MOHAMMAD" into one
names_fixed <- data
names_fixed$Name_2[names_fixed$Name_2=="MD"] <- "MOHAMMAD"
names_fixed$Name_2[names_fixed$Name_2=="MOHAMMED"] <- "MOHAMMAD"
names_fixed$Name_2[names_fixed$Name_2=="MUHAMMAD"] <- "MOHAMMAD"
names_fixed$Name_2[names_fixed$Name_2=="MOHAMED"] <- "MOHAMMAD"
# re-aggregate
first_names_fixed <- aggregate(names_fixed$Name_2, by=list(names_fixed$Name_2), length)
colnames(first_names_fixed)[1] <- "name"
colnames(first_names_fixed)[2] <- "count"
first_names_fixed <- first_names_fixed[order(first_names_fixed$count,decreasing=TRUE),]
# take the top 30
first_names_fixed_top30 <- first_names_fixed[1:30,]
Now, let’s replot the first names
plot_first_names_fixed <- ggplot(first_names_fixed_top30, aes(reorder(name,-count),count)) + geom_bar(stat="identity")
plot_first_names_fixed <- plot_first_names_fixed + xlab("First Name") + ylab("Number")
plot_first_names_fixed <- plot_first_names_fixed + theme_classic(base_size=12)
plot_first_names_fixed <- plot_first_names_fixed + theme(axis.text.x=element_text(angle=45, hjust=1))
plot_first_names_fixed
There you go. “Singh” is the most common last name and “Muhammad” and all of its spelling variations is the most common first name for NYC Taxi Drivers!