This lesson will primarily use R’s tidyverse packages to sort, filter, group, summarize, join, and visualize data.
Data:
Cheat sheets:
Note:
%>% is called a “pipe.” It’s how
tidyverse tells R: “And then do this.” Tidyverse’s new pipe is
|>; it means the same thing as
%>% and can be used
interchangeably.rm(list=ls()) clears all objects from the global
environmentgc() frees up memory by cleaning up unused objects;
important if you’re working on more memory-intensive analysisoptions(scipen = 999) disables scientific notation,
important you’re loading an Excel spreadsheet with long numerical
strings.library() loads the packages you’ll be using in R. You
need to load the libraries you’ll be use every time you open R.rm(list=ls())
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 531799 28.5 1184668 63.3 NA 669514 35.8
## Vcells 982509 7.5 8388608 64.0 18432 1851810 14.2
options(scipen = 999)
library(tidyverse)
library(readxl)
library(janitor)
library(scales)
library(zoo)
library(DataExplorer)
A custom function I’m calling “make_report” that creates a dataset report. (Requires DataExplorer)
function() tells R you’re creating a functiontable tells R that the function will be performed on
the entire table{} encloses the entire functionmake_report <- function(table) {
table <- table %>%
mutate(`# Records` = row_number())
missing <- profile_missing(table)
summaries <- table %>%
#remove_empty("cols") %>%
summarize_all(funs(distinct = n_distinct,
count = n())) %>%
gather() %>%
separate(key, c("feature", "stat"), sep = "_(?=[^_]+$)") %>%
spread(stat, value) %>%
right_join(missing, by = "feature")
summaries %>%
mutate_at(vars(count, distinct), as.integer) %>%
mutate(pct_missing = round(pct_missing,2),
pct_distinct = round((distinct/count), 2),
num_values = count-num_missing) %>%
arrange(pct_missing, desc(distinct)) %>%
select(Column = feature,
`# Distinct` = distinct,
`# Not Missing` = num_values,
`# Missing` = num_missing,
`% Missing` = pct_missing,
`% Distinct` = pct_distinct)
}
read_csv loads the data into R studio. The path is
specified inside quotes “” as Folder(s)->File Name.
guess_max = If you don’t specify the column types,
readr will attempt to guess by inspecting 1000 values, evenly spaced
from the first to the last row. If those values happen to be empty,
readr default converts the entire column to TRUE/FALSE (logical), and
anything that isn’t TRUE/FALSE becomes Null. This is usually NOT what
you want. guess_max simply expands that number to whatever
you specify, ensuring that R is looking at all the values before picking
a column type.<- puts the data in the Global Environment, in a
dataframe named “dob_org” (for “original data”)clean_names() is a function from janitor that converts
all column names to lower case and replace spaces with “_” for easier
coding. (Column names with spaces need to be enclosed in `` marks.)dob_org <- read_csv("Input/DOB_NOW_approved_permits_20250219.csv", guess_max = 1000000) %>%
clean_names()
cd <- read_csv("Input/cd_info.csv")
glimpse() let’s you see all the column names and
types at once.
summary() is useful to see highest/lowest values,
means, etc.
profile_missing() from Data Explorer shows how many
missing values are in each column
glimpse(dob_org)
## Rows: 730,475
## Columns: 35
## $ job_filing_number <chr> "Permit is no", "Permit is no", "…
## $ filing_reason <chr> "Initial Permit", "Initial Permit…
## $ house_no <chr> "201", "33", "1537GARAGE", "19", …
## $ street_name <chr> "FRONT STREET", "1 AVENUE", "EAST…
## $ borough <chr> "STATEN ISLAND", "MANHATTAN", "BR…
## $ lot <dbl> 100, 42, 39, 41, 7, 25, 15, 41, 2…
## $ bin <dbl> 5176234, 1006136, 3377155, 100812…
## $ block <dbl> 487, 444, 8301, 520, 1271, 7309, …
## $ c_b_no <dbl> 501, 103, 318, 102, 105, 411, 104…
## $ apt_condo_no_s <chr> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ work_on_floor <chr> "Sidewalk", "Open Space", "Ground…
## $ work_type <chr> "Protection and Mechanical Method…
## $ permittees_license_type <chr> "GC", "GC", "GC", "GC", "GC", "GC…
## $ applicant_license_number <chr> "612039", "624067", "612039", "60…
## $ applicant_first_name <chr> "JENNIFER", "YAKOV", "JENNIFER", …
## $ applicant_middle_name <chr> NA, NA, "A", NA, NA, NA, NA, NA, …
## $ applicant_last_name <chr> "RUSSO", "EISENBACH", "RUSSO", "K…
## $ applicant_business_name <chr> "RUSSO DEVELOPMENT ENTERP", "HEXA…
## $ applicant_business_address <chr> "67 EAST AVENUE", "5314 16TH AVEN…
## $ filing_representative_first_name <chr> "HOWARD", "MOSES", "HOWARD", "JAC…
## $ filing_representative_middle_initial <chr> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ filing_representative_last_name <chr> "TOMPKIN", "SEIDENFELD", "TOMPKIN…
## $ filing_representative_business_name <chr> "RUSSO DEVELOPMENT ENTERPRISES IN…
## $ work_permit <chr> "Permit is not yet issued", "Perm…
## $ approved_date <chr> NA, NA, NA, "06/21/2023 12:00:00 …
## $ issued_date <chr> NA, NA, NA, "08/29/2023 12:00:00 …
## $ expired_date <chr> NA, NA, NA, "04/09/2024 04:00:00 …
## $ job_description <chr> "Demolition of structure using Ca…
## $ estimated_job_costs <dbl> 1000, 2500, 1000, 100000, 1, 6000…
## $ owner_business_name <chr> "PR", "ROMAH MANAGEMENT CORP", "H…
## $ owner_name <chr> "ELIZABETH ARNAIZ", NA, "HPD REPR…
## $ owner_street_address <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ owner_city <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ owner_state <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ owner_zip_code <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
summary(dob_org)
## job_filing_number filing_reason house_no street_name
## Length:730475 Length:730475 Length:730475 Length:730475
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## borough lot bin block
## Length:730475 Min. : 0 Min. : 429584 Min. : 1
## Class :character 1st Qu.: 12 1st Qu.:1051352 1st Qu.: 897
## Mode :character Median : 32 Median :3002131 Median : 1736
## Mean : 956 Mean :2545154 Mean : 2996
## 3rd Qu.: 62 3rd Qu.:3430065 3rd Qu.: 4224
## Max. :164624628 Max. :5870384 Max. :99999
## NA's :143
## c_b_no apt_condo_no_s work_on_floor work_type
## Min. :101.0 Length:730475 Length:730475 Length:730475
## 1st Qu.:106.0 Class :character Class :character Class :character
## Median :301.0 Mode :character Mode :character Mode :character
## Mean :247.8
## 3rd Qu.:318.0
## Max. :595.0
## NA's :6110
## permittees_license_type applicant_license_number applicant_first_name
## Length:730475 Length:730475 Length:730475
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## applicant_middle_name applicant_last_name applicant_business_name
## Length:730475 Length:730475 Length:730475
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## applicant_business_address filing_representative_first_name
## Length:730475 Length:730475
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## filing_representative_middle_initial filing_representative_last_name
## Length:730475 Length:730475
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## filing_representative_business_name work_permit approved_date
## Length:730475 Length:730475 Length:730475
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## issued_date expired_date job_description estimated_job_costs
## Length:730475 Length:730475 Length:730475 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 3750
## Mode :character Mode :character Mode :character Median : 16300
## Mean : 252218
## 3rd Qu.: 79200
## Max. :800010000
## NA's :2
## owner_business_name owner_name owner_street_address owner_city
## Length:730475 Length:730475 Mode:logical Mode:logical
## Class :character Class :character NA's:730475 NA's:730475
## Mode :character Mode :character
##
##
##
##
## owner_state owner_zip_code
## Mode:logical Mode:logical
## NA's:730475 NA's:730475
##
##
##
##
##
profile_missing(dob_org)
## # A tibble: 35 × 3
## feature num_missing pct_missing
## <fct> <int> <dbl>
## 1 job_filing_number 212 0.000290
## 2 filing_reason 0 0
## 3 house_no 0 0
## 4 street_name 0 0
## 5 borough 0 0
## 6 lot 143 0.000196
## 7 bin 0 0
## 8 block 0 0
## 9 c_b_no 6110 0.00836
## 10 apt_condo_no_s 728242 0.997
## # ℹ 25 more rows
make_report() is the custom function we just loaded,
which summarises distinct and missing records in the datatable.arrange() sorts the data by a column and -
sorts it in descending order versus ascending order.
make_report(dob_org) %>%
arrange(-`# Distinct`)
## # A tibble: 36 × 6
## Column `# Distinct` `# Not Missing` `# Missing` `% Missing` `% Distinct`
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 # Records 730475 730475 0 0 1
## 2 work_permit 449020 730263 212 0 0.61
## 3 job_filing… 433174 730263 212 0 0.59
## 4 job_descri… 305841 730473 2 0 0.42
## 5 bin 117125 730475 0 0 0.16
## 6 owner_name 76474 724983 5492 0.01 0.1
## 7 estimated_… 55159 730473 2 0 0.08
## 8 owner_busi… 31728 698646 31829 0.04 0.04
## 9 work_on_fl… 26397 720044 10431 0.01 0.04
## 10 house_no 21217 730475 0 0 0.03
## # ℹ 26 more rows
Now we’ll do some very light data prep so we can analyze the data.
It’s important to do this correctly for this lesson!
rename renames the “community district” column “cd,”
which is easier to type over and overmutate() changes the information in a column, or
creates a new columnmutate_at() mutates multiple columns at the same time
based on a condition, in this case the column name contains the word
“date.”Mutate functions:
as.character()- converts the community district into
text, since numeric identifiers like zip codes, phone numbers, and IDs
aren’t numbers in dataland as you can’t do math on them.mdy_hms() converts the date the permit was issued into
a date, since it loaded as a string. Note, you must use the correct date
format or the dates will turn into Nulls. For example, if the dates had
been stored in European format, you’d use dmy instead of
mdy.year() is a lubridate function that extracts the year
from the date.as.yearmon() is a zoo function that extracts the
year_month from the date, while keeping it sortable in chronological
orderdob <- dob_org %>%
rename(cd = c_b_no) %>%
mutate(cd = as.character(cd))
dob <- dob %>%
mutate_at(vars(contains("_date")), ~mdy_hms(.))
dob <- dob %>%
mutate(issued_year = year(issued_date),
issued_yearmon = as.yearmon(issued_date))
dob <- dob %>%
mutate_if(is.character, ~str_to_upper(.))
dob %>%
count(issued_year)
## # A tibble: 11 × 2
## issued_year n
## <dbl> <int>
## 1 2016 8
## 2 2017 2304
## 3 2018 27332
## 4 2019 41928
## 5 2020 63357
## 6 2021 112525
## 7 2022 143453
## 8 2023 152716
## 9 2024 166838
## 10 2025 19988
## 11 NA 26
dob %>%
count(filing_reason)
## # A tibble: 4 × 2
## filing_reason n
## <chr> <int>
## 1 INITIAL PERMIT 503311
## 2 NO WORK PERMIT 5437
## 3 RENEWAL PERMIT WITH CHANGES 36393
## 4 RENEWAL PERMIT WITHOUT CHANGES 185334
dob %>%
count(work_type)
## # A tibble: 21 × 2
## work_type n
## <chr> <int>
## 1 ANTENNA 7195
## 2 BOILER EQUIPMENT 8951
## 3 CONSTRUCTION FENCE 68805
## 4 CURB CUT 6833
## 5 EARTH WORK 9713
## 6 FOUNDATION 17141
## 7 FULL DEMOLITION 1686
## 8 GENERAL CONSTRUCTION 142654
## 9 GREEN ROOF 5
## 10 MECHANICAL SYSTEMS 68778
## # ℹ 11 more rows
n_distinct() counts the number of distinct values in a
column, rather than just the number of values. This is an
extremely important concept in data. As you can see,
the same job can have multiple filings in the data. Count the number of
rows and the number of unique job filings by boroughdob %>%
count(n_distinct(job_filing_number))
## # A tibble: 1 × 2
## `n_distinct(job_filing_number)` n
## <int> <int>
## 1 433174 730475
dob %>%
group_by(borough) %>%
summarise(n = n(),
job_filings = n_distinct(job_filing_number))
## # A tibble: 5 × 3
## borough n job_filings
## <chr> <int> <int>
## 1 BRONX 69490 40155
## 2 BROOKLYN 191275 112962
## 3 MANHATTAN 289672 177451
## 4 QUEENS 140163 80752
## 5 STATEN ISLAND 39875 21865
dob %>%
group_by(work_type, borough) %>%
summarise(job_filings = n_distinct(job_filing_number)) %>%
spread(borough, job_filings) %>%
arrange(-BRONX)
## # A tibble: 21 × 6
## # Groups: work_type [21]
## work_type BRONX BROOKLYN MANHATTAN QUEENS `STATEN ISLAND`
## <chr> <int> <int> <int> <int> <int>
## 1 GENERAL CONSTRUCTION 7611 24416 45474 18793 4248
## 2 SIDEWALK SHED 6628 13358 19368 5868 344
## 3 PLUMBING 5972 20979 34096 17221 5500
## 4 CONSTRUCTION FENCE 3923 10669 5237 9517 3122
## 5 MECHANICAL SYSTEMS 3506 11450 25101 7893 2102
## 6 SUPPORTED SCAFFOLD 2311 7646 11965 2332 163
## 7 STRUCTURAL 2108 7210 8909 4687 1967
## 8 SIGN 1942 5269 7524 4743 1416
## 9 SPRINKLERS 1654 4832 15065 2589 656
## 10 SOLAR 1244 2160 85 4951 1902
## # ℹ 11 more rows
job_description
to all upper case using str_to_upper, otherwise we’d have
to anticipate all upper/lower case combinations.! means “not”shed <- dob %>%
filter(work_type == "SIDEWALK SHED" &
year(issued_date) >= 2018)
shed %>% count(issued_year)
## # A tibble: 8 × 2
## issued_year n
## <dbl> <int>
## 1 2018 10972
## 2 2019 11599
## 3 2020 11521
## 4 2021 12477
## 5 2022 12628
## 6 2023 12753
## 7 2024 13051
## 8 2025 1427
shed %>%
group_by(year = year(issued_date)) %>%
summarise(n = n_distinct(job_description))
## # A tibble: 8 × 2
## year n
## <dbl> <int>
## 1 2018 1112
## 2 2019 1550
## 3 2020 1799
## 4 2021 2113
## 5 2022 2410
## 6 2023 2424
## 7 2024 2275
## 8 2025 452
shed %>%
count(job_description)
## # A tibble: 6,917 × 2
## job_description n
## <chr> <int>
## 1 (PHASE 4: MADISON AVENUE): INSTALLATION OF HEAVY DUTY SIDEWALK SHED ON… 3
## 2 - SIDEWALK-SHED PROPOSED INSTALLATION OF HEAVY DUTY SIDEWALK SHED FOR … 3
## 3 1NSTALLATION OF TEMPORARY SIDEWALK SHED AS PER PLANS. NO CHANGE IN US… 1
## 4 2060 SEDGWICK AVE SUBMITTING ON BEHALF OF CONTRACTOR FOR THE INSTALLAT… 1
## 5 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHWON ON DRAWIN… 1
## 6 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING… 1
## 7 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING… 1
## 8 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING… 1
## 9 28-105.4.1 EMERGENCY WORK -DUE TO FALLING DEBRIS - EMERGENCY INSTALLAT… 1
## 10 28-105.4.1 EMERGENCY WORK- INSTALLATION OF AN EMERGENCY SIDEWLAK SHED … 7
## # ℹ 6,907 more rows
shed %>%
mutate(job_description = str_to_upper(job_description)) %>%
filter(str_detect(job_description, "EMERGENCY")) %>%
count(job_description)
## # A tibble: 209 × 2
## job_description n
## <chr> <int>
## 1 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHWON ON DRAWIN… 1
## 2 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING… 1
## 3 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING… 1
## 4 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING… 1
## 5 28-105.4.1 EMERGENCY WORK -DUE TO FALLING DEBRIS - EMERGENCY INSTALLAT… 1
## 6 28-105.4.1 EMERGENCY WORK- INSTALLATION OF AN EMERGENCY SIDEWLAK SHED … 7
## 7 28-105.4.1 EMERGENCY WORK- INSTALLATION OF SIDEWALK SHED AS SHOWN ON D… 2
## 8 28-105.4.1 EMERGENCY WORK. INSTALLATION OF EMERGENCY SIDEWLAK SHED TO … 2
## 9 28-105.4.1 EMERGENCY WORK. INSTALLATION OF SIDEWALK SHED TO PROTECT PE… 2
## 10 28-105.4.1 EMERGENCY WORK: INSTALLATION OF SIDEWALK SHED AS DIRECTED B… 3
## # ℹ 199 more rows
shed %>%
mutate(job_description = str_to_upper(job_description)) %>%
filter(!str_detect(job_description, "EMERGENCY")) %>%
count(job_description)
## # A tibble: 6,707 × 2
## job_description n
## <chr> <int>
## 1 (PHASE 4: MADISON AVENUE): INSTALLATION OF HEAVY DUTY SIDEWALK SHED ON… 3
## 2 - SIDEWALK-SHED PROPOSED INSTALLATION OF HEAVY DUTY SIDEWALK SHED FOR … 3
## 3 1NSTALLATION OF TEMPORARY SIDEWALK SHED AS PER PLANS. NO CHANGE IN US… 1
## 4 2060 SEDGWICK AVE SUBMITTING ON BEHALF OF CONTRACTOR FOR THE INSTALLAT… 1
## 5 35 CRESCENT/SH: FILING ON BEHALF OF CONTRACTOR FOR INSTALLATION OF 32 … 2
## 6 8'-0 HIGH SIDEWALK SHED TO BE ERECTED ALONG THE DESIGNATED PERIMETER O… 1
## 7 8'-0 HIGH SIDEWALK SHED TO BE ERECTED ALONG THE DESIGNATED PERIMETERS … 2
## 8 ADD ADDITIONAL SIDEWALK SHED AS PER DRAWINGS 5
## 9 ADDED ADDITIONAL SHED AND LAYOUT. NO OTHER CHANGES TO THE PREVIOUSLY A… 1
## 10 ADDITIONAL THE BUILDING IS RENT CONTROL. SECTION 26 MADE CHANGE TO RE… 3
## # ℹ 6,697 more rows
Sys.Date() returns today’s date. This is a great
function because then every time you rerun your code it will be
current.as.POSIXct as a wrap around the date functions.table <- shed %>%
mutate(id = row_number()) %>%
group_by(borough, cd, bin, job_filing_number) %>%
summarise(min_date = min(issued_date),
max_date = max(expired_date)) %>%
mutate(today = as.POSIXct(Sys.Date()),
max_date = as.POSIXct(ifelse(max_date > today, today, max_date)),
years = difftime(max_date, min_date, units = "days"),
years = as.numeric(years)/365) %>%
filter(years > 0)
cd <- cd %>%
rename(cd = cd_number) %>%
mutate(cd = as.character(cd)) %>%
select(cd, nabes)
table <- table %>%
left_join(cd, by = "cd")
This is a very brief demo of ggplot. “Be awesome in ggplot2” provides a good beginner overview of chart types in R with sample code.
+
instead of pipes to say “and then do this.”table %>%
ggplot(aes(x=borough, y=years)) +
geom_boxplot()
table %>%
ggplot(aes(years)) +
geom_histogram() +
facet_wrap(~borough, ncol=2, scales="free_y")
table %>%
mutate(year_plus = ifelse(years>1, "yes", "no")) %>%
ggplot() +
geom_bar(aes(x=borough, fill=year_plus))
table %>%
mutate(`2year_plus` = ifelse(years>2, "yes", "no")) %>%
ggplot() +
geom_bar(aes(x=borough, fill=`2year_plus`), position = "fill") +
scale_y_continuous(labels = scales::percent)