R2 Workshop

This lesson will primarily use R’s tidyverse packages to sort, filter, group, summarize, join, and visualize data.

Data:

NYC Dept of Buildings – Approved Permits from NYC open data. Note that this data only includes permits filed through the DOB NOW system, and may not include all permits, such as those filed the old-school way.

Cheat sheets:

R markdown: The type of file we’re using. Easily converts into other file types such as html and PDF.
tidyverse: A group of core packages for data wrangling, analysis, and graphics.
- ggplot2: for making charts and graphs. “Be awesome in ggplot2” is a great beginner tutorial with sample code.
- stringr: for manipulating text (aka “strings”); essential for cleaning data.
- tidyr: for creating and shaping “tidy” data
- dplyr: core data analysis functions
- readr: for reading and writing data in R Studio
- lubridate: for working with dates
janitor: Handy data cleaning, summarizing, and wrangling functions
scales: for pretty scales on charts and maps
DataExplorer: for quickly exploring data

Note:

%>% is called a “pipe.” It’s how tidyverse tells R: “And then do this.” Tidyverse’s new pipe is |>; it means the same thing as %>% and can be used interchangeably.

Setup

rm(list=ls()) clears all objects from the global environment
gc() frees up memory by cleaning up unused objects; important if you’re working on more memory-intensive analysis
options(scipen = 999) disables scientific notation, important you’re loading an Excel spreadsheet with long numerical strings.
library() loads the packages you’ll be using in R. You need to load the libraries you’ll be use every time you open R.

rm(list=ls())
gc()

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 531799 28.5    1184668 63.3         NA   669514 35.8
## Vcells 982509  7.5    8388608 64.0      18432  1851810 14.2

options(scipen = 999)

library(tidyverse)
library(readxl)
library(janitor)
library(scales)
library(zoo)
library(DataExplorer)

Load make_report function

A custom function I’m calling “make_report” that creates a dataset report. (Requires DataExplorer)

function() tells R you’re creating a function
table tells R that the function will be performed on the entire table
{} encloses the entire function

make_report <- function(table) {
  
  table <- table %>% 
    mutate(`# Records` = row_number())
  
  missing <- profile_missing(table)
  
  summaries <- table %>%   
    #remove_empty("cols") %>%
    summarize_all(funs(distinct = n_distinct,
                       count = n())) %>% 
    gather() %>%
    separate(key, c("feature", "stat"), sep = "_(?=[^_]+$)") %>%
    spread(stat, value) %>%
    right_join(missing, by = "feature")
  
  summaries %>%
    mutate_at(vars(count, distinct), as.integer) %>%
    mutate(pct_missing = round(pct_missing,2),
           pct_distinct = round((distinct/count), 2),
           num_values = count-num_missing) %>%
    arrange(pct_missing, desc(distinct)) %>%
    select(Column = feature,
           `# Distinct` = distinct,
           `# Not Missing` = num_values,
           `# Missing` = num_missing,
           `% Missing` = pct_missing,
           `% Distinct` = pct_distinct) 
}

Load data

read_csv loads the data into R studio. The path is specified inside quotes “” as Folder(s)->File Name.
- guess_max = If you don’t specify the column types, readr will attempt to guess by inspecting 1000 values, evenly spaced from the first to the last row. If those values happen to be empty, readr default converts the entire column to TRUE/FALSE (logical), and anything that isn’t TRUE/FALSE becomes Null. This is usually NOT what you want. guess_max simply expands that number to whatever you specify, ensuring that R is looking at all the values before picking a column type.
- <- puts the data in the Global Environment, in a dataframe named “dob_org” (for “original data”)
clean_names() is a function from janitor that converts all column names to lower case and replace spaces with “_” for easier coding. (Column names with spaces need to be enclosed in `` marks.)

dob_org <- read_csv("Input/DOB_NOW_approved_permits_20250219.csv", guess_max = 1000000) %>%
  clean_names()

cd <- read_csv("Input/cd_info.csv")

Glimpse, Summary & Profile Missing

glimpse() let’s you see all the column names and types at once.
summary() is useful to see highest/lowest values, means, etc.
profile_missing() from Data Explorer shows how many missing values are in each column

glimpse(dob_org)

## Rows: 730,475
## Columns: 35
## $ job_filing_number                    <chr> "Permit is no", "Permit is no", "…
## $ filing_reason                        <chr> "Initial Permit", "Initial Permit…
## $ house_no                             <chr> "201", "33", "1537GARAGE", "19", …
## $ street_name                          <chr> "FRONT STREET", "1 AVENUE", "EAST…
## $ borough                              <chr> "STATEN ISLAND", "MANHATTAN", "BR…
## $ lot                                  <dbl> 100, 42, 39, 41, 7, 25, 15, 41, 2…
## $ bin                                  <dbl> 5176234, 1006136, 3377155, 100812…
## $ block                                <dbl> 487, 444, 8301, 520, 1271, 7309, …
## $ c_b_no                               <dbl> 501, 103, 318, 102, 105, 411, 104…
## $ apt_condo_no_s                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ work_on_floor                        <chr> "Sidewalk", "Open Space", "Ground…
## $ work_type                            <chr> "Protection and Mechanical Method…
## $ permittees_license_type              <chr> "GC", "GC", "GC", "GC", "GC", "GC…
## $ applicant_license_number             <chr> "612039", "624067", "612039", "60…
## $ applicant_first_name                 <chr> "JENNIFER", "YAKOV", "JENNIFER", …
## $ applicant_middle_name                <chr> NA, NA, "A", NA, NA, NA, NA, NA, …
## $ applicant_last_name                  <chr> "RUSSO", "EISENBACH", "RUSSO", "K…
## $ applicant_business_name              <chr> "RUSSO DEVELOPMENT ENTERP", "HEXA…
## $ applicant_business_address           <chr> "67 EAST AVENUE", "5314 16TH AVEN…
## $ filing_representative_first_name     <chr> "HOWARD", "MOSES", "HOWARD", "JAC…
## $ filing_representative_middle_initial <chr> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ filing_representative_last_name      <chr> "TOMPKIN", "SEIDENFELD", "TOMPKIN…
## $ filing_representative_business_name  <chr> "RUSSO DEVELOPMENT ENTERPRISES IN…
## $ work_permit                          <chr> "Permit is not yet issued", "Perm…
## $ approved_date                        <chr> NA, NA, NA, "06/21/2023 12:00:00 …
## $ issued_date                          <chr> NA, NA, NA, "08/29/2023 12:00:00 …
## $ expired_date                         <chr> NA, NA, NA, "04/09/2024 04:00:00 …
## $ job_description                      <chr> "Demolition of structure using Ca…
## $ estimated_job_costs                  <dbl> 1000, 2500, 1000, 100000, 1, 6000…
## $ owner_business_name                  <chr> "PR", "ROMAH MANAGEMENT CORP", "H…
## $ owner_name                           <chr> "ELIZABETH ARNAIZ", NA, "HPD REPR…
## $ owner_street_address                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ owner_city                           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ owner_state                          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ owner_zip_code                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, N…

summary(dob_org)

##  job_filing_number  filing_reason        house_no         street_name       
##  Length:730475      Length:730475      Length:730475      Length:730475     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    borough               lot                 bin              block      
##  Length:730475      Min.   :        0   Min.   : 429584   Min.   :    1  
##  Class :character   1st Qu.:       12   1st Qu.:1051352   1st Qu.:  897  
##  Mode  :character   Median :       32   Median :3002131   Median : 1736  
##                     Mean   :      956   Mean   :2545154   Mean   : 2996  
##                     3rd Qu.:       62   3rd Qu.:3430065   3rd Qu.: 4224  
##                     Max.   :164624628   Max.   :5870384   Max.   :99999  
##                     NA's   :143                                          
##      c_b_no      apt_condo_no_s     work_on_floor       work_type        
##  Min.   :101.0   Length:730475      Length:730475      Length:730475     
##  1st Qu.:106.0   Class :character   Class :character   Class :character  
##  Median :301.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :247.8                                                           
##  3rd Qu.:318.0                                                           
##  Max.   :595.0                                                           
##  NA's   :6110                                                            
##  permittees_license_type applicant_license_number applicant_first_name
##  Length:730475           Length:730475            Length:730475       
##  Class :character        Class :character         Class :character    
##  Mode  :character        Mode  :character         Mode  :character    
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  applicant_middle_name applicant_last_name applicant_business_name
##  Length:730475         Length:730475       Length:730475          
##  Class :character      Class :character    Class :character       
##  Mode  :character      Mode  :character    Mode  :character       
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  applicant_business_address filing_representative_first_name
##  Length:730475              Length:730475                   
##  Class :character           Class :character                
##  Mode  :character           Mode  :character                
##                                                             
##                                                             
##                                                             
##                                                             
##  filing_representative_middle_initial filing_representative_last_name
##  Length:730475                        Length:730475                  
##  Class :character                     Class :character               
##  Mode  :character                     Mode  :character               
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  filing_representative_business_name work_permit        approved_date     
##  Length:730475                       Length:730475      Length:730475     
##  Class :character                    Class :character   Class :character  
##  Mode  :character                    Mode  :character   Mode  :character  
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  issued_date        expired_date       job_description    estimated_job_costs
##  Length:730475      Length:730475      Length:730475      Min.   :        0  
##  Class :character   Class :character   Class :character   1st Qu.:     3750  
##  Mode  :character   Mode  :character   Mode  :character   Median :    16300  
##                                                           Mean   :   252218  
##                                                           3rd Qu.:    79200  
##                                                           Max.   :800010000  
##                                                           NA's   :2          
##  owner_business_name  owner_name        owner_street_address owner_city    
##  Length:730475       Length:730475      Mode:logical         Mode:logical  
##  Class :character    Class :character   NA's:730475          NA's:730475   
##  Mode  :character    Mode  :character                                      
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##  owner_state    owner_zip_code
##  Mode:logical   Mode:logical  
##  NA's:730475    NA's:730475   
##                               
##                               
##                               
##                               
##

profile_missing(dob_org)

## # A tibble: 35 × 3
##    feature           num_missing pct_missing
##    <fct>                   <int>       <dbl>
##  1 job_filing_number         212    0.000290
##  2 filing_reason               0    0       
##  3 house_no                    0    0       
##  4 street_name                 0    0       
##  5 borough                     0    0       
##  6 lot                       143    0.000196
##  7 bin                         0    0       
##  8 block                       0    0       
##  9 c_b_no                   6110    0.00836 
## 10 apt_condo_no_s         728242    0.997   
## # ℹ 25 more rows

Make report

make_report() is the custom function we just loaded, which summarises distinct and missing records in the datatable.
arrange() sorts the data by a column and - sorts it in descending order versus ascending order.
- The column name is enclosed with ```` marks because it has a space and starts with a non-alphabetical character

make_report(dob_org)  %>%
  arrange(-`# Distinct`)

## # A tibble: 36 × 6
##    Column      `# Distinct` `# Not Missing` `# Missing` `% Missing` `% Distinct`
##    <chr>              <int>           <int>       <int>       <dbl>        <dbl>
##  1 # Records         730475          730475           0        0            1   
##  2 work_permit       449020          730263         212        0            0.61
##  3 job_filing…       433174          730263         212        0            0.59
##  4 job_descri…       305841          730473           2        0            0.42
##  5 bin               117125          730475           0        0            0.16
##  6 owner_name         76474          724983        5492        0.01         0.1 
##  7 estimated_…        55159          730473           2        0            0.08
##  8 owner_busi…        31728          698646       31829        0.04         0.04
##  9 work_on_fl…        26397          720044       10431        0.01         0.04
## 10 house_no           21217          730475           0        0            0.03
## # ℹ 26 more rows

Prep Data

Now we’ll do some very light data prep so we can analyze the data.

It’s important to do this correctly for this lesson!

rename renames the “community district” column “cd,” which is easier to type over and over
mutate() changes the information in a column, or creates a new column
mutate_at() mutates multiple columns at the same time based on a condition, in this case the column name contains the word “date.”

Mutate functions:

as.character()- converts the community district into text, since numeric identifiers like zip codes, phone numbers, and IDs aren’t numbers in dataland as you can’t do math on them.
mdy_hms() converts the date the permit was issued into a date, since it loaded as a string. Note, you must use the correct date format or the dates will turn into Nulls. For example, if the dates had been stored in European format, you’d use dmy instead of mdy.
year() is a lubridate function that extracts the year from the date.
as.yearmon() is a zoo function that extracts the year_month from the date, while keeping it sortable in chronological order

dob <- dob_org %>%
  rename(cd = c_b_no) %>%
  mutate(cd = as.character(cd))

dob <- dob %>%
  mutate_at(vars(contains("_date")), ~mdy_hms(.))

dob <- dob %>%
  mutate(issued_year = year(issued_date),
         issued_yearmon = as.yearmon(issued_date))

dob <- dob %>%
  mutate_if(is.character, ~str_to_upper(.))

Counting

dob %>%
  count(issued_year)

## # A tibble: 11 × 2
##    issued_year      n
##          <dbl>  <int>
##  1        2016      8
##  2        2017   2304
##  3        2018  27332
##  4        2019  41928
##  5        2020  63357
##  6        2021 112525
##  7        2022 143453
##  8        2023 152716
##  9        2024 166838
## 10        2025  19988
## 11          NA     26

dob %>%
  count(filing_reason)

## # A tibble: 4 × 2
##   filing_reason                       n
##   <chr>                           <int>
## 1 INITIAL PERMIT                 503311
## 2 NO WORK PERMIT                   5437
## 3 RENEWAL PERMIT WITH CHANGES     36393
## 4 RENEWAL PERMIT WITHOUT CHANGES 185334

dob %>%
  count(work_type)

## # A tibble: 21 × 2
##    work_type                 n
##    <chr>                 <int>
##  1 ANTENNA                7195
##  2 BOILER EQUIPMENT       8951
##  3 CONSTRUCTION FENCE    68805
##  4 CURB CUT               6833
##  5 EARTH WORK             9713
##  6 FOUNDATION            17141
##  7 FULL DEMOLITION        1686
##  8 GENERAL CONSTRUCTION 142654
##  9 GREEN ROOF                5
## 10 MECHANICAL SYSTEMS    68778
## # ℹ 11 more rows

Count Distinct, Group, Summarise

n_distinct() counts the number of distinct values in a column, rather than just the number of values. This is an extremely important concept in data. As you can see, the same job can have multiple filings in the data. Count the number of rows and the number of unique job filings by borough

dob %>%
  count(n_distinct(job_filing_number))

## # A tibble: 1 × 2
##   `n_distinct(job_filing_number)`      n
##                             <int>  <int>
## 1                          433174 730475

dob %>%
  group_by(borough) %>%
  summarise(n = n(),
            job_filings = n_distinct(job_filing_number))

## # A tibble: 5 × 3
##   borough            n job_filings
##   <chr>          <int>       <int>
## 1 BRONX          69490       40155
## 2 BROOKLYN      191275      112962
## 3 MANHATTAN     289672      177451
## 4 QUEENS        140163       80752
## 5 STATEN ISLAND  39875       21865

Group, Summarise, and Spread

dob %>%
  group_by(work_type, borough) %>%
  summarise(job_filings = n_distinct(job_filing_number)) %>%
  spread(borough, job_filings) %>%
  arrange(-BRONX)

## # A tibble: 21 × 6
## # Groups:   work_type [21]
##    work_type            BRONX BROOKLYN MANHATTAN QUEENS `STATEN ISLAND`
##    <chr>                <int>    <int>     <int>  <int>           <int>
##  1 GENERAL CONSTRUCTION  7611    24416     45474  18793            4248
##  2 SIDEWALK SHED         6628    13358     19368   5868             344
##  3 PLUMBING              5972    20979     34096  17221            5500
##  4 CONSTRUCTION FENCE    3923    10669      5237   9517            3122
##  5 MECHANICAL SYSTEMS    3506    11450     25101   7893            2102
##  6 SUPPORTED SCAFFOLD    2311     7646     11965   2332             163
##  7 STRUCTURAL            2108     7210      8909   4687            1967
##  8 SIGN                  1942     5269      7524   4743            1416
##  9 SPRINKLERS            1654     4832     15065   2589             656
## 10 SOLAR                 1244     2160        85   4951            1902
## # ℹ 11 more rows

Filtering

Filtering operators can be found in the dplyr documentation.
String operators can be found in the stringr documentation.
R is case sensitive, so we’ll convert job_description to all upper case using str_to_upper, otherwise we’d have to anticipate all upper/lower case combinations.
In R, ! means “not”

shed <- dob %>%
  filter(work_type == "SIDEWALK SHED" &
           year(issued_date) >= 2018)

shed %>% count(issued_year)

## # A tibble: 8 × 2
##   issued_year     n
##         <dbl> <int>
## 1        2018 10972
## 2        2019 11599
## 3        2020 11521
## 4        2021 12477
## 5        2022 12628
## 6        2023 12753
## 7        2024 13051
## 8        2025  1427

shed %>%
  group_by(year = year(issued_date)) %>%
  summarise(n = n_distinct(job_description))

## # A tibble: 8 × 2
##    year     n
##   <dbl> <int>
## 1  2018  1112
## 2  2019  1550
## 3  2020  1799
## 4  2021  2113
## 5  2022  2410
## 6  2023  2424
## 7  2024  2275
## 8  2025   452

shed %>%
  count(job_description)

## # A tibble: 6,917 × 2
##    job_description                                                             n
##    <chr>                                                                   <int>
##  1 (PHASE 4: MADISON AVENUE): INSTALLATION OF HEAVY DUTY SIDEWALK SHED ON…     3
##  2 - SIDEWALK-SHED PROPOSED INSTALLATION OF HEAVY DUTY SIDEWALK SHED FOR …     3
##  3 1NSTALLATION OF TEMPORARY  SIDEWALK SHED AS PER PLANS. NO CHANGE IN US…     1
##  4 2060 SEDGWICK AVE SUBMITTING ON BEHALF OF CONTRACTOR FOR THE INSTALLAT…     1
##  5 28-105.4.1  EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHWON ON DRAWIN…     1
##  6 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING…     1
##  7 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING…     1
##  8 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING…     1
##  9 28-105.4.1 EMERGENCY WORK -DUE TO FALLING DEBRIS - EMERGENCY INSTALLAT…     1
## 10 28-105.4.1 EMERGENCY WORK- INSTALLATION OF AN EMERGENCY SIDEWLAK SHED …     7
## # ℹ 6,907 more rows

shed %>%
  mutate(job_description = str_to_upper(job_description)) %>%
  filter(str_detect(job_description, "EMERGENCY")) %>%
  count(job_description)

## # A tibble: 209 × 2
##    job_description                                                             n
##    <chr>                                                                   <int>
##  1 28-105.4.1  EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHWON ON DRAWIN…     1
##  2 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING…     1
##  3 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING…     1
##  4 28-105.4.1 EMERGENCY INSTALLATION OF SIDEWALK SHED AS SHOWN ON DRAWING…     1
##  5 28-105.4.1 EMERGENCY WORK -DUE TO FALLING DEBRIS - EMERGENCY INSTALLAT…     1
##  6 28-105.4.1 EMERGENCY WORK- INSTALLATION OF AN EMERGENCY SIDEWLAK SHED …     7
##  7 28-105.4.1 EMERGENCY WORK- INSTALLATION OF SIDEWALK SHED AS SHOWN ON D…     2
##  8 28-105.4.1 EMERGENCY WORK. INSTALLATION OF EMERGENCY SIDEWLAK SHED TO …     2
##  9 28-105.4.1 EMERGENCY WORK. INSTALLATION OF SIDEWALK SHED TO PROTECT PE…     2
## 10 28-105.4.1 EMERGENCY WORK: INSTALLATION OF SIDEWALK SHED AS DIRECTED B…     3
## # ℹ 199 more rows

shed %>%
  mutate(job_description = str_to_upper(job_description)) %>%
  filter(!str_detect(job_description, "EMERGENCY"))  %>%
  count(job_description)

## # A tibble: 6,707 × 2
##    job_description                                                             n
##    <chr>                                                                   <int>
##  1 (PHASE 4: MADISON AVENUE): INSTALLATION OF HEAVY DUTY SIDEWALK SHED ON…     3
##  2 - SIDEWALK-SHED PROPOSED INSTALLATION OF HEAVY DUTY SIDEWALK SHED FOR …     3
##  3 1NSTALLATION OF TEMPORARY  SIDEWALK SHED AS PER PLANS. NO CHANGE IN US…     1
##  4 2060 SEDGWICK AVE SUBMITTING ON BEHALF OF CONTRACTOR FOR THE INSTALLAT…     1
##  5 35 CRESCENT/SH: FILING ON BEHALF OF CONTRACTOR FOR INSTALLATION OF 32 …     2
##  6 8'-0 HIGH SIDEWALK SHED TO BE ERECTED ALONG THE DESIGNATED PERIMETER O…     1
##  7 8'-0 HIGH SIDEWALK SHED TO BE ERECTED ALONG THE DESIGNATED PERIMETERS …     2
##  8 ADD ADDITIONAL SIDEWALK SHED AS PER DRAWINGS                                5
##  9 ADDED ADDITIONAL SHED AND LAYOUT. NO OTHER CHANGES TO THE PREVIOUSLY A…     1
## 10 ADDITIONAL THE BUILDING IS RENT CONTROL.  SECTION 26 MADE CHANGE TO RE…     3
## # ℹ 6,697 more rows

Summarising with dates

Sys.Date() returns today’s date. This is a great function because then every time you rerun your code it will be current.
The dates still need to be in a consistent format, so I’m adding as.POSIXct as a wrap around the date functions.

table <- shed %>%
  mutate(id = row_number()) %>%
  group_by(borough, cd, bin, job_filing_number) %>%
  summarise(min_date = min(issued_date),
            max_date = max(expired_date)) %>%
  mutate(today = as.POSIXct(Sys.Date()),
         max_date = as.POSIXct(ifelse(max_date > today, today, max_date)),
         years = difftime(max_date, min_date, units = "days"),
         years = as.numeric(years)/365) %>%
  filter(years > 0)

Join data

The is a very simple left_join. Other common types of joins include right_join, inner_join, outer_join, semi_join, and anti_join.
You can find more guidance on joining data in the dplyr documentation.

source

cd <- cd %>%
  rename(cd = cd_number) %>%
  mutate(cd = as.character(cd)) %>%
  select(cd, nabes)

table <- table %>%
  left_join(cd, by = "cd")

Ggplot

This is a very brief demo of ggplot. “Be awesome in ggplot2” provides a good beginner overview of chart types in R with sample code.

Once you signal to R that you’re in ggplot, use + instead of pipes to say “and then do this.”

table %>%
  ggplot(aes(x=borough, y=years)) +
  geom_boxplot()

table %>%
  ggplot(aes(years)) +
  geom_histogram() +
  facet_wrap(~borough, ncol=2, scales="free_y")

More on scale formatting can be found in the ggplot documentation

table %>%
  mutate(year_plus = ifelse(years>1, "yes", "no")) %>%
  ggplot() +
  geom_bar(aes(x=borough, fill=year_plus))

table %>%
  mutate(`2year_plus` = ifelse(years>2, "yes", "no")) %>%
  ggplot() +
  geom_bar(aes(x=borough, fill=`2year_plus`), position = "fill") +
  scale_y_continuous(labels = scales::percent)