Introduction
Events over time
- Evolution of the events hosted over the years
- Distribution of events around the months
Events Categories
- Most popular themes
- Distribution of free and paying events within the top categories
Major Events Organizers
Duration of the Events
Main areas of hosted events
Conclusion

# --------------------------Importing all the necessary packages
library(ggplot2) # for graphs
library(tidyverse) # to prepare the data
library(dplyr) # to prepare the data
library(leaflet) # for interactive maps
library(leaflet.extras) # for interactive maps
library(kableExtra) # for kable tables
library(formattable) # handling percentages
library(gridExtra) #to organize the graphs
library(lubridate) # for time format
library(kableExtra) # for tables 
library(extrafont) # for importing fonts
library(DescTools) # for descriptive stats
library(wordcloud) # for wordcloud
library(RColorBrewer)# used in the wordcloud
library(tm) # used for prearing words for the wordcloud

#----------------Aesthetic variables
purple <- "#36328C"
red <- "#D81E5B"
blue <- "#8DE4FF"
yellow <- "#FFC914"
green <- "#8AD973"
white <- "#FFFFFF"
black <- "#000000"
blue2 <- "#5AA9E6"
grey <- "#808080"
sage_green <- "#8A9A5B"
glaucous <- "#6082B6"
midnight_blue <- c("#191970")
matte_black <- c("#28282B")

# -----------------Importing the data
file_path <- file.path("Data", "NYC_EVENTS.csv") # this way the file can be opened through multiple operating systems
nyc_events <- read_csv(file = file_path, lazy = FALSE)
file_path_location <- file.path("Data", "NYC_EVENTS_LOCATIONS.csv")
nyc_events_loc <- read_csv(file = file_path_location, lazy = FALSE)
file_path_cat <- file.path("Data", "NYC_EVENTS_CATEGORIES.csv")
nyc_cat <- read_csv(file = file_path_cat, lazy = FALSE)
file_path_org <- file.path("Data", "NYC_EVENTS_ORGANIZER.csv")
nyc_org <- read_csv(file = file_path_org, lazy = FALSE)

Introduction

Context

New York City, or NYC for short, is the most populous city in the United States. Hosting almost 9 million people (8 804 190) in a surface of over 778 km squared, the city is a host to many events along the years that institue the magic that the city is know for.

«I get out of the taxi and it’s probably the only city which in reality looks better than on the postcards: New York. »
Milos Forman - Film director

Description of our database

Disclaimer: All rights over the data presented in this document goes to NYC OpenData

The database used in this document is a relational database of events hosted in New York City that belongs to Parks and Recreational Department of NYC.

The diagram of the database can be presented as follows:

# Compute a cumulative sum of records in the database grouped by year
records <-
  nyc_events %>% distinct(event_id, date) %>% separate(date, c("month", "day", "year"), "/", remove = F) %>% group_by(year) %>% summarise(number_events = n()) %>% mutate(cumulative_records = cumsum(number_events))

windowsFonts(Times_new_roman = windowsFont("Times New Roman")) # Import the font

# Plot the graph of cumulative sum of records by year
records_graph <-
  ggplot(records, aes(x = year, y = cumulative_records)) +
  geom_col(alpha = 1,
           width = 0.5,
           fill = purple,
           position = "dodge") +
  labs(
    title = "Cumulative sum of number of records in the database",
    subtitle = "2013 - 2019",
    y = "Number of records",
    x = NULL
  ) +
  geom_text(
    aes(label = cumulative_records),
    position = position_nudge(y = 5000),
    family = "Times_new_roman",
    check_overlap = T
  ) +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 16
    ),
    legend.position = "none",
    panel.background = element_blank(),
    axis.line = element_line()
  )
records_graph

In our database, we have a total number of 74880 records. With an average of 34 new records per day, our database was regularly updated.

Problematic

How did the number of events evolve over the years?

What are the main categories of events hosted by the city?

Who are the major events’ organizers?

How are the events spread around the city?

# Data preparation
nyc <- inner_join(nyc_events, nyc_events_loc, by = "event_id") # merge the events dataset with the locations dataset
nyc$start_time <- hms(nyc$start_time) # change to time format 
nyc$end_time <- hms(nyc$end_time) # change to time format 
nyc$duration <- nyc$end_time - nyc$start_time # Compute duration of the events
nyc <- nyc %>% separate(date, c("month","day","year"), "/", remove = F) # Seperate the date into days, months, and years

Events over time

Evolution of the events hosted over the years

# Compute the number of events of each year
nyc_year <-
  nyc_events %>% separate(date, c("month", "day", "year"), "/", remove = F) %>%
  distinct(event_id, year) %>%
  group_by(year) %>%
  summarize(number_events = n())
# Plot the graph of events for each year
events_over_time_graph <-
  ggplot(nyc_year, aes(x = year, y = number_events)) +
  geom_bar(
    stat = "summary",
    fun = "mean",
    width = 0.5,
    position = "dodge",
    aes(fill = number_events > mean(number_events))
  ) +
  scale_fill_manual(values = c(red, purple)) +
  geom_text(
    aes(label = number_events),
    position = position_nudge(y = 1000),
    family = "Times_new_roman",
    check_overlap = T
  ) +
  labs(
    title = "Number of events hosted by New York City",
    subtitle = "2013 - 2019",
    y = "Number of events",
    x = NULL
  ) +
  scale_y_continuous(n.breaks = 6) +
  theme_classic() +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 12
    ),
    legend.position = "none",
    panel.background = element_blank(),
    axis.line = element_line()
  )

events_over_time_graph

The number of events hosted by New York has been generally stable between 2013 and 2018:

Having reached 13602 event in 2018, This was the year with the most events hosted by the city.
In 2019, only 752 events have been recorded in our database which is surely not the real number of events but only a fraction of it.

Let’s exclude the year 2019 from the remaining of our analysis and narrow our study to the events hosted only between 2013 and 2018.

Let’s now dive even deeper into the details of these events…

Distribution of events around the months

# Compute the nbr of events for each month
nyc_group_month <- nyc %>%
  distinct(event_id, month) %>%
  group_by(month) %>%
  summarize(number_events = n())
# renaming the months
nyc_group_month$month = c(
  "January",
  "February",
  "March",
  "April",
  "May",
  "June",
  "July",
  "August",
  "September",
  "October",
  "November",
  "December"
)
# factor reordering the months for a more visual graph
nyc_group_month$month <-
  fct_reorder (nyc_group_month$month, nyc_group_month$number_events)

#Plotting the graph
event_month_graph <-
  ggplot(nyc_group_month, aes(x = month, y = number_events)) +
  geom_col(
    width = 0.7,
    position = "dodge",
    aes(fill = nyc_group_month$number_events > mean(nyc_group_month$number_events))
  ) +
  scale_fill_manual(values = c(red, purple)) +
  geom_hline(yintercept = mean(nyc_group_month$number_events)) +
  labs(
    title = "Distribution of events around the months",
    subtitle = "2013 - 2018",
    y = NULL,
    x = NULL
  ) +
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  geom_text(
    aes(label = number_events),
    position = position_nudge(y = 1000),
    family = "Times_new_roman",
    check_overlap = T
  ) +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 12
    ),
    legend.position = "none",
    panel.background = element_blank(),
    axis.line = element_line()
    # axis.line = element_blank(),
    # axis.title.y=element_blank(),
    # axis.text.y=element_blank(),
    # axis.ticks.y=element_blank()
  )
event_month_graph

In the bar graph to the left, we can notice a higher number of events in the months of: May, September, October, June, July, and August.

In fact, since the majority of these events take place outdoors, the sunny weather gives place to more events around the city.

The month of August averages 2202.8 events per year, while January only has 491 events per year on average.

The overall monthly average of events hosted by NYC is: 1228.5 event per month.

Events Categories

Most popular themes

# Creating a wordcloud of all the most reoccurring categories

text <- nyc_cat$name # This is my data
docs <- Corpus(VectorSource(text)) # We add it as a corpus
# We trim and prepare the data
docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing = TRUE)
df <- data.frame(word = names(words), freq = words)
set.seed(1234) # for reproducibility
# we plot the wordcloud
wordcloud(
  words = df$word,
  freq = df$freq,
  min.freq = 1,
  max.words = 200,
  random.order = FALSE,
  rot.per = 0.35,
  colors = brewer.pal(8, "Dark2")
)

The events hosted in New York are often ordered by categories.

These categories will give us insights about the most common themes that New York events choose.

In the wordcloud to the right, we can spot popular themes such as:

Fitness related events.
Nature related events.
Educational events.
Kids related events.

And so on…

# Compute the number of events for each category
nyc_group_cat <- nyc %>%
   distinct(event_id, Categories) %>%
  group_by(Categories) %>%
  summarize(number_events = n())

# Keep the top n
nyc_group_cat <-
  nyc_group_cat %>% slice_max(n = 10, order_by = number_events)

# Factor reordering for a more visual graph
nyc_group_cat$Categories <-
  fct_reorder (nyc_group_cat$Categories, nyc_group_cat$number_events)

# Building the graph
events_by_cat_graph <-
  ggplot(nyc_group_cat, aes(y = Categories, x = number_events)) +
  geom_col(fill = purple,
           width = 0.7) +
  geom_text(
    aes(label = number_events),
    position = position_nudge(x = 1000),
    family = "Times_new_roman"
  ) +
  labs(
    title = "Top 10 categories of events in NYC",
    subtitle = "2013 - 2018",
    x = "Number of events",
    y = NULL
  ) +
  xlim(0,22000)+
  theme_classic() +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 12
    ),
    legend.position = "none",
    panel.background = element_blank(),
    axis.line = element_line()
  )

colnames(nyc_group_cat)[2] = "Number of events" # Renaming the column for the table
events_by_cat_graph

The following categories of events were within the most popular in NYC between 2013 and 2018:

Categories	Number of events
Best for Kids	19681
Nature	16168
Fitness	15006
Education	12947
Art	11908
Seniors	8768
History	8615
Tours	8288
Outdoor Fitness	7149
Sports	7073

Note that each event can belong to multiple categories

Question: What is the percentage of free events within the most popular categories? Does it differ than the rest of events?

Distribution of free and paying events within the top categories

# Number of events per cat and per cost
nyc_cat_cost <- nyc %>%
  distinct(event_id, Categories, cost_free) %>%
  group_by(Categories, cost_free) %>%
  summarize(number_events = n())

# Selecting all the top cats
nyc_cat_cost <-
  nyc_cat_cost %>% filter(Categories %in% nyc_group_cat$Categories)

#Computing the percentage
nyc_cat_cost <-
  nyc_cat_cost %>% group_by(Categories) %>% mutate(percent = number_events /
                                                     sum(number_events))

# Doing the same but for all events
all_events <-
  nyc %>% distinct(event_id, cost_free) %>% group_by(cost_free) %>% summarize(number_events = n()) %>% mutate(percent =
                                                                                                                number_events / sum(number_events))

# Renaming for the graph
all_events$Categories <- "ALL EVENTS"

# merging the two
nyc_cat_cost <- rbind(nyc_cat_cost, all_events)

# Factor reordering for a more visual graph
nyc_cat_cost$Categories <-
  fct_reorder (nyc_cat_cost$Categories, nyc_cat_cost$number_events)

# Building the graph
event_cat_cost_graph <-
  ggplot(nyc_cat_cost,
         aes(
           y = Categories,
           x = percent,
           fill = as.character(cost_free)
         )) +
  geom_col(position = 'dodge') +
  scale_fill_manual(values = c(red, purple), name = "1 (Blue) = Free | 0 (Red) = Not Free") +
  geom_text(
    aes(label = percent(percent)),
    position = position_dodge(0.9),
    hjust = -0.1,
    size = 3,
    family = "Times_new_roman"
    #check_overlap = T
  ) +
  labs(
    title = "Percentage of costly vs. free events within the top categories.",
    subtitle = "2013 - 2018",
    x = "Percentage of events",
    y = NULL
  ) +
  scale_x_continuous(labels = scales::percent) +
  theme_classic() +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 12
    ),
    legend.position = "bottom",
    panel.background = element_blank(),
    axis.line = element_line()
  )
event_cat_cost_graph

What we can conclude from the graph above:

The factor which determines if an event would be free is mostly the nature of the event.
History, Art, and Education related events are more likely to cost money rather than entertainment events (Outdoor Fitness for example).
Even within the Fitness theme, Outdoor Fitness events are more likely to be free than events of Sports or regular Fitness which might take place in gyms.
On average, the percentage of events that you can attend for free in NYC is 81.18% which is around 4 times higher than paying events.

Major Events Organizers

# Selecting the organizers and their emails
organizers <-
  nyc %>% inner_join(nyc_org, by = "event_id") %>%  distinct(event_id, email, event_organizer) %>% group_by(event_organizer, email)

# Dropping NAs
organizers <- na.omit(organizers)

# emails to lower case to match similar ones that have been written differently
organizers$email <- tolower(organizers$email)

# Compute the number of events for each organizer
organizers <-
  organizers %>% group_by(event_organizer, email) %>% summarize(number_events = n())

# Renaming the cols for the table
colnames(organizers) <-
  c("Organizer", "Organizer's Email", "Nbr. events organized")

# Order by nbr of events
organizers <-
  organizers %>% arrange(desc(organizers$`Nbr. events organized`))

# Kable table of the top 20
head(organizers, n = 20) %>%
  arrange(desc(`Nbr. events organized`)) %>%
  kable(escape = F, align = c("l", "l", "c")) %>%
  kable_styling(c("striped", "hover", "condensed"), full_width = F) %>%
  row_spec(0,
           bold = T,
           color = "black",
           font_size = 20) %>%
  column_spec(1,
              bold = T,
              width = "40%",
              color = matte_black) %>%
  column_spec(2, bold = T, width = "30%") %>%
  column_spec(3, bold = T, color = purple)

Organizer	Organizer’s Email	Nbr. events organized
Bryant Park Corp.	bpc@urbanmgt.com	4293
Poe Visitor Center	lucy.aponte@parks.nyc.gov	4115
Central Park Conservancy	tours@centralparknyc.org	3096
Prospect Park Alliance	info@prospectpark.org	1236
Northern Manhattan Parks	info@forttryonparktrust.org	1219
Art & Antiquities	artandantiquities@parks.nyc.gov	1192
City Parks Foundation	sports@cityparksfoundation.org	1167
Fort Tryon Park Trust	info@forttryonparktrust.org	979
Conference House Park	francis.gessner@parks.nyc.gov	907
Summer on the Hudson	summeronthehudson@gmail.com	802
Staten Island Greenbelt Conservancy	naturecenter@sigreenbelt.org	791
City Parks Foundation	info@cityparksfoundation.org	772
Queens Botanical Garden	rforlenza@queensbotanical.org	653
Summer on the Hudson	zhen.heinemann@parks.nyc.gov	568
Queens Botanical Garden	dhector@queensbotanical.org	537
Gracie Mansion	gracieinfo@cityhall.nyc.gov	502
NYC Parks	letitia.guillory@parks.nyc.gov	486
New York Restoration Project	info@nyrp.org	432
Union Square Partnership	info@unionsquarenyc.org	431
New York Botanical Garden	pubrel@nybg.org	399

Duration of the Events

# Selecting the durations of events
duration_table <-
  nyc %>% distinct(event_id, duration) %>%  group_by(duration)

# Changing duration from seconds to hours
duration_table$duration <- as.integer(abs(as.numeric(duration_table$duration)/3600)) 

# Number of events for each duration
duration_table <- duration_table %>% summarise(number_events = n())

# Handling NAs
duration <- na.omit(duration_table)

# Building the graph
duration_graph <-
  ggplot(duration_table, aes(x = duration, y = number_events)) +
  geom_col( fill=purple, alpha = 0.7)+
  geom_area(fill = red, alpha = 0.5)+
    labs(
    title = "Distribution of the duration of the events",
    subtitle = "2013 - 2018",
    y = "Frequency of events",
    x = "Duration of events in hours"
  ) +
  xlim(0,15) +
    geom_text(
    aes(label = number_events ),
    position = position_nudge(y = 1000),
    family = "Times_new_roman",
    check_overlap = T
  ) +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 12
    ),
    legend.position = "none",
    panel.background = element_blank(),
    axis.line = element_line()
  )
duration_graph

Comments:

Events in New York can last from 1 to 15 hours sometimes.
Nevertheless, the distribution of the events’ duration mostly shows a high frequency of events that only last from 1 to 2 hours.
A significant number of events have been shown to last around 7 hours, these must be events that usually last all day.

Main areas of hosted events

# Selecting the coords and the organizers 
location <-
  nyc %>% distinct(event_id, long, lat) %>% inner_join(nyc_org, by = "event_id")

# Handling NAs
location <- na.exclude(location)

# Number of events for each organizer and average coords
location <-
  location %>% group_by(event_organizer) %>% summarize(number_events = n(),
                                                       lat = mean(lat),
                                                       long = mean(long))
# Major organizers
location <-
  location %>% filter(number_events > 10)

# Color palette
pal <- colorNumeric(palette =  "YlGnBu",
                    domain = location$number_events)

# Building the map
nyc_map <- leaflet() %>%
  addTiles() %>%
  setView(
    lng = mean(location$long),
    lat = mean(location$lat),
    zoom = 10
  ) %>%
  addCircles(
    lng = location$long,
    lat = location$lat,
    popup = paste(
      "<b> Main area of events organized by:<b/>",
      location$event_organizer,
      "<br/> <b>The number of events organized:<b/>",
      location$number_events
    ),
    weight = 3,
    radius = location$number_events,
    stroke = TRUE,
    color = pal(location$number_events),
    fillOpacity = 0.8
  ) %>% 
    addLegend("bottomright", 
              pal = pal, values = location$number_events,
    title = "Number of events",
    opacity = 1
  )

nyc_map

Conclusion

# Exporting the file for further analysis
# Make sure to add a lazy read in the read_csv to be able to write back the prepared data 
file_path_export <- file.path("Data","NYC_EVENTS_PREPARED.csv")

# Uncomment this next line if you want to export the prepared data as a csv file
# write_csv(file = file_path_export, x = nyc, progress = T)

To sum up, we can say that New York City is indeed the right place to go for a good time. With its multiple events around the year, New York has, and forever will, attracted many visitors around the world. On a personal level, working on this particular project has been very interesting in terms of how to handle a large database with multiple data sets, how to prepare data to answer a specific problematic, and how to organize and pick the right resources to do so. Working on large dataset certainly has its ups and downs. On one hand, you have enough data from which you can extract valuable information.
On the other hand, the larger volume of data requires a higher level of analysis since you cannot spot abnormalities simply by looking at the raw data, and the larger volume of data can limit the use of some visualization tools such as interactive maps.

New York City Events

Ibrahim M.

19/12/2021