# --------------------------Importing all the necessary packages
library(ggplot2) # for graphs
library(tidyverse) # to prepare the data
library(dplyr) # to prepare the data
library(leaflet) # for interactive maps
library(leaflet.extras) # for interactive maps
library(kableExtra) # for kable tables
library(formattable) # handling percentages
library(gridExtra) #to organize the graphs
library(lubridate) # for time format
library(kableExtra) # for tables 
library(extrafont) # for importing fonts
library(DescTools) # for descriptive stats
library(wordcloud) # for wordcloud
library(RColorBrewer)# used in the wordcloud
library(tm) # used for prearing words for the wordcloud

#----------------Aesthetic variables
purple <- "#36328C"
red <- "#D81E5B"
blue <- "#8DE4FF"
yellow <- "#FFC914"
green <- "#8AD973"
white <- "#FFFFFF"
black <- "#000000"
blue2 <- "#5AA9E6"
grey <- "#808080"
sage_green <- "#8A9A5B"
glaucous <- "#6082B6"
midnight_blue <- c("#191970")
matte_black <- c("#28282B")

# -----------------Importing the data
file_path <- file.path("Data", "NYC_EVENTS.csv") # this way the file can be opened through multiple operating systems
nyc_events <- read_csv(file = file_path, lazy = FALSE)
file_path_location <- file.path("Data", "NYC_EVENTS_LOCATIONS.csv")
nyc_events_loc <- read_csv(file = file_path_location, lazy = FALSE)
file_path_cat <- file.path("Data", "NYC_EVENTS_CATEGORIES.csv")
nyc_cat <- read_csv(file = file_path_cat, lazy = FALSE)
file_path_org <- file.path("Data", "NYC_EVENTS_ORGANIZER.csv")
nyc_org <- read_csv(file = file_path_org, lazy = FALSE)

Introduction

Context

New York City, or NYC for short, is the most populous city in the United States. Hosting almost 9 million people (8 804 190) in a surface of over 778 km squared, the city is a host to many events along the years that institue the magic that the city is know for.

«I get out of the taxi and it’s probably the only city which in reality looks better than on the postcards: New York. »
Milos Forman - Film director

Description of our database

Disclaimer: All rights over the data presented in this document goes to NYC OpenData

The database used in this document is a relational database of events hosted in New York City that belongs to Parks and Recreational Department of NYC.

The diagram of the database can be presented as follows:

# Compute a cumulative sum of records in the database grouped by year
records <-
  nyc_events %>% distinct(event_id, date) %>% separate(date, c("month", "day", "year"), "/", remove = F) %>% group_by(year) %>% summarise(number_events = n()) %>% mutate(cumulative_records = cumsum(number_events))

windowsFonts(Times_new_roman = windowsFont("Times New Roman")) # Import the font

# Plot the graph of cumulative sum of records by year
records_graph <-
  ggplot(records, aes(x = year, y = cumulative_records)) +
  geom_col(alpha = 1,
           width = 0.5,
           fill = purple,
           position = "dodge") +
  labs(
    title = "Cumulative sum of number of records in the database",
    subtitle = "2013 - 2019",
    y = "Number of records",
    x = NULL
  ) +
  geom_text(
    aes(label = cumulative_records),
    position = position_nudge(y = 5000),
    family = "Times_new_roman",
    check_overlap = T
  ) +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 16
    ),
    legend.position = "none",
    panel.background = element_blank(),
    axis.line = element_line()
  )
records_graph

In our database, we have a total number of 74880 records. With an average of 34 new records per day, our database was regularly updated.

Problematic

  • How did the number of events evolve over the years?
  • What are the main categories of events hosted by the city?
  • Who are the major events’ organizers?
  • How are the events spread around the city?
# Data preparation
nyc <- inner_join(nyc_events, nyc_events_loc, by = "event_id") # merge the events dataset with the locations dataset
nyc$start_time <- hms(nyc$start_time) # change to time format 
nyc$end_time <- hms(nyc$end_time) # change to time format 
nyc$duration <- nyc$end_time - nyc$start_time # Compute duration of the events
nyc <- nyc %>% separate(date, c("month","day","year"), "/", remove = F) # Seperate the date into days, months, and years

Events over time

Evolution of the events hosted over the years

# Compute the number of events of each year
nyc_year <-
  nyc_events %>% separate(date, c("month", "day", "year"), "/", remove = F) %>%
  distinct(event_id, year) %>%
  group_by(year) %>%
  summarize(number_events = n())
# Plot the graph of events for each year
events_over_time_graph <-
  ggplot(nyc_year, aes(x = year, y = number_events)) +
  geom_bar(
    stat = "summary",
    fun = "mean",
    width = 0.5,
    position = "dodge",
    aes(fill = number_events > mean(number_events))
  ) +
  scale_fill_manual(values = c(red, purple)) +
  geom_text(
    aes(label = number_events),
    position = position_nudge(y = 1000),
    family = "Times_new_roman",
    check_overlap = T
  ) +
  labs(
    title = "Number of events hosted by New York City",
    subtitle = "2013 - 2019",
    y = "Number of events",
    x = NULL
  ) +
  scale_y_continuous(n.breaks = 6) +
  theme_classic() +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 12
    ),
    legend.position = "none",
    panel.background = element_blank(),
    axis.line = element_line()
  )

events_over_time_graph



The number of events hosted by New York has been generally stable between 2013 and 2018:

  • Having reached 13602 event in 2018, This was the year with the most events hosted by the city.

  • In 2019, only 752 events have been recorded in our database which is surely not the real number of events but only a fraction of it.

Let’s exclude the year 2019 from the remaining of our analysis and narrow our study to the events hosted only between 2013 and 2018.

Let’s now dive even deeper into the details of these events…


Distribution of events around the months

# Compute the nbr of events for each month
nyc_group_month <- nyc %>%
  distinct(event_id, month) %>%
  group_by(month) %>%
  summarize(number_events = n())
# renaming the months
nyc_group_month$month = c(
  "January",
  "February",
  "March",
  "April",
  "May",
  "June",
  "July",
  "August",
  "September",
  "October",
  "November",
  "December"
)
# factor reordering the months for a more visual graph
nyc_group_month$month <-
  fct_reorder (nyc_group_month$month, nyc_group_month$number_events)

#Plotting the graph
event_month_graph <-
  ggplot(nyc_group_month, aes(x = month, y = number_events)) +
  geom_col(
    width = 0.7,
    position = "dodge",
    aes(fill = nyc_group_month$number_events > mean(nyc_group_month$number_events))
  ) +
  scale_fill_manual(values = c(red, purple)) +
  geom_hline(yintercept = mean(nyc_group_month$number_events)) +
  labs(
    title = "Distribution of events around the months",
    subtitle = "2013 - 2018",
    y = NULL,
    x = NULL
  ) +
  scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
  geom_text(
    aes(label = number_events),
    position = position_nudge(y = 1000),
    family = "Times_new_roman",
    check_overlap = T
  ) +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 12
    ),
    legend.position = "none",
    panel.background = element_blank(),
    axis.line = element_line()
    # axis.line = element_blank(),
    # axis.title.y=element_blank(),
    # axis.text.y=element_blank(),
    # axis.ticks.y=element_blank()
  )
event_month_graph



In the bar graph to the left, we can notice a higher number of events in the months of: May, September, October, June, July, and August.

In fact, since the majority of these events take place outdoors, the sunny weather gives place to more events around the city.

The month of August averages 2202.8 events per year, while January only has 491 events per year on average.

The overall monthly average of events hosted by NYC is: 1228.5 event per month.

Events Categories

Distribution of free and paying events within the top categories

# Number of events per cat and per cost
nyc_cat_cost <- nyc %>%
  distinct(event_id, Categories, cost_free) %>%
  group_by(Categories, cost_free) %>%
  summarize(number_events = n())

# Selecting all the top cats
nyc_cat_cost <-
  nyc_cat_cost %>% filter(Categories %in% nyc_group_cat$Categories)

#Computing the percentage
nyc_cat_cost <-
  nyc_cat_cost %>% group_by(Categories) %>% mutate(percent = number_events /
                                                     sum(number_events))

# Doing the same but for all events
all_events <-
  nyc %>% distinct(event_id, cost_free) %>% group_by(cost_free) %>% summarize(number_events = n()) %>% mutate(percent =
                                                                                                                number_events / sum(number_events))

# Renaming for the graph
all_events$Categories <- "ALL EVENTS"

# merging the two
nyc_cat_cost <- rbind(nyc_cat_cost, all_events)

# Factor reordering for a more visual graph
nyc_cat_cost$Categories <-
  fct_reorder (nyc_cat_cost$Categories, nyc_cat_cost$number_events)

# Building the graph
event_cat_cost_graph <-
  ggplot(nyc_cat_cost,
         aes(
           y = Categories,
           x = percent,
           fill = as.character(cost_free)
         )) +
  geom_col(position = 'dodge') +
  scale_fill_manual(values = c(red, purple), name = "1 (Blue) = Free | 0 (Red) = Not Free") +
  geom_text(
    aes(label = percent(percent)),
    position = position_dodge(0.9),
    hjust = -0.1,
    size = 3,
    family = "Times_new_roman"
    #check_overlap = T
  ) +
  labs(
    title = "Percentage of costly vs. free events within the top categories.",
    subtitle = "2013 - 2018",
    x = "Percentage of events",
    y = NULL
  ) +
  scale_x_continuous(labels = scales::percent) +
  theme_classic() +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 12
    ),
    legend.position = "bottom",
    panel.background = element_blank(),
    axis.line = element_line()
  )
event_cat_cost_graph

What we can conclude from the graph above:

  • The factor which determines if an event would be free is mostly the nature of the event.
  • History, Art, and Education related events are more likely to cost money rather than entertainment events (Outdoor Fitness for example).
  • Even within the Fitness theme, Outdoor Fitness events are more likely to be free than events of Sports or regular Fitness which might take place in gyms.
  • On average, the percentage of events that you can attend for free in NYC is 81.18% which is around 4 times higher than paying events.

Major Events Organizers

# Selecting the organizers and their emails
organizers <-
  nyc %>% inner_join(nyc_org, by = "event_id") %>%  distinct(event_id, email, event_organizer) %>% group_by(event_organizer, email)

# Dropping NAs
organizers <- na.omit(organizers)

# emails to lower case to match similar ones that have been written differently
organizers$email <- tolower(organizers$email)

# Compute the number of events for each organizer
organizers <-
  organizers %>% group_by(event_organizer, email) %>% summarize(number_events = n())

# Renaming the cols for the table
colnames(organizers) <-
  c("Organizer", "Organizer's Email", "Nbr. events organized")

# Order by nbr of events
organizers <-
  organizers %>% arrange(desc(organizers$`Nbr. events organized`))

# Kable table of the top 20
head(organizers, n = 20) %>%
  arrange(desc(`Nbr. events organized`)) %>%
  kable(escape = F, align = c("l", "l", "c")) %>%
  kable_styling(c("striped", "hover", "condensed"), full_width = F) %>%
  row_spec(0,
           bold = T,
           color = "black",
           font_size = 20) %>%
  column_spec(1,
              bold = T,
              width = "40%",
              color = matte_black) %>%
  column_spec(2, bold = T, width = "30%") %>%
  column_spec(3, bold = T, color = purple)
Organizer Organizer’s Email Nbr. events organized
Bryant Park Corp.  4293
Poe Visitor Center 4115
Central Park Conservancy 3096
Prospect Park Alliance 1236
Northern Manhattan Parks 1219
Art & Antiquities 1192
City Parks Foundation 1167
Fort Tryon Park Trust 979
Conference House Park 907
Summer on the Hudson 802
Staten Island Greenbelt Conservancy 791
City Parks Foundation 772
Queens Botanical Garden 653
Summer on the Hudson 568
Queens Botanical Garden 537
Gracie Mansion 502
NYC Parks 486
New York Restoration Project 432
Union Square Partnership 431
New York Botanical Garden 399

Duration of the Events

# Selecting the durations of events
duration_table <-
  nyc %>% distinct(event_id, duration) %>%  group_by(duration)

# Changing duration from seconds to hours
duration_table$duration <- as.integer(abs(as.numeric(duration_table$duration)/3600)) 

# Number of events for each duration
duration_table <- duration_table %>% summarise(number_events = n())

# Handling NAs
duration <- na.omit(duration_table)

# Building the graph
duration_graph <-
  ggplot(duration_table, aes(x = duration, y = number_events)) +
  geom_col( fill=purple, alpha = 0.7)+
  geom_area(fill = red, alpha = 0.5)+
    labs(
    title = "Distribution of the duration of the events",
    subtitle = "2013 - 2018",
    y = "Frequency of events",
    x = "Duration of events in hours"
  ) +
  xlim(0,15) +
    geom_text(
    aes(label = number_events ),
    position = position_nudge(y = 1000),
    family = "Times_new_roman",
    check_overlap = T
  ) +
  theme(
    text = element_text(
      family = "Times_new_roman",
      face = "bold",
      size = 12
    ),
    legend.position = "none",
    panel.background = element_blank(),
    axis.line = element_line()
  )
duration_graph



Comments:

  • Events in New York can last from 1 to 15 hours sometimes.

  • Nevertheless, the distribution of the events’ duration mostly shows a high frequency of events that only last from 1 to 2 hours.

  • A significant number of events have been shown to last around 7 hours, these must be events that usually last all day.

Main areas of hosted events

# Selecting the coords and the organizers 
location <-
  nyc %>% distinct(event_id, long, lat) %>% inner_join(nyc_org, by = "event_id")

# Handling NAs
location <- na.exclude(location)

# Number of events for each organizer and average coords
location <-
  location %>% group_by(event_organizer) %>% summarize(number_events = n(),
                                                       lat = mean(lat),
                                                       long = mean(long))
# Major organizers
location <-
  location %>% filter(number_events > 10)

# Color palette
pal <- colorNumeric(palette =  "YlGnBu",
                    domain = location$number_events)

# Building the map
nyc_map <- leaflet() %>%
  addTiles() %>%
  setView(
    lng = mean(location$long),
    lat = mean(location$lat),
    zoom = 10
  ) %>%
  addCircles(
    lng = location$long,
    lat = location$lat,
    popup = paste(
      "<b> Main area of events organized by:<b/>",
      location$event_organizer,
      "<br/> <b>The number of events organized:<b/>",
      location$number_events
    ),
    weight = 3,
    radius = location$number_events,
    stroke = TRUE,
    color = pal(location$number_events),
    fillOpacity = 0.8
  ) %>% 
    addLegend("bottomright", 
              pal = pal, values = location$number_events,
    title = "Number of events",
    opacity = 1
  )

nyc_map

Conclusion

# Exporting the file for further analysis
# Make sure to add a lazy read in the read_csv to be able to write back the prepared data 
file_path_export <- file.path("Data","NYC_EVENTS_PREPARED.csv")

# Uncomment this next line if you want to export the prepared data as a csv file
# write_csv(file = file_path_export, x = nyc, progress = T)

To sum up, we can say that New York City is indeed the right place to go for a good time. With its multiple events around the year, New York has, and forever will, attracted many visitors around the world. On a personal level, working on this particular project has been very interesting in terms of how to handle a large database with multiple data sets, how to prepare data to answer a specific problematic, and how to organize and pick the right resources to do so. Working on large dataset certainly has its ups and downs. On one hand, you have enough data from which you can extract valuable information.
On the other hand, the larger volume of data requires a higher level of analysis since you cannot spot abnormalities simply by looking at the raw data, and the larger volume of data can limit the use of some visualization tools such as interactive maps.