CUNY Data Science 604 - Simulation and Modeling Technique

Instructions

Select a Dataset from an open data portal like https://data.gov, and download the Dataset in CSV or Excel format. Try to use a Dataset no larger than 1,000 records. If larger, filter to use only the first 1,000 records.
Create a “profile” for the real Dataset using descriptive statistics, and charts to analyze trends and behavior. Save file in Excel format.
Use https://excelx.com/practice-data/generators/ or https://www.mockaroo.com/to create a test Dataset based on the fields of the real Dataset selected. Replicate attributes of the fields and define a range of values that resembles the real Dataset.
Download the fake Dataset in Excel or CSV format. Each program will allow you to create a dummy file up to 1,000 records.
Replicate the profile on the fake Dataset (descriptive statistics, charts), and compare if it resembles the real Dataset.
Analyze how you can improve the accuracy of a fake Dataset to simulate real values in the scenario selected. Try to apply the recommended measures or techniques in a new version of your fake Dataset and compare again with the real Dataset.

Deliverable: zip file with real & fake Datasets, and document with analysis.

Introduction

This assignment focuses on analyzing a real-world dataset and generating a synthetic version that closely replicates its structure and statistical patterns. The dataset selected is Uber Pickups in New York City, which contains timestamps, geographic coordinates, and dispatch base codes. These features provide both numeric and categorical information, making it well suited for descriptive profiling and simulation modeling.

The analysis begins with a detailed profile of the real dataset, including summary statistics and visualizations, to understand central tendencies, variability, and distribution shapes. Once the dataset is profiled, a synthetic dataset is generated by replicating the observed distributions and variable types. Sampling and probability-based techniques are applied to ensure the artificial data mirrors the original patterns without containing any real individual data.

Finally, the real and synthetic datasets are compared to evaluate similarity and identify areas for improvement in the simulation process.

Objectives

The objective of this assignment is to analyze a real-world Dataset, develop a statistical profile of its structure and distributions, and generate a synthetic Dataset that closely replicates its key characteristics. The goal is to evaluate how well simulated data can mirror real-world patterns and to apply improvements that enhance the accuracy and realism of the generated Dataset.

Data Overview

I selected the Uber Pickups in New York City Dataset (https://www.kaggle.com/Datasets/fivethirtyeight/uber-pickups-in-new-york-city).

It contains the following main variables:

Date_Time: timestamp of the Uber pickup
Lat: latitude coordinate of the pickup location
Lon: longitude coordinate of the pickup location
Base: TLC base company code associated with the trip

From the timestamp, additional variables such as hour of day and weekday can be derived to analyze time-based patterns. This Dataset was chosen because it includes both numeric variables (latitude, longitude, hour) and categorical variables (base, weekday), along with strong temporal patterns. These characteristics make it well suited for descriptive profiling and simulation modeling. The Dataset is publicly available for research and academic use, satisfying the open-data requirement of this assignment.

Data Load

# Adjust file path as needed
uber_raw <- read.csv("https://raw.githubusercontent.com/rupendra4/Data-604/refs/heads/main/uber-raw-data-apr14.csv")

uber <- uber_raw |>
  clean_names() |>
  mutate(
    date_time = mdy_hms(date_time),
    hour = hour(date_time),
    weekday = wday(date_time, label = TRUE),
    base = factor(base)
  ) |>
  slice_head(n = 1000)

#head(uber)
#kable(uber, digits = 2, caption = "observations of the cleaned Uber Dataset")

datatable(
  uber,
  options = list(pageLength = 20),
  caption = "Uber Dataset (first 1,000 records) displayed with 20 rows per page."
)

Real Dataset

# Real Dataset Profiling - Uber NYC (first 1000 records)
real <- uber |>
  janitor::clean_names() |>
  mutate(
    base = factor(base),
    weekday = factor(weekday, levels = levels(weekday))
  ) |>
  slice_head(n = 1000)

# Numeric profile (lat, lon, hour)
profile_numeric <- real |>
  select(where(is.numeric)) |>
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  group_by(variable) |>
  summarise(
    n = sum(!is.na(value)),
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    min = min(value, na.rm = TRUE),
    q25 = quantile(value, 0.25, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    q75 = quantile(value, 0.75, na.rm = TRUE),
    max = max(value, na.rm = TRUE),
    skew = moments::skewness(value, na.rm = TRUE),
    .groups = "drop"
  )

# Categorical profile (base, weekday)
profile_categorical <- real |>
  summarise(across(where(is.factor), \(x) tibble(n_levels = n_distinct(x), missing = sum(is.na(x))))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "stats") |>
  unnest_wider(stats)

# Save profile to Excel
writexl::write_xlsx(
  list(
    skim = skimr::skim(real) |> as_tibble(),
    numeric_summary = profile_numeric,
    categorical_overview = profile_categorical
  ),
  "output/real_profile.xlsx"
)

# Define colors
blue_col <- "#ADD8E6"
green_col <- "#90EE90"
orange_col <- "#FFA07A"
purple_col <- "#D8BFD8"
red_col <- "#FFB6C1"
yellow_col <- "#FFFACD"

# Histogram of Pickup Hour
#p_hour <- ggplot(uber, aes(hour)) +
  #geom_histogram(bins = 24, fill = blue_col, color = "green") +
  #ggtitle("Pickup Hour Distribution")


# Density plot of hour | Histogram of Pickup Hour
p_hour <- ggplot(uber, aes(hour)) +
  geom_density(fill = blue_col, alpha = 0.5) +
  ggtitle("Pickup Hour Density") +
  xlab("Hour of Day") + ylab("Density")


# Histogram of Latitude
#p_lat <- ggplot(uber, aes(lat)) + geom_histogram(bins = 25, fill = green_col, color = "white") + ggtitle("Latitude Distribution")

# Histogram of Longitude
#p_lon <- ggplot(uber, aes(lon)) + geom_histogram(bins = 25, fill = orange_col, color = "white") + ggtitle("Longitude Distribution")

# Histogram of Latitude & Longitude Distributions
# Scatter plot of lat vs lon
p_scatter <- ggplot(uber, aes(lon, lat)) +
  geom_point(alpha = 0.4, color = "purple") +
  ggtitle("Pickup Locations") +
  xlab("Longitude") + ylab("Latitude")

# Hexbin plot (requires library(hexbin))
library(hexbin)
p_hex <- ggplot(uber, aes(lon, lat)) +
  geom_hex(bins = 50) +
  scale_fill_gradient(low="lightblue", high="darkblue") +
  ggtitle("Pickup Density (Hexbin)") +
  xlab("Longitude") + ylab("Latitude")

# Boxplot of Latitude by Base
#p_box_base <- ggplot(uber, aes(base, lat)) + geom_boxplot(fill = purple_col) + ggtitle("Latitude by Base")

# Boxplot of Latitude by Base - Violin plot
p_violin <- ggplot(uber, aes(base, lat)) +
  geom_violin(fill = purple_col, alpha = 0.6) +
  ggtitle("Latitude Distribution by Base") +
  xlab("Base") + ylab("Latitude")

# Boxplot of Latitude by Base - Jitter plot (points instead of box)
p_jitter <- ggplot(uber, aes(base, lat)) +
  geom_jitter(width = 0.2, alpha = 0.5, color = "purple") +
  ggtitle("Latitude Distribution by Base (Jitter)")

# Pie chart for base counts
p_base <- uber |> count(base) |> 
  ggplot(aes(x="", y=n, fill=base)) +
  geom_col() +
  coord_polar(theta="y") +
  ggtitle("Pickup Counts by Base")

# Stacked bar chart by weekday and base
p_weekday <- uber |> count(weekday, base) |> 
  ggplot(aes(weekday, n, fill=base)) +
  geom_col() +
  ggtitle("Pickups by Weekday and Base")

# Save plots
ggsave("output/charts/hist_hour.png", p_hour, width = 10, height = 6, dpi = 120)
ggsave("output/charts/hist_lat.png", p_scatter, width = 10, height = 6, dpi = 120)
ggsave("output/charts/hist_lon.png", p_hex, width = 10, height = 6, dpi = 120)
#ggsave("output/charts/box_lat_by_base.png", p_box_base, width = 10, height = 6, dpi = 120)
ggsave("output/charts/bar_v_base.png", p_violin, width = 10, height = 6, dpi = 120)
ggsave("output/charts/bar_j_base.png", p_jitter, width = 10, height = 6, dpi = 120)
ggsave("output/charts/bar_weekday.png", p_weekday, width = 10, height = 6, dpi = 120)

getwd()

[1] "/Users/admin/CUNY Project/Data-604 Simulation and Modeling Tech/Assignment5"

Pickup Hour Distribution

The density plot shows the distribution of Uber pickups across the hours of the day. Instead of grouping the values into bars like a histogram, the density plot presents a smooth curve that represents the concentration of pickup times. The horizontal axis represents the hour of the day (0–23), while the vertical axis represents the relative density of pickups during those hours.

p_hour

The density plot indicates that Uber pickup activity varies throughout the day. Certain hours show noticeably higher density, reflecting periods of increased transportation demand. These peaks often correspond to typical commuting times, evening outings, or other daily travel patterns.

Latitude Distribution | Scatter plot of lat vs lon

The scatter plot shows the geographic distribution of Uber pickup locations using longitude and latitude coordinates. Each point represents a single pickup, illustrating where ride requests occurred within the city.

p_scatter

The plot shows that pickups are concentrated in certain areas rather than evenly spread out. These clusters indicate locations with higher ride demand. Replicating similar spatial patterns helps make the synthetic dataset more realistic.

Longitude Distribution | Hexbin plot

The hexbin plot shows the density of Uber pickup locations using longitude and latitude coordinates. Instead of displaying individual points, the map divides the area into hexagonal cells and counts how many pickups fall within each cell. Darker colors represent areas with a higher concentration of pickups.

p_hex

The hexbin plot highlights areas where Uber pickups are most concentrated. Darker hexagons indicate regions with higher ride demand. This visualization helps identify geographic patterns and supports creating synthetic data that reflects similar spatial density.

Latitude Distribution by Base

The violin plot shows the distribution of pickup latitude values for each Uber base. The width of each violin represents the concentration of data points at different latitude levels, allowing us to see how pickup locations vary for each base.

p_violin

The plot shows that different bases operate across similar latitude ranges, but some areas have higher pickup concentrations. This helps identify geographic patterns associated with each dispatch base

The jitter plot displays individual pickup latitude values for each Uber base. Each point represents a pickup location, and the horizontal spread helps separate overlapping points so the distribution can be seen more clearly.

p_jitter

The plot shows how pickup locations vary across different bases. Most bases operate within similar latitude ranges, indicating that Uber activity is spread across similar geographic areas.

Bar Chart of Base Counts / Weekday Counts

The bar chart displays the total number of Uber pickups for each weekday. Each bar represents the count of trips on a specific day, showing how ride activity varies across the week.

p_base

Uber pickups are not evenly distributed across weekdays. Some days, such as weekdays, show higher demand due to commuting and work-related trips, while weekends may have slightly lower or different patterns. This highlights how ride demand is influenced by weekly routines and activities.

Pickup Counts by Weekday

The bar chart displays the total number of Uber pickups for each weekday. Each bar represents the count of trips on a specific day, showing how ride activity varies across the week.

p_weekday

Test Data Set

A Uber Dataset was generated to replicate the structure and characteristics of the real Dataset. The numeric variables—hour, latitude, and longitude were simulated using normal distributions based on the real data’s mean and standard deviation, while clipping ensured all values remained within realistic ranges. Categorical variables weekday and base were sampled according to the observed proportions in the original data.

This approach ensures that the fake Dataset preserves the general patterns of ride activity while containing no actual user data, making it safe for testing and analysis.

# Test Dataset Generation
# This section creates a synthetic test Dataset based on the structure of the Uber Dataset.
# The generated data mimics variable types and ranges, so it can be used safely for testing or analysis.

set.seed(42)
n <- nrow(uber)

# Function to sample categorical/factor variables like the real data
sample_factor_like <- function(x, n){
  probs <- prop.table(table(x, useNA="ifany"))
  out <- sample(names(probs), n, TRUE, as.numeric(probs))
  factor(out, levels = levels(x))
}

# Numeric generators tuned to Uber data
gen_hour <- function(x, n){
  r <- range(x, na.rm = TRUE)
  out <- round(rnorm(n, mean(x, na.rm=TRUE), sd(x, na.rm=TRUE)))
  pmin(pmax(out, r[1]), r[2])
}

gen_lat <- function(x, n){
  r <- range(x, na.rm = TRUE)
  out <- rnorm(n, mean(x, na.rm=TRUE), sd(x, na.rm=TRUE))
  pmin(pmax(out, r[1]), r[2])
}

gen_lon <- function(x, n){
  r <- range(x, na.rm = TRUE)
  out <- rnorm(n, mean(x, na.rm=TRUE), sd(x, na.rm=TRUE))
  pmin(pmax(out, r[1]), r[2])
}

# Generate fake Dataset
fake_uber <- tibble(
  hour = gen_hour(uber$hour, n),
  lat = gen_lat(uber$lat, n),
  lon = gen_lon(uber$lon, n),
  weekday = sample_factor_like(uber$weekday, n),
  base = sample_factor_like(uber$base, n)
)

# Preview first 5 rows of real Dataset
uber |>
  slice_head(n = 5) |>
  knitr::kable(caption = "Preview of Real Uber Dataset (First 5 Rows)")

Preview of Real Uber Dataset (First 5 Rows)
date_time	lat	lon	base	weekday
2014-04-01 00:11:00	40.7690	-73.9549	B02512	Tue
2014-04-01 00:17:00	40.7267	-74.0345	B02512	Tue
2014-04-01 00:21:00	40.7316	-73.9873	B02512	Tue
2014-04-01 00:28:00	40.7588	-73.9776	B02512	Tue
2014-04-01 00:33:00	40.7594	-73.9722	B02512	Tue

# Preview first 5 rows of fake Dataset
fake_uber |>
  slice_head(n = 5) |>
  knitr::kable(caption = "Preview of fake Uber Dataset (First 5 Rows)")

Preview of fake Uber Dataset (First 5 Rows)
hour	lat	lon	weekday	base
21	40.82781	-73.96827	Tue	B02512
11	40.76635	-74.00116	Tue	B02512
16	40.78159	-74.09121	Tue	B02512
17	40.76133	-74.10876	Tue	B02512
16	40.71447	-74.06426	Tue	B02512

# Write files
readr::write_csv(uber, "output/real_uber.csv")
readr::write_csv(fake_uber, "output/fake_uber.csv")

Comparison of Real vs Fake Dataset

Here, the fake Dataset was created to imitate the structure and characteristics of the real Uber Dataset. The goal was to generate artificial data that follows similar patterns while avoiding the use of real observations.

For the categorical variables (weekday and base), values were generated by sampling according to the proportions observed in the real Dataset. This helps maintain similar category frequencies in the fake data.

For the numeric variables (hour, lat, and lon), values were simulated using normal distributions based on the mean and standard deviation of the real data. The generated values were also restricted to realistic ranges so they remain comparable to the original Dataset.

Overall, the fake Dataset reflects the general statistical behavior of the real Uber pickup data while containing completely artificial records. This allows the Dataset to be used for testing and experimentation without relying on actual ride information.

compare_num <- function(df1, df2, cols){
  bind_rows(
    df1 |> select(all_of(cols)) |> 
      pivot_longer(everything(), names_to = "var", values_to = "val") |> 
      mutate(src = "real"),
    
    df2 |> select(all_of(cols)) |> 
      pivot_longer(everything(), names_to = "var", values_to = "val") |> 
      mutate(src = "fake")
  ) |>
  group_by(src, var) |>
  summarise(
    mean = mean(val, na.rm = TRUE),
    sd = sd(val, na.rm = TRUE),
    median = median(val, na.rm = TRUE),
    min = min(val, na.rm = TRUE),
    max = max(val, na.rm = TRUE),
    .groups = "drop"
  ) |>
  pivot_wider(names_from = src, values_from = c(mean, sd, median, min, max)) |>
  mutate(
    mean_abs_pct_diff = abs(mean_real - mean_fake) / ifelse(mean_real == 0, 1, abs(mean_real)),
    sd_abs_pct_diff   = abs(sd_real - sd_fake) / ifelse(sd_real == 0, 1, abs(sd_real))
  )
}

# Numeric variables for Uber Dataset
num_cols <- c("hour", "lat", "lon")

cmp_num <- compare_num(uber, fake_uber, num_cols)


# Function to compare categorical variables
overlap_top5 <- function(a, b){
  ra <- names(sort(table(a), decreasing = TRUE))[1:min(5, n_distinct(a, na.rm = TRUE))]
  rb <- names(sort(table(b), decreasing = TRUE))[1:min(5, n_distinct(b, na.rm = TRUE))]
  
  length(intersect(ra, rb)) / max(length(unique(na.omit(a))), 1)
}

# Categorical variables
cat_cols <- c("weekday", "base")

cmp_cat <- tibble(variable = cat_cols,
                  top5_overlap_ratio = map_dbl(cat_cols, ~overlap_top5(uber[[.x]], fake_uber[[.x]])))


# Preview numeric comparison
DT::datatable(
  cmp_num |>
    dplyr::slice_head(n = 10),
  caption = "Categorical Variable Comparison (First 10 Rows)",
  options = list(pageLength = 10, autoWidth = TRUE)
)

DT::datatable(
  cmp_cat |>
    dplyr::slice_head(n = 10),
  caption = "Categorical Variable Comparison (First 10 Rows)",
  options = list(pageLength = 10, autoWidth = TRUE)
)

# Write Excel
writexl::write_xlsx(
list(numeric_comparison = cmp_num, categorical_overlap = cmp_cat),
"output/comparison_real_vs_fake.xlsx"
)

To measure how similar the Datasets are, several comparisons were performed. For the numeric variables (such as hour, latitude, and longitude), summary statistics including mean, standard deviation, median, minimum, and maximum were calculated for both the real and synthetic Datasets.

For the categorical variables (weekday and base), the comparison focused on the most common categories by calculating the overlap among the top five values in each Dataset.

The results indicate that the fake Uber Dataset follows patterns that are generally close to the real Dataset, suggesting that the simulated data successfully reflects the overall distribution and structure of the original data.

Enhancing the Accuracy of the Fake Dataset

In this step, the synthetic Dataset was refined to better match the statistical patterns of the real Uber data. The improvement process adjusted the generated values of latitude and longitude based on the distribution observed within each dispatch base. By calculating the mean and standard deviation of these variables from the real Dataset and applying them to the fake data, the generated values became more consistent with the original geographic patterns.

# Function to improve synthetic Uber Dataset based on base
improve_by_base <- function(real_df, fake_df){
  out <- fake_df
  
  # Latitude and longitude ranges from real data
  lat_rng <- range(real_df$lat, na.rm = TRUE)
  lon_rng <- range(real_df$lon, na.rm = TRUE)
  
  for(gr in levels(real_df$base)){
    r_idx <- which(real_df$base == gr)
    f_idx <- which(out$base == gr)
    
    if(length(r_idx) > 20 && length(f_idx) > 0){
      # Generate latitude and longitude based on real base distribution
      lat_mu <- mean(real_df$lat[r_idx], na.rm = TRUE)
      lat_sd <- sd(real_df$lat[r_idx], na.rm = TRUE)
      
      lon_mu <- mean(real_df$lon[r_idx], na.rm = TRUE)
      lon_sd <- sd(real_df$lon[r_idx], na.rm = TRUE)
      
      out$lat[f_idx] <- rnorm(length(f_idx), lat_mu, lat_sd)
      out$lon[f_idx] <- rnorm(length(f_idx), lon_mu, lon_sd)
    }
  }
  
  # Keep values within real data range
  out$lat <- pmin(pmax(out$lat, lat_rng[1]), lat_rng[2])
  out$lon <- pmin(pmax(out$lon, lon_rng[1]), lon_rng[2])
  
  out
}

# Apply improvement
fake_improved <- improve_by_base(uber, fake_uber)

# Save improved fake Dataset
readr::write_csv(fake_improved, "output/fake_uber_improved.csv")

# Compare numeric variables before and after improvement
num_cols <- c("hour", "lat", "lon")
cmp_num_improved <- compare_num(uber, fake_improved, num_cols)



cmp_num |> 
  slice_head(n = 5) |> 
  kable(digits = 2, caption = "Top 5 Rows of Numeric Comparison - Baseline") |> 
  kable_styling(full_width = TRUE)

Top 5 Rows of Numeric Comparison - Baseline
var	mean_fake	mean_real	sd_fake	sd_real	median_fake	median_real	min_fake	min_real	max_fake	max_real	mean_abs_pct_diff	sd_abs_pct_diff
hour	13.55	13.75	5.06	5.23	14.00	15.00	0.00	0.00	23.00	23.00	0.01	0.03
lat	40.75	40.75	0.03	0.03	40.75	40.75	40.65	40.61	40.87	40.99	0.00	0.01
lon	-73.98	-73.98	0.06	0.06	-73.98	-73.98	-74.18	-74.42	-73.77	-73.42	0.00	0.03

# Save comparison results to Excel
writexl::write_xlsx(
  list(
    numeric_comparison_baseline = cmp_num,
    numeric_comparison_improved = cmp_num_improved,
    categorical_overlap = cmp_cat
  ),
  "output/comparison_real_vs_fake_uber.xlsx"
)

Conclusion

The synthetic Uber dataset successfully replicates the statistical patterns observed in the real dataset. Numeric variables (hour, latitude, longitude) and categorical variables (weekday, base) show similar distributions, with further improvements achieved by adjusting latitude and longitude based on dispatch base characteristics.

Overall, the simulation demonstrates that it is possible to generate realistic datasets for testing and analysis without using actual sensitive data. Further enhancements could include modeling temporal trends in ride demand, adding correlations between variables, and incorporating more complex geographic patterns to improve realism.

DATA 604 : Week 5 Assignment

Author: Rupendra Shrestha | March 07, 2026