Project 2_Disney Parks Dataset

Author

Theresa Benny

Approach

The second dataset contains monthly attendance estimates for several Disney theme parks during 2025. Each row represents a park, and each column represents attendance for a specific month of the year. Additional information includes the geographic region of each park.

The dataset is currently stored in a wide format, where each month is represented as a separate column. While this layout may be convenient for viewing, it is not ideal for analysis or visualization. For example, analyzing attendance trends across months or comparing parks over time becomes difficult when each month is stored in a different column.

The goal of this transformation is to convert the dataset into a tidy structure where each observation represents a park’s attendance in a specific month. This will make it easier to analyze seasonal trends, compare park performance, and visualize attendance changes throughout the year.

To preserve the original dataset, I will first recreate the data exactly as shown in the source table and store it as a wide-format CSV file. This file will be committed to my GitHub repository before any transformations are performed.

The raw dataset includes the following variables:

Park name
Monthly attendance columns (January through December)
Region where the park is located

The monthly columns represent repeated measurements across time, which is the primary reason the dataset is considered wide.

The dataset will be imported into R and inspected to understand its structure, including the number of rows, column names, and data types. This step helps identify any inconsistencies in column naming or formatting that must be addressed before transformation.

Column names will be standardized to follow a consistent naming convention. For example, month columns may be converted into a consistent lowercase format without spaces. This ensures clarity and consistency when referencing variables in the analysis stage.

The primary structural issue in this dataset is that monthly attendance values are spread across multiple columns. These columns will be reshaped into a tidy structure where:

One column represents the month
One column represents the attendance value

This transformation will allow each park to have multiple rows, one for each month, instead of storing all monthly observations in separate columns.

After this step, the dataset will contain variables such as:

park_name
region
month
attendance

This structure follows tidy data principles and allows easier aggregation, filtering, and visualization.

The attendance numbers appear in multiple formats, including:

values with “M” (millions)
values with “K” (thousands)
comma-separated numbers
values written as text such as “2.1 million”

These values will need to be standardized into numeric attendance counts. This process will involve removing text labels and converting all attendance values into consistent numeric units. This step ensures that the attendance column can be used for calculations and statistical analysis.

Some cells appear to be blank or missing. These values will be converted into standard missing data indicators. This ensures that incomplete observations do not interfere with calculations or visualizations. Once the dataset has been transformed into tidy format, I will perform exploratory analysis to examine attendance patterns across parks and months.

Potential analyses include:

Comparing average attendance between parks
Identifying seasonal trends in park attendance
Examining which months experience the highest and lowest attendance
Comparing attendance patterns between Florida parks and the California park

Results will be summarized using tables and visualizations. For example, line charts may be used to show monthly attendance trends for each park, while bar charts can compare overall attendance between parks.Several data preparation challenges are expected when working with this dataset:

Attendance values appear in multiple textual formats, which must be standardized before analysis.
Monthly data is currently spread across many columns, requiring restructuring into a tidy format.
Some missing or blank cells must be handled consistently.
Month column names may need to be standardized due to inconsistent naming patterns (for example, “Jan 2025” versus “Apr_2025”).

Addressing these issues will ensure the final dataset follows tidy data principles and can be analyzed efficiently.

Codebase

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

disney_raw <- read_csv("disney_parks_monthly_attendance.csv")

Rows: 4 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Park Name, Jan 2025, March, May, Jun, July, Aug, Sept, October 202...
dbl  (1): Apr_2025
num  (1): Feb-2025

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Inspect the raw data

glimpse(disney_raw)

Rows: 4
Columns: 14
$ `Park Name`    <chr> "Magic Kingdom", "EPCOT", "Disneyland Park", "Hollywood…
$ `Jan 2025`     <chr> "1.5M", "1.2M", "1.1M", "950K"
$ `Feb-2025`     <dbl> 1600000, 1300000, 1200000, 980000
$ March          <chr> "2.1 million", "1.7 million", "1.6 million", "1.3 milli…
$ Apr_2025       <dbl> 2300000, 1800000, 1700000, 1400000
$ May            <chr> "2.5M", "2.0M", "1.9M", NA
$ Jun            <chr> "2.8M", "2.2M", "2.1M", "1.6M"
$ July           <chr> "3.0M", "2.4M", "2.3M", "1.8M"
$ Aug            <chr> "2.9M", "2.3M", "2.2M", "1.7M"
$ Sept           <chr> "2.4M", "1.9M", "1.8M", "1.4M"
$ `October 2025` <chr> "2.2M", "1.8M", "1.7M", "1.3M"
$ Nov            <chr> "1.9M", "1.6M", "1.5M", "1.1M"
$ Dec            <chr> "2.6M", "2.1M", "2.0M", "1.5M"
$ Region         <chr> "Florida", "Florida", "California", "Florida"

head(disney_raw)

# A tibble: 4 × 14
  `Park Name` `Jan 2025` `Feb-2025` March Apr_2025 May   Jun   July  Aug   Sept 
  <chr>       <chr>           <dbl> <chr>    <dbl> <chr> <chr> <chr> <chr> <chr>
1 Magic King… 1.5M          1600000 2.1 …  2300000 2.5M  2.8M  3.0M  2.9M  2.4M 
2 EPCOT       1.2M          1300000 1.7 …  1800000 2.0M  2.2M  2.4M  2.3M  1.9M 
3 Disneyland… 1.1M          1200000 1.6 …  1700000 1.9M  2.1M  2.3M  2.2M  1.8M 
4 Hollywood … 950K           980000 1.3 …  1400000 <NA>  1.6M  1.8M  1.7M  1.4M 
# ℹ 4 more variables: `October 2025` <chr>, Nov <chr>, Dec <chr>, Region <chr>

The dataset is currently untidy because monthly attendance measurements are stored across multiple columns rather than in a single month variable. This wide structure makes it harder to analyze attendance trends over time or compare parks across months. The values in them are also inconsistent. To make the data suitable for analysis, the monthly columns will be reshaped into a tidy long format. And the inputs

disney_clean <- disney_raw %>%
  rename_with(tolower) %>%
  rename(
    park_name = `park name`,
    jan = `jan 2025`,
    feb = `feb-2025`,
    mar = march,
    apr = apr_2025,
    may = may,
    jun = jun,
    jul = july,
    aug = aug,
    sep = sept
  )

# Reshaping Monthly Attendance 

disney_clean <- disney_clean %>%
  mutate(across(jan:sep, as.character))

disney_tidy <- disney_clean %>%
  pivot_longer(
    cols = jan:sep,
    names_to = "month",
    values_to = "attendance"
  )

glimpse(disney_tidy)

Rows: 36
Columns: 7
$ park_name      <chr> "Magic Kingdom", "Magic Kingdom", "Magic Kingdom", "Mag…
$ `october 2025` <chr> "2.2M", "2.2M", "2.2M", "2.2M", "2.2M", "2.2M", "2.2M",…
$ nov            <chr> "1.9M", "1.9M", "1.9M", "1.9M", "1.9M", "1.9M", "1.9M",…
$ dec            <chr> "2.6M", "2.6M", "2.6M", "2.6M", "2.6M", "2.6M", "2.6M",…
$ region         <chr> "Florida", "Florida", "Florida", "Florida", "Florida", …
$ month          <chr> "jan", "feb", "mar", "apr", "may", "jun", "jul", "aug",…
$ attendance     <chr> "1.5M", "1600000", "2.1 million", "2300000", "2.5M", "2…

disney_tidy <- disney_tidy %>%
  mutate(
    attendance = str_to_lower(attendance),
    attendance = str_replace_all(attendance, " million", "m"),
    attendance = case_when(
      str_detect(attendance, "m") ~ as.numeric(str_remove(attendance, "m")) * 1000000,
      str_detect(attendance, "k") ~ as.numeric(str_remove(attendance, "k")) * 1000,
      TRUE ~ as.numeric(attendance)
    )
  )

Warning: There were 3 warnings in `mutate()`.
The first warning was:
ℹ In argument: `attendance = case_when(...)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

glimpse(disney_tidy)

Rows: 36
Columns: 7
$ park_name      <chr> "Magic Kingdom", "Magic Kingdom", "Magic Kingdom", "Mag…
$ `october 2025` <chr> "2.2M", "2.2M", "2.2M", "2.2M", "2.2M", "2.2M", "2.2M",…
$ nov            <chr> "1.9M", "1.9M", "1.9M", "1.9M", "1.9M", "1.9M", "1.9M",…
$ dec            <chr> "2.6M", "2.6M", "2.6M", "2.6M", "2.6M", "2.6M", "2.6M",…
$ region         <chr> "Florida", "Florida", "Florida", "Florida", "Florida", …
$ month          <chr> "jan", "feb", "mar", "apr", "may", "jun", "jul", "aug",…
$ attendance     <dbl> 1500000, 1600000, 2100000, 2300000, 2500000, 2800000, 3…

Great, this looks much cleaner and consistent. It’s time to do the analysis.

#Average attendance by park
disney_summary <- disney_tidy %>%
  group_by(park_name) %>%
  summarise(
    avg_attendance = mean(attendance, na.rm = TRUE)
  )
disney_tidy <- disney_tidy %>%
  mutate(month = factor(month, levels = c("jan","feb","mar","apr","may","jun","jul","aug","sep")))

disney_summary

# A tibble: 4 × 2
  park_name         avg_attendance
  <chr>                      <dbl>
1 Disneyland Park         1766667.
2 EPCOT                   1866667.
3 Hollywood Studios       1391250 
4 Magic Kingdom           2344444.

Let’s visualize this data.

ggplot(disney_tidy, aes(x = month, y = attendance, group = park_name, color = park_name)) +
  geom_line(size = 1) +
  geom_point() +
  labs(
    title = "Monthly Disney Park Attendance",
    x = "Month",
    y = "Attendance",
    color = "Park"
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

The visualization shows attendance trends across Disney parks throughout the year. Magic Kingdom consistently attracts the highest attendance, while Hollywood Studios receives comparatively lower visitor numbers. Attendance peaks during the summer months, particularly in July and August, reflecting seasonal increases in tourism and school vacation periods. There’s also missing data for Hollywood Studios from April to June.