Project 2

Author

Sam Barbaro

Project 3 Approach

I will take three untidy, wide data sets shared by my classmates, including the average precipitation by city:

https://en.wikipedia.org/wiki/List_of_cities_by_average_precipitation

Disney parks monthly attendance, and field goal career stats from NFL statistics https://www.kaggle.com/datasets/kendallgillies/nflstatistics

For each data set, I will:

  • Reshape data from wide to tidy (long) format, using pivot longer

  • Normalize the variable structure (e.g., all numeric, same number of decimals)

  • Rename variables to follow a consistent naming convention

  • Address missing or inconsistent values

  • Document how I handle missing values

    Planned analysis

For cities by average precipitation, I will compare seasonal precipitation patterns across cities in the same country, and to find the rainiest and driest cities overall.

For the Disney Parks data I willl look at monthly attendance to identify peak/low travel seasons, compare FL and CA attendance, and calculate YTD totals for each park. I will also visualize monthly attendance across parks. I have added additional parks (Animal Kingdom and Calinfornia Adventure) to compare all CA parks to all FL parks.

With the football data set, I will look at cumulative player stats, ratios, and the relationship of certain stats (such as player height and weight) relate to career length.

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)

Load the data sets

#disney parks
disney_parks <- 
  read.csv("https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/disney_parks_monthly_attendance.csv", 
                          na.strings = c("", "NA", "null", "NULL"), 
                          header = TRUE)


#NFL field goals
field_goal <-  read.csv("https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/Career_Stats_Field_Goal_Kickers.csv", 
                          na.strings = c("", "--", "NA", "null", "NULL"), 
                          header = TRUE)
#rainfall
rainfall <- read.csv("https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/rainfallbycity.csv", na.strings = "", header = TRUE)

Disney Parks

Let’s standardize the values before we pivot longer. I think it will be easier to remove the M/Millions from these columns and say number of guests (in millions), then divide the other columns (which have values in millions) by a million.

#format feb and april 2025 values as numeric, remove commas, and place data in the same format (millions) as other columns before pivoting

disney_parks2 <- disney_parks |> 
  mutate(Apr_2025 = (Apr_2025/1000000)) |>
  mutate(Feb.25 = str_remove_all(Feb.25, ",")) |>
  mutate(Feb.25 = (as.numeric(Feb.25))) |>
  mutate(Feb.25 = (Feb.25/1000000))


#normalize the column names

disney_parks3 <- disney_parks2 |>
  rename_with(~ format(my(paste(., "2025")), "%b-%Y"),
    .cols = c(March, May, Jun, July, Aug, Sept, Nov, Dec)
  ) |> 
  rename("Jan-2025" = Jan.25, "Feb-2025" = Feb.25, "Oct-2025" = Oct.25, "Apr-2025" = Apr_2025) 
#getting rid of M and million
disney_parks4 <- disney_parks3 |> 
mutate(
    across(`May-2025`:`Dec-2025`, ~ str_remove(.x, "M.*"))) |> 
    mutate(`Mar-2025` = str_remove(`Mar-2025`, " m.*")) |>
  mutate(`Jan-2025` = str_remove(`Jan-2025`, "M.*")) |>
  mutate(`Jan-2025` = str_remove(`Jan-2025`, "K.*"))

#Resolving problems I created for myself (I am rounding the first value to one million, since all other values are in multiples of 100k)
disney_parks4[4, 2] <- 1
disney_parks4[5, 2] <- .8

#there's an NA value for one day at one of the parks. 
#I've decided to impute using the closest month (Dec 2025 - every park has the same attendance + 100k)
#I recognize this is the most scientific, but it's only one value and will cause 
#the least disruption when looking at time series.
#We also have no reason to believe the park was closed that month. 

disney_parks4[4, 6] <- 1.4

#format  everything as numeric
disney_parks5 <- disney_parks4 |>
  mutate(across(`Jan-2025`:`Dec-2025`, as.numeric))


#pivot longer

disney_parks6 <- disney_parks5 |> pivot_longer(cols = `Jan-2025`:`Dec-2025`,
                               names_to = "month_year",
        values_to = "attendance_in_millions"
    )

#format as dates
disney_parks7 <- disney_parks6 |> 
  mutate(month_year = my(month_year)) |> 
  rename("park" = Park.Name, "state" = Region)

Analysis - Disney Parks

library(fpp3)
Warning: package 'fpp3' was built under R version 4.5.2
Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr
── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
✔ tsibble     1.1.6     ✔ feasts      0.4.2
✔ tsibbledata 0.4.1     ✔ fable       0.5.0
Warning: package 'tsibble' was built under R version 4.5.2
Warning: package 'tsibbledata' was built under R version 4.5.2
Warning: package 'feasts' was built under R version 4.5.2
Warning: package 'fabletools' was built under R version 4.5.2
Warning: package 'fable' was built under R version 4.5.2
── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
✖ lubridate::date()    masks base::date()
✖ dplyr::filter()      masks stats::filter()
✖ tsibble::intersect() masks base::intersect()
✖ tsibble::interval()  masks lubridate::interval()
✖ dplyr::lag()         masks stats::lag()
✖ tsibble::setdiff()   masks base::setdiff()
✖ tsibble::union()     masks base::union()
disney_parks8 <- disney_parks7 |> 
  as_tsibble(key = park, index = month_year)

autoplot(disney_parks8) + 
  labs(title = "Disney Park Attendance by Month", x = "Month", y = "Attendance in Millions")
Plot variable not specified, automatically selected `.vars =
attendance_in_millions`

#We can see the summer and holiday peaks, which are fairly consistent across parks

#comparing monthly average attendance in California vs. Florida

disney_parks_regional <- disney_parks7 |> 
  group_by(state) |>
  summarize(avg_monthly_park_attendance = round(mean(attendance_in_millions),2))



library(gt)
Warning: package 'gt' was built under R version 4.5.2
disney_parks_regional |>
  select(state, avg_monthly_park_attendance) |>
  gt() |>
  cols_label(
    state = "Location",
    avg_monthly_park_attendance = "Average Visitors by Park (Millions)"
  ) |>
  tab_header(title = "California vs. Florida Monthly Visitors")
California vs. Florida Monthly Visitors
Location Average Visitors by Park (Millions)
California 1.67
Florida 1.81
#Florida and California by month

disney_parks_regional_monthly <- disney_parks7 |> 
  group_by(state, month_year) |> 
  summarize(avg_monthly_park_attendance = round(mean(attendance_in_millions),2))
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
#A somewhat Disney-looking graph

ggplot(disney_parks_regional_monthly, aes(x = month_year, y = avg_monthly_park_attendance, color = state)) +
  geom_line() +  
  scale_color_manual(values = c(
    "#FF8C00", 
    "#008080",
    "#FF4466", 
    "#FFD700"  
  )) +
  geom_point(shape = 8, size = 3) + 
  theme_minimal() +
  labs(title = "Average Monthly Disney Park Attendance", 
       x = "Date", 
       y = "Average Attendance per Park (Millions)") + theme(
    plot.title = element_text(size = 20)
  )

#Florida consistently has a higher avg park attendance 

#this plot also makes seasonality easier to see - attendance climbs throughout the year
#until August (which I always thought was kind of hot for a Florida visit)
#It then drops off and peaks again in December for the holidays. 

#Let's look at overall volume

disney_parks_overall_attendance <- disney_parks7 |> 
  group_by(state, month_year) |>
  summarize(total_monthly_park_attendance = (sum(attendance_in_millions)))
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
ggplot(disney_parks_overall_attendance, aes(x = month_year, y = total_monthly_park_attendance, color = state)) +
  geom_line(size = 1) +
  geom_point(shape = 8, size = 7) +
  scale_color_manual(values = c(
    "#FF8C00", 
    "#008080",
    "#FF4466", 
    "#FFD700"  
  )) +
  theme_minimal() +
  labs(title = "Total Monthly Disney Park Attendance by State", 
       x = "Date", 
       y = "Attendance (Millions)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

#Of course, FL has much higher attendance overall because it has 3 more parks

#YTD totals for each park
disney_parks_yearly <- disney_parks7 |> 
  group_by(park) |>
  summarize(yearly_attendance = (sum(attendance_in_millions))) |>
  arrange(desc(yearly_attendance))

disney_parks_yearly |>
  select(park, yearly_attendance) |>
  gt() |>
  cols_label(
    park = "Park",
    yearly_attendance = "Total Guests (Millions)"
  ) |>
  tab_header(title = "2025 Attendance by Park")
2025 Attendance by Park
Park Total Guests (Millions)
Magic Kingdom 27.80
EPCOT 22.30
Disneyland Park 21.10
Animal Kingdom 20.40
Disney's California Adventure 18.88
Hollywood Studios 16.48
#The most popular park was Magic Kingdom and the least popular park was Hollywood Studios

Rainfall

First, add a continent and delete the extra lines

#note - this first column imports into R as X....City, 
#however, the html would not render.
#so I rendered a glimpse of this table in a separate file, and it's just "City"
#I've changed the code below so it renders properly
#however, for this to run locally, it may need to be changed
#code is provided below

rainfall2 <- rainfall |> 
  select(City:Dec) |>
  mutate(continent = "") |>
  rename(city = City, country = Country, yearly = Year)


#If the above code does not run, please use this 
#(with the corrected first column name)
#rainfall2 <- rainfall |> 
#  select(X...City:Dec) |>
#  mutate(continent = "") |>
#  rename(city = X...City, country = Country, yearly = Year)

#1-37 is Africa, 92 - 129 is Asia, 131 - 270 is Europe, 272 - 305 is North America, 311 - 362 is Oceania, 365 - 535 is South America
#adding a continent column

rainfall2$continent[1:37] <- "Africa"
rainfall2$continent[40:90] <- "Asia"
rainfall2$continent[92:129] <- "Asia"
rainfall2$continent[131:270] <- "Europe"
rainfall2$continent[272:305] <- "North America"
rainfall2$continent[311:362] <- "Oceania"
rainfall2$continent[365:535] <- "South America"

#delete blank spaces in the continent column


rainfall_no_blanks <- subset(rainfall2, continent != "")

#getting rid of duplicated header columns


rainfall_no_blanks_2 <- subset (rainfall_no_blanks, city != "City")

Pivoting longer

rainfall_pivot <- rainfall_no_blanks_2 |>
  pivot_longer(cols = Jan:Dec,
                               names_to = "month",
                               values_to = "rainfall_mm"
  
)


#reorder so continent comes earlier 
#get rid of commas (they are causing big problems)

rainfall_cleaned <- rainfall_pivot |> relocate(continent, .after = country) |>
  mutate(yearly = str_remove(yearly, ",")) |>
  mutate(rainfall_mm = str_remove(rainfall_mm, ","))


#changing the numbers to numeric (they're characters now)
rainfall_cleaned_numeric <- rainfall_cleaned |>
  mutate(yearly = as.numeric(yearly), rainfall_mm = as.numeric(rainfall_mm)) |>
  mutate(across(c(yearly, rainfall_mm), ~replace_na(., 0)))

#one city was broken up (it has a hyphen), so let's fix that

rainfall_cleaned_numeric <- rainfall_cleaned_numeric |>
  mutate(city = str_replace(city, "Petropavlovsk-", "Petropavlovsk-Kamchatsky")) |>
  filter(city != "Kamchatsky")

#make R recognize month as a series of months
rainfall_cleaned_numeric <- as_tibble(rainfall_cleaned_numeric) |>
  mutate(month = month(parse_date_time(month, "b"), label = TRUE))

Rainfall analysis

For cities by average precipitation, I will compare seasonal precipitation patterns across cities in the same country, and find the rainiest and driest cities overall.

#find the max yearly rainfall and the max monthly rainfall 

rainiest_month <- rainfall_cleaned_numeric |> 
  arrange(desc(rainfall_mm)) |>
  slice(1)


#The rainiest documented month is July in Mawsynram, India. 

rainiest_month
# A tibble: 1 × 6
  city      country continent yearly month rainfall_mm
  <chr>     <chr>   <chr>      <dbl> <ord>       <dbl>
1 Mawsynram India   Asia      11305. Jul          3300
rainiest_city <- rainfall_cleaned_numeric |> 
  arrange(desc(yearly)) |>
  slice(1)

#The rainiest city overall is Lopez de Micay, Colombia with avg 15992 mm rain a year

rainiest_city
# A tibble: 1 × 6
  city           country  continent     yearly month rainfall_mm
  <chr>          <chr>    <chr>          <dbl> <ord>       <dbl>
1 López de Micay Colombia South America 15992. Jan          940.
#let's find the 5 rainiest cities in the US and look at a time plot of their precipitation  
#Having Hawaiian cities as part of Oceania is not an error -- that's how the 
#data showed up in Wikipedia, and I left it because continents are kind of subjective

us_rain <- rainfall_cleaned_numeric |> 
  filter(country == "United States") |> 
  arrange(desc(yearly)) 

us_rain_top_5 <- us_rain |> 
  filter(country == "United States") |>
  filter(city %in% c("Hilo", "Miami", "Nashville", "New York City", "Atlanta")) |>
  arrange(desc(yearly)) 

#there are 21 US cities total in this data set
print(us_rain |> distinct(city), n= 21)
# A tibble: 21 × 1
   city           
   <chr>          
 1 Hilo           
 2 Miami          
 3 Nashville      
 4 New York City  
 5 Atlanta        
 6 Cincinnati     
 7 Columbus (Ohio)
 8 Chicago        
 9 Dallas         
10 Seattle        
11 Portland       
12 Austin         
13 Des Moines     
14 Detroit        
15 San Francisco  
16 Anchorage      
17 Honolulu       
18 Denver         
19 Los Angeles    
20 San Diego      
21 Phoenix        
#time plot

ggplot(us_rain_top_5, aes(x = month, y = rainfall_mm, group = city, color = city)) +
  geom_line() +
  geom_point(shape = 6, size = 2) +
  theme_minimal() +
  labs(title = "Average Monthly Rainfall by US City",
       x = "Month",
       y = "Rainfall (mm)")

#NY, Nashville, and Atlanta all have relatively steady precipitation throughout the year
#According to this data set, NYC is rainier than Portland and Seattle 

#bottom five

us_rain <- rainfall_cleaned_numeric |> 
  filter(country == "United States") |> 
  arrange(yearly) 

us_rain_bottom_5 <- us_rain |> 
  filter(country == "United States") |>
  filter(city %in% c("Phoenix", "San Diego", "Los Angeles", "Denver", "Honolulu")) |>
  arrange(desc(yearly)) 

ggplot(us_rain_bottom_5, aes(x = month, y = rainfall_mm, group = city, color = city)) +
  geom_line() +
  geom_point(shape = 6, size = 2) +
  theme_minimal() +
  labs(title = "Average Monthly Rainfall by US City (Driest Cities)",
       x = "Month",
       y = "Rainfall (mm)")

#With the exception of Denver, many of these lines have a U shape, indicating higher rainfall in winter months

#let's look at average rainfall by city

us_rain_avg <- rainfall_cleaned_numeric |> 
  filter(country == "United States") |> 
  group_by(city) |>
  summarize(monthly_avg = round(mean(rainfall_mm),2))


#this table showcases the wide range of avg. monthly rainfalls in the U.S. 15.19 - 255 mm

us_rain_avg |>
  gt() |>
  cols_label(
    city = "City",
    monthly_avg = "Average Monthly Rainfall"
  ) |>
  tab_header(title = "Average Monthly Rainfall in 21 U.S. Cities")
Average Monthly Rainfall in 21 U.S. Cities
City Average Monthly Rainfall
Anchorage 34.78
Atlanta 104.38
Austin 76.72
Chicago 86.50
Cincinnati 89.77
Columbus (Ohio) 88.08
Dallas 81.10
Denver 32.49
Des Moines 75.81
Detroit 72.75
Hilo 255.00
Honolulu 34.67
Los Angeles 30.02
Miami 142.67
Nashville 107.00
New York City 104.75
Phoenix 15.19
Portland 78.17
San Diego 20.72
San Francisco 41.72
Seattle 78.72

Football stats

#Let's start by breaking up first name and last name

fg_names <- field_goal |>
    separate(col = Name, into = c("last_name", "first_name"), sep = ",")

#Pivot longer -- get rid of the percentage (we can calculate that if we need it)
#and  turn the columns into field goal type, made, and attempted
#this is a hard data set to pivot longer because many of the stats are unrelated 
#(we don't want extra points mixed in with field goals or total field goals mixed in 
#with field goal length + number made)

#there's also the question of whether we need the data not being pivoted.
#Besides identifiers, like name and team, every statistic except extra points
#can easily be derived from the pivoted data

fg_keep <- fg_names |> 
  select(Player.Id:FG.Percentage, FGs.Made.20.29.Yards, FGs.Attempted.20.29.Yards, FGs.Made.30.39.Yards, FGs.Attempted.30.39.Yards, FGs.Made.40.49.Yards, FGs.Attempted.40.49.Yards, FGs.Made.50..Yards, FGs.Attempted.50..Yards, Extra.Points.Attempted, Extra.Points.Made, Extra.Points.Blocked) 

fg_keep2 <- fg_keep |>
  mutate(across(contains(c("Yards", "Extra.Points")), as.numeric))

fg_long <- fg_keep2 |> 
    pivot_longer(
        cols = contains("Yards"),
        names_to = c(".value", "field_goal_type"),
        names_pattern = "FGs\\.(Made|Attempted)\\.(.*)"
    )

#make this more readable
fg_cleaned <- fg_long |> 
  mutate(field_goal_type = case_when(
  field_goal_type == "20.29.Yards" ~ "20-29 yards",
  field_goal_type == "30.39.Yards" ~ "30-39 yards",
  field_goal_type == "40.49.Yards" ~"40-49 yards",
  field_goal_type == "50..Yards" ~ "50 yards")
)

#clean up the names

fg_renamed <- fg_cleaned |> rename_with(tolower)

#replace period with underscore 
colnames(fg_renamed) <- gsub(pattern = "\\.", replacement = "_", x = colnames(fg_renamed))

#move first name and last name so they're in the right order

fg_renamed <- fg_renamed |> relocate(last_name, .after = first_name)

Analysis

I said I would look at cumulative player stats ratios (these are already included in the data set), and how the relationship of certain stats (such as player height and weight) relate to career length (these are excluded from the data set).

Instead, let’s look at how likely players are in general to make a field goal based on distance (total # of field goals / total attempted, grouped by type) and visualize it in a table.

Then, we’ll look at the top 5 teams in terms of field goals and field goal percentage.

fg_likelihood <- fg_renamed |> 
  mutate(across(c(made, attempted), as.numeric)) |> 
  group_by(field_goal_type) |> 
  summarize(
    made = sum(made, na.rm = TRUE), 
    attempted = sum(attempted, na.rm = TRUE)
  )

fg_likelihood <- fg_likelihood |> 
  mutate(percentage_made = round(made / attempted, 2))

#make a table
#Unsurprisingly, the farther a kicker is from the goal posts, the lower the percentage

fg_likelihood |>
  select(field_goal_type, percentage_made) |>
  gt() |>
  cols_label(
    field_goal_type = "Distance",
    percentage_made = "Percentage Made (%)"
  ) |>
  tab_header(title = "Percentage of Field Goals Made by Distance")
Percentage of Field Goals Made by Distance
Distance Percentage Made (%)
20-29 yards 0.95
30-39 yards 0.87
40-49 yards 0.72
50 yards 0.56
ggplot(data = fg_renamed, aes(x = attempted, y = made, color = field_goal_type)) + geom_point()
Warning: Removed 5308 rows containing missing values or values outside the scale range
(`geom_point()`).

#dot plot visualizing total games played and number of field goals


#there are some baseball teams in here and defunct teams with 0 FGs, 
#so let's filter those out too, and then go with the top 5 teams.

by_team <- fg_renamed |> group_by(team) |> 
  mutate(across(c(made, attempted), as.numeric)) |>
  summarize(
    made = sum(made, na.rm = TRUE), 
    attempted = sum(attempted, na.rm = TRUE)
  ) |>
  filter(attempted != 0) |>
  arrange(desc(made))
 

by_team |>
   slice(1:5) |>
  select(team, made) |>
  gt() |>
  cols_label(
    team = "NFL Team",
    made = "Total Field Goals Made"
  ) |>
  tab_header(title = "Top 5 Teams for Field Goals")
Top 5 Teams for Field Goals
NFL Team Total Field Goals Made
San Diego Chargers 643
New England Patriots 626
Atlanta Falcons 625
Denver Broncos 625
Buffalo Bills 596
#The San Diego Chargers have made the most field goals over the years, but these numbers
#are pretty close

#Let's look at percentage of field goals made


fg_percent <- by_team |> 
  mutate(percentage = round(made / attempted, 2)) |>
  arrange(desc(percentage)) 


#in a table
fg_percent |>
   slice(1:5) |>
  gt() |>
  cols_label(
    team = "NFL Team",
    made = "Made",
    attempted = "Attempted",
    percentage = "% Made"
  ) |>
  tab_header(title = "NFL Teams With Highest Field Goal Success Rate")
NFL Teams With Highest Field Goal Success Rate
NFL Team Made Attempted % Made
Baltimore Ravens 179 206 0.87
Indianapolis Colts 474 548 0.86
Tennessee Oilers 60 71 0.85
Houston Texans 170 203 0.84
New England Patriots 626 755 0.83
#Theere's a wide range of attempts/FGs made represented here (possibly due to the age 
#of the teams)
#The Baltimore Ravens have the highest percentage, but only 179 field goals made out of 206
#The Pats, on the other hand, rank fifth and have ade 626 out of 755

#a slightly more visually overwhelming presentation of the above info
fg_percent |>
  slice(1:5) |>
  ggplot(aes(x = reorder(team, -percentage), y = percentage, fill = team)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = scales::percent(percentage)), vjust = -0.5) +
  scale_y_continuous(labels = scales::percent, expand = expansion(mult = c(0, 0.1))) +
  labs(
    title = "Top 5 NFL Teams by Field Goal Success Rate",
    x = "NFL Team",
    y = "Success Rate"
  ) +
  theme_minimal()

Google Gemini. (2026). Gemini 3 Flash [Large language model].
https://gemini.google.com. Accessed March 5, 2026.