Reshape data from wide to tidy (long) format, using pivot longer
Normalize the variable structure (e.g., all numeric, same number of decimals)
Rename variables to follow a consistent naming convention
Address missing or inconsistent values
Document how I handle missing values
Planned analysis
For cities by average precipitation, I will compare seasonal precipitation patterns across cities in the same country, and to find the rainiest and driest cities overall.
For the Disney Parks data I willl look at monthly attendance to identify peak/low travel seasons, compare FL and CA attendance, and calculate YTD totals for each park. I will also visualize monthly attendance across parks. I have added additional parks (Animal Kingdom and Calinfornia Adventure) to compare all CA parks to all FL parks.
With the football data set, I will look at cumulative player stats, ratios, and the relationship of certain stats (such as player height and weight) relate to career length.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Let’s standardize the values before we pivot longer. I think it will be easier to remove the M/Millions from these columns and say number of guests (in millions), then divide the other columns (which have values in millions) by a million.
#format feb and april 2025 values as numeric, remove commas, and place data in the same format (millions) as other columns before pivotingdisney_parks2 <- disney_parks |>mutate(Apr_2025 = (Apr_2025/1000000)) |>mutate(Feb.25 =str_remove_all(Feb.25, ",")) |>mutate(Feb.25 = (as.numeric(Feb.25))) |>mutate(Feb.25 = (Feb.25/1000000))#normalize the column namesdisney_parks3 <- disney_parks2 |>rename_with(~format(my(paste(., "2025")), "%b-%Y"),.cols =c(March, May, Jun, July, Aug, Sept, Nov, Dec) ) |>rename("Jan-2025"= Jan.25, "Feb-2025"= Feb.25, "Oct-2025"= Oct.25, "Apr-2025"= Apr_2025)
#getting rid of M and milliondisney_parks4 <- disney_parks3 |>mutate(across(`May-2025`:`Dec-2025`, ~str_remove(.x, "M.*"))) |>mutate(`Mar-2025`=str_remove(`Mar-2025`, " m.*")) |>mutate(`Jan-2025`=str_remove(`Jan-2025`, "M.*")) |>mutate(`Jan-2025`=str_remove(`Jan-2025`, "K.*"))#Resolving problems I created for myself (I am rounding the first value to one million, since all other values are in multiples of 100k)disney_parks4[4, 2] <-1disney_parks4[5, 2] <- .8#there's an NA value for one day at one of the parks. #I've decided to impute using the closest month (Dec 2025 - every park has the same attendance + 100k)#I recognize this is the most scientific, but it's only one value and will cause #the least disruption when looking at time series.#We also have no reason to believe the park was closed that month. disney_parks4[4, 6] <-1.4#format everything as numericdisney_parks5 <- disney_parks4 |>mutate(across(`Jan-2025`:`Dec-2025`, as.numeric))#pivot longerdisney_parks6 <- disney_parks5 |>pivot_longer(cols =`Jan-2025`:`Dec-2025`,names_to ="month_year",values_to ="attendance_in_millions" )#format as datesdisney_parks7 <- disney_parks6 |>mutate(month_year =my(month_year)) |>rename("park"= Park.Name, "state"= Region)
Analysis - Disney Parks
library(fpp3)
Warning: package 'fpp3' was built under R version 4.5.2
Registered S3 method overwritten by 'tsibble':
method from
as_tibble.grouped_df dplyr
disney_parks8 <- disney_parks7 |>as_tsibble(key = park, index = month_year)autoplot(disney_parks8) +labs(title ="Disney Park Attendance by Month", x ="Month", y ="Attendance in Millions")
Plot variable not specified, automatically selected `.vars =
attendance_in_millions`
#We can see the summer and holiday peaks, which are fairly consistent across parks#comparing monthly average attendance in California vs. Floridadisney_parks_regional <- disney_parks7 |>group_by(state) |>summarize(avg_monthly_park_attendance =round(mean(attendance_in_millions),2))library(gt)
Warning: package 'gt' was built under R version 4.5.2
disney_parks_regional |>select(state, avg_monthly_park_attendance) |>gt() |>cols_label(state ="Location",avg_monthly_park_attendance ="Average Visitors by Park (Millions)" ) |>tab_header(title ="California vs. Florida Monthly Visitors")
California vs. Florida Monthly Visitors
Location
Average Visitors by Park (Millions)
California
1.67
Florida
1.81
#Florida and California by monthdisney_parks_regional_monthly <- disney_parks7 |>group_by(state, month_year) |>summarize(avg_monthly_park_attendance =round(mean(attendance_in_millions),2))
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
#A somewhat Disney-looking graphggplot(disney_parks_regional_monthly, aes(x = month_year, y = avg_monthly_park_attendance, color = state)) +geom_line() +scale_color_manual(values =c("#FF8C00", "#008080","#FF4466", "#FFD700" )) +geom_point(shape =8, size =3) +theme_minimal() +labs(title ="Average Monthly Disney Park Attendance", x ="Date", y ="Average Attendance per Park (Millions)") +theme(plot.title =element_text(size =20) )
#Florida consistently has a higher avg park attendance #this plot also makes seasonality easier to see - attendance climbs throughout the year#until August (which I always thought was kind of hot for a Florida visit)#It then drops off and peaks again in December for the holidays. #Let's look at overall volumedisney_parks_overall_attendance <- disney_parks7 |>group_by(state, month_year) |>summarize(total_monthly_park_attendance = (sum(attendance_in_millions)))
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
ggplot(disney_parks_overall_attendance, aes(x = month_year, y = total_monthly_park_attendance, color = state)) +geom_line(size =1) +geom_point(shape =8, size =7) +scale_color_manual(values =c("#FF8C00", "#008080","#FF4466", "#FFD700" )) +theme_minimal() +labs(title ="Total Monthly Disney Park Attendance by State", x ="Date", y ="Attendance (Millions)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
#Of course, FL has much higher attendance overall because it has 3 more parks#YTD totals for each parkdisney_parks_yearly <- disney_parks7 |>group_by(park) |>summarize(yearly_attendance = (sum(attendance_in_millions))) |>arrange(desc(yearly_attendance))disney_parks_yearly |>select(park, yearly_attendance) |>gt() |>cols_label(park ="Park",yearly_attendance ="Total Guests (Millions)" ) |>tab_header(title ="2025 Attendance by Park")
2025 Attendance by Park
Park
Total Guests (Millions)
Magic Kingdom
27.80
EPCOT
22.30
Disneyland Park
21.10
Animal Kingdom
20.40
Disney's California Adventure
18.88
Hollywood Studios
16.48
#The most popular park was Magic Kingdom and the least popular park was Hollywood Studios
Rainfall
First, add a continent and delete the extra lines
#note - this first column imports into R as X....City, #however, the html would not render.#so I rendered a glimpse of this table in a separate file, and it's just "City"#I've changed the code below so it renders properly#however, for this to run locally, it may need to be changed#code is provided belowrainfall2 <- rainfall |>select(City:Dec) |>mutate(continent ="") |>rename(city = City, country = Country, yearly = Year)#If the above code does not run, please use this #(with the corrected first column name)#rainfall2 <- rainfall |> # select(X...City:Dec) |># mutate(continent = "") |># rename(city = X...City, country = Country, yearly = Year)#1-37 is Africa, 92 - 129 is Asia, 131 - 270 is Europe, 272 - 305 is North America, 311 - 362 is Oceania, 365 - 535 is South America#adding a continent columnrainfall2$continent[1:37] <-"Africa"rainfall2$continent[40:90] <-"Asia"rainfall2$continent[92:129] <-"Asia"rainfall2$continent[131:270] <-"Europe"rainfall2$continent[272:305] <-"North America"rainfall2$continent[311:362] <-"Oceania"rainfall2$continent[365:535] <-"South America"#delete blank spaces in the continent columnrainfall_no_blanks <-subset(rainfall2, continent !="")#getting rid of duplicated header columnsrainfall_no_blanks_2 <-subset (rainfall_no_blanks, city !="City")
Pivoting longer
rainfall_pivot <- rainfall_no_blanks_2 |>pivot_longer(cols = Jan:Dec,names_to ="month",values_to ="rainfall_mm")#reorder so continent comes earlier #get rid of commas (they are causing big problems)rainfall_cleaned <- rainfall_pivot |>relocate(continent, .after = country) |>mutate(yearly =str_remove(yearly, ",")) |>mutate(rainfall_mm =str_remove(rainfall_mm, ","))#changing the numbers to numeric (they're characters now)rainfall_cleaned_numeric <- rainfall_cleaned |>mutate(yearly =as.numeric(yearly), rainfall_mm =as.numeric(rainfall_mm)) |>mutate(across(c(yearly, rainfall_mm), ~replace_na(., 0)))#one city was broken up (it has a hyphen), so let's fix thatrainfall_cleaned_numeric <- rainfall_cleaned_numeric |>mutate(city =str_replace(city, "Petropavlovsk-", "Petropavlovsk-Kamchatsky")) |>filter(city !="Kamchatsky")#make R recognize month as a series of monthsrainfall_cleaned_numeric <-as_tibble(rainfall_cleaned_numeric) |>mutate(month =month(parse_date_time(month, "b"), label =TRUE))
Rainfall analysis
For cities by average precipitation, I will compare seasonal precipitation patterns across cities in the same country, and find the rainiest and driest cities overall.
#find the max yearly rainfall and the max monthly rainfall rainiest_month <- rainfall_cleaned_numeric |>arrange(desc(rainfall_mm)) |>slice(1)#The rainiest documented month is July in Mawsynram, India. rainiest_month
# A tibble: 1 × 6
city country continent yearly month rainfall_mm
<chr> <chr> <chr> <dbl> <ord> <dbl>
1 Mawsynram India Asia 11305. Jul 3300
rainiest_city <- rainfall_cleaned_numeric |>arrange(desc(yearly)) |>slice(1)#The rainiest city overall is Lopez de Micay, Colombia with avg 15992 mm rain a yearrainiest_city
# A tibble: 1 × 6
city country continent yearly month rainfall_mm
<chr> <chr> <chr> <dbl> <ord> <dbl>
1 López de Micay Colombia South America 15992. Jan 940.
#let's find the 5 rainiest cities in the US and look at a time plot of their precipitation #Having Hawaiian cities as part of Oceania is not an error -- that's how the #data showed up in Wikipedia, and I left it because continents are kind of subjectiveus_rain <- rainfall_cleaned_numeric |>filter(country =="United States") |>arrange(desc(yearly)) us_rain_top_5 <- us_rain |>filter(country =="United States") |>filter(city %in%c("Hilo", "Miami", "Nashville", "New York City", "Atlanta")) |>arrange(desc(yearly)) #there are 21 US cities total in this data setprint(us_rain |>distinct(city), n=21)
# A tibble: 21 × 1
city
<chr>
1 Hilo
2 Miami
3 Nashville
4 New York City
5 Atlanta
6 Cincinnati
7 Columbus (Ohio)
8 Chicago
9 Dallas
10 Seattle
11 Portland
12 Austin
13 Des Moines
14 Detroit
15 San Francisco
16 Anchorage
17 Honolulu
18 Denver
19 Los Angeles
20 San Diego
21 Phoenix
#time plotggplot(us_rain_top_5, aes(x = month, y = rainfall_mm, group = city, color = city)) +geom_line() +geom_point(shape =6, size =2) +theme_minimal() +labs(title ="Average Monthly Rainfall by US City",x ="Month",y ="Rainfall (mm)")
#NY, Nashville, and Atlanta all have relatively steady precipitation throughout the year#According to this data set, NYC is rainier than Portland and Seattle #bottom fiveus_rain <- rainfall_cleaned_numeric |>filter(country =="United States") |>arrange(yearly) us_rain_bottom_5 <- us_rain |>filter(country =="United States") |>filter(city %in%c("Phoenix", "San Diego", "Los Angeles", "Denver", "Honolulu")) |>arrange(desc(yearly)) ggplot(us_rain_bottom_5, aes(x = month, y = rainfall_mm, group = city, color = city)) +geom_line() +geom_point(shape =6, size =2) +theme_minimal() +labs(title ="Average Monthly Rainfall by US City (Driest Cities)",x ="Month",y ="Rainfall (mm)")
#With the exception of Denver, many of these lines have a U shape, indicating higher rainfall in winter months#let's look at average rainfall by cityus_rain_avg <- rainfall_cleaned_numeric |>filter(country =="United States") |>group_by(city) |>summarize(monthly_avg =round(mean(rainfall_mm),2))#this table showcases the wide range of avg. monthly rainfalls in the U.S. 15.19 - 255 mmus_rain_avg |>gt() |>cols_label(city ="City",monthly_avg ="Average Monthly Rainfall" ) |>tab_header(title ="Average Monthly Rainfall in 21 U.S. Cities")
Average Monthly Rainfall in 21 U.S. Cities
City
Average Monthly Rainfall
Anchorage
34.78
Atlanta
104.38
Austin
76.72
Chicago
86.50
Cincinnati
89.77
Columbus (Ohio)
88.08
Dallas
81.10
Denver
32.49
Des Moines
75.81
Detroit
72.75
Hilo
255.00
Honolulu
34.67
Los Angeles
30.02
Miami
142.67
Nashville
107.00
New York City
104.75
Phoenix
15.19
Portland
78.17
San Diego
20.72
San Francisco
41.72
Seattle
78.72
Football stats
#Let's start by breaking up first name and last namefg_names <- field_goal |>separate(col = Name, into =c("last_name", "first_name"), sep =",")#Pivot longer -- get rid of the percentage (we can calculate that if we need it)#and turn the columns into field goal type, made, and attempted#this is a hard data set to pivot longer because many of the stats are unrelated #(we don't want extra points mixed in with field goals or total field goals mixed in #with field goal length + number made)#there's also the question of whether we need the data not being pivoted.#Besides identifiers, like name and team, every statistic except extra points#can easily be derived from the pivoted datafg_keep <- fg_names |>select(Player.Id:FG.Percentage, FGs.Made.20.29.Yards, FGs.Attempted.20.29.Yards, FGs.Made.30.39.Yards, FGs.Attempted.30.39.Yards, FGs.Made.40.49.Yards, FGs.Attempted.40.49.Yards, FGs.Made.50..Yards, FGs.Attempted.50..Yards, Extra.Points.Attempted, Extra.Points.Made, Extra.Points.Blocked) fg_keep2 <- fg_keep |>mutate(across(contains(c("Yards", "Extra.Points")), as.numeric))fg_long <- fg_keep2 |>pivot_longer(cols =contains("Yards"),names_to =c(".value", "field_goal_type"),names_pattern ="FGs\\.(Made|Attempted)\\.(.*)" )#make this more readablefg_cleaned <- fg_long |>mutate(field_goal_type =case_when( field_goal_type =="20.29.Yards"~"20-29 yards", field_goal_type =="30.39.Yards"~"30-39 yards", field_goal_type =="40.49.Yards"~"40-49 yards", field_goal_type =="50..Yards"~"50 yards"))#clean up the namesfg_renamed <- fg_cleaned |>rename_with(tolower)#replace period with underscore colnames(fg_renamed) <-gsub(pattern ="\\.", replacement ="_", x =colnames(fg_renamed))#move first name and last name so they're in the right orderfg_renamed <- fg_renamed |>relocate(last_name, .after = first_name)
Analysis
I said I would look at cumulative player stats ratios (these are already included in the data set), and how the relationship of certain stats (such as player height and weight) relate to career length (these are excluded from the data set).
Instead, let’s look at how likely players are in general to make a field goal based on distance (total # of field goals / total attempted, grouped by type) and visualize it in a table.
Then, we’ll look at the top 5 teams in terms of field goals and field goal percentage.
fg_likelihood <- fg_renamed |>mutate(across(c(made, attempted), as.numeric)) |>group_by(field_goal_type) |>summarize(made =sum(made, na.rm =TRUE), attempted =sum(attempted, na.rm =TRUE) )fg_likelihood <- fg_likelihood |>mutate(percentage_made =round(made / attempted, 2))#make a table#Unsurprisingly, the farther a kicker is from the goal posts, the lower the percentagefg_likelihood |>select(field_goal_type, percentage_made) |>gt() |>cols_label(field_goal_type ="Distance",percentage_made ="Percentage Made (%)" ) |>tab_header(title ="Percentage of Field Goals Made by Distance")
Percentage of Field Goals Made by Distance
Distance
Percentage Made (%)
20-29 yards
0.95
30-39 yards
0.87
40-49 yards
0.72
50 yards
0.56
ggplot(data = fg_renamed, aes(x = attempted, y = made, color = field_goal_type)) +geom_point()
Warning: Removed 5308 rows containing missing values or values outside the scale range
(`geom_point()`).
#dot plot visualizing total games played and number of field goals#there are some baseball teams in here and defunct teams with 0 FGs, #so let's filter those out too, and then go with the top 5 teams.by_team <- fg_renamed |>group_by(team) |>mutate(across(c(made, attempted), as.numeric)) |>summarize(made =sum(made, na.rm =TRUE), attempted =sum(attempted, na.rm =TRUE) ) |>filter(attempted !=0) |>arrange(desc(made))by_team |>slice(1:5) |>select(team, made) |>gt() |>cols_label(team ="NFL Team",made ="Total Field Goals Made" ) |>tab_header(title ="Top 5 Teams for Field Goals")
Top 5 Teams for Field Goals
NFL Team
Total Field Goals Made
San Diego Chargers
643
New England Patriots
626
Atlanta Falcons
625
Denver Broncos
625
Buffalo Bills
596
#The San Diego Chargers have made the most field goals over the years, but these numbers#are pretty close#Let's look at percentage of field goals madefg_percent <- by_team |>mutate(percentage =round(made / attempted, 2)) |>arrange(desc(percentage)) #in a tablefg_percent |>slice(1:5) |>gt() |>cols_label(team ="NFL Team",made ="Made",attempted ="Attempted",percentage ="% Made" ) |>tab_header(title ="NFL Teams With Highest Field Goal Success Rate")
NFL Teams With Highest Field Goal Success Rate
NFL Team
Made
Attempted
% Made
Baltimore Ravens
179
206
0.87
Indianapolis Colts
474
548
0.86
Tennessee Oilers
60
71
0.85
Houston Texans
170
203
0.84
New England Patriots
626
755
0.83
#Theere's a wide range of attempts/FGs made represented here (possibly due to the age #of the teams)#The Baltimore Ravens have the highest percentage, but only 179 field goals made out of 206#The Pats, on the other hand, rank fifth and have ade 626 out of 755#a slightly more visually overwhelming presentation of the above infofg_percent |>slice(1:5) |>ggplot(aes(x =reorder(team, -percentage), y = percentage, fill = team)) +geom_col(show.legend =FALSE) +geom_text(aes(label = scales::percent(percentage)), vjust =-0.5) +scale_y_continuous(labels = scales::percent, expand =expansion(mult =c(0, 0.1))) +labs(title ="Top 5 NFL Teams by Field Goal Success Rate",x ="NFL Team",y ="Success Rate" ) +theme_minimal()
Google Gemini. (2026). Gemini 3 Flash [Large language model].
https://gemini.google.com. Accessed March 5, 2026.