Disc Golf Events - STAT 210 Final Project

About the Data

This data comes from a csv file from github which contains stats on the names, scores, courses, rounds, earnings, etc. from the top disc golfers around the world in a particular season. In the github page, many other documents were included with top earners listed in descending order, as well as various other stats concerning the players. This github page can be viewed via this link: https://github.com/nicholastbolick/Disc-Golf-Stats/blob/main/events.csv.

Goal of This Project

For this particular project, I wanted to develop my own code to sort the data from one of the files (event.csv) into a more logical progression of players by earnings and round ratings to compare the variables and get a correlation. The goal of this project is to present our data in a readable, visually-appealing format which can help us draw a conclusion between the player round ratings and their subsequent earnings.

Loading Tidyverse & Reading in our Data

First we will load our tidyverse library:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Next we will read in our event.csv file, naming it “event” for our purposes:

event <- read_csv("data/events.csv")
New names:
• `` -> `...1`
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 12455 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Name, Total, Prize, Event, Prize (USD)
dbl (35): ...1, Place, Points, PDGA#, Rating, Par, Rd1, Unnamed: 7, Rd2, Unn...
lgl  (2): Unnamed: 23, Unnamed: 24

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Glimpsing at our Data

Let us look at our dataset now using the “glimpse” and “summary” commands:

glimpse(event)
Rows: 12,455
Columns: 42
$ ...1          <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ Place         <dbl> 1, 2, 2, 4, 5, 6, 7, 7, 9, 10, 11, 11, 13, 14, 15, 16, 1…
$ Points        <dbl> 1360, 1350, 1350, 1330, 1320, 1310, 1300, 1300, 1280, 12…
$ Name          <chr> "Eagle Wynne McMahon", "Calvin Heimburg", "Ezra Aderhold…
$ `PDGA#`       <dbl> 37817, 45971, 121715, 13864, 38008, 48346, 45879, 41760,…
$ Rating        <dbl> 1050, 1051, 1016, 1035, 1054, 1035, 1013, 1034, 1053, 10…
$ Par           <dbl> -48, -40, -40, -38, -37, -36, -35, -35, -34, -32, -31, -…
$ Rd1           <dbl> 48, 47, 49, 49, 49, 54, 56, 53, 48, 50, 51, 49, 53, 55, …
$ `Unnamed: 7`  <dbl> 1071, 1080, 1063, 1063, 1063, 1019, 1002, 1028, 1071, 10…
$ Rd2           <dbl> 50, 50, 46, 50, 52, 51, 49, 53, 49, 51, 55, 53, 53, 51, …
$ `Unnamed: 9`  <dbl> 1066, 1066, 1099, 1066, 1050, 1058, 1075, 1041, 1075, 10…
$ Rd3           <dbl> 49, 51, 58, 57, 52, 50, 51, 53, 54, 54, 52, 52, 54, 52, …
$ `Unnamed: 11` <dbl> 1079, 1062, 1005, 1013, 1054, 1071, 1062, 1046, 1038, 10…
$ Finals        <dbl> 49, 56, 51, 50, 54, 53, 53, 50, 59, 57, 55, 59, 54, 57, …
$ `Unnamed: 13` <dbl> 1090, 1036, 1075, 1082, 1052, 1059, 1059, 1082, 1013, 10…
$ Total         <chr> "196", "204", "204", "206", "207", "208", "209", "209", …
$ Prize         <chr> "$5,500", "$3,125", "$3,125", "$2,250", "$1,750", "$1,50…
$ Event         <chr> "Las Vegas Challenge", "Las Vegas Challenge", "Las Vegas…
$ Rd4           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 4`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 6`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 8`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Semis         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 10` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 12` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Prize (USD)` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Rd5           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Rd6           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 14` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Rd7           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 16` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Rd8           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 18` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 20` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 22` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 5`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 15` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 17` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 19` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 21` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 23` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ `Unnamed: 24` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
summary(event)
      ...1            Place            Points           Name          
 Min.   :  0.00   Min.   :  1.00   Min.   :   0.0   Length:12455      
 1st Qu.: 20.00   1st Qu.: 19.00   1st Qu.:   0.0   Class :character  
 Median : 43.00   Median : 42.00   Median : 210.0   Mode  :character  
 Mean   : 54.09   Mean   : 53.84   Mean   : 334.7                     
 3rd Qu.: 79.00   3rd Qu.: 79.00   3rd Qu.: 520.0                     
 Max.   :237.00   Max.   :236.00   Max.   :2820.0                     
                                   NA's   :2369                       
     PDGA#            Rating            Par               Rd1        
 Min.   :    55   Min.   : 627.0   Min.   :-48.000   Min.   : 40.00  
 1st Qu.:  7495   1st Qu.: 959.0   1st Qu.:-12.000   1st Qu.: 55.00  
 Median : 14600   Median : 982.0   Median : -3.000   Median : 62.00  
 Mean   : 19212   Mean   : 978.4   Mean   : -1.775   Mean   : 66.63  
 3rd Qu.: 22710   3rd Qu.:1004.0   3rd Qu.:  8.000   3rd Qu.: 70.00  
 Max.   :168806   Max.   :1056.0   Max.   : 55.000   Max.   :999.00  
 NA's   :15       NA's   :1722     NA's   :11089                     
   Unnamed: 7          Rd2           Unnamed: 9          Rd3        
 Min.   : 836.0   Min.   : 41.00   Min.   : 826.0   Min.   : 37.00  
 1st Qu.: 963.0   1st Qu.: 55.00   1st Qu.: 967.0   1st Qu.: 55.00  
 Median : 996.0   Median : 62.00   Median : 996.0   Median : 62.00  
 Mean   : 990.3   Mean   : 70.05   Mean   : 990.7   Mean   : 77.15  
 3rd Qu.:1019.0   3rd Qu.: 71.00   3rd Qu.:1018.0   3rd Qu.: 71.00  
 Max.   :1089.0   Max.   :999.00   Max.   :1099.0   Max.   :999.00  
 NA's   :10497    NA's   :3        NA's   :10517    NA's   :413     
  Unnamed: 11         Finals        Unnamed: 13        Total          
 Min.   : 810.0   Min.   : 20.00   Min.   : 826.0   Length:12455      
 1st Qu.: 971.0   1st Qu.: 53.00   1st Qu.: 950.0   Class :character  
 Median : 999.0   Median : 64.00   Median : 986.0   Mode  :character  
 Mean   : 994.7   Mean   : 59.78   Mean   : 982.4                     
 3rd Qu.:1023.0   3rd Qu.: 70.00   3rd Qu.:1013.0                     
 Max.   :1088.0   Max.   :999.00   Max.   :1090.0                     
 NA's   :10710    NA's   :10980    NA's   :12107                      
    Prize              Event                Rd4           Unnamed: 4    
 Length:12455       Length:12455       Min.   : 39.00   Min.   : 780.0  
 Class :character   Class :character   1st Qu.: 52.00   1st Qu.: 931.0  
 Mode  :character   Mode  :character   Median : 57.00   Median : 960.0  
                                       Mean   : 77.73   Mean   : 957.7  
                                       3rd Qu.: 63.00   3rd Qu.: 991.0  
                                       Max.   :999.00   Max.   :1041.0  
                                       NA's   :6416     NA's   :12305   
   Unnamed: 6       Unnamed: 8       Semis         Unnamed: 10  
 Min.   : 663.0   Min.   : 643   Min.   : 21.00   Min.   : 552  
 1st Qu.: 951.0   1st Qu.: 950   1st Qu.: 51.00   1st Qu.: 952  
 Median : 980.0   Median : 980   Median : 54.00   Median : 981  
 Mean   : 974.8   Mean   : 975   Mean   : 54.47   Mean   : 976  
 3rd Qu.:1005.0   3rd Qu.:1006   3rd Qu.: 58.00   3rd Qu.:1006  
 Max.   :1088.0   Max.   :1092   Max.   :126.00   Max.   :1098  
 NA's   :4132     NA's   :4154   NA's   :11053    NA's   :4320  
  Unnamed: 12     Prize (USD)             Rd5              Rd6        
 Min.   : 762.0   Length:12455       Min.   : 41.00   Min.   : 42.00  
 1st Qu.: 953.0   Class :character   1st Qu.: 51.00   1st Qu.: 52.00  
 Median : 982.0   Mode  :character   Median : 55.00   Median : 55.00  
 Mean   : 976.2                      Mean   : 72.41   Mean   : 74.19  
 3rd Qu.:1006.0                      3rd Qu.: 59.00   3rd Qu.: 60.00  
 Max.   :1083.0                      Max.   :999.00   Max.   :999.00  
 NA's   :8018                        NA's   :9076     NA's   :9565    
  Unnamed: 14          Rd7          Unnamed: 16          Rd8        
 Min.   : 714.0   Min.   : 40.00   Min.   : 751.0   Min.   : 39.00  
 1st Qu.: 933.0   1st Qu.: 50.00   1st Qu.: 945.0   1st Qu.: 49.00  
 Median : 971.0   Median : 54.00   Median : 973.0   Median : 53.00  
 Mean   : 963.2   Mean   : 75.74   Mean   : 971.8   Mean   : 71.32  
 3rd Qu.:1000.0   3rd Qu.: 59.00   3rd Qu.:1002.0   3rd Qu.: 57.00  
 Max.   :1102.0   Max.   :999.00   Max.   :1073.0   Max.   :999.00  
 NA's   :11136    NA's   :10284    NA's   :11420    NA's   :10939   
  Unnamed: 18      Unnamed: 20      Unnamed: 22       Unnamed: 5    
 Min.   : 691.0   Min.   : 869.0   Min.   : 923.0   Min.   : 792.0  
 1st Qu.: 954.0   1st Qu.: 970.0   1st Qu.: 977.0   1st Qu.: 939.0  
 Median : 985.0   Median : 990.0   Median : 998.5   Median : 962.0  
 Mean   : 983.1   Mean   : 989.6   Mean   : 995.1   Mean   : 959.5  
 3rd Qu.:1011.0   3rd Qu.:1014.0   3rd Qu.:1014.2   3rd Qu.: 986.0  
 Max.   :1078.0   Max.   :1074.0   Max.   :1057.0   Max.   :1054.0  
 NA's   :11923    NA's   :12116    NA's   :12383    NA's   :11933   
  Unnamed: 15     Unnamed: 17      Unnamed: 19      Unnamed: 21    
 Min.   : 840    Min.   : 835.0   Min.   : 814.0   Min.   : 947.0  
 1st Qu.: 934    1st Qu.: 932.0   1st Qu.: 924.0   1st Qu.: 978.0  
 Median : 965    Median : 965.0   Median : 958.0   Median : 988.0  
 Mean   : 961    Mean   : 956.9   Mean   : 956.6   Mean   : 992.7  
 3rd Qu.: 986    3rd Qu.: 986.0   3rd Qu.: 991.0   3rd Qu.:1009.0  
 Max.   :1048    Max.   :1040.0   Max.   :1069.0   Max.   :1070.0  
 NA's   :12048   NA's   :12283    NA's   :12283    NA's   :12401   
 Unnamed: 23    Unnamed: 24   
 Mode:logical   Mode:logical  
 NA's:12455     NA's:12455    
                              
                              
                              
                              
                              

Cleaning up our Data

Before we create our plots and graphs and visualize our data in a cleaner format, we need to rename any places in our data set that are blank or have “DNF” for “Did Not Finish”. We will rename these spots in our data set to NA so that R does not skip values and will keep our data consistent:

event_clean <- event %>%
  mutate(
    across(
      where(is.character),
      ~ na_if(.x, "")              # convert blank → NA
    )
  ) %>%
  mutate(
    across(
      where(is.character),
      ~ na_if(.x, "DNF")           # convert DNF → NA
    )
  )

I also put some code for making sure the data for the “Prize” variable are all numbers and are 0 if the value was NA for the players that did not recieve any cash prize:

event_clean <- event_clean %>%
  mutate(
    Prize = parse_number(Prize),   # safely extracts numeric values
    Prize = replace_na(Prize, 0)   # missing prize = 0
  )

Additionally, we need to let R know we are reading in the columns that have the round ratings as numeric values, so we have the following code:

event_clean <- event_clean %>%
  mutate(
    across(
      c(`Unnamed: 7`, `Unnamed: 9`, `Unnamed: 11`),
      as.numeric
    )
  )

Averaging Player Round Ratings

Each player plays 3 rounds of 18 holes each tournament, and then the players with the top scores go on to play in the finals. For our purposes, we are going to ignore the final round and just average out the first three round scores to get more evenly distributed values among the whole player field. We can use the “mutate” function to do this by creating a new variable which is the average of our three round score rows which are titled “Unamed 7”, “Unnamed 9”, and “Unnamed 11” for rounds 1, 2, and 3, respectfully:

event_clean <- event_clean %>%
  mutate(
    avg_rating = rowMeans(
      select(., `Unnamed: 7`, `Unnamed: 9`, `Unnamed: 11`),
      na.rm = TRUE
    )
  )

Next we can create a basic scatter plot comparing the round ratings to the money the players earned. I have also added code to add a progression line to gauge our results:

ggplot(event_clean, aes(x = avg_rating, y = Prize)) +
  geom_point(color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "darkgreen") +
  labs(
    title = "Average Round Rating vs. Prize Money",
    x = "Average Round Rating",
    y = "Prize ($)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 10496 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 10496 rows containing missing values or values outside the scale range
(`geom_point()`).

As your can see above, many of our data points were zero for prize, altering our progression line quite a bit. Let us add some code that ignores the values that have a prize of 0 to see how our plot changes:

event_clean %>%
  filter(
    !is.na(avg_rating),   # must have a valid average rating
    Prize > 0             # only players who earned money
  ) %>%
  ggplot(aes(x = avg_rating, y = Prize)) +
  geom_point(color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "orange") +
  labs(
    title = "Average Round Rating vs. Prize (Only Players Who Earned Money)",
    x = "Average Round Rating",
    y = "Prize ($)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Our individual data points are hard to visualize with so much overlap, but we can definitely see there is a general positive correlation between average round rating and prize earnings so far. Let’s make the graph look even better by taking the log of our Prize values to spread the values out more evenly among the x axis average ratings, as well as add an alpha level to make the points semi-transparent to make any overlap more visible:

event_clean %>%
  filter(!is.na(avg_rating), Prize > 0) %>%
  ggplot(aes(x = avg_rating, y = Prize)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  scale_y_log10() +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(
    title = "Prize vs. Average Rating (Log Scale)",
    x = "Average Round Rating",
    y = "Prize (log scale)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Another way to visualize our data would by using a binned average for our variables for Prize and Average Rating which categorizes our different individual points into groups and plots those groups. This can help us see the general trend of our regression line more easily and thus also our relationship between average round rating and prize money earned by the players:

event_clean %>%
  filter(!is.na(avg_rating), Prize > 0) %>%
  mutate(rating_bin = cut(avg_rating, breaks = 10)) %>%
  group_by(rating_bin) %>%
  summarize(
    mean_rating = mean(avg_rating),
    mean_prize = mean(Prize)
  ) %>%
  ggplot(aes(x = mean_rating, y = mean_prize)) +
  geom_point(size = 3, color = "darkgreen") +
  geom_line(color = "darkgreen") +
  labs(
    title = "Binned Average Prize vs. Average Rating",
    x = "Average Rating (binned)",
    y = "Average Prize ($)"
  ) +
  theme_minimal()

Summary of Plots

Supporting my hypothesis that round rating is positively related to the amount of prize money earned by disc golf players, we can see from the various graphs that there is certainly a strong positive correlation from our regression line. The higher the average ratings of the players, the higher the prize money earned. An interesting thing to note is that the graph exponentially increases as the average rating goes up. On average, players with around a 950-1025 average rating received less than 1000 dollars, but past 1025 and into the 1050 and 1075 range, players received upwards of 2,500 dollars and higher. This makes sense according to how tournaments are structured, with the winning players recieving a considerably higher sum compared to even second and third place winners, with quite drastically decreasing earnings following the further down the list you go. Just by looking at the data set in Excel, we can see this right away by observing the first place winner of the Las Vagas Challenge getting 5,500 dollars, and the last player to cash out, which was the 53rd player, only got 400 dollars. So, in conclusion, round ratings definitely correlate to player earnings in disc golf tournaments, with a general positive, even exponential, trend as players’ average rating increases.