Identify a topic of interest and give your project a name/title.

The topic of interest that I chose to focus on with my parter, PJ Casey-Leonard, is golf statistics, more specifically “Strokes Gained”. The project name that we have come up with is “Strokes Gained Analysis”

Phrase 3-5 research questions you would like to explore.

My partner and I researched our own questions and would like to discuss later which ones we want to focus on.

1. Where on the course strokes gained impact a players final standing the most? Prediction: Putting
2. Does a higher purse amount of a tournament result in less variation in strokes gained? Hypothesis: Yes as players might take less risk as they are competing for more money
3. As driver distance has become a larger factor recently in the PGA, has that had a positve impact in strokes gained for off the tee, and a negative impact in strokes gained in other areas? Hypothesis: There will be an improvement in strokes gained over time due to new technology

List the data sources that your find that are relevant with your research questions.

Extract one or more relevant datasets associate with your research questions, either import the downloaded dataset(s), extract from APIs, or ethically scrape the web, etc.

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
golf_data <- read_csv("Desktop/BI4310/Code/ASA All PGA Raw Data - Tourn Level.csv")
## Rows: 36864 Columns: 37
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): Player_initial_last, player, tournament name, course, Finish
## dbl  (28): tournament id, player id, hole_par, strokes, hole_DKP, hole_FDP, ...
## lgl   (3): Unnamed: 2, Unnamed: 3, Unnamed: 4
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Kaggle dataset that provides PGA Tour information from 2015 to 2022, with focuses on sportsbook values as well as strokes gained values

Describe your data extracted, statistically and/or visually.

str(golf_data)
spc_tbl_ [36,864 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Player_initial_last: chr [1:36864] "A. Ancer" "A. Hadwin" "A. Lahiri" "A. Long" ...
 $ tournament id      : num [1:36864] 4.01e+08 4.01e+08 4.01e+08 4.01e+08 4.01e+08 ...
 $ player id          : num [1:36864] 9261 5548 4989 6015 3832 ...
 $ hole_par           : num [1:36864] 288 288 144 144 144 144 288 288 288 144 ...
 $ strokes            : num [1:36864] 289 286 147 151 148 151 287 287 299 151 ...
 $ hole_DKP           : num [1:36864] 60 72.5 21.5 20.5 23.5 19.5 63 59.5 48.5 18 ...
 $ hole_FDP           : num [1:36864] 51.1 61.5 17.4 13.6 18.1 12 55.7 54 34.7 10.9 ...
 $ hole_SDP           : num [1:36864] 56 61 27 17 23 19 58 59 48 20 ...
 $ streak_DKP         : num [1:36864] 3 8 0 0 0 0 3 0 0 0 ...
 $ streak_FDP         : num [1:36864] 7.6 13 0 0.4 1.2 6 7.2 7.8 5.4 0.6 ...
 $ streak_SDP         : num [1:36864] 3 3 0 0 0 0 3 0 0 0 ...
 $ n_rounds           : num [1:36864] 4 4 2 2 2 2 4 4 4 2 ...
 $ made_cut           : num [1:36864] 1 1 0 0 0 0 1 1 1 0 ...
 $ pos                : num [1:36864] 32 18 NA NA NA NA 26 26 67 NA ...
 $ finish_DKP         : num [1:36864] 2 5 0 0 0 0 3 3 0 0 ...
 $ finish_FDP         : num [1:36864] 1 4 0 0 0 0 2 2 0 0 ...
 $ finish_SDP         : num [1:36864] 0 2 0 0 0 0 0 0 0 0 ...
 $ total_DKP          : num [1:36864] 65 85.5 21.5 20.5 23.5 19.5 69 62.5 48.5 18 ...
 $ total_FDP          : num [1:36864] 59.7 78.5 17.4 14 19.3 18 64.9 63.8 40.1 11.5 ...
 $ total_SDP          : num [1:36864] 59 66 27 17 23 19 61 59 48 20 ...
 $ player             : chr [1:36864] "Abraham Ancer" "Adam Hadwin" "Anirban Lahiri" "Adam Long" ...
 $ Unnamed: 2         : logi [1:36864] NA NA NA NA NA NA ...
 $ Unnamed: 3         : logi [1:36864] NA NA NA NA NA NA ...
 $ Unnamed: 4         : logi [1:36864] NA NA NA NA NA NA ...
 $ tournament name    : chr [1:36864] "The Memorial Tournament pres. by Nationwide" "The Memorial Tournament pres. by Nationwide" "The Memorial Tournament pres. by Nationwide" "The Memorial Tournament pres. by Nationwide" ...
 $ course             : chr [1:36864] "Muirfield Village Golf Club - Dublin, OH" "Muirfield Village Golf Club - Dublin, OH" "Muirfield Village Golf Club - Dublin, OH" "Muirfield Village Golf Club - Dublin, OH" ...
 $ date               : Date[1:36864], format: "2022-06-05" "2022-06-05" ...
 $ purse              : num [1:36864] 12 12 12 12 12 12 12 12 12 12 ...
 $ season             : num [1:36864] 2022 2022 2022 2022 2022 ...
 $ no_cut             : num [1:36864] 0 0 0 0 0 0 0 0 0 0 ...
 $ Finish             : chr [1:36864] "T32" "T18" "CUT" "CUT" ...
 $ sg_putt            : num [1:36864] 0.2 0.36 -0.56 -1.46 0.53 -0.97 2.05 -0.96 -0.82 -1.89 ...
 $ sg_arg             : num [1:36864] -0.13 0.75 0.74 -1.86 -0.36 0.14 0.74 -0.01 -1.79 -0.71 ...
 $ sg_app             : num [1:36864] -0.08 0.31 -1.09 -0.02 -1.39 -2.02 -1.32 1.84 2 0.71 ...
 $ sg_ott             : num [1:36864] 0.86 0.18 0.37 0.8 0.19 0.31 -0.12 0.48 -1.04 -0.65 ...
 $ sg_t2g             : num [1:36864] 0.65 1.24 0.02 -1.08 -1.56 -1.56 -0.7 2.31 -0.83 -0.65 ...
 $ sg_total           : num [1:36864] 0.85 1.6 -0.54 -2.54 -1.04 -2.54 1.35 1.35 -1.65 -2.54 ...
 - attr(*, "spec")=
  .. cols(
  ..   Player_initial_last = col_character(),
  ..   `tournament id` = col_double(),
  ..   `player id` = col_double(),
  ..   hole_par = col_double(),
  ..   strokes = col_double(),
  ..   hole_DKP = col_double(),
  ..   hole_FDP = col_double(),
  ..   hole_SDP = col_double(),
  ..   streak_DKP = col_double(),
  ..   streak_FDP = col_double(),
  ..   streak_SDP = col_double(),
  ..   n_rounds = col_double(),
  ..   made_cut = col_double(),
  ..   pos = col_double(),
  ..   finish_DKP = col_double(),
  ..   finish_FDP = col_double(),
  ..   finish_SDP = col_double(),
  ..   total_DKP = col_double(),
  ..   total_FDP = col_double(),
  ..   total_SDP = col_double(),
  ..   player = col_character(),
  ..   `Unnamed: 2` = col_logical(),
  ..   `Unnamed: 3` = col_logical(),
  ..   `Unnamed: 4` = col_logical(),
  ..   `tournament name` = col_character(),
  ..   course = col_character(),
  ..   date = col_date(format = ""),
  ..   purse = col_double(),
  ..   season = col_double(),
  ..   no_cut = col_double(),
  ..   Finish = col_character(),
  ..   sg_putt = col_double(),
  ..   sg_arg = col_double(),
  ..   sg_app = col_double(),
  ..   sg_ott = col_double(),
  ..   sg_t2g = col_double(),
  ..   sg_total = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
summary(golf_data)
 Player_initial_last tournament id         player id          hole_par    
 Length:36864        Min.   :     2230   Min.   :      5   Min.   : 70.0  
 Class :character    1st Qu.:     2696   1st Qu.:   1170   1st Qu.:143.0  
 Mode  :character    Median :401056503   Median :   3793   Median :280.0  
                     Mean   :233180667   Mean   :  79790   Mean   :225.5  
                     3rd Qu.:401219498   3rd Qu.:   6151   3rd Qu.:286.0  
                     Max.   :401366873   Max.   :4845309   Max.   :292.0  
                                                                          
    strokes         hole_DKP         hole_FDP         hole_SDP     
 Min.   : 66.0   Min.   : -2.50   Min.   :-21.40   Min.   :-11.00  
 1st Qu.:146.0   1st Qu.: 27.00   1st Qu.: 22.60   1st Qu.: 28.00  
 Median :272.0   Median : 53.50   Median : 46.10   Median : 55.00  
 Mean   :224.1   Mean   : 50.13   Mean   : 44.38   Mean   : 49.32  
 3rd Qu.:281.0   3rd Qu.: 69.00   3rd Qu.: 64.00   3rd Qu.: 69.00  
 Max.   :325.0   Max.   :174.00   Max.   :134.70   Max.   :107.00  
                                                                   
   streak_DKP       streak_FDP       streak_SDP        n_rounds    
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :1.000  
 1st Qu.: 0.000   1st Qu.: 0.800   1st Qu.: 0.000   1st Qu.:2.000  
 Median : 0.000   Median : 6.400   Median : 0.000   Median :4.000  
 Mean   : 1.764   Mean   : 7.687   Mean   : 1.683   Mean   :3.175  
 3rd Qu.: 3.000   3rd Qu.:12.400   3rd Qu.: 3.000   3rd Qu.:4.000  
 Max.   :23.000   Max.   :43.600   Max.   :22.000   Max.   :4.000  
                                                                   
    made_cut           pos           finish_DKP       finish_FDP    
 Min.   :0.0000   Min.   :  1.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:0.0000   1st Qu.: 15.00   1st Qu.: 0.000   1st Qu.: 0.000  
 Median :1.0000   Median : 32.00   Median : 0.000   Median : 0.000  
 Mean   :0.6059   Mean   : 34.17   Mean   : 2.489   Mean   : 2.134  
 3rd Qu.:1.0000   3rd Qu.: 51.00   3rd Qu.: 3.000   3rd Qu.: 2.000  
 Max.   :1.0000   Max.   :999.00   Max.   :30.000   Max.   :30.000  
                  NA's   :15547                                     
   finish_SDP       total_DKP        total_FDP        total_SDP     
 Min.   : 0.000   Min.   : -2.50   Min.   :-21.40   Min.   :-11.00  
 1st Qu.: 0.000   1st Qu.: 27.50   1st Qu.: 24.70   1st Qu.: 28.00  
 Median : 0.000   Median : 55.50   Median : 52.15   Median : 56.00  
 Mean   : 1.171   Mean   : 54.38   Mean   : 54.20   Mean   : 52.18  
 3rd Qu.: 0.000   3rd Qu.: 75.00   3rd Qu.: 78.50   3rd Qu.: 72.00  
 Max.   :15.000   Max.   :205.50   Max.   :202.60   Max.   :141.00  
                                                                    
    player          Unnamed: 2     Unnamed: 3     Unnamed: 4    
 Length:36864       Mode:logical   Mode:logical   Mode:logical  
 Class :character   NA's:36864     NA's:36864     NA's:36864    
 Mode  :character                                               
                                                                
                                                                
                                                                
                                                                
 tournament name       course               date                purse      
 Length:36864       Length:36864       Min.   :2014-10-12   Min.   : 3.00  
 Class :character   Class :character   1st Qu.:2017-01-15   1st Qu.: 6.40  
 Mode  :character   Mode  :character   Median :2018-11-04   Median : 7.10  
                                       Mean   :2018-10-10   Mean   : 7.53  
                                       3rd Qu.:2020-09-13   3rd Qu.: 8.70  
                                       Max.   :2022-06-05   Max.   :20.00  
                                                                           
     season         no_cut           Finish             sg_putt      
 Min.   :2015   Min.   :0.00000   Length:36864       Min.   :-5.990  
 1st Qu.:2017   1st Qu.:0.00000   Class :character   1st Qu.:-0.770  
 Median :2019   Median :0.00000   Mode  :character   Median :-0.040  
 Mean   :2019   Mean   :0.06529                      Mean   :-0.121  
 3rd Qu.:2021   3rd Qu.:0.00000                      3rd Qu.: 0.630  
 Max.   :2022   Max.   :1.00000                      Max.   : 4.430  
                                                     NA's   :7684    
     sg_arg           sg_app           sg_ott           sg_t2g       
 Min.   :-6.430   Min.   :-9.250   Min.   :-7.740   Min.   :-13.950  
 1st Qu.:-0.450   1st Qu.:-0.740   1st Qu.:-0.450   1st Qu.: -1.080  
 Median : 0.000   Median : 0.000   Median : 0.050   Median : -0.010  
 Mean   :-0.041   Mean   :-0.102   Mean   :-0.046   Mean   : -0.188  
 3rd Qu.: 0.420   3rd Qu.: 0.640   3rd Qu.: 0.480   3rd Qu.:  0.920  
 Max.   : 3.170   Max.   : 4.670   Max.   : 2.770   Max.   :  6.300  
 NA's   :7684     NA's   :7684     NA's   :7684     NA's   :7684     
    sg_total      
 Min.   :-13.670  
 1st Qu.: -1.370  
 Median : -0.160  
 Mean   : -0.305  
 3rd Qu.:  1.060  
 Max.   :  8.520  
 NA's   :7683     

Perform necessary data cleaning and manipulation especially if the raw data contains special values or not directly in the format that can answer your research questions.

golf_data <- golf_data %>% select(-c(6:11,15:20,22:24,30))

Removes rows that provide sportsbook data as well as unnecessary information

List future data preparation work needed if any.

We will have to determine if we want to create a cutoff of strokes gained data to get rid of any outliers in the data. One other aspect that we will have to change is the position finished, as now it is a character structure due to the fact that there are people cut from the tournaments, listed as “CUT”, and people tied for positions, labeled such as “T21”