Identify a topic of interest and give your project a
name/title.
The topic of interest that I chose to focus on with my parter, PJ
Casey-Leonard, is golf statistics, more specifically “Strokes Gained”.
The project name that we have come up with is “Strokes Gained
Analysis”
Phrase 3-5 research questions you would like to explore.
My partner and I researched our own questions and would like to
discuss later which ones we want to focus on.
1. Where on the course strokes gained impact a players final
standing the most? Prediction: Putting
2. Does a higher purse amount of a tournament result in less
variation in strokes gained? Hypothesis: Yes as players might take less
risk as they are competing for more money
3. As driver distance has become a larger factor recently in the
PGA, has that had a positve impact in strokes gained for off the tee,
and a negative impact in strokes gained in other areas? Hypothesis:
There will be an improvement in strokes gained over time due to new
technology
List the data sources that your find that are relevant with your
research questions.
Describe your data extracted, statistically and/or visually.
str(golf_data)
spc_tbl_ [36,864 × 37] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Player_initial_last: chr [1:36864] "A. Ancer" "A. Hadwin" "A. Lahiri" "A. Long" ...
$ tournament id : num [1:36864] 4.01e+08 4.01e+08 4.01e+08 4.01e+08 4.01e+08 ...
$ player id : num [1:36864] 9261 5548 4989 6015 3832 ...
$ hole_par : num [1:36864] 288 288 144 144 144 144 288 288 288 144 ...
$ strokes : num [1:36864] 289 286 147 151 148 151 287 287 299 151 ...
$ hole_DKP : num [1:36864] 60 72.5 21.5 20.5 23.5 19.5 63 59.5 48.5 18 ...
$ hole_FDP : num [1:36864] 51.1 61.5 17.4 13.6 18.1 12 55.7 54 34.7 10.9 ...
$ hole_SDP : num [1:36864] 56 61 27 17 23 19 58 59 48 20 ...
$ streak_DKP : num [1:36864] 3 8 0 0 0 0 3 0 0 0 ...
$ streak_FDP : num [1:36864] 7.6 13 0 0.4 1.2 6 7.2 7.8 5.4 0.6 ...
$ streak_SDP : num [1:36864] 3 3 0 0 0 0 3 0 0 0 ...
$ n_rounds : num [1:36864] 4 4 2 2 2 2 4 4 4 2 ...
$ made_cut : num [1:36864] 1 1 0 0 0 0 1 1 1 0 ...
$ pos : num [1:36864] 32 18 NA NA NA NA 26 26 67 NA ...
$ finish_DKP : num [1:36864] 2 5 0 0 0 0 3 3 0 0 ...
$ finish_FDP : num [1:36864] 1 4 0 0 0 0 2 2 0 0 ...
$ finish_SDP : num [1:36864] 0 2 0 0 0 0 0 0 0 0 ...
$ total_DKP : num [1:36864] 65 85.5 21.5 20.5 23.5 19.5 69 62.5 48.5 18 ...
$ total_FDP : num [1:36864] 59.7 78.5 17.4 14 19.3 18 64.9 63.8 40.1 11.5 ...
$ total_SDP : num [1:36864] 59 66 27 17 23 19 61 59 48 20 ...
$ player : chr [1:36864] "Abraham Ancer" "Adam Hadwin" "Anirban Lahiri" "Adam Long" ...
$ Unnamed: 2 : logi [1:36864] NA NA NA NA NA NA ...
$ Unnamed: 3 : logi [1:36864] NA NA NA NA NA NA ...
$ Unnamed: 4 : logi [1:36864] NA NA NA NA NA NA ...
$ tournament name : chr [1:36864] "The Memorial Tournament pres. by Nationwide" "The Memorial Tournament pres. by Nationwide" "The Memorial Tournament pres. by Nationwide" "The Memorial Tournament pres. by Nationwide" ...
$ course : chr [1:36864] "Muirfield Village Golf Club - Dublin, OH" "Muirfield Village Golf Club - Dublin, OH" "Muirfield Village Golf Club - Dublin, OH" "Muirfield Village Golf Club - Dublin, OH" ...
$ date : Date[1:36864], format: "2022-06-05" "2022-06-05" ...
$ purse : num [1:36864] 12 12 12 12 12 12 12 12 12 12 ...
$ season : num [1:36864] 2022 2022 2022 2022 2022 ...
$ no_cut : num [1:36864] 0 0 0 0 0 0 0 0 0 0 ...
$ Finish : chr [1:36864] "T32" "T18" "CUT" "CUT" ...
$ sg_putt : num [1:36864] 0.2 0.36 -0.56 -1.46 0.53 -0.97 2.05 -0.96 -0.82 -1.89 ...
$ sg_arg : num [1:36864] -0.13 0.75 0.74 -1.86 -0.36 0.14 0.74 -0.01 -1.79 -0.71 ...
$ sg_app : num [1:36864] -0.08 0.31 -1.09 -0.02 -1.39 -2.02 -1.32 1.84 2 0.71 ...
$ sg_ott : num [1:36864] 0.86 0.18 0.37 0.8 0.19 0.31 -0.12 0.48 -1.04 -0.65 ...
$ sg_t2g : num [1:36864] 0.65 1.24 0.02 -1.08 -1.56 -1.56 -0.7 2.31 -0.83 -0.65 ...
$ sg_total : num [1:36864] 0.85 1.6 -0.54 -2.54 -1.04 -2.54 1.35 1.35 -1.65 -2.54 ...
- attr(*, "spec")=
.. cols(
.. Player_initial_last = col_character(),
.. `tournament id` = col_double(),
.. `player id` = col_double(),
.. hole_par = col_double(),
.. strokes = col_double(),
.. hole_DKP = col_double(),
.. hole_FDP = col_double(),
.. hole_SDP = col_double(),
.. streak_DKP = col_double(),
.. streak_FDP = col_double(),
.. streak_SDP = col_double(),
.. n_rounds = col_double(),
.. made_cut = col_double(),
.. pos = col_double(),
.. finish_DKP = col_double(),
.. finish_FDP = col_double(),
.. finish_SDP = col_double(),
.. total_DKP = col_double(),
.. total_FDP = col_double(),
.. total_SDP = col_double(),
.. player = col_character(),
.. `Unnamed: 2` = col_logical(),
.. `Unnamed: 3` = col_logical(),
.. `Unnamed: 4` = col_logical(),
.. `tournament name` = col_character(),
.. course = col_character(),
.. date = col_date(format = ""),
.. purse = col_double(),
.. season = col_double(),
.. no_cut = col_double(),
.. Finish = col_character(),
.. sg_putt = col_double(),
.. sg_arg = col_double(),
.. sg_app = col_double(),
.. sg_ott = col_double(),
.. sg_t2g = col_double(),
.. sg_total = col_double()
.. )
- attr(*, "problems")=<externalptr>
summary(golf_data)
Player_initial_last tournament id player id hole_par
Length:36864 Min. : 2230 Min. : 5 Min. : 70.0
Class :character 1st Qu.: 2696 1st Qu.: 1170 1st Qu.:143.0
Mode :character Median :401056503 Median : 3793 Median :280.0
Mean :233180667 Mean : 79790 Mean :225.5
3rd Qu.:401219498 3rd Qu.: 6151 3rd Qu.:286.0
Max. :401366873 Max. :4845309 Max. :292.0
strokes hole_DKP hole_FDP hole_SDP
Min. : 66.0 Min. : -2.50 Min. :-21.40 Min. :-11.00
1st Qu.:146.0 1st Qu.: 27.00 1st Qu.: 22.60 1st Qu.: 28.00
Median :272.0 Median : 53.50 Median : 46.10 Median : 55.00
Mean :224.1 Mean : 50.13 Mean : 44.38 Mean : 49.32
3rd Qu.:281.0 3rd Qu.: 69.00 3rd Qu.: 64.00 3rd Qu.: 69.00
Max. :325.0 Max. :174.00 Max. :134.70 Max. :107.00
streak_DKP streak_FDP streak_SDP n_rounds
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :1.000
1st Qu.: 0.000 1st Qu.: 0.800 1st Qu.: 0.000 1st Qu.:2.000
Median : 0.000 Median : 6.400 Median : 0.000 Median :4.000
Mean : 1.764 Mean : 7.687 Mean : 1.683 Mean :3.175
3rd Qu.: 3.000 3rd Qu.:12.400 3rd Qu.: 3.000 3rd Qu.:4.000
Max. :23.000 Max. :43.600 Max. :22.000 Max. :4.000
made_cut pos finish_DKP finish_FDP
Min. :0.0000 Min. : 1.00 Min. : 0.000 Min. : 0.000
1st Qu.:0.0000 1st Qu.: 15.00 1st Qu.: 0.000 1st Qu.: 0.000
Median :1.0000 Median : 32.00 Median : 0.000 Median : 0.000
Mean :0.6059 Mean : 34.17 Mean : 2.489 Mean : 2.134
3rd Qu.:1.0000 3rd Qu.: 51.00 3rd Qu.: 3.000 3rd Qu.: 2.000
Max. :1.0000 Max. :999.00 Max. :30.000 Max. :30.000
NA's :15547
finish_SDP total_DKP total_FDP total_SDP
Min. : 0.000 Min. : -2.50 Min. :-21.40 Min. :-11.00
1st Qu.: 0.000 1st Qu.: 27.50 1st Qu.: 24.70 1st Qu.: 28.00
Median : 0.000 Median : 55.50 Median : 52.15 Median : 56.00
Mean : 1.171 Mean : 54.38 Mean : 54.20 Mean : 52.18
3rd Qu.: 0.000 3rd Qu.: 75.00 3rd Qu.: 78.50 3rd Qu.: 72.00
Max. :15.000 Max. :205.50 Max. :202.60 Max. :141.00
player Unnamed: 2 Unnamed: 3 Unnamed: 4
Length:36864 Mode:logical Mode:logical Mode:logical
Class :character NA's:36864 NA's:36864 NA's:36864
Mode :character
tournament name course date purse
Length:36864 Length:36864 Min. :2014-10-12 Min. : 3.00
Class :character Class :character 1st Qu.:2017-01-15 1st Qu.: 6.40
Mode :character Mode :character Median :2018-11-04 Median : 7.10
Mean :2018-10-10 Mean : 7.53
3rd Qu.:2020-09-13 3rd Qu.: 8.70
Max. :2022-06-05 Max. :20.00
season no_cut Finish sg_putt
Min. :2015 Min. :0.00000 Length:36864 Min. :-5.990
1st Qu.:2017 1st Qu.:0.00000 Class :character 1st Qu.:-0.770
Median :2019 Median :0.00000 Mode :character Median :-0.040
Mean :2019 Mean :0.06529 Mean :-0.121
3rd Qu.:2021 3rd Qu.:0.00000 3rd Qu.: 0.630
Max. :2022 Max. :1.00000 Max. : 4.430
NA's :7684
sg_arg sg_app sg_ott sg_t2g
Min. :-6.430 Min. :-9.250 Min. :-7.740 Min. :-13.950
1st Qu.:-0.450 1st Qu.:-0.740 1st Qu.:-0.450 1st Qu.: -1.080
Median : 0.000 Median : 0.000 Median : 0.050 Median : -0.010
Mean :-0.041 Mean :-0.102 Mean :-0.046 Mean : -0.188
3rd Qu.: 0.420 3rd Qu.: 0.640 3rd Qu.: 0.480 3rd Qu.: 0.920
Max. : 3.170 Max. : 4.670 Max. : 2.770 Max. : 6.300
NA's :7684 NA's :7684 NA's :7684 NA's :7684
sg_total
Min. :-13.670
1st Qu.: -1.370
Median : -0.160
Mean : -0.305
3rd Qu.: 1.060
Max. : 8.520
NA's :7683
List future data preparation work needed if any.
We will have to determine if we want to create a cutoff of strokes
gained data to get rid of any outliers in the data. One other aspect
that we will have to change is the position finished, as now it is a
character structure due to the fact that there are people cut from the
tournaments, listed as “CUT”, and people tied for positions, labeled
such as “T21”