plotting means is really easy!

first, let’s look at some data:

golf<- fread("/Users/claire/Desktop/ASA All PGA Raw Data - Tourn Level.csv", header=T, data.table=F)
head(golf)
##   Player_initial_last tournament id player id hole_par strokes hole_DKP
## 1           A. Hadwin     401353275      5548      284     283     57.0
## 2            A. Noren     401353275      3832      284     285     47.0
## 3           A. Putnam     401353275      5502      284     279     65.0
## 4            A. Scott     401353275       388      284     274     77.5
## 5          A. Smalley     401353275      9484      284     283     59.5
## 6             A. Wise     401353275     10577      284     277     70.5
##   hole_FDP hole_SDP streak_DKP streak_FDP streak_SDP n_rounds made_cut pos
## 1     51.3       62          0        1.8          0        4        1  44
## 2     43.3       62          3        1.8          3        4        1  52
## 3     61.1       66          3        7.6          3        4        1  23
## 4     72.7       75          0        5.6          0        4        1   5
## 5     54.0       59          3       11.8          3        4        1  44
## 6     64.4       73          3       11.2          4        4        1  15
##   finish_DKP finish_FDP finish_SDP total_DKP total_FDP total_SDP
## 1          1          0          0      58.0      53.1        62
## 2          0          0          0      50.0      45.1        65
## 3          4          3          1      72.0      71.7        70
## 4         14         14          9      91.5      92.3        84
## 5          1          0          0      63.5      65.8        62
## 6          6          5          3      79.5      80.6        80
##            player Unnamed: 2 Unnamed: 3 Unnamed: 4  tournament name
## 1     Adam Hadwin         NA         NA         NA BMW Championship
## 2 Alexander Noren         NA         NA         NA BMW Championship
## 3   Andrew Putnam         NA         NA         NA BMW Championship
## 4      Adam Scott         NA         NA         NA BMW Championship
## 5    Alex Smalley         NA         NA         NA BMW Championship
## 6      Aaron Wise         NA         NA         NA BMW Championship
##                                     course       date purse season no_cut
## 1 Wilmington Country Club - Wilmington, DE 2022-08-21    15   2022      1
## 2 Wilmington Country Club - Wilmington, DE 2022-08-21    15   2022      1
## 3 Wilmington Country Club - Wilmington, DE 2022-08-21    15   2022      1
## 4 Wilmington Country Club - Wilmington, DE 2022-08-21    15   2022      1
## 5 Wilmington Country Club - Wilmington, DE 2022-08-21    15   2022      1
## 6 Wilmington Country Club - Wilmington, DE 2022-08-21    15   2022      1
##   Finish sg_putt sg_arg sg_app sg_ott sg_t2g sg_total
## 1    T44   -0.56   0.23  -0.03  -0.13   0.06    -0.50
## 2    T52    0.99  -0.09  -1.63  -0.28  -1.99    -1.00
## 3    T23    1.30   0.49  -0.27  -1.01  -0.80     0.50
## 4     T5    0.70   0.28   0.27   0.50   1.05     1.75
## 5    T44   -0.23  -1.08   1.31  -0.49  -0.27    -0.50
## 6    T15    0.34  -0.74   0.84   0.56   0.66     1.00
# 495 unique golfers 
length(unique(golf$Player_initial_last))
## [1] 495

this file contains data on every golf tournament on the PGA tour for the last 4 seasons (2019-2022). since we want to look at means across 2 categories of groups for this example, let’s compare strokes gained (sg_total; a measure of how well someone is playing, factoring in hole lenght, shot length, putting and more) across years and finishing categories.

the variable “finish” contains each player’s place for each tournament. a tournament in the PGA consists of 72 holes (4 days of 18 holes), but after the first 36 holes (2 days), anyone who is not in the top 70 gets “cut”, or elimanted and therefore do not play the remaining holes. for this, I’m interested in looking at strokes gained for players who finished first vs those who made the cut, but did not finish first, vs those who did not make the cut at all.

table(golf$Finish)
## 
##    1   10   11   12   13   14   15   17   18   19    2   20   21   22   23   24 
##  136    7    5    3    6    4    3    1    1    2   97    1    1    2    2    1 
##   25   28   29    3   30   31   34   37   38   39    4   43   46   47   48    5 
##    2    1    3   51    3    4    1    3    1    1   30    2    1    1    2   25 
##   50   51   52   53   54   55   56   57   58   59    6   60   61   62   63   64 
##    3    4    1    1    4    2    1    6    3    5   16    5    7    5   17   17 
##   65   66   67   68   69    7   70   71   72   73   74   75   76   77   78   79 
##   15   29   30   28   24   15   23   16   22   21   15   10   14   15    8    6 
##    8   80   81   82   83   84   86    9  CUT   DQ  MDF  T10  T11  T12  T13  T14 
##    9    1    2    2    1    1    1    9 7570   12  117  138  122  139  181  162 
##  T15  T16  T17  T18  T19   T2  T20  T21  T22  T23  T24  T25  T26  T27  T28  T29 
##  143  154  159  172   96  103  128  170  105  163  105  146  157  117  141  138 
##   T3  T30  T31  T32  T33  T34  T35  T36  T37  T38  T39   T4  T40  T41  T42  T43 
##  131  157  111  136  134   87  161  145  155  116  124  137   99  125  149  149 
##  T44  T45  T46  T47  T48  T49   T5  T50  T51  T52  T53  T54  T55  T56  T57  T58 
##   83   91  127  132  148   95  122   86  140  116  136  108  106  105  130  131 
##  T59   T6  T60  T61  T62  T63  T64  T65  T66  T67  T68  T69   T7  T70  T71  T72 
##  113  140  104  103   68   74  113   82   60   78   56   57  140   49   46   35 
##  T73  T74  T75  T76  T77  T78  T79   T8  T80  T81  T83   T9  W/D   WD 
##   13   19   14    9    8   10    2  152    2    5    6  140    1  143
# these data are kind of messy so first I will convert anyone who was CUT to be a 999. here, I am not going to look at those who were DQ'd, or withdrawn

golf$finish2<- ifelse(golf$Finish=="CUT", 999, ifelse(golf$Finish=="DQ" | golf$Finish=="MDF" | golf$Finish=="WD" | golf$Finish=="W/D", NA, golf$Finish))

# that code creates a new variable (finish2) that counts CUT as 999, and anyone we don't care about as NA. now we can get rid of the "T" (stands for a tie) and convert the variable to numeric

golf$finish2<- gsub(pattern = 'T', x = golf$finish2, replacement = '', fixed = T)
golf$finish2<- as.numeric(golf$finish2)

golf$finish_cat<- ifelse(golf$finish2==1, "first", ifelse(golf$finish2==999, "cut", ifelse(golf$finish2>1, "not cut", NA)))

table(golf$finish_cat) # now we have the categorical variable to look at!
## 
##     cut   first not cut 
##    7570     136    9290

we can now run the model to see if finish and year predict strokes gained.

create codes first:

golf_lin = linear code for season, testing whether strokes gained have improved since 2019 golf_quad = quadradic code for season, testing whether maybe (?) during covid seasons performance was worse or impacted somehow

# season
golf$year_lin<- ifelse(golf$season=="2019", -3, ifelse(golf$season=="2020", -1, ifelse(golf$season=="2021", 1, ifelse(golf$season=="2022", 3, NA))))
golf$year_quad<- ifelse(golf$season=="2019", 1, ifelse(golf$season=="2020", -1, ifelse(golf$season=="2021", -1, ifelse(golf$season=="2022", 1, NA))))

# finish
golf$notcut_first<- ifelse(golf$finish_cat=="first", 1, ifelse(golf$finish_cat=="not cut", -1, ifelse(golf$finish_cat=="cut", 0, NA)))
golf$cut_madecut<- ifelse(golf$finish_cat=="first", 1, ifelse(golf$finish_cat=="not cut", 1, ifelse(golf$finish_cat=="cut", -2, NA)))

run model:

strokes_g<- lmer(sg_total~year_lin+year_quad+notcut_first+cut_madecut+(1|Player_initial_last), data=golf)
summary(strokes_g)
## Linear mixed model fit by REML ['lmerMod']
## Formula: sg_total ~ year_lin + year_quad + notcut_first + cut_madecut +  
##     (1 | Player_initial_last)
##    Data: golf
## 
## REML criterion at convergence: 61310.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -6.8287 -0.5429  0.1011  0.6740  5.4574 
## 
## Random effects:
##  Groups              Name        Variance Std.Dev.
##  Player_initial_last (Intercept) 0.2521   0.5021  
##  Residual                        2.0736   1.4400  
## Number of obs: 16996, groups:  Player_initial_last, 476
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  0.644363   0.050091  12.864
## year_lin     0.005007   0.005217   0.960
## year_quad    0.020056   0.011612   1.727
## notcut_first 1.360547   0.062925  21.622
## cut_madecut  1.199756   0.021948  54.664
## 
## Correlation of Fixed Effects:
##             (Intr) yer_ln yer_qd ntct_f
## year_lin     0.002                     
## year_quad   -0.012  0.004              
## notcut_frst  0.818 -0.001 -0.002       
## cut_madecut  0.793  0.010 -0.009  0.935

season does not seem to have an impact on strokes gained, which probably makes sense.

now, finally, let’s plot the means

# groups by season and finish and computes means (for some reason we get NAs for the NAs -- just get rid of)
means <- golf %>% group_by(season,finish_cat) %>% 
  summarise(mean_sg=mean(sg_total),
            .groups = 'drop') %>%
  na.omit() %>%
  mutate(finish = fct_relevel(finish_cat), "cut", "not cut", "first") %>%
  as.data.frame() 

# plot!

ggplot(data=means, aes(x=season, y=mean_sg, fill=finish)) +
  geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("Strokes gained by finish") +
  labs(y = "Strokes gained", x = "Season")

# could also plot this the other way

means$season<- as.factor(means$season)

ggplot(data=means, aes(x=finish, y=mean_sg, fill=season)) +
  geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("Strokes gained by season") +
  labs(y = "Strokes gained", x = "Finish")

pretty easy and only uses that table of means! this would be more informative than just plotting the total strokes gained by year and finish such because there might be outliers or the metric/ scale doesn’t make much sense to us

if we don’t have raw data to work with, we can always create a matrix or data frame that contains our means and then apply the ggplot plot to it.

to create a matrix, use matrix() and specify your values then the number of rows and whether it is byrow or not. look at ?matrix for help, or play around with it!

matrix(c(7,6,9, 1,0,5), 
               nrow = 3, byrow = F, # numbers go by column
               dimnames=list(c("freshman", "junior", "senior"),
                             c("M", "F")))
##          M F
## freshman 7 1
## junior   6 0
## senior   9 5
matrix(c(7,6,9, 1,0,5), 
               nrow = 3, byrow = T, # numbers go by row, more confusing here
               dimnames=list(c("freshman", "junior", "senior"),
                             c("M", "F")))
##          M F
## freshman 7 6
## junior   9 1
## senior   0 5
matrix(c(0,6,9, 
         1,0,5,  # sometimes it helps to write it out as you would see it
         2,3,0), 
               nrow = 3, byrow = T, 
               dimnames=list(c("freshman", "junior", "senior"),
                             c("freshman", "junior", "senior")))
##          freshman junior senior
## freshman        0      6      9
## junior          1      0      5
## senior          2      3      0

you can also just create a dataframe of means then plot as we did above:

data.frame(
  value=c(1,3,1, 3,5,5),
  gender=factor(
    c("M", "M", "M",
              "F", "F", "F"),
    levels=c("M", "F")
  ),
  year=factor(
    c("freshman", "junior", "senior", "freshman", "junior", "senior"),
    levels = c("freshman", "junior", "senior")
) )
##   value gender     year
## 1     1      M freshman
## 2     3      M   junior
## 3     1      M   senior
## 4     3      F freshman
## 5     5      F   junior
## 6     5      F   senior

or, plot means the from a matrix using base R:

mat<- matrix(c(1,6,9, 
         1,2,5,  # sometimes it helps to write it out as you would see it
         2,3,4), 
               nrow = 3, byrow = T, 
               dimnames=list(c("psychology", "biology", "engineer"),
                             c("freshman", "junior", "senior")))

barplot(mat, beside = T, legend.text = T,
        ylab = "Value", ylim = c(0, 8),
        main = "Value by year")