plotting means is really easy!
first, let’s look at some data:
golf<- fread("/Users/claire/Desktop/ASA All PGA Raw Data - Tourn Level.csv", header=T, data.table=F)
head(golf)
## Player_initial_last tournament id player id hole_par strokes hole_DKP
## 1 A. Hadwin 401353275 5548 284 283 57.0
## 2 A. Noren 401353275 3832 284 285 47.0
## 3 A. Putnam 401353275 5502 284 279 65.0
## 4 A. Scott 401353275 388 284 274 77.5
## 5 A. Smalley 401353275 9484 284 283 59.5
## 6 A. Wise 401353275 10577 284 277 70.5
## hole_FDP hole_SDP streak_DKP streak_FDP streak_SDP n_rounds made_cut pos
## 1 51.3 62 0 1.8 0 4 1 44
## 2 43.3 62 3 1.8 3 4 1 52
## 3 61.1 66 3 7.6 3 4 1 23
## 4 72.7 75 0 5.6 0 4 1 5
## 5 54.0 59 3 11.8 3 4 1 44
## 6 64.4 73 3 11.2 4 4 1 15
## finish_DKP finish_FDP finish_SDP total_DKP total_FDP total_SDP
## 1 1 0 0 58.0 53.1 62
## 2 0 0 0 50.0 45.1 65
## 3 4 3 1 72.0 71.7 70
## 4 14 14 9 91.5 92.3 84
## 5 1 0 0 63.5 65.8 62
## 6 6 5 3 79.5 80.6 80
## player Unnamed: 2 Unnamed: 3 Unnamed: 4 tournament name
## 1 Adam Hadwin NA NA NA BMW Championship
## 2 Alexander Noren NA NA NA BMW Championship
## 3 Andrew Putnam NA NA NA BMW Championship
## 4 Adam Scott NA NA NA BMW Championship
## 5 Alex Smalley NA NA NA BMW Championship
## 6 Aaron Wise NA NA NA BMW Championship
## course date purse season no_cut
## 1 Wilmington Country Club - Wilmington, DE 2022-08-21 15 2022 1
## 2 Wilmington Country Club - Wilmington, DE 2022-08-21 15 2022 1
## 3 Wilmington Country Club - Wilmington, DE 2022-08-21 15 2022 1
## 4 Wilmington Country Club - Wilmington, DE 2022-08-21 15 2022 1
## 5 Wilmington Country Club - Wilmington, DE 2022-08-21 15 2022 1
## 6 Wilmington Country Club - Wilmington, DE 2022-08-21 15 2022 1
## Finish sg_putt sg_arg sg_app sg_ott sg_t2g sg_total
## 1 T44 -0.56 0.23 -0.03 -0.13 0.06 -0.50
## 2 T52 0.99 -0.09 -1.63 -0.28 -1.99 -1.00
## 3 T23 1.30 0.49 -0.27 -1.01 -0.80 0.50
## 4 T5 0.70 0.28 0.27 0.50 1.05 1.75
## 5 T44 -0.23 -1.08 1.31 -0.49 -0.27 -0.50
## 6 T15 0.34 -0.74 0.84 0.56 0.66 1.00
# 495 unique golfers
length(unique(golf$Player_initial_last))
## [1] 495
this file contains data on every golf tournament on the PGA tour for the last 4 seasons (2019-2022). since we want to look at means across 2 categories of groups for this example, let’s compare strokes gained (sg_total; a measure of how well someone is playing, factoring in hole lenght, shot length, putting and more) across years and finishing categories.
the variable “finish” contains each player’s place for each tournament. a tournament in the PGA consists of 72 holes (4 days of 18 holes), but after the first 36 holes (2 days), anyone who is not in the top 70 gets “cut”, or elimanted and therefore do not play the remaining holes. for this, I’m interested in looking at strokes gained for players who finished first vs those who made the cut, but did not finish first, vs those who did not make the cut at all.
table(golf$Finish)
##
## 1 10 11 12 13 14 15 17 18 19 2 20 21 22 23 24
## 136 7 5 3 6 4 3 1 1 2 97 1 1 2 2 1
## 25 28 29 3 30 31 34 37 38 39 4 43 46 47 48 5
## 2 1 3 51 3 4 1 3 1 1 30 2 1 1 2 25
## 50 51 52 53 54 55 56 57 58 59 6 60 61 62 63 64
## 3 4 1 1 4 2 1 6 3 5 16 5 7 5 17 17
## 65 66 67 68 69 7 70 71 72 73 74 75 76 77 78 79
## 15 29 30 28 24 15 23 16 22 21 15 10 14 15 8 6
## 8 80 81 82 83 84 86 9 CUT DQ MDF T10 T11 T12 T13 T14
## 9 1 2 2 1 1 1 9 7570 12 117 138 122 139 181 162
## T15 T16 T17 T18 T19 T2 T20 T21 T22 T23 T24 T25 T26 T27 T28 T29
## 143 154 159 172 96 103 128 170 105 163 105 146 157 117 141 138
## T3 T30 T31 T32 T33 T34 T35 T36 T37 T38 T39 T4 T40 T41 T42 T43
## 131 157 111 136 134 87 161 145 155 116 124 137 99 125 149 149
## T44 T45 T46 T47 T48 T49 T5 T50 T51 T52 T53 T54 T55 T56 T57 T58
## 83 91 127 132 148 95 122 86 140 116 136 108 106 105 130 131
## T59 T6 T60 T61 T62 T63 T64 T65 T66 T67 T68 T69 T7 T70 T71 T72
## 113 140 104 103 68 74 113 82 60 78 56 57 140 49 46 35
## T73 T74 T75 T76 T77 T78 T79 T8 T80 T81 T83 T9 W/D WD
## 13 19 14 9 8 10 2 152 2 5 6 140 1 143
# these data are kind of messy so first I will convert anyone who was CUT to be a 999. here, I am not going to look at those who were DQ'd, or withdrawn
golf$finish2<- ifelse(golf$Finish=="CUT", 999, ifelse(golf$Finish=="DQ" | golf$Finish=="MDF" | golf$Finish=="WD" | golf$Finish=="W/D", NA, golf$Finish))
# that code creates a new variable (finish2) that counts CUT as 999, and anyone we don't care about as NA. now we can get rid of the "T" (stands for a tie) and convert the variable to numeric
golf$finish2<- gsub(pattern = 'T', x = golf$finish2, replacement = '', fixed = T)
golf$finish2<- as.numeric(golf$finish2)
golf$finish_cat<- ifelse(golf$finish2==1, "first", ifelse(golf$finish2==999, "cut", ifelse(golf$finish2>1, "not cut", NA)))
table(golf$finish_cat) # now we have the categorical variable to look at!
##
## cut first not cut
## 7570 136 9290
we can now run the model to see if finish and year predict strokes gained.
create codes first:
golf_lin = linear code for season, testing whether strokes gained have improved since 2019 golf_quad = quadradic code for season, testing whether maybe (?) during covid seasons performance was worse or impacted somehow
# season
golf$year_lin<- ifelse(golf$season=="2019", -3, ifelse(golf$season=="2020", -1, ifelse(golf$season=="2021", 1, ifelse(golf$season=="2022", 3, NA))))
golf$year_quad<- ifelse(golf$season=="2019", 1, ifelse(golf$season=="2020", -1, ifelse(golf$season=="2021", -1, ifelse(golf$season=="2022", 1, NA))))
# finish
golf$notcut_first<- ifelse(golf$finish_cat=="first", 1, ifelse(golf$finish_cat=="not cut", -1, ifelse(golf$finish_cat=="cut", 0, NA)))
golf$cut_madecut<- ifelse(golf$finish_cat=="first", 1, ifelse(golf$finish_cat=="not cut", 1, ifelse(golf$finish_cat=="cut", -2, NA)))
run model:
strokes_g<- lmer(sg_total~year_lin+year_quad+notcut_first+cut_madecut+(1|Player_initial_last), data=golf)
summary(strokes_g)
## Linear mixed model fit by REML ['lmerMod']
## Formula: sg_total ~ year_lin + year_quad + notcut_first + cut_madecut +
## (1 | Player_initial_last)
## Data: golf
##
## REML criterion at convergence: 61310.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -6.8287 -0.5429 0.1011 0.6740 5.4574
##
## Random effects:
## Groups Name Variance Std.Dev.
## Player_initial_last (Intercept) 0.2521 0.5021
## Residual 2.0736 1.4400
## Number of obs: 16996, groups: Player_initial_last, 476
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.644363 0.050091 12.864
## year_lin 0.005007 0.005217 0.960
## year_quad 0.020056 0.011612 1.727
## notcut_first 1.360547 0.062925 21.622
## cut_madecut 1.199756 0.021948 54.664
##
## Correlation of Fixed Effects:
## (Intr) yer_ln yer_qd ntct_f
## year_lin 0.002
## year_quad -0.012 0.004
## notcut_frst 0.818 -0.001 -0.002
## cut_madecut 0.793 0.010 -0.009 0.935
season does not seem to have an impact on strokes gained, which probably makes sense.
now, finally, let’s plot the means
# groups by season and finish and computes means (for some reason we get NAs for the NAs -- just get rid of)
means <- golf %>% group_by(season,finish_cat) %>%
summarise(mean_sg=mean(sg_total),
.groups = 'drop') %>%
na.omit() %>%
mutate(finish = fct_relevel(finish_cat), "cut", "not cut", "first") %>%
as.data.frame()
# plot!
ggplot(data=means, aes(x=season, y=mean_sg, fill=finish)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle("Strokes gained by finish") +
labs(y = "Strokes gained", x = "Season")
# could also plot this the other way
means$season<- as.factor(means$season)
ggplot(data=means, aes(x=finish, y=mean_sg, fill=season)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle("Strokes gained by season") +
labs(y = "Strokes gained", x = "Finish")
pretty easy and only uses that table of means! this would be more informative than just plotting the total strokes gained by year and finish such because there might be outliers or the metric/ scale doesn’t make much sense to us
if we don’t have raw data to work with, we can always create a matrix or data frame that contains our means and then apply the ggplot plot to it.
to create a matrix, use matrix() and specify your values then the number of rows and whether it is byrow or not. look at ?matrix for help, or play around with it!
matrix(c(7,6,9, 1,0,5),
nrow = 3, byrow = F, # numbers go by column
dimnames=list(c("freshman", "junior", "senior"),
c("M", "F")))
## M F
## freshman 7 1
## junior 6 0
## senior 9 5
matrix(c(7,6,9, 1,0,5),
nrow = 3, byrow = T, # numbers go by row, more confusing here
dimnames=list(c("freshman", "junior", "senior"),
c("M", "F")))
## M F
## freshman 7 6
## junior 9 1
## senior 0 5
matrix(c(0,6,9,
1,0,5, # sometimes it helps to write it out as you would see it
2,3,0),
nrow = 3, byrow = T,
dimnames=list(c("freshman", "junior", "senior"),
c("freshman", "junior", "senior")))
## freshman junior senior
## freshman 0 6 9
## junior 1 0 5
## senior 2 3 0
you can also just create a dataframe of means then plot as we did above:
data.frame(
value=c(1,3,1, 3,5,5),
gender=factor(
c("M", "M", "M",
"F", "F", "F"),
levels=c("M", "F")
),
year=factor(
c("freshman", "junior", "senior", "freshman", "junior", "senior"),
levels = c("freshman", "junior", "senior")
) )
## value gender year
## 1 1 M freshman
## 2 3 M junior
## 3 1 M senior
## 4 3 F freshman
## 5 5 F junior
## 6 5 F senior
or, plot means the from a matrix using base R:
mat<- matrix(c(1,6,9,
1,2,5, # sometimes it helps to write it out as you would see it
2,3,4),
nrow = 3, byrow = T,
dimnames=list(c("psychology", "biology", "engineer"),
c("freshman", "junior", "senior")))
barplot(mat, beside = T, legend.text = T,
ylab = "Value", ylim = c(0, 8),
main = "Value by year")