Introduction
Description
The PGA 2004 dataset contains performance statistics and winnings
data for 196 participants in the PGA (Proffesional Golfers Association)
during the 2004 season. The dataset provides information on several key
aspects of player performance and earnings such as:
Name : The name of each golfer. Age : The age of the player. Average
Drive : The average driving distance of the player in yards Driving
Accuracy: The percentage of drives that land on the fairway, Greens in
Regulation : The percentage of greens reached in regulation, meaning the
number of strokes used by the player to land the ball on the green is
two or more less than par. Average Number of Putts : The average number
of putts per round Save Percentage : The percentage of times a player
saves par or better from around the green Money Rank : The rank of the
player based on total earnings for the season. Number of Events : The
total number of events the player participated in during the season.
Total Winnings : The total amount of money earned by the player over the
course of the season. Average Winnings : The average amount of money
earned per event by the player.
Research Question
We want to investigate what variables affect the players winnings of
this given season. We will look at mainly the average drive and how it
correlates to winnings.
Data Preperation
url <- "https://users.stat.ufl.edu/~winner/data/pga2004.dat"
pga_data <- read.table(url, header = FALSE, fill = TRUE)
pga_data$V1 <- NULL
colnames(pga_data) <- c("Player_name", "Player_Age", "Average_Drive", "Driving_Accuracy", "Greens_on_reg",
"Average_number_putts", "Save_Percent", "Money_Rank",
"Number_events", "Total_Winnings", "Average_Winnings")
head(pga_data)
Player_name Player_Age Average_Drive Driving_Accuracy Greens_on_reg
1 Baddeley 23 288.0 53.1 58.2
2 Scott 24 295.4 57.7 65.6
3 Cejka 34 285.8 64.2 63.8
4 Stolz 34 297.9 59.0 63.0
5 Atwal 31 289.4 60.5 62.5
6 Oberholser 29 284.6 68.8 67.0
Average_number_putts Save_Percent Money_Rank Number_events Total_Winnings
1 1.767 50.9 123 27 632878
2 1.757 59.3 7 16 3724984
3 1.795 50.7 54 24 1313484
4 1.787 47.7 101 20 808373
5 1.766 43.5 146 30 486053
6 1.780 50.9 52 23 1355433
Average_Winnings
1 23440
2 232812
3 54729
4 40419
5 16202
6 58932
We take a broad look at the data and see there was two categorical
variables one of which having the first name of a player and then a
second one with the last name. We void the first name and just look at
the last names.
pander(head(pga_data))
Table continues below
Baddeley |
23 |
288 |
53.1 |
58.2 |
Scott |
24 |
295.4 |
57.7 |
65.6 |
Cejka |
34 |
285.8 |
64.2 |
63.8 |
Stolz |
34 |
297.9 |
59 |
63 |
Atwal |
31 |
289.4 |
60.5 |
62.5 |
Oberholser |
29 |
284.6 |
68.8 |
67 |
Table continues below
1.767 |
50.9 |
123 |
27 |
1.757 |
59.3 |
7 |
16 |
1.795 |
50.7 |
54 |
24 |
1.787 |
47.7 |
101 |
20 |
1.766 |
43.5 |
146 |
30 |
1.78 |
50.9 |
52 |
23 |
632878 |
23440 |
3724984 |
232812 |
1313484 |
54729 |
808373 |
40419 |
486053 |
16202 |
1355433 |
58932 |
cor(pga_data[, c("Average_Drive" , "Driving_Accuracy" , "Greens_on_reg" , "Average_number_putts" , "Save_Percent" , "Money_Rank" , "Number_events" , "Total_Winnings" , "Average_Winnings")], use = "pairwise.complete.obs")
Average_Drive Driving_Accuracy Greens_on_reg
Average_Drive 1.000000000 -0.99252269 0.002134259
Driving_Accuracy -0.992522688 1.00000000 0.046962093
Greens_on_reg 0.002134259 0.04696209 1.000000000
Average_number_putts -0.988803391 0.99389638 0.022156599
Save_Percent 0.867471569 -0.87937510 -0.056544353
Money_Rank 0.186561899 -0.20345865 -0.498016597
Number_events -0.699109599 0.68370009 0.062783323
Total_Winnings 0.226265605 -0.21129425 0.405497292
Average_Winnings -0.773878692 0.79872244 0.013823260
Average_number_putts Save_Percent Money_Rank
Average_Drive -0.9888034 0.86747157 0.18656190
Driving_Accuracy 0.9938964 -0.87937510 -0.20345865
Greens_on_reg 0.0221566 -0.05654435 -0.49801660
Average_number_putts 1.0000000 -0.88762172 -0.19710155
Save_Percent -0.8876217 1.00000000 0.06385765
Money_Rank -0.1971016 0.06385765 1.00000000
Number_events 0.6924957 -0.63927742 -0.11418017
Total_Winnings -0.2044137 0.27127779 -0.66596035
Average_Winnings 0.7948355 -0.68320935 -0.27537388
Number_events Total_Winnings Average_Winnings
Average_Drive -0.69910960 0.226265605 -0.773878692
Driving_Accuracy 0.68370009 -0.211294247 0.798722439
Greens_on_reg 0.06278332 0.405497292 0.013823260
Average_number_putts 0.69249567 -0.204413695 0.794835458
Save_Percent -0.63927742 0.271277789 -0.683209349
Money_Rank -0.11418017 -0.665960350 -0.275373883
Number_events 1.00000000 -0.180433459 0.190800029
Total_Winnings -0.18043346 1.000000000 0.002463485
Average_Winnings 0.19080003 0.002463485 1.000000000
We see that a lot of our variables have high correlation which can
cause multicollinearity. We must deal with this issue before doing our
analysis.
colSums(is.na(pga_data))
Player_name Player_Age Average_Drive
0 0 10
Driving_Accuracy Greens_on_reg Average_number_putts
10 10 10
Save_Percent Money_Rank Number_events
10 10 10
Total_Winnings Average_Winnings
10 10
pga_data_clean <- na.omit(pga_data)
For the sake of missing observations we will be removing it for our
data analysis because we have a plentiful amount of observations.
ggplot(pga_data_clean, aes(x = Average_Drive, y = Average_Winnings)) +
geom_point() +
geom_smooth(method = "lm", col = "green") +
labs(title = "Scatter Plot of Average Drive vs. Average Winnings", x = "Average Drive", y = "Average Winnings")
`geom_smooth()` using formula = 'y ~ x'

str(pga_data_clean)
'data.frame': 196 obs. of 11 variables:
$ Player_name : chr "Baddeley" "Scott" "Cejka" "Stolz" ...
$ Player_Age : chr "23" "24" "34" "34" ...
$ Average_Drive : num 288 295 286 298 289 ...
$ Driving_Accuracy : num 53.1 57.7 64.2 59 60.5 68.8 74.2 64.4 64.3 62.6 ...
$ Greens_on_reg : num 58.2 65.6 63.8 63 62.5 67 68.9 64.2 63.4 65.3 ...
$ Average_number_putts: num 1.77 1.76 1.79 1.79 1.77 ...
$ Save_Percent : num 50.9 59.3 50.7 47.7 43.5 50.9 40.4 53.8 42.2 47.7 ...
$ Money_Rank : num 123 7 54 101 146 52 80 75 141 83 ...
$ Number_events : int 27 16 24 20 30 23 23 27 20 15 ...
$ Total_Winnings : int 632878 3724984 1313484 808373 486053 1355433 962167 1036958 500818 943589 ...
$ Average_Winnings : int 23440 232812 54729 40419 16202 58932 41833 38406 25041 62906 ...
- attr(*, "na.action")= 'omit' Named int [1:10] 15 32 54 58 68 110 142 155 198 200
..- attr(*, "names")= chr [1:10] "15" "32" "54" "58" ...
remove_low_drives <- function(pga_data_clean) {
filtered_data <- pga_data_clean %>%
filter(Average_Drive >= 150)
return(filtered_data)
}
# Example usage
pga_data_filtered <- remove_low_drives(pga_data_clean)
ggplot(pga_data_filtered, aes(x = Average_Drive, y = Average_Winnings)) +
geom_point() +
geom_smooth(method = "lm", col = "green") +
labs(
title = "Scatter Plot of Average Drive vs. Average Winnings (Filtered)",
x = "Average Drive",
y = "Average Winnings"
) +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

We look at an initial graph of average winnigs vs average drive. It
looks like we have a good cluster of data but we have a couple of
outliers in our data that is scewing our line.
pga_data_filtered$Age_Above_30 <- pga_data_filtered$Player_Age > 30
head(pga_data_filtered$Age_Above_30)
[1] FALSE FALSE TRUE TRUE TRUE FALSE
pga_data_filtered <- pga_data_filtered %>% select(-Player_name, -Driving_Accuracy, -Total_Winnings, -Average_number_putts, -Money_Rank, -Player_Age)
kable(head(pga_data_filtered))
288.0 |
58.2 |
50.9 |
27 |
23440 |
FALSE |
295.4 |
65.6 |
59.3 |
16 |
232812 |
FALSE |
285.8 |
63.8 |
50.7 |
24 |
54729 |
TRUE |
297.9 |
63.0 |
47.7 |
20 |
40419 |
TRUE |
289.4 |
62.5 |
43.5 |
30 |
16202 |
TRUE |
284.6 |
67.0 |
50.9 |
23 |
58932 |
FALSE |
We remove player name and money rank because we are “ranking” them or
sorting them by +/- 30 age. We remove driving accuracy because we have a
variable with average drive because that is what we are making
assumptions on and also it could cause multicollinearity because driving
accuracy and average drive are highly correlated. The same can be said
for total winnings versus average winnings and also save percent and
number of putts. They are both correlated and can cause
multicollinearity.
Model Bulding
Linear Model
colnames(pga_data_filtered)
[1] "Average_Drive" "Greens_on_reg" "Save_Percent" "Number_events"
[5] "Average_Winnings" "Age_Above_30"
full.model = lm(Average_Winnings ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
Statistics of Regression Coefficients
(Intercept) |
-802913.292 |
147739.1904 |
-5.434667 |
0.0000002 |
Average_Drive |
1198.229 |
435.7615 |
2.749735 |
0.0065731 |
Greens_on_reg |
7278.954 |
1205.2877 |
6.039183 |
0.0000000 |
Save_Percent |
2486.926 |
627.6347 |
3.962378 |
0.0001069 |
Number_events |
-3508.708 |
723.2608 |
-4.851234 |
0.0000026 |
Age_Above_30TRUE |
3537.995 |
8630.5405 |
0.409939 |
0.6823382 |
We transform the age category into a true or false of payers with
above or below age 30. All of our values are significant besides
Age.
par(mfrow=c(2,2))
plot(full.model)

It looks like we have faults in our assumptions. Based on the QQ plot
there seems to be a non-normal distribution with several outliers. There
also seems to be non-constant variance.
vif(full.model)
Average_Drive Greens_on_reg Save_Percent Number_events Age_Above_30
1.136250 1.044126 1.047080 1.018950 1.072829
barplot(vif(full.model), main = "VIF Values", horiz = FALSE, col = "steelblue")

We see in the vif values above that all of them are below 5 which
shows little to no multicollineairty so we can continue.
Box Cox
Transformation
We perform this transformation because our data has non constant
variance
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
##
boxcox(Average_Winnings ~ log(Average_Drive) + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered, lambda = seq(-1, 1, length = 10),
xlab=expression(paste(lambda, ": log Average Drive")))
##
boxcox(Average_Winnings ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered, lambda = seq(-1, 1, length = 10),
xlab=expression(paste(lambda, ": Average Drive")))
##
boxcox(Average_Winnings ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + log(1+Age_Above_30), data = pga_data_filtered, lambda = seq(-1, 1, length = 10),
xlab=expression(paste(lambda, ": log-age")))
##
boxcox(Average_Winnings ~ log(Average_Drive) + Greens_on_reg + Save_Percent + Number_events + log(1+Age_Above_30), data = pga_data_filtered, lambda = seq(-1, 1, length = 10),
xlab=expression(paste(lambda, ": log-age, log Average Drive")))

Values of lamda are around 0 so we shoudl use a log transformation ##
Square Root Model
sqrt.winnings.log.drive = lm((Average_Winnings)^0.5 ~ Greens_on_reg + Save_Percent + Number_events + Age_Above_30 + log(Average_Drive), data = pga_data_filtered)
kable(summary(sqrt.winnings.log.drive)$coef, caption = "log-transformed model")
log-transformed model
(Intercept) |
-3187.392815 |
1176.262371 |
-2.7097635 |
0.0073840 |
Greens_on_reg |
16.184221 |
1.999182 |
8.0954209 |
0.0000000 |
Save_Percent |
5.154888 |
1.041392 |
4.9499977 |
0.0000017 |
Number_events |
-6.056634 |
1.199193 |
-5.0505915 |
0.0000011 |
Age_Above_30TRUE |
2.354375 |
14.312534 |
0.1644974 |
0.8695242 |
log(Average_Drive) |
394.504246 |
208.497492 |
1.8921295 |
0.0600792 |
par(mfrow = c(2,2))
plot(sqrt.winnings.log.drive)

This model worsened our QQ-plot normality and our non constant
variance remained the same, so we can try another model to see if it
helps.
Comparison of
models
par(pty = "s", mfrow = c(1, 3))
qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)
qqnorm(log.winnings$residuals, main = "Log Winnings")
qqline(log.winnings$residuals)
qqnorm(sqrt.winnings.log.drive$residuals, main = "sqrt winnings log drive")
qqline(sqrt.winnings.log.drive$residuals)

We can see that Log Winnings is the best model for our data.
Goodness of Fit
Comparison
select=function(m){
e = m$resid
n0 = length(e)
SSE=(m$df)*(summary(m)$sigma)^2
R.sq=summary(m)$r.squared
R.adj=summary(m)$adj.r
MSE=(summary(m)$sigma)^2
Cp=(SSE/MSE)-(n0-2*(n0-m$df))
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)
X=model.matrix(m)
H=X%*%solve(t(X)%*%X)%*%t(X)
d=e/(1-diag(H))
PRESS=t(d)%*%d
tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
tbl
}
output.sum = rbind(select(full.model), select(sqrt.winnings.log.drive), select(log.winnings))
row.names(output.sum) = c("full.model", "sqrt.winnings.log.drive", "log.winnings")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
Goodness-of-fit Measures of Candidate Models
full.model |
3.852006e+11 |
0.3349285 |
0.3164543 |
6 |
4001.9387 |
4021.2932 |
4.229954e+11 |
sqrt.winnings.log.drive |
1.058809e+06 |
0.4162839 |
0.4000695 |
6 |
1620.3250 |
1639.6795 |
1.143272e+06 |
log.winnings |
1.002463e+02 |
0.4547915 |
0.4396468 |
6 |
-102.9696 |
-83.6151 |
1.075339e+02 |
Even though the R, R squared, and Cp look better in the other two
models our value are still relatively good for log.winnings.
Log.winnings has better residuals, QQ plot, but less goodness of fit
measures but they are not significantly bad so we will stick with the
log model.
Final Model
kable(summary(log.winnings)$coef, caption = "Inferential Statistics of Final Model")
Inferential Statistics of Final Model
(Intercept) |
-5.4512847 |
2.3833429 |
-2.2872432 |
0.0233453 |
Average_Drive |
0.0057567 |
0.0070297 |
0.8189126 |
0.4139191 |
Greens_on_reg |
0.1920075 |
0.0194438 |
9.8749896 |
0.0000000 |
Save_Percent |
0.0563106 |
0.0101251 |
5.5615033 |
0.0000001 |
Number_events |
-0.0450036 |
0.0116677 |
-3.8571026 |
0.0001596 |
Age_Above_30TRUE |
0.0333772 |
0.1392287 |
0.2397290 |
0.8108131 |
While we do have a large sample of 196 and all of our pvalues are
close to 0 we do have one variable “Age” which is not significant.
Summary of Model
The final model can be represented as log(price)=3.9245 −
0.0246×Average_Drive + 0.1866×Greens_on_reg + 0.0387×Save_percent −
0.0181×Number_events + 0.1826×Age_Above_30TRUE
From the above data we can also conclude that as average winnings go
up our average drive goes down by -2.5% while other variables such as
greens on regulation and save percent go up.
log.winnings = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
B = 1000 #
num.p = length(coef(log.winnings)) # number of parameters in the model (includes intercept)
smpl.n = nrow(pga_data_filtered) # sample size
# Matrix to store bootstrap coefficients
coef.mtrx = matrix(0, nrow = B, ncol = num.p)
# Bootstrap loop
for (i in 1:B) {
bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # Resample with replacement
boot_data = pga_data_filtered[bootc.id, ] # Create the bootstrap sample
log.winnings.btc = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = boot_data)
coef.mtrx[i, ] = coef(log.winnings.btc) # Store the coefficients
}
Bootstrapping Final
Model
boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3")
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")
}
par(mfrow=c(2,3))
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Average Drive" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Greens On Regulation" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Save Percentage" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Number of Events" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="Age Above 30" )

We see from our histograms that they look to have a equal
distribution. The blue curve represents the p-values reported. While the
blue curve represents the bootstrap intervals. They are very
similar.
95% CI
num.p = dim(coef.mtrx)[2]
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p) {
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2), 8)
uci.975 = round(quantile(coef.mtrx[, i], 0.975, type = 2), 8)
btc.wd[i] = uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025, 4), ", ", round(uci.975, 4), "]")
}
mean.coefs = apply(coef.mtrx, 2, mean)
results = as.data.frame(cbind(Mean_Coef = formatC(mean.coefs, 4, format = "f"), btc.ci.95 = btc.ci))
kable(results, caption = "Regression Coefficient Matrix with Bootstrap Confidence Intervals")
Regression Coefficient Matrix with Bootstrap Confidence
Intervals
-5.4436 |
[ -10.4839 , -0.2628 ] |
0.0059 |
[ -0.0065 , 0.0181 ] |
0.1917 |
[ 0.1442 , 0.236 ] |
0.0559 |
[ 0.038 , 0.0752 ] |
-0.0452 |
[ -0.0696 , -0.0216 ] |
0.0298 |
[ -0.2617 , 0.3228 ] |
We can see that our values are consistent with the p-values that we
got. Our 95% CI spans over 0 in the Age variable which matches up with
our p-value not being significant
hist(sort(log.winnings$residuals),n=40,
xlab="Residuals",
col = "lightblue",
border="navy",
main = "Histogram of Residuals")

We appear to have a normal distribution but we do appear to have some
outliers on either side of the curve.
Residual Bootstrap
Regression
log.winnings <- lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
model.resid = log.winnings$residuals
B=1000
num.p = dim(model.matrix(log.winnings))[2]
samp.n = dim(model.matrix(log.winnings))[1]
btr.mtrx = matrix(rep(0,6*B), ncol=num.p)
for (i in 1:B){
bt.lg.winnings = log.winnings$fitted.values +
sample(log.winnings$residuals, samp.n, replace = TRUE)
pga_data_filtered$bt.lg.winnings = bt.lg.winnings # send the boot response to the data
btr.model = lm(bt.lg.winnings ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
btr.mtrx[i,]=btr.model$coefficients
}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3")
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")
}
par(mfrow=c(2,3))
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Average Drive" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Greens on Regulation" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Save Percentage" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Number of Events" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="Age" )

We again see that our histograms have an equal distribution. Again we
see that the p-values and residual boostrap have very similar
values.
um.p = dim(btr.mtrx)[2]
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p) {
lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2), 8)
uci.975 = round(quantile(btr.mtrx[, i], 0.975, type = 2), 8)
btr.wd[i] = uci.975 - lci.025
btr.ci[i] = paste("[", round(lci.025, 4), ", ", round(uci.975, 4), "]")
}
kable(as.data.frame(cbind(formatC(btr.mtrx[1, ], 4, format = "f"), btr.ci.95 = btr.ci)),
caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")
Regression Coefficient Matrix with 95% Residual Bootstrap
CI
-3.9016 |
[ -10.2057 , -1.1977 ] |
-0.0006 |
[ -0.0066 , 0.0192 ] |
0.2044 |
[ 0.155 , 0.2292 ] |
0.0500 |
[ 0.0369 , 0.0758 ] |
-0.0583 |
[ -0.0669 , -0.0231 ] |
0.1272 |
[ -0.2253 , 0.291 ] |
Once again we get the same output as expected. The values above show
the same results as our p-values, all being significant except for age
which includes 0.
Combining
results
p_values <- summary(log.winnings)$coefficients[-3, 4]
combined_matrix <- cbind(
Coefficients = formatC(coef(log.winnings)[-3], 4, format = "f"),
`95% CI (Bootstrap t)` = btc.ci,
`95% CI (Bootstrap r)` = btr.ci,
`p-values` = formatC(p_values, 4, format = "f")
)
Warning in cbind(Coefficients = formatC(coef(log.winnings)[-3], 4, format =
"f"), : number of rows of result is not a multiple of vector length (arg 1)
library(knitr)
kable(as.data.frame(combined_matrix),
caption = "Final Combined Inferential Statistics: Coefficients, p-values, and Bootstrap CIs")
Final Combined Inferential Statistics: Coefficients, p-values,
and Bootstrap CIs
-5.4513 |
[ -10.4839 , -0.2628 ] |
[ -10.2057 , -1.1977 ] |
0.0233 |
0.0058 |
[ -0.0065 , 0.0181 ] |
[ -0.0066 , 0.0192 ] |
0.4139 |
0.0563 |
[ 0.1442 , 0.236 ] |
[ 0.155 , 0.2292 ] |
0.0000 |
-0.0450 |
[ 0.038 , 0.0752 ] |
[ 0.0369 , 0.0758 ] |
0.0002 |
0.0334 |
[ -0.0696 , -0.0216 ] |
[ -0.0669 , -0.0231 ] |
0.8108 |
-5.4513 |
[ -0.2617 , 0.3228 ] |
[ -0.2253 , 0.291 ] |
0.0233 |
All of the models yield the same result, showing there is a violation
in the age variable but all other variables are significant.
kable(round(cbind(btc.wd, btr.wd),4), caption="width of the two bootstrap confidence intervals")
width of the two bootstrap confidence intervals
10.2211 |
9.0080 |
0.0245 |
0.0258 |
0.0918 |
0.0741 |
0.0372 |
0.0389 |
0.0480 |
0.0438 |
0.5845 |
0.5163 |
We see from the above analysis that the two values are very similar
to each other.
Summary
We see from our findings above that all three of our models come to
the same conclusion. We will be using our initial final model where we
used a log transformation because they seem to be a little more
significant.
Again we see that the final model can be represented as
log(price)=3.9245 − 0.0246×Average_Drive + 0.1866×Greens_on_reg +
0.0387×Save_percent − 0.0181×Number_events + 0.1826×Age_Above_30TRUE
From the above data we can also conclude that as average winnings go
up our average drive goes down by -2.5% while other variables such as
greens on regulation and save percent go up.
Conclusion
The main conclusion we can draw is that while driving distance in
golf is perceived to be a big indication of whether a person is winning
or not we do not see this in our analysis. While the Average Winnings
goes up the average drive goes down by -2.5% while Save Percentage goes
up by 3.9% and Greens on Regulation goes up by 1.9% which shows that the
“short” game in golf (putting and chipping) is more important than
driving in regards to average money won.
We see one flaw in our analysis and it was that age was not
significant which could mess up our analysis. in the future we could
push our age up or down for the true or false statement to see if that
changes the significance. We could also look at some values of age to
see if there are any outliers, for example golfers are not younger than
20 but they can be 45+ which could mess with our significance.
We can use this model for seasons to come to see if things may have
changed. For example since golf has evolved recently to players having
longer drives we could run this model again in 2024 to see if the
“short” game is less likely to win more money than the “long” game
would.
---
title: "Factors Influence Golf Earnings"
author: 'Tyler Battaglini'
date: "2024-09-19"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
always_allow_html: true
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-weight: bold;
    font-family: system-ui;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# Detect, install and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("nleqslv")) {
   install.packages("nleqslv")
   library(nleqslv)
}
#
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}

if (!require("psych")) {   
  install.packages("psych")
   library(psych)
}
if (!require("MASS")) {   
  install.packages("MASS")
   library(MASS)
}
if (!require("ggplot2")) {   
  install.packages("ggplot2")
   library(ggplot2)
}
if (!require("GGally")) {   
  install.packages("GGally")
   library(GGally)
}
if (!require("car")) {   
  install.packages("car")
   library(car)
}
if (!require("dplyr")) {   
  install.packages("dplyr")
   library(dplyr)
}
if (!require("caret")) {   
  install.packages("caret")
   library(caret)
}

# specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,      # include code chunk in the output file
                      warnings = FALSE,  # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      messages = FALSE,  #
                      results = TRUE,
                      
                      comment = NA       # you can also decide whether to include the output
                                         # in the output file.
                      )   
```


### Introduction

## Description
The PGA 2004 dataset contains performance statistics and winnings data for 196 participants in the PGA (Proffesional Golfers Association) during the 2004 season. The dataset provides information on several key aspects of player performance and earnings such as:

Name : The name of each golfer.
Age : The age of the player.
Average Drive : The average driving distance of the player in yards
Driving Accuracy: The percentage of drives that land on the fairway, 
Greens in Regulation : The percentage of greens reached in regulation, meaning the number of strokes used by the player to land the ball on the green is two or more less than par.
Average Number of Putts : The average number of putts per round
Save Percentage : The percentage of times a player saves par or better from around the green 
Money Rank : The rank of the player based on total earnings for the season.
Number of Events : The total number of events the player participated in during the season.
Total Winnings : The total amount of money earned by the player over the course of the season.
Average Winnings : The average amount of money earned per event by the player.

## Research Question
We want to investigate what variables affect the players winnings of this given season. We will look at mainly the average drive and how it correlates to winnings.

## Data Preperation

```{r}

url <- "https://users.stat.ufl.edu/~winner/data/pga2004.dat"
pga_data <- read.table(url, header = FALSE, fill = TRUE)
pga_data$V1 <- NULL
colnames(pga_data) <- c("Player_name", "Player_Age", "Average_Drive", "Driving_Accuracy", "Greens_on_reg", 
                        "Average_number_putts", "Save_Percent", "Money_Rank",
                        "Number_events", "Total_Winnings", "Average_Winnings")
head(pga_data)
```

We take a broad look at the data and see there was two categorical variables one of which having the first name of a player and then a second one with the last name. We void the first name and just look at the last names.

```{r}
pander(head(pga_data))

```
```{r}
cor(pga_data[, c("Average_Drive" , "Driving_Accuracy" , "Greens_on_reg" , "Average_number_putts" , "Save_Percent" , "Money_Rank" , "Number_events" , "Total_Winnings" , "Average_Winnings")], use = "pairwise.complete.obs")


```
We see that a lot of our variables have high correlation which can cause multicollinearity. We must deal with this issue before doing our analysis. 

```{r}
colSums(is.na(pga_data))
pga_data_clean <- na.omit(pga_data)
```

For the sake of missing observations we will be removing it for our data analysis because we have a plentiful amount of observations.

```{r}
ggplot(pga_data_clean, aes(x = Average_Drive, y = Average_Winnings)) +
  geom_point() +
  geom_smooth(method = "lm", col = "green") +
  labs(title = "Scatter Plot of Average Drive vs. Average Winnings", x = "Average Drive", y = "Average Winnings")
str(pga_data_clean)
remove_low_drives <- function(pga_data_clean) {
  filtered_data <- pga_data_clean %>%
    filter(Average_Drive >= 150)
  
  return(filtered_data)
}

# Example usage
pga_data_filtered <- remove_low_drives(pga_data_clean)
ggplot(pga_data_filtered, aes(x = Average_Drive, y = Average_Winnings)) +
  geom_point() +
  geom_smooth(method = "lm", col = "green") +
  labs(
    title = "Scatter Plot of Average Drive vs. Average Winnings (Filtered)",
    x = "Average Drive",
    y = "Average Winnings"
  ) +
  theme_minimal() 
```


We look at an initial graph of average winnigs vs average drive. It looks like we have a good cluster of data but we have a couple of outliers in our data that is scewing our line.

```{r}
pga_data_filtered$Age_Above_30 <- pga_data_filtered$Player_Age > 30

head(pga_data_filtered$Age_Above_30)
pga_data_filtered <- pga_data_filtered %>% select(-Player_name, -Driving_Accuracy, -Total_Winnings, -Average_number_putts, -Money_Rank, -Player_Age)
kable(head(pga_data_filtered))


```

We remove player name and money rank because we are "ranking" them or sorting them by +/- 30 age. We remove driving accuracy because we have a variable with average drive because that is what we are making assumptions on and also it could cause multicollinearity because driving accuracy and average drive are highly correlated. The same can be said for total winnings versus average winnings and also save percent and number of putts. They are both correlated and can cause multicollinearity.

### Model Bulding

## Linear Model

```{r}
colnames(pga_data_filtered)
full.model = lm(Average_Winnings ~  Average_Drive + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")


```
We transform the age category into a true or false of payers with above or below age 30. All of our values are significant besides Age.
```{r}
par(mfrow=c(2,2))
plot(full.model)


```

It looks like we have faults in our assumptions. Based on the QQ plot there seems to be a non-normal distribution with several outliers. There also seems to be non-constant variance.

```{r}
vif(full.model)
barplot(vif(full.model), main = "VIF Values", horiz = FALSE, col = "steelblue")
```

We see in the vif values above that all of them are below 5 which shows little to no multicollineairty so we can continue.

## Box Cox Transformation

We perform this transformation because our data has non constant variance

```{r}
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
##
boxcox(Average_Winnings ~ log(Average_Drive) + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
       xlab=expression(paste(lambda, ": log Average Drive")))
##
boxcox(Average_Winnings ~ Average_Drive + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
       xlab=expression(paste(lambda, ": Average Drive")))
##
boxcox(Average_Winnings ~ Average_Drive + Greens_on_reg +  Save_Percent + Number_events + log(1+Age_Above_30), data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
       xlab=expression(paste(lambda, ": log-age")))
##
boxcox(Average_Winnings ~ log(Average_Drive) + Greens_on_reg +  Save_Percent + Number_events + log(1+Age_Above_30), data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
      xlab=expression(paste(lambda, ": log-age, log Average Drive")))

```

Values of lamda are around 0 so we shoudl use a log transformation
## Square Root Model

```{r}
sqrt.winnings.log.drive = lm((Average_Winnings)^0.5 ~ Greens_on_reg +  Save_Percent + Number_events + Age_Above_30 + log(Average_Drive), data = pga_data_filtered)
kable(summary(sqrt.winnings.log.drive)$coef, caption = "log-transformed model")


```

```{r}
par(mfrow = c(2,2))
plot(sqrt.winnings.log.drive)

```

This model worsened our QQ-plot normality and our non constant variance remained the same, so we can try another model to see if it helps.

## Log Transformation

```{r}
log.winnings = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
kable(summary(log.winnings)$coef, caption = "log-transformed model")


```

```{r}
par(mfrow = c(2,2))
plot(log.winnings)

```

This model looks drastically better then the other two models. While normality still cannot be assumed the QQ Plot looks better. The same thing has happened with our Residual plot while constant variance can still not be assumed it is drastically better.

## Comparison of models

```{r}

par(pty = "s", mfrow = c(1, 3))

qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)

qqnorm(log.winnings$residuals, main = "Log Winnings")
qqline(log.winnings$residuals)

qqnorm(sqrt.winnings.log.drive$residuals, main = "sqrt winnings log drive")
qqline(sqrt.winnings.log.drive$residuals)


```

We can see that Log Winnings is the best model for our data.

## Goodness of Fit Comparison

```{r}
select=function(m){ 
 e = m$resid                           
 n0 = length(e)                        
 SSE=(m$df)*(summary(m)$sigma)^2       
 R.sq=summary(m)$r.squared             
 R.adj=summary(m)$adj.r               
 MSE=(summary(m)$sigma)^2              
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  
 X=model.matrix(m)                     
 H=X%*%solve(t(X)%*%X)%*%t(X)          
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
 }

```

```{r}

output.sum = rbind(select(full.model), select(sqrt.winnings.log.drive), select(log.winnings))
row.names(output.sum) = c("full.model", "sqrt.winnings.log.drive", "log.winnings")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")

```

Even though the R, R squared, and Cp look better in the other two models our value are still relatively good for log.winnings. Log.winnings has better residuals, QQ plot, but less goodness of fit measures but they are not significantly bad so we will stick with the log model.

## Final Model

```{r}
kable(summary(log.winnings)$coef, caption = "Inferential Statistics of Final Model")
```

While we do have a large sample of 196 and all of our pvalues are close to 0 we do have one variable "Age" which is not significant.

### Summary of Model
The final model can be represented as log(price)=3.9245 − 0.0246×Average_Drive + 0.1866×Greens_on_reg + 
0.0387×Save_percent − 0.0181×Number_events + 0.1826×Age_Above_30TRUE

From the above data we can also conclude that as average winnings go up our average drive goes down by -2.5% while other variables such as greens on regulation and save percent go up.

```{r}
log.winnings = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)

B = 1000  #

num.p = length(coef(log.winnings))  # number of parameters in the model (includes intercept)
smpl.n = nrow(pga_data_filtered)       # sample size

# Matrix to store bootstrap coefficients
coef.mtrx = matrix(0, nrow = B, ncol = num.p)

# Bootstrap loop
for (i in 1:B) {
  bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE)  # Resample with replacement
  boot_data = pga_data_filtered[bootc.id, ]  # Create the bootstrap sample
  log.winnings.btc = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = boot_data)
  coef.mtrx[i, ] = coef(log.winnings.btc)  # Store the coefficients
}

```

### Bootstrapping Final Model

```{r}

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){

  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))

  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
  
}

```

```{r}
par(mfrow=c(2,3))  
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Average Drive" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Greens On Regulation" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Save Percentage" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Number of Events" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="Age Above 30" )
```

We see from our histograms that they look to have a equal distribution. The blue curve represents the p-values reported. While the blue curve represents the bootstrap intervals. They are very similar.

## 95% CI

```{r}

num.p = dim(coef.mtrx)[2]  

btc.ci = NULL
btc.wd = NULL

for (i in 1:num.p) {
  lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2), 8)
  uci.975 = round(quantile(coef.mtrx[, i], 0.975, type = 2), 8)
  btc.wd[i] = uci.975 - lci.025
  btc.ci[i] = paste("[", round(lci.025, 4), ", ", round(uci.975, 4), "]")
}

mean.coefs = apply(coef.mtrx, 2, mean)

results = as.data.frame(cbind(Mean_Coef = formatC(mean.coefs, 4, format = "f"), btc.ci.95 = btc.ci))

kable(results, caption = "Regression Coefficient Matrix with Bootstrap Confidence Intervals")

```

We can see that our values are consistent with the p-values that we got. Our 95% CI spans over 0 in the Age variable which matches up with our p-value not being significant

```{r}
hist(sort(log.winnings$residuals),n=40,
     xlab="Residuals",
     col = "lightblue",
     border="navy",
     main = "Histogram of Residuals")


```

We appear to have a normal distribution but we do appear to have some outliers on either side of the curve.

## Residual Bootstrap Regression

```{r}
log.winnings <- lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
model.resid = log.winnings$residuals
B=1000
num.p = dim(model.matrix(log.winnings))[2]   
samp.n = dim(model.matrix(log.winnings))[1]  
btr.mtrx = matrix(rep(0,6*B), ncol=num.p)
for (i in 1:B){

  bt.lg.winnings = log.winnings$fitted.values + 
        sample(log.winnings$residuals, samp.n, replace = TRUE)  
  
  pga_data_filtered$bt.lg.winnings =  bt.lg.winnings   #  send the boot response to the data
  btr.model = lm(bt.lg.winnings ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)  
  btr.mtrx[i,]=btr.model$coefficients
}
```

```{r}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){

  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))

  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")             
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    
} 



```

```{r}
par(mfrow=c(2,3))  
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Average Drive" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Greens on Regulation" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Save Percentage" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Number of Events" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="Age" )


```

We again see that our histograms have an equal distribution. Again we see that the p-values and residual boostrap have very similar values.
```{r}
um.p = dim(btr.mtrx)[2]  
btr.ci = NULL
btr.wd = NULL

for (i in 1:num.p) {
  lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2), 8)
  uci.975 = round(quantile(btr.mtrx[, i], 0.975, type = 2), 8)
  btr.wd[i] = uci.975 - lci.025
  btr.ci[i] = paste("[", round(lci.025, 4), ", ", round(uci.975, 4), "]")
}

kable(as.data.frame(cbind(formatC(btr.mtrx[1, ], 4, format = "f"), btr.ci.95 = btr.ci)), 
      caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")

```

Once again we get the same output as expected. The values above show the same results as our p-values, all being significant except for age which includes 0.

### Combining results

```{r}
p_values <- summary(log.winnings)$coefficients[-3, 4]  

combined_matrix <- cbind(
  Coefficients = formatC(coef(log.winnings)[-3], 4, format = "f"),  
  `95% CI (Bootstrap t)` = btc.ci,                                
  `95% CI (Bootstrap r)` = btr.ci,                                 
  `p-values` = formatC(p_values, 4, format = "f")                  
)


library(knitr)
kable(as.data.frame(combined_matrix), 
      caption = "Final Combined Inferential Statistics: Coefficients, p-values, and Bootstrap CIs")



```

All of the models yield the same result, showing there is a violation in the age variable but all other variables are significant.

```{r}

kable(round(cbind(btc.wd, btr.wd),4), caption="width of the two bootstrap confidence intervals")


```

We see from the above analysis that the two values are very similar to each other.



### Summary

We see from our findings above that all three of our models come to the same conclusion. We will be using our initial final model where we used a log transformation because they seem to be a little more significant. 

Again we see that the final model can be represented as log(price)=3.9245 − 0.0246×Average_Drive + 0.1866×Greens_on_reg + 
0.0387×Save_percent − 0.0181×Number_events + 0.1826×Age_Above_30TRUE

From the above data we can also conclude that as average winnings go up our average drive goes down by -2.5% while other variables such as greens on regulation and save percent go up.

## Conclusion 
The main conclusion we can draw is that while driving distance in golf is perceived to be a big indication of whether a person is winning or not we do not see this in our analysis. While the Average Winnings goes up the average drive goes down by -2.5% while Save Percentage goes up by 3.9% and Greens on Regulation goes up by 1.9% which shows that the "short" game in golf (putting and chipping) is more important than driving in regards to average money won.

We see one flaw in our analysis and it was that age was not significant which could mess up our analysis. in the future we could push our age up or down for the true or false statement to see if that changes the significance. We could also look at some values of age to see if there are any outliers, for example golfers are not younger than 20 but they can be 45+ which could mess with our significance.

We can use this model for seasons to come to see if things may have changed. For example since golf has evolved recently to players having longer drives we could run this model again in 2024 to see if the "short" game is less likely to win more money than the "long" game would.

