0.1 Introduction

1 Description

The PGA 2004 dataset contains performance statistics and winnings data for 196 participants in the PGA (Proffesional Golfers Association) during the 2004 season. The dataset provides information on several key aspects of player performance and earnings such as:

Name : The name of each golfer. Age : The age of the player. Average Drive : The average driving distance of the player in yards Driving Accuracy: The percentage of drives that land on the fairway, Greens in Regulation : The percentage of greens reached in regulation, meaning the number of strokes used by the player to land the ball on the green is two or more less than par. Average Number of Putts : The average number of putts per round Save Percentage : The percentage of times a player saves par or better from around the green Money Rank : The rank of the player based on total earnings for the season. Number of Events : The total number of events the player participated in during the season. Total Winnings : The total amount of money earned by the player over the course of the season. Average Winnings : The average amount of money earned per event by the player.

2 Research Question

We want to investigate what variables affect the players winnings of this given season. We will look at mainly the average drive and how it correlates to winnings.

3 Data Preperation

url <- "https://users.stat.ufl.edu/~winner/data/pga2004.dat"
pga_data <- read.table(url, header = FALSE, fill = TRUE)
pga_data$V1 <- NULL
colnames(pga_data) <- c("Player_name", "Player_Age", "Average_Drive", "Driving_Accuracy", "Greens_on_reg", 
                        "Average_number_putts", "Save_Percent", "Money_Rank",
                        "Number_events", "Total_Winnings", "Average_Winnings")
head(pga_data)
  Player_name Player_Age Average_Drive Driving_Accuracy Greens_on_reg
1    Baddeley         23         288.0             53.1          58.2
2       Scott         24         295.4             57.7          65.6
3       Cejka         34         285.8             64.2          63.8
4       Stolz         34         297.9             59.0          63.0
5       Atwal         31         289.4             60.5          62.5
6  Oberholser         29         284.6             68.8          67.0
  Average_number_putts Save_Percent Money_Rank Number_events Total_Winnings
1                1.767         50.9        123            27         632878
2                1.757         59.3          7            16        3724984
3                1.795         50.7         54            24        1313484
4                1.787         47.7        101            20         808373
5                1.766         43.5        146            30         486053
6                1.780         50.9         52            23        1355433
  Average_Winnings
1            23440
2           232812
3            54729
4            40419
5            16202
6            58932

We take a broad look at the data and see there was two categorical variables one of which having the first name of a player and then a second one with the last name. We void the first name and just look at the last names.

pander(head(pga_data))
Table continues below
Player_name Player_Age Average_Drive Driving_Accuracy Greens_on_reg
Baddeley 23 288 53.1 58.2
Scott 24 295.4 57.7 65.6
Cejka 34 285.8 64.2 63.8
Stolz 34 297.9 59 63
Atwal 31 289.4 60.5 62.5
Oberholser 29 284.6 68.8 67
Table continues below
Average_number_putts Save_Percent Money_Rank Number_events
1.767 50.9 123 27
1.757 59.3 7 16
1.795 50.7 54 24
1.787 47.7 101 20
1.766 43.5 146 30
1.78 50.9 52 23
Total_Winnings Average_Winnings
632878 23440
3724984 232812
1313484 54729
808373 40419
486053 16202
1355433 58932
cor(pga_data[, c("Average_Drive" , "Driving_Accuracy" , "Greens_on_reg" , "Average_number_putts" , "Save_Percent" , "Money_Rank" , "Number_events" , "Total_Winnings" , "Average_Winnings")], use = "pairwise.complete.obs")
                     Average_Drive Driving_Accuracy Greens_on_reg
Average_Drive          1.000000000      -0.99252269   0.002134259
Driving_Accuracy      -0.992522688       1.00000000   0.046962093
Greens_on_reg          0.002134259       0.04696209   1.000000000
Average_number_putts  -0.988803391       0.99389638   0.022156599
Save_Percent           0.867471569      -0.87937510  -0.056544353
Money_Rank             0.186561899      -0.20345865  -0.498016597
Number_events         -0.699109599       0.68370009   0.062783323
Total_Winnings         0.226265605      -0.21129425   0.405497292
Average_Winnings      -0.773878692       0.79872244   0.013823260
                     Average_number_putts Save_Percent  Money_Rank
Average_Drive                  -0.9888034   0.86747157  0.18656190
Driving_Accuracy                0.9938964  -0.87937510 -0.20345865
Greens_on_reg                   0.0221566  -0.05654435 -0.49801660
Average_number_putts            1.0000000  -0.88762172 -0.19710155
Save_Percent                   -0.8876217   1.00000000  0.06385765
Money_Rank                     -0.1971016   0.06385765  1.00000000
Number_events                   0.6924957  -0.63927742 -0.11418017
Total_Winnings                 -0.2044137   0.27127779 -0.66596035
Average_Winnings                0.7948355  -0.68320935 -0.27537388
                     Number_events Total_Winnings Average_Winnings
Average_Drive          -0.69910960    0.226265605     -0.773878692
Driving_Accuracy        0.68370009   -0.211294247      0.798722439
Greens_on_reg           0.06278332    0.405497292      0.013823260
Average_number_putts    0.69249567   -0.204413695      0.794835458
Save_Percent           -0.63927742    0.271277789     -0.683209349
Money_Rank             -0.11418017   -0.665960350     -0.275373883
Number_events           1.00000000   -0.180433459      0.190800029
Total_Winnings         -0.18043346    1.000000000      0.002463485
Average_Winnings        0.19080003    0.002463485      1.000000000

We see that a lot of our variables have high correlation which can cause multicollinearity. We must deal with this issue before doing our analysis.

colSums(is.na(pga_data))
         Player_name           Player_Age        Average_Drive 
                   0                    0                   10 
    Driving_Accuracy        Greens_on_reg Average_number_putts 
                  10                   10                   10 
        Save_Percent           Money_Rank        Number_events 
                  10                   10                   10 
      Total_Winnings     Average_Winnings 
                  10                   10 
pga_data_clean <- na.omit(pga_data)

For the sake of missing observations we will be removing it for our data analysis because we have a plentiful amount of observations.

ggplot(pga_data_clean, aes(x = Average_Drive, y = Average_Winnings)) +
  geom_point() +
  geom_smooth(method = "lm", col = "green") +
  labs(title = "Scatter Plot of Average Drive vs. Average Winnings", x = "Average Drive", y = "Average Winnings")
`geom_smooth()` using formula = 'y ~ x'

str(pga_data_clean)
'data.frame':   196 obs. of  11 variables:
 $ Player_name         : chr  "Baddeley" "Scott" "Cejka" "Stolz" ...
 $ Player_Age          : chr  "23" "24" "34" "34" ...
 $ Average_Drive       : num  288 295 286 298 289 ...
 $ Driving_Accuracy    : num  53.1 57.7 64.2 59 60.5 68.8 74.2 64.4 64.3 62.6 ...
 $ Greens_on_reg       : num  58.2 65.6 63.8 63 62.5 67 68.9 64.2 63.4 65.3 ...
 $ Average_number_putts: num  1.77 1.76 1.79 1.79 1.77 ...
 $ Save_Percent        : num  50.9 59.3 50.7 47.7 43.5 50.9 40.4 53.8 42.2 47.7 ...
 $ Money_Rank          : num  123 7 54 101 146 52 80 75 141 83 ...
 $ Number_events       : int  27 16 24 20 30 23 23 27 20 15 ...
 $ Total_Winnings      : int  632878 3724984 1313484 808373 486053 1355433 962167 1036958 500818 943589 ...
 $ Average_Winnings    : int  23440 232812 54729 40419 16202 58932 41833 38406 25041 62906 ...
 - attr(*, "na.action")= 'omit' Named int [1:10] 15 32 54 58 68 110 142 155 198 200
  ..- attr(*, "names")= chr [1:10] "15" "32" "54" "58" ...
remove_low_drives <- function(pga_data_clean) {
  filtered_data <- pga_data_clean %>%
    filter(Average_Drive >= 150)
  
  return(filtered_data)
}

# Example usage
pga_data_filtered <- remove_low_drives(pga_data_clean)
ggplot(pga_data_filtered, aes(x = Average_Drive, y = Average_Winnings)) +
  geom_point() +
  geom_smooth(method = "lm", col = "green") +
  labs(
    title = "Scatter Plot of Average Drive vs. Average Winnings (Filtered)",
    x = "Average Drive",
    y = "Average Winnings"
  ) +
  theme_minimal() 
`geom_smooth()` using formula = 'y ~ x'

We look at an initial graph of average winnigs vs average drive. It looks like we have a good cluster of data but we have a couple of outliers in our data that is scewing our line.

pga_data_filtered$Age_Above_30 <- pga_data_filtered$Player_Age > 30

head(pga_data_filtered$Age_Above_30)
[1] FALSE FALSE  TRUE  TRUE  TRUE FALSE
pga_data_filtered <- pga_data_filtered %>% select(-Player_name, -Driving_Accuracy, -Total_Winnings, -Average_number_putts, -Money_Rank, -Player_Age)
kable(head(pga_data_filtered))
Average_Drive Greens_on_reg Save_Percent Number_events Average_Winnings Age_Above_30
288.0 58.2 50.9 27 23440 FALSE
295.4 65.6 59.3 16 232812 FALSE
285.8 63.8 50.7 24 54729 TRUE
297.9 63.0 47.7 20 40419 TRUE
289.4 62.5 43.5 30 16202 TRUE
284.6 67.0 50.9 23 58932 FALSE

We remove player name and money rank because we are “ranking” them or sorting them by +/- 30 age. We remove driving accuracy because we have a variable with average drive because that is what we are making assumptions on and also it could cause multicollinearity because driving accuracy and average drive are highly correlated. The same can be said for total winnings versus average winnings and also save percent and number of putts. They are both correlated and can cause multicollinearity.

3.1 Model Bulding

4 Linear Model

colnames(pga_data_filtered)
[1] "Average_Drive"    "Greens_on_reg"    "Save_Percent"     "Number_events"   
[5] "Average_Winnings" "Age_Above_30"    
full.model = lm(Average_Winnings ~  Average_Drive + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -802913.292 147739.1904 -5.434667 0.0000002
Average_Drive 1198.229 435.7615 2.749735 0.0065731
Greens_on_reg 7278.954 1205.2877 6.039183 0.0000000
Save_Percent 2486.926 627.6347 3.962378 0.0001069
Number_events -3508.708 723.2608 -4.851234 0.0000026
Age_Above_30TRUE 3537.995 8630.5405 0.409939 0.6823382

We transform the age category into a true or false of payers with above or below age 30. All of our values are significant besides Age.

par(mfrow=c(2,2))
plot(full.model)

It looks like we have faults in our assumptions. Based on the QQ plot there seems to be a non-normal distribution with several outliers. There also seems to be non-constant variance.

vif(full.model)
Average_Drive Greens_on_reg  Save_Percent Number_events  Age_Above_30 
     1.136250      1.044126      1.047080      1.018950      1.072829 
barplot(vif(full.model), main = "VIF Values", horiz = FALSE, col = "steelblue")

We see in the vif values above that all of them are below 5 which shows little to no multicollineairty so we can continue.

5 Box Cox Transformation

We perform this transformation because our data has non constant variance

par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
##
boxcox(Average_Winnings ~ log(Average_Drive) + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
       xlab=expression(paste(lambda, ": log Average Drive")))
##
boxcox(Average_Winnings ~ Average_Drive + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
       xlab=expression(paste(lambda, ": Average Drive")))
##
boxcox(Average_Winnings ~ Average_Drive + Greens_on_reg +  Save_Percent + Number_events + log(1+Age_Above_30), data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
       xlab=expression(paste(lambda, ": log-age")))
##
boxcox(Average_Winnings ~ log(Average_Drive) + Greens_on_reg +  Save_Percent + Number_events + log(1+Age_Above_30), data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
      xlab=expression(paste(lambda, ": log-age, log Average Drive")))

Values of lamda are around 0 so we shoudl use a log transformation ## Square Root Model

sqrt.winnings.log.drive = lm((Average_Winnings)^0.5 ~ Greens_on_reg +  Save_Percent + Number_events + Age_Above_30 + log(Average_Drive), data = pga_data_filtered)
kable(summary(sqrt.winnings.log.drive)$coef, caption = "log-transformed model")
log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3187.392815 1176.262371 -2.7097635 0.0073840
Greens_on_reg 16.184221 1.999182 8.0954209 0.0000000
Save_Percent 5.154888 1.041392 4.9499977 0.0000017
Number_events -6.056634 1.199193 -5.0505915 0.0000011
Age_Above_30TRUE 2.354375 14.312534 0.1644974 0.8695242
log(Average_Drive) 394.504246 208.497492 1.8921295 0.0600792
par(mfrow = c(2,2))
plot(sqrt.winnings.log.drive)

This model worsened our QQ-plot normality and our non constant variance remained the same, so we can try another model to see if it helps.

6 Log Transformation

log.winnings = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
kable(summary(log.winnings)$coef, caption = "log-transformed model")
log-transformed model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.4512847 2.3833429 -2.2872432 0.0233453
Average_Drive 0.0057567 0.0070297 0.8189126 0.4139191
Greens_on_reg 0.1920075 0.0194438 9.8749896 0.0000000
Save_Percent 0.0563106 0.0101251 5.5615033 0.0000001
Number_events -0.0450036 0.0116677 -3.8571026 0.0001596
Age_Above_30TRUE 0.0333772 0.1392287 0.2397290 0.8108131
par(mfrow = c(2,2))
plot(log.winnings)

This model looks drastically better then the other two models. While normality still cannot be assumed the QQ Plot looks better. The same thing has happened with our Residual plot while constant variance can still not be assumed it is drastically better.

7 Comparison of models

par(pty = "s", mfrow = c(1, 3))

qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)

qqnorm(log.winnings$residuals, main = "Log Winnings")
qqline(log.winnings$residuals)

qqnorm(sqrt.winnings.log.drive$residuals, main = "sqrt winnings log drive")
qqline(sqrt.winnings.log.drive$residuals)

We can see that Log Winnings is the best model for our data.

8 Goodness of Fit Comparison

select=function(m){ 
 e = m$resid                           
 n0 = length(e)                        
 SSE=(m$df)*(summary(m)$sigma)^2       
 R.sq=summary(m)$r.squared             
 R.adj=summary(m)$adj.r               
 MSE=(summary(m)$sigma)^2              
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  
 X=model.matrix(m)                     
 H=X%*%solve(t(X)%*%X)%*%t(X)          
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
 }
output.sum = rbind(select(full.model), select(sqrt.winnings.log.drive), select(log.winnings))
row.names(output.sum) = c("full.model", "sqrt.winnings.log.drive", "log.winnings")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 3.852006e+11 0.3349285 0.3164543 6 4001.9387 4021.2932 4.229954e+11
sqrt.winnings.log.drive 1.058809e+06 0.4162839 0.4000695 6 1620.3250 1639.6795 1.143272e+06
log.winnings 1.002463e+02 0.4547915 0.4396468 6 -102.9696 -83.6151 1.075339e+02

Even though the R, R squared, and Cp look better in the other two models our value are still relatively good for log.winnings. Log.winnings has better residuals, QQ plot, but less goodness of fit measures but they are not significantly bad so we will stick with the log model.

9 Final Model

kable(summary(log.winnings)$coef, caption = "Inferential Statistics of Final Model")
Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.4512847 2.3833429 -2.2872432 0.0233453
Average_Drive 0.0057567 0.0070297 0.8189126 0.4139191
Greens_on_reg 0.1920075 0.0194438 9.8749896 0.0000000
Save_Percent 0.0563106 0.0101251 5.5615033 0.0000001
Number_events -0.0450036 0.0116677 -3.8571026 0.0001596
Age_Above_30TRUE 0.0333772 0.1392287 0.2397290 0.8108131

While we do have a large sample of 196 and all of our pvalues are close to 0 we do have one variable “Age” which is not significant.

9.1 Summary of Model

The final model can be represented as log(price)=3.9245 − 0.0246×Average_Drive + 0.1866×Greens_on_reg + 0.0387×Save_percent − 0.0181×Number_events + 0.1826×Age_Above_30TRUE

From the above data we can also conclude that as average winnings go up our average drive goes down by -2.5% while other variables such as greens on regulation and save percent go up.

log.winnings = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)

B = 1000  #

num.p = length(coef(log.winnings))  # number of parameters in the model (includes intercept)
smpl.n = nrow(pga_data_filtered)       # sample size

# Matrix to store bootstrap coefficients
coef.mtrx = matrix(0, nrow = B, ncol = num.p)

# Bootstrap loop
for (i in 1:B) {
  bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE)  # Resample with replacement
  boot_data = pga_data_filtered[bootc.id, ]  # Create the bootstrap sample
  log.winnings.btc = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = boot_data)
  coef.mtrx[i, ] = coef(log.winnings.btc)  # Store the coefficients
}

9.2 Bootstrapping Final Model

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){

  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))

  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
  
}
par(mfrow=c(2,3))  
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Average Drive" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Greens On Regulation" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Save Percentage" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Number of Events" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="Age Above 30" )

We see from our histograms that they look to have a equal distribution. The blue curve represents the p-values reported. While the blue curve represents the bootstrap intervals. They are very similar.

10 95% CI

num.p = dim(coef.mtrx)[2]  

btc.ci = NULL
btc.wd = NULL

for (i in 1:num.p) {
  lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2), 8)
  uci.975 = round(quantile(coef.mtrx[, i], 0.975, type = 2), 8)
  btc.wd[i] = uci.975 - lci.025
  btc.ci[i] = paste("[", round(lci.025, 4), ", ", round(uci.975, 4), "]")
}

mean.coefs = apply(coef.mtrx, 2, mean)

results = as.data.frame(cbind(Mean_Coef = formatC(mean.coefs, 4, format = "f"), btc.ci.95 = btc.ci))

kable(results, caption = "Regression Coefficient Matrix with Bootstrap Confidence Intervals")
Regression Coefficient Matrix with Bootstrap Confidence Intervals
Mean_Coef btc.ci.95
-5.4436 [ -10.4839 , -0.2628 ]
0.0059 [ -0.0065 , 0.0181 ]
0.1917 [ 0.1442 , 0.236 ]
0.0559 [ 0.038 , 0.0752 ]
-0.0452 [ -0.0696 , -0.0216 ]
0.0298 [ -0.2617 , 0.3228 ]

We can see that our values are consistent with the p-values that we got. Our 95% CI spans over 0 in the Age variable which matches up with our p-value not being significant

hist(sort(log.winnings$residuals),n=40,
     xlab="Residuals",
     col = "lightblue",
     border="navy",
     main = "Histogram of Residuals")

We appear to have a normal distribution but we do appear to have some outliers on either side of the curve.

11 Residual Bootstrap Regression

log.winnings <- lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
model.resid = log.winnings$residuals
B=1000
num.p = dim(model.matrix(log.winnings))[2]   
samp.n = dim(model.matrix(log.winnings))[1]  
btr.mtrx = matrix(rep(0,6*B), ncol=num.p)
for (i in 1:B){

  bt.lg.winnings = log.winnings$fitted.values + 
        sample(log.winnings$residuals, samp.n, replace = TRUE)  
  
  pga_data_filtered$bt.lg.winnings =  bt.lg.winnings   #  send the boot response to the data
  btr.model = lm(bt.lg.winnings ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)  
  btr.mtrx[i,]=btr.model$coefficients
}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){

  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))

  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")             
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    
} 
par(mfrow=c(2,3))  
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Average Drive" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Greens on Regulation" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Save Percentage" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Number of Events" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="Age" )

We again see that our histograms have an equal distribution. Again we see that the p-values and residual boostrap have very similar values.

um.p = dim(btr.mtrx)[2]  
btr.ci = NULL
btr.wd = NULL

for (i in 1:num.p) {
  lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2), 8)
  uci.975 = round(quantile(btr.mtrx[, i], 0.975, type = 2), 8)
  btr.wd[i] = uci.975 - lci.025
  btr.ci[i] = paste("[", round(lci.025, 4), ", ", round(uci.975, 4), "]")
}

kable(as.data.frame(cbind(formatC(btr.mtrx[1, ], 4, format = "f"), btr.ci.95 = btr.ci)), 
      caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")
Regression Coefficient Matrix with 95% Residual Bootstrap CI
V1 btr.ci.95
-3.9016 [ -10.2057 , -1.1977 ]
-0.0006 [ -0.0066 , 0.0192 ]
0.2044 [ 0.155 , 0.2292 ]
0.0500 [ 0.0369 , 0.0758 ]
-0.0583 [ -0.0669 , -0.0231 ]
0.1272 [ -0.2253 , 0.291 ]

Once again we get the same output as expected. The values above show the same results as our p-values, all being significant except for age which includes 0.

11.1 Combining results

p_values <- summary(log.winnings)$coefficients[-3, 4]  

combined_matrix <- cbind(
  Coefficients = formatC(coef(log.winnings)[-3], 4, format = "f"),  
  `95% CI (Bootstrap t)` = btc.ci,                                
  `95% CI (Bootstrap r)` = btr.ci,                                 
  `p-values` = formatC(p_values, 4, format = "f")                  
)
Warning in cbind(Coefficients = formatC(coef(log.winnings)[-3], 4, format =
"f"), : number of rows of result is not a multiple of vector length (arg 1)
library(knitr)
kable(as.data.frame(combined_matrix), 
      caption = "Final Combined Inferential Statistics: Coefficients, p-values, and Bootstrap CIs")
Final Combined Inferential Statistics: Coefficients, p-values, and Bootstrap CIs
Coefficients 95% CI (Bootstrap t) 95% CI (Bootstrap r) p-values
-5.4513 [ -10.4839 , -0.2628 ] [ -10.2057 , -1.1977 ] 0.0233
0.0058 [ -0.0065 , 0.0181 ] [ -0.0066 , 0.0192 ] 0.4139
0.0563 [ 0.1442 , 0.236 ] [ 0.155 , 0.2292 ] 0.0000
-0.0450 [ 0.038 , 0.0752 ] [ 0.0369 , 0.0758 ] 0.0002
0.0334 [ -0.0696 , -0.0216 ] [ -0.0669 , -0.0231 ] 0.8108
-5.4513 [ -0.2617 , 0.3228 ] [ -0.2253 , 0.291 ] 0.0233

All of the models yield the same result, showing there is a violation in the age variable but all other variables are significant.

kable(round(cbind(btc.wd, btr.wd),4), caption="width of the two bootstrap confidence intervals")
width of the two bootstrap confidence intervals
btc.wd btr.wd
10.2211 9.0080
0.0245 0.0258
0.0918 0.0741
0.0372 0.0389
0.0480 0.0438
0.5845 0.5163

We see from the above analysis that the two values are very similar to each other.

11.2 Summary

We see from our findings above that all three of our models come to the same conclusion. We will be using our initial final model where we used a log transformation because they seem to be a little more significant.

Again we see that the final model can be represented as log(price)=3.9245 − 0.0246×Average_Drive + 0.1866×Greens_on_reg + 0.0387×Save_percent − 0.0181×Number_events + 0.1826×Age_Above_30TRUE

From the above data we can also conclude that as average winnings go up our average drive goes down by -2.5% while other variables such as greens on regulation and save percent go up.

12 Conclusion

The main conclusion we can draw is that while driving distance in golf is perceived to be a big indication of whether a person is winning or not we do not see this in our analysis. While the Average Winnings goes up the average drive goes down by -2.5% while Save Percentage goes up by 3.9% and Greens on Regulation goes up by 1.9% which shows that the “short” game in golf (putting and chipping) is more important than driving in regards to average money won.

We see one flaw in our analysis and it was that age was not significant which could mess up our analysis. in the future we could push our age up or down for the true or false statement to see if that changes the significance. We could also look at some values of age to see if there are any outliers, for example golfers are not younger than 20 but they can be 45+ which could mess with our significance.

We can use this model for seasons to come to see if things may have changed. For example since golf has evolved recently to players having longer drives we could run this model again in 2024 to see if the “short” game is less likely to win more money than the “long” game would.

---
title: "Factors Influence Golf Earnings"
author: 'Tyler Battaglini'
date: "2024-09-19"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    fig_width: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
always_allow_html: true
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-weight: bold;
    font-family: system-ui;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# Detect, install and load packages if needed.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("nleqslv")) {
   install.packages("nleqslv")
   library(nleqslv)
}
#
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}

if (!require("psych")) {   
  install.packages("psych")
   library(psych)
}
if (!require("MASS")) {   
  install.packages("MASS")
   library(MASS)
}
if (!require("ggplot2")) {   
  install.packages("ggplot2")
   library(ggplot2)
}
if (!require("GGally")) {   
  install.packages("GGally")
   library(GGally)
}
if (!require("car")) {   
  install.packages("car")
   library(car)
}
if (!require("dplyr")) {   
  install.packages("dplyr")
   library(dplyr)
}
if (!require("caret")) {   
  install.packages("caret")
   library(caret)
}

# specifications of outputs of code in code chunks
knitr::opts_chunk$set(echo = TRUE,      # include code chunk in the output file
                      warnings = FALSE,  # sometimes, you code may produce warning messages,
                                         # you can choose to include the warning messages in
                                         # the output file. 
                      messages = FALSE,  #
                      results = TRUE,
                      
                      comment = NA       # you can also decide whether to include the output
                                         # in the output file.
                      )   
```


### Introduction

## Description
The PGA 2004 dataset contains performance statistics and winnings data for 196 participants in the PGA (Proffesional Golfers Association) during the 2004 season. The dataset provides information on several key aspects of player performance and earnings such as:

Name : The name of each golfer.
Age : The age of the player.
Average Drive : The average driving distance of the player in yards
Driving Accuracy: The percentage of drives that land on the fairway, 
Greens in Regulation : The percentage of greens reached in regulation, meaning the number of strokes used by the player to land the ball on the green is two or more less than par.
Average Number of Putts : The average number of putts per round
Save Percentage : The percentage of times a player saves par or better from around the green 
Money Rank : The rank of the player based on total earnings for the season.
Number of Events : The total number of events the player participated in during the season.
Total Winnings : The total amount of money earned by the player over the course of the season.
Average Winnings : The average amount of money earned per event by the player.

## Research Question
We want to investigate what variables affect the players winnings of this given season. We will look at mainly the average drive and how it correlates to winnings.

## Data Preperation

```{r}

url <- "https://users.stat.ufl.edu/~winner/data/pga2004.dat"
pga_data <- read.table(url, header = FALSE, fill = TRUE)
pga_data$V1 <- NULL
colnames(pga_data) <- c("Player_name", "Player_Age", "Average_Drive", "Driving_Accuracy", "Greens_on_reg", 
                        "Average_number_putts", "Save_Percent", "Money_Rank",
                        "Number_events", "Total_Winnings", "Average_Winnings")
head(pga_data)
```

We take a broad look at the data and see there was two categorical variables one of which having the first name of a player and then a second one with the last name. We void the first name and just look at the last names.

```{r}
pander(head(pga_data))

```
```{r}
cor(pga_data[, c("Average_Drive" , "Driving_Accuracy" , "Greens_on_reg" , "Average_number_putts" , "Save_Percent" , "Money_Rank" , "Number_events" , "Total_Winnings" , "Average_Winnings")], use = "pairwise.complete.obs")


```
We see that a lot of our variables have high correlation which can cause multicollinearity. We must deal with this issue before doing our analysis. 

```{r}
colSums(is.na(pga_data))
pga_data_clean <- na.omit(pga_data)
```

For the sake of missing observations we will be removing it for our data analysis because we have a plentiful amount of observations.

```{r}
ggplot(pga_data_clean, aes(x = Average_Drive, y = Average_Winnings)) +
  geom_point() +
  geom_smooth(method = "lm", col = "green") +
  labs(title = "Scatter Plot of Average Drive vs. Average Winnings", x = "Average Drive", y = "Average Winnings")
str(pga_data_clean)
remove_low_drives <- function(pga_data_clean) {
  filtered_data <- pga_data_clean %>%
    filter(Average_Drive >= 150)
  
  return(filtered_data)
}

# Example usage
pga_data_filtered <- remove_low_drives(pga_data_clean)
ggplot(pga_data_filtered, aes(x = Average_Drive, y = Average_Winnings)) +
  geom_point() +
  geom_smooth(method = "lm", col = "green") +
  labs(
    title = "Scatter Plot of Average Drive vs. Average Winnings (Filtered)",
    x = "Average Drive",
    y = "Average Winnings"
  ) +
  theme_minimal() 
```


We look at an initial graph of average winnigs vs average drive. It looks like we have a good cluster of data but we have a couple of outliers in our data that is scewing our line.

```{r}
pga_data_filtered$Age_Above_30 <- pga_data_filtered$Player_Age > 30

head(pga_data_filtered$Age_Above_30)
pga_data_filtered <- pga_data_filtered %>% select(-Player_name, -Driving_Accuracy, -Total_Winnings, -Average_number_putts, -Money_Rank, -Player_Age)
kable(head(pga_data_filtered))


```

We remove player name and money rank because we are "ranking" them or sorting them by +/- 30 age. We remove driving accuracy because we have a variable with average drive because that is what we are making assumptions on and also it could cause multicollinearity because driving accuracy and average drive are highly correlated. The same can be said for total winnings versus average winnings and also save percent and number of putts. They are both correlated and can cause multicollinearity.

### Model Bulding

## Linear Model

```{r}
colnames(pga_data_filtered)
full.model = lm(Average_Winnings ~  Average_Drive + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")


```
We transform the age category into a true or false of payers with above or below age 30. All of our values are significant besides Age.
```{r}
par(mfrow=c(2,2))
plot(full.model)


```

It looks like we have faults in our assumptions. Based on the QQ plot there seems to be a non-normal distribution with several outliers. There also seems to be non-constant variance.

```{r}
vif(full.model)
barplot(vif(full.model), main = "VIF Values", horiz = FALSE, col = "steelblue")
```

We see in the vif values above that all of them are below 5 which shows little to no multicollineairty so we can continue.

## Box Cox Transformation

We perform this transformation because our data has non constant variance

```{r}
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
##
boxcox(Average_Winnings ~ log(Average_Drive) + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
       xlab=expression(paste(lambda, ": log Average Drive")))
##
boxcox(Average_Winnings ~ Average_Drive + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
       xlab=expression(paste(lambda, ": Average Drive")))
##
boxcox(Average_Winnings ~ Average_Drive + Greens_on_reg +  Save_Percent + Number_events + log(1+Age_Above_30), data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
       xlab=expression(paste(lambda, ": log-age")))
##
boxcox(Average_Winnings ~ log(Average_Drive) + Greens_on_reg +  Save_Percent + Number_events + log(1+Age_Above_30), data = pga_data_filtered, lambda = seq(-1, 1, length = 10), 
      xlab=expression(paste(lambda, ": log-age, log Average Drive")))

```

Values of lamda are around 0 so we shoudl use a log transformation
## Square Root Model

```{r}
sqrt.winnings.log.drive = lm((Average_Winnings)^0.5 ~ Greens_on_reg +  Save_Percent + Number_events + Age_Above_30 + log(Average_Drive), data = pga_data_filtered)
kable(summary(sqrt.winnings.log.drive)$coef, caption = "log-transformed model")


```

```{r}
par(mfrow = c(2,2))
plot(sqrt.winnings.log.drive)

```

This model worsened our QQ-plot normality and our non constant variance remained the same, so we can try another model to see if it helps.

## Log Transformation

```{r}
log.winnings = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg +  Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
kable(summary(log.winnings)$coef, caption = "log-transformed model")


```

```{r}
par(mfrow = c(2,2))
plot(log.winnings)

```

This model looks drastically better then the other two models. While normality still cannot be assumed the QQ Plot looks better. The same thing has happened with our Residual plot while constant variance can still not be assumed it is drastically better.

## Comparison of models

```{r}

par(pty = "s", mfrow = c(1, 3))

qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)

qqnorm(log.winnings$residuals, main = "Log Winnings")
qqline(log.winnings$residuals)

qqnorm(sqrt.winnings.log.drive$residuals, main = "sqrt winnings log drive")
qqline(sqrt.winnings.log.drive$residuals)


```

We can see that Log Winnings is the best model for our data.

## Goodness of Fit Comparison

```{r}
select=function(m){ 
 e = m$resid                           
 n0 = length(e)                        
 SSE=(m$df)*(summary(m)$sigma)^2       
 R.sq=summary(m)$r.squared             
 R.adj=summary(m)$adj.r               
 MSE=(summary(m)$sigma)^2              
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  
 X=model.matrix(m)                     
 H=X%*%solve(t(X)%*%X)%*%t(X)          
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
 }

```

```{r}

output.sum = rbind(select(full.model), select(sqrt.winnings.log.drive), select(log.winnings))
row.names(output.sum) = c("full.model", "sqrt.winnings.log.drive", "log.winnings")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")

```

Even though the R, R squared, and Cp look better in the other two models our value are still relatively good for log.winnings. Log.winnings has better residuals, QQ plot, but less goodness of fit measures but they are not significantly bad so we will stick with the log model.

## Final Model

```{r}
kable(summary(log.winnings)$coef, caption = "Inferential Statistics of Final Model")
```

While we do have a large sample of 196 and all of our pvalues are close to 0 we do have one variable "Age" which is not significant.

### Summary of Model
The final model can be represented as log(price)=3.9245 − 0.0246×Average_Drive + 0.1866×Greens_on_reg + 
0.0387×Save_percent − 0.0181×Number_events + 0.1826×Age_Above_30TRUE

From the above data we can also conclude that as average winnings go up our average drive goes down by -2.5% while other variables such as greens on regulation and save percent go up.

```{r}
log.winnings = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)

B = 1000  #

num.p = length(coef(log.winnings))  # number of parameters in the model (includes intercept)
smpl.n = nrow(pga_data_filtered)       # sample size

# Matrix to store bootstrap coefficients
coef.mtrx = matrix(0, nrow = B, ncol = num.p)

# Bootstrap loop
for (i in 1:B) {
  bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE)  # Resample with replacement
  boot_data = pga_data_filtered[bootc.id, ]  # Create the bootstrap sample
  log.winnings.btc = lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = boot_data)
  coef.mtrx[i, ] = coef(log.winnings.btc)  # Store the coefficients
}

```

### Bootstrapping Final Model

```{r}

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){

  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))

  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
  
}

```

```{r}
par(mfrow=c(2,3))  
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Average Drive" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Greens On Regulation" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Save Percentage" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Number of Events" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="Age Above 30" )
```

We see from our histograms that they look to have a equal distribution. The blue curve represents the p-values reported. While the blue curve represents the bootstrap intervals. They are very similar.

## 95% CI

```{r}

num.p = dim(coef.mtrx)[2]  

btc.ci = NULL
btc.wd = NULL

for (i in 1:num.p) {
  lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2), 8)
  uci.975 = round(quantile(coef.mtrx[, i], 0.975, type = 2), 8)
  btc.wd[i] = uci.975 - lci.025
  btc.ci[i] = paste("[", round(lci.025, 4), ", ", round(uci.975, 4), "]")
}

mean.coefs = apply(coef.mtrx, 2, mean)

results = as.data.frame(cbind(Mean_Coef = formatC(mean.coefs, 4, format = "f"), btc.ci.95 = btc.ci))

kable(results, caption = "Regression Coefficient Matrix with Bootstrap Confidence Intervals")

```

We can see that our values are consistent with the p-values that we got. Our 95% CI spans over 0 in the Age variable which matches up with our p-value not being significant

```{r}
hist(sort(log.winnings$residuals),n=40,
     xlab="Residuals",
     col = "lightblue",
     border="navy",
     main = "Histogram of Residuals")


```

We appear to have a normal distribution but we do appear to have some outliers on either side of the curve.

## Residual Bootstrap Regression

```{r}
log.winnings <- lm(log(Average_Winnings) ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)
model.resid = log.winnings$residuals
B=1000
num.p = dim(model.matrix(log.winnings))[2]   
samp.n = dim(model.matrix(log.winnings))[1]  
btr.mtrx = matrix(rep(0,6*B), ncol=num.p)
for (i in 1:B){

  bt.lg.winnings = log.winnings$fitted.values + 
        sample(log.winnings$residuals, samp.n, replace = TRUE)  
  
  pga_data_filtered$bt.lg.winnings =  bt.lg.winnings   #  send the boot response to the data
  btr.model = lm(bt.lg.winnings ~ Average_Drive + Greens_on_reg + Save_Percent + Number_events + Age_Above_30, data = pga_data_filtered)  
  btr.mtrx[i,]=btr.model$coefficients
}
```

```{r}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){

  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))

  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")             
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    
} 



```

```{r}
par(mfrow=c(2,3))  
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Average Drive" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Greens on Regulation" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Save Percentage" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Number of Events" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="Age" )


```

We again see that our histograms have an equal distribution. Again we see that the p-values and residual boostrap have very similar values.
```{r}
um.p = dim(btr.mtrx)[2]  
btr.ci = NULL
btr.wd = NULL

for (i in 1:num.p) {
  lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2), 8)
  uci.975 = round(quantile(btr.mtrx[, i], 0.975, type = 2), 8)
  btr.wd[i] = uci.975 - lci.025
  btr.ci[i] = paste("[", round(lci.025, 4), ", ", round(uci.975, 4), "]")
}

kable(as.data.frame(cbind(formatC(btr.mtrx[1, ], 4, format = "f"), btr.ci.95 = btr.ci)), 
      caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")

```

Once again we get the same output as expected. The values above show the same results as our p-values, all being significant except for age which includes 0.

### Combining results

```{r}
p_values <- summary(log.winnings)$coefficients[-3, 4]  

combined_matrix <- cbind(
  Coefficients = formatC(coef(log.winnings)[-3], 4, format = "f"),  
  `95% CI (Bootstrap t)` = btc.ci,                                
  `95% CI (Bootstrap r)` = btr.ci,                                 
  `p-values` = formatC(p_values, 4, format = "f")                  
)


library(knitr)
kable(as.data.frame(combined_matrix), 
      caption = "Final Combined Inferential Statistics: Coefficients, p-values, and Bootstrap CIs")



```

All of the models yield the same result, showing there is a violation in the age variable but all other variables are significant.

```{r}

kable(round(cbind(btc.wd, btr.wd),4), caption="width of the two bootstrap confidence intervals")


```

We see from the above analysis that the two values are very similar to each other.



### Summary

We see from our findings above that all three of our models come to the same conclusion. We will be using our initial final model where we used a log transformation because they seem to be a little more significant. 

Again we see that the final model can be represented as log(price)=3.9245 − 0.0246×Average_Drive + 0.1866×Greens_on_reg + 
0.0387×Save_percent − 0.0181×Number_events + 0.1826×Age_Above_30TRUE

From the above data we can also conclude that as average winnings go up our average drive goes down by -2.5% while other variables such as greens on regulation and save percent go up.

## Conclusion 
The main conclusion we can draw is that while driving distance in golf is perceived to be a big indication of whether a person is winning or not we do not see this in our analysis. While the Average Winnings goes up the average drive goes down by -2.5% while Save Percentage goes up by 3.9% and Greens on Regulation goes up by 1.9% which shows that the "short" game in golf (putting and chipping) is more important than driving in regards to average money won.

We see one flaw in our analysis and it was that age was not significant which could mess up our analysis. in the future we could push our age up or down for the true or false statement to see if that changes the significance. We could also look at some values of age to see if there are any outliers, for example golfers are not younger than 20 but they can be 45+ which could mess with our significance.

We can use this model for seasons to come to see if things may have changed. For example since golf has evolved recently to players having longer drives we could run this model again in 2024 to see if the "short" game is less likely to win more money than the "long" game would.

