Link To RPubs

Link To Posit.Cloud

HW Instructions

Students should create a separate R markdown file that includes the necessary code and graphics to answer each question. The R markdown file should be submitted as well as a knitted version in either Word or HTML. Although the homeworks are designed to offer students a jumping off point when it comes to coding, students are expected to utilize the discussions and R examples used in the live sessions.

For problems that include making graphics or performing analysis, students should be sure to articulate their answers in writing unless explicitly told not to. That is, if you provide a figure or table of results then it is expected that commentary will also be provided.

Exercise 1: Conceptual Questions

A. Bootstrap Random Sample

Briefly explain what a bootstrap random sample is.

B. Bootstrap Confidence Intervals

What major problems does the method of bootstrap confidence intervals solve? This can be a general discussion or a specific discussion about regression.

C. Ensemble vs. Bagged Prediction

Briefly explain what the key difference is between an ensemble prediction and a bagged prediction.

Answer for Exercise 1:

A.

A bootstrap random sample involves randomly selecting data points from a dataset, with replacement, to create a sample of the same size as the original dataset. This method allows for estimating the variability of a statistic.

B.

Bootstrap confidence intervals address challenges such as the assumption of normality and the complexity of deriving exact sampling distributions, especially in complex statistics or small sample sizes. In regression, they allow for robust confidence interval estimation without relying on traditional assumptions about the error distribution.

C.

The key difference between ensemble prediction and bagged prediction is their approach to model variance reduction. Ensemble prediction combines different model types to reduce error, while bagged prediction uses variations of a single model type, like multiple decision trees in random forests, to achieve a similar goal. Bagging involves generating multiple versions of a predictor and using these to get an aggregated prediction.

Exercise 2: Estimating Percentiles

Consider the following diamond data set, which has as one of its variables the price of the diamond.

Histogram Summary The histogram displays a right-skewed distribution of 500 sampled diamond prices, where a majority fall below $5,000, suggesting a market concentration in more affordable diamonds. The distribution extends up to around $20,000, with sparse higher-priced diamonds possibly indicating outliers. The analysis reveals a 90th percentile price of $9,706.60, under which 90% of the sampled prices fall, indicating a threshold for the top 10% of diamonds by price. Bootstrap resampling calculates a 95% confidence interval for this percentile, ranging from $8,640.70 to $10,727.10, offering a statistically robust estimate of where the true 90th percentile lies within the broader population.

Next Step In many areas of study, the parameter of interest is not the mean or the median but a percentile. For example, suppose a diamond expert is trying to get a handle on the market for high-end diamonds and is interested in what the top 10 percent of diamonds are going for in price. To do this you could calculate the 90th percentile:

quantile(price, probs = .9)
##    90% 
## 9706.6

This states that 90% of the diamonds cost less than $9,706.60. Thus, the top 10% of the most highly priced diamonds are above this number. Unfortunately, like any other statistic, this is just an estimate of what the true 90th percentile for the entire population of diamonds actually is. Perform a 95% bootstrap confidence interval for the 90th percentile of diamond prices to reflect the uncertainty in this estimate. Provide the classic statistical interpretation to the interval.

Answer - Exercise 2:

# Load the ggplot2 library for the diamonds dataset
library(ggplot2)
# Set the seed for reproducibility
set.seed(1234)
# Sample 500 diamond prices
price <- sample(diamonds$price, 500)
# Create a histogram of the prices
hist(price, xlab = "Price", main = "")

# Calculate the 90th percentile of the sampled prices
quantile_90 <- quantile(price, probs = .9)
# Perform bootstrap to estimate the 95% confidence interval for the 90th percentile
bootstrap_percentile <- function(data, n_bootstrap, percentile) {
  bootstrap_samples <- replicate(n_bootstrap, sample(data, replace = TRUE))
  percentile_values <- apply(bootstrap_samples, 2, quantile, probs = percentile)
  return(percentile_values)
}
# Number of bootstrap samples
n_bootstrap <- 10000
# Calculate the bootstrap percentiles
bootstrap_results <- bootstrap_percentile(price, n_bootstrap, .9)
# Calculate the confidence interval
CI <- quantile(bootstrap_results, probs = c(0.025, 0.975))
# Display the results
quantile_90
##    90% 
## 9706.6
CI
##    2.5%   97.5% 
##  8640.7 10727.1

Classic Statistical Interpretation The 95% confidence interval for the 90th percentile of diamond prices, based on our bootstrap analysis, ranges from $8,640.70 to $10,727.10. This means that we are 95% confident that the true 90th percentile of the entire population of diamond prices falls within this interval. In other words, if we were to draw many samples from the population of all diamond prices and compute the 90th percentile for each sample, then 95% of those percentiles would be expected to lie within this interval.

Exercise 3: Regression

During unit 2 pre-live session, we analyzed the golf data set using feature selection tools. For that discussion, log transforming the response was helpful for dealing with constant variance and normality assumptions.

The following scatter plot, in particular the last row, provides an investigation of the trends with each predictor and Avg Winnings.

library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
golf<-read.csv("GolfData2.csv")

#Getting fancy
lowerFn <- function(data, mapping, method = "lm", ...) {
  p <- ggplot(data = data, mapping = mapping) +
    geom_point(colour = "blue",size=.2) +
    geom_smooth(method = loess, color = "red", ...)
  p
}


ggpairs(golf[,c(2:7,11)],lower=list(continuous=lowerFn),progress = F)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

ANSWER FOR 1ST PART OR Exercise 3:

Scatter Plot Matrix Interpretation 1. Age and Avg Winnings: The negative correlation suggested by the loess line indicates that, typically, as players age, their average winnings decrease. This could reflect factors like diminishing physical capabilities or changing priorities with age.

  1. AvgDrive and Avg Winnings: A positive correlation is observed here, suggesting that players with a longer average drive distance tend to earn more. This is intuitive in golf since longer drives can lead to easier approach shots and potentially better scores.

  2. DriveAcc and Avg Winnings: The positive correlation, albeit weaker than AvgDrive, implies that while driving accuracy has a positive relationship with winnings, it may not be as critical as how far the drives go.

  3. Greens and Avg Winnings: TA clear positive correlation is noted, indicating that players who hit more greens in regulation tend to have higher average winnings. This relationship highlights the importance of good approach shots leading to potential birdie opportunities.

  4. AvgPutts and Avg Winnings: A slight negative correlation is indicated, suggesting that players who take more putts on average tend to win less. This makes sense as successful putting is crucial in golf to finish holes with fewer strokes.

  5. Save and Avg Winnings: The trend is not distinctly clear, but a slight negative correlation could indicate that players with a higher save percentage might not necessarily earn more. This could be an area for further investigation to understand if this trend is consistent or due to other variable

  6. AvgWinnings and Avg Winnings:This plot is redundant since it compares Avg Winnings against itself, which will always yield a perfect linear relationship.

A. Assuming the trends are roughly linear and constant variance is the only issue, fit a regression model using all six predictors using a bootstrap procedure. Use the intervals to determine if all predictors are relevant. Provide an interpretation of the coefficient on the greens variable. The nice advantage here is that the transformation interpretations are not needed here.

B. The coefficient on AvgPutts is huge!! It almost seems unrealistic. Try to explain why this interval estimate may still make sense. Hint: Try to offer an interpretation that would relate to a more realistic setting rather than a one unit increase in AvgPutts.

CODE for Exercise 3

Code for A version 1

# Load the necessary library
library(GGally)
# Read the data
golf <- read.csv("GolfData2.csv")
# Define the bootstrap function for linear regression
bootstrap_lm <- function(data, n_bootstrap) {
  coefficients <- matrix(NA, nrow = n_bootstrap, ncol = 6)
  colnames(coefficients) <- names(data)[2:7]
  for (i in 1:n_bootstrap) {
    sample_indices <- sample(nrow(data), replace = TRUE)
    sample_data <- data[sample_indices, ]
    fit <- lm(AvgWinnings ~ ., data = sample_data)
    coefficients[i, ] <- coef(fit)[-1] # Exclude the intercept
  }
  return(coefficients)
}
# Number of bootstrap samples
n_bootstrap <- 10000
# Run the bootstrap
set.seed(123) # For reproducibility
bootstrap_results <- bootstrap_lm(golf[, c(2:7, 11)], n_bootstrap)
# Calculate 95% confidence intervals for the 'Greens' variable
CI_greens <- quantile(bootstrap_results[, "Greens"], probs = c(0.025, 0.975))
# Output the results
print(CI_greens)
##      2.5%     97.5% 
## -4585.280 -1211.402

Code for A Version 2:

# Load the dataset
golf <- read.csv("GolfData2.csv")
# Define a function to perform bootstrap regression and extract coefficients
bootstrap_regression <- function(data, response, predictors, n_bootstrap) {
    coefs <- matrix(NA, nrow = n_bootstrap, ncol = length(predictors))
    colnames(coefs) <- predictors
    
    for (i in 1:n_bootstrap) {
        # Create a bootstrap sample
        sample_data <- data[sample(nrow(data), nrow(data), replace = TRUE), ]
        # Fit the linear model
        model <- lm(as.formula(paste(response, "~", paste(predictors, collapse = "+"))), data = sample_data)
        # Store the coefficients
        coefs[i, ] <- coef(model)[-1] # Exclude the intercept
    }
    
    return(coefs)
}

# Set the number of bootstrap samples
n_bootstrap <- 1000
# Set the seed for reproducibility
set.seed(123)
# Define predictors and response
predictors <- c("Age", "AvgDrive", "DriveAcc", "Greens", "AvgPutts", "Save")
response <- "TotalWinning"
# Perform the bootstrap
bootstrap_coefs <- bootstrap_regression(golf, response, predictors, n_bootstrap)
# Calculate 95% confidence intervals for each coefficient
CI <- apply(bootstrap_coefs, 2, function(x) quantile(x, probs = c(0.025, 0.975)))
# Convert CI to a data frame and assign proper names
CI_matrix <- t(CI)  # Transpose the CI matrix to get predictors as rows
CI_df <- as.data.frame(CI_matrix)
colnames(CI_df) <- c("2.5%", "97.5%")
rownames(CI_df) <- predictors
# Extract the confidence interval data and assign proper names
CI_matrix <- t(CI)  # Transpose the CI matrix to get predictors as rows
CI_df <- as.data.frame(CI_matrix)
colnames(CI_df) <- c("2.5%", "97.5%")
rownames(CI_df) <- predictors
# Now print the corrected confidence intervals data frame
print(CI_df)
##                   2.5%        97.5%
## Age         -26723.445     17785.32
## AvgDrive    -24541.924     17202.94
## DriveAcc    -91910.504    -18371.29
## Greens      132415.949    304680.75
## AvgPutts -22925133.870 -10357377.32
## Save          9805.452     58416.96
# Extract the confidence intervals for 'Greens' correctly
greens_ci_lower <- CI_df["Greens", "2.5%"]
greens_ci_upper <- CI_df["Greens", "97.5%"]
# Calculate the mean coefficient for 'Greens' from the bootstrap results
greens_coef <- mean(bootstrap_coefs[, "Greens"])
# Print the results using cat()
cat("The coefficient for 'Greens' is", greens_coef, 
    "with a 95% confidence interval of", greens_ci_lower, "to", greens_ci_upper, 
    ". This indicates that hitting more greens in regulation is consistently associated with higher average winnings in this model.")
## The coefficient for 'Greens' is 209929.8 with a 95% confidence interval of 132415.9 to 304680.7 . This indicates that hitting more greens in regulation is consistently associated with higher average winnings in this model.

ANSWER for A

I will now analyze the output of the bootstrap regression analysis that I performed on the golf dataset. This analysis involves using two different bootstrap procedures to estimate the relationship between several predictors (including ‘Greens’) and two different response variables (‘TotalWinning’ and ‘AvgWinnings’).

First Bootstrap Regression (bootstrap_regression):

  • Objective: Estimate the relationship between predictors (including ‘Greens’) and ‘TotalWinning’.
  • Method: 1000 bootstrap samples used to estimate coefficients.
  • Confidence Intervals for Predictors:
    • The confidence intervals for each predictor, including ‘Greens’, have been calculated.
    • For ‘Greens’, the 95% confidence interval is 132415.9 to 304680.7.

Interpretation of ‘Greens’ Coefficient:

  • The coefficient for ‘Greens’ is 209929.8, with a confidence interval from 132415.9 to 304680.7.
  • This result suggests that hitting more greens in regulation is significantly associated with higher average winnings (‘TotalWinning’).
  • The positive coefficient and its confidence interval do not include zero, indicating a strong and significant relationship.

Second Bootstrap Regression (bootstrap_lm):

  • Objective: Estimate relationship between predictors and ‘AvgWinnings’.
  • Method: 10000 bootstrap samples.
  • Confidence Intervals for ‘Greens’:
    • The 95% confidence interval for the ‘Greens’ variable is -4585.280 to -1211.402.

Interpretation:

  • The negative confidence interval for ‘Greens’ in relation to ‘AvgWinnings’ is counterintuitive. It suggests an inverse relationship, meaning more greens hit might be associated with lower average winnings, which contradicts typical expectations in golf.
  • This might indicate issues in the model, data, or an anomaly specific to the dataset used.

Overall Analysis:

  • Relevance of Predictors: Looking at the confidence intervals from the first bootstrap regression, it seems that all predictors except ‘AvgPutts’ have intervals that do not include zero, suggesting they are relevant to predicting ‘TotalWinning’. ‘AvgPutts’ shows an extremely large negative confidence interval, which is unusual and might need further investigation.
  • Discrepancy in ‘Greens’ Interpretation: The positive relationship of ‘Greens’ with ‘TotalWinning’ and the negative relationship with ‘AvgWinnings’ necessitate a deeper look into the data or model specifications. It could be due to differences in how ‘TotalWinning’ and ‘AvgWinnings’ are calculated or influenced by other factors in the dataset.
  • No Transformation Required: Since we’re assuming linear trends and the main issue is constant variance, transformations for interpretations are not needed. However, the difference in the direction of the ‘Greens’ effect in both models suggests that it might be worth exploring whether the assumptions of linearity and constant variance are indeed valid for this dataset.

This analysis demonstrates the importance of carefully examining model results, especially when they contradict domain knowledge or when different models on the same data yield significantly different interpretations.

NOTE on the 2 Different Codes

In my comparison and consolidation of the two code snippets for my graduate assignment, I observed the following:

In the first code snippet, I implemented a flexible bootstrap function, bootstrap_regression, which allows for the selection of different predictors and response variables. I conducted the bootstrap analysis using 1,000 samples to estimate the coefficients of my chosen predictors, focusing on “TotalWinning” as the response variable. For each predictor coefficient, I computed confidence intervals and organized the results systematically. Notably, my interpretation of the ‘Greens’ variable was positively inclined, suggesting a direct and expected relationship with ‘TotalWinning’.

Moving to the second code snippet, I used a function, bootstrap_lm, tailored to handle ‘AvgWinnings’ as the response variable, operating under a fixed dataset structure. This analysis involved a higher number of bootstrap samples, specifically 10,000, aiming for more stable estimates. My attention was primarily on the ‘Greens’ variable, for which I provided specific confidence intervals. Interestingly, these confidence intervals for ‘Greens’ turned out negative, implying an inverse relationship with ‘AvgWinnings’—a finding that was counterintuitive and necessitated further examination.

My consolidated interpretation from both snippets led to several insights:

In summary, the first snippet closely matched the requirements of my task, providing a comprehensive analysis of all predictors and a sound interpretation of the ‘Greens’ coefficient. The second snippet, though focused on ‘AvgWinnings’, raised questions due to the unexpected direction of the ‘Greens’ relationship, indicating the need for accurate model interpretation.

Both analyses underscored the importance of precise model specification and considering the practical implications of statistical findings. The contrasting results for ‘Greens’ in both analyses particularly highlighted the necessity of validating model outputs with domain knowledge and the possible need for data transformation or additional exploratory data analysis.

ANSWER FOR B:

Interpreting the ‘AvgPutts’ Coefficient in Golf Performance Analysis

In our regression model, the coefficient for ‘AvgPutts’ (average putts per round) is exceptionally large, with a 95% confidence interval ranging from approximately -22,925,134 to -10,357,377. This figure initially seems unrealistic when contextualized within the sport of golf. However, a deeper exploration reveals a nuanced understanding:

  1. Scale and Significance of ‘AvgPutts’: The scale at which ‘AvgPutts’ is measured is crucial. In professional golf, even a small increase in average putts per round can have a significant impact on a player’s performance. A one-unit increase in ‘AvgPutts’ may represent a considerable change, especially given that professional golf scores are often closely clustered. Thus, the large coefficient could be reflecting the aggregated effect of ‘AvgPutts’ over many rounds or an entire season.

  2. Inverse Relationship with Performance: The negative coefficient for ‘AvgPutts’ aligns with the sport’s fundamentals — fewer putts indicate better performance. Therefore, an increase in ‘AvgPutts’ logically correlates with decreased winnings, as it suggests a decline in putting efficiency, a critical aspect of golf.

  3. Contextual Interpretation: Instead of interpreting the coefficient as the impact of a one-unit increase in ‘AvgPutts’, it is more pragmatic to consider smaller, incremental changes. In the precision-driven world of professional golf, even minor improvements or declines in putting can be the difference between winning and losing. Hence, a more granular view, such as the effect of a 0.1 or 0.01 increase in ‘AvgPutts’, might offer a more realistic perspective on its impact on total winnings.

  4. Statistical Considerations: The exceptionally large interval estimate could result from statistical artifacts such as multicollinearity or data anomalies, including outliers. These factors might exaggerate the coefficient’s magnitude in the linear model. Therefore, it’s crucial to consider these potential influences when interpreting the results.

  5. Practical Implications: This analysis highlights the disproportionate impact of putting skills on a golfer’s success. The coefficient’s size might be indicating that, relative to other aspects of the game, putting performance is a critical determinant of a player’s total winnings.

In summary, while the coefficient for ‘AvgPutts’ is extraordinarily large, it becomes more interpretable and meaningful when considered in the context of the sport’s scoring nuances, the importance of putting, and the statistical nature of the model. It underscores the significant influence of putting performance on a golfer’s success and invites a more detailed investigation into both the data and the model’s assumptions.