Introduction

The purpose of this project is to analyse the performance statistics and prize earnings of female players in the Ladies Professional Golf Association (LPGA). Golf is a quite popular sport in the United States, attracting many enthusiasts, with women being around quarter of them1. The LPGA, as the American league for professional female golfers, serves as the main point of interest for this analysis.

Thanks to golf’s detailed nature, characterized by numerous components it results in a large number of player statistics. This project aims to investigate the potential influence of these metrics on player earnings in the LPGA in 2022, as sourced from available data. Given the higher complexity associated with the presence of over 10 metrics, some potentially correlated, an initial step involved employing the Boruta algorithm to check the importance of these variables. Next, the Principal Component Analysis (PCA) technique was applied to reduce the dimensionality of the data.

Furthermore, the analysis seeks to compare two approaches: traditional regression utilizing all accessible variables and regression on a select set of principal components derived from dimensionality reduction in a dataset, prioritizing the most impactful components. This method aims to compare the effectiveness of the reduced data in explaining player earnings in comparison to the complete dataset.

Methods

Boruta algorithm

Boruta is a simple, yet effective feature selection algorithm that allows for classification of important and unimportant variables in higher-dimensional datasets. As it was stated in the paper introducing the Boruta pakage in R: Boruta algorithm is a wrapper built around the random forest classification algorithm implemented in the R package randomForest.

The main idea is to initially generate additional variables, known as shadow features, by shuffling the original independent variables. Then, a random forest model is fitted using the dependent variable from the original dataset along with the newly generated shadow features. After that, the importance of the original variables is compared with the threshold defined as the highest feature importance recorded among the shadow features. Feature is called a hit if it’s importance is higher than the threshold2.

PCA

Principal Component Analysis (PCA) is an Unsupervised Learning method designed to effectively reduce the dimensions of a high-dimensional dataset. By projecting the original data into a PCA reduced space, it retains a significant portion of the essential information. This approach proves particularly advantageous while dealing with datasets featuring numerous variables and/or indicating multicollinearity within the variables.

The method involves going through a few important steps:

  1. Standardizing the data and creating a covariance matrix
  2. Calculating eigenvalues and eigenvectors
  3. Analysing the degree of explained variance, aiding in the selection of the optimal number of eigenvectors
  4. Creating the projection matrix by multiplying the data by the chosen eigenvectors

The outcome of these steps yields k synthetic variables, where k represents the number of selected eigenvectors. These synthetic variables serve as a condensed representation of the original data. One of the most notable usecases involves rotating the loadings of PCA and examining which variables from the initial dataset have the most influence on the created PCA components. This approach enables the attribution of umbrella names to the synthetic variables, that capture their fundamental characteristics.

Golf in general and the LPGA dataset

Let’s start by talking a little about golf. I tried to make a short summary using what I’ve learned in the last few years of getting to know this game.

The essence of golf lies in navigating an 18-hole course with as few strokes as possible, utilizing various clubs like drivers for long-distance shots, irons for medium-distance shots, and putters for precise strokes to roll the ball into the hole. Starting from the tee, players aim to guide the ball accurately to the fairway – the closely mown area between the tee and the green. The ultimate objective is to sink the ball into the hole in the least number of strokes.

Each hole on the course is labeled as par-3, par-4, or par-5, indicating the expected number of shots for players to complete it (3, 4, or 5 shots, respectively). Achieving a score below par is the goal, while going above par is less desirable. Sand traps, strategically placed hazards filled with sand, add an extra challenge, requiring players to navigate their shots skillfully. “Greens in regulation” refers to successfully reaching the putting surface in the expected number of strokes, influencing a player’s overall score. In golf, unlike many other sports, the winner is the individual with the lowest score, representing the fewest shots taken to complete the entire 18-hole course.

With that knowledge, let’s take a closer look at the data that will be analysed.

LPGA <- read.csv("lpga2022.csv", sep=",", dec=".", header=TRUE)

dim(LPGA)
## [1] 158  16
head(LPGA[,c(1:3)])
##              Golfer            Nation        Region
## 1         A Lim Kim Republic of Korea          Asia
## 2       Aditi Ashok             India          Asia
## 3     Agathe Laisne            France        Europe
## 4      Alana Uriell     United States North America
## 5 Albane Valenzuela       Switzerland        Europe
## 6       Alena Sharp            Canada North America
head(LPGA[,c(4:16)])
##   fairways fairAtt fairPct totPutts totRounds avePutts greenReg totPrize events
## 1      962    1395   68.96     3052       108    30.22     74.4   747851     29
## 2      795     996   79.82     2104        79    29.22     63.0   178900     26
## 3      412     561   73.44     1248        41    30.44     64.9    41937     18
## 4      510     729   69.96     1575        55    29.72     65.3    66506     21
## 5      756    1074   70.39     2376        81    30.46     70.7   340107     24
## 6      302     422   71.56      923        35    29.77     70.1    95534     11
##   driveDist sandSaves sandAtt sandPct
## 1    274.74        43      95   45.26
## 2    235.10        53      94   56.38
## 3    254.66        29      62   46.77
## 4    269.51        43      95   45.26
## 5    265.21        27      74   36.49
## 6    265.94        15      30   50.00

Our dataset includes statistics for 158 LPGA players. The first three columns consist of personal details, including the player’s name, nationality, and region of origin. The subsequent columns contain information regarding performance statistics and prize winnings.

Here we have a description of the performance statistics variables:

  • fairways - number of drives that landed on fairways on par-4 or par-5 holes
  • fairAtt - number of drives on par-4 or par-5 holes
  • fairPct - percantage of drives that landed on fairways
  • totPutts - total number of putts made
  • totRounds - total number of rounds played
  • avePutts - average number of putts per round
  • greenReg - percentage of greens hit in regulation
  • totPrize - total official prize winnings, US dollars
  • events - number of tournaments played
  • driveDist - average drive distance on par 4 and 5 holes, yards
  • sandSaves - number of holes with par or better when in sandtrap
  • sandAtt - number of times in sandtrap
  • sandPct - percentage of times a player scored par or better in when in sandtrap

In the analysis, we won’t need the character variables. Therefore, I’m setting the players’ names as the row names and removing the first three columns. From now on, the variable totPrize will now be our dependent variable, and the other 12 will serve as independent variables as we aim to explain the amount of winnings with the players’ performance statistics.

rownames(LPGA) <- LPGA$Golfer

LPGA <- LPGA[, -c(1:3)]
head(LPGA)
##                   fairways fairAtt fairPct totPutts totRounds avePutts greenReg
## A Lim Kim              962    1395   68.96     3052       108    30.22     74.4
## Aditi Ashok            795     996   79.82     2104        79    29.22     63.0
## Agathe Laisne          412     561   73.44     1248        41    30.44     64.9
## Alana Uriell           510     729   69.96     1575        55    29.72     65.3
## Albane Valenzuela      756    1074   70.39     2376        81    30.46     70.7
## Alena Sharp            302     422   71.56      923        35    29.77     70.1
##                   totPrize events driveDist sandSaves sandAtt sandPct
## A Lim Kim           747851     29    274.74        43      95   45.26
## Aditi Ashok         178900     26    235.10        53      94   56.38
## Agathe Laisne        41937     18    254.66        29      62   46.77
## Alana Uriell         66506     21    269.51        43      95   45.26
## Albane Valenzuela   340107     24    265.21        27      74   36.49
## Alena Sharp          95534     11    265.94        15      30   50.00
summary(LPGA)
##     fairways         fairAtt          fairPct         totPutts   
##  Min.   : 224.0   Min.   : 317.0   Min.   :56.21   Min.   : 722  
##  1st Qu.: 464.2   1st Qu.: 655.2   1st Qu.:69.84   1st Qu.:1430  
##  Median : 626.0   Median : 875.5   Median :73.91   Median :1920  
##  Mean   : 624.5   Mean   : 844.7   Mean   :73.67   Mean   :1837  
##  3rd Qu.: 784.8   3rd Qu.:1049.5   3rd Qu.:77.39   3rd Qu.:2274  
##  Max.   :1090.0   Max.   :1395.0   Max.   :87.25   Max.   :3052  
##    totRounds         avePutts        greenReg        totPrize      
##  Min.   : 25.00   Min.   :28.46   Min.   :57.00   Min.   :   3185  
##  1st Qu.: 49.25   1st Qu.:29.72   1st Qu.:66.22   1st Qu.:  96539  
##  Median : 65.50   Median :30.01   Median :68.90   Median : 262617  
##  Mean   : 64.72   Mean   :30.08   Mean   :68.86   Mean   : 522581  
##  3rd Qu.: 79.75   3rd Qu.:30.45   3rd Qu.:71.50   3rd Qu.: 741172  
##  Max.   :108.00   Max.   :31.67   Max.   :77.70   Max.   :4364403  
##      events        driveDist       sandSaves        sandAtt      
##  Min.   :10.00   Min.   :235.1   Min.   : 5.00   Min.   : 25.00  
##  1st Qu.:18.00   1st Qu.:250.8   1st Qu.:24.00   1st Qu.: 58.25  
##  Median :21.00   Median :257.5   Median :32.00   Median : 73.00  
##  Mean   :20.47   Mean   :257.1   Mean   :32.56   Mean   : 70.80  
##  3rd Qu.:24.00   3rd Qu.:263.6   3rd Qu.:41.00   3rd Qu.: 85.75  
##  Max.   :29.00   Max.   :279.2   Max.   :63.00   Max.   :127.00  
##     sandPct     
##  Min.   :19.23  
##  1st Qu.:40.72  
##  Median :45.72  
##  Mean   :45.28  
##  3rd Qu.:50.00  
##  Max.   :66.25

When observing the histogram of the distribution of prize winnings, it becomes evident that it’s not following the normal distribution. Because of that, the regression analysis later on will be conducted using the logarithm of the prize winnings (totPrize). Histograms of both the original and the transformed variable can be observed below.

Analysis

Boruta

The first thing I noticed after initially reviewing the data is that the some of the variables are connected. For instance, there are two variables related to putting, four related to drives, and three related to sand traps, among others. I aimed to investigate whether all of these variables are crucial for modeling prize winnings. To test this hypothesis, I utilized the Boruta algorithm. First I loaded the Boruta package and then applied it to my data.

library(Boruta)
set.seed(123)
boruta.train <- Boruta(totPrize~., data = LPGA, doTrace = 2)
print(boruta.train)
## Boruta performed 91 iterations in 3.461391 secs.
##  12 attributes confirmed important: avePutts, driveDist, events,
## fairAtt, fairPct and 7 more;
##  No attributes deemed unimportant.

The Boruta algorithm classified all of the variables as important, so for now I’m leaving the data as it is.

Correlation of the variables

I’m continuing my analysis with checking the correlations of the variables as some of them are connected. Let’s look at the correlation plot.

library(corrplot)
corrplot.mixed(cor(LPGA[, c("totPrize", "fairPct", "totPutts", "totRounds", "avePutts", "greenReg",
                            "events", "driveDist", "sandSaves", "sandAtt", "sandPct")]),
               tl.pos = 'lt', tl.col = "midnightblue",
               diag = 'u', order="hclust",
               lower="number", upper="circle")

We can observe that indeed, some of the varaibles are highly correlated. With metan package we can also look at the significance of these correlations.

library(metan)
corr <- corr_coef(LPGA[,c("totPrize", "fairPct", "totPutts", "totRounds", "avePutts",
                          "greenReg", "events", "driveDist", "sandSaves", "sandAtt", "sandPct")],
                  use = "pairwise.complete.obs")

plot(corr)

It can be seen that some of independent variables are significantly correlated, which can indicate multicollinearity. When applied to a regression model, it could be harder to distinguish variables’ individual effects. This is a place where PCA could help. Let’s see if it does.

PCA on LPGA data

To begin with, I’m making a subset of the data that contains only the independent variables - the PCA will be performed only on those. Secondly I’m standarizing those variables using the clusterSim package.

LPGA.stats = LPGA[, -8]
head(LPGA.stats)
##                   fairways fairAtt fairPct totPutts totRounds avePutts greenReg
## A Lim Kim              962    1395   68.96     3052       108    30.22     74.4
## Aditi Ashok            795     996   79.82     2104        79    29.22     63.0
## Agathe Laisne          412     561   73.44     1248        41    30.44     64.9
## Alana Uriell           510     729   69.96     1575        55    29.72     65.3
## Albane Valenzuela      756    1074   70.39     2376        81    30.46     70.7
## Alena Sharp            302     422   71.56      923        35    29.77     70.1
##                   events driveDist sandSaves sandAtt sandPct
## A Lim Kim             29    274.74        43      95   45.26
## Aditi Ashok           26    235.10        53      94   56.38
## Agathe Laisne         18    254.66        29      62   46.77
## Alana Uriell          21    269.51        43      95   45.26
## Albane Valenzuela     24    265.21        27      74   36.49
## Alena Sharp           11    265.94        15      30   50.00
library(clusterSim)
LPGA.n <- data.Normalization(LPGA.stats, type="n1")
head(LPGA.n)
##                     fairways    fairAtt     fairPct   totPutts  totRounds
## A Lim Kim          1.6678043  2.1447520 -0.78909676  2.2166790  2.2347252
## Aditi Ashok        0.8426274  0.5896730  1.02959167  0.4871130  0.7372828
## Agathe Laisne     -1.0498442 -1.1057139 -0.03884592 -1.0746048 -1.2248830
## Alana Uriell      -0.5656086 -0.4509438 -0.62163006 -0.4780140 -0.5019798
## Albane Valenzuela  0.6499214  0.8936734 -0.54961937  0.9833598  0.8405547
## Alena Sharp       -1.5933738 -1.6474582 -0.35368333 -1.6675467 -1.5346986
##                     avePutts   greenReg     events  driveDist  sandSaves
## A Lim Kim          0.2301899  1.4026222  1.8809127  1.8884078  0.8946290
## Aditi Ashok       -1.4419942 -1.4853557  1.2190325 -2.3656889  1.7518235
## Agathe Laisne      0.5980704 -1.0040261 -0.5459813 -0.2665433 -0.3054434
## Alana Uriell      -0.6059022 -0.9026935  0.1158988  1.3271332  0.8946290
## Albane Valenzuela  0.6315141  0.4652960  0.7777790  0.8656646 -0.4768823
## Alena Sharp       -0.5222930  0.3132972 -2.0903684  0.9440069 -1.5055157
##                      sandAtt      sandPct
## A Lim Kim          1.2074196 -0.003039633
## Aditi Ashok        1.1575184  1.428733479
## Agathe Laisne     -0.4393201  0.191382795
## Alana Uriell       1.2074196 -0.003039633
## Albane Valenzuela  0.1594943 -1.132234794
## Alena Sharp       -2.0361585  0.607266532

Now, let’s run the PCA using prcomp package, and visualize which variables make up most of the first two principal components.

PCA1 <- prcomp(LPGA.n, center=FALSE, scale.=FALSE)
PCA1$rotation
##                   PC1         PC2          PC3         PC4         PC5
## fairways  -0.36320205  0.13677684 -0.159725381  0.02481068  0.10615564
## fairAtt   -0.37306077 -0.02473003 -0.093941348  0.06172400  0.08818171
## fairPct   -0.06166507  0.65734047 -0.273821674 -0.17704997  0.08355119
## totPutts  -0.37090263 -0.03497738 -0.134055118  0.05766048  0.04799531
## totRounds -0.37199624 -0.02391899 -0.091877766  0.07341895  0.09283949
## avePutts   0.14638876 -0.16325051 -0.596149687 -0.03567235 -0.74616418
## greenReg  -0.22957805 -0.05691666 -0.379252535 -0.62952730  0.18164445
## events    -0.34463078 -0.05674175 -0.094391290  0.31482168 -0.15435967
## driveDist -0.04704323 -0.68644128 -0.007044628 -0.31222437  0.20533343
## sandSaves -0.33238181 -0.02099708  0.305396864 -0.02802398 -0.34255181
## sandAtt   -0.33801597 -0.11786354  0.117005025  0.23536577 -0.15628839
## sandPct   -0.17118921  0.16776265  0.499127760 -0.55240976 -0.40686512
##                    PC6         PC7         PC8          PC9        PC10
## fairways  -0.033876125 -0.10480735  0.29404462 -0.162743721 -0.23564787
## fairAtt   -0.155530416 -0.05372090  0.30645505 -0.161129444 -0.16766439
## fairPct    0.540703610 -0.34208544 -0.07647637  0.063029937  0.02200963
## totPutts  -0.157402106 -0.07261675  0.28206569 -0.237421670 -0.11672696
## totRounds -0.173852117 -0.10807641  0.11320917  0.517791026  0.71599662
## avePutts   0.008417142 -0.06041796  0.16744271 -0.009820879  0.06635061
## greenReg  -0.114794310  0.50042200 -0.32300532  0.025386136 -0.04811238
## events    -0.195844806 -0.33113389 -0.74761991  0.014850673 -0.19638297
## driveDist  0.398957745 -0.47430369  0.01749718  0.023779462 -0.03902518
## sandSaves  0.286170827  0.23594474  0.11613115  0.586884570 -0.42131907
## sandAtt    0.532403437  0.37196149 -0.12580736 -0.454427435  0.36289247
## sandPct   -0.226239051 -0.26351832 -0.02172419 -0.258142813  0.18327801
##                   PC11          PC12
## fairways  -0.794465959 -0.0139130308
## fairAtt    0.420997937 -0.6982117150
## fairPct    0.176370047  0.0002310589
## totPutts   0.390199044  0.7111607223
## totRounds -0.055786449  0.0018512672
## avePutts  -0.023354672 -0.0410988513
## greenReg   0.004472618 -0.0033143149
## events    -0.013214585 -0.0160896471
## driveDist -0.018771274 -0.0034039175
## sandSaves  0.043769781  0.0527064694
## sandAtt   -0.035077843 -0.0357512543
## sandPct   -0.029531026 -0.0229583818
library(factoextra)
fviz_pca_var(PCA1, col.var="contrib")

On this plot we can clearly see that several variables, such as fairways, events and sandAtt are significant in the first principal component. On the other hand, the variables of importance to the second principal component are fairPct and driveDist. This can be better seen on the plot below.

library(gridExtra)
var <- get_pca_var(PCA1)
a <- fviz_contrib(PCA1, "var", axes=1, xtickslab.rt=90)
b <- fviz_contrib(PCA1, "var", axes=2, xtickslab.rt=90)

grid.arrange(a, b, top = 'Contribution to the first two Principal Components')

Let’s examine the explained variance and cumulative explained variance for the generated components.

fviz_eig(PCA1, barcolor = "tomato3", barfill = "tomato3", addlabels = TRUE, main = "Degree of explained variance for created components")

summary(PCA1)
## Importance of components:
##                           PC1    PC2    PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.6335 1.3031 1.2138 0.94750 0.7082 0.52572 0.37589
## Proportion of Variance 0.5779 0.1415 0.1228 0.07481 0.0418 0.02303 0.01177
## Cumulative Proportion  0.5779 0.7194 0.8422 0.91702 0.9588 0.98185 0.99363
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.23305 0.09813 0.09310 0.06017 0.01548
## Proportion of Variance 0.00453 0.00080 0.00072 0.00030 0.00002
## Cumulative Proportion  0.99815 0.99896 0.99968 0.99998 1.00000

I would consider the result satisfactory – the first 4 components (out of 12) account for more than 90%, and the first 5 components account for more than 95% of the explained variance. This suggests that a substantial portion of the other components might be noise in the data. Therefore, for my following analysis, I will retain the first 5 components.

Let’s look at the significant loadings of the first 5 principal components using the psych package. The cutoff level is set at 0.5.

library(psych)

PCA5 <- principal(LPGA.n, nfactors = 5, rotate = "varimax")
print(loadings(PCA5), digits = 3, cutoff = 0.5, sort = TRUE)
## 
## Loadings:
##           RC1    RC2    RC3    RC4    RC5   
## fairways   0.894                            
## fairAtt    0.937                            
## totPutts   0.940                            
## totRounds  0.938                            
## events     0.970                            
## sandSaves  0.773         0.575              
## sandAtt    0.901                            
## fairPct           0.892                     
## driveDist        -0.895                     
## sandPct                  0.937              
## greenReg                        0.875       
## avePutts                               0.935
## 
##                  RC1   RC2   RC3   RC4   RC5
## SS loadings    6.032 1.682 1.383 1.368 1.041
## Proportion Var 0.503 0.140 0.115 0.114 0.087
## Cumulative Var 0.503 0.643 0.758 0.872 0.959

Now it’s evident which variables contribute the most to each created component. I will assign an umbrella name to each of them:

  • PC1: “OverallPerformace” - incorporates many of the golf performance variables
  • PC2: “DrivesAndFairways” - primarily consists of the percentage of drives that landed on fairways and the average drive distance
  • PC3: “Sandtraps” - consist mainly of variables concerning the sandtraps
  • PC4: “GreensInRegulation” - mainly consists of percentage of greens hit in regulation
  • PC5: “Putting” - mainly consists of average number of putts per round

With this in mind, I’m creating a dataset that will store the prize winnings and the first five principal components with assigned names.

LPGA.PCA.5 <- PCA1$x[,1:5]
LPGA.PCA.5.winnings <- as.data.frame(cbind(LPGA[,8], LPGA.PCA.5))

colnames(LPGA.PCA.5.winnings) <- c("Winnings", "OverallPerformance",
                                   "DrivesAndFairways", "Sandtraps",
                                   "GreensInRegulation", "Putting")
head(LPGA.PCA.5.winnings)
##                   Winnings OverallPerformance DrivesAndFairways  Sandtraps
## A Lim Kim           747851          -4.741031        -2.1566379 -1.2013252
## Aditi Ashok         178900          -2.441487         2.6839485  2.1031780
## Agathe Laisne        41937           2.386438         0.1888567  0.5672855
## Alana Uriell         66506           0.087225        -1.3749948  1.5093632
## Albane Valenzuela   340107          -1.237693        -1.3156647 -1.5708788
## Alena Sharp          95534           4.016843        -0.4031349  0.8589521
##                   GreensInRegulation    Putting
## A Lim Kim                -0.02250439  0.3006152
## Aditi Ashok               1.50024736 -0.9107657
## Agathe Laisne             0.08218829 -0.8811531
## Alana Uriell              0.47657666 -0.1724692
## Albane Valenzuela         0.62250700  0.4972045
## Alena Sharp              -2.19137268  0.9835123

Regression

In this section, I aim to compare two regression models – one utilizing the original variables and the second employing the PCA-reduced data. In both of those, the dependent variable will be the logarithm of the prize winnings.

# original variables

model <- lm(log(totPrize) ~ fairways+ fairPct + totPutts + totRounds + avePutts +
                     greenReg + events + driveDist + sandSaves +
                     sandAtt + sandPct, data = LPGA)

summary(model) 
## 
## Call:
## lm(formula = log(totPrize) ~ fairways + fairPct + totPutts + 
##     totRounds + avePutts + greenReg + events + driveDist + sandSaves + 
##     sandAtt + sandPct, data = LPGA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68499 -0.36361 -0.00129  0.32954  1.37023 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.196675   3.643491   4.720 5.48e-06 ***
## fairways    -0.001475   0.002824  -0.522    0.602    
## fairPct      0.018345   0.023599   0.777    0.438    
## totPutts     0.001662   0.001118   1.487    0.139    
## totRounds    0.017320   0.020319   0.852    0.395    
## avePutts    -0.564829   0.123848  -4.561 1.07e-05 ***
## greenReg     0.151592   0.023075   6.569 8.32e-10 ***
## events      -0.156574   0.037091  -4.221 4.25e-05 ***
## driveDist   -0.002337   0.007797  -0.300    0.765    
## sandSaves    0.009488   0.027626   0.343    0.732    
## sandAtt      0.008690   0.013257   0.656    0.513    
## sandPct      0.001505   0.019353   0.078    0.938    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5262 on 146 degrees of freedom
## Multiple R-squared:  0.8531, Adjusted R-squared:  0.842 
## F-statistic: 77.08 on 11 and 146 DF,  p-value: < 2.2e-16
# 5 principal components

model.PCA.5 <- lm(log(Winnings) ~ OverallPerformance + DrivesAndFairways +
                    Sandtraps +  GreensInRegulation + Putting,
                  data = LPGA.PCA.5.winnings)

summary(model.PCA.5)
## 
## Call:
## lm(formula = log(Winnings) ~ OverallPerformance + DrivesAndFairways + 
##     Sandtraps + GreensInRegulation + Putting, data = LPGA.PCA.5.winnings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8266 -0.3988 -0.0385  0.3467  1.8757 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        12.43404    0.04533 274.276  < 2e-16 ***
## OverallPerformance -0.39980    0.01727 -23.151  < 2e-16 ***
## DrivesAndFairways   0.04671    0.03490   1.338    0.183    
## Sandtraps          -0.03365    0.03747  -0.898    0.371    
## GreensInRegulation -0.49951    0.04800 -10.407  < 2e-16 ***
## Putting             0.44822    0.06422   6.980 8.56e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5698 on 152 degrees of freedom
## Multiple R-squared:  0.8207, Adjusted R-squared:  0.8148 
## F-statistic: 139.1 on 5 and 152 DF,  p-value: < 2.2e-16

Reviewing the outcomes, the regression model fitted on the PCA-reduced data shows an adjusted R-squared value of approximately 82%, while the model fitted on the original variables has a value around 84%. Considering that the model with the reduced data uses only 5 of the 12 generated components, this is a satisfactory result. Moreover, the model with the original variables identifies only 4 out of 12 variables as statistically significant (p-value below 0.05). Many of these may need to be discarded, potentially affecting the overall result negatively. This could be attributed to multicollinearity in the data and the presence of noise.

Examining the model fitted on the reduced data, 4 out of 6 values are statistically significant, and the p-values of the others are not exceedingly high. This could indicate that fewer adjustments may need to be made.

Summary

This analysis illustrates the application of PCA on high-dimensional data. It demonstrates that in datasets where multicollinearity among variables is possible, dimensionality reduction can assist in minimizing noise and retaining the most significant information within a smaller set of variables. The regression model conducted on the 5 out of 12 PCA components, generated from the reduction, yielded a nearly identical adjusted R-squared value compared to the model fitted on all independent variables.

Sources: