The purpose of this project is to analyse the performance statistics and prize earnings of female players in the Ladies Professional Golf Association (LPGA). Golf is a quite popular sport in the United States, attracting many enthusiasts, with women being around quarter of them1. The LPGA, as the American league for professional female golfers, serves as the main point of interest for this analysis.
Thanks to golf’s detailed nature, characterized by numerous components it results in a large number of player statistics. This project aims to investigate the potential influence of these metrics on player earnings in the LPGA in 2022, as sourced from available data. Given the higher complexity associated with the presence of over 10 metrics, some potentially correlated, an initial step involved employing the Boruta algorithm to check the importance of these variables. Next, the Principal Component Analysis (PCA) technique was applied to reduce the dimensionality of the data.
Furthermore, the analysis seeks to compare two approaches: traditional regression utilizing all accessible variables and regression on a select set of principal components derived from dimensionality reduction in a dataset, prioritizing the most impactful components. This method aims to compare the effectiveness of the reduced data in explaining player earnings in comparison to the complete dataset.
Boruta is a simple, yet effective feature selection algorithm that allows for classification of important and unimportant variables in higher-dimensional datasets. As it was stated in the paper introducing the Boruta pakage in R: Boruta algorithm is a wrapper built around the random forest classification algorithm implemented in the R package randomForest.
The main idea is to initially generate additional variables, known as shadow features, by shuffling the original independent variables. Then, a random forest model is fitted using the dependent variable from the original dataset along with the newly generated shadow features. After that, the importance of the original variables is compared with the threshold defined as the highest feature importance recorded among the shadow features. Feature is called a hit if it’s importance is higher than the threshold2.
Principal Component Analysis (PCA) is an Unsupervised Learning method designed to effectively reduce the dimensions of a high-dimensional dataset. By projecting the original data into a PCA reduced space, it retains a significant portion of the essential information. This approach proves particularly advantageous while dealing with datasets featuring numerous variables and/or indicating multicollinearity within the variables.
The method involves going through a few important steps:
The outcome of these steps yields k synthetic variables, where k represents the number of selected eigenvectors. These synthetic variables serve as a condensed representation of the original data. One of the most notable usecases involves rotating the loadings of PCA and examining which variables from the initial dataset have the most influence on the created PCA components. This approach enables the attribution of umbrella names to the synthetic variables, that capture their fundamental characteristics.
Let’s start by talking a little about golf. I tried to make a short summary using what I’ve learned in the last few years of getting to know this game.
The essence of golf lies in navigating an 18-hole course with as few strokes as possible, utilizing various clubs like drivers for long-distance shots, irons for medium-distance shots, and putters for precise strokes to roll the ball into the hole. Starting from the tee, players aim to guide the ball accurately to the fairway – the closely mown area between the tee and the green. The ultimate objective is to sink the ball into the hole in the least number of strokes.
Each hole on the course is labeled as par-3, par-4, or par-5, indicating the expected number of shots for players to complete it (3, 4, or 5 shots, respectively). Achieving a score below par is the goal, while going above par is less desirable. Sand traps, strategically placed hazards filled with sand, add an extra challenge, requiring players to navigate their shots skillfully. “Greens in regulation” refers to successfully reaching the putting surface in the expected number of strokes, influencing a player’s overall score. In golf, unlike many other sports, the winner is the individual with the lowest score, representing the fewest shots taken to complete the entire 18-hole course.
With that knowledge, let’s take a closer look at the data that will be analysed.
LPGA <- read.csv("lpga2022.csv", sep=",", dec=".", header=TRUE)
dim(LPGA)
## [1] 158 16
head(LPGA[,c(1:3)])
## Golfer Nation Region
## 1 A Lim Kim Republic of Korea Asia
## 2 Aditi Ashok India Asia
## 3 Agathe Laisne France Europe
## 4 Alana Uriell United States North America
## 5 Albane Valenzuela Switzerland Europe
## 6 Alena Sharp Canada North America
head(LPGA[,c(4:16)])
## fairways fairAtt fairPct totPutts totRounds avePutts greenReg totPrize events
## 1 962 1395 68.96 3052 108 30.22 74.4 747851 29
## 2 795 996 79.82 2104 79 29.22 63.0 178900 26
## 3 412 561 73.44 1248 41 30.44 64.9 41937 18
## 4 510 729 69.96 1575 55 29.72 65.3 66506 21
## 5 756 1074 70.39 2376 81 30.46 70.7 340107 24
## 6 302 422 71.56 923 35 29.77 70.1 95534 11
## driveDist sandSaves sandAtt sandPct
## 1 274.74 43 95 45.26
## 2 235.10 53 94 56.38
## 3 254.66 29 62 46.77
## 4 269.51 43 95 45.26
## 5 265.21 27 74 36.49
## 6 265.94 15 30 50.00
Our dataset includes statistics for 158 LPGA players. The first three columns consist of personal details, including the player’s name, nationality, and region of origin. The subsequent columns contain information regarding performance statistics and prize winnings.
Here we have a description of the performance statistics variables:
In the analysis, we won’t need the character variables. Therefore, I’m setting the players’ names as the row names and removing the first three columns. From now on, the variable totPrize will now be our dependent variable, and the other 12 will serve as independent variables as we aim to explain the amount of winnings with the players’ performance statistics.
rownames(LPGA) <- LPGA$Golfer
LPGA <- LPGA[, -c(1:3)]
head(LPGA)
## fairways fairAtt fairPct totPutts totRounds avePutts greenReg
## A Lim Kim 962 1395 68.96 3052 108 30.22 74.4
## Aditi Ashok 795 996 79.82 2104 79 29.22 63.0
## Agathe Laisne 412 561 73.44 1248 41 30.44 64.9
## Alana Uriell 510 729 69.96 1575 55 29.72 65.3
## Albane Valenzuela 756 1074 70.39 2376 81 30.46 70.7
## Alena Sharp 302 422 71.56 923 35 29.77 70.1
## totPrize events driveDist sandSaves sandAtt sandPct
## A Lim Kim 747851 29 274.74 43 95 45.26
## Aditi Ashok 178900 26 235.10 53 94 56.38
## Agathe Laisne 41937 18 254.66 29 62 46.77
## Alana Uriell 66506 21 269.51 43 95 45.26
## Albane Valenzuela 340107 24 265.21 27 74 36.49
## Alena Sharp 95534 11 265.94 15 30 50.00
summary(LPGA)
## fairways fairAtt fairPct totPutts
## Min. : 224.0 Min. : 317.0 Min. :56.21 Min. : 722
## 1st Qu.: 464.2 1st Qu.: 655.2 1st Qu.:69.84 1st Qu.:1430
## Median : 626.0 Median : 875.5 Median :73.91 Median :1920
## Mean : 624.5 Mean : 844.7 Mean :73.67 Mean :1837
## 3rd Qu.: 784.8 3rd Qu.:1049.5 3rd Qu.:77.39 3rd Qu.:2274
## Max. :1090.0 Max. :1395.0 Max. :87.25 Max. :3052
## totRounds avePutts greenReg totPrize
## Min. : 25.00 Min. :28.46 Min. :57.00 Min. : 3185
## 1st Qu.: 49.25 1st Qu.:29.72 1st Qu.:66.22 1st Qu.: 96539
## Median : 65.50 Median :30.01 Median :68.90 Median : 262617
## Mean : 64.72 Mean :30.08 Mean :68.86 Mean : 522581
## 3rd Qu.: 79.75 3rd Qu.:30.45 3rd Qu.:71.50 3rd Qu.: 741172
## Max. :108.00 Max. :31.67 Max. :77.70 Max. :4364403
## events driveDist sandSaves sandAtt
## Min. :10.00 Min. :235.1 Min. : 5.00 Min. : 25.00
## 1st Qu.:18.00 1st Qu.:250.8 1st Qu.:24.00 1st Qu.: 58.25
## Median :21.00 Median :257.5 Median :32.00 Median : 73.00
## Mean :20.47 Mean :257.1 Mean :32.56 Mean : 70.80
## 3rd Qu.:24.00 3rd Qu.:263.6 3rd Qu.:41.00 3rd Qu.: 85.75
## Max. :29.00 Max. :279.2 Max. :63.00 Max. :127.00
## sandPct
## Min. :19.23
## 1st Qu.:40.72
## Median :45.72
## Mean :45.28
## 3rd Qu.:50.00
## Max. :66.25
When observing the histogram of the distribution of prize winnings, it becomes evident that it’s not following the normal distribution. Because of that, the regression analysis later on will be conducted using the logarithm of the prize winnings (totPrize). Histograms of both the original and the transformed variable can be observed below.
The first thing I noticed after initially reviewing the data is that
the some of the variables are connected. For instance, there are two
variables related to putting, four related to drives, and three related
to sand traps, among others. I aimed to investigate whether all of these
variables are crucial for modeling prize winnings. To test this
hypothesis, I utilized the Boruta algorithm. First I loaded the
Boruta package and then applied it to my data.
library(Boruta)
set.seed(123)
boruta.train <- Boruta(totPrize~., data = LPGA, doTrace = 2)
print(boruta.train)
## Boruta performed 91 iterations in 3.461391 secs.
## 12 attributes confirmed important: avePutts, driveDist, events,
## fairAtt, fairPct and 7 more;
## No attributes deemed unimportant.
The Boruta algorithm classified all of the variables as important, so for now I’m leaving the data as it is.
I’m continuing my analysis with checking the correlations of the variables as some of them are connected. Let’s look at the correlation plot.
library(corrplot)
corrplot.mixed(cor(LPGA[, c("totPrize", "fairPct", "totPutts", "totRounds", "avePutts", "greenReg",
"events", "driveDist", "sandSaves", "sandAtt", "sandPct")]),
tl.pos = 'lt', tl.col = "midnightblue",
diag = 'u', order="hclust",
lower="number", upper="circle")
We can observe that indeed, some of the varaibles are highly
correlated. With metan package we can also look at the
significance of these correlations.
library(metan)
corr <- corr_coef(LPGA[,c("totPrize", "fairPct", "totPutts", "totRounds", "avePutts",
"greenReg", "events", "driveDist", "sandSaves", "sandAtt", "sandPct")],
use = "pairwise.complete.obs")
plot(corr)
It can be seen that some of independent variables are significantly correlated, which can indicate multicollinearity. When applied to a regression model, it could be harder to distinguish variables’ individual effects. This is a place where PCA could help. Let’s see if it does.
To begin with, I’m making a subset of the data that contains only the
independent variables - the PCA will be performed only on those.
Secondly I’m standarizing those variables using the
clusterSim package.
LPGA.stats = LPGA[, -8]
head(LPGA.stats)
## fairways fairAtt fairPct totPutts totRounds avePutts greenReg
## A Lim Kim 962 1395 68.96 3052 108 30.22 74.4
## Aditi Ashok 795 996 79.82 2104 79 29.22 63.0
## Agathe Laisne 412 561 73.44 1248 41 30.44 64.9
## Alana Uriell 510 729 69.96 1575 55 29.72 65.3
## Albane Valenzuela 756 1074 70.39 2376 81 30.46 70.7
## Alena Sharp 302 422 71.56 923 35 29.77 70.1
## events driveDist sandSaves sandAtt sandPct
## A Lim Kim 29 274.74 43 95 45.26
## Aditi Ashok 26 235.10 53 94 56.38
## Agathe Laisne 18 254.66 29 62 46.77
## Alana Uriell 21 269.51 43 95 45.26
## Albane Valenzuela 24 265.21 27 74 36.49
## Alena Sharp 11 265.94 15 30 50.00
library(clusterSim)
LPGA.n <- data.Normalization(LPGA.stats, type="n1")
head(LPGA.n)
## fairways fairAtt fairPct totPutts totRounds
## A Lim Kim 1.6678043 2.1447520 -0.78909676 2.2166790 2.2347252
## Aditi Ashok 0.8426274 0.5896730 1.02959167 0.4871130 0.7372828
## Agathe Laisne -1.0498442 -1.1057139 -0.03884592 -1.0746048 -1.2248830
## Alana Uriell -0.5656086 -0.4509438 -0.62163006 -0.4780140 -0.5019798
## Albane Valenzuela 0.6499214 0.8936734 -0.54961937 0.9833598 0.8405547
## Alena Sharp -1.5933738 -1.6474582 -0.35368333 -1.6675467 -1.5346986
## avePutts greenReg events driveDist sandSaves
## A Lim Kim 0.2301899 1.4026222 1.8809127 1.8884078 0.8946290
## Aditi Ashok -1.4419942 -1.4853557 1.2190325 -2.3656889 1.7518235
## Agathe Laisne 0.5980704 -1.0040261 -0.5459813 -0.2665433 -0.3054434
## Alana Uriell -0.6059022 -0.9026935 0.1158988 1.3271332 0.8946290
## Albane Valenzuela 0.6315141 0.4652960 0.7777790 0.8656646 -0.4768823
## Alena Sharp -0.5222930 0.3132972 -2.0903684 0.9440069 -1.5055157
## sandAtt sandPct
## A Lim Kim 1.2074196 -0.003039633
## Aditi Ashok 1.1575184 1.428733479
## Agathe Laisne -0.4393201 0.191382795
## Alana Uriell 1.2074196 -0.003039633
## Albane Valenzuela 0.1594943 -1.132234794
## Alena Sharp -2.0361585 0.607266532
Now, let’s run the PCA using prcomp package, and
visualize which variables make up most of the first two principal
components.
PCA1 <- prcomp(LPGA.n, center=FALSE, scale.=FALSE)
PCA1$rotation
## PC1 PC2 PC3 PC4 PC5
## fairways -0.36320205 0.13677684 -0.159725381 0.02481068 0.10615564
## fairAtt -0.37306077 -0.02473003 -0.093941348 0.06172400 0.08818171
## fairPct -0.06166507 0.65734047 -0.273821674 -0.17704997 0.08355119
## totPutts -0.37090263 -0.03497738 -0.134055118 0.05766048 0.04799531
## totRounds -0.37199624 -0.02391899 -0.091877766 0.07341895 0.09283949
## avePutts 0.14638876 -0.16325051 -0.596149687 -0.03567235 -0.74616418
## greenReg -0.22957805 -0.05691666 -0.379252535 -0.62952730 0.18164445
## events -0.34463078 -0.05674175 -0.094391290 0.31482168 -0.15435967
## driveDist -0.04704323 -0.68644128 -0.007044628 -0.31222437 0.20533343
## sandSaves -0.33238181 -0.02099708 0.305396864 -0.02802398 -0.34255181
## sandAtt -0.33801597 -0.11786354 0.117005025 0.23536577 -0.15628839
## sandPct -0.17118921 0.16776265 0.499127760 -0.55240976 -0.40686512
## PC6 PC7 PC8 PC9 PC10
## fairways -0.033876125 -0.10480735 0.29404462 -0.162743721 -0.23564787
## fairAtt -0.155530416 -0.05372090 0.30645505 -0.161129444 -0.16766439
## fairPct 0.540703610 -0.34208544 -0.07647637 0.063029937 0.02200963
## totPutts -0.157402106 -0.07261675 0.28206569 -0.237421670 -0.11672696
## totRounds -0.173852117 -0.10807641 0.11320917 0.517791026 0.71599662
## avePutts 0.008417142 -0.06041796 0.16744271 -0.009820879 0.06635061
## greenReg -0.114794310 0.50042200 -0.32300532 0.025386136 -0.04811238
## events -0.195844806 -0.33113389 -0.74761991 0.014850673 -0.19638297
## driveDist 0.398957745 -0.47430369 0.01749718 0.023779462 -0.03902518
## sandSaves 0.286170827 0.23594474 0.11613115 0.586884570 -0.42131907
## sandAtt 0.532403437 0.37196149 -0.12580736 -0.454427435 0.36289247
## sandPct -0.226239051 -0.26351832 -0.02172419 -0.258142813 0.18327801
## PC11 PC12
## fairways -0.794465959 -0.0139130308
## fairAtt 0.420997937 -0.6982117150
## fairPct 0.176370047 0.0002310589
## totPutts 0.390199044 0.7111607223
## totRounds -0.055786449 0.0018512672
## avePutts -0.023354672 -0.0410988513
## greenReg 0.004472618 -0.0033143149
## events -0.013214585 -0.0160896471
## driveDist -0.018771274 -0.0034039175
## sandSaves 0.043769781 0.0527064694
## sandAtt -0.035077843 -0.0357512543
## sandPct -0.029531026 -0.0229583818
library(factoextra)
fviz_pca_var(PCA1, col.var="contrib")
On this plot we can clearly see that several variables, such as fairways, events and sandAtt are significant in the first principal component. On the other hand, the variables of importance to the second principal component are fairPct and driveDist. This can be better seen on the plot below.
library(gridExtra)
var <- get_pca_var(PCA1)
a <- fviz_contrib(PCA1, "var", axes=1, xtickslab.rt=90)
b <- fviz_contrib(PCA1, "var", axes=2, xtickslab.rt=90)
grid.arrange(a, b, top = 'Contribution to the first two Principal Components')
Let’s examine the explained variance and cumulative explained variance for the generated components.
fviz_eig(PCA1, barcolor = "tomato3", barfill = "tomato3", addlabels = TRUE, main = "Degree of explained variance for created components")
summary(PCA1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6335 1.3031 1.2138 0.94750 0.7082 0.52572 0.37589
## Proportion of Variance 0.5779 0.1415 0.1228 0.07481 0.0418 0.02303 0.01177
## Cumulative Proportion 0.5779 0.7194 0.8422 0.91702 0.9588 0.98185 0.99363
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.23305 0.09813 0.09310 0.06017 0.01548
## Proportion of Variance 0.00453 0.00080 0.00072 0.00030 0.00002
## Cumulative Proportion 0.99815 0.99896 0.99968 0.99998 1.00000
I would consider the result satisfactory – the first 4 components (out of 12) account for more than 90%, and the first 5 components account for more than 95% of the explained variance. This suggests that a substantial portion of the other components might be noise in the data. Therefore, for my following analysis, I will retain the first 5 components.
Let’s look at the significant loadings of the first 5 principal
components using the psych package. The cutoff level is set
at 0.5.
library(psych)
PCA5 <- principal(LPGA.n, nfactors = 5, rotate = "varimax")
print(loadings(PCA5), digits = 3, cutoff = 0.5, sort = TRUE)
##
## Loadings:
## RC1 RC2 RC3 RC4 RC5
## fairways 0.894
## fairAtt 0.937
## totPutts 0.940
## totRounds 0.938
## events 0.970
## sandSaves 0.773 0.575
## sandAtt 0.901
## fairPct 0.892
## driveDist -0.895
## sandPct 0.937
## greenReg 0.875
## avePutts 0.935
##
## RC1 RC2 RC3 RC4 RC5
## SS loadings 6.032 1.682 1.383 1.368 1.041
## Proportion Var 0.503 0.140 0.115 0.114 0.087
## Cumulative Var 0.503 0.643 0.758 0.872 0.959
Now it’s evident which variables contribute the most to each created component. I will assign an umbrella name to each of them:
With this in mind, I’m creating a dataset that will store the prize winnings and the first five principal components with assigned names.
LPGA.PCA.5 <- PCA1$x[,1:5]
LPGA.PCA.5.winnings <- as.data.frame(cbind(LPGA[,8], LPGA.PCA.5))
colnames(LPGA.PCA.5.winnings) <- c("Winnings", "OverallPerformance",
"DrivesAndFairways", "Sandtraps",
"GreensInRegulation", "Putting")
head(LPGA.PCA.5.winnings)
## Winnings OverallPerformance DrivesAndFairways Sandtraps
## A Lim Kim 747851 -4.741031 -2.1566379 -1.2013252
## Aditi Ashok 178900 -2.441487 2.6839485 2.1031780
## Agathe Laisne 41937 2.386438 0.1888567 0.5672855
## Alana Uriell 66506 0.087225 -1.3749948 1.5093632
## Albane Valenzuela 340107 -1.237693 -1.3156647 -1.5708788
## Alena Sharp 95534 4.016843 -0.4031349 0.8589521
## GreensInRegulation Putting
## A Lim Kim -0.02250439 0.3006152
## Aditi Ashok 1.50024736 -0.9107657
## Agathe Laisne 0.08218829 -0.8811531
## Alana Uriell 0.47657666 -0.1724692
## Albane Valenzuela 0.62250700 0.4972045
## Alena Sharp -2.19137268 0.9835123
In this section, I aim to compare two regression models – one utilizing the original variables and the second employing the PCA-reduced data. In both of those, the dependent variable will be the logarithm of the prize winnings.
# original variables
model <- lm(log(totPrize) ~ fairways+ fairPct + totPutts + totRounds + avePutts +
greenReg + events + driveDist + sandSaves +
sandAtt + sandPct, data = LPGA)
summary(model)
##
## Call:
## lm(formula = log(totPrize) ~ fairways + fairPct + totPutts +
## totRounds + avePutts + greenReg + events + driveDist + sandSaves +
## sandAtt + sandPct, data = LPGA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.68499 -0.36361 -0.00129 0.32954 1.37023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.196675 3.643491 4.720 5.48e-06 ***
## fairways -0.001475 0.002824 -0.522 0.602
## fairPct 0.018345 0.023599 0.777 0.438
## totPutts 0.001662 0.001118 1.487 0.139
## totRounds 0.017320 0.020319 0.852 0.395
## avePutts -0.564829 0.123848 -4.561 1.07e-05 ***
## greenReg 0.151592 0.023075 6.569 8.32e-10 ***
## events -0.156574 0.037091 -4.221 4.25e-05 ***
## driveDist -0.002337 0.007797 -0.300 0.765
## sandSaves 0.009488 0.027626 0.343 0.732
## sandAtt 0.008690 0.013257 0.656 0.513
## sandPct 0.001505 0.019353 0.078 0.938
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5262 on 146 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.842
## F-statistic: 77.08 on 11 and 146 DF, p-value: < 2.2e-16
# 5 principal components
model.PCA.5 <- lm(log(Winnings) ~ OverallPerformance + DrivesAndFairways +
Sandtraps + GreensInRegulation + Putting,
data = LPGA.PCA.5.winnings)
summary(model.PCA.5)
##
## Call:
## lm(formula = log(Winnings) ~ OverallPerformance + DrivesAndFairways +
## Sandtraps + GreensInRegulation + Putting, data = LPGA.PCA.5.winnings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8266 -0.3988 -0.0385 0.3467 1.8757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.43404 0.04533 274.276 < 2e-16 ***
## OverallPerformance -0.39980 0.01727 -23.151 < 2e-16 ***
## DrivesAndFairways 0.04671 0.03490 1.338 0.183
## Sandtraps -0.03365 0.03747 -0.898 0.371
## GreensInRegulation -0.49951 0.04800 -10.407 < 2e-16 ***
## Putting 0.44822 0.06422 6.980 8.56e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5698 on 152 degrees of freedom
## Multiple R-squared: 0.8207, Adjusted R-squared: 0.8148
## F-statistic: 139.1 on 5 and 152 DF, p-value: < 2.2e-16
Reviewing the outcomes, the regression model fitted on the PCA-reduced data shows an adjusted R-squared value of approximately 82%, while the model fitted on the original variables has a value around 84%. Considering that the model with the reduced data uses only 5 of the 12 generated components, this is a satisfactory result. Moreover, the model with the original variables identifies only 4 out of 12 variables as statistically significant (p-value below 0.05). Many of these may need to be discarded, potentially affecting the overall result negatively. This could be attributed to multicollinearity in the data and the presence of noise.
Examining the model fitted on the reduced data, 4 out of 6 values are statistically significant, and the p-values of the others are not exceedingly high. This could indicate that fewer adjustments may need to be made.
This analysis illustrates the application of PCA on high-dimensional data. It demonstrates that in datasets where multicollinearity among variables is possible, dimensionality reduction can assist in minimizing noise and retaining the most significant information within a smaller set of variables. The regression model conducted on the 5 out of 12 PCA components, generated from the reduction, yielded a nearly identical adjusted R-squared value compared to the model fitted on all independent variables.
Sources: