Q-1)
Read csv file into the system.
pga <- read.csv("C:\\Users\\sharm_000\\OneDrive\\University\\BANA 7038\\Homework 3\\PGA.csv",header = T,sep = ",")
We rename the columns in the following manner:-
| name |
“Name” |
age |
“Age” |
| drvacc |
“Driving Accuracy” |
green |
“Greens on Regulation” |
| avgputt |
“Average Number of Putts” |
save |
“Save Percent” |
| monrank |
“Money Rank” |
numevnt |
“Number of Events” |
| totwin |
“Total Winnings” |
avgwin |
“Average Winnings” |
str(pga)
## 'data.frame': 196 obs. of 11 variables:
## $ Name : Factor w/ 196 levels "Aaron Baddeley",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 23 24 34 34 31 29 42 28 27 47 ...
## $ AverageDrive : num 288 295 286 298 289 ...
## $ DrivingAccuracy : num 53.1 57.7 64.2 59 60.5 68.8 74.2 64.4 64.3 62.6 ...
## $ GreensonRegulation: num 58.2 65.6 63.8 63 62.5 67 68.9 64.2 63.4 65.3 ...
## $ AverageNumofPutts : num 1.77 1.76 1.79 1.79 1.77 ...
## $ SavePercent : num 50.9 59.3 50.7 47.7 43.5 50.9 40.4 53.8 42.2 47.7 ...
## $ MoneyRank : int 123 7 54 101 146 52 80 75 141 83 ...
## $ NumEvents : int 27 16 24 20 30 23 23 27 20 15 ...
## $ TotalWinnings : int 632878 3724984 1313484 808373 486053 1355433 962167 1036958 500818 943589 ...
## $ AverageWinnings : int 23440 232812 54729 40419 16202 58932 41833 38406 25041 62906 ...
summary(pga)
## Name Age AverageDrive DrivingAccuracy
## Aaron Baddeley : 1 Min. :21.00 Min. :268.2 Min. :49.90
## Adam Scott : 1 1st Qu.:31.00 1st Qu.:281.3 1st Qu.:60.95
## Alex Cejka : 1 Median :35.50 Median :287.2 Median :64.30
## Andre Stolz : 1 Mean :35.96 Mean :287.2 Mean :64.08
## Arjun Atwal : 1 3rd Qu.:40.25 3rd Qu.:292.1 3rd Qu.:67.80
## Arron Oberholser: 1 Max. :51.00 Max. :314.4 Max. :77.20
## (Other) :190
## GreensonRegulation AverageNumofPutts SavePercent MoneyRank
## Min. :54.70 Min. :1.723 Min. :31.80 Min. : 1.00
## 1st Qu.:63.00 1st Qu.:1.762 1st Qu.:45.67 1st Qu.: 49.75
## Median :64.90 Median :1.776 Median :49.00 Median : 99.50
## Mean :64.90 Mean :1.778 Mean :48.97 Mean :101.80
## 3rd Qu.:66.83 3rd Qu.:1.796 3rd Qu.:52.42 3rd Qu.:151.25
## Max. :73.30 Max. :1.847 Max. :62.30 Max. :245.00
##
## NumEvents TotalWinnings AverageWinnings
## Min. :15.00 Min. : 21250 Min. : 850
## 1st Qu.:23.00 1st Qu.: 436617 1st Qu.: 15749
## Median :27.00 Median : 814989 Median : 30849
## Mean :26.19 Mean : 1134632 Mean : 46549
## 3rd Qu.:30.00 3rd Qu.: 1407922 3rd Qu.: 56209
## Max. :36.00 Max. :10905167 Max. :376040
##
Q2.
Visualising data
We build a matrix of the variables to compare
library(psych)
pairs.panels(pga[c("avgwin","age","totwin")],hist.col = "green",gap=0)

Here we can see that driving accuracy decreases with an increase in driving distance. Age mildly decreses the driving distance but improves the driving accuracy
pairs.panels(pga[c("green","avgputt","save")],hist.col = "blue")

A chart showing all the scatter plots, histograms and correlations.
Q3.
Building a linear regression using Average winnings as response variable and using Age, Average Drive (Yards), Driving accuracy (percent), Greens on regulation (%), Average # of putts, Save Percent, and # Events as covariates
lm3=lm(pga$avgwin~pga$age+pga$avgdrv+pga$drvacc+pga$green+pga$avgputt+pga$save+pga$numevnt)
lm3
##
## Call:
## lm(formula = pga$avgwin ~ pga$age + pga$avgdrv + pga$drvacc +
## pga$green + pga$avgputt + pga$save + pga$numevnt)
##
## Coefficients:
## (Intercept) pga$age pga$avgdrv pga$drvacc pga$green
## 945579.88 -587.13 -94.76 -2360.57 8466.04
## pga$avgputt pga$save pga$numevnt
## -694226.49 1395.67 -3159.22
This output indicates that the fitted value is given by the equation :-
\(\hat{y}\) = 945579.88 - 587.13x1 - 94.76x2 - 2360.57x3 + 8466.04x4 - 694226.49x5 + 1395.67x6 - 3159.22x7
Q4.
H0: \(\beta\)1 = \(\beta\)2 = …\(\beta\)p = 0
summary(lm3)
##
## Call:
## lm(formula = pga$avgwin ~ pga$age + pga$avgdrv + pga$drvacc +
## pga$green + pga$avgputt + pga$save + pga$numevnt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71690 -22176 -6735 17147 247928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 945579.88 305886.59 3.091 0.00230 **
## pga$age -587.13 519.32 -1.131 0.25968
## pga$avgdrv -94.76 567.42 -0.167 0.86755
## pga$drvacc -2360.57 854.02 -2.764 0.00628 **
## pga$green 8466.04 1303.87 6.493 7.30e-10 ***
## pga$avgputt -694226.49 138155.99 -5.025 1.17e-06 ***
## pga$save 1395.67 587.54 2.375 0.01853 *
## pga$numevnt -3159.22 644.24 -4.904 2.03e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41430 on 188 degrees of freedom
## Multiple R-squared: 0.4527, Adjusted R-squared: 0.4323
## F-statistic: 22.21 on 7 and 188 DF, p-value: < 2.2e-16
Coefficients:
| “(Intercept)” |
3.091 |
0.00230 |
No |
Yes |
Reject |
| “pga$age” |
-1.131 |
0.25968 |
No |
No |
Accept |
| “pga$avgdrv” |
-0.167 |
0.86755 |
No |
No |
Accept |
| “pga$drvacc” |
-2.764 |
0.00628 |
No |
yes |
Reject |
| “pga$green” |
6.493 |
7.30e-10 |
No |
Yes |
Reject |
| “pga$avgputt” |
-5.025 |
1.17e-06 |
No |
Yes |
Reject |
| “pga$save” |
2.375 |
0.01853 |
No |
Yes |
Reject |
| “pga$numevnt” |
-4.904 |
2.03e-06 |
No |
Yes |
Reject |
Here F = 22.21 (p < 2.2e-16)
Therefore, we reject the null hypothesis that the variables mentioned above collectively have no effect on Average Winnings.
Q5.
F test to test significance of the regression
anova(lm3)
## Analysis of Variance Table
##
## Response: pga$avgwin
## Df Sum Sq Mean Sq F value Pr(>F)
## pga$age 1 1.7059e+09 1.7059e+09 0.9937 0.3201135
## pga$avgdrv 1 2.1867e+10 2.1867e+10 12.7378 0.0004548 ***
## pga$drvacc 1 5.1862e+07 5.1862e+07 0.0302 0.8621994
## pga$green 1 1.1345e+11 1.1345e+11 66.0847 5.666e-14 ***
## pga$avgputt 1 7.5694e+10 7.5694e+10 44.0933 3.268e-10 ***
## pga$save 1 1.2892e+10 1.2892e+10 7.5101 0.0067270 **
## pga$numevnt 1 4.1281e+10 4.1281e+10 24.0471 2.027e-06 ***
## Residuals 188 3.2273e+11 1.7167e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| “pga$age” |
0.9937 |
0.3201135 |
No |
No |
Accept |
| “pga$avgdrv” |
12.7378 |
0.0004548 |
Yes |
Yes |
Reject |
| “pga$drvacc” |
0.0302 |
0.8621994 |
No |
No |
Accept |
| “pga$green” |
66.0847 |
5.666e-14 |
Yes |
Yes |
Reject |
| “pga$avgputt” |
44.0933 |
3.268e-10 |
Yes |
Yes |
Reject |
| “pga$save” |
7.5101 |
0.0067270 |
Yes |
Yes |
Reject |
| “pga$numevnt” |
24.0471 |
2.027e-06 |
Yes |
Yes |
Reject |
So only the hypothesis for “age” and “driving accuracy” are accepted.
Q6.
Patrial F test to test for two variables Age and Average Drive (Yards) together
#partial F test for age and average drive
lm2 <- lm(pga$avgwin~pga$age+pga$avgdrv)
anova(lm2,lm3)
## Analysis of Variance Table
##
## Model 1: pga$avgwin ~ pga$age + pga$avgdrv
## Model 2: pga$avgwin ~ pga$age + pga$avgdrv + pga$drvacc + pga$green +
## pga$avgputt + pga$save + pga$numevnt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 193 5.6610e+11
## 2 188 3.2273e+11 5 2.4336e+11 28.353 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output shows the results of the partial F-test. Since F= 28.353 (p-value= 2.2e-16) we can reject the null hypothesis ( \(\beta\)2 = \(\beta\)3 = 0) at the 5% level of significance.
It appears that the variables Driving accuracy, Greens on regulation, Average number of putts, Save percent and Number of Events contribute significant
information to the Average Winnings once the variables Age and Average Drive have been taken into consideration.
Patrial F test to test for three variables Age, Average Drive (Yards) and Save Percent together
#partial F test for age, average drive and save percent
lm1 <- lm(pga$avgwin~pga$age+pga$avgdrv+pga$save)
anova(lm1,lm3)
## Analysis of Variance Table
##
## Model 1: pga$avgwin ~ pga$age + pga$avgdrv + pga$save
## Model 2: pga$avgwin ~ pga$age + pga$avgdrv + pga$drvacc + pga$green +
## pga$avgputt + pga$save + pga$numevnt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 192 5.2941e+11
## 2 188 3.2273e+11 4 2.0668e+11 30.099 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output shows the results of the partial F-test. Since F= 30.099 (p-value= 2.2e-16) we can reject the null hypothesis ( \(\beta\)2 = \(\beta\)3 \(\beta\)6 = 0) at the 5% level of significance.
It appears that the variables Driving accuracy, Greens on regulation, Average number of putts, and Number of Events contribute significant
When comparing the two models between each other, we can see that adding save percent still produces a P statistic less than 0.05 and therefore adding save percent significantly improves the model in this case.
Q7.
Obtaining standardized regression coefficients
#standardized regression coefficients
library(QuantPsyc,warn.conflicts = FALSE)
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
## Loading required package: MASS
lm.beta(lm3)
## pga$age pga$avgdrv pga$drvacc pga$green pga$avgputt pga$save
## -0.06831285 -0.01425906 -0.22790216 0.43883140 -0.29706575 0.13960500
## pga$numevnt
## -0.27161322
Comparison of the influence of all variables
Here we can see the the variable “Greens on Regulation” influence the model the most at 0.43883140 while “Age” variable influences the model the least at -0.06831285
Q8.
Testing data for multicollinearity
#multicollinearity
library(car,warn.conflicts = FALSE)
vif(lm3)
## pga$age pga$avgdrv pga$drvacc pga$green pga$avgputt pga$save
## 1.254131 2.504096 2.335193 1.569009 1.200508 1.186409
## pga$numevnt
## 1.053810
vif(lm3)>4
## pga$age pga$avgdrv pga$drvacc pga$green pga$avgputt pga$save
## FALSE FALSE FALSE FALSE FALSE FALSE
## pga$numevnt
## FALSE
As we can see, the variance inflation factors of the data are all less than 4, we can say that the data does not suffer from multicollinearity.