Q-1)

Read csv file into the system.

pga <- read.csv("C:\\Users\\sharm_000\\OneDrive\\University\\BANA 7038\\Homework 3\\PGA.csv",header = T,sep = ",")
We rename the columns in the following manner:-
Variable name Meaning Variable name Meaning
name “Name” age “Age”
drvacc “Driving Accuracy” green “Greens on Regulation”
avgputt “Average Number of Putts” save “Save Percent”
monrank “Money Rank” numevnt “Number of Events”
totwin “Total Winnings” avgwin “Average Winnings”
str(pga)
## 'data.frame':    196 obs. of  11 variables:
##  $ Name              : Factor w/ 196 levels "Aaron Baddeley",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  23 24 34 34 31 29 42 28 27 47 ...
##  $ AverageDrive      : num  288 295 286 298 289 ...
##  $ DrivingAccuracy   : num  53.1 57.7 64.2 59 60.5 68.8 74.2 64.4 64.3 62.6 ...
##  $ GreensonRegulation: num  58.2 65.6 63.8 63 62.5 67 68.9 64.2 63.4 65.3 ...
##  $ AverageNumofPutts : num  1.77 1.76 1.79 1.79 1.77 ...
##  $ SavePercent       : num  50.9 59.3 50.7 47.7 43.5 50.9 40.4 53.8 42.2 47.7 ...
##  $ MoneyRank         : int  123 7 54 101 146 52 80 75 141 83 ...
##  $ NumEvents         : int  27 16 24 20 30 23 23 27 20 15 ...
##  $ TotalWinnings     : int  632878 3724984 1313484 808373 486053 1355433 962167 1036958 500818 943589 ...
##  $ AverageWinnings   : int  23440 232812 54729 40419 16202 58932 41833 38406 25041 62906 ...
summary(pga)
##                Name          Age         AverageDrive   DrivingAccuracy
##  Aaron Baddeley  :  1   Min.   :21.00   Min.   :268.2   Min.   :49.90  
##  Adam Scott      :  1   1st Qu.:31.00   1st Qu.:281.3   1st Qu.:60.95  
##  Alex Cejka      :  1   Median :35.50   Median :287.2   Median :64.30  
##  Andre Stolz     :  1   Mean   :35.96   Mean   :287.2   Mean   :64.08  
##  Arjun Atwal     :  1   3rd Qu.:40.25   3rd Qu.:292.1   3rd Qu.:67.80  
##  Arron Oberholser:  1   Max.   :51.00   Max.   :314.4   Max.   :77.20  
##  (Other)         :190                                                  
##  GreensonRegulation AverageNumofPutts  SavePercent      MoneyRank     
##  Min.   :54.70      Min.   :1.723     Min.   :31.80   Min.   :  1.00  
##  1st Qu.:63.00      1st Qu.:1.762     1st Qu.:45.67   1st Qu.: 49.75  
##  Median :64.90      Median :1.776     Median :49.00   Median : 99.50  
##  Mean   :64.90      Mean   :1.778     Mean   :48.97   Mean   :101.80  
##  3rd Qu.:66.83      3rd Qu.:1.796     3rd Qu.:52.42   3rd Qu.:151.25  
##  Max.   :73.30      Max.   :1.847     Max.   :62.30   Max.   :245.00  
##                                                                       
##    NumEvents     TotalWinnings      AverageWinnings 
##  Min.   :15.00   Min.   :   21250   Min.   :   850  
##  1st Qu.:23.00   1st Qu.:  436617   1st Qu.: 15749  
##  Median :27.00   Median :  814989   Median : 30849  
##  Mean   :26.19   Mean   : 1134632   Mean   : 46549  
##  3rd Qu.:30.00   3rd Qu.: 1407922   3rd Qu.: 56209  
##  Max.   :36.00   Max.   :10905167   Max.   :376040  
## 

Q2.

Visualising data

We build a matrix of the variables to compare

library(psych)
pairs.panels(pga[c("avgwin","age","totwin")],hist.col = "green",gap=0)

Here we can see that driving accuracy decreases with an increase in driving distance. Age mildly decreses the driving distance but improves the driving accuracy

pairs.panels(pga[c("green","avgputt","save")],hist.col = "blue")

Here we see that all the variables of the game are unrelated to each other

pairs.panels(pga[c("age","avgdrv","drvacc","green","avgputt","save","monrank","numevnt","totwin","avgwin")],gap=0)

A chart showing all the scatter plots, histograms and correlations.

Q3.

Building a linear regression using Average winnings as response variable and using Age, Average Drive (Yards), Driving accuracy (percent), Greens on regulation (%), Average # of putts, Save Percent, and # Events as covariates

lm3=lm(pga$avgwin~pga$age+pga$avgdrv+pga$drvacc+pga$green+pga$avgputt+pga$save+pga$numevnt)
lm3
## 
## Call:
## lm(formula = pga$avgwin ~ pga$age + pga$avgdrv + pga$drvacc + 
##     pga$green + pga$avgputt + pga$save + pga$numevnt)
## 
## Coefficients:
## (Intercept)      pga$age   pga$avgdrv   pga$drvacc    pga$green  
##   945579.88      -587.13       -94.76     -2360.57      8466.04  
## pga$avgputt     pga$save  pga$numevnt  
##  -694226.49      1395.67     -3159.22

This output indicates that the fitted value is given by the equation :-

\(\hat{y}\) = 945579.88 - 587.13x1 - 94.76x2 - 2360.57x3 + 8466.04x4 - 694226.49x5 + 1395.67x6 - 3159.22x7

Q4.

Performing T test for the above coefficient estimates

Let us test if the explanatory variables(Age, Average Drive (Yards), Age and Average Drive) collectively have an effect on the response variable (Average winnings), i.e.

H0: \(\beta\)1 = \(\beta\)2 = …\(\beta\)p = 0

summary(lm3)
## 
## Call:
## lm(formula = pga$avgwin ~ pga$age + pga$avgdrv + pga$drvacc + 
##     pga$green + pga$avgputt + pga$save + pga$numevnt)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71690 -22176  -6735  17147 247928 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  945579.88  305886.59   3.091  0.00230 ** 
## pga$age        -587.13     519.32  -1.131  0.25968    
## pga$avgdrv      -94.76     567.42  -0.167  0.86755    
## pga$drvacc    -2360.57     854.02  -2.764  0.00628 ** 
## pga$green      8466.04    1303.87   6.493 7.30e-10 ***
## pga$avgputt -694226.49  138155.99  -5.025 1.17e-06 ***
## pga$save       1395.67     587.54   2.375  0.01853 *  
## pga$numevnt   -3159.22     644.24  -4.904 2.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41430 on 188 degrees of freedom
## Multiple R-squared:  0.4527, Adjusted R-squared:  0.4323 
## F-statistic: 22.21 on 7 and 188 DF,  p-value: < 2.2e-16

Coefficients:

Variable t value P value T significant P < 0.05 Hypothesis result
“(Intercept)” 3.091 0.00230 No Yes Reject
“pga$age” -1.131 0.25968 No No Accept
“pga$avgdrv” -0.167 0.86755 No No Accept
“pga$drvacc” -2.764 0.00628 No yes Reject
“pga$green” 6.493 7.30e-10 No Yes Reject
“pga$avgputt” -5.025 1.17e-06 No Yes Reject
“pga$save” 2.375 0.01853 No Yes Reject
“pga$numevnt” -4.904 2.03e-06 No Yes Reject

Here F = 22.21 (p < 2.2e-16)

Therefore, we reject the null hypothesis that the variables mentioned above collectively have no effect on Average Winnings.

Q5.

F test to test significance of the regression

anova(lm3)
## Analysis of Variance Table
## 
## Response: pga$avgwin
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## pga$age       1 1.7059e+09 1.7059e+09  0.9937 0.3201135    
## pga$avgdrv    1 2.1867e+10 2.1867e+10 12.7378 0.0004548 ***
## pga$drvacc    1 5.1862e+07 5.1862e+07  0.0302 0.8621994    
## pga$green     1 1.1345e+11 1.1345e+11 66.0847 5.666e-14 ***
## pga$avgputt   1 7.5694e+10 7.5694e+10 44.0933 3.268e-10 ***
## pga$save      1 1.2892e+10 1.2892e+10  7.5101 0.0067270 ** 
## pga$numevnt   1 4.1281e+10 4.1281e+10 24.0471 2.027e-06 ***
## Residuals   188 3.2273e+11 1.7167e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variable F value P value F significant P < 0.05 Hypothesis result
“pga$age” 0.9937 0.3201135 No No Accept
“pga$avgdrv” 12.7378 0.0004548 Yes Yes Reject
“pga$drvacc” 0.0302 0.8621994 No No Accept
“pga$green” 66.0847 5.666e-14 Yes Yes Reject
“pga$avgputt” 44.0933 3.268e-10 Yes Yes Reject
“pga$save” 7.5101 0.0067270 Yes Yes Reject
“pga$numevnt” 24.0471 2.027e-06 Yes Yes Reject

So only the hypothesis for “age” and “driving accuracy” are accepted.

Q6.

Patrial F test to test for two variables Age and Average Drive (Yards) together

#partial F test for age and average drive
lm2 <- lm(pga$avgwin~pga$age+pga$avgdrv)
anova(lm2,lm3)
## Analysis of Variance Table
## 
## Model 1: pga$avgwin ~ pga$age + pga$avgdrv
## Model 2: pga$avgwin ~ pga$age + pga$avgdrv + pga$drvacc + pga$green + 
##     pga$avgputt + pga$save + pga$numevnt
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    193 5.6610e+11                                   
## 2    188 3.2273e+11  5 2.4336e+11 28.353 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output shows the results of the partial F-test. Since F= 28.353 (p-value= 2.2e-16) we can reject the null hypothesis ( \(\beta\)2 = \(\beta\)3 = 0) at the 5% level of significance.

It appears that the variables Driving accuracy, Greens on regulation, Average number of putts, Save percent and Number of Events contribute significant

information to the Average Winnings once the variables Age and Average Drive have been taken into consideration.

Patrial F test to test for three variables Age, Average Drive (Yards) and Save Percent together

#partial F test for age, average drive and save percent
lm1 <- lm(pga$avgwin~pga$age+pga$avgdrv+pga$save)
anova(lm1,lm3)
## Analysis of Variance Table
## 
## Model 1: pga$avgwin ~ pga$age + pga$avgdrv + pga$save
## Model 2: pga$avgwin ~ pga$age + pga$avgdrv + pga$drvacc + pga$green + 
##     pga$avgputt + pga$save + pga$numevnt
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    192 5.2941e+11                                   
## 2    188 3.2273e+11  4 2.0668e+11 30.099 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output shows the results of the partial F-test. Since F= 30.099 (p-value= 2.2e-16) we can reject the null hypothesis ( \(\beta\)2 = \(\beta\)3 \(\beta\)6 = 0) at the 5% level of significance.

It appears that the variables Driving accuracy, Greens on regulation, Average number of putts, and Number of Events contribute significant

information to the Average Winnings once the variables Age, Average Drive and Save percent have been taken into consideration.

anova(lm2,lm1)
## Analysis of Variance Table
## 
## Model 1: pga$avgwin ~ pga$age + pga$avgdrv
## Model 2: pga$avgwin ~ pga$age + pga$avgdrv + pga$save
##   Res.Df        RSS Df  Sum of Sq      F   Pr(>F)    
## 1    193 5.6610e+11                                  
## 2    192 5.2941e+11  1 3.6685e+10 13.305 0.000341 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When comparing the two models between each other, we can see that adding save percent still produces a P statistic less than 0.05 and therefore adding save percent significantly improves the model in this case.

Q7.

Obtaining standardized regression coefficients

#standardized regression coefficients
library(QuantPsyc,warn.conflicts = FALSE)  
## Loading required package: boot
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
## Loading required package: MASS
lm.beta(lm3)                        
##     pga$age  pga$avgdrv  pga$drvacc   pga$green pga$avgputt    pga$save 
## -0.06831285 -0.01425906 -0.22790216  0.43883140 -0.29706575  0.13960500 
## pga$numevnt 
## -0.27161322

Comparison of the influence of all variables

Here we can see the the variable “Greens on Regulation” influence the model the most at 0.43883140 while “Age” variable influences the model the least at -0.06831285

Q8.

Testing data for multicollinearity

#multicollinearity
library(car,warn.conflicts = FALSE)
vif(lm3)
##     pga$age  pga$avgdrv  pga$drvacc   pga$green pga$avgputt    pga$save 
##    1.254131    2.504096    2.335193    1.569009    1.200508    1.186409 
## pga$numevnt 
##    1.053810
vif(lm3)>4
##     pga$age  pga$avgdrv  pga$drvacc   pga$green pga$avgputt    pga$save 
##       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE 
## pga$numevnt 
##       FALSE

As we can see, the variance inflation factors of the data are all less than 4, we can say that the data does not suffer from multicollinearity.