Last Name: Gupta
First Name: Divyay
M-number: M10285479
E-mail: guptadt@mail.uc.edu

Q1.

Answer:

To read data into R, we use the following command.

setwd("D:/MSIS/2nd Flex/DAM/Assignments/3")
pga <- read.csv("PGA.csv", header = T, sep = ",")
attach(pga)
ls()
## [1] "pga"

We can see that after executing pga <- read.csv("PGA.csv", header = T, sep = ","), we have a variable pga in the memory.

Q2.

Answer:

To visualize the data, we can use the scatter plot and histogram: Command to use scatter plot is:

par(mfrow =c(3,3))
plot(Age, AverageWinnings, pch=20, col="red")
plot(AverageDrive, AverageWinnings, pch=20, col="red")
plot(DrivingAccuracy, AverageWinnings, pch=20, col="red")
plot(GreensonRegulation,AverageWinnings, pch=20, col="red")
plot(AverageNumofPutts,AverageWinnings, pch=20, col="red")
plot(SavePercent,AverageWinnings, pch=20, col="red")
plot(MoneyRank,AverageWinnings, pch=20, col="red")
plot(NumEvents,AverageWinnings, pch=20, col="red")
plot(TotalWinnings,AverageWinnings, pch=20, col="red")

Scatter plot is done using ‘AverageWinnings’ as response variable against other variables as covariates. Other way to get the scatter plot:

pairs(pga)

To obtain the Histogram, following command is used:

par(mfrow =c(3,3))
plot(Age, AverageWinnings, pch=20, col="red")
plot(AverageDrive, AverageWinnings, pch=20, col="red")
plot(DrivingAccuracy, AverageWinnings, pch=20, col="red")
plot(GreensonRegulation,AverageWinnings, pch=20, col="red")
plot(AverageNumofPutts,AverageWinnings, pch=20, col="red")
plot(SavePercent,AverageWinnings, pch=20, col="red")
plot(MoneyRank,AverageWinnings, pch=20, col="red")
plot(NumEvents,AverageWinnings, pch=20, col="red")
plot(TotalWinnings,AverageWinnings, pch=20, col="red")

As we can see the histograms of all the variables

Q3.

Answer:

To obtain the linear regression, following command can be used:

model = lm(AverageWinnings~Age+AverageDrive+DrivingAccuracy+DrivingAccuracy+GreensonRegulation+AverageNumofPutts+SavePercent+NumEvents)

After executing this command, model is created with ‘AverageWinnings’ as a response variable and Average Drive, accuracy, Greens on regulation, Average no. of putts, Save Percent, No. of events as covariates.

Q4.

Answer:

To perform the t test for the coefficient estimates, we will use the model created in the previous section, t statistics and p values can be obtained from the following command.

summary(model)
## 
## Call:
## lm(formula = AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + 
##     DrivingAccuracy + GreensonRegulation + AverageNumofPutts + 
##     SavePercent + NumEvents)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71690 -22176  -6735  17147 247928 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         945579.88  305886.59   3.091  0.00230 ** 
## Age                   -587.13     519.32  -1.131  0.25968    
## AverageDrive           -94.76     567.42  -0.167  0.86755    
## DrivingAccuracy      -2360.57     854.02  -2.764  0.00628 ** 
## GreensonRegulation    8466.04    1303.87   6.493 7.30e-10 ***
## AverageNumofPutts  -694226.49  138155.99  -5.025 1.17e-06 ***
## SavePercent           1395.67     587.54   2.375  0.01853 *  
## NumEvents            -3159.22     644.24  -4.904 2.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41430 on 188 degrees of freedom
## Multiple R-squared:  0.4527, Adjusted R-squared:  0.4323 
## F-statistic: 22.21 on 7 and 188 DF,  p-value: < 2.2e-16

T test is used to test the slope of the linear regression model, assuming signifance level as 5%: T at significance level = 2.228. If T statistics of the model comes out to be between -2.228 and +2.228, then we can say that the response variable and a regressor varaiable does not have linear relationship between them.

To test the linear relationship between the response variable and other regressor variables: Null Hypothesis is used: The hypothesis are:

H0: \(\beta_0\) = \(\beta_1\) = 0 H1: \(\beta_j\) not equal to 0 for at least one j

1st case is the null Hypothesis. Null hypothesis is a statistical hypothesis that assumes that the observation is due to a chance factor. Thus, the null hypothesis is true if the observed data (in the sample) do not differ from what would be expected based on chance alone. Reject H0 if |t0| > t at significance level

From the observed summary, it can be concluded that null hypothesis can be rejected for Age and Average Drive. So the relationships of Average Winnings with Age and Average Drive separately are not linear as their T values lie between -2.228 and +2.228.

There is another way to check the linear relationship between the response variable and other regressor variables i.e P value.

P-value explains how significant is the linear relationship. if p-value of the regressor variable is less than 0.05 then also we can reject the null hypothesis. From the summary as observed, p-values of Age and Average Drive are greater than 0.05, so we can reject the null Hypothesis for them and can conclude that there are linear relationships between Average winnings and other covariate variables except Age and Average Drive.

Q5.

Answer:

To perform the f test for the significance of regression, we will use the model created in the previous section, f statistics and p values can be obtained from the following command.

summary(model)
## 
## Call:
## lm(formula = AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + 
##     DrivingAccuracy + GreensonRegulation + AverageNumofPutts + 
##     SavePercent + NumEvents)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71690 -22176  -6735  17147 247928 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         945579.88  305886.59   3.091  0.00230 ** 
## Age                   -587.13     519.32  -1.131  0.25968    
## AverageDrive           -94.76     567.42  -0.167  0.86755    
## DrivingAccuracy      -2360.57     854.02  -2.764  0.00628 ** 
## GreensonRegulation    8466.04    1303.87   6.493 7.30e-10 ***
## AverageNumofPutts  -694226.49  138155.99  -5.025 1.17e-06 ***
## SavePercent           1395.67     587.54   2.375  0.01853 *  
## NumEvents            -3159.22     644.24  -4.904 2.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41430 on 188 degrees of freedom
## Multiple R-squared:  0.4527, Adjusted R-squared:  0.4323 
## F-statistic: 22.21 on 7 and 188 DF,  p-value: < 2.2e-16

As observed from the above summary, we find that F-statistics of the above regression is 22.21 and p-value is 2.2e-16. Assuming significance level as 5%, if we compare p-value of this regression with the assumed significant value, we cannot reject the null hypothesis for this model as 2.2e-16<0.05. It can be concluded that it is a linear regression.

Q6.

Answer:

  1. To use partial F test for two variables ‘Age’ and ‘Average Drive’, we will remove these variables from the model and again obtain a regression to comapre it with the original regression. Following command is used to use partial F-test.
model <- lm(AverageWinnings~Age+AverageDrive+DrivingAccuracy+DrivingAccuracy+GreensonRegulation+AverageNumofPutts+SavePercent+NumEvents)
reducedModel1 <- lm(AverageWinnings~DrivingAccuracy+DrivingAccuracy+GreensonRegulation+AverageNumofPutts+SavePercent+NumEvents)
anova(model,reducedModel1)
## Analysis of Variance Table
## 
## Model 1: AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + DrivingAccuracy + 
##     GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents
## Model 2: AverageWinnings ~ DrivingAccuracy + DrivingAccuracy + GreensonRegulation + 
##     AverageNumofPutts + SavePercent + NumEvents
##   Res.Df        RSS Df   Sum of Sq      F Pr(>F)
## 1    188 3.2273e+11                             
## 2    190 3.2503e+11 -2 -2299556206 0.6698  0.513

ANOVA is used to compare the two models and observe the variation between the two models, if p-value is less than 0.05, then ‘Age’ and ‘Average Drive’ significantly effect the model. As observed from the above ANOVA result, p-value is 0.513 > 0.05. SO, we can conclude that ‘Age’ and ‘Average Drive’ does not significantly effect the model and we can remove them from the model.

  1. To use partial F test for two variables ‘Age’, ‘Average Drive’ and ‘Save Percent’, we will remove these variables from the model and again obtain a regression to comapre it with the original regression. Following command is used to use partial F-test.
model <- lm(AverageWinnings~Age+AverageDrive+DrivingAccuracy+DrivingAccuracy+GreensonRegulation+AverageNumofPutts+SavePercent+NumEvents)
reducedModel2 <- lm(AverageWinnings~DrivingAccuracy+DrivingAccuracy+GreensonRegulation+AverageNumofPutts+NumEvents)
anova(model,reducedModel2)
## Analysis of Variance Table
## 
## Model 1: AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + DrivingAccuracy + 
##     GreensonRegulation + AverageNumofPutts + SavePercent + NumEvents
## Model 2: AverageWinnings ~ DrivingAccuracy + DrivingAccuracy + GreensonRegulation + 
##     AverageNumofPutts + NumEvents
##   Res.Df        RSS Df   Sum of Sq      F Pr(>F)  
## 1    188 3.2273e+11                               
## 2    191 3.3456e+11 -3 -1.1822e+10 2.2956 0.0792 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As observed from the above ANOVA result, p-value is 0.0.0792 > 0.05. SO, we can conclude that ‘Age’, ‘Average Drive’ and ‘Save Percent’ does not significantly effect the model and we can remove them from the model.

Q7.

Answer:

To obtain the standardized regression coefficients, we first apply the method of unit normal scaling but it should be applied after removing ‘Name’ variable from the original dataset.

pga1 <- subset(pga, select = -c(Name))

So, our new data set looks like

head(pga1)
##   Age AverageDrive DrivingAccuracy GreensonRegulation AverageNumofPutts
## 1  23        288.0            53.1               58.2             1.767
## 2  24        295.4            57.7               65.6             1.757
## 3  34        285.8            64.2               63.8             1.795
## 4  34        297.9            59.0               63.0             1.787
## 5  31        289.4            60.5               62.5             1.766
## 6  29        284.6            68.8               67.0             1.780
##   SavePercent MoneyRank NumEvents TotalWinnings AverageWinnings
## 1        50.9       123        27        632878           23440
## 2        59.3         7        16       3724984          232812
## 3        50.7        54        24       1313484           54729
## 4        47.7       101        20        808373           40419
## 5        43.5       146        30        486053           16202
## 6        50.9        52        23       1355433           58932

Now, we apply unit normal scaling to transform the data

pga_unit_normal <- as.data.frame(apply(pga1,2,function(x){(x-mean(x))/sd(x)}))

To obtain the regression of it:

model1_unit_normal <- lm(AverageWinnings~Age+AverageDrive+DrivingAccuracy+DrivingAccuracy+GreensonRegulation+AverageNumofPutts+SavePercent+NumEvents, data = pga_unit_normal)

## To check the coefficients

model1_unit_normal
## 
## Call:
## lm(formula = AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + 
##     DrivingAccuracy + GreensonRegulation + AverageNumofPutts + 
##     SavePercent + NumEvents, data = pga_unit_normal)
## 
## Coefficients:
##        (Intercept)                 Age        AverageDrive  
##          7.714e-16          -6.831e-02          -1.426e-02  
##    DrivingAccuracy  GreensonRegulation   AverageNumofPutts  
##         -2.279e-01           4.388e-01          -2.971e-01  
##        SavePercent           NumEvents  
##          1.396e-01          -2.716e-01
summary(model1_unit_normal)
## 
## Call:
## lm(formula = AverageWinnings ~ Age + AverageDrive + DrivingAccuracy + 
##     DrivingAccuracy + GreensonRegulation + AverageNumofPutts + 
##     SavePercent + NumEvents, data = pga_unit_normal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3037 -0.4033 -0.1225  0.3118  4.5086 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.714e-16  5.382e-02   0.000  1.00000    
## Age                -6.831e-02  6.042e-02  -1.131  0.25968    
## AverageDrive       -1.426e-02  8.538e-02  -0.167  0.86755    
## DrivingAccuracy    -2.279e-01  8.245e-02  -2.764  0.00628 ** 
## GreensonRegulation  4.388e-01  6.759e-02   6.493 7.30e-10 ***
## AverageNumofPutts  -2.971e-01  5.912e-02  -5.025 1.17e-06 ***
## SavePercent         1.396e-01  5.877e-02   2.375  0.01853 *  
## NumEvents          -2.716e-01  5.539e-02  -4.904 2.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7535 on 188 degrees of freedom
## Multiple R-squared:  0.4527, Adjusted R-squared:  0.4323 
## F-statistic: 22.21 on 7 and 188 DF,  p-value: < 2.2e-16

So, model1_unit_normal gives the values of standardized regression coefficients as observed from the summary.

To compare the influence of the variables, we can compare standardized regression coefficient value of each variable. As observed from the summary, ‘Greens on Regulation’ has the largest value among all the variables, so it is more influential than other variables.

Q8.

Answer:

To test the multicollinearity, VIF is computed and VIF can be obtained by installing the ‘car’ library package.

library(car)
## Warning: package 'car' was built under R version 3.3.2
vif(model)
##                Age       AverageDrive    DrivingAccuracy 
##           1.254131           2.504096           2.335193 
## GreensonRegulation  AverageNumofPutts        SavePercent 
##           1.569009           1.200508           1.186409 
##          NumEvents 
##           1.053810

Largest VIF is an indicator of multicollinearity. If VIF > 10 for any variable , then that model suffer from multicollinearity. After observing the VIF values for each variable, we can conclude that this data set does not suffer from multicollinearity as VIF values of each variable is less than 10.