## Warning: package 'corrr' was built under R version 3.5.3
""

""

#get the data
data <- read.csv("C:/Users/Paul/OneDrive - CUNY School of Professional Studies/CUNY/DATA 621/pgatour2006.csv")

#pull out only the cols mentioned in the q
df <- data[,c(3,5,6,7,8,9,10,12)]

# a few quick plots
pairs(df)

hist(df$PrizeMoney)

df$PrizeMoney <- log(df$PrizeMoney)
pairs(df)

Part A

Based on the above, it looks as though our target \(Y\) (PrizeMoney) is quite skewed to the right. I agree with the idea that we should perform a log-transform. We can see, by comparing the 1st col & row of the above pairs plots that the transform makes the data look more managable. Below we’ll look at a few more detailed plots for confirmation.

#melt data
df_melt <- melt(df,"PrizeMoney")

#build hist per group
ggplot(df_melt,aes(value)) +
  geom_histogram() +
  facet_grid(.~variable)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#scatterplot per group
ggplot(df_melt,aes(PrizeMoney,value)) +
  geom_point() +
  facet_grid(.~variable)

ggplot(df_melt,aes(value,PrizeMoney)) +
  geom_point() +
  facet_grid(.~variable)

Part B

based on the above pairs plot (after performing a log transform on \(Y\)!) the relationships look mostly linear and a linear model is probably desired.

m <- lm(PrizeMoney ~ ., data = df)
summary(m)
## 
## Call:
## lm(formula = PrizeMoney ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71949 -0.48608 -0.09172  0.44561  2.14013 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.194300   7.777129   0.025 0.980095    
## DrivingAccuracy  -0.003530   0.011773  -0.300 0.764636    
## GIR               0.199311   0.043817   4.549 9.66e-06 ***
## PuttingAverage   -0.466304   6.905698  -0.068 0.946236    
## BirdieConversion  0.157341   0.040378   3.897 0.000136 ***
## SandSaves         0.015174   0.009862   1.539 0.125551    
## Scrambling        0.051514   0.031788   1.621 0.106788    
## PuttsPerRound    -0.343131   0.473549  -0.725 0.469601    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6639 on 188 degrees of freedom
## Multiple R-squared:  0.5577, Adjusted R-squared:  0.5412 
## F-statistic: 33.87 on 7 and 188 DF,  p-value: < 2.2e-16
plot(m$residuals,main="Residuals")

plot(rstandard(m),main="Standardized Residuals")

plot(rstandard(m) ~ df$PrizeMoney,main="Standardized Residuals vs PrizeMoney")

hist(m$residuals,main="Residuals Histogram")

qqnorm(m$residuals)
qqline(m$residuals) 

Based on the plots above, we see that the qq-norm fit is quite good and that the residuals are normally distributed. This suggests (confirms?) that a linear model is appropriate in this case. On inspecting the summary, we can also see that most of the variables are not pulling their weight here. Only “GIR” and “BirdieConversion” show any significance.

Part C

We see one point in the standardized residual plot which is greater than 3 - this should probably be investigated. We can see from the below that the value of interest is 3.3 @ row 185.

rsm <- as.data.frame(rstandard(m))

loc <- which(abs(rsm) > 3)

value <- rsm[loc,]

loc
## [1] 185
value
## [1] 3.309034
df[184:186,]

Part D

The correlation plot below shows that there might be some strong relationships between some of the variables that have gone into our modle. This can increase the standard error of the model, but should be easy to resolve by removing similar variables.

M <-cor(df)
corrplot(M, type="upper", order="hclust",
         col=brewer.pal(n=8, name="RdYlBu"))

Part E

Because of the interaction of variables, and given the apparent multi-colinearity, removing all of the insignificant variables at once could have an undesirable impact on the model. It would be more reasonable to recursively winnow down the number of variables by removing single variables at a time.