Quick and Dirty Linear Regression

The following are examples of investigating a set of data using Linear Regression under R. In future posts I will delve deeper into the data to form better modules and make more substantial conclusions regarding the the evidence. For now, you should look upon these examples as cursory in nature.

The data are from Larry Winner’s miscellaneous datasets contained in his webpage at http://www.stat.ufl.edu/~winner/datasets.html. The dataset in question are the LPGA Performance Statistics for the year 2009.

After downloading the data, we access the table with the read.table() function and get some sense of its dimensions.

set.seed(1234)
lpga <- read.table('lpga2009.dat')
dim(lpga)

## [1] 146  14

To more easily work with the data, we will give each variable in the data a name. The column names for the data are as follows:

Golfer id
average drive (yards)
percent of fairways hit
percent of greens reached in regulation
average putts per round
percent of sand saves (2 shots to hole)
prize winnings ($1000s)
ln(prize)
tournaments played in
green in regulation putts per hole
completed tournaments
average percentile in tournaments (high is good)
rounds completed
average strokes per round

An abbreviated name will be given to each column. Afterward, the dataset will be broken into a training and test set.

names(lpga) <- c('ID','AvgDrv','PerFair','PerGreen','AvgPutts','PerSandSav','Prize','lnPrize','Tours','PuttsPH','CompTours','AvgPercentile','RoundComp','AvgStrokes')
train <- sample(146,100)
#Now we split the data into the training set...
lpga.train <- lpga[train,]
#and test set.
lpga.test <- lpga[-train,]

For now we will strictly build on what we know about the dataset through the training set. As an example of Simple Linear Regression, we will look at the relationship between the natural log of prize winnings in relation to the number of completed tournament rounds for the year. Due to the scale of the earnings, a transformation of the prize winnings may be appropriate. We’ll suspect a possible relationship due to the nature of golf tournaments: if you don’t make the cut, one is not able to progress in the tournament nor have the opportunity to earn more money due to where they finish in the standings. A Plot will then be made in an x-y fashion comparing completed rounds for each participant with prize money earned. In addition, a plot will be made comparing fitted and residuals of the linear model to gain an idea about the variance involved in this relationship.

#First, a linear model is constructed, and a statistical summary of the relationship is printed.
lm.lpga <- lm(lnPrize~RoundComp,data=lpga.train)
summary(lm.lpga)

## 
## Call:
## lm(formula = lnPrize ~ RoundComp, data = lpga.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2845 -0.4106 -0.1098  0.3434  1.4839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.151511   0.192267   42.40   <2e-16 ***
## RoundComp   0.062324   0.003141   19.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.574 on 98 degrees of freedom
## Multiple R-squared:  0.8007, Adjusted R-squared:  0.7987 
## F-statistic: 393.8 on 1 and 98 DF,  p-value: < 2.2e-16

#A plot comparing completed rounds and the natural log of prize winnings.
plot(lpga.test$RoundComp,lpga.test$lnPrize,pch=19,xlab='Completed Rounds',ylab='ln of prize winnings')

#Here, we plot fitted vs. residuals for the model in question.
plot(fitted(lm.lpga),residuals(lm.lpga),xlab='Fitted',ylab='Residuals')

We can see from the first plot that there appears to be something of a relationship between the two variables (completed rounds, ln of prize winnings). However, one could infer that the relationship may not be best represented as linear. Again, though, a more deeper investigation of the data is for another day. Note also in the second plot the uneven and varied relationship between fitted and residual values.

For Multivariate Linear Regression example, we will need to use the lattice library for the creation of scatter plot matrices. First, I will prepare the data in the following fashion:

Create a seperate data object from the training data, removing variables that will not be in consideration (ID,Prize).
Split the resulting object into two data objects, both containing the dependent variable lnPrize.
Create scatter plot matrices of both objects, looking for how the remaining independent variables relate to the dependent variable.

#Load the lattice library
library(lattice)
#Create a new data object, excluding the variables ID and Prize.
splom.lpga <- lpga.train[,-c(1,7)]
#Now two separate plots both including lnPrize.
splom.lpga1 <- splom.lpga[,1:6]
splom.lpga2 <- splom.lpga[6:12]
splom(splom.lpga1)

splom(splom.lpga2)

Again, this is an initial attempt, so I won’t delve to much into figuring out the best model (now). However, here is an example of a model relating ln value of prize money to puts per hole, completed rounds, and average strokes taken.

#Be sure to continue to utilize the training data.
summary(lm(lnPrize~PuttsPH+RoundComp+AvgStrokes,data=lpga.train))

## 
## Call:
## lm(formula = lnPrize ~ PuttsPH + RoundComp + AvgStrokes, data = lpga.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97772 -0.25761 -0.02274  0.20946  1.67676 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 50.517962   4.376090  11.544   <2e-16 ***
## PuttsPH     -3.633066   1.536025  -2.365     0.02 *  
## RoundComp    0.036481   0.003461  10.541   <2e-16 ***
## AvgStrokes  -0.470914   0.072018  -6.539    3e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4024 on 96 degrees of freedom
## Multiple R-squared:  0.9041, Adjusted R-squared:  0.9011 
## F-statistic: 301.5 on 3 and 96 DF,  p-value: < 2.2e-16