The following are examples of investigating a set of data using Linear Regression under R. In future posts I will delve deeper into the data to form better modules and make more substantial conclusions regarding the the evidence. For now, you should look upon these examples as cursory in nature.
The data are from Larry Winner’s miscellaneous datasets contained in his webpage at http://www.stat.ufl.edu/~winner/datasets.html. The dataset in question are the LPGA Performance Statistics for the year 2009.
After downloading the data, we access the table with the read.table() function and get some sense of its dimensions.
set.seed(1234)
lpga <- read.table('lpga2009.dat')
dim(lpga)
## [1] 146 14
To more easily work with the data, we will give each variable in the data a name. The column names for the data are as follows:
An abbreviated name will be given to each column. Afterward, the dataset will be broken into a training and test set.
names(lpga) <- c('ID','AvgDrv','PerFair','PerGreen','AvgPutts','PerSandSav','Prize','lnPrize','Tours','PuttsPH','CompTours','AvgPercentile','RoundComp','AvgStrokes')
train <- sample(146,100)
#Now we split the data into the training set...
lpga.train <- lpga[train,]
#and test set.
lpga.test <- lpga[-train,]
For now we will strictly build on what we know about the dataset through the training set. As an example of Simple Linear Regression, we will look at the relationship between the natural log of prize winnings in relation to the number of completed tournament rounds for the year. Due to the scale of the earnings, a transformation of the prize winnings may be appropriate. We’ll suspect a possible relationship due to the nature of golf tournaments: if you don’t make the cut, one is not able to progress in the tournament nor have the opportunity to earn more money due to where they finish in the standings. A Plot will then be made in an x-y fashion comparing completed rounds for each participant with prize money earned. In addition, a plot will be made comparing fitted and residuals of the linear model to gain an idea about the variance involved in this relationship.
#First, a linear model is constructed, and a statistical summary of the relationship is printed.
lm.lpga <- lm(lnPrize~RoundComp,data=lpga.train)
summary(lm.lpga)
##
## Call:
## lm(formula = lnPrize ~ RoundComp, data = lpga.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2845 -0.4106 -0.1098 0.3434 1.4839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.151511 0.192267 42.40 <2e-16 ***
## RoundComp 0.062324 0.003141 19.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.574 on 98 degrees of freedom
## Multiple R-squared: 0.8007, Adjusted R-squared: 0.7987
## F-statistic: 393.8 on 1 and 98 DF, p-value: < 2.2e-16
#A plot comparing completed rounds and the natural log of prize winnings.
plot(lpga.test$RoundComp,lpga.test$lnPrize,pch=19,xlab='Completed Rounds',ylab='ln of prize winnings')
#Here, we plot fitted vs. residuals for the model in question.
plot(fitted(lm.lpga),residuals(lm.lpga),xlab='Fitted',ylab='Residuals')
We can see from the first plot that there appears to be something of a relationship between the two variables (completed rounds, ln of prize winnings). However, one could infer that the relationship may not be best represented as linear. Again, though, a more deeper investigation of the data is for another day. Note also in the second plot the uneven and varied relationship between fitted and residual values.
For Multivariate Linear Regression example, we will need to use the lattice library for the creation of scatter plot matrices. First, I will prepare the data in the following fashion:
#Load the lattice library
library(lattice)
#Create a new data object, excluding the variables ID and Prize.
splom.lpga <- lpga.train[,-c(1,7)]
#Now two separate plots both including lnPrize.
splom.lpga1 <- splom.lpga[,1:6]
splom.lpga2 <- splom.lpga[6:12]
splom(splom.lpga1)
splom(splom.lpga2)
Again, this is an initial attempt, so I won’t delve to much into figuring out the best model (now). However, here is an example of a model relating ln value of prize money to puts per hole, completed rounds, and average strokes taken.
#Be sure to continue to utilize the training data.
summary(lm(lnPrize~PuttsPH+RoundComp+AvgStrokes,data=lpga.train))
##
## Call:
## lm(formula = lnPrize ~ PuttsPH + RoundComp + AvgStrokes, data = lpga.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97772 -0.25761 -0.02274 0.20946 1.67676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.517962 4.376090 11.544 <2e-16 ***
## PuttsPH -3.633066 1.536025 -2.365 0.02 *
## RoundComp 0.036481 0.003461 10.541 <2e-16 ***
## AvgStrokes -0.470914 0.072018 -6.539 3e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4024 on 96 degrees of freedom
## Multiple R-squared: 0.9041, Adjusted R-squared: 0.9011
## F-statistic: 301.5 on 3 and 96 DF, p-value: < 2.2e-16