Try to develop a model that predicts total winnings using the same techniques we just reviewed: Check Assumptions: Linear relationship between dep and indeps (plots), normal predictor variables (skewness, shapiro), non-correlation (VIF) and consistant errors (nvcTest) Create a model using lm() Use the model to create predict values and if you have time develop a RSME We are going to us data located on a website here:http://www.stat.ufl.edu/~winner/data/pga2004.dat with information on the dataset located here:http://www.stat.ufl.edu/~winner/data/pga2004.txt Use the data.table package to get your data into R then develop a initial model
library(data.table)
pgadata <- fread("http://www.stat.ufl.edu/~winner/data/pga2004.dat", fill = TRUE)
head(pgadata)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## 1: Aaron Baddeley 23 288.0 53.1 58.2 1.767 50.9 123 27 632878 23440
## 2: Adam Scott 24 295.4 57.7 65.6 1.757 59.3 7 16 3724984 232812
## 3: Alex Cejka 34 285.8 64.2 63.8 1.795 50.7 54 24 1313484 54729
## 4: Andre Stolz 34 297.9 59.0 63.0 1.787 47.7 101 20 808373 40419
## 5: Arjun Atwal 31 289.4 60.5 62.5 1.766 43.5 146 30 486053 16202
## 6: Arron Oberholser 29 284.6 68.8 67.0 1.780 50.9 52 23 1355433 58932
## V13
## 1: NA
## 2: NA
## 3: NA
## 4: NA
## 5: NA
## 6: NA
str(pgadata)
## Classes 'data.table' and 'data.frame': 196 obs. of 13 variables:
## $ V1 : chr "Aaron" "Adam" "Alex" "Andre" ...
## $ V2 : chr "Baddeley" "Scott" "Cejka" "Stolz" ...
## $ V3 : chr "23" "24" "34" "34" ...
## $ V4 : num 288 295 286 298 289 ...
## $ V5 : num 53.1 57.7 64.2 59 60.5 68.8 74.2 64.4 64.3 62.6 ...
## $ V6 : num 58.2 65.6 63.8 63 62.5 67 68.9 64.2 63.4 65.3 ...
## $ V7 : num 1.77 1.76 1.79 1.79 1.77 ...
## $ V8 : num 50.9 59.3 50.7 47.7 43.5 50.9 40.4 53.8 42.2 47.7 ...
## $ V9 : num 123 7 54 101 146 52 80 75 141 83 ...
## $ V10: int 27 16 24 20 30 23 23 27 20 15 ...
## $ V11: int 632878 3724984 1313484 808373 486053 1355433 962167 1036958 500818 943589 ...
## $ V12: int 23440 232812 54729 40419 16202 58932 41833 38406 25041 62906 ...
## $ V13: int NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
names(pgadata) <- c("First", "Last","Age","AveDrive","DriveAccur","GreensReg","AvePuts","Save%","MoneyRank","NoEvents","TotalWin","AverWin","NA")
head(pgadata)
## First Last Age AveDrive DriveAccur GreensReg AvePuts Save%
## 1: Aaron Baddeley 23 288.0 53.1 58.2 1.767 50.9
## 2: Adam Scott 24 295.4 57.7 65.6 1.757 59.3
## 3: Alex Cejka 34 285.8 64.2 63.8 1.795 50.7
## 4: Andre Stolz 34 297.9 59.0 63.0 1.787 47.7
## 5: Arjun Atwal 31 289.4 60.5 62.5 1.766 43.5
## 6: Arron Oberholser 29 284.6 68.8 67.0 1.780 50.9
## MoneyRank NoEvents TotalWin AverWin NA
## 1: 123 27 632878 23440 NA
## 2: 7 16 3724984 232812 NA
## 3: 54 24 1313484 54729 NA
## 4: 101 20 808373 40419 NA
## 5: 146 30 486053 16202 NA
## 6: 52 23 1355433 58932 NA