Investigators studied physical characteristics and ability in 13
football punters. Each volunteer punted a
football ten times. The investigators recorded the average distance for
the ten punts, in feet. They also
recorded the average hang time (time the ball is in the air before the
receiver catches it) for the ten punts, in seconds. This means that
there are two possible response variables, distance and hang time. In
addition, the investigators recorded five measures of strength and
flexibility for each punter: right leg strength (pounds), left leg
strength (pounds), right hamstring muscle flexibility (degrees), left
hamstring muscle flexibility (degrees), and overall leg strength
(foot-pounds). The data comes from the study “The relationship between
selected physical performance variables and football punting ability” by
the Department of Health, Physical Education and Recreation at the
Virginia Polytechnic Institute and State University, 1983. The dataset
file is punting.csv.
| Variables: | Variable Description |
|---|---|
| Distance Hang R_Strength L_Strength |
Distance travelled in feet Time in air in seconds Right leg strength in pounds Left leg strength in pounds |
| R_Flexibility L_Flexibility |
Right leg flexibility in degrees Left leg flexibility in degrees |
| O_Strength | Overall leg strength in pounds |
This is a very small dataset, only 13 observations, so in relative
terms the number of variables, 7, is quite
high (p ≈ n). In the era of Big Data, people may be easily mislead to
believe that only problems involving
very large volumes of data are difficult to tackle. While there are
various ways to tackle large volumes of
data (e.g. parallel and distributed computation, among many others), it
is important to understand that
lack of data can also pose very difficult challenges in
statistics.
Your Task: Perform linear regression analysis in this dataset, as
follows: (a) Standard least squares; (b)
Ridge Regression; (c) Lasso. First perform the analysis with (i)
Distance as a function of all variables
except Hang, then repeat the analysis now with (ii) Hang as a function
of all variables except Distance.
rm( list=ls ())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
library(ggplot2)
#install.packages('glmnet')
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
punt <- read.csv('punting.csv')
summary(punt)
## Distance Hang R_Strength L_Strength
## Min. :104.9 Min. :3.020 Min. :110.0 Min. :110.0
## 1st Qu.:140.2 1st Qu.:3.640 1st Qu.:130.0 1st Qu.:130.0
## Median :150.2 Median :4.040 Median :150.0 Median :150.0
## Mean :148.2 Mean :3.921 Mean :147.7 Mean :143.8
## 3rd Qu.:163.5 3rd Qu.:4.180 3rd Qu.:170.0 3rd Qu.:160.0
## Max. :192.0 Max. :4.750 Max. :180.0 Max. :180.0
## R_Flexibility L_Flexibility O_Strength
## Min. : 85.00 Min. : 78.00 Min. :130.2
## 1st Qu.: 90.00 1st Qu.: 86.00 1st Qu.:153.9
## Median : 93.00 Median : 93.00 Median :197.1
## Mean : 95.69 Mean : 91.23 Mean :196.2
## 3rd Qu.:103.00 3rd Qu.: 94.00 3rd Qu.:240.6
## Max. :108.00 Max. :106.00 Max. :266.6
names(punt)
## [1] "Distance" "Hang" "R_Strength" "L_Strength"
## [5] "R_Flexibility" "L_Flexibility" "O_Strength"
Set up functions
# We set seed for cross validation, so the results will be the same.
set.seed(8)
week3_lm <- function(output='Distance'){
punt_lm <- lm(data=punt, get(output,punt) ~ R_Strength + L_Strength + R_Flexibility + L_Flexibility + O_Strength)
cat('\n\n*** Standard lm for ',output)
print(summary(punt_lm))
plot(punt_lm, which = 1, pch='🏈',sub=paste('standard least square lm of ',output))
}
week3 <- function (a=1, output='Distance'){
X <- model.matrix(data=punt, get(output,punt) ~ R_Strength + L_Strength + R_Flexibility + L_Flexibility + O_Strength)[,-1]
Y <- get(output,punt)
plot(glmnet(X,Y,alpha=a),sub=paste('Function of ',output, ', alpha= ',a))
punt_cv <- cv.glmnet(X,Y,alpha=a, grouped = FALSE)
plot(punt_cv,sub=paste('k-fold function of ',output, ', alpha= ',a))
best_ridge_lambda <- punt_cv$lambda.min
cat('\n\n*** Coefficients for model with ', output, 'as repsonse variable and alpha = ', a ,'\n')
print(coef(punt_cv,lambda=punt_cv$lambda.min))
cat('\nMean of MSE with alpha = ', a, ' is: ' ,mean(punt_cv$cvm))
}
# Distance as response variable
week3_lm()
##
##
## *** Standard lm for Distance
## Call:
## lm(formula = get(output, punt) ~ R_Strength + L_Strength + R_Flexibility +
## L_Flexibility + O_Strength, data = punt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3829 -9.5711 -0.2166 5.4988 20.0188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.58047 65.70042 -0.450 0.666
## R_Strength 0.27877 0.45638 0.611 0.561
## L_Strength 0.06971 0.48388 0.144 0.890
## R_Flexibility 1.24146 1.44927 0.857 0.420
## L_Flexibility -0.39535 0.74472 -0.531 0.612
## O_Strength 0.22369 0.13053 1.714 0.130
##
## Residual standard error: 14.65 on 7 degrees of freedom
## Multiple R-squared: 0.8144, Adjusted R-squared: 0.6818
## F-statistic: 6.142 on 5 and 7 DF, p-value: 0.01694
week3(0)
##
##
## *** Coefficients for model with Distance as repsonse variable and alpha = 0
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 23.63545797
## R_Strength 0.17131996
## L_Strength 0.14701238
## R_Flexibility 0.49525762
## L_Flexibility 0.12927353
## O_Strength 0.09665104
##
## Mean of MSE with alpha = 0 is: 469.6738
week3(1)
##
##
## *** Coefficients for model with Distance as repsonse variable and alpha = 1
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 23.3327076
## R_Strength 0.2498696
## L_Strength .
## R_Flexibility 0.6174439
## L_Flexibility .
## O_Strength 0.1473687
##
## Mean of MSE with alpha = 1 is: 429.1399
# now Hang
week3_lm('Hang')
##
##
## *** Standard lm for Hang
## Call:
## lm(formula = get(output, punt) ~ R_Strength + L_Strength + R_Flexibility +
## L_Flexibility + O_Strength, data = punt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35802 -0.05761 0.02460 0.05586 0.31529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6449613 0.9856374 0.654 0.5338
## R_Strength 0.0011040 0.0068466 0.161 0.8764
## L_Strength 0.0121468 0.0072592 1.673 0.1382
## R_Flexibility -0.0002985 0.0217420 -0.014 0.9894
## L_Flexibility 0.0069159 0.0111723 0.619 0.5555
## O_Strength 0.0038897 0.0019581 1.986 0.0873 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2198 on 7 degrees of freedom
## Multiple R-squared: 0.8821, Adjusted R-squared: 0.7979
## F-statistic: 10.47 on 5 and 7 DF, p-value: 0.003782
week3(0,'Hang')
##
##
## *** Coefficients for model with Hang as repsonse variable and alpha = 0
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.196343186
## R_Strength 0.003418564
## L_Strength 0.003773457
## R_Flexibility 0.009525175
## L_Flexibility 0.004903473
## O_Strength 0.001620386
##
## Mean of MSE with alpha = 0 is: 0.1581544
week3(1,'Hang')
##
##
## *** Coefficients for model with Hang as repsonse variable and alpha = 1
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.278409e+00
## R_Strength 2.977796e-05
## L_Strength 9.537705e-03
## R_Flexibility 7.831749e-03
## L_Flexibility .
## O_Strength 2.632997e-03
##
## Mean of MSE with alpha = 1 is: 0.07776399
# somthing in the middle, elastic
week3(0.5,'Hang')
##
##
## *** Coefficients for model with Hang as repsonse variable and alpha = 0.5
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.563684131
## R_Strength 0.002638934
## L_Strength 0.004788556
## R_Flexibility 0.010104490
## L_Flexibility .
## O_Strength 0.001588263
##
## Mean of MSE with alpha = 0.5 is: 0.09111192
# Distance as response variable
week3_lm()
##
##
## *** Standard lm for Distance
## Call:
## lm(formula = get(output, punt) ~ R_Strength + L_Strength + R_Flexibility +
## L_Flexibility + O_Strength, data = punt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3829 -9.5711 -0.2166 5.4988 20.0188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.58047 65.70042 -0.450 0.666
## R_Strength 0.27877 0.45638 0.611 0.561
## L_Strength 0.06971 0.48388 0.144 0.890
## R_Flexibility 1.24146 1.44927 0.857 0.420
## L_Flexibility -0.39535 0.74472 -0.531 0.612
## O_Strength 0.22369 0.13053 1.714 0.130
##
## Residual standard error: 14.65 on 7 degrees of freedom
## Multiple R-squared: 0.8144, Adjusted R-squared: 0.6818
## F-statistic: 6.142 on 5 and 7 DF, p-value: 0.01694
week3(0)
##
##
## *** Coefficients for model with Distance as repsonse variable and alpha = 0
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 31.25399197
## R_Strength 0.15932951
## L_Strength 0.13839581
## R_Flexibility 0.46148911
## L_Flexibility 0.13278278
## O_Strength 0.08800146
##
## Mean of MSE with alpha = 0 is: 464.363
week3(1)
##
##
## *** Coefficients for model with Distance as repsonse variable and alpha = 1
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 18.1282781
## R_Strength 0.2624381
## L_Strength .
## R_Flexibility 0.6386499
## L_Flexibility .
## O_Strength 0.1540913
##
## Mean of MSE with alpha = 1 is: 440.2299
# now Hang
week3_lm('Hang')
##
##
## *** Standard lm for Hang
## Call:
## lm(formula = get(output, punt) ~ R_Strength + L_Strength + R_Flexibility +
## L_Flexibility + O_Strength, data = punt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35802 -0.05761 0.02460 0.05586 0.31529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6449613 0.9856374 0.654 0.5338
## R_Strength 0.0011040 0.0068466 0.161 0.8764
## L_Strength 0.0121468 0.0072592 1.673 0.1382
## R_Flexibility -0.0002985 0.0217420 -0.014 0.9894
## L_Flexibility 0.0069159 0.0111723 0.619 0.5555
## O_Strength 0.0038897 0.0019581 1.986 0.0873 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2198 on 7 degrees of freedom
## Multiple R-squared: 0.8821, Adjusted R-squared: 0.7979
## F-statistic: 10.47 on 5 and 7 DF, p-value: 0.003782
week3(0,'Hang')
##
##
## *** Coefficients for model with Hang as repsonse variable and alpha = 0
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.274348721
## R_Strength 0.003317280
## L_Strength 0.003638719
## R_Flexibility 0.009281820
## L_Flexibility 0.004807993
## O_Strength 0.001560915
##
## Mean of MSE with alpha = 0 is: 0.1526909
week3(1,'Hang')
##
##
## *** Coefficients for model with Hang as repsonse variable and alpha = 1
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.278409e+00
## R_Strength 2.977796e-05
## L_Strength 9.537705e-03
## R_Flexibility 7.831749e-03
## L_Flexibility .
## O_Strength 2.632997e-03
##
## Mean of MSE with alpha = 1 is: 0.09410517
week3(0.5,'Hang')
##
##
## *** Coefficients for model with Hang as repsonse variable and alpha = 0.5
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.372475483
## R_Strength 0.002750878
## L_Strength 0.005302828
## R_Flexibility 0.010664087
## L_Flexibility .
## O_Strength 0.001828595
##
## Mean of MSE with alpha = 0.5 is: 0.08429597
Model for Hang is stronger than Distance.
The Ridge and Lasso improves the model, compare to the standard least
square. With the same alpha value, there is not much different between
Ridge and Lasso, however the Lasso reduces some independent
variables.