Using R, build a regression model for data that interests you. Conduct residual analysis.
Was the linear model appropriate? Why or why not?
OpenPowerlifting Database and do residual analysis.
I will build a simple linear regression model of body weights vs best bench press for seniors to see if
a linear relation exists between them.
Dataset can be found here: https://www.kaggle.com/open-powerlifting/powerlifting-database
Overview: https://www.kaggle.com/open-powerlifting/powerlifting-database/home
Get the data and examine a preview
library(data.table)
powerlift <- fread("openpowerlifting.csv")
head(powerlift, n=5)
## MeetID Name Sex Equipment Age Division BodyweightKg
## 1: 0 Angie Belk Terry F Wraps 47 Mst 45-49 59.60
## 2: 0 Dawn Bogart F Single-ply 42 Mst 40-44 58.51
## 3: 0 Dawn Bogart F Single-ply 42 Open Senior 58.51
## 4: 0 Dawn Bogart F Raw 42 Open Senior 58.51
## 5: 0 Destiny Dula F Raw 18 Teen 18-19 63.68
## WeightClassKg Squat4Kg BestSquatKg Bench4Kg BestBenchKg Deadlift4Kg
## 1: 60 NA 47.63 NA 20.41 NA
## 2: 60 NA 142.88 NA 95.25 NA
## 3: 60 NA 142.88 NA 95.25 NA
## 4: 60 NA NA NA 95.25 NA
## 5: 67.5 NA NA NA 31.75 NA
## BestDeadliftKg TotalKg Place Wilks
## 1: 70.31 138.35 1 155.05
## 2: 163.29 401.42 1 456.38
## 3: 163.29 401.42 1 456.38
## 4: NA 95.25 1 108.29
## 5: 90.72 122.47 1 130.47
summary(powerlift)
## MeetID Name Sex Equipment
## Min. : 0 Length:386414 Length:386414 Length:386414
## 1st Qu.:2979 Class :character Class :character Class :character
## Median :5960 Mode :character Mode :character Mode :character
## Mean :5143
## 3rd Qu.:7175
## Max. :8481
##
## Age Division BodyweightKg WeightClassKg
## Min. : 5.00 Length:386414 Min. : 15.88 Length:386414
## 1st Qu.:22.00 Class :character 1st Qu.: 70.30 Class :character
## Median :28.00 Mode :character Median : 83.20 Mode :character
## Mean :31.67 Mean : 86.93
## 3rd Qu.:39.00 3rd Qu.:100.00
## Max. :95.00 Max. :242.40
## NA's :239267 NA's :2402
## Squat4Kg BestSquatKg Bench4Kg BestBenchKg
## Min. :-440.5 Min. :-477.5 Min. :-360.0 Min. :-522.50
## 1st Qu.: 87.5 1st Qu.: 127.5 1st Qu.: -90.0 1st Qu.: 79.38
## Median : 145.0 Median : 174.6 Median : 90.2 Median : 115.00
## Mean : 107.0 Mean : 176.6 Mean : 45.7 Mean : 118.35
## 3rd Qu.: 212.5 3rd Qu.: 217.7 3rd Qu.: 167.5 3rd Qu.: 150.00
## Max. : 450.0 Max. : 573.8 Max. : 378.8 Max. : 488.50
## NA's :385171 NA's :88343 NA's :384452 NA's :30050
## Deadlift4Kg BestDeadliftKg TotalKg Place
## Min. :-461.0 Min. :-410.0 Min. : 11.0 Length:386414
## 1st Qu.: 110.0 1st Qu.: 147.5 1st Qu.: 272.2 Class :character
## Median : 157.5 Median : 195.0 Median : 424.1 Mode :character
## Mean : 113.6 Mean : 195.0 Mean : 424.0
## 3rd Qu.: 220.0 3rd Qu.: 238.1 3rd Qu.: 565.0
## Max. : 418.0 Max. : 460.4 Max. :1365.3
## NA's :383614 NA's :68567 NA's :23177
## Wilks
## Min. : 13.73
## 1st Qu.:237.38
## Median :319.66
## Mean :301.08
## 3rd Qu.:379.29
## Max. :779.38
## NA's :24220
dim(powerlift)
## [1] 386414 17
str(powerlift)
## Classes 'data.table' and 'data.frame': 386414 obs. of 17 variables:
## $ MeetID : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Name : chr "Angie Belk Terry" "Dawn Bogart" "Dawn Bogart" "Dawn Bogart" ...
## $ Sex : chr "F" "F" "F" "F" ...
## $ Equipment : chr "Wraps" "Single-ply" "Single-ply" "Raw" ...
## $ Age : num 47 42 42 42 18 28 60 60 52 52 ...
## $ Division : chr "Mst 45-49" "Mst 40-44" "Open Senior" "Open Senior" ...
## $ BodyweightKg : num 59.6 58.5 58.5 58.5 63.7 ...
## $ WeightClassKg : chr "60" "60" "60" "60" ...
## $ Squat4Kg : num NA NA NA NA NA ...
## $ BestSquatKg : num 47.6 142.9 142.9 NA NA ...
## $ Bench4Kg : num NA NA NA NA NA NA NA NA NA NA ...
## $ BestBenchKg : num 20.4 95.2 95.2 95.2 31.8 ...
## $ Deadlift4Kg : num NA NA NA NA NA NA NA NA NA NA ...
## $ BestDeadliftKg: num 70.3 163.3 163.3 NA 90.7 ...
## $ TotalKg : num 138.3 401.4 401.4 95.2 122.5 ...
## $ Place : chr "1" "1" "1" "1" ...
## $ Wilks : num 155 456 456 108 130 ...
## - attr(*, ".internal.selfref")=<externalptr>
library(dplyr)
powerlift_senior <- powerlift %>% filter(BestBenchKg > 0 & Division %like% 'Senior') %>%
select(BodyweightKg, BestBenchKg)
with(powerlift_senior, plot(BodyweightKg, BestBenchKg, xlab = "Body Weight (Kg)",
ylab = "Best Bench Press (Kg)"))
lm_powerlift_senior <- lm(BestBenchKg ~ BodyweightKg, data = powerlift_senior)
with(powerlift_senior, plot(BodyweightKg, BestBenchKg,xlab = "Body Weight (Kg)",
ylab = "Best Bench Press (Kg)"))
abline(lm_powerlift_senior)
summary(lm_powerlift_senior)
##
## Call:
## lm(formula = BestBenchKg ~ BodyweightKg, data = powerlift_senior)
##
## Residuals:
## Min 1Q Median 3Q Max
## -122.985 -24.352 -3.433 20.126 158.286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13762 6.42581 0.021 0.983
## BodyweightKg 1.28035 0.07453 17.178 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.41 on 553 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.348, Adjusted R-squared: 0.3468
## F-statistic: 295.1 on 1 and 553 DF, p-value: < 2.2e-16
\[ \begin{aligned} \widehat{bestbenchpress} = 0.1376 + 1.2804*\widehat{bodyweight} \end{aligned} \]
plot(fitted(lm_powerlift_senior), resid(lm_powerlift_senior),
main = "Fitted vs residuals", xlab = "", ylab = "")
abline(h =0)
hist(resid(lm_powerlift_senior), xlab = "", main = "Histogram of Residuals")
qqnorm(resid(lm_powerlift_senior))
qqline(resid(lm_powerlift_senior))
data well. The \(R^2\) value is quite low which shows that the fitted model doesn’t
accuractely predict the values of Senior divisions competitiors bench press best based
on their weight.
meaning that the significance can be ignored and is not a good estimate. One good thing
is that the p-value of estimated body weight is a good predictor due to the low p-value
and standard error.
normal, the fitted data vs the residuals doesn’t show constant variance and for
larger fitted values, the variance begins to change. A good-fit linear model would
have constant variance or nearly constant variance.