Project_2 Powerlifting

Author

Mei Qi Ng

Approach

One of the dataset used for project comes from the Open Powerlifting Database on Kaggle, which has competition results across federations, weight classes, and events. Although the dataset is readable, it is not in tidy format. There are several structural issues make it untidy: it is a wide format storage: contains massive amounts of variables spread across many columns. Also the multiple observation unit in 1 row need to be extended.

Tidying Strategy:

  1. First I will load the csv file that is uploaded from the Github repository.

To transform this dataset into tidy format, I will:

  1. Identify wide-format columns specifically columns and place the multiple variables in a single name.Reshape from wide to long format. I plan to collapse the many columns into 2 new columns that I will rename.

Challenges would be working with such a large and wide dataset and also the missing and inconsistent values (0 or negatives) as some competitors may not have attempted all lifts or failed attempt lifts but that will be documented and how these cases are handled.

Planned Analyses

After tidying is complete, I plan to perform several analyses based on the tidy dataset:

  1. Performance differences by sex: Use boxplots to compare lift performance between male and female competitors.

  2. Lift strength relative to body weight: Group by weight class to examine strength differences and compare.

  3. Trends over time: Analyzing maximum recorded lifts across years to observe how world-record-level lifts have progressed.

Finally, data visualizations will be created using ggplot2, and results will be summarized in tables with clear labeling.

Resources:

Loading packages and data set

library (tidyverse) #loads required packages needed for data manipulation such as dplyr, tidyr, readr, lubridate and ggplot2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Gemini LLM assisted me in loading such a large csv file onto Github repository using Git

github_url <- "https://raw.githubusercontent.com/meiqing39/607-Project-2/cbb74bb1c84cef741f9883e92b32ce21ef7078db/openpowerlifting.zip"

# Create a temporary file on the user's computer
temp_file <- tempfile(fileext = ".zip")

# Download the file safely
download.file(github_url, destfile = temp_file, mode = "wb")

# Read the data into R
raw_powerlifting <- read_csv(temp_file)
Rows: 1423354 Columns: 37
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (14): Name, Sex, Event, Equipment, AgeClass, Division, WeightClassKg, P...
dbl  (22): Age, BodyweightKg, Squat1Kg, Squat2Kg, Squat3Kg, Squat4Kg, Best3S...
date  (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the structure

glimpse(raw_powerlifting)
Rows: 1,423,354
Columns: 37
$ Name            <chr> "Abbie Murphy", "Abbie Tuong", "Ainslee Hooper", "Amy …
$ Sex             <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
$ Event           <chr> "SBD", "SBD", "B", "SBD", "SBD", "SBD", "SBD", "SBD", …
$ Equipment       <chr> "Wraps", "Wraps", "Raw", "Wraps", "Wraps", "Wraps", "W…
$ Age             <dbl> 29, 29, 40, 23, 45, 37, 23, 35, 36, 37, 34, 27, 27, 38…
$ AgeClass        <chr> "24-34", "24-34", "40-44", "20-23", "45-49", "35-39", …
$ Division        <chr> "F-OR", "F-OR", "F-OR", "F-OR", "F-OR", "F-OR", "F-OR"…
$ BodyweightKg    <dbl> 59.8, 58.5, 55.4, 60.0, 104.0, 74.0, 59.8, 80.4, 108.0…
$ WeightClassKg   <chr> "60", "60", "56", "60", "110", "75", "60", "82.5", "11…
$ Squat1Kg        <dbl> 80.0, 100.0, NA, -105.0, 120.0, 127.5, 115.0, 120.0, 1…
$ Squat2Kg        <dbl> 92.5, 110.0, NA, -105.0, 130.0, 135.0, 125.0, 130.0, 2…
$ Squat3Kg        <dbl> 105.0, 120.0, NA, 105.0, 140.0, 142.5, -135.0, 140.0, …
$ Squat4Kg        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Best3SquatKg    <dbl> 105.0, 120.0, NA, 105.0, 140.0, 142.5, 125.0, 140.0, 2…
$ Bench1Kg        <dbl> 45.0, 55.0, 27.5, 67.5, 70.0, 72.5, 65.0, 67.5, 90.0, …
$ Bench2Kg        <dbl> 50.0, 62.5, 32.5, 72.5, 75.0, 77.5, 70.0, 72.5, 100.0,…
$ Bench3Kg        <dbl> 55.0, 67.5, -35.0, -75.0, 80.0, 82.5, -75.0, 77.5, NA,…
$ Bench4Kg        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Best3BenchKg    <dbl> 55.0, 67.5, 32.5, 72.5, 80.0, 82.5, 70.0, 77.5, 100.0,…
$ Deadlift1Kg     <dbl> 110.0, 130.0, NA, 132.5, 150.0, 125.0, 130.0, 125.0, 2…
$ Deadlift2Kg     <dbl> 120.0, 140.0, NA, -140.0, 160.0, 135.0, 140.0, 140.0, …
$ Deadlift3Kg     <dbl> 130.0, 145.0, NA, -140.0, 170.0, 145.0, 150.0, 152.5, …
$ Deadlift4Kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Best3DeadliftKg <dbl> 130.0, 145.0, NA, 132.5, 170.0, 145.0, 150.0, 152.5, 2…
$ TotalKg         <dbl> 290.0, 332.5, 32.5, 310.0, 390.0, 370.0, 345.0, 370.0,…
$ Place           <chr> "4", "2", "1", "3", "3", "5", "1", "5", "1", "1", "2",…
$ Wilks           <dbl> 324.16, 378.07, 38.56, 345.61, 321.25, 354.74, 385.63,…
$ McCulloch       <dbl> 324.16, 378.07, 38.56, 345.61, 338.91, 354.74, 385.63,…
$ Glossbrenner    <dbl> 286.42, 334.16, 34.12, 305.37, 274.56, 312.20, 340.74,…
$ IPFPoints       <dbl> 511.15, 595.65, 313.97, 547.04, 550.08, 596.18, 612.23…
$ Tested          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Country         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Austr…
$ Federation      <chr> "GPC-AUS", "GPC-AUS", "GPC-AUS", "GPC-AUS", "GPC-AUS",…
$ Date            <date> 2018-10-27, 2018-10-27, 2018-10-27, 2018-10-27, 2018-…
$ MeetCountry     <chr> "Australia", "Australia", "Australia", "Australia", "A…
$ MeetState       <chr> "VIC", "VIC", "VIC", "VIC", "VIC", "VIC", "VIC", "VIC"…
$ MeetName        <chr> "Melbourne Cup", "Melbourne Cup", "Melbourne Cup", "Me…

Reshaping and tidy large data

#wide to long format and use columns that shows competitors attempts
tidy_powerlift <- raw_powerlifting |> 
  pivot_longer(cols = matches("Squat|Bench|Deadlift)[1-4]Kg"),
               names_to = "lift_type",
               values_to = "lift_kg") |> 
  #everything to lowercase, easier to work with and consistent, remove missing values and negative values (failed lifts)
  rename_with(tolower) |> 
  drop_na(lift_kg) |> 
  filter(lift_kg >0)  #keep successful lifts
 
glimpse(tidy_powerlift)
Rows: 4,051,475
Columns: 29
$ name            <chr> "Abbie Murphy", "Abbie Murphy", "Abbie Murphy", "Abbie…
$ sex             <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
$ event           <chr> "SBD", "SBD", "SBD", "SBD", "SBD", "SBD", "SBD", "SBD"…
$ equipment       <chr> "Wraps", "Wraps", "Wraps", "Wraps", "Wraps", "Wraps", …
$ age             <dbl> 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29…
$ ageclass        <chr> "24-34", "24-34", "24-34", "24-34", "24-34", "24-34", …
$ division        <chr> "F-OR", "F-OR", "F-OR", "F-OR", "F-OR", "F-OR", "F-OR"…
$ bodyweightkg    <dbl> 59.8, 59.8, 59.8, 59.8, 59.8, 59.8, 59.8, 59.8, 58.5, …
$ weightclasskg   <chr> "60", "60", "60", "60", "60", "60", "60", "60", "60", …
$ deadlift1kg     <dbl> 110.0, 110.0, 110.0, 110.0, 110.0, 110.0, 110.0, 110.0…
$ deadlift2kg     <dbl> 120, 120, 120, 120, 120, 120, 120, 120, 140, 140, 140,…
$ deadlift3kg     <dbl> 130, 130, 130, 130, 130, 130, 130, 130, 145, 145, 145,…
$ deadlift4kg     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ best3deadliftkg <dbl> 130.0, 130.0, 130.0, 130.0, 130.0, 130.0, 130.0, 130.0…
$ totalkg         <dbl> 290.0, 290.0, 290.0, 290.0, 290.0, 290.0, 290.0, 290.0…
$ place           <chr> "4", "4", "4", "4", "4", "4", "4", "4", "2", "2", "2",…
$ wilks           <dbl> 324.16, 324.16, 324.16, 324.16, 324.16, 324.16, 324.16…
$ mcculloch       <dbl> 324.16, 324.16, 324.16, 324.16, 324.16, 324.16, 324.16…
$ glossbrenner    <dbl> 286.42, 286.42, 286.42, 286.42, 286.42, 286.42, 286.42…
$ ipfpoints       <dbl> 511.15, 511.15, 511.15, 511.15, 511.15, 511.15, 511.15…
$ tested          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ country         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ federation      <chr> "GPC-AUS", "GPC-AUS", "GPC-AUS", "GPC-AUS", "GPC-AUS",…
$ date            <date> 2018-10-27, 2018-10-27, 2018-10-27, 2018-10-27, 2018-…
$ meetcountry     <chr> "Australia", "Australia", "Australia", "Australia", "A…
$ meetstate       <chr> "VIC", "VIC", "VIC", "VIC", "VIC", "VIC", "VIC", "VIC"…
$ meetname        <chr> "Melbourne Cup", "Melbourne Cup", "Melbourne Cup", "Me…
$ lift_type       <chr> "Squat1Kg", "Squat2Kg", "Squat3Kg", "Best3SquatKg", "B…
$ lift_kg         <dbl> 80.0, 92.5, 105.0, 105.0, 45.0, 50.0, 55.0, 55.0, 100.…

Narrowing down to a smaller dataset: USA

#Filtering to work with USA powerlifting competitors and specifc columns

tidy_USA_powerlift <- tidy_powerlift |> 
  filter(country == "USA") |> 
   select(name, sex, age, bodyweightkg,country,date,lift_type, lift_kg)

glimpse(tidy_USA_powerlift) #look at total columns
Rows: 245,100
Columns: 8
$ name         <chr> "Brett Worland", "Brett Worland", "Brett Worland", "Micha…
$ sex          <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M…
$ age          <dbl> 23, 23, 23, 19, 19, 19, 19, 19, 19, 19, 30, 30, 30, 30, 3…
$ bodyweightkg <dbl> 98.1, 98.1, 98.1, 69.9, 69.9, 69.9, 69.9, 69.9, 69.9, 69.…
$ country      <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "…
$ date         <date> 2015-05-19, 2015-05-19, 2015-05-19, 2013-06-14, 2013-06-…
$ lift_type    <chr> "Bench1Kg", "Bench2Kg", "Best3BenchKg", "Squat1Kg", "Squa…
$ lift_kg      <dbl> 185.0, 195.0, 195.0, 140.0, 145.0, 152.5, 152.5, 87.5, 95…
head(tidy_USA_powerlift, 6) #look at first 6 rows 
# A tibble: 6 × 8
  name             sex     age bodyweightkg country date       lift_type lift_kg
  <chr>            <chr> <dbl>        <dbl> <chr>   <date>     <chr>       <dbl>
1 Brett Worland    M        23         98.1 USA     2015-05-19 Bench1Kg     185 
2 Brett Worland    M        23         98.1 USA     2015-05-19 Bench2Kg     195 
3 Brett Worland    M        23         98.1 USA     2015-05-19 Best3Ben…    195 
4 Michael Hamilton M        19         69.9 USA     2013-06-14 Squat1Kg     140 
5 Michael Hamilton M        19         69.9 USA     2013-06-14 Squat2Kg     145 
6 Michael Hamilton M        19         69.9 USA     2013-06-14 Squat3Kg     152.

Analysis 1: Performance difference (Male vs Female)

#Performance difference by Sex Table

PxS_table <-tidy_USA_powerlift |> 
  filter(sex %in% c("M", "F")) |> 
  group_by(sex, lift_type) |> 
  summarize(avg_lift = round(mean(lift_kg), 0),
            max_lift = max(lift_kg))
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
print(PxS_table)
# A tibble: 20 × 4
# Groups:   sex [2]
   sex   lift_type    avg_lift max_lift
   <chr> <chr>           <dbl>    <dbl>
 1 F     Bench1Kg           76     230 
 2 F     Bench2Kg           82     242.
 3 F     Bench3Kg           88     248.
 4 F     Bench4Kg           87     250 
 5 F     Best3BenchKg       80     272.
 6 F     Best3SquatKg      137     388.
 7 F     Squat1Kg          128     350 
 8 F     Squat2Kg          136     370.
 9 F     Squat3Kg          141     365 
10 F     Squat4Kg          137     378.
11 M     Bench1Kg          162     468.
12 M     Bench2Kg          170     488.
13 M     Bench3Kg          175     479.
14 M     Bench4Kg          182     488.
15 M     Best3BenchKg      168     488.
16 M     Best3SquatKg      241     574.
17 M     Squat1Kg          235     522.
18 M     Squat2Kg          248     530 
19 M     Squat3Kg          254     560 
20 M     Squat4Kg          214     467.
# Compare lifts between sex: Boxplot 

tidy_USA_powerlift |> 
  filter(sex %in% c("M", "F")) |> 
  ggplot(aes( x = sex, y = lift_kg, fill = sex))+
  geom_boxplot()+
  facet_wrap(~lift_type)+ #compare all kinds of lifts
  labs( title = "Lift Perfomance Differences between Sex",
        x = "Sex",
        y = "Weight lifted (kg)")

Interpretation: The table and the box plot shows the performance differences between Male vs Female weight lifting competitors across 3 different weight lifting events. I expected male lifters to lift heavier on average and in max lifting than females in the all lifting events. This was revealed clearly in the boxplot. In additon, there is also a larger variance in strength level for male competitors than females; it is more apparent in the Squat competitions.

Analysis 2: lift strength relative to body weight

#How many times their own body weight can a powerlifter competitor lift?

strength_table <- tidy_USA_powerlift |> 
  filter(sex %in% c("M", "F")) |>
  mutate(bw_multiplier = lift_kg/bodyweightkg) |> 
  group_by(sex, lift_type) |> 
  summarize(avg_times = round(mean(bw_multiplier, na.rm = TRUE), 0))
            
print(strength_table)
# A tibble: 20 × 3
# Groups:   sex [2]
   sex   lift_type    avg_times
   <chr> <chr>            <dbl>
 1 F     Bench1Kg             1
 2 F     Bench2Kg             1
 3 F     Bench3Kg             1
 4 F     Bench4Kg             1
 5 F     Best3BenchKg         1
 6 F     Best3SquatKg         2
 7 F     Squat1Kg             2
 8 F     Squat2Kg             2
 9 F     Squat3Kg             2
10 F     Squat4Kg             2
11 M     Bench1Kg             2
12 M     Bench2Kg             2
13 M     Bench3Kg             2
14 M     Bench4Kg             2
15 M     Best3BenchKg         2
16 M     Best3SquatKg         3
17 M     Squat1Kg             3
18 M     Squat2Kg             3
19 M     Squat3Kg             3
20 M     Squat4Kg             3
# Scatterplot to compare body weight of competitors to how much weight they can lift

tidy_USA_powerlift|> 
  filter(sex %in% c("M", "F")) |> 
  ggplot(aes( x = bodyweightkg, y = lift_kg, color = sex))+
  geom_point(alpha = 0.2)+ # create transparent dots to see clusters of M vs F more clearly
  facet_wrap(~lift_type)+   
  labs(title = "Lift Strength to Body Weight",
       x = "Competitor Weight (kg)",
       y = "Weight lifted (kg)")

Intepretation: When analyzing the lift strength to the competitor’s body weight, the scatter plot demonstrates a positive correlation of increase lifter’s body weight mean increase weight that they can lift. However, the table provides more information by calculating the average bodyweight multiplier. This is incredibly interesting to see and compare: while heavier power lifters lift more total weight, lighter power lifters often move a higher multiple of their own body weight which reveals very strong relative strength to their weight.

Analysis 3: Trend over time

# Trend table

TT_table <- tidy_USA_powerlift |> 
  mutate(year = lubridate::year(date)) |>
  group_by(year, lift_type) |>
  summarize(max_lift = max(lift_kg)) |> 
  arrange(desc(year))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
head(TT_table, 9)
# A tibble: 9 × 3
# Groups:   year [1]
   year lift_type    max_lift
  <dbl> <chr>           <dbl>
1  2019 Bench1Kg         425 
2  2019 Bench2Kg         455 
3  2019 Bench3Kg         372.
4  2019 Bench4Kg         203 
5  2019 Best3BenchKg     455 
6  2019 Best3SquatKg     490 
7  2019 Squat1Kg         465 
8  2019 Squat2Kg         460 
9  2019 Squat3Kg         490 
# Extract Year and find max lift per year

tidy_USA_powerlift|> 
  mutate(year = lubridate::year(date)) |> #extracts date/time *R for datascience Chapter 17*
  group_by(year, lift_type)|> 
  summarize(max_weight_lift = max(lift_kg)) |> 
  ggplot(aes( x = year, y = max_weight_lift, color = lift_type))+
  geom_line()+
  labs(title = "Competition Max Lifts over Time",
       x = "Year",
       y = "Lift Type")
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Interpretation: By extracting the year from the competition dates, we can track the progression of the highest weight lifted by the power lifters. The line graph and table shows the maximum lifts over the years. I can see how power lifters in the US attempts in all lift type are trending upwards which makes sense since for these althetes over the years with experience, they probably made improvements in their training methods, nutrition,and using better lifting equipment.

Conclusion

The goal of this project was to transform a wide, untidy dataset into a tidy format that can be used for analysis. By using the pivot_longer() function in tidyr, the OpenPowerlifting dataset was successfully reshaped, collapsing over a dozen horizontal lift attempt columns into standardized lift_type and lift_kg variables. The challenge for me was trying to find a way to upload this large csv data set probably as Github would not accept files larger than 100MB. By using Gemini Pro as a assistant, I was able to bypass this by zip filing and using Git Bash to push and commit the csv file on to the main repository. In additon, while working with such a large datase, I notice there were many missing and failed (negative values) attempts. By using the drop.na functions and filtering those values out, as well as selecting only the variable columns I am interested in, I was able to create a clean dataset that was ready to be utilized for exploratory planned analyses.

Working on the tidy data, the analysis I performed regarding the powerlifting listed:

Performance Differences (Sex): There is apparent variance in strength between sexes, with males recording higher average lifts across all events (ex: an average bench(1kg) of 162 kg compared to 76 kg for females).

Strength to bodyweight: While weight lifted increases with body weight as I predicted, I analyzed the bodyweight multiplier and it revealed that lighter lifters often have exceptional strength, moving higher multiples of their own body weight.

Trends over time: There is a clear upward trajectory of the maximum lifts over the years in the power lifting sport in the USA competition, which reflects the advancements in strength training and athlete preparation.

Ultimately, this process help me understand how important data tidying was when using a large and messy data set. The wide format made creating tables and data plot visualizations very difficult as there were too many variables to work with. However, the tidy and long format allowed for better statistical grouping, easier spotting of patterns in the tables, and creating straight forward graphs.