Preliminary diving results of women’s 15-18 1m springboard from the 2022 FINA World Junior Championships
The FINA Junior World Diving Championships is an elite competition for divers under 18. Each diver completes 9 dives, scored by 7 judges. The two lowest and two highest scores are voided, and the remaining three are added together. This sum is then multiplied by the dive’s difficulty to calculate the points earned for that dive. The overall score is obtained by adding the points earned on each dive.
Emma Deering in the article “FINA World Junior Championships 2022 Diving Data” published in February 29, 2024 in CORE Network explains that the data set contains 360 rows and 15 columns. Each row represents a completed dive from a diver in the preliminary results of women (aged 15-18) 1m springboard from the 2022 FINA World Junior Championships. Each diver completed 9 dives, so there is 9 rows per diver. The 15 variables are respectively:
LastName which represents the last name of the diver,
Country which represents the country the diver was competing for,
Age is the age of the diver,
TotalPoints are the overall points the diver had earned at the end of the meet,
DiveNum is the order in which the dives were performed,
DiveName is the coded name of the dive,
Difficulty stands for the difficulty of the dive,
Judge1 means the score that judge 1 awarded the dive,
Judge2 means the score that judge 2 awarded the dive,
Judge3 means the score that judge 3 awarded the dive,
Judge4 means the score that judge 4 awarded the dive,
Judge5 means the score that judge 5 awarded the dive,
Judge6 means the score that judge 6 awarded the dive,
Judge7 is the score that judge 7 awarded the dive,
Points are overall points the individual dive earned.
These data are sourced in WorlAquatics from the link https://www.worldaquatics.com/competitions/2951/fina-world-junior-diving-championships-2022/results?event=6d65f6db-1e71-4bca-b1c3-7facf12f500f&unit=preliminary
With this dataset, I can explore the age of divers who had the best scores and how dives are scored regarding the level of difficulty.
Load the library
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load and view data
Load the dataset
I will load the dataset using read_csv function.
# Load data of preliminary diving results of womens 16-18 1m springboard from the 2022 FINA World Junior Championships.setwd("C:/Users/User/Downloads/Data 110 Projects and Assignments")divingdata <-read_csv("divingdata.csv")
Rows: 360 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): LastName, Country, DiveName
dbl (12): Age, TotalPoints, DiveNum, Difficulty, Judge1, Judge2, Judge3, Jud...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(divingdata) # shows the first 6 rows of the dataset.
chr means character or a string of text which can be treated as a categorical variable and num is for numbers that may contain decimal. We then understand that the dataset has 3 categorical variables and 12 numerical variables.
Clean the dataset
Check missing values
The function I will use check if all rows in divingdata are complete, so if there is no missing values. Then it returns two possible messages: “No NA Founded” if there are no missing values in any row or “Found NA” if there is at least one row with missing values.
# check missing valuesifelse(mean(complete.cases(divingdata)) ==1, "No NA Founded", "Found NA")
[1] "No NA Founded"
There is no missing values in the dataset.
Remove capital letters to columnn’ headers
It is useful to set all variable names to lowercase to avoid keeping track of capitalizing to save time. It is the role of the function “tolower”
# Make all headings (column names) lowercasenames(divingdata) <-tolower(names(divingdata))head(divingdata)
The summary function will run a quick statistical summary of a data frame, calculating mean, median, quartile values for continuous variables, and giving the minimum and maximum values. It helps to select the focus points of the analysis.
summary(divingdata)
lastname country age totalpoints
Length:360 Length:360 Min. :15.00 Min. :157.5
Class :character Class :character 1st Qu.:17.00 1st Qu.:271.9
Mode :character Mode :character Median :17.00 Median :300.0
Mean :17.12 Mean :298.8
3rd Qu.:18.00 3rd Qu.:334.9
Max. :18.00 Max. :374.7
divenum divename difficulty judge1 judge2
Min. :1 Length:360 Min. :1.200 Min. :2.000 Min. :2.000
1st Qu.:3 Class :character 1st Qu.:1.700 1st Qu.:4.500 1st Qu.:5.000
Median :5 Mode :character Median :2.200 Median :5.500 Median :5.500
Mean :5 Mean :2.042 Mean :5.456 Mean :5.447
3rd Qu.:7 3rd Qu.:2.400 3rd Qu.:6.000 3rd Qu.:6.500
Max. :9 Max. :3.100 Max. :7.500 Max. :8.500
judge3 judge4 judge5 judge6
Min. :1.000 Min. :2.000 Min. :1.000 Min. :1.000
1st Qu.:4.875 1st Qu.:4.500 1st Qu.:5.000 1st Qu.:5.000
Median :5.500 Median :5.500 Median :5.500 Median :5.500
Mean :5.457 Mean :5.464 Mean :5.436 Mean :5.513
3rd Qu.:6.500 3rd Qu.:6.500 3rd Qu.:6.500 3rd Qu.:6.500
Max. :8.000 Max. :8.000 Max. :8.000 Max. :8.000
judge7 points
Min. :1.500 Min. : 9.60
1st Qu.:5.000 1st Qu.:28.00
Median :6.000 Median :32.45
Mean :5.483 Mean :33.20
3rd Qu.:6.500 3rd Qu.:39.10
Max. :8.000 Max. :54.60
Best scores by ages
The function as.factor(age) ensures that ages are treated as discrete groups for proper grouping and coloring.
theme(plot.title = element_text(hjust = 0.5)) centers the title. The “hjust” parameter controls the horizontal justification of the title text, where 0 is left-aligned, 0.5 is centered, and 1 is right-aligned.It is the same role for plot.caption which aligns the caption text.
coord_flip() function changes the position of the x and y coordinates in a plot. This means that what is normally displayed on the x-axis will be displayed on the y-axis, and vice versa.
Age_best_scores <- divingdata |>ggplot() +geom_boxplot(aes(x=as.factor(age), y= totalpoints, group=age, fill =as.factor(age))) +scale_fill_manual(values =c("15"="yellow", "16"="orange","17"="maroon", "18"="pink")) +labs(title ="Comparison of Scores from Different Ages", x ="Diver's Age",y ="Overall Points",caption ="Source: WorldAquatics \n www.worldaquatics.com/competitions/2951/fina-world-junior-diving-championships-2022") +theme(plot.title =element_text(hjust =0.5),plot.caption =element_text(hjust =0)) +coord_flip() +guides(fill =guide_legend(title ="Age")) # is used to change the legend title to "Age" instead of as.factor(age)Age_best_scores
It appears that divers of 18 years old had the best performances during that championship although the least scores where also observed from that age.
Scores of dive by difficulty
Convert scores to numeric on the second plot to have a gradient scale, rather than each score being a factor level to avoid redundancy on x-axis and colors’ label.
divingdata |>ggplot(aes(x = difficulty, y = points, color = difficulty)) +# treat difficulty as numericgeom_point() +scale_color_gradient(low ="cyan", high ="red", name ="Difficulty") +# use gradient scale of colorslabs(x ="Level of Difficulty",y ="Dive's Points", title ="Relationship Between Dive's Difficulty and Scores",caption ="WorldAquatics \n www.worldaquatics.com/competitions/2951/fina-world-junior-diving-championships-2022") +theme(plot.title =element_text(hjust =0.5),plot.caption =element_text(hjust =0))
Notice scores increase as difficulty level increases.
Analysis
When analyzing the preliminary diving results of the women’s 15-18 1m springboard event from the 2022 FINA World Junior Championships dataset, I modified the dataset by converting all the column headings to lowercase using the tolower(names(divingdata)) function. Additionally, I used the function ifelse(mean(complete.cases(divingdata)) to check for any missing values within the dataset. Fortunately, there were no missing values, so no further action was required to clean the dataset.
The first visualization is a side-by-side boxplot that displays the distribution of overall points scored by divers of different ages (15, 16, 17, and 18 years old). The plot allows for a clear comparison of the performance of divers in various age groups based on their overall points. The components of the visualization include axes representing overall points and diver’s age, boxplots representing different age groups, colors to distinguish between age groups, a legend, title and labels, and a source and caption at the bottom. Analysis of the boxplots reveals that older divers (ages 17 and 18) generally achieve higher median scores compared to younger divers (ages 15 and 16). This visualization effectively highlights the median, spread, and potential outliers in the data, allowing for a quick comparison of divers’ performance based on age.
Furthermore, in the second visualization, the scatterplot shows the relationship between dive difficulty and points scored, with each point representing an individual dive. The x-axis shows the difficulty level (1.2 to 3.1), and the y-axis shows the points scored. Points are color-coded by difficulty level, as indicated in the legend. The plot reveals a general trend where higher difficulty levels correspond to higher scores, though there is significant variability within each difficulty level. This suggests that while difficulty influences scoring, other factors may also play a role.
I had hoped to create more advanced plots, including one illustrating the average score given by each judge per dive, but I encountered challenges with the code and couldn’t execute it as intended.