Data110 Project 1

Author

Duchelle Kemoue

Preliminary diving results of women’s 15-18 1m springboard from the 2022 FINA World Junior Championships

The FINA Junior World Diving Championships is an elite competition for divers under 18. Each diver completes 9 dives, scored by 7 judges. The two lowest and two highest scores are voided, and the remaining three are added together. This sum is then multiplied by the dive’s difficulty to calculate the points earned for that dive. The overall score is obtained by adding the points earned on each dive.

Emma Deering in the article “FINA World Junior Championships 2022 Diving Data” published in February 29, 2024 in CORE Network explains that the data set contains 360 rows and 15 columns. Each row represents a completed dive from a diver in the preliminary results of women (aged 15-18) 1m springboard from the 2022 FINA World Junior Championships. Each diver completed 9 dives, so there is 9 rows per diver. The 15 variables are respectively:

LastName which represents the last name of the diver,
Country which represents the country the diver was competing for,
Age is the age of the diver,
TotalPoints are the overall points the diver had earned at the end of the meet,
DiveNum is the order in which the dives were performed,
DiveName is the coded name of the dive,
Difficulty stands for the difficulty of the dive,
Judge1 means the score that judge 1 awarded the dive,
Judge2 means the score that judge 2 awarded the dive,
Judge3 means the score that judge 3 awarded the dive,
Judge4 means the score that judge 4 awarded the dive,
Judge5 means the score that judge 5 awarded the dive,
Judge6 means the score that judge 6 awarded the dive,
Judge7 is the score that judge 7 awarded the dive,
Points are overall points the individual dive earned.

These data are sourced in WorlAquatics from the link https://www.worldaquatics.com/competitions/2951/fina-world-junior-diving-championships-2022/results?event=6d65f6db-1e71-4bca-b1c3-7facf12f500f&unit=preliminary

With this dataset, I can explore the age of divers who had the best scores and how dives are scored regarding the level of difficulty.

Load the library

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load and view data

Load the dataset

I will load the dataset using read_csv function.

# Load data of preliminary diving results of womens 16-18 1m springboard from the 2022 FINA World Junior Championships.
setwd("C:/Users/User/Downloads/Data 110 Projects and Assignments")
divingdata <- read_csv("divingdata.csv")

Rows: 360 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): LastName, Country, DiveName
dbl (12): Age, TotalPoints, DiveNum, Difficulty, Judge1, Judge2, Judge3, Jud...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(divingdata) # shows the first 6 rows of the dataset.

# A tibble: 6 × 15
  LastName Country   Age TotalPoints DiveNum DiveName Difficulty Judge1 Judge2
  <chr>    <chr>   <dbl>       <dbl>   <dbl> <chr>         <dbl>  <dbl>  <dbl>
1 Fung     CAN        18        375.       1 103B            1.7    7      6.5
2 Fung     CAN        18        375.       2 201B            1.6    6.5    6  
3 Fung     CAN        18        375.       3 5233D           2.5    6.5    6.5
4 Fung     CAN        18        375.       4 301B            1.7    6      6.5
5 Fung     CAN        18        375.       5 401B            1.5    7      7  
6 Fung     CAN        18        375.       6 403B            2.4    6.5    6.5
# ℹ 6 more variables: Judge3 <dbl>, Judge4 <dbl>, Judge5 <dbl>, Judge6 <dbl>,
#   Judge7 <dbl>, Points <dbl>

Examine the dataset

The ‘str’ function will tell more about the columns in the data, including the data type.

# view structure of data
str(divingdata)

spc_tbl_ [360 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ LastName   : chr [1:360] "Fung" "Fung" "Fung" "Fung" ...
 $ Country    : chr [1:360] "CAN" "CAN" "CAN" "CAN" ...
 $ Age        : num [1:360] 18 18 18 18 18 18 18 18 18 17 ...
 $ TotalPoints: num [1:360] 375 375 375 375 375 ...
 $ DiveNum    : num [1:360] 1 2 3 4 5 6 7 8 9 1 ...
 $ DiveName   : chr [1:360] "103B" "201B" "5233D" "301B" ...
 $ Difficulty : num [1:360] 1.7 1.6 2.5 1.7 1.5 2.4 2.6 2.3 2.4 1.5 ...
 $ Judge1     : num [1:360] 7 6.5 6.5 6 7 6.5 7 5.5 6 6.5 ...
 $ Judge2     : num [1:360] 6.5 6 6.5 6.5 7 6.5 7 5.5 6.5 6.5 ...
 $ Judge3     : num [1:360] 7.5 6.5 6.5 6 7 6.5 7.5 6.5 7 7 ...
 $ Judge4     : num [1:360] 7.5 7 6.5 6.5 7.5 7 7 6.5 7 7 ...
 $ Judge5     : num [1:360] 7 6.5 6.5 6.5 7 7 6.5 6.5 6.5 7.5 ...
 $ Judge6     : num [1:360] 7 6.5 6.5 7.5 6.5 7 6.5 5 7 7 ...
 $ Judge7     : num [1:360] 7 6 6 6.5 7 7 7 6 7 6 ...
 $ Points     : num [1:360] 35.7 31.2 48.8 33.1 31.5 ...
 - attr(*, "spec")=
  .. cols(
  ..   LastName = col_character(),
  ..   Country = col_character(),
  ..   Age = col_double(),
  ..   TotalPoints = col_double(),
  ..   DiveNum = col_double(),
  ..   DiveName = col_character(),
  ..   Difficulty = col_double(),
  ..   Judge1 = col_double(),
  ..   Judge2 = col_double(),
  ..   Judge3 = col_double(),
  ..   Judge4 = col_double(),
  ..   Judge5 = col_double(),
  ..   Judge6 = col_double(),
  ..   Judge7 = col_double(),
  ..   Points = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

chr means character or a string of text which can be treated as a categorical variable and num is for numbers that may contain decimal. We then understand that the dataset has 3 categorical variables and 12 numerical variables.

Clean the dataset

Check missing values

The function I will use check if all rows in divingdata are complete, so if there is no missing values. Then it returns two possible messages: “No NA Founded” if there are no missing values in any row or “Found NA” if there is at least one row with missing values.

# check missing values
ifelse(mean(complete.cases(divingdata)) == 1, "No NA Founded", "Found NA")

[1] "No NA Founded"

There is no missing values in the dataset.

Remove capital letters to columnn’ headers

It is useful to set all variable names to lowercase to avoid keeping track of capitalizing to save time. It is the role of the function “tolower”

# Make all headings (column names) lowercase
names(divingdata) <- tolower(names(divingdata))

head(divingdata)

# A tibble: 6 × 15
  lastname country   age totalpoints divenum divename difficulty judge1 judge2
  <chr>    <chr>   <dbl>       <dbl>   <dbl> <chr>         <dbl>  <dbl>  <dbl>
1 Fung     CAN        18        375.       1 103B            1.7    7      6.5
2 Fung     CAN        18        375.       2 201B            1.6    6.5    6  
3 Fung     CAN        18        375.       3 5233D           2.5    6.5    6.5
4 Fung     CAN        18        375.       4 301B            1.7    6      6.5
5 Fung     CAN        18        375.       5 401B            1.5    7      7  
6 Fung     CAN        18        375.       6 403B            2.4    6.5    6.5
# ℹ 6 more variables: judge3 <dbl>, judge4 <dbl>, judge5 <dbl>, judge6 <dbl>,
#   judge7 <dbl>, points <dbl>

Summary of the data

The summary function will run a quick statistical summary of a data frame, calculating mean, median, quartile values for continuous variables, and giving the minimum and maximum values. It helps to select the focus points of the analysis.

summary(divingdata)

   lastname           country               age         totalpoints   
 Length:360         Length:360         Min.   :15.00   Min.   :157.5  
 Class :character   Class :character   1st Qu.:17.00   1st Qu.:271.9  
 Mode  :character   Mode  :character   Median :17.00   Median :300.0  
                                       Mean   :17.12   Mean   :298.8  
                                       3rd Qu.:18.00   3rd Qu.:334.9  
                                       Max.   :18.00   Max.   :374.7  
    divenum    divename           difficulty        judge1          judge2     
 Min.   :1   Length:360         Min.   :1.200   Min.   :2.000   Min.   :2.000  
 1st Qu.:3   Class :character   1st Qu.:1.700   1st Qu.:4.500   1st Qu.:5.000  
 Median :5   Mode  :character   Median :2.200   Median :5.500   Median :5.500  
 Mean   :5                      Mean   :2.042   Mean   :5.456   Mean   :5.447  
 3rd Qu.:7                      3rd Qu.:2.400   3rd Qu.:6.000   3rd Qu.:6.500  
 Max.   :9                      Max.   :3.100   Max.   :7.500   Max.   :8.500  
     judge3          judge4          judge5          judge6     
 Min.   :1.000   Min.   :2.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:4.875   1st Qu.:4.500   1st Qu.:5.000   1st Qu.:5.000  
 Median :5.500   Median :5.500   Median :5.500   Median :5.500  
 Mean   :5.457   Mean   :5.464   Mean   :5.436   Mean   :5.513  
 3rd Qu.:6.500   3rd Qu.:6.500   3rd Qu.:6.500   3rd Qu.:6.500  
 Max.   :8.000   Max.   :8.000   Max.   :8.000   Max.   :8.000  
     judge7          points     
 Min.   :1.500   Min.   : 9.60  
 1st Qu.:5.000   1st Qu.:28.00  
 Median :6.000   Median :32.45  
 Mean   :5.483   Mean   :33.20  
 3rd Qu.:6.500   3rd Qu.:39.10  
 Max.   :8.000   Max.   :54.60

Best scores by ages

The function as.factor(age) ensures that ages are treated as discrete groups for proper grouping and coloring.

theme(plot.title = element_text(hjust = 0.5)) centers the title. The “hjust” parameter controls the horizontal justification of the title text, where 0 is left-aligned, 0.5 is centered, and 1 is right-aligned.It is the same role for plot.caption which aligns the caption text.

coord_flip() function changes the position of the x and y coordinates in a plot. This means that what is normally displayed on the x-axis will be displayed on the y-axis, and vice versa.

 Age_best_scores <- divingdata |>
 ggplot() + 
  geom_boxplot(aes(x= as.factor(age), y= totalpoints, group=age, fill = as.factor(age))) + 
 scale_fill_manual(values = c("15"= "yellow", "16" = "orange","17" = "maroon", "18" = "pink")) + 
  labs(title = "Comparison of Scores from Different Ages", 
      x = "Diver's Age",
      y = "Overall Points",
      caption = "Source: WorldAquatics \n www.worldaquatics.com/competitions/2951/fina-world-junior-diving-championships-2022") + 
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0)) +
 coord_flip() +
  guides(fill = guide_legend(title = "Age")) # is used to change the legend title to "Age" instead of as.factor(age)

Age_best_scores

It appears that divers of 18 years old had the best performances during that championship although the least scores where also observed from that age.

Scores of dive by difficulty

Convert scores to numeric on the second plot to have a gradient scale, rather than each score being a factor level to avoid redundancy on x-axis and colors’ label.

divingdata |>
  ggplot(aes(x = difficulty, y = points, color = difficulty)) +  # treat difficulty as numeric
  geom_point() +
  scale_color_gradient(low = "cyan", high = "red", name = "Difficulty") +  # use gradient scale of colors
  labs(x = "Level of Difficulty",
       y = "Dive's Points", 
       title = "Relationship Between Dive's Difficulty and Scores",
       caption = "WorldAquatics \n www.worldaquatics.com/competitions/2951/fina-world-junior-diving-championships-2022") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0))

Notice scores increase as difficulty level increases.

Analysis

When analyzing the preliminary diving results of the women’s 15-18 1m springboard event from the 2022 FINA World Junior Championships dataset, I modified the dataset by converting all the column headings to lowercase using the tolower(names(divingdata)) function. Additionally, I used the function ifelse(mean(complete.cases(divingdata)) to check for any missing values within the dataset. Fortunately, there were no missing values, so no further action was required to clean the dataset.

The first visualization is a side-by-side boxplot that displays the distribution of overall points scored by divers of different ages (15, 16, 17, and 18 years old). The plot allows for a clear comparison of the performance of divers in various age groups based on their overall points. The components of the visualization include axes representing overall points and diver’s age, boxplots representing different age groups, colors to distinguish between age groups, a legend, title and labels, and a source and caption at the bottom. Analysis of the boxplots reveals that older divers (ages 17 and 18) generally achieve higher median scores compared to younger divers (ages 15 and 16). This visualization effectively highlights the median, spread, and potential outliers in the data, allowing for a quick comparison of divers’ performance based on age.

Furthermore, in the second visualization, the scatterplot shows the relationship between dive difficulty and points scored, with each point representing an individual dive. The x-axis shows the difficulty level (1.2 to 3.1), and the y-axis shows the points scored. Points are color-coded by difficulty level, as indicated in the legend. The plot reveals a general trend where higher difficulty levels correspond to higher scores, though there is significant variability within each difficulty level. This suggests that while difficulty influences scoring, other factors may also play a role.

I had hoped to create more advanced plots, including one illustrating the average score given by each judge per dive, but I encountered challenges with the code and couldn’t execute it as intended.