Project_1

Author

Arthur De Almeida

Project 1

Image found in Genler

This data set was found on kaggle. Its is a data set about the 2024 Paris Olympic games. It has a multitude of variables both quantitative and categorical divided in multiple data sets. The ones I will be using are the athletes csv and the medalists csv. The source of the data was the Olympic games official website (https://olympics.com/en/paris-2024https://olympics.com/en/paris-2024) and some factor’s the author said he used the Wikipedia and the sources from there. The main variables I would like to analyse are meda

# Loading the libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

#Loading the csv files
athletes <- read_csv("athletes.csv")

Rows: 11113 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (31): name, name_short, name_tv, gender, function, country_code, countr...
dbl   (3): code, height, weight
lgl   (1): current
date  (1): birth_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

medallists <- read_csv("medallists.csv")

Rows: 2315 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): medal_type, name, gender, country_code, country, country_long, na...
dbl   (2): medal_code, code_athlete
lgl   (1): is_medallist
date  (2): medal_date, birth_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

country_medals <- read_csv("medals_total.csv")

Rows: 92 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country_code, country, country_long
dbl (4): Gold Medal, Silver Medal, Bronze Medal, Total

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Bind the 2 data sets together to make the combined data set from the 2 csv files
joined <- athletes %>%
  left_join(medallists, by = "country")

Warning in left_join(., medallists, by = "country"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 310 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

filtered_dataset <- joined %>%
  select(name.x, gender.x, country, birth_date.x, nickname, medal_type)
dataset_age <- filtered_dataset %>%
  mutate(birth_date = as.Date(birth_date.x),  # Make sure the birth_date is in date format
         age = 2024 - year(birth_date.x),)

# Get only the year of birth. And calculate the age of each athlete.

dataset_age1 <- dataset_age %>%
  rename(.,name = name.x) #Renaming the column to country

average_by_country <- dataset_age1 %>%
  group_by(country) %>% #grouping by country to find the average age
  summarize(average_value = mean(age, na.rm = TRUE))
dataset2 <- merge(average_by_country,country_medals,by="country") #then merging the data set with the country medals data set using country as the common column.

ggplot(dataset2, aes(x = average_value, y = Total)) +
  geom_point(alpha = 0.5) +  #creating the point graph
  labs(title = "Relation between average age and total medals",
       x = "Average age per country",
       y = "Sum of medals won per country", caption = "Source: Paris Olympics Officiall website") +
  geom_smooth(method = "lm", se = FALSE, color = "red") # creating the line of the linear method

`geom_smooth()` using formula = 'y ~ x'

average_by_sport <- joined %>%
  mutate(birth_date = as.Date(birth_date.x),  # Make sure the birth_date is in date format
         age = 2024 - year(birth_date.x),) %>%
  group_by(discipline) %>% #grouping by country to find the average age
  summarize(.,average_value = mean(age, na.rm = TRUE))%>%
  slice_max(order_by = average_value, n = 10)

plot2 <- ggplot(average_by_sport, aes(x = average_value, y = discipline, color = average_value)) +
  geom_point() +
  scale_fill_viridis_c() +
  labs(title = "Oldest average of age per sport", 
       x = "Age",
       y = "Sport", caption = "Source: Paris Olympics Officiall website") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1))

plot2

To clean my data set I created multiple data sets since they provide different values I wanted to use. The first thing I did was merge the 3 data sets I wanted to use and then selected the columns I wanted to use and renamed the ones that were doubled in each csv file. The first plot made a weird find where the oldest the age the greatest number of medals. Which would be verry surprising but the correlation is very light. So even though it has an increasing linear mode. It still has a big concentration around the age of 27. The main thing I had problems that I would like to make was to have a graph that is interactive. And another thing I would like to make on the other one is a more colorful graph. Since I’m not very good with color theory I tend to be more minimalistic.