library(dplyr)
<- olympics |>
gold_medalists filter(medal == "Gold")
HW02
Packages:
Question 1
Keep only the Gold Medalistst:
How many rows does the reslting dataset have:
nrow(gold_medalists)
[1] 13372
Question 2
- The most appropriate plot is a scatter plot because each dot can represents an athelete. A scatter plot is also good for showing the relationship between two continuous variables.
library(ggplot2)
ggplot(gold_medalists, aes(x = year, y = age)) +
geom_point(alpha = 0.3) +
labs(title = "Age of Gold Medal Winners Over the Years",
x = "Year",
y = "Age")
The ages of gold winners slightly decrease over time.
To solve the overplotting problem, I make the dots smaller and more transparent.
ggplot(gold_medalists, aes(x = year, y = age)) +
geom_point(size = 0.8, alpha = 0.2) +
labs(title = "Age of Gold Medal Winners Over the Years",
x = "Year", y = "Age")
Question 3
<- gold_medalists |>
us_medals filter(noc == "USA") |>
group_by(year) |>
summarise(num_medals = n())
Line Graph:
ggplot(us_medals, aes(x = year, y = num_medals)) +
geom_line(color = "black", linewidth = 1) +
geom_point(color = "blue", size = 2) +
labs(title = "U.S. Gold Medals Accorss the Years",
x = "Year",
y = "Number of Gold Medals")
The country’s most successful year:
which.max(us_medals$num_medals), ] us_medals[
# A tibble: 1 × 2
year num_medals
<dbl> <int>
1 1984 190
Why the wiggles: IOC separates Winter Olynpics and Summer Olympics after 1992. U.S performs well during the Summer Games but not as well in Winter Games.
Question 4
<- gold_medalists |>
two_events filter(
== "Gymnastics Men's Individual All-Around" |
event == "Gymnastics Women's Individual All-Around" |
event == "Athletics Women's 100 metres" |
event == "Athletics Men's 100 metres"
event )
Gymnastics events:
<- two_events |>
gymnastics filter(event == "Gymnastics Men's Individual All-Around" |
== "Gymnastics Women's Individual All-Around") event
Histogram:
ggplot(gymnastics, aes(x = age)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = "Age of Gymnastics Gold Winners", x = "Age", y ="Count")
Distribution: roughly a normal distribution from age 14-35, with most winners (mode) bettwen age 20-30.
Distribution by gender:
ggplot(gymnastics, aes(x = age)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
facet_wrap(~ sex) +
labs(title = "Age of Gymnastics Gold Winners by Gender",
x = "Age", y = "Count")
Male Atheletes tend to be older.
Question 5
Filter two events:
<- gold_medalists |>
two_events filter(
== "Gymnastics Men's Individual All-Around" |
event == "Gymnastics Women's Individual All-Around" |
event == "Athletics Women's 100 metres" |
event == "Athletics Men's 100 metres") event
ggplot(two_events, aes(x = event, y = height)) +
geom_boxplot(fill = "blue") +
labs(title = "Height of Gold Winners Across Events", x = "Event", y = "Height (cm)") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Range: 150cm-190+cm Median:
Male 100m at around 185cm Female 100m at around 170cm Male Gymnastics at around 167cm Female Gymnastics at around 159
By events, 100m athelets have higher midian height, by gender, male athletes have higher median height.
Question 6
<- gold_medalists |>
us_medalists filter(noc == "USA")
ggplot(us_medalists, aes(x = year, fill = sex)) +
geom_bar(position = "dodge") +
labs(title = "US Gold Winner by Gender Each Year", x = "year", y = "No. Gold Won", fill = "Gender")
Pattern: Barely any female gold winners before 1950s; Female gold winner gradually increase at 1920s and boost after 1975; Number of gold winner in both gender after 1992 fluctuate due to alternating summer&winter game.