This homework has two parts. Part 1 uses base R to inspect a dataframe. Part 2 uses dplyr to wrangle a different dataset.
Download StudentSurvey.csv from the Datasets folder on
Blackboard. Save it next to this Rmd and set your working directory.
# Load the file
getwd()
## [1] "/Users/darrenabou/Downloads"
survey <- read.csv("StudentSurvey.csv")
# Q1. Check the head of the dataset
head(survey)
## Year Sex Smoke Award HigherSAT Exercise TV Height Weight Siblings
## 1 Senior M No Olympic Math 10 1 71 180 4
## 2 Sophomore F Yes Academy Math 4 7 66 120 2
## 3 FirstYear M No Nobel Math 14 5 72 208 2
## 4 Junior M No Nobel Math 3 1 63 110 1
## 5 Sophomore F No Nobel Verbal 3 3 65 150 1
## 6 Sophomore F No Nobel Verbal 5 4 65 114 2
## BirthOrder VerbalSAT MathSAT SAT GPA Pulse Piercings
## 1 4 540 670 1210 3.13 54 0
## 2 2 520 630 1150 2.50 66 3
## 3 1 550 560 1110 2.55 130 0
## 4 1 490 630 1120 3.10 78 0
## 5 1 720 450 1170 2.70 40 6
## 6 2 600 550 1150 3.20 80 4
# Q2. Check the dimensions
dim(survey)
## [1] 362 17
# Q3. Create a table of students' sex and HigherSAT
table<- survey |>
select(Sex, HigherSAT) |>
table()
table
## HigherSAT
## Sex Math Verbal
## F 4 81 84
## M 3 124 66
# Q4. Display summary statistics for VerbalSAT
summary(survey$VerbalSAT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 390.0 550.0 600.0 594.2 640.0 800.0
# Q5. Find the average GPA of students
mean(survey$GPA, na.rm = TRUE)
## [1] 3.157942
# Q6. Create a new dataframe called column_df that contains students' weight
# and number of hours they exercise.
column_df <- survey |>
select(Weight, Exercise)
# Q7. Access the fourth element in the first column of the StudentSurvey dataset.
survey[4, 1]
## [1] "Junior"
Don’t change this chunk — it loads and filters the dataset.
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics |>
filter(!is.na(age)) |>
filter(sport == "Gymnastics") |>
mutate(
medalist = case_when(
is.na(medal) ~ FALSE,
!is.na(medal) ~ TRUE
)
)
More info on the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
# Q8. Create a subset dataframe with these columns only: name, sex, age, team, year, medalist.
# Call it df.
df<- olympic_gymnasts|>
select(name, sex, age, team, year, medalist)
# Q9. From df, create df2 that only has the years 2008, 2012, and 2016.
df2<- df |>
filter( year %in% c(2008, 2012, 2016))
# Q10. Group by those three years and summarize the mean age in each group.
df2|>
group_by(year) |>
summarize(mean_age = mean(age), na.rm = TRUE)
## # A tibble: 3 × 3
## year mean_age na.rm
## <dbl> <dbl> <lgl>
## 1 2008 21.6 TRUE
## 2 2012 21.9 TRUE
## 3 2016 22.2 TRUE
# Q11. Using the full olympic_gymnasts dataset, group by year and find the mean age
# for each year. Call this oly_year.
# (Bonus: find the minimum average age across years.)
olympic_gymnasts |>
group_by(year) |>
summarize(oly_year = mean(age), na.rm = TRUE)
## # A tibble: 29 × 3
## year oly_year na.rm
## <dbl> <dbl> <lgl>
## 1 1896 24.3 TRUE
## 2 1900 22.2 TRUE
## 3 1904 25.1 TRUE
## 4 1906 24.7 TRUE
## 5 1908 23.2 TRUE
## 6 1912 24.2 TRUE
## 7 1920 26.7 TRUE
## 8 1924 27.6 TRUE
## 9 1928 25.6 TRUE
## 10 1932 23.9 TRUE
## # ℹ 19 more rows
# bonus
#oly_year |>
#summarize(min_avg_age = min(mean_age))
# Q12. Open-ended: come up with a question that requires at least TWO dplyr verbs.
# Write the question, then the code that answers it. Below the chunk, briefly
# explain why you chose this question.
basketball_medals <- olympics |>
filter(sport == "Basketball",
year %in% c(2012, 2016)) |>
filter(!is.na(medal)) |>
group_by(year, team) |>
summarize(
num_medals = n()
) |>
arrange(year, desc(num_medals))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by year and team.
## ℹ Output is grouped by year.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(year, team))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
basketball_medals
## # A tibble: 8 × 3
## # Groups: year [2]
## year team num_medals
## <dbl> <chr> <int>
## 1 2012 United States 24
## 2 2012 Australia 12
## 3 2012 France 12
## 4 2012 Russia 12
## 5 2012 Spain 12
## 6 2016 Serbia 24
## 7 2016 Spain 24
## 8 2016 United States 24
Your question and reflection:
Question: Which basketball teams won the most medals in 2012 and in 2016 separately?
I wanted to see which countries dominated basketball in the two most recent Olympics shown in the dataset. This question uses four dplyr verbs (filter, group_by, summarize, and arrange) to count medals per team for each year separately and rank them from highest to lowest.