Homework 2 — Dataframes + dplyr

This homework has two parts. Part 1 uses base R to inspect a dataframe. Part 2 uses dplyr to wrangle a different dataset.

Part 1 — Student Survey (dataframe basics)

Download StudentSurvey.csv from the Datasets folder on Blackboard. Save it next to this Rmd and set your working directory.

# Load the file

getwd()

## [1] "/Users/darrenabou/Downloads"

survey <- read.csv("StudentSurvey.csv")

# Q1. Check the head of the dataset

head(survey)

##        Year Sex Smoke   Award HigherSAT Exercise TV Height Weight Siblings
## 1    Senior   M    No Olympic      Math       10  1     71    180        4
## 2 Sophomore   F   Yes Academy      Math        4  7     66    120        2
## 3 FirstYear   M    No   Nobel      Math       14  5     72    208        2
## 4    Junior   M    No   Nobel      Math        3  1     63    110        1
## 5 Sophomore   F    No   Nobel    Verbal        3  3     65    150        1
## 6 Sophomore   F    No   Nobel    Verbal        5  4     65    114        2
##   BirthOrder VerbalSAT MathSAT  SAT  GPA Pulse Piercings
## 1          4       540     670 1210 3.13    54         0
## 2          2       520     630 1150 2.50    66         3
## 3          1       550     560 1110 2.55   130         0
## 4          1       490     630 1120 3.10    78         0
## 5          1       720     450 1170 2.70    40         6
## 6          2       600     550 1150 3.20    80         4

# Q2. Check the dimensions

dim(survey)

## [1] 362  17

# Q3. Create a table of students' sex and HigherSAT

table<- survey |>
  select(Sex, HigherSAT) |>
  table()
table

##    HigherSAT
## Sex     Math Verbal
##   F   4   81     84
##   M   3  124     66

# Q4. Display summary statistics for VerbalSAT

summary(survey$VerbalSAT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   390.0   550.0   600.0   594.2   640.0   800.0

# Q5. Find the average GPA of students

mean(survey$GPA, na.rm = TRUE)

## [1] 3.157942

# Q6. Create a new dataframe called column_df that contains students' weight
#     and number of hours they exercise.
column_df <- survey |> 
  select(Weight, Exercise)
# Q7. Access the fourth element in the first column of the StudentSurvey dataset.
  
  survey[4, 1]

## [1] "Junior"

Part 2 — Olympic Gymnasts (dplyr)

Don’t change this chunk — it loads and filters the dataset.

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics |>
  filter(!is.na(age)) |>
  filter(sport == "Gymnastics") |>
  mutate(
    medalist = case_when(
      is.na(medal) ~ FALSE,
      !is.na(medal) ~ TRUE
    )
  )

More info on the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

# Q8. Create a subset dataframe with these columns only: name, sex, age, team, year, medalist.
#     Call it df.

df<- olympic_gymnasts|>
  select(name, sex, age, team, year, medalist)

# Q9. From df, create df2 that only has the years 2008, 2012, and 2016.

df2<- df |>
  filter( year %in% c(2008, 2012, 2016))


# Q10. Group by those three years and summarize the mean age in each group.

df2|>
  group_by(year) |>
  summarize(mean_age = mean(age), na.rm = TRUE)

## # A tibble: 3 × 3
##    year mean_age na.rm
##   <dbl>    <dbl> <lgl>
## 1  2008     21.6 TRUE 
## 2  2012     21.9 TRUE 
## 3  2016     22.2 TRUE

# Q11. Using the full olympic_gymnasts dataset, group by year and find the mean age
#      for each year. Call this oly_year.
#      (Bonus: find the minimum average age across years.)

olympic_gymnasts |>
  group_by(year) |>
  summarize(oly_year = mean(age), na.rm = TRUE)

## # A tibble: 29 × 3
##     year oly_year na.rm
##    <dbl>    <dbl> <lgl>
##  1  1896     24.3 TRUE 
##  2  1900     22.2 TRUE 
##  3  1904     25.1 TRUE 
##  4  1906     24.7 TRUE 
##  5  1908     23.2 TRUE 
##  6  1912     24.2 TRUE 
##  7  1920     26.7 TRUE 
##  8  1924     27.6 TRUE 
##  9  1928     25.6 TRUE 
## 10  1932     23.9 TRUE 
## # ℹ 19 more rows

# bonus 
#oly_year |> 
  #summarize(min_avg_age = min(mean_age))


# Q12. Open-ended: come up with a question that requires at least TWO dplyr verbs.
#      Write the question, then the code that answers it. Below the chunk, briefly
#      explain why you chose this question.

basketball_medals <- olympics |>
  filter(sport == "Basketball",
         year %in% c(2012, 2016)) |>
  filter(!is.na(medal)) |>               
  group_by(year, team) |>
  summarize(
    num_medals = n()
  ) |>
  arrange(year, desc(num_medals))

## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by year and team.
## ℹ Output is grouped by year.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(year, team))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

basketball_medals

## # A tibble: 8 × 3
## # Groups:   year [2]
##    year team          num_medals
##   <dbl> <chr>              <int>
## 1  2012 United States         24
## 2  2012 Australia             12
## 3  2012 France                12
## 4  2012 Russia                12
## 5  2012 Spain                 12
## 6  2016 Serbia                24
## 7  2016 Spain                 24
## 8  2016 United States         24

Your question and reflection:

Question: Which basketball teams won the most medals in 2012 and in 2016 separately?

I wanted to see which countries dominated basketball in the two most recent Olympics shown in the dataset. This question uses four dplyr verbs (filter, group_by, summarize, and arrange) to count medals per team for each year separately and rank them from highest to lowest.

Homework 2 — Dataframes + dplyr

Gamaliel

Part 1 — Student Survey (dataframe basics)

Part 2 — Olympic Gymnasts (dplyr)