Homework 2 — Dataframes + dplyr

This homework has two parts. Part 1 uses base R to inspect a dataframe. Part 2 uses dplyr to wrangle a different dataset.

Part 1 — Student Survey (dataframe basics)

Download StudentSurvey.csv from the Datasets folder on Blackboard. Save it next to this Rmd and set your working directory. x download ‘StudentSurvey.csv’ x save in same folder as this r markdown x set working directory ()

# Load the file
survey <- read.csv("StudentSurvey.csv")
glimpse(survey)

## Rows: 362
## Columns: 17
## $ Year       <chr> "Senior", "Sophomore", "FirstYear", "Junior", "Sophomore", …
## $ Sex        <chr> "M", "F", "M", "M", "F", "F", "F", "M", "F", "F", "F", "M",…
## $ Smoke      <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No"…
## $ Award      <chr> "Olympic", "Academy", "Nobel", "Nobel", "Nobel", "Nobel", "…
## $ HigherSAT  <chr> "Math", "Math", "Math", "Math", "Verbal", "Verbal", "Math",…
## $ Exercise   <dbl> 10, 4, 14, 3, 3, 5, 10, 13, 3, 12, 12, 10, 12, 6, 9, 10, 3,…
## $ TV         <int> 1, 7, 5, 1, 3, 4, 10, 8, 6, 1, 6, 5, 8, 1, 5, 2, 15, 3, 1, …
## $ Height     <int> 71, 66, 72, 63, 65, 65, 66, 74, 61, 60, 65, 63, 68, 68, 68,…
## $ Weight     <int> 180, 120, 208, 110, 150, 114, 128, 235, NA, 115, 140, 200, …
## $ Siblings   <int> 4, 2, 2, 1, 1, 2, 1, 1, 2, 7, 1, 2, 3, 2, 1, 1, 2, 2, 1, 1,…
## $ BirthOrder <int> 4, 2, 1, 1, 1, 2, 1, 1, 2, 8, 2, 2, NA, 3, 1, 2, 1, 1, 1, 2…
## $ VerbalSAT  <int> 540, 520, 550, 490, 720, 600, 640, 660, 550, 670, 500, 580,…
## $ MathSAT    <int> 670, 630, 560, 630, 450, 550, 680, 710, 550, 700, 670, 600,…
## $ SAT        <int> 1210, 1150, 1110, 1120, 1170, 1150, 1320, 1370, 1100, 1370,…
## $ GPA        <dbl> 3.13, 2.50, 2.55, 3.10, 2.70, 3.20, 2.77, 3.30, 2.80, 3.70,…
## $ Pulse      <int> 54, 66, 130, 78, 40, 80, 94, 77, 60, 94, 63, 72, 54, 66, 72…
## $ Piercings  <int> 0, 3, 0, 0, 6, 4, 8, 0, 7, 2, 2, 0, 0, 4, 0, 4, 4, 0, 2, 5,…

# 362 rows observations ; 17 columns

# Q1. Check the head of the dataset

head(survey)

##        Year Sex Smoke   Award HigherSAT Exercise TV Height Weight Siblings
## 1    Senior   M    No Olympic      Math       10  1     71    180        4
## 2 Sophomore   F   Yes Academy      Math        4  7     66    120        2
## 3 FirstYear   M    No   Nobel      Math       14  5     72    208        2
## 4    Junior   M    No   Nobel      Math        3  1     63    110        1
## 5 Sophomore   F    No   Nobel    Verbal        3  3     65    150        1
## 6 Sophomore   F    No   Nobel    Verbal        5  4     65    114        2
##   BirthOrder VerbalSAT MathSAT  SAT  GPA Pulse Piercings
## 1          4       540     670 1210 3.13    54         0
## 2          2       520     630 1150 2.50    66         3
## 3          1       550     560 1110 2.55   130         0
## 4          1       490     630 1120 3.10    78         0
## 5          1       720     450 1170 2.70    40         6
## 6          2       600     550 1150 3.20    80         4

# first 6 rows ; 17 columns

# Q2. Check the dimensions

#from glimpse(survey) there were 362 observations and 16 columns

dim(survey)

## [1] 362  17

nrow(survey)

## [1] 362

ncol(survey)

## [1] 17

# 362 rows by 17 columns

# Q3. Create a table of students' Sex and HigherSAT

table_Sex_HigherSAT <- table(survey$Sex, survey$HigherSAT)
table_Sex_HigherSAT

##    
##         Math Verbal
##   F   4   81     84
##   M   3  124     66

# getting an extra column , checking data

unique(survey$Sex)        # returned "M" and "F" so no extras

## [1] "M" "F"

unique(survey$HigherSAT)  #returned "Math" "Verbal" and ""

## [1] "Math"   "Verbal" ""

sum(survey$HigherSAT == "") # count number of blanks; returned 7 blanks

## [1] 7

                            # 4 females "" and 3 males "" no higherSAT

# Q4. Display summary statistics for VerbalSAT

summary(survey$VerbalSAT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   390.0   550.0   600.0   594.2   640.0   800.0

# Q5. Find the average GPA of students

mean(survey$GPA) # retrned NA, so will try to count number of NAs, then remove NAs to compute mean.

## [1] NA

sum(is.na(survey$GPA))    #count number of NA s

## [1] 17

which(is.na(survey$GPA))  #which rows have NA

##  [1]  12  15  51  65  76  77  82  89 106 154 161 200 210 216 238 263 303

round ((mean(survey$GPA, na.rm = TRUE)), 3)

## [1] 3.158

# 17 NA s in the GPA column
# the observations on the following rows have NA for GPA:
# 12  15  51  65  76  77  82  89 106 154 161 200 210 216 238 263 303

# Q6. Create a new dataframe called column_df that contains students' weight
#     and number of hours they exercise.

column_df <- data.frame(
  students_weight = survey$Weight,
  hours_exercise = survey$Exercise
  
)

glimpse(column_df)

## Rows: 362
## Columns: 2
## $ students_weight <int> 180, 120, 208, 110, 150, 114, 128, 235, NA, 115, 140, …
## $ hours_exercise  <dbl> 10, 4, 14, 3, 3, 5, 10, 13, 3, 12, 12, 10, 12, 6, 9, 1…

head(column_df)

##   students_weight hours_exercise
## 1             180             10
## 2             120              4
## 3             208             14
## 4             110              3
## 5             150              3
## 6             114              5

# Q7. Access the fourth element in the first column of the StudentSurvey dataset.

survey[4,1] # fourth row and first column

## [1] "Junior"

names(survey)[1] # returns "Year" as first column

## [1] "Year"

survey[4, "Year"]  #wanted to find it by column name and row/observation

## [1] "Junior"

Part 2 — Olympic Gymnasts (dplyr)

Don’t change this chunk — it loads and filters the dataset.

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics |>
  filter(!is.na(age)) |>
  filter(sport == "Gymnastics") |>
  mutate(
    medalist = case_when(
      is.na(medal) ~ FALSE,
      !is.na(medal) ~ TRUE
    )
  )

More info on the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

# Q8. Create a subset dataframe with these columns only: name, sex, age, team, year, medalist.
#     Call it df.

glimpse(olympics)

## Rows: 271,116
## Columns: 15
## $ id     <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
## $ name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
## $ sex    <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
## $ age    <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
## $ height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
## $ weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
## $ team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
## $ noc    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
## $ games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
## $ year   <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
## $ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
## $ city   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
## $ sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
## $ event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
## $ medal  <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

head(olympics)

## # A tibble: 6 × 15
##      id name      sex     age height weight team  noc   games  year season city 
##   <dbl> <chr>     <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
## 1     1 A Dijiang M        24    180     80 China CHN   1992…  1992 Summer Barc…
## 2     2 A Lamusi  M        23    170     60 China CHN   2012…  2012 Summer Lond…
## 3     3 Gunnar N… M        24     NA     NA Denm… DEN   1920…  1920 Summer Antw…
## 4     4 Edgar Li… M        34     NA     NA Denm… DEN   1900…  1900 Summer Paris
## 5     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
## 6     5 Christin… F        21    185     82 Neth… NED   1988…  1988 Winter Calg…
## # ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>

df <- data.frame(
  name = olympics$name,
  sex = olympics$sex,
  age = olympics$age,
  team = olympics$team,
  year = olympics$year,
  medalist = olympics$medal
)

glimpse(df)

## Rows: 271,116
## Columns: 6
## $ name     <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Linden…
## $ sex      <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "…
## $ age      <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 3…
## $ team     <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands",…
## $ year     <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 1…
## $ medalist <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

head(df)

##                       name sex age           team year medalist
## 1                A Dijiang   M  24          China 1992     <NA>
## 2                 A Lamusi   M  23          China 2012     <NA>
## 3      Gunnar Nielsen Aaby   M  24        Denmark 1920     <NA>
## 4     Edgar Lindenau Aabye   M  34 Denmark/Sweden 1900     Gold
## 5 Christine Jacoba Aaftink   F  21    Netherlands 1988     <NA>
## 6 Christine Jacoba Aaftink   F  21    Netherlands 1988     <NA>

# 271,116 rows observations and 6 columns

# Q9. From df, create df2 that only has the years 2008, 2012, and 2016.

#piping - and then
df2 <- df |>
  filter(year %in% c(2008, 2012, 2016))
  
glimpse(df2)

## Rows: 40,210
## Columns: 6
## $ name     <chr> "A Lamusi", "Ragnhild Margrethe Aamodt", "Andreea Aanei", "Ja…
## $ sex      <chr> "M", "F", "F", "M", "M", "F", "M", "M", "M", "M", "M", "M", "…
## $ age      <dbl> 23, 27, 22, 30, 24, 28, 23, 23, 23, 23, 23, 23, 26, 21, 30, 2…
## $ team     <chr> "China", "Norway", "Romania", "France", "Morocco", "Estonia",…
## $ year     <dbl> 2012, 2008, 2016, 2012, 2012, 2008, 2016, 2016, 2016, 2016, 2…
## $ medalist <chr> NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Bron…

head(df2)

##                                 name sex age    team year medalist
## 1                           A Lamusi   M  23   China 2012     <NA>
## 2          Ragnhild Margrethe Aamodt   F  27  Norway 2008     Gold
## 3                      Andreea Aanei   F  22 Romania 2016     <NA>
## 4 Jamale (Djamel-) Aarrass (Ahrass-)   M  30  France 2012     <NA>
## 5                   Abdelhak Aatakni   M  24 Morocco 2012     <NA>
## 6                       Moonika Aava   F  28 Estonia 2008     <NA>

# Q10. Group by those three years and summarize the mean age in each group.

# piping operators

df2 <- df |>
  filter(year %in% c(2008, 2012, 2016)) |>  # filter years
  group_by(year) |>                         # group by year
  summarise(                                # summarizes the groups that we grouped
    count = n(),                            # counts rows in group of years
    mean_age_byyear = mean(age, na.rm = TRUE)
   
  )
           
df2

## # A tibble: 3 × 3
##    year count mean_age_byyear
##   <dbl> <int>           <dbl>
## 1  2008 13602            25.7
## 2  2012 12920            26.0
## 3  2016 13688            26.2

# Q11. Using the full olympic_gymnasts dataset, group by year and find the mean age
#      for each year. Call this oly_year.
#      (Bonus: find the minimum average age across years.)

oly_year <- olympic_gymnasts |>
  group_by(year) |>
  summarise(
    count = n(),
    mean_age_eachyear = mean(age, na.rm = TRUE)
  )

oly_year

## # A tibble: 29 × 3
##     year count mean_age_eachyear
##    <dbl> <int>             <dbl>
##  1  1896    73              24.3
##  2  1900    33              22.2
##  3  1904   317              25.1
##  4  1906    70              24.7
##  5  1908   240              23.2
##  6  1912   310              24.2
##  7  1920   206              26.7
##  8  1924   499              27.6
##  9  1928   561              25.6
## 10  1932   140              23.9
## # ℹ 19 more rows

(Bonus: find the minimum average age across years.)

min(oly_year$mean_age_eachyear)

## [1] 19.86606

oly_year[which.min(oly_year$mean_age_eachyear), ]

## # A tibble: 1 × 3
##    year count mean_age_eachyear
##   <dbl> <int>             <dbl>
## 1  1988  1217              19.9

# The year with the youngest / lowest average age was 1988. 
# The average age was 19.86606 years old
.86606*12

## [1] 10.39272

# about 10.4 months
# so 19 years and about 10.4 months

Write the question, then the code that answers it. Below the chunk, briefly explain why you chose this question.

Your question and reflection:

A. what was the oldest age for an Olympic competitor in the year in which the average age was the lowest?

year_youngest_avg <- oly_year[which.min(oly_year$mean_age_eachyear), ]

olympic_gymnasts |>
  filter(year == year_youngest_avg$year) |>
  summarise(oldest_age = max(age, na.rm = TRUE))

## # A tibble: 1 × 1
##   oldest_age
##        <dbl>
## 1         28

B. Can we find the oldest aged olympian for each year, make a column for that age and attach it to the summarized table of year and average age?

oly_year_oldest_age <- olympic_gymnasts |>
  group_by(year) |>
  summarise(
    count = n(),
    mean_age_eachyear = mean(age, na.rm = TRUE),
    oldest_age = max(age, na.rm = TRUE)
  )

oly_year_oldest_age

## # A tibble: 29 × 4
##     year count mean_age_eachyear oldest_age
##    <dbl> <int>             <dbl>      <dbl>
##  1  1896    73              24.3         31
##  2  1900    33              22.2         31
##  3  1904   317              25.1         37
##  4  1906    70              24.7         35
##  5  1908   240              23.2         49
##  6  1912   310              24.2         38
##  7  1920   206              26.7         45
##  8  1924   499              27.6         38
##  9  1928   561              25.6         39
## 10  1932   140              23.9         34
## # ℹ 19 more rows

C. what are the years for the five youngest average ages? What is the oldest olympian for each year as well?

oly_year_oldest_age |>
  slice_min(mean_age_eachyear, n=5)

## # A tibble: 5 × 4
##    year count mean_age_eachyear oldest_age
##   <dbl> <int>             <dbl>      <dbl>
## 1  1988  1217              19.9         28
## 2  1992  1255              20.0         31
## 3  1980   866              20.1         30
## 4  1996  1397              20.3         29
## 5  1984   930              20.4         30

D. what are the years for the five oldest average ages? What is the oldest and youngest olympian for each year as well?

oly_year_oldest_mean_age <- olympic_gymnasts |>
  group_by(year) |>
  summarise(
    count = n(),
    mean_age_eachyear = mean(age, na.rm = TRUE),
    oldest_age = max(age, na.rm = TRUE),
    youngest_age = min(age, na.rm = TRUE)
  )

oly_year_oldest_mean_age

## # A tibble: 29 × 5
##     year count mean_age_eachyear oldest_age youngest_age
##    <dbl> <int>             <dbl>      <dbl>        <dbl>
##  1  1896    73              24.3         31           10
##  2  1900    33              22.2         31           17
##  3  1904   317              25.1         37           18
##  4  1906    70              24.7         35           14
##  5  1908   240              23.2         49           16
##  6  1912   310              24.2         38           18
##  7  1920   206              26.7         45           17
##  8  1924   499              27.6         38           19
##  9  1928   561              25.6         39           11
## 10  1932   140              23.9         34           15
## # ℹ 19 more rows

oly_year_oldest_mean_age |>
  slice_max(mean_age_eachyear, n=5)

## # A tibble: 5 × 5
##    year count mean_age_eachyear oldest_age youngest_age
##   <dbl> <int>             <dbl>      <dbl>        <dbl>
## 1  1948   874              27.8         42           12
## 2  1924   499              27.6         38           19
## 3  1920   206              26.7         45           17
## 4  1936   947              25.8         37           14
## 5  1928   561              25.6         39           11

Reflection

Reflection: In two of the five oldest mean age years, 1948 and 1928, the youngest aged olympian was 12 and 11. I would not expect competitors so young to be at the same Olympic Games in the older average aged years.