Homework 2 — Dataframes + dplyr

This homework has two parts. Part 1 uses base R to inspect a dataframe. Part 2 uses dplyr to wrangle a different dataset.

Part 1 — Student Survey (dataframe basics)

Download StudentSurvey.csv from the Datasets folder on Blackboard. Save it next to this Rmd and set your working directory.

# Load the file
survey <- read.csv("StudentSurvey.csv")

# Q1. Check the head of the dataset
head(survey)

##        Year Sex Smoke   Award HigherSAT Exercise TV Height Weight Siblings
## 1    Senior   M    No Olympic      Math       10  1     71    180        4
## 2 Sophomore   F   Yes Academy      Math        4  7     66    120        2
## 3 FirstYear   M    No   Nobel      Math       14  5     72    208        2
## 4    Junior   M    No   Nobel      Math        3  1     63    110        1
## 5 Sophomore   F    No   Nobel    Verbal        3  3     65    150        1
## 6 Sophomore   F    No   Nobel    Verbal        5  4     65    114        2
##   BirthOrder VerbalSAT MathSAT  SAT  GPA Pulse Piercings
## 1          4       540     670 1210 3.13    54         0
## 2          2       520     630 1150 2.50    66         3
## 3          1       550     560 1110 2.55   130         0
## 4          1       490     630 1120 3.10    78         0
## 5          1       720     450 1170 2.70    40         6
## 6          2       600     550 1150 3.20    80         4

# Q2. Check the dimensions
nrow(survey)

## [1] 362

dim(survey)

## [1] 362  17

# Q3. Create a table of students' sex and HigherSAT
str(survey)

## 'data.frame':    362 obs. of  17 variables:
##  $ Year      : chr  "Senior" "Sophomore" "FirstYear" "Junior" ...
##  $ Sex       : chr  "M" "F" "M" "M" ...
##  $ Smoke     : chr  "No" "Yes" "No" "No" ...
##  $ Award     : chr  "Olympic" "Academy" "Nobel" "Nobel" ...
##  $ HigherSAT : chr  "Math" "Math" "Math" "Math" ...
##  $ Exercise  : num  10 4 14 3 3 5 10 13 3 12 ...
##  $ TV        : int  1 7 5 1 3 4 10 8 6 1 ...
##  $ Height    : int  71 66 72 63 65 65 66 74 61 60 ...
##  $ Weight    : int  180 120 208 110 150 114 128 235 NA 115 ...
##  $ Siblings  : int  4 2 2 1 1 2 1 1 2 7 ...
##  $ BirthOrder: int  4 2 1 1 1 2 1 1 2 8 ...
##  $ VerbalSAT : int  540 520 550 490 720 600 640 660 550 670 ...
##  $ MathSAT   : int  670 630 560 630 450 550 680 710 550 700 ...
##  $ SAT       : int  1210 1150 1110 1120 1170 1150 1320 1370 1100 1370 ...
##  $ GPA       : num  3.13 2.5 2.55 3.1 2.7 3.2 2.77 3.3 2.8 3.7 ...
##  $ Pulse     : int  54 66 130 78 40 80 94 77 60 94 ...
##  $ Piercings : int  0 3 0 0 6 4 8 0 7 2 ...

summary(survey)

##         Year            Sex            Smoke           Award    
##  Length   :362   Length   :362   Length   :362   Length   :362  
##  N.unique :  5   N.unique :  2   N.unique :  2   N.unique :  3  
##  N.blank  :  2   N.blank  :  0   N.blank  :  0   N.blank  :  0  
##  Min.nchar:  0   Min.nchar:  1   Min.nchar:  2   Min.nchar:  5  
##  Max.nchar:  9   Max.nchar:  1   Max.nchar:  3   Max.nchar:  7  
##                                                                 
##                                                                 
##      HigherSAT      Exercise            TV             Height     
##  Length   :362   Min.   : 0.000   Min.   : 0.000   Min.   :59.00  
##  N.unique :  3   1st Qu.: 5.000   1st Qu.: 3.000   1st Qu.:65.00  
##  N.blank  :  7   Median : 8.000   Median : 5.000   Median :68.00  
##  Min.nchar:  0   Mean   : 9.054   Mean   : 6.504   Mean   :68.42  
##  Max.nchar:  6   3rd Qu.:12.000   3rd Qu.: 9.000   3rd Qu.:71.00  
##                  Max.   :40.000   Max.   :40.000   Max.   :83.00  
##                  NAs    :1        NAs    :1        NAs    :7      
##      Weight         Siblings       BirthOrder     VerbalSAT        MathSAT     
##  Min.   : 95.0   Min.   :0.000   Min.   :1.00   Min.   :390.0   Min.   :400.0  
##  1st Qu.:138.0   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:550.0   1st Qu.:560.0  
##  Median :155.0   Median :1.000   Median :2.00   Median :600.0   Median :610.0  
##  Mean   :159.8   Mean   :1.727   Mean   :1.83   Mean   :594.2   Mean   :609.4  
##  3rd Qu.:180.0   3rd Qu.:2.000   3rd Qu.:2.00   3rd Qu.:640.0   3rd Qu.:650.0  
##  Max.   :275.0   Max.   :8.000   Max.   :8.00   Max.   :800.0   Max.   :800.0  
##  NAs    :5                       NAs    :3                                     
##       SAT            GPA            Pulse          Piercings     
##  Min.   : 800   Min.   :2.000   Min.   : 35.00   Min.   : 0.000  
##  1st Qu.:1130   1st Qu.:2.900   1st Qu.: 62.00   1st Qu.: 0.000  
##  Median :1200   Median :3.200   Median : 70.00   Median : 0.000  
##  Mean   :1204   Mean   :3.158   Mean   : 69.57   Mean   : 1.673  
##  3rd Qu.:1270   3rd Qu.:3.400   3rd Qu.: 77.75   3rd Qu.: 3.000  
##  Max.   :1550   Max.   :4.000   Max.   :130.00   Max.   :10.000  
##                 NAs    :17                       NAs    :1

# Q4. Display summary statistics for VerbalSAT
summary(survey$VerbalSAT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   390.0   550.0   600.0   594.2   640.0   800.0

# Q5. Find the average GPA of students
mean(survey$GPA, na.rm = TRUE)

## [1] 3.157942

# Q6. Create a new dataframe called column_df that contains students' weight
#     and number of hours they exercise.
column_df <- survey |>
  select(Weight, Exercise)
print(column_df)

##     Weight Exercise
## 1      180     10.0
## 2      120      4.0
## 3      208     14.0
## 4      110      3.0
## 5      150      3.0
## 6      114      5.0
## 7      128     10.0
## 8      235     13.0
## 9       NA      3.0
## 10     115     12.0
## 11     140     12.0
## 12     200     10.0
## 13     162     12.0
## 14     135      6.0
## 15     193      9.0
## 16     110     10.0
## 17      99      3.0
## 18     165      7.0
## 19     120      2.0
## 20     154     14.0
## 21     110     10.0
## 22     145     14.0
## 23     195     20.0
## 24     200      7.0
## 25     167     12.0
## 26     175     10.0
## 27     155      6.0
## 28     185     14.0
## 29     190     12.0
## 30     165     10.0
## 31     175      8.0
## 32     126      0.0
## 33     187     10.0
## 34     170      6.0
## 35     158      5.0
## 36     119     24.0
## 37     205      2.0
## 38     129     10.0
## 39     145      6.0
## 40     130      5.0
## 41     215      5.0
## 42     135     12.0
## 43     145      2.0
## 44      98      7.0
## 45     150     15.0
## 46     159      5.0
## 47     174      7.0
## 48     160     15.0
## 49     165      8.0
## 50     161     14.0
## 51     160     14.0
## 52     130      4.0
## 53     175     15.0
## 54     255      4.0
## 55     160     15.0
## 56     160      3.0
## 57      95      3.0
## 58     115     15.0
## 59     120     20.0
## 60     135      3.0
## 61     180      6.0
## 62     155     12.0
## 63     110      4.0
## 64     215     20.0
## 65     175     15.0
## 66     140     10.0
## 67     195     10.0
## 68     185      4.0
## 69     185      9.0
## 70     209     12.0
## 71     145      2.0
## 72     140     15.0
## 73     146     10.0
## 74     130      7.0
## 75     140      3.0
## 76     130      4.0
## 77     140     15.0
## 78     160      8.0
## 79     120      5.0
## 80     150     10.0
## 81     155     15.0
## 82     128      4.0
## 83     143      5.0
## 84     155      6.0
## 85     119     18.0
## 86     138     16.0
## 87     240      4.0
## 88     160      3.0
## 89     191     20.0
## 90     165      5.0
## 91     200     10.0
## 92     125      2.0
## 93     140      4.0
## 94     206     14.0
## 95     275      7.0
## 96     142     12.0
## 97     140     14.0
## 98     145      3.0
## 99     128      5.0
## 100    165     15.0
## 101    140      5.0
## 102    130      4.0
## 103    170      8.0
## 104    160     14.0
## 105    165      5.0
## 106    145     12.0
## 107    155     15.0
## 108    155      3.0
## 109    113     12.0
## 110    155     12.0
## 111    173      6.0
## 112    195      2.0
## 113    120      3.0
## 114    225     24.0
## 115    160      3.0
## 116    120      6.0
## 117    138     12.0
## 118    260     18.0
## 119    150      2.0
## 120    135     12.0
## 121    165      5.0
## 122    142      3.0
## 123    210      3.0
## 124    171     21.0
## 125    150     12.0
## 126    188      3.0
## 127    195     20.0
## 128    230      8.0
## 129    140      2.0
## 130    200     15.0
## 131    140      6.0
## 132    180      2.0
## 133    160      1.0
## 134    135     13.0
## 135    140     12.0
## 136    155     10.0
## 137    235     10.0
## 138    140     10.0
## 139    130      3.0
## 140    125      2.0
## 141    222     10.0
## 142    128      4.0
## 143    183      8.0
## 144    175     18.0
## 145    125      5.0
## 146    156      4.0
## 147    145     20.0
## 148    195     14.0
## 149    185      8.0
## 150    150      4.0
## 151    140      5.0
## 152    150     14.0
## 153    150      4.0
## 154    220      4.0
## 155    195     15.0
## 156    140     12.0
## 157    135     10.0
## 158    138      6.0
## 159    170     12.0
## 160    145     10.0
## 161    135     14.0
## 162    140      5.0
## 163    155     18.0
## 164    155     10.0
## 165    155      8.0
## 166    135      8.0
## 167    165      3.0
## 168    160      5.0
## 169    175      4.0
## 170    183      8.0
## 171    140      3.0
## 172    140     12.0
## 173    110      5.0
## 174    125      5.0
## 175    125     15.0
## 176    180     10.0
## 177    195     10.0
## 178    128     10.0
## 179    172     14.0
## 180    150      6.0
## 181    188      4.0
## 182    110     14.0
## 183    125      5.0
## 184    135     16.0
## 185    130     14.0
## 186    230     10.0
## 187    198     11.0
## 188    127      4.0
## 189    150     12.0
## 190    175      5.0
## 191    160      8.0
## 192    140      8.0
## 193    200      3.0
## 194    130      8.0
## 195    210     20.0
## 196    210     10.0
## 197    145      3.0
## 198    156      4.0
## 199    207      8.0
## 200    150     15.0
## 201    137     14.0
## 202    153     14.0
## 203    175     12.0
## 204    136      6.0
## 205    104      3.0
## 206    122     12.0
## 207    192      5.0
## 208    218      8.0
## 209    170      1.0
## 210    220     10.0
## 211    150      7.0
## 212    135      5.0
## 213    155     10.0
## 214    135     12.0
## 215    140     15.0
## 216    120     10.0
## 217    135      7.0
## 218    135       NA
## 219    165     15.0
## 220    180     10.0
## 221    165     15.0
## 222    175     12.0
## 223    200     10.0
## 224    138     12.0
## 225    265      5.0
## 226    105     14.0
## 227    130      3.0
## 228    165     12.0
## 229    145     15.0
## 230    138      3.0
## 231    193     25.0
## 232    170      5.0
## 233    155      6.0
## 234    155     27.0
## 235    140      8.0
## 236    130      3.0
## 237    180     17.0
## 238    140     40.0
## 239    210      5.0
## 240    150      0.0
## 241    115     27.0
## 242    225      3.0
## 243    170      3.0
## 244    120     11.0
## 245    170      2.0
## 246    130      9.0
## 247    182     12.0
## 248    138     15.0
## 249    180      2.0
## 250    230     20.0
## 251    222     12.0
## 252    145      5.0
## 253    150     10.0
## 254    190      2.0
## 255    195     12.0
## 256    115      5.0
## 257    140      3.0
## 258    155     12.0
## 259    150      8.0
## 260    155      8.0
## 261    130      5.0
## 262    210      6.0
## 263    105      8.0
## 264    192     10.0
## 265    220     18.0
## 266    134     12.0
## 267    160     10.0
## 268    135     10.0
## 269    140      5.0
## 270    135      7.0
## 271    145      3.0
## 272    140      7.0
## 273     NA      3.0
## 274    138      4.0
## 275     NA      8.0
## 276    130      7.0
## 277    210      7.0
## 278    233      3.0
## 279    180      7.0
## 280    185     14.0
## 281    189      5.0
## 282    127     14.0
## 283    135      5.0
## 284    140      5.0
## 285    162      2.0
## 286    215      1.0
## 287    200     15.0
## 288    120      5.0
## 289    165      5.0
## 290    175     15.0
## 291    160      5.0
## 292    145     20.0
## 293    180      3.0
## 294    170      7.0
## 295    235     20.0
## 296    155     10.0
## 297    140      6.0
## 298    145      6.0
## 299    175     10.0
## 300    210      9.0
## 301    130      1.5
## 302    195     20.0
## 303    165      8.0
## 304    180     21.0
## 305    145     10.0
## 306    163      8.0
## 307    160     10.0
## 308    150     18.0
## 309    170     15.0
## 310    110      8.0
## 311    170      6.0
## 312    145     12.0
## 313    160     10.0
## 314    130      7.0
## 315    155      5.0
## 316     NA      6.0
## 317    160      5.0
## 318    151      5.0
## 319    180     30.0
## 320    130      5.0
## 321    185     12.0
## 322    198      0.0
## 323    175     12.0
## 324    123      0.0
## 325    145     15.0
## 326    190     18.0
## 327    130     12.0
## 328    185      2.0
## 329    165     12.0
## 330    165      3.0
## 331    150      3.0
## 332    160      4.0
## 333    142     12.0
## 334    165     25.0
## 335    175      3.0
## 336    175      6.0
## 337    190      7.0
## 338    180      7.0
## 339    185      6.0
## 340     NA      6.0
## 341    135     13.0
## 342    195     25.0
## 343    175      8.0
## 344    165      5.0
## 345    135     11.0
## 346    140     18.0
## 347    182     10.0
## 348    155      6.0
## 349    180      2.0
## 350    170      5.0
## 351    135      5.0
## 352    165      6.0
## 353    137     10.0
## 354    147      4.0
## 355    150      5.0
## 356    155     17.0
## 357    160      7.0
## 358    130      2.0
## 359    180      8.0
## 360    150      1.0
## 361    205     14.0
## 362    115     12.0

# Q7. Access the fourth element in the first column of the StudentSurvey dataset.
head(survey$Year, n= 4)

## [1] "Senior"    "Sophomore" "FirstYear" "Junior"

Part 2 — Olympic Gymnasts (dplyr)

Don’t change this chunk — it loads and filters the dataset.

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics |>
  filter(!is.na(age)) |>
  filter(sport == "Gymnastics") |>
  mutate(
    medalist = case_when(
      is.na(medal) ~ FALSE,
      !is.na(medal) ~ TRUE
    )
  )

More info on the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

# Q8. Create a subset dataframe with these columns only: name, sex, age, team, year, medalist.
#     Call it df.
df<- olympic_gymnasts |>
  select(name, sex, age, team, year, medalist)
print(df)

## # A tibble: 25,528 × 6
##    name                    sex     age team     year medalist
##    <chr>                   <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  2 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  3 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  4 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  5 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  6 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  7 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  8 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  9 Paavo Johannes Aaltonen M        32 Finland  1952 FALSE   
## 10 Paavo Johannes Aaltonen M        32 Finland  1952 TRUE    
## # ℹ 25,518 more rows

# Q9. From df, create df2 that only has the years 2008, 2012, and 2016.
df2 <- df |>
  filter(year %in% c(2008, 2012, 2016))
print(df2)

## # A tibble: 2,703 × 6
##    name              sex     age team     year medalist
##    <chr>             <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  2 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  3 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  4 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  5 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  6 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  7 Katja Abel        F        25 Germany  2008 FALSE   
##  8 Katja Abel        F        25 Germany  2008 FALSE   
##  9 Katja Abel        F        25 Germany  2008 FALSE   
## 10 Katja Abel        F        25 Germany  2008 FALSE   
## # ℹ 2,693 more rows

# Q10. Group by those three years and summarize the mean age in each group.
age_summary <- df |>
  filter(year %in% c(2008, 2012, 2016)) |>
  group_by(year) |>
  summarize(mean_age = mean(age, na.rm = TRUE))

# Q11. Using the full olympic_gymnasts dataset, group by year and find the mean age
#      for each year. Call this oly_year.
#      (Bonus: find the minimum average age across years.)


# Q12. Open-ended: come up with a question that requires at least TWO dplyr verbs.
#      Write the question, then the code that answers it. Below the chunk, briefly
#      explain why you chose this question.

Your question and reflection:

Homework 2 — Dataframes + dplyr

Santiago Aburto

Part 1 — Student Survey (dataframe basics)

Part 2 — Olympic Gymnasts (dplyr)