Homework 2 — Dataframes + dplyr

This homework has two parts. Part 1 uses base R to inspect a dataframe. Part 2 uses dplyr to wrangle a different dataset.

Part 1 — Student Survey (dataframe basics)

Download StudentSurvey.csv from the Datasets folder on Blackboard. Save it next to this Rmd and set your working directory.

# Load the file
survey <- read.csv("StudentSurvey.csv")

# Q1. Check the head of the dataset
head(survey)

##        Year Sex Smoke   Award HigherSAT Exercise TV Height Weight Siblings
## 1    Senior   M    No Olympic      Math       10  1     71    180        4
## 2 Sophomore   F   Yes Academy      Math        4  7     66    120        2
## 3 FirstYear   M    No   Nobel      Math       14  5     72    208        2
## 4    Junior   M    No   Nobel      Math        3  1     63    110        1
## 5 Sophomore   F    No   Nobel    Verbal        3  3     65    150        1
## 6 Sophomore   F    No   Nobel    Verbal        5  4     65    114        2
##   BirthOrder VerbalSAT MathSAT  SAT  GPA Pulse Piercings
## 1          4       540     670 1210 3.13    54         0
## 2          2       520     630 1150 2.50    66         3
## 3          1       550     560 1110 2.55   130         0
## 4          1       490     630 1120 3.10    78         0
## 5          1       720     450 1170 2.70    40         6
## 6          2       600     550 1150 3.20    80         4

# Q2. Check the dimensions
dim(survey)

## [1] 362  17

# Q3. Create a table of students' sex and HigherSAT

table("students'sex", "HigherSAT")

##               
##                HigherSAT
##   students'sex         1

# Q4. Display summary statistics for VerbalSAT
summary("VerbalSAT")

##    Length     Class      Mode 
##         1 character character

mean("VerbalSAT" )

## Warning in mean.default("VerbalSAT"): argument is not numeric or logical:
## returning NA

## [1] NA

# Q5. Find the average GPA of students
 students<- read.csv("StudentSurvey.csv")
 mean(students$GPA, na.rm = TRUE)

## [1] 3.157942

# Q6. Create a new dataframe called column_df that contains students' weight
#     and number of hours they exercise.
column_df<- survey|>
  select(Weight,Exercise)
column_df

##     Weight Exercise
## 1      180     10.0
## 2      120      4.0
## 3      208     14.0
## 4      110      3.0
## 5      150      3.0
## 6      114      5.0
## 7      128     10.0
## 8      235     13.0
## 9       NA      3.0
## 10     115     12.0
## 11     140     12.0
## 12     200     10.0
## 13     162     12.0
## 14     135      6.0
## 15     193      9.0
## 16     110     10.0
## 17      99      3.0
## 18     165      7.0
## 19     120      2.0
## 20     154     14.0
## 21     110     10.0
## 22     145     14.0
## 23     195     20.0
## 24     200      7.0
## 25     167     12.0
## 26     175     10.0
## 27     155      6.0
## 28     185     14.0
## 29     190     12.0
## 30     165     10.0
## 31     175      8.0
## 32     126      0.0
## 33     187     10.0
## 34     170      6.0
## 35     158      5.0
## 36     119     24.0
## 37     205      2.0
## 38     129     10.0
## 39     145      6.0
## 40     130      5.0
## 41     215      5.0
## 42     135     12.0
## 43     145      2.0
## 44      98      7.0
## 45     150     15.0
## 46     159      5.0
## 47     174      7.0
## 48     160     15.0
## 49     165      8.0
## 50     161     14.0
## 51     160     14.0
## 52     130      4.0
## 53     175     15.0
## 54     255      4.0
## 55     160     15.0
## 56     160      3.0
## 57      95      3.0
## 58     115     15.0
## 59     120     20.0
## 60     135      3.0
## 61     180      6.0
## 62     155     12.0
## 63     110      4.0
## 64     215     20.0
## 65     175     15.0
## 66     140     10.0
## 67     195     10.0
## 68     185      4.0
## 69     185      9.0
## 70     209     12.0
## 71     145      2.0
## 72     140     15.0
## 73     146     10.0
## 74     130      7.0
## 75     140      3.0
## 76     130      4.0
## 77     140     15.0
## 78     160      8.0
## 79     120      5.0
## 80     150     10.0
## 81     155     15.0
## 82     128      4.0
## 83     143      5.0
## 84     155      6.0
## 85     119     18.0
## 86     138     16.0
## 87     240      4.0
## 88     160      3.0
## 89     191     20.0
## 90     165      5.0
## 91     200     10.0
## 92     125      2.0
## 93     140      4.0
## 94     206     14.0
## 95     275      7.0
## 96     142     12.0
## 97     140     14.0
## 98     145      3.0
## 99     128      5.0
## 100    165     15.0
## 101    140      5.0
## 102    130      4.0
## 103    170      8.0
## 104    160     14.0
## 105    165      5.0
## 106    145     12.0
## 107    155     15.0
## 108    155      3.0
## 109    113     12.0
## 110    155     12.0
## 111    173      6.0
## 112    195      2.0
## 113    120      3.0
## 114    225     24.0
## 115    160      3.0
## 116    120      6.0
## 117    138     12.0
## 118    260     18.0
## 119    150      2.0
## 120    135     12.0
## 121    165      5.0
## 122    142      3.0
## 123    210      3.0
## 124    171     21.0
## 125    150     12.0
## 126    188      3.0
## 127    195     20.0
## 128    230      8.0
## 129    140      2.0
## 130    200     15.0
## 131    140      6.0
## 132    180      2.0
## 133    160      1.0
## 134    135     13.0
## 135    140     12.0
## 136    155     10.0
## 137    235     10.0
## 138    140     10.0
## 139    130      3.0
## 140    125      2.0
## 141    222     10.0
## 142    128      4.0
## 143    183      8.0
## 144    175     18.0
## 145    125      5.0
## 146    156      4.0
## 147    145     20.0
## 148    195     14.0
## 149    185      8.0
## 150    150      4.0
## 151    140      5.0
## 152    150     14.0
## 153    150      4.0
## 154    220      4.0
## 155    195     15.0
## 156    140     12.0
## 157    135     10.0
## 158    138      6.0
## 159    170     12.0
## 160    145     10.0
## 161    135     14.0
## 162    140      5.0
## 163    155     18.0
## 164    155     10.0
## 165    155      8.0
## 166    135      8.0
## 167    165      3.0
## 168    160      5.0
## 169    175      4.0
## 170    183      8.0
## 171    140      3.0
## 172    140     12.0
## 173    110      5.0
## 174    125      5.0
## 175    125     15.0
## 176    180     10.0
## 177    195     10.0
## 178    128     10.0
## 179    172     14.0
## 180    150      6.0
## 181    188      4.0
## 182    110     14.0
## 183    125      5.0
## 184    135     16.0
## 185    130     14.0
## 186    230     10.0
## 187    198     11.0
## 188    127      4.0
## 189    150     12.0
## 190    175      5.0
## 191    160      8.0
## 192    140      8.0
## 193    200      3.0
## 194    130      8.0
## 195    210     20.0
## 196    210     10.0
## 197    145      3.0
## 198    156      4.0
## 199    207      8.0
## 200    150     15.0
## 201    137     14.0
## 202    153     14.0
## 203    175     12.0
## 204    136      6.0
## 205    104      3.0
## 206    122     12.0
## 207    192      5.0
## 208    218      8.0
## 209    170      1.0
## 210    220     10.0
## 211    150      7.0
## 212    135      5.0
## 213    155     10.0
## 214    135     12.0
## 215    140     15.0
## 216    120     10.0
## 217    135      7.0
## 218    135       NA
## 219    165     15.0
## 220    180     10.0
## 221    165     15.0
## 222    175     12.0
## 223    200     10.0
## 224    138     12.0
## 225    265      5.0
## 226    105     14.0
## 227    130      3.0
## 228    165     12.0
## 229    145     15.0
## 230    138      3.0
## 231    193     25.0
## 232    170      5.0
## 233    155      6.0
## 234    155     27.0
## 235    140      8.0
## 236    130      3.0
## 237    180     17.0
## 238    140     40.0
## 239    210      5.0
## 240    150      0.0
## 241    115     27.0
## 242    225      3.0
## 243    170      3.0
## 244    120     11.0
## 245    170      2.0
## 246    130      9.0
## 247    182     12.0
## 248    138     15.0
## 249    180      2.0
## 250    230     20.0
## 251    222     12.0
## 252    145      5.0
## 253    150     10.0
## 254    190      2.0
## 255    195     12.0
## 256    115      5.0
## 257    140      3.0
## 258    155     12.0
## 259    150      8.0
## 260    155      8.0
## 261    130      5.0
## 262    210      6.0
## 263    105      8.0
## 264    192     10.0
## 265    220     18.0
## 266    134     12.0
## 267    160     10.0
## 268    135     10.0
## 269    140      5.0
## 270    135      7.0
## 271    145      3.0
## 272    140      7.0
## 273     NA      3.0
## 274    138      4.0
## 275     NA      8.0
## 276    130      7.0
## 277    210      7.0
## 278    233      3.0
## 279    180      7.0
## 280    185     14.0
## 281    189      5.0
## 282    127     14.0
## 283    135      5.0
## 284    140      5.0
## 285    162      2.0
## 286    215      1.0
## 287    200     15.0
## 288    120      5.0
## 289    165      5.0
## 290    175     15.0
## 291    160      5.0
## 292    145     20.0
## 293    180      3.0
## 294    170      7.0
## 295    235     20.0
## 296    155     10.0
## 297    140      6.0
## 298    145      6.0
## 299    175     10.0
## 300    210      9.0
## 301    130      1.5
## 302    195     20.0
## 303    165      8.0
## 304    180     21.0
## 305    145     10.0
## 306    163      8.0
## 307    160     10.0
## 308    150     18.0
## 309    170     15.0
## 310    110      8.0
## 311    170      6.0
## 312    145     12.0
## 313    160     10.0
## 314    130      7.0
## 315    155      5.0
## 316     NA      6.0
## 317    160      5.0
## 318    151      5.0
## 319    180     30.0
## 320    130      5.0
## 321    185     12.0
## 322    198      0.0
## 323    175     12.0
## 324    123      0.0
## 325    145     15.0
## 326    190     18.0
## 327    130     12.0
## 328    185      2.0
## 329    165     12.0
## 330    165      3.0
## 331    150      3.0
## 332    160      4.0
## 333    142     12.0
## 334    165     25.0
## 335    175      3.0
## 336    175      6.0
## 337    190      7.0
## 338    180      7.0
## 339    185      6.0
## 340     NA      6.0
## 341    135     13.0
## 342    195     25.0
## 343    175      8.0
## 344    165      5.0
## 345    135     11.0
## 346    140     18.0
## 347    182     10.0
## 348    155      6.0
## 349    180      2.0
## 350    170      5.0
## 351    135      5.0
## 352    165      6.0
## 353    137     10.0
## 354    147      4.0
## 355    150      5.0
## 356    155     17.0
## 357    160      7.0
## 358    130      2.0
## 359    180      8.0
## 360    150      1.0
## 361    205     14.0
## 362    115     12.0

# Q7. Access the fourth element in the first column of the StudentSurvey dataset.
survey[4,1]

## [1] "Junior"

Part 2 — Olympic Gymnasts (dplyr)

Don’t change this chunk — it loads and filters the dataset.

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics |>
  filter(!is.na(age)) |>
  filter(sport == "Gymnastics") |>
  mutate(
    medalist = case_when(
      is.na(medal) ~ FALSE,
      !is.na(medal) ~ TRUE
    )
  )

More info on the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

# Q8. Create a subset dataframe with these columns only: name, sex, age, team, year, medalist.
#     Call it df.
df<- olympic_gymnasts|>
  select(name, sex, age,team,year)
df

## # A tibble: 25,528 × 5
##    name                    sex     age team     year
##    <chr>                   <chr> <dbl> <chr>   <dbl>
##  1 Paavo Johannes Aaltonen M        28 Finland  1948
##  2 Paavo Johannes Aaltonen M        28 Finland  1948
##  3 Paavo Johannes Aaltonen M        28 Finland  1948
##  4 Paavo Johannes Aaltonen M        28 Finland  1948
##  5 Paavo Johannes Aaltonen M        28 Finland  1948
##  6 Paavo Johannes Aaltonen M        28 Finland  1948
##  7 Paavo Johannes Aaltonen M        28 Finland  1948
##  8 Paavo Johannes Aaltonen M        28 Finland  1948
##  9 Paavo Johannes Aaltonen M        32 Finland  1952
## 10 Paavo Johannes Aaltonen M        32 Finland  1952
## # ℹ 25,518 more rows

# Q9. From df, create df2 that only has the years 2008, 2012, and 2016.
df2 <- df |>
  filter(year %in% c(2008, 2012, 2016))
df2

## # A tibble: 2,703 × 5
##    name              sex     age team     year
##    <chr>             <chr> <dbl> <chr>   <dbl>
##  1 Nstor Abad Sanjun M        23 Spain    2016
##  2 Nstor Abad Sanjun M        23 Spain    2016
##  3 Nstor Abad Sanjun M        23 Spain    2016
##  4 Nstor Abad Sanjun M        23 Spain    2016
##  5 Nstor Abad Sanjun M        23 Spain    2016
##  6 Nstor Abad Sanjun M        23 Spain    2016
##  7 Katja Abel        F        25 Germany  2008
##  8 Katja Abel        F        25 Germany  2008
##  9 Katja Abel        F        25 Germany  2008
## 10 Katja Abel        F        25 Germany  2008
## # ℹ 2,693 more rows

df2# Q10. Group by those three years and summarize the mean age in each group.

## # A tibble: 2,703 × 5
##    name              sex     age team     year
##    <chr>             <chr> <dbl> <chr>   <dbl>
##  1 Nstor Abad Sanjun M        23 Spain    2016
##  2 Nstor Abad Sanjun M        23 Spain    2016
##  3 Nstor Abad Sanjun M        23 Spain    2016
##  4 Nstor Abad Sanjun M        23 Spain    2016
##  5 Nstor Abad Sanjun M        23 Spain    2016
##  6 Nstor Abad Sanjun M        23 Spain    2016
##  7 Katja Abel        F        25 Germany  2008
##  8 Katja Abel        F        25 Germany  2008
##  9 Katja Abel        F        25 Germany  2008
## 10 Katja Abel        F        25 Germany  2008
## # ℹ 2,693 more rows

df2 |>
  group_by(year) |>
  summarize(mean_age = mean(age,))

## # A tibble: 3 × 2
##    year mean_age
##   <dbl>    <dbl>
## 1  2008     21.6
## 2  2012     21.9
## 3  2016     22.2

# Q11. Using the full olympic_gymnasts dataset, group by year and find the mean age
#      for each year. Call this oly_year.
#      (Bonus: find the minimum average age across years.)
oly_year <- olympic_gymnasts |>
  group_by(year) |>
  summarize(mean_age = mean(age,))
oly_year

## # A tibble: 29 × 2
##     year mean_age
##    <dbl>    <dbl>
##  1  1896     24.3
##  2  1900     22.2
##  3  1904     25.1
##  4  1906     24.7
##  5  1908     23.2
##  6  1912     24.2
##  7  1920     26.7
##  8  1924     27.6
##  9  1928     25.6
## 10  1932     23.9
## # ℹ 19 more rows

min(oly_year$mean_age)

## [1] 19.86606

# Q12. Open-ended: come up with a question that requires at least TWO dplyr verbs.
#      Write the question, then the code that answers it. Below the chunk, briefly
#      explain why you chose this question.

olympic_gymnasts |>
  group_by(sex) |>
  summarize(num_gymnasts = n ()) |>
  arrange(desc(num_gymnasts))

## # A tibble: 2 × 2
##   sex   num_gymnasts
##   <chr>        <int>
## 1 M            16484
## 2 F             9044

Your question and reflection: # My question is pertaining to how many gymnasts of each sex there are, and which sex of gymnasts there is more of in this data set. I chose this question because I found it to incorporate fundamental dplyr verbs and as well gives a straightforward but digsetable outcome for beginning stage students like myself to learn how to dissect from data sets but as well comprehend what they are seeking an answer in regard to.

Homework 2 — Dataframes + dplyr

Thyda Khan

Part 1 — Student Survey (dataframe basics)

Part 2 — Olympic Gymnasts (dplyr)