This homework has two parts. Part 1 uses base R to inspect a dataframe. Part 2 uses dplyr to wrangle a different dataset.
Download StudentSurvey.csv from the Datasets folder on
Blackboard. Save it next to this Rmd and set your working directory.
# Load the file
survey <- read.csv("StudentSurvey.csv")
# Q1. Check the head of the dataset
head(survey)
## Year Sex Smoke Award HigherSAT Exercise TV Height Weight Siblings
## 1 Senior M No Olympic Math 10 1 71 180 4
## 2 Sophomore F Yes Academy Math 4 7 66 120 2
## 3 FirstYear M No Nobel Math 14 5 72 208 2
## 4 Junior M No Nobel Math 3 1 63 110 1
## 5 Sophomore F No Nobel Verbal 3 3 65 150 1
## 6 Sophomore F No Nobel Verbal 5 4 65 114 2
## BirthOrder VerbalSAT MathSAT SAT GPA Pulse Piercings
## 1 4 540 670 1210 3.13 54 0
## 2 2 520 630 1150 2.50 66 3
## 3 1 550 560 1110 2.55 130 0
## 4 1 490 630 1120 3.10 78 0
## 5 1 720 450 1170 2.70 40 6
## 6 2 600 550 1150 3.20 80 4
# Q2. Check the dimensions
dim(survey)
## [1] 362 17
# Q3. Create a table of students' sex and HigherSAT
table("students'sex", "HigherSAT")
##
## HigherSAT
## students'sex 1
# Q4. Display summary statistics for VerbalSAT
summary("VerbalSAT")
## Length Class Mode
## 1 character character
mean("VerbalSAT" )
## Warning in mean.default("VerbalSAT"): argument is not numeric or logical:
## returning NA
## [1] NA
# Q5. Find the average GPA of students
students<- read.csv("StudentSurvey.csv")
mean(students$GPA, na.rm = TRUE)
## [1] 3.157942
# Q6. Create a new dataframe called column_df that contains students' weight
# and number of hours they exercise.
column_df<- survey|>
select(Weight,Exercise)
column_df
## Weight Exercise
## 1 180 10.0
## 2 120 4.0
## 3 208 14.0
## 4 110 3.0
## 5 150 3.0
## 6 114 5.0
## 7 128 10.0
## 8 235 13.0
## 9 NA 3.0
## 10 115 12.0
## 11 140 12.0
## 12 200 10.0
## 13 162 12.0
## 14 135 6.0
## 15 193 9.0
## 16 110 10.0
## 17 99 3.0
## 18 165 7.0
## 19 120 2.0
## 20 154 14.0
## 21 110 10.0
## 22 145 14.0
## 23 195 20.0
## 24 200 7.0
## 25 167 12.0
## 26 175 10.0
## 27 155 6.0
## 28 185 14.0
## 29 190 12.0
## 30 165 10.0
## 31 175 8.0
## 32 126 0.0
## 33 187 10.0
## 34 170 6.0
## 35 158 5.0
## 36 119 24.0
## 37 205 2.0
## 38 129 10.0
## 39 145 6.0
## 40 130 5.0
## 41 215 5.0
## 42 135 12.0
## 43 145 2.0
## 44 98 7.0
## 45 150 15.0
## 46 159 5.0
## 47 174 7.0
## 48 160 15.0
## 49 165 8.0
## 50 161 14.0
## 51 160 14.0
## 52 130 4.0
## 53 175 15.0
## 54 255 4.0
## 55 160 15.0
## 56 160 3.0
## 57 95 3.0
## 58 115 15.0
## 59 120 20.0
## 60 135 3.0
## 61 180 6.0
## 62 155 12.0
## 63 110 4.0
## 64 215 20.0
## 65 175 15.0
## 66 140 10.0
## 67 195 10.0
## 68 185 4.0
## 69 185 9.0
## 70 209 12.0
## 71 145 2.0
## 72 140 15.0
## 73 146 10.0
## 74 130 7.0
## 75 140 3.0
## 76 130 4.0
## 77 140 15.0
## 78 160 8.0
## 79 120 5.0
## 80 150 10.0
## 81 155 15.0
## 82 128 4.0
## 83 143 5.0
## 84 155 6.0
## 85 119 18.0
## 86 138 16.0
## 87 240 4.0
## 88 160 3.0
## 89 191 20.0
## 90 165 5.0
## 91 200 10.0
## 92 125 2.0
## 93 140 4.0
## 94 206 14.0
## 95 275 7.0
## 96 142 12.0
## 97 140 14.0
## 98 145 3.0
## 99 128 5.0
## 100 165 15.0
## 101 140 5.0
## 102 130 4.0
## 103 170 8.0
## 104 160 14.0
## 105 165 5.0
## 106 145 12.0
## 107 155 15.0
## 108 155 3.0
## 109 113 12.0
## 110 155 12.0
## 111 173 6.0
## 112 195 2.0
## 113 120 3.0
## 114 225 24.0
## 115 160 3.0
## 116 120 6.0
## 117 138 12.0
## 118 260 18.0
## 119 150 2.0
## 120 135 12.0
## 121 165 5.0
## 122 142 3.0
## 123 210 3.0
## 124 171 21.0
## 125 150 12.0
## 126 188 3.0
## 127 195 20.0
## 128 230 8.0
## 129 140 2.0
## 130 200 15.0
## 131 140 6.0
## 132 180 2.0
## 133 160 1.0
## 134 135 13.0
## 135 140 12.0
## 136 155 10.0
## 137 235 10.0
## 138 140 10.0
## 139 130 3.0
## 140 125 2.0
## 141 222 10.0
## 142 128 4.0
## 143 183 8.0
## 144 175 18.0
## 145 125 5.0
## 146 156 4.0
## 147 145 20.0
## 148 195 14.0
## 149 185 8.0
## 150 150 4.0
## 151 140 5.0
## 152 150 14.0
## 153 150 4.0
## 154 220 4.0
## 155 195 15.0
## 156 140 12.0
## 157 135 10.0
## 158 138 6.0
## 159 170 12.0
## 160 145 10.0
## 161 135 14.0
## 162 140 5.0
## 163 155 18.0
## 164 155 10.0
## 165 155 8.0
## 166 135 8.0
## 167 165 3.0
## 168 160 5.0
## 169 175 4.0
## 170 183 8.0
## 171 140 3.0
## 172 140 12.0
## 173 110 5.0
## 174 125 5.0
## 175 125 15.0
## 176 180 10.0
## 177 195 10.0
## 178 128 10.0
## 179 172 14.0
## 180 150 6.0
## 181 188 4.0
## 182 110 14.0
## 183 125 5.0
## 184 135 16.0
## 185 130 14.0
## 186 230 10.0
## 187 198 11.0
## 188 127 4.0
## 189 150 12.0
## 190 175 5.0
## 191 160 8.0
## 192 140 8.0
## 193 200 3.0
## 194 130 8.0
## 195 210 20.0
## 196 210 10.0
## 197 145 3.0
## 198 156 4.0
## 199 207 8.0
## 200 150 15.0
## 201 137 14.0
## 202 153 14.0
## 203 175 12.0
## 204 136 6.0
## 205 104 3.0
## 206 122 12.0
## 207 192 5.0
## 208 218 8.0
## 209 170 1.0
## 210 220 10.0
## 211 150 7.0
## 212 135 5.0
## 213 155 10.0
## 214 135 12.0
## 215 140 15.0
## 216 120 10.0
## 217 135 7.0
## 218 135 NA
## 219 165 15.0
## 220 180 10.0
## 221 165 15.0
## 222 175 12.0
## 223 200 10.0
## 224 138 12.0
## 225 265 5.0
## 226 105 14.0
## 227 130 3.0
## 228 165 12.0
## 229 145 15.0
## 230 138 3.0
## 231 193 25.0
## 232 170 5.0
## 233 155 6.0
## 234 155 27.0
## 235 140 8.0
## 236 130 3.0
## 237 180 17.0
## 238 140 40.0
## 239 210 5.0
## 240 150 0.0
## 241 115 27.0
## 242 225 3.0
## 243 170 3.0
## 244 120 11.0
## 245 170 2.0
## 246 130 9.0
## 247 182 12.0
## 248 138 15.0
## 249 180 2.0
## 250 230 20.0
## 251 222 12.0
## 252 145 5.0
## 253 150 10.0
## 254 190 2.0
## 255 195 12.0
## 256 115 5.0
## 257 140 3.0
## 258 155 12.0
## 259 150 8.0
## 260 155 8.0
## 261 130 5.0
## 262 210 6.0
## 263 105 8.0
## 264 192 10.0
## 265 220 18.0
## 266 134 12.0
## 267 160 10.0
## 268 135 10.0
## 269 140 5.0
## 270 135 7.0
## 271 145 3.0
## 272 140 7.0
## 273 NA 3.0
## 274 138 4.0
## 275 NA 8.0
## 276 130 7.0
## 277 210 7.0
## 278 233 3.0
## 279 180 7.0
## 280 185 14.0
## 281 189 5.0
## 282 127 14.0
## 283 135 5.0
## 284 140 5.0
## 285 162 2.0
## 286 215 1.0
## 287 200 15.0
## 288 120 5.0
## 289 165 5.0
## 290 175 15.0
## 291 160 5.0
## 292 145 20.0
## 293 180 3.0
## 294 170 7.0
## 295 235 20.0
## 296 155 10.0
## 297 140 6.0
## 298 145 6.0
## 299 175 10.0
## 300 210 9.0
## 301 130 1.5
## 302 195 20.0
## 303 165 8.0
## 304 180 21.0
## 305 145 10.0
## 306 163 8.0
## 307 160 10.0
## 308 150 18.0
## 309 170 15.0
## 310 110 8.0
## 311 170 6.0
## 312 145 12.0
## 313 160 10.0
## 314 130 7.0
## 315 155 5.0
## 316 NA 6.0
## 317 160 5.0
## 318 151 5.0
## 319 180 30.0
## 320 130 5.0
## 321 185 12.0
## 322 198 0.0
## 323 175 12.0
## 324 123 0.0
## 325 145 15.0
## 326 190 18.0
## 327 130 12.0
## 328 185 2.0
## 329 165 12.0
## 330 165 3.0
## 331 150 3.0
## 332 160 4.0
## 333 142 12.0
## 334 165 25.0
## 335 175 3.0
## 336 175 6.0
## 337 190 7.0
## 338 180 7.0
## 339 185 6.0
## 340 NA 6.0
## 341 135 13.0
## 342 195 25.0
## 343 175 8.0
## 344 165 5.0
## 345 135 11.0
## 346 140 18.0
## 347 182 10.0
## 348 155 6.0
## 349 180 2.0
## 350 170 5.0
## 351 135 5.0
## 352 165 6.0
## 353 137 10.0
## 354 147 4.0
## 355 150 5.0
## 356 155 17.0
## 357 160 7.0
## 358 130 2.0
## 359 180 8.0
## 360 150 1.0
## 361 205 14.0
## 362 115 12.0
# Q7. Access the fourth element in the first column of the StudentSurvey dataset.
survey[4,1]
## [1] "Junior"
Don’t change this chunk — it loads and filters the dataset.
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics |>
filter(!is.na(age)) |>
filter(sport == "Gymnastics") |>
mutate(
medalist = case_when(
is.na(medal) ~ FALSE,
!is.na(medal) ~ TRUE
)
)
More info on the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
# Q8. Create a subset dataframe with these columns only: name, sex, age, team, year, medalist.
# Call it df.
df<- olympic_gymnasts|>
select(name, sex, age,team,year)
df
## # A tibble: 25,528 × 5
## name sex age team year
## <chr> <chr> <dbl> <chr> <dbl>
## 1 Paavo Johannes Aaltonen M 28 Finland 1948
## 2 Paavo Johannes Aaltonen M 28 Finland 1948
## 3 Paavo Johannes Aaltonen M 28 Finland 1948
## 4 Paavo Johannes Aaltonen M 28 Finland 1948
## 5 Paavo Johannes Aaltonen M 28 Finland 1948
## 6 Paavo Johannes Aaltonen M 28 Finland 1948
## 7 Paavo Johannes Aaltonen M 28 Finland 1948
## 8 Paavo Johannes Aaltonen M 28 Finland 1948
## 9 Paavo Johannes Aaltonen M 32 Finland 1952
## 10 Paavo Johannes Aaltonen M 32 Finland 1952
## # ℹ 25,518 more rows
# Q9. From df, create df2 that only has the years 2008, 2012, and 2016.
df2 <- df |>
filter(year %in% c(2008, 2012, 2016))
df2
## # A tibble: 2,703 × 5
## name sex age team year
## <chr> <chr> <dbl> <chr> <dbl>
## 1 Nstor Abad Sanjun M 23 Spain 2016
## 2 Nstor Abad Sanjun M 23 Spain 2016
## 3 Nstor Abad Sanjun M 23 Spain 2016
## 4 Nstor Abad Sanjun M 23 Spain 2016
## 5 Nstor Abad Sanjun M 23 Spain 2016
## 6 Nstor Abad Sanjun M 23 Spain 2016
## 7 Katja Abel F 25 Germany 2008
## 8 Katja Abel F 25 Germany 2008
## 9 Katja Abel F 25 Germany 2008
## 10 Katja Abel F 25 Germany 2008
## # ℹ 2,693 more rows
df2# Q10. Group by those three years and summarize the mean age in each group.
## # A tibble: 2,703 × 5
## name sex age team year
## <chr> <chr> <dbl> <chr> <dbl>
## 1 Nstor Abad Sanjun M 23 Spain 2016
## 2 Nstor Abad Sanjun M 23 Spain 2016
## 3 Nstor Abad Sanjun M 23 Spain 2016
## 4 Nstor Abad Sanjun M 23 Spain 2016
## 5 Nstor Abad Sanjun M 23 Spain 2016
## 6 Nstor Abad Sanjun M 23 Spain 2016
## 7 Katja Abel F 25 Germany 2008
## 8 Katja Abel F 25 Germany 2008
## 9 Katja Abel F 25 Germany 2008
## 10 Katja Abel F 25 Germany 2008
## # ℹ 2,693 more rows
df2 |>
group_by(year) |>
summarize(mean_age = mean(age,))
## # A tibble: 3 × 2
## year mean_age
## <dbl> <dbl>
## 1 2008 21.6
## 2 2012 21.9
## 3 2016 22.2
# Q11. Using the full olympic_gymnasts dataset, group by year and find the mean age
# for each year. Call this oly_year.
# (Bonus: find the minimum average age across years.)
oly_year <- olympic_gymnasts |>
group_by(year) |>
summarize(mean_age = mean(age,))
oly_year
## # A tibble: 29 × 2
## year mean_age
## <dbl> <dbl>
## 1 1896 24.3
## 2 1900 22.2
## 3 1904 25.1
## 4 1906 24.7
## 5 1908 23.2
## 6 1912 24.2
## 7 1920 26.7
## 8 1924 27.6
## 9 1928 25.6
## 10 1932 23.9
## # ℹ 19 more rows
min(oly_year$mean_age)
## [1] 19.86606
# Q12. Open-ended: come up with a question that requires at least TWO dplyr verbs.
# Write the question, then the code that answers it. Below the chunk, briefly
# explain why you chose this question.
olympic_gymnasts |>
group_by(sex) |>
summarize(num_gymnasts = n ()) |>
arrange(desc(num_gymnasts))
## # A tibble: 2 × 2
## sex num_gymnasts
## <chr> <int>
## 1 M 16484
## 2 F 9044
Your question and reflection: # My question is pertaining to how many gymnasts of each sex there are, and which sex of gymnasts there is more of in this data set. I chose this question because I found it to incorporate fundamental dplyr verbs and as well gives a straightforward but digsetable outcome for beginning stage students like myself to learn how to dissect from data sets but as well comprehend what they are seeking an answer in regard to.