This homework has two parts. Part 1 uses base R to inspect a dataframe. Part 2 uses dplyr to wrangle a different dataset.
Download StudentSurvey.csv from the Datasets folder on
Blackboard. Save it next to this Rmd and set your working directory.
# Load the file
survey <- read.csv("StudentSurvey.csv")
# Q1. Check the head of the dataset
head(survey)
## Year Sex Smoke Award HigherSAT Exercise TV Height Weight Siblings
## 1 Senior M No Olympic Math 10 1 71 180 4
## 2 Sophomore F Yes Academy Math 4 7 66 120 2
## 3 FirstYear M No Nobel Math 14 5 72 208 2
## 4 Junior M No Nobel Math 3 1 63 110 1
## 5 Sophomore F No Nobel Verbal 3 3 65 150 1
## 6 Sophomore F No Nobel Verbal 5 4 65 114 2
## BirthOrder VerbalSAT MathSAT SAT GPA Pulse Piercings
## 1 4 540 670 1210 3.13 54 0
## 2 2 520 630 1150 2.50 66 3
## 3 1 550 560 1110 2.55 130 0
## 4 1 490 630 1120 3.10 78 0
## 5 1 720 450 1170 2.70 40 6
## 6 2 600 550 1150 3.20 80 4
# Q2. Check the dimensions
nrow(survey)
## [1] 362
dim(survey)
## [1] 362 17
# Q3. Create a table of students' sex and HigherSAT
str(survey)
## 'data.frame': 362 obs. of 17 variables:
## $ Year : chr "Senior" "Sophomore" "FirstYear" "Junior" ...
## $ Sex : chr "M" "F" "M" "M" ...
## $ Smoke : chr "No" "Yes" "No" "No" ...
## $ Award : chr "Olympic" "Academy" "Nobel" "Nobel" ...
## $ HigherSAT : chr "Math" "Math" "Math" "Math" ...
## $ Exercise : num 10 4 14 3 3 5 10 13 3 12 ...
## $ TV : int 1 7 5 1 3 4 10 8 6 1 ...
## $ Height : int 71 66 72 63 65 65 66 74 61 60 ...
## $ Weight : int 180 120 208 110 150 114 128 235 NA 115 ...
## $ Siblings : int 4 2 2 1 1 2 1 1 2 7 ...
## $ BirthOrder: int 4 2 1 1 1 2 1 1 2 8 ...
## $ VerbalSAT : int 540 520 550 490 720 600 640 660 550 670 ...
## $ MathSAT : int 670 630 560 630 450 550 680 710 550 700 ...
## $ SAT : int 1210 1150 1110 1120 1170 1150 1320 1370 1100 1370 ...
## $ GPA : num 3.13 2.5 2.55 3.1 2.7 3.2 2.77 3.3 2.8 3.7 ...
## $ Pulse : int 54 66 130 78 40 80 94 77 60 94 ...
## $ Piercings : int 0 3 0 0 6 4 8 0 7 2 ...
summary(survey)
## Year Sex Smoke Award
## Length :362 Length :362 Length :362 Length :362
## N.unique : 5 N.unique : 2 N.unique : 2 N.unique : 3
## N.blank : 2 N.blank : 0 N.blank : 0 N.blank : 0
## Min.nchar: 0 Min.nchar: 1 Min.nchar: 2 Min.nchar: 5
## Max.nchar: 9 Max.nchar: 1 Max.nchar: 3 Max.nchar: 7
##
##
## HigherSAT Exercise TV Height
## Length :362 Min. : 0.000 Min. : 0.000 Min. :59.00
## N.unique : 3 1st Qu.: 5.000 1st Qu.: 3.000 1st Qu.:65.00
## N.blank : 7 Median : 8.000 Median : 5.000 Median :68.00
## Min.nchar: 0 Mean : 9.054 Mean : 6.504 Mean :68.42
## Max.nchar: 6 3rd Qu.:12.000 3rd Qu.: 9.000 3rd Qu.:71.00
## Max. :40.000 Max. :40.000 Max. :83.00
## NAs :1 NAs :1 NAs :7
## Weight Siblings BirthOrder VerbalSAT MathSAT
## Min. : 95.0 Min. :0.000 Min. :1.00 Min. :390.0 Min. :400.0
## 1st Qu.:138.0 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:550.0 1st Qu.:560.0
## Median :155.0 Median :1.000 Median :2.00 Median :600.0 Median :610.0
## Mean :159.8 Mean :1.727 Mean :1.83 Mean :594.2 Mean :609.4
## 3rd Qu.:180.0 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:640.0 3rd Qu.:650.0
## Max. :275.0 Max. :8.000 Max. :8.00 Max. :800.0 Max. :800.0
## NAs :5 NAs :3
## SAT GPA Pulse Piercings
## Min. : 800 Min. :2.000 Min. : 35.00 Min. : 0.000
## 1st Qu.:1130 1st Qu.:2.900 1st Qu.: 62.00 1st Qu.: 0.000
## Median :1200 Median :3.200 Median : 70.00 Median : 0.000
## Mean :1204 Mean :3.158 Mean : 69.57 Mean : 1.673
## 3rd Qu.:1270 3rd Qu.:3.400 3rd Qu.: 77.75 3rd Qu.: 3.000
## Max. :1550 Max. :4.000 Max. :130.00 Max. :10.000
## NAs :17 NAs :1
# Q4. Display summary statistics for VerbalSAT
summary(survey$VerbalSAT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 390.0 550.0 600.0 594.2 640.0 800.0
# Q5. Find the average GPA of students
mean(survey$GPA, na.rm = TRUE)
## [1] 3.157942
# Q6. Create a new dataframe called column_df that contains students' weight
# and number of hours they exercise.
column_df <- survey |>
select(Weight, Exercise)
print(column_df)
## Weight Exercise
## 1 180 10.0
## 2 120 4.0
## 3 208 14.0
## 4 110 3.0
## 5 150 3.0
## 6 114 5.0
## 7 128 10.0
## 8 235 13.0
## 9 NA 3.0
## 10 115 12.0
## 11 140 12.0
## 12 200 10.0
## 13 162 12.0
## 14 135 6.0
## 15 193 9.0
## 16 110 10.0
## 17 99 3.0
## 18 165 7.0
## 19 120 2.0
## 20 154 14.0
## 21 110 10.0
## 22 145 14.0
## 23 195 20.0
## 24 200 7.0
## 25 167 12.0
## 26 175 10.0
## 27 155 6.0
## 28 185 14.0
## 29 190 12.0
## 30 165 10.0
## 31 175 8.0
## 32 126 0.0
## 33 187 10.0
## 34 170 6.0
## 35 158 5.0
## 36 119 24.0
## 37 205 2.0
## 38 129 10.0
## 39 145 6.0
## 40 130 5.0
## 41 215 5.0
## 42 135 12.0
## 43 145 2.0
## 44 98 7.0
## 45 150 15.0
## 46 159 5.0
## 47 174 7.0
## 48 160 15.0
## 49 165 8.0
## 50 161 14.0
## 51 160 14.0
## 52 130 4.0
## 53 175 15.0
## 54 255 4.0
## 55 160 15.0
## 56 160 3.0
## 57 95 3.0
## 58 115 15.0
## 59 120 20.0
## 60 135 3.0
## 61 180 6.0
## 62 155 12.0
## 63 110 4.0
## 64 215 20.0
## 65 175 15.0
## 66 140 10.0
## 67 195 10.0
## 68 185 4.0
## 69 185 9.0
## 70 209 12.0
## 71 145 2.0
## 72 140 15.0
## 73 146 10.0
## 74 130 7.0
## 75 140 3.0
## 76 130 4.0
## 77 140 15.0
## 78 160 8.0
## 79 120 5.0
## 80 150 10.0
## 81 155 15.0
## 82 128 4.0
## 83 143 5.0
## 84 155 6.0
## 85 119 18.0
## 86 138 16.0
## 87 240 4.0
## 88 160 3.0
## 89 191 20.0
## 90 165 5.0
## 91 200 10.0
## 92 125 2.0
## 93 140 4.0
## 94 206 14.0
## 95 275 7.0
## 96 142 12.0
## 97 140 14.0
## 98 145 3.0
## 99 128 5.0
## 100 165 15.0
## 101 140 5.0
## 102 130 4.0
## 103 170 8.0
## 104 160 14.0
## 105 165 5.0
## 106 145 12.0
## 107 155 15.0
## 108 155 3.0
## 109 113 12.0
## 110 155 12.0
## 111 173 6.0
## 112 195 2.0
## 113 120 3.0
## 114 225 24.0
## 115 160 3.0
## 116 120 6.0
## 117 138 12.0
## 118 260 18.0
## 119 150 2.0
## 120 135 12.0
## 121 165 5.0
## 122 142 3.0
## 123 210 3.0
## 124 171 21.0
## 125 150 12.0
## 126 188 3.0
## 127 195 20.0
## 128 230 8.0
## 129 140 2.0
## 130 200 15.0
## 131 140 6.0
## 132 180 2.0
## 133 160 1.0
## 134 135 13.0
## 135 140 12.0
## 136 155 10.0
## 137 235 10.0
## 138 140 10.0
## 139 130 3.0
## 140 125 2.0
## 141 222 10.0
## 142 128 4.0
## 143 183 8.0
## 144 175 18.0
## 145 125 5.0
## 146 156 4.0
## 147 145 20.0
## 148 195 14.0
## 149 185 8.0
## 150 150 4.0
## 151 140 5.0
## 152 150 14.0
## 153 150 4.0
## 154 220 4.0
## 155 195 15.0
## 156 140 12.0
## 157 135 10.0
## 158 138 6.0
## 159 170 12.0
## 160 145 10.0
## 161 135 14.0
## 162 140 5.0
## 163 155 18.0
## 164 155 10.0
## 165 155 8.0
## 166 135 8.0
## 167 165 3.0
## 168 160 5.0
## 169 175 4.0
## 170 183 8.0
## 171 140 3.0
## 172 140 12.0
## 173 110 5.0
## 174 125 5.0
## 175 125 15.0
## 176 180 10.0
## 177 195 10.0
## 178 128 10.0
## 179 172 14.0
## 180 150 6.0
## 181 188 4.0
## 182 110 14.0
## 183 125 5.0
## 184 135 16.0
## 185 130 14.0
## 186 230 10.0
## 187 198 11.0
## 188 127 4.0
## 189 150 12.0
## 190 175 5.0
## 191 160 8.0
## 192 140 8.0
## 193 200 3.0
## 194 130 8.0
## 195 210 20.0
## 196 210 10.0
## 197 145 3.0
## 198 156 4.0
## 199 207 8.0
## 200 150 15.0
## 201 137 14.0
## 202 153 14.0
## 203 175 12.0
## 204 136 6.0
## 205 104 3.0
## 206 122 12.0
## 207 192 5.0
## 208 218 8.0
## 209 170 1.0
## 210 220 10.0
## 211 150 7.0
## 212 135 5.0
## 213 155 10.0
## 214 135 12.0
## 215 140 15.0
## 216 120 10.0
## 217 135 7.0
## 218 135 NA
## 219 165 15.0
## 220 180 10.0
## 221 165 15.0
## 222 175 12.0
## 223 200 10.0
## 224 138 12.0
## 225 265 5.0
## 226 105 14.0
## 227 130 3.0
## 228 165 12.0
## 229 145 15.0
## 230 138 3.0
## 231 193 25.0
## 232 170 5.0
## 233 155 6.0
## 234 155 27.0
## 235 140 8.0
## 236 130 3.0
## 237 180 17.0
## 238 140 40.0
## 239 210 5.0
## 240 150 0.0
## 241 115 27.0
## 242 225 3.0
## 243 170 3.0
## 244 120 11.0
## 245 170 2.0
## 246 130 9.0
## 247 182 12.0
## 248 138 15.0
## 249 180 2.0
## 250 230 20.0
## 251 222 12.0
## 252 145 5.0
## 253 150 10.0
## 254 190 2.0
## 255 195 12.0
## 256 115 5.0
## 257 140 3.0
## 258 155 12.0
## 259 150 8.0
## 260 155 8.0
## 261 130 5.0
## 262 210 6.0
## 263 105 8.0
## 264 192 10.0
## 265 220 18.0
## 266 134 12.0
## 267 160 10.0
## 268 135 10.0
## 269 140 5.0
## 270 135 7.0
## 271 145 3.0
## 272 140 7.0
## 273 NA 3.0
## 274 138 4.0
## 275 NA 8.0
## 276 130 7.0
## 277 210 7.0
## 278 233 3.0
## 279 180 7.0
## 280 185 14.0
## 281 189 5.0
## 282 127 14.0
## 283 135 5.0
## 284 140 5.0
## 285 162 2.0
## 286 215 1.0
## 287 200 15.0
## 288 120 5.0
## 289 165 5.0
## 290 175 15.0
## 291 160 5.0
## 292 145 20.0
## 293 180 3.0
## 294 170 7.0
## 295 235 20.0
## 296 155 10.0
## 297 140 6.0
## 298 145 6.0
## 299 175 10.0
## 300 210 9.0
## 301 130 1.5
## 302 195 20.0
## 303 165 8.0
## 304 180 21.0
## 305 145 10.0
## 306 163 8.0
## 307 160 10.0
## 308 150 18.0
## 309 170 15.0
## 310 110 8.0
## 311 170 6.0
## 312 145 12.0
## 313 160 10.0
## 314 130 7.0
## 315 155 5.0
## 316 NA 6.0
## 317 160 5.0
## 318 151 5.0
## 319 180 30.0
## 320 130 5.0
## 321 185 12.0
## 322 198 0.0
## 323 175 12.0
## 324 123 0.0
## 325 145 15.0
## 326 190 18.0
## 327 130 12.0
## 328 185 2.0
## 329 165 12.0
## 330 165 3.0
## 331 150 3.0
## 332 160 4.0
## 333 142 12.0
## 334 165 25.0
## 335 175 3.0
## 336 175 6.0
## 337 190 7.0
## 338 180 7.0
## 339 185 6.0
## 340 NA 6.0
## 341 135 13.0
## 342 195 25.0
## 343 175 8.0
## 344 165 5.0
## 345 135 11.0
## 346 140 18.0
## 347 182 10.0
## 348 155 6.0
## 349 180 2.0
## 350 170 5.0
## 351 135 5.0
## 352 165 6.0
## 353 137 10.0
## 354 147 4.0
## 355 150 5.0
## 356 155 17.0
## 357 160 7.0
## 358 130 2.0
## 359 180 8.0
## 360 150 1.0
## 361 205 14.0
## 362 115 12.0
# Q7. Access the fourth element in the first column of the StudentSurvey dataset.
head(survey$Year, n= 4)
## [1] "Senior" "Sophomore" "FirstYear" "Junior"
Don’t change this chunk — it loads and filters the dataset.
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics |>
filter(!is.na(age)) |>
filter(sport == "Gymnastics") |>
mutate(
medalist = case_when(
is.na(medal) ~ FALSE,
!is.na(medal) ~ TRUE
)
)
More info on the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
# Q8. Create a subset dataframe with these columns only: name, sex, age, team, year, medalist.
# Call it df.
df<- olympic_gymnasts |>
select(name, sex, age, team, year, medalist)
print(df)
## # A tibble: 25,528 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 2 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 3 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 4 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 5 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 6 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 7 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 8 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 9 Paavo Johannes Aaltonen M 32 Finland 1952 FALSE
## 10 Paavo Johannes Aaltonen M 32 Finland 1952 TRUE
## # ℹ 25,518 more rows
# Q9. From df, create df2 that only has the years 2008, 2012, and 2016.
df2 <- df |>
filter(year %in% c(2008, 2012, 2016))
print(df2)
## # A tibble: 2,703 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 2 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 3 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 4 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 5 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 6 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 7 Katja Abel F 25 Germany 2008 FALSE
## 8 Katja Abel F 25 Germany 2008 FALSE
## 9 Katja Abel F 25 Germany 2008 FALSE
## 10 Katja Abel F 25 Germany 2008 FALSE
## # ℹ 2,693 more rows
# Q10. Group by those three years and summarize the mean age in each group.
age_summary <- df |>
filter(year %in% c(2008, 2012, 2016)) |>
group_by(year) |>
summarize(mean_age = mean(age, na.rm = TRUE))
# Q11. Using the full olympic_gymnasts dataset, group by year and find the mean age
# for each year. Call this oly_year.
# (Bonus: find the minimum average age across years.)
# Q12. Open-ended: come up with a question that requires at least TWO dplyr verbs.
# Write the question, then the code that answers it. Below the chunk, briefly
# explain why you chose this question.
Your question and reflection: