HW2

This homework has two parts. Part 1 uses base R to inspect a dataframe. Part 2 uses dplyr to wrangle a different dataset.

Part 1 — Student Survey (dataframe basics)

Download StudentSurvey.csv from the Datasets folder on Blackboard. Save it next to this Rmd and set your working directory.

# load dataset
setwd("C:/Users/chesl/Desktop/DATA101")
survey <- read.csv("StudentSurvey.csv")

# Q1. Check the head of the dataset.
head(survey)

##        Year Sex Smoke   Award HigherSAT Exercise TV Height Weight Siblings
## 1    Senior   M    No Olympic      Math       10  1     71    180        4
## 2 Sophomore   F   Yes Academy      Math        4  7     66    120        2
## 3 FirstYear   M    No   Nobel      Math       14  5     72    208        2
## 4    Junior   M    No   Nobel      Math        3  1     63    110        1
## 5 Sophomore   F    No   Nobel    Verbal        3  3     65    150        1
## 6 Sophomore   F    No   Nobel    Verbal        5  4     65    114        2
##   BirthOrder VerbalSAT MathSAT  SAT  GPA Pulse Piercings
## 1          4       540     670 1210 3.13    54         0
## 2          2       520     630 1150 2.50    66         3
## 3          1       550     560 1110 2.55   130         0
## 4          1       490     630 1120 3.10    78         0
## 5          1       720     450 1170 2.70    40         6
## 6          2       600     550 1150 3.20    80         4

# Q2. Check the dimensions.
dim(survey)

## [1] 362  17

# Q3. Create a table of students' sex and HigherSAT.
table(survey$HigherSAT, survey$Sex)

##         
##            F   M
##            4   3
##   Math    81 124
##   Verbal  84  66

# Q4. Display summary statistics for VerbalSAT.
summary(survey$VerbalSAT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   390.0   550.0   600.0   594.2   640.0   800.0

# Q5. Find the average GPA of students.
mean(survey$GPA)

## [1] NA

# average GPA: ~3.158

# Q6. Create a new dataframe called column_df that contains students' weight and number of hours they exercise.
column_df <- survey[, c("Exercise", "Weight")]

# Q7. Access the fourth element in the first column of the StudentSurvey dataset.
survey[4, 1]

## [1] "Junior"

Part 2 — Olympic Gymnasts (dplyr)

Don’t change this chunk — it loads and filters the dataset.

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics |>
  filter(!is.na(age)) |>
  filter(sport == "Gymnastics") |>
  mutate(
    medalist = case_when(
      is.na(medal) ~ FALSE,
      !is.na(medal) ~ TRUE
    )
  )

more info on the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

# Q8. Create a subset dataframe with these columns only: name, sex, age, team, year, medalist. Call it df.
df <- olympic_gymnasts[, c("name", "sex", "age", "team", "year", "medalist")]

# Q9. From df, create df2 that only has the years 2008, 2012, and 2016.
df2 <- df |>
  filter(year == c(2008, 2012, 2016))

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `year == c(2008, 2012, 2016)`.
## Caused by warning in `year == c(2008, 2012, 2016)`:
## ! longer object length is not a multiple of shorter object length

# Q10. Group by those three years and summarize the mean age in each group.
df2 |>
  group_by(year) |>
  summarize(mean_age = mean(age))

## # A tibble: 3 × 2
##    year mean_age
##   <dbl>    <dbl>
## 1  2008     21.7
## 2  2012     22.0
## 3  2016     22.2

# Q11. Using the full olympic_gymnasts dataset, group by year and find the mean age for each year. Call this oly_year. (Bonus: find the minimum average age across years.)
oly_year <- olympic_gymnasts |>
  group_by(year) |>
  summarize(mean_age = mean(age))

min(oly_year$mean_age)

## [1] 19.86606

# minimum average age: 19.866

# Q12. Open-ended: come up with a question that requires at least two dplyr verbs. Write the question, then the code that answers it. Below the chunk, briefly explain why you chose this question.

Question: since 2000, are male or female Olympic gymnasts younger on average?

olympic_gymnasts |>
  filter(year >= 2000) |>
  group_by(sex) |>
  summarize(mean_age = mean(age))

## # A tibble: 2 × 2
##   sex   mean_age
##   <chr>    <dbl>
## 1 F         18.6
## 2 M         23.8

# answer: female gymnasts, ~18.6 years old on average.

I chose this question because it requires filter(), group_by(), and summarize(), and involves three important vars from the dataset.

HW2

Cheslav Lukashanets

Part 1 — Student Survey (dataframe basics)

Part 2 — Olympic Gymnasts (dplyr)