This homework has two parts. Part 1 uses base R to inspect a dataframe. Part 2 uses dplyr to wrangle a different dataset.
Download StudentSurvey.csv from the Datasets folder on
Blackboard. Save it next to this Rmd and set your working directory. x
download ‘StudentSurvey.csv’ x save in same folder as this r markdown x
set working directory ()
# Load the file
survey <- read.csv("StudentSurvey.csv")
glimpse(survey)
## Rows: 362
## Columns: 17
## $ Year <chr> "Senior", "Sophomore", "FirstYear", "Junior", "Sophomore", …
## $ Sex <chr> "M", "F", "M", "M", "F", "F", "F", "M", "F", "F", "F", "M",…
## $ Smoke <chr> "No", "Yes", "No", "No", "No", "No", "No", "No", "No", "No"…
## $ Award <chr> "Olympic", "Academy", "Nobel", "Nobel", "Nobel", "Nobel", "…
## $ HigherSAT <chr> "Math", "Math", "Math", "Math", "Verbal", "Verbal", "Math",…
## $ Exercise <dbl> 10, 4, 14, 3, 3, 5, 10, 13, 3, 12, 12, 10, 12, 6, 9, 10, 3,…
## $ TV <int> 1, 7, 5, 1, 3, 4, 10, 8, 6, 1, 6, 5, 8, 1, 5, 2, 15, 3, 1, …
## $ Height <int> 71, 66, 72, 63, 65, 65, 66, 74, 61, 60, 65, 63, 68, 68, 68,…
## $ Weight <int> 180, 120, 208, 110, 150, 114, 128, 235, NA, 115, 140, 200, …
## $ Siblings <int> 4, 2, 2, 1, 1, 2, 1, 1, 2, 7, 1, 2, 3, 2, 1, 1, 2, 2, 1, 1,…
## $ BirthOrder <int> 4, 2, 1, 1, 1, 2, 1, 1, 2, 8, 2, 2, NA, 3, 1, 2, 1, 1, 1, 2…
## $ VerbalSAT <int> 540, 520, 550, 490, 720, 600, 640, 660, 550, 670, 500, 580,…
## $ MathSAT <int> 670, 630, 560, 630, 450, 550, 680, 710, 550, 700, 670, 600,…
## $ SAT <int> 1210, 1150, 1110, 1120, 1170, 1150, 1320, 1370, 1100, 1370,…
## $ GPA <dbl> 3.13, 2.50, 2.55, 3.10, 2.70, 3.20, 2.77, 3.30, 2.80, 3.70,…
## $ Pulse <int> 54, 66, 130, 78, 40, 80, 94, 77, 60, 94, 63, 72, 54, 66, 72…
## $ Piercings <int> 0, 3, 0, 0, 6, 4, 8, 0, 7, 2, 2, 0, 0, 4, 0, 4, 4, 0, 2, 5,…
# 362 rows observations ; 17 columns
# Q1. Check the head of the dataset
head(survey)
## Year Sex Smoke Award HigherSAT Exercise TV Height Weight Siblings
## 1 Senior M No Olympic Math 10 1 71 180 4
## 2 Sophomore F Yes Academy Math 4 7 66 120 2
## 3 FirstYear M No Nobel Math 14 5 72 208 2
## 4 Junior M No Nobel Math 3 1 63 110 1
## 5 Sophomore F No Nobel Verbal 3 3 65 150 1
## 6 Sophomore F No Nobel Verbal 5 4 65 114 2
## BirthOrder VerbalSAT MathSAT SAT GPA Pulse Piercings
## 1 4 540 670 1210 3.13 54 0
## 2 2 520 630 1150 2.50 66 3
## 3 1 550 560 1110 2.55 130 0
## 4 1 490 630 1120 3.10 78 0
## 5 1 720 450 1170 2.70 40 6
## 6 2 600 550 1150 3.20 80 4
# first 6 rows ; 17 columns
# Q2. Check the dimensions
#from glimpse(survey) there were 362 observations and 16 columns
dim(survey)
## [1] 362 17
nrow(survey)
## [1] 362
ncol(survey)
## [1] 17
# 362 rows by 17 columns
# Q3. Create a table of students' Sex and HigherSAT
table_Sex_HigherSAT <- table(survey$Sex, survey$HigherSAT)
table_Sex_HigherSAT
##
## Math Verbal
## F 4 81 84
## M 3 124 66
# getting an extra column , checking data
unique(survey$Sex) # returned "M" and "F" so no extras
## [1] "M" "F"
unique(survey$HigherSAT) #returned "Math" "Verbal" and ""
## [1] "Math" "Verbal" ""
sum(survey$HigherSAT == "") # count number of blanks; returned 7 blanks
## [1] 7
# 4 females "" and 3 males "" no higherSAT
# Q4. Display summary statistics for VerbalSAT
summary(survey$VerbalSAT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 390.0 550.0 600.0 594.2 640.0 800.0
# Q5. Find the average GPA of students
mean(survey$GPA) # retrned NA, so will try to count number of NAs, then remove NAs to compute mean.
## [1] NA
sum(is.na(survey$GPA)) #count number of NA s
## [1] 17
which(is.na(survey$GPA)) #which rows have NA
## [1] 12 15 51 65 76 77 82 89 106 154 161 200 210 216 238 263 303
round ((mean(survey$GPA, na.rm = TRUE)), 3)
## [1] 3.158
# 17 NA s in the GPA column
# the observations on the following rows have NA for GPA:
# 12 15 51 65 76 77 82 89 106 154 161 200 210 216 238 263 303
# Q6. Create a new dataframe called column_df that contains students' weight
# and number of hours they exercise.
column_df <- data.frame(
students_weight = survey$Weight,
hours_exercise = survey$Exercise
)
glimpse(column_df)
## Rows: 362
## Columns: 2
## $ students_weight <int> 180, 120, 208, 110, 150, 114, 128, 235, NA, 115, 140, …
## $ hours_exercise <dbl> 10, 4, 14, 3, 3, 5, 10, 13, 3, 12, 12, 10, 12, 6, 9, 1…
head(column_df)
## students_weight hours_exercise
## 1 180 10
## 2 120 4
## 3 208 14
## 4 110 3
## 5 150 3
## 6 114 5
# Q7. Access the fourth element in the first column of the StudentSurvey dataset.
survey[4,1] # fourth row and first column
## [1] "Junior"
names(survey)[1] # returns "Year" as first column
## [1] "Year"
survey[4, "Year"] #wanted to find it by column name and row/observation
## [1] "Junior"
Don’t change this chunk — it loads and filters the dataset.
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics |>
filter(!is.na(age)) |>
filter(sport == "Gymnastics") |>
mutate(
medalist = case_when(
is.na(medal) ~ FALSE,
!is.na(medal) ~ TRUE
)
)
More info on the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
# Q8. Create a subset dataframe with these columns only: name, sex, age, team, year, medalist.
# Call it df.
glimpse(olympics)
## Rows: 271,116
## Columns: 15
## $ id <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
## $ name <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
## $ sex <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
## $ age <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
## $ height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
## $ weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
## $ team <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
## $ noc <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
## $ games <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
## $ year <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
## $ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
## $ city <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
## $ sport <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
## $ event <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
## $ medal <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
head(olympics)
## # A tibble: 6 × 15
## id name sex age height weight team noc games year season city
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 A Dijiang M 24 180 80 China CHN 1992… 1992 Summer Barc…
## 2 2 A Lamusi M 23 170 60 China CHN 2012… 2012 Summer Lond…
## 3 3 Gunnar N… M 24 NA NA Denm… DEN 1920… 1920 Summer Antw…
## 4 4 Edgar Li… M 34 NA NA Denm… DEN 1900… 1900 Summer Paris
## 5 5 Christin… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
## 6 5 Christin… F 21 185 82 Neth… NED 1988… 1988 Winter Calg…
## # ℹ 3 more variables: sport <chr>, event <chr>, medal <chr>
df <- data.frame(
name = olympics$name,
sex = olympics$sex,
age = olympics$age,
team = olympics$team,
year = olympics$year,
medalist = olympics$medal
)
glimpse(df)
## Rows: 271,116
## Columns: 6
## $ name <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Linden…
## $ sex <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "…
## $ age <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 3…
## $ team <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands",…
## $ year <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 1…
## $ medalist <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
head(df)
## name sex age team year medalist
## 1 A Dijiang M 24 China 1992 <NA>
## 2 A Lamusi M 23 China 2012 <NA>
## 3 Gunnar Nielsen Aaby M 24 Denmark 1920 <NA>
## 4 Edgar Lindenau Aabye M 34 Denmark/Sweden 1900 Gold
## 5 Christine Jacoba Aaftink F 21 Netherlands 1988 <NA>
## 6 Christine Jacoba Aaftink F 21 Netherlands 1988 <NA>
# 271,116 rows observations and 6 columns
# Q9. From df, create df2 that only has the years 2008, 2012, and 2016.
#piping - and then
df2 <- df |>
filter(year %in% c(2008, 2012, 2016))
glimpse(df2)
## Rows: 40,210
## Columns: 6
## $ name <chr> "A Lamusi", "Ragnhild Margrethe Aamodt", "Andreea Aanei", "Ja…
## $ sex <chr> "M", "F", "F", "M", "M", "F", "M", "M", "M", "M", "M", "M", "…
## $ age <dbl> 23, 27, 22, 30, 24, 28, 23, 23, 23, 23, 23, 23, 26, 21, 30, 2…
## $ team <chr> "China", "Norway", "Romania", "France", "Morocco", "Estonia",…
## $ year <dbl> 2012, 2008, 2016, 2012, 2012, 2008, 2016, 2016, 2016, 2016, 2…
## $ medalist <chr> NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Bron…
head(df2)
## name sex age team year medalist
## 1 A Lamusi M 23 China 2012 <NA>
## 2 Ragnhild Margrethe Aamodt F 27 Norway 2008 Gold
## 3 Andreea Aanei F 22 Romania 2016 <NA>
## 4 Jamale (Djamel-) Aarrass (Ahrass-) M 30 France 2012 <NA>
## 5 Abdelhak Aatakni M 24 Morocco 2012 <NA>
## 6 Moonika Aava F 28 Estonia 2008 <NA>
# Q10. Group by those three years and summarize the mean age in each group.
# piping operators
df2 <- df |>
filter(year %in% c(2008, 2012, 2016)) |> # filter years
group_by(year) |> # group by year
summarise( # summarizes the groups that we grouped
count = n(), # counts rows in group of years
mean_age_byyear = mean(age, na.rm = TRUE)
)
df2
## # A tibble: 3 × 3
## year count mean_age_byyear
## <dbl> <int> <dbl>
## 1 2008 13602 25.7
## 2 2012 12920 26.0
## 3 2016 13688 26.2
# Q11. Using the full olympic_gymnasts dataset, group by year and find the mean age
# for each year. Call this oly_year.
# (Bonus: find the minimum average age across years.)
oly_year <- olympic_gymnasts |>
group_by(year) |>
summarise(
count = n(),
mean_age_eachyear = mean(age, na.rm = TRUE)
)
oly_year
## # A tibble: 29 × 3
## year count mean_age_eachyear
## <dbl> <int> <dbl>
## 1 1896 73 24.3
## 2 1900 33 22.2
## 3 1904 317 25.1
## 4 1906 70 24.7
## 5 1908 240 23.2
## 6 1912 310 24.2
## 7 1920 206 26.7
## 8 1924 499 27.6
## 9 1928 561 25.6
## 10 1932 140 23.9
## # ℹ 19 more rows
min(oly_year$mean_age_eachyear)
## [1] 19.86606
oly_year[which.min(oly_year$mean_age_eachyear), ]
## # A tibble: 1 × 3
## year count mean_age_eachyear
## <dbl> <int> <dbl>
## 1 1988 1217 19.9
# The year with the youngest / lowest average age was 1988.
# The average age was 19.86606 years old
.86606*12
## [1] 10.39272
# about 10.4 months
# so 19 years and about 10.4 months
Your question and reflection:
year_youngest_avg <- oly_year[which.min(oly_year$mean_age_eachyear), ]
olympic_gymnasts |>
filter(year == year_youngest_avg$year) |>
summarise(oldest_age = max(age, na.rm = TRUE))
## # A tibble: 1 × 1
## oldest_age
## <dbl>
## 1 28
oly_year_oldest_age <- olympic_gymnasts |>
group_by(year) |>
summarise(
count = n(),
mean_age_eachyear = mean(age, na.rm = TRUE),
oldest_age = max(age, na.rm = TRUE)
)
oly_year_oldest_age
## # A tibble: 29 × 4
## year count mean_age_eachyear oldest_age
## <dbl> <int> <dbl> <dbl>
## 1 1896 73 24.3 31
## 2 1900 33 22.2 31
## 3 1904 317 25.1 37
## 4 1906 70 24.7 35
## 5 1908 240 23.2 49
## 6 1912 310 24.2 38
## 7 1920 206 26.7 45
## 8 1924 499 27.6 38
## 9 1928 561 25.6 39
## 10 1932 140 23.9 34
## # ℹ 19 more rows
oly_year_oldest_age |>
slice_min(mean_age_eachyear, n=5)
## # A tibble: 5 × 4
## year count mean_age_eachyear oldest_age
## <dbl> <int> <dbl> <dbl>
## 1 1988 1217 19.9 28
## 2 1992 1255 20.0 31
## 3 1980 866 20.1 30
## 4 1996 1397 20.3 29
## 5 1984 930 20.4 30
oly_year_oldest_mean_age <- olympic_gymnasts |>
group_by(year) |>
summarise(
count = n(),
mean_age_eachyear = mean(age, na.rm = TRUE),
oldest_age = max(age, na.rm = TRUE),
youngest_age = min(age, na.rm = TRUE)
)
oly_year_oldest_mean_age
## # A tibble: 29 × 5
## year count mean_age_eachyear oldest_age youngest_age
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1896 73 24.3 31 10
## 2 1900 33 22.2 31 17
## 3 1904 317 25.1 37 18
## 4 1906 70 24.7 35 14
## 5 1908 240 23.2 49 16
## 6 1912 310 24.2 38 18
## 7 1920 206 26.7 45 17
## 8 1924 499 27.6 38 19
## 9 1928 561 25.6 39 11
## 10 1932 140 23.9 34 15
## # ℹ 19 more rows
oly_year_oldest_mean_age |>
slice_max(mean_age_eachyear, n=5)
## # A tibble: 5 × 5
## year count mean_age_eachyear oldest_age youngest_age
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 1948 874 27.8 42 12
## 2 1924 499 27.6 38 19
## 3 1920 206 26.7 45 17
## 4 1936 947 25.8 37 14
## 5 1928 561 25.6 39 11
Reflection
Reflection: In two of the five oldest mean age years, 1948 and 1928, the youngest aged olympian was 12 and 11. I would not expect competitors so young to be at the same Olympic Games in the older average aged years.