About

Author

Affiliation

510119533

The University of Sydney

Code

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggplot2)
library(patchwork)
library(ggpubr)
# creates a file with the bibtex for packages used:
knitr::write_bib(c(.packages(),
                   "knitr", "rmarkdown"), "refs/packages.bib")

1 Introduction

This assignment is designed and answered based on the survey data we collected in class DATA2x02 students (2024).

Data wrangling and information see Section 2, you could also download the clean data set at Section 3

Used packages and tools

Tools: R (version 4.4.1 (R Core Team, 2024)), Rmarkdown (Xie et al., 2018);
Packages: tidyverse (Wickham et al., 2019), janitor(Firke, 2023), gt(Iannone et al., 2023), ggpubr(Kassambara, 2023),etc;
Data Wrangling Packages: tidyverse (Wickham et al., 2019), janitor(Firke, 2023), gendercoder(Beaudry et al., 2020), hms(Müller & Wickham, 2021),etc;
Publishing system: quarto(Allaire et al., 2022), Shiny(Chang et al., 2024)

2 Data Wrangling

Data wrangling, Variable names and explanations

Note: Gratefully adopted Tarr (2024) ’s Data Cleaning styles

Code

library(tidyverse)
library(gendercoder)
library(janitor)
library(hms)
theme_set(theme_bw())
x = readxl::read_excel("DATA2x02_survey_2024_Responses.xlsx")

old_names = colnames(x)

new_names = c(
  "timestamp",
  "target_grade",
  "assignment_preference",
  "trimester_or_semester",
  "age",
  "tendency_yes_or_no",
  "pay_rent",
  "urinal_choice",
  "stall_choice",
  "weetbix_count",
  "weekly_food_spend",
  "living_arrangements",
  "weekly_alcohol",
  "believe_in_aliens",
  "height",
  "commute",
  "daily_anxiety_frequency",
  "weekly_study_hours",
  "work_status",
  "social_media",
  "gender",
  "average_daily_sleep",
  "usual_bedtime",
  "sleep_schedule",
  "sibling_count",
  "allergy_count",
  "diet_style",
  "random_number",
  "favourite_number",
  "favourite_letter",
  "drivers_license",
  "relationship_status",
  "daily_short_video_time",
  "computer_os",
  "steak_preference",
  "dominant_hand",
  "enrolled_unit",
  "weekly_exercise_hours",
  "weekly_paid_work_hours",
  "assignments_on_time",
  "used_r_before",
  "team_role_type",
  "university_year",
  "favourite_anime",
  "fluent_languages",
  "readable_languages",
  "country_of_birth",
  "wam",
  "shoe_size")

# Overwrite the old names:
colnames(x) = new_names
# Combine names
name_combo = bind_cols(Variable_Names = new_names, Explanations = old_names)
name_combo |> gt::gt()

Variable names and explanations
Variable_Names	Explanations
timestamp	Timestamp
target_grade	What final grade are you aiming to achieve in DATA2x02?
assignment_preference	When it comes to assignments / due tasks do you:
trimester_or_semester	Would you prefer to have a trimester system (3 main teaching sessions per year) or stick with the existing semester system (2 main teaching sessions per year)?
age	How old are you?
tendency_yes_or_no	Do you tend to lean towards saying "yes" or towards saying "no" to things throughout life?
pay_rent	Do you pay rent?
urinal_choice	You enter a public bathroom and find you're the only one there. There are three urinals on the wall for you to choose from. Which do you choose?
stall_choice	You enter a public bathroom and there are three stalls to choose from. All three are unoccupied. Which do you choose?
weetbix_count	How many Weet-Bix would you typically eat in one sitting?
weekly_food_spend	What is the average amount of money you spend each week on food/beverages?
living_arrangements	What are your current living arrangements?
weekly_alcohol	How much alcohol do you consume each week?
believe_in_aliens	Do you believe in the existence of aliens (Extraterrestrial Life)?
height	How tall are you?
commute	How do you get to university?
daily_anxiety_frequency	How often would you say you feel anxious on a daily basis?
weekly_study_hours	How many hours a week do you spend studying?
work_status	Do you work?
social_media	What is your favourite social media platform?
gender	What is your gender?
average_daily_sleep	How much sleep do you get (on avg, per day)?
usual_bedtime	What time do you usually go to bed?
sleep_schedule	How consistent would you rate your sleep schedule?
sibling_count	How many siblings do you have?
allergy_count	How many allergies do you know you have?
diet_style	What is your diet style?
random_number	Pick a number at random between 1 and 10 (inclusive)
favourite_number	What is your favourite number (between 1 and 10 inclusive)
favourite_letter	What is your favourite letter from the English alphabet?
drivers_license	Do you have a driver's license?
relationship_status	Do you currently have a partner?
daily_short_video_time	How much time do you spend on short video apps (like tiktok or reels) every day?
computer_os	What computer OS (operating system) are you currently using?
steak_preference	How do you like your steak cooked?
dominant_hand	What is your dominant hand?
enrolled_unit	Which unit are you enrolled in?
weekly_exercise_hours	On average, how many hours each week do you spend exercising?
weekly_paid_work_hours	How many hours a week (on average) do you work in paid employment?
assignments_on_time	Do you submit assignments on time?
used_r_before	Have you ever used R before starting DATA2x02?
team_role_type	What kind of role (active or passive) do you think you are when working as part of a team?
university_year	Which year of university are you currently in?
favourite_anime	What is your favourite anime?
fluent_languages	How many languages can you speak fluently?
readable_languages	How many languages can you read?
country_of_birth	What country were you born in?
wam	What is your WAM (weighted average mark)?
shoe_size	What is your shoe size?

3 Data Cleaning

Cleaning steps see below, you could download the clean data for further analysis. Download CSV Here

Data Cleaning Steps

Code

# Step 2: Select relevant columns and remove rows with missing values in any selected column (including enrolled_unit)
df_cleaned <- x %>%
  select(target_grade, assignment_preference, weekly_alcohol, daily_anxiety_frequency, 
         weekly_study_hours, work_status, average_daily_sleep, sleep_schedule, 
         diet_style, relationship_status, enrolled_unit, weekly_exercise_hours, 
         weekly_paid_work_hours, assignments_on_time, team_role_type, wam) %>%
  drop_na()  # Remove rows with missing values in any selected column

# Step 3: Clean `diet_style`
df_cleaned <- df_cleaned %>%
  # Delete rows where diet_style is 'no beef'
  filter(diet_style != "no beef") %>%
  # Reclassify 'diet_style' values
  mutate(diet_style = case_when(
    diet_style %in% c("normal", "Normal", "NA", "None") ~ "Omnivorous",
    TRUE ~ diet_style  # Retain all other values as they are
  ))

# Step 4: Clean `relationship_status`
df_cleaned <- df_cleaned %>%
  # Delete rows where relationship_status is 'Tears'
  filter(relationship_status != "Tears") %>%
  # Reclassify 'relationship_status' values
  mutate(relationship_status = case_when(
    relationship_status == "Multiple" ~ "Yes",  # Convert 'Multiple' to 'Yes'
    relationship_status == "No but I would like to have one, please" ~ "No",  # Convert this to 'No'
    TRUE ~ relationship_status  # Retain all other values as they are
  ))

# Step 5: Clean the 'weekly_study_hours', 'weekly_exercise_hours', and 'wam'
df_cleaned <- df_cleaned %>%
  # Convert 'weekly_study_hours' to numeric and remove outliers
  mutate(weekly_study_hours = as.numeric(weekly_study_hours),
         weekly_study_hours = ifelse(weekly_study_hours < 0 | weekly_study_hours > 100, NA, weekly_study_hours)) %>%
  drop_na(weekly_study_hours)  # Remove rows with invalid study hours

# Step 6: Clean 'average_daily_sleep' with custom rules
df_cleaned <- df_cleaned %>%
  mutate(average_daily_sleep = as.numeric(str_replace_all(average_daily_sleep, "[-~]", "-")),
         average_daily_sleep = ifelse(grepl(":", average_daily_sleep), 
                                      as.numeric(sub(":", ".", average_daily_sleep)), average_daily_sleep),
         average_daily_sleep = ifelse(grepl("-", average_daily_sleep), 
                                      rowMeans(sapply(str_split(average_daily_sleep, "-"), as.numeric)), 
                                      average_daily_sleep),
         average_daily_sleep = as.numeric(average_daily_sleep),
         average_daily_sleep = ifelse(average_daily_sleep < 0 | average_daily_sleep > 15, NA, average_daily_sleep)) %>%
  drop_na(average_daily_sleep)

# Step 7: Clean 'weekly_exercise_hours' and 'weekly_paid_work_hours'
df_cleaned <- df_cleaned %>%
  mutate(weekly_exercise_hours = as.numeric(weekly_exercise_hours),
         weekly_exercise_hours = ifelse(weekly_exercise_hours < 0 | weekly_exercise_hours > 30 | weekly_exercise_hours == 900, 
                                        NA, weekly_exercise_hours),
         weekly_paid_work_hours = as.numeric(weekly_paid_work_hours),
         weekly_paid_work_hours = ifelse(weekly_paid_work_hours < 0 | weekly_paid_work_hours > 30, NA, weekly_paid_work_hours)) %>%
  drop_na(weekly_exercise_hours, weekly_paid_work_hours)

# Step 8: Clean 'wam'
df_cleaned <- df_cleaned %>%
  mutate(wam = as.numeric(wam),
         wam = ifelse(wam < 0 | wam > 100, NA, wam)) %>%
  drop_na(wam)

# Step 9: Create 'weekly_exercise_level' categorical variable
df_cleaned <- df_cleaned %>%
  mutate(weekly_exercise_level = case_when(
    weekly_exercise_hours < 3 ~ "Low Intensity",
    weekly_exercise_hours >= 3 & weekly_exercise_hours <= 8 ~ "Moderate Intensity",
    weekly_exercise_hours > 8 ~ "High Intensity"
  ))

# Step 10: Create 'wam_level' categorical variable
df_cleaned <- df_cleaned %>%
  mutate(wam_level = case_when(
    wam <= 50 ~ "Fail",
    wam > 50 & wam <= 64 ~ "Pass",
    wam > 64 & wam <= 74 ~ "Credit",
    wam > 74 & wam <= 84 ~ "Distinction",
    wam > 84 ~ "High Distinction"
  ))

# Step 11: Write the cleaned data to a new CSV file (ensuring row names are not saved)
write.csv(df_cleaned, "cleaned_data.csv", row.names = FALSE)

# Print a message indicating the completion of the CSV creation
#print("Cleaned CSV file created successfully: cleaned_data.csv")

References

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.2) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Beaudry, J., Kothe, E., Thorn, F. S., McGuire, R., Tierney, N., & Ling, M. (2020). Gendercoder: Recodes sex/gender descriptions into a standard set. https://github.com/ropensci/gendercoder

Chang, W., Cheng, J., Allaire, J., Sievert, C., Schloerke, B., Xie, Y., Allen, J., McPherson, J., Dipert, A., & Borges, B. (2024). Shiny: Web application framework for r. https://shiny.posit.co/

DATA2x02 students. (2024). DATA2x02 class survey. https://docs.google.com/spreadsheets/d/1CR33C_oUu2QqbKWshnk5pP_-wwRIx8z9RCtWN7cVVWw/pub?output=xlsx

Firke, S. (2023). Janitor: Simple tools for examining and cleaning dirty data. https://CRAN.R-project.org/package=janitor

Iannone, R., Cheng, J., Schloerke, B., Hughes, E., Lauer, A., & Seo, J. (2023). Gt: Easily create presentation-ready display tables. https://CRAN.R-project.org/package=gt

Kassambara, A. (2023). Ggpubr: ggplot2 based publication ready plots. https://rpkgs.datanovia.com/ggpubr/

Müller, K., & Wickham, H. (2021). Hms: Pretty time of day. https://CRAN.R-project.org/package=hms

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Tarr, G. (2024). DATA2002 Assignment: Data Importing and Cleaning Guide. https://pages.github.sydney.edu.au/DATA2002/2024/assignment/assignment_data.html

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R markdown: The definitive guide. Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown