Data1001 project 1

Exploring data and its representations

Author

The Travellers

Recommendation/Insight

Executive Summary

  • Male students consume the highest average number of standard drinks per week.
  • We explored correlations between weekly alcohol intake and expected related variables to identify potential causes.
  • Stress level, university friendships, and number of friends do not account for this consumption.

  • Further research is needed to identify contributing factors.

Evidence

Code
packages = c("tidyverse", "rmdformats", "prettydoc", "yaml", "bslib", "readr", "dplyr", "ggplot2", "RColorBrewer", "sass", "colorspace", "plotly")

invisible(lapply(packages, function(pkg) suppressPackageStartupMessages(library(pkg, character.only = TRUE))))

data1 = read_csv('~/Downloads/data1001_survey_data_2025_S1.csv')

data_clean = data1 |>
  filter(
    friends_count >= 0, friends_count <= 50, age >= 16, age <= 60,
    hours_work >= 0, hours_work <= 40,       social_media_use >= 0, social_media_use <= 30,
    rent >= 0, rent <= 1500,                 highest_speed >= 0, highest_speed <= 300,
    dates >= 0, dates <= 300,                standard_drinks >= 0, standard_drinks <= 35, 
    countries >= 0, countries <= 30,         semesters >= 0, semesters <= 20,
    commute >= 0, commute <= 240,            mark_goal >= 50, mark_goal <= 100,
    hours_studying >= 0, hours_studying <= 20
  )

IDA

The data was sourced from a survey of DATA1001 students at the University of Sydney, covering cohorts from Semester 2, 2024 and Semester 1, 2025.

Code
gender_summary1 = data_clean |>
  count(gender) |>
  mutate(percent = n / sum(n) * 100)

custom_colors = c("#2f5182", lighten('#2f5182', amount=0.2), lighten('#2f5182', amount=0.5), lighten('#2f5182', amount=0.8))  

plot_ly(
  gender_summary1,
  labels = ~gender,
  values = ~n,
  type = 'pie',
  marker = list(colors = custom_colors),
  hoverinfo = 'label+percent+value') |>
    layout(title = list(
        text = "<b>Gender 
        Distribution</b>",
        font = list(
        size = 18,
        family = "Georgia"),
        x = 0.1,
        xanchor = "center",
        y = 0.95,
        yanchor = "top"),
        plot_bgcolor = "#d1d1c9",
        paper_bgcolor = "#d1d1c9")

Figure 1.1

Code
gender_summary2 = data_clean |>
  count(student_type) |>
  mutate(percent = n / sum(n) * 100)

custom_colors = c("#2f5182", lighten('#2f5182', amount=0.2), lighten('#2f5182', amount=0.5), lighten('#2f5182', amount=0.8))  

plot_ly(
  gender_summary2,
  labels = ~student_type,
  values = ~n,
  type = 'pie',
  marker = list(colors = custom_colors),
  hoverinfo = 'label+percent+value') |>
    layout(title = list(
        text = "<b>Student Type
            Distribution</b>",
        font = list(
        size = 18,
        family = "Georgia"),
        x = 0.1,
        xanchor = "center",
        y = 0.95,
        yanchor = "top"),
        plot_bgcolor = "#d1d1c9",
        paper_bgcolor = "#d1d1c9")

Figure 1.2

2,104 individuals provided responses and consented to participate. The dataset included 28 variables, of which 6 were selected for analysis in this report. (see Table 1). 

Variable Name Description Classification
gender

Male, Female, Non-binary / Third gender,

Prefer not to say

Qualitative Nominal
standard_drinks Any number. Avg. per week. Quantitative Continuous
student_type “Domestic”, “International” Qualitative Nominal
stress_level Whole number from 0 (No Stress) to 10 (Worst Stress Imaginable). Quantitative Discrete
friend_count Any positive whole number. Quantitative Discrete
rent Any positive whole number. Quantitative Continuous

Table 1: Variable names and classifications.

This report, as a cross-sectional study, cannot account for fluctuations in answers over time. This may affect analysis of things such as alcohol consumption, which may vary due to factors like price changes. The data also represents a sample and not a population. Thus, findings cannot be accurately applied to individuals outside of the survey group without generalisation. This may affect any recommendations made.

It was assumed that all self-reported data was honest and unaffected by bias. Weekly alcohol consumption was treated as a reflection of consistent behavior, rather than a one-off occurrence. Concurrently, any unrealistic values could be explained by respondents entering their answers incorrectly, or misinterpreting the requirements of the survey question.

Code
data_clean = data1 |>
  filter(
    friends_count >= 0, friends_count <= 50, age >= 16, age <= 60,
    hours_work >= 0, hours_work <= 40,       social_media_use >= 0, social_media_use <= 30,
    rent >= 0, rent <= 1500,                 highest_speed >= 0, highest_speed <= 300,
    dates >= 0, dates <= 300,                standard_drinks >= 0, standard_drinks <= 35, 
    countries >= 0, countries <= 30,         semesters >= 0, semesters <= 20,
    commute >= 0, commute <= 240,            mark_goal >= 50, mark_goal <= 100,
    hours_studying >= 0, hours_studying <= 20
  )

Responses that contained values deemed as reasonably unrealistic were cleaned from the dataset using the parameters in the code block above. This removed 11.84% (n=254) of responses which was believed to greatly increase the validity of the results.

Research Question 1

Are there significant differences in self-reported weekly alcohol consumption between genders, and how do these differences interact with student type?

Code
data_cleanQ1 = data_clean |>
  mutate(gender = recode(gender, "Non-binary / third gender" = "third gender")) |>
  filter(gender != "Prefer not to say")

df_summary = data_cleanQ1 |>
  group_by(student_type, gender) |>
  summarise(AvgDrinks = mean(standard_drinks, na.rm = TRUE)) |>
  ungroup() |>
  mutate(Category = paste(student_type, gender))

desired_order = c("Domestic Male", "Domestic Female", "International Male", "International Female", "Domestic third gender", "International third gender")

df_summary$Category = factor(df_summary$Category, levels = desired_order)

ggplot(df_summary, aes(x = Category, y = AvgDrinks)) +
  geom_bar(stat = "identity", fill = "#2f5182",width=0.5) + 
  geom_text(aes(label = round(AvgDrinks, 1), fontface = "bold"),
            vjust = -0.5, 
            hjust = 1.2) +
  labs(
    title = "Standard Drinks per Week by Gender and Student Type",
    x = "Gender and Student Type",
    y = "Avg Standard Drinks per Week" ) +
  scale_y_continuous(limits = c(0, 3.7)) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, margin = margin(b = 20), family = 'Georgia', 
    colour = lighten('#825d0c', amount=0.1), face = "bold"),
    axis.text.x = element_text(angle = 30, hjust = 1, size = 9, face = "bold"),
    plot.margin = margin(t = 20, r = 0, b = 0, l = 20),
    plot.background = element_rect(fill = "#d1d1c9", color = NA),
    panel.grid = element_line(color = lighten('#825d0c', amount=0.5)))

Figure 2.1
Code
df_summary2 = data_cleanQ1 |>
  group_by(student_type, gender) |>
  ungroup() |>
  mutate(Category = paste(student_type, gender))

desired_order2 = c("Domestic Male", "Domestic Female", "International Male", 
                   "International Female", "Domestic third gender", "International third gender")

df_summary2$Category = factor(df_summary2$Category, levels = desired_order)

d=ggplot(df_summary2, aes(x = Category, y = standard_drinks)) +
  geom_boxplot(fill = lighten('#2f5182', amount = 0.2), color = darken('#2f5182', amount = 0.5),
               outlier.fill = lighten('#2f5182', amount = 0.2), outlier.size = 1.5,
               outlier.color = lighten('#2f5182', amount = 0.2)) +
  labs(title = "Distribution of Standard Drinks by Category",
        x = "Gender and Student Type",
        y = "Standard Drinks") +
  theme(plot.background = element_rect(fill = "#d1d1c9", color = NA),
        plot.title = element_text(hjust = 0.5, size = 14, margin = margin(b = 20), family = 'Georgia',
        colour = lighten('#825d0c', amount=0.1), face = "bold"),
        axis.text.x = element_text(angle = 30, hjust = 1, size = 9, face = "bold"))+
  theme_minimal()

ggplotly(d) |>
  layout(plot_bgcolor = "#d1d1c9",
         paper_bgcolor = "#d1d1c9",
         title = list(
    x = 0.5,      
    font = list(
    size = 18,
    family = "Georgia",
    color = lighten('#825d0c', amount = 0.1) )),
    xaxis = list(
    tickangle = -30,
    tickfont = list(
    size = 13,
    family = "Georgia Bold"  
    )
  ))

Figure 2.2

Domestic and international male students reported the highest average alcohol consumption (3.5 and 3.4 standards, respectively). Female students consumed less, and third gender/non-binary individuals reported the lowest levels, especially international students (0.7 standards). Domestic students generally drank more than international peers across all gender groups. However, results for third gender students should be interpreted with caution due to the small sample size.

These findings agree with literature that claims male students consume more alcohol on average than female students (Papier et al., 2015). This manifests where male students consumed an average of one additional standard per week . Domestic students are also at higher risk of excess drinking (Sanci et al., 2022). Domestic females consumed 0.6 more drinks than international females, a much larger gap than the 0.1 difference observed between domestic and international males.

Outliers, visible in the boxplots (see Graph 2.2), can skew the mean and lead to misinterpretation. However, the graph remains a useful indicator of relative alcohol consumption when paired with boxplots showing distribution.

Research Question 2 (Linear Model)

Is there a significant correlation between alcohol consumption among male students and stress level, number of university friends and weekly rent?

Code
data_male = data_clean |>
  filter(gender == "Male") |>
   mutate(friends_count = as.numeric(friends_count))

cor_coef = cor(data_male$stress, data_male$standard_drinks)

p1 = ggplot(data_male, aes(x = stress, y = standard_drinks)) +
  geom_point(color = "#2f5182", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = darken('#2f5182', amount = 0.5)) +
  labs(
    title = "Stress vs Standard Drinks for Males",
    x = "Stress Level",
    y = "Standard Drinks per Week"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, family = 'Georgia',
    colour = lighten('#825d0c', amount=0.1), face = "bold"),
    axis.text.x = element_text(face = "bold"),
    plot.background = element_rect(fill = "#d1d1c9", color = NA),
    panel.grid = element_line(color = lighten('#825d0c', amount=0.5)))
  
ggplotly(p1) |> layout(
  annotations = list(
    xref = "paper",
    yref = "paper",
    x = 0.90,
    y = 0.90,
    showarrow = FALSE,
    text = paste("r =", round(cor_coef, 4)),
    font = list(size = 20, color = "black")),
    plot_bgcolor = "#d1d1c9")

Figure 3.1

Code
cor_coef2 = cor(data_male$friends_count, data_male$standard_drinks)

p2 = ggplot(data_male, aes(x = friends_count, y = standard_drinks)) +
  geom_point(color = "#2f5182", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = darken('#2f5182', amount = 0.5) ) +
  labs(
    title = "Friend count vs Standard Drinks for Males",
    x = "Friend count",
    y = "Standard Drinks per Week"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, family = 'Georgia', 
    colour = lighten('#825d0c', amount=0.1), face = "bold"),
    axis.text.x = element_text(face = "bold"),
    plot.background = element_rect(fill = "#d1d1c9", color = NA),
    panel.grid = element_line(color = lighten('#825d0c', amount=0.5)))

ggplotly(p2) |> layout(
  annotations = list(
    xref = "paper",
    yref = "paper",
    x = 0.90,
    y = 0.90,
    showarrow = FALSE,
    text = paste("r =", round(cor_coef2, 4)),
    font = list(size = 20, color = "black")),
  plot_bgcolor = "#d1d1c9")

Figure 3.2

Code
cor_coef3 = cor(data_male$rent, data_male$standard_drinks)

p3 = ggplot(data_male, aes(x = rent, y = standard_drinks)) +
  geom_point(color = "#2f5182", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = darken('#2f5182', amount = 0.5) ) + 
  labs(
    title = "Rent vs Standard Drinks for Males",
    x = "Rent (Aud)",
    y = "Standard Drinks per Week"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, family = 'Georgia',
    colour = lighten('#825d0c', amount=0.1), face = "bold"),
    axis.text.x = element_text(face = "bold"),
    plot.background = element_rect(fill = "#d1d1c9", color = NA),
    panel.grid = element_line(color = lighten('#825d0c', amount=0.5)))

ggplotly(p3) |> layout(
  annotations = list(
    xref = "paper",
    yref = "paper",
    x = 0.90,
    y = 0.90,
    showarrow = FALSE,
    text = paste("r =", round(cor_coef3, 4)),
    font = list(size = 20, color = "black")),
    plot_bgcolor = "#d1d1c9")

Figure 3.3

People experiencing higher stress consume more alcohol (de Wit et al., 2003). We expected a similar trend to be observed in survey respondents. Conversely, a negligible correlation (r = -0.0177) was obtained from the linear regression (see Graph 3.1). Indicating no correlation between stress level and alcohol consumption. Thus, stress level could not adequately explain high alcohol consumption in male students.

Individuals with friends who often drink are likely to drink similarly (Jones & Magee, 2014). A larger social network may increase exposure to heavy-drinking peers, raising the likelihood of adopting similar habits. Linear regression resulted in an r-value of 0.1480, representing a very weak positive correlation (see Graph 3.2), due to this weakness, friendship count could not explain high alcohol consumption in male students.

Students living away from home are known to drink more than students who are not (Harford et al., 2002). Weekly rent was thought to be a possible indicator of this, and thus, alcohol consumption too. Linear regression resulted in an r-coefficient of 0.0051, representing no correlation (see Graph 3.3).

Stress level, friend count, and weekly rent could not account for high alcohol consumption in men. Due to this, we cannot make any direct recommendations, further research is required to determine factors influencing alcohol consumption. Due to the higher correlation observed in Graph 3.2, we recommend a follow-up survey targeted toward social and interpersonal factors surrounding alcohol consumption.

Declaration of Professional Ethics

Shared Values

In maintaining transparency about statistical methodologies used, and by making these methodologies public in this report, the shared professional value of truthfulness and integrity has been adhered to.

Ethical Principles

Through accurate description of results and explanatory power of the data utilised, as well as the recognition of the data’s limits, the ethical principle of maintaining confidence in statistics has been adhered to.

Articles

Papier, K., Ahmed, F., Lee, P., & Wiseman, J. (2015). Stress and dietary behaviour among first-year university students in Australia: Sex differences. Nutrition, 31(2), 324–330. https://doi.org/10.1016/j.nut.2014.08.004

Sanci, L., Williams, I., Russell, M., Chondros, P., Duncan, A.-M., Tarzia, L., Peter, D., Lim, M. S., Tomyn, A., & Minas, H. (2022). Towards a health promoting university: Descriptive findings on Health, Wellbeing and Academic Performance Amongst University students in Australia. BMC Public Health, 22(1). https://doi.org/10.1186/s12889-022-14690-9

de Wit, H., S, Söderpalm, A. H. V., Nikolayev, L., & Young, E. (2003). Effects of Acute Social Stress on Alcohol Consumption in Healthy Subjects. Alcoholism: Clinical & Experimental Research, 27(8), 1270–1277. https://doi.org/10.1097/01.alc.0000081617.37539.d6

Jones, S. C., & Magee, C. A. (2014). The role of family, friends and peers in Australian adolescent’s alcohol consumption. Drug and Alcohol Review, 33(3), 304–313. https://doi.org/10.1111/dar.12111

Harford, T. C., Wechsler, H., & Muthén, B. O. (2002). The impact of current residence and high school drinking on alcohol problems among college students. Journal of Studies on Alcohol, 63(3), 271–279. https://doi.org/10.15288/jsa.2002.63.271

Acknowledgements

Group meetings

26th March, 12:35pm - Raz, Dan, Steven

Contribution of Group Members

Raz: Creating theme, formatting and graphs

Dan: Writing, editing, finding trend that led to RQ1, template of report, and researching

Mox: Writing, editing report, made presentation

Steven: Research, finding research papers 

Geoffrey: Writing, editing report, made presentation

Resources Used 

https://youtu.be/CblUFMoC9yg?si=PMzGDK7zZxjrGC97 - creating themes in Quarto

https://quarto.org/docs/output-formats/html-themes.html - theme and formatting 

AI usage statement 

AI tools, including ChatGPT-4o, ChatGPT-4o Research, and ChatGPT-o3-mini-high, were used between April 4th and 9th to explore and implement functions beyond the scope of the DATA1001 curriculum. This included implementing CSS for custom themes, adding tabsets, styling and customizing graphs, and finding the root of error messages, among other enhancements. While AI contributed to the document’s enhanced functionality and aesthetics, its outputs often required significant human intervention and customization to align with the intended design and purpose.

“Sharing conversations with user uploaded images is not yet supported” - ChatGPT

At this moment we are unable to share the link to the entire prompt session 

Instead, view this example of the prompts used:

https://chatgpt.com/share/67f756ce-abe8-800d-84d4-46782675e85c