Data1001 Group Project

Author

Mia, Behin, Gabi, Latoya

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
data1 = read.csv("~/Desktop/data1001_survey_data_2025_S1 (1).csv")


data1 <- read.csv("~/Desktop/data1001_survey_data_2025_S1 (1).csv", stringsAsFactors = FALSE)


data1_clean <- data1 %>%
  filter(!apply(., 1, function(row) any(grepl("I do not consent", row, ignore.case = TRUE))))


write.csv(data1_clean, "~/Desktop/data1001_survey_data_2025_S1 (1).csv", row.names = FALSE)

Recommendation/Insight

Our examination concluded that domestic students work more hours per week than international students. However international students pay more rent per week than domestic students. These results highlight the need to develop inexpensive housing for domestic students, particularly as housing costs increase and international enrollment decreases.

Evidence

IDA

Overview

The data was sourced from DATA1001/1901 student survey, which collected 2103 submissions from students who consented to engage with the survey (approximately 97% of DATA1001/1901 Cohort). The survey contained 28 variables, with our research concentrating on three key variables, average rent paid per week, hours of work per week and student enrollment type (international or domestic). Our variables were independently recognised as quantitative continuous, quantitative continuous and qualitative nominal. 

attach(data1_clean)
# Data
counts= c(1143, 918, 26, 16) # Number of people
labels= c("Female", "Male", "Non-binary / Third Gender","Prefer not to say") # Categories
colors= c("pink", "blue", "green", "gray")
percentages=round(counts / sum(counts) * 100, 0)
labels= paste(labels, percentages, "%") 
# Create pie chart
pie(counts, labels = labels, col = colors, main = "Gender Distribution")

attach(data1_clean)
The following objects are masked from data1_clean (pos = 3):

    age, cohort, commute, consent, countries, country_of_birth,
    country_of_birth_5_TEXT, data_interest, dates, drug_use_ans,
    drug_use_q, friends_count, gender, highest_speed, hours_studying,
    hours_work, learner_style, lecture_mode, mainstream_advanced,
    mark_goal, relationship_status, rent, semesters, social_media_use,
    standard_drinks, stress, student_type, study_type
counts= c(824, 1279) # Number of students
labels= c("International", "Domestic") 
colors= c("red", "blue") # Colors for each category
# Calculate percentages
percentages=round(counts / sum(counts) * 100, 1)
labels= paste(labels, percentages, "%") # Add percentages to labels
pie(counts, labels = labels, col = colors, main = "Student Type Distribution")

Limitations

Limitations of the data include response bias which may occur due to students under or over reporting their financial and work circumstances. Furthermore, students may have accidentally or intentionally misinterpreted the question and made errors when entering data, for example missing values.

Assumptions

After eliminating all the “i do not consent” data from the original data set in R studio Using “library(dplyr)” we assumed that all participants that said “i do consent” were truthful in their response, as each participant followed a strict set of instructions when identifying rent paid per week and hours of paid work per week. We assumed that bias may still be present and cleaned the data by removing outliers in rent paid per week that were above $2000 and removing hours of work per week above 35 hours.

Research Question 1

Does the number of hours worked per week vary between international and domestic students?

library(ggplot2)

ggplot(data1_clean, aes(x = hours_work, fill = student_type)) +
  geom_histogram(binwidth = 2, position = "identity", alpha = 0.5, color = "black") +  
  labs(title = "Comparative Histogram: Hours Worked (Domestic vs. International)",
       x = "Hours Worked per Week", y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c("Domestic" = "blue", "International" = "red")) +  
  scale_x_continuous(limits = c(0, 30), breaks = seq(0, 50, by = 5)) +   scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, by = 20)) 
Warning: Removed 118 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_bar()`).

summary_stats <- data1_clean %>%
  group_by(student_type) %>%
  summarise(Mean = mean(hours_work, na.rm = TRUE),
    Median = median(hours_work, na.rm = TRUE),
    Q1 = quantile(hours_work, 0.25, na.rm = TRUE),
    Q3 = quantile(hours_work, 0.75, na.rm = TRUE))


print(summary_stats)
# A tibble: 2 × 5
  student_type   Mean Median    Q1    Q3
  <chr>         <dbl>  <dbl> <dbl> <dbl>
1 Domestic       10.5      5     0    14
2 International  11.1      0     0    10

The number of Working hours per week between international and domestic data1001/1901 students has a moderate variation. This has been reflected through the comparative histogram created where Domestic students data represents a more positively skewed data, where as international students skewness illustrates less of a relationship. This is also reflected in the difference of the medians domestic (5), international (0), and the IQR Domestic (14) and International (10). The difference in the medians illustrates how domestic students are working longer hours, However the IQR for international students is more narrow indicating that there is a greater consistency in the hours worked and domestic student have a higher variability of hours worked.

Article reasoning

As per previous empirical research, for international students, dependence on family overseas is indicated to be the primary source of income for international students (Krause, K. L., et al, 2005), consequently negating the need for such students to require income through working, with only 10% of international students reporting absence of external funding. Additionally, the financial reliance that Australian tertiary education institutions have come to have on international students as a ‘major service export industry’ (Forbes-Mewett, et al., p. 10., 2006) has led to inflated and unsubsidised international tuition costs. This, when coupled with high currency exchange rates as well as the income standards for student visa obtainment, suggests those prepared to study abroad are making choices actively informed by their financial capacity (Forbes-Mewett, et al., 2006). Therefore the result of the hours worked variable can be seen more so as a reflection of differing income sources, justifying that international students likely work less due to stable passive incomes rather than because of lower living costs or wealth discrepancies.

Research Question 2 - Linear model

Is there a linear correlation between the number of hours worked and rent prices amongst domestic and international students?

library(ggplot2)
library(dplyr)

data1_clean$student_type <- as.factor(data1_clean$student_type)

remove_outliers <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  return(x >= lower_bound & x <= upper_bound) }


valid_rows <- remove_outliers(data1_clean$hours_work) & remove_outliers(data1_clean$rent)


data1_clean <- data1_clean[valid_rows, ]


ggplot(data1_clean, aes(x = hours_work, y = rent, color = student_type)) +
  geom_point(alpha = 0.5) +  
  geom_smooth(method = "lm", se = FALSE, color = "black") + 
  labs(title = "Hours Worked vs Rent Between International and Domestic students",
       x = "Hours Worked by Data1001 (Per Week)",
       y = "Rent($) for Data1001 (Per Week)") +
  theme_minimal() + 
  scale_color_manual(values = c("Domestic" = "blue", "International" = "red")) + scale_y_continuous(limits = c(0, 2000))
`geom_smooth()` using formula = 'y ~ x'

correlation <- cor(data1_clean$hours_work, data1_clean$rent, use = "complete.obs")
print(paste("Correlation between Hours Worked and Rent: ", round(correlation, 2))) 
[1] "Correlation between Hours Worked and Rent:  -0.1"
library(dplyr)

rent_summary <- data1_clean %>%
  group_by(student_type) %>%
  summarise( Mean_Rent = mean(rent, na.rm = TRUE), Median_Rent = median(rent, na.rm = TRUE))
print(rent_summary)
# A tibble: 2 × 3
  student_type  Mean_Rent Median_Rent
  <fct>             <dbl>       <dbl>
1 Domestic           76.1           0
2 International     478.          500
lm_model <- lm(rent ~ hours_work, data = data1_clean)

data1_clean$residuals <- resid(lm_model)


ggplot(data1_clean, aes(x = hours_work, y = residuals, color = student_type)) +
  geom_point(alpha = 0.5) +  
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  
  labs(title = "Residual Plot: Hours Worked vs Rent",
       x = "Hours Worked by Data1001 (Per Week)",
       y = "Residuals (Rent - Predicted Rent)") +
  theme_minimal() + 
  scale_color_manual(values = c("Domestic" = "blue", "International" = "red"))

The scatter plot illustrating the hours worked and the rent of both domestic and international data1001/1901 students illustrates a very weak correlation between both of the variables presented within the scatter plot. Using R studio we calculated the correlation which was -0.14 suggesting that the association between the rent and the hours of work between domestic and international students is very weak in nature. It can be explained that this lack of relationship occurs through the comparison of the median of hours worked and mean of rent between domestic and international students “hours worked” medians = domestic (5), international (0), rent means = domestic ($75.63), international ($474.29) furthermore highlighting why there is a lack of relationship between these variables. In the residual plot the data is highly clustered in areas, specifically along the y axis indicating that predominantly international students have higher rent yet work low hours. Therefore the lack of homoscedasticity, highlighted in the scattered data indicates that there is a lack of linear relationship between the variables. 

Article reasoning

Such insight drawn from the data representations suggests that there are other variables to be factored into the relationship between rent and work frequency, where despite reporting greater rent expenditures, international students subseed domestic students in work frequency. Supporting research suggests this largely reflects the presence of stable offshore funding from family. Yet past empirical research duly advises that international students are at high risk of income insecurity, often experiencing financial challenges despite this third party funding (Forbes-Mewett, et al., 2006). While our data affirms that international students can viably remain without work whilst meeting rent costs, this additional factor of dependence subsequently sees alternate financial stresses emerge, wherein the academic expectations elicited by financial suppliers force the prioritisation of full time studies over seeking surplus income beyond funded minimum living costs (Krause, K. L., et al, 2005). Additional consideration into variables of stress and pressure could therefore provide more comprehensive insights to strengthen the linear relationship between our variables. Irrespectivly, it is clarified that international students can viably remain without work while maintaining high rent costs, but that this does not necessarily negate the sacrificing of financial flexibility.

Professional Standard Report

As a group, it was our priority that both the process and outcome of our report upheld values of integrity and professionalism, whereby all data gathered was reported and expressed truthfully, accurately and objectively. We remained conscious of how ethical principles such as research participant consent as well as anonymity must be consistently respected in order for our report to adequately reflect such intentions. We furthermore prioritised the transparent representation of our data, taking responsibility for any claims or inferences made.

Acknowledgements

Group Composition

First Name Last Name UniKey
Latoya Kennedy lken0992
Mia Mitchell mmit0965
Gabi Johnson gjoh0437
Behin Moradifard bmor0996
→ Group Meeting Log
Date: Time:
Meeting 1: 27/03/25 - In-person - ALL 9am - 12pm
Meeting 2: 1/04/25 - In-person - ALL 8am - 11am 
Meeting 3: 3/04/25 - In-person - ALL 10:30am - 12pm
Meeting 4: 4/04/25 - In-person - Mia & Gabi 10:30am - 12pm

Group Roles: All sections of the project will be divided evenly amongst each group member to ensure fair and equitable participation from all group members. 

Group Roles Assigned Group Member
Executive Summary Latoya 
Initial Data Analysis (IDA) Latoya & Mia & Behin
RQ1 Mia & Gabi 
RQ2 Mia & Gabi 
Articles Gabi 
Acknowledgements Latoya & Behin 
Professional Standard of Report Gabi
Slideshow Presentation Mia

Use of AI:

Acknowledgement: We used (Copilot and ChatGPT) to assist us with this assignment

How We Used AI:

• AI was used to help us understand and debug coding errors.

• We used AI to generate simple code snippets, which we then adapted to fit our dataset and analysis.

• AI assisted in answering specific technical questions related to R programming, such as data visualisation and cleaning techniques.

How We Did Not Use AI:

• AI was not used to replace critical thinking or analysis.

• We did not use AI to rewrite sections of our assignment for clarity or grammar.

• AI was not used to generate content beyond our understanding; all interpretations and explanations are our own.

• All conceptualization, critical analysis, and final edits were conducted by the authors to ensure accuracy and adherence to academic integrity. Any potential errors or misinterpretations remain the responsibility of the authors.

Prompts We Used:

1. “How to add percentages and labels to pie charts in RStudio?”values <- c(30, 20, 50) 

# Example values

labels <- c(“A”, “B”, “C”)  # Labels

percentages <- round(values / sum(values) * 100, 1)  # Calculate percentages

labels_with_percent <- paste(labels, percentages, “%”)  # Combine labels with percentages

# Create pie chart

pie(values, labels = labels_with_percent, main = “Pie Chart with Percentages”)

2. “How to use data cleaning to get more accurate data in RStudio?”

# Remove outliers

clean_data <- subset(data, values >= lower_bound & values <= upper_bound)

# Print cleaned data print(clean_data)

3. “How to add colors to graphs?”

values <- c(30, 20, 50)

labels <- c(“A”, “B”, “C”)

# Define colors

colors <- c(“skyblue”, “light green”, “salmon”)

# Create pie chart with colors

pie(values, labels = labels, col = colors, main = “Colored Pie Chart”)

“How to Data clean to removing unnecessary data”

library(dplyr)

data1_clean <- data1 %>%

  filter(!if_any(everything(), ~ grepl(“value”, ., ignore.case = TRUE)))

“How to make a Comparative histogram”

Limits + IQR for the comparative histogram 

# Load necessary libraries

library(ggplot2)

library(dplyr)

# Example dataset

set.seed(123)

data1_clean <- data.frame(

student_type = rep(c(“Full-time”, “Part-time”), each = 50),

hours_work = c(rnorm(50, mean = 10, sd = 5), rnorm(50, mean = 20, sd = 7)))

# Compute Summary Statistics (IQR)

summary_stats <- data1_clean %>%

group_by(student_type) %>%

summarise( Mean = mean(hours_work, na.rm = TRUE), Median = median(hours_work, na.rm = TRUE),

Q1 = quantile(hours_work, 0.25, na.rm = TRUE),

Q3 = quantile(hours_work, 0.75, na.rm = TRUE))

print(summary_stats)  # Display the summary statistics

# Create Comparative Histogram

ggplot(data1_clean, aes(x = hours_work, fill = student_type)) +

geom_histogram(alpha = 0.6, position = “identity”, bins = 15) +

scale_x_continuous(limits = c(0, 30), breaks = seq(0, 30, by = 5)) +

scale_y_continuous(limits = c(0, 20), breaks = seq(0, 20, by = 5)) +

labs(title = “Comparative Histogram of Hours Worked”, x = “Hours Worked”, y = “Count”, fill = “Student Type”) + theme_minimal()

“How to remove outliers from the scatterplot”

# Function to remove outliers

remove_outliers <- function(x) { Q1 <- quantile(x, 0.25, na.rm = TRUE) Q3 <- quantile(x, 0.75, na.rm = TRUE)

IQR_value <- Q3 - Q1

lower_bound <- Q1 - 1.5 * IQR_value

upper_bound <- Q3 + 1.5 * IQR_value

return(x >= lower_bound & x <= upper_bound)}


“How to create a Plot with color customization and scale limits”

ggplot(data, aes(x = hours_work)) +

geom_histogram(binwidth = 2, fill = “skyblue”, color = “black”, alpha = 0.7) +

scale_x_continuous(limits = c(10, 30), breaks = seq(10, 30, by = 5)) +  # Set X-axis limits

scale_y_continuous(limits = c(0, 20), breaks = seq(0, 20, by = 5)) +  # Set Y-axis limits

labs(title = “Histogram of Hours Worked”, x = “Hours Worked”, y = “Frequency”) + theme_minimal()

References

Bexley, E., Daroesman, S., Arkoudis, S., & James, R. (2013). University Student Finances in 

2012: A Study of the Financial Circumstances of Domestic and International Students in Australia’s Universities. In ERIC. Centre for the Study of Higher Education. https://eric.ed.gov/?id=ED558588 

Forbes-Mewett, H., Chung, M., Marginson, S., Nyland, C., Sawir, E., & Ramia, G. (2006). Income 

security of international students in Australia (Version 1). Deakin University. https://hdl.handle.net/10536/DRO/DU:30009797 

Krause, K. L., Hartley, R., James, R., & McInnis, C. (2005, January). The first year experience in 

Australian universities: Findings from a decade of national studies.