Introduction

This homework involves three distinct research questions covering hypothesis testing, correlation between numerical variables, and association between categorical variables. The analysis is performed step-by-step using the dataset titled Student Depression Dataset.

1 Data import

library(carData)
mydata <- read.table("/cloud/project/Student depression/Student Depression Dataset.csv",
                     header = TRUE,
                     sep = ",")
head(mydata)

##   id Gender Age          City Profession Academic.Pressure Work.Pressure CGPA
## 1  2   Male  33 Visakhapatnam    Student                 5             0 8.97
## 2  8 Female  24     Bangalore    Student                 2             0 5.90
## 3 26   Male  31      Srinagar    Student                 3             0 7.03
## 4 30 Female  28      Varanasi    Student                 3             0 5.59
## 5 32 Female  25        Jaipur    Student                 4             0 8.13
## 6 33   Male  29          Pune    Student                 2             0 5.70
##   Study.Satisfaction Job.Satisfaction    Sleep.Duration Dietary.Habits  Degree
## 1                  2                0         5-6 hours        Healthy B.Pharm
## 2                  5                0         5-6 hours       Moderate     BSc
## 3                  5                0 Less than 5 hours        Healthy      BA
## 4                  2                0         7-8 hours       Moderate     BCA
## 5                  3                0         5-6 hours       Moderate  M.Tech
## 6                  3                0 Less than 5 hours        Healthy     PhD
##   Have.you.ever.had.suicidal.thoughts.. Work.Study.Hours Financial.Stress
## 1                                   Yes                3                1
## 2                                    No                3                2
## 3                                    No                9                1
## 4                                   Yes                4                5
## 5                                   Yes                1                1
## 6                                    No                4                1
##   Family.History.of.Mental.Illness Depression
## 1                               No          1
## 2                              Yes          0
## 3                              Yes          0
## 4                              Yes          1
## 5                               No          0
## 6                               No          0

2 Data description

Source: Kaggle

The dataset contains information about students and their mental health indicators.

Unit of Observation: An individual student

Sample Size: 27,898 observations.

Description of the variables:

Gender: Male/Female.
Age: Age of the student (years).
City: City of residence.
Profession: Profession of the individual.
Academic Pressure: Self-reported academic pressure on a numerical scale.
Work Pressure: Self-reported work pressure on a numerical scale.
CGPA: Cumulative Grade Point Average.
Study Satisfaction: Satisfaction with studies (numerical scale).
Job Satisfaction: Satisfaction with job (numerical scale).
Sleep Duration: Sleep duration category (categorical).
Dietary Habits: Healthy/Moderate/Unhealthy.
Degree: Degree pursued or achieved.
Have you ever had suicidal thoughts?: Yes/No.
Work/Study Hours: Hours spent on work/study.
Financial Stress: Self-reported financial stress (numerical scale).
Family History of Mental Illness: Yes/No response.
Depression: Indicator for depression (0 = No, 1 = Yes)

Data Cleaning

mydata$Gender <- factor(mydata$Gender, 
                               levels = c("Male", "Female"), 
                               labels = c("Male", "Female"))

mydata$Sleep.Duration <- factor(mydata$Sleep.Duration, 
                               levels = c("Short", "Normal", "Long"), 
                               labels = c("Short", "Normal", "Long"))

mydata$Dietary.Habits <- factor(mydata$Dietary.Habits, 
                               levels = c("Healthy", "Moderate", "Unhealthy"), 
                               labels = c("Healthy", "Moderate", "Unhealthy"))

mydata$Degree <- factor(mydata$Degree)
mydata$Have.you.ever.had.suicidal.thoughts <- factor(mydata$Have.you.ever.had.suicidal.thoughts, 
                               levels = c("Yes", "No"), 
                               labels = c("Yes", "No"))

mydata$Family.History.of.Mental.Illness <- factor(mydata$Family.History.of.Mental.Illness, 
                               levels = c("Yes", "No"), 
                               labels = c("Yes", "No"))

mydata$Depression <- factor(mydata$Depression, 
                                   levels = c(0, 1), 
                                   labels = c("No", "Yes"))

library (psych)
describe(mydata$CGPA)

##    vars     n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 27901 7.66 1.47   7.77    7.67 1.88   0  10    10 -0.11    -1.02 0.01

The average CGPA is 7.66 (out of 10), with a standard deviation of 1.47. Half of the students have a CGPA above 7.77, while the others have a lower score. The CGPA ranges from 0 to 10, showing that there are students at both extremes of the performance scale. The distribution of CGPA is slightly left-skewed, suggesting more students score above average.

Since the sample size is quite large, we pick 150 units randomly for our analysis

set.seed(1)
mydataB <- mydata[sample(nrow(mydata), 150), ]
head(mydataB)

##           id Gender Age      City Profession Academic.Pressure Work.Pressure
## 17401  88001   Male  29    Meerut    Student                 5             0
## 24388 122837   Male  24 Ahmedabad    Student                 3             0
## 4775   23941 Female  21   Chennai    Student                 1             0
## 26753 134987 Female  33     Thane    Student                 4             0
## 13218  67005   Male  25     Delhi    Student                 3             0
## 26109 131660   Male  24 Bangalore    Student                 4             0
##       CGPA Study.Satisfaction Job.Satisfaction Sleep.Duration Dietary.Habits
## 17401 6.78                  3                0           <NA>      Unhealthy
## 24388 8.91                  1                0           <NA>      Unhealthy
## 4775  7.50                  4                0           <NA>      Unhealthy
## 26753 6.75                  2                0           <NA>        Healthy
## 13218 9.44                  4                0           <NA>       Moderate
## 26109 8.52                  3                0           <NA>      Unhealthy
##        Degree Have.you.ever.had.suicidal.thoughts.. Work.Study.Hours
## 17401    B.Ed                                   Yes               12
## 24388  B.Arch                                   Yes               11
## 4775      MBA                                    No               11
## 26753     MCA                                   Yes               11
## 13218  B.Tech                                    No                8
## 26109 B.Pharm                                   Yes                6
##       Financial.Stress Family.History.of.Mental.Illness Depression
## 17401                5                               No        Yes
## 24388                5                               No        Yes
## 4775                 4                              Yes        Yes
## 26753                4                               No        Yes
## 13218                5                               No         No
## 26109                4                               No        Yes
##       Have.you.ever.had.suicidal.thoughts
## 17401                                 Yes
## 24388                                 Yes
## 4775                                   No
## 26753                                 Yes
## 13218                                  No
## 26109                                 Yes

Research Question 1: Hypothesis Testing

Is there a significant difference in CGPA between students with and without depression?

Parametric Test (Independent Samples t-test)

H0: μM = μF or μM - μF = 0 (The mean CGPA of students with and without depression is equal).

H1: μM ≠ μF or μM - μF ≠ 0 (The mean CGPA of students with and without depression is not equal).

Non-Parametric Test (Wilcoxon Rank Sum Test)

H0: The location distribution of CGPA is the same for students with and without depression.

H1: The location distribution of CGPA is different for students with and without depression.

Assumptions:

Numeric Variable: CGPA is numeric — Assumption Met.

Normal Distribution: Shapiro-Wilk test will test for normality in each group.

3.1.1 Analysis

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydataB, aes(x = CGPA)) +
  geom_histogram(binwidth = 0.5, colour = "blue", fill = "lightblue") +
  facet_wrap(~Depression, ncol = 1) +
  ylab("Frequency") +
  xlab("Distribution of CGPA")

install.packages("ggpubr")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

library(ggpubr)
ggqqplot(mydataB, "CGPA", facet.by = "Depression",)

We can see deviations from the diagonal reference line for both depressed and not depressed. This indicates that the CGPA distribution is not normal for students with or without depression.

shapiro.test(mydataB$CGPA)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydataB$CGPA
## W = 0.94559, p-value = 1.434e-05

We reject the null hypothesis at p<0,001 and conclude that CGPA is not normally distributed for the sample.

wilcox.test(mydataB$CGPA ~ mydataB$Depression,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  mydataB$CGPA by mydataB$Depression
## W = 2407.5, p-value = 0.1782
## alternative hypothesis: true location shift is not equal to 0

We cannot reject null hypotheses. There is no significant difference in the location distribution of CGPA between students with and without depression.

t.test(mydataB$CGPA ~ mydataB$Depression, 
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  mydataB$CGPA by mydataB$Depression
## t = -1.3649, df = 130.11, p-value = 0.1746
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -0.7989783  0.1465983
## sample estimates:
##  mean in group No mean in group Yes 
##          7.509692          7.835882

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following object is masked from 'package:psych':
## 
##     phi

effectsize(wilcox.test(mydataB$CGPA ~ mydataB$Depression,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## -0.13             | [-0.31, 0.06]

interpret_rank_biserial(0.13, rules = "funder2019")

## [1] "small"
## (Rules: funder2019)

The p-value is 0.1746, we cannot reject the null hypothesis. There is no statistically significant difference in the mean CGPA between students with and without depression.The observed difference in CGPA between students with and without depression is small (-0.13), even if it were statistically significant.

3.1.2 Conclusion

In this case, the non-parametric Wilcoxon Rank Sum test is more suitable because the Shapiro-Wilk test demonstrated that the variable CGPA is not normally distributed for the sample.

Based on the Wilcoxon Rank Sum test we performed, we cannot reject the null hypothesis and conclude that students with and without depression differ in the distribution location of their CGPA.

Research Question 2: Correlation Analysis

Is there a correlation between Work Study Hours and CGPA?

H0: ρWork/Study Hours, CGPA = 0 (No correlation)

H1: ρWork/Study Hours, CGPA ≠ 0 (Correlation)

3.2.1 Analysis

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(mydataB[ , c(15,8)], smooth=FALSE)

Linear relationship between Work/Study Hours and CGPA is positive and weak.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydataB[ , c(15,8)]), 
      type = "pearson")

##                  Work.Study.Hours CGPA
## Work.Study.Hours             1.00 0.09
## CGPA                         0.09 1.00
## 
## n= 150 
## 
## 
## P
##                  Work.Study.Hours CGPA  
## Work.Study.Hours                  0.2677
## CGPA             0.2677

cor(mydataB$Work.Study.Hours,mydataB$CGPA,
    method="pearson")

## [1] 0.09106778

3.2.2 Conclusion

Based on the Pearson correlation test, we cannot reject the null hypothesis. The correlation coefficient (r = 0.09) indicates a weak positive linear relationship between Work/Study Hours and CGPA, but this relationship is not statistically significant (p = 0.2677).

Research Question 3: Association Analysis

Is there an association between Gender and Depression?

H0: There is no association between Gender and Depression. H1: There is an association between Gender and Depression.

Observations are independent: Each observation (student) is independent, as gender and depression status for one individual do not influence others. This assumption is met.

Expected frequencies: All expected frequencies should be greater than 5, or at least 80% of expected frequencies should exceed 5.

3.3.1 Analysis

chi_square <- chisq.test(mydataB$Gender, mydataB$Depression, 
                        correct = TRUE)

chi_square

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydataB$Gender and mydataB$Depression
## X-squared = 2.1829, df = 1, p-value = 0.1396

addmargins(chi_square$observed)

##               mydataB$Depression
## mydataB$Gender  No Yes Sum
##         Male    35  57  92
##         Female  30  28  58
##         Sum     65  85 150

addmargins(round(chi_square$expected, 2))

##               mydataB$Depression
## mydataB$Gender    No   Yes Sum
##         Male   39.87 52.13  92
##         Female 25.13 32.87  58
##         Sum    65.00 85.00 150

All expected frequencies are greater than 5, satisfying the assumption for the Chi-Square test. We cannot reject Ho.The Chi-Square test indicates no significant association between Gender and Depression (p = 0.1396).

library(effectsize)
effectsize::cramers_v(mydataB$Gender, mydataB$Depression)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.11              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.11)

## [1] "small"
## (Rules: funder2019)

The effect size (Cramer’s V = 0.11) is small, reinforcing the conclusion that Gender and Depression are weakly associated in this sample.

3.3.2 Conclusion

In this case, the Pearson Chi-Square test shows that there is no significant association between Gender and Depression, as the p-value is greater than 0.05. The small effect size further supports this finding. This suggests that in the sample, depression prevalence does not differ meaningfully based on gender.

Conclusion

CGPA does not differ significantly between students with and without depression, as shown by both parametric and non-parametric tests. The effect size, measured by Cohen’s d and rank-biserial correlation, is small, indicating no meaningful difference in CGPA between the two groups.

Work/Study Hours and CGPA show a positive correlation, with a correlation coefficient of 0.09. This weak relationship is not statistically significant, suggesting that the amount of time students dedicate to work or study does not strongly impact their academic performance as measured by CGPA.

No significant association was found between Gender and Depression, as indicated by the Pearson Chi-Square test. The effect size, measured by Cramer’s V, is small, reinforcing the conclusion that gender does not meaningfully affect the prevalence of depression in this sample.

Homework assignment 1 at course MVA

Simon Sever

2025-01-15