This homework involves three distinct research questions covering hypothesis testing, correlation between numerical variables, and association between categorical variables. The analysis is performed step-by-step using the dataset titled Student Depression Dataset.
library(carData)
mydata <- read.table("/cloud/project/Student depression/Student Depression Dataset.csv",
header = TRUE,
sep = ",")
head(mydata)
## id Gender Age City Profession Academic.Pressure Work.Pressure CGPA
## 1 2 Male 33 Visakhapatnam Student 5 0 8.97
## 2 8 Female 24 Bangalore Student 2 0 5.90
## 3 26 Male 31 Srinagar Student 3 0 7.03
## 4 30 Female 28 Varanasi Student 3 0 5.59
## 5 32 Female 25 Jaipur Student 4 0 8.13
## 6 33 Male 29 Pune Student 2 0 5.70
## Study.Satisfaction Job.Satisfaction Sleep.Duration Dietary.Habits Degree
## 1 2 0 5-6 hours Healthy B.Pharm
## 2 5 0 5-6 hours Moderate BSc
## 3 5 0 Less than 5 hours Healthy BA
## 4 2 0 7-8 hours Moderate BCA
## 5 3 0 5-6 hours Moderate M.Tech
## 6 3 0 Less than 5 hours Healthy PhD
## Have.you.ever.had.suicidal.thoughts.. Work.Study.Hours Financial.Stress
## 1 Yes 3 1
## 2 No 3 2
## 3 No 9 1
## 4 Yes 4 5
## 5 Yes 1 1
## 6 No 4 1
## Family.History.of.Mental.Illness Depression
## 1 No 1
## 2 Yes 0
## 3 Yes 0
## 4 Yes 1
## 5 No 0
## 6 No 0
Source: Kaggle
The dataset contains information about students and their mental health indicators.
Unit of Observation: An individual student
Sample Size: 27,898 observations.
Description of the variables:
Gender: Male/Female.
Age: Age of the student (years).
City: City of residence.
Profession: Profession of the individual.
Academic Pressure: Self-reported academic pressure on a numerical scale.
Work Pressure: Self-reported work pressure on a numerical scale.
CGPA: Cumulative Grade Point Average.
Study Satisfaction: Satisfaction with studies (numerical scale).
Job Satisfaction: Satisfaction with job (numerical scale).
Sleep Duration: Sleep duration category (categorical).
Dietary Habits: Healthy/Moderate/Unhealthy.
Degree: Degree pursued or achieved.
Have you ever had suicidal thoughts?: Yes/No.
Work/Study Hours: Hours spent on work/study.
Financial Stress: Self-reported financial stress (numerical scale).
Family History of Mental Illness: Yes/No response.
Depression: Indicator for depression (0 = No, 1 = Yes)
mydata$Gender <- factor(mydata$Gender,
levels = c("Male", "Female"),
labels = c("Male", "Female"))
mydata$Sleep.Duration <- factor(mydata$Sleep.Duration,
levels = c("Short", "Normal", "Long"),
labels = c("Short", "Normal", "Long"))
mydata$Dietary.Habits <- factor(mydata$Dietary.Habits,
levels = c("Healthy", "Moderate", "Unhealthy"),
labels = c("Healthy", "Moderate", "Unhealthy"))
mydata$Degree <- factor(mydata$Degree)
mydata$Have.you.ever.had.suicidal.thoughts <- factor(mydata$Have.you.ever.had.suicidal.thoughts,
levels = c("Yes", "No"),
labels = c("Yes", "No"))
mydata$Family.History.of.Mental.Illness <- factor(mydata$Family.History.of.Mental.Illness,
levels = c("Yes", "No"),
labels = c("Yes", "No"))
mydata$Depression <- factor(mydata$Depression,
levels = c(0, 1),
labels = c("No", "Yes"))
library (psych)
describe(mydata$CGPA)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 27901 7.66 1.47 7.77 7.67 1.88 0 10 10 -0.11 -1.02 0.01
The average CGPA is 7.66 (out of 10), with a standard deviation of 1.47. Half of the students have a CGPA above 7.77, while the others have a lower score. The CGPA ranges from 0 to 10, showing that there are students at both extremes of the performance scale. The distribution of CGPA is slightly left-skewed, suggesting more students score above average.
Since the sample size is quite large, we pick 150 units randomly for our analysis
set.seed(1)
mydataB <- mydata[sample(nrow(mydata), 150), ]
head(mydataB)
## id Gender Age City Profession Academic.Pressure Work.Pressure
## 17401 88001 Male 29 Meerut Student 5 0
## 24388 122837 Male 24 Ahmedabad Student 3 0
## 4775 23941 Female 21 Chennai Student 1 0
## 26753 134987 Female 33 Thane Student 4 0
## 13218 67005 Male 25 Delhi Student 3 0
## 26109 131660 Male 24 Bangalore Student 4 0
## CGPA Study.Satisfaction Job.Satisfaction Sleep.Duration Dietary.Habits
## 17401 6.78 3 0 <NA> Unhealthy
## 24388 8.91 1 0 <NA> Unhealthy
## 4775 7.50 4 0 <NA> Unhealthy
## 26753 6.75 2 0 <NA> Healthy
## 13218 9.44 4 0 <NA> Moderate
## 26109 8.52 3 0 <NA> Unhealthy
## Degree Have.you.ever.had.suicidal.thoughts.. Work.Study.Hours
## 17401 B.Ed Yes 12
## 24388 B.Arch Yes 11
## 4775 MBA No 11
## 26753 MCA Yes 11
## 13218 B.Tech No 8
## 26109 B.Pharm Yes 6
## Financial.Stress Family.History.of.Mental.Illness Depression
## 17401 5 No Yes
## 24388 5 No Yes
## 4775 4 Yes Yes
## 26753 4 No Yes
## 13218 5 No No
## 26109 4 No Yes
## Have.you.ever.had.suicidal.thoughts
## 17401 Yes
## 24388 Yes
## 4775 No
## 26753 Yes
## 13218 No
## 26109 Yes
Is there a significant difference in CGPA between students with and without depression?
Parametric Test (Independent Samples t-test)
H0: μM = μF or μM - μF = 0 (The mean CGPA of students with and without depression is equal).
H1: μM ≠ μF or μM - μF ≠ 0 (The mean CGPA of students with and without depression is not equal).
Non-Parametric Test (Wilcoxon Rank Sum Test)
H0: The location distribution of CGPA is the same for students with and without depression.
H1: The location distribution of CGPA is different for students with and without depression.
Assumptions:
Numeric Variable: CGPA is numeric — Assumption Met.
Normal Distribution: Shapiro-Wilk test will test for normality in each group.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydataB, aes(x = CGPA)) +
geom_histogram(binwidth = 0.5, colour = "blue", fill = "lightblue") +
facet_wrap(~Depression, ncol = 1) +
ylab("Frequency") +
xlab("Distribution of CGPA")
install.packages("ggpubr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(ggpubr)
ggqqplot(mydataB, "CGPA", facet.by = "Depression",)
We can see deviations from the diagonal reference line for both depressed and not depressed. This indicates that the CGPA distribution is not normal for students with or without depression.
shapiro.test(mydataB$CGPA)
##
## Shapiro-Wilk normality test
##
## data: mydataB$CGPA
## W = 0.94559, p-value = 1.434e-05
We reject the null hypothesis at p<0,001 and conclude that CGPA is not normally distributed for the sample.
wilcox.test(mydataB$CGPA ~ mydataB$Depression,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydataB$CGPA by mydataB$Depression
## W = 2407.5, p-value = 0.1782
## alternative hypothesis: true location shift is not equal to 0
We cannot reject null hypotheses. There is no significant difference in the location distribution of CGPA between students with and without depression.
t.test(mydataB$CGPA ~ mydataB$Depression,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: mydataB$CGPA by mydataB$Depression
## t = -1.3649, df = 130.11, p-value = 0.1746
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -0.7989783 0.1465983
## sample estimates:
## mean in group No mean in group Yes
## 7.509692 7.835882
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize(wilcox.test(mydataB$CGPA ~ mydataB$Depression,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.13 | [-0.31, 0.06]
interpret_rank_biserial(0.13, rules = "funder2019")
## [1] "small"
## (Rules: funder2019)
The p-value is 0.1746, we cannot reject the null hypothesis. There is no statistically significant difference in the mean CGPA between students with and without depression.The observed difference in CGPA between students with and without depression is small (-0.13), even if it were statistically significant.
In this case, the non-parametric Wilcoxon Rank Sum test is more suitable because the Shapiro-Wilk test demonstrated that the variable CGPA is not normally distributed for the sample.
Based on the Wilcoxon Rank Sum test we performed, we cannot reject the null hypothesis and conclude that students with and without depression differ in the distribution location of their CGPA.
Is there a correlation between Work Study Hours and CGPA?
H0: ρWork/Study Hours, CGPA = 0 (No correlation)
H1: ρWork/Study Hours, CGPA ≠ 0 (Correlation)
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(mydataB[ , c(15,8)], smooth=FALSE)
Linear relationship between Work/Study Hours and CGPA is positive and weak.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydataB[ , c(15,8)]),
type = "pearson")
## Work.Study.Hours CGPA
## Work.Study.Hours 1.00 0.09
## CGPA 0.09 1.00
##
## n= 150
##
##
## P
## Work.Study.Hours CGPA
## Work.Study.Hours 0.2677
## CGPA 0.2677
cor(mydataB$Work.Study.Hours,mydataB$CGPA,
method="pearson")
## [1] 0.09106778
Based on the Pearson correlation test, we cannot reject the null hypothesis. The correlation coefficient (r = 0.09) indicates a weak positive linear relationship between Work/Study Hours and CGPA, but this relationship is not statistically significant (p = 0.2677).
Is there an association between Gender and Depression?
H0: There is no association between Gender and Depression. H1: There is an association between Gender and Depression.
Observations are independent: Each observation (student) is independent, as gender and depression status for one individual do not influence others. This assumption is met.
Expected frequencies: All expected frequencies should be greater than 5, or at least 80% of expected frequencies should exceed 5.
chi_square <- chisq.test(mydataB$Gender, mydataB$Depression,
correct = TRUE)
chi_square
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydataB$Gender and mydataB$Depression
## X-squared = 2.1829, df = 1, p-value = 0.1396
addmargins(chi_square$observed)
## mydataB$Depression
## mydataB$Gender No Yes Sum
## Male 35 57 92
## Female 30 28 58
## Sum 65 85 150
addmargins(round(chi_square$expected, 2))
## mydataB$Depression
## mydataB$Gender No Yes Sum
## Male 39.87 52.13 92
## Female 25.13 32.87 58
## Sum 65.00 85.00 150
All expected frequencies are greater than 5, satisfying the assumption for the Chi-Square test. We cannot reject Ho.The Chi-Square test indicates no significant association between Gender and Depression (p = 0.1396).
library(effectsize)
effectsize::cramers_v(mydataB$Gender, mydataB$Depression)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.11 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.11)
## [1] "small"
## (Rules: funder2019)
The effect size (Cramer’s V = 0.11) is small, reinforcing the conclusion that Gender and Depression are weakly associated in this sample.
In this case, the Pearson Chi-Square test shows that there is no significant association between Gender and Depression, as the p-value is greater than 0.05. The small effect size further supports this finding. This suggests that in the sample, depression prevalence does not differ meaningfully based on gender.
CGPA does not differ significantly between students with and without depression, as shown by both parametric and non-parametric tests. The effect size, measured by Cohen’s d and rank-biserial correlation, is small, indicating no meaningful difference in CGPA between the two groups.
Work/Study Hours and CGPA show a positive correlation, with a correlation coefficient of 0.09. This weak relationship is not statistically significant, suggesting that the amount of time students dedicate to work or study does not strongly impact their academic performance as measured by CGPA.
No significant association was found between Gender and Depression, as indicated by the Pearson Chi-Square test. The effect size, measured by Cramer’s V, is small, reinforcing the conclusion that gender does not meaningfully affect the prevalence of depression in this sample.