Mucheli Sudhakar Sairam Sukesh

library(readr)
mydata <- read.csv("./Assignment 2.csv", header=TRUE, 
                   sep=",", dec=".")

#discovered in later parts that there is someone with 0 CGPA and some NA results, which I think is not possible so will be removing it here

library(tidyr)
mydata <-na.omit(mydata)
mydata <-mydata[mydata$CGPA!=0,]

#Explain the data

Variables in the data:

#Carry out data manipulation

mydata2 <-mydata[,c(-1,-4,-5,-7,-10,-12,-13,-14,-16,-17)]

#removing not needed variables like Work pressure, Job Satisfaction etc.

mydata2$Gender <- factor(mydata2$Gender,
                                 levels= c("Male","Female"),
                                 labels= c("Male","Female"))

#converting the characters into factors for Gender

set.seed(123)  
mydata3 <- mydata2[sample(nrow(mydata2), 100), ]
head(mydata3)
##       Gender Age Academic.Pressure CGPA Study.Satisfaction    Sleep.Duration
## 18854   Male  28                 2 7.10                  3         7-8 hours
## 18902 Female  21                 2 9.94                  1 Less than 5 hours
## 26815 Female  32                 3 7.25                  5         5-6 hours
## 25112   Male  31                 5 5.32                  5 Less than 5 hours
## 2986  Female  28                 1 5.32                  3 More than 8 hours
## 1842    Male  28                 4 5.25                  2         5-6 hours
##       Work.Study.Hours Depression
## 18854                7          0
## 18902                8          1
## 26815                8          0
## 25112               10          1
## 2986                12          0
## 1842                 1          1

#converting my 27,889 observations into 100 observations randomly for my hypotehsis testing

#Source of the data: Kaggle (link: https://www.kaggle.com/datasets/hopesb/student-depression-dataset)

library(psych)
describe(mydata3$CGPA)
##    vars   n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 100 7.76 1.54    7.9    7.79 2.24 5.12  10  4.88 -0.16    -1.34 0.15

#Describing a few parameters I have selected the cumulative GPA as the variable to explain.
The average CGPA for people, included in the sample is 7.76.
The standard deviation which is the measure of variability is 1.54.
The skew is -0.16 which means that the histogram will be left skewed when drawn as it is negatively skewed.
The median is 7.9 which means that 50% of the people have a CGPA of 7.9 and lower while the other 50% have a CGPA higher than 7.9.

Research Question: Whether CGPA is different and affected by gender or not

  • It is an independent testing as there are 2 different population and each individual is measured with their own respective CGPA
  • The first criteria is that the variable has to be numerical variable. And it is true as Cumulative is a numerical variable.
describeBy(mydata3$CGPA,mydata3$Gender)
## 
##  Descriptive statistics by group 
## group: Male
##    vars  n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 57 7.71 1.62   7.88    7.73 2.39 5.12  10  4.88 -0.07    -1.54 0.22
## ------------------------------------------------------------ 
## group: Female
##    vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis   se
## X1    1 43 7.83 1.43   7.99    7.88 1.48 5.16 9.94  4.78 -0.28    -1.05 0.22

We will now carry out both the parametric and non parametric tests first before using the boxplot and the shapiro test to find out if CGPA is normally distributed or not in both the populations.
Before the tests, we will have to check whether the CGPA is the same or not in terms of variance. So, we will use the levene test for it.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
leveneTest(mydata3$CGPA, group = mydata3$Gender)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  3.4699 0.06549 .
##       98                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: the variance of the CGPA for males is equal to the variance of the CGPA for females
H1: the variance of the CGPA for males is not equal to the variance of the CGPA for females
Based on the test, since the P-value is 0.06549, we will not reject the null hypothesis and instead accept the null hypothesis. This means that we can carry out both the non paremetric hypothesis testing and parametric without any adjustment since the variance is said to be the equal.

Parametric testing

t.test(mydata3$CGPA ~ mydata3$Gender,
       var.equal = TRUE, 
       alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  mydata3$CGPA by mydata3$Gender
## t = -0.38875, df = 98, p-value = 0.6983
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -0.7404334  0.4978549
## sample estimates:
##   mean in group Male mean in group Female 
##             7.708246             7.829535

H0: The mean of the males CGPA is not different from the mean of the females CGPA
H1: The mean of the males CGPA is different from the mean of the females CGPA.
Since, the p-value is 0.6983, we do not reject the null hypothesis and have to accept it. Which means that based on the sample data, we find that both the male and female do not have a different mean in terms of the CGPA attained.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cohens_d(mydata3$CGPA ~ mydata3$Gender,
                     pooled_sd = FALSE)
## Cohen's d |        95% CI
## -------------------------
## -0.08     | [-0.47, 0.31]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.08,rules="sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)

Moreover, since the effect size is tiny and d = 0.08, we can say that there was a no difference in the means of the CGPA when checked against the gender and hence gender is not a factor which causes it to change.

Non parametric method

wilcox.test(mydata3$CGPA~mydata3$Gender,
            correct=FALSE, 
            exact=FALSE,
            alternative="two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata3$CGPA by mydata3$Gender
## W = 1191, p-value = 0.8102
## alternative hypothesis: true location shift is not equal to 0
library(effectsize)
effectsize(wilcox.test(mydata3$CGPA~mydata3$Gender,
                       correct = FALSE, 
                       exact = FALSE, 
                       alternative ="two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## -0.03             | [-0.25, 0.20]
interpret_rank_biserial(0.03)
## [1] "tiny"
## (Rules: funder2019)

H0: The distribution location of CGPA is the same for both the male and the female
H1: The distribution location of CGPA is different for both the male and female

Based on the sample data, we find that the both male and female have the same distribution location of CGPA, since the p-value is small at 0.8102, as we cannot reject the null hypothesis. The difference in distribution locations is tiny as r = 0.03.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(mydata3, aes(CGPA))+
  geom_histogram(binwidth = 1, fill="darkred")+
  facet_wrap(~Gender, ncol=1)+
  ylab("Frequency")

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following objects are masked from 'package:effectsize':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:stats':
## 
##     filter
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mydata3 %>%
  group_by(Gender) %>%
  shapiro_test(CGPA)
## # A tibble: 2 × 4
##   Gender variable statistic        p
##   <fct>  <chr>        <dbl>    <dbl>
## 1 Male   CGPA         0.906 0.000319
## 2 Female CGPA         0.948 0.0522

H0: The distribution of the Cumulative GPA (for the gender whose p-value we are looking at) is normally distributed
H1: The distribution of the Cumulative GPA (for the gender whose p-value we are looking at) is not normally distributed
Looking at both the genders, the p-value is small (p<0.001) for males and 0.06 for females, hence we will have to reject the H0 hypothesis for males but we accept the null hypothesis for females. And conclude that, the Cumulative GPA is not normally distributed for males but it is normally distributed for females. And as such, we will have to use the non parametric hypo testing to answer our research question.
Moreover, even the distribution of the CGPA in the frequency table does not look like it is normally distributed for males. Hence, we will have to use the non-parametric tests to confirm our results.

Lastly, our results concludes that there is no difference in the distribution location between both males and females for CGPA. Hence, gender cannot be described as a factor that affects the CGPA.