library(readr)
mydata <- read.csv("./Assignment 2.csv", header=TRUE,
sep=",", dec=".")
#discovered in later parts that there is someone with 0 CGPA and some NA results, which I think is not possible so will be removing it here
library(tidyr)
mydata <-na.omit(mydata)
mydata <-mydata[mydata$CGPA!=0,]
#Explain the data
Variables in the data:
#Carry out data manipulation
mydata2 <-mydata[,c(-1,-4,-5,-7,-10,-12,-13,-14,-16,-17)]
#removing not needed variables like Work pressure, Job Satisfaction etc.
mydata2$Gender <- factor(mydata2$Gender,
levels= c("Male","Female"),
labels= c("Male","Female"))
#converting the characters into factors for Gender
set.seed(123)
mydata3 <- mydata2[sample(nrow(mydata2), 100), ]
head(mydata3)
## Gender Age Academic.Pressure CGPA Study.Satisfaction Sleep.Duration
## 18854 Male 28 2 7.10 3 7-8 hours
## 18902 Female 21 2 9.94 1 Less than 5 hours
## 26815 Female 32 3 7.25 5 5-6 hours
## 25112 Male 31 5 5.32 5 Less than 5 hours
## 2986 Female 28 1 5.32 3 More than 8 hours
## 1842 Male 28 4 5.25 2 5-6 hours
## Work.Study.Hours Depression
## 18854 7 0
## 18902 8 1
## 26815 8 0
## 25112 10 1
## 2986 12 0
## 1842 1 1
#converting my 27,889 observations into 100 observations randomly for my hypotehsis testing
#Source of the data: Kaggle (link: https://www.kaggle.com/datasets/hopesb/student-depression-dataset)
library(psych)
describe(mydata3$CGPA)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 100 7.76 1.54 7.9 7.79 2.24 5.12 10 4.88 -0.16 -1.34 0.15
#Describing a few parameters I have selected the cumulative GPA as
the variable to explain.
The average CGPA for people, included in
the sample is 7.76.
The standard deviation which is the measure of
variability is 1.54.
The skew is -0.16 which means that the
histogram will be left skewed when drawn as it is negatively skewed.
The median is 7.9 which means that 50% of the people have a CGPA of 7.9
and lower while the other 50% have a CGPA higher than 7.9.
describeBy(mydata3$CGPA,mydata3$Gender)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 57 7.71 1.62 7.88 7.73 2.39 5.12 10 4.88 -0.07 -1.54 0.22
## ------------------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 43 7.83 1.43 7.99 7.88 1.48 5.16 9.94 4.78 -0.28 -1.05 0.22
We will now carry out both the parametric and non parametric tests
first before using the boxplot and the shapiro test to find out if CGPA
is normally distributed or not in both the populations.
Before the
tests, we will have to check whether the CGPA is the same or not in
terms of variance. So, we will use the levene test for it.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
leveneTest(mydata3$CGPA, group = mydata3$Gender)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 3.4699 0.06549 .
## 98
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: the variance of the CGPA for males is equal to the variance of
the CGPA for females
H1: the variance of the CGPA for males is not
equal to the variance of the CGPA for females
Based on the test,
since the P-value is 0.06549, we will not reject the null hypothesis and
instead accept the null hypothesis. This means that we can carry out
both the non paremetric hypothesis testing and parametric without any
adjustment since the variance is said to be the equal.
t.test(mydata3$CGPA ~ mydata3$Gender,
var.equal = TRUE,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: mydata3$CGPA by mydata3$Gender
## t = -0.38875, df = 98, p-value = 0.6983
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -0.7404334 0.4978549
## sample estimates:
## mean in group Male mean in group Female
## 7.708246 7.829535
H0: The mean of the males CGPA is not different from the mean of the
females CGPA
H1: The mean of the males CGPA is different from the
mean of the females CGPA.
Since, the p-value is 0.6983, we do not
reject the null hypothesis and have to accept it. Which means that based
on the sample data, we find that both the male and female do not have a
different mean in terms of the CGPA attained.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(mydata3$CGPA ~ mydata3$Gender,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## -------------------------
## -0.08 | [-0.47, 0.31]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.08,rules="sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)
Moreover, since the effect size is tiny and d = 0.08, we can say that there was a no difference in the means of the CGPA when checked against the gender and hence gender is not a factor which causes it to change.
wilcox.test(mydata3$CGPA~mydata3$Gender,
correct=FALSE,
exact=FALSE,
alternative="two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata3$CGPA by mydata3$Gender
## W = 1191, p-value = 0.8102
## alternative hypothesis: true location shift is not equal to 0
library(effectsize)
effectsize(wilcox.test(mydata3$CGPA~mydata3$Gender,
correct = FALSE,
exact = FALSE,
alternative ="two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -0.03 | [-0.25, 0.20]
interpret_rank_biserial(0.03)
## [1] "tiny"
## (Rules: funder2019)
H0: The distribution location of CGPA is the same for both the male
and the female
H1: The distribution location of CGPA is different
for both the male and female
Based on the sample data, we find that the both male and female have the same distribution location of CGPA, since the p-value is small at 0.8102, as we cannot reject the null hypothesis. The difference in distribution locations is tiny as r = 0.03.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(mydata3, aes(CGPA))+
geom_histogram(binwidth = 1, fill="darkred")+
facet_wrap(~Gender, ncol=1)+
ylab("Frequency")
library(rstatix)
##
## Attaching package: 'rstatix'
## The following objects are masked from 'package:effectsize':
##
## cohens_d, eta_squared
## The following object is masked from 'package:stats':
##
## filter
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata3 %>%
group_by(Gender) %>%
shapiro_test(CGPA)
## # A tibble: 2 × 4
## Gender variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Male CGPA 0.906 0.000319
## 2 Female CGPA 0.948 0.0522
H0: The distribution of the Cumulative GPA (for the gender whose
p-value we are looking at) is normally distributed
H1: The
distribution of the Cumulative GPA (for the gender whose p-value we are
looking at) is not normally distributed
Looking at both the genders,
the p-value is small (p<0.001) for males and 0.06 for females, hence
we will have to reject the H0 hypothesis for males but we accept the
null hypothesis for females. And conclude that, the Cumulative GPA is
not normally distributed for males but it is normally distributed for
females. And as such, we will have to use the non parametric hypo
testing to answer our research question.
Moreover, even the
distribution of the CGPA in the frequency table does not look like it is
normally distributed for males. Hence, we will have to use the
non-parametric tests to confirm our results.
Lastly, our results concludes that there is no difference in the distribution location between both males and females for CGPA. Hence, gender cannot be described as a factor that affects the CGPA.