Mental disorders diagnosis consists of 120 patients observations of
17 different behavioral and psychological symptoms, they are: sadness,
euphoric, exhausted, sleep dissorder, mood swing, suicidal thoughts,
anorxia, authority respect, try-explanation, aggressive response, ignore
& move-on, nervous break-down, admit mistakes, overthinking, sexual
activity, concentration, and optimism. First through
install.packages() function we need to download the
tidyverse package, then get it out via
library() function. After getting this package, I
downloaded the dataset from Kaggle platform as csv file. Now, in order
to save this dataset into R global environment, we use
read.csv() function, give it a name, I named it as
disorder, and seved it. To see if we saved the correct dataset, we may
use glimpse() function.
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
disorder = read.csv("mental_disorders_dataset.csv") %>%
as_tibble()
glimpse(disorder)
## Rows: 120
## Columns: 19
## $ Patient.Number <chr> "Patiant-01", "Patiant-02", "Patiant-03", "Patiant…
## $ Sadness <chr> "Usually", "Usually", "Sometimes", "Usually", "Usu…
## $ Euphoric <chr> "Seldom", "Seldom", "Most-Often", "Seldom", "Usual…
## $ Exhausted <chr> "Sometimes", "Usually", "Sometimes", "Usually", "S…
## $ Sleep.dissorder <chr> "Sometimes", "Sometimes", "Sometimes", "Most-Often…
## $ Mood.Swing <chr> "YES", "NO", "YES", "YES", "NO", "NO", "YES", "NO"…
## $ Suicidal.thoughts <chr> "YES ", "YES", "NO", "YES", "NO", "YES", "YES", "N…
## $ Anorxia <chr> "NO", "NO", "NO", "YES", "NO", "YES", "YES", "NO",…
## $ Authority.Respect <chr> "NO", "NO", "NO", "NO", "NO", "YES", "NO", "NO", "…
## $ Try.Explanation <chr> "YES", "NO", "YES", "YES", "NO", "NO", "YES", "YES…
## $ Aggressive.Response <chr> "NO", "NO", "YES", "NO", "NO", "NO", "YES", "NO", …
## $ Ignore...Move.On <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "N…
## $ Nervous.Break.down <chr> "YES", "NO", "YES", "NO", "YES", "NO", "YES", "NO"…
## $ Admit.Mistakes <chr> "YES", "NO", "YES", "NO", "YES", "YES", "YES", "NO…
## $ Overthinking <chr> "YES", "NO", "NO", "NO", "YES", "NO", "YES", "YES"…
## $ Sexual.Activity <chr> "3 From 10", "4 From 10", "6 From 10", "3 From 10"…
## $ Concentration <chr> "3 From 10", "2 From 10", "5 From 10", "2 From 10"…
## $ Optimisim <chr> "4 From 10", "5 From 10", "7 From 10", "2 From 10"…
## $ Expert.Diagnose <chr> "Bipolar Type-2", "Depression", "Bipolar Type-1", …
We can visualize distribution of diagnosis across number of patients.
To do so, I used the ggplot() function, as I am interested
in distribution of diagnosis across patients, Experts identified 4 types
of diagnoses, they are: Bipolar Type-1, Bipolar Type-2, Depression, and
Normal. Through table() function we can get the exact
number of patients, diagnoses of Bipolar Type-1 and Depression have the
most number of 31. Bipolar Type-1 was diagnosed the least, with 28
patients. Now by using prop.table() function we get the
relative frequency of this numbers: 0.23; 0.258; 0.258; 0.25,
respectively.
disorder.plot = disorder %>%
ggplot(
aes(
x = Expert.Diagnose
)
)+
geom_bar(fill = "pink")+
labs(
title = "Number of patients by expert diagnosis",
x = "Expert diagnosis",
y = "Number of patient"
)+
theme_classic()
disorder.plot
table(disorder$Expert.Diagnose)
##
## Bipolar Type-1 Bipolar Type-2 Depression Normal
## 28 31 31 30
prop.table(table(disorder$Expert.Diagnose))
##
## Bipolar Type-1 Bipolar Type-2 Depression Normal
## 0.2333333 0.2583333 0.2583333 0.2500000
Higher frequency of Bipolar Type-2 and depression diagnoses across number of patients might mean that there is a one predictive common behavioural symptom. Maybe suicidal thoughts? So let’s do some statistical analysis to find out, is there a significant association between having suicidal thoughts and Bipolar Type-2 and Depression diagnoses.
Now, to test the association between suicidal thoughts and diagnoses,
we first need to use table() two way function. As we
randomly chose suicidal thoughts, and it is a categorical - binomial
data, we create a table of frequency how often we get YES and NO from
suicidal thoughts in total for expert diagnose.
tab_diagnose = table(disorder$Suicidal.thoughts, disorder$Expert.Diagnose)
chi.test = chisq.test(tab_diagnose)
## Warning in chisq.test(tab_diagnose): Chi-squared approximation may be incorrect
chi.test
##
## Pearson's Chi-squared test
##
## data: tab_diagnose
## X-squared = 36.871, df = 6, p-value = 1.865e-06
For statistical testing we used chi-square test, as we are looking for association between two categorical variables. Once we run the test, we have got p-value < 0.05, which rejects the null hypothesis, and accepts the point that there is a significant association between two variables: suicidal thoughts and diagnose. But now we have got only general association with diagnoses but not specifically with bipolor type-2 and depression.
sub = subset(disorder, disorder$Expert.Diagnose %in% c("Depression","Bipolar Type-2") )
tab.sub = table(sub$Expert.Diagnose, sub$Suicidal.thoughts)
tab.sub
##
## NO YES YES
## Bipolar Type-2 8 22 1
## Depression 10 21 0
chi.sub = chisq.test(tab.sub)
## Warning in chisq.test(tab.sub): Chi-squared approximation may be incorrect
chi.sub
##
## Pearson's Chi-squared test
##
## data: tab.sub
## X-squared = 1.2455, df = 2, p-value = 0.5365
First thing, we subset or filter the data of Deression and Bipolar Type-2 from other diagnostics. From subset chi-square test we see p-value > 0.05, which means no statistically significant association between suicidal thoughts and Bipolar type-2 and depression. Well, unlucky for us. Then maybe we should try once more with other symptom. Sure, why not?! Lets try with mood swing.
tab_mood = table(disorder$Expert.Diagnose, disorder$Mood.Swing)
tab_mood
##
## NO YES
## Bipolar Type-1 3 25
## Bipolar Type-2 0 31
## Depression 31 0
## Normal 29 1
tab.sub.mood = table(sub$Expert.Diagnose, sub$Mood.Swing)
tab.sub.mood
##
## NO YES
## Bipolar Type-2 0 31
## Depression 31 0
chi.sub.mood = chisq.test(tab.sub.mood)
chi.sub.mood
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab.sub.mood
## X-squared = 58.065, df = 1, p-value = 2.537e-14
summary(chi.sub.mood)
## Length Class Mode
## statistic 1 -none- numeric
## parameter 1 -none- numeric
## p.value 1 -none- numeric
## method 1 -none- character
## data.name 1 -none- character
## observed 4 table numeric
## expected 4 -none- numeric
## residuals 4 table numeric
## stdres 4 table numeric
We did the exact same steps as with previous symptom tesing. Now, if we look at the chi-square test, we have got p-value < 0.05, which means YAY, we have statistically significant association between Mood swing and Bipolar 2 and depression diagnoses. Having mood swings and being diagnosed with specifically either Bipolar 2 or depression can be highly associated and inter-dependent.
Now, we need data modeling. For this step I used binary logistic regression, as I am looking for categorical data of people with Bipolar Type-2 and Depression diagnoses either experience Mood swing or not.
sub = disorder %>%
filter(Expert.Diagnose %in% c("Bipolar Type-2", "Depression"))
sub$Expert.Diagnose = factor(sub$Expert.Diagnose)
sub$Mood.Swing = factor(sub$Mood.Swing)
sub$Expert.Diagnose = relevel(sub$Expert.Diagnose, ref = "Depression")
model.disorder = glm(Expert.Diagnose ~ Mood.Swing,
data = sub,
family = binomial)
summary(model.disorder)
##
## Call:
## glm(formula = Expert.Diagnose ~ Mood.Swing, family = binomial,
## data = sub)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.57 63961.76 0.000 1
## Mood.SwingYES 53.13 90455.60 0.001 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.595e+01 on 61 degrees of freedom
## Residual deviance: 3.597e-10 on 60 degrees of freedom
## AIC: 4
##
## Number of Fisher Scoring iterations: 25
plot(model.disorder)
The results of data modeling, we have got a weird table, which does not
actually gives an explanation for modeling. There is a huge standard
error. Mood swing almost perfectly distinguished between bipolar and
depressive symptoms. Thus, lets check how Mood swing is predictive of
two diagnoses.
chi.model = sub %>%
ggplot(
aes(
x = sub$Expert.Diagnose,
fill = sub$Mood.Swing
)
)+
geom_bar(position = "dodge")+
labs(
title = "Proportion of Mood Swing prediction by diagnosis",
x = "Diagnosis",
y = "Proportions",
fill = "Mood Swings"
)+
scale_y_continuous()+
theme_minimal()
chi.model
## Warning: Use of `sub$Expert.Diagnose` is discouraged.
## ℹ Use `Expert.Diagnose` instead.
## Warning: Use of `sub$Mood.Swing` is discouraged.
## ℹ Use `Mood.Swing` instead.
Here we can see that categorical answers of mood swings may result in
getting diagnosed with Bipolar Type-2 or Depression disorders. Based on
the proportions, patients who had answered no on having mood swing
symptoms, got diagnosed with depression, and vica verse, patients who
had this symptom, got Bipolar Type-2 disorder.
In the dataset 4 expert diagnoses were identified. By subsetting the Bipolar Type-2 and Depression we have identified the statistically significant association with Mood swings.
Possibly, as it was a randomly chosen symptom, there are might be other predictive one I did not take into an account. In addition, as I was mainly focused on p-value, which is not enough for interpretation, there is an error of misinterpretation.