Final_assignment: analysis of mental disorders dataset

Introduction

Mental disorders diagnosis consists of 120 patients observations of 17 different behavioral and psychological symptoms, they are: sadness, euphoric, exhausted, sleep dissorder, mood swing, suicidal thoughts, anorxia, authority respect, try-explanation, aggressive response, ignore & move-on, nervous break-down, admit mistakes, overthinking, sexual activity, concentration, and optimism. First through install.packages() function we need to download the tidyverse package, then get it out via library() function. After getting this package, I downloaded the dataset from Kaggle platform as csv file. Now, in order to save this dataset into R global environment, we use read.csv() function, give it a name, I named it as disorder, and seved it. To see if we saved the correct dataset, we may use glimpse() function.

install.packages("tidyverse")

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

disorder = read.csv("mental_disorders_dataset.csv") %>% 
  as_tibble()

glimpse(disorder)

## Rows: 120
## Columns: 19
## $ Patient.Number      <chr> "Patiant-01", "Patiant-02", "Patiant-03", "Patiant…
## $ Sadness             <chr> "Usually", "Usually", "Sometimes", "Usually", "Usu…
## $ Euphoric            <chr> "Seldom", "Seldom", "Most-Often", "Seldom", "Usual…
## $ Exhausted           <chr> "Sometimes", "Usually", "Sometimes", "Usually", "S…
## $ Sleep.dissorder     <chr> "Sometimes", "Sometimes", "Sometimes", "Most-Often…
## $ Mood.Swing          <chr> "YES", "NO", "YES", "YES", "NO", "NO", "YES", "NO"…
## $ Suicidal.thoughts   <chr> "YES ", "YES", "NO", "YES", "NO", "YES", "YES", "N…
## $ Anorxia             <chr> "NO", "NO", "NO", "YES", "NO", "YES", "YES", "NO",…
## $ Authority.Respect   <chr> "NO", "NO", "NO", "NO", "NO", "YES", "NO", "NO", "…
## $ Try.Explanation     <chr> "YES", "NO", "YES", "YES", "NO", "NO", "YES", "YES…
## $ Aggressive.Response <chr> "NO", "NO", "YES", "NO", "NO", "NO", "YES", "NO", …
## $ Ignore...Move.On    <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "NO", "N…
## $ Nervous.Break.down  <chr> "YES", "NO", "YES", "NO", "YES", "NO", "YES", "NO"…
## $ Admit.Mistakes      <chr> "YES", "NO", "YES", "NO", "YES", "YES", "YES", "NO…
## $ Overthinking        <chr> "YES", "NO", "NO", "NO", "YES", "NO", "YES", "YES"…
## $ Sexual.Activity     <chr> "3 From 10", "4 From 10", "6 From 10", "3 From 10"…
## $ Concentration       <chr> "3 From 10", "2 From 10", "5 From 10", "2 From 10"…
## $ Optimisim           <chr> "4 From 10", "5 From 10", "7 From 10", "2 From 10"…
## $ Expert.Diagnose     <chr> "Bipolar Type-2", "Depression", "Bipolar Type-1", …

Visualization

We can visualize distribution of diagnosis across number of patients. To do so, I used the ggplot() function, as I am interested in distribution of diagnosis across patients, Experts identified 4 types of diagnoses, they are: Bipolar Type-1, Bipolar Type-2, Depression, and Normal. Through table() function we can get the exact number of patients, diagnoses of Bipolar Type-1 and Depression have the most number of 31. Bipolar Type-1 was diagnosed the least, with 28 patients. Now by using prop.table() function we get the relative frequency of this numbers: 0.23; 0.258; 0.258; 0.25, respectively.

disorder.plot = disorder %>% 
  ggplot(
    aes(
      x = Expert.Diagnose
    )
  )+
  geom_bar(fill = "pink")+
  labs(
    title = "Number of patients by expert diagnosis",
    x = "Expert diagnosis",
    y = "Number of patient"
  )+
  theme_classic()

disorder.plot

table(disorder$Expert.Diagnose)

## 
## Bipolar Type-1 Bipolar Type-2     Depression         Normal 
##             28             31             31             30

prop.table(table(disorder$Expert.Diagnose))

## 
## Bipolar Type-1 Bipolar Type-2     Depression         Normal 
##      0.2333333      0.2583333      0.2583333      0.2500000

Higher frequency of Bipolar Type-2 and depression diagnoses across number of patients might mean that there is a one predictive common behavioural symptom. Maybe suicidal thoughts? So let’s do some statistical analysis to find out, is there a significant association between having suicidal thoughts and Bipolar Type-2 and Depression diagnoses.

Statistical testing

Now, to test the association between suicidal thoughts and diagnoses, we first need to use table() two way function. As we randomly chose suicidal thoughts, and it is a categorical - binomial data, we create a table of frequency how often we get YES and NO from suicidal thoughts in total for expert diagnose.

tab_diagnose = table(disorder$Suicidal.thoughts, disorder$Expert.Diagnose)

chi.test = chisq.test(tab_diagnose)

## Warning in chisq.test(tab_diagnose): Chi-squared approximation may be incorrect

chi.test

## 
##  Pearson's Chi-squared test
## 
## data:  tab_diagnose
## X-squared = 36.871, df = 6, p-value = 1.865e-06

For statistical testing we used chi-square test, as we are looking for association between two categorical variables. Once we run the test, we have got p-value < 0.05, which rejects the null hypothesis, and accepts the point that there is a significant association between two variables: suicidal thoughts and diagnose. But now we have got only general association with diagnoses but not specifically with bipolor type-2 and depression.

sub = subset(disorder, disorder$Expert.Diagnose %in% c("Depression","Bipolar Type-2") )

tab.sub = table(sub$Expert.Diagnose, sub$Suicidal.thoughts)
tab.sub

##                 
##                  NO YES YES 
##   Bipolar Type-2  8  22    1
##   Depression     10  21    0

chi.sub = chisq.test(tab.sub)

## Warning in chisq.test(tab.sub): Chi-squared approximation may be incorrect

chi.sub

## 
##  Pearson's Chi-squared test
## 
## data:  tab.sub
## X-squared = 1.2455, df = 2, p-value = 0.5365

First thing, we subset or filter the data of Deression and Bipolar Type-2 from other diagnostics. From subset chi-square test we see p-value > 0.05, which means no statistically significant association between suicidal thoughts and Bipolar type-2 and depression. Well, unlucky for us. Then maybe we should try once more with other symptom. Sure, why not?! Lets try with mood swing.

tab_mood = table(disorder$Expert.Diagnose, disorder$Mood.Swing)
tab_mood

##                 
##                  NO YES
##   Bipolar Type-1  3  25
##   Bipolar Type-2  0  31
##   Depression     31   0
##   Normal         29   1

tab.sub.mood = table(sub$Expert.Diagnose, sub$Mood.Swing)
tab.sub.mood

##                 
##                  NO YES
##   Bipolar Type-2  0  31
##   Depression     31   0

chi.sub.mood = chisq.test(tab.sub.mood)
chi.sub.mood

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab.sub.mood
## X-squared = 58.065, df = 1, p-value = 2.537e-14

summary(chi.sub.mood)

##           Length Class  Mode     
## statistic 1      -none- numeric  
## parameter 1      -none- numeric  
## p.value   1      -none- numeric  
## method    1      -none- character
## data.name 1      -none- character
## observed  4      table  numeric  
## expected  4      -none- numeric  
## residuals 4      table  numeric  
## stdres    4      table  numeric

We did the exact same steps as with previous symptom tesing. Now, if we look at the chi-square test, we have got p-value < 0.05, which means YAY, we have statistically significant association between Mood swing and Bipolar 2 and depression diagnoses. Having mood swings and being diagnosed with specifically either Bipolar 2 or depression can be highly associated and inter-dependent.

Data modeling

Now, we need data modeling. For this step I used binary logistic regression, as I am looking for categorical data of people with Bipolar Type-2 and Depression diagnoses either experience Mood swing or not.

sub = disorder %>% 
  filter(Expert.Diagnose %in% c("Bipolar Type-2", "Depression"))

sub$Expert.Diagnose = factor(sub$Expert.Diagnose)
sub$Mood.Swing = factor(sub$Mood.Swing)

sub$Expert.Diagnose = relevel(sub$Expert.Diagnose, ref = "Depression")

model.disorder = glm(Expert.Diagnose ~ Mood.Swing,
                     data = sub,
                     family = binomial)

summary(model.disorder)

## 
## Call:
## glm(formula = Expert.Diagnose ~ Mood.Swing, family = binomial, 
##     data = sub)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -26.57   63961.76   0.000        1
## Mood.SwingYES    53.13   90455.60   0.001        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.595e+01  on 61  degrees of freedom
## Residual deviance: 3.597e-10  on 60  degrees of freedom
## AIC: 4
## 
## Number of Fisher Scoring iterations: 25

plot(model.disorder)

The results of data modeling, we have got a weird table, which does not actually gives an explanation for modeling. There is a huge standard error. Mood swing almost perfectly distinguished between bipolar and depressive symptoms. Thus, lets check how Mood swing is predictive of two diagnoses.

Final visualization

chi.model = sub %>% 
  ggplot(
    aes(
      x = sub$Expert.Diagnose,
      fill = sub$Mood.Swing
    )
  )+
  geom_bar(position = "dodge")+
  labs(
    title = "Proportion of Mood Swing prediction by diagnosis",
    x = "Diagnosis",
    y = "Proportions",
    fill = "Mood Swings"
  )+
  scale_y_continuous()+
  theme_minimal()

chi.model

## Warning: Use of `sub$Expert.Diagnose` is discouraged.
## ℹ Use `Expert.Diagnose` instead.

## Warning: Use of `sub$Mood.Swing` is discouraged.
## ℹ Use `Mood.Swing` instead.

Here we can see that categorical answers of mood swings may result in getting diagnosed with Bipolar Type-2 or Depression disorders. Based on the proportions, patients who had answered no on having mood swing symptoms, got diagnosed with depression, and vica verse, patients who had this symptom, got Bipolar Type-2 disorder.

Conclusion

In the dataset 4 expert diagnoses were identified. By subsetting the Bipolar Type-2 and Depression we have identified the statistically significant association with Mood swings.

Limitations:

Possibly, as it was a randomly chosen symptom, there are might be other predictive one I did not take into an account. In addition, as I was mainly focused on p-value, which is not enough for interpretation, there is an error of misinterpretation.