Part 0: Importing Data and Preprocessing

First, we import the datasets into 3 separate data-frames.

# begin: load files
demog <- read.csv("dem.csv", header = TRUE, sep = ",")
diag  <- read.csv("dia.csv", header = TRUE, sep = ",")
EDvis <- read.table("ed_visits.txt", header = TRUE, sep = "$")
# include R packages
library(ggplot2)
library(dplyr)
library(scales)
library(knitr)

Part 1: Patient Description

The patients present a mix of diverse races, age-groups and representation from both sexes. The following plots visually describe the proportions of these mixtures.

# Bar plot to describe sex grouped by race
sexRacePlot <- ggplot(data = demog, aes(gender, fill = race)) + geom_bar(alpha = 0.7) + coord_flip()
sexRacePlot

This plot yields two important obervations.

  1. We see that there is an equal number of men and women.
  2. On the other hand, the racial distribution is more skewed:
    • predominantly white
    • next most common race is black, followed by hispanic, then asian and lastly, Native American. Aside: This racial and sex profile is largely along on the lines of the national population mix, so it seems that the cohort of patients represents a more of less uniformly random sampling of the nationwide population.

Next, we visualize the age-mix combined with race using a histogram, as well as combined with gender using a box-plot:

# histogram
ageplotHist <- ggplot(data = demog, aes(x = demog$age, fill = race)) + geom_histogram(bins = 9, col = "white", alpha = 0.8)
ageplotHist

# box-plot
ageplotBox <- ggplot(data = demog, aes(x = gender, y = age)) + geom_boxplot() + coord_flip()
ageplotBox

We note that population skews to being older, with the largest sub-group being over 90 years of age, and a generally increasing trend of number of patients with older age. It is interesting to note, however, that both race and sex seem to be uniformly represented.

Part 2: Inference and Data Exploration

After playing around with the data (aka, fooling around and failing) in different ways, I have decided to focus on diseases. Before doing anything, I wanted to see the spread of diseases. From the data, it seems like the number of diseases diagnosed is:

length(levels(diag$dia_code))
## [1] 688

Since there were so many of them to work with, I just wanted to see a landscape of what they looked like for the population at large, and see which ones affected patients disproportionately.

# create combined data frame of the patients and their diagnoses
patientDiagnoses <- merge(demog, diag, by = "empi", all = FALSE)
# plot
diseasePlot <- ggplot(data = patientDiagnoses, aes(dia_code, fill = race)) + geom_bar() 
diseasePlot <- diseasePlot + theme(legend.position = "bottom", axis.text.x = element_blank())
diseasePlot

While there is too much going on in this plot, it certainly seems there there are some diseases that tend to affect a lot of patients. Let us try to find which these might be.

kable(summary(patientDiagnoses$dia_name), col.names = 'Patient Count')
Patient Count
Other and unspecified hyperlipidemia 590
Abdominal pain, unspecified site 226
Other malignant lymphomas, unspecified site, extranodal and solid organ sites 202
Malignant neoplasm of breast (female), unspecified 144
Acute reaction to stress 119
Other acute reactions to stress 118
Panic disorder without agoraphobia 118
Unspecified acute reaction to stress 115
Pain in joint involving lower leg 113
Hysteria, unspecified 110
Generalized anxiety disorder 107
Anxiety, dissociative and somatoform disorders 103
Other anxiety states 103
Diabetes mellitus type II [non-insulin dependent type] [NIDDM type] [adult-onset type] or unspecified type, not stated as uncontrolled, with unspecified complication 98
Hypertensive chronic kidney disease, benign, with chronic kidney disease stage V or end stage renal disease 84
Hyposmolality and/or hyponatremia 82
Cirrhosis of liver without mention of alcohol 77
Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end stage renal disease 75
Other seborrheic keratosis 53
Other malignant neoplasm of skin of other and unspecified parts of face 50
Esophageal reflux 45
Other paralytic syndromes 45
Chronic pulmonary heart disease 43
Mylagia and myositis, unspecified 43
Secondary malignant neoplasm of other specified sites 43
Malignant neoplasm without specification of site 41
Secondary and unspecified malignant neoplasm of lymph nodes 40
Malignant neoplasm of brain, unspecified 38
Unspecified disease of pulmonary circulation 38
Embolism and thrombosis of other specified veins 37
Secondary malignant neoplasm of respiratory and digestive systems 36
Swelling, mass, or lump in chest 36
Chronic pulmonary heart disease, unspecified 35
Intracerebral hemorrhage 35
Acute or unspecified hepatitis C without mention of hepatic coma 33
Atherosclerosis of native arteries of the extremities, unspecified 32
Pain in joint, site unspecified 32
Adjustment reaction with prolonged depressive reaction 31
Hemiplegia and hemiparesis 31
Cognitive deficits as late effect of cerebrovascular disease 30
Unspecified essential hypertension 30
Benign hypertensive heart disease without heart failure 29
Other second degree atrioventricular block 28
Acute cor pulmonale 27
Alcohol-induced psychotic disorder with delusions 27
Chronic hepatitis C without mention of hepatic coma 27
Atrioventricular block, unspecified 26
First degree atrioventricular block 26
Secondary malignant neoplasm of ovary 26
Unspecified gastritis and gastroduodenitis, without mention of hemorrhage 26
Alcohol-induced persisting dementia 25
Benign essential hypertension 25
Neurogenic bladder NOS 25
Epilepsy, unspecified, without mention of intractable epilepsy 24
Other conditions of brain 24
Other severe protein-calorie malnutrition 24
Alcohol-induced persisting amnestic disorder 23
Chest pain, unspecified 22
Complications of transplanted liver 22
Nontoxic multinodular goiter 22
Nutritional marasmus 22
Human immunodeficiency virus [HIV] disease 21
Pain in limb 21
Cellulitis and abscess of upper arm and forearm 20
Chronic viral hepatitis B without mention of hepatic coma without mention of hepatitis delta 20
Subarachnoid hemorrhage 20
Atherosclerosis of aorta 19
Other chronic pulmonary heart diseases 19
Abdominal pain, other specified site 18
Acidosis 18
Diabetes mellitus type II [non-insulin dependent type] [NIDDM type] [adult-onset type] or unspecified type, not stated as uncontrolled, with ophthalmic manifestations 18
Headache 18
Secondary and unspecified malignant neoplasm of lymph nodes of axilla and upper limb 18
Acute myeloid leukemia, in relapse 17
Esophageal varices with bleeding 17
Kwashiorkor 17
Epistaxis 16
Hematemesis 16
Other and unspecified protein-calorie malnutrition 16
Thoracic or lumbosacral neuritis or radiculitis, unspecified 16
Pain in thoracic spine 15
Sensorineural hearing loss of combined types 15
Shortness of breath 15
Atherosclerosis of renal artery 14
Hyperpotassemia 14
Leukocytosis, unspecified 14
Other malaise and fatigue 14
Subdural hemorrhage following injury, without mention of open intracranial wound, with state of consciousness unspecified 14
Aftercare following joint replacement 13
Lumbago 13
Chronic viral hepatitis B without mention of hepatic coma with hepatitis delta 12
Congenital deficiency of other clotting factors 12
Dizziness and giddiness 12
Mixed acid-base balance disorder 12
Nocturia 12
Other and unspecified coagulation defects 12
Other specified personal history presenting hazards to health 12
Syncope and collapse 12
Anxiety state, unspecified 11
(Other) 1339

The conditions that find their way onto this table are fairly diverse, from heart conditions to abdominal pain. Given the age demopgrahic at hand, many of the conditions affect the body directly and seem fairly reasonable. > What piqued my interest is the prevalence of conditions of stress / anxiety related conditions, i.e. mental health afflictions near the very top of the table.

So next, let’s do deep dive into these conditions to figure out who they’re affecting, how they’re being treated, and in general, what may be going on.

In doing this, I began by gathering some basic domain knowledge on how the ICD-9 codes work, and how diseases are systematically grouped together accordingly. In order to group the diseases from the given data, I did some basic data wrangling to classify the diagnosis that qualify as mental health conditions (ICD codes 290-319)

# select all records with patients who've been diagnosed with mental health conditions
mentalHealthPatients <- patientDiagnoses[which(as.numeric(patientDiagnoses$dia_code) >= 290 & as.numeric(patientDiagnoses$dia_code) < 320), ]

It is extremely interesting that while the number of diagnoses of mental health conditions as observed from the previous table is well over 400, there are only

length(mentalHealthPatients)
## [1] 14

unique patients with mental health disorders, which suggests that a number of these disorders occur together in patients. I now seek to answer who these patients might be.

# age and race
ageplotHist2 <- ggplot(data = mentalHealthPatients, aes(x = age, fill = race)) + geom_histogram(bins = 9, col = "white", alpha = 0.8)
ageplotHist2

# population
ageplotHist

How does this distribution compare with the cohort’s overall patient population?

# age and sex
ageplotHist3 <- ggplot(data = mentalHealthPatients, aes(x = age, fill = gender)) + geom_histogram(bins = 9, col = "white", alpha = 0.8)
ageplotHist3

The gender distribution is rather even and shows no particularly interesting outliers.

mhID <- mentalHealthPatients$empi
mentalHealth <- filter(patientDiagnoses, empi %in% mhID)
nrow(mentalHealth)
## [1] 1843

This is truly strange: there are 1843 records of patients who suffer from mental health conditions, but only 132 such patients.

This makes me curious as to what non-mental health conditions they may be suffering from, the stress from some of which, might even have led to the mental health troubles. Let us find out.

# top 5 accompanying conditions
kable(mentalHealth %>% count(dia_name, sort = TRUE) %>% top_n(5))
## Selecting by n
dia_name n
Other and unspecified hyperlipidemia 181
Other malignant lymphomas, unspecified site, extranodal and solid organ sites 70
Abdominal pain, unspecified site 65
Malignant neoplasm of breast (female), unspecified 55
Esophageal reflux 45

The condition most commonly accompanying mental health disorders is hyperlipidemia, which, as Google tells me, is the presence of a high level of fats in the blood. Also on the list are malignant lymphomas and neoplasm of breas, both referring to cancer conditions. There are also to stomach / digestive conditions, abdominal pain and esophageal reflux.

Inference and intuition

This list is interesting and insightful. However, it isn’t incredibly surprising. If asked to guess, I would say that people with hyperlipidemia and esophegal reflux tend to suffer from mental disorders due to an unhealthy and stressful lifestyle (think long hours at work, with no exercise and fast food for meals). For the other group, with cancerous conditions, it is possible that the onset of cancer could’ve provoked severe stress, anxiety, and, consequently, ill mental health.

Part 2: Next Steps

  1. Extract and wrangle data to find which condition they were diagnosed with first
  2. Try to see if there latency period between one diagnosis and the second is consistent for a pair of conditions.
  3. Run a linear regression to see if we can predict the onset of mental disease from the data at hand.
  4. See how mortality is related to mental health.