Final Project 110

Author

Christina Takla

Final Project for DATA110

Hospital, Doctor, Patient Dataset

Source: UCLA

Variables in the dataset (8,525 rows and 27 variables)

Variable Name	Description	Data Type
tumorsize	size of each tumor in mm	Quantitative
co2	CO2 levels in percents	Quantitative
pain	scale of 1-10, 10 being the most pain	Quantitative
wound	scale of 1-10 indicating severity of the wound	Quantitative
mobility	scale of 1-10 indicating ability to move	Quantitative
ntumors	number of tumors each patient has (right censored)	Quantitative
nmorphine	number of self administered morphine doses	Quantitative
remission	1= yes, 0= no	Categorical
lungcapacity	proportion of optimal lung capacity	Quantitative
Age	years each patient has been alive	Quantitative
Married	Marital Status	Categorical
FamilyHx	If the patient has family history of cancer	Categorical
SmokingHx	Smoking status, 3 levels	Categorical
Sex	male or female	Categorical
CancerStage	stages 1-4	Categorical
LengthofStay	number of days the patient stayed in the hospital	Quantitative
WBC	white blood cell count	Quantitative
RBC	red blood cell count	Quantitative
BMI	body mass index	Quantitative
IL6	interleukin 6 levels (indicator of inflammation)	Quantitative
CRP	C-reactive protein levels (indicator of inflammation)	Quantitative
DID	doctor identifier	Categorical
Experience	Number of years practicing medicine	Quantitative
School	the quality of the school the doctor was trained	Categorical
Lawsuits	number of malpractice lawsuits each doctor has	Quantitative
HID	hospital identifier	Categorical
Medicaid	proportion of patients enrolled in medicaid at each hospital	Quantitative

There was not a readme file to explain the method used to collect this information, because it is a simulation made by the UCLA data science program to be used for EDA. I am not sure if there is any truth to the information provided in the dataset, or if it is purely fictional. I chose this dataset, because I have always been interested in the healthcare sector. It gives a broad overview of three hierarchical levels in the world of healthcare which are hospital, doctor, and patient, and I have never seen information organized in such a way.

Apparently, the format of this dataset is a newer type of structure to gathering data about hospitals, patients, and doctors, which is now sought after. This type of data collection allows for a hierarchical medical system that seems to be the most efficient at the moment (Wang et al).

With this dataset, I’m interested in exploring remission of patients in relation to the tumor size in order to investigate if there is any correlation between the two. I would also like to investigate if the age someone contracts cancer changes based off BMI.

citation: Wang, Yaogang et al. “Hierarchical Medical System Based on Big Data and Mobile Internet: A New Strategic Choice in Health Care.” JMIR medical informatics vol. 5,3 e22. 8 Aug. 2017, doi:10.2196/medinform.6799

Load and clean the data

library(dplyr)
library(tidyr)
library(ggplot2)
library(highcharter)
library(plotly)
library(knitr)
library(readr)
hdp <- read_csv("~/Data101/5.1.23/hdp.csv")
View(hdp)

sum(is.na(hdp))

[1] 0

str(hdp)

spc_tbl_ [8,525 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ tumorsize   : num [1:8525] 68 64.7 51.6 86.4 53.4 ...
 $ co2         : num [1:8525] 1.53 1.68 1.53 1.45 1.57 ...
 $ pain        : num [1:8525] 4 2 6 3 3 4 3 3 4 5 ...
 $ wound       : num [1:8525] 4 3 3 3 4 5 4 3 4 4 ...
 $ mobility    : num [1:8525] 2 2 2 2 2 2 2 3 3 3 ...
 $ ntumors     : num [1:8525] 0 0 0 0 0 0 0 0 2 0 ...
 $ nmorphine   : num [1:8525] 0 0 0 0 0 0 0 0 0 0 ...
 $ remission   : num [1:8525] 0 0 0 0 0 0 0 0 0 0 ...
 $ lungcapacity: num [1:8525] 0.801 0.326 0.565 0.848 0.886 ...
 $ Age         : num [1:8525] 65 53.9 53.3 41.4 46.8 ...
 $ Married     : num [1:8525] 0 0 1 0 0 1 1 0 1 0 ...
 $ FamilyHx    : chr [1:8525] "no" "no" "no" "no" ...
 $ SmokingHx   : chr [1:8525] "former" "former" "never" "former" ...
 $ Sex         : chr [1:8525] "male" "female" "female" "male" ...
 $ CancerStage : chr [1:8525] "II" "II" "II" "I" ...
 $ LengthofStay: num [1:8525] 6 6 5 5 6 5 4 5 6 7 ...
 $ WBC         : num [1:8525] 6088 6700 6043 7163 6443 ...
 $ RBC         : num [1:8525] 4.87 4.68 5.01 5.27 4.98 ...
 $ BMI         : num [1:8525] 24.1 29.4 29.5 21.6 29.8 ...
 $ IL6         : num [1:8525] 3.7 2.63 13.9 3.01 3.89 ...
 $ CRP         : num [1:8525] 8.086 0.803 4.034 2.126 1.349 ...
 $ DID         : num [1:8525] 1 1 1 1 1 1 1 1 1 1 ...
 $ Experience  : num [1:8525] 25 25 25 25 25 25 25 25 25 25 ...
 $ School      : chr [1:8525] "average" "average" "average" "average" ...
 $ Lawsuits    : num [1:8525] 3 3 3 3 3 3 3 3 3 3 ...
 $ HID         : num [1:8525] 1 1 1 1 1 1 1 1 1 1 ...
 $ Medicaid    : num [1:8525] 0.606 0.606 0.606 0.606 0.606 ...
 - attr(*, "spec")=
  .. cols(
  ..   tumorsize = col_double(),
  ..   co2 = col_double(),
  ..   pain = col_double(),
  ..   wound = col_double(),
  ..   mobility = col_double(),
  ..   ntumors = col_double(),
  ..   nmorphine = col_double(),
  ..   remission = col_double(),
  ..   lungcapacity = col_double(),
  ..   Age = col_double(),
  ..   Married = col_double(),
  ..   FamilyHx = col_character(),
  ..   SmokingHx = col_character(),
  ..   Sex = col_character(),
  ..   CancerStage = col_character(),
  ..   LengthofStay = col_double(),
  ..   WBC = col_double(),
  ..   RBC = col_double(),
  ..   BMI = col_double(),
  ..   IL6 = col_double(),
  ..   CRP = col_double(),
  ..   DID = col_double(),
  ..   Experience = col_double(),
  ..   School = col_character(),
  ..   Lawsuits = col_double(),
  ..   HID = col_double(),
  ..   Medicaid = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Making the “remission” column better for use

hdp$remission <- as.factor(hdp$remission)
hdp$remission <- recode_factor(hdp$remission, "0" = "No", "1" = "Yes")

Here, I am categorizing the size of the tumors

hdp1 <- hdp %>%
  group_by(tumorsize) %>%
  mutate(L_tumor = tumorsize>=30 & tumorsize<60,
         XL_tumor = tumorsize>=60 & tumorsize<90,
         XXL_tumor = tumorsize>=90 & tumorsize<120) %>%
  gather(tumorcat, logical, L_tumor, XL_tumor, XXL_tumor) %>%
  filter(logical == T) %>%
  select(pain, tumorsize, tumorcat, remission)

hdp1.1 <- hdp1 %>%
  group_by(tumorcat, remission) %>%
  summarise(Count = n())

mycolors <- colors()[c(20, 38, 428, 622, 91, 411, 76)]

Visual 1

plot1 <- ggplot(hdp1.1, aes(tumorcat, Count, fill= remission)) +
  geom_bar(stat = "identity",
           position = 'dodge', 
           color="grey") +
  scale_fill_manual(values = c("bisque1", "lightsalmon4" ))+
  guides(fill=guide_legend(title=NULL))+
  ggtitle("Number of patients in remission in regards to the size of their tumor")+
  labs(x= "Tumor Size Category",
       y= "Count")+
  theme_minimal()
plot1 <- ggplotly()
plot1

For my first visual, I wanted to see if there was an obvious relationship between tumor size and if the patient goes into remission. This visual shows that no matter the size of the tumor, it is less likely that the patient will go into remission. In all size categories, the number of patients who do not go into remission are more than double of those who do.

hdp2 <- hdp %>%
  select(Age, BMI, FamilyHx)

Visual 2

my_own_theme <- hc_theme(
  chart = list(
    colors = c("bisque1", "lightsalmon4", "tan2"),
    backgroundColor = 'transparent',
    divBackgroundImage = "https://media.tenor.com/2ce3aeUUwbIAAAAC/colors-pattern.gif"
  ),
  title = list(
    style = list(
      color = "black",
      fontFamily = "Bold"
    )
  ),
  subtitle = list(
    style = list(
      color = "black",
      fontFamily = "Shadows Into Light"
    )
  ),
  legend = list(
    itemStyle = list(
      fontFamily = "Tangerine",
      color = "black"
    ),
    itemHoverStyle = list(
      color = "gray"
    )
  )
)
plot2 <- highchart() %>%
  hc_add_theme(my_own_theme) %>%
  hc_add_series(hdp2, "scatter", hcaes(x= "Age", y= "BMI", group = hdp2$FamilyHx, color = hdp2$BMI)) %>%
 hc_colors(mycolors) %>%
  hc_title(text = "BMI and Age of each patient with Cancer") %>%
  hc_yAxis(title = list(text = "BMI"))%>%
  hc_xAxis(title = list(text = "Age")) %>%
  hc_legend(title = list(text ="Family History of Cancer"))
plot2

This visual was made in an effort to utilize unique qualities that highcharter offers. I was able to include a gif background image to add color and movement to the chart, but it does not provide any substance to the chart. I also made my own color palette that should have shown for the BMI, but it did not work. There are also axis labels that may be hard to see due to the colors that were used. This chart shows a scatter plot of age to bmi of the cancer patients in the data set. It shows a normal distribution of bmi to age, and it also shows the same distribution for individuals who have a family history of cancer and for those who do not have that family history. The shape of this graph indicates that there probably is no association between bmi and getting cancer.

Statistical Analysis

hdp3 <- hdp1.1 %>%
  spread(remission, Count) %>%
  mutate(rate_remission = (Yes/sum(Yes + No))*100,
         total = (Yes+No)) %>%
  select(tumorcat,rate_remission, total)

library(DataExplorer)
plot_correlation(hdp3, title = "The correlation between each variable in the HDP3 dataset")

cor.test(hdp3$rate_remission, hdp3$total,
                    method = "pearson")


    Pearson's product-moment correlation

data:  hdp3$rate_remission and hdp3$total
t = -0.35844, df = 1, p-value = 0.7809
alternative hypothesis: true correlation is not equal to 0
sample estimates:
       cor 
-0.3374228

According to these correlation tests, there is not a significant result because the p-value is much greater than 0.05. There is also a weak negative correlation between the remission rate and the total number of tumors in each category. The category size that had the strongest correlation with remission rates was the XXL tumors which I had defined as between 90 and 120 mm.