library(dplyr)
library(tidyr)
library(ggplot2)
library(highcharter)
library(plotly)
library(knitr)
library(readr)
hdp <- read_csv("~/Data101/5.1.23/hdp.csv")
View(hdp)Final Project 110
Final Project for DATA110

Hospital, Doctor, Patient Dataset
Hospital, Doctor, Patient Dataset
Source: UCLA
Variables in the dataset (8,525 rows and 27 variables)
| Variable Name | Description | Data Type |
| tumorsize | size of each tumor in mm | Quantitative |
| co2 | CO2 levels in percents | Quantitative |
| pain | scale of 1-10, 10 being the most pain | Quantitative |
| wound | scale of 1-10 indicating severity of the wound | Quantitative |
| mobility | scale of 1-10 indicating ability to move | Quantitative |
| ntumors | number of tumors each patient has (right censored) | Quantitative |
| nmorphine | number of self administered morphine doses | Quantitative |
| remission | 1= yes, 0= no | Categorical |
| lungcapacity | proportion of optimal lung capacity | Quantitative |
| Age | years each patient has been alive | Quantitative |
| Married | Marital Status | Categorical |
| FamilyHx | If the patient has family history of cancer | Categorical |
| SmokingHx | Smoking status, 3 levels | Categorical |
| Sex | male or female | Categorical |
| CancerStage | stages 1-4 | Categorical |
| LengthofStay | number of days the patient stayed in the hospital | Quantitative |
| WBC | white blood cell count | Quantitative |
| RBC | red blood cell count | Quantitative |
| BMI | body mass index | Quantitative |
| IL6 | interleukin 6 levels (indicator of inflammation) | Quantitative |
| CRP | C-reactive protein levels (indicator of inflammation) | Quantitative |
| DID | doctor identifier | Categorical |
| Experience | Number of years practicing medicine | Quantitative |
| School | the quality of the school the doctor was trained | Categorical |
| Lawsuits | number of malpractice lawsuits each doctor has | Quantitative |
| HID | hospital identifier | Categorical |
| Medicaid | proportion of patients enrolled in medicaid at each hospital | Quantitative |
There was not a readme file to explain the method used to collect this information, because it is a simulation made by the UCLA data science program to be used for EDA. I am not sure if there is any truth to the information provided in the dataset, or if it is purely fictional. I chose this dataset, because I have always been interested in the healthcare sector. It gives a broad overview of three hierarchical levels in the world of healthcare which are hospital, doctor, and patient, and I have never seen information organized in such a way.
Apparently, the format of this dataset is a newer type of structure to gathering data about hospitals, patients, and doctors, which is now sought after. This type of data collection allows for a hierarchical medical system that seems to be the most efficient at the moment (Wang et al).
With this dataset, I’m interested in exploring remission of patients in relation to the tumor size in order to investigate if there is any correlation between the two. I would also like to investigate if the age someone contracts cancer changes based off BMI.
citation: Wang, Yaogang et al. “Hierarchical Medical System Based on Big Data and Mobile Internet: A New Strategic Choice in Health Care.” JMIR medical informatics vol. 5,3 e22. 8 Aug. 2017, doi:10.2196/medinform.6799
Load and clean the data
sum(is.na(hdp))[1] 0
str(hdp)spc_tbl_ [8,525 × 27] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ tumorsize : num [1:8525] 68 64.7 51.6 86.4 53.4 ...
$ co2 : num [1:8525] 1.53 1.68 1.53 1.45 1.57 ...
$ pain : num [1:8525] 4 2 6 3 3 4 3 3 4 5 ...
$ wound : num [1:8525] 4 3 3 3 4 5 4 3 4 4 ...
$ mobility : num [1:8525] 2 2 2 2 2 2 2 3 3 3 ...
$ ntumors : num [1:8525] 0 0 0 0 0 0 0 0 2 0 ...
$ nmorphine : num [1:8525] 0 0 0 0 0 0 0 0 0 0 ...
$ remission : num [1:8525] 0 0 0 0 0 0 0 0 0 0 ...
$ lungcapacity: num [1:8525] 0.801 0.326 0.565 0.848 0.886 ...
$ Age : num [1:8525] 65 53.9 53.3 41.4 46.8 ...
$ Married : num [1:8525] 0 0 1 0 0 1 1 0 1 0 ...
$ FamilyHx : chr [1:8525] "no" "no" "no" "no" ...
$ SmokingHx : chr [1:8525] "former" "former" "never" "former" ...
$ Sex : chr [1:8525] "male" "female" "female" "male" ...
$ CancerStage : chr [1:8525] "II" "II" "II" "I" ...
$ LengthofStay: num [1:8525] 6 6 5 5 6 5 4 5 6 7 ...
$ WBC : num [1:8525] 6088 6700 6043 7163 6443 ...
$ RBC : num [1:8525] 4.87 4.68 5.01 5.27 4.98 ...
$ BMI : num [1:8525] 24.1 29.4 29.5 21.6 29.8 ...
$ IL6 : num [1:8525] 3.7 2.63 13.9 3.01 3.89 ...
$ CRP : num [1:8525] 8.086 0.803 4.034 2.126 1.349 ...
$ DID : num [1:8525] 1 1 1 1 1 1 1 1 1 1 ...
$ Experience : num [1:8525] 25 25 25 25 25 25 25 25 25 25 ...
$ School : chr [1:8525] "average" "average" "average" "average" ...
$ Lawsuits : num [1:8525] 3 3 3 3 3 3 3 3 3 3 ...
$ HID : num [1:8525] 1 1 1 1 1 1 1 1 1 1 ...
$ Medicaid : num [1:8525] 0.606 0.606 0.606 0.606 0.606 ...
- attr(*, "spec")=
.. cols(
.. tumorsize = col_double(),
.. co2 = col_double(),
.. pain = col_double(),
.. wound = col_double(),
.. mobility = col_double(),
.. ntumors = col_double(),
.. nmorphine = col_double(),
.. remission = col_double(),
.. lungcapacity = col_double(),
.. Age = col_double(),
.. Married = col_double(),
.. FamilyHx = col_character(),
.. SmokingHx = col_character(),
.. Sex = col_character(),
.. CancerStage = col_character(),
.. LengthofStay = col_double(),
.. WBC = col_double(),
.. RBC = col_double(),
.. BMI = col_double(),
.. IL6 = col_double(),
.. CRP = col_double(),
.. DID = col_double(),
.. Experience = col_double(),
.. School = col_character(),
.. Lawsuits = col_double(),
.. HID = col_double(),
.. Medicaid = col_double()
.. )
- attr(*, "problems")=<externalptr>
Making the “remission” column better for use
hdp$remission <- as.factor(hdp$remission)
hdp$remission <- recode_factor(hdp$remission, "0" = "No", "1" = "Yes")Here, I am categorizing the size of the tumors
hdp1 <- hdp %>%
group_by(tumorsize) %>%
mutate(L_tumor = tumorsize>=30 & tumorsize<60,
XL_tumor = tumorsize>=60 & tumorsize<90,
XXL_tumor = tumorsize>=90 & tumorsize<120) %>%
gather(tumorcat, logical, L_tumor, XL_tumor, XXL_tumor) %>%
filter(logical == T) %>%
select(pain, tumorsize, tumorcat, remission)hdp1.1 <- hdp1 %>%
group_by(tumorcat, remission) %>%
summarise(Count = n())mycolors <- colors()[c(20, 38, 428, 622, 91, 411, 76)] Visual 1
plot1 <- ggplot(hdp1.1, aes(tumorcat, Count, fill= remission)) +
geom_bar(stat = "identity",
position = 'dodge',
color="grey") +
scale_fill_manual(values = c("bisque1", "lightsalmon4" ))+
guides(fill=guide_legend(title=NULL))+
ggtitle("Number of patients in remission in regards to the size of their tumor")+
labs(x= "Tumor Size Category",
y= "Count")+
theme_minimal()
plot1 <- ggplotly()
plot1For my first visual, I wanted to see if there was an obvious relationship between tumor size and if the patient goes into remission. This visual shows that no matter the size of the tumor, it is less likely that the patient will go into remission. In all size categories, the number of patients who do not go into remission are more than double of those who do.
hdp2 <- hdp %>%
select(Age, BMI, FamilyHx)Visual 2
my_own_theme <- hc_theme(
chart = list(
colors = c("bisque1", "lightsalmon4", "tan2"),
backgroundColor = 'transparent',
divBackgroundImage = "https://media.tenor.com/2ce3aeUUwbIAAAAC/colors-pattern.gif"
),
title = list(
style = list(
color = "black",
fontFamily = "Bold"
)
),
subtitle = list(
style = list(
color = "black",
fontFamily = "Shadows Into Light"
)
),
legend = list(
itemStyle = list(
fontFamily = "Tangerine",
color = "black"
),
itemHoverStyle = list(
color = "gray"
)
)
)
plot2 <- highchart() %>%
hc_add_theme(my_own_theme) %>%
hc_add_series(hdp2, "scatter", hcaes(x= "Age", y= "BMI", group = hdp2$FamilyHx, color = hdp2$BMI)) %>%
hc_colors(mycolors) %>%
hc_title(text = "BMI and Age of each patient with Cancer") %>%
hc_yAxis(title = list(text = "BMI"))%>%
hc_xAxis(title = list(text = "Age")) %>%
hc_legend(title = list(text ="Family History of Cancer"))
plot2This visual was made in an effort to utilize unique qualities that highcharter offers. I was able to include a gif background image to add color and movement to the chart, but it does not provide any substance to the chart. I also made my own color palette that should have shown for the BMI, but it did not work. There are also axis labels that may be hard to see due to the colors that were used. This chart shows a scatter plot of age to bmi of the cancer patients in the data set. It shows a normal distribution of bmi to age, and it also shows the same distribution for individuals who have a family history of cancer and for those who do not have that family history. The shape of this graph indicates that there probably is no association between bmi and getting cancer.
Statistical Analysis
hdp3 <- hdp1.1 %>%
spread(remission, Count) %>%
mutate(rate_remission = (Yes/sum(Yes + No))*100,
total = (Yes+No)) %>%
select(tumorcat,rate_remission, total)library(DataExplorer)
plot_correlation(hdp3, title = "The correlation between each variable in the HDP3 dataset")
cor.test(hdp3$rate_remission, hdp3$total,
method = "pearson")
Pearson's product-moment correlation
data: hdp3$rate_remission and hdp3$total
t = -0.35844, df = 1, p-value = 0.7809
alternative hypothesis: true correlation is not equal to 0
sample estimates:
cor
-0.3374228
According to these correlation tests, there is not a significant result because the p-value is much greater than 0.05. There is also a weak negative correlation between the remission rate and the total number of tumors in each category. The category size that had the strongest correlation with remission rates was the XXL tumors which I had defined as between 90 and 120 mm.