Group 7: Esther Agyemang, Omar Alamodi, Aliha Bashir, Hetal Bulsara, Andy Huang, James Huynh, Nasteha Mohamed, and Tina Shen
George Brown Polytechnic
COMP 4033 – Health Informatics Data Analytics
This section provides an overview of the dataset structure, including the number of variables and their data types.
str(oral_data)
## 'data.frame': 84922 obs. of 25 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Country : chr "Italy" "Japan" "UK" "Sri Lanka" ...
## $ Age : int 36 64 37 55 68 70 41 53 62 50 ...
## $ Gender : chr "Female" "Male" "Female" "Male" ...
## $ Tobacco.Use : chr "Yes" "Yes" "No" "Yes" ...
## $ Alcohol.Consumption : chr "Yes" "Yes" "Yes" "Yes" ...
## $ HPV.Infection : chr "Yes" "Yes" "No" "No" ...
## $ Betel.Quid.Use : chr "No" "No" "No" "Yes" ...
## $ Chronic.Sun.Exposure : chr "No" "Yes" "Yes" "No" ...
## $ Poor.Oral.Hygiene : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Diet..Fruits...Vegetables.Intake. : chr "Low" "High" "Moderate" "Moderate" ...
## $ Family.History.of.Cancer : chr "No" "No" "No" "No" ...
## $ Compromised.Immune.System : chr "No" "No" "No" "No" ...
## $ Oral.Lesions : chr "No" "No" "No" "Yes" ...
## $ Unexplained.Bleeding : chr "No" "Yes" "No" "No" ...
## $ Difficulty.Swallowing : chr "No" "No" "No" "No" ...
## $ White.or.Red.Patches.in.Mouth : chr "No" "No" "Yes" "No" ...
## $ Tumor.Size..cm. : num 0 1.78 3.52 0 2.83 ...
## $ Cancer.Stage : int 0 1 2 0 3 2 1 0 3 3 ...
## $ Treatment.Type : chr "No Treatment" "No Treatment" "Surgery" "No Treatment" ...
## $ Survival.Rate..5.Year.... : num 100 83.3 63.2 100 44.3 ...
## $ Cost.of.Treatment..USD. : num 0 77773 101165 0 45355 ...
## $ Economic.Burden..Lost.Workdays.per.Year.: int 0 177 130 0 52 91 105 0 136 82 ...
## $ Early.Diagnosis : chr "No" "No" "Yes" "Yes" ...
## $ Oral.Cancer..Diagnosis. : chr "No" "Yes" "Yes" "No" ...
The dataset contains the following variables:
names(oral_data)
## [1] "ID"
## [2] "Country"
## [3] "Age"
## [4] "Gender"
## [5] "Tobacco.Use"
## [6] "Alcohol.Consumption"
## [7] "HPV.Infection"
## [8] "Betel.Quid.Use"
## [9] "Chronic.Sun.Exposure"
## [10] "Poor.Oral.Hygiene"
## [11] "Diet..Fruits...Vegetables.Intake."
## [12] "Family.History.of.Cancer"
## [13] "Compromised.Immune.System"
## [14] "Oral.Lesions"
## [15] "Unexplained.Bleeding"
## [16] "Difficulty.Swallowing"
## [17] "White.or.Red.Patches.in.Mouth"
## [18] "Tumor.Size..cm."
## [19] "Cancer.Stage"
## [20] "Treatment.Type"
## [21] "Survival.Rate..5.Year...."
## [22] "Cost.of.Treatment..USD."
## [23] "Economic.Burden..Lost.Workdays.per.Year."
## [24] "Early.Diagnosis"
## [25] "Oral.Cancer..Diagnosis."
Below are the first 15 rows of the dataset:
head(oral_data, 15)
## ID Country Age Gender Tobacco.Use Alcohol.Consumption HPV.Infection
## 1 1 Italy 36 Female Yes Yes Yes
## 2 2 Japan 64 Male Yes Yes Yes
## 3 3 UK 37 Female No Yes No
## 4 4 Sri Lanka 55 Male Yes Yes No
## 5 5 South Africa 68 Male No No No
## 6 6 Taiwan 70 Male Yes No Yes
## 7 7 USA 41 Female Yes Yes No
## 8 8 Italy 53 Male Yes Yes Yes
## 9 9 Sri Lanka 62 Female Yes No Yes
## 10 10 Germany 50 Male Yes No No
## 11 11 Sri Lanka 65 Male No Yes No
## 12 12 Sri Lanka 34 Male Yes No No
## 13 13 France 56 Male Yes Yes Yes
## 14 14 Australia 59 Female Yes Yes No
## 15 15 UK 43 Male Yes No No
## Betel.Quid.Use Chronic.Sun.Exposure Poor.Oral.Hygiene
## 1 No No Yes
## 2 No Yes Yes
## 3 No Yes Yes
## 4 Yes No Yes
## 5 No No Yes
## 6 Yes No Yes
## 7 No No No
## 8 No No Yes
## 9 Yes No Yes
## 10 No No Yes
## 11 Yes No No
## 12 No No Yes
## 13 Yes No Yes
## 14 No Yes Yes
## 15 No No No
## Diet..Fruits...Vegetables.Intake. Family.History.of.Cancer
## 1 Low No
## 2 High No
## 3 Moderate No
## 4 Moderate No
## 5 High No
## 6 Moderate Yes
## 7 Moderate No
## 8 Moderate No
## 9 Low No
## 10 High No
## 11 Moderate Yes
## 12 Low No
## 13 Moderate No
## 14 Low No
## 15 Low No
## Compromised.Immune.System Oral.Lesions Unexplained.Bleeding
## 1 No No No
## 2 No No Yes
## 3 No No No
## 4 No Yes No
## 5 No No No
## 6 No Yes Yes
## 7 No No No
## 8 No Yes No
## 9 No No No
## 10 No No No
## 11 No Yes No
## 12 No Yes Yes
## 13 No No No
## 14 No Yes No
## 15 Yes No No
## Difficulty.Swallowing White.or.Red.Patches.in.Mouth Tumor.Size..cm.
## 1 No No 0.000000
## 2 No No 1.782186
## 3 No Yes 3.523895
## 4 No No 0.000000
## 5 No No 2.834789
## 6 No No 1.692675
## 7 Yes Yes 5.794843
## 8 No No 0.000000
## 9 Yes No 5.999476
## 10 Yes No 5.282246
## 11 No Yes 4.724627
## 12 No No 0.000000
## 13 Yes Yes 0.000000
## 14 No Yes 0.000000
## 15 No Yes 0.000000
## Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## 1 0 No Treatment 100.00000
## 2 1 No Treatment 83.34010
## 3 2 Surgery 63.22287
## 4 0 No Treatment 100.00000
## 5 3 No Treatment 44.29320
## 6 2 Surgery 67.40727
## 7 1 No Treatment 80.79813
## 8 0 No Treatment 100.00000
## 9 3 Surgery 41.45993
## 10 3 Radiation 33.75381
## 11 3 Targeted Therapy 42.81490
## 12 0 No Treatment 100.00000
## 13 0 No Treatment 100.00000
## 14 0 No Treatment 100.00000
## 15 0 No Treatment 100.00000
## Cost.of.Treatment..USD. Economic.Burden..Lost.Workdays.per.Year.
## 1 0.00 0
## 2 77772.50 177
## 3 101164.50 130
## 4 0.00 0
## 5 45354.75 52
## 6 96504.00 91
## 7 86131.25 105
## 8 0.00 0
## 9 42630.00 136
## 10 75150.25 82
## 11 63721.00 84
## 12 0.00 0
## 13 0.00 0
## 14 0.00 0
## 15 0.00 0
## Early.Diagnosis Oral.Cancer..Diagnosis.
## 1 No No
## 2 No Yes
## 3 Yes Yes
## 4 Yes No
## 5 No Yes
## 6 Yes Yes
## 7 Yes Yes
## 8 Yes No
## 9 No Yes
## 10 No Yes
## 11 Yes Yes
## 12 Yes No
## 13 Yes No
## 14 Yes No
## 15 Yes No
Calculate a risk score based on the person’s data with tobacco, alcohol, HPV, family history, and compromised immune system.
risk_score <- function(tobacco, alcohol, hpv, family_history, immune_system) {
(tobacco == "Yes") * 2 +
(alcohol == "Yes") +
(hpv == "Yes") * 2 +
(family_history == "Yes") +
(immune_system == "Yes")
}
risk_score(oral_data$Tobacco.Use[1],
oral_data$Alcohol.Consumption[1],
oral_data$HPV.Infection[1],
oral_data$Family.History.of.Cancer[1],
oral_data$Compromised.Immune.System[1])
## [1] 5
Filter rows based on the following criteria:
oral_data %>% filter(
oral_data$Cancer.Stage == "1",
oral_data$Tumor.Size..cm. < "5",
oral_data$Country == "Sri Lanka",
oral_data$Age > 60,
oral_data$Gender == "Female",
oral_data$Tobacco.Use == "No",
oral_data$Betel.Quid.Use == "No")
## ID Country Age Gender Tobacco.Use Alcohol.Consumption HPV.Infection
## 1 5547 Sri Lanka 66 Female No No No
## 2 8641 Sri Lanka 63 Female No Yes No
## 3 11363 Sri Lanka 68 Female No Yes No
## 4 21549 Sri Lanka 70 Female No No No
## 5 41070 Sri Lanka 61 Female No Yes No
## 6 42048 Sri Lanka 69 Female No Yes No
## 7 51622 Sri Lanka 73 Female No No No
## 8 52435 Sri Lanka 70 Female No Yes Yes
## 9 60156 Sri Lanka 65 Female No No No
## 10 68135 Sri Lanka 70 Female No No No
## 11 79435 Sri Lanka 80 Female No Yes Yes
## Betel.Quid.Use Chronic.Sun.Exposure Poor.Oral.Hygiene
## 1 No No No
## 2 No No No
## 3 No No Yes
## 4 No No Yes
## 5 No No No
## 6 No No Yes
## 7 No No No
## 8 No Yes No
## 9 No No Yes
## 10 No No No
## 11 No No Yes
## Diet..Fruits...Vegetables.Intake. Family.History.of.Cancer
## 1 Moderate No
## 2 Moderate No
## 3 High Yes
## 4 Moderate No
## 5 Moderate No
## 6 Moderate No
## 7 Moderate No
## 8 Low No
## 9 Moderate No
## 10 Low No
## 11 Moderate No
## Compromised.Immune.System Oral.Lesions Unexplained.Bleeding
## 1 No No No
## 2 No No Yes
## 3 No No Yes
## 4 No Yes Yes
## 5 No Yes No
## 6 No Yes No
## 7 No Yes No
## 8 No Yes Yes
## 9 No Yes No
## 10 No No No
## 11 No Yes No
## Difficulty.Swallowing White.or.Red.Patches.in.Mouth Tumor.Size..cm.
## 1 No Yes 3.845308
## 2 Yes No 4.825246
## 3 Yes No 2.383717
## 4 No No 4.198427
## 5 No Yes 3.756328
## 6 No No 4.519694
## 7 Yes No 4.578338
## 8 Yes Yes 4.084550
## 9 No Yes 3.249317
## 10 Yes No 3.942891
## 11 Yes No 2.621289
## Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## 1 1 Radiation 89.98152
## 2 1 Radiation 89.68025
## 3 1 Chemotherapy 83.32314
## 4 1 Chemotherapy 83.88464
## 5 1 Surgery 81.95862
## 6 1 No Treatment 83.57633
## 7 1 Radiation 84.75359
## 8 1 Radiation 85.83111
## 9 1 Chemotherapy 83.84482
## 10 1 Chemotherapy 88.86750
## 11 1 Radiation 80.87980
## Cost.of.Treatment..USD. Economic.Burden..Lost.Workdays.per.Year.
## 1 98652.50 155
## 2 53808.75 140
## 3 25171.25 135
## 4 80856.25 48
## 5 97940.00 65
## 6 94122.50 109
## 7 81845.00 79
## 8 41413.75 41
## 9 76490.00 41
## 10 60420.00 173
## 11 89812.50 36
## Early.Diagnosis Oral.Cancer..Diagnosis.
## 1 Yes Yes
## 2 No Yes
## 3 No Yes
## 4 Yes Yes
## 5 Yes Yes
## 6 Yes Yes
## 7 Yes Yes
## 8 Yes Yes
## 9 Yes Yes
## 10 No Yes
## 11 Yes Yes
# Independent var: Tumor Size -- Dependent var: Cost of Treatment
head(data.frame(oral_data$Tumor.Size..cm., oral_data$Cost.of.Treatment..USD.), 15)
## oral_data.Tumor.Size..cm. oral_data.Cost.of.Treatment..USD.
## 1 0.000000 0.00
## 2 1.782186 77772.50
## 3 3.523895 101164.50
## 4 0.000000 0.00
## 5 2.834789 45354.75
## 6 1.692675 96504.00
## 7 5.794843 86131.25
## 8 0.000000 0.00
## 9 5.999476 42630.00
## 10 5.282246 75150.25
## 11 4.724627 63721.00
## 12 0.000000 0.00
## 13 0.000000 0.00
## 14 0.000000 0.00
## 15 0.000000 0.00
# Independent var: Diet (fruits, veg intake) -- Dependent var: Cancer Stage
head(data.frame(oral_data$Diet..Fruits...Vegetables.Intake., oral_data$Cancer.Stage), 15)
## oral_data.Diet..Fruits...Vegetables.Intake. oral_data.Cancer.Stage
## 1 Low 0
## 2 High 1
## 3 Moderate 2
## 4 Moderate 0
## 5 High 3
## 6 Moderate 2
## 7 Moderate 1
## 8 Moderate 0
## 9 Low 3
## 10 High 3
## 11 Moderate 3
## 12 Low 0
## 13 Moderate 0
## 14 Low 0
## 15 Low 0
Remove missing values in your dataset. Identify and remove duplicated data in your dataset.
Reorder multiple rows in descending order. Rename some of the column names in your dataset.
clean_data = oral_data %>% filter(!is.na(Cost.of.Treatment..USD.))
clean_data = clean_data %>% distinct()
clean_data = clean_data %>% arrange(desc(Cost.of.Treatment..USD.))
clean_data = clean_data %>% rename(
Drinker = Alcohol.Consumption,
Treatment.Cost.USD = Cost.of.Treatment..USD.
)
Add a “Cost per cm” column based on Treatment Cost divided by Tumor Size.
clean_data = clean_data %>% mutate(cost.per.cm = Treatment.Cost.USD / Tumor.Size..cm.)
Create a training set using random number generator engine with seed
1234 with 5% the size of the main data set.
set.seed(1234)
training_data <- clean_data %>% sample_frac(0.05, replace = FALSE)
summary(clean_data)
## ID Country Age Gender
## Min. : 1 Length:84922 Min. : 15.00 Length:84922
## 1st Qu.:21231 Class :character 1st Qu.: 48.00 Class :character
## Median :42462 Mode :character Median : 55.00 Mode :character
## Mean :42462 Mean : 54.51
## 3rd Qu.:63692 3rd Qu.: 61.00
## Max. :84922 Max. :101.00
##
## Tobacco.Use Drinker HPV.Infection Betel.Quid.Use
## Length:84922 Length:84922 Length:84922 Length:84922
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Chronic.Sun.Exposure Poor.Oral.Hygiene Diet..Fruits...Vegetables.Intake.
## Length:84922 Length:84922 Length:84922
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Family.History.of.Cancer Compromised.Immune.System Oral.Lesions
## Length:84922 Length:84922 Length:84922
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
## Length:84922 Length:84922 Length:84922
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Tumor.Size..cm. Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## Min. :0.000 Min. :0.000 Length:84922 Min. : 10.00
## 1st Qu.:0.000 1st Qu.:0.000 Class :character 1st Qu.: 65.23
## Median :0.000 Median :0.000 Mode :character Median :100.00
## Mean :1.747 Mean :1.119 Mean : 79.50
## 3rd Qu.:3.480 3rd Qu.:2.000 3rd Qu.:100.00
## Max. :6.000 Max. :4.000 Max. :100.00
##
## Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis
## Min. : 0 Min. : 0.00 Length:84922
## 1st Qu.: 0 1st Qu.: 0.00 Class :character
## Median : 0 Median : 0.00 Mode :character
## Mean : 39110 Mean : 52.03
## 3rd Qu.: 76468 3rd Qu.:104.00
## Max. :159988 Max. :179.00
##
## Oral.Cancer..Diagnosis. cost.per.cm
## Length:84922 Min. : 4176
## Class :character 1st Qu.: 14770
## Mode :character Median : 22254
## Mean : 28090
## 3rd Qu.: 34968
## Max. :156336
## NA's :42573
mean(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1] 39109.88
range(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1] 0 159988
Scatter plot of Survival Rate (5 years) and Treatment Cost.
ggplot(data = training_data, aes(x = Survival.Rate..5.Year...., y = Treatment.Cost.USD)) +
geom_point(color = "steelblue", size = 1.2)
Bar plot of displaying number of individuals at each cancer stage.
ggplot(data = clean_data, aes(x = Cancer.Stage, fill = Gender)) + geom_bar() +
ylab("Number of Individuals")
A correlation near 1 implies high correlation; the dataset has about -0.8 correlation, so there is a strong negative linear relationship between Treatment Cost and Survival Rate.
cor(clean_data$Treatment.Cost.USD, clean_data$Survival.Rate..5.Year...., method="pearson")
## [1] -0.8066187