Group 7: Esther Agyemang, Omar Alamodi, Aliha Bashir, Hetal Bulsara, Andy Huang, James Huynh, Nasteha Mohamed, and Tina Shen
George Brown Polytechnic
COMP 4033 – Health Informatics Data Analytics
This dataset contains individual information related to oral cancer.
According to the dataset description on Kaggle, the data is based on
real-world oral cancer statistics and is intended to reflect patterns
reported in global health studies. Based on the Country
variable, the dataset includes observations from 17 unique countries,
including India, Pakistan, Sri Lanka, Taiwan, and some Western
nations.
The str() function shows that the dataset contains
84,922 observations and 25 variables. It includes a mix of numeric and
character variables representing demographics, risk factors, clinical
symptoms, and treatment outcomes.
str(oral_data)
## 'data.frame': 84922 obs. of 25 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Country : chr "Italy" "Japan" "UK" "Sri Lanka" ...
## $ Age : int 36 64 37 55 68 70 41 53 62 50 ...
## $ Gender : chr "Female" "Male" "Female" "Male" ...
## $ Tobacco.Use : chr "Yes" "Yes" "No" "Yes" ...
## $ Alcohol.Consumption : chr "Yes" "Yes" "Yes" "Yes" ...
## $ HPV.Infection : chr "Yes" "Yes" "No" "No" ...
## $ Betel.Quid.Use : chr "No" "No" "No" "Yes" ...
## $ Chronic.Sun.Exposure : chr "No" "Yes" "Yes" "No" ...
## $ Poor.Oral.Hygiene : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Diet..Fruits...Vegetables.Intake. : chr "Low" "High" "Moderate" "Moderate" ...
## $ Family.History.of.Cancer : chr "No" "No" "No" "No" ...
## $ Compromised.Immune.System : chr "No" "No" "No" "No" ...
## $ Oral.Lesions : chr "No" "No" "No" "Yes" ...
## $ Unexplained.Bleeding : chr "No" "Yes" "No" "No" ...
## $ Difficulty.Swallowing : chr "No" "No" "No" "No" ...
## $ White.or.Red.Patches.in.Mouth : chr "No" "No" "Yes" "No" ...
## $ Tumor.Size..cm. : num 0 1.78 3.52 0 2.83 ...
## $ Cancer.Stage : int 0 1 2 0 3 2 1 0 3 3 ...
## $ Treatment.Type : chr "No Treatment" "No Treatment" "Surgery" "No Treatment" ...
## $ Survival.Rate..5.Year.... : num 100 83.3 63.2 100 44.3 ...
## $ Cost.of.Treatment..USD. : num 0 77773 101165 0 45355 ...
## $ Economic.Burden..Lost.Workdays.per.Year.: int 0 177 130 0 52 91 105 0 136 82 ...
## $ Early.Diagnosis : chr "No" "No" "Yes" "Yes" ...
## $ Oral.Cancer..Diagnosis. : chr "No" "Yes" "Yes" "No" ...
The table() function is used to count the number of
records per country, while length() is used to determine
the total number of unique countries in the dataset.
countries = table(oral_data$Country)
sort(countries)
##
## South Africa Japan Kenya Australia Nigeria Egypt
## 3086 3152 3171 3189 3256 3263
## Russia Brazil France Italy USA Germany
## 4711 4762 4783 4834 4891 4909
## UK Taiwan Sri Lanka Pakistan India
## 4930 7905 8000 8001 8079
length(countries)
## [1] 17
The names() function lists all 25 variables in the
dataset. These include demographic information (e.g., Age, Gender,
Country), risk factors (e.g., Tobacco Use, Alcohol Consumption, HPV
Infection), clinical indicators, and outcome variables such as survival
rate and treatment cost.
names(oral_data)
## [1] "ID"
## [2] "Country"
## [3] "Age"
## [4] "Gender"
## [5] "Tobacco.Use"
## [6] "Alcohol.Consumption"
## [7] "HPV.Infection"
## [8] "Betel.Quid.Use"
## [9] "Chronic.Sun.Exposure"
## [10] "Poor.Oral.Hygiene"
## [11] "Diet..Fruits...Vegetables.Intake."
## [12] "Family.History.of.Cancer"
## [13] "Compromised.Immune.System"
## [14] "Oral.Lesions"
## [15] "Unexplained.Bleeding"
## [16] "Difficulty.Swallowing"
## [17] "White.or.Red.Patches.in.Mouth"
## [18] "Tumor.Size..cm."
## [19] "Cancer.Stage"
## [20] "Treatment.Type"
## [21] "Survival.Rate..5.Year...."
## [22] "Cost.of.Treatment..USD."
## [23] "Economic.Burden..Lost.Workdays.per.Year."
## [24] "Early.Diagnosis"
## [25] "Oral.Cancer..Diagnosis."
The head() function displays the first 15 rows of the
dataset, providing a quick overview of how the data is organized.
head(oral_data, 15)
## ID Country Age Gender Tobacco.Use Alcohol.Consumption HPV.Infection
## 1 1 Italy 36 Female Yes Yes Yes
## 2 2 Japan 64 Male Yes Yes Yes
## 3 3 UK 37 Female No Yes No
## 4 4 Sri Lanka 55 Male Yes Yes No
## 5 5 South Africa 68 Male No No No
## 6 6 Taiwan 70 Male Yes No Yes
## 7 7 USA 41 Female Yes Yes No
## 8 8 Italy 53 Male Yes Yes Yes
## 9 9 Sri Lanka 62 Female Yes No Yes
## 10 10 Germany 50 Male Yes No No
## 11 11 Sri Lanka 65 Male No Yes No
## 12 12 Sri Lanka 34 Male Yes No No
## 13 13 France 56 Male Yes Yes Yes
## 14 14 Australia 59 Female Yes Yes No
## 15 15 UK 43 Male Yes No No
## Betel.Quid.Use Chronic.Sun.Exposure Poor.Oral.Hygiene
## 1 No No Yes
## 2 No Yes Yes
## 3 No Yes Yes
## 4 Yes No Yes
## 5 No No Yes
## 6 Yes No Yes
## 7 No No No
## 8 No No Yes
## 9 Yes No Yes
## 10 No No Yes
## 11 Yes No No
## 12 No No Yes
## 13 Yes No Yes
## 14 No Yes Yes
## 15 No No No
## Diet..Fruits...Vegetables.Intake. Family.History.of.Cancer
## 1 Low No
## 2 High No
## 3 Moderate No
## 4 Moderate No
## 5 High No
## 6 Moderate Yes
## 7 Moderate No
## 8 Moderate No
## 9 Low No
## 10 High No
## 11 Moderate Yes
## 12 Low No
## 13 Moderate No
## 14 Low No
## 15 Low No
## Compromised.Immune.System Oral.Lesions Unexplained.Bleeding
## 1 No No No
## 2 No No Yes
## 3 No No No
## 4 No Yes No
## 5 No No No
## 6 No Yes Yes
## 7 No No No
## 8 No Yes No
## 9 No No No
## 10 No No No
## 11 No Yes No
## 12 No Yes Yes
## 13 No No No
## 14 No Yes No
## 15 Yes No No
## Difficulty.Swallowing White.or.Red.Patches.in.Mouth Tumor.Size..cm.
## 1 No No 0.000000
## 2 No No 1.782186
## 3 No Yes 3.523895
## 4 No No 0.000000
## 5 No No 2.834789
## 6 No No 1.692675
## 7 Yes Yes 5.794843
## 8 No No 0.000000
## 9 Yes No 5.999476
## 10 Yes No 5.282246
## 11 No Yes 4.724627
## 12 No No 0.000000
## 13 Yes Yes 0.000000
## 14 No Yes 0.000000
## 15 No Yes 0.000000
## Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## 1 0 No Treatment 100.00000
## 2 1 No Treatment 83.34010
## 3 2 Surgery 63.22287
## 4 0 No Treatment 100.00000
## 5 3 No Treatment 44.29320
## 6 2 Surgery 67.40727
## 7 1 No Treatment 80.79813
## 8 0 No Treatment 100.00000
## 9 3 Surgery 41.45993
## 10 3 Radiation 33.75381
## 11 3 Targeted Therapy 42.81490
## 12 0 No Treatment 100.00000
## 13 0 No Treatment 100.00000
## 14 0 No Treatment 100.00000
## 15 0 No Treatment 100.00000
## Cost.of.Treatment..USD. Economic.Burden..Lost.Workdays.per.Year.
## 1 0.00 0
## 2 77772.50 177
## 3 101164.50 130
## 4 0.00 0
## 5 45354.75 52
## 6 96504.00 91
## 7 86131.25 105
## 8 0.00 0
## 9 42630.00 136
## 10 75150.25 82
## 11 63721.00 84
## 12 0.00 0
## 13 0.00 0
## 14 0.00 0
## 15 0.00 0
## Early.Diagnosis Oral.Cancer..Diagnosis.
## 1 No No
## 2 No Yes
## 3 Yes Yes
## 4 Yes No
## 5 No Yes
## 6 Yes Yes
## 7 Yes Yes
## 8 Yes No
## 9 No Yes
## 10 No Yes
## 11 Yes Yes
## 12 Yes No
## 13 Yes No
## 14 Yes No
## 15 Yes No
A user-defined function is created to calculate a simple risk score based on selected risk factors, including tobacco use, alcohol consumption, HPV infection, family history, and immune system status. Each factor contributes a weighted value to produce an overall score for demonstration purposes.
risk_score <- function(tobacco, alcohol, hpv, family_history, immune_system) {
(tobacco == "Yes") * 2 +
(alcohol == "Yes") +
(hpv == "Yes") * 2 +
(family_history == "Yes") +
(immune_system == "Yes")
}
risk_score(oral_data$Tobacco.Use[1],
oral_data$Alcohol.Consumption[1],
oral_data$HPV.Infection[1],
oral_data$Family.History.of.Cancer[1],
oral_data$Compromised.Immune.System[1])
## [1] 5
An example filter is applied based on the following conditions:
The narrow criteria result in a small subset of patients, in this case older females from Sri Lanka with early-stage cancer and generally fewer high-risk behaviors.
oral_data %>% filter(
oral_data$Cancer.Stage == 1,
oral_data$Tumor.Size..cm. < 5,
oral_data$Country == "Sri Lanka",
oral_data$Age > 60,
oral_data$Gender == "Female",
oral_data$Tobacco.Use == "No",
oral_data$Betel.Quid.Use == "No")
## ID Country Age Gender Tobacco.Use Alcohol.Consumption HPV.Infection
## 1 5547 Sri Lanka 66 Female No No No
## 2 8641 Sri Lanka 63 Female No Yes No
## 3 11363 Sri Lanka 68 Female No Yes No
## 4 21549 Sri Lanka 70 Female No No No
## 5 41070 Sri Lanka 61 Female No Yes No
## 6 42048 Sri Lanka 69 Female No Yes No
## 7 51622 Sri Lanka 73 Female No No No
## 8 52435 Sri Lanka 70 Female No Yes Yes
## 9 60156 Sri Lanka 65 Female No No No
## 10 68135 Sri Lanka 70 Female No No No
## 11 79435 Sri Lanka 80 Female No Yes Yes
## Betel.Quid.Use Chronic.Sun.Exposure Poor.Oral.Hygiene
## 1 No No No
## 2 No No No
## 3 No No Yes
## 4 No No Yes
## 5 No No No
## 6 No No Yes
## 7 No No No
## 8 No Yes No
## 9 No No Yes
## 10 No No No
## 11 No No Yes
## Diet..Fruits...Vegetables.Intake. Family.History.of.Cancer
## 1 Moderate No
## 2 Moderate No
## 3 High Yes
## 4 Moderate No
## 5 Moderate No
## 6 Moderate No
## 7 Moderate No
## 8 Low No
## 9 Moderate No
## 10 Low No
## 11 Moderate No
## Compromised.Immune.System Oral.Lesions Unexplained.Bleeding
## 1 No No No
## 2 No No Yes
## 3 No No Yes
## 4 No Yes Yes
## 5 No Yes No
## 6 No Yes No
## 7 No Yes No
## 8 No Yes Yes
## 9 No Yes No
## 10 No No No
## 11 No Yes No
## Difficulty.Swallowing White.or.Red.Patches.in.Mouth Tumor.Size..cm.
## 1 No Yes 3.845308
## 2 Yes No 4.825246
## 3 Yes No 2.383717
## 4 No No 4.198427
## 5 No Yes 3.756328
## 6 No No 4.519694
## 7 Yes No 4.578338
## 8 Yes Yes 4.084550
## 9 No Yes 3.249317
## 10 Yes No 3.942891
## 11 Yes No 2.621289
## Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## 1 1 Radiation 89.98152
## 2 1 Radiation 89.68025
## 3 1 Chemotherapy 83.32314
## 4 1 Chemotherapy 83.88464
## 5 1 Surgery 81.95862
## 6 1 No Treatment 83.57633
## 7 1 Radiation 84.75359
## 8 1 Radiation 85.83111
## 9 1 Chemotherapy 83.84482
## 10 1 Chemotherapy 88.86750
## 11 1 Radiation 80.87980
## Cost.of.Treatment..USD. Economic.Burden..Lost.Workdays.per.Year.
## 1 98652.50 155
## 2 53808.75 140
## 3 25171.25 135
## 4 80856.25 48
## 5 97940.00 65
## 6 94122.50 109
## 7 81845.00 79
## 8 41413.75 41
## 9 76490.00 41
## 10 60420.00 173
## 11 89812.50 36
## Early.Diagnosis Oral.Cancer..Diagnosis.
## 1 Yes Yes
## 2 No Yes
## 3 No Yes
## 4 Yes Yes
## 5 Yes Yes
## 6 Yes Yes
## 7 Yes Yes
## 8 Yes Yes
## 9 Yes Yes
## 10 No Yes
## 11 Yes Yes
The dataset contains multiple potential independent and dependent variables; a subset has been selected for demonstration purposes.
Independent variable: Tumor Size
Dependent variable: Cost of Treatment
head(data.frame(oral_data$Tumor.Size..cm., oral_data$Cost.of.Treatment..USD.), 15)
## oral_data.Tumor.Size..cm. oral_data.Cost.of.Treatment..USD.
## 1 0.000000 0.00
## 2 1.782186 77772.50
## 3 3.523895 101164.50
## 4 0.000000 0.00
## 5 2.834789 45354.75
## 6 1.692675 96504.00
## 7 5.794843 86131.25
## 8 0.000000 0.00
## 9 5.999476 42630.00
## 10 5.282246 75150.25
## 11 4.724627 63721.00
## 12 0.000000 0.00
## 13 0.000000 0.00
## 14 0.000000 0.00
## 15 0.000000 0.00
Independent variable: Diet (Fruits, Vegetables
Intake)
Dependent variable: Cancer Stage
This relationship is explored for demonstration purposes, recognizing that it may be indirect.
head(data.frame(oral_data$Diet..Fruits...Vegetables.Intake., oral_data$Cancer.Stage), 15)
## oral_data.Diet..Fruits...Vegetables.Intake. oral_data.Cancer.Stage
## 1 Low 0
## 2 High 1
## 3 Moderate 2
## 4 Moderate 0
## 5 High 3
## 6 Moderate 2
## 7 Moderate 1
## 8 Moderate 0
## 9 Low 3
## 10 High 3
## 11 Moderate 3
## 12 Low 0
## 13 Moderate 0
## 14 Low 0
## 15 Low 0
Basic data cleaning steps are demonstrated, including removing missing values, eliminating duplicate records, sorting the data in descending order by treatment cost, and renaming selected columns for clarity. In this dataset, no missing values or duplicate records were observed. Columns such as alcohol consumption and treatment cost were renamed for readability as part of the demonstration.
clean_data = oral_data %>% filter(!is.na(Cost.of.Treatment..USD.))
clean_data = clean_data %>% distinct()
clean_data = clean_data %>% arrange(desc(Cost.of.Treatment..USD.))
clean_data = clean_data %>% rename(
Drinker = Alcohol.Consumption,
Treatment.Cost.USD = Cost.of.Treatment..USD.
)
head(clean_data, 5)
## ID Country Age Gender Tobacco.Use Drinker HPV.Infection Betel.Quid.Use
## 1 32793 Pakistan 56 Male No Yes Yes No
## 2 40541 Brazil 54 Male Yes No Yes No
## 3 18663 Germany 59 Male No Yes No No
## 4 61019 Germany 43 Female No Yes Yes No
## 5 18640 India 57 Female Yes Yes Yes Yes
## Chronic.Sun.Exposure Poor.Oral.Hygiene Diet..Fruits...Vegetables.Intake.
## 1 No Yes Low
## 2 Yes Yes Low
## 3 No Yes Moderate
## 4 No No High
## 5 No Yes Moderate
## Family.History.of.Cancer Compromised.Immune.System Oral.Lesions
## 1 No No No
## 2 No No No
## 3 No No Yes
## 4 No No No
## 5 No No No
## Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
## 1 No No Yes
## 2 No Yes No
## 3 No No No
## 4 No No Yes
## 5 No No No
## Tumor.Size..cm. Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## 1 2.259677 4 Surgery 19.59434
## 2 5.393321 4 Radiation 21.05594
## 3 2.552248 4 Radiation 26.10189
## 4 2.978733 4 Targeted Therapy 12.63353
## 5 1.793576 4 No Treatment 22.09566
## Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis
## 1 159988 125 Yes
## 2 159986 40 No
## 3 159984 72 Yes
## 4 159934 75 Yes
## 5 159932 121 No
## Oral.Cancer..Diagnosis.
## 1 Yes
## 2 Yes
## 3 Yes
## 4 Yes
## 5 Yes
A new variable, cost per cm, is created by dividing treatment cost by tumor size. This demonstrates how derived variables can be added to the dataset for further analysis.
clean_data = clean_data %>% mutate(cost.per.cm = Treatment.Cost.USD / Tumor.Size..cm.)
head(clean_data, 5)
## ID Country Age Gender Tobacco.Use Drinker HPV.Infection Betel.Quid.Use
## 1 32793 Pakistan 56 Male No Yes Yes No
## 2 40541 Brazil 54 Male Yes No Yes No
## 3 18663 Germany 59 Male No Yes No No
## 4 61019 Germany 43 Female No Yes Yes No
## 5 18640 India 57 Female Yes Yes Yes Yes
## Chronic.Sun.Exposure Poor.Oral.Hygiene Diet..Fruits...Vegetables.Intake.
## 1 No Yes Low
## 2 Yes Yes Low
## 3 No Yes Moderate
## 4 No No High
## 5 No Yes Moderate
## Family.History.of.Cancer Compromised.Immune.System Oral.Lesions
## 1 No No No
## 2 No No No
## 3 No No Yes
## 4 No No No
## 5 No No No
## Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
## 1 No No Yes
## 2 No Yes No
## 3 No No No
## 4 No No Yes
## 5 No No No
## Tumor.Size..cm. Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## 1 2.259677 4 Surgery 19.59434
## 2 5.393321 4 Radiation 21.05594
## 3 2.552248 4 Radiation 26.10189
## 4 2.978733 4 Targeted Therapy 12.63353
## 5 1.793576 4 No Treatment 22.09566
## Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis
## 1 159988 125 Yes
## 2 159986 40 No
## 3 159984 72 Yes
## 4 159934 75 Yes
## 5 159932 121 No
## Oral.Cancer..Diagnosis. cost.per.cm
## 1 Yes 70801.27
## 2 Yes 29663.73
## 3 Yes 62683.56
## 4 Yes 53691.95
## 5 Yes 89169.33
A training set is created by randomly sampling 5% of the dataset using a fixed seed (1234) to ensure reproducibility. This demonstrates how subsets of data can be generated for modeling or testing purposes.
set.seed(1234)
training_data = clean_data %>% sample_frac(0.05, replace = FALSE)
head(training_data, 5)
## ID Country Age Gender Tobacco.Use Drinker HPV.Infection Betel.Quid.Use
## 1 1920 France 53 Male Yes Yes No No
## 2 9233 UK 60 Female Yes Yes No No
## 3 23357 UK 53 Male Yes Yes Yes No
## 4 81151 India 40 Male No Yes Yes Yes
## 5 76611 Sri Lanka 51 Male No No No Yes
## Chronic.Sun.Exposure Poor.Oral.Hygiene Diet..Fruits...Vegetables.Intake.
## 1 No No Moderate
## 2 No Yes Low
## 3 Yes Yes High
## 4 Yes No Low
## 5 Yes No High
## Family.History.of.Cancer Compromised.Immune.System Oral.Lesions
## 1 Yes Yes No
## 2 No No Yes
## 3 No No Yes
## 4 No No No
## 5 No No Yes
## Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
## 1 No No No
## 2 No No No
## 3 No No Yes
## 4 No No No
## 5 No No Yes
## Tumor.Size..cm. Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## 1 3.866738 1 Surgery 86.48450
## 2 3.128102 3 Targeted Therapy 31.24434
## 3 4.237244 2 Targeted Therapy 62.49541
## 4 0.000000 0 No Treatment 100.00000
## 5 0.000000 0 No Treatment 100.00000
## Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis
## 1 27366.25 148 No
## 2 89209.75 168 Yes
## 3 49224.00 69 Yes
## 4 0.00 0 No
## 5 0.00 0 Yes
## Oral.Cancer..Diagnosis. cost.per.cm
## 1 Yes 7077.348
## 2 Yes 28518.813
## 3 Yes 11616.985
## 4 No NaN
## 5 No NaN
The summary() function provides an overview of the
dataset.
From the output, many values for tumor size and treatment cost are zero, which may indicate placeholder or missing entries. The median tumor size and cancer stage are also zero, suggesting many cases are recorded at early or minimal levels. Additionally, a large number of missing values appear in cost per cm, due to division by zero when tumor size is zero.
summary(clean_data)
## ID Country Age Gender
## Min. : 1 Length:84922 Min. : 15.00 Length:84922
## 1st Qu.:21231 Class :character 1st Qu.: 48.00 Class :character
## Median :42462 Mode :character Median : 55.00 Mode :character
## Mean :42462 Mean : 54.51
## 3rd Qu.:63692 3rd Qu.: 61.00
## Max. :84922 Max. :101.00
##
## Tobacco.Use Drinker HPV.Infection Betel.Quid.Use
## Length:84922 Length:84922 Length:84922 Length:84922
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Chronic.Sun.Exposure Poor.Oral.Hygiene Diet..Fruits...Vegetables.Intake.
## Length:84922 Length:84922 Length:84922
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Family.History.of.Cancer Compromised.Immune.System Oral.Lesions
## Length:84922 Length:84922 Length:84922
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
## Length:84922 Length:84922 Length:84922
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Tumor.Size..cm. Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## Min. :0.000 Min. :0.000 Length:84922 Min. : 10.00
## 1st Qu.:0.000 1st Qu.:0.000 Class :character 1st Qu.: 65.23
## Median :0.000 Median :0.000 Mode :character Median :100.00
## Mean :1.747 Mean :1.119 Mean : 79.50
## 3rd Qu.:3.480 3rd Qu.:2.000 3rd Qu.:100.00
## Max. :6.000 Max. :4.000 Max. :100.00
##
## Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis
## Min. : 0 Min. : 0.00 Length:84922
## 1st Qu.: 0 1st Qu.: 0.00 Class :character
## Median : 0 Median : 0.00 Mode :character
## Mean : 39110 Mean : 52.03
## 3rd Qu.: 76468 3rd Qu.:104.00
## Max. :159988 Max. :179.00
##
## Oral.Cancer..Diagnosis. cost.per.cm
## Length:84922 Min. : 4176
## Class :character 1st Qu.: 14770
## Mode :character Median : 22254
## Mean : 28090
## 3rd Qu.: 34968
## Max. :156336
## NA's :42573
Mean, median, and range of treatment costs.
Basic statistical measures such as mean, median, and range are calculated for treatment cost. The mean cost is approximately 39,110 USD, while the median is 0, indicating that a large portion of the dataset contains zero-cost entries. The range spans from 0 to 159,988 USD, showing a wide variation in treatment costs.
mean(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1] 39109.88
median(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1] 0
range(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1] 0 159988
R does not have a built-in mode() function for
statistical mode (most frequent value). The mode() function
in R refers to the data type of an object. So the following code was
executed to determine the statistical mode of cancer
stage in the dataset; this code can be put in a user-defined function
for reusability, but the following is included to illustrate the steps
taken.
The most common cancer stage in the dataset is 0, which aligns with earlier observations that many cases are recorded at early or minimal stages.
table() is used to count how many times each cancer
stage appears.cancer_stage_count = table(clean_data$Cancer.Stage)
cancer_stage_count
##
## 0 1 2 3 4
## 42573 12713 12865 10520 6251
cancer_stage_count = sort(cancer_stage_count, decreasing = TRUE)
cancer_stage_count
##
## 0 2 1 3 4
## 42573 12865 12713 10520 6251
names(cancer_stage_count)[1]
## [1] "0"
A scatter plot is created to visualize the relationship between survival rate (5 years) and treatment cost.
The plot shows a slight negative relationship between survival rate and treatment cost. The distribution is not fully continuous, as points form clusters at certain values, which may reflect how the data is structured around specific treatment or disease categories. Despite these clusters, there is still variability within groups, suggesting that treatment cost is not solely determined by survival rate.
ggplot(data = training_data, aes(x = Survival.Rate..5.Year...., y = Treatment.Cost.USD)) +
geom_point(color = "steelblue", size = 1.2)
A bar plot is created to display the number of individuals at each cancer stage, grouped by gender.
From the plot, stage 0 has the highest number of individuals, with counts decreasing as cancer stage increases. Males appear to have higher counts than females across all stages, although the overall distribution trend is similar for both groups.
ggplot(data = clean_data, aes(x = Cancer.Stage, fill = Gender)) + geom_bar() +
ylab("Number of Individuals")
A correlation value close to 1 or -1 indicates a strong linear relationship. The Pearson correlation between treatment cost and survival rate is approximately -0.8, indicating a strong negative correlation, where higher survival rates are generally associated with lower treatment costs. While the scatter plot of the same variables (Section 12) shows high variability, the overall downward trend still results in a strong negative correlation.
cor(clean_data$Treatment.Cost.USD, clean_data$Survival.Rate..5.Year...., method="pearson")
## [1] -0.8066187