Group 7: Esther Agyemang, Omar Alamodi, Aliha Bashir, Hetal Bulsara, Andy Huang, James Huynh, Nasteha Mohamed, and Tina Shen

George Brown Polytechnic
COMP 4033 – Health Informatics Data Analytics

1. Dataset Structure

This dataset contains individual information related to oral cancer.

According to the dataset description on Kaggle, the data is based on real-world oral cancer statistics and is intended to reflect patterns reported in global health studies. Based on the Country variable, the dataset includes observations from 17 unique countries, including India, Pakistan, Sri Lanka, Taiwan, and some Western nations.

The str() function shows that the dataset contains 84,922 observations and 25 variables. It includes a mix of numeric and character variables representing demographics, risk factors, clinical symptoms, and treatment outcomes.

str(oral_data)
## 'data.frame':    84922 obs. of  25 variables:
##  $ ID                                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country                                 : chr  "Italy" "Japan" "UK" "Sri Lanka" ...
##  $ Age                                     : int  36 64 37 55 68 70 41 53 62 50 ...
##  $ Gender                                  : chr  "Female" "Male" "Female" "Male" ...
##  $ Tobacco.Use                             : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Alcohol.Consumption                     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ HPV.Infection                           : chr  "Yes" "Yes" "No" "No" ...
##  $ Betel.Quid.Use                          : chr  "No" "No" "No" "Yes" ...
##  $ Chronic.Sun.Exposure                    : chr  "No" "Yes" "Yes" "No" ...
##  $ Poor.Oral.Hygiene                       : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Diet..Fruits...Vegetables.Intake.       : chr  "Low" "High" "Moderate" "Moderate" ...
##  $ Family.History.of.Cancer                : chr  "No" "No" "No" "No" ...
##  $ Compromised.Immune.System               : chr  "No" "No" "No" "No" ...
##  $ Oral.Lesions                            : chr  "No" "No" "No" "Yes" ...
##  $ Unexplained.Bleeding                    : chr  "No" "Yes" "No" "No" ...
##  $ Difficulty.Swallowing                   : chr  "No" "No" "No" "No" ...
##  $ White.or.Red.Patches.in.Mouth           : chr  "No" "No" "Yes" "No" ...
##  $ Tumor.Size..cm.                         : num  0 1.78 3.52 0 2.83 ...
##  $ Cancer.Stage                            : int  0 1 2 0 3 2 1 0 3 3 ...
##  $ Treatment.Type                          : chr  "No Treatment" "No Treatment" "Surgery" "No Treatment" ...
##  $ Survival.Rate..5.Year....               : num  100 83.3 63.2 100 44.3 ...
##  $ Cost.of.Treatment..USD.                 : num  0 77773 101165 0 45355 ...
##  $ Economic.Burden..Lost.Workdays.per.Year.: int  0 177 130 0 52 91 105 0 136 82 ...
##  $ Early.Diagnosis                         : chr  "No" "No" "Yes" "Yes" ...
##  $ Oral.Cancer..Diagnosis.                 : chr  "No" "Yes" "Yes" "No" ...

1.1 Dataset Countries

The table() function is used to count the number of records per country, while length() is used to determine the total number of unique countries in the dataset.

countries = table(oral_data$Country)
sort(countries)
## 
## South Africa        Japan        Kenya    Australia      Nigeria        Egypt 
##         3086         3152         3171         3189         3256         3263 
##       Russia       Brazil       France        Italy          USA      Germany 
##         4711         4762         4783         4834         4891         4909 
##           UK       Taiwan    Sri Lanka     Pakistan        India 
##         4930         7905         8000         8001         8079
length(countries)
## [1] 17

2. Variables in Dataset

The names() function lists all 25 variables in the dataset. These include demographic information (e.g., Age, Gender, Country), risk factors (e.g., Tobacco Use, Alcohol Consumption, HPV Infection), clinical indicators, and outcome variables such as survival rate and treatment cost.

names(oral_data)
##  [1] "ID"                                      
##  [2] "Country"                                 
##  [3] "Age"                                     
##  [4] "Gender"                                  
##  [5] "Tobacco.Use"                             
##  [6] "Alcohol.Consumption"                     
##  [7] "HPV.Infection"                           
##  [8] "Betel.Quid.Use"                          
##  [9] "Chronic.Sun.Exposure"                    
## [10] "Poor.Oral.Hygiene"                       
## [11] "Diet..Fruits...Vegetables.Intake."       
## [12] "Family.History.of.Cancer"                
## [13] "Compromised.Immune.System"               
## [14] "Oral.Lesions"                            
## [15] "Unexplained.Bleeding"                    
## [16] "Difficulty.Swallowing"                   
## [17] "White.or.Red.Patches.in.Mouth"           
## [18] "Tumor.Size..cm."                         
## [19] "Cancer.Stage"                            
## [20] "Treatment.Type"                          
## [21] "Survival.Rate..5.Year...."               
## [22] "Cost.of.Treatment..USD."                 
## [23] "Economic.Burden..Lost.Workdays.per.Year."
## [24] "Early.Diagnosis"                         
## [25] "Oral.Cancer..Diagnosis."

3. Preview of Data

The head() function displays the first 15 rows of the dataset, providing a quick overview of how the data is organized.

head(oral_data, 15)
##    ID      Country Age Gender Tobacco.Use Alcohol.Consumption HPV.Infection
## 1   1        Italy  36 Female         Yes                 Yes           Yes
## 2   2        Japan  64   Male         Yes                 Yes           Yes
## 3   3           UK  37 Female          No                 Yes            No
## 4   4    Sri Lanka  55   Male         Yes                 Yes            No
## 5   5 South Africa  68   Male          No                  No            No
## 6   6       Taiwan  70   Male         Yes                  No           Yes
## 7   7          USA  41 Female         Yes                 Yes            No
## 8   8        Italy  53   Male         Yes                 Yes           Yes
## 9   9    Sri Lanka  62 Female         Yes                  No           Yes
## 10 10      Germany  50   Male         Yes                  No            No
## 11 11    Sri Lanka  65   Male          No                 Yes            No
## 12 12    Sri Lanka  34   Male         Yes                  No            No
## 13 13       France  56   Male         Yes                 Yes           Yes
## 14 14    Australia  59 Female         Yes                 Yes            No
## 15 15           UK  43   Male         Yes                  No            No
##    Betel.Quid.Use Chronic.Sun.Exposure Poor.Oral.Hygiene
## 1              No                   No               Yes
## 2              No                  Yes               Yes
## 3              No                  Yes               Yes
## 4             Yes                   No               Yes
## 5              No                   No               Yes
## 6             Yes                   No               Yes
## 7              No                   No                No
## 8              No                   No               Yes
## 9             Yes                   No               Yes
## 10             No                   No               Yes
## 11            Yes                   No                No
## 12             No                   No               Yes
## 13            Yes                   No               Yes
## 14             No                  Yes               Yes
## 15             No                   No                No
##    Diet..Fruits...Vegetables.Intake. Family.History.of.Cancer
## 1                                Low                       No
## 2                               High                       No
## 3                           Moderate                       No
## 4                           Moderate                       No
## 5                               High                       No
## 6                           Moderate                      Yes
## 7                           Moderate                       No
## 8                           Moderate                       No
## 9                                Low                       No
## 10                              High                       No
## 11                          Moderate                      Yes
## 12                               Low                       No
## 13                          Moderate                       No
## 14                               Low                       No
## 15                               Low                       No
##    Compromised.Immune.System Oral.Lesions Unexplained.Bleeding
## 1                         No           No                   No
## 2                         No           No                  Yes
## 3                         No           No                   No
## 4                         No          Yes                   No
## 5                         No           No                   No
## 6                         No          Yes                  Yes
## 7                         No           No                   No
## 8                         No          Yes                   No
## 9                         No           No                   No
## 10                        No           No                   No
## 11                        No          Yes                   No
## 12                        No          Yes                  Yes
## 13                        No           No                   No
## 14                        No          Yes                   No
## 15                       Yes           No                   No
##    Difficulty.Swallowing White.or.Red.Patches.in.Mouth Tumor.Size..cm.
## 1                     No                            No        0.000000
## 2                     No                            No        1.782186
## 3                     No                           Yes        3.523895
## 4                     No                            No        0.000000
## 5                     No                            No        2.834789
## 6                     No                            No        1.692675
## 7                    Yes                           Yes        5.794843
## 8                     No                            No        0.000000
## 9                    Yes                            No        5.999476
## 10                   Yes                            No        5.282246
## 11                    No                           Yes        4.724627
## 12                    No                            No        0.000000
## 13                   Yes                           Yes        0.000000
## 14                    No                           Yes        0.000000
## 15                    No                           Yes        0.000000
##    Cancer.Stage   Treatment.Type Survival.Rate..5.Year....
## 1             0     No Treatment                 100.00000
## 2             1     No Treatment                  83.34010
## 3             2          Surgery                  63.22287
## 4             0     No Treatment                 100.00000
## 5             3     No Treatment                  44.29320
## 6             2          Surgery                  67.40727
## 7             1     No Treatment                  80.79813
## 8             0     No Treatment                 100.00000
## 9             3          Surgery                  41.45993
## 10            3        Radiation                  33.75381
## 11            3 Targeted Therapy                  42.81490
## 12            0     No Treatment                 100.00000
## 13            0     No Treatment                 100.00000
## 14            0     No Treatment                 100.00000
## 15            0     No Treatment                 100.00000
##    Cost.of.Treatment..USD. Economic.Burden..Lost.Workdays.per.Year.
## 1                     0.00                                        0
## 2                 77772.50                                      177
## 3                101164.50                                      130
## 4                     0.00                                        0
## 5                 45354.75                                       52
## 6                 96504.00                                       91
## 7                 86131.25                                      105
## 8                     0.00                                        0
## 9                 42630.00                                      136
## 10                75150.25                                       82
## 11                63721.00                                       84
## 12                    0.00                                        0
## 13                    0.00                                        0
## 14                    0.00                                        0
## 15                    0.00                                        0
##    Early.Diagnosis Oral.Cancer..Diagnosis.
## 1               No                      No
## 2               No                     Yes
## 3              Yes                     Yes
## 4              Yes                      No
## 5               No                     Yes
## 6              Yes                     Yes
## 7              Yes                     Yes
## 8              Yes                      No
## 9               No                     Yes
## 10              No                     Yes
## 11             Yes                     Yes
## 12             Yes                      No
## 13             Yes                      No
## 14             Yes                      No
## 15             Yes                      No

4. User-Defined Function

A user-defined function is created to calculate a simple risk score based on selected risk factors, including tobacco use, alcohol consumption, HPV infection, family history, and immune system status. Each factor contributes a weighted value to produce an overall score for demonstration purposes.

risk_score <- function(tobacco, alcohol, hpv, family_history, immune_system) {
  (tobacco == "Yes") * 2 +
  (alcohol == "Yes") +
  (hpv == "Yes") * 2 + 
  (family_history == "Yes") + 
  (immune_system == "Yes")
}

risk_score(oral_data$Tobacco.Use[1],
           oral_data$Alcohol.Consumption[1],
           oral_data$HPV.Infection[1],
           oral_data$Family.History.of.Cancer[1],
           oral_data$Compromised.Immune.System[1])
## [1] 5

5. Filter Rows Based on Criteria

An example filter is applied based on the following conditions:

The narrow criteria result in a small subset of patients, in this case older females from Sri Lanka with early-stage cancer and generally fewer high-risk behaviors.

oral_data %>% filter(
  oral_data$Cancer.Stage == 1,
  oral_data$Tumor.Size..cm. < 5,
  oral_data$Country == "Sri Lanka",
  oral_data$Age > 60,
  oral_data$Gender == "Female",
  oral_data$Tobacco.Use == "No",
  oral_data$Betel.Quid.Use == "No")
##       ID   Country Age Gender Tobacco.Use Alcohol.Consumption HPV.Infection
## 1   5547 Sri Lanka  66 Female          No                  No            No
## 2   8641 Sri Lanka  63 Female          No                 Yes            No
## 3  11363 Sri Lanka  68 Female          No                 Yes            No
## 4  21549 Sri Lanka  70 Female          No                  No            No
## 5  41070 Sri Lanka  61 Female          No                 Yes            No
## 6  42048 Sri Lanka  69 Female          No                 Yes            No
## 7  51622 Sri Lanka  73 Female          No                  No            No
## 8  52435 Sri Lanka  70 Female          No                 Yes           Yes
## 9  60156 Sri Lanka  65 Female          No                  No            No
## 10 68135 Sri Lanka  70 Female          No                  No            No
## 11 79435 Sri Lanka  80 Female          No                 Yes           Yes
##    Betel.Quid.Use Chronic.Sun.Exposure Poor.Oral.Hygiene
## 1              No                   No                No
## 2              No                   No                No
## 3              No                   No               Yes
## 4              No                   No               Yes
## 5              No                   No                No
## 6              No                   No               Yes
## 7              No                   No                No
## 8              No                  Yes                No
## 9              No                   No               Yes
## 10             No                   No                No
## 11             No                   No               Yes
##    Diet..Fruits...Vegetables.Intake. Family.History.of.Cancer
## 1                           Moderate                       No
## 2                           Moderate                       No
## 3                               High                      Yes
## 4                           Moderate                       No
## 5                           Moderate                       No
## 6                           Moderate                       No
## 7                           Moderate                       No
## 8                                Low                       No
## 9                           Moderate                       No
## 10                               Low                       No
## 11                          Moderate                       No
##    Compromised.Immune.System Oral.Lesions Unexplained.Bleeding
## 1                         No           No                   No
## 2                         No           No                  Yes
## 3                         No           No                  Yes
## 4                         No          Yes                  Yes
## 5                         No          Yes                   No
## 6                         No          Yes                   No
## 7                         No          Yes                   No
## 8                         No          Yes                  Yes
## 9                         No          Yes                   No
## 10                        No           No                   No
## 11                        No          Yes                   No
##    Difficulty.Swallowing White.or.Red.Patches.in.Mouth Tumor.Size..cm.
## 1                     No                           Yes        3.845308
## 2                    Yes                            No        4.825246
## 3                    Yes                            No        2.383717
## 4                     No                            No        4.198427
## 5                     No                           Yes        3.756328
## 6                     No                            No        4.519694
## 7                    Yes                            No        4.578338
## 8                    Yes                           Yes        4.084550
## 9                     No                           Yes        3.249317
## 10                   Yes                            No        3.942891
## 11                   Yes                            No        2.621289
##    Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## 1             1      Radiation                  89.98152
## 2             1      Radiation                  89.68025
## 3             1   Chemotherapy                  83.32314
## 4             1   Chemotherapy                  83.88464
## 5             1        Surgery                  81.95862
## 6             1   No Treatment                  83.57633
## 7             1      Radiation                  84.75359
## 8             1      Radiation                  85.83111
## 9             1   Chemotherapy                  83.84482
## 10            1   Chemotherapy                  88.86750
## 11            1      Radiation                  80.87980
##    Cost.of.Treatment..USD. Economic.Burden..Lost.Workdays.per.Year.
## 1                 98652.50                                      155
## 2                 53808.75                                      140
## 3                 25171.25                                      135
## 4                 80856.25                                       48
## 5                 97940.00                                       65
## 6                 94122.50                                      109
## 7                 81845.00                                       79
## 8                 41413.75                                       41
## 9                 76490.00                                       41
## 10                60420.00                                      173
## 11                89812.50                                       36
##    Early.Diagnosis Oral.Cancer..Diagnosis.
## 1              Yes                     Yes
## 2               No                     Yes
## 3               No                     Yes
## 4              Yes                     Yes
## 5              Yes                     Yes
## 6              Yes                     Yes
## 7              Yes                     Yes
## 8              Yes                     Yes
## 9              Yes                     Yes
## 10              No                     Yes
## 11             Yes                     Yes

6. Identifying Dependent & Independent Variables

The dataset contains multiple potential independent and dependent variables; a subset has been selected for demonstration purposes.

Independent variable: Tumor Size
Dependent variable: Cost of Treatment

head(data.frame(oral_data$Tumor.Size..cm., oral_data$Cost.of.Treatment..USD.), 15)
##    oral_data.Tumor.Size..cm. oral_data.Cost.of.Treatment..USD.
## 1                   0.000000                              0.00
## 2                   1.782186                          77772.50
## 3                   3.523895                         101164.50
## 4                   0.000000                              0.00
## 5                   2.834789                          45354.75
## 6                   1.692675                          96504.00
## 7                   5.794843                          86131.25
## 8                   0.000000                              0.00
## 9                   5.999476                          42630.00
## 10                  5.282246                          75150.25
## 11                  4.724627                          63721.00
## 12                  0.000000                              0.00
## 13                  0.000000                              0.00
## 14                  0.000000                              0.00
## 15                  0.000000                              0.00

Independent variable: Diet (Fruits, Vegetables Intake)
Dependent variable: Cancer Stage

This relationship is explored for demonstration purposes, recognizing that it may be indirect.

head(data.frame(oral_data$Diet..Fruits...Vegetables.Intake., oral_data$Cancer.Stage), 15)
##    oral_data.Diet..Fruits...Vegetables.Intake. oral_data.Cancer.Stage
## 1                                          Low                      0
## 2                                         High                      1
## 3                                     Moderate                      2
## 4                                     Moderate                      0
## 5                                         High                      3
## 6                                     Moderate                      2
## 7                                     Moderate                      1
## 8                                     Moderate                      0
## 9                                          Low                      3
## 10                                        High                      3
## 11                                    Moderate                      3
## 12                                         Low                      0
## 13                                    Moderate                      0
## 14                                         Low                      0
## 15                                         Low                      0

7. Data Cleaning and Adjustments

Basic data cleaning steps are demonstrated, including removing missing values, eliminating duplicate records, sorting the data in descending order by treatment cost, and renaming selected columns for clarity. In this dataset, no missing values or duplicate records were observed. Columns such as alcohol consumption and treatment cost were renamed for readability as part of the demonstration.

clean_data = oral_data %>% filter(!is.na(Cost.of.Treatment..USD.))
clean_data = clean_data %>% distinct()

clean_data = clean_data %>% arrange(desc(Cost.of.Treatment..USD.))

clean_data = clean_data %>% rename(
  Drinker = Alcohol.Consumption,
  Treatment.Cost.USD = Cost.of.Treatment..USD.
)
head(clean_data, 5)
##      ID  Country Age Gender Tobacco.Use Drinker HPV.Infection Betel.Quid.Use
## 1 32793 Pakistan  56   Male          No     Yes           Yes             No
## 2 40541   Brazil  54   Male         Yes      No           Yes             No
## 3 18663  Germany  59   Male          No     Yes            No             No
## 4 61019  Germany  43 Female          No     Yes           Yes             No
## 5 18640    India  57 Female         Yes     Yes           Yes            Yes
##   Chronic.Sun.Exposure Poor.Oral.Hygiene Diet..Fruits...Vegetables.Intake.
## 1                   No               Yes                               Low
## 2                  Yes               Yes                               Low
## 3                   No               Yes                          Moderate
## 4                   No                No                              High
## 5                   No               Yes                          Moderate
##   Family.History.of.Cancer Compromised.Immune.System Oral.Lesions
## 1                       No                        No           No
## 2                       No                        No           No
## 3                       No                        No          Yes
## 4                       No                        No           No
## 5                       No                        No           No
##   Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
## 1                   No                    No                           Yes
## 2                   No                   Yes                            No
## 3                   No                    No                            No
## 4                   No                    No                           Yes
## 5                   No                    No                            No
##   Tumor.Size..cm. Cancer.Stage   Treatment.Type Survival.Rate..5.Year....
## 1        2.259677            4          Surgery                  19.59434
## 2        5.393321            4        Radiation                  21.05594
## 3        2.552248            4        Radiation                  26.10189
## 4        2.978733            4 Targeted Therapy                  12.63353
## 5        1.793576            4     No Treatment                  22.09566
##   Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis
## 1             159988                                      125             Yes
## 2             159986                                       40              No
## 3             159984                                       72             Yes
## 4             159934                                       75             Yes
## 5             159932                                      121              No
##   Oral.Cancer..Diagnosis.
## 1                     Yes
## 2                     Yes
## 3                     Yes
## 4                     Yes
## 5                     Yes

8. Adding New Variables to the Data Frame

A new variable, cost per cm, is created by dividing treatment cost by tumor size. This demonstrates how derived variables can be added to the dataset for further analysis.

clean_data = clean_data %>% mutate(cost.per.cm = Treatment.Cost.USD / Tumor.Size..cm.)
head(clean_data, 5)
##      ID  Country Age Gender Tobacco.Use Drinker HPV.Infection Betel.Quid.Use
## 1 32793 Pakistan  56   Male          No     Yes           Yes             No
## 2 40541   Brazil  54   Male         Yes      No           Yes             No
## 3 18663  Germany  59   Male          No     Yes            No             No
## 4 61019  Germany  43 Female          No     Yes           Yes             No
## 5 18640    India  57 Female         Yes     Yes           Yes            Yes
##   Chronic.Sun.Exposure Poor.Oral.Hygiene Diet..Fruits...Vegetables.Intake.
## 1                   No               Yes                               Low
## 2                  Yes               Yes                               Low
## 3                   No               Yes                          Moderate
## 4                   No                No                              High
## 5                   No               Yes                          Moderate
##   Family.History.of.Cancer Compromised.Immune.System Oral.Lesions
## 1                       No                        No           No
## 2                       No                        No           No
## 3                       No                        No          Yes
## 4                       No                        No           No
## 5                       No                        No           No
##   Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
## 1                   No                    No                           Yes
## 2                   No                   Yes                            No
## 3                   No                    No                            No
## 4                   No                    No                           Yes
## 5                   No                    No                            No
##   Tumor.Size..cm. Cancer.Stage   Treatment.Type Survival.Rate..5.Year....
## 1        2.259677            4          Surgery                  19.59434
## 2        5.393321            4        Radiation                  21.05594
## 3        2.552248            4        Radiation                  26.10189
## 4        2.978733            4 Targeted Therapy                  12.63353
## 5        1.793576            4     No Treatment                  22.09566
##   Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis
## 1             159988                                      125             Yes
## 2             159986                                       40              No
## 3             159984                                       72             Yes
## 4             159934                                       75             Yes
## 5             159932                                      121              No
##   Oral.Cancer..Diagnosis. cost.per.cm
## 1                     Yes    70801.27
## 2                     Yes    29663.73
## 3                     Yes    62683.56
## 4                     Yes    53691.95
## 5                     Yes    89169.33

9. Creating a Training Set

A training set is created by randomly sampling 5% of the dataset using a fixed seed (1234) to ensure reproducibility. This demonstrates how subsets of data can be generated for modeling or testing purposes.

set.seed(1234) 
training_data = clean_data %>% sample_frac(0.05, replace = FALSE)
head(training_data, 5)
##      ID   Country Age Gender Tobacco.Use Drinker HPV.Infection Betel.Quid.Use
## 1  1920    France  53   Male         Yes     Yes            No             No
## 2  9233        UK  60 Female         Yes     Yes            No             No
## 3 23357        UK  53   Male         Yes     Yes           Yes             No
## 4 81151     India  40   Male          No     Yes           Yes            Yes
## 5 76611 Sri Lanka  51   Male          No      No            No            Yes
##   Chronic.Sun.Exposure Poor.Oral.Hygiene Diet..Fruits...Vegetables.Intake.
## 1                   No                No                          Moderate
## 2                   No               Yes                               Low
## 3                  Yes               Yes                              High
## 4                  Yes                No                               Low
## 5                  Yes                No                              High
##   Family.History.of.Cancer Compromised.Immune.System Oral.Lesions
## 1                      Yes                       Yes           No
## 2                       No                        No          Yes
## 3                       No                        No          Yes
## 4                       No                        No           No
## 5                       No                        No          Yes
##   Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
## 1                   No                    No                            No
## 2                   No                    No                            No
## 3                   No                    No                           Yes
## 4                   No                    No                            No
## 5                   No                    No                           Yes
##   Tumor.Size..cm. Cancer.Stage   Treatment.Type Survival.Rate..5.Year....
## 1        3.866738            1          Surgery                  86.48450
## 2        3.128102            3 Targeted Therapy                  31.24434
## 3        4.237244            2 Targeted Therapy                  62.49541
## 4        0.000000            0     No Treatment                 100.00000
## 5        0.000000            0     No Treatment                 100.00000
##   Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis
## 1           27366.25                                      148              No
## 2           89209.75                                      168             Yes
## 3           49224.00                                       69             Yes
## 4               0.00                                        0              No
## 5               0.00                                        0             Yes
##   Oral.Cancer..Diagnosis. cost.per.cm
## 1                     Yes    7077.348
## 2                     Yes   28518.813
## 3                     Yes   11616.985
## 4                      No         NaN
## 5                      No         NaN

10. Summary of the Main Data Set

The summary() function provides an overview of the dataset.

From the output, many values for tumor size and treatment cost are zero, which may indicate placeholder or missing entries. The median tumor size and cancer stage are also zero, suggesting many cases are recorded at early or minimal levels. Additionally, a large number of missing values appear in cost per cm, due to division by zero when tumor size is zero.

summary(clean_data)
##        ID          Country               Age            Gender         
##  Min.   :    1   Length:84922       Min.   : 15.00   Length:84922      
##  1st Qu.:21231   Class :character   1st Qu.: 48.00   Class :character  
##  Median :42462   Mode  :character   Median : 55.00   Mode  :character  
##  Mean   :42462                      Mean   : 54.51                     
##  3rd Qu.:63692                      3rd Qu.: 61.00                     
##  Max.   :84922                      Max.   :101.00                     
##                                                                        
##  Tobacco.Use          Drinker          HPV.Infection      Betel.Quid.Use    
##  Length:84922       Length:84922       Length:84922       Length:84922      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Chronic.Sun.Exposure Poor.Oral.Hygiene  Diet..Fruits...Vegetables.Intake.
##  Length:84922         Length:84922       Length:84922                     
##  Class :character     Class :character   Class :character                 
##  Mode  :character     Mode  :character   Mode  :character                 
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  Family.History.of.Cancer Compromised.Immune.System Oral.Lesions      
##  Length:84922             Length:84922              Length:84922      
##  Class :character         Class :character          Class :character  
##  Mode  :character         Mode  :character          Mode  :character  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
##  Length:84922         Length:84922          Length:84922                 
##  Class :character     Class :character      Class :character             
##  Mode  :character     Mode  :character      Mode  :character             
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  Tumor.Size..cm.  Cancer.Stage   Treatment.Type     Survival.Rate..5.Year....
##  Min.   :0.000   Min.   :0.000   Length:84922       Min.   : 10.00           
##  1st Qu.:0.000   1st Qu.:0.000   Class :character   1st Qu.: 65.23           
##  Median :0.000   Median :0.000   Mode  :character   Median :100.00           
##  Mean   :1.747   Mean   :1.119                      Mean   : 79.50           
##  3rd Qu.:3.480   3rd Qu.:2.000                      3rd Qu.:100.00           
##  Max.   :6.000   Max.   :4.000                      Max.   :100.00           
##                                                                              
##  Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis   
##  Min.   :     0     Min.   :  0.00                           Length:84922      
##  1st Qu.:     0     1st Qu.:  0.00                           Class :character  
##  Median :     0     Median :  0.00                           Mode  :character  
##  Mean   : 39110     Mean   : 52.03                                             
##  3rd Qu.: 76468     3rd Qu.:104.00                                             
##  Max.   :159988     Max.   :179.00                                             
##                                                                                
##  Oral.Cancer..Diagnosis.  cost.per.cm    
##  Length:84922            Min.   :  4176  
##  Class :character        1st Qu.: 14770  
##  Mode  :character        Median : 22254  
##                          Mean   : 28090  
##                          3rd Qu.: 34968  
##                          Max.   :156336  
##                          NA's   :42573

11. Statistical Functions on the Dataset

Mean, median, and range of treatment costs.

Basic statistical measures such as mean, median, and range are calculated for treatment cost. The mean cost is approximately 39,110 USD, while the median is 0, indicating that a large portion of the dataset contains zero-cost entries. The range spans from 0 to 159,988 USD, showing a wide variation in treatment costs.

mean(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1] 39109.88
median(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1] 0
range(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1]      0 159988

R does not have a built-in mode() function for statistical mode (most frequent value). The mode() function in R refers to the data type of an object. So the following code was executed to determine the statistical mode of cancer stage in the dataset; this code can be put in a user-defined function for reusability, but the following is included to illustrate the steps taken.

The most common cancer stage in the dataset is 0, which aligns with earlier observations that many cases are recorded at early or minimal stages.

  1. table() is used to count how many times each cancer stage appears.
cancer_stage_count = table(clean_data$Cancer.Stage)
cancer_stage_count
## 
##     0     1     2     3     4 
## 42573 12713 12865 10520  6251
  1. Sort the counts from highest to lowest.
cancer_stage_count = sort(cancer_stage_count, decreasing = TRUE)
cancer_stage_count
## 
##     0     2     1     3     4 
## 42573 12865 12713 10520  6251
  1. Obtain the cancer stage with the highest count.
names(cancer_stage_count)[1]
## [1] "0"

12. Scatter Plot

A scatter plot is created to visualize the relationship between survival rate (5 years) and treatment cost.

The plot shows a slight negative relationship between survival rate and treatment cost. The distribution is not fully continuous, as points form clusters at certain values, which may reflect how the data is structured around specific treatment or disease categories. Despite these clusters, there is still variability within groups, suggesting that treatment cost is not solely determined by survival rate.

ggplot(data = training_data, aes(x = Survival.Rate..5.Year...., y = Treatment.Cost.USD)) + 
  geom_point(color = "steelblue", size = 1.2)

13. Bar Plot

A bar plot is created to display the number of individuals at each cancer stage, grouped by gender.

From the plot, stage 0 has the highest number of individuals, with counts decreasing as cancer stage increases. Males appear to have higher counts than females across all stages, although the overall distribution trend is similar for both groups.

ggplot(data = clean_data, aes(x = Cancer.Stage, fill = Gender)) + geom_bar() + 
  ylab("Number of Individuals")

14. Pearson Correlation

A correlation value close to 1 or -1 indicates a strong linear relationship. The Pearson correlation between treatment cost and survival rate is approximately -0.8, indicating a strong negative correlation, where higher survival rates are generally associated with lower treatment costs. While the scatter plot of the same variables (Section 12) shows high variability, the overall downward trend still results in a strong negative correlation.

cor(clean_data$Treatment.Cost.USD, clean_data$Survival.Rate..5.Year...., method="pearson")
## [1] -0.8066187