Group 7: Esther Agyemang, Omar Alamodi, Aliha Bashir, Hetal Bulsara, Andy Huang, James Huynh, Nasteha Mohamed, and Tina Shen

George Brown Polytechnic
COMP 4033 – Health Informatics Data Analytics

1. Dataset Structure

This section provides an overview of the dataset structure, including the number of variables and their data types.

str(oral_data)
## 'data.frame':    84922 obs. of  25 variables:
##  $ ID                                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country                                 : chr  "Italy" "Japan" "UK" "Sri Lanka" ...
##  $ Age                                     : int  36 64 37 55 68 70 41 53 62 50 ...
##  $ Gender                                  : chr  "Female" "Male" "Female" "Male" ...
##  $ Tobacco.Use                             : chr  "Yes" "Yes" "No" "Yes" ...
##  $ Alcohol.Consumption                     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ HPV.Infection                           : chr  "Yes" "Yes" "No" "No" ...
##  $ Betel.Quid.Use                          : chr  "No" "No" "No" "Yes" ...
##  $ Chronic.Sun.Exposure                    : chr  "No" "Yes" "Yes" "No" ...
##  $ Poor.Oral.Hygiene                       : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Diet..Fruits...Vegetables.Intake.       : chr  "Low" "High" "Moderate" "Moderate" ...
##  $ Family.History.of.Cancer                : chr  "No" "No" "No" "No" ...
##  $ Compromised.Immune.System               : chr  "No" "No" "No" "No" ...
##  $ Oral.Lesions                            : chr  "No" "No" "No" "Yes" ...
##  $ Unexplained.Bleeding                    : chr  "No" "Yes" "No" "No" ...
##  $ Difficulty.Swallowing                   : chr  "No" "No" "No" "No" ...
##  $ White.or.Red.Patches.in.Mouth           : chr  "No" "No" "Yes" "No" ...
##  $ Tumor.Size..cm.                         : num  0 1.78 3.52 0 2.83 ...
##  $ Cancer.Stage                            : int  0 1 2 0 3 2 1 0 3 3 ...
##  $ Treatment.Type                          : chr  "No Treatment" "No Treatment" "Surgery" "No Treatment" ...
##  $ Survival.Rate..5.Year....               : num  100 83.3 63.2 100 44.3 ...
##  $ Cost.of.Treatment..USD.                 : num  0 77773 101165 0 45355 ...
##  $ Economic.Burden..Lost.Workdays.per.Year.: int  0 177 130 0 52 91 105 0 136 82 ...
##  $ Early.Diagnosis                         : chr  "No" "No" "Yes" "Yes" ...
##  $ Oral.Cancer..Diagnosis.                 : chr  "No" "Yes" "Yes" "No" ...

2. Variables in Dataset

The dataset contains the following variables:

names(oral_data)
##  [1] "ID"                                      
##  [2] "Country"                                 
##  [3] "Age"                                     
##  [4] "Gender"                                  
##  [5] "Tobacco.Use"                             
##  [6] "Alcohol.Consumption"                     
##  [7] "HPV.Infection"                           
##  [8] "Betel.Quid.Use"                          
##  [9] "Chronic.Sun.Exposure"                    
## [10] "Poor.Oral.Hygiene"                       
## [11] "Diet..Fruits...Vegetables.Intake."       
## [12] "Family.History.of.Cancer"                
## [13] "Compromised.Immune.System"               
## [14] "Oral.Lesions"                            
## [15] "Unexplained.Bleeding"                    
## [16] "Difficulty.Swallowing"                   
## [17] "White.or.Red.Patches.in.Mouth"           
## [18] "Tumor.Size..cm."                         
## [19] "Cancer.Stage"                            
## [20] "Treatment.Type"                          
## [21] "Survival.Rate..5.Year...."               
## [22] "Cost.of.Treatment..USD."                 
## [23] "Economic.Burden..Lost.Workdays.per.Year."
## [24] "Early.Diagnosis"                         
## [25] "Oral.Cancer..Diagnosis."

3. Preview of Data

Below are the first 15 rows of the dataset:

head(oral_data, 15)
##    ID      Country Age Gender Tobacco.Use Alcohol.Consumption HPV.Infection
## 1   1        Italy  36 Female         Yes                 Yes           Yes
## 2   2        Japan  64   Male         Yes                 Yes           Yes
## 3   3           UK  37 Female          No                 Yes            No
## 4   4    Sri Lanka  55   Male         Yes                 Yes            No
## 5   5 South Africa  68   Male          No                  No            No
## 6   6       Taiwan  70   Male         Yes                  No           Yes
## 7   7          USA  41 Female         Yes                 Yes            No
## 8   8        Italy  53   Male         Yes                 Yes           Yes
## 9   9    Sri Lanka  62 Female         Yes                  No           Yes
## 10 10      Germany  50   Male         Yes                  No            No
## 11 11    Sri Lanka  65   Male          No                 Yes            No
## 12 12    Sri Lanka  34   Male         Yes                  No            No
## 13 13       France  56   Male         Yes                 Yes           Yes
## 14 14    Australia  59 Female         Yes                 Yes            No
## 15 15           UK  43   Male         Yes                  No            No
##    Betel.Quid.Use Chronic.Sun.Exposure Poor.Oral.Hygiene
## 1              No                   No               Yes
## 2              No                  Yes               Yes
## 3              No                  Yes               Yes
## 4             Yes                   No               Yes
## 5              No                   No               Yes
## 6             Yes                   No               Yes
## 7              No                   No                No
## 8              No                   No               Yes
## 9             Yes                   No               Yes
## 10             No                   No               Yes
## 11            Yes                   No                No
## 12             No                   No               Yes
## 13            Yes                   No               Yes
## 14             No                  Yes               Yes
## 15             No                   No                No
##    Diet..Fruits...Vegetables.Intake. Family.History.of.Cancer
## 1                                Low                       No
## 2                               High                       No
## 3                           Moderate                       No
## 4                           Moderate                       No
## 5                               High                       No
## 6                           Moderate                      Yes
## 7                           Moderate                       No
## 8                           Moderate                       No
## 9                                Low                       No
## 10                              High                       No
## 11                          Moderate                      Yes
## 12                               Low                       No
## 13                          Moderate                       No
## 14                               Low                       No
## 15                               Low                       No
##    Compromised.Immune.System Oral.Lesions Unexplained.Bleeding
## 1                         No           No                   No
## 2                         No           No                  Yes
## 3                         No           No                   No
## 4                         No          Yes                   No
## 5                         No           No                   No
## 6                         No          Yes                  Yes
## 7                         No           No                   No
## 8                         No          Yes                   No
## 9                         No           No                   No
## 10                        No           No                   No
## 11                        No          Yes                   No
## 12                        No          Yes                  Yes
## 13                        No           No                   No
## 14                        No          Yes                   No
## 15                       Yes           No                   No
##    Difficulty.Swallowing White.or.Red.Patches.in.Mouth Tumor.Size..cm.
## 1                     No                            No        0.000000
## 2                     No                            No        1.782186
## 3                     No                           Yes        3.523895
## 4                     No                            No        0.000000
## 5                     No                            No        2.834789
## 6                     No                            No        1.692675
## 7                    Yes                           Yes        5.794843
## 8                     No                            No        0.000000
## 9                    Yes                            No        5.999476
## 10                   Yes                            No        5.282246
## 11                    No                           Yes        4.724627
## 12                    No                            No        0.000000
## 13                   Yes                           Yes        0.000000
## 14                    No                           Yes        0.000000
## 15                    No                           Yes        0.000000
##    Cancer.Stage   Treatment.Type Survival.Rate..5.Year....
## 1             0     No Treatment                 100.00000
## 2             1     No Treatment                  83.34010
## 3             2          Surgery                  63.22287
## 4             0     No Treatment                 100.00000
## 5             3     No Treatment                  44.29320
## 6             2          Surgery                  67.40727
## 7             1     No Treatment                  80.79813
## 8             0     No Treatment                 100.00000
## 9             3          Surgery                  41.45993
## 10            3        Radiation                  33.75381
## 11            3 Targeted Therapy                  42.81490
## 12            0     No Treatment                 100.00000
## 13            0     No Treatment                 100.00000
## 14            0     No Treatment                 100.00000
## 15            0     No Treatment                 100.00000
##    Cost.of.Treatment..USD. Economic.Burden..Lost.Workdays.per.Year.
## 1                     0.00                                        0
## 2                 77772.50                                      177
## 3                101164.50                                      130
## 4                     0.00                                        0
## 5                 45354.75                                       52
## 6                 96504.00                                       91
## 7                 86131.25                                      105
## 8                     0.00                                        0
## 9                 42630.00                                      136
## 10                75150.25                                       82
## 11                63721.00                                       84
## 12                    0.00                                        0
## 13                    0.00                                        0
## 14                    0.00                                        0
## 15                    0.00                                        0
##    Early.Diagnosis Oral.Cancer..Diagnosis.
## 1               No                      No
## 2               No                     Yes
## 3              Yes                     Yes
## 4              Yes                      No
## 5               No                     Yes
## 6              Yes                     Yes
## 7              Yes                     Yes
## 8              Yes                      No
## 9               No                     Yes
## 10              No                     Yes
## 11             Yes                     Yes
## 12             Yes                      No
## 13             Yes                      No
## 14             Yes                      No
## 15             Yes                      No

4. User-Defined Function

Calculate a risk score based on the person’s data with tobacco, alcohol, HPV, family history, and compromised immune system.

risk_score <- function(tobacco, alcohol, hpv, family_history, immune_system) {
  (tobacco == "Yes") * 2 +
  (alcohol == "Yes") +
  (hpv == "Yes") * 2 + 
  (family_history == "Yes") + 
  (immune_system == "Yes")
}

risk_score(oral_data$Tobacco.Use[1],
           oral_data$Alcohol.Consumption[1],
           oral_data$HPV.Infection[1],
           oral_data$Family.History.of.Cancer[1],
           oral_data$Compromised.Immune.System[1])
## [1] 5

5. Filter Rows Based on Criteria

Filter rows based on the following criteria:

oral_data %>% filter(
  oral_data$Cancer.Stage == "1",
  oral_data$Tumor.Size..cm. < "5",
  oral_data$Country == "Sri Lanka",
  oral_data$Age > 60,
  oral_data$Gender == "Female",
  oral_data$Tobacco.Use == "No",
  oral_data$Betel.Quid.Use == "No")
##       ID   Country Age Gender Tobacco.Use Alcohol.Consumption HPV.Infection
## 1   5547 Sri Lanka  66 Female          No                  No            No
## 2   8641 Sri Lanka  63 Female          No                 Yes            No
## 3  11363 Sri Lanka  68 Female          No                 Yes            No
## 4  21549 Sri Lanka  70 Female          No                  No            No
## 5  41070 Sri Lanka  61 Female          No                 Yes            No
## 6  42048 Sri Lanka  69 Female          No                 Yes            No
## 7  51622 Sri Lanka  73 Female          No                  No            No
## 8  52435 Sri Lanka  70 Female          No                 Yes           Yes
## 9  60156 Sri Lanka  65 Female          No                  No            No
## 10 68135 Sri Lanka  70 Female          No                  No            No
## 11 79435 Sri Lanka  80 Female          No                 Yes           Yes
##    Betel.Quid.Use Chronic.Sun.Exposure Poor.Oral.Hygiene
## 1              No                   No                No
## 2              No                   No                No
## 3              No                   No               Yes
## 4              No                   No               Yes
## 5              No                   No                No
## 6              No                   No               Yes
## 7              No                   No                No
## 8              No                  Yes                No
## 9              No                   No               Yes
## 10             No                   No                No
## 11             No                   No               Yes
##    Diet..Fruits...Vegetables.Intake. Family.History.of.Cancer
## 1                           Moderate                       No
## 2                           Moderate                       No
## 3                               High                      Yes
## 4                           Moderate                       No
## 5                           Moderate                       No
## 6                           Moderate                       No
## 7                           Moderate                       No
## 8                                Low                       No
## 9                           Moderate                       No
## 10                               Low                       No
## 11                          Moderate                       No
##    Compromised.Immune.System Oral.Lesions Unexplained.Bleeding
## 1                         No           No                   No
## 2                         No           No                  Yes
## 3                         No           No                  Yes
## 4                         No          Yes                  Yes
## 5                         No          Yes                   No
## 6                         No          Yes                   No
## 7                         No          Yes                   No
## 8                         No          Yes                  Yes
## 9                         No          Yes                   No
## 10                        No           No                   No
## 11                        No          Yes                   No
##    Difficulty.Swallowing White.or.Red.Patches.in.Mouth Tumor.Size..cm.
## 1                     No                           Yes        3.845308
## 2                    Yes                            No        4.825246
## 3                    Yes                            No        2.383717
## 4                     No                            No        4.198427
## 5                     No                           Yes        3.756328
## 6                     No                            No        4.519694
## 7                    Yes                            No        4.578338
## 8                    Yes                           Yes        4.084550
## 9                     No                           Yes        3.249317
## 10                   Yes                            No        3.942891
## 11                   Yes                            No        2.621289
##    Cancer.Stage Treatment.Type Survival.Rate..5.Year....
## 1             1      Radiation                  89.98152
## 2             1      Radiation                  89.68025
## 3             1   Chemotherapy                  83.32314
## 4             1   Chemotherapy                  83.88464
## 5             1        Surgery                  81.95862
## 6             1   No Treatment                  83.57633
## 7             1      Radiation                  84.75359
## 8             1      Radiation                  85.83111
## 9             1   Chemotherapy                  83.84482
## 10            1   Chemotherapy                  88.86750
## 11            1      Radiation                  80.87980
##    Cost.of.Treatment..USD. Economic.Burden..Lost.Workdays.per.Year.
## 1                 98652.50                                      155
## 2                 53808.75                                      140
## 3                 25171.25                                      135
## 4                 80856.25                                       48
## 5                 97940.00                                       65
## 6                 94122.50                                      109
## 7                 81845.00                                       79
## 8                 41413.75                                       41
## 9                 76490.00                                       41
## 10                60420.00                                      173
## 11                89812.50                                       36
##    Early.Diagnosis Oral.Cancer..Diagnosis.
## 1              Yes                     Yes
## 2               No                     Yes
## 3               No                     Yes
## 4              Yes                     Yes
## 5              Yes                     Yes
## 6              Yes                     Yes
## 7              Yes                     Yes
## 8              Yes                     Yes
## 9              Yes                     Yes
## 10              No                     Yes
## 11             Yes                     Yes

6. Identifying Dependent & Independent Variables

# Independent var: Tumor Size -- Dependent var: Cost of Treatment
head(data.frame(oral_data$Tumor.Size..cm., oral_data$Cost.of.Treatment..USD.), 15)
##    oral_data.Tumor.Size..cm. oral_data.Cost.of.Treatment..USD.
## 1                   0.000000                              0.00
## 2                   1.782186                          77772.50
## 3                   3.523895                         101164.50
## 4                   0.000000                              0.00
## 5                   2.834789                          45354.75
## 6                   1.692675                          96504.00
## 7                   5.794843                          86131.25
## 8                   0.000000                              0.00
## 9                   5.999476                          42630.00
## 10                  5.282246                          75150.25
## 11                  4.724627                          63721.00
## 12                  0.000000                              0.00
## 13                  0.000000                              0.00
## 14                  0.000000                              0.00
## 15                  0.000000                              0.00
# Independent var: Diet (fruits, veg intake) -- Dependent var: Cancer Stage
head(data.frame(oral_data$Diet..Fruits...Vegetables.Intake., oral_data$Cancer.Stage), 15)
##    oral_data.Diet..Fruits...Vegetables.Intake. oral_data.Cancer.Stage
## 1                                          Low                      0
## 2                                         High                      1
## 3                                     Moderate                      2
## 4                                     Moderate                      0
## 5                                         High                      3
## 6                                     Moderate                      2
## 7                                     Moderate                      1
## 8                                     Moderate                      0
## 9                                          Low                      3
## 10                                        High                      3
## 11                                    Moderate                      3
## 12                                         Low                      0
## 13                                    Moderate                      0
## 14                                         Low                      0
## 15                                         Low                      0

7. Data Cleaning and Adjustments

Remove missing values in your dataset. Identify and remove duplicated data in your dataset.

Reorder multiple rows in descending order. Rename some of the column names in your dataset.

clean_data = oral_data %>% filter(!is.na(Cost.of.Treatment..USD.))
clean_data = clean_data %>% distinct()

clean_data = clean_data %>% arrange(desc(Cost.of.Treatment..USD.))

clean_data = clean_data %>% rename(
  Drinker = Alcohol.Consumption,
  Treatment.Cost.USD = Cost.of.Treatment..USD.
)

8. Adding New Variables to the Data Frame

Add a “Cost per cm” column based on Treatment Cost divided by Tumor Size.

clean_data = clean_data %>% mutate(cost.per.cm = Treatment.Cost.USD / Tumor.Size..cm.)

9. Creating a Training Set

Create a training set using random number generator engine with seed 1234 with 5% the size of the main data set.

set.seed(1234) 
training_data <- clean_data %>% sample_frac(0.05, replace = FALSE)

10. Summary of the Main Data Set

summary(clean_data)
##        ID          Country               Age            Gender         
##  Min.   :    1   Length:84922       Min.   : 15.00   Length:84922      
##  1st Qu.:21231   Class :character   1st Qu.: 48.00   Class :character  
##  Median :42462   Mode  :character   Median : 55.00   Mode  :character  
##  Mean   :42462                      Mean   : 54.51                     
##  3rd Qu.:63692                      3rd Qu.: 61.00                     
##  Max.   :84922                      Max.   :101.00                     
##                                                                        
##  Tobacco.Use          Drinker          HPV.Infection      Betel.Quid.Use    
##  Length:84922       Length:84922       Length:84922       Length:84922      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Chronic.Sun.Exposure Poor.Oral.Hygiene  Diet..Fruits...Vegetables.Intake.
##  Length:84922         Length:84922       Length:84922                     
##  Class :character     Class :character   Class :character                 
##  Mode  :character     Mode  :character   Mode  :character                 
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  Family.History.of.Cancer Compromised.Immune.System Oral.Lesions      
##  Length:84922             Length:84922              Length:84922      
##  Class :character         Class :character          Class :character  
##  Mode  :character         Mode  :character          Mode  :character  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  Unexplained.Bleeding Difficulty.Swallowing White.or.Red.Patches.in.Mouth
##  Length:84922         Length:84922          Length:84922                 
##  Class :character     Class :character      Class :character             
##  Mode  :character     Mode  :character      Mode  :character             
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  Tumor.Size..cm.  Cancer.Stage   Treatment.Type     Survival.Rate..5.Year....
##  Min.   :0.000   Min.   :0.000   Length:84922       Min.   : 10.00           
##  1st Qu.:0.000   1st Qu.:0.000   Class :character   1st Qu.: 65.23           
##  Median :0.000   Median :0.000   Mode  :character   Median :100.00           
##  Mean   :1.747   Mean   :1.119                      Mean   : 79.50           
##  3rd Qu.:3.480   3rd Qu.:2.000                      3rd Qu.:100.00           
##  Max.   :6.000   Max.   :4.000                      Max.   :100.00           
##                                                                              
##  Treatment.Cost.USD Economic.Burden..Lost.Workdays.per.Year. Early.Diagnosis   
##  Min.   :     0     Min.   :  0.00                           Length:84922      
##  1st Qu.:     0     1st Qu.:  0.00                           Class :character  
##  Median :     0     Median :  0.00                           Mode  :character  
##  Mean   : 39110     Mean   : 52.03                                             
##  3rd Qu.: 76468     3rd Qu.:104.00                                             
##  Max.   :159988     Max.   :179.00                                             
##                                                                                
##  Oral.Cancer..Diagnosis.  cost.per.cm    
##  Length:84922            Min.   :  4176  
##  Class :character        1st Qu.: 14770  
##  Mode  :character        Median : 22254  
##                          Mean   : 28090  
##                          3rd Qu.: 34968  
##                          Max.   :156336  
##                          NA's   :42573

11. Mean and Range of Treatment Costs

mean(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1] 39109.88
range(clean_data$Treatment.Cost.USD, na.rm = TRUE)
## [1]      0 159988

12. Scatter Plot

Scatter plot of Survival Rate (5 years) and Treatment Cost.

ggplot(data = training_data, aes(x = Survival.Rate..5.Year...., y = Treatment.Cost.USD)) + 
  geom_point(color = "steelblue", size = 1.2)

13. Bar Plot

Bar plot of displaying number of individuals at each cancer stage.

ggplot(data = clean_data, aes(x = Cancer.Stage, fill = Gender)) + geom_bar() + 
  ylab("Number of Individuals")

14. Pearson Correlation

A correlation near 1 implies high correlation; the dataset has about -0.8 correlation, so there is a strong negative linear relationship between Treatment Cost and Survival Rate.

cor(clean_data$Treatment.Cost.USD, clean_data$Survival.Rate..5.Year...., method="pearson")
## [1] -0.8066187