Data importation
mydata <- read.table("./cancer.csv", header=TRUE, sep=",", dec=".")
head(mydata)
##   PatientID Age Gender      Race  BMI SmokingStatus FamilyHistory CancerType
## 1         1  80 Female     Other 23.3        Smoker           Yes     Breast
## 2         2  76   Male Caucasian 22.4 Former Smoker           Yes      Colon
## 3         3  69   Male     Asian 21.5        Smoker           Yes     Breast
## 4         4  77   Male     Asian 30.4 Former Smoker           Yes   Prostate
## 5         5  89   Male Caucasian 20.9        Smoker           Yes       Lung
## 6         6  64 Female     Asian 33.8 Former Smoker           Yes       Lung
##   Stage TumorSize       TreatmentType  TreatmentResponse SurvivalMonths
## 1    II       1.7 Combination Therapy        No Response            103
## 2    IV       4.7             Surgery        No Response             14
## 3   III       8.3 Combination Therapy Complete Remission             61
## 4    II       1.7           Radiation  Partial Remission             64
## 5    IV       7.4           Radiation        No Response             82
## 6    IV       5.2 Combination Therapy        No Response             95
##   Recurrence
## 1        Yes
## 2        Yes
## 3        Yes
## 4         No
## 5        Yes
## 6         No
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 1000), ]
Data description

There is one unit of observation: A patient that was diagnosed with cancer. The data contains 17,686 observations and 14 variables.

  1. PatientID: ID number of patient
  2. Age: Age of the patient in years
  3. Gender: Gender of the patient (“Male”, “Female”)
  4. Race: Race or ethnicity of the patient (“African American”,“Asian”,“Caucasian”,“Hispanic”,“Other”)
  5. BMI: Body Mass Index of the patient
  6. SmokingStatus: Smoking status of the patient (“Former Smoker”,“Non-Smoker”,“Smoker”)
  7. FamilyHistory: If the patient has had direct family members with cancer prior to diagnosis (“Yes”,“No”)
  8. CancerType: Original cancer type in patient (“Breast”,“Colon”,“Leukemia”,“Lung”,“Prostate”,“Skin”)
  9. Stage: Stage of cancer in patient when first detected (“I”,“II”,“III”,“IV”)
  10. TumorSize: Tumor size in patients when first detected in cm
  11. TreatmentType: Treatment type that patient recieved (“Chemotherapy”,“Combination Therapy”,“Radiation”,“Surgery”)
  12. TreatmentResponse: Response patients had after treatment (“Complete Remission”,“No Response”,“Partial Remission”)
  13. SurvivalMonths: Time patients lived after initial diagnosis in months
  14. Recurrance: If patient had recurrence (“Yes”,“No”)

The dataset was obtained from Kaggle.com, from the user preetigupta004 on 13.01.2025.

mydata$Gender <- factor(mydata$Gender)
mydata$Race <- factor(mydata$Race)
mydata$SmokingStatus <- factor(mydata$SmokingStatus)
mydata$FamilyHistory <- factor(mydata$FamilyHistory)
mydata$CancerType <- factor(mydata$CancerType)
mydata$TreatmentType <- factor(mydata$TreatmentType)
mydata$TreatmentResponse <- factor(mydata$TreatmentResponse)
mydata$Recurrence <- factor(mydata$Recurrence)

mydata$Stage <- as.character(mydata$Stage)

mydata$Stage[mydata$Stage == "I"] <- 1
mydata$Stage[mydata$Stage == "II"] <- 2
mydata$Stage[mydata$Stage == "III"] <- 3
mydata$Stage[mydata$Stage == "IV"] <- 4

mydata$Stage <- as.numeric(mydata$Stage)

In this chunk I changed the categorical values from Gender, Race, SmokingStatus, FamilyHistory, CancerType, TreatmentType, TreatmentResponse and Recurrence into factor so the data can be properly analyzed. In order to have an ordinal attribute I consulted ChatGPT to change the attributes from Stage into numerical values that can be treated as categorical for the further analysis.

summary(mydata[-1])
##       Age           Gender                  Race          BMI      
##  Min.   :18.00   Female:510   African American:209   Min.   :18.5  
##  1st Qu.:36.00   Male  :490   Asian           :203   1st Qu.:23.7  
##  Median :54.00                Caucasian       :177   Median :29.0  
##  Mean   :54.52                Hispanic        :204   Mean   :29.2  
##  3rd Qu.:73.00                Other           :207   3rd Qu.:35.0  
##  Max.   :90.00                                       Max.   :40.0  
##        SmokingStatus FamilyHistory    CancerType      Stage      
##  Former Smoker:362   No :475       Breast  :168   Min.   :1.000  
##  Non-Smoker   :309   Yes:525       Colon   :182   1st Qu.:1.000  
##  Smoker       :329                 Leukemia:146   Median :2.000  
##                                    Lung    :165   Mean   :2.429  
##                                    Prostate:181   3rd Qu.:3.000  
##                                    Skin    :158   Max.   :4.000  
##    TumorSize                  TreatmentType          TreatmentResponse
##  Min.   : 1.000   Chemotherapy       :253   Complete Remission:348    
##  1st Qu.: 3.200   Combination Therapy:233   No Response       :320    
##  Median : 5.500   Radiation          :242   Partial Remission :332    
##  Mean   : 5.436   Surgery            :272                             
##  3rd Qu.: 7.600                                                       
##  Max.   :10.000                                                       
##  SurvivalMonths   Recurrence
##  Min.   :  1.00   No :468   
##  1st Qu.: 31.00   Yes:532   
##  Median : 64.00             
##  Mean   : 62.12             
##  3rd Qu.: 92.00             
##  Max.   :120.00

I called for a summary of the data after converting it into the proper variables, we can observe that in general all of the data is equally distributed in all of the attributes that are in the dataset, for example we have an approximately equal amount of Females and Males. Also the numerical values seem equally distributed, taking for example the attribute SurvivalMonths, it goes from 1 to 120 months with a mean of 62.12, approximate to the half of the range.

#RQ1

#Is there a significant differenece in age between patients with and without recurrence?

In order to answer this hypothesis we need to perform a parametric test (t-test) in case the assumptions are respected or a non-parametric test (Wilcoxon Rank-Sum Test) in case these assumptions are violated.

#install.packages("dplyr")
#install.packages("rstatix")

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
leveneTest(Age ~ Recurrence, data = mydata)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   1   5e-04 0.9823
##       998

Because the null hypothesis is rejected there will be the need for Welch Correction when performing the t-test

t.test(mydata$Age ~ mydata$Recurrence,
       var.equal = TRUE,
       alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  mydata$Age by mydata$Recurrence
## t = 0.88504, df = 998, p-value = 0.3763
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -1.447035  3.824613
## sample estimates:
##  mean in group No mean in group Yes 
##          55.14744          53.95865

Because the p-value is bigger than 0.05 we fail to reject the null hypothesis, we cannot say that there is a difference between the ages of people with recurrence and those without recurrence.

wilcox.test(mydata$Age ~ mydata$Recurrence,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Age by mydata$Recurrence
## W = 128561, p-value = 0.3715
## alternative hypothesis: true location shift is not equal to 0

Because the p-value is bigger than 0.05 we fail to reject the null hypothesis, we cannot say that there is a difference between the ages of people with recurrence and those without recurrence.

#install.packages("ggpubr")

library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
NoRecurrence <- ggplot(mydata[mydata$Recurrence == "No", ], aes(x = Age)) +
  theme_linedraw() +
  geom_histogram(fill = "darkblue") +
  ylab("Frequency") +
  ggtitle("No Recurrence")

YesRecurrence <- ggplot(mydata[mydata$Recurrence == "Yes", ], aes(x = Age)) +
  theme_linedraw() +
  geom_histogram(fill = "darkred") +
  ylab("Frequency") +
  ggtitle("Yes Recurrence")

ggarrange(NoRecurrence, YesRecurrence,
          ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

By observing the histograms created for the age distribution in the groups of people with recurrence and no recurrence we could assume that there is no normality, which will lead us to perform a non parametric test, first I will formally test for normality.

mydata %>%
  group_by(Recurrence) %>%
  shapiro_test(Age)
## # A tibble: 2 × 4
##   Recurrence variable statistic        p
##   <fct>      <chr>        <dbl>    <dbl>
## 1 No         Age          0.952 4.04e-11
## 2 Yes        Age          0.953 5.60e-12

The Shapiro Test shows that the p normality is violated, because of this the right option to analyze this data is to perform the non-parametric test. To perform now the effect size test we will take the non-parametric test information.

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
effectsize(wilcox.test(mydata$Age ~ mydata$Recurrence,
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided"))
## r (rank biserial) |        95% CI
## ---------------------------------
## 0.03              | [-0.04, 0.10]
interpret_rank_biserial(.03)
## [1] "tiny"
## (Rules: funder2019)

Based on the sample data, we cannot say that there is a difference in age between the groups of patients that had recurrence compared to those that didn’t had recurrence (p<0.05, r=0.03 - Tiny).

#RQ2

#Is there a correlation between the BMI of a patient and the size of the tumor when they were first diagnosed?

First I will call for a scatter plot to visualize the correlation between these two variables.

library(car)

scatterplot(mydata$BMI, mydata$TumorSize)

We can observe that there is a really weak relationship between the Body Mass Index and the Tumor Size in the patients.

shapiro.test(mydata$BMI)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$BMI
## W = 0.94805, p-value < 2.2e-16
shapiro.test(mydata$TumorSize)
## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$TumorSize
## W = 0.95583, p-value < 2.2e-16

We can observe that for these two variables we have a p-value < 0.05, which means that both variables violate normality, for this reason we will use Spearman’s Correlation coefficient to determine correlation between the two.

library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(mydata$BMI, mydata$TumorSize, type = "spearman")
##      x    y
## x 1.00 0.05
## y 0.05 1.00
## 
## n= 1000 
## 
## 
## P
##   x      y     
## x        0.1536
## y 0.1536

In this case we can conclude that even the data shows that there is a coefficient equal to 0.05 (very weak correlation), the p value is equal to 0.1465, reason why we fail to reject the null hypothesis and we can’t say that the relationship between them is of statistical significance. We can conclude that there is no correlation between the BMI and the tumor size in patients when they are fisrt diagnosed.

#RQ3

#Is there an association between the race/ethnicity of the patients and the type of cancer that they were diagnosed with?

ho: Race and CancerType are not associated h1: Race and CancerType are associated

mytable <- table(mydata$Race, mydata$CancerType)
print(mytable)
##                   
##                    Breast Colon Leukemia Lung Prostate Skin
##   African American     32    30       32   39       46   30
##   Asian                39    37       27   25       45   30
##   Caucasian            23    37       33   31       21   32
##   Hispanic             31    44       30   33       33   33
##   Other                43    34       24   37       36   33

In this chunk I created a table which I stored into a new data set which shows how much people have a type of cancer within the same ethnic group.

chi_squared <- chisq.test(mytable,
                          correct = FALSE)

chi_squared
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 24.391, df = 20, p-value = 0.2257

The Chi-Square test shows that we fail to reject the null hypothesis (p>0.05), which means that we cannot say that there is a statistical association between the race of a patient and the type of cancer which they got diagnosed with.

addmargins(chi_squared$observed)
##                   
##                    Breast Colon Leukemia Lung Prostate Skin  Sum
##   African American     32    30       32   39       46   30  209
##   Asian                39    37       27   25       45   30  203
##   Caucasian            23    37       33   31       21   32  177
##   Hispanic             31    44       30   33       33   33  204
##   Other                43    34       24   37       36   33  207
##   Sum                 168   182      146  165      181  158 1000
round(chi_squared$expected,2)
##                   
##                    Breast Colon Leukemia  Lung Prostate  Skin
##   African American  35.11 38.04    30.51 34.48    37.83 33.02
##   Asian             34.10 36.95    29.64 33.49    36.74 32.07
##   Caucasian         29.74 32.21    25.84 29.20    32.04 27.97
##   Hispanic          34.27 37.13    29.78 33.66    36.92 32.23
##   Other             34.78 37.67    30.22 34.16    37.47 32.71
round(chi_squared$res,2)
##                   
##                    Breast Colon Leukemia  Lung Prostate  Skin
##   African American  -0.53 -1.30     0.27  0.77     1.33 -0.53
##   Asian              0.84  0.01    -0.48 -1.47     1.36 -0.37
##   Caucasian         -1.24  0.84     1.41  0.33    -1.95  0.76
##   Hispanic          -0.56  1.13     0.04 -0.11    -0.65  0.14
##   Other              1.39 -0.60    -1.13  0.49    -0.24  0.05

With these additional chunks we can look into the differences between the empirical numbers that we got from the data and the data that would be expected.

library(effectsize)
effectsize::cramers_v(mytable)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.03              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

We can conclude that we cannot say that there is an statistical association between these two variables (p>0.05), with a tiny effect size.