mydata <- read.table("./cancer.csv", header=TRUE, sep=",", dec=".")
head(mydata)
## PatientID Age Gender Race BMI SmokingStatus FamilyHistory CancerType
## 1 1 80 Female Other 23.3 Smoker Yes Breast
## 2 2 76 Male Caucasian 22.4 Former Smoker Yes Colon
## 3 3 69 Male Asian 21.5 Smoker Yes Breast
## 4 4 77 Male Asian 30.4 Former Smoker Yes Prostate
## 5 5 89 Male Caucasian 20.9 Smoker Yes Lung
## 6 6 64 Female Asian 33.8 Former Smoker Yes Lung
## Stage TumorSize TreatmentType TreatmentResponse SurvivalMonths
## 1 II 1.7 Combination Therapy No Response 103
## 2 IV 4.7 Surgery No Response 14
## 3 III 8.3 Combination Therapy Complete Remission 61
## 4 II 1.7 Radiation Partial Remission 64
## 5 IV 7.4 Radiation No Response 82
## 6 IV 5.2 Combination Therapy No Response 95
## Recurrence
## 1 Yes
## 2 Yes
## 3 Yes
## 4 No
## 5 Yes
## 6 No
set.seed(1)
mydata <- mydata[sample(nrow(mydata), 1000), ]
There is one unit of observation: A patient that was diagnosed with cancer. The data contains 17,686 observations and 14 variables.
The dataset was obtained from Kaggle.com, from the user preetigupta004 on 13.01.2025.
mydata$Gender <- factor(mydata$Gender)
mydata$Race <- factor(mydata$Race)
mydata$SmokingStatus <- factor(mydata$SmokingStatus)
mydata$FamilyHistory <- factor(mydata$FamilyHistory)
mydata$CancerType <- factor(mydata$CancerType)
mydata$TreatmentType <- factor(mydata$TreatmentType)
mydata$TreatmentResponse <- factor(mydata$TreatmentResponse)
mydata$Recurrence <- factor(mydata$Recurrence)
mydata$Stage <- as.character(mydata$Stage)
mydata$Stage[mydata$Stage == "I"] <- 1
mydata$Stage[mydata$Stage == "II"] <- 2
mydata$Stage[mydata$Stage == "III"] <- 3
mydata$Stage[mydata$Stage == "IV"] <- 4
mydata$Stage <- as.numeric(mydata$Stage)
In this chunk I changed the categorical values from Gender, Race, SmokingStatus, FamilyHistory, CancerType, TreatmentType, TreatmentResponse and Recurrence into factor so the data can be properly analyzed. In order to have an ordinal attribute I consulted ChatGPT to change the attributes from Stage into numerical values that can be treated as categorical for the further analysis.
summary(mydata[-1])
## Age Gender Race BMI
## Min. :18.00 Female:510 African American:209 Min. :18.5
## 1st Qu.:36.00 Male :490 Asian :203 1st Qu.:23.7
## Median :54.00 Caucasian :177 Median :29.0
## Mean :54.52 Hispanic :204 Mean :29.2
## 3rd Qu.:73.00 Other :207 3rd Qu.:35.0
## Max. :90.00 Max. :40.0
## SmokingStatus FamilyHistory CancerType Stage
## Former Smoker:362 No :475 Breast :168 Min. :1.000
## Non-Smoker :309 Yes:525 Colon :182 1st Qu.:1.000
## Smoker :329 Leukemia:146 Median :2.000
## Lung :165 Mean :2.429
## Prostate:181 3rd Qu.:3.000
## Skin :158 Max. :4.000
## TumorSize TreatmentType TreatmentResponse
## Min. : 1.000 Chemotherapy :253 Complete Remission:348
## 1st Qu.: 3.200 Combination Therapy:233 No Response :320
## Median : 5.500 Radiation :242 Partial Remission :332
## Mean : 5.436 Surgery :272
## 3rd Qu.: 7.600
## Max. :10.000
## SurvivalMonths Recurrence
## Min. : 1.00 No :468
## 1st Qu.: 31.00 Yes:532
## Median : 64.00
## Mean : 62.12
## 3rd Qu.: 92.00
## Max. :120.00
I called for a summary of the data after converting it into the proper variables, we can observe that in general all of the data is equally distributed in all of the attributes that are in the dataset, for example we have an approximately equal amount of Females and Males. Also the numerical values seem equally distributed, taking for example the attribute SurvivalMonths, it goes from 1 to 120 months with a mean of 62.12, approximate to the half of the range.
#RQ1
#Is there a significant differenece in age between patients with and without recurrence?
In order to answer this hypothesis we need to perform a parametric test (t-test) in case the assumptions are respected or a non-parametric test (Wilcoxon Rank-Sum Test) in case these assumptions are violated.
#install.packages("dplyr")
#install.packages("rstatix")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(Age ~ Recurrence, data = mydata)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 5e-04 0.9823
## 998
Because the null hypothesis is rejected there will be the need for Welch Correction when performing the t-test
t.test(mydata$Age ~ mydata$Recurrence,
var.equal = TRUE,
alternative = "two.sided")
##
## Two Sample t-test
##
## data: mydata$Age by mydata$Recurrence
## t = 0.88504, df = 998, p-value = 0.3763
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -1.447035 3.824613
## sample estimates:
## mean in group No mean in group Yes
## 55.14744 53.95865
Because the p-value is bigger than 0.05 we fail to reject the null hypothesis, we cannot say that there is a difference between the ages of people with recurrence and those without recurrence.
wilcox.test(mydata$Age ~ mydata$Recurrence,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$Age by mydata$Recurrence
## W = 128561, p-value = 0.3715
## alternative hypothesis: true location shift is not equal to 0
Because the p-value is bigger than 0.05 we fail to reject the null hypothesis, we cannot say that there is a difference between the ages of people with recurrence and those without recurrence.
#install.packages("ggpubr")
library(ggplot2)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
NoRecurrence <- ggplot(mydata[mydata$Recurrence == "No", ], aes(x = Age)) +
theme_linedraw() +
geom_histogram(fill = "darkblue") +
ylab("Frequency") +
ggtitle("No Recurrence")
YesRecurrence <- ggplot(mydata[mydata$Recurrence == "Yes", ], aes(x = Age)) +
theme_linedraw() +
geom_histogram(fill = "darkred") +
ylab("Frequency") +
ggtitle("Yes Recurrence")
ggarrange(NoRecurrence, YesRecurrence,
ncol = 2, nrow = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
By observing the histograms created for the age distribution in the groups of people with recurrence and no recurrence we could assume that there is no normality, which will lead us to perform a non parametric test, first I will formally test for normality.
mydata %>%
group_by(Recurrence) %>%
shapiro_test(Age)
## # A tibble: 2 × 4
## Recurrence variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 No Age 0.952 4.04e-11
## 2 Yes Age 0.953 5.60e-12
The Shapiro Test shows that the p normality is violated, because of this the right option to analyze this data is to perform the non-parametric test. To perform now the effect size test we will take the non-parametric test information.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
effectsize(wilcox.test(mydata$Age ~ mydata$Recurrence,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## 0.03 | [-0.04, 0.10]
interpret_rank_biserial(.03)
## [1] "tiny"
## (Rules: funder2019)
Based on the sample data, we cannot say that there is a difference in age between the groups of patients that had recurrence compared to those that didn’t had recurrence (p<0.05, r=0.03 - Tiny).
#RQ2
#Is there a correlation between the BMI of a patient and the size of the tumor when they were first diagnosed?
First I will call for a scatter plot to visualize the correlation between these two variables.
library(car)
scatterplot(mydata$BMI, mydata$TumorSize)
We can observe that there is a really weak relationship between the Body Mass Index and the Tumor Size in the patients.
shapiro.test(mydata$BMI)
##
## Shapiro-Wilk normality test
##
## data: mydata$BMI
## W = 0.94805, p-value < 2.2e-16
shapiro.test(mydata$TumorSize)
##
## Shapiro-Wilk normality test
##
## data: mydata$TumorSize
## W = 0.95583, p-value < 2.2e-16
We can observe that for these two variables we have a p-value < 0.05, which means that both variables violate normality, for this reason we will use Spearman’s Correlation coefficient to determine correlation between the two.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(mydata$BMI, mydata$TumorSize, type = "spearman")
## x y
## x 1.00 0.05
## y 0.05 1.00
##
## n= 1000
##
##
## P
## x y
## x 0.1536
## y 0.1536
In this case we can conclude that even the data shows that there is a coefficient equal to 0.05 (very weak correlation), the p value is equal to 0.1465, reason why we fail to reject the null hypothesis and we can’t say that the relationship between them is of statistical significance. We can conclude that there is no correlation between the BMI and the tumor size in patients when they are fisrt diagnosed.
#RQ3
#Is there an association between the race/ethnicity of the patients and the type of cancer that they were diagnosed with?
ho: Race and CancerType are not associated h1: Race and CancerType are associated
mytable <- table(mydata$Race, mydata$CancerType)
print(mytable)
##
## Breast Colon Leukemia Lung Prostate Skin
## African American 32 30 32 39 46 30
## Asian 39 37 27 25 45 30
## Caucasian 23 37 33 31 21 32
## Hispanic 31 44 30 33 33 33
## Other 43 34 24 37 36 33
In this chunk I created a table which I stored into a new data set which shows how much people have a type of cancer within the same ethnic group.
chi_squared <- chisq.test(mytable,
correct = FALSE)
chi_squared
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 24.391, df = 20, p-value = 0.2257
The Chi-Square test shows that we fail to reject the null hypothesis (p>0.05), which means that we cannot say that there is a statistical association between the race of a patient and the type of cancer which they got diagnosed with.
addmargins(chi_squared$observed)
##
## Breast Colon Leukemia Lung Prostate Skin Sum
## African American 32 30 32 39 46 30 209
## Asian 39 37 27 25 45 30 203
## Caucasian 23 37 33 31 21 32 177
## Hispanic 31 44 30 33 33 33 204
## Other 43 34 24 37 36 33 207
## Sum 168 182 146 165 181 158 1000
round(chi_squared$expected,2)
##
## Breast Colon Leukemia Lung Prostate Skin
## African American 35.11 38.04 30.51 34.48 37.83 33.02
## Asian 34.10 36.95 29.64 33.49 36.74 32.07
## Caucasian 29.74 32.21 25.84 29.20 32.04 27.97
## Hispanic 34.27 37.13 29.78 33.66 36.92 32.23
## Other 34.78 37.67 30.22 34.16 37.47 32.71
round(chi_squared$res,2)
##
## Breast Colon Leukemia Lung Prostate Skin
## African American -0.53 -1.30 0.27 0.77 1.33 -0.53
## Asian 0.84 0.01 -0.48 -1.47 1.36 -0.37
## Caucasian -1.24 0.84 1.41 0.33 -1.95 0.76
## Hispanic -0.56 1.13 0.04 -0.11 -0.65 0.14
## Other 1.39 -0.60 -1.13 0.49 -0.24 0.05
With these additional chunks we can look into the differences between the empirical numbers that we got from the data and the data that would be expected.
library(effectsize)
effectsize::cramers_v(mytable)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.03 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
We can conclude that we cannot say that there is an statistical association between these two variables (p>0.05), with a tiny effect size.