#Importing the data
cancer_data <- read.table("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/MVA/MVA HW1/cancer issue.csv", header=TRUE, sep=",", dec=".")
head(cancer_data)
## PatientID Age Gender Race.Ethnicity BMI SmokingStatus
## 1 1 80 Female Other 23.3 Smoker
## 2 2 76 Male Caucasian 22.4 Former Smoker
## 3 3 69 Male Asian 21.5 Smoker
## 4 4 77 Male Asian 30.4 Former Smoker
## 5 5 89 Male Caucasian 20.9 Smoker
## 6 6 64 Female Asian 33.8 Former Smoker
## FamilyHistory CancerType Stage TumorSize TreatmentType
## 1 Yes Breast II 1.7 Combination Therapy
## 2 Yes Colon IV 4.7 Surgery
## 3 Yes Breast III 8.3 Combination Therapy
## 4 Yes Prostate II 1.7 Radiation
## 5 Yes Lung IV 7.4 Radiation
## 6 Yes Lung IV 5.2 Combination Therapy
## TreatmentResponse SurvivalMonths Recurrence GeneticMarker
## 1 No Response 103 Yes None
## 2 No Response 14 Yes BRCA1
## 3 Complete Remission 61 Yes BRCA1
## 4 Partial Remission 64 No KRAS
## 5 No Response 82 Yes KRAS
## 6 No Response 95 No None
## HospitalRegion
## 1 South
## 2 West
## 3 West
## 4 South
## 5 South
## 6 North
Unit of observation is one individual person with cancer.
Description:
Here is the description of all variables in the dataset before I started to manipulate the set:
Since we have a lot of variables, I decided to delete some of the variables I believe are not important for my research and I also decided to eliminate all of the non-availables (n/a).
summary(cancer_data)
## PatientID Age Gender
## Min. : 1 Min. :18.00 Length:17686
## 1st Qu.: 4422 1st Qu.:35.00 Class :character
## Median : 8844 Median :54.00 Mode :character
## Mean : 8844 Mean :53.76
## 3rd Qu.:13265 3rd Qu.:72.00
## Max. :17686 Max. :90.00
## Race.Ethnicity BMI SmokingStatus
## Length:17686 Min. :18.50 Length:17686
## Class :character 1st Qu.:23.90 Class :character
## Mode :character Median :29.20 Mode :character
## Mean :29.25
## 3rd Qu.:34.60
## Max. :40.00
## FamilyHistory CancerType Stage
## Length:17686 Length:17686 Length:17686
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## TumorSize TreatmentType TreatmentResponse
## Min. : 1.0 Length:17686 Length:17686
## 1st Qu.: 3.3 Class :character Class :character
## Median : 5.5 Mode :character Mode :character
## Mean : 5.5
## 3rd Qu.: 7.7
## Max. :10.0
## SurvivalMonths Recurrence GeneticMarker
## Min. : 1.00 Length:17686 Length:17686
## 1st Qu.: 30.00 Class :character Class :character
## Median : 60.00 Mode :character Mode :character
## Mean : 60.39
## 3rd Qu.: 91.00
## Max. :120.00
## HospitalRegion
## Length:17686
## Class :character
## Mode :character
##
##
##
Here we can clearly see that data needs to be adjusted. My further steps continue as followed.
# Keep only the necessary variables for the analysis
cancer_data2 <- cancer_data[, colnames(cancer_data) %in% c("ID", "Gender", "Age", "TumorSize", "SmokingStatus", "CancerType", "SurvivalMonths", "BMI")]
#Factoring variables Gender, SmokingStatus and CancerType
cancer_data2$Gender <- factor(cancer_data2$Gender,
levels = c("Male", "Female"),
labels = c("Male", "Female"))
cancer_data2$SmokingStatus <- factor(cancer_data2$SmokingStatus,
levels = c("Smoker", "Former Smoker", "Non-Smoker"),
labels = c("Smoker", "FormerS", "NonS"))
cancer_data2$CancerType <- factor(cancer_data2$CancerType,
levels = c("Breast", "Colon", "Prostate", "Lung", "Leukemia", "Skin"),
labels = c("Breast", "Colon", "Prostate", "Lung", "Leukemia", "Skin"))
summary(cancer_data2)
## Age Gender BMI SmokingStatus
## Min. :18.00 Male :8756 Min. :18.50 Smoker :5860
## 1st Qu.:35.00 Female:8930 1st Qu.:23.90 FormerS:5915
## Median :54.00 Median :29.20 NonS :5911
## Mean :53.76 Mean :29.25
## 3rd Qu.:72.00 3rd Qu.:34.60
## Max. :90.00 Max. :40.00
## CancerType TumorSize SurvivalMonths
## Breast :2938 Min. : 1.0 Min. : 1.00
## Colon :2934 1st Qu.: 3.3 1st Qu.: 30.00
## Prostate:2938 Median : 5.5 Median : 60.00
## Lung :2939 Mean : 5.5 Mean : 60.39
## Leukemia:2957 3rd Qu.: 7.7 3rd Qu.: 91.00
## Skin :2980 Max. :10.0 Max. :120.00
library(pastecs)
options(scipen = 999) #Ensuring numbers are not shown as e+
stat.desc(cancer_data2)
## Age Gender BMI SmokingStatus
## nbr.val 17686.0000000 NA 17686.00000000 NA
## nbr.null 0.0000000 NA 0.00000000 NA
## nbr.na 0.0000000 NA 0.00000000 NA
## min 18.0000000 NA 18.50000000 NA
## max 90.0000000 NA 40.00000000 NA
## range 72.0000000 NA 21.50000000 NA
## sum 950771.0000000 NA 517382.80000000 NA
## median 54.0000000 NA 29.20000000 NA
## mean 53.7583965 NA 29.25380527 NA
## SE.mean 0.1585057 NA 0.04664738 NA
## CI.mean.0.95 0.3106868 NA 0.09143343 NA
## var 444.3441690 NA 38.48434147 NA
## std.dev 21.0794727 NA 6.20357489 NA
## coef.var 0.3921150 NA 0.21206044 NA
## CancerType TumorSize SurvivalMonths
## nbr.val NA 17686.00000000 17686.0000000
## nbr.null NA 0.00000000 0.0000000
## nbr.na NA 0.00000000 0.0000000
## min NA 1.00000000 1.0000000
## max NA 10.00000000 120.0000000
## range NA 9.00000000 119.0000000
## sum NA 97268.60000000 1068019.0000000
## median NA 5.50000000 60.0000000
## mean NA 5.49975122 60.3878209
## SE.mean NA 0.01957389 0.2616377
## CI.mean.0.95 NA 0.03836675 0.5128355
## var NA 6.77616392 1210.6822130
## std.dev NA 2.60310659 34.7948590
## coef.var NA 0.47331352 0.5761900
Since we see we do not have any n/a we do not perform a removal of non-availables.
head(cancer_data2)
## Age Gender BMI SmokingStatus CancerType TumorSize SurvivalMonths
## 1 80 Female 23.3 Smoker Breast 1.7 103
## 2 76 Male 22.4 FormerS Colon 4.7 14
## 3 69 Male 21.5 Smoker Breast 8.3 61
## 4 77 Male 30.4 FormerS Prostate 1.7 64
## 5 89 Male 20.9 Smoker Lung 7.4 82
## 6 64 Female 33.8 FormerS Lung 5.2 95
Description of finalized dataset: - PatientID: Unique identifier for each patient. - Age: Age of the patient (in years). - Gender: Gender of the patient (e.g., Male, Female). - SmokingStatus: Smoking status of the patient (e.g., Smoker, Former Smoker, Non-Smoker). - CancerType: Type of cancer diagnosed (e.g., Breast, Colon, Prostate, Lung). - TumorSize: Size of the tumor (in cm).
library(psych)
describeBy(cancer_data2)
## Warning in describeBy(cancer_data2): no grouping variable requested
## vars n mean sd median trimmed mad min max
## Age 1 17686 53.76 21.08 54.0 53.70 26.69 18.0 90
## Gender* 2 17686 1.50 0.50 2.0 1.51 0.00 1.0 2
## BMI 3 17686 29.25 6.20 29.2 29.25 7.86 18.5 40
## SmokingStatus* 4 17686 2.00 0.82 2.0 2.00 1.48 1.0 3
## CancerType* 5 17686 3.51 1.71 4.0 3.51 2.97 1.0 6
## TumorSize 6 17686 5.50 2.60 5.5 5.50 3.26 1.0 10
## SurvivalMonths 7 17686 60.39 34.79 60.0 60.34 44.48 1.0 120
## range skew kurtosis se
## Age 72.0 0.02 -1.20 0.16
## Gender* 1.0 -0.02 -2.00 0.00
## BMI 21.5 0.00 -1.19 0.05
## SmokingStatus* 2.0 -0.01 -1.50 0.01
## CancerType* 5.0 -0.01 -1.27 0.01
## TumorSize 9.0 0.00 -1.20 0.02
## SurvivalMonths 119.0 0.01 -1.21 0.26
##RQ1: Is there a significant difference in the age distribution of cancer patients between male and female patients?
#Descriptive statistics by group - Male and Female
library(psych)
describeBy(cancer_data2$Age, cancer_data2$Gender)
##
## Descriptive statistics by group
## group: Male
## vars n mean sd median trimmed mad min max range skew
## X1 1 8756 53.6 21.14 53 53.51 26.69 18 90 72 0.03
## kurtosis se
## X1 -1.21 0.23
## ----------------------------------------------------
## group: Female
## vars n mean sd median trimmed mad min max range skew
## X1 1 8930 53.92 21.02 54 53.89 26.69 18 90 72 0.01
## kurtosis se
## X1 -1.19 0.22
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
cancer_female <- ggplot(cancer_data2[cancer_data2$Gender == "Female", ], aes(x = Age)) +
theme_linedraw() +
geom_histogram(binwidth = 2, col = "black", fill = "orchid1") +
ylab("Frequency") +
ggtitle("Age distribution of female with cancer")
cancer_male <- ggplot(cancer_data2[cancer_data2$Gender == "Male", ], aes(x = Age)) +
theme_linedraw() +
geom_histogram(binwidth = 2, col = "black", fill = "steelblue1") +
ylab("Frequency") +
ggtitle("Age distribution of male with cancer")
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.4.2
ggarrange(cancer_female, cancer_male,
ncol = 2, nrow = 1)
library(ggpubr)
ggqqplot(cancer_data2,
"Age",
facet.by = "Gender")
Based on the graphs and quantile quantile plot above I am predicting that we do not have normal distribution of age among the different sex in neither of the cases of this sample. But let’s check it.
Because my sample is too big, R does not enable me to perform a Shapiro test. Therefore I will make a smaller sample of 1000 units.
set.seed(1) #Setting initial point of sampling
cancer_data_1000 <- cancer_data2[sample(nrow(cancer_data2), 1000), ] #Random sample of 1000
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.4.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
cancer_data_1000 %>%
group_by(Gender) %>%
shapiro_test(Age)
## # A tibble: 2 × 4
## Gender variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Male Age 0.953 2.08e-11
## 2 Female Age 0.953 1.26e-11
Based on data we can reject normal distribution of Age in variables Male and Female with both p<0.001. Therefore, since there is no normality in the sample we move to non-parametric tests. But for the purpose of this homework I will first do the parametric test and than the non parametric test.
Since we have independent sample of 2 population (Male and Female), we will star our parametric test with t-test with Welch correction (Welch two sample test) with hypothesis:
H0: meanFemale = meanMale or meanFemale - meanMale = 0 H1: meanFemale =/ meanMale or meanFemale - meanMale =/ 0
# Independent samples
t.test(cancer_data2$Age ~ cancer_data2$Gender,
var.equal = FALSE,
alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: cancer_data2$Age by cancer_data2$Gender
## t = -1.002, df = 17673, p-value = 0.3163
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
## -0.9391237 0.3037487
## sample estimates:
## mean in group Male mean in group Female
## 53.59799 53.91568
With Welch two sample test we cannot reject th H0 with p>0.05 at p=0.32, meaning we cannot reject that variances of age in gender are not different. Even if we compare the mean in each group we can spot that there is only a relatively small difference which is proven with this test to not be significant.
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
effectsize::cohens_d(cancer_data2$Age ~ cancer_data2$Gender,
pooled_sd = FALSE)
## Cohen's d | 95% CI
## -------------------------
## -0.02 | [-0.04, 0.01]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.02, rules = "sawilowsky2009")
## [1] "tiny"
## (Rules: sawilowsky2009)
Based on the sample data, we find that the we cannot reject that the average age in each gender does not differ (p>0.05). Nevertheless, with effect size we proved that there are only tiny differences in age distribution between the two genders.
wilcox.test(cancer_data2$Age ~ cancer_data2$Gender,
correct = FALSE,
exact = FALSE,
alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: cancer_data2$Age by cancer_data2$Gender
## W = 38752218, p-value = 0.3118
## alternative hypothesis: true location shift is not equal to 0
H0: Distribution location of age is the same for male and female. H1: Distribution location of age is not the same for male and female.
Based on the p-value p>0.05, we cannot reject H0 that the distribution location of age is the same for male and female.
library(ggplot2)
ggplot(cancer_data2, aes(x = Age, fill = Gender)) +
geom_histogram(position = position_dodge(width = 2.5), binwidth = 5, colour = "Black") +
scale_x_continuous(breaks = seq(0, 75, 5)) +
scale_fill_manual(values = c("Male" = "steelblue1", "Female" = "orchid1")) +
ylab("Frequency") +
labs(fill = "Age")
The function scale_fill_manual allowed me to manually specify aesthetic fill values for the graph to mathch the previous one in colors for male and female.
library(effectsize)
effectsize(wilcox.test(cancer_data2$Age ~ cancer_data2$Gender,
correct = FALSE,
exact = FALSE,
alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ---------------------------------
## -8.78e-03 | [-0.03, 0.01]
Based on Funder&Ozer 2019 scale, this is considered a tiny effect size.
#RQ1 Conclusion and answer
Based on the analysis done above, I would say that the more appropriate test for this sample data to conduct is the non-parametric Wilcoxon Rank Sum Test since we can reject normal distribution of Age in variables Male and Female with both values p<0.001. Therefore we shoul skip the parametric tests and go straight to non-parametric alternatives. Also based on the non-parametric Wilcoxon Rank Sum Test I am concluding that we cannot find any statistically significant differences in Age location distribution among Male and Female in this Sample, concluding with tiny differences in the effectsize analysis (r = 0.00878).
We can conclude that there is nobsignificant difference in the age distribution of cancer patients between male and female patients based on this sample. But I would add that since this is a second hand data, we cannot be sure how this data was collected, meaning if this sample is random or not. But based on the data we have, we can make such conclusions.
library(ggplot2)
cancer_female <- ggplot(cancer_data2[cancer_data2$Gender == "Female", ], aes(x = TumorSize)) +
theme_linedraw() +
geom_histogram(binwidth = 0.5, col = "black", fill = "orchid1") +
ylab("Frequency") +
ggtitle("Age distribution of female with cancer")
cancer_male <- ggplot(cancer_data2[cancer_data2$Gender == "Male", ], aes(x = TumorSize)) +
theme_linedraw() +
geom_histogram(binwidth = 0.5, col = "black", fill = "steelblue1") +
ylab("Frequency") +
ggtitle("Age distribution of male with cancer")
library(ggpubr)
ggarrange(cancer_female, cancer_male,
ncol = 1, nrow = 1)
## $`1`
##
## $`2`
##
## attr(,"class")
## [1] "list" "ggarrange"
library(psych)
psych::describe(cancer_data2[ , c("TumorSize", "SurvivalMonths")])
## vars n mean sd median trimmed mad min max
## TumorSize 1 17686 5.50 2.60 5.5 5.50 3.26 1 10
## SurvivalMonths 2 17686 60.39 34.79 60.0 60.34 44.48 1 120
## range skew kurtosis se
## TumorSize 9 0.00 -1.20 0.02
## SurvivalMonths 119 0.01 -1.21 0.26
library(ggplot2)
ggplot(cancer_data2, aes(x = TumorSize, y = SurvivalMonths)) +
geom_point()
We can see that this graphical representation is not looking great. Let’s check the linear relationship with descriptive approach.
cor(cancer_data2$TumorSize, cancer_data2$SurvivalMonths,
method = "pearson",
use = "complete.obs")
## [1] 0.001953289
Based on the result we can say that the relationship between tumor size and the amount of survival months after diagnosis has a positive and very weak correlation.
We still conduct the test of correlation coefficient with:
H0: ro = 0 H1: ro =/ 0
cor.test(cancer_data2$TumorSize, cancer_data2$SurvivalMonths,
method = "pearson",
use = "complete.obs")
##
## Pearson's product-moment correlation
##
## data: cancer_data2$TumorSize and cancer_data2$SurvivalMonths
## t = 0.25975, df = 17684, p-value = 0.7951
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01278508 0.01669081
## sample estimates:
## cor
## 0.001953289
We cannot reject the null hypothesis H0 that correlation coefficient is 0.
Based on the statistical tests above we cannot conclude that there is a significant correlation between tumor size and survival months in cancer patients.
# Pearson Chi2 test
results <- chisq.test(cancer_data2$SmokingStatus, cancer_data2$CancerType,
correct = FALSE)
results
##
## Pearson's Chi-squared test
##
## data: cancer_data2$SmokingStatus and cancer_data2$CancerType
## X-squared = 4.4956, df = 10, p-value = 0.9222
H0: There is no association between smoking status and cancer type H1: There is association between smoking status and cancer type
We cannot reject H0 that there is no association between these two categorical variables - smoking status and cancer type based on the p-value p>0.05.
addmargins(results$observed)
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate Lung Leukemia Skin
## Smoker 943 972 976 997 970 1002
## FormerS 985 1002 977 962 988 1001
## NonS 1010 960 985 980 999 977
## Sum 2938 2934 2938 2939 2957 2980
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Sum
## Smoker 5860
## FormerS 5915
## NonS 5911
## Sum 17686
Here R created the table with observed (empirical) findings from our data set. For example: 985 means that there was 985 people that were former smokers with breast cancer. Together in the right column we see the sums of smoker type categories and in the bottom row we can see sums of observed units based on each cancer type. And in the right bottom corner is the sum of all units.
We continue with calculating a table of expected frequencies.
#Calculating expected frequencies
round(results$expected, 2)
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate Lung Leukemia
## Smoker 973.46 972.14 973.46 973.80 979.76
## FormerS 982.60 981.26 982.60 982.93 988.95
## NonS 981.94 980.60 981.94 982.27 988.29
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Skin
## Smoker 987.38
## FormerS 996.65
## NonS 995.97
In this step we calculated the expected frequencies. This means how much people we would expect to see in each paired category if there wouldn’t be any association among the variables cancer type and smoking status.
For example: If there wouldn’t be association between cancer type and smoking status we would expect 982.60 prople that are former smokers to have a breast cancer.
With this calculations we also check one of the required assumptions for this statistical analysis. Assumptions for association analysis between two categorical variables has 2 main assumptions that we checked in previous steps: 1. The observations are independent of each other: This is true since we know that one person is not measured multiple times 2. All expected frequencies are greater than 5: We can see in the table above that this is true
Since we checked that all assumptions hold we can continue with parametric tests for checking our hypothesis.
# Calculation of standardized residuals
round(results$res, 2)
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate Lung Leukemia Skin
## Smoker -0.98 0.00 0.08 0.74 -0.31 0.47
## FormerS 0.08 0.66 -0.18 -0.67 -0.03 0.14
## NonS 0.90 -0.66 0.10 -0.07 0.34 -0.60
Here we can see that non of the standardized residuals is significant, meaning there are no significant differences found.
addmargins(round(prop.table(results$observed), 3))
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate Lung Leukemia Skin
## Smoker 0.053 0.055 0.055 0.056 0.055 0.057
## FormerS 0.056 0.057 0.055 0.054 0.056 0.057
## NonS 0.057 0.054 0.056 0.055 0.056 0.055
## Sum 0.166 0.166 0.166 0.165 0.167 0.169
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Sum
## Smoker 0.331
## FormerS 0.335
## NonS 0.333
## Sum 0.999
Here is a structured table where all of the data together sums up to 1 (around 1 or 0.999 in our case). For example in our data there is 16.7% of people with leukemia and 33.1% of people that are smokers.
addmargins(round(prop.table(results$observed, 1), 3), 2)
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate Lung Leukemia Skin
## Smoker 0.161 0.166 0.167 0.170 0.166 0.171
## FormerS 0.167 0.169 0.165 0.163 0.167 0.169
## NonS 0.171 0.162 0.167 0.166 0.169 0.165
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Sum
## Smoker 1.001
## FormerS 1.000
## NonS 1.000
In this table we analyzed the smoking status. We observe that also from the table where all of the smoking categories sum up to one. Results form this table can be interpreted as followed:
16.9% of former smokers in the research have skin cancer.
addmargins(round(prop.table(results$observed, 2), 3), 1)
## cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate Lung Leukemia Skin
## Smoker 0.321 0.331 0.332 0.339 0.328 0.336
## FormerS 0.335 0.342 0.333 0.327 0.334 0.336
## NonS 0.344 0.327 0.335 0.333 0.338 0.328
## Sum 1.000 1.000 1.000 0.999 1.000 1.000
In this table we can observe data based on the cancer type variable. We can spot this from the bottom row with the sum of ones. The data from this table can be interpreted as followed: Out of all people with lung cancer in this sample, there is 33,9% of those that are smokers.
library(effectsize)
effectsize::cramers_v(cancer_data2$SmokingStatus, cancer_data2$CancerType)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.00 | [0.00, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.00)
## [1] "tiny"
## (Rules: funder2019)
Based on Cramers statistics we cannot say there is tiny difference between cancer type and smoking status of a person.