MVA Homework Assignment 1

Importing and preparation of the data

#Importing the data

cancer_data <- read.table("C:/Users/Veronika/ŠOLA/EKONOMSKA FAKULTETA/PRIJAVA NA IMB/MVA/MVA HW1/cancer issue.csv", header=TRUE, sep=",", dec=".")
head(cancer_data)

##   PatientID Age Gender Race.Ethnicity  BMI SmokingStatus
## 1         1  80 Female          Other 23.3        Smoker
## 2         2  76   Male      Caucasian 22.4 Former Smoker
## 3         3  69   Male          Asian 21.5        Smoker
## 4         4  77   Male          Asian 30.4 Former Smoker
## 5         5  89   Male      Caucasian 20.9        Smoker
## 6         6  64 Female          Asian 33.8 Former Smoker
##   FamilyHistory CancerType Stage TumorSize       TreatmentType
## 1           Yes     Breast    II       1.7 Combination Therapy
## 2           Yes      Colon    IV       4.7             Surgery
## 3           Yes     Breast   III       8.3 Combination Therapy
## 4           Yes   Prostate    II       1.7           Radiation
## 5           Yes       Lung    IV       7.4           Radiation
## 6           Yes       Lung    IV       5.2 Combination Therapy
##    TreatmentResponse SurvivalMonths Recurrence GeneticMarker
## 1        No Response            103        Yes          None
## 2        No Response             14        Yes         BRCA1
## 3 Complete Remission             61        Yes         BRCA1
## 4  Partial Remission             64         No          KRAS
## 5        No Response             82        Yes          KRAS
## 6        No Response             95         No          None
##   HospitalRegion
## 1          South
## 2           West
## 3           West
## 4          South
## 5          South
## 6          North

Unit of observation is one individual person with cancer.

Description:

Here is the description of all variables in the dataset before I started to manipulate the set:

PatientID: Unique identifier for each patient.
Age: Age of the patient (in years).
Gender: Gender of the patient (e.g., Male, Female).
Race/Ethnicity: The racial or ethnic background of the patient (e.g., Caucasian, Asian, Other).
BMI: Body Mass Index of the patient, measured in kg/m².
SmokingStatus: Smoking status of the patient (e.g., Smoker, Former Smoker, Non-Smoker).
FamilyHistory: Indicates whether the patient has a family history of cancer (Yes/No).
CancerType: Type of cancer diagnosed (e.g., Breast, Colon, Prostate, Lung).
Stage: Stage of the cancer (e.g., II, III, IV).
TumorSize: Size of the tumor (in cm).
TreatmentType: Type of treatment received by the patient (e.g., Surgery, Radiation, Combination Therapy).
TreatmentResponse: Response to the treatment (e.g., No Response, Partial Remission, Complete Remission).
SurvivalMonths: Number of months the patient survived after diagnosis.
Recurrence: Indicates whether the cancer recurred (Yes/No).
GeneticMarker: Presence of specific genetic markers related to cancer (e.g., BRCA1, KRAS, None).
HospitalRegion: Geographical region of the hospital where the patient was treated (e.g., South, West).

Since we have a lot of variables, I decided to delete some of the variables I believe are not important for my research and I also decided to eliminate all of the non-availables (n/a).

summary(cancer_data)

##    PatientID          Age           Gender         
##  Min.   :    1   Min.   :18.00   Length:17686      
##  1st Qu.: 4422   1st Qu.:35.00   Class :character  
##  Median : 8844   Median :54.00   Mode  :character  
##  Mean   : 8844   Mean   :53.76                     
##  3rd Qu.:13265   3rd Qu.:72.00                     
##  Max.   :17686   Max.   :90.00                     
##  Race.Ethnicity          BMI        SmokingStatus     
##  Length:17686       Min.   :18.50   Length:17686      
##  Class :character   1st Qu.:23.90   Class :character  
##  Mode  :character   Median :29.20   Mode  :character  
##                     Mean   :29.25                     
##                     3rd Qu.:34.60                     
##                     Max.   :40.00                     
##  FamilyHistory       CancerType           Stage          
##  Length:17686       Length:17686       Length:17686      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    TumorSize    TreatmentType      TreatmentResponse 
##  Min.   : 1.0   Length:17686       Length:17686      
##  1st Qu.: 3.3   Class :character   Class :character  
##  Median : 5.5   Mode  :character   Mode  :character  
##  Mean   : 5.5                                        
##  3rd Qu.: 7.7                                        
##  Max.   :10.0                                        
##  SurvivalMonths    Recurrence        GeneticMarker     
##  Min.   :  1.00   Length:17686       Length:17686      
##  1st Qu.: 30.00   Class :character   Class :character  
##  Median : 60.00   Mode  :character   Mode  :character  
##  Mean   : 60.39                                        
##  3rd Qu.: 91.00                                        
##  Max.   :120.00                                        
##  HospitalRegion    
##  Length:17686      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Here we can clearly see that data needs to be adjusted. My further steps continue as followed.

# Keep only the necessary variables for the analysis
cancer_data2 <- cancer_data[, colnames(cancer_data) %in% c("ID", "Gender", "Age", "TumorSize", "SmokingStatus", "CancerType", "SurvivalMonths", "BMI")]

#Factoring variables Gender, SmokingStatus and CancerType

cancer_data2$Gender <- factor(cancer_data2$Gender,
                         levels = c("Male", "Female"),
                         labels = c("Male", "Female"))

cancer_data2$SmokingStatus <- factor(cancer_data2$SmokingStatus,
                         levels = c("Smoker", "Former Smoker", "Non-Smoker"),
                         labels = c("Smoker", "FormerS", "NonS"))

cancer_data2$CancerType <- factor(cancer_data2$CancerType,
                         levels = c("Breast", "Colon", "Prostate", "Lung", "Leukemia", "Skin"),
                         labels = c("Breast", "Colon", "Prostate", "Lung", "Leukemia", "Skin"))

summary(cancer_data2)

##       Age           Gender          BMI        SmokingStatus 
##  Min.   :18.00   Male  :8756   Min.   :18.50   Smoker :5860  
##  1st Qu.:35.00   Female:8930   1st Qu.:23.90   FormerS:5915  
##  Median :54.00                 Median :29.20   NonS   :5911  
##  Mean   :53.76                 Mean   :29.25                 
##  3rd Qu.:72.00                 3rd Qu.:34.60                 
##  Max.   :90.00                 Max.   :40.00                 
##     CancerType     TumorSize    SurvivalMonths  
##  Breast  :2938   Min.   : 1.0   Min.   :  1.00  
##  Colon   :2934   1st Qu.: 3.3   1st Qu.: 30.00  
##  Prostate:2938   Median : 5.5   Median : 60.00  
##  Lung    :2939   Mean   : 5.5   Mean   : 60.39  
##  Leukemia:2957   3rd Qu.: 7.7   3rd Qu.: 91.00  
##  Skin    :2980   Max.   :10.0   Max.   :120.00

library(pastecs)

options(scipen = 999) #Ensuring numbers are not shown as e+
stat.desc(cancer_data2)

##                         Age Gender             BMI SmokingStatus
## nbr.val       17686.0000000     NA  17686.00000000            NA
## nbr.null          0.0000000     NA      0.00000000            NA
## nbr.na            0.0000000     NA      0.00000000            NA
## min              18.0000000     NA     18.50000000            NA
## max              90.0000000     NA     40.00000000            NA
## range            72.0000000     NA     21.50000000            NA
## sum          950771.0000000     NA 517382.80000000            NA
## median           54.0000000     NA     29.20000000            NA
## mean             53.7583965     NA     29.25380527            NA
## SE.mean           0.1585057     NA      0.04664738            NA
## CI.mean.0.95      0.3106868     NA      0.09143343            NA
## var             444.3441690     NA     38.48434147            NA
## std.dev          21.0794727     NA      6.20357489            NA
## coef.var          0.3921150     NA      0.21206044            NA
##              CancerType      TumorSize  SurvivalMonths
## nbr.val              NA 17686.00000000   17686.0000000
## nbr.null             NA     0.00000000       0.0000000
## nbr.na               NA     0.00000000       0.0000000
## min                  NA     1.00000000       1.0000000
## max                  NA    10.00000000     120.0000000
## range                NA     9.00000000     119.0000000
## sum                  NA 97268.60000000 1068019.0000000
## median               NA     5.50000000      60.0000000
## mean                 NA     5.49975122      60.3878209
## SE.mean              NA     0.01957389       0.2616377
## CI.mean.0.95         NA     0.03836675       0.5128355
## var                  NA     6.77616392    1210.6822130
## std.dev              NA     2.60310659      34.7948590
## coef.var             NA     0.47331352       0.5761900

Since we see we do not have any n/a we do not perform a removal of non-availables.

head(cancer_data2)

##   Age Gender  BMI SmokingStatus CancerType TumorSize SurvivalMonths
## 1  80 Female 23.3        Smoker     Breast       1.7            103
## 2  76   Male 22.4       FormerS      Colon       4.7             14
## 3  69   Male 21.5        Smoker     Breast       8.3             61
## 4  77   Male 30.4       FormerS   Prostate       1.7             64
## 5  89   Male 20.9        Smoker       Lung       7.4             82
## 6  64 Female 33.8       FormerS       Lung       5.2             95

Description of finalized dataset: - PatientID: Unique identifier for each patient. - Age: Age of the patient (in years). - Gender: Gender of the patient (e.g., Male, Female). - SmokingStatus: Smoking status of the patient (e.g., Smoker, Former Smoker, Non-Smoker). - CancerType: Type of cancer diagnosed (e.g., Breast, Colon, Prostate, Lung). - TumorSize: Size of the tumor (in cm).

library(psych)
describeBy(cancer_data2)

## Warning in describeBy(cancer_data2): no grouping variable requested

##                vars     n  mean    sd median trimmed   mad  min max
## Age               1 17686 53.76 21.08   54.0   53.70 26.69 18.0  90
## Gender*           2 17686  1.50  0.50    2.0    1.51  0.00  1.0   2
## BMI               3 17686 29.25  6.20   29.2   29.25  7.86 18.5  40
## SmokingStatus*    4 17686  2.00  0.82    2.0    2.00  1.48  1.0   3
## CancerType*       5 17686  3.51  1.71    4.0    3.51  2.97  1.0   6
## TumorSize         6 17686  5.50  2.60    5.5    5.50  3.26  1.0  10
## SurvivalMonths    7 17686 60.39 34.79   60.0   60.34 44.48  1.0 120
##                range  skew kurtosis   se
## Age             72.0  0.02    -1.20 0.16
## Gender*          1.0 -0.02    -2.00 0.00
## BMI             21.5  0.00    -1.19 0.05
## SmokingStatus*   2.0 -0.01    -1.50 0.01
## CancerType*      5.0 -0.01    -1.27 0.01
## TumorSize        9.0  0.00    -1.20 0.02
## SurvivalMonths 119.0  0.01    -1.21 0.26

##RQ1: Is there a significant difference in the age distribution of cancer patients between male and female patients?

#Descriptive statistics by group - Male and Female

library(psych)
describeBy(cancer_data2$Age, cancer_data2$Gender)

## 
##  Descriptive statistics by group 
## group: Male
##    vars    n mean    sd median trimmed   mad min max range skew
## X1    1 8756 53.6 21.14     53   53.51 26.69  18  90    72 0.03
##    kurtosis   se
## X1    -1.21 0.23
## ---------------------------------------------------- 
## group: Female
##    vars    n  mean    sd median trimmed   mad min max range skew
## X1    1 8930 53.92 21.02     54   53.89 26.69  18  90    72 0.01
##    kurtosis   se
## X1    -1.19 0.22

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

cancer_female <- ggplot(cancer_data2[cancer_data2$Gender == "Female",  ], aes(x = Age)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 2, col = "black", fill = "orchid1") +
  ylab("Frequency") +
  ggtitle("Age distribution of female with cancer")

cancer_male <- ggplot(cancer_data2[cancer_data2$Gender == "Male",  ], aes(x = Age)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 2, col = "black", fill = "steelblue1") +
  ylab("Frequency") +
  ggtitle("Age distribution of male with cancer")

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.4.2

ggarrange(cancer_female, cancer_male,
          ncol = 2, nrow = 1)

library(ggpubr)
ggqqplot(cancer_data2,
         "Age",
         facet.by = "Gender")

Based on the graphs and quantile quantile plot above I am predicting that we do not have normal distribution of age among the different sex in neither of the cases of this sample. But let’s check it.

Because my sample is too big, R does not enable me to perform a Shapiro test. Therefore I will make a smaller sample of 1000 units.

set.seed(1) #Setting initial point of sampling
cancer_data_1000 <- cancer_data2[sample(nrow(cancer_data2), 1000), ] #Random sample of 1000

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.4.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

cancer_data_1000 %>%
  group_by(Gender) %>%
  shapiro_test(Age)

## # A tibble: 2 × 4
##   Gender variable statistic        p
##   <fct>  <chr>        <dbl>    <dbl>
## 1 Male   Age          0.953 2.08e-11
## 2 Female Age          0.953 1.26e-11

Based on data we can reject normal distribution of Age in variables Male and Female with both p<0.001. Therefore, since there is no normality in the sample we move to non-parametric tests. But for the purpose of this homework I will first do the parametric test and than the non parametric test.

Since we have independent sample of 2 population (Male and Female), we will star our parametric test with t-test with Welch correction (Welch two sample test) with hypothesis:

H0: meanFemale = meanMale or meanFemale - meanMale = 0 H1: meanFemale =/ meanMale or meanFemale - meanMale =/ 0

# Independent samples

t.test(cancer_data2$Age ~ cancer_data2$Gender, 
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  cancer_data2$Age by cancer_data2$Gender
## t = -1.002, df = 17673, p-value = 0.3163
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  -0.9391237  0.3037487
## sample estimates:
##   mean in group Male mean in group Female 
##             53.59799             53.91568

With Welch two sample test we cannot reject th H0 with p>0.05 at p=0.32, meaning we cannot reject that variances of age in gender are not different. Even if we compare the mean in each group we can spot that there is only a relatively small difference which is proven with this test to not be significant.

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

effectsize::cohens_d(cancer_data2$Age ~ cancer_data2$Gender, 
                     pooled_sd = FALSE)

## Cohen's d |        95% CI
## -------------------------
## -0.02     | [-0.04, 0.01]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.02, rules = "sawilowsky2009")

## [1] "tiny"
## (Rules: sawilowsky2009)

Based on the sample data, we find that the we cannot reject that the average age in each gender does not differ (p>0.05). Nevertheless, with effect size we proved that there are only tiny differences in age distribution between the two genders.

wilcox.test(cancer_data2$Age ~ cancer_data2$Gender, 
            correct = FALSE,
            exact = FALSE,
            alternative = "two.sided")

## 
##  Wilcoxon rank sum test
## 
## data:  cancer_data2$Age by cancer_data2$Gender
## W = 38752218, p-value = 0.3118
## alternative hypothesis: true location shift is not equal to 0

H0: Distribution location of age is the same for male and female. H1: Distribution location of age is not the same for male and female.

Based on the p-value p>0.05, we cannot reject H0 that the distribution location of age is the same for male and female.

library(ggplot2)
ggplot(cancer_data2, aes(x = Age, fill = Gender)) +
  geom_histogram(position = position_dodge(width = 2.5), binwidth = 5, colour = "Black") +
  scale_x_continuous(breaks = seq(0, 75, 5)) +
  scale_fill_manual(values = c("Male" = "steelblue1", "Female" = "orchid1")) +
  ylab("Frequency") +
  labs(fill = "Age")

The function scale_fill_manual allowed me to manually specify aesthetic fill values for the graph to mathch the previous one in colors for male and female.

library(effectsize)
effectsize(wilcox.test(cancer_data2$Age ~ cancer_data2$Gender, 
           correct = FALSE,
           exact = FALSE,
           alternative = "two.sided"))

## r (rank biserial) |        95% CI
## ---------------------------------
## -8.78e-03         | [-0.03, 0.01]

Based on Funder&Ozer 2019 scale, this is considered a tiny effect size.

#RQ1 Conclusion and answer

Based on the analysis done above, I would say that the more appropriate test for this sample data to conduct is the non-parametric Wilcoxon Rank Sum Test since we can reject normal distribution of Age in variables Male and Female with both values p<0.001. Therefore we shoul skip the parametric tests and go straight to non-parametric alternatives. Also based on the non-parametric Wilcoxon Rank Sum Test I am concluding that we cannot find any statistically significant differences in Age location distribution among Male and Female in this Sample, concluding with tiny differences in the effectsize analysis (r = 0.00878).

We can conclude that there is nobsignificant difference in the age distribution of cancer patients between male and female patients based on this sample. But I would add that since this is a second hand data, we cannot be sure how this data was collected, meaning if this sample is random or not. But based on the data we have, we can make such conclusions.

Do you want to check this aswell? RQ1.2 Is there a significant difference in tumor size between male and female cancer patients?

library(ggplot2)

cancer_female <- ggplot(cancer_data2[cancer_data2$Gender == "Female",  ], aes(x = TumorSize)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 0.5, col = "black", fill = "orchid1") +
  ylab("Frequency") +
  ggtitle("Age distribution of female with cancer")

cancer_male <- ggplot(cancer_data2[cancer_data2$Gender == "Male",  ], aes(x = TumorSize)) +
  theme_linedraw() + 
  geom_histogram(binwidth = 0.5, col = "black", fill = "steelblue1") +
  ylab("Frequency") +
  ggtitle("Age distribution of male with cancer")

library(ggpubr)
ggarrange(cancer_female, cancer_male,
          ncol = 1, nrow = 1)

## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

RQ2: Is there a significant correlation between tumor size and survival months in cancer patients?

library(psych)
psych::describe(cancer_data2[ , c("TumorSize", "SurvivalMonths")])

##                vars     n  mean    sd median trimmed   mad min max
## TumorSize         1 17686  5.50  2.60    5.5    5.50  3.26   1  10
## SurvivalMonths    2 17686 60.39 34.79   60.0   60.34 44.48   1 120
##                range skew kurtosis   se
## TumorSize          9 0.00    -1.20 0.02
## SurvivalMonths   119 0.01    -1.21 0.26

library(ggplot2)
ggplot(cancer_data2, aes(x = TumorSize, y = SurvivalMonths)) +
  geom_point()

We can see that this graphical representation is not looking great. Let’s check the linear relationship with descriptive approach.

cor(cancer_data2$TumorSize, cancer_data2$SurvivalMonths, 
    method = "pearson",
    use = "complete.obs")

## [1] 0.001953289

Based on the result we can say that the relationship between tumor size and the amount of survival months after diagnosis has a positive and very weak correlation.

We still conduct the test of correlation coefficient with:

H0: ro = 0 H1: ro =/ 0

cor.test(cancer_data2$TumorSize, cancer_data2$SurvivalMonths, 
         method = "pearson",
         use = "complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  cancer_data2$TumorSize and cancer_data2$SurvivalMonths
## t = 0.25975, df = 17684, p-value = 0.7951
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01278508  0.01669081
## sample estimates:
##         cor 
## 0.001953289

We cannot reject the null hypothesis H0 that correlation coefficient is 0.

Based on the statistical tests above we cannot conclude that there is a significant correlation between tumor size and survival months in cancer patients.

RQ3: Is there an association between smoking status and type of cancer?

# Pearson Chi2 test
results <- chisq.test(cancer_data2$SmokingStatus, cancer_data2$CancerType, 
                      correct = FALSE)

results

## 
##  Pearson's Chi-squared test
## 
## data:  cancer_data2$SmokingStatus and cancer_data2$CancerType
## X-squared = 4.4956, df = 10, p-value = 0.9222

H0: There is no association between smoking status and cancer type H1: There is association between smoking status and cancer type

We cannot reject H0 that there is no association between these two categorical variables - smoking status and cancer type based on the p-value p>0.05.

addmargins(results$observed)

##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate  Lung Leukemia  Skin
##                    Smoker     943   972      976   997      970  1002
##                    FormerS    985  1002      977   962      988  1001
##                    NonS      1010   960      985   980      999   977
##                    Sum       2938  2934     2938  2939     2957  2980
##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus   Sum
##                    Smoker   5860
##                    FormerS  5915
##                    NonS     5911
##                    Sum     17686

Here R created the table with observed (empirical) findings from our data set. For example: 985 means that there was 985 people that were former smokers with breast cancer. Together in the right column we see the sums of smoker type categories and in the bottom row we can see sums of observed units based on each cancer type. And in the right bottom corner is the sum of all units.

We continue with calculating a table of expected frequencies.

#Calculating expected frequencies
round(results$expected, 2)

##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast  Colon Prostate   Lung Leukemia
##                    Smoker  973.46 972.14   973.46 973.80   979.76
##                    FormerS 982.60 981.26   982.60 982.93   988.95
##                    NonS    981.94 980.60   981.94 982.27   988.29
##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus   Skin
##                    Smoker  987.38
##                    FormerS 996.65
##                    NonS    995.97

In this step we calculated the expected frequencies. This means how much people we would expect to see in each paired category if there wouldn’t be any association among the variables cancer type and smoking status.

For example: If there wouldn’t be association between cancer type and smoking status we would expect 982.60 prople that are former smokers to have a breast cancer.

With this calculations we also check one of the required assumptions for this statistical analysis. Assumptions for association analysis between two categorical variables has 2 main assumptions that we checked in previous steps: 1. The observations are independent of each other: This is true since we know that one person is not measured multiple times 2. All expected frequencies are greater than 5: We can see in the table above that this is true

Since we checked that all assumptions hold we can continue with parametric tests for checking our hypothesis.

# Calculation of standardized residuals

round(results$res, 2)

##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate  Lung Leukemia  Skin
##                    Smoker   -0.98  0.00     0.08  0.74    -0.31  0.47
##                    FormerS   0.08  0.66    -0.18 -0.67    -0.03  0.14
##                    NonS      0.90 -0.66     0.10 -0.07     0.34 -0.60

Here we can see that non of the standardized residuals is significant, meaning there are no significant differences found.

addmargins(round(prop.table(results$observed), 3))

##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate  Lung Leukemia  Skin
##                    Smoker   0.053 0.055    0.055 0.056    0.055 0.057
##                    FormerS  0.056 0.057    0.055 0.054    0.056 0.057
##                    NonS     0.057 0.054    0.056 0.055    0.056 0.055
##                    Sum      0.166 0.166    0.166 0.165    0.167 0.169
##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus   Sum
##                    Smoker  0.331
##                    FormerS 0.335
##                    NonS    0.333
##                    Sum     0.999

Here is a structured table where all of the data together sums up to 1 (around 1 or 0.999 in our case). For example in our data there is 16.7% of people with leukemia and 33.1% of people that are smokers.

addmargins(round(prop.table(results$observed, 1), 3), 2)

##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate  Lung Leukemia  Skin
##                    Smoker   0.161 0.166    0.167 0.170    0.166 0.171
##                    FormerS  0.167 0.169    0.165 0.163    0.167 0.169
##                    NonS     0.171 0.162    0.167 0.166    0.169 0.165
##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus   Sum
##                    Smoker  1.001
##                    FormerS 1.000
##                    NonS    1.000

In this table we analyzed the smoking status. We observe that also from the table where all of the smoking categories sum up to one. Results form this table can be interpreted as followed:

16.9% of former smokers in the research have skin cancer.

addmargins(round(prop.table(results$observed, 2), 3), 1)

##                           cancer_data2$CancerType
## cancer_data2$SmokingStatus Breast Colon Prostate  Lung Leukemia  Skin
##                    Smoker   0.321 0.331    0.332 0.339    0.328 0.336
##                    FormerS  0.335 0.342    0.333 0.327    0.334 0.336
##                    NonS     0.344 0.327    0.335 0.333    0.338 0.328
##                    Sum      1.000 1.000    1.000 0.999    1.000 1.000

In this table we can observe data based on the cancer type variable. We can spot this from the bottom row with the sum of ones. The data from this table can be interpreted as followed: Out of all people with lung cancer in this sample, there is 33,9% of those that are smokers.

library(effectsize)
effectsize::cramers_v(cancer_data2$SmokingStatus, cancer_data2$CancerType)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.00              | [0.00, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.00)

## [1] "tiny"
## (Rules: funder2019)

Based on Cramers statistics we cannot say there is tiny difference between cancer type and smoking status of a person.

MVA Homework Assignment 1

Veronika Avbar

2025-01-13

Importing and preparation of the data

Do you want to check this aswell? RQ1.2 Is there a significant difference in tumor size between male and female cancer patients?

RQ2: Is there a significant correlation between tumor size and survival months in cancer patients?

RQ3: Is there an association between smoking status and type of cancer?