Lung Cancer Analysis Using R

Executive Summary

This assignment is a part of the Alabama College of Ostheopathic Medicine certificate course “Introduction to Data Science with R for Public Health” by Emmanuel Segui, MS.

Its purpose is to analyze the Lung Cancer Data by the North Central Cancer Treatment Group by applying all the tools and techniques learned in the course to real-life data.

As such, this page demonstrates familiarity and understanding of R’s syntax and package ecosystem pertaining to importing, tidying, manipulation, modeling and visualizing of the aforementioned data set.

Required Resources

The following 3 main resources are needed in order to complete this analysis: ggpubr, tidyverse & survminer.
Other R packages designed for epidemiologists: survival, epiR & epitools.
Lung Cancer dataset in CSV format.

Analysis

1. Loading of necessary libraries

We begin by loading all of the necessary libraries into R. These were previously installed using the install.packages function.

library(tidyverse)
library(ggpubr)
library(survival)
library(survminer)

2. Import of “lung-cancer” CSV File Using R Studio

While importing the file, we omit the first column as it merely provides an ascending number of observations. Factors like sex or status are let unchanged for now.

lung <- read_csv("lung-cancer.csv", col_types = cols(X1 = col_skip()))
lung

## # A tibble: 228 x 10
##     inst  time status   age   sex ph.ecog ph.karno pat.karno meal.cal wt.loss
##    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
##  1     3   306      2    74     1       1       90       100     1175      NA
##  2     3   455      2    68     1       0       90        90     1225      15
##  3     3  1010      1    56     1       0       90        90       NA      15
##  4     5   210      2    57     1       1       90        60     1150      11
##  5     1   883      2    60     1       0      100        90       NA       0
##  6    12  1022      1    74     1       1       50        80      513       0
##  7     7   310      2    68     2       2       70        60      384      10
##  8    11   361      2    71     2       2       60        80      538       1
##  9     1   218      2    53     1       1       70        80      825      16
## 10     7   166      2    61     1       2       70        70      271      34
## # … with 218 more rows

Note: you can also use the following function to perform the same operation:

lung <- lung %>% select(-1)

3. Filtering of Observations by Female Cases Only

We assign a different variable to this list containing only females.

lungfemale <- lung %>% filter(sex == "2")

4. Summary of Resulting Dataset

We calculate the average “age”:

lungfemale %>% summarize(average = mean(age))

## # A tibble: 1 x 1
##   average
##     <dbl>
## 1    61.1

And the average “meal calories”:

lungfemale %>% summarize(average = mean(meal.cal, na.rm=TRUE))

## # A tibble: 1 x 1
##   average
##     <dbl>
## 1    841.

Note: na.rm=TRUE is added as the second argument in the “mean()” function to drop the NAs (not available values) in the file.

The average age is of 61.1 years and the average meal calories is 841 calories.

5. Average Survival Time in Days for Both Status Groups

Per NCCTG Lung Cancer Data, status: censoring status 1 = censored, 2 = dead.

We group by status first:

censored <- lung %>% filter(status == "1") %>% summarize(average = mean(time))

The average censored survival time is 363 days.

We then calculate the average:

dead <- lung %>% filter(status == "2") %>% summarize(average = mean(time))

The average dead survival time is 283 days.

6. Observations per Status Group

We ‘group_by’ status and then calculate number of observations per group using the tally() function.

lung %>% group_by(status) %>% tally()

## # A tibble: 2 x 2
##   status     n
##    <dbl> <int>
## 1      1    63
## 2      2   165

There are 63 observations for status group “1” (censored) and 165 observations for status group “2” (dead).

7. Observations by “Status” and “Sex”

The group_by() function accepts several variables separated by commas:

lung %>% group_by(status, sex) %>% tally()

## # A tibble: 4 x 3
## # Groups:   status [2]
##   status   sex     n
##    <dbl> <dbl> <int>
## 1      1     1    26
## 2      1     2    37
## 3      2     1   112
## 4      2     2    53

8. Karnofsky Ratings Differential

This shows the result of the difference between the Karnofsky ratings done by the patient and the physician. We begin by adding a column called “karnodiff”:

lung <- lung %>% mutate(karnodiff = ph.karno-pat.karno)

As the difference sometimes yields a negative value, we calculate the absolute value of karnodiff with the abs() function:

lung <- lung %>% mutate(karnodiff = abs(ph.karno-pat.karno))

The resulting “karnodiff” column only consists of positive numbers.

9. Ratings Differential Between Patients and Physicians

We calculate mean of karnodiff with na.rm=TRUE as second argument (to drop the NAs).

lung %>% summarize(average = mean(karnodiff, na.rm=TRUE))

## # A tibble: 1 x 1
##   average
##     <dbl>
## 1    10.6

The mean difference of the ratings is 10.6.

10. Difference Between Calories Consumed by Males and Females

In order to figure this out, we use the statistical test called ‘independent t.test with a factor’.

t.test(lung$meal.cal ~ lung$sex)

## 
##  Welch Two Sample t-test
## 
## data:  lung$meal.cal by lung$sex
## t = 2.3533, df = 151.16, p-value = 0.01989
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   22.43394 257.25080
## sample estimates:
## mean in group 1 mean in group 2 
##        980.5439        840.7015

The values for males and females are 980.540 and 840.7 calories, respectively. The p-value is 0.01989. This indicates strong evidence against the null hypothesis, so we reject the null hypothesis.

11. Model for Survival Analysis

We use the following syntax for our model:

fit <- survfit(Surv(time, status) ~ sex, data = lung)

Next, we use the ggplot() function to draw Kaplan-Meier Curve:

ggsurvplot(fit, data = lung)

We then customize it further by adding the p-value, the p-value method (pval.method), the line size to 2, no confidence interval, the ggtheme to theme_classic(), and no risk table:

ggsurvplot(
fit,
data = lung,
pval = TRUE, 
pval.method = TRUE,
size = 2,
conf.int = FALSE,
ggtheme = theme_classic(),
risk.table = FALSE
)

Note: the test used to calculate p-value is the log-rank test.

12. Histogram of the Continuous Variable “meal.cal”

We start with a basic histogram:

gghistogram(lung,x = "meal.cal")

We customize it by adding a mean dotted line:

gghistogram(lung,x = "meal.cal", add = "mean")

We then add a title “Calories Consumed at Meals” with the x axis label as “Calories”, while hiding the Y axis label:

gghistogram(lung, x = "meal.cal",
add = "mean", title = "Calories Consumed at Meals",
xlab = "Calories", ylab = FALSE
)

We finally group by sex so that the histogram shows the frequency of calories by gender (by adding fill=”sex” as argument to ggistogram()):

gghistogram(lung,
x = "meal.cal", add = "mean",
title = "Calories Consumed at Meals",
xlab = "Calories", ylab = FALSE,
fill = "sex"
)

We notice that ‘sex’ is a continuous variable in the legend as the values are in numeric format.

We remedy the ‘sex’ continuous variable issue by coverting ‘sex’ into a factor:

lung$sex <- as.factor(lung$sex)

We verify that “sex” variable is a factor now with 2 levels using str() and levels():

str(lung$sex)

##  Factor w/ 2 levels "1","2": 1 1 1 1 1 1 2 2 1 1 ...

levels(lung$sex)

## [1] "1" "2"

Using the previous code for gghistogram(), we see that both genders are now properly represented as stacked histograms, with a correct lengend.

We can now finalize the histogram with some color by adding a palette argument for green and blue `c(“green”, “blue”)’:

gghistogram(lung, x = "meal.cal",
add = "mean", title = "Calories Consumed at Meals",
xlab = "Calories", ylab = FALSE,
fill = "sex", palette = c("green", "blue"))

13. Correlation Between the Karnofsky Ratings

We use the cor() function from the base package to look at the correlation between the Karnofsky ratings given by the patient and the physician. We get rid of NAs by using complete observations only:

cor(lung$pat.karno, lung$ph.karno, use = "complete.obs")

## [1] 0.5202974

The correlation coefficient is 0.5202974.

14. Scatterplot of Karnofsky Ratings

We start with a basic ggscatter() from ggpubr with patient on the y axis and the physician on the x axis:

ggscatter(lung, x = "ph.karno", y = "pat.karno")

We further customize the scatterplot:
1. color as sex,
2. title as “Correlation Between Karnofsky Performance Done by Physicians & Patients”,
3. x axis label as “Score by Physician”,
4. y axis label as “Score by Patient”
5. add a linear regression line
6. Modify the color of the regression line to light blue (color = “lightblue”)
7. Add the confidence intervals
8. Add the group mean point to the plot
9. Change the group mean points to 5
10. Add the “spearman” correlation coefficient to the plot

ggscatter(lung,
x = "ph.karno",
y = "pat.karno",
color = "sex",
title = "Correlation Between Karnofsky Performance \nScore Done by Physicians & Patients",
xlab = "Score by Physician",
ylab = "Score by Patient",
add = "reg.line",
add.params = list(color = "lightblue"),
conf.int = TRUE,
ellipse.type = "confidence",
mean.point = TRUE,
mean.point.size = 5,
cor.coef = TRUE,
cor.coeff.args = list(method = "spearman", label.x.npc = "middle", label.y.npc = "bottom")
)

15. Scatterplot of Calories Consumed (x axis) and Weight Loss (y axis)

We begin with a basic scatterplot using the ggscatter() function:

ggscatter(lung, x = "meal.cal", y="wt.loss")

There are 5 people who consume more than 2000 cal. They are outliers and need to be eliminated from the dataset. As such we use the filter() function:

lung <- filter(lung, meal.cal < 2000)

With outliers removed, we customize our scatterplot with:
1. color as sex,
2. title as " Correlation Between Calories Consumed and Meals and Weight Loss in the Last 6 Months, by Sex",
3. x axis label as “Calories Consumed”,
4. y axis label as “Weight Loss”
5. add a linear regression line
6. add an ellipse around the data points.
7. Add the confidence intervals
8. Add the group mean point to the plot

ggscatter(lung,
          x = "meal.cal",
          y = "wt.loss",
          color = "sex",
          title = "Correlation Between Calories Consumed and Meals \n and Weight Loss in the Last 6 Months, by Sex",
          xlab = "Calories Consumed",
          ylab = "Weight Loss",
          add = "reg.line",
          ellipse = TRUE,
          conf.int = TRUE,
          mean.point = TRUE
          )

Contact

Thank you for taking the time to look at this page.

Please feel free to contact me at steven.altiner@gmail.com if you have any questions.