Assignment 4 Document

Load the necessary libraries: tidyverse, ggpubr, survival, survminer

Here I’m going to load the packages I need for the assignments.

library(tidyverse)
library(ggpubr)
library(survival)
library(survminer)
library(readr)

Import lung-cancer.csv with RStudio

When I import the csv file, I’m going to skip the first column and I’m not changing the factors for now.

lc <- read_csv("lung-cancer.csv", col_types = cols(X1 = col_skip()))
lc

## # A tibble: 228 x 10
##     inst  time status   age   sex ph.ecog ph.karno pat.karno meal.cal wt.loss
##    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
##  1     3   306      2    74     1       1       90       100     1175      NA
##  2     3   455      2    68     1       0       90        90     1225      15
##  3     3  1010      1    56     1       0       90        90       NA      15
##  4     5   210      2    57     1       1       90        60     1150      11
##  5     1   883      2    60     1       0      100        90       NA       0
##  6    12  1022      1    74     1       1       50        80      513       0
##  7     7   310      2    68     2       2       70        60      384      10
##  8    11   361      2    71     2       2       60        80      538       1
##  9     1   218      2    53     1       1       70        80      825      16
## 10     7   166      2    61     1       2       70        70      271      34
## # … with 218 more rows

Begin to interpret the data

I filtered the observations by females and assigned it to a different variable.

lungfemale <- lung %>% filter(sex == "2")

I summarized the resulting dataset (with females only) by calculating the average age and average meal calories. To drop the NAs I used the function “na.rm=TRUE” as the second argument in the mean function.

lungfemale %>% summarize(average = mean(age))

##    average
## 1 61.07778

lungfemale %>% summarize(average = mean(meal.cal, na.rm=TRUE))

##    average
## 1 840.7015

The average age for females is 61.1 years and the average meal calories is 841 calories.

Then, I calculated the average survival time in days for both status groups.

censored <- lung %>% filter(status == "1") %>% summarize(average = mean(time))
dead <- lung %>% filter(status == "2") %>% summarize(average = mean(time))

The average survival time for the censored group is 363 days and the average survival time for the dead group is 283 days.

Next, I calculated how many observations were in the dataset by status group.

lung %>% group_by(status) %>% tally()

## # A tibble: 2 x 2
##   status     n
##    <dbl> <int>
## 1      1    63
## 2      2   165

I also calculated how many observations there were by status and sex.

lung %>% group_by(status, sex) %>% tally()

## # A tibble: 4 x 3
## # Groups:   status [2]
##   status   sex     n
##    <dbl> <dbl> <int>
## 1      1     1    26
## 2      1     2    37
## 3      2     1   112
## 4      2     2    53

I added a column to the dataset that is a result of the difference between the patient and physician Karnofsky ratings. The column was called “karnodiff”.

lung <- lung %>% mutate(karnodiff = ph.karno-pat.karno)

I then took the absolute value of karnodiff so that it only contains positive numbers.

lung <- lung %>% mutate(karnodiff = abs(ph.karno-pat.karno))

Using those values, I calculated the mean of karnodiff. To drop the NAs I used the function “na.rm=TRUE” as the second argument in the mean function.

lung %>% summarize(average=mean(karnodiff, na.rm=TRUE))

##    average
## 1 10.58036

The mean of karnodiff [the difference between the patient and physician Karnofsky ratings] is 10.6 points.

I ran an independent t-test to assess whether there was a significant difference between calories consumed for males and females.

t.test(lung$meal.cal ~ lung$sex)

## 
##  Welch Two Sample t-test
## 
## data:  lung$meal.cal by lung$sex
## t = 2.3533, df = 151.16, p-value = 0.01989
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   22.43394 257.25080
## sample estimates:
## mean in group 1 mean in group 2 
##        980.5439        840.7015

With a p-value of p=0.01989, there is a significant difference between the calories consumed by sex with an alpha level of 0.05.

Survival Analysis Plot

I copied the following code into the console to calculate a model for survival analysis.

fit <- survfit(Surv(time, status) ~ sex, data = lung)

Next, I used the ggsurvplot() function from the survminer package to draw a Kaplan-Meier Curve.

ggsurvplot(fit, data = lung)

Then I customized the plot by: adding the p-value, p-value method, made the line size 2, no confidence interval, the ggtheme theme_classic(), and no risk table.

ggsurvplot(fit, data = lung, pval = TRUE, pval.method = TRUE, size = 2, 
conf.int = FALSE, ggtheme = theme_classic(), risk.table = FALSE)

Histogram Plot

I used the gghistogram() function from the ggpubr package to create a histogram of the continuous variable “meal.cal”.

gghistogram(lung,x = "meal.cal")

Next, I customized the histogram by:

adding a dotted line for the mean

gghistogram(lung,x = "meal.cal", add = "mean")

adding a title “Calories consumed at meals” with the x axis label as “Calories” and hiding the Y axis label

gghistogram(lung, x = "meal.cal",
add = "mean", title = "Calories Consumed at Meals",
xlab = "Calories", ylab = FALSE)

grouping by sex so that the histogram shows the frequency of calories by sex

gghistogram(lung,
x = "meal.cal", add = "mean",
title = "Calories Consumed at Meals",
xlab = "Calories", ylab = FALSE, fill = "sex")

Next, I converted sex as a factor by using as.factor() function.

lung$sex <- as.factor(lung$sex)

I made sure that the “sex” variable is a factor now with 2 levels using the str() and levels() to look at the structure and the levels of the variable.

I used the previous code for gghistogram(). By converting sex to a factor instead of continuous variable, it recognized the individuals as either male or female.

adding a palette argument as green and blue

gghistogram(lung, x = "meal.cal",
add = "mean", title = "Calories Consumed at Meals",
xlab = "Calories", ylab = FALSE,
fill = "sex", palette = c("green", "blue"))

Karnofsky Performance Score Scatterplot

First, I found the correlations between the Karnofsky ratings done by the patient and the physician with the cor() function from the base package. I dropped the NAs by using the complete observations argument.

cor(lung$pat.karno, lung$ph.karno, use = "complete.obs")

## [1] 0.5202974

The r value is 0.5203 indicating a moderate correlation.

Then, I created a custom scatterplot of the Karnofsky ratings done by the patient (y axis) and the physician (x axis) with the ggscatter() function from the ggpubr package.

ggscatter(lung, x = "ph.karno", y = "pat.karno")

Next, I customized the scatterplot by:

adding color by sex
putting a title of “Correlation Between Karnofsky Performance Done by Physicians and Patients”
labeling the x axis “Score by Physician”
labeling the y axis “Score by Patient”
adding a linear regression line
modifying the color of the regression line to light blue
adding the confidence intervals
adding the group mean point to the plot
changing the group mean points to a size of 5
adding the “spearman” correlation coefficient to the plot

ggscatter(lung, x = "ph.karno", y = "pat.karno",
color = "sex", title = "Correlation Between 
Karnofsky Performance \nScore Done by 
Physicians & Patients",
xlab = "Score by Physician", 
ylab = "Score by Patient", 
add = "reg.line", add.params = list
(color = "lightblue"), conf.int = TRUE, 
ellipse.type = "confidence", mean.point = TRUE, mean.point.size = 5, cor.coef = TRUE, 
cor.coeff.args = list(method = "spearman", 
label.x.npc = "middle", label.y.npc = "bottom"))

Scatterplot of Weight Loss Versus Calories Consumed

First, I created a basic scatterplot of the calories consumed (x axis) and weight loss (y axis).

ggscatter(lung, x = "meal.cal", y="wt.loss")

Then I counted the number of people that consume more than 2000 calories. These 5 people are outliers, so I removed them by filtering the dataset so that I only kept the data points when meal.cal <2000.

lung <- filter(lung, meal.cal < 2000)

I recreated the scatterplot without the outliers and customized the scatterplot by:

adding color by sex
putting a title of “Correlation Between Calories Consumed and Meals and Weight Loss in the Last 6 Months, by Sex”
labeling the x axis “Calories Consumed”
labeling the y axis “Weight Loss”
adding a linear regression line
adding an ellipse around the data points
adding the confidence intervals
adding the group mean point to the plot

ggscatter(lung,
x = "meal.cal",y = "wt.loss",
color = "sex",
title = "Correlation Between Calories Consumed
and Meals \n and Weight Loss in the
Last 6 Months, by Sex",
xlab = "Calories Consumed",
ylab = "Weight Loss",
add = "reg.line", ellipse = TRUE,
conf.int = TRUE, mean.point = TRUE)