Load the necessary libraries: tidyverse, ggpubr, survival, survminer

Here I’m going to load the packages I need for the assignments.

Import lung-cancer.csv with RStudio

When I import the csv file, I’m going to skip the first column and I’m not changing the factors for now.

## # A tibble: 228 x 10
##     inst  time status   age   sex ph.ecog ph.karno pat.karno meal.cal wt.loss
##    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
##  1     3   306      2    74     1       1       90       100     1175      NA
##  2     3   455      2    68     1       0       90        90     1225      15
##  3     3  1010      1    56     1       0       90        90       NA      15
##  4     5   210      2    57     1       1       90        60     1150      11
##  5     1   883      2    60     1       0      100        90       NA       0
##  6    12  1022      1    74     1       1       50        80      513       0
##  7     7   310      2    68     2       2       70        60      384      10
##  8    11   361      2    71     2       2       60        80      538       1
##  9     1   218      2    53     1       1       70        80      825      16
## 10     7   166      2    61     1       2       70        70      271      34
## # … with 218 more rows

Begin to interpret the data

I filtered the observations by females and assigned it to a different variable.

I summarized the resulting dataset (with females only) by calculating the average age and average meal calories. To drop the NAs I used the function “na.rm=TRUE” as the second argument in the mean function.

##    average
## 1 61.07778
##    average
## 1 840.7015

The average age for females is 61.1 years and the average meal calories is 841 calories.


Then, I calculated the average survival time in days for both status groups.

The average survival time for the censored group is 363 days and the average survival time for the dead group is 283 days.


Next, I calculated how many observations were in the dataset by status group.

## # A tibble: 2 x 2
##   status     n
##    <dbl> <int>
## 1      1    63
## 2      2   165

I also calculated how many observations there were by status and sex.

## # A tibble: 4 x 3
## # Groups:   status [2]
##   status   sex     n
##    <dbl> <dbl> <int>
## 1      1     1    26
## 2      1     2    37
## 3      2     1   112
## 4      2     2    53

I added a column to the dataset that is a result of the difference between the patient and physician Karnofsky ratings. The column was called “karnodiff”.

I then took the absolute value of karnodiff so that it only contains positive numbers.

Using those values, I calculated the mean of karnodiff. To drop the NAs I used the function “na.rm=TRUE” as the second argument in the mean function.

##    average
## 1 10.58036

The mean of karnodiff [the difference between the patient and physician Karnofsky ratings] is 10.6 points.


I ran an independent t-test to assess whether there was a significant difference between calories consumed for males and females.

## 
##  Welch Two Sample t-test
## 
## data:  lung$meal.cal by lung$sex
## t = 2.3533, df = 151.16, p-value = 0.01989
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   22.43394 257.25080
## sample estimates:
## mean in group 1 mean in group 2 
##        980.5439        840.7015

With a p-value of p=0.01989, there is a significant difference between the calories consumed by sex with an alpha level of 0.05.


Survival Analysis Plot

I copied the following code into the console to calculate a model for survival analysis.

Next, I used the ggsurvplot() function from the survminer package to draw a Kaplan-Meier Curve.

Then I customized the plot by: adding the p-value, p-value method, made the line size 2, no confidence interval, the ggtheme theme_classic(), and no risk table.

Histogram Plot

I used the gghistogram() function from the ggpubr package to create a histogram of the continuous variable “meal.cal”.

Next, I customized the histogram by:

  1. adding a dotted line for the mean

  1. adding a title “Calories consumed at meals” with the x axis label as “Calories” and hiding the Y axis label

  1. grouping by sex so that the histogram shows the frequency of calories by sex

Next, I converted sex as a factor by using as.factor() function.

I made sure that the “sex” variable is a factor now with 2 levels using the str() and levels() to look at the structure and the levels of the variable.

I used the previous code for gghistogram(). By converting sex to a factor instead of continuous variable, it recognized the individuals as either male or female.

  1. adding a palette argument as green and blue


Karnofsky Performance Score Scatterplot

First, I found the correlations between the Karnofsky ratings done by the patient and the physician with the cor() function from the base package. I dropped the NAs by using the complete observations argument.

## [1] 0.5202974

The r value is 0.5203 indicating a moderate correlation.


Then, I created a custom scatterplot of the Karnofsky ratings done by the patient (y axis) and the physician (x axis) with the ggscatter() function from the ggpubr package.

Next, I customized the scatterplot by:

  1. adding color by sex

  2. putting a title of “Correlation Between Karnofsky Performance Done by Physicians and Patients”

  3. labeling the x axis “Score by Physician”

  4. labeling the y axis “Score by Patient”

  5. adding a linear regression line

  6. modifying the color of the regression line to light blue

  7. adding the confidence intervals

  8. adding the group mean point to the plot

  9. changing the group mean points to a size of 5

  10. adding the “spearman” correlation coefficient to the plot

Scatterplot of Weight Loss Versus Calories Consumed

First, I created a basic scatterplot of the calories consumed (x axis) and weight loss (y axis).

Then I counted the number of people that consume more than 2000 calories. These 5 people are outliers, so I removed them by filtering the dataset so that I only kept the data points when meal.cal <2000.

I recreated the scatterplot without the outliers and customized the scatterplot by:

  1. adding color by sex

  2. putting a title of “Correlation Between Calories Consumed and Meals and Weight Loss in the Last 6 Months, by Sex”

  3. labeling the x axis “Calories Consumed”

  4. labeling the y axis “Weight Loss”

  5. adding a linear regression line

  6. adding an ellipse around the data points

  7. adding the confidence intervals

  8. adding the group mean point to the plot