Load the necessary libraries: tidyverse, ggpubr, survival, survminer
Here I’m going to load the packages I need for the assignments.
Import lung-cancer.csv with RStudio
When I import the csv file, I’m going to skip the first column and I’m not changing the factors for now.
## # A tibble: 228 x 10
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
## 7 7 310 2 68 2 2 70 60 384 10
## 8 11 361 2 71 2 2 60 80 538 1
## 9 1 218 2 53 1 1 70 80 825 16
## 10 7 166 2 61 1 2 70 70 271 34
## # … with 218 more rows
Begin to interpret the data
I filtered the observations by females and assigned it to a different variable.
I summarized the resulting dataset (with females only) by calculating the average age and average meal calories. To drop the NAs I used the function “na.rm=TRUE” as the second argument in the mean function.
## average
## 1 61.07778
## average
## 1 840.7015
The average age for females is 61.1 years and the average meal calories is 841 calories.
Then, I calculated the average survival time in days for both status groups.
censored <- lung %>% filter(status == "1") %>% summarize(average = mean(time))
dead <- lung %>% filter(status == "2") %>% summarize(average = mean(time))The average survival time for the censored group is 363 days and the average survival time for the dead group is 283 days.
Next, I calculated how many observations were in the dataset by status group.
## # A tibble: 2 x 2
## status n
## <dbl> <int>
## 1 1 63
## 2 2 165
I also calculated how many observations there were by status and sex.
## # A tibble: 4 x 3
## # Groups: status [2]
## status sex n
## <dbl> <dbl> <int>
## 1 1 1 26
## 2 1 2 37
## 3 2 1 112
## 4 2 2 53
I added a column to the dataset that is a result of the difference between the patient and physician Karnofsky ratings. The column was called “karnodiff”.
I then took the absolute value of karnodiff so that it only contains positive numbers.
Using those values, I calculated the mean of karnodiff. To drop the NAs I used the function “na.rm=TRUE” as the second argument in the mean function.
## average
## 1 10.58036
The mean of karnodiff [the difference between the patient and physician Karnofsky ratings] is 10.6 points.
I ran an independent t-test to assess whether there was a significant difference between calories consumed for males and females.
##
## Welch Two Sample t-test
##
## data: lung$meal.cal by lung$sex
## t = 2.3533, df = 151.16, p-value = 0.01989
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 22.43394 257.25080
## sample estimates:
## mean in group 1 mean in group 2
## 980.5439 840.7015
With a p-value of p=0.01989, there is a significant difference between the calories consumed by sex with an alpha level of 0.05.
Survival Analysis Plot
I copied the following code into the console to calculate a model for survival analysis.
Next, I used the ggsurvplot() function from the survminer package to draw a Kaplan-Meier Curve.
Then I customized the plot by: adding the p-value, p-value method, made the line size 2, no confidence interval, the ggtheme theme_classic(), and no risk table.
ggsurvplot(fit, data = lung, pval = TRUE, pval.method = TRUE, size = 2,
conf.int = FALSE, ggtheme = theme_classic(), risk.table = FALSE)Histogram Plot
I used the gghistogram() function from the ggpubr package to create a histogram of the continuous variable “meal.cal”.
Next, I customized the histogram by:
- adding a dotted line for the mean
- adding a title “Calories consumed at meals” with the x axis label as “Calories” and hiding the Y axis label
gghistogram(lung, x = "meal.cal",
add = "mean", title = "Calories Consumed at Meals",
xlab = "Calories", ylab = FALSE)- grouping by sex so that the histogram shows the frequency of calories by sex
gghistogram(lung,
x = "meal.cal", add = "mean",
title = "Calories Consumed at Meals",
xlab = "Calories", ylab = FALSE, fill = "sex")Next, I converted sex as a factor by using as.factor() function.
I made sure that the “sex” variable is a factor now with 2 levels using the str() and levels() to look at the structure and the levels of the variable.
I used the previous code for gghistogram(). By converting sex to a factor instead of continuous variable, it recognized the individuals as either male or female.
- adding a palette argument as green and blue
gghistogram(lung, x = "meal.cal",
add = "mean", title = "Calories Consumed at Meals",
xlab = "Calories", ylab = FALSE,
fill = "sex", palette = c("green", "blue"))Karnofsky Performance Score Scatterplot
First, I found the correlations between the Karnofsky ratings done by the patient and the physician with the cor() function from the base package. I dropped the NAs by using the complete observations argument.
## [1] 0.5202974
The r value is 0.5203 indicating a moderate correlation.
Then, I created a custom scatterplot of the Karnofsky ratings done by the patient (y axis) and the physician (x axis) with the ggscatter() function from the ggpubr package.
Next, I customized the scatterplot by:
adding color by sex
putting a title of “Correlation Between Karnofsky Performance Done by Physicians and Patients”
labeling the x axis “Score by Physician”
labeling the y axis “Score by Patient”
adding a linear regression line
modifying the color of the regression line to light blue
adding the confidence intervals
adding the group mean point to the plot
changing the group mean points to a size of 5
adding the “spearman” correlation coefficient to the plot
ggscatter(lung, x = "ph.karno", y = "pat.karno",
color = "sex", title = "Correlation Between
Karnofsky Performance \nScore Done by
Physicians & Patients",
xlab = "Score by Physician",
ylab = "Score by Patient",
add = "reg.line", add.params = list
(color = "lightblue"), conf.int = TRUE,
ellipse.type = "confidence", mean.point = TRUE, mean.point.size = 5, cor.coef = TRUE,
cor.coeff.args = list(method = "spearman",
label.x.npc = "middle", label.y.npc = "bottom"))Scatterplot of Weight Loss Versus Calories Consumed
First, I created a basic scatterplot of the calories consumed (x axis) and weight loss (y axis).
Then I counted the number of people that consume more than 2000 calories. These 5 people are outliers, so I removed them by filtering the dataset so that I only kept the data points when meal.cal <2000.
I recreated the scatterplot without the outliers and customized the scatterplot by:
adding color by sex
putting a title of “Correlation Between Calories Consumed and Meals and Weight Loss in the Last 6 Months, by Sex”
labeling the x axis “Calories Consumed”
labeling the y axis “Weight Loss”
adding a linear regression line
adding an ellipse around the data points
adding the confidence intervals
adding the group mean point to the plot
ggscatter(lung,
x = "meal.cal",y = "wt.loss",
color = "sex",
title = "Correlation Between Calories Consumed
and Meals \n and Weight Loss in the
Last 6 Months, by Sex",
xlab = "Calories Consumed",
ylab = "Weight Loss",
add = "reg.line", ellipse = TRUE,
conf.int = TRUE, mean.point = TRUE)