Introduction to Data Science - Assignment Week 5

Load the necessary libraries:tidyverse,ggpubr,survival,survminer

Here, I’m going to install and load the packages needed for the assignment

install.packages("tidyverse")
install.packages("ggpubr")
install.packages("survival")
install.packages("survminer")
install.packages("readr")
library(tidyverse)
library(ggpubr)
library(survival)
library(survminer)
library(readr)

Importlung-cancer.csv with RStudio

Skip the first column
Don’t change the factors (like sex, status) for now

library(readr)
lung_cancer <- read_csv("lung-cancer.csv", 
    col_types = cols(X1 = col_skip()))
lung_cancer

## # A tibble: 228 x 10
##     inst  time status   age   sex ph.ecog ph.karno pat.karno meal.cal wt.loss
##    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl>     <dbl>    <dbl>   <dbl>
##  1     3   306      2    74     1       1       90       100     1175      NA
##  2     3   455      2    68     1       0       90        90     1225      15
##  3     3  1010      1    56     1       0       90        90       NA      15
##  4     5   210      2    57     1       1       90        60     1150      11
##  5     1   883      2    60     1       0      100        90       NA       0
##  6    12  1022      1    74     1       1       50        80      513       0
##  7     7   310      2    68     2       2       70        60      384      10
##  8    11   361      2    71     2       2       60        80      538       1
##  9     1   218      2    53     1       1       70        80      825      16
## 10     7   166      2    61     1       2       70        70      271      34
## # … with 218 more rows

Filter all observations by females and assign it to a different variable

lungfemale <- lung %>% filter(sex == "2")

Summarize the resulting dataset (with females only) by calculating average “age” and average “Meal Calories” (hint: add “na.rm=TRUE” as the second argument in the “mean()” function to drop the NAs)

lungfemale %>% summarize(average = mean(age))

##    average
## 1 61.07778

lungfemale %>% summarize(average = mean(meal.cal, na.rm=TRUE))

##    average
## 1 840.7015

What is the average age and average meal calories for females?

The average age is of 61.1 years and the average meal calories is 841 cal.

What is the average survival time in days for both status groups?

Group by status first, then calculate average

censored <- lung %>% filter(status == "1") %>% summarize(average = mean(time))

dead <- lung %>% filter(status == "2") %>% summarize(average = mean(time))

How many observations do you have in your dataset per “status” group?

group_by() status and then calculate number of observations per group with tally()

lung %>% group_by(status) %>% tally()

## # A tibble: 2 x 2
##   status     n
##    <dbl> <int>
## 1      1    63
## 2      2   165

How many observations do you have by “status” and “sex”?

The group_by() function accepts several variables separated by commas.

lung %>% group_by(status, sex) %>% tally()

## # A tibble: 4 x 3
## # Groups:   status [2]
##   status   sex     n
##    <dbl> <dbl> <int>
## 1      1     1    26
## 2      1     2    37
## 3      2     1   112
## 4      2     2    53

Add a column to the dataset that is the result of the difference between the Karnofsky ratings done by the patient and the physician

Call the column “karnodiff”

lung <- lung %>% mutate(karnodiff = ph.karno-pat.karno)

Calculate the absolute value of karnodiff with the “abs()” function and reassign it to karnodiff.

lung <- lung %>% mutate(karnodiff = abs(ph.karno-pat.karno))

The resulting karnodiff column should have only positive numbers

What is the mean difference of the ratings between patients and physicians?

Calculate mean of karnodiff with na.rm=TRUE as second argument (to drop the NAs)

lung %>% summarize(average=mean(karnodiff, na.rm=TRUE))

##    average
## 1 10.58036

Is there a significant difference between the calories consumed between males and females.

What statistical test are you using? Why?

t.test(lung$meal.cal ~ lung$sex)

## 
##  Welch Two Sample t-test
## 
## data:  lung$meal.cal by lung$sex
## t = 2.3533, df = 151.16, p-value = 0.01989
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   22.43394 257.25080
## sample estimates:
## mean in group 1 mean in group 2 
##        980.5439        840.7015

Write or copy-paste the following code into the console to calculate a model for survival analysis

fit<-survfit(Surv(time,status)~sex,data=name_of_your_dataset)

fit <- survfit(Surv(time, status) ~ sex, data = lung)

Use the ggsurvplot() function from the survminer package to draw a kaplan-meier curve. [hint:] (https://rpkgs.datanovia.com/survminer/index.html)

ggsurvplot(fit, data = lung)

a. Customizetheplotbyaddingp-value,thep-valuemethod (pval.method), the line size to 2, no confidence interval, the ggtheme to theme_classic(), and no risk table

    ggsurvplot(fit, data = lung, pval = TRUE, pval.method = TRUE, size = 2, conf.int = FALSE,  ggtheme = theme_classic(), risk.table = FALSE)

Use gghistogram() from ggpubr to create a histogram of the continuous variable “meal.cal”.

gghistogram(lung,x = "meal.cal")

Customizethehistogramby
adding the dotted line for the mean

gghistogram(lung,x = "meal.cal", add = "mean")

adding a title “Calories consumed at meals” with x axis label as “Calories”. Hide Y axis label

gghistogram(lung, x = "meal.cal",
add = "mean", title = "Calories Consumed at Meals", xlab = "Calories", ylab = FALSE
)

Group by sex so that the histogram shows the frequency of calories by sex (add fill=”sex” as argument to ggistogram())

gghistogram(lung,
x = "meal.cal", add = "mean",
title = "Calories Consumed at Meals", xlab = "Calories", ylab = FALSE, fill = "sex")

Convert sex as a factor by using as.factor() function (hint: google “as.factor() in r” if you want more details on how to use the function)

lung$sex <- as.factor(lung$sex)

Make sure the “sex” variable is a factor now with 2 levels (use str() and levels() to look at the structure and the levels of the variable.

 levels(lung$sex)

## [1] "1" "2"

Now, use the previous code for gghistogram(). What is the difference?

Add palette argument as green and blue: c(“green”, “blue”)

gghistogram(lung, x = "meal.cal",
add = "mean", title = "Calories Consumed at Meals", xlab = "Calories", ylab = FALSE,
fill = "sex", palette = c("green", "blue"))

Find the correlation between the Karnofsky ratings done by the patient and the physician with the cor() function from the base package

Drop the NAs by using complete observations (See lecture)

cor(lung$pat.karno, lung$ph.karno, use = "complete.obs")

## [1] 0.5202974

Create a basic scatterplot of the Karnofsky ratings done by the patient (y axis) and the physician (x axis) with the ggscatter() function from the ggpubr package.

ggscatter(lung, x = "ph.karno", y = "pat.karno")

Customize the scatterplot with a. color as sex, b. title as ““Correlation Between Karnofsky Performance Done by Physicians and Patients”, c. x axis label as “Score by Physician”, d. y axis label as “Score by Patient” e. addalinearregressionline f. Modify the color of the regression line to light blue (color = “lightblue”) g. Addtheconfidenceintervals h. Add the group mean point to the plot i. Change the group mean points to 5 j. Add the “spearman” correlation coefficient to the plot k. Copy paste the plot here

ggscatter(lung,
x = "ph.karno",
y = "pat.karno",
color = "sex",
title = "Correlation Between Karnofsky Performance \nScore Done by
Physicians & Patients",
xlab = "Score by Physician",
ylab = "Score by Patient",
add = "reg.line",
add.params = list(color = "lightblue"),
conf.int = TRUE,
ellipse.type = "confidence",
mean.point = TRUE,
mean.point.size = 5,
cor.coef = TRUE,
cor.coeff.args = list(method = "spearman", label.x.npc = "middle",
label.y.npc = "bottom"))

Create a basic scatterplot of the calories consumed (x axis) and weight loss (y axis).

ggscatter(lung, x = "meal.cal", y="wt.loss")

Howmanypeopleconsumemorethan2000calories,accordingto the plot?
They are outliers and we’re going to get rid of them. Filter your dataset and keep the data points when meal.cal < 2000

lung <- filter(lung, meal.cal < 2000)

Use same code to create the basic scatterplot again. The outliers should be gone. Customize the scatterplot with
color as sex,
title as " Correlation Between Calories Consumed and Meals and Weight Loss in the Last 6 Months, by Sex",
x axis label as “Calories Consumed”,
y axis label as “Weight Loss”
addalinearregressionline
add an ellipse around the data points.
Add the confidence intervals
Add the group mean point to the plot
Copy paste the plot here

ggscatter(lung,
          x = "meal.cal",
y = "wt.loss",
color = "sex",
title = "Correlation Between Calories Consumed and Meals \n and
Weight Loss in the Last 6 Months, by Sex", xlab = "Calories Consumed",
          ylab = "Weight Loss",
          add = "reg.line",
          ellipse = TRUE,
          conf.int = TRUE,
          mean.point = TRUE)