Get familiar with R (in RStudio)
R packages for data science (manipulate and visualize data)
Obtain summary statistics and produce useful graphs
Example of survival analysis
More material at https://alecri.github.io/courses/getStartedR_uu.html
October 12, 2017
Get familiar with R (in RStudio)
R packages for data science (manipulate and visualize data)
Obtain summary statistics and produce useful graphs
Example of survival analysis
More material at https://alecri.github.io/courses/getStartedR_uu.html
base for Windows, or the last .pkg for Mac, and download..exe or .dmg file and run it.Bottom left: console (or command) window.
Here you can type simple commands after the ">" prompt and R will then execute your command.
Top left: editor (or script) window.
Collections of commands (scripts) can be edited and saved. If you want to run a line from the script window you can click Run or press CTRL+ENTER or CMD+ENTER to send it to the command window.
Top right: environment/history window.
In the environment window you can see which data and values R has in its memory. The history window shows what has been typed before.
Bottom right: files/plots/packages/help window.
Here you can open files, view plots, install and load packages or use the help function.
R can do many statistical and data analyses. They are organized in so-called packages or libraries. Most common packages are already installed.
Many more packages are available on CRAN.
For example, to install and use the tidyverse package type in the command line or execute them from the editor/script
# install the package
install.packages("tidyverse")
# load the package
library(tidyverse)
http://r4ds.had.co.nz/index.html
The tidyverse is a set of packages that work in harmony because they share common data representations and API design.
It includes:
Data-set (oralca.txt)
orca = read.table("http://www.stats4life.se/data/oralca.txt")
orca = as_data_frame(orca)
Reference (article)
Läärä E, Korpi JT, Pitkänen H, Alho OP, Kantola S. Competing risks analysis of cause‐specific mortality in patients with oral squamous cell carcinoma. Head & neck. 2017 Jan 1;39(1):56-62.
Descriptive abstract
A population-based retrospective cohort design, including patients diagnosed with an OSCC cancer between January 1, 1985 and December 31, 2005 from the 2 northernmost provinces of Finland. The classification of deaths into the 2 categories in this study: (1) deaths from OSCC; and (2) deaths of other causes, was based on the Finnish Cancer Registry.
# reads a file in table format (.csv) into data.frame
orca = read.csv("http://www.stats4life.se/data/oralca.csv")
# read txt into a data.frame
orca = read.table("http://www.stats4life.se/data/oralca.txt")
# reads a delimited file (.csv and .tsc) into tibble
library(readr)
orca = read_csv("http://www.stats4life.se/data/oralca.csv")
library(haven)
# read a Stata data set into tibble
orca = read_dta("http://www.stats4life.se/data/oralca.dta")
# read a SPSS data set into tibble
orca = read_sav("http://www.stats4life.se/data/oralca.sav")
# read a SAS data set into tibble
orca = read_sas("http://www.stats4life.se/data/oralca.sas7bdat")
library(readxl)
# download and read an Excel sheet into tibble
download.file("http://www.stats4life.se/data/oralca.xlsx", destfile = "oralca.xlsx")
orca = read_excel("oralca.xlsx")
# load data in R format
load(url("http://www.stats4life.se/data/orcalca.rda"))
# print data in the console
orca
# A tibble: 338 x 6
id sex age stage time event
* <int> <fctr> <dbl> <fctr> <dbl> <fctr>
1 1 Male 65.42274 unkn 5.081 Alive
2 2 Female 83.08783 III 0.419 Oral ca. death
3 3 Male 52.59008 II 7.915 Other death
4 4 Male 77.08630 I 2.480 Other death
5 5 Male 80.33622 IV 2.500 Oral ca. death
6 6 Female 82.58132 IV 0.167 Other death
7 7 Male 73.07524 II 5.925 Alive
8 8 Male 74.49622 unkn 1.503 Oral ca. death
9 9 Female 56.49984 IV 13.333 Oral ca. death
10 10 Male 49.49075 II 7.666 Alive
# ... with 328 more rows
# dim = number of row anxd number of columns dim(orca) [1] 338 6
c("Number of obs:" = nrow(orca), "Number of variables:" = ncol(orca))
Number of obs: Number of variables:
338 6
# More in general, get the structure of the data str(orca) Classes 'tbl_df', 'tbl' and 'data.frame': 338 obs. of 6 variables: $ id : int 1 2 3 4 5 6 7 8 9 10 ... $ sex : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 1 2 2 1 2 ... $ age : num 65.4 83.1 52.6 77.1 80.3 ... $ stage: Factor w/ 5 levels "I","II","III",..: 5 3 2 1 4 4 2 5 4 2 ... $ time : num 5.081 0.419 7.915 2.48 2.5 ... $ event: Factor w/ 3 levels "Alive","Oral ca. death",..: 1 2 3 3 2 3 1 2 2 1 ...
# names of the variable names(orca) [1] "id" "sex" "age" "stage" "time" "event"
# a quick summary of the data
summary(orca)
id sex age stage time event
Min. : 1.00 Female:152 Min. :15.15 I :50 Min. : 0.085 Alive :109
1st Qu.: 85.25 Male :186 1st Qu.:53.24 II :77 1st Qu.: 1.333 Oral ca. death:122
Median :169.50 Median :64.86 III :72 Median : 3.869 Other death :107
Mean :169.50 Mean :63.51 IV :68 Mean : 5.662
3rd Qu.:253.75 3rd Qu.:74.29 unkn:71 3rd Qu.: 8.417
Max. :338.00 Max. :92.24 Max. :23.258
https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
5 functions (verbs) to manipulate data
mutate())select())filter())arrange())summarise())mutate()# create an indicator variable for death in a new data-set called 'orca2'
orca2 = mutate(orca, dead = (event != "Alive"))
orca2
# A tibble: 338 x 7
id sex age stage time event dead
<int> <fctr> <dbl> <fctr> <dbl> <fctr> <lgl>
1 1 Male 65.42274 unkn 5.081 Alive FALSE
2 2 Female 83.08783 III 0.419 Oral ca. death TRUE
3 3 Male 52.59008 II 7.915 Other death TRUE
4 4 Male 77.08630 I 2.480 Other death TRUE
5 5 Male 80.33622 IV 2.500 Oral ca. death TRUE
6 6 Female 82.58132 IV 0.167 Other death TRUE
7 7 Male 73.07524 II 5.925 Alive FALSE
8 8 Male 74.49622 unkn 1.503 Oral ca. death TRUE
9 9 Female 56.49984 IV 13.333 Oral ca. death TRUE
10 10 Male 49.49075 II 7.666 Alive FALSE
# ... with 328 more rows
select()# select only variables: age, sex, stage, time, and dead
select(orca2, age, sex, stage, time, dead)
# A tibble: 338 x 5
age sex stage time dead
<dbl> <fctr> <fctr> <dbl> <lgl>
1 65.42274 Male unkn 5.081 FALSE
2 83.08783 Female III 0.419 TRUE
3 52.59008 Male II 7.915 TRUE
4 77.08630 Male I 2.480 TRUE
5 80.33622 Male IV 2.500 TRUE
6 82.58132 Female IV 0.167 TRUE
7 73.07524 Male II 5.925 FALSE
8 74.49622 Male unkn 1.503 TRUE
9 56.49984 Female IV 13.333 TRUE
10 49.49075 Male II 7.666 FALSE
# ... with 328 more rows
filter()# select only male patients with diagnosed stage IV
filter(orca2, sex == "Male", stage == "IV")
# A tibble: 44 x 7
id sex age stage time event dead
<int> <fctr> <dbl> <fctr> <dbl> <fctr> <lgl>
1 5 Male 80.33622 IV 2.500 Oral ca. death TRUE
2 11 Male 53.08017 IV 2.834 Alive FALSE
3 14 Male 74.08006 IV 4.668 Other death TRUE
4 27 Male 65.98675 IV 5.246 Other death TRUE
5 28 Male 57.31574 IV 1.815 Alive FALSE
6 29 Male 74.40313 IV 0.747 Other death TRUE
7 37 Male 63.31179 IV 0.252 Oral ca. death TRUE
8 52 Male 54.06034 IV 5.248 Other death TRUE
9 57 Male 51.05684 IV 0.914 Other death TRUE
10 62 Male 59.47049 IV 1.084 Oral ca. death TRUE
# ... with 34 more rows
arrange()# order the data-set with increasing time, and descreasing age
arrange(orca2, time, desc(age))
# A tibble: 338 x 7
id sex age stage time event dead
<int> <fctr> <dbl> <fctr> <dbl> <fctr> <lgl>
1 265 Male 62.62731 IV 0.085 Oral ca. death TRUE
2 227 Male 47.80966 IV 0.085 Oral ca. death TRUE
3 18 Female 90.74855 IV 0.162 Oral ca. death TRUE
4 229 Male 53.14588 unkn 0.162 Oral ca. death TRUE
5 79 Male 87.71219 unkn 0.167 Oral ca. death TRUE
6 6 Female 82.58132 IV 0.167 Other death TRUE
7 183 Male 77.32176 III 0.167 Other death TRUE
8 199 Male 72.40171 unkn 0.167 Oral ca. death TRUE
9 249 Female 81.71887 I 0.170 Other death TRUE
10 190 Male 62.81897 unkn 0.170 Oral ca. death TRUE
# ... with 328 more rows
Combine your code and make it readable
orca %>%
mutate(dead = (event != "Alive")) %>%
select(age, sex, stage, time, dead) %>%
filter(sex == "Male", stage == "IV") %>%
arrange(time, desc(age)) %>% head()
# A tibble: 6 x 5
age sex stage time dead
<dbl> <fctr> <fctr> <dbl> <lgl>
1 62.62731 Male IV 0.085 TRUE
2 47.80966 Male IV 0.085 TRUE
3 65.27489 Male IV 0.252 TRUE
4 63.31179 Male IV 0.252 TRUE
5 68.43719 Male IV 0.329 TRUE
6 70.81919 Male IV 0.419 TRUE
http://www.cookbook-r.com/Graphs/
A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects.
ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
data: the data-set comprised of variables that we map;geom: the geometric object in question, i.e. the objects we can observe in our plot (e.g. points, lines, bars, etc);aes: aesthetic attributes of the geometric object that we can perceive on a graphic (e.g. x/y position, color, shape, and size. Each assigned aesthetic attribute can be mapped to a variable in our data-set).# an example ggplot(orca, aes(x = id, y = time)) + geom_linerange(aes(ymin = 0, ymax = time)) + geom_point(aes(color = event, shape = event)) + coord_flip()
# another example ggplot(data = orca, aes(x = age, y = time)) + geom_point(aes(shape = event, color = stage)) + geom_smooth(se = FALSE)
# first 6 observations head(orca$age) [1] 65.42274 83.08783 52.59008 77.08630 80.33622 82.58132
# summary statistics (range, quartiles, and mean) summary(orca$age) Min. 1st Qu. Median Mean 3rd Qu. Max. 15.15 53.24 64.86 63.51 74.29 92.24
# save sample size, mean, median, and standard deviation in a new object
stat_age = c(n = length(orca$age), mean = mean(orca$age),
median = median(orca$age), sd = sd(orca$age))
stat_age
n mean median sd
338.00000 63.51398 64.85735 14.43288
Most common graphical presentation: histogram (?geom_histogram) and …
ggplot(orca, aes(age)) + geom_histogram()
ggplot(orca, aes(x = time, y = ..density..)) + geom_histogram(bins = 25) + geom_line(stat = "density")
… box plot (?geom_boxplot)
ggplot(orca, aes(x = 1, y = age)) + geom_boxplot() + labs(x = "", y = "Age (yr)") + theme_classic()
Table of frequencies are the typical way to summarize a categorical variable
# summary(orca$event)
tab = table(orca$event)
tab
Alive Oral ca. death Other death
109 122 107
# for proportions
round(prop.table(tab), 2)
Alive Oral ca. death Other death
0.32 0.36 0.32
Most common graphical presentation: bar plot (?geom_bar)
ggplot(orca, aes(event)) + geom_bar()
library(scales) ggplot(orca, aes(event)) + geom_bar(aes(y = ..count../sum(..count..))) + labs(x = "", y = "Frequency") + scale_y_continuous(labels = percent)
table(orca$stage) I II III IV unkn 50 77 72 68 71
orca %>% group_by(stage) %>% summarize(mean = mean(age), se = sd(age)) # A tibble: 5 x 3 stage mean se <fctr> <dbl> <dbl> 1 I 62.88139 14.87824 2 II 64.63002 14.63587 3 III 61.69197 13.64211 4 IV 64.20545 13.77372 5 unkn 63.93454 15.46824
Graphical presentation: box plot
ggplot(orca, aes(x = stage, y = age)) + geom_boxplot() + coord_flip()
mod = lm(age ~ stage, data = orca)
summary(mod)
Call:
lm(formula = age ~ stage, data = orca)
Residuals:
Min 1Q Median 3Q Max
-47.73 -10.22 1.13 10.45 28.30
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.881 2.047 30.712 <2e-16 ***
stageII 1.749 2.630 0.665 0.507
stageIII -1.189 2.665 -0.446 0.656
stageIV 1.324 2.697 0.491 0.624
stageunkn 1.053 2.673 0.394 0.694
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.48 on 333 degrees of freedom
Multiple R-squared: 0.005698, Adjusted R-squared: -0.006245
F-statistic: 0.4771 on 4 and 333 DF, p-value: 0.7526
Graphically compare the distributions
ggplot(orca, aes(x = age, col = stage)) + geom_line(stat = "density")
ggplot(orca, aes(x = age, group = stage)) + geom_line(stat = "density") + facet_grid(stage ~ .)
# create binary variable 'dead' (0 alive 1 dead)
orca = mutate(orca, dead = factor(event != "Alive", labels = c("Alive", "Dead")))
tab2 = with(orca, table(sex, dead))
addmargins(tab2)
dead
sex Alive Dead Sum
Female 50 102 152
Male 59 127 186
Sum 109 229 338
# get proportion (margin: 1 by row, 2 by col)
100*prop.table(tab2, margin = 2)
dead
sex Alive Dead
Female 45.87156 44.54148
Male 54.12844 55.45852
# Test the association
chisq.test(tab2)
Pearson's Chi-squared test with Yates' continuity correction
data: tab2
X-squared = 0.012725, df = 1, p-value = 0.9102
Check out the Epi package
library(Epi)
stat.table(list(sex, dead),
list(count(), percent(sex)),
orca, margin = T)
---------------------------------
----------dead-----------
sex Alive Dead Total
---------------------------------
Female 50 102 152
45.9 44.5 45.0
Male 59 127 186
54.1 55.5 55.0
Total 109 229 338
100.0 100.0 100.0
---------------------------------
with(orca, twoby2(sex, relevel(dead, 2)))
2 by 2 table analysis:
------------------------------------------------------
Outcome : Dead
Comparing : Female vs. Male
Dead Alive P(Dead) 95% conf. interval
Female 102 50 0.6711 0.5926 0.7410
Male 127 59 0.6828 0.6125 0.7456
95% conf. interval
Relative Risk: 0.9828 0.8474 1.1399
Sample Odds Ratio: 0.9477 0.5994 1.4984
Conditional MLE Odds Ratio: 0.9479 0.5845 1.5396
Probability difference: -0.0117 -0.1118 0.0870
Exact P-value: 0.9069
Asymptotic P-value: 0.8183
------------------------------------------------------
Or alternatively with the epitools package
library(epitools) with(orca, epitab(table(dead, sex), verbose = T))
Graphical presentation: bar plot
ggplot(orca, aes(x = dead, group = sex)) +
geom_bar(aes(y = ..prop.., fill = sex),
position = "dodge") +
scale_y_continuous(labels = percent)
ggplot(orca, aes(x = dead, group = sex)) + geom_bar(aes(y = ..prop..), position = "dodge") + facet_grid(~ sex) + scale_y_continuous(labels = percent)
If time to event was not of interest (not this case)
mod_bin = glm(dead ~ relevel(sex, 2), data = orca, family = "binomial")
# presenting odds ratio
ci.exp(mod_bin)
exp(Est.) 2.5% 97.5%
(Intercept) 2.1525424 1.580672 2.931309
relevel(sex, 2)Female 0.9477165 0.599421 1.498390
# predicted probabilities
orca %>%
mutate(prob = predict(mod_bin, type = "response")) %>%
head()
# A tibble: 6 x 8
id sex age stage time event dead prob
<int> <fctr> <dbl> <fctr> <dbl> <fctr> <fctr> <dbl>
1 1 Male 65.42274 unkn 5.081 Alive Alive 0.6827957
2 2 Female 83.08783 III 0.419 Oral ca. death Dead 0.6710526
3 3 Male 52.59008 II 7.915 Other death Dead 0.6827957
4 4 Male 77.08630 I 2.480 Other death Dead 0.6827957
5 5 Male 80.33622 IV 2.500 Oral ca. death Dead 0.6827957
6 6 Female 82.58132 IV 0.167 Other death Dead 0.6710526
orca %>%
group_by(sex, dead) %>%
summarise(mean = mean(age), sd = sd(age))
# A tibble: 4 x 4
# Groups: sex [?]
sex dead mean sd
<fctr> <fctr> <dbl> <dbl>
1 Female Alive 61.12249 14.23391
2 Female Dead 71.21227 12.74809
3 Male Alive 51.94397 14.06550
4 Male Dead 63.64768 12.00282
stat.table(index = list(sex, dead),
contents = list(mean(age), sd(age)),
data = orca, margins = T)
---------------------------------
----------dead-----------
sex Alive Dead Total
---------------------------------
Female 61.12 71.21 67.89
14.23 12.75 14.04
Male 51.94 63.65 59.94
14.07 12.00 13.78
Total 56.15 67.02 63.51
14.81 12.88 14.43
---------------------------------
ggplot(orca, aes(x = age, col = dead)) + geom_line(stat = "density") + facet_grid(~ sex)
library(plotly)
ggplotly(
ggplot(orca, aes(x = id, y = time)) + geom_linerange(aes(ymin = 0, ymax = time)) +
geom_point(aes(shape = event, color = event)) + scale_shape_manual(values = c(1, 3, 4)) +
labs(y = "Time (years)", x = "Subject ID") + coord_flip() + theme_classic()
)
The Surv() function creates a survival object, i.e. the time variable and the failure or status indicator
library(survival) su_obj = Surv(orca$time, orca$dead == "Dead") head(su_obj, n = 20) [1] 5.081+ 0.419 7.915 2.480 2.500 0.167 5.925+ 1.503 13.333 7.666+ 2.834+ 0.914 3.083 4.668 [15] 4.816 2.166 5.837 0.162 3.666 0.504
The created survival object is then used as response variable in specific functions for survival analysis.
The survfit() function by the default estimate then survival curve using the Kaplan–Meier estimator
fit_km = survfit(su_obj ~ 1, data = orca)
print(fit_km, print.rmean = TRUE)
Call: survfit(formula = su_obj ~ 1, data = orca)
n events *rmean *se(rmean) median 0.95LCL 0.95UCL
338.000 229.000 8.060 0.465 5.418 4.331 6.916
* restricted mean with upper limit = 23.3
The fortify() function of the ggfortify package is useful to extract the whole survival table in a data.frame.
library(ggfortify) fortify(fit_km) %>% head() time n.risk n.event n.censor surv std.err upper lower 1 0.085 338 2 0 0.9940828 0.004196498 1.0000000 0.9859401 2 0.162 336 2 0 0.9881657 0.005952486 0.9997618 0.9767041 3 0.167 334 4 0 0.9763314 0.008468952 0.9926726 0.9602592 4 0.170 330 2 0 0.9704142 0.009497400 0.9886472 0.9525175 5 0.246 328 1 0 0.9674556 0.009976176 0.9865584 0.9487228 6 0.249 327 1 0 0.9644970 0.010435745 0.9844277 0.9449699
Change the argument fun = events for the cumulative proportion, cumhaz for the cumulative hazard, cloglog for the complementary log−log
library(survminer)
ggsurvplot(fit_km, risk.table = TRUE, fun = "pct", censor = T,
xlab = "Time (years)", palette = "black", legend = "none")
mc = data.frame(q = c(.25, .5, .75), km = quantile(fit_km))
mc
q km.quantile km.lower km.upper
25 0.25 1.333 1.084 1.834
50 0.50 5.418 4.331 6.916
75 0.75 13.673 11.748 16.580
ggsurvplot(fit_km, xlab = "Time (years)", censor = F, legend = "none")$plot +
geom_segment(data = mc, aes(x = km.quantile, y = 1-q, xend = km.quantile,
yend = 0), lty = 2) +
geom_segment(data = mc, aes(x = 0, y = 1-q, xend = km.quantile, yend = 1-q), lty = 2)
Check ?flexsurvreg for a description of the different distributions
library(flexsurv)
fit_wei = flexsurvreg(su_obj ~ 1, data = orca, dist = "weibull")
fit_wei
Call:
flexsurvreg(formula = su_obj ~ 1, data = orca, dist = "weibull")
Estimates:
est L95% U95% se
shape 0.821 0.737 0.914 0.045
scale 8.426 7.193 9.871 0.680
N = 338, Events: 229, Censored: 109
Total time at risk: 1913.673
Log-likelihood = -708.0542, df = 2
AIC = 1420.108
data.frame(summary(fit_wei)) %>% head() time est lcl ucl 1 0.085 0.9772491 0.9672428 0.9844084 2 0.162 0.9616852 0.9471458 0.9725004 3 0.167 0.9607367 0.9459484 0.9717434 4 0.170 0.9601706 0.9452349 0.9712906 5 0.246 0.9464460 0.9281915 0.9602414 6 0.249 0.9459254 0.9275537 0.9598219
library(gridExtra)
grid.arrange(
ggplot(data.frame(summary(fit_wei)), aes(time, est)) +
geom_line() + labs(x = "Time (years)", y = "Survival"),
ggplot(data.frame(summary(fit_wei, type = "hazard")), aes(time, est)) +
geom_line() + labs(x = "Time (years)", y = "Hazard"), ncol = 2
)
su_stg = survfit(su_obj ~ stage, data = orca)
su_stg
Call: survfit(formula = su_obj ~ stage, data = orca)
n events median 0.95LCL 0.95UCL
stage=I 50 25 10.56 6.17 NA
stage=II 77 51 7.92 4.92 13.34
stage=III 72 51 7.41 3.92 9.90
stage=IV 68 57 2.00 1.08 4.82
stage=unkn 71 45 3.67 2.83 8.17
survdiff(su_obj ~ stage, data = orca)
Call:
survdiff(formula = su_obj ~ stage, data = orca)
N Observed Expected (O-E)^2/E (O-E)^2/V
stage=I 50 25 39.9 5.573 6.813
stage=II 77 51 63.9 2.606 3.662
stage=III 72 51 54.1 0.174 0.231
stage=IV 68 57 33.2 16.966 20.103
stage=unkn 71 45 37.9 1.346 1.642
Chisq= 27.2 on 4 degrees of freedom, p= 1.78e-05
Change the argument fun = events to compare cumulative proportions, cumhaz for the cumulative hazards, cloglog for the complementary log−log
ggsurvplot(su_stg, fun = "event", censor = F, xlab = "Time (years)")
m1 = coxph(su_obj ~ sex + I((age-65)/10) + stage, data = orca)
ci.exp(m1)
exp(Est.) 2.5% 97.5%
sexMale 1.421036 1.0770928 1.874809
I((age - 65)/10) 1.515929 1.3572535 1.693156
stageII 1.035537 0.6385626 1.679298
stageIII 1.412619 0.8727664 2.286398
stageIV 2.423992 1.5063276 3.900703
stageunkn 1.793932 1.0963386 2.935398
# check proportionallity of the hazard
cox.zph(m1)
rho chisq p
sexMale -0.00137 0.000439 0.983
I((age - 65)/10) 0.07539 1.393597 0.238
stageII -0.04208 0.411652 0.521
stageIII -0.06915 1.083755 0.298
stageIV -0.10044 2.301780 0.129
stageunkn -0.09663 2.082042 0.149
GLOBAL NA 4.895492 0.557
newd = data.frame(sex = "Male", age = 40, stage = "IV") fortify(survfit(m1, newdata = newd)) %>% ggplot(aes(x = time, y = surv)) + geom_step() + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = .2) + labs(x = "Time (years)", y = "Survival probability") + theme_classic()
library(shiny)
shinyApp(
ui = fluidPage(
titlePanel("Prediction from Cox model"),
sidebarLayout(
sidebarPanel(
sliderInput("age", "Select age:", min = 50, max = 75, value = 60),
selectInput("stage", "Select tumoral stage:", levels(orca$stage), "IV"),
radioButtons("sex", "Select sex:", levels(orca$sex), "Male")
),
mainPanel(
plotOutput("surv")
)
)),
server = function(input, output){
output$surv = renderPlot({
newd = reactive({
data.frame(sex = input$sex, age = input$age, stage = input$stage)
})
fortify(survfit(m1, newdata = newd())) %>%
ggplot(aes(x = time, y = surv)) +
geom_step() + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = .2) +
labs(x = "Time (years)", y = "Survival probability") +
theme_classic()
})
}
)
m2 = flexsurvreg(su_obj ~ sex + I((age-65)/10) + stage, data = orca, dist = "weibull")
m2
Call:
flexsurvreg(formula = su_obj ~ sex + I((age - 65)/10) + stage,
data = orca, dist = "weibull")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 0.9181 0.8272 1.0191 0.0489 NA NA NA
scale NA 14.5312 9.1847 22.9898 3.4012 NA NA NA
sexMale 0.5503 -0.4276 -0.7283 -0.1269 0.1534 0.6521 0.4827 0.8808
I((age - 65)/10) -0.1486 -0.4613 -0.5817 -0.3409 0.0614 0.6305 0.5590 0.7111
stageII 0.2278 -0.0144 -0.5397 0.5109 0.2680 0.9857 0.5829 1.6668
stageIII 0.2130 -0.3538 -0.8783 0.1706 0.2676 0.7020 0.4155 1.1860
stageIV 0.2012 -0.9674 -1.4889 -0.4458 0.2661 0.3801 0.2256 0.6403
stageunkn 0.2101 -0.6513 -1.1863 -0.1163 0.2730 0.5214 0.3053 0.8902
N = 338, Events: 229, Censored: 109
Total time at risk: 1913.673
Log-likelihood = -662.2349, df = 8
AIC = 1340.47
newd = data_frame(sex = "Male", age = 40, stage = "IV") data.frame(unname(summary(m2, newdata = newd))) %>% ggplot(aes(x = time, y = est)) + geom_step() + geom_ribbon(aes(ymin = lcl, ymax = ucl), alpha = .2) + labs(x = "Time (years)", y = "Survival probability") + theme_classic()