1 MỤC LỤC

2 MỞ ĐẦU

2.1 LÝ DO CHỌN ĐỀ TÀI

Khi so sánh 2 phương pháp điều trị cho các bệnh có tần số tử vong cao như bệnh AIDS, các bệnh ung thư… Nếu mô hình phân tích như phân tích hồi qui logistic, chỉ để ý đến biến kết cục (sống/chết hoặc khỏi bệnh/không khỏi bệnh) mà không quan tâm đến yếu tố thời gian thì đôi không tìm thấy sự khác biệt giữa 2 phương pháp điều trị vì tỉ lệ tử vong gần như nhau, nhưng thời gian dẫn đến tử vong ở 2 nhóm có thể khác nhau. Một ví dụ khác khi so sánh 2 loại kháng sinh điều trị bệnh thương hàn, tỉ lệ khỏi bệnh của 2 loại kháng sinh có thể như nhau nhưng thời gian cắt sốt của 2 nhóm có thể khác nhau, vì vậy chúng ta phải sử dụng mô hình PTSS thì mới thấy sự khác biệt này. Như vậy mô hình nghiên cứu mô tả kết cục là biến nhị phân (sống/chết-hết sốt/còn sốt) tuy quan trọng nhưng không chính xác.
Các đối tượng còn sống kể cả đối tượng bỏ cuộc được gọi là censored hoặc sự kiện chưa xảy ra. Các đối tượng tử vong hoặc hết sốt (ví dụ trong nghiên cứu bệnh thương hàn) được gọi là events hoặc sự kiện đã kết thúc.
Như vậy khi cần phân tích những ảnh hưởng đến các biến kết cục (biến phụ thuộc) mang tính thời gian lên biến thời gian, ta cần một mô hình tương ứng để giải quyết vấn đề. Mô hình đó là: Phân tích sống còn (Survival Analysis) hoặc gọi là Phân tích sự kiện (Event Analysis).

2.2 MỤC ĐÍCH CỦA ĐỀ TÀI

Tìm hiểu về khái niệm Phân tích sống còn.
Tìm hiểu cơ bản về package survival và cách sử dụng package survival trong phân tích sống còn.
SO sánh sự khác biệt giữa Phân tích sống còn và hồi quy tuyến tính.
Hiểu rõ ưu điểm của Phân tích sống còn.
Sử dụng được những phương pháp trong phân tích sống còn như:
- Biểu đồ Kaplan-Meier vẽ các đường cong sinh tồn.
- Log-rank test để so sánh các đường cong sống sót của hai hoặc nhiều nhóm.
- Hồi quy mối nguy theo tỷ lệ Cox để mô tả ảnh hưởng của các biến đối với tỷ lệ sống sót và mô hình Cox.

2.3 KẾT CẤU CỦA BÀI TIỂU LUẬN

Lý do chọn đề tài.
Đối tượng và dữ liệu dùng cho đề tài.
Tổng quan về package survival và phân tích sống còn.
Phân tích sống còn trong R.
Tổng kết bài tiểu luận về phân tích sống còn.

3 NỘI DUNG

3.1 CHUẨN BỊ PACKAGE VÀ DỮ LIỆU PHÂN TÍCH CHO ĐỀ TÀI

3.1.1 Nhập package

Đầu tiên chúng ta cài đặt package “survival” với cú pháp: >install.packages(“survival”). Sau đó, chúng ta gọi package “survival” từ thư viện để sử dụng

library(survival)

3.1.2 Chuẩn bộ dữ liệu cần cho phân tích

Tôi có một bộ số liệu dưới dạng .rds của các ca bệnh được mô phỏng từ một vụ dịch bệnh Ebola. Một tệp RDS (.rds) là một tệp đặc trưng của R, được lưu và sử dụng như một data frame. Chúng hữu ích trong việc lưu trữ dữ liệu đã được làm sạch, vì chúng giữ lại kiểu dữ liệu cho các cột R Để nhập dữ liệu linh hoạt, tôi sử dụng hai packages hỗ trợ là package “rio” và package “here”

library(rio)
library(here)

## here() starts at C:/Users/84896/Documents

Sau đây, tôi sẽ nhập số liệu này bằng hàm import() từ package rio như sau:

ebola1 <- rio::import("ebola.rds")
head(ebola1,3)

##   case_id generation date_infection date_onset date_hospitalisation
## 1  5fe599          4     2014-05-08 2014-05-13           2014-05-15
## 2  8689b7          4           <NA> 2014-05-13           2014-05-14
## 3  11f8ea          2           <NA> 2014-05-16           2014-05-18
##   date_outcome outcome gender age age_unit age_years age_cat age_cat5
## 1         <NA>    <NA>      m   2    years         2     0-4      0-4
## 2   2014-05-18 Recover      f   3    years         3     0-4      0-4
## 3   2014-05-30 Recover      m  56    years        56   50-69    55-59
##                               hospital       lon      lat infector source wt_kg
## 1                                Other -13.21574 8.468973   f547d6  other    27
## 2                              Missing -13.21523 8.451719     <NA>   <NA>    25
## 3 St. Mark's Maternity Hospital (SMMH) -13.21291 8.464817     <NA>   <NA>    91
##   ht_cm ct_blood fever chills cough aches vomit temp time_admission       bmi
## 1    48       22    no     no   yes    no   yes 36.8           <NA> 117.18750
## 2    59       22  <NA>   <NA>  <NA>  <NA>  <NA> 36.9          09:36  71.81844
## 3   238       21  <NA>   <NA>  <NA>  <NA>  <NA> 36.9          16:48  16.06525
##   days_onset_hosp
## 1               2
## 2               1
## 3               2

3.1.3 Mô tả dữ liệu

Số liệu cho việc sống còn phải có các đặc điểm sau đây:
- Biến phụ thuộc là khoảng thời gian từ thời điểm bắt đầu đến khi sự kiện xảy ra.
- Các quan sát censored là các quan sát mà sự kiện quan tâm không xảy ra tại thời điểm phân tích số liệu.
- Các biến dự đoán hay giải thích có ảnh hưởng đến thời gian dẫn đến sự kiện mà chúng ta muốn đánh giá hoặc kiểm soát.
Như vậy:
- Cần một bộ số liệu mới linelist_surv cho phân tích sóng còn.
- Sự kiện quan tâm là “tử vong”; thời gian theo dõi (futime) là số ngày giữa thời điểm khởi phát bệnh và thời điểm có kết cục; bệnh nhân censored là những người đã hồi phục hoặc những người không biết kết cục.

3.1.4 Xử lý dữ liệu cho phù hợp cho việc phân tích

Trước tiên, ta cần sử dụng các package hỗ trợ cho việc xử lý dữ liệu

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.2     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(utf8)
library(tidyr)
library(DT)
library(tidyverse)
options(digits = 4)

Tạo 1 biến sự kiện (event) khi bệnh nhân chết là 1 và và bệnh nhân được phụ hồi là 0 và tạo 1 biến thời gian theo dõi “fulltime” tính bẳng ngày;

ebola1x <-  ebola1 %>% 
  dplyr::filter(date_outcome > date_onset) %>% 
  dplyr::mutate(
       event = ifelse(is.na(outcome) | outcome == "Recover", 0, 1), 
       futime = as.double(date_outcome - date_onset))

Chúng ta xem lại 10 quan sát đầu tiên của bộ số liệu ebola.rds bằng cách xem xét các biến cụ thể:

ebola1xx <- ebola1x %>% select(case_id, date_onset, date_outcome, outcome, event, futime)
head(ebola1xx)

##   case_id date_onset date_outcome outcome event futime
## 1  8689b7 2014-05-13   2014-05-18 Recover     0      5
## 2  11f8ea 2014-05-16   2014-05-30 Recover     0     14
## 3  893f25 2014-05-21   2014-05-29 Recover     0      8
## 4  be99c8 2014-05-22   2014-05-24 Recover     0      2
## 5  07e3e8 2014-05-27   2014-06-01 Recover     0      5
## 6  369449 2014-06-02   2014-06-07   Death     1      5

3.2 TỔNG QUAN VỀ PACKAGE SURVIVAL VÀ PHÂN TÍCH SỐNG CÒN

3.2.1 GIỚI THIỆU VỀ PACKAGE SURVIVAL

3.2.1.1 Lịch sử hình thành

Work on the survival package began in 1985 in connection with the analysis of medical research data, without any realization at the time that the work would become a package. Eventually, the software was placed on the Statlib repository hosted by Carnegie Mellon University. Multiple version were released in this fashion but I don’t have a list of the dates version 2 was the rst to make use of the print method that was introduced in `New S’ in 1988, which places that release somewhere in 1989. The library was eventually incorporated directly in S-Plus, and from there it became a standard part of R.
I suspect that one of the primary reasons for the package’s success is that all of the functions have been written to solve real analysis questions that arose from real data sets; theoretical issues were explored when necessary but they have never played a leading role. As a statistician in a major medical center, the central focus of my department is to advance medicine; statistics is a tool to that end. This also highlights one of the deciencies of the package: if a particular analysis question has not yet arisen in one of my studies then the survival package will also have nothing to say on the topic. Luckily, there are many other R packages that build on or extend the survival package, and anyone working in the eld (the author included) can expect to use more packages than just this one. I certainly never foresaw that the library would become as popular as it has.
This vignette is an introduction to version 3.x of the survival package. We can think of versions 1.x as the S-Plus era and 2.1 2.44 as maturation of the package in R. Version 3 had 4 major goals:
- Make multi-state curves and models as easy to use as an ordinary Kaplan-Meier and Cox model.
- Deeper support for absolute risk estimates.
- Consistent use of robust variance estimates.
- Clean up various naming inconsistencies that have arisen over time.
With over 600 dependent packages in 2019, not counting Bioconductor, other guiding lights of the change are:
- We can’t do everything (so don’t try).
- Allow other packages to build on this one. That means clear documentation of all of the results that are produced, the use of simple S3 objects that are easy to manipulate, and setting up many of the routines as a pair. For example, concordance and concordancefit; the former is the user front end and the latter does the actual work. Other package authors might want to access the lower level interface, while accepting the penalty of fewer error checks.
+Don’t mess it up!
This meant preserving the current argument names as much as possible. Appendix A.1 summarizes changes that were made which are not backwards compatible. The two other major changes are to collapse many of vignettes into this single large one, and the parallel creation of an actual book. Documentation is an ongoing process, and there are still things the package can do which are not well described. That said, we’ve recognized that the package needs more than a vignette. With the book’s (eventual) appearance this vignette can also be more brief, essentially leaving out a lot of the theory. Version 3 will not appear all at once, however; it will take some time to get all of the documentation sorted out in the way that we like.

3.2.1.2 Survival data

The survival package is concerned with time-to-event analysis. Such outcomes arise very often in the analysis of medical data: time from chemotherapy to tumor recurrence, the durability of a joint replacement, recurrent lung infections in subjects with cystic brosis, the appearance of hypertension, hyperlipidemia and other comorbidities of age, and of course death itself, from which the overall label of survival analysis derives. A key principle of all such studies is that it takes time to observe time, which in turn leads to two of the primary challenges.
- 1. Incomplete information. At the time of an analysis, not everyone will have yet had the event. This is a form of partial information known as censoring: if a particular subject was enrolled in a study 2 years ago, and has not yet had an event at the time of analysis, we only know that their time to event is > 2 years.
- 1. Dated results. In order to report 5 year survival, say, from a treatment, patients need to be enrolled and then followed for 5+ years. By the time recruitment and follow-up is nished, analysis done, the report nally published the treatment in question might be 8 years old and considered to be out of date. This leads to a tension between early reporting and long term outcomes.
Survival data is often represented as a pair (ti, δi) where t is the time until endpoint or last follow-up, and δ is a 0/1 variable with 0= subject was censored at t and 1 =subject had an event at t, or in R code as Surv(time, status). The status variable can be logical, e.g., vtype==‘death’ where vtype is a variable in the data set. An alternate view is to think of time to event data as a multi-state process as is shown in gure 1.1. The upper left panel is simple survival with two states of alive and dead, classic survival analysis. The other three panels show repeated events of the same type (upper right) competing risks for subjects on a liver transplant waiting list(lower left) and the illness-death model (lower right). In this approach interest normally centers on the transition rates or hazards (arrows) from state to state (box to box). For simple survival the two multistate/hazard and the time-to-event viewpoints are equivalent, and we will move freely between them, i.e., use whichever viewpoint is handy at the moment. When there more than one transition the rate approach is particularly useful. The gure also displays a 2 by 2 division of survival data sets, one that will be used to organize other subsections of this document.

ĐỀ TÀI: TIỂU LUẬN VỀ PACKAGES SURVIVAL VÀ PHÂN TÍCH SỐNG CÒN TRÊN R

Thành viên: Nguyễn Khánh An-2121012544 và Lê Phạm Thị Phương Tiền-2121006573

1 MỤC LỤC

2 MỞ ĐẦU

2.1 LÝ DO CHỌN ĐỀ TÀI

2.2 MỤC ĐÍCH CỦA ĐỀ TÀI

2.3 KẾT CẤU CỦA BÀI TIỂU LUẬN

3 NỘI DUNG

3.1 CHUẨN BỊ PACKAGE VÀ DỮ LIỆU PHÂN TÍCH CHO ĐỀ TÀI

3.1.1 Nhập package

3.1.2 Chuẩn bộ dữ liệu cần cho phân tích

3.1.3 Mô tả dữ liệu

3.1.4 Xử lý dữ liệu cho phù hợp cho việc phân tích

3.2 TỔNG QUAN VỀ PACKAGE SURVIVAL VÀ PHÂN TÍCH SỐNG CÒN

3.2.1 GIỚI THIỆU VỀ PACKAGE SURVIVAL

3.2.1.1 Lịch sử hình thành

3.2.1.2 Survival data

3.2.2 CÁC HÀM CƠ BẢN VÀ PHỔ BIẾN TRONG PACKAGE SURVIVAL

3.2.3 PHÂN TÍCH SỐNG CÒN LÀ GÌ?

3.2.4 CÁC THUẬT NGỮ CƠ BẢN TRONG PHÂN TÍCH SỐNG CÒN

3.2.5 PHÂN TÍCH KAPLAN MEIER VÀ MÔ HÌNH COX

3.3 PHÂN TÍCH SỐNG CÒN TRONG R VỚI PACKAGE SURVIVAL

3.3.1 GỌI PACKAGE VÀ CHUẨN BỊ DỮ LIỆU

3.3.2 PHÂN TÍCH KAPLAN MEIER

3.3.2.1 XÁC SUẤT SỐNG SÓT TÍCH LŨY

3.3.2.2 TÍNH TOÁN ĐƯỜNG CONG SINH TỒN

3.3.2.3 VẼ CÁC ĐƯỜNG CONG SINH TỒN

3.3.2.4 KIỂM ĐỊNH LOG RANK

3.3.2.5 KAPLAN MEIER LIFE TABLE

3.3.3 MÔ HÌNH COX

3.3.3.1 PHÂN TÍCH MÔ HÌNH COX

3.3.3.2 VẼ KẾT QUẢ CHO MÔ HÌNH

3.4 TỔNG KẾT VỀ PACKAGE SURVIVAL

3.4.1 SO SÁNH CHI TIẾT PHÂN TÍCH SỐNG CÒN VÀ HỒI QUY TUYẾN TÍNH

3.4.2 ƯU ĐIỂM CỦA PHÂN TÍCH SỐNG CÒN

3.4.3 CÁC TRƯỜNG HỢP ÁP DỤNG THỰC TẾ KHÁC