Ngày 1: Giới thiệu R

Việc 1. Tải R

Việc 2. Cài đặt packages

# install.packages(c("lessR", "table1", "simpleboot", "boot", "gapminder", "ggfortify", "DescTools", "epiDisplay", "BMA", "ggplot2", "gridExtra", "metafor", "MatchIt", "cobalt"), dependencies = T)

Việc 3. Đọc dữ liệu vào R

df = read.csv("D:\\NCKH GS Tuan\\Stroke Data.csv")

Việc 4. Thông tin về dữ liệu này

4.1 Có bao nhiêu biến số (variable) và quan sát (observation)

dim(df)
## [1] 5110   12

4.2 Liệt kê 10 quan sát đầu tiên của dữ liệu

head(df, 10)
##     id gender age hypertension heart.disease ever.married work.type
## 1   67 Female  17            0             0           No   Private
## 2   77 Female  13            0             0           No  children
## 3   84   Male  55            0             0          Yes   Private
## 4   91 Female  42            0             0           No   Private
## 5   99 Female  31            0             0           No   Private
## 6  121 Female  38            0             0          Yes   Private
## 7  129 Female  24            0             0           No   Private
## 8  132 Female  80            0             0          Yes  Govt_job
## 9  156 Female  33            0             0          Yes   Private
## 10 163 Female  20            0             0           No   Private
##    Residence.type glucose.level  bmi         smoking stroke
## 1           Urban         92.97   NA formerly smoked      0
## 2           Rural         85.81 18.6         Unknown      0
## 3           Urban         89.17 31.5    never smoked      0
## 4           Urban         98.53 18.5    never smoked      0
## 5           Urban        108.89 52.3         Unknown      0
## 6           Urban         91.44   NA         Unknown      0
## 7           Urban         97.55 26.2    never smoked      0
## 8           Urban         84.86   NA         Unknown      0
## 9           Rural         86.97 42.2    never smoked      0
## 10          Rural         94.67 28.8         Unknown      0

4.3 Liệt kê 6 quan sát cuối cùng của dữ liệu

tail(df)
##         id gender age hypertension heart.disease ever.married work.type
## 5105 72882   Male  47            0             0          Yes   Private
## 5106 72911 Female  57            1             0          Yes   Private
## 5107 72914 Female  19            0             0           No   Private
## 5108 72915 Female  45            0             0          Yes   Private
## 5109 72918 Female  53            1             0          Yes   Private
## 5110 72940 Female   2            0             0           No  children
##      Residence.type glucose.level  bmi         smoking stroke
## 5105          Rural         75.30 25.0 formerly smoked      0
## 5106          Rural        129.54 60.9          smokes      0
## 5107          Urban         90.57 24.2         Unknown      0
## 5108          Urban        172.33 45.3 formerly smoked      0
## 5109          Urban         62.55 30.3         Unknown      1
## 5110          Urban        102.92 17.6         Unknown      0

4.4 Mô tả dữ liệu, lưu ý dữ liệu missing hoặc bất thường

summary(df)
##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##                                                                      
##  heart.disease     ever.married        work.type         Residence.type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##                                                                            
##  glucose.level         bmi          smoking              stroke       
##  Min.   : 55.12   Min.   :10.30   Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25   1st Qu.:23.50   Class :character   1st Qu.:0.00000  
##  Median : 91.89   Median :28.10   Mode  :character   Median :0.00000  
##  Mean   :106.15   Mean   :28.89                      Mean   :0.04873  
##  3rd Qu.:114.09   3rd Qu.:33.10                      3rd Qu.:0.00000  
##  Max.   :271.74   Max.   :97.60                      Max.   :1.00000  
##                   NA's   :201

check lại các dữ liệu: age: min=0.08? bmi max: 97.60? bmi missing (NA’s):201?

Việc 5. Biên tập dữ liệu

5.1. Mã hóa biến sex

Mã hoá biến gender (Female/Male/Other) thành biến sex với giá trị 0/1/2 (0= Male; 1= Female; 2= Other)

df$sex[df$gender == "Female"] = 0
df$sex[df$gender == "Male"] = 1
df$sex[df$gender == "Other"] = 2

head(df)
##    id gender age hypertension heart.disease ever.married work.type
## 1  67 Female  17            0             0           No   Private
## 2  77 Female  13            0             0           No  children
## 3  84   Male  55            0             0          Yes   Private
## 4  91 Female  42            0             0           No   Private
## 5  99 Female  31            0             0           No   Private
## 6 121 Female  38            0             0          Yes   Private
##   Residence.type glucose.level  bmi         smoking stroke sex
## 1          Urban         92.97   NA formerly smoked      0   0
## 2          Rural         85.81 18.6         Unknown      0   0
## 3          Urban         89.17 31.5    never smoked      0   1
## 4          Urban         98.53 18.5    never smoked      0   0
## 5          Urban        108.89 52.3         Unknown      0   0
## 6          Urban         91.44   NA         Unknown      0   0
table(df$sex, df$gender)
##    
##     Female Male
##   0   2994    0
##   1      0 2116

Mã hóa biến BMI thành biến bmi_cat

Nếu bmi < 18.5 thì bmi_cat = “Underweight” Nếu 18.5  bmi < 25.0 thì bmi_cat = “Normal”
Nếu 25.0  bmi < 30 thì bmi_cat = “Overweight” Nếu bmi ≥ 30.0 thì bmi = “Obese”

df$bmi_cat[df$bmi < 18.5] = "Underweight"
df$bmi_cat[df$bmi>= 18.5 & df$bmi< 25] = "Normal"
df$bmi_cat[df$bmi>=25 & df$bmi< 30] = "Overweight"
df$bmi_cat[df$bmi >= 30] = "Obese"

table(df$bmi_cat)
## 
##      Normal       Obese  Overweight Underweight 
##        1243        1920        1409         337

5.3. Biến stroke1

df$stroke1 = as.factor(df$stroke)

table(df$stroke1, df$stroke)
##    
##        0    1
##   0 4861    0
##   1    0  249
head(df)
##    id gender age hypertension heart.disease ever.married work.type
## 1  67 Female  17            0             0           No   Private
## 2  77 Female  13            0             0           No  children
## 3  84   Male  55            0             0          Yes   Private
## 4  91 Female  42            0             0           No   Private
## 5  99 Female  31            0             0           No   Private
## 6 121 Female  38            0             0          Yes   Private
##   Residence.type glucose.level  bmi         smoking stroke sex bmi_cat stroke1
## 1          Urban         92.97   NA formerly smoked      0   0    <NA>       0
## 2          Rural         85.81 18.6         Unknown      0   0  Normal       0
## 3          Urban         89.17 31.5    never smoked      0   1   Obese       0
## 4          Urban         98.53 18.5    never smoked      0   0  Normal       0
## 5          Urban        108.89 52.3         Unknown      0   0   Obese       0
## 6          Urban         91.44   NA         Unknown      0   0    <NA>       0

Biến stroke1 khác biết stroke ra sao?

Stroke là biến liên tục –> tính được mean, SD… –> ko hợp lý.

Stroke1 là biến phân loại –> tính % cho mỗi loại.

Việc 6. Phân tích mô tả

6.1 Mô tả đặc điểm tuổi (age), giới tính (gender), bệnh cao huyết áp (hypertension), bệnh tim (heart_disease), tình trạng gia đình (ever_married), việc làm (work_type), nơi ở (Residence_type), nồng độ đường huyết (avg_glucose_level), chỉ số khối cơ thể (bmi), và tình trạng hút thuốc (smoking_status) theo tình trạng đột quị (stroke)

library(table1)
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1(~ age + gender + hypertension + heart.disease + ever.married + work.type + Residence.type + glucose.level + bmi + smoking | stroke, data = df)
## Warning in table1.formula(~age + gender + hypertension + heart.disease + :
## Terms to the right of '|' in formula 'x' define table columns and are expected
## to be factors with meaningful labels.
0
(N=4861)
1
(N=249)
Overall
(N=5110)
age
Mean (SD) 42.0 (22.3) 67.7 (12.7) 43.2 (22.6)
Median [Min, Max] 43.0 [0.0800, 82.0] 71.0 [1.32, 82.0] 45.0 [0.0800, 82.0]
gender
Female 2853 (58.7%) 141 (56.6%) 2994 (58.6%)
Male 2008 (41.3%) 108 (43.4%) 2116 (41.4%)
hypertension
Mean (SD) 0.0889 (0.285) 0.265 (0.442) 0.0975 (0.297)
Median [Min, Max] 0 [0, 1.00] 0 [0, 1.00] 0 [0, 1.00]
heart.disease
Mean (SD) 0.0471 (0.212) 0.189 (0.392) 0.0540 (0.226)
Median [Min, Max] 0 [0, 1.00] 0 [0, 1.00] 0 [0, 1.00]
ever.married
No 1728 (35.5%) 29 (11.6%) 1757 (34.4%)
Yes 3133 (64.5%) 220 (88.4%) 3353 (65.6%)
work.type
children 685 (14.1%) 2 (0.8%) 687 (13.4%)
Govt_job 624 (12.8%) 33 (13.3%) 657 (12.9%)
Never_worked 22 (0.5%) 0 (0%) 22 (0.4%)
Private 2776 (57.1%) 149 (59.8%) 2925 (57.2%)
Self-employed 754 (15.5%) 65 (26.1%) 819 (16.0%)
Residence.type
Rural 2400 (49.4%) 114 (45.8%) 2514 (49.2%)
Urban 2461 (50.6%) 135 (54.2%) 2596 (50.8%)
glucose.level
Mean (SD) 105 (43.8) 133 (61.9) 106 (45.3)
Median [Min, Max] 91.5 [55.1, 268] 105 [56.1, 272] 91.9 [55.1, 272]
bmi
Mean (SD) 28.8 (7.91) 30.5 (6.33) 28.9 (7.85)
Median [Min, Max] 28.0 [10.3, 97.6] 29.7 [16.9, 56.6] 28.1 [10.3, 97.6]
Missing 161 (3.3%) 40 (16.1%) 201 (3.9%)
smoking
formerly smoked 815 (16.8%) 70 (28.1%) 885 (17.3%)
never smoked 1802 (37.1%) 90 (36.1%) 1892 (37.0%)
smokes 747 (15.4%) 42 (16.9%) 789 (15.4%)
Unknown 1497 (30.8%) 47 (18.9%) 1544 (30.2%)
table1(~ hypertension + as.factor(hypertension) + heart.disease + as.factor(heart.disease) | stroke, data = df)
## Warning in table1.formula(~hypertension + as.factor(hypertension) +
## heart.disease + : Terms to the right of '|' in formula 'x' define table columns
## and are expected to be factors with meaningful labels.
0
(N=4861)
1
(N=249)
Overall
(N=5110)
hypertension
Mean (SD) 0.0889 (0.285) 0.265 (0.442) 0.0975 (0.297)
Median [Min, Max] 0 [0, 1.00] 0 [0, 1.00] 0 [0, 1.00]
as.factor(hypertension)
0 4429 (91.1%) 183 (73.5%) 4612 (90.3%)
1 432 (8.9%) 66 (26.5%) 498 (9.7%)
heart.disease
Mean (SD) 0.0471 (0.212) 0.189 (0.392) 0.0540 (0.226)
Median [Min, Max] 0 [0, 1.00] 0 [0, 1.00] 0 [0, 1.00]
as.factor(heart.disease)
0 4632 (95.3%) 202 (81.1%) 4834 (94.6%)
1 229 (4.7%) 47 (18.9%) 276 (5.4%)