Day 1: Introduction to R and Rmd

Task 1: Download R

Task 2: Download required pakages

Task 3: Read data into R

df = read.csv('/Users/lecat/Downloads/AI_R_NVT/Stroke Data.csv')

Task 4: Summary of data

4.1: Number of variables and observations

Can be observed in the environment panel or with this line

dim (df)

## [1] 5110   12

4.2: 10 first line of df

head (df,10)

4.3: 6 last line of df

tail(df)

4.4: Summary of df

summary(df)

##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##                                                                      
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##                                                                            
##  avg_glucose_level      bmi        smoking_status         stroke       
##  Min.   : 55.12    Min.   :10.30   Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    1st Qu.:23.50   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Median :28.10   Mode  :character   Median :0.00000  
##  Mean   :106.15    Mean   :28.89                      Mean   :0.04873  
##  3rd Qu.:114.09    3rd Qu.:33.10                      3rd Qu.:0.00000  
##  Max.   :271.74    Max.   :97.60                      Max.   :1.00000  
##                    NA's   :201

NA represent blank value

We need to determine our strategy BEFORE the data analysis

There are 2 main methods: complete-case approach and multiple imputations.

Min of age is 0.08 –> Not logic –> Reconsider the experiment designs and the cleaning data process

Task 5: Editing data

Encrypt ‘gender’ (Female/Male/Other) into ‘sex’ with 0/1/2 (0= Male; 1= Female; 2= Other)

df$sex = factor (df$gender, levels= c('Male','Female','Other'), labels = c('0','1','2'))

Re-check the edition

head (df)

table (df$gender,df$sex)

##         
##             0    1    2
##   Female    0 2994    0
##   Male   2115    0    0
##   Other     0    0    1

Encrypt ‘bmi’ into ‘bmi_cat’ with 4 groups

df$bmi_cat [df$bmi < 18.5] ='Underweight' 
df$bmi_cat [df$bmi >= 18.5 & df$bmi<25] ='Normal' 
df$bmi_cat [df$bmi >= 25 & df$bmi<30] ='Overweight' 
df$bmi_cat [df$bmi >= 30] ='Obese'

Re-check the edition

head (df)

Encrypt ‘stroke’

df$stroke1 = as.factor (df$stroke)
table (df$stroke, df$stroke1)

##    
##        0    1
##   0 4861    0
##   1    0  249

head (df)

stroke1 is now consider a variable with character values

summary (df)

##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##                                                                      
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##                                                                            
##  avg_glucose_level      bmi        smoking_status         stroke       
##  Min.   : 55.12    Min.   :10.30   Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    1st Qu.:23.50   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Median :28.10   Mode  :character   Median :0.00000  
##  Mean   :106.15    Mean   :28.89                      Mean   :0.04873  
##  3rd Qu.:114.09    3rd Qu.:33.10                      3rd Qu.:0.00000  
##  Max.   :271.74    Max.   :97.60                      Max.   :1.00000  
##                    NA's   :201                                         
##  sex        bmi_cat          stroke1 
##  0:2115   Length:5110        0:4861  
##  1:2994   Class :character   1: 249  
##  2:   1   Mode  :character           
##                                      
##                                      
##                                      
##

Task 6: Descriptive statistic

Describe all the variables mentioned in the ‘Stroke data.csv’ file

6.1: Descriptive statistic

library (table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1 (~ age + gender + hypertension + heart_disease + ever_married + work_type + Residence_type + avg_glucose_level + bmi + smoking_status| stroke, data = df )

## Warning in table1.formula(~age + gender + hypertension + heart_disease + :
## Terms to the right of '|' in formula 'x' define table columns and are expected
## to be factors with meaningful labels.

	0 (N=4861)	1 (N=249)	Overall (N=5110)
age
Mean (SD)	42.0 (22.3)	67.7 (12.7)	43.2 (22.6)
Median [Min, Max]	43.0 [0.0800, 82.0]	71.0 [1.32, 82.0]	45.0 [0.0800, 82.0]
gender
Female	2853 (58.7%)	141 (56.6%)	2994 (58.6%)
Male	2007 (41.3%)	108 (43.4%)	2115 (41.4%)
Other	1 (0.0%)	0 (0%)	1 (0.0%)
hypertension
Mean (SD)	0.0889 (0.285)	0.265 (0.442)	0.0975 (0.297)
Median [Min, Max]	0 [0, 1.00]	0 [0, 1.00]	0 [0, 1.00]
heart_disease
Mean (SD)	0.0471 (0.212)	0.189 (0.392)	0.0540 (0.226)
Median [Min, Max]	0 [0, 1.00]	0 [0, 1.00]	0 [0, 1.00]
ever_married
No	1728 (35.5%)	29 (11.6%)	1757 (34.4%)
Yes	3133 (64.5%)	220 (88.4%)	3353 (65.6%)
work_type
children	685 (14.1%)	2 (0.8%)	687 (13.4%)
Govt_job	624 (12.8%)	33 (13.3%)	657 (12.9%)
Never_worked	22 (0.5%)	0 (0%)	22 (0.4%)
Private	2776 (57.1%)	149 (59.8%)	2925 (57.2%)
Self-employed	754 (15.5%)	65 (26.1%)	819 (16.0%)
Residence_type
Rural	2400 (49.4%)	114 (45.8%)	2514 (49.2%)
Urban	2461 (50.6%)	135 (54.2%)	2596 (50.8%)
avg_glucose_level
Mean (SD)	105 (43.8)	133 (61.9)	106 (45.3)
Median [Min, Max]	91.5 [55.1, 268]	105 [56.1, 272]	91.9 [55.1, 272]
bmi
Mean (SD)	28.8 (7.91)	30.5 (6.33)	28.9 (7.85)
Median [Min, Max]	28.0 [10.3, 97.6]	29.7 [16.9, 56.6]	28.1 [10.3, 97.6]
Missing	161 (3.3%)	40 (16.1%)	201 (3.9%)
smoking_status
formerly smoked	815 (16.8%)	70 (28.1%)	885 (17.3%)
never smoked	1802 (37.1%)	90 (36.1%)	1892 (37.0%)
smokes	747 (15.4%)	42 (16.9%)	789 (15.4%)
Unknown	1497 (30.8%)	47 (18.9%)	1544 (30.2%)

We only present Mean OR Median.

NEVER present both mean and median in the analysis or report

6.2: Treat ‘hypertension’ and ‘heart_disease’ as factor

table1 (~hypertension + as.factor(hypertension) + heart_disease + as.factor(heart_disease)| stroke, data = df)

## Warning in table1.formula(~hypertension + as.factor(hypertension) +
## heart_disease + : Terms to the right of '|' in formula 'x' define table columns
## and are expected to be factors with meaningful labels.

	0 (N=4861)	1 (N=249)	Overall (N=5110)
hypertension
Mean (SD)	0.0889 (0.285)	0.265 (0.442)	0.0975 (0.297)
Median [Min, Max]	0 [0, 1.00]	0 [0, 1.00]	0 [0, 1.00]
as.factor(hypertension)
0	4429 (91.1%)	183 (73.5%)	4612 (90.3%)
1	432 (8.9%)	66 (26.5%)	498 (9.7%)
heart_disease
Mean (SD)	0.0471 (0.212)	0.189 (0.392)	0.0540 (0.226)
Median [Min, Max]	0 [0, 1.00]	0 [0, 1.00]	0 [0, 1.00]
as.factor(heart_disease)
0	4632 (95.3%)	202 (81.1%)	4834 (94.6%)
1	229 (4.7%)	47 (18.9%)	276 (5.4%)

Here, we treat ‘hypertension’ and ‘heart_disease’ like variable with character values –> NO calculation of mean or median

OR we can create a new variable for that

df$hyper.f = as.factor (df$hypertension)
table1 (~hyper.f, data = df)

	Overall (N=5110)
hyper.f
0	4612 (90.3%)
1	498 (9.7%)

AI_R_day1

Gia Cat

2026-01-06