In this exercise, we use dataset data_example1 that we have downloaded before in the Introduction section
library(readxl)
## Warning: package 'readxl' was built under R version 3.6.1
data=read_excel("data_example1.xlsx")
to list all variables name in the dataset ‘data’
names(data)
## [1] "patientid" "date" "enddate" "ptage" "ptgender"
## [6] "ptrace" "q1" "q2" "q3" "q4"
## [11] "q5" "height" "weight" "sysbp" "diasbp"
to display the dataset structure where you can identify data type for each variable
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 28 obs. of 15 variables:
## $ patientid: num 1 1 2 3 4 5 6 7 8 9 ...
## $ date : POSIXct, format: "2009-05-13" "2009-06-17" ...
## $ enddate : POSIXct, format: "2009-12-13" "2009-12-13" ...
## $ ptage : num 48 44 35 44 45 48 34 39 48 35 ...
## $ ptgender : num 1 2 1 1 2 2 2 2 2 1 ...
## $ ptrace : num 3 1 1 1 1 1 1 3 3 1 ...
## $ q1 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ q2 : num 2 2 4 2 2 2 4 1 2 4 ...
## $ q3 : num 1 2 1 2 1 1 1 1 1 1 ...
## $ q4 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ q5 : num 1 4 1 4 4 3 3 5 1 1 ...
## $ height : num 1.6 1.67 1.72 1.67 1.68 NA 1.65 1.68 1.6 1.72 ...
## $ weight : num 77 74 128 74 78.5 NA 100 194 77 128 ...
## $ sysbp : num 117 123 191 123 107 120 122 140 1 191 ...
## $ diasbp : num 78 80 109 80 91 78 73 90 78 109 ...
num is numerical data type
no. of rows/ observations in the ‘data’
nrow(data)
## [1] 28
no. of columns/ variables in the ‘data’
ncol(data)
## [1] 15
to get range (min and max) for variable ptage
Notes:na.rm=TRUE is a function to exclude missing value in the analysis
range(data$ptage,na.rm=TRUE)
## [1] 34 110
to get range (min and max) for ALL variable in dataset ‘data’
Notes:na.rm=TRUE is a function to exclude missing value in the analysis
sapply(data,range,na.rm=TRUE)
## patientid date enddate ptage ptgender ptrace q1 q2 q3 q4 q5
## [1,] 1 1242172800 1260662400 34 1 1 1 1 1 1 1
## [2,] 28 1247961600 1260662400 110 3 3 1 4 2 1 5
## height weight sysbp diasbp
## [1,] 1.58 74 1 73
## [2,] 1.73 194 1120 109
we can see ptgender has ‘3’ which is neither male nor female
we can see sysbp has outlier of ‘1120’
data$patientid[duplicated(data$patientid)]
## [1] 1 14
data$patientid[which(data$ptgender==3)]
## [1] 19 26
boxplot.stats(data$sysbp)
## $stats
## [1] 107 117 122 140 140
##
## $n
## [1] 28
##
## $conf
## [1] 115.1324 128.8676
##
## $out
## [1] 191 1 191 191 1120
‘$out’ are listed outliers for sysbp. In this case, 191,1,1120
to identify which subject has sysbp = 1120
data$patientid[which(data$sysbp==1120)]
## [1] 23
Exercise:
Extract outliers for ptage
factor is a function to set to categorical data type
data$gender is the new variable created into the dataset ‘data’
labels=c(“Male”,“Female”,NA) is value labels for data$gender i.e 1=Male,2=Female,3=Missing
data$gender=factor(data$ptgender, levels=c(1,2,3), labels=c("Male","Female",NA))
then your data will have new variable ‘gender’ and look like this
Exercise:
Change ptrace, q1, q2, q3, q4 and q5 according to data_example1_Dictionary
data$bmi is the new variable created into the dataset ‘data’
data$bmi=data$weight/(data$height^2)
data$bmi_cat is the new variable created into the dataset ‘data’
data$bmi_cat[data$bmi<=18.5]=1
## Warning: Unknown or uninitialised column: 'bmi_cat'.
data$bmi_cat[data$bmi>=18.6 & data$bmi<=24.9]=2
data$bmi_cat[data$bmi>=25.0 & data$bmi<=29.9]=3
data$bmi_cat[data$bmi>=30.0]=4
Exercise:
Change data$bmi_cat to categorical data type and label according to;
1 = underweight, 2 = normal, 3 = overweight, 4 = obese
data$days is the new variable created into the dataset ‘data’
data$days=(data$enddate-data$date)
data$years is the new variable created into the dataset ‘data’
/365.25 - Divide 365.25 for unit in years
data$years=(data$enddate-data$date)/365.25