In this exercise, we use dataset data_example1 that we have downloaded before in the Introduction section

library(readxl)
## Warning: package 'readxl' was built under R version 3.6.1
data=read_excel("data_example1.xlsx")

Data Examination

to list all variables name in the dataset ‘data’

names(data)
##  [1] "patientid" "date"      "enddate"   "ptage"     "ptgender" 
##  [6] "ptrace"    "q1"        "q2"        "q3"        "q4"       
## [11] "q5"        "height"    "weight"    "sysbp"     "diasbp"

to display the dataset structure where you can identify data type for each variable

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    28 obs. of  15 variables:
##  $ patientid: num  1 1 2 3 4 5 6 7 8 9 ...
##  $ date     : POSIXct, format: "2009-05-13" "2009-06-17" ...
##  $ enddate  : POSIXct, format: "2009-12-13" "2009-12-13" ...
##  $ ptage    : num  48 44 35 44 45 48 34 39 48 35 ...
##  $ ptgender : num  1 2 1 1 2 2 2 2 2 1 ...
##  $ ptrace   : num  3 1 1 1 1 1 1 3 3 1 ...
##  $ q1       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ q2       : num  2 2 4 2 2 2 4 1 2 4 ...
##  $ q3       : num  1 2 1 2 1 1 1 1 1 1 ...
##  $ q4       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ q5       : num  1 4 1 4 4 3 3 5 1 1 ...
##  $ height   : num  1.6 1.67 1.72 1.67 1.68 NA 1.65 1.68 1.6 1.72 ...
##  $ weight   : num  77 74 128 74 78.5 NA 100 194 77 128 ...
##  $ sysbp    : num  117 123 191 123 107 120 122 140 1 191 ...
##  $ diasbp   : num  78 80 109 80 91 78 73 90 78 109 ...

num is numerical data type

no. of rows/ observations in the ‘data’

nrow(data)
## [1] 28

no. of columns/ variables in the ‘data’

ncol(data)
## [1] 15

Data Cleaning

A.Inspect Data Abnormality

to get range (min and max) for variable ptage

Notes:na.rm=TRUE is a function to exclude missing value in the analysis

range(data$ptage,na.rm=TRUE)
## [1]  34 110

to get range (min and max) for ALL variable in dataset ‘data’

Notes:na.rm=TRUE is a function to exclude missing value in the analysis

sapply(data,range,na.rm=TRUE)
##      patientid       date    enddate ptage ptgender ptrace q1 q2 q3 q4 q5
## [1,]         1 1242172800 1260662400    34        1      1  1  1  1  1  1
## [2,]        28 1247961600 1260662400   110        3      3  1  4  2  1  5
##      height weight sysbp diasbp
## [1,]   1.58     74     1     73
## [2,]   1.73    194  1120    109

we can see ptgender has ‘3’ which is neither male nor female

we can see sysbp has outlier of ‘1120’

C.Extract Duplicate Cases

data$patientid[duplicated(data$patientid)]
## [1]  1 14

D.Extract Inconsistency Cases

data$patientid[which(data$ptgender==3)]
## [1] 19 26

E.Extract Outliers’ Cases for sysbp

boxplot.stats(data$sysbp)
## $stats
## [1] 107 117 122 140 140
## 
## $n
## [1] 28
## 
## $conf
## [1] 115.1324 128.8676
## 
## $out
## [1]  191    1  191  191 1120

‘$out’ are listed outliers for sysbp. In this case, 191,1,1120

to identify which subject has sysbp = 1120

data$patientid[which(data$sysbp==1120)] 
## [1] 23

Exercise:

Extract outliers for ptage

Data Modification

A.Change numerical data to categorical data

*Notes:*

factor is a function to set to categorical data type

data$gender is the new variable created into the dataset ‘data’

labels=c(“Male”,“Female”,NA) is value labels for data$gender i.e 1=Male,2=Female,3=Missing

data$gender=factor(data$ptgender, levels=c(1,2,3), labels=c("Male","Female",NA)) 

then your data will have new variable ‘gender’ and look like this

Exercise:

Change ptrace, q1, q2, q3, q4 and q5 according to data_example1_Dictionary

B.Compute variables

Calculate bmi

*Notes:*

data$bmi is the new variable created into the dataset ‘data’

data$bmi=data$weight/(data$height^2)

C.Recode into different variable

Categorize bmi into bmi category

*Notes:*

data$bmi_cat is the new variable created into the dataset ‘data’

data$bmi_cat[data$bmi<=18.5]=1
## Warning: Unknown or uninitialised column: 'bmi_cat'.
data$bmi_cat[data$bmi>=18.6 & data$bmi<=24.9]=2
data$bmi_cat[data$bmi>=25.0 & data$bmi<=29.9]=3
data$bmi_cat[data$bmi>=30.0]=4

Exercise:

Change data$bmi_cat to categorical data type and label according to;

1 = underweight, 2 = normal, 3 = overweight, 4 = obese

D.Calculate number between two dates

Calculate no of days

*Notes:*

data$days is the new variable created into the dataset ‘data’

data$days=(data$enddate-data$date)

Calculate no of years

*Notes:*

data$years is the new variable created into the dataset ‘data’

/365.25 - Divide 365.25 for unit in years

data$years=(data$enddate-data$date)/365.25