In this exercise, we use dataset data_example1 that we have downloaded before in the Introduction section

library(readxl)

## Warning: package 'readxl' was built under R version 3.6.1

data=read_excel("data_example1.xlsx")

Data Examination

to list all variables name in the dataset ‘data’

names(data)

##  [1] "patientid" "date"      "enddate"   "ptage"     "ptgender" 
##  [6] "ptrace"    "q1"        "q2"        "q3"        "q4"       
## [11] "q5"        "height"    "weight"    "sysbp"     "diasbp"

to display the dataset structure where you can identify data type for each variable

str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    28 obs. of  15 variables:
##  $ patientid: num  1 1 2 3 4 5 6 7 8 9 ...
##  $ date     : POSIXct, format: "2009-05-13" "2009-06-17" ...
##  $ enddate  : POSIXct, format: "2009-12-13" "2009-12-13" ...
##  $ ptage    : num  48 44 35 44 45 48 34 39 48 35 ...
##  $ ptgender : num  1 2 1 1 2 2 2 2 2 1 ...
##  $ ptrace   : num  3 1 1 1 1 1 1 3 3 1 ...
##  $ q1       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ q2       : num  2 2 4 2 2 2 4 1 2 4 ...
##  $ q3       : num  1 2 1 2 1 1 1 1 1 1 ...
##  $ q4       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ q5       : num  1 4 1 4 4 3 3 5 1 1 ...
##  $ height   : num  1.6 1.67 1.72 1.67 1.68 NA 1.65 1.68 1.6 1.72 ...
##  $ weight   : num  77 74 128 74 78.5 NA 100 194 77 128 ...
##  $ sysbp    : num  117 123 191 123 107 120 122 140 1 191 ...
##  $ diasbp   : num  78 80 109 80 91 78 73 90 78 109 ...

num is numerical data type

no. of rows/ observations in the ‘data’

nrow(data)

## [1] 28

no. of columns/ variables in the ‘data’

ncol(data)

## [1] 15

Data Cleaning

A.Inspect Data Abnormality

to get range (min and max) for variable ptage

Notes:na.rm=TRUE is a function to exclude missing value in the analysis

range(data$ptage,na.rm=TRUE)

## [1]  34 110

to get range (min and max) for ALL variable in dataset ‘data’

Notes:na.rm=TRUE is a function to exclude missing value in the analysis

sapply(data,range,na.rm=TRUE)

##      patientid       date    enddate ptage ptgender ptrace q1 q2 q3 q4 q5
## [1,]         1 1242172800 1260662400    34        1      1  1  1  1  1  1
## [2,]        28 1247961600 1260662400   110        3      3  1  4  2  1  5
##      height weight sysbp diasbp
## [1,]   1.58     74     1     73
## [2,]   1.73    194  1120    109

we can see ptgender has ‘3’ which is neither male nor female

we can see sysbp has outlier of ‘1120’

C.Extract Duplicate Cases

data$patientid[duplicated(data$patientid)]

## [1]  1 14

D.Extract Inconsistency Cases

data$patientid[which(data$ptgender==3)]

## [1] 19 26

E.Extract Outliers’ Cases for sysbp

boxplot.stats(data$sysbp)

## $stats
## [1] 107 117 122 140 140
## 
## $n
## [1] 28
## 
## $conf
## [1] 115.1324 128.8676
## 
## $out
## [1]  191    1  191  191 1120

‘$out’ are listed outliers for sysbp. In this case, 191,1,1120

to identify which subject has sysbp = 1120

data$patientid[which(data$sysbp==1120)]

## [1] 23

Exercise:

Extract outliers for ptage

Data Modification

A.Change numerical data to categorical data

*Notes:*

factor is a function to set to categorical data type

data$gender is the new variable created into the dataset ‘data’

labels=c(“Male”,“Female”,NA) is value labels for data$gender i.e 1=Male,2=Female,3=Missing

data$gender=factor(data$ptgender, levels=c(1,2,3), labels=c("Male","Female",NA))

then your data will have new variable ‘gender’ and look like this

Exercise:

Change ptrace, q1, q2, q3, q4 and q5 according to data_example1_Dictionary

B.Compute variables

Calculate bmi

*Notes:*

data$bmi is the new variable created into the dataset ‘data’

data$bmi=data$weight/(data$height^2)

C.Recode into different variable

Categorize bmi into bmi category

*Notes:*

data$bmi_cat is the new variable created into the dataset ‘data’

data$bmi_cat[data$bmi<=18.5]=1

## Warning: Unknown or uninitialised column: 'bmi_cat'.

data$bmi_cat[data$bmi>=18.6 & data$bmi<=24.9]=2
data$bmi_cat[data$bmi>=25.0 & data$bmi<=29.9]=3
data$bmi_cat[data$bmi>=30.0]=4

Exercise:

Change data$bmi_cat to categorical data type and label according to;

1 = underweight, 2 = normal, 3 = overweight, 4 = obese

D.Calculate number between two dates

Calculate no of days

*Notes:*

data$days is the new variable created into the dataset ‘data’

data$days=(data$enddate-data$date)

Calculate no of years

*Notes:*

data$years is the new variable created into the dataset ‘data’

/365.25 - Divide 365.25 for unit in years

data$years=(data$enddate-data$date)/365.25

Basic Data Management using R

Evi Diana Omar,Shahrul Aiman Soelar,Fatimah Diana Amin Nordin

September 4, 2019

Data Examination

Data Cleaning

A.Inspect Data Abnormality

C.Extract Duplicate Cases

D.Extract Inconsistency Cases

E.Extract Outliers’ Cases for sysbp

Data Modification

A.Change numerical data to categorical data

B.Compute variables

Calculate bmi

C.Recode into different variable

Categorize bmi into bmi category

D.Calculate number between two dates

Calculate no of days

Calculate no of years