Data Management - An Introduction

Ching-Fan SHEU

26 February 2018

Outline

Data formats

Organizing data

Sharing your data

Data cleaning

Data manipulation

Data analysis

Data management - Before

Data management - After

Data format I

Gender Pregnant (= No) Preganent (= Yes)
Female 3 1
Male 2 0

Data format II

Gender Pregnant Frequency
Female No 3
Male No 2
Female Yes 1
Male Yes 0

Data format III

Gender Pregnant
Female Yes
Female No
Female No
Female No
Male No
Male No

Data format IV

Gender Pregnant
1 1
1 0
1 0
1 0
0 0
0 0

Gender: Female = 1, Male = 0
Pregnant: No = 0, Yes = 1

Data format V

ID Gender Pregnant
1 1 1
2 1 0
3 1 0
4 1 0
5 0 0
6 0 0

ID: Subject Index
Gender: Female = 1, Male = 0
Pregnant: No = 0, Yes = 1

Generate raw data with R

dta <- data.frame(Gender = rep(c(1, 0), c(4, 2)),   
                  Pregnant = c(1, rep(0, 5)))
dta
##   Gender Pregnant
## 1      1        1
## 2      1        0
## 3      1        0
## 4      1        0
## 5      0        0
## 6      0        0

Recode variables

dta <- within(dta, {     
    Gender <- ifelse(Gender == 0, "Male", "Female")
    Pregnant <- ifelse(Pregnant == 0, "No", "Yes") 
    } )
dta
##   Gender Pregnant
## 1 Female      Yes
## 2 Female       No
## 3 Female       No
## 4 Female       No
## 5   Male       No
## 6   Male       No

Cross-classification

table(dta)
##         Pregnant
## Gender   No Yes
##   Female  3   1
##   Male    2   0
with(dta, table(Pregnant, Gender))
##         Gender
## Pregnant Female Male
##      No       3    2
##      Yes      1    0

Order of transformation matters

t(table(dta))
##         Gender
## Pregnant Female Male
##      No       3    2
##      Yes      1    0
table(t(dta))
## 
## Female   Male     No    Yes 
##      4      2      5      1

Turn tables into data frames

data.frame(table(dta))
##   Gender Pregnant Freq
## 1 Female       No    3
## 2   Male       No    2
## 3 Female      Yes    1
## 4   Male      Yes    0

Mosaicplot as default for table object

plot(table(dta), main = "")

mosaicplot(table(dta), main = "")

Save a table object and name the cell elements

tdta <- table(dta)
as.vector(tdta)
## [1] 3 2 1 0
names(tdta) <- c("Female_No", "Male_No", "Female_Yes", "Male_Yes")

Pie charts

pie(tdta, radius = 1, col = gray(seq(1, .4, length = 4)))

Doughtnut plot as an improvement?

pie(tdta, radius = 1, col = gray(seq(.8, .2, length = 4)))
symbols(0, 0, circles = .5, bg = "white", add = T)

Cleveland’s dot plot - raw data

dotchart(table(dta))

Cleveland’s dot plot - proportions

dotchart(prop.table(table(dta), 1))

Version control

The End