Data Management - An Introduction

Yu-Jou Lin

08 March 2018

Outline

Data formats

Organizing data

Sharing your data

Data cleaning

Data manipulation

Data analysis

Data management - Before

Data management - After

Data format I

Gender Pregnant (= No) Preganent (= Yes)
Female 3 1
Male 2 0

Data format II

Gender Pregnant Frequency
Female No 3
Male No 2
Female Yes 1
Male Yes 0

Data format III

Gender Pregnant
Female Yes
Female No
Female No
Female No
Male No
Male No

Data format IV

Gender Pregnant
1 1
1 0
1 0
1 0
0 0
0 0

Gender: Female = 1, Male = 0
Pregnant: No = 0, Yes = 1

Data format V

ID Gender Pregnant
1 1 1
2 1 0
3 1 0
4 1 0
5 0 0
6 0 0

ID: Subject Index
Gender: Female = 1, Male = 0
Pregnant: No = 0, Yes = 1

Generate raw data with R

dta <- data.frame(Gender = rep(c(1, 0), c(4, 2)),   
                  Pregnant = c(1, rep(0, 5)))
dta
  Gender Pregnant
1      1        1
2      1        0
3      1        0
4      1        0
5      0        0
6      0        0

Recode variables

dta <- within(dta, {     
    Gender <- ifelse(Gender == 0, "Male", "Female")
    Pregnant <- ifelse(Pregnant == 0, "No", "Yes") 
    } )
dta
  Gender Pregnant
1 Female      Yes
2 Female       No
3 Female       No
4 Female       No
5   Male       No
6   Male       No

Cross-classification

table(dta)
        Pregnant
Gender   No Yes
  Female  3   1
  Male    2   0
with(dta, table(Pregnant, Gender))
        Gender
Pregnant Female Male
     No       3    2
     Yes      1    0

Order of transformation matters

t(table(dta))
        Gender
Pregnant Female Male
     No       3    2
     Yes      1    0
table(t(dta))

Female   Male     No    Yes 
     4      2      5      1 

Turn tables into data frames

data.frame(table(dta))
  Gender Pregnant Freq
1 Female       No    3
2   Male       No    2
3 Female      Yes    1
4   Male      Yes    0

Mosaicplot as default for table object

plot(table(dta), main = "")

mosaicplot(table(dta), main = "")

Save a table object and name the cell elements

tdta <- table(dta)
as.vector(tdta)
[1] 3 2 1 0
names(tdta) <- c("Female_No", "Male_No", "Female_Yes", "Male_Yes")

Pie charts

pie(tdta, radius = 1, col = gray(seq(1, .4, length = 4)))

Doughtnut plot as an improvement?

pie(tdta, radius = 1, col = gray(seq(.8, .2, length = 4)))
symbols(0, 0, circles = .5, bg = "white", add = T)

Cleveland’s dot plot - raw data

dotchart(table(dta))

Cleveland’s dot plot - proportions

dotchart(prop.table(table(dta), 1))

Version control

The End