picture of spaghetti

AMPHIBIAN Working Group 2014-2015

Switch2R 02

Rich Jones

Monday, November 10, 2014

AMPHIBIAN logo

2014-2015 presentation series: Switch2R

Where to find code

on github (https://github.com/rnj0nes/Switch2R)

Updates from October 2014

November 2014: Clean Data with R

Things to remember about R

Things to remember about data in R

What are clean data?

Example from last month

Atkins, D. C. (2005). Using multilevel models to analyze couple and family treatment data: Basic and advanced issues. Journal of Family Psychology, 19, 98-110.

Data from UCLA IDRE

Article: findit at Brown full text (http://goo.gl/D6m2ID)

R Session

What I do

work <- "c:/work/shows/switch2r"
setwd(work)
getwd()
## [1] "c:/work/shows/switch2r"

Load some packages

(note: order matters)

# install.packages("lubridate")
# install.packages("psych")
# install.packages("gmodels")
# install.packages("Hmisc")
# install.packages("sjPlot")
require(lubridate) # easy handling of dates and times
require(psych)    # using the scoreItems command 
require(Hmisc)    # using the describe command
require(sjPlot)   # tools for reading SPSS formatted data
require(gmodels)  # using the CrossTable command

Read in the data set from the web

# Read data from UCLA Web site
url <- "http://www.ats.ucla.edu/stat/paperexamples/atkins_mlm/Atkins_JFP_data.txt"
data <- read.csv(url, sep="\t", header=TRUE)
# I like all lowercase variable names
names(data)<- tolower(names(data)) 
# show the first 15 lines
head(data, 15)
##    id sex therapy time       das pilot miss m.ind
## 1   1   0    -0.5    0  94.51204     1    0     1
## 2   1   0    -0.5   13  87.53364     1    0     1
## 3   1   0    -0.5   26  81.46659     1    1     1
## 4   1   0    -0.5   35  83.44614     1    1     1
## 5   1   1    -0.5    0  81.27981     1    0     1
## 6   1   1    -0.5   13  68.80343     1    0     1
## 7   1   1    -0.5   26  71.16971     1    1     1
## 8   1   1    -0.5   35  76.88723     1    1     1
## 9   2   0     0.5    0  79.53347     1    0     0
## 10  2   0     0.5   13  98.75209     1    0     0
## 11  2   0     0.5   26 100.15992     1    0     0
## 12  2   0     0.5   35 127.55920     1    0     0
## 13  2   1     0.5    0 106.98098     1    0     0
## 14  2   1     0.5   13 124.13277     1    0     0
## 15  2   1     0.5   26 143.09363     1    0     0

Are these data tidy?

Make some variables

The data file has a variable therapy that is coded -.5/+.5.

Let’s say I’d like to have a version of that variable that was coded 1/2.

I’ll call it tx

data$tx <- 1 # New variable tx in data.frame data = 1 (by default)
data$tx[which(data$therapy==0.5)] <- 2 # if therapy == 0.5 data$tx = 2
str(data[,c("tx","therapy")]) # show the characteristics of two variables
## 'data.frame':    1072 obs. of  2 variables:
##  $ tx     : num  1 1 1 1 1 1 1 1 2 2 ...
##  $ therapy: num  -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 0.5 0.5 ...
# Now check
table(data$tx, data$therapy) # a crosstab of tx and therapy in data.frame data
##    
##     -0.5 0.5
##   1  536   0
##   2    0 536

A nicer looking cross-tab

Using CrossTable from the gmodels package

CrossTable(data$tx, data$therapy, 
  missing.include=TRUE, 
  prop.r=FALSE, 
  prop.c=FALSE, 
  prop.t=FALSE, 
  prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  1072 
## 
##  
##              | data$therapy 
##      data$tx |      -0.5 |       0.5 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |       536 |         0 |       536 | 
## -------------|-----------|-----------|-----------|
##            2 |         0 |       536 |       536 | 
## -------------|-----------|-----------|-----------|
## Column Total |       536 |       536 |      1072 | 
## -------------|-----------|-----------|-----------|
## 
## 

But that is so cumbersome to code…

Welcome to R

tab <- function(r,c) {
  CrossTable(r, c, 
    missing.include=TRUE, prop.r=FALSE, prop.c=FALSE, prop.t=FALSE, 
    prop.chisq=FALSE)
}

Now a nice cross-tab

tab(data$tx,data$therapy)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  1072 
## 
##  
##              | c 
##            r |      -0.5 |       0.5 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |       536 |         0 |       536 | 
## -------------|-----------|-----------|-----------|
##            2 |         0 |       536 |       536 | 
## -------------|-----------|-----------|-----------|
## Column Total |       536 |       536 |      1072 | 
## -------------|-----------|-----------|-----------|
## 
## 

And anyway the coding looks right. But notice the variable names are now the not-so-helpful “r” and “c”. So whatever.

Watch me do this with Rcmdr

Link to video

Sex

Another example. I’ll code a variable male from the original sex

data$male <- 0 # initialize to 0
data$male[which(data$sex!=1)] <- 1 # sex is 0:Husband 1:Wife
tab(data$male,data$sex)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |-------------------------|
## 
##  
## Total Observations in Table:  1072 
## 
##  
##              | c 
##            r |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |         0 |       536 |       536 | 
## -------------|-----------|-----------|-----------|
##            1 |       536 |         0 |       536 | 
## -------------|-----------|-----------|-----------|
## Column Total |       536 |       536 |      1072 | 
## -------------|-----------|-----------|-----------|
## 
## 

About indicator variables

male is a binary indicator (0/1). When coding binary indicators with the values (0/1), always code so that the name of the variable always matches the label for value 1. For example,

R sex Categorical Variable Binary indicator Binary indicator
is really Gender codes Female Male
a man 1 = male 0 1
a woman 2 = female 1 0

Dates

The example data set does not have any date information.

Public data rarely will.

But you will surely have to deal with dates.

So let’s pretend it has dates.

To pretend it has dates, we will generate some date data. So now you get to see how we generate data with R.

set.seed(3481)
data$year <- sample(1:3,nrow(data),replace=T)+2008
data$day <-  sample(1:27,nrow(data),replace=T)
data$month <- round(runif(nrow(data),0.51,12.49))

The first line says: make a new variable year in data frame data. Assign to it values, sample from the range 1 to 3, as many times as there are rows in the data frame named data. Sample with replacement. Then add 2008.

The third line says: make a new variable month in data frame data. Assign to it a random numbers (as many as there are rows in the data frame) drawn from a uniform distribution ranging from 0.51 to 12.49. Oh but round it to the nearest whole number too.

Display the structure of the day month year variables

str(data[,c("day","month","year")])
## 'data.frame':    1072 obs. of  3 variables:
##  $ day  : int  5 1 27 16 3 18 19 24 3 6 ...
##  $ month: num  2 4 3 10 5 5 5 11 3 5 ...
##  $ year : num  2011 2010 2010 2010 2011 ...

Tabulate the day month year variables

table(data$day)
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
## 41 37 45 43 37 49 34 36 44 31 32 33 40 37 42 33 44 35 47 40 49 35 37 47 42 
## 26 27 
## 42 40
table(data$month)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
##  75  83  87  72 102  93  84  83  72 106 105 110
table(data$year)
## 
## 2009 2010 2011 
##  363  363  346

Dates using lubridate

(luridate vignette)[http://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html]

Some interesting things about dates

the.time.is.now <- now()
the.time.is.now.numeric <- as.numeric(now())
the.time.is.now
## [1] "2014-11-12 15:54:03 EST"
the.time.is.now.numeric
## [1] 1415825644
as.numeric(mdy("1/1/1970"))
## [1] 0
as.numeric(mdy("1/2/1970"))
## [1] 86400
60*60*24
## [1] 86400
str(the.time.is.now)
##  POSIXct[1:1], format: "2014-11-12 15:54:03"

Making a datetime variable

data$year <- as.character(data$year)
data$month <- as.character(data$month)
data$day <- as.character(data$day)
data$date <- mdy(paste0(data$month,"/",data$day,"/",data$year))
head(data[,c("year","month","day","date")])
##   year month day       date
## 1 2011     2   5 2011-02-05
## 2 2010     4   1 2010-04-01
## 3 2010     3  27 2010-03-27
## 4 2010    10  16 2010-10-16
## 5 2011     5   3 2011-05-03
## 6 2009     5  18 2009-05-18

New Example: National Comorbidity Study Replication (NCS-R 2002-2003)

Data from ICPSR (public use)

I can redistribute to Brown faculty, students, and staff (authorized users)

If you are not Brown faculty, students, and staff, see about getting access to the NCS-R public use files at ICPSR

NCS-R data

I am pulling the NCS-R data as archived along with the Collaborative Psychiatric Epidemiology Surveys.

CPES Website

CPES-1

Download the SPSS data file

CPES-2

Let’s get set

work <- "c:/work/shows/switch2r" # my working folder you edit here
setwd(work) # change to it
getwd() # check it
## [1] "c:/work/shows/switch2r"

I’ve downloaded the SPSS file to a subfolder here (NCSR)

dir("./NCSR")
## [1] "20240-0002-Codebook.pdf"      "20240-0002-Data.sav"         
## [3] "20240-0002-Data.txt"          "20240-0002-Documentation.pdf"
## [5] "20240-0002-Setup.sas"

We’ll use the sjPlot package to read in the SPSS data

and generate a nice data dictionary

spssdata <- sji.SPSS("./NCSR/20240-0002-Data.sav")
# sji.viewSPSS(spssdata) # opens in the Rstudio Viewer Panel
# Or in a browser window if you're running the R Console

Let’s save that codebook

#sji.viewSPSS(spssdata, file="codebook.html", showFreq = TRUE, useViewer = FALSE)
codebookpath <- file.path(work,"codebook.html")

The codebook

But that link only works on your (i.e., my) local computer

Viewers of this presentation can see the codebook I generated on my dropbox

Let’s make a summated rating scale score

Psychotic experiences

Each coded 1=Yes, 5=No

Look at the data

PSitems <- c("PS1A", "PS1B", "PS1C", "PS1D", "PS1E", "PS1F")
describe(spssdata[PSitems])
## spssdata[PSitems] 
## 
##  6  Variables      9282  Observations
## ---------------------------------------------------------------------------
## PS1A 
##       n missing  unique    Info    Mean 
##    2349    6933       2    0.25    4.63 
## 
## 1 (217, 9%), 5 (2132, 91%) 
## ---------------------------------------------------------------------------
## PS1B 
##       n missing  unique    Info    Mean 
##    2349    6933       2    0.17   4.753 
## 
## 1 (145, 6%), 5 (2204, 94%) 
## ---------------------------------------------------------------------------
## PS1C 
##       n missing  unique    Info    Mean 
##    2352    6930       2    0.02   4.968 
## 
## 1 (19, 1%), 5 (2333, 99%) 
## ---------------------------------------------------------------------------
## PS1D 
##       n missing  unique    Info    Mean 
##    2353    6929       2    0.01   4.981 
## 
## 1 (11, 0%), 5 (2342, 100%) 
## ---------------------------------------------------------------------------
## PS1E 
##       n missing  unique    Info    Mean 
##    2352    6930       2    0.04   4.951 
## 
## 1 (29, 1%), 5 (2323, 99%) 
## ---------------------------------------------------------------------------
## PS1F 
##       n missing  unique    Info    Mean 
##    2352    6930       2    0.05   4.937 
## 
## 1 (37, 2%), 5 (2315, 98%) 
## ---------------------------------------------------------------------------

Looks like a skip pattern

Kessler says…

Kessler RC, Birnbaum H, Demler O, Falloon IR, Gagnon E, Guyer M, Howes MJ, Kendler KS, Shi L, Walters E. The prevalence and correlates of nonaffective psychosis in the National Comorbidity Survey Replication (NCS-R). Biol Psychiatry. 2005;58(8):668-76.

Right ways and wrong ways to make a sum score

Remember

table(spssdata$PS1A)
## 
##    1    5 
##  217 2132

Right ways and wrong ways to make a sum score

This works

spssdata$count.ok <- (spssdata$PS1A==1)+
  (spssdata$PS1B==1)+(spssdata$PS1C==1)+
  (spssdata$PS1D==1)+(spssdata$PS1E==1)+
  (spssdata$PS1F==1)

Right ways and wrong ways to make a sum score

This does not

spssdata$count.notok <- (spssdata$PS1A==1)
  +(spssdata$PS1B==1)
  +(spssdata$PS1C==1)
  +(spssdata$PS1D==1)
  +(spssdata$PS1E==1)
  +(spssdata$PS1F==1)

Right ways and wrong ways to make a sum score

See

spssdata$count.ok <- (spssdata$PS1A==1)+
  (spssdata$PS1B==1)+(spssdata$PS1C==1)+
  (spssdata$PS1D==1)+(spssdata$PS1E==1)+
  (spssdata$PS1F==1)
summary(spssdata$count.ok)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.194   0.000   6.000    6938

Right ways and wrong ways to make a sum score

Let’s check

spssdata$check <- paste0(as.character(spssdata$PS1A),
  as.character(spssdata$PS1B),as.character(spssdata$PS1C),
  as.character(spssdata$PS1D),as.character(spssdata$PS1E),
  as.character(spssdata$PS1F))

Right ways and wrong ways to make a sum score

Let’s check

head(spssdata[which(spssdata$count==0),c("check","count.ok")])
##     check count.ok
## 8  555555        0
## 12 555555        0
## 18 555555        0
## 24 555555        0
## 25 555555        0
## 26 555555        0

Right ways and wrong ways to make a sum score

Let’s check

head(spssdata[which(spssdata$count==1),c("check","count.ok")])
##      check count.ok
## 17  155555        1
## 19  155555        1
## 212 155555        1
## 321 515555        1
## 378 155555        1
## 381 555551        1

Right ways and wrong ways to make a sum score

Let’s check

head(spssdata[which(spssdata$count==2),c("check","count.ok")])
##      check count.ok
## 313 115555        2
## 367 515551        2
## 453 115555        2
## 501 155551        2
## 660 115555        2
## 721 155515        2

Right ways and wrong ways to make a sum score

Let’s check

head(spssdata[which(spssdata$count==3),c("check","count.ok")])
##       check count.ok
## 1176 115155        3
## 2254 115551        3
## 3084 115155        3
## 3620 111555        3
## 3658 115551        3
## 4115 111555        3

Right ways and wrong ways to make a sum score

Let’s check

head(spssdata[which(spssdata$count==4),c("check","count.ok")])
##       check count.ok
## 38   151115        4
## 3793 111551        4
## 4426 115115        4
## 6260 115511        4
## 6299 115511        4
## 7040 111515        4

Right ways and wrong ways to make a sum score

Let’s check

head(spssdata[which(spssdata$count==5),c("check","count.ok")])
##       check count.ok
## 6721 111511        5

Right ways and wrong ways to make a sum score

Let’s check

head(spssdata[which(spssdata$count==6),c("check","count.ok")])
##       check count.ok
## 1453 111111        6
## 4460 111111        6

Using scoreItems from package psych

psitems <- c("ps1a.i", "ps1b.i", "ps1c.i", "ps1d.i", "ps1e.i", "ps1f.i")
# Make indicator versions of response items
spssdata$ps1a.i <- as.numeric(spssdata$PS1A==1)
spssdata$ps1b.i <- as.numeric(spssdata$PS1B==1)
spssdata$ps1c.i <- as.numeric(spssdata$PS1C==1)
spssdata$ps1d.i <- as.numeric(spssdata$PS1D==1)
spssdata$ps1e.i <- as.numeric(spssdata$PS1E==1)
spssdata$ps1f.i <- as.numeric(spssdata$PS1F==1)

Using scoreItems from package psych

key <- c(1,1,1,1,1,1) # the "right" answers
results <- scoreItems(
  items = spssdata[psitems],
  keys = key, 
  missing = TRUE , impute = "none"  # person's non-missing mean
  #                                   response used for missing
  )

Display scoreItem results

results
## Call: scoreItems(keys = key, items = spssdata[psitems], missing = TRUE, 
##     impute = "none")
## 
## (Standardized) Alpha:
##         A1
## alpha 0.53
## 
## Standard errors of unstandardized Alpha:
##         [,1]
## ASE   0.0094
## 
## Standardized Alpha of observed scales:
##                [,1]
## alpha.observed 0.53
## 
## Average item correlation:
##           [,1]
## average.r 0.16
## 
##  Guttman 6* reliability: 
##          [,1]
## Lambda.6 0.52
## 
## Signal/Noise based upon av.r : 
##              [,1]
## Signal/Noise  1.1
## 
## Scale intercorrelations corrected for attenuation 
##  raw correlations below the diagonal, alpha on the diagonal 
##  corrected correlations above the diagonal:
## 
## Note that these are the correlations of the complete scales based on the correlation matrix,
##  not the observed scales based on the raw items.
##      [,1]
## [1,] 0.53
## 
##  In order to see the item by scale loadings and frequency counts of the data
##  print with the short option = FALSE

Display the structure of the results

str(results)
## List of 17
##  $ scores        : num [1:9282, 1] NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr "A1"
##  $ missing       : num [1:9282, 1] 6 6 6 6 6 6 6 0 6 6 ...
##  $ alpha         : num [1, 1] 0.527
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr "alpha"
##   .. ..$ : chr "A1"
##  $ av.r          : num [1, 1] 0.157
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr "average.r"
##   .. ..$ : NULL
##  $ sn            : num [1, 1] 1.11
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr "Signal/Noise"
##   .. ..$ : NULL
##  $ n.items       : num 6
##  $ item.cor      : num [1:6, 1] 0.755 0.726 0.445 0.359 0.517 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "ps1a.i" "ps1b.i" "ps1c.i" "ps1d.i" ...
##   .. ..$ : chr "A1"
##  $ cor           : num [1, 1] 1
##  $ corrected     : num [1, 1] 0.527
##  $ G6            : num [1, 1] 0.523
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr "Lambda.6"
##   .. ..$ : NULL
##  $ item.corrected: num [1:6, 1] 0.44 0.521 0.427 0.353 0.498 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "ps1a.i" "ps1b.i" "ps1c.i" "ps1d.i" ...
##   .. ..$ : chr "A1"
##  $ response.freq : num [1:6, 1:3] 0.908 0.938 0.992 0.995 0.988 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:6] "ps1a.i" "ps1b.i" "ps1c.i" "ps1d.i" ...
##   .. ..$ : chr [1:3] "0" "1" "miss"
##  $ raw           : logi FALSE
##  $ alpha.ob      : num [1, 1] 0.527
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr "alpha.observed"
##   .. ..$ : NULL
##  $ num.ob.item   : num 6
##  $ ase           : num [1, 1] 0.00941
##  $ Call          : language scoreItems(keys = key, items = spssdata[psitems], missing = TRUE,      impute = "none")
##  - attr(*, "class")= chr [1:2] "psych" "score.items"

Using scoreItems from package psych

summary(results$score)
##        A1       
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.032  
##  3rd Qu.:0.000  
##  Max.   :1.000  
##  NA's   :6929

Using scoreItems from package psych

results$pscount <- 6*results$score
summary(results$pscount)
##        A1       
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.195  
##  3rd Qu.:0.000  
##  Max.   :6.000  
##  NA's   :6929

But

How do we get results$pscount back into the working data set?

spssdata$pscount.master <- scoreItems(items = spssdata[psitems],
  keys = key, missing = TRUE , impute = "none")[[1]]*6
summary(spssdata$pscount.master)
##        A1       
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.195  
##  3rd Qu.:0.000  
##  Max.   :6.000  
##  NA's   :6929

Display the internal consistency reliability coefficient

results$alpha
##              A1
## alpha 0.5271367

Save alpha to two decimal places

alpha <- round(results$alpha,2)

My markdown code

The internal consistency reliability coefficient for the sum of the
psychotic experiences scale is 'r alpha'.

The internal consistency reliability coefficient for the sum of the psychotic experiences scale is 0.53.

Display the structure and report the item response frequencies

str(results$response.freq)
##  num [1:6, 1:3] 0.908 0.938 0.992 0.995 0.988 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:6] "ps1a.i" "ps1b.i" "ps1c.i" "ps1d.i" ...
##   ..$ : chr [1:3] "0" "1" "miss"
results$response.freq
##                0           1      miss
## ps1a.i 0.9076203 0.092379736 0.7469295
## ps1b.i 0.9382716 0.061728395 0.7469295
## ps1c.i 0.9919218 0.008078231 0.7466063
## ps1d.i 0.9953251 0.004674883 0.7464986
## ps1e.i 0.9876701 0.012329932 0.7466063
## ps1f.i 0.9842687 0.015731293 0.7466063

Review of things we accomplished today

Big ideas

Packages

Commands and functions

Skills and tasks

Next meeting

Analysis with R

Use R to do multivariable analysis

Monday, December 8, 2014, 2-3pm.