Rich Jones
Monday, November 10, 2014
If you are a Mac user, don’t use Revolution R Open
If you are a Mac user, there was an issue with R, RStudio with the Yosemite Mac OS X 10.10 update
Use R to clean data
Data can be
Tidy data have the following characteristics
Atkins, D. C. (2005). Using multilevel models to analyze couple and family treatment data: Basic and advanced issues. Journal of Family Psychology, 19, 98-110.
Data from UCLA IDRE
Article: findit at Brown full text (http://goo.gl/D6m2ID)
work <- "c:/work/shows/switch2r"
setwd(work)
getwd()
## [1] "c:/work/shows/switch2r"
(note: order matters)
# install.packages("lubridate")
# install.packages("psych")
# install.packages("gmodels")
# install.packages("Hmisc")
# install.packages("sjPlot")
require(lubridate) # easy handling of dates and times
require(psych) # using the scoreItems command
require(Hmisc) # using the describe command
require(sjPlot) # tools for reading SPSS formatted data
require(gmodels) # using the CrossTable command
# Read data from UCLA Web site
url <- "http://www.ats.ucla.edu/stat/paperexamples/atkins_mlm/Atkins_JFP_data.txt"
data <- read.csv(url, sep="\t", header=TRUE)
# I like all lowercase variable names
names(data)<- tolower(names(data))
# show the first 15 lines
head(data, 15)
## id sex therapy time das pilot miss m.ind
## 1 1 0 -0.5 0 94.51204 1 0 1
## 2 1 0 -0.5 13 87.53364 1 0 1
## 3 1 0 -0.5 26 81.46659 1 1 1
## 4 1 0 -0.5 35 83.44614 1 1 1
## 5 1 1 -0.5 0 81.27981 1 0 1
## 6 1 1 -0.5 13 68.80343 1 0 1
## 7 1 1 -0.5 26 71.16971 1 1 1
## 8 1 1 -0.5 35 76.88723 1 1 1
## 9 2 0 0.5 0 79.53347 1 0 0
## 10 2 0 0.5 13 98.75209 1 0 0
## 11 2 0 0.5 26 100.15992 1 0 0
## 12 2 0 0.5 35 127.55920 1 0 0
## 13 2 1 0.5 0 106.98098 1 0 0
## 14 2 1 0.5 13 124.13277 1 0 0
## 15 2 1 0.5 26 143.09363 1 0 0
The data file has a variable therapy that is coded -.5/+.5.
Let’s say I’d like to have a version of that variable that was coded 1/2.
I’ll call it tx
data$tx <- 1 # New variable tx in data.frame data = 1 (by default)
data$tx[which(data$therapy==0.5)] <- 2 # if therapy == 0.5 data$tx = 2
str(data[,c("tx","therapy")]) # show the characteristics of two variables
## 'data.frame': 1072 obs. of 2 variables:
## $ tx : num 1 1 1 1 1 1 1 1 2 2 ...
## $ therapy: num -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 0.5 0.5 ...
# Now check
table(data$tx, data$therapy) # a crosstab of tx and therapy in data.frame data
##
## -0.5 0.5
## 1 536 0
## 2 0 536
Using CrossTable from the gmodels package
CrossTable(data$tx, data$therapy,
missing.include=TRUE,
prop.r=FALSE,
prop.c=FALSE,
prop.t=FALSE,
prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## |-------------------------|
##
##
## Total Observations in Table: 1072
##
##
## | data$therapy
## data$tx | -0.5 | 0.5 | Row Total |
## -------------|-----------|-----------|-----------|
## 1 | 536 | 0 | 536 |
## -------------|-----------|-----------|-----------|
## 2 | 0 | 536 | 536 |
## -------------|-----------|-----------|-----------|
## Column Total | 536 | 536 | 1072 |
## -------------|-----------|-----------|-----------|
##
##
Welcome to R
tab <- function(r,c) {
CrossTable(r, c,
missing.include=TRUE, prop.r=FALSE, prop.c=FALSE, prop.t=FALSE,
prop.chisq=FALSE)
}
tab(data$tx,data$therapy)
##
##
## Cell Contents
## |-------------------------|
## | N |
## |-------------------------|
##
##
## Total Observations in Table: 1072
##
##
## | c
## r | -0.5 | 0.5 | Row Total |
## -------------|-----------|-----------|-----------|
## 1 | 536 | 0 | 536 |
## -------------|-----------|-----------|-----------|
## 2 | 0 | 536 | 536 |
## -------------|-----------|-----------|-----------|
## Column Total | 536 | 536 | 1072 |
## -------------|-----------|-----------|-----------|
##
##
And anyway the coding looks right. But notice the variable names are now the not-so-helpful “r” and “c”. So whatever.
Another example. I’ll code a variable male from the original sex
data$male <- 0 # initialize to 0
data$male[which(data$sex!=1)] <- 1 # sex is 0:Husband 1:Wife
tab(data$male,data$sex)
##
##
## Cell Contents
## |-------------------------|
## | N |
## |-------------------------|
##
##
## Total Observations in Table: 1072
##
##
## | c
## r | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 0 | 536 | 536 |
## -------------|-----------|-----------|-----------|
## 1 | 536 | 0 | 536 |
## -------------|-----------|-----------|-----------|
## Column Total | 536 | 536 | 1072 |
## -------------|-----------|-----------|-----------|
##
##
male is a binary indicator (0/1). When coding binary indicators with the values (0/1), always code so that the name of the variable always matches the label for value 1. For example,
| R sex | Categorical Variable | Binary indicator | Binary indicator |
|---|---|---|---|
| is really | Gender codes | Female | Male |
| a man | 1 = male | 0 | 1 |
| a woman | 2 = female | 1 | 0 |
The example data set does not have any date information.
Public data rarely will.
But you will surely have to deal with dates.
So let’s pretend it has dates.
To pretend it has dates, we will generate some date data. So now you get to see how we generate data with R.
set.seed(3481)
data$year <- sample(1:3,nrow(data),replace=T)+2008
data$day <- sample(1:27,nrow(data),replace=T)
data$month <- round(runif(nrow(data),0.51,12.49))
The first line says: make a new variable year in data frame data. Assign to it values, sample from the range 1 to 3, as many times as there are rows in the data frame named data. Sample with replacement. Then add 2008.
The third line says: make a new variable month in data frame data. Assign to it a random numbers (as many as there are rows in the data frame) drawn from a uniform distribution ranging from 0.51 to 12.49. Oh but round it to the nearest whole number too.
str(data[,c("day","month","year")])
## 'data.frame': 1072 obs. of 3 variables:
## $ day : int 5 1 27 16 3 18 19 24 3 6 ...
## $ month: num 2 4 3 10 5 5 5 11 3 5 ...
## $ year : num 2011 2010 2010 2010 2011 ...
table(data$day)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 41 37 45 43 37 49 34 36 44 31 32 33 40 37 42 33 44 35 47 40 49 35 37 47 42
## 26 27
## 42 40
table(data$month)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 75 83 87 72 102 93 84 83 72 106 105 110
table(data$year)
##
## 2009 2010 2011
## 363 363 346
(luridate vignette)[http://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html]
the.time.is.now <- now()
the.time.is.now.numeric <- as.numeric(now())
the.time.is.now
## [1] "2014-11-12 15:54:03 EST"
the.time.is.now.numeric
## [1] 1415825644
as.numeric(mdy("1/1/1970"))
## [1] 0
as.numeric(mdy("1/2/1970"))
## [1] 86400
60*60*24
## [1] 86400
str(the.time.is.now)
## POSIXct[1:1], format: "2014-11-12 15:54:03"
data$year <- as.character(data$year)
data$month <- as.character(data$month)
data$day <- as.character(data$day)
data$date <- mdy(paste0(data$month,"/",data$day,"/",data$year))
head(data[,c("year","month","day","date")])
## year month day date
## 1 2011 2 5 2011-02-05
## 2 2010 4 1 2010-04-01
## 3 2010 3 27 2010-03-27
## 4 2010 10 16 2010-10-16
## 5 2011 5 3 2011-05-03
## 6 2009 5 18 2009-05-18
Data from ICPSR (public use)
I can redistribute to Brown faculty, students, and staff (authorized users)
If you are not Brown faculty, students, and staff, see about getting access to the NCS-R public use files at ICPSR
I am pulling the NCS-R data as archived along with the Collaborative Psychiatric Epidemiology Surveys.
work <- "c:/work/shows/switch2r" # my working folder you edit here
setwd(work) # change to it
getwd() # check it
## [1] "c:/work/shows/switch2r"
dir("./NCSR")
## [1] "20240-0002-Codebook.pdf" "20240-0002-Data.sav"
## [3] "20240-0002-Data.txt" "20240-0002-Documentation.pdf"
## [5] "20240-0002-Setup.sas"
and generate a nice data dictionary
spssdata <- sji.SPSS("./NCSR/20240-0002-Data.sav")
# sji.viewSPSS(spssdata) # opens in the Rstudio Viewer Panel
# Or in a browser window if you're running the R Console
#sji.viewSPSS(spssdata, file="codebook.html", showFreq = TRUE, useViewer = FALSE)
codebookpath <- file.path(work,"codebook.html")
But that link only works on your (i.e., my) local computer
Viewers of this presentation can see the codebook I generated on my dropbox
Psychotic experiences
Each coded 1=Yes, 5=No
PSitems <- c("PS1A", "PS1B", "PS1C", "PS1D", "PS1E", "PS1F")
describe(spssdata[PSitems])
## spssdata[PSitems]
##
## 6 Variables 9282 Observations
## ---------------------------------------------------------------------------
## PS1A
## n missing unique Info Mean
## 2349 6933 2 0.25 4.63
##
## 1 (217, 9%), 5 (2132, 91%)
## ---------------------------------------------------------------------------
## PS1B
## n missing unique Info Mean
## 2349 6933 2 0.17 4.753
##
## 1 (145, 6%), 5 (2204, 94%)
## ---------------------------------------------------------------------------
## PS1C
## n missing unique Info Mean
## 2352 6930 2 0.02 4.968
##
## 1 (19, 1%), 5 (2333, 99%)
## ---------------------------------------------------------------------------
## PS1D
## n missing unique Info Mean
## 2353 6929 2 0.01 4.981
##
## 1 (11, 0%), 5 (2342, 100%)
## ---------------------------------------------------------------------------
## PS1E
## n missing unique Info Mean
## 2352 6930 2 0.04 4.951
##
## 1 (29, 1%), 5 (2323, 99%)
## ---------------------------------------------------------------------------
## PS1F
## n missing unique Info Mean
## 2352 6930 2 0.05 4.937
##
## 1 (37, 2%), 5 (2315, 98%)
## ---------------------------------------------------------------------------
Looks like a skip pattern
Kessler RC, Birnbaum H, Demler O, Falloon IR, Gagnon E, Guyer M, Howes MJ, Kendler KS, Shi L, Walters E. The prevalence and correlates of nonaffective psychosis in the National Comorbidity Survey Replication (NCS-R). Biol Psychiatry. 2005;58(8):668-76.
Remember
table(spssdata$PS1A)
##
## 1 5
## 217 2132
This works
spssdata$count.ok <- (spssdata$PS1A==1)+
(spssdata$PS1B==1)+(spssdata$PS1C==1)+
(spssdata$PS1D==1)+(spssdata$PS1E==1)+
(spssdata$PS1F==1)
This does not
spssdata$count.notok <- (spssdata$PS1A==1)
+(spssdata$PS1B==1)
+(spssdata$PS1C==1)
+(spssdata$PS1D==1)
+(spssdata$PS1E==1)
+(spssdata$PS1F==1)
See
spssdata$count.ok <- (spssdata$PS1A==1)+
(spssdata$PS1B==1)+(spssdata$PS1C==1)+
(spssdata$PS1D==1)+(spssdata$PS1E==1)+
(spssdata$PS1F==1)
summary(spssdata$count.ok)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.194 0.000 6.000 6938
Let’s check
spssdata$check <- paste0(as.character(spssdata$PS1A),
as.character(spssdata$PS1B),as.character(spssdata$PS1C),
as.character(spssdata$PS1D),as.character(spssdata$PS1E),
as.character(spssdata$PS1F))
Let’s check
head(spssdata[which(spssdata$count==0),c("check","count.ok")])
## check count.ok
## 8 555555 0
## 12 555555 0
## 18 555555 0
## 24 555555 0
## 25 555555 0
## 26 555555 0
Let’s check
head(spssdata[which(spssdata$count==1),c("check","count.ok")])
## check count.ok
## 17 155555 1
## 19 155555 1
## 212 155555 1
## 321 515555 1
## 378 155555 1
## 381 555551 1
Let’s check
head(spssdata[which(spssdata$count==2),c("check","count.ok")])
## check count.ok
## 313 115555 2
## 367 515551 2
## 453 115555 2
## 501 155551 2
## 660 115555 2
## 721 155515 2
Let’s check
head(spssdata[which(spssdata$count==3),c("check","count.ok")])
## check count.ok
## 1176 115155 3
## 2254 115551 3
## 3084 115155 3
## 3620 111555 3
## 3658 115551 3
## 4115 111555 3
Let’s check
head(spssdata[which(spssdata$count==4),c("check","count.ok")])
## check count.ok
## 38 151115 4
## 3793 111551 4
## 4426 115115 4
## 6260 115511 4
## 6299 115511 4
## 7040 111515 4
Let’s check
head(spssdata[which(spssdata$count==5),c("check","count.ok")])
## check count.ok
## 6721 111511 5
Let’s check
head(spssdata[which(spssdata$count==6),c("check","count.ok")])
## check count.ok
## 1453 111111 6
## 4460 111111 6
psitems <- c("ps1a.i", "ps1b.i", "ps1c.i", "ps1d.i", "ps1e.i", "ps1f.i")
# Make indicator versions of response items
spssdata$ps1a.i <- as.numeric(spssdata$PS1A==1)
spssdata$ps1b.i <- as.numeric(spssdata$PS1B==1)
spssdata$ps1c.i <- as.numeric(spssdata$PS1C==1)
spssdata$ps1d.i <- as.numeric(spssdata$PS1D==1)
spssdata$ps1e.i <- as.numeric(spssdata$PS1E==1)
spssdata$ps1f.i <- as.numeric(spssdata$PS1F==1)
key <- c(1,1,1,1,1,1) # the "right" answers
results <- scoreItems(
items = spssdata[psitems],
keys = key,
missing = TRUE , impute = "none" # person's non-missing mean
# response used for missing
)
results
## Call: scoreItems(keys = key, items = spssdata[psitems], missing = TRUE,
## impute = "none")
##
## (Standardized) Alpha:
## A1
## alpha 0.53
##
## Standard errors of unstandardized Alpha:
## [,1]
## ASE 0.0094
##
## Standardized Alpha of observed scales:
## [,1]
## alpha.observed 0.53
##
## Average item correlation:
## [,1]
## average.r 0.16
##
## Guttman 6* reliability:
## [,1]
## Lambda.6 0.52
##
## Signal/Noise based upon av.r :
## [,1]
## Signal/Noise 1.1
##
## Scale intercorrelations corrected for attenuation
## raw correlations below the diagonal, alpha on the diagonal
## corrected correlations above the diagonal:
##
## Note that these are the correlations of the complete scales based on the correlation matrix,
## not the observed scales based on the raw items.
## [,1]
## [1,] 0.53
##
## In order to see the item by scale loadings and frequency counts of the data
## print with the short option = FALSE
str(results)
## List of 17
## $ scores : num [1:9282, 1] NaN NaN NaN NaN NaN NaN NaN 0 NaN NaN ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "A1"
## $ missing : num [1:9282, 1] 6 6 6 6 6 6 6 0 6 6 ...
## $ alpha : num [1, 1] 0.527
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "alpha"
## .. ..$ : chr "A1"
## $ av.r : num [1, 1] 0.157
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "average.r"
## .. ..$ : NULL
## $ sn : num [1, 1] 1.11
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "Signal/Noise"
## .. ..$ : NULL
## $ n.items : num 6
## $ item.cor : num [1:6, 1] 0.755 0.726 0.445 0.359 0.517 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "ps1a.i" "ps1b.i" "ps1c.i" "ps1d.i" ...
## .. ..$ : chr "A1"
## $ cor : num [1, 1] 1
## $ corrected : num [1, 1] 0.527
## $ G6 : num [1, 1] 0.523
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "Lambda.6"
## .. ..$ : NULL
## $ item.corrected: num [1:6, 1] 0.44 0.521 0.427 0.353 0.498 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "ps1a.i" "ps1b.i" "ps1c.i" "ps1d.i" ...
## .. ..$ : chr "A1"
## $ response.freq : num [1:6, 1:3] 0.908 0.938 0.992 0.995 0.988 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:6] "ps1a.i" "ps1b.i" "ps1c.i" "ps1d.i" ...
## .. ..$ : chr [1:3] "0" "1" "miss"
## $ raw : logi FALSE
## $ alpha.ob : num [1, 1] 0.527
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr "alpha.observed"
## .. ..$ : NULL
## $ num.ob.item : num 6
## $ ase : num [1, 1] 0.00941
## $ Call : language scoreItems(keys = key, items = spssdata[psitems], missing = TRUE, impute = "none")
## - attr(*, "class")= chr [1:2] "psych" "score.items"
summary(results$score)
## A1
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.032
## 3rd Qu.:0.000
## Max. :1.000
## NA's :6929
results$pscount <- 6*results$score
summary(results$pscount)
## A1
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.195
## 3rd Qu.:0.000
## Max. :6.000
## NA's :6929
How do we get results$pscount back into the working data set?
spssdata$pscount.master <- scoreItems(items = spssdata[psitems],
keys = key, missing = TRUE , impute = "none")[[1]]*6
summary(spssdata$pscount.master)
## A1
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.195
## 3rd Qu.:0.000
## Max. :6.000
## NA's :6929
results$alpha
## A1
## alpha 0.5271367
alpha <- round(results$alpha,2)
My markdown code
The internal consistency reliability coefficient for the sum of the
psychotic experiences scale is 'r alpha'.
The internal consistency reliability coefficient for the sum of the psychotic experiences scale is 0.53.
str(results$response.freq)
## num [1:6, 1:3] 0.908 0.938 0.992 0.995 0.988 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:6] "ps1a.i" "ps1b.i" "ps1c.i" "ps1d.i" ...
## ..$ : chr [1:3] "0" "1" "miss"
results$response.freq
## 0 1 miss
## ps1a.i 0.9076203 0.092379736 0.7469295
## ps1b.i 0.9382716 0.061728395 0.7469295
## ps1c.i 0.9919218 0.008078231 0.7466063
## ps1d.i 0.9953251 0.004674883 0.7464986
## ps1e.i 0.9876701 0.012329932 0.7466063
## ps1f.i 0.9842687 0.015731293 0.7466063
Analysis with R
Use R to do multivariable analysis
Monday, December 8, 2014, 2-3pm.