DataM: HW Exercise 0330 6

The HELP (Health Evaluation and Linkage to Primary Care) study was a clinical trial for adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. Eligible subjects were adults, who spoke Spanish or English, reported alcohol, heroin or cocaine as their first or second drug of choice, resided in proximity to the primary care clinic to which they would be referred or were homeless. Subjects were interviewed at baseline during their detoxification stay and follow-up interviews were undertaken every 6 months for 2 years. A variety of continuous, count, discrete, and survival time predictors and outcomes were collected at each of these five occasions.

The following R script is used to manage the data file at the initial stage of investigation. Provide comments on what each line of the script is meant to achieve.

Source: Kleinman, K., & Horton, N.J. (2015). Using R for Data Management, Statistical Analysis, and Graphics.

Chunk 1

echo=FALSE
eval=TRUE

Make the codes be runned but not displayed in the output file.

Chunk 2

options(continue="  ")
options(digits=3)
options(width=72) # narrow output

Use " " for strings more than one line.
Round floats in 3 digits.
Set the width of display as 72 units.

Chunk 3

ds = read.csv("http://www.amherst.edu/~nhorton/r2/datasets/help.csv")
library(dplyr)
newds = select(ds, cesd, female, i1, i2, id, treat, f1a, f1b, f1c, f1d, f1e, f1f, f1g, f1h, f1i, f1j, f1k, f1l, f1m, f1n, f1o, f1p, f1q, f1r, f1s, f1t)

Load in the dataset from the online file.
Load in the package dplyr.
Select some variables from ds as a new data frame and name it newds.

Chunk 4

names(newds)

 [1] "cesd"   "female" "i1"     "i2"     "id"     "treat"  "f1a"   
 [8] "f1b"    "f1c"    "f1d"    "f1e"    "f1f"    "f1g"    "f1h"   
[15] "f1i"    "f1j"    "f1k"    "f1l"    "f1m"    "f1n"    "f1o"   
[22] "f1p"    "f1q"    "f1r"    "f1s"    "f1t"

str(newds[,1:10]) # structure of the first 10 variables

'data.frame':   453 obs. of  10 variables:
 $ cesd  : int  49 30 39 15 39 6 52 32 50 46 ...
 $ female: int  0 0 0 1 0 1 1 0 1 0 ...
 $ i1    : int  13 56 0 5 10 4 13 12 71 20 ...
 $ i2    : int  26 62 0 5 13 4 20 24 129 27 ...
 $ id    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ treat : int  1 1 0 0 0 1 0 1 0 1 ...
 $ f1a   : int  3 3 3 0 3 1 3 1 3 2 ...
 $ f1b   : int  2 2 2 0 0 0 1 1 2 3 ...
 $ f1c   : int  3 0 3 1 3 1 3 2 3 3 ...
 $ f1d   : int  0 3 0 3 3 3 1 3 1 0 ...

Display names of variables in newds.
Display the structure of the data frame with the first 10 variables in newds.

Chunk 5

summary(newds[,1:10]) # summary of the first 10 variables

      cesd          female            i1              i2       
 Min.   : 1.0   Min.   :0.000   Min.   :  0.0   Min.   :  0.0  
 1st Qu.:25.0   1st Qu.:0.000   1st Qu.:  3.0   1st Qu.:  3.0  
 Median :34.0   Median :0.000   Median : 13.0   Median : 15.0  
 Mean   :32.8   Mean   :0.236   Mean   : 17.9   Mean   : 22.6  
 3rd Qu.:41.0   3rd Qu.:0.000   3rd Qu.: 26.0   3rd Qu.: 32.0  
 Max.   :60.0   Max.   :1.000   Max.   :142.0   Max.   :184.0  
       id          treat            f1a            f1b      
 Min.   :  1   Min.   :0.000   Min.   :0.00   Min.   :0.00  
 1st Qu.:119   1st Qu.:0.000   1st Qu.:1.00   1st Qu.:0.00  
 Median :233   Median :0.000   Median :2.00   Median :1.00  
 Mean   :233   Mean   :0.497   Mean   :1.63   Mean   :1.39  
 3rd Qu.:348   3rd Qu.:1.000   3rd Qu.:3.00   3rd Qu.:2.00  
 Max.   :470   Max.   :1.000   Max.   :3.00   Max.   :3.00  
      f1c            f1d      
 Min.   :0.00   Min.   :0.00  
 1st Qu.:1.00   1st Qu.:0.00  
 Median :2.00   Median :1.00  
 Mean   :1.92   Mean   :1.56  
 3rd Qu.:3.00   3rd Qu.:3.00  
 Max.   :3.00   Max.   :3.00

Use descriptive statistics to summary the first 10 variables in newds, including the minimum, the maxinmum, the Q1, the Q3, the median, and the mean. If there are missing values, the number of missing values of each variable would be presented as well.

Chunk 6

head(newds, n=3)

Display the first 3 rows of newds.

Chunk 7

comment(newds) = "HELP baseline dataset"
comment(newds)

[1] "HELP baseline dataset"

save(ds, file="savedfile")

Give comments on newds and save the file.

Chunk 8

write.csv(ds, file="ds.csv")

Save the dataset as a csv file.

Chunk 9

library(foreign)
write.foreign(newds, "file.dat", "file.sas", package="SAS")

Load in the package foreign.
Save newds as a .dta file and .sas file.

Chunk 10

with(newds, cesd[1:10])

 [1] 49 30 39 15 39  6 52 32 50 46

with(newds, head(cesd, 10))

 [1] 49 30 39 15 39  6 52 32 50 46

Show the first 10 rows of cesd, a variable of newds.

Chunk 11

with(newds, cesd[cesd > 56])

[1] 57 58 57 60 58 58 57

Show the data of cesd which exceeds 56.

Chunk 12

library(dplyr)
filter(newds, cesd > 56) %>% select(id, cesd)

Load in the package dplyr.
Pick the rows of newds which cesd > 56 and select 2 variables, id and cesd.

Chunk 13

with(newds, sort(cesd)[1:4])

[1] 1 3 3 4

with(newds, which.min(cesd))

[1] 199

Sort newds$cesd in the ascending order and show the first 4 elements. That is, show the 4 smallest values of newds$cesd.
Show the index (the no. of row) of the minimum of newds$cesd.

Chunk 14

library(mosaic)
tally(~ is.na(f1g), data=newds)

is.na(f1g)
 TRUE FALSE 
    1   452

favstats(~ f1g, data=newds)

Load in the package mosaic.
Count the missing values of f1g, a variable of newds.
Compute descriptive statistics of newds$f1g.

Chunk 15

# reverse code f1d, f1h, f1l and f1p
cesditems = with(newds, cbind(f1a, f1b, f1c, (3 - f1d), f1e, f1f, f1g, 
                              (3 - f1h), f1i, f1j, f1k, (3 - f1l), f1m, f1n, f1o, (3 - f1p), 
                              f1q, f1r, f1s, f1t)) #1
nmisscesd = apply(is.na(cesditems), 1, sum)        #2
ncesditems = cesditems                             #3
ncesditems[is.na(cesditems)] = 0                   #3
newcesd = apply(ncesditems, 1, sum)                #4
imputemeancesd = 20/(20-nmisscesd)*newcesd

Pick data of items in the scale CESD, adjust the reversed items and call this new data frame cesditems.
Count the total number of missing values of each item of CESD.
Deplicate cesditems and name it ncesditems. Fill all missing values in ncesditems with 0.
Sum the score of each item to compute the total score of CESD.
Compute the impute mean of CESD with the no. of missing values.

Chunk 16

data.frame(newcesd, newds$cesd, nmisscesd, imputemeancesd)[nmisscesd>0,]

Collect the data created in Chunk 15 into a data frame and show it.

Chunk 17

message=FALSE    #1
library(dplyr)   #2
library(memisc)  #2
newds = mutate(newds, drinkstat= 
                 cases(
                   "abstinent" = i1==0,
                   "moderate" = (i1>0 & i1<=1 & i2<=3 & female==1) |
                     (i1>0 & i1<=2 & i2<=4 & female==0),
                   "highrisk" = ((i1>1 | i2>3) & female==1) |
                     ((i1>2 | i2>4) & female==0)))

Make messages (especially when loading in packages) are not presented.
Load in the packages, dplyr and memisc.
Create a new variable drinkstat in newds. drinkstat is a factorial vector with 3 levels, which represents the severity of drinking with the given conditions.

Chunk 18

echo=FALSE
library(mosaic)

Do not show the code.
Load in the package mosaic.

Chunk 19

detach(package:memisc)
detach(package:MASS)

Remove the packges memisc and MASS.

Chunk 20

library(dplyr)
tmpds = select(newds, i1, i2, female, drinkstat)
tmpds[365:370,]

Load in the package dplyr.
Pick variables i1, i2, female, and drinkstat from newds into a new data frame and name it tmpds.
Show the 365th-370th rows of tmpds.

Chunk 21

library(dplyr)
filter(tmpds, drinkstat=="moderate" & female==1)

Load in the package dplyr.
Show the rows in tmpds corresponding to the given conditions (status of drinking is moderate and is a female).

Chunk 22

library(gmodels)
with(tmpds, CrossTable(drinkstat))


 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  453 

 
          | abstinent |  moderate |  highrisk | 
          |-----------|-----------|-----------|
          |        68 |        28 |       357 | 
          |     0.150 |     0.062 |     0.788 | 
          |-----------|-----------|-----------|

Load in the package gmodels.
Display the frequency distribution table of the severity of drinking.

Chunk 23

with(tmpds, CrossTable(drinkstat, female, 
                       prop.t=FALSE, prop.c=FALSE, prop.chisq=FALSE))


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  453 

 
             | female 
   drinkstat |         0 |         1 | Row Total | 
-------------|-----------|-----------|-----------|
   abstinent |        42 |        26 |        68 | 
             |     0.618 |     0.382 |     0.150 | 
-------------|-----------|-----------|-----------|
    moderate |        21 |         7 |        28 | 
             |     0.750 |     0.250 |     0.062 | 
-------------|-----------|-----------|-----------|
    highrisk |       283 |        74 |       357 | 
             |     0.793 |     0.207 |     0.788 | 
-------------|-----------|-----------|-----------|
Column Total |       346 |       107 |       453 | 
-------------|-----------|-----------|-----------|

Display the frequency distribution table in two variables, the severity of drinking and gender.

Chunk 24

newds = transform(newds, 
                  gender=factor(female, c(0,1), c("Male","Female")))
tally(~ female + gender, margin=FALSE, data=newds)

      gender
female Male Female
     0  346      0
     1    0    107

Revise newds: Use the binanry variable female to create a factorial vector with 2 levels (e.g., Male and Female) and name it gender.
Show the contingency table of female and gender.

Chunk 25

library(dplyr)
newds = arrange(ds, cesd, i1)
newds[1:5, c("cesd", "i1", "id")]

Load in the package dplyr.
Sort the newds in the ascending order by two variables, cesd and i1.

Chunk 26

library(dplyr)
females = filter(ds, female==1)
with(females, mean(cesd))

[1] 36.9

# an alternative approach
mean(ds$cesd[ds$female==1])

[1] 36.9

Load in the package dplyr.
Create a female subset from ds.
Compute the mean of CESD score in female subset.

Chunk 27

with(ds, tapply(cesd, female, mean))

   0    1 
31.6 36.9

library(mosaic)
mean(cesd ~ female, data=ds)

   0    1 
31.6 36.9

Compute the mean of CESD score for each group of female (e.g., Male and Female) in ds.
Load in the package mosaic.
Compute the mean of CESD score for each group of female (e.g., Male and Female) inds.