DataM: HW Exercise 0330 6
The HELP (Health Evaluation and Linkage to Primary Care) study was a clinical trial for adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. Eligible subjects were adults, who spoke Spanish or English, reported alcohol, heroin or cocaine as their first or second drug of choice, resided in proximity to the primary care clinic to which they would be referred or were homeless. Subjects were interviewed at baseline during their detoxification stay and follow-up interviews were undertaken every 6 months for 2 years. A variety of continuous, count, discrete, and survival time predictors and outcomes were collected at each of these five occasions.
The following R script is used to manage the data file at the initial stage of investigation. Provide comments on what each line of the script is meant to achieve.
Source: Kleinman, K., & Horton, N.J. (2015). Using R for Data Management, Statistical Analysis, and Graphics.
Chunk 2
- Use
" "for strings more than one line. - Round floats in 3 digits.
- Set the width of display as 72 units.
Chunk 3
ds = read.csv("http://www.amherst.edu/~nhorton/r2/datasets/help.csv")
library(dplyr)
newds = select(ds, cesd, female, i1, i2, id, treat, f1a, f1b, f1c, f1d, f1e, f1f, f1g, f1h, f1i, f1j, f1k, f1l, f1m, f1n, f1o, f1p, f1q, f1r, f1s, f1t)- Load in the dataset from the online file.
- Load in the package
dplyr. - Select some variables from
dsas a new data frame and name itnewds.
Chunk 4
[1] "cesd" "female" "i1" "i2" "id" "treat" "f1a"
[8] "f1b" "f1c" "f1d" "f1e" "f1f" "f1g" "f1h"
[15] "f1i" "f1j" "f1k" "f1l" "f1m" "f1n" "f1o"
[22] "f1p" "f1q" "f1r" "f1s" "f1t"
'data.frame': 453 obs. of 10 variables:
$ cesd : int 49 30 39 15 39 6 52 32 50 46 ...
$ female: int 0 0 0 1 0 1 1 0 1 0 ...
$ i1 : int 13 56 0 5 10 4 13 12 71 20 ...
$ i2 : int 26 62 0 5 13 4 20 24 129 27 ...
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ treat : int 1 1 0 0 0 1 0 1 0 1 ...
$ f1a : int 3 3 3 0 3 1 3 1 3 2 ...
$ f1b : int 2 2 2 0 0 0 1 1 2 3 ...
$ f1c : int 3 0 3 1 3 1 3 2 3 3 ...
$ f1d : int 0 3 0 3 3 3 1 3 1 0 ...
- Display names of variables in
newds. - Display the structure of the data frame with the first 10 variables in
newds.
Chunk 5
cesd female i1 i2
Min. : 1.0 Min. :0.000 Min. : 0.0 Min. : 0.0
1st Qu.:25.0 1st Qu.:0.000 1st Qu.: 3.0 1st Qu.: 3.0
Median :34.0 Median :0.000 Median : 13.0 Median : 15.0
Mean :32.8 Mean :0.236 Mean : 17.9 Mean : 22.6
3rd Qu.:41.0 3rd Qu.:0.000 3rd Qu.: 26.0 3rd Qu.: 32.0
Max. :60.0 Max. :1.000 Max. :142.0 Max. :184.0
id treat f1a f1b
Min. : 1 Min. :0.000 Min. :0.00 Min. :0.00
1st Qu.:119 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:0.00
Median :233 Median :0.000 Median :2.00 Median :1.00
Mean :233 Mean :0.497 Mean :1.63 Mean :1.39
3rd Qu.:348 3rd Qu.:1.000 3rd Qu.:3.00 3rd Qu.:2.00
Max. :470 Max. :1.000 Max. :3.00 Max. :3.00
f1c f1d
Min. :0.00 Min. :0.00
1st Qu.:1.00 1st Qu.:0.00
Median :2.00 Median :1.00
Mean :1.92 Mean :1.56
3rd Qu.:3.00 3rd Qu.:3.00
Max. :3.00 Max. :3.00
Use descriptive statistics to summary the first 10 variables in newds, including the minimum, the maxinmum, the Q1, the Q3, the median, and the mean. If there are missing values, the number of missing values of each variable would be presented as well.
Chunk 7
[1] "HELP baseline dataset"
Give comments on newds and save the file.
Chunk 9
- Load in the package
foreign. - Save
newdsas a .dta file and .sas file.
Chunk 10
[1] 49 30 39 15 39 6 52 32 50 46
[1] 49 30 39 15 39 6 52 32 50 46
Show the first 10 rows of cesd, a variable of newds.
Chunk 11
[1] 57 58 57 60 58 58 57
Show the data of cesd which exceeds 56.
Chunk 12
- Load in the package
dplyr. - Pick the rows of
newdswhichcesd> 56 and select 2 variables,idandcesd.
Chunk 13
[1] 1 3 3 4
[1] 199
- Sort
newds$cesdin the ascending order and show the first 4 elements. That is, show the 4 smallest values ofnewds$cesd. - Show the index (the no. of row) of the minimum of
newds$cesd.
Chunk 14
is.na(f1g)
TRUE FALSE
1 452
- Load in the package
mosaic. - Count the missing values of
f1g, a variable ofnewds. - Compute descriptive statistics of
newds$f1g.
Chunk 15
# reverse code f1d, f1h, f1l and f1p
cesditems = with(newds, cbind(f1a, f1b, f1c, (3 - f1d), f1e, f1f, f1g,
(3 - f1h), f1i, f1j, f1k, (3 - f1l), f1m, f1n, f1o, (3 - f1p),
f1q, f1r, f1s, f1t)) #1
nmisscesd = apply(is.na(cesditems), 1, sum) #2
ncesditems = cesditems #3
ncesditems[is.na(cesditems)] = 0 #3
newcesd = apply(ncesditems, 1, sum) #4
imputemeancesd = 20/(20-nmisscesd)*newcesd- Pick data of items in the scale CESD, adjust the reversed items and call this new data frame
cesditems. - Count the total number of missing values of each item of CESD.
- Deplicate
cesditemsand name itncesditems. Fill all missing values inncesditemswith 0. - Sum the score of each item to compute the total score of CESD.
- Compute the impute mean of CESD with the no. of missing values.
Chunk 16
Collect the data created in Chunk 15 into a data frame and show it.
Chunk 17
message=FALSE #1
library(dplyr) #2
library(memisc) #2
newds = mutate(newds, drinkstat=
cases(
"abstinent" = i1==0,
"moderate" = (i1>0 & i1<=1 & i2<=3 & female==1) |
(i1>0 & i1<=2 & i2<=4 & female==0),
"highrisk" = ((i1>1 | i2>3) & female==1) |
((i1>2 | i2>4) & female==0)))- Make messages (especially when loading in packages) are not presented.
- Load in the packages,
dplyrandmemisc. - Create a new variable
drinkstatinnewds.drinkstatis a factorial vector with 3 levels, which represents the severity of drinking with the given conditions.
Chunk 20
- Load in the package
dplyr. - Pick variables
i1,i2,female, anddrinkstatfromnewdsinto a new data frame and name ittmpds. - Show the 365th-370th rows of
tmpds.
Chunk 21
- Load in the package
dplyr. - Show the rows in
tmpdscorresponding to the given conditions (status of drinking is moderate and is a female).
Chunk 22
Cell Contents
|-------------------------|
| N |
| N / Table Total |
|-------------------------|
Total Observations in Table: 453
| abstinent | moderate | highrisk |
|-----------|-----------|-----------|
| 68 | 28 | 357 |
| 0.150 | 0.062 | 0.788 |
|-----------|-----------|-----------|
- Load in the package
gmodels. - Display the frequency distribution table of the severity of drinking.
Chunk 23
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 453
| female
drinkstat | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
abstinent | 42 | 26 | 68 |
| 0.618 | 0.382 | 0.150 |
-------------|-----------|-----------|-----------|
moderate | 21 | 7 | 28 |
| 0.750 | 0.250 | 0.062 |
-------------|-----------|-----------|-----------|
highrisk | 283 | 74 | 357 |
| 0.793 | 0.207 | 0.788 |
-------------|-----------|-----------|-----------|
Column Total | 346 | 107 | 453 |
-------------|-----------|-----------|-----------|
Display the frequency distribution table in two variables, the severity of drinking and gender.
Chunk 24
newds = transform(newds,
gender=factor(female, c(0,1), c("Male","Female")))
tally(~ female + gender, margin=FALSE, data=newds) gender
female Male Female
0 346 0
1 0 107
- Revise
newds: Use the binanry variablefemaleto create a factorial vector with 2 levels (e.g., Male and Female) and name itgender. - Show the contingency table of
femaleandgender.
Chunk 25
- Load in the package
dplyr. - Sort the newds in the ascending order by two variables,
cesdandi1.
Chunk 26
[1] 36.9
[1] 36.9
- Load in the package
dplyr. - Create a female subset from
ds. - Compute the mean of CESD score in female subset.
Chunk 27
0 1
31.6 36.9
0 1
31.6 36.9
- Compute the mean of CESD score for each group of
female(e.g., Male and Female) inds. - Load in the package
mosaic. - Compute the mean of CESD score for each group of
female(e.g., Male and Female) inds.