The HELP study was a clinical trial for adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. The data set corresponding to this study is part of the mosaicData package.
First, save the data set as help.
help<-HELPrct
Let’s create a two-way table with substance as the row variable and sex as the column variable.
tally(~substance|sex, data=help)
## sex
## substance female male
## alcohol 36 141
## cocaine 41 111
## heroin 30 94
We can also display the entries as percents by adding the option format=“percent”
tally(~substance|sex, data=help, format="percent")
## sex
## substance female male
## alcohol 33.64486 40.75145
## cocaine 38.31776 32.08092
## heroin 28.03738 27.16763
Notice that the percentages in each column add to 100%.
Or we can display the proportions by adding the option format=“proportion”
tally(~substance|sex, data=help, format="proportion")
## sex
## substance female male
## alcohol 0.3364486 0.4075145
## cocaine 0.3831776 0.3208092
## heroin 0.2803738 0.2716763
Notice that the proportions in each column add to 1.
We can ask R Studio to give us the sum of the counts in each column by adding the option margins=T
tally(~substance|sex, data=help, margins=T)
## sex
## substance female male
## alcohol 36 141
## cocaine 41 111
## heroin 30 94
## Total 107 346
If we want to display both the column sums and the row sums (i.e. the marginal distribution), then we have to modify the way we enter the variables iside the tally command (notice that | got replaced by &)
tally(~substance&sex, data=help, margins=T)
## sex
## substance female male Total
## alcohol 36 141 177
## cocaine 41 111 152
## heroin 30 94 124
## Total 107 346 453
We can also do this when we display proportions (or percents), but in this case the proportions in all the cells (instead of each column) will add to 1 (or 100%). So the “proportion” option with this command will give the joint distribution of the two variables.
tally(~substance&sex, data=help, format="proportion",margins=T)
## sex
## substance female male Total
## alcohol 0.07947020 0.31125828 0.39072848
## cocaine 0.09050773 0.24503311 0.33554084
## heroin 0.06622517 0.20750552 0.27373068
## Total 0.23620309 0.76379691 1.00000000
If we only want to look at a certain subset of the data (say only the observations for which the variable sex has value female), then we can add the optional subset command inside tally:
tally(~substance, data=help, format="proportion",margins=T,subset=sex=="female")
##
## alcohol cocaine heroin Total
## 0.3364486 0.3831776 0.2803738 1.0000000
We can select a different criteria to define a the subset we want to look at (say when the value of the variable age is less than 35), and create the two-way table for only these observations.
tally(~substance&sex, data=help, margins=T, subset=age<35)
## sex
## substance female male Total
## alcohol 12 46 58
## cocaine 22 63 85
## heroin 16 58 74
## Total 50 167 217
Let’s display the distribution of the two variables substance and sex on bargraphs with the two variables on separate panels:
bargraph(~substance|sex, data=help)
The following command will put the two variables on the same graph and group the bars by the categories in the variable sex.
bargraph(~substance, groups=sex, data=help)
For these bargarphs, it is very useful to provide a key to what each color means. We can accomplish this by adding the option auto.key=T:
bargraph(~substance, groups=sex, data=help, auto.key=T)
As before, this command also works with subsets, so we can look at just the observations for which the value of age is less than 35:
bargraph(~substance, groups=sex, data=help, auto.key=T, subset=age<35)
Load the following data from the textbook:
binge<-read.file("/home/emesekennedy/Data/Ch2/bingegender.txt")
## Reading data with read.table()
Notice that this data set only has the summaries of the counts of the two categorical variables Frequent and Gender, but it is not quite in the right format to be a two-way table.
We can use the function dcast from the package reshape2 to format the data into a two-way table:
require(reshape2)
## Loading required package: reshape2
dcast(binge, Frequent~Gender)
## Using Count as value column: use value.var to override.
## Frequent Men Women
## 1 No 5550 8232
## 2 Yes 1630 1684
The “d” in dcast stands for data, which means that we can use the output to create a new data set/
binge2<-dcast(binge, Frequent~Gender)
## Using Count as value column: use value.var to override.
With a little more work, we can find the total number of observations:
sum(~Men, data=binge2)
## [1] 7180
sum(~Women, data=binge2)
## [1] 9916
7180+9916
## [1] 17096
Now, if we want to look at the proportions instead of the counts, we can divide each entry in the two-way table by the total number of observations. However, we cannot do this easily on the data set binge2, so we will use the function acast to create the two-way table. The “a” in acast stands for array which means that the command outputs an array that we can do arithmetic on.
acast(binge, Frequent~Gender)
## Using Count as value column: use value.var to override.
## Men Women
## No 5550 8232
## Yes 1630 1684
acast(binge, Frequent~Gender)/17096
## Using Count as value column: use value.var to override.
## Men Women
## No 0.32463734 0.48151614
## Yes 0.09534394 0.09850257
Let’s create a bargraph to graphically represent the data.
barchart(Count~Frequent, groups=Gender, data=binge, auto.key=T)
We have worked with stemplots before to look at the distribution of a quantitative variable. If we have two quantitative variable that we would like to compare, then we can create a back-to-back stemplot. We’ll do this on the data sets vitdboys and vitdgirls, each of which has one variable called VitaminD. First, let’s load each set and create a stemplot of VitaminD for each.
vitB<-read.file("/home/emesekennedy/Data/Ch1/vitdboys.txt")
## Reading data with read.table()
stem(vitB$VitaminD)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 8
## 1 | 28
## 2 | 134447788899
## 3 | 11237
vitG<-read.file("/home/emesekennedy/Data/Ch1/vitdgirls.txt")
## Reading data with read.table()
stem(vitG$VitaminD)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 1 | 6
## 2 | 3568
## 3 | 3455678
## 4 | 0122338
## 5 | 1
A back-to-back stemplot allows us to look at the two stemplots at the same time. We can create a back-to-back stemplot using the stem.leaf.backback command from the package aplpack.
require(aplpack)
## Loading required package: aplpack
## Loading required package: tcltk
## Warning in fun(libname, pkgname): couldn't connect to display ":0"
stem.leaf.backback(vitB$VitaminD,vitG$VitaminD,style="bare", depths=F)
## ___________________________
## 1 | 2: represents 12, leaf unit: 1
## vitB$VitaminD
## vitG$VitaminD
## ___________________________
## | 0 |
## 8| 0 |
## 2| 1 |
## 8| 1 |6
## 44431| 2 |3
## 9988877| 2 |568
## 3211| 3 |34
## 7| 3 |55678
## | 4 |012233
## | 4 |8
## | 5 |1
## | 5 |
## | 6 |
## ___________________________
## n: 20 20
## ___________________________
If we don’t include the option depths=F, then we can get a little more information about the distribution of the two variables:
stem.leaf.backback(vitB$VitaminD,vitG$VitaminD,style="bare")
## _________________________________
## 1 | 2: represents 12, leaf unit: 1
## vitB$VitaminD vitG$VitaminD
## _________________________________
## | 0 |
## 1 8| 0 |
## 2 2| 1 |
## 3 8| 1 |6 1
## 8 44431| 2 |3 2
## (7) 9988877| 2 |568 5
## 5 3211| 3 |34 7
## 1 7| 3 |55678 (5)
## | 4 |012233 8
## | 4 |8 2
## | 5 |1 1
## | 5 |
## | 6 |
## _________________________________
## n: 20 20
## _________________________________