Helping Students

Question 1:

trialdata<-read.csv("C:\\Users\\eomara.000\\Documents\\University of Maryland\\Courses\\PHAR663\\Course data files\\AHTtrial_II.csv",skip=5,sep=",")`

attach(trialdata)

The following object is masked by .GlobalEnv:

Gender

The following object is masked from trialdata (position 3):

Age, DelDBP, DELSBP, Dose, Gender, ID, Race, SCreatinine, SerumALT, TRT, Weight

(1) I do not understand why Gender is “masked _by_Global Env: in the code above and what that means.

(2) When I use Gender[Gender=1] it lists only the categorical argument "Male” rather than the entire string of male subjects in the database

Response

Hi Edward,

A couple things for you.

ANSWER TO (1)

First, the error message 'the following object is masked by .GlobalEnv:' comes when you have a similarly named object already in your global environment (think of this as things that are already defined). To give a trivial example, imagine you had already defined a some constants such as:

Weight <- 70
AMT <- 500

Weight and AMT are now in your global environment for you to use.

So if I wanted to calculate the dose in mg/kg I could easily do:

DOSE <- AMT/Weight

Now, if I then come in and type:

trialdata <- read.csv("./DATA/AHTtrial_II.csv", skip = 5, sep = ",")
attach(trialdata)

## The following object is masked _by_ .GlobalEnv:
## 
##     Weight

Which has a column named 'Weight' R gets confused, it says, woah, you already have a different object named 'Weight' in your environment. As such, it will mask (ignore) the NEW object you're trying to call weight (which is the column of Weight values from the AHTtrial_II dataset). You then get the same error you reported (as I've demonstrated with Weight)

So if I now type in 'Weight', R will return 70 (which I had already defined) rather than the column of 'Weights'

Weight

## [1] 70

To get around this you must remove what was previously stored in your global environment to unmask what you've attached:

rm(Weight)
Weight  #should now be vector from attached dataset

##   [1] 63.62 67.13 58.14 68.80 63.66 71.62 62.28 79.34 63.06 70.61 66.71
##  [12] 78.95 75.73 72.18 71.65 61.99 63.57 63.90 74.75 68.15 71.72 69.27
##  [23] 66.38 67.32 70.05 60.88 69.60 67.05 65.13 75.88 73.62 67.23 63.75
##  [34] 73.85 72.05 78.39 67.81 71.28 66.21 81.19 78.51 65.25 72.59 79.26
##  [45] 66.87 76.90 70.19 70.11 66.41 66.82 65.69 79.93 70.72 87.05 71.99
##  [56] 59.95 78.06 78.52 78.55 71.12 64.33 74.87 76.18 69.55 68.08 60.47
##  [67] 73.76 82.76 63.91 70.98 70.44 75.77 75.44 66.46 68.03 67.63 72.88
##  [78] 64.40 76.95 70.02 56.29 76.81 78.04 72.86 71.96 67.31 68.19 74.25
##  [89] 68.49 60.99 70.53 80.57 68.19 65.67 66.09 77.62 65.53 58.25 73.17
## [100] 73.88

Now, to answer your question - the way to 'get around this' is you should never attach a dataset - it is risky business (as you've found out), especially when you start working with larger projects.

Instead, you can query values by explicitly calling the column you'd want by doing dataset$columnname

For example:

right now if you want to see the Gender column you tried:

## Error: invalid 'name' argument

attach(trialdata)

## The following objects are masked from trialdata (position 3):
## 
##     Age, DELSBP, DelDBP, Dose, Gender, ID, Race, SCreatinine,
##     SerumALT, TRT, Weight


# I will use 'head' to only print the first 10 rows to save space
head(Gender, n = 10)

##  [1] Male   Female Female Male   Female Female Female Male   Female Female
## Levels: Female Male

(bad!! - easy to start masking previously defined objects - especially if working with multiple datasets)

# lets detach it
detach(trialdata)

INSTEAD you should do:

trialdata$Gender

##   [1] Male   Female Female Male   Female Female Female Male   Female Female
##  [11] Female Male   Male   Female Male   Male   Female Female Female Female
##  [21] Male   Female Female Female Male   Female Male   Female Male   Male  
##  [31] Male   Male   Male   Female Male   Female Male   Female Female Female
##  [41] Female Female Female Female Male   Male   Female Male   Female Female
##  [51] Male   Male   Female Male   Male   Male   Female Female Female Female
##  [61] Male   Female Female Female Female Female Female Female Male   Female
##  [71] Male   Female Male   Female Female Male   Male   Male   Female Male  
##  [81] Female Male   Male   Male   Female Female Female Male   Female Female
##  [91] Male   Male   Male   Female Female Male   Male   Female Male   Male  
## Levels: Female Male

ANSWER TO (2)

Following on what I've just told you (never attach!) you can easily subset out rows.

The general form is: dataset[dataset$columnname == 'row value you want',] (NOTE: you need that ',' after the row value - easy to miss)

How this 'reads' is: for the dataframe named dataset get all rows such that in the column (columnname) they have the value 'row value you want'.

So in your case you would type: trialdata[trialdata$Gender == 'Male',]

head(trialdata[trialdata$Gender == "Male", ], n = 10)

##    ID   Age Weight SCreatinine SerumALT Gender      Race  DelDBP  DELSBP
## 1   1 44.16  63.62      1.1201    16.22   Male Caucasian   4.042 -28.483
## 4   4 50.85  68.80      0.9241    15.44   Male Caucasian -13.607   2.326
## 8   8 63.02  79.34      0.9335    14.47   Male     Other  -0.069 -32.223
## 12 12 55.73  78.95      0.8810    19.72   Male Caucasian  -2.043 -32.135
## 13 13 49.51  75.73      1.1102    18.28   Male Caucasian -12.808  -2.770
## 15 15 51.23  71.65      0.9064    19.78   Male Caucasian -17.061   0.235
## 16 16 49.02  61.99      1.0713    15.54   Male Caucasian  -8.612 -25.948
## 21 21 47.42  71.72      1.0775    17.67   Male     Black   6.172 -40.991
## 25 25 51.41  70.05      1.0111    16.26   Male Caucasian   3.913 -11.781
## 27 27 51.96  69.60      0.9725    16.55   Male Caucasian   8.450  -6.806
##        TRT Dose
## 1  Placebo    0
## 4  Placebo    0
## 8  Placebo    0
## 12 Placebo    0
## 13 Placebo    0
## 15 Placebo    0
## 16 Placebo    0
## 21 Placebo    0
## 25 Placebo    0
## 27 Placebo    0

This will give you the whole AHTtrial dataset (all columns: ID, Age, Weight….) but only the rows that have Gender == 'Male'

If you want to save as a new dataset with only males you could easily type it as:

male_trialdata <- trialdata[trialdata$Gender == "Male", ]

So you would now have two dataframes (one with only males and the original with everything)

head(trialdata)

##   ID   Age Weight SCreatinine SerumALT Gender      Race  DelDBP  DELSBP
## 1  1 44.16  63.62      1.1201    16.22   Male Caucasian   4.042 -28.483
## 2  2 47.15  67.13      1.1004    16.03 Female  Hispanic -16.785 -25.319
## 3  3 44.47  58.14      1.0299    15.64 Female     Black -10.010 -49.503
## 4  4 50.85  68.80      0.9241    15.44   Male Caucasian -13.607   2.326
## 5  5 43.98  63.66      1.0151    18.55 Female     Other -15.791 -24.010
## 6  6 47.57  71.62      1.0199    18.13 Female Caucasian  -4.153 -45.685
##       TRT Dose
## 1 Placebo    0
## 2 Placebo    0
## 3 Placebo    0
## 4 Placebo    0
## 5 Placebo    0
## 6 Placebo    0

head(male_trialdata)

##    ID   Age Weight SCreatinine SerumALT Gender      Race  DelDBP  DELSBP
## 1   1 44.16  63.62      1.1201    16.22   Male Caucasian   4.042 -28.483
## 4   4 50.85  68.80      0.9241    15.44   Male Caucasian -13.607   2.326
## 8   8 63.02  79.34      0.9335    14.47   Male     Other  -0.069 -32.223
## 12 12 55.73  78.95      0.8810    19.72   Male Caucasian  -2.043 -32.135
## 13 13 49.51  75.73      1.1102    18.28   Male Caucasian -12.808  -2.770
## 15 15 51.23  71.65      0.9064    19.78   Male Caucasian -17.061   0.235
##        TRT Dose
## 1  Placebo    0
## 4  Placebo    0
## 8  Placebo    0
## 12 Placebo    0
## 13 Placebo    0
## 15 Placebo    0

One last thing to note:

you tried Gender[Gender=1], keep in mind '=' is the same as '<-' It means 'assign this value'

So in your case, you were trying to say in the column Gender, create a variable Gender and make it equal 1 and look for that position in the Gender vector (so essentially you were writing Gender[1]). This just asked which was the first value in the Gender column, in this case it was a male.

Even if you typed it the correct way (Gender[Gender == 'Male']) you would have only gotten a vector of Male Male Male because when you attach a dataframe it 'splits' up the columns into independent vectors, you wouldn't have seen what the other column values were where Gender == 'Male'.

I don't want to go on too much so I will post some additional info in a followup post if people want to continue reading…

For your reference here were the outputs from some of the things you tried:


# when attached
attach(trialdata)

## The following objects are masked from trialdata (position 3):
## 
##     Age, DELSBP, DelDBP, Dose, Gender, ID, Race, SCreatinine,
##     SerumALT, TRT, Weight

Gender[Gender = 1]  # Essentially asking for Gender[1]

## [1] Male
## Levels: Female Male


Gender[Gender == 1]  # asking for information in vector 'Gender' Where Gender ==1

## factor(0)
## Levels: Female Male


Gender[Gender == "Male"]  #works but probably not results expecting

##  [1] Male Male Male Male Male Male Male Male Male Male Male Male Male Male
## [15] Male Male Male Male Male Male Male Male Male Male Male Male Male Male
## [29] Male Male Male Male Male Male Male Male Male Male Male Male Male Male
## [43] Male Male
## Levels: Female Male


detach(trialdata)

# instead want
male_trialdata <- trialdata[trialdata$Gender == "Male", ]
head(male_trialdata)  #can see subset only rows where have Gender is Male

##    ID   Age Weight SCreatinine SerumALT Gender      Race  DelDBP  DELSBP
## 1   1 44.16  63.62      1.1201    16.22   Male Caucasian   4.042 -28.483
## 4   4 50.85  68.80      0.9241    15.44   Male Caucasian -13.607   2.326
## 8   8 63.02  79.34      0.9335    14.47   Male     Other  -0.069 -32.223
## 12 12 55.73  78.95      0.8810    19.72   Male Caucasian  -2.043 -32.135
## 13 13 49.51  75.73      1.1102    18.28   Male Caucasian -12.808  -2.770
## 15 15 51.23  71.65      0.9064    19.78   Male Caucasian -17.061   0.235
##        TRT Dose
## 1  Placebo    0
## 4  Placebo    0
## 8  Placebo    0
## 12 Placebo    0
## 13 Placebo    0
## 15 Placebo    0

Question 2

These questions refer to the writing of R code for the AHTtrial_I(1) data for DBP and SBP of placebo and AHT (slide 54).

When working out an intermediate step in the R code, I got the following result:

tapply(placebo_trialdata$SBP,placebo_trialdata$Time==“0”,mean)

FALSE TRUE

153.0701 164.8006

It appears (above) as if the mean function applied to both all SBP values in the dataframe (of the placebo TRT group)and to those restricted to the Time = 0 group. Does the “FALSE” posted above refer to the fact that the answer 153.0701 contains values outside of Time =0 while the second value of 164.8006 is limited to values in the Time = 0 group?

tapply(placebo_trialdata$SBP, placebo_trialdata$Time == "0", placebo_trialdata$Time == 
    "2", mean)

Error in match.fun(FUN) :

placebo_trialdata$Time == “2”' is not a function, character or symbol

Why does the code not recognize and process the Time==2 data in the same way that it recognizes Time==0? I did try the following as well with a similar result:

tapply(placebo_trialdata$SBP, placebo_trialdata$Time == "0", placebo_trialdata$SBP, 
    placebo_trialdata$Time == "2", mean)

I did this in case each Time value needed “placebo_trialdata$SBP" before it in order to execute.

Thanks!

Answer to Question 2

To understand what is going on here you need to understand the implications of what tapply does.

tapply, one of the many functions in the 'apply' class for R is a easy function that will loop through your data a certain way. Each of the 'apply' functions loops over the data in different ways. I would highly suggest reading this post on stackoverflow that gives a great summary of how the various ones work.

In your case, the code is utilizing the functionality of tapply to group variables into subsets then apply a function to each subset.

To think about how tapply works in a generic sense:

tapply(Summary Variable, Group Variable, Function)

Now, let us compare with what your code is attempting to do and disect it a little:

tapply(placebo_trialdata$SBP, placebo_trialdata$Time == "0", placebo_trialdata$SBP, 
    placebo_trialdata$Time == "2", mean)

So first, what arguments are you supplying vs what tapply needs:

Summary Variable: placebo_trialdata$SBP

Ok, this is a good start, you want to summarize all the SBP values

Group Variable: placebo_trialdata$Time==“0”

Here is error #1 - lets break down what you're trying to do.

By saying group by $Time == “0” tapply break the SBP into two groups through a process called logical subsetting - in this case it assigns a value of 'TRUE' to all values that match “0”, and 'FALSE' to the rest. The significance of this will become apparent when the tapply looks for the FUN (function) argument. So, according to your code the function is…

Function: placebo_trialdata$Time==“2”

Woops - it gets very confused as there is no such function as this. That is why you got your error 'not a function, character or symbol' (there are some additional reasons but they are out of the scope of this class)

So lets tie everything back together and see what you were asking when it did work, and what are the TRUE/FALSE values

Given your request to group by Time == 0, tapply has assigned a group of TRUE/FALSE values to keep track of what it should do with the function you pass in. In this case, you give the function 'mean' to which it takes all values == TRUE and takes the mean, then all values == FALSE and takes the mean. In your case, this would be FALSE is all times != 0, TRUE is times == 0

Now how the heck can you easily get all the times at once - it seems so slow to do this for each time separately, take all the TRUE values at each one and recombine them. Luckily, tapply is smart and will automatically break down the Group Variable to factors and then apply the function each factor separately.

So if instead you had used the code:

tapply(trialdata1$SBP,trialdata1$Time,mean)

You'll see the results I think you're looking for.

One thing to keep in mind - to get mean SBP and DBP, as well as separate by TRT and placebo it can easily be done using subsetting, but I would highly suggest checking out the 'plyr' package. It is absolutely invaluable for data manipulation.

trialdata1 <- read.csv("./DATA/AHTtrial_I.csv", skip = 7, sep = ",")
tapply(trialdata1$SBP, trialdata1$Time, mean)

##     0     2     4     6     8 
## 162.9 156.6 148.6 145.5 136.9

You can see if nicely groups each time value separately and then takes the mean SBP for each.

vs the previously attempted filtering at a single time:

tapply(trialdata1$SBP, trialdata1$Time == 0, mean)

## FALSE  TRUE 
## 146.9 162.9