WPA #3: Chapter 6 – Matrices and Dataframes

A drug study

You’re a lab assistant for a multi-billion dollar drug company called Novartirosche. The company has just developed a new cognitive performance enhancing drug called drug.x that it expects will revolutionize the industry. To test the performance of the drug, the company recruited 1,000 participants to perform one of two cognitive tasks after having taken drug.x or a placebo sugar pill. Participants assigned to the ‘wordsearch’ task have to find 100 words in a jumbled list as fast as possible. Participants assigned to the ‘animals task’ have to name 20 different animals as quickly as possible. For each task, the a lab assistant recorded how long it took each participant, in seconds, to complete their assigned task. The results are stored in a tab-delimited text file at https://dl.dropboxusercontent.com/u/7618380/drug.txt

You should probably get access to the dataset, load the data into R as a new dataframe object by running the following code.

drug <- read.table("https://dl.dropboxusercontent.com/u/7618380/drug.txt")

# That code didn't work for some people -- I'm not sure why. If that doesn't work, download the file to your computer, then replace the https link above with the file's path on your compueter. For example:
#drug <- read.table(file = "/Users/nathaniel/desktop/drug.txt")

Ok, let’s make sure it looks ok. Print the first few rows of the dataframe into the console using numerical indexing (that is, brackets!).

drug[1:5,]

##        drug       task sex.asdf time.s age phone.number  id
## 677  drug.x    animals   female    171  37   9322672379 268
## 348 placebo wordsearch   female    231  37   7180780308 128
## 429 placebo    animals    other    131  24   8508113795 331
## 385 placebo wordsearch     male    230  24   1648030302 254
## 419 placebo    animals   female    130  41   6482193027 396

Hmmm, let’s double-check the first few rows with a function. Print the first few rows of the dataframe into the console using a function (if you use your head you should be able to remember the name of the function…).

head(drug)

##        drug       task sex.asdf time.s age phone.number  id
## 677  drug.x    animals   female    171  37   9322672379 268
## 348 placebo wordsearch   female    231  37   7180780308 128
## 429 placebo    animals    other    131  24   8508113795 331
## 385 placebo wordsearch     male    230  24   1648030302 254
## 419 placebo    animals   female    130  41   6482193027 396
## 263 placebo wordsearch   female    231  15   2494861347  13

Someone left me a note saying that there could be problems in rows 50 through 60. Print rows 50 through 60 into the console and make sure they look ok.

drug[50:60,]

##        drug       task sex.asdf time.s age phone.number  id
## 462 placebo    animals    other    129  15   2753248327 111
## 407 placebo    animals   female    129  37   6592522210 694
## 100 placebo wordsearch     male    230  26   3410544564 427
## 487 placebo    animals    other    130  31   8102985146  67
## 840  drug.x    animals    other    169  29   6430696567 192
## 727  drug.x    animals   female    168  32   3334892955 905
## 685  drug.x    animals    other    170  23   1231159931 564
## 996  drug.x    animals     male    171  23   5439840998 620
## 119 placebo wordsearch   female    229  30   1807868391 878
## 585  drug.x wordsearch     male    270  28   6975130923 276
## 317 placebo wordsearch     male    231  28   2908492601 435

Better take a quick look at the whole dataset. View the entire dataframe in a new window using View()

View(df)

Print summary statistics of each column using summary()

summary(drug)

##       drug             task       sex.asdf       time.s         age       
##  drug.x :500   animals   :500   female:348   Min.   :128   Min.   :10.00  
##  placebo:500   wordsearch:500   male  :348   1st Qu.:170   1st Qu.:25.00  
##                                 other :304   Median :200   Median :30.00  
##                                              Mean   :200   Mean   :30.01  
##                                              3rd Qu.:230   3rd Qu.:35.00  
##                                              Max.   :273   Max.   :50.00  
##   phone.number             id        
##  Min.   :1.009e+09   Min.   :   1.0  
##  1st Qu.:3.102e+09   1st Qu.: 250.8  
##  Median :5.416e+09   Median : 500.5  
##  Mean   :5.442e+09   Mean   : 500.5  
##  3rd Qu.:7.688e+09   3rd Qu.: 750.2  
##  Max.   :9.998e+09   Max.   :1000.0

I’m superstitious about the numbers of pi…What was the data from the patient with id 314?

drug[drug$id == 314,]

##       drug    task sex.asdf time.s age phone.number  id
## 747 drug.x animals   female    170  36   4925379726 314

What are the names of the columns of the dataframe? Use the appropriate function.

names(drug)

## [1] "drug"         "task"         "sex.asdf"     "time.s"      
## [5] "age"          "phone.number" "id"

One of the column names has some unnecessary numbers. Fix the name.

names(drug)[3] <- "sex"

A colleague requested just the response time data. She also wants it in minutes instead of seconds. Create a new vector object called time.m that only contains the time data in minutes.

time.m <- drug$time.s / 60

Let’s check some of the effects of sex. Create two separate dataframes for male and female participants. Call them drug.male and drug.female.

drug.female <- subset(drug, sex == "female")
drug.male <- subset(drug, sex == "male")

Let’s look just at the females: What percent of the female participants were given the placebo? What was the mean response time of females?

mean(drug.female$drug == "placebo")

## [1] 0.5

mean(drug.female$time.s)

## [1] 202.0345

Now let’s focus on the males: What percent of the male participants were given the placebo? What was the mean response time of males?

mean(drug.male$drug == "placebo")

## [1] 0.5

mean(drug.male$time.s)

## [1] 201.1983

Let’s look at some of the age related data. Create a new dataframe called drug.oldest containing only the data from the oldest patient(s), and a new dataframe called drug.youngest that only contains data from the youngest participants. To do this, use basic indexing in addition to the max() and min() functions.

drug.oldest <- drug[drug$age == max(drug$age),]
drug.youngest <- drug[drug$age == min(drug$age),]

Show me a table of the age data. In other words, how many participants were there of each age?

table(drug$age)

## 
## 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 
##  2  1  2  5  4  5  6 15 13 18 28 29 38 41 52 44 60 56 69 49 47 52 58 48 47 
## 36 37 38 39 40 41 42 43 44 45 47 48 49 50 
## 41 38 25 26 22 17 13 10  6  7  3  1  1  1

In the next question, you need to change specific values of a vector based on some criteria. We leaned how to do this in Chapter 5. If you forgot how, here’s an example:

a <- c(1, 1, 1, 1, 2, 2, 2, 2)
a == 2

## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

a[a == 2] <- 10
a

## [1]  1  1  1  1 10 10 10 10

Oh no. Some of the patients may have been too young to participate in a medical study…Please um…help them grow up. Change the age of anyone less than 18 to 18.

drug$age[drug$age < 18] <- 18

Ok. Now that you’ve ‘fixed’ the dataframe, show me a new table of the age data. No one should be less than 18 now.

table(drug$age)

## 
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 
## 40 13 18 28 29 38 41 52 44 60 56 69 49 47 52 58 48 47 41 38 25 26 22 17 13 
## 43 44 45 47 48 49 50 
## 10  6  7  3  1  1  1

Hey I just found some new data from the study. The participants’ height and weight is stored in a separate table. Run the following code to store the data as a new dataframe called heightweight. Look at the first few rows of the dataframe to see how it looks.

heightweight <- read.table("https://dl.dropboxusercontent.com/u/7618380/moredata.txt", sep = "\t")

# Again, if the code doesn't work, download the file to your computer, then put the file path from your computer as the argument to read.table()

Now, combine the two dataframes drug and heightweight into a single dataframe using either data.frame() or cbind()

drug <- cbind(drug, heightweight)

Ok now let’s start analyzing the response time data. Of course, we want to see if drug.x improved people’s performance….Calculate the mean response time separately for the placebo and drug.x pills (ignoring the type of task).

with(drug, mean(time.s[drug == "placebo"]))

## [1] 210.03

with(drug, mean(time.s[drug == "drug.x"]))

## [1] 190.022

Did drug.x help performance by reducing response times relative to the placebo? You can answer with words.

Response times with drug.x were, on average, lower than with the placebo. Therefore, drug.x appeared to help

Just for fun, let’s see which task was harder. Calculate the mean response time for each task (ignoring the drug)

with(drug, mean(time.s[task == "animals"]))

## [1] 162.022

with(drug, mean(time.s[task == "wordsearch"]))

## [1] 238.03

Based on what you found, which task was harder?

The wordsearch task was much harder than the animal naming task

Ok, now calculate the mean response time, for each combination of pills and task. That is, calculate the mean response time for placebo in the animal task, drug.x in the animal task, and placebo in the wordsearch task, and drug.x in the wordsearch task.

with(drug, mean(time.s[task == "animals" & drug == "drug.x"]))

## [1] 170.0125

with(drug, mean(time.s[task == "animals" & drug == "placebo"]))

## [1] 130.06

with(drug, mean(time.s[task == "wordsearch" & drug == "drug.x"]))

## [1] 270.06

with(drug, mean(time.s[task == "wordsearch" & drug == "placebo"]))

## [1] 230.0225

Based on what you know now, did drug.x help performance by reducing response times?

drug.x appeared to lead to SLOWER response times in both tasks! Crazy!

I think there may have been a problem with the study design. Use the table() function to create a classification table showing how many people were assigned to each pill and each task. To do this, just put the drug and task columns as two separate arguments to table()

with(drug, table(task, drug))

##             drug
## task         drug.x placebo
##   animals       400     100
##   wordsearch    100     400

What went wrong with this study? Why did the apparent performance of drug.x versus the placebo change from question 21 to question 25?

Simpson’s paradox…again! People given drug x were mostly assigned to the easier task while people given the placebo were mostly assigned to the harder task. This is why people given drug x appeared to do better on average compared to the placebo. However, people given the placebo actually did better on BOTH tasks. Drug x sucks.

WPA #3: Chapter 6 – Matrices and Dataframes

Basel Spring 2016

A drug study