I have chosen to work with the csv file “BtheB”. This file includes data from a clinical trial designed to deliver cognitive behavioral therapy to depressed patients via a computer terminal. The treatment (named the “Beating the Blues” program) was tested against the usual treatment (aptly named “Treatment as Usual” (TAU)). The Beck Depressive Inventory II scale was used to meausure the level of depression in patients, where higher total scores indicate more severe depression. Eight variables were considered:

  1. drug - if the patient took anti-depressants
  2. length - duration of current depressive episode
  3. treatment - TAU (Treatment as Usual) or BtheB (Beat the Blues)
  4. bdi.pre - depression degree before treatment
  5. bdi.2m - depression degree after 2 months
  6. bdi.4m - depression degree after 4 months
  7. bdi.6m - depression degree after 6 months
  8. bdi.8m - depression degree after 8 months
library(ggplot2)

At what point in the treatment do we see a significant effect of the BtheB program on the patients, if we even see one?

BtheB <- read.csv(file="BtheB.csv",head=TRUE,sep=",",stringsAsFactors=FALSE)
BtB = BtheB[-c(1)]
colnames(BtB) = c("drug","length", "treatment", "pre","two","four","six","eight")
head(BtB)
##   drug length treatment pre two four six eight
## 1   No    >6m       TAU  29   2    2  NA    NA
## 2  Yes    >6m     BtheB  32  16   24  17    20
## 3  Yes    <6m       TAU  25  20   NA  NA    NA
## 4   No    >6m     BtheB  21  17   16  10     9
## 5  Yes    >6m     BtheB  26  23   NA  NA    NA
## 6  Yes    <6m     BtheB   7   0    0   0     0

Subset the data into two groups: treatment vs control

control = subset(BtB, treatment == "TAU", select = c(pre, two, four, six, eight))
head(control)
##    pre two four six eight
## 1   29   2    2  NA    NA
## 3   25  20   NA  NA    NA
## 7   17   7    7   3     7
## 8   20  20   21  19    13
## 11  30  32   24  12     2
## 13  26  27   23  NA    NA
experim = subset(BtB, treatment == "BtheB",select = c(pre, two, four, six, eight))
head(experim)
##    pre two four six eight
## 2   32  16   24  17    20
## 4   21  17   16  10     9
## 5   26  23   NA  NA    NA
## 6    7   0    0   0     0
## 9   18  13   14  20    11
## 10  20   5    5   8    12

Means, medians, quartiles, relevant data

summary(control)
##       pre             two             four            six       
##  Min.   : 7.00   Min.   : 0.00   Min.   : 2.00   Min.   : 0.00  
##  1st Qu.:16.75   1st Qu.: 9.00   1st Qu.: 7.00   1st Qu.: 3.00  
##  Median :23.00   Median :20.00   Median :15.50   Median :19.00  
##  Mean   :24.19   Mean   :19.47   Mean   :17.67   Mean   :16.28  
##  3rd Qu.:30.25   3rd Qu.:27.00   3rd Qu.:24.00   3rd Qu.:24.00  
##  Max.   :47.00   Max.   :48.00   Max.   :49.00   Max.   :47.00  
##                  NA's   :3       NA's   :12      NA's   :19     
##      eight     
##  Min.   : 0.0  
##  1st Qu.: 2.0  
##  Median :13.0  
##  Mean   :13.6  
##  3rd Qu.:20.0  
##  Max.   :40.0  
##  NA's   :23
summary(experim)
##       pre             two             four            six        
##  Min.   : 2.00   Min.   : 0.00   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:13.75   1st Qu.: 7.00   1st Qu.: 5.00   1st Qu.: 3.000  
##  Median :20.50   Median :12.50   Median :10.00   Median : 8.000  
##  Mean   :22.54   Mean   :14.71   Mean   :12.03   Mean   : 9.241  
##  3rd Qu.:30.50   3rd Qu.:20.50   3rd Qu.:16.00   3rd Qu.:12.000  
##  Max.   :49.00   Max.   :40.00   Max.   :53.00   Max.   :30.000  
##                                  NA's   :15      NA's   :23      
##      eight       
##  Min.   : 0.000  
##  1st Qu.: 3.000  
##  Median : 9.000  
##  Mean   : 8.852  
##  3rd Qu.:12.500  
##  Max.   :23.000  
##  NA's   :25
contmean= c(mean(control$pre, na.rm = T),
     mean(control$two, na.rm = T),
     mean(control$four, na.rm = T),
     mean(control$six, na.rm = T),
     mean(control$eight, na.rm = T))
expmean= c(mean(experim$pre, na.rm = T),
     mean(experim$two, na.rm = T),
     mean(experim$four, na.rm = T),
     mean(experim$six, na.rm = T),
     mean(experim$eight, na.rm = T))
means= matrix(c(contmean, expmean), ncol=5, byrow = T)
colnames(means)= c("pre", "two", "four", "six", "eight")
rownames(means)= c("Control means", "Experiment means")
means = as.table(means)
round(means, digits= 2)
##                    pre   two  four   six eight
## Control means    24.19 19.47 17.67 16.28 13.60
## Experiment means 22.54 14.71 12.03  9.24  8.85

Just from the raw data we see a decline in BDI means over the months. The experimental group has lower means throughout, however since even the pre group was lower, the assumption cannot be made that the lower means are due to the BeattheBlues program.

The graphs and plots of control and experiment means

par(mfrow=c(1,2))
plot(contmean, main= "Control Group Means" , xlab= "Months of Recorded BDI", ylab = "BDI Mean")
abline(lsfit(1:5, contmean), col= "red")
plot(expmean, main= "Experimental Group Means", xlab= "Months of Recorded BDI", ylab = "BDI Mean")
abline(lsfit(1:5, expmean), col= "blue")

par(mfrow=c(1,2))
boxplot(control, main= "Control Group" , xlab= "Months of Recorded BDI", ylab = "BDI Levels")
boxplot(experim, main= "Experimental Group" , xlab= "Months of Recorded BDI", ylab = "BDI Levels")

There is a difference in means between the groups within the control and the experiment sets, particularly when glancing at the pre data (which is significantly higher in both occasions than the BDI levels 2, 4, 6 and 8 months later). The boxplots, though, (which unlike the previous plot included NA data) show that the spike in the mean of month six in the control group is absent in the experimental group, meaning that the BeattheBlues treatment may potentially have a significant effect six months into the treatment.

Trying to visually determine with a histogram if the data is normally distributed

normal= function(x){
  x[is.na(x)] = 0 
  graph= hist(x, col= "gray", main= "", xlab = "BDI level")
  fitx= seq(min(x),max(x))
  fity= dnorm(fitx, mean= mean(x),sd= sd(x))
  fity= fity*diff(graph$mids[1:2])*length(x) 
  lines(fitx, fity, col="red")
}

par(mfrow=c(1,2))
# Pre is somewhat normally distributed.
normal(control$pre)
normal(experim$pre)

# Two could be normal for the control, but the experimental group is a tad skewed right.
normal(control$two)
normal(experim$two)

# Four is skewed right.
normal(control$four)
normal(experim$four)

# Six is skewed right.
normal(control$six)
normal(experim$six)

# Eight could be randomly distributed.
normal(control$eight)
normal(experim$eight)

The data is not normal (skewed right in months two, four, six and potentially eight) with the exception of the pre factor. Any tests run on this data must be non-parametric.

Conducting Kruskal-Wallis test (are any of our groups significantly different from each other?)

kruskal.test(control)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  control
## Kruskal-Wallis chi-squared = 18.326, df = 4, p-value = 0.001066
qchisq(0.950, 4)
## [1] 9.487729
kruskal.test(experim)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  experim
## Kruskal-Wallis chi-squared = 42.509, df = 4, p-value = 1.309e-08
qchisq(0.950, 4)
## [1] 9.487729

Since in both scenarios the test statistics are less than the chi-squares and the p-values are less than 0.05, we can conclude that the five groups we tested are not statistically equal. There is an effect on the subjects.

Plotting factors to determine correlation

pairs(na.omit(control))

cor(na.omit(control))
##             pre       two      four       six     eight
## pre   1.0000000 0.4414580 0.5392721 0.4958526 0.4066434
## two   0.4414580 1.0000000 0.7856932 0.7692827 0.7164076
## four  0.5392721 0.7856932 1.0000000 0.8898735 0.7952014
## six   0.4958526 0.7692827 0.8898735 1.0000000 0.8498854
## eight 0.4066434 0.7164076 0.7952014 0.8498854 1.0000000
pairs(na.omit(experim))

cor(na.omit(experim))
##             pre       two      four       six     eight
## pre   1.0000000 0.3250588 0.3333027 0.5487748 0.4140905
## two   0.3250588 1.0000000 0.4355687 0.7577294 0.5763700
## four  0.3333027 0.4355687 1.0000000 0.5750822 0.4019558
## six   0.5487748 0.7577294 0.5750822 1.0000000 0.6557662
## eight 0.4140905 0.5763700 0.4019558 0.6557662 1.0000000

We see that a lot of the control factors are highly correlated, yet less so in the experimental group, but not enough to worry about multicollinearity interfering with the regression (where a correlation of 0.90 or higher would be considered significant).

Run a regression on the experiment subset

exp2 = stack(experim)
exp2$patient = rep(1:260, 1)          #create subject line
exp2$patient = factor(exp2$patient)   #make it a factor
exp2[is.na(exp2)] = 0                 #change NA values to zero
colnames(exp2) = c("BDI", "month", "patient")
head(exp2)
##   BDI month patient
## 1  32   pre       1
## 2  21   pre       2
## 3  26   pre       3
## 4   7   pre       4
## 5  18   pre       5
## 6  20   pre       6
exp.reg = lm(BDI~ month, data=exp2) #Regression
summary(exp.reg)
## 
## Call:
## lm(formula = BDI ~ month, data = exp2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.538  -5.154  -3.346   5.288  44.442 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.5962     1.3038   3.525 0.000502 ***
## monthfour     3.9615     1.8439   2.148 0.032619 *  
## monthpre     17.9423     1.8439   9.731  < 2e-16 ***
## monthsix      0.5577     1.8439   0.302 0.762553    
## monthtwo     10.1154     1.8439   5.486 9.91e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.402 on 255 degrees of freedom
## Multiple R-squared:  0.3447, Adjusted R-squared:  0.3344 
## F-statistic: 33.53 on 4 and 255 DF,  p-value: < 2.2e-16

Run a regression on the control subset

con2 = stack(control)
con2$patient = rep(1:240, 1)          #create subject line
con2$patient = factor(con2$patient)   #make it a factor
con2[is.na(con2)] = 0                 #change NA values to zero
colnames(con2) = c("BDI", "month", "patient")
head(con2)
##   BDI month patient
## 1  29   pre       1
## 2  25   pre       2
## 3  17   pre       3
## 4  20   pre       4
## 5  30   pre       5
## 6  26   pre       6
con.reg = lm(BDI~ month, data=con2) #Regression
summary(con.reg)
## 
## Call:
## lm(formula = BDI ~ month, data = con2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.250  -9.396  -3.219   8.000  37.167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.083      1.695   4.179 4.14e-05 ***
## monthfour      6.167      2.397   2.572   0.0107 *  
## monthpre      17.104      2.397   7.135 1.19e-11 ***
## monthsix       2.750      2.397   1.147   0.2525    
## monthtwo      11.167      2.397   4.658 5.34e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.74 on 235 degrees of freedom
## Multiple R-squared:  0.2162, Adjusted R-squared:  0.2029 
## F-statistic: 16.21 on 4 and 235 DF,  p-value: 9.791e-12

Regression shows that the significant factors to take into consideration in both occasions is the pre results and the second month. Month four is also significant, but not as much as the other two factors.

We can’t be sure that the BeattheBlues program had any significant effect on the patients, except for the boxplots contrasting the means (where the sixth month of treatment did not spike in the experimental group as it did in the control). Though the means of the experimental group were overall lower than the control, there is no evidence to show that the lower BDI values were due to the treatment. Also, the difficulty in running tests was due to how the data was not normally distributed. However, when a regression was run, the results showed that what affected the BDI levels in both the control and experimental groups was the BDI levels on the months before the treatment was administered (pre) and the second month of treatment (two).