Linguistics Pilot Results

For 35 individuals, we have their responses to 40 questions, divided into 'pre-test' and 'post-test'. The individuals were randomly assigned to four groups, and one control group. Between the two tests, the individuals assigned to the three non-control groups had some linguistic training of different types. In addition, 7 of the participants benefited from highlighting of vowels. We want to find if there is any difference between the pre-test and post-test scores, and whether there were differences by group. A table of the distribution of the participants is below. HLV is 'highlight vowel'.

## recode
##      Control          HLV   Morphology Morphonology    Phonology 
##            7            7            7            7            7

To get the data into a dataframe format consisting of response variable (post-test score) and explanatory variables (pre-test score, group membership), I recorded 'vowel blindness' when the result was “0-1”. I then counted the number of instances of vowel blindness for the pre-test and the post-test scores. For the pre-test, there were 808 instances of vowel blindness and for the post-test the figure was 570. The total possible was 35 x 40 = 1400.

Comparison of pre-test scores by treatment type

Before training the randomly assigned groups, we should establish that there is no pre-existing difference in abilities. Therefore we examine the pre-test scores against the treatment groups that the participants will later be assigned to

plot of chunk unnamed-chunk-2

Here we can see the pre-test scores divided by group. By eye, the HLV group seems to have a slightly higher pre-test score. We can test for statistical differences between the five groups

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = prevb ~ recode, data = reem2)
## 
## $recode
##                            diff     lwr   upr  p adj
## HLV-Control              1.0000  -5.867 7.867 0.9930
## Morphology-Control      -4.5714 -11.439 2.296 0.3236
## Morphonology-Control    -1.8571  -8.725 5.010 0.9332
## Phonology-Control       -2.0000  -8.867 4.867 0.9143
## Morphology-HLV          -5.5714 -12.439 1.296 0.1565
## Morphonology-HLV        -2.8571  -9.725 4.010 0.7475
## Phonology-HLV           -3.0000  -9.867 3.867 0.7127
## Morphonology-Morphology  2.7143  -4.153 9.582 0.7808
## Phonology-Morphology     2.5714  -4.296 9.439 0.8122
## Phonology-Morphonology  -0.1429  -7.010 6.725 1.0000

There is no statistically significant difference (none of the p-values are smaller than 0.05). All of the confidence intervals contain zero.

plot of chunk unnamed-chunk-4

Check for differences in the pre-test scores between the control group and the four treatment groups consolidated

plot of chunk unnamed-chunk-5

The control group is on the left and appears to have a greater median than the other four groups combined.We test for the difference

## 
##  Welch Two Sample t-test
## 
## data:  prevb by groups2 
## t = 0.8374, df = 8.061, p-value = 0.4265
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -3.250  6.965 
## sample estimates:
##   mean in group Control mean in group Treatment 
##                   24.57                   22.71

The p-value is 0.4265 meaning that there is no statistically significant difference between the control group and the four other groups. These two sets of tests indicate that the groups are equally skilled at the pre-test stage.

Comparison of post-test scores by group

First, a boxplot of post-test scores divided into control group and four treatment groups consolidated

plot of chunk unnamed-chunk-7

The post-test score for the control group is on the left. By eye the control group appears to have a higher mean than the other three groups together. We can test for the difference

## 
##  Welch Two Sample t-test
## 
## data:  postvb by groups2 
## t = 7.014, df = 8.023, p-value = 0.0001096
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##   8.632 17.082 
## sample estimates:
##   mean in group Control mean in group Treatment 
##                   26.57                   13.71

The p-value is very small at 0.0001096 and so we can confidently say that the post-test scores between the control group and the four treatment groups are different.

Then by the individual groups (four treatment and one control)

plot of chunk unnamed-chunk-9

The control group appears to have both a larger median and a larger variance. The highlight vowel group (HLV) has one outlier, represented by the dot towards the x axis.

Use an anova test for differences between the five sample means

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = postvb ~ recode, data = reem2)
## 
## $recode
##                             diff     lwr    upr  p adj
## HLV-Control             -11.5714 -17.591 -5.551 0.0000
## Morphology-Control      -13.1429 -19.163 -7.123 0.0000
## Morphonology-Control    -12.8571 -18.877 -6.837 0.0000
## Phonology-Control       -13.8571 -19.877 -7.837 0.0000
## Morphology-HLV           -1.5714  -7.591  4.449 0.9408
## Morphonology-HLV         -1.2857  -7.306  4.734 0.9708
## Phonology-HLV            -2.2857  -8.306  3.734 0.8045
## Morphonology-Morphology   0.2857  -5.734  6.306 0.9999
## Phonology-Morphology     -0.7143  -6.734  5.306 0.9968
## Phonology-Morphonology   -1.0000  -7.020  5.020 0.9885

plot of chunk unnamed-chunk-10

The control group has a mean different from the other four groups. Compared pairwise, any combination containing 'Control' does not have zero in its confidence interval.

Comparison between pre-test and post-scores by participant

First a boxplot

plot of chunk unnamed-chunk-11

The post-test scores seem to be smaller than the pre-test scores. We use a paired t-test because we have the results with the two sets of scores matched by participant

## 
##  Paired t-test
## 
## data:  prevb and postvb 
## t = 6.563, df = 34, p-value = 1.616e-07
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  4.694 8.906 
## sample estimates:
## mean of the differences 
##                     6.8

It is clear from the very small p-value (1.616e-07) that means are indeed different. The mean difference between post and pre test scores is 6.8.This means that on average post scores were 6.8 lower than pre-test scores.

Linear regression to determine the factor variable relationships

Now we use linear regression to find out the effect of the 'factor' variables of group membership. First for the difference between the control group and the four treatment groups combined

## 
## Call:
## lm(formula = postvb ~ prevb + groups2, data = reem2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.607 -1.666 -0.015  1.751  7.254 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        16.551      3.347    4.94  2.3e-05 ***
## prevb               0.408      0.126    3.23   0.0028 ** 
## groups2Treatment  -12.100      1.428   -8.47  1.1e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 3.33 on 32 degrees of freedom
## Multiple R-squared: 0.746,   Adjusted R-squared: 0.73 
## F-statistic: 46.9 on 2 and 32 DF,  p-value: 3.08e-10

The model has an adjusted R-squared value of 0.7297, meaning that 73% of the variance in the post-test scores is explained by the model. (This is pretty good!). Both variables, the pre-test score and membership of the consolidated treatment group, are statistically significant with p-values of 0.00285 and the vanishingly small 1.10e-09. The pre-test score variable has a positive sign, meaning that a higher pre-test score is related to a higher post-test score (which makes sense). membership of one of the four treatment groups reduces the post-test score by 12.0998 holding the pre-test score constant. This is the significant negative difference illustrated on the boxplot.

We can plot the differences, with the control group coded as zero and represented by the red dots

plot of chunk unnamed-chunk-14

The effect of the individual groups is available through linear regression

## 
## Call:
## lm(formula = postvb ~ prevb + recode, data = reem2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5.61  -2.16   0.02   1.82   7.25 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          16.039      3.724    4.31  0.00017 ***
## prevb                 0.429      0.142    3.02  0.00525 ** 
## recodeHLV           -12.000      1.847   -6.50  4.1e-07 ***
## recodeMorphology    -11.183      1.952   -5.73  3.4e-06 ***
## recodeMorphonology  -12.061      1.860   -6.48  4.3e-07 ***
## recodePhonology     -13.000      1.863   -6.98  1.1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 3.44 on 29 degrees of freedom
## Multiple R-squared: 0.754,   Adjusted R-squared: 0.711 
## F-statistic: 17.7 on 5 and 29 DF,  p-value: 4.77e-08

The reference level is the group membership variable 'Control'. The results mean that compared to the score from members of the control group, all the other four groups had lower post-test scores. This is indicated by the negative signs. The smallest difference was for the Morphology treatment group, which had on average a post-test score 11.1834 points lower than a control group member with the same pre-test score. The greatest difference was for Phonology at 12.9999.

This is a statistically significant model, and an interesting result although the sample size is small.

Using a classification tree

Classification trees are a relatively new approach to breaking up data for classification. The 'splits' occur at points which are statistically significant. Here I have used exactly the same formula as for the linear regression plot of chunk unnamed-chunk-16

The tree is really simple–there are indeed 7 participants in the control group (Node 2). Within the treatment group, the tree splits the participants into those with a pre-test vowel blindness score of less than or equal to 20, and those with a higher score. This model would be useful for prediction. We might test it against the larger sample.

Effect of distractors pre-test and post-test

Eight questions in the pre-test and the post-test question sets are defined as distractor questions. The vowel blindness scores for distractors for the pre-test and post-test scores are shown in the boxplot below. The pre-test distractor scores appear to have a higher mean compared to the post-test scores.

plot of chunk unnamed-chunk-17

Using a paired t-test for differences in means shows a statistically significant difference. The post-test distractor scores are lower by on average 1.8 points.

## 
##  Paired t-test
## 
## data:  reem2$predist and reem2$postdist 
## t = 5.332, df = 34, p-value = 6.359e-06
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  1.114 2.486 
## sample estimates:
## mean of the differences 
##                     1.8

Using linear regression

## 
## Call:
## lm(formula = postdist ~ predist + recode, data = reem2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6044 -0.5765 -0.0456  0.8381  1.7295 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.4802     0.8230    5.44  7.4e-06 ***
## predist              0.0177     0.1336    0.13  0.89532    
## recodeHLV           -1.2806     0.6420   -1.99  0.05555 .  
## recodeMorphology    -2.4159     0.6479   -3.73  0.00083 ***
## recodeMorphonology  -1.9924     0.6434   -3.10  0.00432 ** 
## recodePhonology     -1.5765     0.6420   -2.46  0.02030 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.2 on 29 degrees of freedom
## Multiple R-squared: 0.364,   Adjusted R-squared: 0.254 
## F-statistic: 3.32 on 5 and 29 DF,  p-value: 0.0172

The results are interesting because the pre-test distractor score is not statistically significant. The most significant variable is Morphology with a p-value of 0.000832. Morphology training reduced vowel blindness across the distractors by on average 2.41591 points.

Question by question analysis

The list below shows the number of instances of vowel blindness by question on pre-test and post-test, and also the difference between them. The list is sorted by pre-test question with the worst vowel blindness score (31 out of 35). This was question 20. For the same question, the equivalent post-test score is 8, resulting in a difference between the two of 23.

##    Pre.VBScore Pre.Q Post.VBScore Post.Q Difference
## 1           31 Q20VB            8    Q20         23
## 2           28 Q25VB           20    Q25          8
## 3           27 Q11VB            9    Q11         18
## 4           27 Q17VB           15    Q17         12
## 5           27 Q26VB           26    Q26          1
## 6           25  Q5VB            5     Q5         20
## 7           25 Q30VB           13    Q30         12
## 8           25 Q38VB           10    Q38         15
## 9           24  Q2VB           24     Q2          0
## 10          24 Q14VB           13    Q14         11
## 11          24 Q19VB           11    Q19         13
## 12          24 Q28VB            8    Q28         16
## 13          23  Q4VB           10     Q4         13
## 14          23 Q15VB           13    Q15         10
## 15          23 Q22VB           12    Q22         11
## 16          23 Q32VB           16    Q32          7
## 17          22 Q21VB           20    Q21          2
## 18          22 Q37VB           17    Q37          5
## 19          21 Q13VB            6    Q13         15
## 20          21 Q18VB            9    Q18         12
## 21          21 Q36VB            5    Q36         16
## 22          20 Q10VB           17    Q10          3
## 23          20 Q29VB           19    Q29          1
## 24          20 Q40VB           16    Q40          4
## 25          19  Q9VB           19     Q9          0
## 26          19 Q12VB           14    Q12          5
## 27          19 Q27VB            4    Q27         15
## 28          18  Q3VB           14     Q3          4
## 29          17  Q7VB           20     Q7         -3
## 30          17  Q8VB           16     Q8          1
## 31          17 Q34VB           25    Q34         -8
## 32          17 Q35VB           17    Q35          0
## 33          16 Q23VB           18    Q23         -2
## 34          16 Q33VB           12    Q33          4
## 35          15  Q6VB           18     Q6         -3
## 36          15 Q39VB            6    Q39          9
## 37          13  Q1VB           16     Q1         -3
## 38          12 Q31VB           18    Q31         -6
## 39           8 Q16VB           24    Q16        -16
## 40           0 Q24VB            7    Q24         -7

Follow-up test

I developed a small piece of code to simplify the task of adding up the vowel blindness scores. This is much quicker than the manual method I had previously used and probably less likely to include human errors in tabulation. I went back through the pre-test and post-test results before adding them to the follow-up test results. I conducted these tests: a boxplot of the verbal blindness scores of the three tests; t-test to compare the means of the post-test and the follow-up test; and ANOVA of follow-up test by highlighting. First, a boxplot comparing the results of the three tests:

pretestsum <- read.csv("C:/Users/Stephen/Desktop/Reem/pretestsum.csv")
postestsum <- read.csv("C:/Users/Stephen/Desktop/Reem/postestsum.csv")
follupsum <- read.csv("C:/Users/Stephen/Desktop/Reem/follupsum.csv")
pilot <- merge(pretestsum, postestsum, by = "users")
pilot1 <- merge(pilot, follupsum, by = "users")
p <- as.data.frame(pilot1)
pstack <- stack(p)
names(pstack) <- c("VBScore", "Test")
boxplot(pstack$VBScore ~ pstack$Test, main = "VB Scores by Test")

plot of chunk unnamed-chunk-21

By eye, it looks as though follow-up results and posttest results are very similar. We can use a paired t-test to see if individual participants had changed vowel-blindness scores:

t.test(p$followupsum, p$posttestsum, paired = TRUE)
## 
##  Paired t-test
## 
## data:  p$followupsum and p$posttestsum 
## t = -1.451, df = 34, p-value = 0.156
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -2.7436  0.4579 
## sample estimates:
## mean of the differences 
##                  -1.143

The p-value of 0.156 indicates that there is no difference. Finally, highlighting. First, here is a boxplot of follow-up scores by highlighting:

boxplot(p$followupsum ~ p$highlightvowel, main = "Effects of highlighting")

plot of chunk unnamed-chunk-23

There appears to be little difference. Test this:

faov <- aov(p$followupsum ~ p$highlightvowel)
summary(faov)
##                  Df Sum Sq Mean Sq F value Pr(>F)
## p$highlightvowel  1     58    57.9    1.24   0.27
## Residuals        33   1544    46.8

And of the follow-up group those who had highlighting did not differ from those who did not.

Interesting individual scores

Below is a plot of the records of individuals. Event 1 is the pre-test score, event 2 is the post-test score, and event 3 is the follow-up score. The blue line is the overall trend line for all of the participants. Note that the post-test score for some individuals is higher than the the pre-test score.

preemlong <- read.csv("C:/Users/Stephen/Desktop/Reem/preemlong.csv")
library(ggplot2)
pr <- ggplot(preemlong, aes(event, SCORE, group = id)) + geom_line()
pr1 <- pr + geom_smooth(aes(group = 1), size = 2, col = "blue", se = T)
pr1
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-25

Who are these individuals?

preem <- read.csv("C:/Users/Stephen/Desktop/Reem/preem.csv")
preemhigh <- subset(preem, subset = (POSTTEST > PRETEST))
## Error: object 'POSTTEST' not found
preemhigh
## Error: object 'preemhigh' not found

So they are all from the 'None' group and without vowel highlighting.

More interesting graphical output

Instead of looking at individuals, we can examine their changed vowel-blindness over time grouped by treatment. The plot below shows the smoothed trend by group. The top line is for the “None” group. They do not have vowel highlighting. Their scores are essentially unchanged. The other four lines are for the other four groups, including the control group with highlighting (the next worse group). They show a similar pattern. I am working on getting a legend and colour-coding the lines.

pr <- ggplot(preemlong, aes(event, SCORE, group = id)) + geom_smooth(aes(group = GROUP1, 
    color = GROUP1), se = F, size = 2)
pr

plot of chunk unnamed-chunk-27

We can do the same thing with vowel highlighting. The line which starts off highest is for those with highlighting. The improvement is dramatic.

ggplot(preemlong, aes(x = event, y = SCORE, group = id)) + geom_smooth(aes(group = hlv, 
    color = hlv), se = F, size = 2)

plot of chunk unnamed-chunk-28

Getting a percentage change in performance by group

An interesting and highly useful quality of natural logarithms is that regressing one natural logarithm against each other provides the elasticity: the percentage change. Working in percentages means we don't need to worry about the units. I have converted the vowel blindness scores to logs and then performed the regression, using the 'None' group as the reference level.

preem <- read.csv("C:/Users/Stephen/Desktop/Reem/preem.csv")
preem$GROUP1 <- as.factor(preem$GROUP1)
preem$GROUP1 <- relevel(preem$GROUP1, ref = "None")
changeaftertest <- lm(LNFOLLOWUP ~ LNPRETEST + GROUP1, data = preem)
summary(changeaftertest)
## 
## Call:
## lm(formula = LNFOLLOWUP ~ LNPRETEST + GROUP1, data = preem)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6487 -0.1973  0.0546  0.1834  0.7717 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.473      1.039    2.38  0.02403 *  
## LNPRETEST             0.239      0.321    0.74  0.46300    
## GROUP1Morphology     -0.722      0.202   -3.57  0.00126 ** 
## GROUP1Morphonology   -0.770      0.192   -4.01  0.00039 ***
## GROUP1NoneHLV        -0.830      0.191   -4.34  0.00016 ***
## GROUP1Phonology      -0.708      0.193   -3.68  0.00095 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.357 on 29 degrees of freedom
## Multiple R-squared: 0.486,   Adjusted R-squared: 0.397 
## F-statistic: 5.47 on 5 and 29 DF,  p-value: 0.00115

These results are really interesting. Notice the negative signs by each group? This means that the followup score (the response variable) was in all cases smaller than the reference level (the 'None' group). So compared to the None group, all the groups came in with a lower score. This fits with the graph we did above. The number under 'Estimate' is the coefficient which can be read as the percentage improvement. So, for any given pre-test score, the followup score will be 72% lower for participants given the Morphology treatment and so on. From this it looks as though highlighting on its own is the most effective with a 82.96% reduction in vowel blindness. CAUTION! Although this model 'works' it explains only 40% of the variance (see the R-squared value of 0.3969). There will be some other factors not in this model. Probably the R-squared value will increase with a larger sample.