Tutorial in R - exploratory analyses and the t.test

Read in the data
Correlation - First pass exploratory analyses
EXERCISE
Aggregating and summarising the data
EXERCISE
Summary
Data Analyses
- Testing the assumptions
- Developing the hypothesis
Intepreting the output from a ttest
EXERCISE

Read in the data

In the code below change the text in between the inverted commas to your directory and file name.

meas <- read.csv("~/OneDrive - University of the Sunshine Coast/Courses/ENG403/2019/Weeks/4/meas.csv")

Correlation - First pass exploratory analyses

The following will produce the matrix of scatter plots referenced in the video. Remember that correlation is different to a linear regression in that there are no dependent and independent variables.

pairs(meas)

As you can see in the pairs output that there is some anomalies with the dates. We need to fix this and look for other anomalies. The unique function is an ideal tool for this:

unique(meas$Date)

## [1] 7/08/2019 7-Aug     07-Aug-19
## Levels: 07-Aug-19 7-Aug 7/08/2019

From this output we need to assign the same value to all of the rows and also make the datatype Date (they are currently assigned a datatype level)

meas$Date<-format(as.Date("2019-08-07"),"%Y-%m-%d")

EXERCISE

Use the unique function to see if there are any other variables with spurious observations.

Aggregating and summarising the data

The by function aggregates the data, but as you can see below, the output is a bit messy. The colons (:) separates the variables that we want to group by; in r this is an interaction operator. colMeans indicates that the function to apply to the groupings and the data of interest (L, W, and D) are in columns 8 to 10.

do<-by(meas[,8:10],meas$Instrument:meas$Specimen:meas$Group.Name,colMeans)
print(head(do))

## $`Callipers:Stretched:Explorers`
##      L      W      D 
##     NA 18.560  1.388 
## 
## $`Callipers:Stretched:Group1`
##       L       W       D 
##      NA 17.6275  1.4675 
## 
## $`Callipers:Stretched:Group8B`
##      L      W      D 
##     NA 16.212  1.436 
## 
## $`Callipers:Stretched:Group9`
##       L       W       D 
##      NA 13.7275  1.4325 
## 
## $`Callipers:Stretched:GroupA`
##       L       W       D 
## 162.840  17.666   1.248 
## 
## $`Callipers:Stretched:GroupC`
##        L        W        D 
## 165.0000  16.0675   1.2325

#The head command is used to limit the output.  If you want to see all the output from this command, remove the head command.

EXERCISE

Produce the summary for just the Specimen (ie. Stretched and Unstretched) for L, W and D

Summary

Ideally we want to be able to see a summary of all the data in one nice table. The following will summarise the L, W and D (they are in columns 8, 9 and 10, respectively), but this is really meaningless, as we want to see the differences between the stretched and unstretched and across the groups.

summary(meas[,8:10])

##        L               W               D         
##  Min.   :144.7   Min.   : 1.80   Min.   : 1.000  
##  1st Qu.:150.0   1st Qu.:17.66   1st Qu.: 1.300  
##  Median :150.0   Median :19.87   Median : 1.510  
##  Mean   :155.3   Mean   :18.91   Mean   : 6.069  
##  3rd Qu.:162.5   3rd Qu.:20.00   3rd Qu.: 2.000  
##  Max.   :166.0   Max.   :23.00   Max.   :83.000  
##  NA's   :74      NA's   :4       NA's   :22

We can do a summary of the data by adding some row filters in each case (eg. as below), but this is a bit cumbersome and there are some nice packages available that allow us to do this in just 1 or 2 lines of code

summary(meas[meas$Specimen=="Unstretched",8:10])

##        L               W               D        
##  Min.   :144.7   Min.   :17.00   Min.   :1.000  
##  1st Qu.:149.9   1st Qu.:19.91   1st Qu.:1.500  
##  Median :150.0   Median :20.00   Median :1.550  
##  Mean   :150.3   Mean   :20.02   Mean   :1.646  
##  3rd Qu.:150.0   3rd Qu.:20.00   3rd Qu.:2.000  
##  Max.   :165.2   Max.   :23.00   Max.   :2.500  
##  NA's   :19      NA's   :2       NA's   :16

summary(meas[meas$Specimen=="Stretched",8:10])

##        L               W               D        
##  Min.   :158.0   Min.   : 1.80   Min.   : 1.00  
##  1st Qu.:160.0   1st Qu.:17.00   1st Qu.: 1.20  
##  Median :162.7   Median :17.69   Median : 1.42  
##  Mean   :162.1   Mean   :17.84   Mean   :10.01  
##  3rd Qu.:164.0   3rd Qu.:18.88   3rd Qu.: 1.70  
##  Max.   :166.0   Max.   :23.00   Max.   :83.00  
##  NA's   :55      NA's   :2       NA's   :6

The package EnvStats is one such package. Don’t forget you will need to do a install.package(“EnvStats”) before you can library it. In this package the `summaryFull function will produce very clean output if we ‘nest’ the split function on the interaction between the columns of interest:

library(EnvStats)
sdat<-summaryFull(split(meas$L,meas$Specimen:meas$Instrument))

## Warning in FUN(X[[i]], ...): All values of 'x' are missing

## Warning in FUN(X[[i]], ...): All values of 'x' are missing

print(sdat)

##                              Stretched:Callipers Stretched:Micrometer
## N                              9                   0                 
## NA's                          31                  24                 
## N.Total                       40                  24                 
## Mean                         163.8                                   
## Median                       164                                     
## 10% Trimmed Mean             163.8                                   
## Geometric Mean               163.8                                   
## Skew                          -0.08679                               
## Kurtosis                      -2.418                                 
## Min                          162.5                                   
## Max                          165                                     
## Range                          2.5                                   
## 1st Quartile                 162.5                                   
## 3rd Quartile                 165                                     
## Standard Deviation             1.229                                 
## Geometric Standard Deviation   1.008                                 
## Interquartile Range            2.5                                   
## Median Absolute Deviation      1.483                                 
## Coefficient of Variation       0.007502                              
##                              Stretched:Ruler Stretched:Tape Measure
## N                             40              40                   
## NA's                           0               0                   
## N.Total                       40              40                   
## Mean                         162.1           161.8                 
## Median                       162.6           162.5                 
## 10% Trimmed Mean             162.2           161.8                 
## Geometric Mean               162.1           161.8                 
## Skew                          -0.3417         -0.2739              
## Kurtosis                      -1.482          -1.468               
## Min                          158             158                   
## Max                          165             166                   
## Range                          7               8                   
## 1st Quartile                 160             159                   
## 3rd Quartile                 164             164                   
## Standard Deviation             2.37            2.619               
## Geometric Standard Deviation   1.015           1.016               
## Interquartile Range            4               5                   
## Median Absolute Deviation      3.262           3.336               
## Coefficient of Variation       0.01462         0.01619             
##                              Unstretched:Callipers Unstretched:Micrometer
## N                             40                     0                   
## NA's                           0                    19                   
## N.Total                       40                    19                   
## Mean                         150.6                                       
## Median                       149.8                                       
## 10% Trimmed Mean             149.8                                       
## Geometric Mean               150.5                                       
## Skew                           3.243                                     
## Kurtosis                      11.57                                      
## Min                          144.7                                       
## Max                          165.2                                       
## Range                         20.5                                       
## 1st Quartile                 149.7                                       
## 3rd Quartile                 150                                         
## Standard Deviation             3.689                                     
## Geometric Standard Deviation   1.024                                     
## Interquartile Range            0.3                                       
## Median Absolute Deviation      0.1483                                    
## Coefficient of Variation       0.0245                                    
##                              Unstretched:Ruler Unstretched:Tape Measure
## N                             40                40                     
## NA's                           0                 0                     
## N.Total                       40                40                     
## Mean                         150.2             150                     
## Median                       150               150                     
## 10% Trimmed Mean             150               150                     
## Geometric Mean               150.2             150                     
## Skew                           3.316             2.668                 
## Kurtosis                      10.78             12.87                  
## Min                          149.7             149                     
## Max                          152.9             153                     
## Range                          3.2               4                     
## 1st Quartile                 150               150                     
## 3rd Quartile                 150               150                     
## Standard Deviation             0.6768            0.6368                
## Geometric Standard Deviation   1.004             1.004                 
## Interquartile Range            0                 0                     
## Median Absolute Deviation      0                 0                     
## Coefficient of Variation       0.004505          0.004246

#The head command is used to

Data Analyses

We will want to do some data analyses to test for significant differences. However the test we are going to apply assumes that the samples for both groups are drawn from a normal distribution with equal variance. Let’s assume that an intervention is to stretch the material and we want to test if this makes a significant difference to the width of the sample. Obviously we would be expecting this to be the case. First step is to create a histogram of both the stretched and unstretched data (to make a subjective judgement about whether the samples are normally distributed) and then a create a side by side boxplot of the samples to make a subjective judgement about the variance. The histogram of all the measurements suggests this is not the case.

The following two lines of code create two different datasets for the stretched and unstretched specimens, with just the width measurement. In this example we are not ‘pairing the measurements’ and just treating them as an unpaired sample. Further on we will use the reshape library and pair the two samples to observe the difference to our ttest.

ustr<-meas$W[meas$Specimen=="Unstretched"]
stret<-meas$W[meas$Specimen=="Stretched"]

To create the side by side histograms we need to divide the plot output up into two columns using the par command. The mfrow parameter takes a vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively. There are many other ways you can configure your plot output with the par command. Type ?par into the command line to obtain the help file to read up on these.

par(mfrow=c(1,2))
#The two histogram are created with the following code:
hist(ustr,main="Unstretched", xlab="Width (mm)",freq=FALSE,col="red")
hist(stret,main="Stretched", xlab="Width (mm)",freq=FALSE,col="red")

The boxplot command will not require the two separate datasets as we can ‘condition’ the output on the Specimen type using the ~ symbol. However we will need to reset the output device to remove the side by side plots (from the previous par command). This done by using the command dev.off()

boxplot(W~Specimen,data=meas,las=2)

Testing the assumptions

The following notes are compiled from a Berkeley Statistics Page.

The two outputs we have looked at so far suggest that the assumption of normality is ok, but there is a considerable difference in the variance. There is a version of the t.test, which applies when the variances are not equal.

Developing the hypothesis

Before we undertake the t-test we need to develop the Null hypothesis, which is: the mean of the stretched width < the mean of the unstretched (control) width, and the alternative is the opposite. Under the null hypothesis (assuming unpaired samples) we can calculate a t-statistic that will follow a t-distribution with n1 + n2 - 2 degrees of freedom (dof), where n1 is the first dataset and n2 is the second. Each observation contributes a degree of freedom, but we lose two because we have to estimate the mean of each group. How many dof are you expecting for this test?

dofcal<-length(stret)+length(ustr)-2
print(dofcal)

## [1] 281

ttest<-t.test(meas$W~meas$Specimen,paired=FALSE, alternative="less")
print(ttest)

## 
## Results of Hypothesis Test
## --------------------------
## 
## Null Hypothesis:                 difference in means = 0
## 
## Alternative Hypothesis:          True difference in means is less than 0
## 
## Test Name:                       Welch Two Sample t-test
## 
## Estimated Parameter(s):          mean in group Stretched   = 17.84190
##                                  mean in group Unstretched = 20.01781
## 
## Data:                            meas$W by meas$Specimen
## 
## Test Statistic:                  t = -11.69415
## 
## Test Statistic Parameter:        df = 167.9572
## 
## P-value:                         8.595005e-24
## 
## 95% Confidence Interval:         LCL =      -Inf
##                                  UCL = -1.868156

Intepreting the output from a ttest

Based on this result we can say that at the 95% confidence level the mean of the width of the stretched material is significantly less than the mean of the width of the unstretched material (p<0.05) and therefore can reject the null hypothesis. Strictly speaking we cant say that we accept the alternative hypothesis.

You will note we used the paired=FALSE option in the t.test, because when we extracted the data we didn’t look at how they are paired. To investigate if this has any difference we will pair the data using the cast function in the reshape library. If you don’t have this library installed, you will need to use the install.package("reshape") and library(reshape) commands before you can access the cast function.

The cast function works best if you have only the columns you want to group by and call the column you want to pivot on “variable” and column with the observation of interest “value”:

library(reshape)
#Extract only the data for the analysis (d4a)
d4a<-meas[,c(1:2,4:7,9)]
#Name the columns such that the last two have the names variable and value
colnames(d4a)[6:7]<-c("variable","value")
#Cast the data to a new dataset (cd4a)
cd4a<-cast(d4a)

The t distribution is a probability distribution that is often used for small sample sizes. When sample sizes are small or if the standard deviation of the population is not known the t statistic is often used. If we know the standard deviation and have a normally distributed sample, we would obtain the z-score to obtain the probability of how far away an observation is from the mean. The t-distribution uses a similar approach and uses a t-statistic (t-score) rather than a z-score. However, unlike the standard normal distribution there are many t-distributions and the particular form is determined by its degree of freedom.

ttest<-t.test(cd4a$Stretched,cd4a$Unstretched,paired=TRUE,na.rm=TRUE,alternative = "less")
print(ttest)

## 
## Results of Hypothesis Test
## --------------------------
## 
## Null Hypothesis:                 difference in means = 0
## 
## Alternative Hypothesis:          True difference in means is less than 0
## 
## Test Name:                       Paired t-test
## 
## Estimated Parameter(s):          mean of the differences = -2.270365
## 
## Data:                            cd4a$Stretched and cd4a$Unstretched
## 
## Test Statistic:                  t = -12.62997
## 
## Test Statistic Parameter:        df = 136
## 
## P-value:                         5.56821e-25
## 
## 95% Confidence Interval:         LCL =      -Inf
##                                  UCL = -1.972658

The number of dof in this test is 136, which is the number of observations (ignoring the NAs) - 1.

EXERCISE

Explore for differences between the tut groups using the t.test.

Tutorial in R - exploratory analyses and the t.test

Dr Helen Fairweather

20 August, 2019

Read in the data

Correlation - First pass exploratory analyses

EXERCISE

Aggregating and summarising the data

EXERCISE

Summary

Data Analyses

Testing the assumptions

Developing the hypothesis

Intepreting the output from a ttest

EXERCISE