In the code below change the text in between the inverted commas to your directory and file name.
meas <- read.csv("~/OneDrive - University of the Sunshine Coast/Courses/ENG403/2019/Weeks/4/meas.csv")
The following will produce the matrix of scatter plots referenced in the video. Remember that correlation is different to a linear regression in that there are no dependent and independent variables.
pairs(meas)
As you can see in the pairs output that there is some anomalies with the dates. We need to fix this and look for other anomalies. The unique function is an ideal tool for this:
unique(meas$Date)
## [1] 7/08/2019 7-Aug 07-Aug-19
## Levels: 07-Aug-19 7-Aug 7/08/2019
From this output we need to assign the same value to all of the rows and also make the datatype Date (they are currently assigned a datatype level)
meas$Date<-format(as.Date("2019-08-07"),"%Y-%m-%d")
Use the unique function to see if there are any other variables with spurious observations.
The by function aggregates the data, but as you can see below, the output is a bit messy. The colons (:) separates the variables that we want to group by; in r this is an interaction operator. colMeans indicates that the function to apply to the groupings and the data of interest (L, W, and D) are in columns 8 to 10.
do<-by(meas[,8:10],meas$Instrument:meas$Specimen:meas$Group.Name,colMeans)
print(head(do))
## $`Callipers:Stretched:Explorers`
## L W D
## NA 18.560 1.388
##
## $`Callipers:Stretched:Group1`
## L W D
## NA 17.6275 1.4675
##
## $`Callipers:Stretched:Group8B`
## L W D
## NA 16.212 1.436
##
## $`Callipers:Stretched:Group9`
## L W D
## NA 13.7275 1.4325
##
## $`Callipers:Stretched:GroupA`
## L W D
## 162.840 17.666 1.248
##
## $`Callipers:Stretched:GroupC`
## L W D
## 165.0000 16.0675 1.2325
#The head command is used to limit the output. If you want to see all the output from this command, remove the head command.
Produce the summary for just the Specimen (ie. Stretched and Unstretched) for L, W and D
Ideally we want to be able to see a summary of all the data in one nice table. The following will summarise the L, W and D (they are in columns 8, 9 and 10, respectively), but this is really meaningless, as we want to see the differences between the stretched and unstretched and across the groups.
summary(meas[,8:10])
## L W D
## Min. :144.7 Min. : 1.80 Min. : 1.000
## 1st Qu.:150.0 1st Qu.:17.66 1st Qu.: 1.300
## Median :150.0 Median :19.87 Median : 1.510
## Mean :155.3 Mean :18.91 Mean : 6.069
## 3rd Qu.:162.5 3rd Qu.:20.00 3rd Qu.: 2.000
## Max. :166.0 Max. :23.00 Max. :83.000
## NA's :74 NA's :4 NA's :22
We can do a summary of the data by adding some row filters in each case (eg. as below), but this is a bit cumbersome and there are some nice packages available that allow us to do this in just 1 or 2 lines of code
summary(meas[meas$Specimen=="Unstretched",8:10])
## L W D
## Min. :144.7 Min. :17.00 Min. :1.000
## 1st Qu.:149.9 1st Qu.:19.91 1st Qu.:1.500
## Median :150.0 Median :20.00 Median :1.550
## Mean :150.3 Mean :20.02 Mean :1.646
## 3rd Qu.:150.0 3rd Qu.:20.00 3rd Qu.:2.000
## Max. :165.2 Max. :23.00 Max. :2.500
## NA's :19 NA's :2 NA's :16
summary(meas[meas$Specimen=="Stretched",8:10])
## L W D
## Min. :158.0 Min. : 1.80 Min. : 1.00
## 1st Qu.:160.0 1st Qu.:17.00 1st Qu.: 1.20
## Median :162.7 Median :17.69 Median : 1.42
## Mean :162.1 Mean :17.84 Mean :10.01
## 3rd Qu.:164.0 3rd Qu.:18.88 3rd Qu.: 1.70
## Max. :166.0 Max. :23.00 Max. :83.00
## NA's :55 NA's :2 NA's :6
The package EnvStats is one such package. Don’t forget you will need to do a install.package(“EnvStats”) before you can library it. In this package the `summaryFull function will produce very clean output if we ‘nest’ the split function on the interaction between the columns of interest:
library(EnvStats)
sdat<-summaryFull(split(meas$L,meas$Specimen:meas$Instrument))
## Warning in FUN(X[[i]], ...): All values of 'x' are missing
## Warning in FUN(X[[i]], ...): All values of 'x' are missing
print(sdat)
## Stretched:Callipers Stretched:Micrometer
## N 9 0
## NA's 31 24
## N.Total 40 24
## Mean 163.8
## Median 164
## 10% Trimmed Mean 163.8
## Geometric Mean 163.8
## Skew -0.08679
## Kurtosis -2.418
## Min 162.5
## Max 165
## Range 2.5
## 1st Quartile 162.5
## 3rd Quartile 165
## Standard Deviation 1.229
## Geometric Standard Deviation 1.008
## Interquartile Range 2.5
## Median Absolute Deviation 1.483
## Coefficient of Variation 0.007502
## Stretched:Ruler Stretched:Tape Measure
## N 40 40
## NA's 0 0
## N.Total 40 40
## Mean 162.1 161.8
## Median 162.6 162.5
## 10% Trimmed Mean 162.2 161.8
## Geometric Mean 162.1 161.8
## Skew -0.3417 -0.2739
## Kurtosis -1.482 -1.468
## Min 158 158
## Max 165 166
## Range 7 8
## 1st Quartile 160 159
## 3rd Quartile 164 164
## Standard Deviation 2.37 2.619
## Geometric Standard Deviation 1.015 1.016
## Interquartile Range 4 5
## Median Absolute Deviation 3.262 3.336
## Coefficient of Variation 0.01462 0.01619
## Unstretched:Callipers Unstretched:Micrometer
## N 40 0
## NA's 0 19
## N.Total 40 19
## Mean 150.6
## Median 149.8
## 10% Trimmed Mean 149.8
## Geometric Mean 150.5
## Skew 3.243
## Kurtosis 11.57
## Min 144.7
## Max 165.2
## Range 20.5
## 1st Quartile 149.7
## 3rd Quartile 150
## Standard Deviation 3.689
## Geometric Standard Deviation 1.024
## Interquartile Range 0.3
## Median Absolute Deviation 0.1483
## Coefficient of Variation 0.0245
## Unstretched:Ruler Unstretched:Tape Measure
## N 40 40
## NA's 0 0
## N.Total 40 40
## Mean 150.2 150
## Median 150 150
## 10% Trimmed Mean 150 150
## Geometric Mean 150.2 150
## Skew 3.316 2.668
## Kurtosis 10.78 12.87
## Min 149.7 149
## Max 152.9 153
## Range 3.2 4
## 1st Quartile 150 150
## 3rd Quartile 150 150
## Standard Deviation 0.6768 0.6368
## Geometric Standard Deviation 1.004 1.004
## Interquartile Range 0 0
## Median Absolute Deviation 0 0
## Coefficient of Variation 0.004505 0.004246
#The head command is used to
We will want to do some data analyses to test for significant differences. However the test we are going to apply assumes that the samples for both groups are drawn from a normal distribution with equal variance. Let’s assume that an intervention is to stretch the material and we want to test if this makes a significant difference to the width of the sample. Obviously we would be expecting this to be the case. First step is to create a histogram of both the stretched and unstretched data (to make a subjective judgement about whether the samples are normally distributed) and then a create a side by side boxplot of the samples to make a subjective judgement about the variance. The histogram of all the measurements suggests this is not the case.
The following two lines of code create two different datasets for the stretched and unstretched specimens, with just the width measurement. In this example we are not ‘pairing the measurements’ and just treating them as an unpaired sample. Further on we will use the reshape library and pair the two samples to observe the difference to our ttest.
ustr<-meas$W[meas$Specimen=="Unstretched"]
stret<-meas$W[meas$Specimen=="Stretched"]
To create the side by side histograms we need to divide the plot output up into two columns using the par command. The mfrow parameter takes a vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively. There are many other ways you can configure your plot output with the par command. Type ?par into the command line to obtain the help file to read up on these.
par(mfrow=c(1,2))
#The two histogram are created with the following code:
hist(ustr,main="Unstretched", xlab="Width (mm)",freq=FALSE,col="red")
hist(stret,main="Stretched", xlab="Width (mm)",freq=FALSE,col="red")
The boxplot command will not require the two separate datasets as we can ‘condition’ the output on the Specimen type using the ~ symbol. However we will need to reset the output device to remove the side by side plots (from the previous par command). This done by using the command dev.off()
boxplot(W~Specimen,data=meas,las=2)
The following notes are compiled from a Berkeley Statistics Page.
The two outputs we have looked at so far suggest that the assumption of normality is ok, but there is a considerable difference in the variance. There is a version of the t.test, which applies when the variances are not equal.
Before we undertake the t-test we need to develop the Null hypothesis, which is: the mean of the stretched width < the mean of the unstretched (control) width, and the alternative is the opposite. Under the null hypothesis (assuming unpaired samples) we can calculate a t-statistic that will follow a t-distribution with n1 + n2 - 2 degrees of freedom (dof), where n1 is the first dataset and n2 is the second. Each observation contributes a degree of freedom, but we lose two because we have to estimate the mean of each group. How many dof are you expecting for this test?
dofcal<-length(stret)+length(ustr)-2
print(dofcal)
## [1] 281
ttest<-t.test(meas$W~meas$Specimen,paired=FALSE, alternative="less")
print(ttest)
##
## Results of Hypothesis Test
## --------------------------
##
## Null Hypothesis: difference in means = 0
##
## Alternative Hypothesis: True difference in means is less than 0
##
## Test Name: Welch Two Sample t-test
##
## Estimated Parameter(s): mean in group Stretched = 17.84190
## mean in group Unstretched = 20.01781
##
## Data: meas$W by meas$Specimen
##
## Test Statistic: t = -11.69415
##
## Test Statistic Parameter: df = 167.9572
##
## P-value: 8.595005e-24
##
## 95% Confidence Interval: LCL = -Inf
## UCL = -1.868156
Based on this result we can say that at the 95% confidence level the mean of the width of the stretched material is significantly less than the mean of the width of the unstretched material (p<0.05) and therefore can reject the null hypothesis. Strictly speaking we cant say that we accept the alternative hypothesis.
You will note we used the paired=FALSE option in the t.test, because when we extracted the data we didn’t look at how they are paired. To investigate if this has any difference we will pair the data using the cast function in the reshape library. If you don’t have this library installed, you will need to use the install.package("reshape") and library(reshape) commands before you can access the cast function.
The cast function works best if you have only the columns you want to group by and call the column you want to pivot on “variable” and column with the observation of interest “value”:
library(reshape)
#Extract only the data for the analysis (d4a)
d4a<-meas[,c(1:2,4:7,9)]
#Name the columns such that the last two have the names variable and value
colnames(d4a)[6:7]<-c("variable","value")
#Cast the data to a new dataset (cd4a)
cd4a<-cast(d4a)
The t distribution is a probability distribution that is often used for small sample sizes. When sample sizes are small or if the standard deviation of the population is not known the t statistic is often used. If we know the standard deviation and have a normally distributed sample, we would obtain the z-score to obtain the probability of how far away an observation is from the mean. The t-distribution uses a similar approach and uses a t-statistic (t-score) rather than a z-score. However, unlike the standard normal distribution there are many t-distributions and the particular form is determined by its degree of freedom.
ttest<-t.test(cd4a$Stretched,cd4a$Unstretched,paired=TRUE,na.rm=TRUE,alternative = "less")
print(ttest)
##
## Results of Hypothesis Test
## --------------------------
##
## Null Hypothesis: difference in means = 0
##
## Alternative Hypothesis: True difference in means is less than 0
##
## Test Name: Paired t-test
##
## Estimated Parameter(s): mean of the differences = -2.270365
##
## Data: cd4a$Stretched and cd4a$Unstretched
##
## Test Statistic: t = -12.62997
##
## Test Statistic Parameter: df = 136
##
## P-value: 5.56821e-25
##
## 95% Confidence Interval: LCL = -Inf
## UCL = -1.972658
The number of dof in this test is 136, which is the number of observations (ignoring the NAs) - 1.
Explore for differences between the tut groups using the t.test.