Evaluating equation (3.4) with \(N1,2,\hat{x}_{1,2} and s_{1,2}\) gives a t-value of \(t_{1,2}=-2.740403\). The following R-snippet shows how the two samples were generated and how the t-value was calculated with the user function \(“t.val()”\) (note that both samples truly differ by \(Δμ=5\) with “true” uncertainty \(σ=5\)).
rm(list=ls()) # Delete all datasets and variables
set.seed(1234) # sets a seed. It is good to use the same seed all the time.
t.val <- function(pop1,pop2){ #specify the t value
x1 <- mean(pop1,na.rm=T) #specify the x1 value from mean
x2 <- mean(pop2,na.rm=T) #specify the x2 value from mean
n1 <- length(pop1[!is.na(pop1)]) #specify n1 from length
n2 <- length(pop2[!is.na(pop2)]) #specify n2 from length
s2.1 <- var(pop1,na.rm=T) #specify s2.1 from variance
s2.2 <- var(pop2,na.rm=T) #specify s2.2 from variance
s.pool <- sqrt( ( (n1-1)*s2.1 + (n2-1)*s2.2 ) / (n1+n2-2))
t <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2 ) )
return(t)
} #specify the t value
old <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135 and var = 5
new <- rnorm(10,140,5) #new is normal distribution with n=10, mean =140 and var = 5
cat("Old(N=10):",round( mean(old),2),"+/-", round(sd(old),2)) #producing output old, with mean and standard deviation are decimal number.## Old(N=10): 133.08 +/- 4.98
## New(N=10): 139.41 +/- 5.34
## t-value= -2.740403
As such, the value of \(t=-2.74\) does not tell much, but it would tell more if the distribution of t were known under the assumption that both samples come from the same distribution. This distribution can easily be obtained by Monte Carlo simulation as shown with the following R-code. The code generates 100000 t-values by sampling from a normal distribution, \(N(μ=0,σ^2=1)\), with sample size Nold,new=10 and shows the resulting distribution as histogram, figure 3.10. It then counts the number of t-values exeeding of falling short of \(±2.74\) (that is \(N|abs(t)>2.74)\) and reports this as a percentage value, \(α=0.01312\).
t.val <- function(pop1,pop2){ #specify the t function
x1 <- mean(pop1,na.rm=T) #specify the x1 value from mean
x2 <- mean(pop2,na.rm=T) #specify the x2 value from mean
n1 <- length(pop1[!is.na(pop1)]) #specify n1 from length
n2 <- length(pop2[!is.na(pop2)]) #specify n2 from length
s2.1 <- var(pop1,na.rm=T) #specify s2.1 from variance
s2.2 <- var(pop2,na.rm=T) #specify s2.2 from variance
s.pool <- sqrt( ( (n1-1)*s2.1 + (n2-1)*s2.2 ) / (n1+n2-2))
t <- (x1-x2)/(s.pool*sqrt( 1/n1 + 1/n2 ) )
return(t) #return from function t
} #specify the t function
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
t <- NULL # Observations in _each_ group
for (i in 1:100000) { # Loop through the vector, adding one
old <- rnorm(10,0,1) #old is normal distribution with n=10, mean =0, and var = 1
new <- rnorm(10,0,1) #new is normal distribution with n=10, mean =0, and var = 1
t <- c(t,t.val(old,new)) #Specify the t function
}
hist(t, breaks=60,col="lightblue",main="")
abline(v=c(-2.740403,0,2.740403),
lty=c(2,1,2),col=c("red","black","red"))
box() #histogram visualitation Monte Carlo t-distribution of sample size \(N=10\) obtained by 100000 resampling steps from \(∼N(0,1)\); the test value \(±2.74\) in the present example is marked red in the plot.
##
## FALSE TRUE
## 0.98688 0.01312
Of course, there are ready-made functions for performing t-tests in all statistical software packages, in R, e.g., the base function t.test():
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
old <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135, and var = 5
new <- rnorm(10,140,5) #new is normal distribution with n=10, mean =140, and var = 5
t.test(old,new, var.equal=T ) #specify the t test value from function old and new##
## Two Sample t-test
##
## data: old and new
## t = -2.7404, df = 18, p-value = 0.01344
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.173926 -1.475941
## sample estimates:
## mean of x mean of y
## 133.0842 139.4091
The t-tests reports an \(α-value^14\) of \(0.01344\) in good agreement with the result obtained from Monte Carlo simulation. In addition the parameter df in the output reports the degrees of freedom, \(df=18\), of the test. The t-test is based on two “internal” parameters,\(\bar{X}_1\) and \(\bar{X}_2\) and these two parameters must be deducted from the total sample size \(N=20\), hence \(df=1815\). By convention a test result is called significant in statistics when \(0.01≤α≤0.05\) and highly significant when \(α<0.01\). When a test result is jugded significant given α, H0 is rejected in favour of H1 while silently accepting that in \(100∗α%\) of the cases \(H0\) will be true.
pwr::pwr.t.test( # Power analysis, t-test
n = NULL, # Observations in _each_ group
d = 1,
sig.level = 0.05, #Type I probability
power = 0.9,alternative = c("greater")) #alternative hypothesis##
## Two-sample t test power calculation
##
## n = 17.84712
## d = 1
## sig.level = 0.05
## power = 0.9
## alternative = greater
##
## NOTE: n is number in *each* group
Least Squares problems can be solved conveniently with statistical software, making Ordinary Least Squares (OLS) to the most widely used statistical method in science. In R the lm() function provides all functionalities for solving OLS problems, e.g. with the R example code
###### plot SS
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
x <- seq(0,5,length=30)
y <- 3 + 5*x + rnorm(length(x),0,1)
x.df <- data.frame(y,x) # make a data.frame
res <- lm(y~x,data=x.df) # do linear modeling
summary(res) # show results ##
## Call:
## lm(formula = y ~ x, data = x.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0306 -0.5751 -0.2109 0.5522 2.7050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6800 0.3273 8.188 6.51e-09 ***
## x 5.0094 0.1124 44.562 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9189 on 28 degrees of freedom
## Multiple R-squared: 0.9861, Adjusted R-squared: 0.9856
## F-statistic: 1986 on 1 and 28 DF, p-value: < 2.2e-16
a1 <- cov(x,y)/var(x) # analytical OLS solution
a0 <- mean(y) - a1*mean(x)
c(a0=a0,a1=a1) # show OLS estimators a0,a1## a0 a1
## 2.680009 5.009426
## y x
## y 1.0000000 0.9930235
## x 0.9930235 1.0000000
## [1] 0.9188509
## [1] 0.9860958
## [1] "Df" "Sum Sq" "Mean Sq" "F value" "Pr(>F)"
## [1] 0.9860958
## [1] 1985.773
## [1] 0
The R-code for estimating and testing the contrast (new-old) with OLS then becomes
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
rm(list=ls()) #Delete all datasets and variables
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
old <- rnorm(10,135,5) #old is normal distribution with n=10, mean =135, and var = 5.
new <- rnorm(10,140,5) #new is normal distribution with n=10, mean =140, and var = 5.
x <- data.frame(obsnr=1:20,treat=c(rep("old",10),
rep("new",10) ), y=round(c(old,new),2))
levels(x$treat) # wrong order with "new" being the reference## NULL
##
## Call:
## lm(formula = y ~ treat, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.815 -3.365 -0.930 3.495 12.672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 139.408 1.632 85.404 <2e-16 ***
## treatold -6.323 2.308 -2.739 0.0135 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.162 on 18 degrees of freedom
## Multiple R-squared: 0.2942, Adjusted R-squared: 0.255
## F-statistic: 7.502 on 1 and 18 DF, p-value: 0.01348
Outliers are a reality in empirical data and might reflect a lack of experimental control or alternatively model inadequacy, i.e. a non-linearity not approriately described by the empirical model. Depending on the objective of the \(project^20\), outlier should be checked by repeating the outlying experiments.
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
x <- seq(0,5,length=30) #spesify funtion of x
XX <- as.matrix(cbind(1,x)) #spesify funtion of xx
y <- 3 + 5*x + rnorm(length(x),0,5) #spesify funtion of y
y[x > 3.8 & x < 4.2] <- y[x > 3.8 & x < 4.2] - 25
res <- lm(y~x) #spesify funtion of res from linear model x and y
layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
plot(x,y,type="n",pch=16, ylim=c(-20,40), main="y~x")
text(x,y,label=1:length(y))
abline(lm(y~x)$coef)
grid()
plot(predict(res), rstudent(res),xlab=expression( paste("predicted response ", hat(y)) ) ,ylim=c(-5,3) ,
ylab="studentized residuals",type="n",
main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ ", hat(y) )) )
text(predict(res), rstudent(res),label=1:length(y))
abline(h=c(-2,0,2),lty=c(2,1,2))
grid()
plot(1:length(y), rstudent(res),xlab="run order",ylab="studentized residuals",type="l",
main=expression(paste("(",frac(paste("y - ", hat(y)),sigma), ") ~ run order" ))) #create plot visualitation
abline(h=c(-2,0,2),lty=c(2,1,2)) #create abline in visualitation
grid()
acf(rstudent(res), main="Autocorrelation function of residuals") #show auto-correlation function plot Effects of outlying observations on OLS estimation.
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.4949 -1.3215 0.6568 3.2538 16.0924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6366 2.6802 0.984 0.333666
## x 3.8858 0.9205 4.221 0.000232 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.524 on 28 degrees of freedom
## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3671
## F-statistic: 17.82 on 1 and 28 DF, p-value: 0.0002315